From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 4B83E595B for ; Wed, 12 Nov 2014 16:42:12 +0100 (CET) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga102.jf.intel.com with ESMTP; 12 Nov 2014 07:49:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.07,369,1413270000"; d="scan'208";a="635784604" Received: from bricha3-mobl3.ger.corp.intel.com ([10.237.220.93]) by orsmga002.jf.intel.com with SMTP; 12 Nov 2014 07:51:47 -0800 Received: by (sSMTP sendmail emulation); Wed, 12 Nov 2014 15:51:46 +0025 Date: Wed, 12 Nov 2014 15:51:46 +0000 From: Bruce Richardson To: jigsaw Message-ID: <20141112155145.GA7952@bricha3-MOBL3> References: <1415630642-3905-1-git-send-email-jigsaw@gmail.com> <20141110151308.GC12532@bricha3-MOBL3> <20141110165530.GA10776@bricha3-MOBL3> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Organization: Intel Shannon Ltd. User-Agent: Mutt/1.5.23 (2014-03-12) Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] [PATCH v3] Add in_flight_bitmask so as to use full 32 bits of tag. X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Nov 2014 15:42:16 -0000 On Tue, Nov 11, 2014 at 10:53:40AM +0200, jigsaw wrote: > Hi Bruce, > > This patch has little, if any, performance impact. > See the perf stat -d for original and patched version > of test_distributor_perf. Thanks for running the test. Results look ok to me. I confirm my previous ack. /Bruce > > Original version > ============ > perf stat -d ./test_orig -cffff -n2 > ============ > ==== Cache line switch test === > > [4/4590] > Time for 1048576 iterations = 362349056 ticks > Ticks per iteration = 345 > > === Performance test of distributor === > Time per burst: 7043 > Time per packet: 220 > > Worker 0 handled 3377372 packets > Worker 1 handled 2857280 packets > Worker 2 handled 2120982 packets > Worker 3 handled 2112720 packets > Worker 4 handled 2102014 packets > Worker 5 handled 2101314 packets > Worker 6 handled 2099248 packets > Worker 7 handled 2098560 packets > Worker 8 handled 2098114 packets > Worker 9 handled 2097962 packets > Worker 10 handled 2097892 packets > Worker 11 handled 2097856 packets > Worker 12 handled 2097762 packets > Worker 13 handled 2097726 packets > Worker 14 handled 2097630 packets > Total packets: 33554432 (2000000) > === Perf test done === > > > Performance counter stats for './test_orig -cffff -n2 --no-huge': > > 45784.935475 task-clock # 12.847 CPUs utilized > 4822 context-switches # 0.105 K/sec > 7 CPU-migrations # 0.000 K/sec > 1017 page-faults # 0.022 K/sec > 118181375409 cycles # 2.581 GHz > [40.00%] > 88915381373 stalled-cycles-frontend # 75.24% frontend cycles > idle [39.98%] > 77015854611 stalled-cycles-backend # 65.17% backend cycles > idle [39.96%] > 30469536186 instructions # 0.26 insns per cycle > # 2.92 stalled cycles per > insn [49.96%] > 6481829773 branches # 141.571 M/sec > [49.97%] > 45283365 branch-misses # 0.70% of all branches > [50.00%] > 6537115556 L1-dcache-loads # 142.779 M/sec > [50.05%] > 249533128 L1-dcache-load-misses # 3.82% of all L1-dcache > hits [50.08%] > 144469663 LLC-loads # 3.155 M/sec > [40.05%] > 69897760 LLC-load-misses # 48.38% of all LL-cache > hits [40.03%] > > 3.563886431 seconds time elapsed > > > Patched version > ============= > perf stat -d ./test_patch -cffff -n2 > ==== Cache line switch test === > > [4/4747] > Time for 1048576 iterations = 456390606 ticks > Ticks per iteration = 435 > > === Performance test of distributor === > Time per burst: 6566 > Time per packet: 205 > > Worker 0 handled 2528418 packets > Worker 1 handled 2531906 packets > Worker 2 handled 2524133 packets > Worker 3 handled 2507720 packets > Worker 4 handled 2332648 packets > Worker 5 handled 2300046 packets > Worker 6 handled 2213035 packets > Worker 7 handled 2125486 packets > Worker 8 handled 2096726 packets > Worker 9 handled 2089150 packets > Worker 10 handled 2079626 packets > Worker 11 handled 2074512 packets > Worker 12 handled 2073560 packets > Worker 13 handled 2054838 packets > Worker 14 handled 2022628 packets > Total packets: 33554432 (2000000) > === Perf test done === > > > Performance counter stats for './test_patch -cffff -n2 --no-huge': > > 42784.064267 task-clock # 12.546 CPUs utilized > 4517 context-switches # 0.106 K/sec > 5 CPU-migrations # 0.000 K/sec > 1020 page-faults # 0.024 K/sec > 110531300647 cycles # 2.583 GHz > [40.06%] > 83135557406 stalled-cycles-frontend # 75.21% frontend cycles > idle [40.02%] > 71862086289 stalled-cycles-backend # 65.02% backend cycles > idle [39.99%] > 28847200612 instructions # 0.26 insns per cycle > # 2.88 stalled cycles per > insn [49.98%] > 6059139104 branches # 141.621 M/sec > [49.94%] > 38455463 branch-misses # 0.63% of all branches > [49.95%] > 6090169906 L1-dcache-loads # 142.347 M/sec > [49.99%] > 235094118 L1-dcache-load-misses # 3.86% of all L1-dcache > hits [50.03%] > 129974492 LLC-loads # 3.038 M/sec > [40.07%] > 56701883 LLC-load-misses # 43.63% of all LL-cache > hits [40.08%] > > 3.410231367 seconds time elapsed > > > The patched version even runs a bit faster. > > Perf annotation shows that 92% time is spent in the while loop > of rte_distributor_get_pkt. > Anothoer 6% spent in the calculation of match > inside rte_distributor_process. > > Also we can see that there is a big amount of LLC load miss, which is > perhaps due to cache conflict > in the while-loop of rte_distributor_get_pkt? > > I'm no going to draw any conclusion here but evidently the > in_flight_bitmask is not in the hot path, and > is irrelevant to the performance of distributor. > > Thanks a lot for your code review and ack. > > thx & > rgds, > -qinglai > > > On Mon, Nov 10, 2014 at 6:55 PM, Bruce Richardson < > bruce.richardson@intel.com> wrote: > > > On Mon, Nov 10, 2014 at 05:52:26PM +0200, jigsaw wrote: > > > Hi Bruce, > > > > > > Sorry I didn't. I will run a performance test tomorrow. > > > > > > thx & > > > rgds, > > > -qinglai > > > > > > > On the assumption that no performance regressions show up... > > Acked-by: Bruce Richardson > > > > > > > On Mon, Nov 10, 2014 at 5:13 PM, Bruce Richardson < > > > bruce.richardson@intel.com> wrote: > > > > > > > On Mon, Nov 10, 2014 at 04:44:02PM +0200, Qinglai Xiao wrote: > > > > > With introduction of in_flight_bitmask, the whole 32 bits of tag can > > be > > > > > used. Further more, this patch fixed the integer overflow when > > finding > > > > > the matched tags. > > > > > The maximum number workers is now defined as 64, which is length of > > > > > double-word. The link between number of workers and RTE_MAX_LCORE is > > > > > now removed. Compile time check is added to ensure the > > > > > RTE_DISTRIB_MAX_WORKERS is less than or equal to size of double-word. > > > > > > > > > > Signed-off-by: Qinglai Xiao > > > > > > > > Looks good to me. > > > > Just before I ack this, have you checked to see if there is any > > > > performance impact? > > > > > > > > /Bruce > > > > > > > > > --- > > > > > lib/librte_distributor/rte_distributor.c | 64 > > > > ++++++++++++++++++++++-------- > > > > > lib/librte_distributor/rte_distributor.h | 4 ++ > > > > > 2 files changed, 51 insertions(+), 17 deletions(-) > > > > > > > > > > diff --git a/lib/librte_distributor/rte_distributor.c > > > > b/lib/librte_distributor/rte_distributor.c > > > > > index 3dfec4a..2c5d61c 100644 > > > > > --- a/lib/librte_distributor/rte_distributor.c > > > > > +++ b/lib/librte_distributor/rte_distributor.c > > > > > @@ -62,6 +62,13 @@ > > > > > #define RTE_DISTRIB_RETURNS_MASK (RTE_DISTRIB_MAX_RETURNS - 1) > > > > > > > > > > /** > > > > > + * Maximum number of workers allowed. > > > > > + * Be aware of increasing the limit, becaus it is limited by how we > > > > track > > > > > + * in-flight tags. See @in_flight_bitmask and > > @rte_distributor_process > > > > > + */ > > > > > +#define RTE_DISTRIB_MAX_WORKERS 64 > > > > > + > > > > > +/** > > > > > * Buffer structure used to pass the pointer data between cores. > > This > > > > is cache > > > > > * line aligned, but to improve performance and prevent adjacent > > > > cache-line > > > > > * prefetches of buffers for other workers, e.g. when worker 1's > > buffer > > > > is on > > > > > @@ -91,11 +98,17 @@ struct rte_distributor { > > > > > char name[RTE_DISTRIBUTOR_NAMESIZE]; /**< Name of the ring. */ > > > > > unsigned num_workers; /**< Number of workers > > > > polling */ > > > > > > > > > > - uint32_t in_flight_tags[RTE_MAX_LCORE]; > > > > > - /**< Tracks the tag being processed per core, 0 == no > > pkt > > > > */ > > > > > - struct rte_distributor_backlog backlog[RTE_MAX_LCORE]; > > > > > + uint32_t in_flight_tags[RTE_DISTRIB_MAX_WORKERS]; > > > > > + /**< Tracks the tag being processed per core */ > > > > > + uint64_t in_flight_bitmask; > > > > > + /**< on/off bits for in-flight tags. > > > > > + * Note that if RTE_DISTRIB_MAX_WORKERS is larger than > > 64 > > > > then > > > > > + * the bitmask has to expand. > > > > > + */ > > > > > + > > > > > + struct rte_distributor_backlog > > backlog[RTE_DISTRIB_MAX_WORKERS]; > > > > > > > > > > - union rte_distributor_buffer bufs[RTE_MAX_LCORE]; > > > > > + union rte_distributor_buffer bufs[RTE_DISTRIB_MAX_WORKERS]; > > > > > > > > > > struct rte_distributor_returned_pkts returns; > > > > > }; > > > > > @@ -189,6 +202,7 @@ static inline void > > > > > handle_worker_shutdown(struct rte_distributor *d, unsigned wkr) > > > > > { > > > > > d->in_flight_tags[wkr] = 0; > > > > > + d->in_flight_bitmask &= ~(1UL << wkr); > > > > > d->bufs[wkr].bufptr64 = 0; > > > > > if (unlikely(d->backlog[wkr].count != 0)) { > > > > > /* On return of a packet, we need to move the > > > > > @@ -211,7 +225,10 @@ handle_worker_shutdown(struct rte_distributor > > *d, > > > > unsigned wkr) > > > > > pkts[i] = (void *)((uintptr_t)(bl->pkts[idx] >> > > > > > RTE_DISTRIB_FLAG_BITS)); > > > > > } > > > > > - /* recursive call */ > > > > > + /* recursive call. > > > > > + * Note that the tags were set before first level call > > > > > + * to rte_distributor_process. > > > > > + */ > > > > > rte_distributor_process(d, pkts, i); > > > > > bl->count = bl->start = 0; > > > > > } > > > > > @@ -242,6 +259,7 @@ process_returns(struct rte_distributor *d) > > > > > else { > > > > > d->bufs[wkr].bufptr64 = > > > > RTE_DISTRIB_GET_BUF; > > > > > d->in_flight_tags[wkr] = 0; > > > > > + d->in_flight_bitmask &= ~(1UL << wkr); > > > > > } > > > > > oldbuf = data >> RTE_DISTRIB_FLAG_BITS; > > > > > } else if (data & RTE_DISTRIB_RETURN_BUF) { > > > > > @@ -284,14 +302,18 @@ rte_distributor_process(struct rte_distributor > > *d, > > > > > next_value = (((int64_t)(uintptr_t)next_mb) > > > > > << RTE_DISTRIB_FLAG_BITS); > > > > > /* > > > > > - * Set the low bit on the tag, so we can > > guarantee > > > > that > > > > > - * we never store a tag value of zero. That > > means > > > > we can > > > > > - * use the zero-value to indicate that no > > packet is > > > > > - * being processed by a worker. > > > > > + * User is advocated to set tag vaue for each > > > > > + * mbuf before calling rte_distributor_process. > > > > > + * User defined tags are used to identify > > flows, > > > > > + * or sessions. > > > > > */ > > > > > - new_tag = (next_mb->hash.usr | 1); > > > > > + new_tag = next_mb->hash.usr; > > > > > > > > > > - uint32_t match = 0; > > > > > + /* > > > > > + * Note that if RTE_DISTRIB_MAX_WORKERS is > > larger > > > > than 64 > > > > > + * then the size of match has to be expanded. > > > > > + */ > > > > > + uint64_t match = 0; > > > > > unsigned i; > > > > > /* > > > > > * to scan for a match use "xor" and "not" to > > get > > > > a 0/1 > > > > > @@ -303,9 +325,12 @@ rte_distributor_process(struct rte_distributor > > *d, > > > > > match |= (!(d->in_flight_tags[i] ^ > > new_tag) > > > > > << i); > > > > > > > > > > + /* Only turned-on bits are considered as match > > */ > > > > > + match &= d->in_flight_bitmask; > > > > > + > > > > > if (match) { > > > > > next_mb = NULL; > > > > > - unsigned worker = __builtin_ctz(match); > > > > > + unsigned worker = > > __builtin_ctzl(match); > > > > > if (add_to_backlog(&d->backlog[worker], > > > > > next_value) < 0) > > > > > next_idx--; > > > > > @@ -322,6 +347,7 @@ rte_distributor_process(struct rte_distributor > > *d, > > > > > else { > > > > > d->bufs[wkr].bufptr64 = next_value; > > > > > d->in_flight_tags[wkr] = new_tag; > > > > > + d->in_flight_bitmask |= (1UL << wkr); > > > > > next_mb = NULL; > > > > > } > > > > > oldbuf = data >> RTE_DISTRIB_FLAG_BITS; > > > > > @@ -379,11 +405,13 @@ rte_distributor_returned_pkts(struct > > > > rte_distributor *d, > > > > > static inline unsigned > > > > > total_outstanding(const struct rte_distributor *d) > > > > > { > > > > > - unsigned wkr, total_outstanding = 0; > > > > > + unsigned wkr, total_outstanding; > > > > > + > > > > > + total_outstanding = __builtin_popcountl(d->in_flight_bitmask); > > > > > > > > > > for (wkr = 0; wkr < d->num_workers; wkr++) > > > > > - total_outstanding += d->backlog[wkr].count + > > > > > - !!(d->in_flight_tags[wkr]); > > > > > + total_outstanding += d->backlog[wkr].count; > > > > > + > > > > > return total_outstanding; > > > > > } > > > > > > > > > > @@ -423,9 +451,11 @@ rte_distributor_create(const char *name, > > > > > > > > > > /* compilation-time checks */ > > > > > RTE_BUILD_BUG_ON((sizeof(*d) & CACHE_LINE_MASK) != 0); > > > > > - RTE_BUILD_BUG_ON((RTE_MAX_LCORE & 7) != 0); > > > > > + RTE_BUILD_BUG_ON((RTE_DISTRIB_MAX_WORKERS & 7) != 0); > > > > > + RTE_BUILD_BUG_ON(RTE_DISTRIB_MAX_WORKERS > > > > > > + sizeof(d->in_flight_bitmask) * > > CHAR_BIT); > > > > > > > > > > - if (name == NULL || num_workers >= RTE_MAX_LCORE) { > > > > > + if (name == NULL || num_workers >= RTE_DISTRIB_MAX_WORKERS) { > > > > > rte_errno = EINVAL; > > > > > return NULL; > > > > > } > > > > > diff --git a/lib/librte_distributor/rte_distributor.h > > > > b/lib/librte_distributor/rte_distributor.h > > > > > index ec0d74a..cc1d559 100644 > > > > > --- a/lib/librte_distributor/rte_distributor.h > > > > > +++ b/lib/librte_distributor/rte_distributor.h > > > > > @@ -88,6 +88,10 @@ rte_distributor_create(const char *name, unsigned > > > > socket_id, > > > > > * packets. The distributor will ensure that no two packets that > > have > > > > the > > > > > * same flow id, or tag, in the mbuf will be procesed at the same > > time. > > > > > * > > > > > + * The user is advocated to set tag for each mbuf before calling > > this > > > > function. > > > > > + * If user doesn't set the tag, the tag value can be various values > > > > depending on > > > > > + * driver implementation and configuration. > > > > > + * > > > > > * This is not multi-thread safe and should only be called on a > > single > > > > lcore. > > > > > * > > > > > * @param d > > > > > -- > > > > > 1.7.1 > > > > > > > > > > >