From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id CEA602BDF for ; Wed, 1 Mar 2017 10:47:06 +0100 (CET) Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 Mar 2017 01:47:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.35,224,1484035200"; d="scan'208";a="71459207" Received: from bricha3-mobl3.ger.corp.intel.com ([10.237.221.61]) by fmsmga006.fm.intel.com with SMTP; 01 Mar 2017 01:47:03 -0800 Received: by (sSMTP sendmail emulation); Wed, 01 Mar 2017 09:47:03 +0000 Date: Wed, 1 Mar 2017 09:47:03 +0000 From: Bruce Richardson To: Jerin Jacob Cc: olivier.matz@6wind.com, dev@dpdk.org Message-ID: <20170301094702.GA15176@bricha3-MOBL3.ger.corp.intel.com> References: <20170223172407.27664-1-bruce.richardson@intel.com> <20170223172407.27664-2-bruce.richardson@intel.com> <20170228113511.GA28584@localhost.localdomain> <20170228115703.GA4656@bricha3-MOBL3.ger.corp.intel.com> <20170228120833.GA30817@localhost.localdomain> <20170228135226.GA9784@bricha3-MOBL3.ger.corp.intel.com> <20170228175423.GA23591@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170228175423.GA23591@localhost.localdomain> Organization: Intel Research and =?iso-8859-1?Q?De=ACvel?= =?iso-8859-1?Q?opment?= Ireland Ltd. User-Agent: Mutt/1.7.2 (2016-11-26) Subject: Re: [dpdk-dev] [PATCH v1 01/14] ring: remove split cacheline build setting X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Mar 2017 09:47:07 -0000 On Tue, Feb 28, 2017 at 11:24:25PM +0530, Jerin Jacob wrote: > On Tue, Feb 28, 2017 at 01:52:26PM +0000, Bruce Richardson wrote: > > On Tue, Feb 28, 2017 at 05:38:34PM +0530, Jerin Jacob wrote: > > > On Tue, Feb 28, 2017 at 11:57:03AM +0000, Bruce Richardson wrote: > > > > On Tue, Feb 28, 2017 at 05:05:13PM +0530, Jerin Jacob wrote: > > > > > On Thu, Feb 23, 2017 at 05:23:54PM +0000, Bruce Richardson wrote: > > > > > > Users compiling DPDK should not need to know or care about the arrangement > > > > > > of cachelines in the rte_ring structure. Therefore just remove the build > > > > > > option and set the structures to be always split. For improved > > > > > > performance use 128B rather than 64B alignment since it stops the producer > > > > > > and consumer data being on adjacent cachelines. > > > > > > > > > > > > Signed-off-by: Bruce Richardson > > > > > > --- > > > > > > config/common_base | 1 - > > > > > > doc/guides/rel_notes/release_17_05.rst | 6 ++++++ > > > > > > lib/librte_ring/rte_ring.c | 2 -- > > > > > > lib/librte_ring/rte_ring.h | 8 ++------ > > > > > > 4 files changed, 8 insertions(+), 9 deletions(-) > > > > > > > > > > > > diff --git a/config/common_base b/config/common_base > > > > > > index aeee13e..099ffda 100644 > > > > > > --- a/config/common_base > > > > > > +++ b/config/common_base > > > > > > @@ -448,7 +448,6 @@ CONFIG_RTE_LIBRTE_PMD_NULL_CRYPTO=y > > > > > > # > > > > > > CONFIG_RTE_LIBRTE_RING=y > > > > > > CONFIG_RTE_LIBRTE_RING_DEBUG=n > > > > > > -CONFIG_RTE_RING_SPLIT_PROD_CONS=n > > > > > > CONFIG_RTE_RING_PAUSE_REP_COUNT=0 > > > > > > > > > > > > # > > > > > > diff --git a/doc/guides/rel_notes/release_17_05.rst b/doc/guides/rel_notes/release_17_05.rst > > > > > > index e25ea9f..ea45e0c 100644 > > > > > > --- a/doc/guides/rel_notes/release_17_05.rst > > > > > > +++ b/doc/guides/rel_notes/release_17_05.rst > > > > > > @@ -110,6 +110,12 @@ API Changes > > > > > > Also, make sure to start the actual text at the margin. > > > > > > ========================================================= > > > > > > > > > > > > +* **Reworked rte_ring library** > > > > > > + > > > > > > + The rte_ring library has been reworked and updated. The following changes > > > > > > + have been made to it: > > > > > > + > > > > > > + * removed the build-time setting ``CONFIG_RTE_RING_SPLIT_PROD_CONS`` > > > > > > > > > > > > ABI Changes > > > > > > ----------- > > > > > > diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c > > > > > > index ca0a108..4bc6da1 100644 > > > > > > --- a/lib/librte_ring/rte_ring.c > > > > > > +++ b/lib/librte_ring/rte_ring.c > > > > > > @@ -127,10 +127,8 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count, > > > > > > /* compilation-time checks */ > > > > > > RTE_BUILD_BUG_ON((sizeof(struct rte_ring) & > > > > > > RTE_CACHE_LINE_MASK) != 0); > > > > > > -#ifdef RTE_RING_SPLIT_PROD_CONS > > > > > > RTE_BUILD_BUG_ON((offsetof(struct rte_ring, cons) & > > > > > > RTE_CACHE_LINE_MASK) != 0); > > > > > > -#endif > > > > > > RTE_BUILD_BUG_ON((offsetof(struct rte_ring, prod) & > > > > > > RTE_CACHE_LINE_MASK) != 0); > > > > > > #ifdef RTE_LIBRTE_RING_DEBUG > > > > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h > > > > > > index 72ccca5..04fe667 100644 > > > > > > --- a/lib/librte_ring/rte_ring.h > > > > > > +++ b/lib/librte_ring/rte_ring.h > > > > > > @@ -168,7 +168,7 @@ struct rte_ring { > > > > > > uint32_t mask; /**< Mask (size-1) of ring. */ > > > > > > volatile uint32_t head; /**< Producer head. */ > > > > > > volatile uint32_t tail; /**< Producer tail. */ > > > > > > - } prod __rte_cache_aligned; > > > > > > + } prod __rte_aligned(RTE_CACHE_LINE_SIZE * 2); > > > > > > > > > > I think we need to use RTE_CACHE_LINE_MIN_SIZE instead of > > > > > RTE_CACHE_LINE_SIZE for alignment here. PPC and ThunderX1 targets are cache line > > > > > size of 128B > > > > > > > > > Sure. > > > > > > > > However, can you perhaps try a performance test and check to see if > > > > there is a performance difference between the two values before I change > > > > it? In my tests I see improved performance by having an extra blank > > > > cache-line between the producer and consumer data. > > > > > > Sure. Which test are you running to measure the performance difference? > > > Is it app/test/test_ring_perf.c? > > > > > > > > > Yep, just the basic ring perf test. I look mostly at the core-to-core > > numbers, since hyperthread-to-hyperthread or NUMA socket to NUMA socket > > would be far less common use cases IMHO. > > Performance test result shows regression with RTE_CACHE_LINE_MIN_SIZE > scheme in some use case and some use case has higher performance(Testing using > two physical cores) > > > # base code > RTE>>ring_perf_autotest > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 84 > MP/MC single enq/dequeue: 301 > SP/SC burst enq/dequeue (size: 8): 20 > MP/MC burst enq/dequeue (size: 8): 46 > SP/SC burst enq/dequeue (size: 32): 12 > MP/MC burst enq/dequeue (size: 32): 18 > > ### Testing empty dequeue ### > SC empty dequeue: 7.11 > MC empty dequeue: 12.15 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 19.08 > MP/MC bulk enq/dequeue (size: 8): 46.28 > SP/SC bulk enq/dequeue (size: 32): 11.89 > MP/MC bulk enq/dequeue (size: 32): 18.84 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 37.42 > MP/MC bulk enq/dequeue (size: 8): 73.32 > SP/SC bulk enq/dequeue (size: 32): 18.69 > MP/MC bulk enq/dequeue (size: 32): 24.59 > Test OK > > # with ring rework patch > RTE>>ring_perf_autotest > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 84 > MP/MC single enq/dequeue: 301 > SP/SC burst enq/dequeue (size: 8): 19 > MP/MC burst enq/dequeue (size: 8): 45 > SP/SC burst enq/dequeue (size: 32): 11 > MP/MC burst enq/dequeue (size: 32): 18 > > ### Testing empty dequeue ### > SC empty dequeue: 7.10 > MC empty dequeue: 12.15 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 18.59 > MP/MC bulk enq/dequeue (size: 8): 45.49 > SP/SC bulk enq/dequeue (size: 32): 11.67 > MP/MC bulk enq/dequeue (size: 32): 18.65 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 37.41 > MP/MC bulk enq/dequeue (size: 8): 72.98 > SP/SC bulk enq/dequeue (size: 32): 18.69 > MP/MC bulk enq/dequeue (size: 32): 24.59 > Test OK > RTE>> > > # with ring rework patch + cache-line size change to one on 128BCL target > RTE>>ring_perf_autotest > ### Testing single element and burst enq/deq ### > SP/SC single enq/dequeue: 90 > MP/MC single enq/dequeue: 317 > SP/SC burst enq/dequeue (size: 8): 20 > MP/MC burst enq/dequeue (size: 8): 48 > SP/SC burst enq/dequeue (size: 32): 11 > MP/MC burst enq/dequeue (size: 32): 18 > > ### Testing empty dequeue ### > SC empty dequeue: 8.10 > MC empty dequeue: 11.15 > > ### Testing using a single lcore ### > SP/SC bulk enq/dequeue (size: 8): 20.24 > MP/MC bulk enq/dequeue (size: 8): 48.43 > SP/SC bulk enq/dequeue (size: 32): 11.01 > MP/MC bulk enq/dequeue (size: 32): 18.43 > > ### Testing using two physical cores ### > SP/SC bulk enq/dequeue (size: 8): 25.92 > MP/MC bulk enq/dequeue (size: 8): 69.76 > SP/SC bulk enq/dequeue (size: 32): 14.27 > MP/MC bulk enq/dequeue (size: 32): 22.94 > Test OK > RTE>> So given that there is not much difference here, is the MIN_SIZE i.e. forced 64B, your preference, rather than actual cacheline-size? /Bruce