From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bruce.richardson@intel.com>
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id CEA602BDF
 for <dev@dpdk.org>; Wed,  1 Mar 2017 10:47:06 +0100 (CET)
Received: from fmsmga006.fm.intel.com ([10.253.24.20])
 by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 01 Mar 2017 01:47:05 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.35,224,1484035200"; d="scan'208";a="71459207"
Received: from bricha3-mobl3.ger.corp.intel.com ([10.237.221.61])
 by fmsmga006.fm.intel.com with SMTP; 01 Mar 2017 01:47:03 -0800
Received: by  (sSMTP sendmail emulation); Wed, 01 Mar 2017 09:47:03 +0000
Date: Wed, 1 Mar 2017 09:47:03 +0000
From: Bruce Richardson <bruce.richardson@intel.com>
To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Cc: olivier.matz@6wind.com, dev@dpdk.org
Message-ID: <20170301094702.GA15176@bricha3-MOBL3.ger.corp.intel.com>
References: <20170223172407.27664-1-bruce.richardson@intel.com>
 <20170223172407.27664-2-bruce.richardson@intel.com>
 <20170228113511.GA28584@localhost.localdomain>
 <20170228115703.GA4656@bricha3-MOBL3.ger.corp.intel.com>
 <20170228120833.GA30817@localhost.localdomain>
 <20170228135226.GA9784@bricha3-MOBL3.ger.corp.intel.com>
 <20170228175423.GA23591@localhost.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170228175423.GA23591@localhost.localdomain>
Organization: Intel Research and =?iso-8859-1?Q?De=ACvel?=
 =?iso-8859-1?Q?opment?= Ireland Ltd.
User-Agent: Mutt/1.7.2 (2016-11-26)
Subject: Re: [dpdk-dev] [PATCH v1 01/14] ring: remove split cacheline build
 setting
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 01 Mar 2017 09:47:07 -0000

On Tue, Feb 28, 2017 at 11:24:25PM +0530, Jerin Jacob wrote:
> On Tue, Feb 28, 2017 at 01:52:26PM +0000, Bruce Richardson wrote:
> > On Tue, Feb 28, 2017 at 05:38:34PM +0530, Jerin Jacob wrote:
> > > On Tue, Feb 28, 2017 at 11:57:03AM +0000, Bruce Richardson wrote:
> > > > On Tue, Feb 28, 2017 at 05:05:13PM +0530, Jerin Jacob wrote:
> > > > > On Thu, Feb 23, 2017 at 05:23:54PM +0000, Bruce Richardson wrote:
> > > > > > Users compiling DPDK should not need to know or care about the arrangement
> > > > > > of cachelines in the rte_ring structure. Therefore just remove the build
> > > > > > option and set the structures to be always split. For improved
> > > > > > performance use 128B rather than 64B alignment since it stops the producer
> > > > > > and consumer data being on adjacent cachelines.
> > > > > > 
> > > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > > > ---
> > > > > >  config/common_base                     | 1 -
> > > > > >  doc/guides/rel_notes/release_17_05.rst | 6 ++++++
> > > > > >  lib/librte_ring/rte_ring.c             | 2 --
> > > > > >  lib/librte_ring/rte_ring.h             | 8 ++------
> > > > > >  4 files changed, 8 insertions(+), 9 deletions(-)
> > > > > > 
> > > > > > diff --git a/config/common_base b/config/common_base
> > > > > > index aeee13e..099ffda 100644
> > > > > > --- a/config/common_base
> > > > > > +++ b/config/common_base
> > > > > > @@ -448,7 +448,6 @@ CONFIG_RTE_LIBRTE_PMD_NULL_CRYPTO=y
> > > > > >  #
> > > > > >  CONFIG_RTE_LIBRTE_RING=y
> > > > > >  CONFIG_RTE_LIBRTE_RING_DEBUG=n
> > > > > > -CONFIG_RTE_RING_SPLIT_PROD_CONS=n
> > > > > >  CONFIG_RTE_RING_PAUSE_REP_COUNT=0
> > > > > >  
> > > > > >  #
> > > > > > diff --git a/doc/guides/rel_notes/release_17_05.rst b/doc/guides/rel_notes/release_17_05.rst
> > > > > > index e25ea9f..ea45e0c 100644
> > > > > > --- a/doc/guides/rel_notes/release_17_05.rst
> > > > > > +++ b/doc/guides/rel_notes/release_17_05.rst
> > > > > > @@ -110,6 +110,12 @@ API Changes
> > > > > >     Also, make sure to start the actual text at the margin.
> > > > > >     =========================================================
> > > > > >  
> > > > > > +* **Reworked rte_ring library**
> > > > > > +
> > > > > > +  The rte_ring library has been reworked and updated. The following changes
> > > > > > +  have been made to it:
> > > > > > +
> > > > > > +  * removed the build-time setting ``CONFIG_RTE_RING_SPLIT_PROD_CONS``
> > > > > >  
> > > > > >  ABI Changes
> > > > > >  -----------
> > > > > > diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
> > > > > > index ca0a108..4bc6da1 100644
> > > > > > --- a/lib/librte_ring/rte_ring.c
> > > > > > +++ b/lib/librte_ring/rte_ring.c
> > > > > > @@ -127,10 +127,8 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
> > > > > >  	/* compilation-time checks */
> > > > > >  	RTE_BUILD_BUG_ON((sizeof(struct rte_ring) &
> > > > > >  			  RTE_CACHE_LINE_MASK) != 0);
> > > > > > -#ifdef RTE_RING_SPLIT_PROD_CONS
> > > > > >  	RTE_BUILD_BUG_ON((offsetof(struct rte_ring, cons) &
> > > > > >  			  RTE_CACHE_LINE_MASK) != 0);
> > > > > > -#endif
> > > > > >  	RTE_BUILD_BUG_ON((offsetof(struct rte_ring, prod) &
> > > > > >  			  RTE_CACHE_LINE_MASK) != 0);
> > > > > >  #ifdef RTE_LIBRTE_RING_DEBUG
> > > > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > > > > > index 72ccca5..04fe667 100644
> > > > > > --- a/lib/librte_ring/rte_ring.h
> > > > > > +++ b/lib/librte_ring/rte_ring.h
> > > > > > @@ -168,7 +168,7 @@ struct rte_ring {
> > > > > >  		uint32_t mask;           /**< Mask (size-1) of ring. */
> > > > > >  		volatile uint32_t head;  /**< Producer head. */
> > > > > >  		volatile uint32_t tail;  /**< Producer tail. */
> > > > > > -	} prod __rte_cache_aligned;
> > > > > > +	} prod __rte_aligned(RTE_CACHE_LINE_SIZE * 2);
> > > > > 
> > > > > I think we need to use RTE_CACHE_LINE_MIN_SIZE instead of
> > > > > RTE_CACHE_LINE_SIZE for alignment here. PPC and ThunderX1 targets are cache line
> > > > > size of 128B
> > > > > 
> > > > Sure.
> > > > 
> > > > However, can you perhaps try a performance test and check to see if
> > > > there is a performance difference between the two values before I change
> > > > it? In my tests I see improved performance by having an extra blank
> > > > cache-line between the producer and consumer data.
> > > 
> > > Sure. Which test are you running to measure the performance difference?
> > > Is it app/test/test_ring_perf.c?
> > > 
> > > > 
> > Yep, just the basic ring perf test. I look mostly at the core-to-core
> > numbers, since hyperthread-to-hyperthread or NUMA socket to NUMA socket
> > would be far less common use cases IMHO.
> 
> Performance test result shows regression with RTE_CACHE_LINE_MIN_SIZE
> scheme in some use case and some use case has higher performance(Testing using
> two physical cores)
> 
> 
> # base code
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 84
> MP/MC single enq/dequeue: 301
> SP/SC burst enq/dequeue (size: 8): 20
> MP/MC burst enq/dequeue (size: 8): 46
> SP/SC burst enq/dequeue (size: 32): 12
> MP/MC burst enq/dequeue (size: 32): 18
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 7.11
> MC empty dequeue: 12.15
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 19.08
> MP/MC bulk enq/dequeue (size: 8): 46.28
> SP/SC bulk enq/dequeue (size: 32): 11.89
> MP/MC bulk enq/dequeue (size: 32): 18.84
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 37.42
> MP/MC bulk enq/dequeue (size: 8): 73.32
> SP/SC bulk enq/dequeue (size: 32): 18.69
> MP/MC bulk enq/dequeue (size: 32): 24.59
> Test OK
> 
> # with ring rework patch
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 84
> MP/MC single enq/dequeue: 301
> SP/SC burst enq/dequeue (size: 8): 19
> MP/MC burst enq/dequeue (size: 8): 45
> SP/SC burst enq/dequeue (size: 32): 11
> MP/MC burst enq/dequeue (size: 32): 18
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 7.10
> MC empty dequeue: 12.15
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 18.59
> MP/MC bulk enq/dequeue (size: 8): 45.49
> SP/SC bulk enq/dequeue (size: 32): 11.67
> MP/MC bulk enq/dequeue (size: 32): 18.65
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 37.41
> MP/MC bulk enq/dequeue (size: 8): 72.98
> SP/SC bulk enq/dequeue (size: 32): 18.69
> MP/MC bulk enq/dequeue (size: 32): 24.59
> Test OK
> RTE>>
> 
> # with ring rework patch + cache-line size change to one on 128BCL target
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 90
> MP/MC single enq/dequeue: 317
> SP/SC burst enq/dequeue (size: 8): 20
> MP/MC burst enq/dequeue (size: 8): 48
> SP/SC burst enq/dequeue (size: 32): 11
> MP/MC burst enq/dequeue (size: 32): 18
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 8.10
> MC empty dequeue: 11.15
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 20.24
> MP/MC bulk enq/dequeue (size: 8): 48.43
> SP/SC bulk enq/dequeue (size: 32): 11.01
> MP/MC bulk enq/dequeue (size: 32): 18.43
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 25.92
> MP/MC bulk enq/dequeue (size: 8): 69.76
> SP/SC bulk enq/dequeue (size: 32): 14.27
> MP/MC bulk enq/dequeue (size: 32): 22.94
> Test OK
> RTE>>

So given that there is not much difference here, is the MIN_SIZE i.e.
forced 64B, your preference, rather than actual cacheline-size?

/Bruce