From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <olivier.matz@6wind.com>
Received: from mail-wr0-f181.google.com (mail-wr0-f181.google.com
 [209.85.128.181]) by dpdk.org (Postfix) with ESMTP id 3968A2BFF
 for <dev@dpdk.org>; Wed,  1 Mar 2017 11:17:56 +0100 (CET)
Received: by mail-wr0-f181.google.com with SMTP id u48so26715620wrc.0
 for <dev@dpdk.org>; Wed, 01 Mar 2017 02:17:56 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=6wind-com.20150623.gappssmtp.com; s=20150623;
 h=date:from:to:cc:subject:message-id:in-reply-to:references
 :mime-version:content-transfer-encoding;
 bh=dyQSWYYuBA0EjC8BaTmobPtJd049iN7KCVViUgmINUY=;
 b=XDP2tW3BwbOztISQnlvhartBo6HuEGrREUIgo4JCcWoW2ods9rd8QA524SyazXTMCG
 ryOlHgSwC1ehXsj2dp01PaVJeoG5dpWc9x1SPUxOpO3VVA9aHmdKtRgS2Ka/5YLGTtfU
 y+CArZ5goD0lWM9uwCtX+MqUztbOgoZ71jnp36/shvc8Ghu7WGAqOWcfQ7ne0p5vsxRw
 nwwNp34ir8rORGnLevpIY9yAc/xVb4NUuw5aaBFzdJogxUpC4YAMj7sQgRgQn9LnMwvX
 4CUJMhrcjwRLHE5wr9GD9PWSEAL4og878QDs/IwU00vmKsvhnuCdkoGIMMIml1A9g1yg
 D2yQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to
 :references:mime-version:content-transfer-encoding;
 bh=dyQSWYYuBA0EjC8BaTmobPtJd049iN7KCVViUgmINUY=;
 b=LjCbrDslgGEleZ8GuPJNZsdCH2A9dc5bU4IM3BGUVwWXC73tXmpkDXmfHz9A3tduGR
 +x+Lz1USDf4oFDYnojWYQxtTZ53+UNAk53lKRqE5DEVvkQBknLapQrnwZ5eKdqdw5TT9
 JiJUz7XrvNpSLKsgKMa8ybuZSmQZUS2hw/A+lA+CuiX5JMPUkYaQw5KBzQO/REyist8A
 g2nX237nYxhCb6UXiqLwghoYKhaUNos9AmD6E7ziiK/vFznzWHrTjWXO/Ap/uN4DLm34
 M8OvKh7D4FgPHMfvNI/c/mn1wPs7uga7tb8eKYvOapqeczjbtJQQ6z7cHICyDlfjiJyA
 sZTw==
X-Gm-Message-State: AMke39nZbqAX/Sr+zfTE/G2NzgiWHuHhZfwQoVoynKx96vSo1C6780z9/sPSdoDdlqa18j5A
X-Received: by 10.223.133.5 with SMTP id 5mr6069188wrh.175.1488363476511;
 Wed, 01 Mar 2017 02:17:56 -0800 (PST)
Received: from platinum (2a01cb0c03c651000226b0fffeed02fc.ipv6.abo.wanadoo.fr.
 [2a01:cb0c:3c6:5100:226:b0ff:feed:2fc])
 by smtp.gmail.com with ESMTPSA id w17sm5946030wra.28.2017.03.01.02.17.56
 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
 Wed, 01 Mar 2017 02:17:56 -0800 (PST)
Date: Wed, 1 Mar 2017 11:17:53 +0100
From: Olivier Matz <olivier.matz@6wind.com>
To: Bruce Richardson <bruce.richardson@intel.com>
Cc: Jerin Jacob <jerin.jacob@caviumnetworks.com>, dev@dpdk.org
Message-ID: <20170301111753.1223a01e@platinum>
In-Reply-To: <20170301094702.GA15176@bricha3-MOBL3.ger.corp.intel.com>
References: <20170223172407.27664-1-bruce.richardson@intel.com>
 <20170223172407.27664-2-bruce.richardson@intel.com>
 <20170228113511.GA28584@localhost.localdomain>
 <20170228115703.GA4656@bricha3-MOBL3.ger.corp.intel.com>
 <20170228120833.GA30817@localhost.localdomain>
 <20170228135226.GA9784@bricha3-MOBL3.ger.corp.intel.com>
 <20170228175423.GA23591@localhost.localdomain>
 <20170301094702.GA15176@bricha3-MOBL3.ger.corp.intel.com>
X-Mailer: Claws Mail 3.14.1 (GTK+ 2.24.31; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v1 01/14] ring: remove split cacheline build
 setting
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 01 Mar 2017 10:17:57 -0000

Hi Bruce,

On Wed, 1 Mar 2017 09:47:03 +0000, Bruce Richardson
<bruce.richardson@intel.com> wrote:
> On Tue, Feb 28, 2017 at 11:24:25PM +0530, Jerin Jacob wrote:
> > On Tue, Feb 28, 2017 at 01:52:26PM +0000, Bruce Richardson wrote:  
> > > On Tue, Feb 28, 2017 at 05:38:34PM +0530, Jerin Jacob wrote:  
> > > > On Tue, Feb 28, 2017 at 11:57:03AM +0000, Bruce Richardson
> > > > wrote:  
> > > > > On Tue, Feb 28, 2017 at 05:05:13PM +0530, Jerin Jacob wrote:  
> > > > > > On Thu, Feb 23, 2017 at 05:23:54PM +0000, Bruce Richardson
> > > > > > wrote:  
> > > > > > > Users compiling DPDK should not need to know or care
> > > > > > > about the arrangement of cachelines in the rte_ring
> > > > > > > structure. Therefore just remove the build option and set
> > > > > > > the structures to be always split. For improved
> > > > > > > performance use 128B rather than 64B alignment since it
> > > > > > > stops the producer and consumer data being on adjacent


You say you see an improved performance on Intel by having an extra
blank cache-line between the producer and consumer data. Do you have an
idea why it behaves like this? Do you think it is related to the
hardware adjacent cache line prefetcher?



> [...]
> > # base code  
> > RTE>>ring_perf_autotest  
> > ### Testing single element and burst enq/deq ###
> > SP/SC single enq/dequeue: 84
> > MP/MC single enq/dequeue: 301
> > SP/SC burst enq/dequeue (size: 8): 20
> > MP/MC burst enq/dequeue (size: 8): 46
> > SP/SC burst enq/dequeue (size: 32): 12
> > MP/MC burst enq/dequeue (size: 32): 18
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 7.11
> > MC empty dequeue: 12.15
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 19.08
> > MP/MC bulk enq/dequeue (size: 8): 46.28
> > SP/SC bulk enq/dequeue (size: 32): 11.89
> > MP/MC bulk enq/dequeue (size: 32): 18.84
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 37.42
> > MP/MC bulk enq/dequeue (size: 8): 73.32
> > SP/SC bulk enq/dequeue (size: 32): 18.69
> > MP/MC bulk enq/dequeue (size: 32): 24.59
> > Test OK
> > 
> > # with ring rework patch  
> > RTE>>ring_perf_autotest  
> > ### Testing single element and burst enq/deq ###
> > SP/SC single enq/dequeue: 84
> > MP/MC single enq/dequeue: 301
> > SP/SC burst enq/dequeue (size: 8): 19
> > MP/MC burst enq/dequeue (size: 8): 45
> > SP/SC burst enq/dequeue (size: 32): 11
> > MP/MC burst enq/dequeue (size: 32): 18
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 7.10
> > MC empty dequeue: 12.15
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 18.59
> > MP/MC bulk enq/dequeue (size: 8): 45.49
> > SP/SC bulk enq/dequeue (size: 32): 11.67
> > MP/MC bulk enq/dequeue (size: 32): 18.65
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 37.41
> > MP/MC bulk enq/dequeue (size: 8): 72.98
> > SP/SC bulk enq/dequeue (size: 32): 18.69
> > MP/MC bulk enq/dequeue (size: 32): 24.59
> > Test OK  
> > RTE>>  
> > 
> > # with ring rework patch + cache-line size change to one on 128BCL
> > target  
> > RTE>>ring_perf_autotest  
> > ### Testing single element and burst enq/deq ###
> > SP/SC single enq/dequeue: 90
> > MP/MC single enq/dequeue: 317
> > SP/SC burst enq/dequeue (size: 8): 20
> > MP/MC burst enq/dequeue (size: 8): 48
> > SP/SC burst enq/dequeue (size: 32): 11
> > MP/MC burst enq/dequeue (size: 32): 18
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 8.10
> > MC empty dequeue: 11.15
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 20.24
> > MP/MC bulk enq/dequeue (size: 8): 48.43
> > SP/SC bulk enq/dequeue (size: 32): 11.01
> > MP/MC bulk enq/dequeue (size: 32): 18.43
> > 
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 25.92
> > MP/MC bulk enq/dequeue (size: 8): 69.76
> > SP/SC bulk enq/dequeue (size: 32): 14.27
> > MP/MC bulk enq/dequeue (size: 32): 22.94
> > Test OK  
> > RTE>>  
> 
> So given that there is not much difference here, is the MIN_SIZE i.e.
> forced 64B, your preference, rather than actual cacheline-size?
> 

I don't quite like this macro CACHE_LINE_MIN_SIZE. For me, it does not
mean anything. The reasons for aligning on a cache line size are
straightforward, but when should we need to align on the minimum
cache line size supported by dpdk? For instance, in mbuf structure,
aligning on 64 would make more sense to me.

So, I would prefer using (RTE_CACHE_LINE_SIZE * 2) here. If we don't
want it on some architectures, or if this optimization is only for Intel
(or all archs that need this optim), I think we could have something
like:

/* bla bla */
#ifdef INTEL
#define __rte_ring_aligned __rte_aligned(RTE_CACHE_LINE_SIZE * 2)
#else
#define __rte_ring_aligned __rte_aligned(RTE_CACHE_LINE_SIZE)
#endif


Olivier