From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f181.google.com (mail-wr0-f181.google.com [209.85.128.181]) by dpdk.org (Postfix) with ESMTP id 3968A2BFF for ; Wed, 1 Mar 2017 11:17:56 +0100 (CET) Received: by mail-wr0-f181.google.com with SMTP id u48so26715620wrc.0 for ; Wed, 01 Mar 2017 02:17:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=dyQSWYYuBA0EjC8BaTmobPtJd049iN7KCVViUgmINUY=; b=XDP2tW3BwbOztISQnlvhartBo6HuEGrREUIgo4JCcWoW2ods9rd8QA524SyazXTMCG ryOlHgSwC1ehXsj2dp01PaVJeoG5dpWc9x1SPUxOpO3VVA9aHmdKtRgS2Ka/5YLGTtfU y+CArZ5goD0lWM9uwCtX+MqUztbOgoZ71jnp36/shvc8Ghu7WGAqOWcfQ7ne0p5vsxRw nwwNp34ir8rORGnLevpIY9yAc/xVb4NUuw5aaBFzdJogxUpC4YAMj7sQgRgQn9LnMwvX 4CUJMhrcjwRLHE5wr9GD9PWSEAL4og878QDs/IwU00vmKsvhnuCdkoGIMMIml1A9g1yg D2yQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=dyQSWYYuBA0EjC8BaTmobPtJd049iN7KCVViUgmINUY=; b=LjCbrDslgGEleZ8GuPJNZsdCH2A9dc5bU4IM3BGUVwWXC73tXmpkDXmfHz9A3tduGR +x+Lz1USDf4oFDYnojWYQxtTZ53+UNAk53lKRqE5DEVvkQBknLapQrnwZ5eKdqdw5TT9 JiJUz7XrvNpSLKsgKMa8ybuZSmQZUS2hw/A+lA+CuiX5JMPUkYaQw5KBzQO/REyist8A g2nX237nYxhCb6UXiqLwghoYKhaUNos9AmD6E7ziiK/vFznzWHrTjWXO/Ap/uN4DLm34 M8OvKh7D4FgPHMfvNI/c/mn1wPs7uga7tb8eKYvOapqeczjbtJQQ6z7cHICyDlfjiJyA sZTw== X-Gm-Message-State: AMke39nZbqAX/Sr+zfTE/G2NzgiWHuHhZfwQoVoynKx96vSo1C6780z9/sPSdoDdlqa18j5A X-Received: by 10.223.133.5 with SMTP id 5mr6069188wrh.175.1488363476511; Wed, 01 Mar 2017 02:17:56 -0800 (PST) Received: from platinum (2a01cb0c03c651000226b0fffeed02fc.ipv6.abo.wanadoo.fr. [2a01:cb0c:3c6:5100:226:b0ff:feed:2fc]) by smtp.gmail.com with ESMTPSA id w17sm5946030wra.28.2017.03.01.02.17.56 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 01 Mar 2017 02:17:56 -0800 (PST) Date: Wed, 1 Mar 2017 11:17:53 +0100 From: Olivier Matz To: Bruce Richardson Cc: Jerin Jacob , dev@dpdk.org Message-ID: <20170301111753.1223a01e@platinum> In-Reply-To: <20170301094702.GA15176@bricha3-MOBL3.ger.corp.intel.com> References: <20170223172407.27664-1-bruce.richardson@intel.com> <20170223172407.27664-2-bruce.richardson@intel.com> <20170228113511.GA28584@localhost.localdomain> <20170228115703.GA4656@bricha3-MOBL3.ger.corp.intel.com> <20170228120833.GA30817@localhost.localdomain> <20170228135226.GA9784@bricha3-MOBL3.ger.corp.intel.com> <20170228175423.GA23591@localhost.localdomain> <20170301094702.GA15176@bricha3-MOBL3.ger.corp.intel.com> X-Mailer: Claws Mail 3.14.1 (GTK+ 2.24.31; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH v1 01/14] ring: remove split cacheline build setting X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Mar 2017 10:17:57 -0000 Hi Bruce, On Wed, 1 Mar 2017 09:47:03 +0000, Bruce Richardson wrote: > On Tue, Feb 28, 2017 at 11:24:25PM +0530, Jerin Jacob wrote: > > On Tue, Feb 28, 2017 at 01:52:26PM +0000, Bruce Richardson wrote: > > > On Tue, Feb 28, 2017 at 05:38:34PM +0530, Jerin Jacob wrote: > > > > On Tue, Feb 28, 2017 at 11:57:03AM +0000, Bruce Richardson > > > > wrote: > > > > > On Tue, Feb 28, 2017 at 05:05:13PM +0530, Jerin Jacob wrote: > > > > > > On Thu, Feb 23, 2017 at 05:23:54PM +0000, Bruce Richardson > > > > > > wrote: > > > > > > > Users compiling DPDK should not need to know or care > > > > > > > about the arrangement of cachelines in the rte_ring > > > > > > > structure. Therefore just remove the build option and set > > > > > > > the structures to be always split. For improved > > > > > > > performance use 128B rather than 64B alignment since it > > > > > > > stops the producer and consumer data being on adjacent You say you see an improved performance on Intel by having an extra blank cache-line between the producer and consumer data. Do you have an idea why it behaves like this? Do you think it is related to the hardware adjacent cache line prefetcher? > [...] > > # base code > > RTE>>ring_perf_autotest > > ### Testing single element and burst enq/deq ### > > SP/SC single enq/dequeue: 84 > > MP/MC single enq/dequeue: 301 > > SP/SC burst enq/dequeue (size: 8): 20 > > MP/MC burst enq/dequeue (size: 8): 46 > > SP/SC burst enq/dequeue (size: 32): 12 > > MP/MC burst enq/dequeue (size: 32): 18 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 7.11 > > MC empty dequeue: 12.15 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 19.08 > > MP/MC bulk enq/dequeue (size: 8): 46.28 > > SP/SC bulk enq/dequeue (size: 32): 11.89 > > MP/MC bulk enq/dequeue (size: 32): 18.84 > > > > ### Testing using two physical cores ### > > SP/SC bulk enq/dequeue (size: 8): 37.42 > > MP/MC bulk enq/dequeue (size: 8): 73.32 > > SP/SC bulk enq/dequeue (size: 32): 18.69 > > MP/MC bulk enq/dequeue (size: 32): 24.59 > > Test OK > > > > # with ring rework patch > > RTE>>ring_perf_autotest > > ### Testing single element and burst enq/deq ### > > SP/SC single enq/dequeue: 84 > > MP/MC single enq/dequeue: 301 > > SP/SC burst enq/dequeue (size: 8): 19 > > MP/MC burst enq/dequeue (size: 8): 45 > > SP/SC burst enq/dequeue (size: 32): 11 > > MP/MC burst enq/dequeue (size: 32): 18 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 7.10 > > MC empty dequeue: 12.15 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 18.59 > > MP/MC bulk enq/dequeue (size: 8): 45.49 > > SP/SC bulk enq/dequeue (size: 32): 11.67 > > MP/MC bulk enq/dequeue (size: 32): 18.65 > > > > ### Testing using two physical cores ### > > SP/SC bulk enq/dequeue (size: 8): 37.41 > > MP/MC bulk enq/dequeue (size: 8): 72.98 > > SP/SC bulk enq/dequeue (size: 32): 18.69 > > MP/MC bulk enq/dequeue (size: 32): 24.59 > > Test OK > > RTE>> > > > > # with ring rework patch + cache-line size change to one on 128BCL > > target > > RTE>>ring_perf_autotest > > ### Testing single element and burst enq/deq ### > > SP/SC single enq/dequeue: 90 > > MP/MC single enq/dequeue: 317 > > SP/SC burst enq/dequeue (size: 8): 20 > > MP/MC burst enq/dequeue (size: 8): 48 > > SP/SC burst enq/dequeue (size: 32): 11 > > MP/MC burst enq/dequeue (size: 32): 18 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 8.10 > > MC empty dequeue: 11.15 > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 20.24 > > MP/MC bulk enq/dequeue (size: 8): 48.43 > > SP/SC bulk enq/dequeue (size: 32): 11.01 > > MP/MC bulk enq/dequeue (size: 32): 18.43 > > > > ### Testing using two physical cores ### > > SP/SC bulk enq/dequeue (size: 8): 25.92 > > MP/MC bulk enq/dequeue (size: 8): 69.76 > > SP/SC bulk enq/dequeue (size: 32): 14.27 > > MP/MC bulk enq/dequeue (size: 32): 22.94 > > Test OK > > RTE>> > > So given that there is not much difference here, is the MIN_SIZE i.e. > forced 64B, your preference, rather than actual cacheline-size? > I don't quite like this macro CACHE_LINE_MIN_SIZE. For me, it does not mean anything. The reasons for aligning on a cache line size are straightforward, but when should we need to align on the minimum cache line size supported by dpdk? For instance, in mbuf structure, aligning on 64 would make more sense to me. So, I would prefer using (RTE_CACHE_LINE_SIZE * 2) here. If we don't want it on some architectures, or if this optimization is only for Intel (or all archs that need this optim), I think we could have something like: /* bla bla */ #ifdef INTEL #define __rte_ring_aligned __rte_aligned(RTE_CACHE_LINE_SIZE * 2) #else #define __rte_ring_aligned __rte_aligned(RTE_CACHE_LINE_SIZE) #endif Olivier