From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id AE877A0352;
	Sat, 25 Dec 2021 00:00:20 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 1B169410FA;
	Sat, 25 Dec 2021 00:00:14 +0100 (CET)
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by mails.dpdk.org (Postfix) with ESMTP id 03AFF4013F
 for <dev@dpdk.org>; Sat, 25 Dec 2021 00:00:10 +0100 (CET)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4F7881FB;
 Fri, 24 Dec 2021 15:00:10 -0800 (PST)
Received: from 2p2660v4-1.austin.arm.com (2p2660v4-1.austin.arm.com
 [10.118.13.211])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3B6F93F718;
 Fri, 24 Dec 2021 15:00:10 -0800 (PST)
From: Dharmik Thakkar <dharmik.thakkar@arm.com>
To: 
Cc: dev@dpdk.org, nd@arm.com, honnappa.nagarahalli@arm.com,
 ruifeng.wang@arm.com, Dharmik Thakkar <dharmik.thakkar@arm.com>
Subject: [PATCH 0/1] mempool: implement index-based per core cache
Date: Fri, 24 Dec 2021 16:59:22 -0600
Message-Id: <20211224225923.806498-1-dharmik.thakkar@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20210930172735.2675627-1-dharmik.thakkar@arm.com>
References: <20210930172735.2675627-1-dharmik.thakkar@arm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Current mempool per core cache implementation stores pointers to mbufs
On 64b architectures, each pointer consumes 8B
This patch replaces it with index-based implementation,
where in each buffer is addressed by (pool base address + index)
It reduces the amount of memory/cache required for per core cache

L3Fwd performance testing reveals minor improvements in the cache
performance (L1 and L2 misses reduced by 0.60%)
with no change in throughput

Micro-benchmarking the patch using mempool_perf_test shows
significant improvement with majority of the test cases

Number of cores = 1:
n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.01
n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=19.91
n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-20.37 (regression)
n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.01 (regression) 
n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 (regression)
n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81 (regression)
n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=53.93
n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=60.90
n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.64
n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.76
n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-4.71 (regression)
n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.19 (regression)
n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=65.63
n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=75.19
n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.75
n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=15.52
n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.45
n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.58

Number of core = 2:
n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.21
n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=21.89
n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-21.21 (regression)
n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.05 (regression)
n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 (regression)
n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49 (regression)
n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=56.28
n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=67.69
n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.45
n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.84
n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-5.27 (regression)
n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.09 (regression)
n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=76.11
n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=86.06
n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.86
n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=16.55
n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.01
n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.51


>From analyzing the results, it is clear that for n_get_bulk and
n_put_bulk sizes of 32 there is no performance regression
IMO, the other sizes are not practical from performance perspective
and the regression in those cases can be safely ignored

Dharmik Thakkar (1):
  mempool: implement index-based per core cache

 lib/mempool/rte_mempool.h             | 114 +++++++++++++++++++++++++-
 lib/mempool/rte_mempool_ops_default.c |   7 ++
 2 files changed, 119 insertions(+), 2 deletions(-)

-- 
2.25.1