From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 48AF4A00C4; Thu, 13 Jan 2022 06:36:49 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C3A434117D; Thu, 13 Jan 2022 06:36:48 +0100 (CET) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by mails.dpdk.org (Postfix) with ESMTP id E1E1B40150 for ; Thu, 13 Jan 2022 06:36:46 +0100 (CET) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2BAEAED1; Wed, 12 Jan 2022 21:36:46 -0800 (PST) Received: from 2p2660v4-1.austin.arm.com (2p2660v4-1.austin.arm.com [10.118.13.211]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 1AC6A3F5A1; Wed, 12 Jan 2022 21:36:46 -0800 (PST) From: Dharmik Thakkar To: Cc: dev@dpdk.org, nd@arm.com, honnappa.nagarahalli@arm.com, ruifeng.wang@arm.com, Dharmik Thakkar Subject: [PATCH v2 0/1] mempool: implement index-based per core cache Date: Wed, 12 Jan 2022 23:36:29 -0600 Message-Id: <20220113053630.886638-1-dharmik.thakkar@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20211224225923.806498-1-dharmik.thakkar@arm.com> References: <20211224225923.806498-1-dharmik.thakkar@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Current mempool per core cache implementation stores pointers to mbufs On 64b architectures, each pointer consumes 8B This patch replaces it with index-based implementation, where in each buffer is addressed by (pool base address + index) It reduces the amount of memory/cache required for per core cache L3Fwd performance testing reveals minor improvements in the cache performance (L1 and L2 misses reduced by 0.60%) with no change in throughput Micro-benchmarking the patch using mempool_perf_test shows significant improvement with majority of the test cases Number of cores = 1: n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.01 n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=19.91 n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-20.37 (regression) n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.01 (regression) n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-25.06 (regression) n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.81 (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=53.93 n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=60.90 n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.64 n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.76 n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-4.71 (regression) n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.19 (regression) n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=65.63 n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=75.19 n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.75 n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=15.52 n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.45 n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.58 Number of core = 2: n_get_bulk=1 n_put_bulk=1 n_keep=32 %_change_with_patch=18.21 n_get_bulk=1 n_put_bulk=1 n_keep=128 %_change_with_patch=21.89 n_get_bulk=1 n_put_bulk=4 n_keep=32 %_change_with_patch=-21.21 (regression) n_get_bulk=1 n_put_bulk=4 n_keep=128 %_change_with_patch=-17.05 (regression) n_get_bulk=1 n_put_bulk=32 n_keep=32 %_change_with_patch=-26.09 (regression) n_get_bulk=1 n_put_bulk=32 n_keep=128 %_change_with_patch=-23.49 (regression) n_get_bulk=4 n_put_bulk=1 n_keep=32 %_change_with_patch=56.28 n_get_bulk=4 n_put_bulk=1 n_keep=128 %_change_with_patch=67.69 n_get_bulk=4 n_put_bulk=4 n_keep=32 %_change_with_patch=1.45 n_get_bulk=4 n_put_bulk=4 n_keep=128 %_change_with_patch=8.84 n_get_bulk=4 n_put_bulk=32 n_keep=32 %_change_with_patch=-5.27 (regression) n_get_bulk=4 n_put_bulk=32 n_keep=128 %_change_with_patch=-3.09 (regression) n_get_bulk=32 n_put_bulk=1 n_keep=32 %_change_with_patch=76.11 n_get_bulk=32 n_put_bulk=1 n_keep=128 %_change_with_patch=86.06 n_get_bulk=32 n_put_bulk=4 n_keep=32 %_change_with_patch=11.86 n_get_bulk=32 n_put_bulk=4 n_keep=128 %_change_with_patch=16.55 n_get_bulk=32 n_put_bulk=32 n_keep=32 %_change_with_patch=13.01 n_get_bulk=32 n_put_bulk=32 n_keep=128 %_change_with_patch=11.51 >From analyzing the results, it is clear that for n_get_bulk and n_put_bulk sizes of 32 there is no performance regression IMO, the other sizes are not practical from performance perspective and the regression in those cases can be safely ignored An attempt to increase the size of mempool to 32GB, by dividing the index by sizeof(uintptr_t), has led to a performance degradation of ~5% compared to the base performance --- v2: - Increase size of mempool to 32GB (Morten) - Improve performance for other platforms using dual loop unrolling --- Dharmik Thakkar (1): mempool: implement index-based per core cache lib/mempool/rte_mempool.h | 150 +++++++++++++++++++++++++- lib/mempool/rte_mempool_ops_default.c | 7 ++ 2 files changed, 156 insertions(+), 1 deletion(-) -- 2.17.1