From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id F1529A0352; Sat, 25 Dec 2021 01:16:17 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 814744067B; Sat, 25 Dec 2021 01:16:17 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 412474013F for ; Sat, 25 Dec 2021 01:16:16 +0100 (CET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH 0/1] mempool: implement index-based per core cache Date: Sat, 25 Dec 2021 01:16:03 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86DAD@smartserver.smartshare.dk> In-Reply-To: <20211224225923.806498-1-dharmik.thakkar@arm.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH 0/1] mempool: implement index-based per core cache Thread-Index: Adf5GglTuPH7Oxq1Thuueipv7P4sXQACLQRQ References: <20210930172735.2675627-1-dharmik.thakkar@arm.com> <20211224225923.806498-1-dharmik.thakkar@arm.com> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Dharmik Thakkar" Cc: , , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] > Sent: Friday, 24 December 2021 23.59 >=20 > Current mempool per core cache implementation stores pointers to mbufs > On 64b architectures, each pointer consumes 8B > This patch replaces it with index-based implementation, > where in each buffer is addressed by (pool base address + index) > It reduces the amount of memory/cache required for per core cache >=20 > L3Fwd performance testing reveals minor improvements in the cache > performance (L1 and L2 misses reduced by 0.60%) > with no change in throughput >=20 > Micro-benchmarking the patch using mempool_perf_test shows > significant improvement with majority of the test cases >=20 > Number of cores =3D 1: > n_get_bulk=3D1 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D18.01 > n_get_bulk=3D1 n_put_bulk=3D1 n_keep=3D128 %_change_with_patch=3D19.91 > n_get_bulk=3D1 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D-20.37 > (regression) > n_get_bulk=3D1 n_put_bulk=3D4 n_keep=3D128 = %_change_with_patch=3D-17.01 > (regression) > n_get_bulk=3D1 n_put_bulk=3D32 n_keep=3D32 = %_change_with_patch=3D-25.06 > (regression) > n_get_bulk=3D1 n_put_bulk=3D32 n_keep=3D128 = %_change_with_patch=3D-23.81 > (regression) > n_get_bulk=3D4 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D53.93 > n_get_bulk=3D4 n_put_bulk=3D1 n_keep=3D128 %_change_with_patch=3D60.90 > n_get_bulk=3D4 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D1.64 > n_get_bulk=3D4 n_put_bulk=3D4 n_keep=3D128 %_change_with_patch=3D8.76 > n_get_bulk=3D4 n_put_bulk=3D32 n_keep=3D32 %_change_with_patch=3D-4.71 > (regression) > n_get_bulk=3D4 n_put_bulk=3D32 n_keep=3D128 = %_change_with_patch=3D-3.19 > (regression) > n_get_bulk=3D32 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D65.63 > n_get_bulk=3D32 n_put_bulk=3D1 n_keep=3D128 = %_change_with_patch=3D75.19 > n_get_bulk=3D32 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D11.75 > n_get_bulk=3D32 n_put_bulk=3D4 n_keep=3D128 = %_change_with_patch=3D15.52 > n_get_bulk=3D32 n_put_bulk=3D32 n_keep=3D32 = %_change_with_patch=3D13.45 > n_get_bulk=3D32 n_put_bulk=3D32 n_keep=3D128 = %_change_with_patch=3D11.58 >=20 > Number of core =3D 2: > n_get_bulk=3D1 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D18.21 > n_get_bulk=3D1 n_put_bulk=3D1 n_keep=3D128 %_change_with_patch=3D21.89 > n_get_bulk=3D1 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D-21.21 > (regression) > n_get_bulk=3D1 n_put_bulk=3D4 n_keep=3D128 = %_change_with_patch=3D-17.05 > (regression) > n_get_bulk=3D1 n_put_bulk=3D32 n_keep=3D32 = %_change_with_patch=3D-26.09 > (regression) > n_get_bulk=3D1 n_put_bulk=3D32 n_keep=3D128 = %_change_with_patch=3D-23.49 > (regression) > n_get_bulk=3D4 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D56.28 > n_get_bulk=3D4 n_put_bulk=3D1 n_keep=3D128 %_change_with_patch=3D67.69 > n_get_bulk=3D4 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D1.45 > n_get_bulk=3D4 n_put_bulk=3D4 n_keep=3D128 %_change_with_patch=3D8.84 > n_get_bulk=3D4 n_put_bulk=3D32 n_keep=3D32 %_change_with_patch=3D-5.27 > (regression) > n_get_bulk=3D4 n_put_bulk=3D32 n_keep=3D128 = %_change_with_patch=3D-3.09 > (regression) > n_get_bulk=3D32 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D76.11 > n_get_bulk=3D32 n_put_bulk=3D1 n_keep=3D128 = %_change_with_patch=3D86.06 > n_get_bulk=3D32 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D11.86 > n_get_bulk=3D32 n_put_bulk=3D4 n_keep=3D128 = %_change_with_patch=3D16.55 > n_get_bulk=3D32 n_put_bulk=3D32 n_keep=3D32 = %_change_with_patch=3D13.01 > n_get_bulk=3D32 n_put_bulk=3D32 n_keep=3D128 = %_change_with_patch=3D11.51 >=20 >=20 > From analyzing the results, it is clear that for n_get_bulk and > n_put_bulk sizes of 32 there is no performance regression > IMO, the other sizes are not practical from performance perspective > and the regression in those cases can be safely ignored >=20 > Dharmik Thakkar (1): > mempool: implement index-based per core cache >=20 > lib/mempool/rte_mempool.h | 114 = +++++++++++++++++++++++++- > lib/mempool/rte_mempool_ops_default.c | 7 ++ > 2 files changed, 119 insertions(+), 2 deletions(-) >=20 > -- > 2.25.1 >=20 I still think this is very interesting. And your performance numbers are = looking good. However, it limits the size of a mempool to 4 GB. As previously = discussed, the max mempool size can be increased by multiplying the = index with a constant. I would suggest using sizeof(uintptr_t) as the constant multiplier, so = the mempool can hold objects of any size divisible by sizeof(uintptr_t). = And it would be silly to use a mempool to hold objects smaller than = sizeof(uintptr_t). How does the performance look if you multiply the index by = sizeof(uintptr_t)? Med venlig hilsen / Kind regards, -Morten Br=F8rup