From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id F1529A0352;
	Sat, 25 Dec 2021 01:16:17 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 814744067B;
	Sat, 25 Dec 2021 01:16:17 +0100 (CET)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id 412474013F
 for <dev@dpdk.org>; Sat, 25 Dec 2021 01:16:16 +0100 (CET)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [PATCH 0/1] mempool: implement index-based per core cache
Date: Sat, 25 Dec 2021 01:16:03 +0100
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86DAD@smartserver.smartshare.dk>
In-Reply-To: <20211224225923.806498-1-dharmik.thakkar@arm.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [PATCH 0/1] mempool: implement index-based per core cache
Thread-Index: Adf5GglTuPH7Oxq1Thuueipv7P4sXQACLQRQ
References: <20210930172735.2675627-1-dharmik.thakkar@arm.com>
 <20211224225923.806498-1-dharmik.thakkar@arm.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Dharmik Thakkar" <dharmik.thakkar@arm.com>
Cc: <dev@dpdk.org>, <nd@arm.com>, <honnappa.nagarahalli@arm.com>,
 <ruifeng.wang@arm.com>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com]
> Sent: Friday, 24 December 2021 23.59
>=20
> Current mempool per core cache implementation stores pointers to mbufs
> On 64b architectures, each pointer consumes 8B
> This patch replaces it with index-based implementation,
> where in each buffer is addressed by (pool base address + index)
> It reduces the amount of memory/cache required for per core cache
>=20
> L3Fwd performance testing reveals minor improvements in the cache
> performance (L1 and L2 misses reduced by 0.60%)
> with no change in throughput
>=20
> Micro-benchmarking the patch using mempool_perf_test shows
> significant improvement with majority of the test cases
>=20
> Number of cores =3D 1:
> n_get_bulk=3D1 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D18.01
> n_get_bulk=3D1 n_put_bulk=3D1 n_keep=3D128 %_change_with_patch=3D19.91
> n_get_bulk=3D1 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D-20.37
> (regression)
> n_get_bulk=3D1 n_put_bulk=3D4 n_keep=3D128 =
%_change_with_patch=3D-17.01
> (regression)
> n_get_bulk=3D1 n_put_bulk=3D32 n_keep=3D32 =
%_change_with_patch=3D-25.06
> (regression)
> n_get_bulk=3D1 n_put_bulk=3D32 n_keep=3D128 =
%_change_with_patch=3D-23.81
> (regression)
> n_get_bulk=3D4 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D53.93
> n_get_bulk=3D4 n_put_bulk=3D1 n_keep=3D128 %_change_with_patch=3D60.90
> n_get_bulk=3D4 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D1.64
> n_get_bulk=3D4 n_put_bulk=3D4 n_keep=3D128 %_change_with_patch=3D8.76
> n_get_bulk=3D4 n_put_bulk=3D32 n_keep=3D32 %_change_with_patch=3D-4.71
> (regression)
> n_get_bulk=3D4 n_put_bulk=3D32 n_keep=3D128 =
%_change_with_patch=3D-3.19
> (regression)
> n_get_bulk=3D32 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D65.63
> n_get_bulk=3D32 n_put_bulk=3D1 n_keep=3D128 =
%_change_with_patch=3D75.19
> n_get_bulk=3D32 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D11.75
> n_get_bulk=3D32 n_put_bulk=3D4 n_keep=3D128 =
%_change_with_patch=3D15.52
> n_get_bulk=3D32 n_put_bulk=3D32 n_keep=3D32 =
%_change_with_patch=3D13.45
> n_get_bulk=3D32 n_put_bulk=3D32 n_keep=3D128 =
%_change_with_patch=3D11.58
>=20
> Number of core =3D 2:
> n_get_bulk=3D1 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D18.21
> n_get_bulk=3D1 n_put_bulk=3D1 n_keep=3D128 %_change_with_patch=3D21.89
> n_get_bulk=3D1 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D-21.21
> (regression)
> n_get_bulk=3D1 n_put_bulk=3D4 n_keep=3D128 =
%_change_with_patch=3D-17.05
> (regression)
> n_get_bulk=3D1 n_put_bulk=3D32 n_keep=3D32 =
%_change_with_patch=3D-26.09
> (regression)
> n_get_bulk=3D1 n_put_bulk=3D32 n_keep=3D128 =
%_change_with_patch=3D-23.49
> (regression)
> n_get_bulk=3D4 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D56.28
> n_get_bulk=3D4 n_put_bulk=3D1 n_keep=3D128 %_change_with_patch=3D67.69
> n_get_bulk=3D4 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D1.45
> n_get_bulk=3D4 n_put_bulk=3D4 n_keep=3D128 %_change_with_patch=3D8.84
> n_get_bulk=3D4 n_put_bulk=3D32 n_keep=3D32 %_change_with_patch=3D-5.27
> (regression)
> n_get_bulk=3D4 n_put_bulk=3D32 n_keep=3D128 =
%_change_with_patch=3D-3.09
> (regression)
> n_get_bulk=3D32 n_put_bulk=3D1 n_keep=3D32 %_change_with_patch=3D76.11
> n_get_bulk=3D32 n_put_bulk=3D1 n_keep=3D128 =
%_change_with_patch=3D86.06
> n_get_bulk=3D32 n_put_bulk=3D4 n_keep=3D32 %_change_with_patch=3D11.86
> n_get_bulk=3D32 n_put_bulk=3D4 n_keep=3D128 =
%_change_with_patch=3D16.55
> n_get_bulk=3D32 n_put_bulk=3D32 n_keep=3D32 =
%_change_with_patch=3D13.01
> n_get_bulk=3D32 n_put_bulk=3D32 n_keep=3D128 =
%_change_with_patch=3D11.51
>=20
>=20
> From analyzing the results, it is clear that for n_get_bulk and
> n_put_bulk sizes of 32 there is no performance regression
> IMO, the other sizes are not practical from performance perspective
> and the regression in those cases can be safely ignored
>=20
> Dharmik Thakkar (1):
>   mempool: implement index-based per core cache
>=20
>  lib/mempool/rte_mempool.h             | 114 =
+++++++++++++++++++++++++-
>  lib/mempool/rte_mempool_ops_default.c |   7 ++
>  2 files changed, 119 insertions(+), 2 deletions(-)
>=20
> --
> 2.25.1
>=20

I still think this is very interesting. And your performance numbers are =
looking good.

However, it limits the size of a mempool to 4 GB. As previously =
discussed, the max mempool size can be increased by multiplying the =
index with a constant.

I would suggest using sizeof(uintptr_t) as the constant multiplier, so =
the mempool can hold objects of any size divisible by sizeof(uintptr_t). =
And it would be silly to use a mempool to hold objects smaller than =
sizeof(uintptr_t).

How does the performance look if you multiply the index by =
sizeof(uintptr_t)?


Med venlig hilsen / Kind regards,
-Morten Br=F8rup