From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id B1D3B42DEC;
	Thu,  6 Jul 2023 19:43:08 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 8588F41101;
	Thu,  6 Jul 2023 19:43:08 +0200 (CEST)
Received: from mail-pg1-f181.google.com (mail-pg1-f181.google.com
 [209.85.215.181])
 by mails.dpdk.org (Postfix) with ESMTP id 20E2F40A79
 for <dev@dpdk.org>; Thu,  6 Jul 2023 19:43:07 +0200 (CEST)
Received: by mail-pg1-f181.google.com with SMTP id
 41be03b00d2f7-553ad54d3c6so776790a12.1
 for <dev@dpdk.org>; Thu, 06 Jul 2023 10:43:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=networkplumber-org.20221208.gappssmtp.com; s=20221208; t=1688665386;
 x=1691257386; 
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:subject:cc:to:from:date:from:to:cc:subject:date
 :message-id:reply-to;
 bh=6UaB+GsuZiT+4u3tdhMkC3Q7s1hKY27r/Hwv8CLk0XI=;
 b=AxswNzzKjm+qxnpfCCXlBPDCk3vJ1S01SiS3ExDtIMYLAmKZevOfSshHM44E9BvHIt
 XY39bZNtqPScocemwkAfK8WVXyrRaAHG23u+MLWrOvbDTEHiWMOln+KeB+GlY8BabLFb
 vXV5/z3iHBD8kZ3wwELv2poellQPdOmGsv5XgVKbgbdAwNu6+7ykAq/bSAfnEXrrSwPo
 +kjo0QCBJve3hkD3x2MlKkO88OwVuQ6zjH4Pd++A8ou9qup+BuCfV8QVx08sH7C2klzo
 XGcVV6JyLKg/waNQhAjTHRwXiyYfkCRURcGfVUgA/59bFmdA6UPZRX51Fz09qf20lBj4
 pqKw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1688665386; x=1691257386;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=6UaB+GsuZiT+4u3tdhMkC3Q7s1hKY27r/Hwv8CLk0XI=;
 b=G99Ap+nzWdAvVvnRoIuikvoiFeKH8/1z5bqzu1VcG2HfiY5VYK282pRaHgZoOLeDFA
 l7/gIzeTVCAzkQ4e5KI5pgM+3nFLUbWYoucssqHRYUD/6Y9iZb0eswYpYq7nzT0X2S+k
 ndlzMqjoYJpS32xffLRkUlnwsKVMFwb3kvWPN4w18AkMLZ0JljPjjKdvEbjwBh8cRjYJ
 6NuilkgPU7g/bKDtUcD9eZSu7WNAGKNFDqz6peZO6vJpzx1m2CCmxuuAZEFjy/H9i2mp
 8cDP9n2AeXK+dxr/rzOnsC4KhlN8zSdaz3NpRp2aA/4MEU5SomH4xSVDiGy1Wd1xMast
 vvzg==
X-Gm-Message-State: ABy/qLYFcXbb2n6DNaPMzWa20qtbGM1mEm6Jh780oKemkR18l0d9b5tJ
 U1Gnxw6tPvrjBzNw3WOp2XsmqA==
X-Google-Smtp-Source: APBJJlE5NCTi24izNtOsa24x/NRUxdKJNtNMraHx1XyM7FEUW9TLbJyG1/Kl0pVQ9SECVGwnyVv2fg==
X-Received: by 2002:a05:6a20:96cb:b0:118:e70:6f7d with SMTP id
 hq11-20020a056a2096cb00b001180e706f7dmr1714710pzc.10.1688665384723; 
 Thu, 06 Jul 2023 10:43:04 -0700 (PDT)
Received: from hermes.local (204-195-120-218.wavecable.com. [204.195.120.218])
 by smtp.gmail.com with ESMTPSA id
 ja22-20020a170902efd600b001b83a575995sm1713005plb.42.2023.07.06.10.43.03
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Thu, 06 Jul 2023 10:43:04 -0700 (PDT)
Date: Thu, 6 Jul 2023 10:43:02 -0700
From: Stephen Hemminger <stephen@networkplumber.org>
To: Dharmik Thakkar <Dharmik.Thakkar@arm.com>
Cc: Jerin Jacob <jerinjacobk@gmail.com>, Morten =?UTF-8?B?QnLDuHJ1cA==?=
 <mb@smartsharesystems.com>, Bruce Richardson <bruce.richardson@intel.com>,
 Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>, dpdk-dev
 <dev@dpdk.org>, nd <nd@arm.com>, Ruifeng Wang <Ruifeng.Wang@arm.com>
Subject: Re: [PATCH 0/1] mempool: implement index-based per core cache
Message-ID: <20230706104302.4fd9807a@hermes.local>
In-Reply-To: <DD1F9EC6-8316-4EFE-BAD6-4D285F0FB6BF@arm.com>
References: <20210930172735.2675627-1-dharmik.thakkar@arm.com>
 <20211224225923.806498-1-dharmik.thakkar@arm.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86DAD@smartserver.smartshare.dk>
 <Ydgg7YlndpBb0g+w@bricha3-MOBL.ger.corp.intel.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86DEA@smartserver.smartshare.dk>
 <YdhFKhWtpzKS6g7l@bricha3-MOBL.ger.corp.intel.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86DEF@smartserver.smartshare.dk>
 <CALBAE1NWcTNiAjxD7hSgGBzsYP0bx+0bsaUxiJ0LpGmCByLRtA@mail.gmail.com>
 <DD1F9EC6-8316-4EFE-BAD6-4D285F0FB6BF@arm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On Thu, 13 Jan 2022 05:31:18 +0000
Dharmik Thakkar <Dharmik.Thakkar@arm.com> wrote:

> Hi,
>=20
> Thank you for your valuable review comments and suggestions!
>=20
> I will be sending out a v2 in which I have increased the size of the memp=
ool to 32GB by using division by sizeof(uintptr_t).
> However, I am seeing ~5% performance degradation with mempool_perf_autote=
st (for bulk size of 32) with this change
> when compared to the base performance.
> Earlier, without this change, I was seeing an improvement of ~13% compare=
d to base performance. So, this is a significant degradation.
> I would appreciate your review comments on v2.
>=20
> Thank you!
>=20
> > On Jan 10, 2022, at 12:38 AM, Jerin Jacob <jerinjacobk@gmail.com> wrote:
> >=20
> > On Sat, Jan 8, 2022 at 3:07 PM Morten Br=C3=B8rup <mb@smartsharesystems=
.com> wrote: =20
> >>  =20
> >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>> Sent: Friday, 7 January 2022 14.51
> >>>=20
> >>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Br=C3=B8rup wrote: =
=20
> >>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>>> Sent: Friday, 7 January 2022 12.16
> >>>>>=20
> >>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Br=C3=B8rup wrote:=
 =20
> >>>>>>> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent: =20
> >>>>> Friday, 24 =20
> >>>>>>> December 2021 23.59
> >>>>>>>=20
> >>>>>>> Current mempool per core cache implementation stores pointers =20
> >>> to =20
> >>>>> mbufs =20
> >>>>>>> On 64b architectures, each pointer consumes 8B This patch =20
> >>> replaces =20
> >>>>> it =20
> >>>>>>> with index-based implementation, where in each buffer is =20
> >>> addressed =20
> >>>>> by =20
> >>>>>>> (pool base address + index) It reduces the amount of =20
> >>> memory/cache =20
> >>>>>>> required for per core cache
> >>>>>>>=20
> >>>>>>> L3Fwd performance testing reveals minor improvements in the =20
> >>> cache =20
> >>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no change =20
> >>> in =20
> >>>>>>> throughput
> >>>>>>>=20
> >>>>>>> Micro-benchmarking the patch using mempool_perf_test shows =20
> >>>>> significant =20
> >>>>>>> improvement with majority of the test cases
> >>>>>>>  =20
> >>>>>>=20
> >>>>>> I still think this is very interesting. And your performance =20
> >>> numbers =20
> >>>>> are =20
> >>>>>> looking good.
> >>>>>>=20
> >>>>>> However, it limits the size of a mempool to 4 GB. As previously
> >>>>>> discussed, the max mempool size can be increased by multiplying =20
> >>> the =20
> >>>>> index =20
> >>>>>> with a constant.
> >>>>>>=20
> >>>>>> I would suggest using sizeof(uintptr_t) as the constant =20
> >>> multiplier, =20
> >>>>> so =20
> >>>>>> the mempool can hold objects of any size divisible by =20
> >>>>> sizeof(uintptr_t). =20
> >>>>>> And it would be silly to use a mempool to hold objects smaller =20
> >>> than =20
> >>>>>> sizeof(uintptr_t).
> >>>>>>=20
> >>>>>> How does the performance look if you multiply the index by
> >>>>>> sizeof(uintptr_t)?
> >>>>>>  =20
> >>>>>=20
> >>>>> Each mempool entry is cache aligned, so we can use that if we want =
=20
> >>> a =20
> >>>>> bigger
> >>>>> multiplier. =20
> >>>>=20
> >>>> Thanks for chiming in, Bruce.
> >>>>=20
> >>>> Please also read this discussion about the multiplier:
> >>>> http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95=
z9xjbRuyA@mail.gmail.com/
> >>>>  =20
> >>>=20
> >>> I actually wondered after I had sent the email whether we had indeed =
an
> >>> option to disable the cache alignment or not! Thanks for pointing out
> >>> that
> >>> we do. This brings a couple additional thoughts:
> >>>=20
> >>> * Using indexes for the cache should probably be a runtime flag rather
> >>> than
> >>>  a build-time one.
> >>> * It would seem reasonable to me to disallow use of the indexed-cache
> >>> flag
> >>>  and the non-cache aligned flag simultaneously.
> >>> * On the offchance that that restriction is unacceptable, then we can
> >>>  make things a little more complicated by doing a runtime computation
> >>> of
> >>>  the "index-shiftwidth" to use.
> >>>=20
> >>> Overall, I think defaulting to cacheline shiftwidth and disallowing
> >>> index-based addressing when using unaligned buffers is simplest and
> >>> easiest
> >>> unless we can come up with a valid usecase for needing more than that.
> >>>=20
> >>> /Bruce =20
> >>=20
> >> This feature is a performance optimization.
> >>=20
> >> With that in mind, it should not introduce function pointers or simila=
r run-time checks or in the fast path, to determine what kind of cache to u=
se per mempool. And if an index multiplier is implemented, it should be a c=
ompile time constant, probably something between sizeof(uintptr_t) or RTE_M=
EMPOOL_ALIGN (=3DRTE_CACHE_LINE_SIZE).
> >>=20
> >> The patch comes with a tradeoff between better performance and limited=
 mempool size, and possibly some limitations regarding very small objects t=
hat are not cache line aligned to avoid wasting memory (RTE_MEMPOOL_POPULAT=
E_F_ALIGN_OBJ).
> >>=20
> >> With no multiplier, the only tradeoff is that the mempool size is limi=
ted to 4 GB.
> >>=20
> >> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that th=
e mempool size is limited to 32 GB. (And a waste of memory for objects smal=
ler than 8 byte; but I don't think anyone would use a mempool to hold objec=
ts smaller than 8 byte.)
> >>=20
> >> If the multiplier is larger (i.e. 64 bytes cache line size), the mempo=
ol size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ =
has no effect.
> >>=20
> >> Note: 32 bit platforms have no benefit from this patch: The pointer al=
ready only uses 4 bytes, so replacing the pointer with a 4 byte index makes=
 no difference.
> >>=20
> >>=20
> >> Since this feature is a performance optimization only, and doesn't pro=
vide any new features, I don't mind it being a compile time option.
> >>=20
> >> If this feature is a compile time option, and the mempool library is c=
ompiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ co=
uld be made undefined in the public header file, so compilation of applicat=
ions using the flag will fail. And rte_mempool_create() could RTE_ASSERT() =
that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or=
 emit a warning about the flag being ignored. Obviously, rte_mempool_create=
() should also RTE_ASSERT() that the mempool is not larger than the library=
 supports, possibly emitting a message that the mempool library should be b=
uilt without this feature to support the larger mempool.
> >>=20
> >> Here is another thought: If only exotic applications use mempools larg=
er than 32 GB, this would be a generally acceptable limit, and DPDK should =
use index-based cache as default, making the opposite (i.e. pointer-based c=
ache) a compile time option instead. A similar decision was recently made f=
or limiting the RTE_MAX_LCORE default.
> >>=20
> >>=20
> >> Although DPDK is moving away from compile time options in order to bet=
ter support Linux distros, there should be a general exception for performa=
nce and memory optimizations. Otherwise, network appliance vendors will inh=
erit the increasing amount of DPDK bloat, and we (network appliance vendors=
) will eventually be forced to fork DPDK to get rid of the bloat and achiev=
e the goals originally intended by DPDK. =20
> >=20
> > Agree with Morten's view on this.
> >  =20
> >> If anyone disagrees with the principle about a general exception for p=
erformance and memory optimizations, I would like to pass on the decision t=
o the Techboard!
> >>  =20

NAK
Having compile time stuff like this means one side or the other is not test=
ed
by CI infrastructure.  There never was sufficient justification, and lots o=
f objections.
Dropping the patch.