From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id E2FA542FA3; Mon, 31 Jul 2023 14:23:58 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 757B440A89; Mon, 31 Jul 2023 14:23:58 +0200 (CEST) Received: from wout4-smtp.messagingengine.com (wout4-smtp.messagingengine.com [64.147.123.20]) by mails.dpdk.org (Postfix) with ESMTP id 886CC4067B for ; Mon, 31 Jul 2023 14:23:57 +0200 (CEST) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.west.internal (Postfix) with ESMTP id 593BB32003C0; Mon, 31 Jul 2023 08:23:55 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute2.internal (MEProxy); Mon, 31 Jul 2023 08:23:56 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=monjalon.net; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:sender:subject:subject:to:to; s=fm3; t= 1690806234; x=1690892634; bh=f290ABmB/4ZurYvFWnDYC9EjbZwB+/wzc4S aJ6rLzeI=; b=lx1dHKc4GpEjOUhDi9PPmMak+96AUZY1ujssM363kiDpylrgjjO ufJqBW5hX+Nsj7RF11pXULiVnsk845aqo2nM6TjH+nMY5cYg27vrLLnUZ6LuSJke HmADMctuaeFcLy1XKf+RHas07i1IQcA0qeUhlsXSqqXiXA5QxHIEqdhYkmRCgs7p SIMFERT9kevK4DnnJSX7uPfQn7G84X8+BQ2n8R5KLu/aJhj8gU6oM/rTXWpzwoie RmJg1gV1szjR+H7rbJDIXVQYJ96CGzkmo/o23RnzQxqHjbfSI5ikxUrEr5nhnQjW 8lvIK5Cn+/4Lp2umyyA0cIORUHMsdBk8cwQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:sender:subject:subject:to:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1690806234; x=1690892634; bh=f290ABmB/4ZurYvFWnDYC9EjbZwB+/wzc4S aJ6rLzeI=; b=r21u2/P/7VhzcSI/S8uA7sOv7OUifFoZrYTlrPlQDMHBeOoYkdy kv0/O4HWWdxWQs73+5BT1vvPPlTAlq31hZL/RItsTyz/Siab85RLyJG1ueJXVKP0 LzQziZpbPdbjiXaz+9YMAwK5OcGGdIlKLN3yAOMUKHTRaTzoqUUgS1z4NC0Mu2T1 +tTHTMSiVR6zJzCGbaVlR439H4PlQenlkQGAxx36rTMfypfWDbkM6ahj1GuYQKcu ozwd2lvfFc8hJgWMLtZfpbkMcdY2UIBcCfZ6vtBUkM4ZuD7xIC/fEzTqcNDckDd/ N5MHAKlpx1eh4iF4J9KtWPf95mBLvhW04iA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedviedrjeeggddvkecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefhvfevufffkfgjfhgggfgtsehtqhertddttddunecuhfhrohhmpefvhhhomhgr shcuofhonhhjrghlohhnuceothhhohhmrghssehmohhnjhgrlhhonhdrnhgvtheqnecugg ftrfgrthhtvghrnhepgeejiedvleehudfhuefgfeevgfehieetgeeigffggfffffeiuefg ueektdeffeevnecuffhomhgrihhnpeguphgukhdrohhrghenucevlhhushhtvghrufhiii gvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehthhhomhgrshesmhhonhhjrghlohhn rdhnvght X-ME-Proxy: Feedback-ID: i47234305:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 31 Jul 2023 08:23:53 -0400 (EDT) From: Thomas Monjalon To: Dharmik Thakkar Cc: dev@dpdk.org, Jerin Jacob , Morten =?ISO-8859-1?Q?Br=F8rup?= , Bruce Richardson , Honnappa Nagarahalli , nd , Ruifeng Wang , Stephen Hemminger , olivier.matz@6wind.com, andrew.rybchenko@oktetlabs.ru Subject: Re: [PATCH 0/1] mempool: implement index-based per core cache Date: Mon, 31 Jul 2023 14:23:51 +0200 Message-ID: <12619560.VsHLxoZxqI@thomas> In-Reply-To: <20230706104302.4fd9807a@hermes.local> References: <20210930172735.2675627-1-dharmik.thakkar@arm.com> <20230706104302.4fd9807a@hermes.local> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org The v2 was not sent, and Stephen dropped the patch from patchwork. Do we abandon this feature? Should I remove it from the roadmap? 06/07/2023 19:43, Stephen Hemminger: > On Thu, 13 Jan 2022 05:31:18 +0000 > Dharmik Thakkar wrote: >=20 > > Hi, > >=20 > > Thank you for your valuable review comments and suggestions! > >=20 > > I will be sending out a v2 in which I have increased the size of the me= mpool to 32GB by using division by sizeof(uintptr_t). > > However, I am seeing ~5% performance degradation with mempool_perf_auto= test (for bulk size of 32) with this change > > when compared to the base performance. > > Earlier, without this change, I was seeing an improvement of ~13% compa= red to base performance. So, this is a significant degradation. > > I would appreciate your review comments on v2. > >=20 > > Thank you! > >=20 > > > On Jan 10, 2022, at 12:38 AM, Jerin Jacob wro= te: > > >=20 > > > On Sat, Jan 8, 2022 at 3:07 PM Morten Br=F8rup wrote: =20 > > >> =20 > > >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > >>> Sent: Friday, 7 January 2022 14.51 > > >>>=20 > > >>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Br=F8rup wrote: =20 > > >>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > >>>>> Sent: Friday, 7 January 2022 12.16 > > >>>>>=20 > > >>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Br=F8rup wrote: = =20 > > >>>>>>> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent: =20 > > >>>>> Friday, 24 =20 > > >>>>>>> December 2021 23.59 > > >>>>>>>=20 > > >>>>>>> Current mempool per core cache implementation stores pointers = =20 > > >>> to =20 > > >>>>> mbufs =20 > > >>>>>>> On 64b architectures, each pointer consumes 8B This patch =20 > > >>> replaces =20 > > >>>>> it =20 > > >>>>>>> with index-based implementation, where in each buffer is =20 > > >>> addressed =20 > > >>>>> by =20 > > >>>>>>> (pool base address + index) It reduces the amount of =20 > > >>> memory/cache =20 > > >>>>>>> required for per core cache > > >>>>>>>=20 > > >>>>>>> L3Fwd performance testing reveals minor improvements in the =20 > > >>> cache =20 > > >>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no change = =20 > > >>> in =20 > > >>>>>>> throughput > > >>>>>>>=20 > > >>>>>>> Micro-benchmarking the patch using mempool_perf_test shows =20 > > >>>>> significant =20 > > >>>>>>> improvement with majority of the test cases > > >>>>>>> =20 > > >>>>>>=20 > > >>>>>> I still think this is very interesting. And your performance =20 > > >>> numbers =20 > > >>>>> are =20 > > >>>>>> looking good. > > >>>>>>=20 > > >>>>>> However, it limits the size of a mempool to 4 GB. As previously > > >>>>>> discussed, the max mempool size can be increased by multiplying = =20 > > >>> the =20 > > >>>>> index =20 > > >>>>>> with a constant. > > >>>>>>=20 > > >>>>>> I would suggest using sizeof(uintptr_t) as the constant =20 > > >>> multiplier, =20 > > >>>>> so =20 > > >>>>>> the mempool can hold objects of any size divisible by =20 > > >>>>> sizeof(uintptr_t). =20 > > >>>>>> And it would be silly to use a mempool to hold objects smaller = =20 > > >>> than =20 > > >>>>>> sizeof(uintptr_t). > > >>>>>>=20 > > >>>>>> How does the performance look if you multiply the index by > > >>>>>> sizeof(uintptr_t)? > > >>>>>> =20 > > >>>>>=20 > > >>>>> Each mempool entry is cache aligned, so we can use that if we wan= t =20 > > >>> a =20 > > >>>>> bigger > > >>>>> multiplier. =20 > > >>>>=20 > > >>>> Thanks for chiming in, Bruce. > > >>>>=20 > > >>>> Please also read this discussion about the multiplier: > > >>>> http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY= 95z9xjbRuyA@mail.gmail.com/ > > >>>> =20 > > >>>=20 > > >>> I actually wondered after I had sent the email whether we had indee= d an > > >>> option to disable the cache alignment or not! Thanks for pointing o= ut > > >>> that > > >>> we do. This brings a couple additional thoughts: > > >>>=20 > > >>> * Using indexes for the cache should probably be a runtime flag rat= her > > >>> than > > >>> a build-time one. > > >>> * It would seem reasonable to me to disallow use of the indexed-cac= he > > >>> flag > > >>> and the non-cache aligned flag simultaneously. > > >>> * On the offchance that that restriction is unacceptable, then we c= an > > >>> make things a little more complicated by doing a runtime computati= on > > >>> of > > >>> the "index-shiftwidth" to use. > > >>>=20 > > >>> Overall, I think defaulting to cacheline shiftwidth and disallowing > > >>> index-based addressing when using unaligned buffers is simplest and > > >>> easiest > > >>> unless we can come up with a valid usecase for needing more than th= at. > > >>>=20 > > >>> /Bruce =20 > > >>=20 > > >> This feature is a performance optimization. > > >>=20 > > >> With that in mind, it should not introduce function pointers or simi= lar run-time checks or in the fast path, to determine what kind of cache to= use per mempool. And if an index multiplier is implemented, it should be a= compile time constant, probably something between sizeof(uintptr_t) or RTE= _MEMPOOL_ALIGN (=3DRTE_CACHE_LINE_SIZE). > > >>=20 > > >> The patch comes with a tradeoff between better performance and limit= ed mempool size, and possibly some limitations regarding very small objects= that are not cache line aligned to avoid wasting memory (RTE_MEMPOOL_POPUL= ATE_F_ALIGN_OBJ). > > >>=20 > > >> With no multiplier, the only tradeoff is that the mempool size is li= mited to 4 GB. > > >>=20 > > >> If the multiplier is small (i.e. 8 bytes) the only tradeoff is that = the mempool size is limited to 32 GB. (And a waste of memory for objects sm= aller than 8 byte; but I don't think anyone would use a mempool to hold obj= ects smaller than 8 byte.) > > >>=20 > > >> If the multiplier is larger (i.e. 64 bytes cache line size), the mem= pool size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OB= J has no effect. > > >>=20 > > >> Note: 32 bit platforms have no benefit from this patch: The pointer = already only uses 4 bytes, so replacing the pointer with a 4 byte index mak= es no difference. > > >>=20 > > >>=20 > > >> Since this feature is a performance optimization only, and doesn't p= rovide any new features, I don't mind it being a compile time option. > > >>=20 > > >> If this feature is a compile time option, and the mempool library is= compiled with the large multiplier, then RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ = could be made undefined in the public header file, so compilation of applic= ations using the flag will fail. And rte_mempool_create() could RTE_ASSERT(= ) that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, = or emit a warning about the flag being ignored. Obviously, rte_mempool_crea= te() should also RTE_ASSERT() that the mempool is not larger than the libra= ry supports, possibly emitting a message that the mempool library should be= built without this feature to support the larger mempool. > > >>=20 > > >> Here is another thought: If only exotic applications use mempools la= rger than 32 GB, this would be a generally acceptable limit, and DPDK shoul= d use index-based cache as default, making the opposite (i.e. pointer-based= cache) a compile time option instead. A similar decision was recently made= for limiting the RTE_MAX_LCORE default. > > >>=20 > > >>=20 > > >> Although DPDK is moving away from compile time options in order to b= etter support Linux distros, there should be a general exception for perfor= mance and memory optimizations. Otherwise, network appliance vendors will i= nherit the increasing amount of DPDK bloat, and we (network appliance vendo= rs) will eventually be forced to fork DPDK to get rid of the bloat and achi= eve the goals originally intended by DPDK. =20 > > >=20 > > > Agree with Morten's view on this. > > > =20 > > >> If anyone disagrees with the principle about a general exception for= performance and memory optimizations, I would like to pass on the decision= to the Techboard! > > >> =20 >=20 > NAK > Having compile time stuff like this means one side or the other is not te= sted > by CI infrastructure. There never was sufficient justification, and lots= of objections. > Dropping the patch. >=20 >=20