From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id B45C342FA3; Mon, 31 Jul 2023 14:33:06 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id A66BC43067; Mon, 31 Jul 2023 14:33:06 +0200 (CEST) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 4E2E040A89 for ; Mon, 31 Jul 2023 14:33:05 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id 2765620424; Mon, 31 Jul 2023 14:33:05 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH 0/1] mempool: implement index-based per core cache Date: Mon, 31 Jul 2023 14:33:03 +0200 X-MimeOLE: Produced By Microsoft Exchange V6.5 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D87AAF@smartserver.smartshare.dk> In-Reply-To: <12619560.VsHLxoZxqI@thomas> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH 0/1] mempool: implement index-based per core cache Thread-Index: AdnDqeF8TVLNgRYYQC20gSEkMpfylgAAKP6Q References: <20210930172735.2675627-1-dharmik.thakkar@arm.com> <20230706104302.4fd9807a@hermes.local> <12619560.VsHLxoZxqI@thomas> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Thomas Monjalon" , "Dharmik Thakkar" Cc: , "Jerin Jacob" , "Bruce Richardson" , "Honnappa Nagarahalli" , "nd" , "Ruifeng Wang" , "Stephen Hemminger" , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Thomas Monjalon [mailto:thomas@monjalon.net] > Sent: Monday, 31 July 2023 14.24 >=20 > The v2 was not sent, and Stephen dropped the patch from patchwork. >=20 > Do we abandon this feature? +1, because I think that the zero-copy mempool cache access functions = make this patch irrelevant. > Should I remove it from the roadmap? +1 >=20 >=20 > 06/07/2023 19:43, Stephen Hemminger: > > On Thu, 13 Jan 2022 05:31:18 +0000 > > Dharmik Thakkar wrote: > > > > > Hi, > > > > > > Thank you for your valuable review comments and suggestions! > > > > > > I will be sending out a v2 in which I have increased the size of = the > mempool to 32GB by using division by sizeof(uintptr_t). > > > However, I am seeing ~5% performance degradation with > mempool_perf_autotest (for bulk size of 32) with this change > > > when compared to the base performance. > > > Earlier, without this change, I was seeing an improvement of ~13% = compared > to base performance. So, this is a significant degradation. > > > I would appreciate your review comments on v2. > > > > > > Thank you! > > > > > > > On Jan 10, 2022, at 12:38 AM, Jerin Jacob = wrote: > > > > > > > > On Sat, Jan 8, 2022 at 3:07 PM Morten Br=F8rup = > wrote: > > > >> > > > >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > >>> Sent: Friday, 7 January 2022 14.51 > > > >>> > > > >>> On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Br=F8rup = wrote: > > > >>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > >>>>> Sent: Friday, 7 January 2022 12.16 > > > >>>>> > > > >>>>> On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Br=F8rup = wrote: > > > >>>>>>> From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] = Sent: > > > >>>>> Friday, 24 > > > >>>>>>> December 2021 23.59 > > > >>>>>>> > > > >>>>>>> Current mempool per core cache implementation stores = pointers > > > >>> to > > > >>>>> mbufs > > > >>>>>>> On 64b architectures, each pointer consumes 8B This patch > > > >>> replaces > > > >>>>> it > > > >>>>>>> with index-based implementation, where in each buffer is > > > >>> addressed > > > >>>>> by > > > >>>>>>> (pool base address + index) It reduces the amount of > > > >>> memory/cache > > > >>>>>>> required for per core cache > > > >>>>>>> > > > >>>>>>> L3Fwd performance testing reveals minor improvements in = the > > > >>> cache > > > >>>>>>> performance (L1 and L2 misses reduced by 0.60%) with no = change > > > >>> in > > > >>>>>>> throughput > > > >>>>>>> > > > >>>>>>> Micro-benchmarking the patch using mempool_perf_test shows > > > >>>>> significant > > > >>>>>>> improvement with majority of the test cases > > > >>>>>>> > > > >>>>>> > > > >>>>>> I still think this is very interesting. And your = performance > > > >>> numbers > > > >>>>> are > > > >>>>>> looking good. > > > >>>>>> > > > >>>>>> However, it limits the size of a mempool to 4 GB. As = previously > > > >>>>>> discussed, the max mempool size can be increased by = multiplying > > > >>> the > > > >>>>> index > > > >>>>>> with a constant. > > > >>>>>> > > > >>>>>> I would suggest using sizeof(uintptr_t) as the constant > > > >>> multiplier, > > > >>>>> so > > > >>>>>> the mempool can hold objects of any size divisible by > > > >>>>> sizeof(uintptr_t). > > > >>>>>> And it would be silly to use a mempool to hold objects = smaller > > > >>> than > > > >>>>>> sizeof(uintptr_t). > > > >>>>>> > > > >>>>>> How does the performance look if you multiply the index by > > > >>>>>> sizeof(uintptr_t)? > > > >>>>>> > > > >>>>> > > > >>>>> Each mempool entry is cache aligned, so we can use that if = we want > > > >>> a > > > >>>>> bigger > > > >>>>> multiplier. > > > >>>> > > > >>>> Thanks for chiming in, Bruce. > > > >>>> > > > >>>> Please also read this discussion about the multiplier: > > > >>>> > = http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjb= RuyA@ > mail.gmail.com/ > > > >>>> > > > >>> > > > >>> I actually wondered after I had sent the email whether we had = indeed > an > > > >>> option to disable the cache alignment or not! Thanks for = pointing out > > > >>> that > > > >>> we do. This brings a couple additional thoughts: > > > >>> > > > >>> * Using indexes for the cache should probably be a runtime = flag rather > > > >>> than > > > >>> a build-time one. > > > >>> * It would seem reasonable to me to disallow use of the = indexed-cache > > > >>> flag > > > >>> and the non-cache aligned flag simultaneously. > > > >>> * On the offchance that that restriction is unacceptable, then = we can > > > >>> make things a little more complicated by doing a runtime = computation > > > >>> of > > > >>> the "index-shiftwidth" to use. > > > >>> > > > >>> Overall, I think defaulting to cacheline shiftwidth and = disallowing > > > >>> index-based addressing when using unaligned buffers is = simplest and > > > >>> easiest > > > >>> unless we can come up with a valid usecase for needing more = than that. > > > >>> > > > >>> /Bruce > > > >> > > > >> This feature is a performance optimization. > > > >> > > > >> With that in mind, it should not introduce function pointers or = similar > run-time checks or in the fast path, to determine what kind of cache = to use > per mempool. And if an index multiplier is implemented, it should be a = compile > time constant, probably something between sizeof(uintptr_t) or > RTE_MEMPOOL_ALIGN (=3DRTE_CACHE_LINE_SIZE). > > > >> > > > >> The patch comes with a tradeoff between better performance and = limited > mempool size, and possibly some limitations regarding very small = objects that > are not cache line aligned to avoid wasting memory > (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ). > > > >> > > > >> With no multiplier, the only tradeoff is that the mempool size = is > limited to 4 GB. > > > >> > > > >> If the multiplier is small (i.e. 8 bytes) the only tradeoff is = that the > mempool size is limited to 32 GB. (And a waste of memory for objects = smaller > than 8 byte; but I don't think anyone would use a mempool to hold = objects > smaller than 8 byte.) > > > >> > > > >> If the multiplier is larger (i.e. 64 bytes cache line size), = the > mempool size is instead limited to 256 GB, but > RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ has no effect. > > > >> > > > >> Note: 32 bit platforms have no benefit from this patch: The = pointer > already only uses 4 bytes, so replacing the pointer with a 4 byte = index makes > no difference. > > > >> > > > >> > > > >> Since this feature is a performance optimization only, and = doesn't > provide any new features, I don't mind it being a compile time option. > > > >> > > > >> If this feature is a compile time option, and the mempool = library is > compiled with the large multiplier, then = RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ > could be made undefined in the public header file, so compilation of > applications using the flag will fail. And rte_mempool_create() could > RTE_ASSERT() that RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its = flags > parameter, or emit a warning about the flag being ignored. Obviously, > rte_mempool_create() should also RTE_ASSERT() that the mempool is not = larger > than the library supports, possibly emitting a message that the = mempool > library should be built without this feature to support the larger = mempool. > > > >> > > > >> Here is another thought: If only exotic applications use = mempools > larger than 32 GB, this would be a generally acceptable limit, and = DPDK should > use index-based cache as default, making the opposite (i.e. = pointer-based > cache) a compile time option instead. A similar decision was recently = made for > limiting the RTE_MAX_LCORE default. > > > >> > > > >> > > > >> Although DPDK is moving away from compile time options in order = to > better support Linux distros, there should be a general exception for > performance and memory optimizations. Otherwise, network appliance = vendors > will inherit the increasing amount of DPDK bloat, and we (network = appliance > vendors) will eventually be forced to fork DPDK to get rid of the = bloat and > achieve the goals originally intended by DPDK. > > > > > > > > Agree with Morten's view on this. > > > > > > > >> If anyone disagrees with the principle about a general = exception for > performance and memory optimizations, I would like to pass on the = decision to > the Techboard! > > > >> > > > > NAK > > Having compile time stuff like this means one side or the other is = not > tested > > by CI infrastructure. There never was sufficient justification, and = lots of > objections. > > Dropping the patch. > > > > >=20 >=20 >=20 >=20