From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id DE84AA0350; Sat, 8 Jan 2022 10:37:18 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 68845410E1; Sat, 8 Jan 2022 10:37:18 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id AA24640042 for ; Sat, 8 Jan 2022 10:37:16 +0100 (CET) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH 0/1] mempool: implement index-based per core cache Date: Sat, 8 Jan 2022 10:37:10 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86DEF@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH 0/1] mempool: implement index-based per core cache Thread-Index: AdgDzZNRf9kRoJmFScqQtiMx6QknDgAlQeWg References: <20210930172735.2675627-1-dharmik.thakkar@arm.com> <20211224225923.806498-1-dharmik.thakkar@arm.com> <98CBD80474FA8B44BF855DF32C47DC35D86DAD@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86DEA@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" , "Dharmik Thakkar" , Cc: , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Friday, 7 January 2022 14.51 >=20 > On Fri, Jan 07, 2022 at 12:29:23PM +0100, Morten Br=F8rup wrote: > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > Sent: Friday, 7 January 2022 12.16 > > > > > > On Sat, Dec 25, 2021 at 01:16:03AM +0100, Morten Br=F8rup wrote: > > > > > From: Dharmik Thakkar [mailto:dharmik.thakkar@arm.com] Sent: > > > Friday, 24 > > > > > December 2021 23.59 > > > > > > > > > > Current mempool per core cache implementation stores pointers > to > > > mbufs > > > > > On 64b architectures, each pointer consumes 8B This patch > replaces > > > it > > > > > with index-based implementation, where in each buffer is > addressed > > > by > > > > > (pool base address + index) It reduces the amount of > memory/cache > > > > > required for per core cache > > > > > > > > > > L3Fwd performance testing reveals minor improvements in the > cache > > > > > performance (L1 and L2 misses reduced by 0.60%) with no change > in > > > > > throughput > > > > > > > > > > Micro-benchmarking the patch using mempool_perf_test shows > > > significant > > > > > improvement with majority of the test cases > > > > > > > > > > > > > I still think this is very interesting. And your performance > numbers > > > are > > > > looking good. > > > > > > > > However, it limits the size of a mempool to 4 GB. As previously > > > > discussed, the max mempool size can be increased by multiplying > the > > > index > > > > with a constant. > > > > > > > > I would suggest using sizeof(uintptr_t) as the constant > multiplier, > > > so > > > > the mempool can hold objects of any size divisible by > > > sizeof(uintptr_t). > > > > And it would be silly to use a mempool to hold objects smaller > than > > > > sizeof(uintptr_t). > > > > > > > > How does the performance look if you multiply the index by > > > > sizeof(uintptr_t)? > > > > > > > > > > Each mempool entry is cache aligned, so we can use that if we want > a > > > bigger > > > multiplier. > > > > Thanks for chiming in, Bruce. > > > > Please also read this discussion about the multiplier: > > = http://inbox.dpdk.org/dev/CALBAE1PrQYyOG96f6ECeW1vPF3TOh1h7MZZULiY95z9xjb= RuyA@mail.gmail.com/ > > >=20 > I actually wondered after I had sent the email whether we had indeed = an > option to disable the cache alignment or not! Thanks for pointing out > that > we do. This brings a couple additional thoughts: >=20 > * Using indexes for the cache should probably be a runtime flag rather > than > a build-time one. > * It would seem reasonable to me to disallow use of the indexed-cache > flag > and the non-cache aligned flag simultaneously. > * On the offchance that that restriction is unacceptable, then we can > make things a little more complicated by doing a runtime computation > of > the "index-shiftwidth" to use. >=20 > Overall, I think defaulting to cacheline shiftwidth and disallowing > index-based addressing when using unaligned buffers is simplest and > easiest > unless we can come up with a valid usecase for needing more than that. >=20 > /Bruce This feature is a performance optimization. With that in mind, it should not introduce function pointers or similar = run-time checks or in the fast path, to determine what kind of cache to = use per mempool. And if an index multiplier is implemented, it should be = a compile time constant, probably something between sizeof(uintptr_t) or = RTE_MEMPOOL_ALIGN (=3DRTE_CACHE_LINE_SIZE). The patch comes with a tradeoff between better performance and limited = mempool size, and possibly some limitations regarding very small objects = that are not cache line aligned to avoid wasting memory = (RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ). With no multiplier, the only tradeoff is that the mempool size is = limited to 4 GB. If the multiplier is small (i.e. 8 bytes) the only tradeoff is that the = mempool size is limited to 32 GB. (And a waste of memory for objects = smaller than 8 byte; but I don't think anyone would use a mempool to = hold objects smaller than 8 byte.) If the multiplier is larger (i.e. 64 bytes cache line size), the mempool = size is instead limited to 256 GB, but RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ = has no effect. Note: 32 bit platforms have no benefit from this patch: The pointer = already only uses 4 bytes, so replacing the pointer with a 4 byte index = makes no difference. Since this feature is a performance optimization only, and doesn't = provide any new features, I don't mind it being a compile time option. If this feature is a compile time option, and the mempool library is = compiled with the large multiplier, then = RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ could be made undefined in the public = header file, so compilation of applications using the flag will fail. = And rte_mempool_create() could RTE_ASSERT() that = RTE_MEMPOOL_POPULATE_F_ALIGN_OBJ is not set in its flags parameter, or = emit a warning about the flag being ignored. Obviously, = rte_mempool_create() should also RTE_ASSERT() that the mempool is not = larger than the library supports, possibly emitting a message that the = mempool library should be built without this feature to support the = larger mempool. Here is another thought: If only exotic applications use mempools larger = than 32 GB, this would be a generally acceptable limit, and DPDK should = use index-based cache as default, making the opposite (i.e. = pointer-based cache) a compile time option instead. A similar decision = was recently made for limiting the RTE_MAX_LCORE default. Although DPDK is moving away from compile time options in order to = better support Linux distros, there should be a general exception for = performance and memory optimizations. Otherwise, network appliance = vendors will inherit the increasing amount of DPDK bloat, and we = (network appliance vendors) will eventually be forced to fork DPDK to = get rid of the bloat and achieve the goals originally intended by DPDK. = If anyone disagrees with the principle about a general exception for = performance and memory optimizations, I would like to pass on the = decision to the Techboard!