From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id A5A09432E3; Thu, 9 Nov 2023 11:45:51 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 2598E40267; Thu, 9 Nov 2023 11:45:51 +0100 (CET) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id DA8CB4021F for ; Thu, 9 Nov 2023 11:45:49 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id B08272231C; Thu, 9 Nov 2023 11:45:49 +0100 (CET) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [RFC] mempool: CPU cache aligning mempool driver accesses X-MimeOLE: Produced By Microsoft Exchange V6.5 Date: Thu, 9 Nov 2023 11:45:46 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F003@smartserver.smartshare.dk> In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9EFD6@smartserver.smartshare.dk> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [RFC] mempool: CPU cache aligning mempool driver accesses Thread-Index: AdoQlfQAVkMOU12OR12baq2ZRWeiqwAABHVwAJhfICA= References: <98CBD80474FA8B44BF855DF32C47DC35E9EFD4@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35E9EFD6@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" , "Andrew Rybchenko" Cc: X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org +TO: Andrew, mempool maintainer > From: Morten Br=F8rup [mailto:mb@smartsharesystems.com] > Sent: Monday, 6 November 2023 11.29 >=20 > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > Sent: Monday, 6 November 2023 10.45 > > > > On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Br=F8rup wrote: > > > I tried a little experiment, which gave a 25 % improvement in > mempool > > > perf tests for long bursts (n_get_bulk=3D32 n_put_bulk=3D32 = n_keep=3D512 > > > constant_n=3D0) on a Xeon E5-2620 v4 based system. > > > > > > This is the concept: > > > > > > If all accesses to the mempool driver goes through the mempool > cache, > > > we can ensure that these bulk load/stores are always CPU cache > > aligned, > > > by using cache->size when loading/storing to the mempool driver. > > > > > > Furthermore, it is rumored that most applications use the default > > > mempool cache size, so if the driver tests for that specific = value, > > > it can use rte_memcpy(src,dst,N) with N known at build time, > allowing > > > optimal performance for copying the array of objects. > > > > > > Unfortunately, I need to change the flush threshold from 1.5 to 2 > to > > > be able to always use cache->size when loading/storing to the > mempool > > > driver. > > > > > > What do you think? It's the concept of accessing the underlying mempool in entire cache = lines I am seeking feedback for. The provided code is just an example, mainly for testing performance of = the concept. > > > > > > PS: If we can't get rid of the mempool cache size threshold = factor, > > > we really need to expose it through public APIs. A job for another > > day. The concept that a mempool per-lcore cache can hold more objects than = its size is extremely weird, and certainly unexpected by any normal = developer. And thus it is likely to cause runtime errors for = applications designing tightly sized mempools. So, if we move forward with this RFC, I propose eliminating the = threshold factor, so the mempool per-lcore caches cannot hold more = objects than their size. When doing this, we might also choose to double = RTE_MEMPOOL_CACHE_MAX_SIZE, to prevent any performance degradation. > > > > > > Signed-off-by: Morten Br=F8rup > > > --- > > Interesting, thanks. > > > > Out of interest, is there any different in performance you observe = if > > using > > regular libc memcpy vs rte_memcpy for the ring copies? Since the = copy > > amount is constant, a regular memcpy call should be expanded by the > > compiler itself, and so should be pretty efficient. >=20 > I ran some tests without patching rte_ring_elem_pvt.h, i.e. without > introducing the constant-size copy loop. I got the majority of the > performance gain at this point. >=20 > At this point, both pointers are CPU cache aligned when refilling the > mempool cache, and the destination pointer is CPU cache aligned when > draining the mempool cache. >=20 > In other words: When refilling the mempool cache, it is both loading > and storing entire CPU cache lines. And when draining, it is storing > entire CPU cache lines. >=20 >=20 > Adding the fixed-size copy loop provided an additional performance > gain. I didn't test other constant-size copy methods than rte_memcpy. >=20 > rte_memcpy should have optimal conditions in this patch, because N is > known to be 512 * 8 =3D 4 KiB at build time. Furthermore, both = pointers > are CPU cache aligned when refilling the mempool cache, and the > destination pointer is CPU cache aligned when draining the mempool > cache. I don't recall if pointer alignment matters for rte_memcpy, > though. >=20 > The memcpy in libc (or more correctly: intrinsic to the compiler) will > do non-temporal copying for large sizes, and I don't know what that > threshold is, so I think rte_memcpy is the safe bet here. Especially = if > someone builds DPDK with a larger mempool cache size than 512 objects. >=20 > On the other hand, non-temporal access to the objects in the ring = might > be beneficial if the ring is so large that they go cold before the > application loads them from the ring again.