From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 66686432B9; Mon, 6 Nov 2023 11:29:30 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 52B3B402EA; Mon, 6 Nov 2023 11:29:30 +0100 (CET) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 3CD4B402AA for ; Mon, 6 Nov 2023 11:29:29 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id 10F462071A; Mon, 6 Nov 2023 11:29:28 +0100 (CET) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [RFC] mempool: CPU cache aligning mempool driver accesses X-MimeOLE: Produced By Microsoft Exchange V6.5 Date: Mon, 6 Nov 2023 11:29:24 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9EFD6@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [RFC] mempool: CPU cache aligning mempool driver accesses Thread-Index: AdoQlfQAVkMOU12OR12baq2ZRWeiqwAABHVw References: <98CBD80474FA8B44BF855DF32C47DC35E9EFD4@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" Cc: X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Monday, 6 November 2023 10.45 >=20 > On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Br=F8rup wrote: > > I tried a little experiment, which gave a 25 % improvement in = mempool > > perf tests for long bursts (n_get_bulk=3D32 n_put_bulk=3D32 = n_keep=3D512 > > constant_n=3D0) on a Xeon E5-2620 v4 based system. > > > > This is the concept: > > > > If all accesses to the mempool driver goes through the mempool = cache, > > we can ensure that these bulk load/stores are always CPU cache > aligned, > > by using cache->size when loading/storing to the mempool driver. > > > > Furthermore, it is rumored that most applications use the default > > mempool cache size, so if the driver tests for that specific value, > > it can use rte_memcpy(src,dst,N) with N known at build time, = allowing > > optimal performance for copying the array of objects. > > > > Unfortunately, I need to change the flush threshold from 1.5 to 2 to > > be able to always use cache->size when loading/storing to the = mempool > > driver. > > > > What do you think? > > > > PS: If we can't get rid of the mempool cache size threshold factor, > > we really need to expose it through public APIs. A job for another > day. > > > > Signed-off-by: Morten Br=F8rup > > --- > Interesting, thanks. >=20 > Out of interest, is there any different in performance you observe if > using > regular libc memcpy vs rte_memcpy for the ring copies? Since the copy > amount is constant, a regular memcpy call should be expanded by the > compiler itself, and so should be pretty efficient. I ran some tests without patching rte_ring_elem_pvt.h, i.e. without = introducing the constant-size copy loop. I got the majority of the = performance gain at this point. At this point, both pointers are CPU cache aligned when refilling the = mempool cache, and the destination pointer is CPU cache aligned when = draining the mempool cache. In other words: When refilling the mempool cache, it is both loading and = storing entire CPU cache lines. And when draining, it is storing entire = CPU cache lines. Adding the fixed-size copy loop provided an additional performance gain. = I didn't test other constant-size copy methods than rte_memcpy. rte_memcpy should have optimal conditions in this patch, because N is = known to be 512 * 8 =3D 4 KiB at build time. Furthermore, both pointers = are CPU cache aligned when refilling the mempool cache, and the = destination pointer is CPU cache aligned when draining the mempool = cache. I don't recall if pointer alignment matters for rte_memcpy, = though. The memcpy in libc (or more correctly: intrinsic to the compiler) will = do non-temporal copying for large sizes, and I don't know what that = threshold is, so I think rte_memcpy is the safe bet here. Especially if = someone builds DPDK with a larger mempool cache size than 512 objects. On the other hand, non-temporal access to the objects in the ring might = be beneficial if the ring is so large that they go cold before the = application loads them from the ring again.