From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 66686432B9;
	Mon,  6 Nov 2023 11:29:30 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 52B3B402EA;
	Mon,  6 Nov 2023 11:29:30 +0100 (CET)
Received: from dkmailrelay1.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id 3CD4B402AA
 for <dev@dpdk.org>; Mon,  6 Nov 2023 11:29:29 +0100 (CET)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesys.local [192.168.4.10])
 by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id 10F462071A;
 Mon,  6 Nov 2023 11:29:28 +0100 (CET)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [RFC] mempool: CPU cache aligning mempool driver accesses
X-MimeOLE: Produced By Microsoft Exchange V6.5
Date: Mon, 6 Nov 2023 11:29:24 +0100
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9EFD6@smartserver.smartshare.dk>
In-Reply-To: <ZUi1oZwNr4N6I62j@bricha3-MOBL.ger.corp.intel.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [RFC] mempool: CPU cache aligning mempool driver accesses
Thread-Index: AdoQlfQAVkMOU12OR12baq2ZRWeiqwAABHVw
References: <98CBD80474FA8B44BF855DF32C47DC35E9EFD4@smartserver.smartshare.dk>
 <ZUi1oZwNr4N6I62j@bricha3-MOBL.ger.corp.intel.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>
Cc: <dev@dpdk.org>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Monday, 6 November 2023 10.45
>=20
> On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Br=F8rup wrote:
> > I tried a little experiment, which gave a 25 % improvement in =
mempool
> > perf tests for long bursts (n_get_bulk=3D32 n_put_bulk=3D32 =
n_keep=3D512
> > constant_n=3D0) on a Xeon E5-2620 v4 based system.
> >
> > This is the concept:
> >
> > If all accesses to the mempool driver goes through the mempool =
cache,
> > we can ensure that these bulk load/stores are always CPU cache
> aligned,
> > by using cache->size when loading/storing to the mempool driver.
> >
> > Furthermore, it is rumored that most applications use the default
> > mempool cache size, so if the driver tests for that specific value,
> > it can use rte_memcpy(src,dst,N) with N known at build time, =
allowing
> > optimal performance for copying the array of objects.
> >
> > Unfortunately, I need to change the flush threshold from 1.5 to 2 to
> > be able to always use cache->size when loading/storing to the =
mempool
> > driver.
> >
> > What do you think?
> >
> > PS: If we can't get rid of the mempool cache size threshold factor,
> > we really need to expose it through public APIs. A job for another
> day.
> >
> > Signed-off-by: Morten Br=F8rup <mb@smartsharesystems.com>
> > ---
> Interesting, thanks.
>=20
> Out of interest, is there any different in performance you observe if
> using
> regular libc memcpy vs rte_memcpy for the ring copies? Since the copy
> amount is constant, a regular memcpy call should be expanded by the
> compiler itself, and so should be pretty efficient.

I ran some tests without patching rte_ring_elem_pvt.h, i.e. without =
introducing the constant-size copy loop. I got the majority of the =
performance gain at this point.

At this point, both pointers are CPU cache aligned when refilling the =
mempool cache, and the destination pointer is CPU cache aligned when =
draining the mempool cache.

In other words: When refilling the mempool cache, it is both loading and =
storing entire CPU cache lines. And when draining, it is storing entire =
CPU cache lines.


Adding the fixed-size copy loop provided an additional performance gain. =
I didn't test other constant-size copy methods than rte_memcpy.

rte_memcpy should have optimal conditions in this patch, because N is =
known to be 512 * 8 =3D 4 KiB at build time. Furthermore, both pointers =
are CPU cache aligned when refilling the mempool cache, and the =
destination pointer is CPU cache aligned when draining the mempool =
cache. I don't recall if pointer alignment matters for rte_memcpy, =
though.

The memcpy in libc (or more correctly: intrinsic to the compiler) will =
do non-temporal copying for large sizes, and I don't know what that =
threshold is, so I think rte_memcpy is the safe bet here. Especially if =
someone builds DPDK with a larger mempool cache size than 512 objects.

On the other hand, non-temporal access to the objects in the ring might =
be beneficial if the ring is so large that they go cold before the =
application loads them from the ring again.