RE: [RFC] mempool: CPU cache aligning mempool driver accesses

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>,
	"Andrew Rybchenko" <andrew.rybchenko@oktetlabs.ru>
Cc: <dev@dpdk.org>
Subject: RE: [RFC] mempool: CPU cache aligning mempool driver accesses
Date: Thu, 9 Nov 2023 11:45:46 +0100	[thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F003@smartserver.smartshare.dk> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9EFD6@smartserver.smartshare.dk>

+TO: Andrew, mempool maintainer

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Monday, 6 November 2023 11.29
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Monday, 6 November 2023 10.45
> >
> > On Sat, Nov 04, 2023 at 06:29:40PM +0100, Morten Brørup wrote:
> > > I tried a little experiment, which gave a 25 % improvement in
> mempool
> > > perf tests for long bursts (n_get_bulk=32 n_put_bulk=32 n_keep=512
> > > constant_n=0) on a Xeon E5-2620 v4 based system.
> > >
> > > This is the concept:
> > >
> > > If all accesses to the mempool driver goes through the mempool
> cache,
> > > we can ensure that these bulk load/stores are always CPU cache
> > aligned,
> > > by using cache->size when loading/storing to the mempool driver.
> > >
> > > Furthermore, it is rumored that most applications use the default
> > > mempool cache size, so if the driver tests for that specific value,
> > > it can use rte_memcpy(src,dst,N) with N known at build time,
> allowing
> > > optimal performance for copying the array of objects.
> > >
> > > Unfortunately, I need to change the flush threshold from 1.5 to 2
> to
> > > be able to always use cache->size when loading/storing to the
> mempool
> > > driver.
> > >
> > > What do you think?

It's the concept of accessing the underlying mempool in entire cache lines I am seeking feedback for.

The provided code is just an example, mainly for testing performance of the concept.

> > >
> > > PS: If we can't get rid of the mempool cache size threshold factor,
> > > we really need to expose it through public APIs. A job for another
> > day.

The concept that a mempool per-lcore cache can hold more objects than its size is extremely weird, and certainly unexpected by any normal developer. And thus it is likely to cause runtime errors for applications designing tightly sized mempools.

So, if we move forward with this RFC, I propose eliminating the threshold factor, so the mempool per-lcore caches cannot hold more objects than their size.
When doing this, we might also choose to double RTE_MEMPOOL_CACHE_MAX_SIZE, to prevent any performance degradation.

> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > ---
> > Interesting, thanks.
> >
> > Out of interest, is there any different in performance you observe if
> > using
> > regular libc memcpy vs rte_memcpy for the ring copies? Since the copy
> > amount is constant, a regular memcpy call should be expanded by the
> > compiler itself, and so should be pretty efficient.
> 
> I ran some tests without patching rte_ring_elem_pvt.h, i.e. without
> introducing the constant-size copy loop. I got the majority of the
> performance gain at this point.
> 
> At this point, both pointers are CPU cache aligned when refilling the
> mempool cache, and the destination pointer is CPU cache aligned when
> draining the mempool cache.
> 
> In other words: When refilling the mempool cache, it is both loading
> and storing entire CPU cache lines. And when draining, it is storing
> entire CPU cache lines.
> 
> 
> Adding the fixed-size copy loop provided an additional performance
> gain. I didn't test other constant-size copy methods than rte_memcpy.
> 
> rte_memcpy should have optimal conditions in this patch, because N is
> known to be 512 * 8 = 4 KiB at build time. Furthermore, both pointers
> are CPU cache aligned when refilling the mempool cache, and the
> destination pointer is CPU cache aligned when draining the mempool
> cache. I don't recall if pointer alignment matters for rte_memcpy,
> though.
> 
> The memcpy in libc (or more correctly: intrinsic to the compiler) will
> do non-temporal copying for large sizes, and I don't know what that
> threshold is, so I think rte_memcpy is the safe bet here. Especially if
> someone builds DPDK with a larger mempool cache size than 512 objects.
> 
> On the other hand, non-temporal access to the objects in the ring might
> be beneficial if the ring is so large that they go cold before the
> application loads them from the ring again.

     prev parent reply	other threads:[~2023-11-09 10:45 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-04 17:29 Morten Brørup
2023-11-06  9:45 ` Bruce Richardson
2023-11-06 10:29   ` Morten Brørup
2023-11-09 10:45     ` Morten Brørup [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98CBD80474FA8B44BF855DF32C47DC35E9F003@smartserver.smartshare.dk \
    --to=mb@smartsharesystems.com \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).