DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Jerin Jacob" <jerinjacobk@gmail.com>,
	"Bruce Richardson" <bruce.richardson@intel.com>
Cc: "Olivier Matz" <olivier.matz@6wind.com>,
	"Andrew Rybchenko" <andrew.rybchenko@oktetlabs.ru>,
	"dpdk-dev" <dev@dpdk.org>
Subject: RE: [PATCH] mempool: optimize incomplete cache handling
Date: Mon, 10 Jan 2022 11:55:37 +0100	[thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86DF2@smartserver.smartshare.dk> (raw)
In-Reply-To: <CALBAE1OjCswxUfaNLWg5y-tnPkFhvvKQ8sJ3JpBoo7ObgeB5OA@mail.gmail.com>

+Bruce; you seemed interested in my work in this area.

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Monday, 10 January 2022 08.27
> 
> On Fri, Jan 7, 2022 at 2:16 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > Sent: Thursday, 6 January 2022 17.55
> > >
> > > On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup
> <mb@smartsharesystems.com>
> > > wrote:
> > > >
> > > > A flush threshold for the mempool cache was introduced in DPDK
> > > version
> > > > 1.3, but rte_mempool_do_generic_get() was not completely updated
> back
> > > > then.
> > > >
> > > > The incompleteness did not cause any functional bugs, so this
> patch
> > > > could be considered refactoring for the purpose of cleaning up.
> > > >
> > > > This patch completes the update of rte_mempool_do_generic_get()
> as
> > > > follows:
> > > >
> > > > 1. A few comments were malplaced or no longer correct.
> > > > Some comments have been updated/added/corrected.
> > > >
> > > > 2. The code that initially screens the cache request was not
> updated.
> > > > The initial screening compared the request length to the cache
> size,
> > > > which was correct before, but became irrelevant with the
> introduction
> > > of
> > > > the flush threshold. E.g. the cache can hold up to flushthresh
> > > objects,
> > > > which is more than its size, so some requests were not served
> from
> > > the
> > > > cache, even though they could be.
> > > > The initial screening has now been corrected to match the initial
> > > > screening in rte_mempool_do_generic_put(), which verifies that a
> > > cache
> > > > is present, and that the length of the request does not overflow
> the
> > > > memory allocated for the cache.
> > > >
> > > > 3. The code flow for satisfying the request from the cache was
> weird.
> > > > The likely code path where the objects are simply served from the
> > > cache
> > > > was treated as unlikely; now it is treated as likely.
> > > > And in the code path where the cache was backfilled first,
> numbers
> > > were
> > > > added and subtracted from the cache length; now this code path
> simply
> > > > sets the cache length to its final value.
> > > >
> > > > 4. The objects were returned in reverse order.
> > > > Returning the objects in reverse order is not necessary, so
> > > rte_memcpy()
> > > > is now used instead.
> > >
> > > Have you checked the performance with network workload?
> > > IMO, reverse order makes sense(LIFO vs FIFO).
> > > The LIFO makes the cache warm as the same buffers are reused
> > > frequently.
> >
> > I have not done any performance testing. We probably agree that the
> only major difference lies in how the objects are returned. And we
> probably also agree that rte_memcpy() is faster than the copy loop it
> replaced, especially when n is constant at compile time. So the
> performance difference mainly depends on the application, which I will
> discuss below.
> >
> > Let's first consider LIFO vs. FIFO.
> >
> > The key argument for the rte_memcpy() optimization is that we are
> still getting the burst of objects from the top of the stack (LIFO);
> only the order of the objects inside the burst is not reverse anymore.
> >
> > Here is an example:
> >
> > The cache initially contains 8 objects: 01234567.
> >
> > 8 more objects are put into the cache: 89ABCDEF.
> >
> > The cache now holds: 0123456789ABCDEF.
> 
> Agree. However I think, it may matter with less sized L1 cache
> machines and burst size is more where it plays role what can be in L1
> with the scheme.

Good point! Thinking further about it made me realize that the mempool cache flushing algorithm is fundamentally flawed, at least in some cases...


rte_mempool_do_generic_put():

When putting objects into the cache, and the cache length exceeds the flush threshold, the most recent (hot) objects are flushed to the ring, thus leaving the less recent (colder) objects at the top of the cache stack.

Example (cache size: 8, flush threshold: 12, put 8 objects):

Initial cache: 01234567

Cache after putting (hot) objects 89ABCDEF: 0123456789ABCDEF

Cache flush threshold reached. Resulting cache: 01234567

Furthermore, the cache has to be completely depleted before the hot objects that were flushed to the ring are retrieved from the ring again.


rte_mempool_do_generic_get():

When getting objects from the cache, and the cache does not hold the requested number of objects, the cache will first be backfilled from the ring, thus putting colder objects at the top of the cache stack, and then the objects will be returned from the top of the cache stack, i.e. the backfilled (cold) objects will be returned first.

Example (cache size: 8, get 8 objects):

Initial cache: 0123 (hot or lukewarm)

Cache after backfill to size + requested objects: 0123456789ABCDEF

Returned objects: FEDCBA98 (cold)

Cache after returning objects: 012345678 (i.e. cold objects at the top)


> 
> I would suggest splitting each performance improvement as a separate
> patch for better tracking and quantity of the performance improvement.

With the new realizations above, I should reconsider my patch from scratch.

I have also been wondering if the mempool cache size really needs to be configurable, or it could be fixed size?

Bruce mentioned in another thread (http://inbox.dpdk.org/dev/Ydv%2FIMz8eIRPSguY@bricha3-MOBL.ger.corp.intel.com/T/#m02cabb25655c08a0980888df8c41aba9ac8dd6ff) that the typical configuration of the cache size is RTE_MEMPOOL_CACHE_MAX_SIZE.

I would dare to say that it probably suffices to configure if the mempool has a cache or not! The cache_size parameter of rte_mempool_create() is not respected 1:1 anyway (because each per-lcore cache may consume up to 1.5 x cache_size objects from the mempool backing store), so the cache_size parameter could be parsed as non-zero vs. zero to determine if a cache is wanted or not.

> 
> I think, mempool performance test and tx only stream mode in testpmd
> can quantify patches.
> 
> 
> 
> >
> > Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e.
> we are still getting the 4 objects most recently put into the cache.
> >
> > Furthermore, if the application is working with fixed size bursts, it
> will usually put and get the same size burst, i.e. put the burst
> 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache
> again.
> >
> >
> > Here is an example unfavorable scenario:
> >
> > The cache initially contains 4 objects, which have gone cold: 0123.
> >
> > 4 more objects, which happen to be hot, are put into the cache: 4567.
> >
> > Getting 8 objects from the cache gives us 01234567 instead of
> 76543210.
> >
> > Now, if the application only processes the first 4 of the 8 objects
> in the burst, it would have benefitted from those objects being the hot
> 7654 objects instead of the cold 0123 objects.
> >
> > However, I think that most applications process the complete burst,
> so I do consider this scenario unlikely.
> >
> > Similarly, a pipelined application doesn't process objects in reverse
> order at every other step in the pipeline, even though the previous
> step in the pipeline most recently touched the last object of the
> burst.
> >
> >
> > My overall conclusion was that the benefit of using rte_memcpy()
> outweighs the disadvantage of the unfavorable scenario, because I
> consider the probability of the unfavorable scenario occurring very
> low. But again: it mainly depends on the application.
> >
> > If anyone disagrees with the risk analysis described above, I will
> happily provide a version 2 of the patch, where the objects are still
> returned in reverse order. After all, the rte_memcpy() benefit is
> relatively small compared to the impact if the unlikely scenario
> occurs.
> >


  reply	other threads:[~2022-01-10 10:55 UTC|newest]

Thread overview: 85+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
2022-01-06 12:23 ` [PATCH] mempool: optimize incomplete cache handling Morten Brørup
2022-01-06 16:55   ` Jerin Jacob
2022-01-07  8:46     ` Morten Brørup
2022-01-10  7:26       ` Jerin Jacob
2022-01-10 10:55         ` Morten Brørup [this message]
2022-01-14 16:36 ` [PATCH] mempool: fix get objects from mempool with cache Morten Brørup
2022-01-17 17:35   ` Bruce Richardson
2022-01-18  8:25     ` Morten Brørup
2022-01-18  9:07       ` Bruce Richardson
2022-01-24 15:38   ` Olivier Matz
2022-01-24 16:11     ` Olivier Matz
2022-01-28 10:22     ` Morten Brørup
2022-01-17 11:52 ` [PATCH] mempool: optimize put objects to " Morten Brørup
2022-01-19 14:52 ` [PATCH v2] mempool: fix " Morten Brørup
2022-01-19 15:03 ` [PATCH v3] " Morten Brørup
2022-01-24 15:39   ` Olivier Matz
2022-01-28  9:37     ` Morten Brørup
2022-02-02  8:14 ` [PATCH v2] mempool: fix get objects from " Morten Brørup
2022-06-15 21:18   ` Morten Brørup
2022-09-29 10:52     ` Morten Brørup
2022-10-04 12:57   ` Andrew Rybchenko
2022-10-04 15:13     ` Morten Brørup
2022-10-04 15:58       ` Andrew Rybchenko
2022-10-04 18:09         ` Morten Brørup
2022-10-06 13:43       ` Aaron Conole
2022-10-04 16:03   ` Morten Brørup
2022-10-04 16:36   ` Morten Brørup
2022-10-04 16:39   ` Morten Brørup
2022-02-02 10:33 ` [PATCH v4] mempool: fix mempool cache flushing algorithm Morten Brørup
2022-04-07  9:04   ` Morten Brørup
2022-04-07  9:14     ` Bruce Richardson
2022-04-07  9:26       ` Morten Brørup
2022-04-07 10:32         ` Bruce Richardson
2022-04-07 10:43           ` Bruce Richardson
2022-04-07 11:36             ` Morten Brørup
2022-10-04 20:01   ` Morten Brørup
2022-10-09 11:11   ` [PATCH 1/2] mempool: check driver enqueue result in one place Andrew Rybchenko
2022-10-09 11:11     ` [PATCH 2/2] mempool: avoid usage of term ring on put Andrew Rybchenko
2022-10-09 13:08       ` Morten Brørup
2022-10-09 13:14         ` Andrew Rybchenko
2022-10-09 13:01     ` [PATCH 1/2] mempool: check driver enqueue result in one place Morten Brørup
2022-10-09 13:19   ` [PATCH v4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
2022-10-04 12:53 ` [PATCH v3] mempool: fix get objects from mempool with cache Andrew Rybchenko
2022-10-04 14:42   ` Morten Brørup
2022-10-07 10:44 ` [PATCH v4] " Andrew Rybchenko
2022-10-08 20:56   ` Thomas Monjalon
2022-10-11 20:30     ` Copy-pasted code should be updated Morten Brørup
2022-10-11 21:47       ` Honnappa Nagarahalli
2022-10-30  8:44         ` Morten Brørup
2022-10-30 22:50           ` Honnappa Nagarahalli
2022-10-14 14:01     ` [PATCH v4] mempool: fix get objects from mempool with cache Olivier Matz
2022-10-09 13:37 ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
2022-10-09 13:37   ` [PATCH v6 1/4] mempool: check driver enqueue result in one place Andrew Rybchenko
2022-10-09 13:37   ` [PATCH v6 2/4] mempool: avoid usage of term ring on put Andrew Rybchenko
2022-10-09 13:37   ` [PATCH v6 3/4] mempool: fix cache flushing algorithm Andrew Rybchenko
2022-10-09 14:31     ` Morten Brørup
2022-10-09 14:51       ` Andrew Rybchenko
2022-10-09 15:08         ` Morten Brørup
2022-10-14 14:01           ` Olivier Matz
2022-10-14 15:57             ` Morten Brørup
2022-10-14 19:50               ` Olivier Matz
2022-10-15  6:57                 ` Morten Brørup
2022-10-18 16:32                   ` Jerin Jacob
2022-10-09 13:37   ` [PATCH v6 4/4] mempool: flush cache completely on overflow Andrew Rybchenko
2022-10-09 14:44     ` Morten Brørup
2022-10-14 14:01       ` Olivier Matz
2022-10-10 15:21   ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Thomas Monjalon
2022-10-11 19:26     ` Morten Brørup
2022-10-26 14:09     ` Thomas Monjalon
2022-10-26 14:26       ` Morten Brørup
2022-10-26 14:44         ` [PATCH] mempool: cache align mempool cache objects Morten Brørup
2022-10-26 19:44           ` Andrew Rybchenko
2022-10-27  8:34           ` Olivier Matz
2022-10-27  9:22             ` Morten Brørup
2022-10-27 11:42               ` Olivier Matz
2022-10-27 12:11                 ` Morten Brørup
2022-10-27 15:20                   ` Olivier Matz
2022-10-28  6:35           ` [PATCH v3 1/2] " Morten Brørup
2022-10-28  6:35             ` [PATCH v3 2/2] mempool: optimized debug statistics Morten Brørup
2022-10-28  6:41           ` [PATCH v4 1/2] mempool: cache align mempool cache objects Morten Brørup
2022-10-28  6:41             ` [PATCH v4 2/2] mempool: optimized debug statistics Morten Brørup
2022-10-30  9:09               ` Morten Brørup
2022-10-30  9:16                 ` Thomas Monjalon
2022-10-30  9:17             ` [PATCH v4 1/2] mempool: cache align mempool cache objects Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98CBD80474FA8B44BF855DF32C47DC35D86DF2@smartserver.smartshare.dk \
    --to=mb@smartsharesystems.com \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=jerinjacobk@gmail.com \
    --cc=olivier.matz@6wind.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).