RE: [PATCH] mempool: optimize get objects with constant n

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>
Cc: <olivier.matz@6wind.com>, <andrew.rybchenko@oktetlabs.ru>,
	<dev@dpdk.org>
Subject: RE: [PATCH] mempool: optimize get objects with constant n
Date: Tue, 18 Apr 2023 13:29:49 +0200	[thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D8788D@smartserver.smartshare.dk> (raw)
In-Reply-To: <ZD55wqJLvwt1Bq7r@bricha3-MOBL.ger.corp.intel.com>

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Tuesday, 18 April 2023 13.07
> 
> On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > When getting objects from the mempool, the number of objects to get is
> > often constant at build time.
> >
> > This patch adds another code path for this case, so the compiler can
> > optimize more, e.g. unroll the copy loop when the entire request is
> > satisfied from the cache.
> >
> > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > mempool_perf_test with constant n shows an increase in rate_persec by an
> > average of 17 %, minimum 9.5 %, maximum 24 %.
> >
> > The code path where the number of objects to get is unknown at build time
> > remains essentially unchanged.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Change looks a good idea. Some suggestions inline below, which you may want to
> take on board for any future version. I'd strongly suggest adding some
> extra clarifying code comments, as I suggest below.
> With those exta code comments:
> 
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> > ---
> >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> >  1 file changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 9f530db24b..ade0100ec7 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp,
> void **obj_table,
> >  	if (unlikely(cache == NULL))
> >  		goto driver_dequeue;
> >
> > -	/* Use the cache as much as we have to return hot objects first */
> > -	len = RTE_MIN(remaining, cache->len);
> >  	cache_objs = &cache->objs[cache->len];
> > +
> > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > +		/*
> > +		 * The request size is known at build time, and
> > +		 * the entire request can be satisfied from the cache,
> > +		 * so let the compiler unroll the fixed length copy loop.
> > +		 */
> > +		cache->len -= n;
> > +		for (index = 0; index < n; index++)
> > +			*obj_table++ = *--cache_objs;
> > +
> 
> This loop looks a little awkward to me. Would it be clearer (and perhaps
> easier for compilers to unroll efficiently if it was rewritten as:
> 
> 	cache->len -= n;
> 	cache_objs = &cache->objs[cache->len];
> 	for (index = 0; index < n; index++)
> 		obj_table[index] = cache_objs[index];

The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments.

Regardless, your point here is still valid! I expected that any unrolling capable compiler can unroll *dst++ = *--src; but I can experiment with different compilers on godbolt.org to see if dst[index] = src[-index] is better.

A future version could be hand coded with vector instructions here, like in the Ethdev drivers.

> 
> alternatively those last two lines can be replaced by a memcpy, which the
> compiler should nicely optimize itself, for constant size copy:
> 
> 	memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n);

Just for reference: It was previously discussed to ignore the stack ordering and simply copy the objects, but the idea was rejected due to the potential performance impact of not returning the hottest objects, i.e. the objects at the top of the stack, as the first in the returned array.

> 
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> > +
> > +		return 0;
> > +	}
> > +
> > +	/* Use the cache as much as we have to return hot objects first */
> > +	len = __extension__(__builtin_constant_p(n)) ? cache->len :
> > +			RTE_MIN(remaining, cache->len);
> 
> This line confused me a lot, until I applied the patch and stared at it a
> lot :-). Why would the length value depend upon whether or not n is a
> compile-time constant?
> I think it would really help here to add in a comment stating that when n
> is a compile-time constant, at this point it much be "> cache->len" since
> the previous block was untaken, therefore we don't need to call RTE_MIN.

I agree. This is almost like perl... write-only source code.

I will add a few explanatory comments.

> 
> >  	cache->len -= len;
> >  	remaining -= len;
> >  	for (index = 0; index < len; index++)
> >  		*obj_table++ = *--cache_objs;
> >
> > -	if (remaining == 0) {
> > +	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
> >  		/* The entire request is satisfied from the cache. */
> >
> >  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> 
> Once I'd worked out the logic for the above conditional check, then this
> conditional adjustment was clear. I just wonder if a further comment might
> help here.
> 
> I am also wondering if having larger blocks for the constant and
> non-constant cases may help. It would lead to some duplication but may
> clarify the code.

I get your point, but the code is still short enough to grasp (after comments have been added). So I prefer to avoid code duplication.

> 
> > --
> > 2.17.1
> >

next prev parent reply	other threads:[~2023-04-18 11:29 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-11  6:48 Morten Brørup
2023-04-18 11:06 ` Bruce Richardson
2023-04-18 11:29   ` Morten Brørup [this message]
2023-04-18 12:54     ` Bruce Richardson
2023-04-18 12:55     ` Bruce Richardson
2023-06-07  7:51       ` Thomas Monjalon
2023-06-07  8:03         ` Morten Brørup
2023-06-07  8:10           ` Thomas Monjalon
2023-06-07  8:33             ` Morten Brørup
2023-06-07  8:41             ` Morten Brørup
2023-04-18 15:15   ` Tyler Retzlaff
2023-04-18 15:30     ` Morten Brørup
2023-04-18 15:44       ` Tyler Retzlaff
2023-04-18 15:50         ` Morten Brørup
2023-04-18 16:01           ` Tyler Retzlaff
2023-04-18 16:05   ` Morten Brørup
2023-04-18 19:51 ` [PATCH v3] " Morten Brørup
2023-04-18 20:09 ` [PATCH v4] " Morten Brørup
2023-06-07  9:12   ` Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98CBD80474FA8B44BF855DF32C47DC35D8788D@smartserver.smartshare.dk \
    --to=mb@smartsharesystems.com \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=olivier.matz@6wind.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).