DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH] mempool: optimize get objects with constant n
@ 2023-04-11  6:48 Morten Brørup
  2023-04-18 11:06 ` Bruce Richardson
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Morten Brørup @ 2023-04-11  6:48 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko; +Cc: dev, Morten Brørup

When getting objects from the mempool, the number of objects to get is
often constant at build time.

This patch adds another code path for this case, so the compiler can
optimize more, e.g. unroll the copy loop when the entire request is
satisfied from the cache.

On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
mempool_perf_test with constant n shows an increase in rate_persec by an
average of 17 %, minimum 9.5 %, maximum 24 %.

The code path where the number of objects to get is unknown at build time
remains essentially unchanged.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 9f530db24b..ade0100ec7 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	if (unlikely(cache == NULL))
 		goto driver_dequeue;
 
-	/* Use the cache as much as we have to return hot objects first */
-	len = RTE_MIN(remaining, cache->len);
 	cache_objs = &cache->objs[cache->len];
+
+	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
+		/*
+		 * The request size is known at build time, and
+		 * the entire request can be satisfied from the cache,
+		 * so let the compiler unroll the fixed length copy loop.
+		 */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
+
+		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
+		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
+
+		return 0;
+	}
+
+	/* Use the cache as much as we have to return hot objects first */
+	len = __extension__(__builtin_constant_p(n)) ? cache->len :
+			RTE_MIN(remaining, cache->len);
 	cache->len -= len;
 	remaining -= len;
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
-	if (remaining == 0) {
+	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
 		/* The entire request is satisfied from the cache. */
 
 		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mempool: optimize get objects with constant n
  2023-04-11  6:48 [PATCH] mempool: optimize get objects with constant n Morten Brørup
@ 2023-04-18 11:06 ` Bruce Richardson
  2023-04-18 11:29   ` Morten Brørup
                     ` (2 more replies)
  2023-04-18 19:51 ` [PATCH v3] " Morten Brørup
  2023-04-18 20:09 ` [PATCH v4] " Morten Brørup
  2 siblings, 3 replies; 19+ messages in thread
From: Bruce Richardson @ 2023-04-18 11:06 UTC (permalink / raw)
  To: Morten Brørup; +Cc: olivier.matz, andrew.rybchenko, dev

On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> When getting objects from the mempool, the number of objects to get is
> often constant at build time.
> 
> This patch adds another code path for this case, so the compiler can
> optimize more, e.g. unroll the copy loop when the entire request is
> satisfied from the cache.
> 
> On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> mempool_perf_test with constant n shows an increase in rate_persec by an
> average of 17 %, minimum 9.5 %, maximum 24 %.
> 
> The code path where the number of objects to get is unknown at build time
> remains essentially unchanged.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>

Change looks a good idea. Some suggestions inline below, which you may want to
take on board for any future version. I'd strongly suggest adding some
extra clarifying code comments, as I suggest below.
With those exta code comments:

Acked-by: Bruce Richardson <bruce.richardson@intel.com>

> ---
>  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 9f530db24b..ade0100ec7 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>  	if (unlikely(cache == NULL))
>  		goto driver_dequeue;
>  
> -	/* Use the cache as much as we have to return hot objects first */
> -	len = RTE_MIN(remaining, cache->len);
>  	cache_objs = &cache->objs[cache->len];
> +
> +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> +		/*
> +		 * The request size is known at build time, and
> +		 * the entire request can be satisfied from the cache,
> +		 * so let the compiler unroll the fixed length copy loop.
> +		 */
> +		cache->len -= n;
> +		for (index = 0; index < n; index++)
> +			*obj_table++ = *--cache_objs;
> +

This loop looks a little awkward to me. Would it be clearer (and perhaps
easier for compilers to unroll efficiently if it was rewritten as:

	cache->len -= n;
	cache_objs = &cache->objs[cache->len];
	for (index = 0; index < n; index++)
		obj_table[index] = cache_objs[index];

alternatively those last two lines can be replaced by a memcpy, which the
compiler should nicely optimize itself, for constant size copy:

	memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n);

> +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> +
> +		return 0;
> +	}
> +
> +	/* Use the cache as much as we have to return hot objects first */
> +	len = __extension__(__builtin_constant_p(n)) ? cache->len :
> +			RTE_MIN(remaining, cache->len);

This line confused me a lot, until I applied the patch and stared at it a
lot :-). Why would the length value depend upon whether or not n is a
compile-time constant?
I think it would really help here to add in a comment stating that when n
is a compile-time constant, at this point it much be "> cache->len" since
the previous block was untaken, therefore we don't need to call RTE_MIN.

>  	cache->len -= len;
>  	remaining -= len;
>  	for (index = 0; index < len; index++)
>  		*obj_table++ = *--cache_objs;
>  
> -	if (remaining == 0) {
> +	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
>  		/* The entire request is satisfied from the cache. */
>  
>  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);

Once I'd worked out the logic for the above conditional check, then this
conditional adjustment was clear. I just wonder if a further comment might
help here.

I am also wondering if having larger blocks for the constant and
non-constant cases may help. It would lead to some duplication but may
clarify the code.

> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] mempool: optimize get objects with constant n
  2023-04-18 11:06 ` Bruce Richardson
@ 2023-04-18 11:29   ` Morten Brørup
  2023-04-18 12:54     ` Bruce Richardson
  2023-04-18 12:55     ` Bruce Richardson
  2023-04-18 15:15   ` Tyler Retzlaff
  2023-04-18 16:05   ` Morten Brørup
  2 siblings, 2 replies; 19+ messages in thread
From: Morten Brørup @ 2023-04-18 11:29 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: olivier.matz, andrew.rybchenko, dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Tuesday, 18 April 2023 13.07
> 
> On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > When getting objects from the mempool, the number of objects to get is
> > often constant at build time.
> >
> > This patch adds another code path for this case, so the compiler can
> > optimize more, e.g. unroll the copy loop when the entire request is
> > satisfied from the cache.
> >
> > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > mempool_perf_test with constant n shows an increase in rate_persec by an
> > average of 17 %, minimum 9.5 %, maximum 24 %.
> >
> > The code path where the number of objects to get is unknown at build time
> > remains essentially unchanged.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Change looks a good idea. Some suggestions inline below, which you may want to
> take on board for any future version. I'd strongly suggest adding some
> extra clarifying code comments, as I suggest below.
> With those exta code comments:
> 
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> > ---
> >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> >  1 file changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 9f530db24b..ade0100ec7 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp,
> void **obj_table,
> >  	if (unlikely(cache == NULL))
> >  		goto driver_dequeue;
> >
> > -	/* Use the cache as much as we have to return hot objects first */
> > -	len = RTE_MIN(remaining, cache->len);
> >  	cache_objs = &cache->objs[cache->len];
> > +
> > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > +		/*
> > +		 * The request size is known at build time, and
> > +		 * the entire request can be satisfied from the cache,
> > +		 * so let the compiler unroll the fixed length copy loop.
> > +		 */
> > +		cache->len -= n;
> > +		for (index = 0; index < n; index++)
> > +			*obj_table++ = *--cache_objs;
> > +
> 
> This loop looks a little awkward to me. Would it be clearer (and perhaps
> easier for compilers to unroll efficiently if it was rewritten as:
> 
> 	cache->len -= n;
> 	cache_objs = &cache->objs[cache->len];
> 	for (index = 0; index < n; index++)
> 		obj_table[index] = cache_objs[index];

The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments.

Regardless, your point here is still valid! I expected that any unrolling capable compiler can unroll *dst++ = *--src; but I can experiment with different compilers on godbolt.org to see if dst[index] = src[-index] is better.

A future version could be hand coded with vector instructions here, like in the Ethdev drivers.

> 
> alternatively those last two lines can be replaced by a memcpy, which the
> compiler should nicely optimize itself, for constant size copy:
> 
> 	memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n);

Just for reference: It was previously discussed to ignore the stack ordering and simply copy the objects, but the idea was rejected due to the potential performance impact of not returning the hottest objects, i.e. the objects at the top of the stack, as the first in the returned array.

> 
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> > +
> > +		return 0;
> > +	}
> > +
> > +	/* Use the cache as much as we have to return hot objects first */
> > +	len = __extension__(__builtin_constant_p(n)) ? cache->len :
> > +			RTE_MIN(remaining, cache->len);
> 
> This line confused me a lot, until I applied the patch and stared at it a
> lot :-). Why would the length value depend upon whether or not n is a
> compile-time constant?
> I think it would really help here to add in a comment stating that when n
> is a compile-time constant, at this point it much be "> cache->len" since
> the previous block was untaken, therefore we don't need to call RTE_MIN.

I agree. This is almost like perl... write-only source code.

I will add a few explanatory comments.

> 
> >  	cache->len -= len;
> >  	remaining -= len;
> >  	for (index = 0; index < len; index++)
> >  		*obj_table++ = *--cache_objs;
> >
> > -	if (remaining == 0) {
> > +	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
> >  		/* The entire request is satisfied from the cache. */
> >
> >  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> 
> Once I'd worked out the logic for the above conditional check, then this
> conditional adjustment was clear. I just wonder if a further comment might
> help here.
> 
> I am also wondering if having larger blocks for the constant and
> non-constant cases may help. It would lead to some duplication but may
> clarify the code.

I get your point, but the code is still short enough to grasp (after comments have been added). So I prefer to avoid code duplication.

> 
> > --
> > 2.17.1
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mempool: optimize get objects with constant n
  2023-04-18 11:29   ` Morten Brørup
@ 2023-04-18 12:54     ` Bruce Richardson
  2023-04-18 12:55     ` Bruce Richardson
  1 sibling, 0 replies; 19+ messages in thread
From: Bruce Richardson @ 2023-04-18 12:54 UTC (permalink / raw)
  To: Morten Brørup; +Cc: olivier.matz, andrew.rybchenko, dev

On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Tuesday, 18 April 2023 13.07
> > 
> > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > When getting objects from the mempool, the number of objects to get is
> > > often constant at build time.
> > >
> > > This patch adds another code path for this case, so the compiler can
> > > optimize more, e.g. unroll the copy loop when the entire request is
> > > satisfied from the cache.
> > >
> > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > > mempool_perf_test with constant n shows an increase in rate_persec by an
> > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > >
> > > The code path where the number of objects to get is unknown at build time
> > > remains essentially unchanged.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > 
> > Change looks a good idea. Some suggestions inline below, which you may want to
> > take on board for any future version. I'd strongly suggest adding some
> > extra clarifying code comments, as I suggest below.
> > With those exta code comments:
> > 
> > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > 
> > > ---
> > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 9f530db24b..ade0100ec7 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp,
> > void **obj_table,
> > >  	if (unlikely(cache == NULL))
> > >  		goto driver_dequeue;
> > >
> > > -	/* Use the cache as much as we have to return hot objects first */
> > > -	len = RTE_MIN(remaining, cache->len);
> > >  	cache_objs = &cache->objs[cache->len];
> > > +
> > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > +		/*
> > > +		 * The request size is known at build time, and
> > > +		 * the entire request can be satisfied from the cache,
> > > +		 * so let the compiler unroll the fixed length copy loop.
> > > +		 */
> > > +		cache->len -= n;
> > > +		for (index = 0; index < n; index++)
> > > +			*obj_table++ = *--cache_objs;
> > > +
> > 
> > This loop looks a little awkward to me. Would it be clearer (and perhaps
> > easier for compilers to unroll efficiently if it was rewritten as:
> > 
> > 	cache->len -= n;
> > 	cache_objs = &cache->objs[cache->len];
> > 	for (index = 0; index < n; index++)
> > 		obj_table[index] = cache_objs[index];
> 
> The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments.
> 
> Regardless, your point here is still valid! I expected that any unrolling capable compiler can unroll *dst++ = *--src; but I can experiment with different compilers on godbolt.org to see if dst[index] = src[-index] is better.
> 
> A future version could be hand coded with vector instructions here, like in the Ethdev drivers.
> 
> > 
> > alternatively those last two lines can be replaced by a memcpy, which the
> > compiler should nicely optimize itself, for constant size copy:
> > 
> > 	memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n);
> 
> Just for reference: It was previously discussed to ignore the stack ordering and simply copy the objects, but the idea was rejected due to the potential performance impact of not returning the hottest objects, i.e. the objects at the top of the stack, as the first in the returned array.
> 

Ah, yes, I had forgotten that we want the order reversed, so most recent
objects first. It would be interesting to see if ignoring order within a
burst can lead to some performance improvements, as I expect that in most
cases, the buffers to be used in a burst are to be used pretty much
simultaneously. However, based on your feedback, feel free to ignore this
comment and keep code as-is.

> > 
> > > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> > > +
> > > +		return 0;
> > > +	}
> > > +
> > > +	/* Use the cache as much as we have to return hot objects first */
> > > +	len = __extension__(__builtin_constant_p(n)) ? cache->len :
> > > +			RTE_MIN(remaining, cache->len);
> > 
> > This line confused me a lot, until I applied the patch and stared at it a
> > lot :-). Why would the length value depend upon whether or not n is a
> > compile-time constant?
> > I think it would really help here to add in a comment stating that when n
> > is a compile-time constant, at this point it much be "> cache->len" since
> > the previous block was untaken, therefore we don't need to call RTE_MIN.
> 
> I agree. This is almost like perl... write-only source code.
> 
> I will add a few explanatory comments.
> 

Thanks.

> > 
> > >  	cache->len -= len;
> > >  	remaining -= len;
> > >  	for (index = 0; index < len; index++)
> > >  		*obj_table++ = *--cache_objs;
> > >
> > > -	if (remaining == 0) {
> > > +	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
> > >  		/* The entire request is satisfied from the cache. */
> > >
> > >  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > 
> > Once I'd worked out the logic for the above conditional check, then this
> > conditional adjustment was clear. I just wonder if a further comment might
> > help here.
> > 
> > I am also wondering if having larger blocks for the constant and
> > non-constant cases may help. It would lead to some duplication but may
> > clarify the code.
> 
> I get your point, but the code is still short enough to grasp (after comments have been added). So I prefer to avoid code duplication.
> 

Sure.

> > 
> > > --
> > > 2.17.1
> > >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mempool: optimize get objects with constant n
  2023-04-18 11:29   ` Morten Brørup
  2023-04-18 12:54     ` Bruce Richardson
@ 2023-04-18 12:55     ` Bruce Richardson
  2023-06-07  7:51       ` Thomas Monjalon
  1 sibling, 1 reply; 19+ messages in thread
From: Bruce Richardson @ 2023-04-18 12:55 UTC (permalink / raw)
  To: Morten Brørup; +Cc: olivier.matz, andrew.rybchenko, dev

On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Tuesday, 18 April 2023 13.07
> > 
> > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > When getting objects from the mempool, the number of objects to get is
> > > often constant at build time.
> > >
> > > This patch adds another code path for this case, so the compiler can
> > > optimize more, e.g. unroll the copy loop when the entire request is
> > > satisfied from the cache.
> > >
> > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > > mempool_perf_test with constant n shows an increase in rate_persec by an
> > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > >
> > > The code path where the number of objects to get is unknown at build time
> > > remains essentially unchanged.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > 
> > Change looks a good idea. Some suggestions inline below, which you may want to
> > take on board for any future version. I'd strongly suggest adding some
> > extra clarifying code comments, as I suggest below.
> > With those exta code comments:
> > 
> > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > 
> > > ---
> > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 9f530db24b..ade0100ec7 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp,
> > void **obj_table,
> > >  	if (unlikely(cache == NULL))
> > >  		goto driver_dequeue;
> > >
> > > -	/* Use the cache as much as we have to return hot objects first */
> > > -	len = RTE_MIN(remaining, cache->len);
> > >  	cache_objs = &cache->objs[cache->len];
> > > +
> > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > +		/*
> > > +		 * The request size is known at build time, and
> > > +		 * the entire request can be satisfied from the cache,
> > > +		 * so let the compiler unroll the fixed length copy loop.
> > > +		 */
> > > +		cache->len -= n;
> > > +		for (index = 0; index < n; index++)
> > > +			*obj_table++ = *--cache_objs;
> > > +
> > 
> > This loop looks a little awkward to me. Would it be clearer (and perhaps
> > easier for compilers to unroll efficiently if it was rewritten as:
> > 
> > 	cache->len -= n;
> > 	cache_objs = &cache->objs[cache->len];
> > 	for (index = 0; index < n; index++)
> > 		obj_table[index] = cache_objs[index];
> 
> The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments.
> 

BTW: Please add this as a comment in the code too, above the loop to avoid
future developers (or even future me), asking this question again!

/Bruce


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mempool: optimize get objects with constant n
  2023-04-18 11:06 ` Bruce Richardson
  2023-04-18 11:29   ` Morten Brørup
@ 2023-04-18 15:15   ` Tyler Retzlaff
  2023-04-18 15:30     ` Morten Brørup
  2023-04-18 16:05   ` Morten Brørup
  2 siblings, 1 reply; 19+ messages in thread
From: Tyler Retzlaff @ 2023-04-18 15:15 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: Morten Brørup, olivier.matz, andrew.rybchenko, dev

On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote:
> On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > When getting objects from the mempool, the number of objects to get is
> > often constant at build time.
> > 
> > This patch adds another code path for this case, so the compiler can
> > optimize more, e.g. unroll the copy loop when the entire request is
> > satisfied from the cache.
> > 
> > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > mempool_perf_test with constant n shows an increase in rate_persec by an
> > average of 17 %, minimum 9.5 %, maximum 24 %.
> > 
> > The code path where the number of objects to get is unknown at build time
> > remains essentially unchanged.
> > 
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Change looks a good idea. Some suggestions inline below, which you may want to
> take on board for any future version. I'd strongly suggest adding some
> extra clarifying code comments, as I suggest below.
> With those exta code comments:
> 
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> > ---
> >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> >  1 file changed, 21 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 9f530db24b..ade0100ec7 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
> >  	if (unlikely(cache == NULL))
> >  		goto driver_dequeue;
> >  
> > -	/* Use the cache as much as we have to return hot objects first */
> > -	len = RTE_MIN(remaining, cache->len);
> >  	cache_objs = &cache->objs[cache->len];
> > +
> > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {

don't take direct dependency on compiler builtins. define a macro so we
don't have to play shotgun surgery later.

also what is the purpose of using __extension__ here? are you annotating
the use of __builtin_constant_p() or is there more? because if that's
the only reason i see no need to use __extension__ when already using a
compiler specific builtin like this, that it is not standard is implied
and enforced by a compile break.

> > +		/*
> > +		 * The request size is known at build time, and
> > +		 * the entire request can be satisfied from the cache,
> > +		 * so let the compiler unroll the fixed length copy loop.
> > +		 */
> > +		cache->len -= n;
> > +		for (index = 0; index < n; index++)
> > +			*obj_table++ = *--cache_objs;
> > +
> 
> This loop looks a little awkward to me. Would it be clearer (and perhaps
> easier for compilers to unroll efficiently if it was rewritten as:
> 
> 	cache->len -= n;
> 	cache_objs = &cache->objs[cache->len];
> 	for (index = 0; index < n; index++)
> 		obj_table[index] = cache_objs[index];
> 
> alternatively those last two lines can be replaced by a memcpy, which the
> compiler should nicely optimize itself, for constant size copy:
> 
> 	memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n);
> 
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > +		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> > +
> > +		return 0;
> > +	}
> > +
> > +	/* Use the cache as much as we have to return hot objects first */
> > +	len = __extension__(__builtin_constant_p(n)) ? cache->len :
> > +			RTE_MIN(remaining, cache->len);
> 
> This line confused me a lot, until I applied the patch and stared at it a
> lot :-). Why would the length value depend upon whether or not n is a
> compile-time constant?
> I think it would really help here to add in a comment stating that when n
> is a compile-time constant, at this point it much be "> cache->len" since
> the previous block was untaken, therefore we don't need to call RTE_MIN.
> 
> >  	cache->len -= len;
> >  	remaining -= len;
> >  	for (index = 0; index < len; index++)
> >  		*obj_table++ = *--cache_objs;
> >  
> > -	if (remaining == 0) {
> > +	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
> >  		/* The entire request is satisfied from the cache. */
> >  
> >  		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> 
> Once I'd worked out the logic for the above conditional check, then this
> conditional adjustment was clear. I just wonder if a further comment might
> help here.
> 
> I am also wondering if having larger blocks for the constant and
> non-constant cases may help. It would lead to some duplication but may
> clarify the code.
> 
> > -- 
> > 2.17.1
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] mempool: optimize get objects with constant n
  2023-04-18 15:15   ` Tyler Retzlaff
@ 2023-04-18 15:30     ` Morten Brørup
  2023-04-18 15:44       ` Tyler Retzlaff
  0 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2023-04-18 15:30 UTC (permalink / raw)
  To: Tyler Retzlaff, Bruce Richardson; +Cc: olivier.matz, andrew.rybchenko, dev

> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> Sent: Tuesday, 18 April 2023 17.15
> 
> On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote:
> > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > When getting objects from the mempool, the number of objects to get
> is
> > > often constant at build time.
> > >
> > > This patch adds another code path for this case, so the compiler can
> > > optimize more, e.g. unroll the copy loop when the entire request is
> > > satisfied from the cache.
> > >
> > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > > mempool_perf_test with constant n shows an increase in rate_persec
> by an
> > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > >
> > > The code path where the number of objects to get is unknown at build
> time
> > > remains essentially unchanged.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >
> > Change looks a good idea. Some suggestions inline below, which you may
> want to
> > take on board for any future version. I'd strongly suggest adding some
> > extra clarifying code comments, as I suggest below.
> > With those exta code comments:
> >
> > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> >
> > > ---
> > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 9f530db24b..ade0100ec7 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct
> rte_mempool *mp, void **obj_table,
> > >  	if (unlikely(cache == NULL))
> > >  		goto driver_dequeue;
> > >
> > > -	/* Use the cache as much as we have to return hot objects first */
> > > -	len = RTE_MIN(remaining, cache->len);
> > >  	cache_objs = &cache->objs[cache->len];
> > > +
> > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> 
> don't take direct dependency on compiler builtins. define a macro so we
> don't have to play shotgun surgery later.
> 
> also what is the purpose of using __extension__ here? are you annotating
> the use of __builtin_constant_p() or is there more? because if that's
> the only reason i see no need to use __extension__ when already using a
> compiler specific builtin like this, that it is not standard is implied
> and enforced by a compile break.

ARM 32-bit memcpy() [1] does it this way, so I did the same.

[1]: https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_memcpy_32.h#L122

While I agree that a macro for __builtin_constant_p() would be good, it belongs in a patch to fix portability, not in this patch.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mempool: optimize get objects with constant n
  2023-04-18 15:30     ` Morten Brørup
@ 2023-04-18 15:44       ` Tyler Retzlaff
  2023-04-18 15:50         ` Morten Brørup
  0 siblings, 1 reply; 19+ messages in thread
From: Tyler Retzlaff @ 2023-04-18 15:44 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Bruce Richardson, olivier.matz, andrew.rybchenko, dev

On Tue, Apr 18, 2023 at 05:30:27PM +0200, Morten Brørup wrote:
> > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> > Sent: Tuesday, 18 April 2023 17.15
> > 
> > On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote:
> > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > When getting objects from the mempool, the number of objects to get
> > is
> > > > often constant at build time.
> > > >
> > > > This patch adds another code path for this case, so the compiler can
> > > > optimize more, e.g. unroll the copy loop when the entire request is
> > > > satisfied from the cache.
> > > >
> > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> > > > mempool_perf_test with constant n shows an increase in rate_persec
> > by an
> > > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > > >
> > > > The code path where the number of objects to get is unknown at build
> > time
> > > > remains essentially unchanged.
> > > >
> > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >
> > > Change looks a good idea. Some suggestions inline below, which you may
> > want to
> > > take on board for any future version. I'd strongly suggest adding some
> > > extra clarifying code comments, as I suggest below.
> > > With those exta code comments:
> > >
> > > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > >
> > > > ---
> > > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > > index 9f530db24b..ade0100ec7 100644
> > > > --- a/lib/mempool/rte_mempool.h
> > > > +++ b/lib/mempool/rte_mempool.h
> > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct
> > rte_mempool *mp, void **obj_table,
> > > >  	if (unlikely(cache == NULL))
> > > >  		goto driver_dequeue;
> > > >
> > > > -	/* Use the cache as much as we have to return hot objects first */
> > > > -	len = RTE_MIN(remaining, cache->len);
> > > >  	cache_objs = &cache->objs[cache->len];
> > > > +
> > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > 
> > don't take direct dependency on compiler builtins. define a macro so we
> > don't have to play shotgun surgery later.
> > 
> > also what is the purpose of using __extension__ here? are you annotating
> > the use of __builtin_constant_p() or is there more? because if that's
> > the only reason i see no need to use __extension__ when already using a
> > compiler specific builtin like this, that it is not standard is implied
> > and enforced by a compile break.
> 
> ARM 32-bit memcpy() [1] does it this way, so I did the same.
> 
> [1]: https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_memcpy_32.h#L122

i see thanks.

> 
> While I agree that a macro for __builtin_constant_p() would be good, it belongs in a patch to fix portability, not in this patch.

i agree it isn't composite of this change.

would you mind introducing it as a separate patch and depending on it or
do you feel that would delay this patch too much? i wouldn't mind doing
it myself but there is a long merge time on my patches which means i end
up having to carry the adaptations locally for weeks at a time.

thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] mempool: optimize get objects with constant n
  2023-04-18 15:44       ` Tyler Retzlaff
@ 2023-04-18 15:50         ` Morten Brørup
  2023-04-18 16:01           ` Tyler Retzlaff
  0 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2023-04-18 15:50 UTC (permalink / raw)
  To: Tyler Retzlaff; +Cc: Bruce Richardson, olivier.matz, andrew.rybchenko, dev

> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> Sent: Tuesday, 18 April 2023 17.45
> 
> On Tue, Apr 18, 2023 at 05:30:27PM +0200, Morten Brørup wrote:
> > > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> > > Sent: Tuesday, 18 April 2023 17.15
> > >
> > > On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote:
> > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > > When getting objects from the mempool, the number of objects to
> get
> > > is
> > > > > often constant at build time.
> > > > >
> > > > > This patch adds another code path for this case, so the compiler
> can
> > > > > optimize more, e.g. unroll the copy loop when the entire request
> is
> > > > > satisfied from the cache.
> > > > >
> > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc
> 9.4.0,
> > > > > mempool_perf_test with constant n shows an increase in
> rate_persec
> > > by an
> > > > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > > > >
> > > > > The code path where the number of objects to get is unknown at
> build
> > > time
> > > > > remains essentially unchanged.
> > > > >
> > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > >
> > > > Change looks a good idea. Some suggestions inline below, which you
> may
> > > want to
> > > > take on board for any future version. I'd strongly suggest adding
> some
> > > > extra clarifying code comments, as I suggest below.
> > > > With those exta code comments:
> > > >
> > > > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > > >
> > > > > ---
> > > > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > > > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/lib/mempool/rte_mempool.h
> b/lib/mempool/rte_mempool.h
> > > > > index 9f530db24b..ade0100ec7 100644
> > > > > --- a/lib/mempool/rte_mempool.h
> > > > > +++ b/lib/mempool/rte_mempool.h
> > > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct
> > > rte_mempool *mp, void **obj_table,
> > > > >  	if (unlikely(cache == NULL))
> > > > >  		goto driver_dequeue;
> > > > >
> > > > > -	/* Use the cache as much as we have to return hot objects
> first */
> > > > > -	len = RTE_MIN(remaining, cache->len);
> > > > >  	cache_objs = &cache->objs[cache->len];
> > > > > +
> > > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache-
> >len) {
> > >
> > > don't take direct dependency on compiler builtins. define a macro so
> we
> > > don't have to play shotgun surgery later.
> > >
> > > also what is the purpose of using __extension__ here? are you
> annotating
> > > the use of __builtin_constant_p() or is there more? because if
> that's
> > > the only reason i see no need to use __extension__ when already
> using a
> > > compiler specific builtin like this, that it is not standard is
> implied
> > > and enforced by a compile break.
> >
> > ARM 32-bit memcpy() [1] does it this way, so I did the same.
> >
> > [1]:
> https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_me
> mcpy_32.h#L122
> 
> i see thanks.
> 
> >
> > While I agree that a macro for __builtin_constant_p() would be good,
> it belongs in a patch to fix portability, not in this patch.
> 
> i agree it isn't composite of this change.
> 
> would you mind introducing it as a separate patch and depending on it or
> do you feel that would delay this patch too much? i wouldn't mind doing
> it myself but there is a long merge time on my patches which means i end
> up having to carry the adaptations locally for weeks at a time.

I would rather not.

Introducing global macros in rte_common.h usually triggers a lot of discussion and pushback, and I don't want it to hold back this patch.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mempool: optimize get objects with constant n
  2023-04-18 15:50         ` Morten Brørup
@ 2023-04-18 16:01           ` Tyler Retzlaff
  0 siblings, 0 replies; 19+ messages in thread
From: Tyler Retzlaff @ 2023-04-18 16:01 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Bruce Richardson, olivier.matz, andrew.rybchenko, dev

On Tue, Apr 18, 2023 at 05:50:56PM +0200, Morten Brørup wrote:
> > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> > Sent: Tuesday, 18 April 2023 17.45
> > 
> > On Tue, Apr 18, 2023 at 05:30:27PM +0200, Morten Brørup wrote:
> > > > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> > > > Sent: Tuesday, 18 April 2023 17.15
> > > >
> > > > On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote:
> > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > > > When getting objects from the mempool, the number of objects to
> > get
> > > > is
> > > > > > often constant at build time.
> > > > > >
> > > > > > This patch adds another code path for this case, so the compiler
> > can
> > > > > > optimize more, e.g. unroll the copy loop when the entire request
> > is
> > > > > > satisfied from the cache.
> > > > > >
> > > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc
> > 9.4.0,
> > > > > > mempool_perf_test with constant n shows an increase in
> > rate_persec
> > > > by an
> > > > > > average of 17 %, minimum 9.5 %, maximum 24 %.
> > > > > >
> > > > > > The code path where the number of objects to get is unknown at
> > build
> > > > time
> > > > > > remains essentially unchanged.
> > > > > >
> > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > >
> > > > > Change looks a good idea. Some suggestions inline below, which you
> > may
> > > > want to
> > > > > take on board for any future version. I'd strongly suggest adding
> > some
> > > > > extra clarifying code comments, as I suggest below.
> > > > > With those exta code comments:
> > > > >
> > > > > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > >
> > > > > > ---
> > > > > >  lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++---
> > > > > >  1 file changed, 21 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/mempool/rte_mempool.h
> > b/lib/mempool/rte_mempool.h
> > > > > > index 9f530db24b..ade0100ec7 100644
> > > > > > --- a/lib/mempool/rte_mempool.h
> > > > > > +++ b/lib/mempool/rte_mempool.h
> > > > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct
> > > > rte_mempool *mp, void **obj_table,
> > > > > >  	if (unlikely(cache == NULL))
> > > > > >  		goto driver_dequeue;
> > > > > >
> > > > > > -	/* Use the cache as much as we have to return hot objects
> > first */
> > > > > > -	len = RTE_MIN(remaining, cache->len);
> > > > > >  	cache_objs = &cache->objs[cache->len];
> > > > > > +
> > > > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache-
> > >len) {
> > > >
> > > > don't take direct dependency on compiler builtins. define a macro so
> > we
> > > > don't have to play shotgun surgery later.
> > > >
> > > > also what is the purpose of using __extension__ here? are you
> > annotating
> > > > the use of __builtin_constant_p() or is there more? because if
> > that's
> > > > the only reason i see no need to use __extension__ when already
> > using a
> > > > compiler specific builtin like this, that it is not standard is
> > implied
> > > > and enforced by a compile break.
> > >
> > > ARM 32-bit memcpy() [1] does it this way, so I did the same.
> > >
> > > [1]:
> > https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_me
> > mcpy_32.h#L122
> > 
> > i see thanks.
> > 
> > >
> > > While I agree that a macro for __builtin_constant_p() would be good,
> > it belongs in a patch to fix portability, not in this patch.
> > 
> > i agree it isn't composite of this change.
> > 
> > would you mind introducing it as a separate patch and depending on it or
> > do you feel that would delay this patch too much? i wouldn't mind doing
> > it myself but there is a long merge time on my patches which means i end
> > up having to carry the adaptations locally for weeks at a time.
> 
> I would rather not.
> 
> Introducing global macros in rte_common.h usually triggers a lot of discussion and pushback, and I don't want it to hold back this patch.

yeah, no kidding. i wish the process was a bit more friendly being on
the receiving end. it's unfortunate because it is discouraging
improvements.

i'll bring a patch for it then.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] mempool: optimize get objects with constant n
  2023-04-18 11:06 ` Bruce Richardson
  2023-04-18 11:29   ` Morten Brørup
  2023-04-18 15:15   ` Tyler Retzlaff
@ 2023-04-18 16:05   ` Morten Brørup
  2 siblings, 0 replies; 19+ messages in thread
From: Morten Brørup @ 2023-04-18 16:05 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: olivier.matz, andrew.rybchenko, dev

> From: Morten Brørup
> Sent: Tuesday, 18 April 2023 13.30
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Tuesday, 18 April 2023 13.07
> >
> > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:

[...]

> > > +		/*
> > > +		 * The request size is known at build time, and
> > > +		 * the entire request can be satisfied from the cache,
> > > +		 * so let the compiler unroll the fixed length copy loop.
> > > +		 */
> > > +		cache->len -= n;
> > > +		for (index = 0; index < n; index++)
> > > +			*obj_table++ = *--cache_objs;
> > > +
> >
> > This loop looks a little awkward to me. Would it be clearer (and
> perhaps
> > easier for compilers to unroll efficiently if it was rewritten as:
> >
> > 	cache->len -= n;
> > 	cache_objs = &cache->objs[cache->len];
> > 	for (index = 0; index < n; index++)
> > 		obj_table[index] = cache_objs[index];
> 
> The mempool cache is a stack, so the copy loop needs get the objects in
> decrementing order. I.e. the source index decrements and the destination
> index increments.
> 
> Regardless, your point here is still valid! I expected that any
> unrolling capable compiler can unroll *dst++ = *--src; but I can
> experiment with different compilers on godbolt.org to see if dst[index]
> = src[-index] is better.

Just for the record... I have now tried experimenting with the alternative, and it makes no difference.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3] mempool: optimize get objects with constant n
  2023-04-11  6:48 [PATCH] mempool: optimize get objects with constant n Morten Brørup
  2023-04-18 11:06 ` Bruce Richardson
@ 2023-04-18 19:51 ` Morten Brørup
  2023-04-18 20:09 ` [PATCH v4] " Morten Brørup
  2 siblings, 0 replies; 19+ messages in thread
From: Morten Brørup @ 2023-04-18 19:51 UTC (permalink / raw)
  To: dev, olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, roretzla, Morten Brørup

When getting objects from the mempool, the number of objects to get is
often constant at build time.

This patch adds another code path for this case, so the compiler can
optimize more, e.g. unroll the copy loop when the entire request is
satisfied from the cache.

On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
mempool_perf_test with constant n shows an increase in rate_persec by an
average of 17 %, minimum 9.5 %, maximum 24 %.

The code path where the number of objects to get is unknown at build time
remains essentially unchanged.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>

---
v3:
* Rebase to main.
v2:
* Added comments describing why some code is omitted when 'n' is known
  at buuild time.
* Improved source code readability by explicitly setting 'remaining' where
  relevant, instead of at initialization.
---
 lib/mempool/rte_mempool.h | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index ade0100ec7..7d14dea5c4 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1492,13 +1492,15 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
-	unsigned int remaining = n;
+	unsigned int remaining;
 	uint32_t index, len;
 	void **cache_objs;
 
 	/* No cache provided */
-	if (unlikely(cache == NULL))
+	if (unlikely(cache == NULL)) {
+		remaining = n;
 		goto driver_dequeue;
+	}
 
 	cache_objs = &cache->objs[cache->len];
 
@@ -1518,14 +1520,23 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 		return 0;
 	}
 
-	/* Use the cache as much as we have to return hot objects first */
+	/*
+	 * Use the cache as much as we have to return hot objects first.
+	 * If the request size 'n' is known at build time, the above comparison
+	 * ensures that n > cache->len here, so omit RTE_MIN().
+	 */
 	len = __extension__(__builtin_constant_p(n)) ? cache->len :
-			RTE_MIN(remaining, cache->len);
+			RTE_MIN(n, cache->len);
 	cache->len -= len;
-	remaining -= len;
+	remaining = n - len;
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
+	/*
+	 * If the request size 'n' is known at build time, the case
+	 * where the entire request can be satisfied from the cache
+	 * has already been handled above, so omit handling it here.
+	 */
 	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
 		/* The entire request is satisfied from the cache. */
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4] mempool: optimize get objects with constant n
  2023-04-11  6:48 [PATCH] mempool: optimize get objects with constant n Morten Brørup
  2023-04-18 11:06 ` Bruce Richardson
  2023-04-18 19:51 ` [PATCH v3] " Morten Brørup
@ 2023-04-18 20:09 ` Morten Brørup
  2023-06-07  9:12   ` Thomas Monjalon
  2 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2023-04-18 20:09 UTC (permalink / raw)
  To: dev, olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, roretzla, Morten Brørup

When getting objects from the mempool, the number of objects to get is
often constant at build time.

This patch adds another code path for this case, so the compiler can
optimize more, e.g. unroll the copy loop when the entire request is
satisfied from the cache.

On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
mempool_perf_test with constant n shows an increase in rate_persec by an
average of 17 %, minimum 9.5 %, maximum 24 %.

The code path where the number of objects to get is unknown at build time
remains essentially unchanged.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>

---
v4:
* Rebased to main.
v3:
* Tried to rebase to main, but failed to include all my changes.
v2:
* Added comments describing why some code is omitted when 'n' is known
  at buuild time.
* Improved source code readability by explicitly setting 'remaining' where
  relevant, instead of at initialization.
---
 lib/mempool/rte_mempool.h | 41 +++++++++++++++++++++++++++++++++------
 1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 9f530db24b..7d14dea5c4 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1492,23 +1492,52 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
-	unsigned int remaining = n;
+	unsigned int remaining;
 	uint32_t index, len;
 	void **cache_objs;
 
 	/* No cache provided */
-	if (unlikely(cache == NULL))
+	if (unlikely(cache == NULL)) {
+		remaining = n;
 		goto driver_dequeue;
+	}
 
-	/* Use the cache as much as we have to return hot objects first */
-	len = RTE_MIN(remaining, cache->len);
 	cache_objs = &cache->objs[cache->len];
+
+	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
+		/*
+		 * The request size is known at build time, and
+		 * the entire request can be satisfied from the cache,
+		 * so let the compiler unroll the fixed length copy loop.
+		 */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
+
+		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
+		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
+
+		return 0;
+	}
+
+	/*
+	 * Use the cache as much as we have to return hot objects first.
+	 * If the request size 'n' is known at build time, the above comparison
+	 * ensures that n > cache->len here, so omit RTE_MIN().
+	 */
+	len = __extension__(__builtin_constant_p(n)) ? cache->len :
+			RTE_MIN(n, cache->len);
 	cache->len -= len;
-	remaining -= len;
+	remaining = n - len;
 	for (index = 0; index < len; index++)
 		*obj_table++ = *--cache_objs;
 
-	if (remaining == 0) {
+	/*
+	 * If the request size 'n' is known at build time, the case
+	 * where the entire request can be satisfied from the cache
+	 * has already been handled above, so omit handling it here.
+	 */
+	if (!__extension__(__builtin_constant_p(n)) && remaining == 0) {
 		/* The entire request is satisfied from the cache. */
 
 		RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mempool: optimize get objects with constant n
  2023-04-18 12:55     ` Bruce Richardson
@ 2023-06-07  7:51       ` Thomas Monjalon
  2023-06-07  8:03         ` Morten Brørup
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Monjalon @ 2023-06-07  7:51 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev, olivier.matz, andrew.rybchenko, Bruce Richardson

18/04/2023 14:55, Bruce Richardson:
> On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > > +		/*
> > > > +		 * The request size is known at build time, and
> > > > +		 * the entire request can be satisfied from the cache,
> > > > +		 * so let the compiler unroll the fixed length copy loop.
> > > > +		 */
> > > > +		cache->len -= n;
> > > > +		for (index = 0; index < n; index++)
> > > > +			*obj_table++ = *--cache_objs;
> > > > +
> > > 
> > > This loop looks a little awkward to me. Would it be clearer (and perhaps
> > > easier for compilers to unroll efficiently if it was rewritten as:
> > > 
> > > 	cache->len -= n;
> > > 	cache_objs = &cache->objs[cache->len];
> > > 	for (index = 0; index < n; index++)
> > > 		obj_table[index] = cache_objs[index];
> > 
> > The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments.
> > 
> 
> BTW: Please add this as a comment in the code too, above the loop to avoid
> future developers (or even future me), asking this question again!

Looks like this request was missed.




^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] mempool: optimize get objects with constant n
  2023-06-07  7:51       ` Thomas Monjalon
@ 2023-06-07  8:03         ` Morten Brørup
  2023-06-07  8:10           ` Thomas Monjalon
  0 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2023-06-07  8:03 UTC (permalink / raw)
  To: Thomas Monjalon, Bruce Richardson; +Cc: dev, olivier.matz, andrew.rybchenko

> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, 7 June 2023 09.52
> 
> 18/04/2023 14:55, Bruce Richardson:
> > On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > > > +		/*
> > > > > +		 * The request size is known at build time, and
> > > > > +		 * the entire request can be satisfied from the cache,
> > > > > +		 * so let the compiler unroll the fixed length copy loop.
> > > > > +		 */
> > > > > +		cache->len -= n;
> > > > > +		for (index = 0; index < n; index++)
> > > > > +			*obj_table++ = *--cache_objs;
> > > > > +
> > > >
> > > > This loop looks a little awkward to me. Would it be clearer (and perhaps
> > > > easier for compilers to unroll efficiently if it was rewritten as:
> > > >
> > > > 	cache->len -= n;
> > > > 	cache_objs = &cache->objs[cache->len];
> > > > 	for (index = 0; index < n; index++)
> > > > 		obj_table[index] = cache_objs[index];
> > >
> > > The mempool cache is a stack, so the copy loop needs get the objects in
> decrementing order. I.e. the source index decrements and the destination index
> increments.
> > >
> >
> > BTW: Please add this as a comment in the code too, above the loop to avoid
> > future developers (or even future me), asking this question again!
> 
> Looks like this request was missed.

I intentionally omitted it, because I disagree with the suggestion.

Otherwise, reminders that the mempool cache is a stack should be plastered all over the source code, not just here. For reference, this copy loop (without such a reminder) also exists elsewhere in the same file.

Apologies for not responding to Bruce's request earlier.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mempool: optimize get objects with constant n
  2023-06-07  8:03         ` Morten Brørup
@ 2023-06-07  8:10           ` Thomas Monjalon
  2023-06-07  8:33             ` Morten Brørup
  2023-06-07  8:41             ` Morten Brørup
  0 siblings, 2 replies; 19+ messages in thread
From: Thomas Monjalon @ 2023-06-07  8:10 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko

07/06/2023 10:03, Morten Brørup:
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Wednesday, 7 June 2023 09.52
> > 
> > 18/04/2023 14:55, Bruce Richardson:
> > > On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > > > > +		/*
> > > > > > +		 * The request size is known at build time, and
> > > > > > +		 * the entire request can be satisfied from the cache,
> > > > > > +		 * so let the compiler unroll the fixed length copy loop.
> > > > > > +		 */
> > > > > > +		cache->len -= n;
> > > > > > +		for (index = 0; index < n; index++)
> > > > > > +			*obj_table++ = *--cache_objs;
> > > > > > +
> > > > >
> > > > > This loop looks a little awkward to me. Would it be clearer (and perhaps
> > > > > easier for compilers to unroll efficiently if it was rewritten as:
> > > > >
> > > > > 	cache->len -= n;
> > > > > 	cache_objs = &cache->objs[cache->len];
> > > > > 	for (index = 0; index < n; index++)
> > > > > 		obj_table[index] = cache_objs[index];
> > > >
> > > > The mempool cache is a stack, so the copy loop needs get the objects in
> > decrementing order. I.e. the source index decrements and the destination index
> > increments.
> > > >
> > >
> > > BTW: Please add this as a comment in the code too, above the loop to avoid
> > > future developers (or even future me), asking this question again!
> > 
> > Looks like this request was missed.
> 
> I intentionally omitted it, because I disagree with the suggestion.
> 
> Otherwise, reminders that the mempool cache is a stack should be plastered all over the source code, not just here. For reference, this copy loop (without such a reminder) also exists elsewhere in the same file.
> 
> Apologies for not responding to Bruce's request earlier.

What about doing a general comment at the top of the function,
with the assignment of the pointer at the end of the array:

    /* The cache is a stack, so copy will be in reverse order. */
    cache_objs = &cache->objs[cache->len];

I could do it on apply if there is an agreement.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] mempool: optimize get objects with constant n
  2023-06-07  8:10           ` Thomas Monjalon
@ 2023-06-07  8:33             ` Morten Brørup
  2023-06-07  8:41             ` Morten Brørup
  1 sibling, 0 replies; 19+ messages in thread
From: Morten Brørup @ 2023-06-07  8:33 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko

> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, 7 June 2023 10.10
> 
> 07/06/2023 10:03, Morten Brørup:
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Wednesday, 7 June 2023 09.52
> > >
> > > 18/04/2023 14:55, Bruce Richardson:
> > > > On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote:
> > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote:
> > > > > > > +	if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> > > > > > > +		/*
> > > > > > > +		 * The request size is known at build time, and
> > > > > > > +		 * the entire request can be satisfied from the cache,
> > > > > > > +		 * so let the compiler unroll the fixed length copy loop.
> > > > > > > +		 */
> > > > > > > +		cache->len -= n;
> > > > > > > +		for (index = 0; index < n; index++)
> > > > > > > +			*obj_table++ = *--cache_objs;
> > > > > > > +
> > > > > >
> > > > > > This loop looks a little awkward to me. Would it be clearer (and
> perhaps
> > > > > > easier for compilers to unroll efficiently if it was rewritten as:
> > > > > >
> > > > > > 	cache->len -= n;
> > > > > > 	cache_objs = &cache->objs[cache->len];
> > > > > > 	for (index = 0; index < n; index++)
> > > > > > 		obj_table[index] = cache_objs[index];
> > > > >
> > > > > The mempool cache is a stack, so the copy loop needs get the objects
> in
> > > decrementing order. I.e. the source index decrements and the destination
> index
> > > increments.
> > > > >
> > > >
> > > > BTW: Please add this as a comment in the code too, above the loop to
> avoid
> > > > future developers (or even future me), asking this question again!
> > >
> > > Looks like this request was missed.
> >
> > I intentionally omitted it, because I disagree with the suggestion.
> >
> > Otherwise, reminders that the mempool cache is a stack should be plastered
> all over the source code, not just here. For reference, this copy loop
> (without such a reminder) also exists elsewhere in the same file.
> >
> > Apologies for not responding to Bruce's request earlier.
> 
> What about doing a general comment at the top of the function,
> with the assignment of the pointer at the end of the array:
> 
>     /* The cache is a stack, so copy will be in reverse order. */
>     cache_objs = &cache->objs[cache->len];
> 
> I could do it on apply if there is an agreement.
> 
ACK from me.

For consistency, please also add the reminder to the two existing reverse order copy loops in rte_mempool_do_generic_get().


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] mempool: optimize get objects with constant n
  2023-06-07  8:10           ` Thomas Monjalon
  2023-06-07  8:33             ` Morten Brørup
@ 2023-06-07  8:41             ` Morten Brørup
  1 sibling, 0 replies; 19+ messages in thread
From: Morten Brørup @ 2023-06-07  8:41 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko

> From: Morten Brørup
> Sent: Wednesday, 7 June 2023 10.34
> 
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Wednesday, 7 June 2023 10.10
> >

[...]

> > What about doing a general comment at the top of the function,
> > with the assignment of the pointer at the end of the array:
> >
> >     /* The cache is a stack, so copy will be in reverse order. */
> >     cache_objs = &cache->objs[cache->len];
> >
> > I could do it on apply if there is an agreement.
> >
> ACK from me.
> 
> For consistency, please also add the reminder to the two existing reverse
> order copy loops in rte_mempool_do_generic_get().

Rubbish! I was looking at the wrong version of the code.

So, plain ACK from me. No further comments!

Sorry. :-!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4] mempool: optimize get objects with constant n
  2023-04-18 20:09 ` [PATCH v4] " Morten Brørup
@ 2023-06-07  9:12   ` Thomas Monjalon
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas Monjalon @ 2023-06-07  9:12 UTC (permalink / raw)
  To: Morten Brørup
  Cc: dev, olivier.matz, andrew.rybchenko, bruce.richardson, roretzla

18/04/2023 22:09, Morten Brørup:
> When getting objects from the mempool, the number of objects to get is
> often constant at build time.
> 
> This patch adds another code path for this case, so the compiler can
> optimize more, e.g. unroll the copy loop when the entire request is
> satisfied from the cache.
> 
> On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0,
> mempool_perf_test with constant n shows an increase in rate_persec by an
> average of 17 %, minimum 9.5 %, maximum 24 %.
> 
> The code path where the number of objects to get is unknown at build time
> remains essentially unchanged.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>

Applied with suggested added comment, thanks.




^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-06-07  9:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-11  6:48 [PATCH] mempool: optimize get objects with constant n Morten Brørup
2023-04-18 11:06 ` Bruce Richardson
2023-04-18 11:29   ` Morten Brørup
2023-04-18 12:54     ` Bruce Richardson
2023-04-18 12:55     ` Bruce Richardson
2023-06-07  7:51       ` Thomas Monjalon
2023-06-07  8:03         ` Morten Brørup
2023-06-07  8:10           ` Thomas Monjalon
2023-06-07  8:33             ` Morten Brørup
2023-06-07  8:41             ` Morten Brørup
2023-04-18 15:15   ` Tyler Retzlaff
2023-04-18 15:30     ` Morten Brørup
2023-04-18 15:44       ` Tyler Retzlaff
2023-04-18 15:50         ` Morten Brørup
2023-04-18 16:01           ` Tyler Retzlaff
2023-04-18 16:05   ` Morten Brørup
2023-04-18 19:51 ` [PATCH v3] " Morten Brørup
2023-04-18 20:09 ` [PATCH v4] " Morten Brørup
2023-06-07  9:12   ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).