* [PATCH] mempool: optimize get objects with constant n @ 2023-04-11 6:48 Morten Brørup 2023-04-18 11:06 ` Bruce Richardson ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Morten Brørup @ 2023-04-11 6:48 UTC (permalink / raw) To: olivier.matz, andrew.rybchenko; +Cc: dev, Morten Brørup When getting objects from the mempool, the number of objects to get is often constant at build time. This patch adds another code path for this case, so the compiler can optimize more, e.g. unroll the copy loop when the entire request is satisfied from the cache. On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, mempool_perf_test with constant n shows an increase in rate_persec by an average of 17 %, minimum 9.5 %, maximum 24 %. The code path where the number of objects to get is unknown at build time remains essentially unchanged. Signed-off-by: Morten Brørup <mb@smartsharesystems.com> --- lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h index 9f530db24b..ade0100ec7 100644 --- a/lib/mempool/rte_mempool.h +++ b/lib/mempool/rte_mempool.h @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table, if (unlikely(cache == NULL)) goto driver_dequeue; - /* Use the cache as much as we have to return hot objects first */ - len = RTE_MIN(remaining, cache->len); cache_objs = &cache->objs[cache->len]; + + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { + /* + * The request size is known at build time, and + * the entire request can be satisfied from the cache, + * so let the compiler unroll the fixed length copy loop. + */ + cache->len -= n; + for (index = 0; index < n; index++) + *obj_table++ = *--cache_objs; + + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); + + return 0; + } + + /* Use the cache as much as we have to return hot objects first */ + len = __extension__(__builtin_constant_p(n)) ? cache->len : + RTE_MIN(remaining, cache->len); cache->len -= len; remaining -= len; for (index = 0; index < len; index++) *obj_table++ = *--cache_objs; - if (remaining == 0) { + if (!__extension__(__builtin_constant_p(n)) && remaining == 0) { /* The entire request is satisfied from the cache. */ RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); -- 2.17.1 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mempool: optimize get objects with constant n 2023-04-11 6:48 [PATCH] mempool: optimize get objects with constant n Morten Brørup @ 2023-04-18 11:06 ` Bruce Richardson 2023-04-18 11:29 ` Morten Brørup ` (2 more replies) 2023-04-18 19:51 ` [PATCH v3] " Morten Brørup 2023-04-18 20:09 ` [PATCH v4] " Morten Brørup 2 siblings, 3 replies; 19+ messages in thread From: Bruce Richardson @ 2023-04-18 11:06 UTC (permalink / raw) To: Morten Brørup; +Cc: olivier.matz, andrew.rybchenko, dev On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > When getting objects from the mempool, the number of objects to get is > often constant at build time. > > This patch adds another code path for this case, so the compiler can > optimize more, e.g. unroll the copy loop when the entire request is > satisfied from the cache. > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > mempool_perf_test with constant n shows an increase in rate_persec by an > average of 17 %, minimum 9.5 %, maximum 24 %. > > The code path where the number of objects to get is unknown at build time > remains essentially unchanged. > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com> Change looks a good idea. Some suggestions inline below, which you may want to take on board for any future version. I'd strongly suggest adding some extra clarifying code comments, as I suggest below. With those exta code comments: Acked-by: Bruce Richardson <bruce.richardson@intel.com> > --- > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > 1 file changed, 21 insertions(+), 3 deletions(-) > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > index 9f530db24b..ade0100ec7 100644 > --- a/lib/mempool/rte_mempool.h > +++ b/lib/mempool/rte_mempool.h > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table, > if (unlikely(cache == NULL)) > goto driver_dequeue; > > - /* Use the cache as much as we have to return hot objects first */ > - len = RTE_MIN(remaining, cache->len); > cache_objs = &cache->objs[cache->len]; > + > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > + /* > + * The request size is known at build time, and > + * the entire request can be satisfied from the cache, > + * so let the compiler unroll the fixed length copy loop. > + */ > + cache->len -= n; > + for (index = 0; index < n; index++) > + *obj_table++ = *--cache_objs; > + This loop looks a little awkward to me. Would it be clearer (and perhaps easier for compilers to unroll efficiently if it was rewritten as: cache->len -= n; cache_objs = &cache->objs[cache->len]; for (index = 0; index < n; index++) obj_table[index] = cache_objs[index]; alternatively those last two lines can be replaced by a memcpy, which the compiler should nicely optimize itself, for constant size copy: memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n); > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); > + > + return 0; > + } > + > + /* Use the cache as much as we have to return hot objects first */ > + len = __extension__(__builtin_constant_p(n)) ? cache->len : > + RTE_MIN(remaining, cache->len); This line confused me a lot, until I applied the patch and stared at it a lot :-). Why would the length value depend upon whether or not n is a compile-time constant? I think it would really help here to add in a comment stating that when n is a compile-time constant, at this point it much be "> cache->len" since the previous block was untaken, therefore we don't need to call RTE_MIN. > cache->len -= len; > remaining -= len; > for (index = 0; index < len; index++) > *obj_table++ = *--cache_objs; > > - if (remaining == 0) { > + if (!__extension__(__builtin_constant_p(n)) && remaining == 0) { > /* The entire request is satisfied from the cache. */ > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); Once I'd worked out the logic for the above conditional check, then this conditional adjustment was clear. I just wonder if a further comment might help here. I am also wondering if having larger blocks for the constant and non-constant cases may help. It would lead to some duplication but may clarify the code. > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH] mempool: optimize get objects with constant n 2023-04-18 11:06 ` Bruce Richardson @ 2023-04-18 11:29 ` Morten Brørup 2023-04-18 12:54 ` Bruce Richardson 2023-04-18 12:55 ` Bruce Richardson 2023-04-18 15:15 ` Tyler Retzlaff 2023-04-18 16:05 ` Morten Brørup 2 siblings, 2 replies; 19+ messages in thread From: Morten Brørup @ 2023-04-18 11:29 UTC (permalink / raw) To: Bruce Richardson; +Cc: olivier.matz, andrew.rybchenko, dev > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Tuesday, 18 April 2023 13.07 > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > When getting objects from the mempool, the number of objects to get is > > often constant at build time. > > > > This patch adds another code path for this case, so the compiler can > > optimize more, e.g. unroll the copy loop when the entire request is > > satisfied from the cache. > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > > mempool_perf_test with constant n shows an increase in rate_persec by an > > average of 17 %, minimum 9.5 %, maximum 24 %. > > > > The code path where the number of objects to get is unknown at build time > > remains essentially unchanged. > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com> > > Change looks a good idea. Some suggestions inline below, which you may want to > take on board for any future version. I'd strongly suggest adding some > extra clarifying code comments, as I suggest below. > With those exta code comments: > > Acked-by: Bruce Richardson <bruce.richardson@intel.com> > > > --- > > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > index 9f530db24b..ade0100ec7 100644 > > --- a/lib/mempool/rte_mempool.h > > +++ b/lib/mempool/rte_mempool.h > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, > void **obj_table, > > if (unlikely(cache == NULL)) > > goto driver_dequeue; > > > > - /* Use the cache as much as we have to return hot objects first */ > > - len = RTE_MIN(remaining, cache->len); > > cache_objs = &cache->objs[cache->len]; > > + > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > > + /* > > + * The request size is known at build time, and > > + * the entire request can be satisfied from the cache, > > + * so let the compiler unroll the fixed length copy loop. > > + */ > > + cache->len -= n; > > + for (index = 0; index < n; index++) > > + *obj_table++ = *--cache_objs; > > + > > This loop looks a little awkward to me. Would it be clearer (and perhaps > easier for compilers to unroll efficiently if it was rewritten as: > > cache->len -= n; > cache_objs = &cache->objs[cache->len]; > for (index = 0; index < n; index++) > obj_table[index] = cache_objs[index]; The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments. Regardless, your point here is still valid! I expected that any unrolling capable compiler can unroll *dst++ = *--src; but I can experiment with different compilers on godbolt.org to see if dst[index] = src[-index] is better. A future version could be hand coded with vector instructions here, like in the Ethdev drivers. > > alternatively those last two lines can be replaced by a memcpy, which the > compiler should nicely optimize itself, for constant size copy: > > memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n); Just for reference: It was previously discussed to ignore the stack ordering and simply copy the objects, but the idea was rejected due to the potential performance impact of not returning the hottest objects, i.e. the objects at the top of the stack, as the first in the returned array. > > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); > > + > > + return 0; > > + } > > + > > + /* Use the cache as much as we have to return hot objects first */ > > + len = __extension__(__builtin_constant_p(n)) ? cache->len : > > + RTE_MIN(remaining, cache->len); > > This line confused me a lot, until I applied the patch and stared at it a > lot :-). Why would the length value depend upon whether or not n is a > compile-time constant? > I think it would really help here to add in a comment stating that when n > is a compile-time constant, at this point it much be "> cache->len" since > the previous block was untaken, therefore we don't need to call RTE_MIN. I agree. This is almost like perl... write-only source code. I will add a few explanatory comments. > > > cache->len -= len; > > remaining -= len; > > for (index = 0; index < len; index++) > > *obj_table++ = *--cache_objs; > > > > - if (remaining == 0) { > > + if (!__extension__(__builtin_constant_p(n)) && remaining == 0) { > > /* The entire request is satisfied from the cache. */ > > > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > Once I'd worked out the logic for the above conditional check, then this > conditional adjustment was clear. I just wonder if a further comment might > help here. > > I am also wondering if having larger blocks for the constant and > non-constant cases may help. It would lead to some duplication but may > clarify the code. I get your point, but the code is still short enough to grasp (after comments have been added). So I prefer to avoid code duplication. > > > -- > > 2.17.1 > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mempool: optimize get objects with constant n 2023-04-18 11:29 ` Morten Brørup @ 2023-04-18 12:54 ` Bruce Richardson 2023-04-18 12:55 ` Bruce Richardson 1 sibling, 0 replies; 19+ messages in thread From: Bruce Richardson @ 2023-04-18 12:54 UTC (permalink / raw) To: Morten Brørup; +Cc: olivier.matz, andrew.rybchenko, dev On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote: > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > Sent: Tuesday, 18 April 2023 13.07 > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > > When getting objects from the mempool, the number of objects to get is > > > often constant at build time. > > > > > > This patch adds another code path for this case, so the compiler can > > > optimize more, e.g. unroll the copy loop when the entire request is > > > satisfied from the cache. > > > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > > > mempool_perf_test with constant n shows an increase in rate_persec by an > > > average of 17 %, minimum 9.5 %, maximum 24 %. > > > > > > The code path where the number of objects to get is unknown at build time > > > remains essentially unchanged. > > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com> > > > > Change looks a good idea. Some suggestions inline below, which you may want to > > take on board for any future version. I'd strongly suggest adding some > > extra clarifying code comments, as I suggest below. > > With those exta code comments: > > > > Acked-by: Bruce Richardson <bruce.richardson@intel.com> > > > > > --- > > > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > > index 9f530db24b..ade0100ec7 100644 > > > --- a/lib/mempool/rte_mempool.h > > > +++ b/lib/mempool/rte_mempool.h > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, > > void **obj_table, > > > if (unlikely(cache == NULL)) > > > goto driver_dequeue; > > > > > > - /* Use the cache as much as we have to return hot objects first */ > > > - len = RTE_MIN(remaining, cache->len); > > > cache_objs = &cache->objs[cache->len]; > > > + > > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > > > + /* > > > + * The request size is known at build time, and > > > + * the entire request can be satisfied from the cache, > > > + * so let the compiler unroll the fixed length copy loop. > > > + */ > > > + cache->len -= n; > > > + for (index = 0; index < n; index++) > > > + *obj_table++ = *--cache_objs; > > > + > > > > This loop looks a little awkward to me. Would it be clearer (and perhaps > > easier for compilers to unroll efficiently if it was rewritten as: > > > > cache->len -= n; > > cache_objs = &cache->objs[cache->len]; > > for (index = 0; index < n; index++) > > obj_table[index] = cache_objs[index]; > > The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments. > > Regardless, your point here is still valid! I expected that any unrolling capable compiler can unroll *dst++ = *--src; but I can experiment with different compilers on godbolt.org to see if dst[index] = src[-index] is better. > > A future version could be hand coded with vector instructions here, like in the Ethdev drivers. > > > > > alternatively those last two lines can be replaced by a memcpy, which the > > compiler should nicely optimize itself, for constant size copy: > > > > memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n); > > Just for reference: It was previously discussed to ignore the stack ordering and simply copy the objects, but the idea was rejected due to the potential performance impact of not returning the hottest objects, i.e. the objects at the top of the stack, as the first in the returned array. > Ah, yes, I had forgotten that we want the order reversed, so most recent objects first. It would be interesting to see if ignoring order within a burst can lead to some performance improvements, as I expect that in most cases, the buffers to be used in a burst are to be used pretty much simultaneously. However, based on your feedback, feel free to ignore this comment and keep code as-is. > > > > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); > > > + > > > + return 0; > > > + } > > > + > > > + /* Use the cache as much as we have to return hot objects first */ > > > + len = __extension__(__builtin_constant_p(n)) ? cache->len : > > > + RTE_MIN(remaining, cache->len); > > > > This line confused me a lot, until I applied the patch and stared at it a > > lot :-). Why would the length value depend upon whether or not n is a > > compile-time constant? > > I think it would really help here to add in a comment stating that when n > > is a compile-time constant, at this point it much be "> cache->len" since > > the previous block was untaken, therefore we don't need to call RTE_MIN. > > I agree. This is almost like perl... write-only source code. > > I will add a few explanatory comments. > Thanks. > > > > > cache->len -= len; > > > remaining -= len; > > > for (index = 0; index < len; index++) > > > *obj_table++ = *--cache_objs; > > > > > > - if (remaining == 0) { > > > + if (!__extension__(__builtin_constant_p(n)) && remaining == 0) { > > > /* The entire request is satisfied from the cache. */ > > > > > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > > > Once I'd worked out the logic for the above conditional check, then this > > conditional adjustment was clear. I just wonder if a further comment might > > help here. > > > > I am also wondering if having larger blocks for the constant and > > non-constant cases may help. It would lead to some duplication but may > > clarify the code. > > I get your point, but the code is still short enough to grasp (after comments have been added). So I prefer to avoid code duplication. > Sure. > > > > > -- > > > 2.17.1 > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mempool: optimize get objects with constant n 2023-04-18 11:29 ` Morten Brørup 2023-04-18 12:54 ` Bruce Richardson @ 2023-04-18 12:55 ` Bruce Richardson 2023-06-07 7:51 ` Thomas Monjalon 1 sibling, 1 reply; 19+ messages in thread From: Bruce Richardson @ 2023-04-18 12:55 UTC (permalink / raw) To: Morten Brørup; +Cc: olivier.matz, andrew.rybchenko, dev On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote: > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > Sent: Tuesday, 18 April 2023 13.07 > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > > When getting objects from the mempool, the number of objects to get is > > > often constant at build time. > > > > > > This patch adds another code path for this case, so the compiler can > > > optimize more, e.g. unroll the copy loop when the entire request is > > > satisfied from the cache. > > > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > > > mempool_perf_test with constant n shows an increase in rate_persec by an > > > average of 17 %, minimum 9.5 %, maximum 24 %. > > > > > > The code path where the number of objects to get is unknown at build time > > > remains essentially unchanged. > > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com> > > > > Change looks a good idea. Some suggestions inline below, which you may want to > > take on board for any future version. I'd strongly suggest adding some > > extra clarifying code comments, as I suggest below. > > With those exta code comments: > > > > Acked-by: Bruce Richardson <bruce.richardson@intel.com> > > > > > --- > > > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > > index 9f530db24b..ade0100ec7 100644 > > > --- a/lib/mempool/rte_mempool.h > > > +++ b/lib/mempool/rte_mempool.h > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, > > void **obj_table, > > > if (unlikely(cache == NULL)) > > > goto driver_dequeue; > > > > > > - /* Use the cache as much as we have to return hot objects first */ > > > - len = RTE_MIN(remaining, cache->len); > > > cache_objs = &cache->objs[cache->len]; > > > + > > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > > > + /* > > > + * The request size is known at build time, and > > > + * the entire request can be satisfied from the cache, > > > + * so let the compiler unroll the fixed length copy loop. > > > + */ > > > + cache->len -= n; > > > + for (index = 0; index < n; index++) > > > + *obj_table++ = *--cache_objs; > > > + > > > > This loop looks a little awkward to me. Would it be clearer (and perhaps > > easier for compilers to unroll efficiently if it was rewritten as: > > > > cache->len -= n; > > cache_objs = &cache->objs[cache->len]; > > for (index = 0; index < n; index++) > > obj_table[index] = cache_objs[index]; > > The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments. > BTW: Please add this as a comment in the code too, above the loop to avoid future developers (or even future me), asking this question again! /Bruce ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mempool: optimize get objects with constant n 2023-04-18 12:55 ` Bruce Richardson @ 2023-06-07 7:51 ` Thomas Monjalon 2023-06-07 8:03 ` Morten Brørup 0 siblings, 1 reply; 19+ messages in thread From: Thomas Monjalon @ 2023-06-07 7:51 UTC (permalink / raw) To: Morten Brørup; +Cc: dev, olivier.matz, andrew.rybchenko, Bruce Richardson 18/04/2023 14:55, Bruce Richardson: > On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote: > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > > > > + /* > > > > + * The request size is known at build time, and > > > > + * the entire request can be satisfied from the cache, > > > > + * so let the compiler unroll the fixed length copy loop. > > > > + */ > > > > + cache->len -= n; > > > > + for (index = 0; index < n; index++) > > > > + *obj_table++ = *--cache_objs; > > > > + > > > > > > This loop looks a little awkward to me. Would it be clearer (and perhaps > > > easier for compilers to unroll efficiently if it was rewritten as: > > > > > > cache->len -= n; > > > cache_objs = &cache->objs[cache->len]; > > > for (index = 0; index < n; index++) > > > obj_table[index] = cache_objs[index]; > > > > The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments. > > > > BTW: Please add this as a comment in the code too, above the loop to avoid > future developers (or even future me), asking this question again! Looks like this request was missed. ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH] mempool: optimize get objects with constant n 2023-06-07 7:51 ` Thomas Monjalon @ 2023-06-07 8:03 ` Morten Brørup 2023-06-07 8:10 ` Thomas Monjalon 0 siblings, 1 reply; 19+ messages in thread From: Morten Brørup @ 2023-06-07 8:03 UTC (permalink / raw) To: Thomas Monjalon, Bruce Richardson; +Cc: dev, olivier.matz, andrew.rybchenko > From: Thomas Monjalon [mailto:thomas@monjalon.net] > Sent: Wednesday, 7 June 2023 09.52 > > 18/04/2023 14:55, Bruce Richardson: > > On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote: > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > > > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > > > > > + /* > > > > > + * The request size is known at build time, and > > > > > + * the entire request can be satisfied from the cache, > > > > > + * so let the compiler unroll the fixed length copy loop. > > > > > + */ > > > > > + cache->len -= n; > > > > > + for (index = 0; index < n; index++) > > > > > + *obj_table++ = *--cache_objs; > > > > > + > > > > > > > > This loop looks a little awkward to me. Would it be clearer (and perhaps > > > > easier for compilers to unroll efficiently if it was rewritten as: > > > > > > > > cache->len -= n; > > > > cache_objs = &cache->objs[cache->len]; > > > > for (index = 0; index < n; index++) > > > > obj_table[index] = cache_objs[index]; > > > > > > The mempool cache is a stack, so the copy loop needs get the objects in > decrementing order. I.e. the source index decrements and the destination index > increments. > > > > > > > BTW: Please add this as a comment in the code too, above the loop to avoid > > future developers (or even future me), asking this question again! > > Looks like this request was missed. I intentionally omitted it, because I disagree with the suggestion. Otherwise, reminders that the mempool cache is a stack should be plastered all over the source code, not just here. For reference, this copy loop (without such a reminder) also exists elsewhere in the same file. Apologies for not responding to Bruce's request earlier. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mempool: optimize get objects with constant n 2023-06-07 8:03 ` Morten Brørup @ 2023-06-07 8:10 ` Thomas Monjalon 2023-06-07 8:33 ` Morten Brørup 2023-06-07 8:41 ` Morten Brørup 0 siblings, 2 replies; 19+ messages in thread From: Thomas Monjalon @ 2023-06-07 8:10 UTC (permalink / raw) To: Morten Brørup; +Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko 07/06/2023 10:03, Morten Brørup: > > From: Thomas Monjalon [mailto:thomas@monjalon.net] > > Sent: Wednesday, 7 June 2023 09.52 > > > > 18/04/2023 14:55, Bruce Richardson: > > > On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote: > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > > > > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > > > > > > + /* > > > > > > + * The request size is known at build time, and > > > > > > + * the entire request can be satisfied from the cache, > > > > > > + * so let the compiler unroll the fixed length copy loop. > > > > > > + */ > > > > > > + cache->len -= n; > > > > > > + for (index = 0; index < n; index++) > > > > > > + *obj_table++ = *--cache_objs; > > > > > > + > > > > > > > > > > This loop looks a little awkward to me. Would it be clearer (and perhaps > > > > > easier for compilers to unroll efficiently if it was rewritten as: > > > > > > > > > > cache->len -= n; > > > > > cache_objs = &cache->objs[cache->len]; > > > > > for (index = 0; index < n; index++) > > > > > obj_table[index] = cache_objs[index]; > > > > > > > > The mempool cache is a stack, so the copy loop needs get the objects in > > decrementing order. I.e. the source index decrements and the destination index > > increments. > > > > > > > > > > BTW: Please add this as a comment in the code too, above the loop to avoid > > > future developers (or even future me), asking this question again! > > > > Looks like this request was missed. > > I intentionally omitted it, because I disagree with the suggestion. > > Otherwise, reminders that the mempool cache is a stack should be plastered all over the source code, not just here. For reference, this copy loop (without such a reminder) also exists elsewhere in the same file. > > Apologies for not responding to Bruce's request earlier. What about doing a general comment at the top of the function, with the assignment of the pointer at the end of the array: /* The cache is a stack, so copy will be in reverse order. */ cache_objs = &cache->objs[cache->len]; I could do it on apply if there is an agreement. ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH] mempool: optimize get objects with constant n 2023-06-07 8:10 ` Thomas Monjalon @ 2023-06-07 8:33 ` Morten Brørup 2023-06-07 8:41 ` Morten Brørup 1 sibling, 0 replies; 19+ messages in thread From: Morten Brørup @ 2023-06-07 8:33 UTC (permalink / raw) To: Thomas Monjalon; +Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko > From: Thomas Monjalon [mailto:thomas@monjalon.net] > Sent: Wednesday, 7 June 2023 10.10 > > 07/06/2023 10:03, Morten Brørup: > > > From: Thomas Monjalon [mailto:thomas@monjalon.net] > > > Sent: Wednesday, 7 June 2023 09.52 > > > > > > 18/04/2023 14:55, Bruce Richardson: > > > > On Tue, Apr 18, 2023 at 01:29:49PM +0200, Morten Brørup wrote: > > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > > > > > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > > > > > > > + /* > > > > > > > + * The request size is known at build time, and > > > > > > > + * the entire request can be satisfied from the cache, > > > > > > > + * so let the compiler unroll the fixed length copy loop. > > > > > > > + */ > > > > > > > + cache->len -= n; > > > > > > > + for (index = 0; index < n; index++) > > > > > > > + *obj_table++ = *--cache_objs; > > > > > > > + > > > > > > > > > > > > This loop looks a little awkward to me. Would it be clearer (and > perhaps > > > > > > easier for compilers to unroll efficiently if it was rewritten as: > > > > > > > > > > > > cache->len -= n; > > > > > > cache_objs = &cache->objs[cache->len]; > > > > > > for (index = 0; index < n; index++) > > > > > > obj_table[index] = cache_objs[index]; > > > > > > > > > > The mempool cache is a stack, so the copy loop needs get the objects > in > > > decrementing order. I.e. the source index decrements and the destination > index > > > increments. > > > > > > > > > > > > > BTW: Please add this as a comment in the code too, above the loop to > avoid > > > > future developers (or even future me), asking this question again! > > > > > > Looks like this request was missed. > > > > I intentionally omitted it, because I disagree with the suggestion. > > > > Otherwise, reminders that the mempool cache is a stack should be plastered > all over the source code, not just here. For reference, this copy loop > (without such a reminder) also exists elsewhere in the same file. > > > > Apologies for not responding to Bruce's request earlier. > > What about doing a general comment at the top of the function, > with the assignment of the pointer at the end of the array: > > /* The cache is a stack, so copy will be in reverse order. */ > cache_objs = &cache->objs[cache->len]; > > I could do it on apply if there is an agreement. > ACK from me. For consistency, please also add the reminder to the two existing reverse order copy loops in rte_mempool_do_generic_get(). ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH] mempool: optimize get objects with constant n 2023-06-07 8:10 ` Thomas Monjalon 2023-06-07 8:33 ` Morten Brørup @ 2023-06-07 8:41 ` Morten Brørup 1 sibling, 0 replies; 19+ messages in thread From: Morten Brørup @ 2023-06-07 8:41 UTC (permalink / raw) To: Thomas Monjalon; +Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko > From: Morten Brørup > Sent: Wednesday, 7 June 2023 10.34 > > > From: Thomas Monjalon [mailto:thomas@monjalon.net] > > Sent: Wednesday, 7 June 2023 10.10 > > [...] > > What about doing a general comment at the top of the function, > > with the assignment of the pointer at the end of the array: > > > > /* The cache is a stack, so copy will be in reverse order. */ > > cache_objs = &cache->objs[cache->len]; > > > > I could do it on apply if there is an agreement. > > > ACK from me. > > For consistency, please also add the reminder to the two existing reverse > order copy loops in rte_mempool_do_generic_get(). Rubbish! I was looking at the wrong version of the code. So, plain ACK from me. No further comments! Sorry. :-! ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mempool: optimize get objects with constant n 2023-04-18 11:06 ` Bruce Richardson 2023-04-18 11:29 ` Morten Brørup @ 2023-04-18 15:15 ` Tyler Retzlaff 2023-04-18 15:30 ` Morten Brørup 2023-04-18 16:05 ` Morten Brørup 2 siblings, 1 reply; 19+ messages in thread From: Tyler Retzlaff @ 2023-04-18 15:15 UTC (permalink / raw) To: Bruce Richardson; +Cc: Morten Brørup, olivier.matz, andrew.rybchenko, dev On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote: > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > When getting objects from the mempool, the number of objects to get is > > often constant at build time. > > > > This patch adds another code path for this case, so the compiler can > > optimize more, e.g. unroll the copy loop when the entire request is > > satisfied from the cache. > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > > mempool_perf_test with constant n shows an increase in rate_persec by an > > average of 17 %, minimum 9.5 %, maximum 24 %. > > > > The code path where the number of objects to get is unknown at build time > > remains essentially unchanged. > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com> > > Change looks a good idea. Some suggestions inline below, which you may want to > take on board for any future version. I'd strongly suggest adding some > extra clarifying code comments, as I suggest below. > With those exta code comments: > > Acked-by: Bruce Richardson <bruce.richardson@intel.com> > > > --- > > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > index 9f530db24b..ade0100ec7 100644 > > --- a/lib/mempool/rte_mempool.h > > +++ b/lib/mempool/rte_mempool.h > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table, > > if (unlikely(cache == NULL)) > > goto driver_dequeue; > > > > - /* Use the cache as much as we have to return hot objects first */ > > - len = RTE_MIN(remaining, cache->len); > > cache_objs = &cache->objs[cache->len]; > > + > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { don't take direct dependency on compiler builtins. define a macro so we don't have to play shotgun surgery later. also what is the purpose of using __extension__ here? are you annotating the use of __builtin_constant_p() or is there more? because if that's the only reason i see no need to use __extension__ when already using a compiler specific builtin like this, that it is not standard is implied and enforced by a compile break. > > + /* > > + * The request size is known at build time, and > > + * the entire request can be satisfied from the cache, > > + * so let the compiler unroll the fixed length copy loop. > > + */ > > + cache->len -= n; > > + for (index = 0; index < n; index++) > > + *obj_table++ = *--cache_objs; > > + > > This loop looks a little awkward to me. Would it be clearer (and perhaps > easier for compilers to unroll efficiently if it was rewritten as: > > cache->len -= n; > cache_objs = &cache->objs[cache->len]; > for (index = 0; index < n; index++) > obj_table[index] = cache_objs[index]; > > alternatively those last two lines can be replaced by a memcpy, which the > compiler should nicely optimize itself, for constant size copy: > > memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n); > > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); > > + > > + return 0; > > + } > > + > > + /* Use the cache as much as we have to return hot objects first */ > > + len = __extension__(__builtin_constant_p(n)) ? cache->len : > > + RTE_MIN(remaining, cache->len); > > This line confused me a lot, until I applied the patch and stared at it a > lot :-). Why would the length value depend upon whether or not n is a > compile-time constant? > I think it would really help here to add in a comment stating that when n > is a compile-time constant, at this point it much be "> cache->len" since > the previous block was untaken, therefore we don't need to call RTE_MIN. > > > cache->len -= len; > > remaining -= len; > > for (index = 0; index < len; index++) > > *obj_table++ = *--cache_objs; > > > > - if (remaining == 0) { > > + if (!__extension__(__builtin_constant_p(n)) && remaining == 0) { > > /* The entire request is satisfied from the cache. */ > > > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > Once I'd worked out the logic for the above conditional check, then this > conditional adjustment was clear. I just wonder if a further comment might > help here. > > I am also wondering if having larger blocks for the constant and > non-constant cases may help. It would lead to some duplication but may > clarify the code. > > > -- > > 2.17.1 > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH] mempool: optimize get objects with constant n 2023-04-18 15:15 ` Tyler Retzlaff @ 2023-04-18 15:30 ` Morten Brørup 2023-04-18 15:44 ` Tyler Retzlaff 0 siblings, 1 reply; 19+ messages in thread From: Morten Brørup @ 2023-04-18 15:30 UTC (permalink / raw) To: Tyler Retzlaff, Bruce Richardson; +Cc: olivier.matz, andrew.rybchenko, dev > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com] > Sent: Tuesday, 18 April 2023 17.15 > > On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote: > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > > When getting objects from the mempool, the number of objects to get > is > > > often constant at build time. > > > > > > This patch adds another code path for this case, so the compiler can > > > optimize more, e.g. unroll the copy loop when the entire request is > > > satisfied from the cache. > > > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > > > mempool_perf_test with constant n shows an increase in rate_persec > by an > > > average of 17 %, minimum 9.5 %, maximum 24 %. > > > > > > The code path where the number of objects to get is unknown at build > time > > > remains essentially unchanged. > > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com> > > > > Change looks a good idea. Some suggestions inline below, which you may > want to > > take on board for any future version. I'd strongly suggest adding some > > extra clarifying code comments, as I suggest below. > > With those exta code comments: > > > > Acked-by: Bruce Richardson <bruce.richardson@intel.com> > > > > > --- > > > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > > index 9f530db24b..ade0100ec7 100644 > > > --- a/lib/mempool/rte_mempool.h > > > +++ b/lib/mempool/rte_mempool.h > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct > rte_mempool *mp, void **obj_table, > > > if (unlikely(cache == NULL)) > > > goto driver_dequeue; > > > > > > - /* Use the cache as much as we have to return hot objects first */ > > > - len = RTE_MIN(remaining, cache->len); > > > cache_objs = &cache->objs[cache->len]; > > > + > > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > > don't take direct dependency on compiler builtins. define a macro so we > don't have to play shotgun surgery later. > > also what is the purpose of using __extension__ here? are you annotating > the use of __builtin_constant_p() or is there more? because if that's > the only reason i see no need to use __extension__ when already using a > compiler specific builtin like this, that it is not standard is implied > and enforced by a compile break. ARM 32-bit memcpy() [1] does it this way, so I did the same. [1]: https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_memcpy_32.h#L122 While I agree that a macro for __builtin_constant_p() would be good, it belongs in a patch to fix portability, not in this patch. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mempool: optimize get objects with constant n 2023-04-18 15:30 ` Morten Brørup @ 2023-04-18 15:44 ` Tyler Retzlaff 2023-04-18 15:50 ` Morten Brørup 0 siblings, 1 reply; 19+ messages in thread From: Tyler Retzlaff @ 2023-04-18 15:44 UTC (permalink / raw) To: Morten Brørup; +Cc: Bruce Richardson, olivier.matz, andrew.rybchenko, dev On Tue, Apr 18, 2023 at 05:30:27PM +0200, Morten Brørup wrote: > > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com] > > Sent: Tuesday, 18 April 2023 17.15 > > > > On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote: > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > > > When getting objects from the mempool, the number of objects to get > > is > > > > often constant at build time. > > > > > > > > This patch adds another code path for this case, so the compiler can > > > > optimize more, e.g. unroll the copy loop when the entire request is > > > > satisfied from the cache. > > > > > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > > > > mempool_perf_test with constant n shows an increase in rate_persec > > by an > > > > average of 17 %, minimum 9.5 %, maximum 24 %. > > > > > > > > The code path where the number of objects to get is unknown at build > > time > > > > remains essentially unchanged. > > > > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com> > > > > > > Change looks a good idea. Some suggestions inline below, which you may > > want to > > > take on board for any future version. I'd strongly suggest adding some > > > extra clarifying code comments, as I suggest below. > > > With those exta code comments: > > > > > > Acked-by: Bruce Richardson <bruce.richardson@intel.com> > > > > > > > --- > > > > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > > > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > > > index 9f530db24b..ade0100ec7 100644 > > > > --- a/lib/mempool/rte_mempool.h > > > > +++ b/lib/mempool/rte_mempool.h > > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct > > rte_mempool *mp, void **obj_table, > > > > if (unlikely(cache == NULL)) > > > > goto driver_dequeue; > > > > > > > > - /* Use the cache as much as we have to return hot objects first */ > > > > - len = RTE_MIN(remaining, cache->len); > > > > cache_objs = &cache->objs[cache->len]; > > > > + > > > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > > > > don't take direct dependency on compiler builtins. define a macro so we > > don't have to play shotgun surgery later. > > > > also what is the purpose of using __extension__ here? are you annotating > > the use of __builtin_constant_p() or is there more? because if that's > > the only reason i see no need to use __extension__ when already using a > > compiler specific builtin like this, that it is not standard is implied > > and enforced by a compile break. > > ARM 32-bit memcpy() [1] does it this way, so I did the same. > > [1]: https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_memcpy_32.h#L122 i see thanks. > > While I agree that a macro for __builtin_constant_p() would be good, it belongs in a patch to fix portability, not in this patch. i agree it isn't composite of this change. would you mind introducing it as a separate patch and depending on it or do you feel that would delay this patch too much? i wouldn't mind doing it myself but there is a long merge time on my patches which means i end up having to carry the adaptations locally for weeks at a time. thanks ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH] mempool: optimize get objects with constant n 2023-04-18 15:44 ` Tyler Retzlaff @ 2023-04-18 15:50 ` Morten Brørup 2023-04-18 16:01 ` Tyler Retzlaff 0 siblings, 1 reply; 19+ messages in thread From: Morten Brørup @ 2023-04-18 15:50 UTC (permalink / raw) To: Tyler Retzlaff; +Cc: Bruce Richardson, olivier.matz, andrew.rybchenko, dev > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com] > Sent: Tuesday, 18 April 2023 17.45 > > On Tue, Apr 18, 2023 at 05:30:27PM +0200, Morten Brørup wrote: > > > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com] > > > Sent: Tuesday, 18 April 2023 17.15 > > > > > > On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote: > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > > > > When getting objects from the mempool, the number of objects to > get > > > is > > > > > often constant at build time. > > > > > > > > > > This patch adds another code path for this case, so the compiler > can > > > > > optimize more, e.g. unroll the copy loop when the entire request > is > > > > > satisfied from the cache. > > > > > > > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc > 9.4.0, > > > > > mempool_perf_test with constant n shows an increase in > rate_persec > > > by an > > > > > average of 17 %, minimum 9.5 %, maximum 24 %. > > > > > > > > > > The code path where the number of objects to get is unknown at > build > > > time > > > > > remains essentially unchanged. > > > > > > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com> > > > > > > > > Change looks a good idea. Some suggestions inline below, which you > may > > > want to > > > > take on board for any future version. I'd strongly suggest adding > some > > > > extra clarifying code comments, as I suggest below. > > > > With those exta code comments: > > > > > > > > Acked-by: Bruce Richardson <bruce.richardson@intel.com> > > > > > > > > > --- > > > > > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > > > > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > > > > > > > diff --git a/lib/mempool/rte_mempool.h > b/lib/mempool/rte_mempool.h > > > > > index 9f530db24b..ade0100ec7 100644 > > > > > --- a/lib/mempool/rte_mempool.h > > > > > +++ b/lib/mempool/rte_mempool.h > > > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct > > > rte_mempool *mp, void **obj_table, > > > > > if (unlikely(cache == NULL)) > > > > > goto driver_dequeue; > > > > > > > > > > - /* Use the cache as much as we have to return hot objects > first */ > > > > > - len = RTE_MIN(remaining, cache->len); > > > > > cache_objs = &cache->objs[cache->len]; > > > > > + > > > > > + if (__extension__(__builtin_constant_p(n)) && n <= cache- > >len) { > > > > > > don't take direct dependency on compiler builtins. define a macro so > we > > > don't have to play shotgun surgery later. > > > > > > also what is the purpose of using __extension__ here? are you > annotating > > > the use of __builtin_constant_p() or is there more? because if > that's > > > the only reason i see no need to use __extension__ when already > using a > > > compiler specific builtin like this, that it is not standard is > implied > > > and enforced by a compile break. > > > > ARM 32-bit memcpy() [1] does it this way, so I did the same. > > > > [1]: > https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_me > mcpy_32.h#L122 > > i see thanks. > > > > > While I agree that a macro for __builtin_constant_p() would be good, > it belongs in a patch to fix portability, not in this patch. > > i agree it isn't composite of this change. > > would you mind introducing it as a separate patch and depending on it or > do you feel that would delay this patch too much? i wouldn't mind doing > it myself but there is a long merge time on my patches which means i end > up having to carry the adaptations locally for weeks at a time. I would rather not. Introducing global macros in rte_common.h usually triggers a lot of discussion and pushback, and I don't want it to hold back this patch. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mempool: optimize get objects with constant n 2023-04-18 15:50 ` Morten Brørup @ 2023-04-18 16:01 ` Tyler Retzlaff 0 siblings, 0 replies; 19+ messages in thread From: Tyler Retzlaff @ 2023-04-18 16:01 UTC (permalink / raw) To: Morten Brørup; +Cc: Bruce Richardson, olivier.matz, andrew.rybchenko, dev On Tue, Apr 18, 2023 at 05:50:56PM +0200, Morten Brørup wrote: > > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com] > > Sent: Tuesday, 18 April 2023 17.45 > > > > On Tue, Apr 18, 2023 at 05:30:27PM +0200, Morten Brørup wrote: > > > > From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com] > > > > Sent: Tuesday, 18 April 2023 17.15 > > > > > > > > On Tue, Apr 18, 2023 at 12:06:42PM +0100, Bruce Richardson wrote: > > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > > > > > When getting objects from the mempool, the number of objects to > > get > > > > is > > > > > > often constant at build time. > > > > > > > > > > > > This patch adds another code path for this case, so the compiler > > can > > > > > > optimize more, e.g. unroll the copy loop when the entire request > > is > > > > > > satisfied from the cache. > > > > > > > > > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc > > 9.4.0, > > > > > > mempool_perf_test with constant n shows an increase in > > rate_persec > > > > by an > > > > > > average of 17 %, minimum 9.5 %, maximum 24 %. > > > > > > > > > > > > The code path where the number of objects to get is unknown at > > build > > > > time > > > > > > remains essentially unchanged. > > > > > > > > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com> > > > > > > > > > > Change looks a good idea. Some suggestions inline below, which you > > may > > > > want to > > > > > take on board for any future version. I'd strongly suggest adding > > some > > > > > extra clarifying code comments, as I suggest below. > > > > > With those exta code comments: > > > > > > > > > > Acked-by: Bruce Richardson <bruce.richardson@intel.com> > > > > > > > > > > > --- > > > > > > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > > > > > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > > > > > > > > > diff --git a/lib/mempool/rte_mempool.h > > b/lib/mempool/rte_mempool.h > > > > > > index 9f530db24b..ade0100ec7 100644 > > > > > > --- a/lib/mempool/rte_mempool.h > > > > > > +++ b/lib/mempool/rte_mempool.h > > > > > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct > > > > rte_mempool *mp, void **obj_table, > > > > > > if (unlikely(cache == NULL)) > > > > > > goto driver_dequeue; > > > > > > > > > > > > - /* Use the cache as much as we have to return hot objects > > first */ > > > > > > - len = RTE_MIN(remaining, cache->len); > > > > > > cache_objs = &cache->objs[cache->len]; > > > > > > + > > > > > > + if (__extension__(__builtin_constant_p(n)) && n <= cache- > > >len) { > > > > > > > > don't take direct dependency on compiler builtins. define a macro so > > we > > > > don't have to play shotgun surgery later. > > > > > > > > also what is the purpose of using __extension__ here? are you > > annotating > > > > the use of __builtin_constant_p() or is there more? because if > > that's > > > > the only reason i see no need to use __extension__ when already > > using a > > > > compiler specific builtin like this, that it is not standard is > > implied > > > > and enforced by a compile break. > > > > > > ARM 32-bit memcpy() [1] does it this way, so I did the same. > > > > > > [1]: > > https://elixir.bootlin.com/dpdk/v23.03/source/lib/eal/arm/include/rte_me > > mcpy_32.h#L122 > > > > i see thanks. > > > > > > > > While I agree that a macro for __builtin_constant_p() would be good, > > it belongs in a patch to fix portability, not in this patch. > > > > i agree it isn't composite of this change. > > > > would you mind introducing it as a separate patch and depending on it or > > do you feel that would delay this patch too much? i wouldn't mind doing > > it myself but there is a long merge time on my patches which means i end > > up having to carry the adaptations locally for weeks at a time. > > I would rather not. > > Introducing global macros in rte_common.h usually triggers a lot of discussion and pushback, and I don't want it to hold back this patch. yeah, no kidding. i wish the process was a bit more friendly being on the receiving end. it's unfortunate because it is discouraging improvements. i'll bring a patch for it then. ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: [PATCH] mempool: optimize get objects with constant n 2023-04-18 11:06 ` Bruce Richardson 2023-04-18 11:29 ` Morten Brørup 2023-04-18 15:15 ` Tyler Retzlaff @ 2023-04-18 16:05 ` Morten Brørup 2 siblings, 0 replies; 19+ messages in thread From: Morten Brørup @ 2023-04-18 16:05 UTC (permalink / raw) To: Bruce Richardson; +Cc: olivier.matz, andrew.rybchenko, dev > From: Morten Brørup > Sent: Tuesday, 18 April 2023 13.30 > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > Sent: Tuesday, 18 April 2023 13.07 > > > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: [...] > > > + /* > > > + * The request size is known at build time, and > > > + * the entire request can be satisfied from the cache, > > > + * so let the compiler unroll the fixed length copy loop. > > > + */ > > > + cache->len -= n; > > > + for (index = 0; index < n; index++) > > > + *obj_table++ = *--cache_objs; > > > + > > > > This loop looks a little awkward to me. Would it be clearer (and > perhaps > > easier for compilers to unroll efficiently if it was rewritten as: > > > > cache->len -= n; > > cache_objs = &cache->objs[cache->len]; > > for (index = 0; index < n; index++) > > obj_table[index] = cache_objs[index]; > > The mempool cache is a stack, so the copy loop needs get the objects in > decrementing order. I.e. the source index decrements and the destination > index increments. > > Regardless, your point here is still valid! I expected that any > unrolling capable compiler can unroll *dst++ = *--src; but I can > experiment with different compilers on godbolt.org to see if dst[index] > = src[-index] is better. Just for the record... I have now tried experimenting with the alternative, and it makes no difference. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v3] mempool: optimize get objects with constant n 2023-04-11 6:48 [PATCH] mempool: optimize get objects with constant n Morten Brørup 2023-04-18 11:06 ` Bruce Richardson @ 2023-04-18 19:51 ` Morten Brørup 2023-04-18 20:09 ` [PATCH v4] " Morten Brørup 2 siblings, 0 replies; 19+ messages in thread From: Morten Brørup @ 2023-04-18 19:51 UTC (permalink / raw) To: dev, olivier.matz, andrew.rybchenko Cc: bruce.richardson, roretzla, Morten Brørup When getting objects from the mempool, the number of objects to get is often constant at build time. This patch adds another code path for this case, so the compiler can optimize more, e.g. unroll the copy loop when the entire request is satisfied from the cache. On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, mempool_perf_test with constant n shows an increase in rate_persec by an average of 17 %, minimum 9.5 %, maximum 24 %. The code path where the number of objects to get is unknown at build time remains essentially unchanged. Signed-off-by: Morten Brørup <mb@smartsharesystems.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com> --- v3: * Rebase to main. v2: * Added comments describing why some code is omitted when 'n' is known at buuild time. * Improved source code readability by explicitly setting 'remaining' where relevant, instead of at initialization. --- lib/mempool/rte_mempool.h | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h index ade0100ec7..7d14dea5c4 100644 --- a/lib/mempool/rte_mempool.h +++ b/lib/mempool/rte_mempool.h @@ -1492,13 +1492,15 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table, unsigned int n, struct rte_mempool_cache *cache) { int ret; - unsigned int remaining = n; + unsigned int remaining; uint32_t index, len; void **cache_objs; /* No cache provided */ - if (unlikely(cache == NULL)) + if (unlikely(cache == NULL)) { + remaining = n; goto driver_dequeue; + } cache_objs = &cache->objs[cache->len]; @@ -1518,14 +1520,23 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table, return 0; } - /* Use the cache as much as we have to return hot objects first */ + /* + * Use the cache as much as we have to return hot objects first. + * If the request size 'n' is known at build time, the above comparison + * ensures that n > cache->len here, so omit RTE_MIN(). + */ len = __extension__(__builtin_constant_p(n)) ? cache->len : - RTE_MIN(remaining, cache->len); + RTE_MIN(n, cache->len); cache->len -= len; - remaining -= len; + remaining = n - len; for (index = 0; index < len; index++) *obj_table++ = *--cache_objs; + /* + * If the request size 'n' is known at build time, the case + * where the entire request can be satisfied from the cache + * has already been handled above, so omit handling it here. + */ if (!__extension__(__builtin_constant_p(n)) && remaining == 0) { /* The entire request is satisfied from the cache. */ -- 2.17.1 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v4] mempool: optimize get objects with constant n 2023-04-11 6:48 [PATCH] mempool: optimize get objects with constant n Morten Brørup 2023-04-18 11:06 ` Bruce Richardson 2023-04-18 19:51 ` [PATCH v3] " Morten Brørup @ 2023-04-18 20:09 ` Morten Brørup 2023-06-07 9:12 ` Thomas Monjalon 2 siblings, 1 reply; 19+ messages in thread From: Morten Brørup @ 2023-04-18 20:09 UTC (permalink / raw) To: dev, olivier.matz, andrew.rybchenko Cc: bruce.richardson, roretzla, Morten Brørup When getting objects from the mempool, the number of objects to get is often constant at build time. This patch adds another code path for this case, so the compiler can optimize more, e.g. unroll the copy loop when the entire request is satisfied from the cache. On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, mempool_perf_test with constant n shows an increase in rate_persec by an average of 17 %, minimum 9.5 %, maximum 24 %. The code path where the number of objects to get is unknown at build time remains essentially unchanged. Signed-off-by: Morten Brørup <mb@smartsharesystems.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com> --- v4: * Rebased to main. v3: * Tried to rebase to main, but failed to include all my changes. v2: * Added comments describing why some code is omitted when 'n' is known at buuild time. * Improved source code readability by explicitly setting 'remaining' where relevant, instead of at initialization. --- lib/mempool/rte_mempool.h | 41 +++++++++++++++++++++++++++++++++------ 1 file changed, 35 insertions(+), 6 deletions(-) diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h index 9f530db24b..7d14dea5c4 100644 --- a/lib/mempool/rte_mempool.h +++ b/lib/mempool/rte_mempool.h @@ -1492,23 +1492,52 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table, unsigned int n, struct rte_mempool_cache *cache) { int ret; - unsigned int remaining = n; + unsigned int remaining; uint32_t index, len; void **cache_objs; /* No cache provided */ - if (unlikely(cache == NULL)) + if (unlikely(cache == NULL)) { + remaining = n; goto driver_dequeue; + } - /* Use the cache as much as we have to return hot objects first */ - len = RTE_MIN(remaining, cache->len); cache_objs = &cache->objs[cache->len]; + + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { + /* + * The request size is known at build time, and + * the entire request can be satisfied from the cache, + * so let the compiler unroll the fixed length copy loop. + */ + cache->len -= n; + for (index = 0; index < n; index++) + *obj_table++ = *--cache_objs; + + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); + + return 0; + } + + /* + * Use the cache as much as we have to return hot objects first. + * If the request size 'n' is known at build time, the above comparison + * ensures that n > cache->len here, so omit RTE_MIN(). + */ + len = __extension__(__builtin_constant_p(n)) ? cache->len : + RTE_MIN(n, cache->len); cache->len -= len; - remaining -= len; + remaining = n - len; for (index = 0; index < len; index++) *obj_table++ = *--cache_objs; - if (remaining == 0) { + /* + * If the request size 'n' is known at build time, the case + * where the entire request can be satisfied from the cache + * has already been handled above, so omit handling it here. + */ + if (!__extension__(__builtin_constant_p(n)) && remaining == 0) { /* The entire request is satisfied from the cache. */ RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); -- 2.17.1 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v4] mempool: optimize get objects with constant n 2023-04-18 20:09 ` [PATCH v4] " Morten Brørup @ 2023-06-07 9:12 ` Thomas Monjalon 0 siblings, 0 replies; 19+ messages in thread From: Thomas Monjalon @ 2023-06-07 9:12 UTC (permalink / raw) To: Morten Brørup Cc: dev, olivier.matz, andrew.rybchenko, bruce.richardson, roretzla 18/04/2023 22:09, Morten Brørup: > When getting objects from the mempool, the number of objects to get is > often constant at build time. > > This patch adds another code path for this case, so the compiler can > optimize more, e.g. unroll the copy loop when the entire request is > satisfied from the cache. > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > mempool_perf_test with constant n shows an increase in rate_persec by an > average of 17 %, minimum 9.5 %, maximum 24 %. > > The code path where the number of objects to get is unknown at build time > remains essentially unchanged. > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com> > Acked-by: Bruce Richardson <bruce.richardson@intel.com> Applied with suggested added comment, thanks. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2023-06-07 9:12 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-04-11 6:48 [PATCH] mempool: optimize get objects with constant n Morten Brørup 2023-04-18 11:06 ` Bruce Richardson 2023-04-18 11:29 ` Morten Brørup 2023-04-18 12:54 ` Bruce Richardson 2023-04-18 12:55 ` Bruce Richardson 2023-06-07 7:51 ` Thomas Monjalon 2023-06-07 8:03 ` Morten Brørup 2023-06-07 8:10 ` Thomas Monjalon 2023-06-07 8:33 ` Morten Brørup 2023-06-07 8:41 ` Morten Brørup 2023-04-18 15:15 ` Tyler Retzlaff 2023-04-18 15:30 ` Morten Brørup 2023-04-18 15:44 ` Tyler Retzlaff 2023-04-18 15:50 ` Morten Brørup 2023-04-18 16:01 ` Tyler Retzlaff 2023-04-18 16:05 ` Morten Brørup 2023-04-18 19:51 ` [PATCH v3] " Morten Brørup 2023-04-18 20:09 ` [PATCH v4] " Morten Brørup 2023-06-07 9:12 ` Thomas Monjalon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).