DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC] mempool: rte_mempool_do_generic_get optimizations
@ 2021-12-26 15:34 Morten Brørup
  2022-01-06 12:23 ` [PATCH] mempool: optimize incomplete cache handling Morten Brørup
                   ` (9 more replies)
  0 siblings, 10 replies; 85+ messages in thread
From: Morten Brørup @ 2021-12-26 15:34 UTC (permalink / raw)
  To: Olivier Matz, Andrew Rybchenko, dev

While going through the mempool code for potential optimizations, I found two details in rte_mempool_do_generic_get(), which are easily improved.

Any comments or alternative suggestions?


1. The objects are returned in reverse order. This is silly, and should be optimized.

rte_mempool_do_generic_get() line 1493:

	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	rte_memcpy(obj_table, &cache_objs[cache->len - n], sizeof(void *) * n);


2. The initial screening in rte_mempool_do_generic_get() differs from the initial screening in rte_mempool_do_generic_put().

For reference, rte_mempool_do_generic_put() line 1343:

	/* No cache provided or if put would overflow mem allocated for cache */
	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
		goto ring_enqueue;

Notice how this uses RTE_MEMPOOL_CACHE_MAX_SIZE to determine the maximum burst size into the cache.

Now, rte_mempool_do_generic_get() line 1466:

	/* No cache provided or cannot be satisfied from cache */
	if (unlikely(cache == NULL || n >= cache->size))
		goto ring_dequeue;

	cache_objs = cache->objs;

	/* Can this be satisfied from the cache? */
	if (cache->len < n) {
		/* No. Backfill the cache first, and then fill from it */
		uint32_t req = n + (cache->size - cache->len);

First of all, there might already be up to cache->flushthresh - 1 objects in the cache, which is 50 % more than cache->size, so screening for n >= cache->size would not serve those from the cache!

Second of all, the next step is to check if the cache holds sufficient objects. So the initial screening should only do initial screening. Therefore, I propose changing the initial screening to also use RTE_MEMPOOL_CACHE_MAX_SIZE to determine the maximum burst size from the cache, like in rte_mempool_do_generic_put().

rte_mempool_do_generic_get() line 1466:

-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
		goto ring_dequeue;


Med venlig hilsen / Kind regards,
-Morten Brørup


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH] mempool: optimize incomplete cache handling
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
@ 2022-01-06 12:23 ` Morten Brørup
  2022-01-06 16:55   ` Jerin Jacob
  2022-01-14 16:36 ` [PATCH] mempool: fix get objects from mempool with cache Morten Brørup
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-01-06 12:23 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko; +Cc: dev, Morten Brørup

A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then.

The incompleteness did not cause any functional bugs, so this patch
could be considered refactoring for the purpose of cleaning up.

This patch completes the update of rte_mempool_do_generic_get() as
follows:

1. A few comments were malplaced or no longer correct.
Some comments have been updated/added/corrected.

2. The code that initially screens the cache request was not updated.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

3. The code flow for satisfying the request from the cache was weird.
The likely code path where the objects are simply served from the cache
was treated as unlikely; now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were
added and subtracted from the cache length; now this code path simply
sets the cache length to its final value.

4. The objects were returned in reverse order.
Returning the objects in reverse order is not necessary, so rte_memcpy()
is now used instead.

This patch also updates/adds/corrects some comments in
rte_mempool_do_generic_put().

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..4c36ad6dd1 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1353,12 +1353,12 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	 *   cache flush threshold) is flushed to the ring.
 	 */
 
-	/* Add elements back into the cache */
+	/* Add the objects to the cache */
 	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
-
 	cache->len += n;
 
 	if (cache->len >= cache->flushthresh) {
+		/* Flush excess objects in the cache to the ring */
 		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
 				cache->len - cache->size);
 		cache->len = cache->size;
@@ -1368,7 +1368,7 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
@@ -1460,21 +1460,25 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
-	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
 	cache_objs = cache->objs;
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
+	/* Can the request be satisfied from the cache? */
+	if (n <= cache->len) {
+		/* Yes. Simply decrease the cache length */
+		cache->len -= n;
+	} else {
+		/* No. Backfill the cache from the ring first */
+
+		/* Number required to fill the cache + n */
 		uint32_t req = n + (cache->size - cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill the cache from the ring */
 		ret = rte_mempool_ops_dequeue_bulk(mp,
 			&cache->objs[cache->len], req);
 		if (unlikely(ret < 0)) {
@@ -1487,14 +1491,12 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			goto ring_dequeue;
 		}
 
-		cache->len += req;
+		/* Set the length of the backfilled cache - n */
+		cache->len = cache->size;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
-
-	cache->len -= n;
+	/* Get the objects from the cache, at the already decreased offset */
+	rte_memcpy(obj_table, &cache_objs[cache->len], sizeof(void *) * n);
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1503,7 +1505,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] mempool: optimize incomplete cache handling
  2022-01-06 12:23 ` [PATCH] mempool: optimize incomplete cache handling Morten Brørup
@ 2022-01-06 16:55   ` Jerin Jacob
  2022-01-07  8:46     ` Morten Brørup
  0 siblings, 1 reply; 85+ messages in thread
From: Jerin Jacob @ 2022-01-06 16:55 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Olivier Matz, Andrew Rybchenko, dpdk-dev

On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then.
>
> The incompleteness did not cause any functional bugs, so this patch
> could be considered refactoring for the purpose of cleaning up.
>
> This patch completes the update of rte_mempool_do_generic_get() as
> follows:
>
> 1. A few comments were malplaced or no longer correct.
> Some comments have been updated/added/corrected.
>
> 2. The code that initially screens the cache request was not updated.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
>
> 3. The code flow for satisfying the request from the cache was weird.
> The likely code path where the objects are simply served from the cache
> was treated as unlikely; now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.
>
> 4. The objects were returned in reverse order.
> Returning the objects in reverse order is not necessary, so rte_memcpy()
> is now used instead.

Have you checked the performance with network workload?
IMO, reverse order makes sense(LIFO vs FIFO).
The LIFO makes the cache warm as the same buffers are reused frequently.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH] mempool: optimize incomplete cache handling
  2022-01-06 16:55   ` Jerin Jacob
@ 2022-01-07  8:46     ` Morten Brørup
  2022-01-10  7:26       ` Jerin Jacob
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-01-07  8:46 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Olivier Matz, Andrew Rybchenko, dpdk-dev

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Thursday, 6 January 2022 17.55
> 
> On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > A flush threshold for the mempool cache was introduced in DPDK
> version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then.
> >
> > The incompleteness did not cause any functional bugs, so this patch
> > could be considered refactoring for the purpose of cleaning up.
> >
> > This patch completes the update of rte_mempool_do_generic_get() as
> > follows:
> >
> > 1. A few comments were malplaced or no longer correct.
> > Some comments have been updated/added/corrected.
> >
> > 2. The code that initially screens the cache request was not updated.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction
> of
> > the flush threshold. E.g. the cache can hold up to flushthresh
> objects,
> > which is more than its size, so some requests were not served from
> the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a
> cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.
> >
> > 3. The code flow for satisfying the request from the cache was weird.
> > The likely code path where the objects are simply served from the
> cache
> > was treated as unlikely; now it is treated as likely.
> > And in the code path where the cache was backfilled first, numbers
> were
> > added and subtracted from the cache length; now this code path simply
> > sets the cache length to its final value.
> >
> > 4. The objects were returned in reverse order.
> > Returning the objects in reverse order is not necessary, so
> rte_memcpy()
> > is now used instead.
> 
> Have you checked the performance with network workload?
> IMO, reverse order makes sense(LIFO vs FIFO).
> The LIFO makes the cache warm as the same buffers are reused
> frequently.

I have not done any performance testing. We probably agree that the only major difference lies in how the objects are returned. And we probably also agree that rte_memcpy() is faster than the copy loop it replaced, especially when n is constant at compile time. So the performance difference mainly depends on the application, which I will discuss below.

Let's first consider LIFO vs. FIFO.

The key argument for the rte_memcpy() optimization is that we are still getting the burst of objects from the top of the stack (LIFO); only the order of the objects inside the burst is not reverse anymore.

Here is an example:

The cache initially contains 8 objects: 01234567.

8 more objects are put into the cache: 89ABCDEF.

The cache now holds: 0123456789ABCDEF.

Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e. we are still getting the 4 objects most recently put into the cache.

Furthermore, if the application is working with fixed size bursts, it will usually put and get the same size burst, i.e. put the burst 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache again.


Here is an example unfavorable scenario:

The cache initially contains 4 objects, which have gone cold: 0123.

4 more objects, which happen to be hot, are put into the cache: 4567.

Getting 8 objects from the cache gives us 01234567 instead of 76543210.

Now, if the application only processes the first 4 of the 8 objects in the burst, it would have benefitted from those objects being the hot 7654 objects instead of the cold 0123 objects.

However, I think that most applications process the complete burst, so I do consider this scenario unlikely.

Similarly, a pipelined application doesn't process objects in reverse order at every other step in the pipeline, even though the previous step in the pipeline most recently touched the last object of the burst.


My overall conclusion was that the benefit of using rte_memcpy() outweighs the disadvantage of the unfavorable scenario, because I consider the probability of the unfavorable scenario occurring very low. But again: it mainly depends on the application.

If anyone disagrees with the risk analysis described above, I will happily provide a version 2 of the patch, where the objects are still returned in reverse order. After all, the rte_memcpy() benefit is relatively small compared to the impact if the unlikely scenario occurs.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] mempool: optimize incomplete cache handling
  2022-01-07  8:46     ` Morten Brørup
@ 2022-01-10  7:26       ` Jerin Jacob
  2022-01-10 10:55         ` Morten Brørup
  0 siblings, 1 reply; 85+ messages in thread
From: Jerin Jacob @ 2022-01-10  7:26 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Olivier Matz, Andrew Rybchenko, dpdk-dev

On Fri, Jan 7, 2022 at 2:16 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > Sent: Thursday, 6 January 2022 17.55
> >
> > On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup <mb@smartsharesystems.com>
> > wrote:
> > >
> > > A flush threshold for the mempool cache was introduced in DPDK
> > version
> > > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > > then.
> > >
> > > The incompleteness did not cause any functional bugs, so this patch
> > > could be considered refactoring for the purpose of cleaning up.
> > >
> > > This patch completes the update of rte_mempool_do_generic_get() as
> > > follows:
> > >
> > > 1. A few comments were malplaced or no longer correct.
> > > Some comments have been updated/added/corrected.
> > >
> > > 2. The code that initially screens the cache request was not updated.
> > > The initial screening compared the request length to the cache size,
> > > which was correct before, but became irrelevant with the introduction
> > of
> > > the flush threshold. E.g. the cache can hold up to flushthresh
> > objects,
> > > which is more than its size, so some requests were not served from
> > the
> > > cache, even though they could be.
> > > The initial screening has now been corrected to match the initial
> > > screening in rte_mempool_do_generic_put(), which verifies that a
> > cache
> > > is present, and that the length of the request does not overflow the
> > > memory allocated for the cache.
> > >
> > > 3. The code flow for satisfying the request from the cache was weird.
> > > The likely code path where the objects are simply served from the
> > cache
> > > was treated as unlikely; now it is treated as likely.
> > > And in the code path where the cache was backfilled first, numbers
> > were
> > > added and subtracted from the cache length; now this code path simply
> > > sets the cache length to its final value.
> > >
> > > 4. The objects were returned in reverse order.
> > > Returning the objects in reverse order is not necessary, so
> > rte_memcpy()
> > > is now used instead.
> >
> > Have you checked the performance with network workload?
> > IMO, reverse order makes sense(LIFO vs FIFO).
> > The LIFO makes the cache warm as the same buffers are reused
> > frequently.
>
> I have not done any performance testing. We probably agree that the only major difference lies in how the objects are returned. And we probably also agree that rte_memcpy() is faster than the copy loop it replaced, especially when n is constant at compile time. So the performance difference mainly depends on the application, which I will discuss below.
>
> Let's first consider LIFO vs. FIFO.
>
> The key argument for the rte_memcpy() optimization is that we are still getting the burst of objects from the top of the stack (LIFO); only the order of the objects inside the burst is not reverse anymore.
>
> Here is an example:
>
> The cache initially contains 8 objects: 01234567.
>
> 8 more objects are put into the cache: 89ABCDEF.
>
> The cache now holds: 0123456789ABCDEF.

Agree. However I think, it may matter with less sized L1 cache
machines and burst size is more where it plays role what can be in L1
with the scheme.

I would suggest splitting each performance improvement as a separate
patch for better tracking and quantity of the performance improvement.

I think, mempool performance test and tx only stream mode in testpmd
can quantify patches.



>
> Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e. we are still getting the 4 objects most recently put into the cache.
>
> Furthermore, if the application is working with fixed size bursts, it will usually put and get the same size burst, i.e. put the burst 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache again.
>
>
> Here is an example unfavorable scenario:
>
> The cache initially contains 4 objects, which have gone cold: 0123.
>
> 4 more objects, which happen to be hot, are put into the cache: 4567.
>
> Getting 8 objects from the cache gives us 01234567 instead of 76543210.
>
> Now, if the application only processes the first 4 of the 8 objects in the burst, it would have benefitted from those objects being the hot 7654 objects instead of the cold 0123 objects.
>
> However, I think that most applications process the complete burst, so I do consider this scenario unlikely.
>
> Similarly, a pipelined application doesn't process objects in reverse order at every other step in the pipeline, even though the previous step in the pipeline most recently touched the last object of the burst.
>
>
> My overall conclusion was that the benefit of using rte_memcpy() outweighs the disadvantage of the unfavorable scenario, because I consider the probability of the unfavorable scenario occurring very low. But again: it mainly depends on the application.
>
> If anyone disagrees with the risk analysis described above, I will happily provide a version 2 of the patch, where the objects are still returned in reverse order. After all, the rte_memcpy() benefit is relatively small compared to the impact if the unlikely scenario occurs.
>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH] mempool: optimize incomplete cache handling
  2022-01-10  7:26       ` Jerin Jacob
@ 2022-01-10 10:55         ` Morten Brørup
  0 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-01-10 10:55 UTC (permalink / raw)
  To: Jerin Jacob, Bruce Richardson; +Cc: Olivier Matz, Andrew Rybchenko, dpdk-dev

+Bruce; you seemed interested in my work in this area.

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Monday, 10 January 2022 08.27
> 
> On Fri, Jan 7, 2022 at 2:16 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > Sent: Thursday, 6 January 2022 17.55
> > >
> > > On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup
> <mb@smartsharesystems.com>
> > > wrote:
> > > >
> > > > A flush threshold for the mempool cache was introduced in DPDK
> > > version
> > > > 1.3, but rte_mempool_do_generic_get() was not completely updated
> back
> > > > then.
> > > >
> > > > The incompleteness did not cause any functional bugs, so this
> patch
> > > > could be considered refactoring for the purpose of cleaning up.
> > > >
> > > > This patch completes the update of rte_mempool_do_generic_get()
> as
> > > > follows:
> > > >
> > > > 1. A few comments were malplaced or no longer correct.
> > > > Some comments have been updated/added/corrected.
> > > >
> > > > 2. The code that initially screens the cache request was not
> updated.
> > > > The initial screening compared the request length to the cache
> size,
> > > > which was correct before, but became irrelevant with the
> introduction
> > > of
> > > > the flush threshold. E.g. the cache can hold up to flushthresh
> > > objects,
> > > > which is more than its size, so some requests were not served
> from
> > > the
> > > > cache, even though they could be.
> > > > The initial screening has now been corrected to match the initial
> > > > screening in rte_mempool_do_generic_put(), which verifies that a
> > > cache
> > > > is present, and that the length of the request does not overflow
> the
> > > > memory allocated for the cache.
> > > >
> > > > 3. The code flow for satisfying the request from the cache was
> weird.
> > > > The likely code path where the objects are simply served from the
> > > cache
> > > > was treated as unlikely; now it is treated as likely.
> > > > And in the code path where the cache was backfilled first,
> numbers
> > > were
> > > > added and subtracted from the cache length; now this code path
> simply
> > > > sets the cache length to its final value.
> > > >
> > > > 4. The objects were returned in reverse order.
> > > > Returning the objects in reverse order is not necessary, so
> > > rte_memcpy()
> > > > is now used instead.
> > >
> > > Have you checked the performance with network workload?
> > > IMO, reverse order makes sense(LIFO vs FIFO).
> > > The LIFO makes the cache warm as the same buffers are reused
> > > frequently.
> >
> > I have not done any performance testing. We probably agree that the
> only major difference lies in how the objects are returned. And we
> probably also agree that rte_memcpy() is faster than the copy loop it
> replaced, especially when n is constant at compile time. So the
> performance difference mainly depends on the application, which I will
> discuss below.
> >
> > Let's first consider LIFO vs. FIFO.
> >
> > The key argument for the rte_memcpy() optimization is that we are
> still getting the burst of objects from the top of the stack (LIFO);
> only the order of the objects inside the burst is not reverse anymore.
> >
> > Here is an example:
> >
> > The cache initially contains 8 objects: 01234567.
> >
> > 8 more objects are put into the cache: 89ABCDEF.
> >
> > The cache now holds: 0123456789ABCDEF.
> 
> Agree. However I think, it may matter with less sized L1 cache
> machines and burst size is more where it plays role what can be in L1
> with the scheme.

Good point! Thinking further about it made me realize that the mempool cache flushing algorithm is fundamentally flawed, at least in some cases...


rte_mempool_do_generic_put():

When putting objects into the cache, and the cache length exceeds the flush threshold, the most recent (hot) objects are flushed to the ring, thus leaving the less recent (colder) objects at the top of the cache stack.

Example (cache size: 8, flush threshold: 12, put 8 objects):

Initial cache: 01234567

Cache after putting (hot) objects 89ABCDEF: 0123456789ABCDEF

Cache flush threshold reached. Resulting cache: 01234567

Furthermore, the cache has to be completely depleted before the hot objects that were flushed to the ring are retrieved from the ring again.


rte_mempool_do_generic_get():

When getting objects from the cache, and the cache does not hold the requested number of objects, the cache will first be backfilled from the ring, thus putting colder objects at the top of the cache stack, and then the objects will be returned from the top of the cache stack, i.e. the backfilled (cold) objects will be returned first.

Example (cache size: 8, get 8 objects):

Initial cache: 0123 (hot or lukewarm)

Cache after backfill to size + requested objects: 0123456789ABCDEF

Returned objects: FEDCBA98 (cold)

Cache after returning objects: 012345678 (i.e. cold objects at the top)


> 
> I would suggest splitting each performance improvement as a separate
> patch for better tracking and quantity of the performance improvement.

With the new realizations above, I should reconsider my patch from scratch.

I have also been wondering if the mempool cache size really needs to be configurable, or it could be fixed size?

Bruce mentioned in another thread (http://inbox.dpdk.org/dev/Ydv%2FIMz8eIRPSguY@bricha3-MOBL.ger.corp.intel.com/T/#m02cabb25655c08a0980888df8c41aba9ac8dd6ff) that the typical configuration of the cache size is RTE_MEMPOOL_CACHE_MAX_SIZE.

I would dare to say that it probably suffices to configure if the mempool has a cache or not! The cache_size parameter of rte_mempool_create() is not respected 1:1 anyway (because each per-lcore cache may consume up to 1.5 x cache_size objects from the mempool backing store), so the cache_size parameter could be parsed as non-zero vs. zero to determine if a cache is wanted or not.

> 
> I think, mempool performance test and tx only stream mode in testpmd
> can quantify patches.
> 
> 
> 
> >
> > Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e.
> we are still getting the 4 objects most recently put into the cache.
> >
> > Furthermore, if the application is working with fixed size bursts, it
> will usually put and get the same size burst, i.e. put the burst
> 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache
> again.
> >
> >
> > Here is an example unfavorable scenario:
> >
> > The cache initially contains 4 objects, which have gone cold: 0123.
> >
> > 4 more objects, which happen to be hot, are put into the cache: 4567.
> >
> > Getting 8 objects from the cache gives us 01234567 instead of
> 76543210.
> >
> > Now, if the application only processes the first 4 of the 8 objects
> in the burst, it would have benefitted from those objects being the hot
> 7654 objects instead of the cold 0123 objects.
> >
> > However, I think that most applications process the complete burst,
> so I do consider this scenario unlikely.
> >
> > Similarly, a pipelined application doesn't process objects in reverse
> order at every other step in the pipeline, even though the previous
> step in the pipeline most recently touched the last object of the
> burst.
> >
> >
> > My overall conclusion was that the benefit of using rte_memcpy()
> outweighs the disadvantage of the unfavorable scenario, because I
> consider the probability of the unfavorable scenario occurring very
> low. But again: it mainly depends on the application.
> >
> > If anyone disagrees with the risk analysis described above, I will
> happily provide a version 2 of the patch, where the objects are still
> returned in reverse order. After all, the rte_memcpy() benefit is
> relatively small compared to the impact if the unlikely scenario
> occurs.
> >


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH] mempool: fix get objects from mempool with cache
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
  2022-01-06 12:23 ` [PATCH] mempool: optimize incomplete cache handling Morten Brørup
@ 2022-01-14 16:36 ` Morten Brørup
  2022-01-17 17:35   ` Bruce Richardson
  2022-01-24 15:38   ` Olivier Matz
  2022-01-17 11:52 ` [PATCH] mempool: optimize put objects to " Morten Brørup
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 85+ messages in thread
From: Morten Brørup @ 2022-01-14 16:36 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, jerinjacobk, dev, Morten Brørup

A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then, and some inefficiencies were introduced.

This patch fixes the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated
with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

2. The function is a helper for rte_mempool_generic_get(), so it must
behave according to the description of that function.
Specifically, objects must first be returned from the cache,
subsequently from the ring.
After the change in DPDK version 1.3, this was not the behavior when
the request was partially satisfied from the cache; instead, the objects
from the ring were returned ahead of the objects from the cache. This is
bad for CPUs with a small L1 cache, which benefit from having the hot
objects first in the returned array. (This is also the reason why
the function returns the objects in reverse order.)
Now, all code paths first return objects from the cache, subsequently
from the ring.

3. If the cache could not be backfilled, the function would attempt
to get all the requested objects from the ring (instead of only the
number of requested objects minus the objects available in the ring),
and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache,
and if the subsequent backfilling of the cache from the ring fails, only
the remaining requested objects are retrieved from the ring.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache
was treated as unlikely. Now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were
added and subtracted from the cache length; now this code path simply
sets the cache length to its final value.

5. Some comments were not correct anymore.
The comments have been updated.
Most importanly, the description of the succesful return value was
inaccurate. Success only returns 0, not >= 0.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 81 ++++++++++++++++++++++++++++-----------
 1 file changed, 59 insertions(+), 22 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..88f1b8b7ab 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1443,6 +1443,10 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
 
 /**
  * @internal Get several objects from the mempool; used internally.
+ *
+ * If cache is enabled, objects are returned from the cache in Last In First
+ * Out (LIFO) order for the benefit of CPUs with small L1 cache.
+ *
  * @param mp
  *   A pointer to the mempool structure.
  * @param obj_table
@@ -1452,7 +1456,7 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
  * @param cache
  *   A pointer to a mempool cache structure. May be NULL if not needed.
  * @return
- *   - >=0: Success; number of objects supplied.
+ *   - 0: Success; got n objects.
  *   - <0: Error; code of ring dequeue function.
  */
 static __rte_always_inline int
@@ -1463,38 +1467,71 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
-	cache_objs = cache->objs;
+	cache_objs = &cache->objs[cache->len];
+
+	if (n <= cache->len) {
+		/* The entire request can be satisfied from the cache. */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
+		return 0;
+	}
+
+	/* Satisfy the first part of the request by depleting the cache. */
+	len = cache->len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
+
+	/* Number of objects remaining to satisfy the request. */
+	len = n - len;
+
+	/* Fill the cache from the ring; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + len);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the ring.
+		 */
+		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
 		if (unlikely(ret < 0)) {
 			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
+			 * That also failed.
+			 * No furter action is required to roll the first
+			 * part of the request back into the cache, as both
+			 * cache->len and the objects in the cache are intact.
 			 */
-			goto ring_dequeue;
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
+
+			return ret;
 		}
 
-		cache->len += req;
+		/* Commit that the cache was emptied. */
+		cache->len = 0;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	cache_objs = &cache->objs[cache->size + len];
 
-	cache->len -= n;
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache->len = cache->size;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1503,7 +1540,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring. */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH] mempool: optimize put objects to mempool with cache
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
  2022-01-06 12:23 ` [PATCH] mempool: optimize incomplete cache handling Morten Brørup
  2022-01-14 16:36 ` [PATCH] mempool: fix get objects from mempool with cache Morten Brørup
@ 2022-01-17 11:52 ` Morten Brørup
  2022-01-19 14:52 ` [PATCH v2] mempool: fix " Morten Brørup
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-01-17 11:52 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, jerinjacobk, dev, Morten Brørup

This patch optimizes the rte_mempool_do_generic_put() caching algorithm.

The algorithm was:
 1. Add the objects to the cache
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the ring.

(Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
"size" objects, which is reflected in the above description.)

Now, the algorithm is:
 1. If the the objects cannot be added to the cache without crossing the
    flush threshold, flush the cache to the ring.
 2. Add the objects to the cache.

This patch fixes the following two inefficiencies of the old algorithm:

1. The most recent (hot) objects are flushed, leaving the oldest (cold)
objects in the mempool cache.
This is bad for CPUs with a small L1 cache, because when they get
objects from the mempool after the mempool cache has been flushed, they
get cold objects instead of hot objects.
Now, the existing (cold) objects in the mempool cache are flushed before
the new (hot) objects are added the to the mempool cache.

2. The cache is still full after flushing.
In the opposite direction, i.e. when getting objects from the cache, the
cache is refilled to full level when it crosses the low watermark (which
happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The current flushing behaviour is suboptimal for real life applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
NB: When the application is in a state where put/get events are in
balance, the cache should remain within its low and high watermarks, and
the algorithms for refilling/flushing the cache should not come into
play.
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

Not adding the new objects to the mempool cache before flushing it also
allows the memory allocated for the mempool cache to be reduced from 3 x
to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.

Futhermore, a minor bug in the flush threshold comparison has been
corrected; it must be "len > flushthresh", not "len > flushthresh".
Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
would be flushed already when reaching size elements, not when exceeding
size elements.
Now, flushing is triggered when the flush threshold is exceeded, not
when reached.

And finally, using the x86 variant of rte_memcpy() is inefficient here,
where n is relatively small and unknown at compile time.
Now, it has been replaced by an alternative copying method, optimized
for the fact that most Ethernet PMDs operate in bursts of 4 or 8 mbufs
or multiples thereof.
The mempool cache is cache line aligned for the benefit of this copying
method, which on some CPU architectures performs worse on data crossing
a cache boundary.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 47 ++++++++++++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..1ce850bedd 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -94,7 +94,8 @@ struct rte_mempool_cache {
 	 * Cache is allocated to this size to allow it to overflow in certain
 	 * cases to avoid needless emptying of cache.
 	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
+	/**< Cache objects */
 } __rte_cache_aligned;
 
 /**
@@ -1344,31 +1345,51 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
+	/* If the request itself is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto ring_enqueue;
 
 	/*
 	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   1. If the the objects cannot be added to the cache without
+	 *   crossing the flush threshold, flush the cache to the ring.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
 
-	cache->len += n;
+		cache->len += n;
+	} else {
+		cache_objs = cache->objs;
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
+			rte_panic("cannot put objects in mempool\n");
+#else
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+#endif
+		cache->len = n;
+	}
+
+	/* Add the objects to the cache. */
+	for (; n >= 4; n -= 4) {
+#ifdef RTE_ARCH_64
+		rte_mov32((unsigned char *)cache_objs, (const unsigned char *)obj_table);
+#else
+		rte_mov16((unsigned char *)cache_objs, (const unsigned char *)obj_table);
+#endif
+		cache_objs += 4;
+		obj_table += 4;
 	}
+	for (; n > 0; --n)
+		*cache_objs++ = *obj_table++;
 
 	return;
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] mempool: fix get objects from mempool with cache
  2022-01-14 16:36 ` [PATCH] mempool: fix get objects from mempool with cache Morten Brørup
@ 2022-01-17 17:35   ` Bruce Richardson
  2022-01-18  8:25     ` Morten Brørup
  2022-01-24 15:38   ` Olivier Matz
  1 sibling, 1 reply; 85+ messages in thread
From: Bruce Richardson @ 2022-01-17 17:35 UTC (permalink / raw)
  To: Morten Brørup; +Cc: olivier.matz, andrew.rybchenko, jerinjacobk, dev

On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> This patch fixes the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the ring.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the objects
> from the ring were returned ahead of the objects from the cache. This is
> bad for CPUs with a small L1 cache, which benefit from having the hot
> objects first in the returned array. (This is also the reason why
> the function returns the objects in reverse order.)
> Now, all code paths first return objects from the cache, subsequently
> from the ring.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the ring (instead of only the
> number of requested objects minus the objects available in the ring),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the ring fails, only
> the remaining requested objects are retrieved from the ring.
> 
> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.
> 
> 5. Some comments were not correct anymore.
> The comments have been updated.
> Most importanly, the description of the succesful return value was
> inaccurate. Success only returns 0, not >= 0.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---

I am a little uncertain about the reversing of the copies taking things out
of the mempool - for machines where we are not that cache constrainted will
we lose out in possible optimizations where the compiler optimizes the copy
loop as a memcpy?

Otherwise the logic all looks correct to me.

/Bruce

^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH] mempool: fix get objects from mempool with cache
  2022-01-17 17:35   ` Bruce Richardson
@ 2022-01-18  8:25     ` Morten Brørup
  2022-01-18  9:07       ` Bruce Richardson
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-01-18  8:25 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: olivier.matz, andrew.rybchenko, jerinjacobk, dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Monday, 17 January 2022 18.35
> 
> On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> > A flush threshold for the mempool cache was introduced in DPDK
> version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then, and some inefficiencies were introduced.
> >
> > This patch fixes the following in rte_mempool_do_generic_get():
> >
> > 1. The code that initially screens the cache request was not updated
> > with the change in DPDK version 1.3.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction
> of
> > the flush threshold. E.g. the cache can hold up to flushthresh
> objects,
> > which is more than its size, so some requests were not served from
> the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a
> cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.
> >
> > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > behave according to the description of that function.
> > Specifically, objects must first be returned from the cache,
> > subsequently from the ring.
> > After the change in DPDK version 1.3, this was not the behavior when
> > the request was partially satisfied from the cache; instead, the
> objects
> > from the ring were returned ahead of the objects from the cache. This
> is
> > bad for CPUs with a small L1 cache, which benefit from having the hot
> > objects first in the returned array. (This is also the reason why
> > the function returns the objects in reverse order.)
> > Now, all code paths first return objects from the cache, subsequently
> > from the ring.
> >
> > 3. If the cache could not be backfilled, the function would attempt
> > to get all the requested objects from the ring (instead of only the
> > number of requested objects minus the objects available in the ring),
> > and the function would fail if that failed.
> > Now, the first part of the request is always satisfied from the
> cache,
> > and if the subsequent backfilling of the cache from the ring fails,
> only
> > the remaining requested objects are retrieved from the ring.
> >
> > 4. The code flow for satisfying the request from the cache was
> slightly
> > inefficient:
> > The likely code path where the objects are simply served from the
> cache
> > was treated as unlikely. Now it is treated as likely.
> > And in the code path where the cache was backfilled first, numbers
> were
> > added and subtracted from the cache length; now this code path simply
> > sets the cache length to its final value.
> >
> > 5. Some comments were not correct anymore.
> > The comments have been updated.
> > Most importanly, the description of the succesful return value was
> > inaccurate. Success only returns 0, not >= 0.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> 
> I am a little uncertain about the reversing of the copies taking things
> out
> of the mempool - for machines where we are not that cache constrainted
> will
> we lose out in possible optimizations where the compiler optimizes the
> copy
> loop as a memcpy?

The objects are also returned in reverse order in the code it replaces, so this behavior is not introduced by this patch; I only describe the reason for it.

I floated a previous patch, in which the objects were returned in order, but Jerin argued [1] that we should keep it the way it was, unless I could show a performance improvement.

So I retracted that patch to split it up in two independent patches instead. This patch for get(), and [3] for put().

While experimenting using rte_memcpy() for these, I couldn't achieve a performance boost - quite the opposite. So I gave up on it.

Reviewing the x86 variant of rte_memcpy() [2] makes me think that it is inefficient for copying small bulks of pointers, especially when n is unknown at compile time, and its code path goes through a great deal of branches.

> 
> Otherwise the logic all looks correct to me.
> 
> /Bruce

[1]: http://inbox.dpdk.org/dev/CALBAE1OjCswxUfaNLWg5y-tnPkFhvvKQ8sJ3JpBoo7ObgeB5OA@mail.gmail.com/
[2]: http://code.dpdk.org/dpdk/latest/source/lib/eal/x86/include/rte_memcpy.h
[3]: http://inbox.dpdk.org/dev/20220117115231.8060-1-mb@smartsharesystems.com/


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] mempool: fix get objects from mempool with cache
  2022-01-18  8:25     ` Morten Brørup
@ 2022-01-18  9:07       ` Bruce Richardson
  0 siblings, 0 replies; 85+ messages in thread
From: Bruce Richardson @ 2022-01-18  9:07 UTC (permalink / raw)
  To: Morten Brørup; +Cc: olivier.matz, andrew.rybchenko, jerinjacobk, dev

On Tue, Jan 18, 2022 at 09:25:22AM +0100, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Monday, 17 January 2022 18.35
> > 
> > On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> > > A flush threshold for the mempool cache was introduced in DPDK
> > version
> > > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > > then, and some inefficiencies were introduced.
> > >
> > > This patch fixes the following in rte_mempool_do_generic_get():
> > >
> > > 1. The code that initially screens the cache request was not updated
> > > with the change in DPDK version 1.3.
> > > The initial screening compared the request length to the cache size,
> > > which was correct before, but became irrelevant with the introduction
> > of
> > > the flush threshold. E.g. the cache can hold up to flushthresh
> > objects,
> > > which is more than its size, so some requests were not served from
> > the
> > > cache, even though they could be.
> > > The initial screening has now been corrected to match the initial
> > > screening in rte_mempool_do_generic_put(), which verifies that a
> > cache
> > > is present, and that the length of the request does not overflow the
> > > memory allocated for the cache.
> > >
> > > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > > behave according to the description of that function.
> > > Specifically, objects must first be returned from the cache,
> > > subsequently from the ring.
> > > After the change in DPDK version 1.3, this was not the behavior when
> > > the request was partially satisfied from the cache; instead, the
> > objects
> > > from the ring were returned ahead of the objects from the cache. This
> > is
> > > bad for CPUs with a small L1 cache, which benefit from having the hot
> > > objects first in the returned array. (This is also the reason why
> > > the function returns the objects in reverse order.)
> > > Now, all code paths first return objects from the cache, subsequently
> > > from the ring.
> > >
> > > 3. If the cache could not be backfilled, the function would attempt
> > > to get all the requested objects from the ring (instead of only the
> > > number of requested objects minus the objects available in the ring),
> > > and the function would fail if that failed.
> > > Now, the first part of the request is always satisfied from the
> > cache,
> > > and if the subsequent backfilling of the cache from the ring fails,
> > only
> > > the remaining requested objects are retrieved from the ring.
> > >
> > > 4. The code flow for satisfying the request from the cache was
> > slightly
> > > inefficient:
> > > The likely code path where the objects are simply served from the
> > cache
> > > was treated as unlikely. Now it is treated as likely.
> > > And in the code path where the cache was backfilled first, numbers
> > were
> > > added and subtracted from the cache length; now this code path simply
> > > sets the cache length to its final value.
> > >
> > > 5. Some comments were not correct anymore.
> > > The comments have been updated.
> > > Most importanly, the description of the succesful return value was
> > > inaccurate. Success only returns 0, not >= 0.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > ---
> > 
> > I am a little uncertain about the reversing of the copies taking things
> > out
> > of the mempool - for machines where we are not that cache constrainted
> > will
> > we lose out in possible optimizations where the compiler optimizes the
> > copy
> > loop as a memcpy?
> 
> The objects are also returned in reverse order in the code it replaces, so this behavior is not introduced by this patch; I only describe the reason for it.
> 
> I floated a previous patch, in which the objects were returned in order, but Jerin argued [1] that we should keep it the way it was, unless I could show a performance improvement.
> 
> So I retracted that patch to split it up in two independent patches instead. This patch for get(), and [3] for put().
> 
> While experimenting using rte_memcpy() for these, I couldn't achieve a performance boost - quite the opposite. So I gave up on it.
> 
> Reviewing the x86 variant of rte_memcpy() [2] makes me think that it is inefficient for copying small bulks of pointers, especially when n is unknown at compile time, and its code path goes through a great deal of branches.
>
Thanks for all the explanation.

Reviewed-by: Bruce Richardson <bruce.richardson@intel.com> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v2] mempool: fix put objects to mempool with cache
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
                   ` (2 preceding siblings ...)
  2022-01-17 11:52 ` [PATCH] mempool: optimize put objects to " Morten Brørup
@ 2022-01-19 14:52 ` Morten Brørup
  2022-01-19 15:03 ` [PATCH v3] " Morten Brørup
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-01-19 14:52 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, jerinjacobk, dev, Morten Brørup

This patch optimizes the rte_mempool_do_generic_put() caching algorithm,
and fixes a bug in it.

The existing algorithm was:
 1. Add the objects to the cache
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the ring.

Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
"size" objects, which is reflected in the above description.

Now, the algorithm is:
 1. If the objects cannot be added to the cache without crossing the
    flush threshold, flush the cache to the ring.
 2. Add the objects to the cache.

This patch changes these details:

1. Bug: The cache was still full after flushing.
In the opposite direction, i.e. when getting objects from the cache, the
cache is refilled to full level when it crosses the low watermark (which
happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The existing flushing behaviour was suboptimal for real applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
NB: When the application is in a state where put/get events are in
balance, the cache should remain within its low and high watermarks, and
the algorithms for refilling/flushing the cache should not come into
play.
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

2. Minor bug: The flush threshold comparison has been corrected; it must
be "len > flushthresh", not "len >= flushthresh".
Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
would be flushed already when reaching size elements, not when exceeding
size elements.
Now, flushing is triggered when the flush threshold is exceeded, not
when reached.

3. Optimization: The most recent (hot) objects are flushed, leaving the
oldest (cold) objects in the mempool cache.
This is bad for CPUs with a small L1 cache, because when they get
objects from the mempool after the mempool cache has been flushed, they
get cold objects instead of hot objects.
Now, the existing (cold) objects in the mempool cache are flushed before
the new (hot) objects are added the to the mempool cache.

4. Optimization: Using the x86 variant of rte_memcpy() is inefficient
here, where n is relatively small and unknown at compile time.
Now, it has been replaced by an alternative copying method, optimized
for the fact that most Ethernet PMDs operate in bursts of 4 or 8 mbufs
or multiples thereof.

v2 changes:

- Not adding the new objects to the mempool cache before flushing it
also allows the memory allocated for the mempool cache to be reduced
from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
However, such this change would break the ABI, so it was removed in v2.

- The mempool cache should be cache line aligned for the benefit of the
copying method, which on some CPU architectures performs worse on data
crossing a cache boundary.
However, such this change would break the ABI, so it was removed in v2;
and yet another alternative copying method replaced the rte_memcpy().

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 54 +++++++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 14 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..8a7067ee5b 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -94,7 +94,8 @@ struct rte_mempool_cache {
 	 * Cache is allocated to this size to allow it to overflow in certain
 	 * cases to avoid needless emptying of cache.
 	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
+	/**< Cache objects */
 } __rte_cache_aligned;
 
 /**
@@ -1334,6 +1335,7 @@ static __rte_always_inline void
 rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
+	uint32_t index;
 	void **cache_objs;
 
 	/* increment stat now, adding in mempool always success */
@@ -1344,31 +1346,56 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
+	/* If the request itself is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto ring_enqueue;
 
 	/*
 	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   1. If the objects cannot be added to the cache without
+	 *   crossing the flush threshold, flush the cache to the ring.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
 
-	cache->len += n;
+		cache->len += n;
+	} else {
+		cache_objs = cache->objs;
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
+			rte_panic("cannot put objects in mempool\n");
+#else
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+#endif
+		cache->len = n;
+	}
+
+	/* Add the objects to the cache. */
+	for (index = 0; index < (n & ~0x3); index += 4) {
+		cache_objs[index] = obj_table[index];
+		cache_objs[index + 1] = obj_table[index + 1];
+		cache_objs[index + 2] = obj_table[index + 2];
+		cache_objs[index + 3] = obj_table[index + 3];
+	}
+	switch (n & 0x3) {
+	case 3:
+		cache_objs[index] = obj_table[index];
+		index++; /* fallthrough */
+	case 2:
+		cache_objs[index] = obj_table[index];
+		index++; /* fallthrough */
+	case 1:
+		cache_objs[index] = obj_table[index];
 	}
 
 	return;
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
@@ -1377,7 +1404,6 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 #endif
 }
 
-
 /**
  * Put several objects back in the mempool.
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v3] mempool: fix put objects to mempool with cache
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
                   ` (3 preceding siblings ...)
  2022-01-19 14:52 ` [PATCH v2] mempool: fix " Morten Brørup
@ 2022-01-19 15:03 ` Morten Brørup
  2022-01-24 15:39   ` Olivier Matz
  2022-02-02  8:14 ` [PATCH v2] mempool: fix get objects from " Morten Brørup
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-01-19 15:03 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, jerinjacobk, dev, Morten Brørup

mempool: fix put objects to mempool with cache

This patch optimizes the rte_mempool_do_generic_put() caching algorithm,
and fixes a bug in it.

The existing algorithm was:
 1. Add the objects to the cache
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the ring.

Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
"size" objects, which is reflected in the above description.

Now, the algorithm is:
 1. If the objects cannot be added to the cache without crossing the
    flush threshold, flush the cache to the ring.
 2. Add the objects to the cache.

This patch changes these details:

1. Bug: The cache was still full after flushing.
In the opposite direction, i.e. when getting objects from the cache, the
cache is refilled to full level when it crosses the low watermark (which
happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The existing flushing behaviour was suboptimal for real applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
NB: When the application is in a state where put/get events are in
balance, the cache should remain within its low and high watermarks, and
the algorithms for refilling/flushing the cache should not come into
play.
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

2. Minor bug: The flush threshold comparison has been corrected; it must
be "len > flushthresh", not "len >= flushthresh".
Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
would be flushed already when reaching size elements, not when exceeding
size elements.
Now, flushing is triggered when the flush threshold is exceeded, not
when reached.

3. Optimization: The most recent (hot) objects are flushed, leaving the
oldest (cold) objects in the mempool cache.
This is bad for CPUs with a small L1 cache, because when they get
objects from the mempool after the mempool cache has been flushed, they
get cold objects instead of hot objects.
Now, the existing (cold) objects in the mempool cache are flushed before
the new (hot) objects are added the to the mempool cache.

4. Optimization: Using the x86 variant of rte_memcpy() is inefficient
here, where n is relatively small and unknown at compile time.
Now, it has been replaced by an alternative copying method, optimized
for the fact that most Ethernet PMDs operate in bursts of 4 or 8 mbufs
or multiples thereof.

v2 changes:

- Not adding the new objects to the mempool cache before flushing it
also allows the memory allocated for the mempool cache to be reduced
from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
However, such this change would break the ABI, so it was removed in v2.

- The mempool cache should be cache line aligned for the benefit of the
copying method, which on some CPU architectures performs worse on data
crossing a cache boundary.
However, such this change would break the ABI, so it was removed in v2;
and yet another alternative copying method replaced the rte_memcpy().

v3 changes:

- Actually remove my modifications of the rte_mempool_cache structure.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 51 +++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..7b364cfc74 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1334,6 +1334,7 @@ static __rte_always_inline void
 rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
+	uint32_t index;
 	void **cache_objs;
 
 	/* increment stat now, adding in mempool always success */
@@ -1344,31 +1345,56 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
+	/* If the request itself is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto ring_enqueue;
 
 	/*
 	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   1. If the objects cannot be added to the cache without
+	 *   crossing the flush threshold, flush the cache to the ring.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
 
-	cache->len += n;
+		cache->len += n;
+	} else {
+		cache_objs = cache->objs;
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
+			rte_panic("cannot put objects in mempool\n");
+#else
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+#endif
+		cache->len = n;
+	}
+
+	/* Add the objects to the cache. */
+	for (index = 0; index < (n & ~0x3); index += 4) {
+		cache_objs[index] = obj_table[index];
+		cache_objs[index + 1] = obj_table[index + 1];
+		cache_objs[index + 2] = obj_table[index + 2];
+		cache_objs[index + 3] = obj_table[index + 3];
+	}
+	switch (n & 0x3) {
+	case 3:
+		cache_objs[index] = obj_table[index];
+		index++; /* fallthrough */
+	case 2:
+		cache_objs[index] = obj_table[index];
+		index++; /* fallthrough */
+	case 1:
+		cache_objs[index] = obj_table[index];
 	}
 
 	return;
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
@@ -1377,7 +1403,6 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 #endif
 }
 
-
 /**
  * Put several objects back in the mempool.
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] mempool: fix get objects from mempool with cache
  2022-01-14 16:36 ` [PATCH] mempool: fix get objects from mempool with cache Morten Brørup
  2022-01-17 17:35   ` Bruce Richardson
@ 2022-01-24 15:38   ` Olivier Matz
  2022-01-24 16:11     ` Olivier Matz
  2022-01-28 10:22     ` Morten Brørup
  1 sibling, 2 replies; 85+ messages in thread
From: Olivier Matz @ 2022-01-24 15:38 UTC (permalink / raw)
  To: Morten Brørup; +Cc: andrew.rybchenko, bruce.richardson, jerinjacobk, dev

Hi Morten,

Few comments below.

On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> This patch fixes the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the ring.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the objects
> from the ring were returned ahead of the objects from the cache. This is
> bad for CPUs with a small L1 cache, which benefit from having the hot
> objects first in the returned array. (This is also the reason why
> the function returns the objects in reverse order.)
> Now, all code paths first return objects from the cache, subsequently
> from the ring.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the ring (instead of only the
> number of requested objects minus the objects available in the ring),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the ring fails, only
> the remaining requested objects are retrieved from the ring.

This is the only point I'd consider to be a fix. The problem, from the
user perspective, is that a get() can fail despite there are enough
objects in cache + common pool.

To be honnest, I feel a bit uncomfortable to have such a list of
problems solved in one commit, even if I understand that they are part
of the same code rework.

Ideally, this fix should be a separate commit. What do you think of
having this simple patch for this fix, and then do the
optimizations/rework in another commit?

  --- a/lib/mempool/rte_mempool.h
  +++ b/lib/mempool/rte_mempool.h
  @@ -1484,7 +1484,22 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
                           * the ring directly. If that fails, we are truly out of
                           * buffers.
                           */
  -                       goto ring_dequeue;
  +                       req = n - cache->len;
  +                       ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, req);
  +                       if (ret < 0) {
  +                               RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
  +                               RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
  +                               return ret;
  +                       }
  +                       obj_table += req;
  +                       len = cache->len;
  +                       while (len > 0)
  +                               *obj_table++ = cache_objs[--len];
  +                       cache->len = 0;
  +                       RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
  +                       RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
  +
  +                       return 0;
                  }
   
                  cache->len += req;

The title of this commit could then be more precise to describe
the solved issue.

> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.
> 
> 5. Some comments were not correct anymore.
> The comments have been updated.
> Most importanly, the description of the succesful return value was
> inaccurate. Success only returns 0, not >= 0.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 81 ++++++++++++++++++++++++++++-----------
>  1 file changed, 59 insertions(+), 22 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..88f1b8b7ab 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1443,6 +1443,10 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
>  
>  /**
>   * @internal Get several objects from the mempool; used internally.
> + *
> + * If cache is enabled, objects are returned from the cache in Last In First
> + * Out (LIFO) order for the benefit of CPUs with small L1 cache.
> + *
>   * @param mp
>   *   A pointer to the mempool structure.
>   * @param obj_table
> @@ -1452,7 +1456,7 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
>   * @param cache
>   *   A pointer to a mempool cache structure. May be NULL if not needed.
>   * @return
> - *   - >=0: Success; number of objects supplied.
> + *   - 0: Success; got n objects.
>   *   - <0: Error; code of ring dequeue function.
>   */
>  static __rte_always_inline int

I think that part should be in a separate commit too. This is a
documentation fix, which is easily backportable (and should be
backported) (Fixes: af75078fece3 ("first public release")).

> @@ -1463,38 +1467,71 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>  	uint32_t index, len;
>  	void **cache_objs;
>  
> -	/* No cache provided or cannot be satisfied from cache */
> -	if (unlikely(cache == NULL || n >= cache->size))
> +	/* No cache provided or if get would overflow mem allocated for cache */
> +	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>  		goto ring_dequeue;
>  
> -	cache_objs = cache->objs;
> +	cache_objs = &cache->objs[cache->len];
> +
> +	if (n <= cache->len) {
> +		/* The entire request can be satisfied from the cache. */
> +		cache->len -= n;
> +		for (index = 0; index < n; index++)
> +			*obj_table++ = *--cache_objs;
>  
> -	/* Can this be satisfied from the cache? */
> -	if (cache->len < n) {
> -		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = n + (cache->size - cache->len);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
>  
> -		/* How many do we require i.e. number to fill the cache + the request */
> -		ret = rte_mempool_ops_dequeue_bulk(mp,
> -			&cache->objs[cache->len], req);
> +		return 0;
> +	}
> +
> +	/* Satisfy the first part of the request by depleting the cache. */
> +	len = cache->len;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;
> +
> +	/* Number of objects remaining to satisfy the request. */
> +	len = n - len;
> +
> +	/* Fill the cache from the ring; fetch size + remaining objects. */
> +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> +			cache->size + len);
> +	if (unlikely(ret < 0)) {
> +		/*
> +		 * We are buffer constrained, and not able to allocate
> +		 * cache + remaining.
> +		 * Do not fill the cache, just satisfy the remaining part of
> +		 * the request directly from the ring.
> +		 */
> +		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
>  		if (unlikely(ret < 0)) {
>  			/*
> -			 * In the off chance that we are buffer constrained,
> -			 * where we are not able to allocate cache + n, go to
> -			 * the ring directly. If that fails, we are truly out of
> -			 * buffers.
> +			 * That also failed.
> +			 * No furter action is required to roll the first
> +			 * part of the request back into the cache, as both
> +			 * cache->len and the objects in the cache are intact.
>  			 */
> -			goto ring_dequeue;
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
> +
> +			return ret;
>  		}
>  
> -		cache->len += req;
> +		/* Commit that the cache was emptied. */
> +		cache->len = 0;
> +
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> +
> +		return 0;
>  	}
>  
> -	/* Now fill in the response ... */
> -	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
> -		*obj_table = cache_objs[len];
> +	cache_objs = &cache->objs[cache->size + len];
>  
> -	cache->len -= n;
> +	/* Satisfy the remaining part of the request from the filled cache. */
> +	cache->len = cache->size;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;
>  
>  	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
>  	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> @@ -1503,7 +1540,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>  
>  ring_dequeue:
>  
> -	/* get remaining objects from ring */
> +	/* Get the objects from the ring. */
>  	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
>  
>  	if (ret < 0) {

About the code itself, it is more readable now, and probably more
efficient. Did you notice any performance change in mempool perf
autotests ?

Thanks,
Olivier

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v3] mempool: fix put objects to mempool with cache
  2022-01-19 15:03 ` [PATCH v3] " Morten Brørup
@ 2022-01-24 15:39   ` Olivier Matz
  2022-01-28  9:37     ` Morten Brørup
  0 siblings, 1 reply; 85+ messages in thread
From: Olivier Matz @ 2022-01-24 15:39 UTC (permalink / raw)
  To: Morten Brørup; +Cc: andrew.rybchenko, bruce.richardson, jerinjacobk, dev

Hi Morten,

On Wed, Jan 19, 2022 at 04:03:01PM +0100, Morten Brørup wrote:
> mempool: fix put objects to mempool with cache
> 
> This patch optimizes the rte_mempool_do_generic_put() caching algorithm,
> and fixes a bug in it.

I think we should avoid grouping fixes and optimizations in one
patch. The main reason is that fixes aims to be backported, which
is not the case of optimizations.

> The existing algorithm was:
>  1. Add the objects to the cache
>  2. Anything greater than the cache size (if it crosses the cache flush
>     threshold) is flushed to the ring.
> 
> Please note that the description in the source code said that it kept
> "cache min value" objects after flushing, but the function actually kept
> "size" objects, which is reflected in the above description.
> 
> Now, the algorithm is:
>  1. If the objects cannot be added to the cache without crossing the
>     flush threshold, flush the cache to the ring.
>  2. Add the objects to the cache.
> 
> This patch changes these details:
> 
> 1. Bug: The cache was still full after flushing.
> In the opposite direction, i.e. when getting objects from the cache, the
> cache is refilled to full level when it crosses the low watermark (which
> happens to be zero).
> Similarly, the cache should be flushed to empty level when it crosses
> the high watermark (which happens to be 1.5 x the size of the cache).
> The existing flushing behaviour was suboptimal for real applications,
> because crossing the low or high watermark typically happens when the
> application is in a state where the number of put/get events are out of
> balance, e.g. when absorbing a burst of packets into a QoS queue
> (getting more mbufs from the mempool), or when a burst of packets is
> trickling out from the QoS queue (putting the mbufs back into the
> mempool).
> NB: When the application is in a state where put/get events are in
> balance, the cache should remain within its low and high watermarks, and
> the algorithms for refilling/flushing the cache should not come into
> play.
> Now, the mempool cache is completely flushed when crossing the flush
> threshold, so only the newly put (hot) objects remain in the mempool
> cache afterwards.

I'm not sure we should call this behavior a bug. What is the impact
on applications, from a user perspective? Can it break a use-case, or
have an important performance impact?


> 2. Minor bug: The flush threshold comparison has been corrected; it must
> be "len > flushthresh", not "len >= flushthresh".
> Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
> would be flushed already when reaching size elements, not when exceeding
> size elements.
> Now, flushing is triggered when the flush threshold is exceeded, not
> when reached.

Same here, we should ask ourselves what is the impact before calling
it a bug.


> 3. Optimization: The most recent (hot) objects are flushed, leaving the
> oldest (cold) objects in the mempool cache.
> This is bad for CPUs with a small L1 cache, because when they get
> objects from the mempool after the mempool cache has been flushed, they
> get cold objects instead of hot objects.
> Now, the existing (cold) objects in the mempool cache are flushed before
> the new (hot) objects are added the to the mempool cache.
> 
> 4. Optimization: Using the x86 variant of rte_memcpy() is inefficient
> here, where n is relatively small and unknown at compile time.
> Now, it has been replaced by an alternative copying method, optimized
> for the fact that most Ethernet PMDs operate in bursts of 4 or 8 mbufs
> or multiples thereof.

For these optimizations, do you have an idea of what is the performance
gain? Ideally (I understand it is not always possible), each optimization
is done separately, and its impact is measured.


> v2 changes:
> 
> - Not adding the new objects to the mempool cache before flushing it
> also allows the memory allocated for the mempool cache to be reduced
> from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
> However, such this change would break the ABI, so it was removed in v2.
> 
> - The mempool cache should be cache line aligned for the benefit of the
> copying method, which on some CPU architectures performs worse on data
> crossing a cache boundary.
> However, such this change would break the ABI, so it was removed in v2;
> and yet another alternative copying method replaced the rte_memcpy().

OK, we may want to keep this in mind for the next abi breakage.


> 
> v3 changes:
> 
> - Actually remove my modifications of the rte_mempool_cache structure.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 51 +++++++++++++++++++++++++++++----------
>  1 file changed, 38 insertions(+), 13 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..7b364cfc74 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1334,6 +1334,7 @@ static __rte_always_inline void
>  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
>  			   unsigned int n, struct rte_mempool_cache *cache)
>  {
> +	uint32_t index;
>  	void **cache_objs;
>  
>  	/* increment stat now, adding in mempool always success */
> @@ -1344,31 +1345,56 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
>  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>  		goto ring_enqueue;
>  
> -	cache_objs = &cache->objs[cache->len];
> +	/* If the request itself is too big for the cache */
> +	if (unlikely(n > cache->flushthresh))
> +		goto ring_enqueue;
>  
>  	/*
>  	 * The cache follows the following algorithm
> -	 *   1. Add the objects to the cache
> -	 *   2. Anything greater than the cache min value (if it crosses the
> -	 *   cache flush threshold) is flushed to the ring.
> +	 *   1. If the objects cannot be added to the cache without
> +	 *   crossing the flush threshold, flush the cache to the ring.
> +	 *   2. Add the objects to the cache.
>  	 */
>  
> -	/* Add elements back into the cache */
> -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> +	if (cache->len + n <= cache->flushthresh) {
> +		cache_objs = &cache->objs[cache->len];
>  
> -	cache->len += n;
> +		cache->len += n;
> +	} else {
> +		cache_objs = cache->objs;
>  
> -	if (cache->len >= cache->flushthresh) {
> -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> -				cache->len - cache->size);
> -		cache->len = cache->size;
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
> +			rte_panic("cannot put objects in mempool\n");
> +#else
> +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> +#endif
> +		cache->len = n;
> +	}
> +
> +	/* Add the objects to the cache. */
> +	for (index = 0; index < (n & ~0x3); index += 4) {
> +		cache_objs[index] = obj_table[index];
> +		cache_objs[index + 1] = obj_table[index + 1];
> +		cache_objs[index + 2] = obj_table[index + 2];
> +		cache_objs[index + 3] = obj_table[index + 3];
> +	}
> +	switch (n & 0x3) {
> +	case 3:
> +		cache_objs[index] = obj_table[index];
> +		index++; /* fallthrough */
> +	case 2:
> +		cache_objs[index] = obj_table[index];
> +		index++; /* fallthrough */
> +	case 1:
> +		cache_objs[index] = obj_table[index];
>  	}
>  
>  	return;
>  
>  ring_enqueue:
>  
> -	/* push remaining objects in ring */
> +	/* Put the objects into the ring */
>  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
>  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
>  		rte_panic("cannot put objects in mempool\n");
> @@ -1377,7 +1403,6 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
>  #endif
>  }
>  
> -
>  /**
>   * Put several objects back in the mempool.
>   *
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] mempool: fix get objects from mempool with cache
  2022-01-24 15:38   ` Olivier Matz
@ 2022-01-24 16:11     ` Olivier Matz
  2022-01-28 10:22     ` Morten Brørup
  1 sibling, 0 replies; 85+ messages in thread
From: Olivier Matz @ 2022-01-24 16:11 UTC (permalink / raw)
  To: Morten Brørup; +Cc: andrew.rybchenko, bruce.richardson, jerinjacobk, dev

On Mon, Jan 24, 2022 at 04:38:58PM +0100, Olivier Matz wrote:
> On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1443,6 +1443,10 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
> >  
> >  /**
> >   * @internal Get several objects from the mempool; used internally.
> > + *
> > + * If cache is enabled, objects are returned from the cache in Last In First
> > + * Out (LIFO) order for the benefit of CPUs with small L1 cache.
> > + *
> >   * @param mp
> >   *   A pointer to the mempool structure.
> >   * @param obj_table
> > @@ -1452,7 +1456,7 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
> >   * @param cache
> >   *   A pointer to a mempool cache structure. May be NULL if not needed.
> >   * @return
> > - *   - >=0: Success; number of objects supplied.
> > + *   - 0: Success; got n objects.
> >   *   - <0: Error; code of ring dequeue function.
> >   */
> >  static __rte_always_inline int
> 
> I think that part should be in a separate commit too. This is a
> documentation fix, which is easily backportable (and should be
> backported) (Fixes: af75078fece3 ("first public release")).

I see that the same change is also part of this commit:
https://patches.dpdk.org/project/dpdk/patch/20211223100741.21292-1-chenzhiheng0227@gmail.com/

I think it is better to have a doc fix commit, and remove this chunk
from this patch.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v3] mempool: fix put objects to mempool with cache
  2022-01-24 15:39   ` Olivier Matz
@ 2022-01-28  9:37     ` Morten Brørup
  0 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-01-28  9:37 UTC (permalink / raw)
  To: Olivier Matz; +Cc: andrew.rybchenko, bruce.richardson, jerinjacobk, dev

> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Monday, 24 January 2022 16.39
> 
> Hi Morten,
> 
> On Wed, Jan 19, 2022 at 04:03:01PM +0100, Morten Brørup wrote:
> > mempool: fix put objects to mempool with cache
> >
> > This patch optimizes the rte_mempool_do_generic_put() caching
> algorithm,
> > and fixes a bug in it.
> 
> I think we should avoid grouping fixes and optimizations in one
> patch. The main reason is that fixes aims to be backported, which
> is not the case of optimizations.

OK. I'll separate them.

> 
> > The existing algorithm was:
> >  1. Add the objects to the cache
> >  2. Anything greater than the cache size (if it crosses the cache
> flush
> >     threshold) is flushed to the ring.
> >
> > Please note that the description in the source code said that it kept
> > "cache min value" objects after flushing, but the function actually
> kept
> > "size" objects, which is reflected in the above description.
> >
> > Now, the algorithm is:
> >  1. If the objects cannot be added to the cache without crossing the
> >     flush threshold, flush the cache to the ring.
> >  2. Add the objects to the cache.
> >
> > This patch changes these details:
> >
> > 1. Bug: The cache was still full after flushing.
> > In the opposite direction, i.e. when getting objects from the cache,
> the
> > cache is refilled to full level when it crosses the low watermark
> (which
> > happens to be zero).
> > Similarly, the cache should be flushed to empty level when it crosses
> > the high watermark (which happens to be 1.5 x the size of the cache).
> > The existing flushing behaviour was suboptimal for real applications,
> > because crossing the low or high watermark typically happens when the
> > application is in a state where the number of put/get events are out
> of
> > balance, e.g. when absorbing a burst of packets into a QoS queue
> > (getting more mbufs from the mempool), or when a burst of packets is
> > trickling out from the QoS queue (putting the mbufs back into the
> > mempool).
> > NB: When the application is in a state where put/get events are in
> > balance, the cache should remain within its low and high watermarks,
> and
> > the algorithms for refilling/flushing the cache should not come into
> > play.
> > Now, the mempool cache is completely flushed when crossing the flush
> > threshold, so only the newly put (hot) objects remain in the mempool
> > cache afterwards.
> 
> I'm not sure we should call this behavior a bug. What is the impact
> on applications, from a user perspective? Can it break a use-case, or
> have an important performance impact?

It doesn't break anything.

But it doesn't behave as intended (according to its description in the source code), so I do consider it a bug! Any professional tester, when seeing an implementation that doesn't do what is intended, would also flag the implementation as faulty.

It has a performance impact: It causes many more mempool cache flushes than was intended. I have elaborated by an example here: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D86E54@smartserver.smartshare.dk/T/#t

> 
> 
> > 2. Minor bug: The flush threshold comparison has been corrected; it
> must
> > be "len > flushthresh", not "len >= flushthresh".
> > Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
> > would be flushed already when reaching size elements, not when
> exceeding
> > size elements.
> > Now, flushing is triggered when the flush threshold is exceeded, not
> > when reached.
> 
> Same here, we should ask ourselves what is the impact before calling
> it a bug.

It's a classic off-by-one bug.

It only impacts performance, causing premature mempool cache flushing.

Referring to my example in the RFC discussion, this bug causes flushing every 3rd application put() instead of every 4th.

> 
> 
> > 3. Optimization: The most recent (hot) objects are flushed, leaving
> the
> > oldest (cold) objects in the mempool cache.
> > This is bad for CPUs with a small L1 cache, because when they get
> > objects from the mempool after the mempool cache has been flushed,
> they
> > get cold objects instead of hot objects.
> > Now, the existing (cold) objects in the mempool cache are flushed
> before
> > the new (hot) objects are added the to the mempool cache.
> >
> > 4. Optimization: Using the x86 variant of rte_memcpy() is inefficient
> > here, where n is relatively small and unknown at compile time.
> > Now, it has been replaced by an alternative copying method, optimized
> > for the fact that most Ethernet PMDs operate in bursts of 4 or 8
> mbufs
> > or multiples thereof.
> 
> For these optimizations, do you have an idea of what is the performance
> gain? Ideally (I understand it is not always possible), each
> optimization
> is done separately, and its impact is measured.

Regarding 3: I don't have access to hardware with a CPU with small L1 cache. But the algorithm was structurally wrong, so I think it should be fixed. Not working with such hardware ourselves, I labeled it an "optimization"... if the patch came from someone with affected hardware, it could reasonably had been labeled a "bug fix".

Regarding 4: I'll stick with rte_memcpy() in the "fix" patch, and provide a separate optimization patch with performance information.

> 
> 
> > v2 changes:
> >
> > - Not adding the new objects to the mempool cache before flushing it
> > also allows the memory allocated for the mempool cache to be reduced
> > from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
> > However, such this change would break the ABI, so it was removed in
> v2.
> >
> > - The mempool cache should be cache line aligned for the benefit of
> the
> > copying method, which on some CPU architectures performs worse on
> data
> > crossing a cache boundary.
> > However, such this change would break the ABI, so it was removed in
> v2;
> > and yet another alternative copying method replaced the rte_memcpy().
> 
> OK, we may want to keep this in mind for the next abi breakage.

Sounds good.

> 
> 
> >
> > v3 changes:
> >
> > - Actually remove my modifications of the rte_mempool_cache
> structure.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/mempool/rte_mempool.h | 51 +++++++++++++++++++++++++++++--------
> --
> >  1 file changed, 38 insertions(+), 13 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1e7a3c1527..7b364cfc74 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1334,6 +1334,7 @@ static __rte_always_inline void
> >  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const
> *obj_table,
> >  			   unsigned int n, struct rte_mempool_cache *cache)
> >  {
> > +	uint32_t index;
> >  	void **cache_objs;
> >
> >  	/* increment stat now, adding in mempool always success */
> > @@ -1344,31 +1345,56 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
> >  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
> >  		goto ring_enqueue;
> >
> > -	cache_objs = &cache->objs[cache->len];
> > +	/* If the request itself is too big for the cache */
> > +	if (unlikely(n > cache->flushthresh))
> > +		goto ring_enqueue;
> >
> >  	/*
> >  	 * The cache follows the following algorithm
> > -	 *   1. Add the objects to the cache
> > -	 *   2. Anything greater than the cache min value (if it crosses
> the
> > -	 *   cache flush threshold) is flushed to the ring.
> > +	 *   1. If the objects cannot be added to the cache without
> > +	 *   crossing the flush threshold, flush the cache to the ring.
> > +	 *   2. Add the objects to the cache.
> >  	 */
> >
> > -	/* Add elements back into the cache */
> > -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> > +	if (cache->len + n <= cache->flushthresh) {
> > +		cache_objs = &cache->objs[cache->len];
> >
> > -	cache->len += n;
> > +		cache->len += n;
> > +	} else {
> > +		cache_objs = cache->objs;
> >
> > -	if (cache->len >= cache->flushthresh) {
> > -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> > -				cache->len - cache->size);
> > -		cache->len = cache->size;
> > +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> > +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> >len) < 0)
> > +			rte_panic("cannot put objects in mempool\n");
> > +#else
> > +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > +#endif
> > +		cache->len = n;
> > +	}
> > +
> > +	/* Add the objects to the cache. */
> > +	for (index = 0; index < (n & ~0x3); index += 4) {
> > +		cache_objs[index] = obj_table[index];
> > +		cache_objs[index + 1] = obj_table[index + 1];
> > +		cache_objs[index + 2] = obj_table[index + 2];
> > +		cache_objs[index + 3] = obj_table[index + 3];
> > +	}
> > +	switch (n & 0x3) {
> > +	case 3:
> > +		cache_objs[index] = obj_table[index];
> > +		index++; /* fallthrough */
> > +	case 2:
> > +		cache_objs[index] = obj_table[index];
> > +		index++; /* fallthrough */
> > +	case 1:
> > +		cache_objs[index] = obj_table[index];
> >  	}
> >
> >  	return;
> >
> >  ring_enqueue:
> >
> > -	/* push remaining objects in ring */
> > +	/* Put the objects into the ring */
> >  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> >  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
> >  		rte_panic("cannot put objects in mempool\n");
> > @@ -1377,7 +1403,6 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
> >  #endif
> >  }
> >
> > -
> >  /**
> >   * Put several objects back in the mempool.
> >   *
> > --
> > 2.17.1
> >


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH] mempool: fix get objects from mempool with cache
  2022-01-24 15:38   ` Olivier Matz
  2022-01-24 16:11     ` Olivier Matz
@ 2022-01-28 10:22     ` Morten Brørup
  1 sibling, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-01-28 10:22 UTC (permalink / raw)
  To: Olivier Matz; +Cc: andrew.rybchenko, bruce.richardson, jerinjacobk, dev

Olivier, thank you for the detailed feedback on my mempool RFC and patches.

We might disagree on some points, but that is the point of having a discussion. :-)

> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Monday, 24 January 2022 16.39
> 
> Hi Morten,
> 
> Few comments below.
> 
> On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> > A flush threshold for the mempool cache was introduced in DPDK
> version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then, and some inefficiencies were introduced.
> >
> > This patch fixes the following in rte_mempool_do_generic_get():
> >
> > 1. The code that initially screens the cache request was not updated
> > with the change in DPDK version 1.3.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction
> of
> > the flush threshold. E.g. the cache can hold up to flushthresh
> objects,
> > which is more than its size, so some requests were not served from
> the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a
> cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.

This bug will cause a major performance degradation in a scenario where the application burst length is the same as the cache size. In this case, the objects are not ever fetched from the mempool cache.

This scenario occurs if an application has configured a mempool with a size matching the application's burst size. Do any such applications exist? I don't know.

> >
> > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > behave according to the description of that function.
> > Specifically, objects must first be returned from the cache,
> > subsequently from the ring.
> > After the change in DPDK version 1.3, this was not the behavior when
> > the request was partially satisfied from the cache; instead, the
> objects
> > from the ring were returned ahead of the objects from the cache. This
> is
> > bad for CPUs with a small L1 cache, which benefit from having the hot
> > objects first in the returned array. (This is also the reason why
> > the function returns the objects in reverse order.)
> > Now, all code paths first return objects from the cache, subsequently
> > from the ring.

Formally, the function is buggy when it isn't doing what it is supposed to.

But yes, it only has a performance impact. And perhaps mostly on low end hardware.

> >
> > 3. If the cache could not be backfilled, the function would attempt
> > to get all the requested objects from the ring (instead of only the
> > number of requested objects minus the objects available in the ring),
> > and the function would fail if that failed.
> > Now, the first part of the request is always satisfied from the
> cache,
> > and if the subsequent backfilling of the cache from the ring fails,
> only
> > the remaining requested objects are retrieved from the ring.
> 
> This is the only point I'd consider to be a fix. The problem, from the
> user perspective, is that a get() can fail despite there are enough
> objects in cache + common pool.
> 
> To be honnest, I feel a bit uncomfortable to have such a list of
> problems solved in one commit, even if I understand that they are part
> of the same code rework.
> 
> Ideally, this fix should be a separate commit. What do you think of
> having this simple patch for this fix, and then do the
> optimizations/rework in another commit?
> 
>   --- a/lib/mempool/rte_mempool.h
>   +++ b/lib/mempool/rte_mempool.h
>   @@ -1484,7 +1484,22 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>                            * the ring directly. If that fails, we are
> truly out of
>                            * buffers.
>                            */
>   -                       goto ring_dequeue;
>   +                       req = n - cache->len;
>   +                       ret = rte_mempool_ops_dequeue_bulk(mp,
> obj_table, req);
>   +                       if (ret < 0) {
>   +                               RTE_MEMPOOL_STAT_ADD(mp,
> get_fail_bulk, 1);
>   +                               RTE_MEMPOOL_STAT_ADD(mp,
> get_fail_objs, n);
>   +                               return ret;
>   +                       }
>   +                       obj_table += req;
>   +                       len = cache->len;
>   +                       while (len > 0)
>   +                               *obj_table++ = cache_objs[--len];
>   +                       cache->len = 0;
>   +                       RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk,
> 1);
>   +                       RTE_MEMPOOL_STAT_ADD(mp, get_success_objs,
> n);
>   +
>   +                       return 0;
>                   }
> 
>                   cache->len += req;
> 
> The title of this commit could then be more precise to describe
> the solved issue.

I get your point here, but I still consider the other modifications bug fixes too, so a unified rework patch works better for me.

As you also noticed yourself, the resulting code is clear and easily readable. If that was not the case, I might have agreed to break it up in a couple of steps.

> 
> > 4. The code flow for satisfying the request from the cache was
> slightly
> > inefficient:
> > The likely code path where the objects are simply served from the
> cache
> > was treated as unlikely. Now it is treated as likely.
> > And in the code path where the cache was backfilled first, numbers
> were
> > added and subtracted from the cache length; now this code path simply
> > sets the cache length to its final value.
> >
> > 5. Some comments were not correct anymore.
> > The comments have been updated.
> > Most importanly, the description of the succesful return value was
> > inaccurate. Success only returns 0, not >= 0.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/mempool/rte_mempool.h | 81 ++++++++++++++++++++++++++++---------
> --
> >  1 file changed, 59 insertions(+), 22 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1e7a3c1527..88f1b8b7ab 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1443,6 +1443,10 @@ rte_mempool_put(struct rte_mempool *mp, void
> *obj)
> >
> >  /**
> >   * @internal Get several objects from the mempool; used internally.
> > + *
> > + * If cache is enabled, objects are returned from the cache in Last
> In First
> > + * Out (LIFO) order for the benefit of CPUs with small L1 cache.
> > + *
> >   * @param mp
> >   *   A pointer to the mempool structure.
> >   * @param obj_table
> > @@ -1452,7 +1456,7 @@ rte_mempool_put(struct rte_mempool *mp, void
> *obj)
> >   * @param cache
> >   *   A pointer to a mempool cache structure. May be NULL if not
> needed.
> >   * @return
> > - *   - >=0: Success; number of objects supplied.
> > + *   - 0: Success; got n objects.
> >   *   - <0: Error; code of ring dequeue function.
> >   */
> >  static __rte_always_inline int
> 
> I think that part should be in a separate commit too. This is a
> documentation fix, which is easily backportable (and should be
> backported) (Fixes: af75078fece3 ("first public release")).

OK. I'll take this out from the patch. And as you mentioned in your follow-up, this is also addressed by another commit, so I'll leave it at that.

> 
> > @@ -1463,38 +1467,71 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
> >  	uint32_t index, len;
> >  	void **cache_objs;
> >
> > -	/* No cache provided or cannot be satisfied from cache */
> > -	if (unlikely(cache == NULL || n >= cache->size))
> > +	/* No cache provided or if get would overflow mem allocated for
> cache */
> > +	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
> >  		goto ring_dequeue;
> >
> > -	cache_objs = cache->objs;
> > +	cache_objs = &cache->objs[cache->len];
> > +
> > +	if (n <= cache->len) {
> > +		/* The entire request can be satisfied from the cache. */
> > +		cache->len -= n;
> > +		for (index = 0; index < n; index++)
> > +			*obj_table++ = *--cache_objs;
> >
> > -	/* Can this be satisfied from the cache? */
> > -	if (cache->len < n) {
> > -		/* No. Backfill the cache first, and then fill from it */
> > -		uint32_t req = n + (cache->size - cache->len);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> >
> > -		/* How many do we require i.e. number to fill the cache +
> the request */
> > -		ret = rte_mempool_ops_dequeue_bulk(mp,
> > -			&cache->objs[cache->len], req);
> > +		return 0;
> > +	}
> > +
> > +	/* Satisfy the first part of the request by depleting the cache.
> */
> > +	len = cache->len;
> > +	for (index = 0; index < len; index++)
> > +		*obj_table++ = *--cache_objs;
> > +
> > +	/* Number of objects remaining to satisfy the request. */
> > +	len = n - len;
> > +
> > +	/* Fill the cache from the ring; fetch size + remaining objects.
> */
> > +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> > +			cache->size + len);
> > +	if (unlikely(ret < 0)) {
> > +		/*
> > +		 * We are buffer constrained, and not able to allocate
> > +		 * cache + remaining.
> > +		 * Do not fill the cache, just satisfy the remaining part
> of
> > +		 * the request directly from the ring.
> > +		 */
> > +		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
> >  		if (unlikely(ret < 0)) {
> >  			/*
> > -			 * In the off chance that we are buffer constrained,
> > -			 * where we are not able to allocate cache + n, go to
> > -			 * the ring directly. If that fails, we are truly out
> of
> > -			 * buffers.
> > +			 * That also failed.
> > +			 * No furter action is required to roll the first
> > +			 * part of the request back into the cache, as both
> > +			 * cache->len and the objects in the cache are
> intact.
> >  			 */
> > -			goto ring_dequeue;
> > +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
> > +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
> > +
> > +			return ret;
> >  		}
> >
> > -		cache->len += req;
> > +		/* Commit that the cache was emptied. */
> > +		cache->len = 0;
> > +
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> > +
> > +		return 0;
> >  	}
> >
> > -	/* Now fill in the response ... */
> > -	for (index = 0, len = cache->len - 1; index < n; ++index, len--,
> obj_table++)
> > -		*obj_table = cache_objs[len];
> > +	cache_objs = &cache->objs[cache->size + len];
> >
> > -	cache->len -= n;
> > +	/* Satisfy the remaining part of the request from the filled
> cache. */
> > +	cache->len = cache->size;
> > +	for (index = 0; index < len; index++)
> > +		*obj_table++ = *--cache_objs;
> >
> >  	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> >  	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> > @@ -1503,7 +1540,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
> >
> >  ring_dequeue:
> >
> > -	/* get remaining objects from ring */
> > +	/* Get the objects from the ring. */
> >  	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
> >
> >  	if (ret < 0) {
> 
> About the code itself, it is more readable now, and probably more
> efficient. Did you notice any performance change in mempool perf
> autotests ?

No significant performance change in my test environment. Probably also because mempool_perf_autotest doesn't test flushing/refilling the cache.

> 
> Thanks,
> Olivier


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v2] mempool: fix get objects from mempool with cache
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
                   ` (4 preceding siblings ...)
  2022-01-19 15:03 ` [PATCH v3] " Morten Brørup
@ 2022-02-02  8:14 ` Morten Brørup
  2022-06-15 21:18   ` Morten Brørup
                     ` (4 more replies)
  2022-02-02 10:33 ` [PATCH v4] mempool: fix mempool cache flushing algorithm Morten Brørup
                   ` (3 subsequent siblings)
  9 siblings, 5 replies; 85+ messages in thread
From: Morten Brørup @ 2022-02-02  8:14 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, jerinjacobk, dev, Morten Brørup

A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then, and some inefficiencies were introduced.

This patch fixes the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated
with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

This bug caused a major performance degradation in scenarios where the
application burst length is the same as the cache size. In such cases,
the objects were not ever fetched from the mempool cache, regardless if
they could have been.
This scenario occurs e.g. if an application has configured a mempool
with a size matching the application's burst size.

2. The function is a helper for rte_mempool_generic_get(), so it must
behave according to the description of that function.
Specifically, objects must first be returned from the cache,
subsequently from the ring.
After the change in DPDK version 1.3, this was not the behavior when
the request was partially satisfied from the cache; instead, the objects
from the ring were returned ahead of the objects from the cache.
This bug degraded application performance on CPUs with a small L1 cache,
which benefit from having the hot objects first in the returned array.
(This is probably also the reason why the function returns the objects
in reverse order, which it still does.)
Now, all code paths first return objects from the cache, subsequently
from the ring.

The function was not behaving as described (by the function using it)
and expected by applications using it. This in itself is also a bug.

3. If the cache could not be backfilled, the function would attempt
to get all the requested objects from the ring (instead of only the
number of requested objects minus the objects available in the ring),
and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache,
and if the subsequent backfilling of the cache from the ring fails, only
the remaining requested objects are retrieved from the ring.

The function would fail despite there are enough objects in the cache
plus the common pool.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache
was treated as unlikely. Now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were
added and subtracted from the cache length; now this code path simply
sets the cache length to its final value.

v2 changes
- Do not modify description of return value. This belongs in a separate
doc fix.
- Elaborate even more on which bugs the modifications fix.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 21 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..2898c690b0 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1463,38 +1463,71 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
-	cache_objs = cache->objs;
+	cache_objs = &cache->objs[cache->len];
+
+	if (n <= cache->len) {
+		/* The entire request can be satisfied from the cache. */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+		return 0;
+	}
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
+	/* Satisfy the first part of the request by depleting the cache. */
+	len = cache->len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
+
+	/* Number of objects remaining to satisfy the request. */
+	len = n - len;
+
+	/* Fill the cache from the ring; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + len);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the ring.
+		 */
+		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
 		if (unlikely(ret < 0)) {
 			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
+			 * That also failed.
+			 * No further action is required to roll the first
+			 * part of the request back into the cache, as both
+			 * cache->len and the objects in the cache are intact.
 			 */
-			goto ring_dequeue;
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
+
+			return ret;
 		}
 
-		cache->len += req;
+		/* Commit that the cache was emptied. */
+		cache->len = 0;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	cache_objs = &cache->objs[cache->size + len];
 
-	cache->len -= n;
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache->len = cache->size;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1503,7 +1536,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring. */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v4] mempool: fix mempool cache flushing algorithm
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
                   ` (5 preceding siblings ...)
  2022-02-02  8:14 ` [PATCH v2] mempool: fix get objects from " Morten Brørup
@ 2022-02-02 10:33 ` Morten Brørup
  2022-04-07  9:04   ` Morten Brørup
                     ` (3 more replies)
  2022-10-04 12:53 ` [PATCH v3] mempool: fix get objects from mempool with cache Andrew Rybchenko
                   ` (2 subsequent siblings)
  9 siblings, 4 replies; 85+ messages in thread
From: Morten Brørup @ 2022-02-02 10:33 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, jerinjacobk, dev, Morten Brørup

This patch fixes the rte_mempool_do_generic_put() caching algorithm,
which was fundamentally wrong, causing multiple performance issues when
flushing.

Although the bugs do have serious performance implications when
flushing, the function did not fail when flushing (or otherwise).
Backporting could be considered optional.

The algorithm was:
 1. Add the objects to the cache
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the ring.

Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
the cache full after flushing, which the above description reflects.

Now, the algorithm is:
 1. If the objects cannot be added to the cache without crossing the
    flush threshold, flush the cache to the ring.
 2. Add the objects to the cache.

This patch fixes these bugs:

1. The cache was still full after flushing.
In the opposite direction, i.e. when getting objects from the cache, the
cache is refilled to full level when it crosses the low watermark (which
happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The existing flushing behaviour was suboptimal for real applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

This bug degraded performance caused by too frequent flushing.

Consider this application scenario:

Either, an lcore thread in the application is in a state of balance,
where it uses the mempool cache within its flush/refill boundaries; in
this situation, the flush method is less important, and this fix is
irrelevant.

Or, an lcore thread in the application is out of balance (either
permanently or temporarily), and mostly gets or puts objects from/to the
mempool. If it mostly puts objects, not flushing all of the objects will
cause more frequent flushing. This is the scenario addressed by this
fix. E.g.:

Cache size=256, flushthresh=384 (1.5x size), initial len=256;
application burst len=32.

If there are "size" objects in the cache after flushing, the cache is
flushed at every 4th burst.

If the cache is flushed completely, the cache is only flushed at every
16th burst.

As you can see, this bug caused the cache to be flushed 4x too
frequently in this example.

And when/if the application thread breaks its pattern of continuously
putting objects, and suddenly starts to get objects instead, it will
either get objects already in the cache, or the get() function will
refill the cache.

The concept of not flushing the cache completely was probably based on
an assumption that it is more likely for an application's lcore thread
to get() after flushing than to put() after flushing.
I strongly disagree with this assumption! If an application thread is
continuously putting so much that it overflows the cache, it is much
more likely to keep putting than it is to start getting. If in doubt,
consider how CPU branch predictors work: When the application has done
something many times consecutively, the branch predictor will expect the
application to do the same again, rather than suddenly do something
else.

Also, if you consider the description of the algorithm in the source
code, and agree that "cache min value" cannot mean "cache size", the
function did not behave as intended. This in itself is a bug.

2. The flush threshold comparison was off by one.
It must be "len > flushthresh", not "len >= flushthresh".
Consider a flush multiplier of 1 instead of 1.5; the cache would be
flushed already when reaching size objecs, not when exceeding size
objects. In other words, the cache would not be able to hold "size"
objects, which is clearly a bug.
Now, flushing is triggered when the flush threshold is exceeded, not
when reached.

This bug degraded performance due to premature flushing. In my example
above, this bug caused flushing every 3rd burst instead of every 4th.

3. The most recent (hot) objects were flushed, leaving the oldest (cold)
objects in the mempool cache.
This bug degraded performance, because flushing prevented immediate
reuse of the (hot) objects already in the CPU cache.
Now, the existing (cold) objects in the mempool cache are flushed before
the new (hot) objects are added the to the mempool cache.

4. With RTE_LIBRTE_MEMPOOL_DEBUG defined, the return value of
rte_mempool_ops_enqueue_bulk() was not checked when flushing the cache.
Now, it is checked in both locations where used; and obviously still
only if RTE_LIBRTE_MEMPOOL_DEBUG is defined.

v2 changes:

- Not adding the new objects to the mempool cache before flushing it
also allows the memory allocated for the mempool cache to be reduced
from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
However, such this change would break the ABI, so it was removed in v2.

- The mempool cache should be cache line aligned for the benefit of the
copying method, which on some CPU architectures performs worse on data
crossing a cache boundary.
However, such this change would break the ABI, so it was removed in v2;
and yet another alternative copying method replaced the rte_memcpy().

v3 changes:

- Actually remove my modifications of the rte_mempool_cache structure.

v4 changes:

- Updated patch title to reflect that the scope of the patch is only
mempool cache flushing.

- Do not replace rte_memcpy() with alternative copying method. This was
a pure optimization, not a fix.

- Elaborate even more on the bugs fixed by the modifications.

- Added 4th bullet item to the patch description, regarding
rte_mempool_ops_enqueue_bulk() with RTE_LIBRTE_MEMPOOL_DEBUG.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..e7e09e48fc 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
+	/* If the request itself is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto ring_enqueue;
 
 	/*
 	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   1. If the objects cannot be added to the cache without
+	 *   crossing the flush threshold, flush the cache to the ring.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
 
-	cache->len += n;
+		cache->len += n;
+	} else {
+		cache_objs = &cache->objs[0];
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
+			rte_panic("cannot put objects in mempool\n");
+#else
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+#endif
+		cache->len = n;
 	}
 
+	/* Add the objects to the cache. */
+	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
+
 	return;
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v4] mempool: fix mempool cache flushing algorithm
  2022-02-02 10:33 ` [PATCH v4] mempool: fix mempool cache flushing algorithm Morten Brørup
@ 2022-04-07  9:04   ` Morten Brørup
  2022-04-07  9:14     ` Bruce Richardson
  2022-10-04 20:01   ` Morten Brørup
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-04-07  9:04 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko, thomas; +Cc: bruce.richardson, jerinjacobk, dev

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 2 February 2022 11.34
> 
> This patch fixes the rte_mempool_do_generic_put() caching algorithm,
> which was fundamentally wrong, causing multiple performance issues when
> flushing.
> 

[...]

Olivier,

Will you please consider this patch [1] and the other one [2].

The primary bug here is this: When a mempool cache becomes full (i.e. exceeds the "flush threshold"), and is flushed to the backing ring, it is still full afterwards; but it should be empty afterwards. It is not flushed entirely, only the elements exceeding "size" are flushed.

E.g. pipelined applications having ingress threads and egress threads running on different lcores are affected by this bug.

I don't think the real performance impact is very big, but these algorithm level bugs really annoy me.

I'm still wondering how the patch introducing the mempool cache flush threshold could pass internal code review with so many bugs.

[1] https://patchwork.dpdk.org/project/dpdk/patch/20220202103354.79832-1-mb@smartsharesystems.com/
[2] https://patchwork.dpdk.org/project/dpdk/patch/20220202081426.77975-1-mb@smartsharesystems.com/

-Morten

> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------
>  1 file changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..e7e09e48fc 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
>  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>  		goto ring_enqueue;
> 
> -	cache_objs = &cache->objs[cache->len];
> +	/* If the request itself is too big for the cache */
> +	if (unlikely(n > cache->flushthresh))
> +		goto ring_enqueue;
> 
>  	/*
>  	 * The cache follows the following algorithm
> -	 *   1. Add the objects to the cache
> -	 *   2. Anything greater than the cache min value (if it crosses
> the
> -	 *   cache flush threshold) is flushed to the ring.

In the code, "the cache min value" is actually "the cache size". This indicates an intention to do something more. Perhaps the patch introducing the "flush threshold" was committed while still incomplete, and just never got completed?

> +	 *   1. If the objects cannot be added to the cache without
> +	 *   crossing the flush threshold, flush the cache to the ring.
> +	 *   2. Add the objects to the cache.
>  	 */
> 
> -	/* Add elements back into the cache */
> -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> +	if (cache->len + n <= cache->flushthresh) {
> +		cache_objs = &cache->objs[cache->len];
> 
> -	cache->len += n;
> +		cache->len += n;
> +	} else {
> +		cache_objs = &cache->objs[0];
> 
> -	if (cache->len >= cache->flushthresh) {
> -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> -				cache->len - cache->size);
> -		cache->len = cache->size;
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> >len) < 0)
> +			rte_panic("cannot put objects in mempool\n");
> +#else
> +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> +#endif
> +		cache->len = n;
>  	}
> 
> +	/* Add the objects to the cache. */
> +	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
> +
>  	return;
> 
>  ring_enqueue:
> 
> -	/* push remaining objects in ring */
> +	/* Put the objects into the ring */
>  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
>  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
>  		rte_panic("cannot put objects in mempool\n");
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4] mempool: fix mempool cache flushing algorithm
  2022-04-07  9:04   ` Morten Brørup
@ 2022-04-07  9:14     ` Bruce Richardson
  2022-04-07  9:26       ` Morten Brørup
  0 siblings, 1 reply; 85+ messages in thread
From: Bruce Richardson @ 2022-04-07  9:14 UTC (permalink / raw)
  To: Morten Brørup
  Cc: olivier.matz, andrew.rybchenko, thomas, jerinjacobk, dev

On Thu, Apr 07, 2022 at 11:04:53AM +0200, Morten Brørup wrote:
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Wednesday, 2 February 2022 11.34
> > 
> > This patch fixes the rte_mempool_do_generic_put() caching algorithm,
> > which was fundamentally wrong, causing multiple performance issues when
> > flushing.
> > 
> 
> [...]
> 
> Olivier,
> 
> Will you please consider this patch [1] and the other one [2].
> 
> The primary bug here is this: When a mempool cache becomes full (i.e. exceeds the "flush threshold"), and is flushed to the backing ring, it is still full afterwards; but it should be empty afterwards. It is not flushed entirely, only the elements exceeding "size" are flushed.
> 

I don't believe it should be flushed entirely, there should always be some
elements left so that even after flush we can still allocate an additional
burst. We want to avoid the situation where a flush of all elements is
immediately followed by a refill of new elements. However, we can flush to
maybe size/2, and improve things. In short, this not emptying is by design
rather than a bug, though we can look to tweak the behaviour.

> E.g. pipelined applications having ingress threads and egress threads running on different lcores are affected by this bug.
> 
If we are looking at improvements for pipelined applications, I think a
bigger win would be to change the default mempool from ring-based to
stack-based. For apps using a run-to-completion model, they should run out
of cache and should therefore be largely unaffected by such a change.

> I don't think the real performance impact is very big, but these algorithm level bugs really annoy me.
> 
> I'm still wondering how the patch introducing the mempool cache flush threshold could pass internal code review with so many bugs.
> 
> [1] https://patchwork.dpdk.org/project/dpdk/patch/20220202103354.79832-1-mb@smartsharesystems.com/
> [2] https://patchwork.dpdk.org/project/dpdk/patch/20220202081426.77975-1-mb@smartsharesystems.com/
> 
> -Morten
> 
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------
> >  1 file changed, 22 insertions(+), 12 deletions(-)
> > 
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1e7a3c1527..e7e09e48fc 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct rte_mempool
> > *mp, void * const *obj_table,
> >  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
> >  		goto ring_enqueue;
> > 
> > -	cache_objs = &cache->objs[cache->len];
> > +	/* If the request itself is too big for the cache */
> > +	if (unlikely(n > cache->flushthresh))
> > +		goto ring_enqueue;
> > 
> >  	/*
> >  	 * The cache follows the following algorithm
> > -	 *   1. Add the objects to the cache
> > -	 *   2. Anything greater than the cache min value (if it crosses
> > the
> > -	 *   cache flush threshold) is flushed to the ring.
> 
> In the code, "the cache min value" is actually "the cache size". This indicates an intention to do something more. Perhaps the patch introducing the "flush threshold" was committed while still incomplete, and just never got completed?
> 
> > +	 *   1. If the objects cannot be added to the cache without
> > +	 *   crossing the flush threshold, flush the cache to the ring.
> > +	 *   2. Add the objects to the cache.
> >  	 */
> > 
> > -	/* Add elements back into the cache */
> > -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> > +	if (cache->len + n <= cache->flushthresh) {
> > +		cache_objs = &cache->objs[cache->len];
> > 
> > -	cache->len += n;
> > +		cache->len += n;
> > +	} else {
> > +		cache_objs = &cache->objs[0];
> > 
> > -	if (cache->len >= cache->flushthresh) {
> > -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> > -				cache->len - cache->size);
> > -		cache->len = cache->size;
> > +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> > +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> > >len) < 0)
> > +			rte_panic("cannot put objects in mempool\n");
> > +#else
> > +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > +#endif
> > +		cache->len = n;
> >  	}
> > 
> > +	/* Add the objects to the cache. */
> > +	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
> > +
> >  	return;
> > 
> >  ring_enqueue:
> > 
> > -	/* push remaining objects in ring */
> > +	/* Put the objects into the ring */
> >  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> >  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
> >  		rte_panic("cannot put objects in mempool\n");
> > --
> > 2.17.1
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v4] mempool: fix mempool cache flushing algorithm
  2022-04-07  9:14     ` Bruce Richardson
@ 2022-04-07  9:26       ` Morten Brørup
  2022-04-07 10:32         ` Bruce Richardson
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-04-07  9:26 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: olivier.matz, andrew.rybchenko, thomas, jerinjacobk, dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, 7 April 2022 11.14
> 
> On Thu, Apr 07, 2022 at 11:04:53AM +0200, Morten Brørup wrote:
> > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > Sent: Wednesday, 2 February 2022 11.34
> > >
> > > This patch fixes the rte_mempool_do_generic_put() caching
> algorithm,
> > > which was fundamentally wrong, causing multiple performance issues
> when
> > > flushing.
> > >
> >
> > [...]
> >
> > Olivier,
> >
> > Will you please consider this patch [1] and the other one [2].
> >
> > The primary bug here is this: When a mempool cache becomes full (i.e.
> exceeds the "flush threshold"), and is flushed to the backing ring, it
> is still full afterwards; but it should be empty afterwards. It is not
> flushed entirely, only the elements exceeding "size" are flushed.
> >
> 
> I don't believe it should be flushed entirely, there should always be
> some
> elements left so that even after flush we can still allocate an
> additional
> burst. We want to avoid the situation where a flush of all elements is
> immediately followed by a refill of new elements. However, we can flush
> to
> maybe size/2, and improve things. In short, this not emptying is by
> design
> rather than a bug, though we can look to tweak the behaviour.
> 

I initially agreed with you about flushing to size/2.

However, I did think further about it when I wrote the patch, and came to this conclusion: If an application thread repeatedly puts objects into the mempool, and does it so often that the cache overflows (i.e. reaches the flush threshold) and needs to be flushed, it is far more likely that the application thread will continue doing that, rather than start getting objects from the mempool. This speaks for flushing the cache entirely.

Both solutions are better than flushing to size, so if there is a preference for keeping some objects in the cache after flushing, I can update the patch accordingly.

> > E.g. pipelined applications having ingress threads and egress threads
> running on different lcores are affected by this bug.
> >
> If we are looking at improvements for pipelined applications, I think a
> bigger win would be to change the default mempool from ring-based to
> stack-based. For apps using a run-to-completion model, they should run
> out
> of cache and should therefore be largely unaffected by such a change.
> 
> > I don't think the real performance impact is very big, but these
> algorithm level bugs really annoy me.
> >
> > I'm still wondering how the patch introducing the mempool cache flush
> threshold could pass internal code review with so many bugs.
> >
> > [1]
> https://patchwork.dpdk.org/project/dpdk/patch/20220202103354.79832-1-
> mb@smartsharesystems.com/
> > [2]
> https://patchwork.dpdk.org/project/dpdk/patch/20220202081426.77975-1-
> mb@smartsharesystems.com/
> >
> > -Morten
> >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > ---
> > >  lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------
> > >  1 file changed, 22 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 1e7a3c1527..e7e09e48fc 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct
> rte_mempool
> > > *mp, void * const *obj_table,
> > >  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
> > >  		goto ring_enqueue;
> > >
> > > -	cache_objs = &cache->objs[cache->len];
> > > +	/* If the request itself is too big for the cache */
> > > +	if (unlikely(n > cache->flushthresh))
> > > +		goto ring_enqueue;
> > >
> > >  	/*
> > >  	 * The cache follows the following algorithm
> > > -	 *   1. Add the objects to the cache
> > > -	 *   2. Anything greater than the cache min value (if it crosses
> > > the
> > > -	 *   cache flush threshold) is flushed to the ring.
> >
> > In the code, "the cache min value" is actually "the cache size". This
> indicates an intention to do something more. Perhaps the patch
> introducing the "flush threshold" was committed while still incomplete,
> and just never got completed?
> >
> > > +	 *   1. If the objects cannot be added to the cache without
> > > +	 *   crossing the flush threshold, flush the cache to the ring.
> > > +	 *   2. Add the objects to the cache.
> > >  	 */
> > >
> > > -	/* Add elements back into the cache */
> > > -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> > > +	if (cache->len + n <= cache->flushthresh) {
> > > +		cache_objs = &cache->objs[cache->len];
> > >
> > > -	cache->len += n;
> > > +		cache->len += n;
> > > +	} else {
> > > +		cache_objs = &cache->objs[0];
> > >
> > > -	if (cache->len >= cache->flushthresh) {
> > > -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> > > -				cache->len - cache->size);
> > > -		cache->len = cache->size;
> > > +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> > > +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> > > >len) < 0)
> > > +			rte_panic("cannot put objects in mempool\n");
> > > +#else
> > > +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > > +#endif
> > > +		cache->len = n;
> > >  	}
> > >
> > > +	/* Add the objects to the cache. */
> > > +	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
> > > +
> > >  	return;
> > >
> > >  ring_enqueue:
> > >
> > > -	/* push remaining objects in ring */
> > > +	/* Put the objects into the ring */
> > >  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> > >  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
> > >  		rte_panic("cannot put objects in mempool\n");
> > > --
> > > 2.17.1
> >


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4] mempool: fix mempool cache flushing algorithm
  2022-04-07  9:26       ` Morten Brørup
@ 2022-04-07 10:32         ` Bruce Richardson
  2022-04-07 10:43           ` Bruce Richardson
  0 siblings, 1 reply; 85+ messages in thread
From: Bruce Richardson @ 2022-04-07 10:32 UTC (permalink / raw)
  To: Morten Brørup
  Cc: olivier.matz, andrew.rybchenko, thomas, jerinjacobk, dev

On Thu, Apr 07, 2022 at 11:26:53AM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Thursday, 7 April 2022 11.14
> > 
> > On Thu, Apr 07, 2022 at 11:04:53AM +0200, Morten Brørup wrote:
> > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > Sent: Wednesday, 2 February 2022 11.34
> > > >
> > > > This patch fixes the rte_mempool_do_generic_put() caching
> > algorithm,
> > > > which was fundamentally wrong, causing multiple performance issues
> > when
> > > > flushing.
> > > >
> > >
> > > [...]
> > >
> > > Olivier,
> > >
> > > Will you please consider this patch [1] and the other one [2].
> > >
> > > The primary bug here is this: When a mempool cache becomes full (i.e.
> > exceeds the "flush threshold"), and is flushed to the backing ring, it
> > is still full afterwards; but it should be empty afterwards. It is not
> > flushed entirely, only the elements exceeding "size" are flushed.
> > >
> > 
> > I don't believe it should be flushed entirely, there should always be
> > some
> > elements left so that even after flush we can still allocate an
> > additional
> > burst. We want to avoid the situation where a flush of all elements is
> > immediately followed by a refill of new elements. However, we can flush
> > to
> > maybe size/2, and improve things. In short, this not emptying is by
> > design
> > rather than a bug, though we can look to tweak the behaviour.
> > 
> 
> I initially agreed with you about flushing to size/2.
> 
> However, I did think further about it when I wrote the patch, and came to this conclusion: If an application thread repeatedly puts objects into the mempool, and does it so often that the cache overflows (i.e. reaches the flush threshold) and needs to be flushed, it is far more likely that the application thread will continue doing that, rather than start getting objects from the mempool. This speaks for flushing the cache entirely.
> 
> Both solutions are better than flushing to size, so if there is a preference for keeping some objects in the cache after flushing, I can update the patch accordingly.
> 

Would it be worth looking at adding per-core hinting to the mempool?
Indicate for a core that it allocates-only, i.e. RX thread, frees-only,
i.e. TX-thread, or does both alloc and free (the default)? That hint could
be used only on flush or refill to specify whether to flush all or partial,
and similarly to refill to max possible or just to size.

/Bruce

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4] mempool: fix mempool cache flushing algorithm
  2022-04-07 10:32         ` Bruce Richardson
@ 2022-04-07 10:43           ` Bruce Richardson
  2022-04-07 11:36             ` Morten Brørup
  0 siblings, 1 reply; 85+ messages in thread
From: Bruce Richardson @ 2022-04-07 10:43 UTC (permalink / raw)
  To: Morten Brørup
  Cc: olivier.matz, andrew.rybchenko, thomas, jerinjacobk, dev

On Thu, Apr 07, 2022 at 11:32:12AM +0100, Bruce Richardson wrote:
> On Thu, Apr 07, 2022 at 11:26:53AM +0200, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Thursday, 7 April 2022 11.14
> > > 
> > > On Thu, Apr 07, 2022 at 11:04:53AM +0200, Morten Brørup wrote:
> > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > Sent: Wednesday, 2 February 2022 11.34
> > > > >
> > > > > This patch fixes the rte_mempool_do_generic_put() caching
> > > algorithm,
> > > > > which was fundamentally wrong, causing multiple performance issues
> > > when
> > > > > flushing.
> > > > >
> > > >
> > > > [...]
> > > >
> > > > Olivier,
> > > >
> > > > Will you please consider this patch [1] and the other one [2].
> > > >
> > > > The primary bug here is this: When a mempool cache becomes full (i.e.
> > > exceeds the "flush threshold"), and is flushed to the backing ring, it
> > > is still full afterwards; but it should be empty afterwards. It is not
> > > flushed entirely, only the elements exceeding "size" are flushed.
> > > >
> > > 
> > > I don't believe it should be flushed entirely, there should always be
> > > some
> > > elements left so that even after flush we can still allocate an
> > > additional
> > > burst. We want to avoid the situation where a flush of all elements is
> > > immediately followed by a refill of new elements. However, we can flush
> > > to
> > > maybe size/2, and improve things. In short, this not emptying is by
> > > design
> > > rather than a bug, though we can look to tweak the behaviour.
> > > 
> > 
> > I initially agreed with you about flushing to size/2.
> > 
> > However, I did think further about it when I wrote the patch, and came to this conclusion: If an application thread repeatedly puts objects into the mempool, and does it so often that the cache overflows (i.e. reaches the flush threshold) and needs to be flushed, it is far more likely that the application thread will continue doing that, rather than start getting objects from the mempool. This speaks for flushing the cache entirely.
> > 
> > Both solutions are better than flushing to size, so if there is a preference for keeping some objects in the cache after flushing, I can update the patch accordingly.
> > 
> 
> Would it be worth looking at adding per-core hinting to the mempool?
> Indicate for a core that it allocates-only, i.e. RX thread, frees-only,
> i.e. TX-thread, or does both alloc and free (the default)? That hint could
> be used only on flush or refill to specify whether to flush all or partial,
> and similarly to refill to max possible or just to size.
> 
Actually, taking the idea further, we could always track per-core whether a
core has ever done a flush/refill and use that as the hint instead. It
could even be done in a branch-free manner if we want. For example:

on flush:
	keep_entries = (size >> 1) & (never_refills - 1);

which will set the entries to keep to be 0 if we have never had to refill, or
half of size, if the thread has previously done refills.

/Bruce

^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v4] mempool: fix mempool cache flushing algorithm
  2022-04-07 10:43           ` Bruce Richardson
@ 2022-04-07 11:36             ` Morten Brørup
  0 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-04-07 11:36 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: olivier.matz, andrew.rybchenko, thomas, jerinjacobk, dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, 7 April 2022 12.44
> 
> On Thu, Apr 07, 2022 at 11:32:12AM +0100, Bruce Richardson wrote:
> > On Thu, Apr 07, 2022 at 11:26:53AM +0200, Morten Brørup wrote:
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Thursday, 7 April 2022 11.14
> > > >
> > > > On Thu, Apr 07, 2022 at 11:04:53AM +0200, Morten Brørup wrote:
> > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > Sent: Wednesday, 2 February 2022 11.34
> > > > > >
> > > > > > This patch fixes the rte_mempool_do_generic_put() caching
> > > > algorithm,
> > > > > > which was fundamentally wrong, causing multiple performance
> issues
> > > > when
> > > > > > flushing.
> > > > > >
> > > > >
> > > > > [...]
> > > > >
> > > > > Olivier,
> > > > >
> > > > > Will you please consider this patch [1] and the other one [2].
> > > > >
> > > > > The primary bug here is this: When a mempool cache becomes full
> (i.e.
> > > > exceeds the "flush threshold"), and is flushed to the backing
> ring, it
> > > > is still full afterwards; but it should be empty afterwards. It
> is not
> > > > flushed entirely, only the elements exceeding "size" are flushed.
> > > > >
> > > >
> > > > I don't believe it should be flushed entirely, there should
> always be
> > > > some
> > > > elements left so that even after flush we can still allocate an
> > > > additional
> > > > burst. We want to avoid the situation where a flush of all
> elements is
> > > > immediately followed by a refill of new elements. However, we can
> flush
> > > > to
> > > > maybe size/2, and improve things. In short, this not emptying is
> by
> > > > design
> > > > rather than a bug, though we can look to tweak the behaviour.
> > > >
> > >
> > > I initially agreed with you about flushing to size/2.
> > >
> > > However, I did think further about it when I wrote the patch, and
> came to this conclusion: If an application thread repeatedly puts
> objects into the mempool, and does it so often that the cache overflows
> (i.e. reaches the flush threshold) and needs to be flushed, it is far
> more likely that the application thread will continue doing that,
> rather than start getting objects from the mempool. This speaks for
> flushing the cache entirely.
> > >
> > > Both solutions are better than flushing to size, so if there is a
> preference for keeping some objects in the cache after flushing, I can
> update the patch accordingly.

I forgot to mention some details here...

The cache is a stack, so leaving objects in it after flushing can be done in one of two ways:

1. Flush the top objects, and leave the bottom objects, which are extremely cold.
2. Flush the bottom objects, and move the objects from the top to the bottom, which is a costly operation.

Theoretically, there is a third option: Make the stack a circular buffer, so its "bottom pointer" can be moved around, instead of copying objects from the top to the bottom after flushing. However, this will add complexity when copying arrays to/from the stack in both the normal cases, i.e. to/from the application. And it introduces requirements to the cache size. So I quickly discarded the idea when it first came to me.

The provided patch flushes the entire cache, and then stores the newly added objects (the ones causing the flush) in the cache. So it is not completely empty after flushing. It contains some (but not many) objects, and they are hot.

> > >
> >
> > Would it be worth looking at adding per-core hinting to the mempool?
> > Indicate for a core that it allocates-only, i.e. RX thread, frees-
> only,
> > i.e. TX-thread, or does both alloc and free (the default)? That hint
> could
> > be used only on flush or refill to specify whether to flush all or
> partial,
> > and similarly to refill to max possible or just to size.
> >
> Actually, taking the idea further, we could always track per-core
> whether a
> core has ever done a flush/refill and use that as the hint instead. It
> could even be done in a branch-free manner if we want. For example:
> 
> on flush:
> 	keep_entries = (size >> 1) & (never_refills - 1);
> 
> which will set the entries to keep to be 0 if we have never had to
> refill, or
> half of size, if the thread has previously done refills.
> 

Your suggestion is a good idea for a performance improvement.

We would also need "mostly" variants in addition to the "only" variants. Or the automatic detection will cause problems if triggered by some rare event.

And applications using the "service cores" concept will just fall back to the default alloc-free balanced variant.


Perhaps we should fix the current bugs (my term, not consensus) first, and then look at further performance improvements. It's already uphill getting Acks for my fixes as they are.


Another performance improvement could be hard coding the mempool cache size to RTE_MEMPOOL_CACHE_MAX_SIZE, so the copying between the cache and the backing ring can be done by a fixed size optimized memcpy using vector instructions.

Obviously, whatever we do to optimize this, we should ensure optimal handling of the common case, where objects are only moved between the application and the mempool cache, and doesn't touch the backing ring.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v2] mempool: fix get objects from mempool with cache
  2022-02-02  8:14 ` [PATCH v2] mempool: fix get objects from " Morten Brørup
@ 2022-06-15 21:18   ` Morten Brørup
  2022-09-29 10:52     ` Morten Brørup
  2022-10-04 12:57   ` Andrew Rybchenko
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-06-15 21:18 UTC (permalink / raw)
  To: Olivier Matz, Andrew Rybchenko
  Cc: Bruce Richardson, Jerin Jacob, Beilei Xing, dev

+CC: Beilei Xing <beilei.xing@intel.com>, i40e maintainer, may be interested in the performance improvements achieved by this patch.

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 2 February 2022 09.14
> 
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> This patch fixes the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction
> of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> This bug caused a major performance degradation in scenarios where the
> application burst length is the same as the cache size. In such cases,
> the objects were not ever fetched from the mempool cache, regardless if
> they could have been.
> This scenario occurs e.g. if an application has configured a mempool
> with a size matching the application's burst size.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the ring.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the
> objects
> from the ring were returned ahead of the objects from the cache.
> This bug degraded application performance on CPUs with a small L1
> cache,
> which benefit from having the hot objects first in the returned array.
> (This is probably also the reason why the function returns the objects
> in reverse order, which it still does.)
> Now, all code paths first return objects from the cache, subsequently
> from the ring.
> 
> The function was not behaving as described (by the function using it)
> and expected by applications using it. This in itself is also a bug.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the ring (instead of only the
> number of requested objects minus the objects available in the ring),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the ring fails,
> only
> the remaining requested objects are retrieved from the ring.
> 
> The function would fail despite there are enough objects in the cache
> plus the common pool.
> 
> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.
> 
> v2 changes
> - Do not modify description of return value. This belongs in a separate
> doc fix.
> - Elaborate even more on which bugs the modifications fix.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++-----------
>  1 file changed, 54 insertions(+), 21 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..2898c690b0 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1463,38 +1463,71 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	uint32_t index, len;
>  	void **cache_objs;
> 
> -	/* No cache provided or cannot be satisfied from cache */
> -	if (unlikely(cache == NULL || n >= cache->size))
> +	/* No cache provided or if get would overflow mem allocated for
> cache */
> +	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>  		goto ring_dequeue;
> 
> -	cache_objs = cache->objs;
> +	cache_objs = &cache->objs[cache->len];
> +
> +	if (n <= cache->len) {
> +		/* The entire request can be satisfied from the cache. */
> +		cache->len -= n;
> +		for (index = 0; index < n; index++)
> +			*obj_table++ = *--cache_objs;
> +
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> 
> -	/* Can this be satisfied from the cache? */
> -	if (cache->len < n) {
> -		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = n + (cache->size - cache->len);
> +		return 0;
> +	}
> 
> -		/* How many do we require i.e. number to fill the cache +
> the request */
> -		ret = rte_mempool_ops_dequeue_bulk(mp,
> -			&cache->objs[cache->len], req);
> +	/* Satisfy the first part of the request by depleting the cache.
> */
> +	len = cache->len;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;
> +
> +	/* Number of objects remaining to satisfy the request. */
> +	len = n - len;
> +
> +	/* Fill the cache from the ring; fetch size + remaining objects.
> */
> +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> +			cache->size + len);
> +	if (unlikely(ret < 0)) {
> +		/*
> +		 * We are buffer constrained, and not able to allocate
> +		 * cache + remaining.
> +		 * Do not fill the cache, just satisfy the remaining part
> of
> +		 * the request directly from the ring.
> +		 */
> +		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
>  		if (unlikely(ret < 0)) {
>  			/*
> -			 * In the off chance that we are buffer constrained,
> -			 * where we are not able to allocate cache + n, go to
> -			 * the ring directly. If that fails, we are truly out
> of
> -			 * buffers.
> +			 * That also failed.
> +			 * No further action is required to roll the first
> +			 * part of the request back into the cache, as both
> +			 * cache->len and the objects in the cache are
> intact.
>  			 */
> -			goto ring_dequeue;
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
> +
> +			return ret;
>  		}
> 
> -		cache->len += req;
> +		/* Commit that the cache was emptied. */
> +		cache->len = 0;
> +
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> +
> +		return 0;
>  	}
> 
> -	/* Now fill in the response ... */
> -	for (index = 0, len = cache->len - 1; index < n; ++index, len--,
> obj_table++)
> -		*obj_table = cache_objs[len];
> +	cache_objs = &cache->objs[cache->size + len];
> 
> -	cache->len -= n;
> +	/* Satisfy the remaining part of the request from the filled
> cache. */
> +	cache->len = cache->size;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;
> 
>  	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
>  	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> @@ -1503,7 +1536,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
> 
>  ring_dequeue:
> 
> -	/* get remaining objects from ring */
> +	/* Get the objects from the ring. */
>  	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
> 
>  	if (ret < 0) {
> --
> 2.17.1

PING.

According to Patchwork [1], this patch provides up to 10.9 % single thread throughput improvement on XL710 with x86, and 0.7 % improvement with ARM.

Still no interest?

PS: Bruce reviewed V1 of this patch [2], but I don't think it is appropriate copying a Reviewed-by tag from one version of a patch to another, regardless how small the changes are.

[1] http://mails.dpdk.org/archives/test-report/2022-February/256462.html
[2] http://inbox.dpdk.org/dev/YeaDSxj%2FuZ0vPMl+@bricha3-MOBL.ger.corp.intel.com/


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v2] mempool: fix get objects from mempool with cache
  2022-06-15 21:18   ` Morten Brørup
@ 2022-09-29 10:52     ` Morten Brørup
  0 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-09-29 10:52 UTC (permalink / raw)
  To: Olivier Matz, Andrew Rybchenko, Bruce Richardson
  Cc: Jerin Jacob, Beilei Xing, dev

PING again.

If the explanation and/or diff is too longwinded, just look at the resulting code instead - it is clean and easily readable.

This patch should not be controversial, so I would like to see it merged into the coming LTS release. (Unlike my other mempool patch [3], which changes the behavior of the mempool cache.)

[3]: https://patchwork.dpdk.org/project/dpdk/patch/20220202103354.79832-1-mb@smartsharesystems.com/

Med venlig hilsen / Kind regards,
-Morten Brørup

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 15 June 2022 23.18
> 
> +CC: Beilei Xing <beilei.xing@intel.com>, i40e maintainer, may be
> interested in the performance improvements achieved by this patch.
> 
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Wednesday, 2 February 2022 09.14
> >
> > A flush threshold for the mempool cache was introduced in DPDK
> version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then, and some inefficiencies were introduced.
> >
> > This patch fixes the following in rte_mempool_do_generic_get():
> >
> > 1. The code that initially screens the cache request was not updated
> > with the change in DPDK version 1.3.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction
> > of
> > the flush threshold. E.g. the cache can hold up to flushthresh
> objects,
> > which is more than its size, so some requests were not served from
> the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a
> cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.
> >
> > This bug caused a major performance degradation in scenarios where
> the
> > application burst length is the same as the cache size. In such
> cases,
> > the objects were not ever fetched from the mempool cache, regardless
> if
> > they could have been.
> > This scenario occurs e.g. if an application has configured a mempool
> > with a size matching the application's burst size.
> >
> > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > behave according to the description of that function.
> > Specifically, objects must first be returned from the cache,
> > subsequently from the ring.
> > After the change in DPDK version 1.3, this was not the behavior when
> > the request was partially satisfied from the cache; instead, the
> > objects
> > from the ring were returned ahead of the objects from the cache.
> > This bug degraded application performance on CPUs with a small L1
> > cache,
> > which benefit from having the hot objects first in the returned
> array.
> > (This is probably also the reason why the function returns the
> objects
> > in reverse order, which it still does.)
> > Now, all code paths first return objects from the cache, subsequently
> > from the ring.
> >
> > The function was not behaving as described (by the function using it)
> > and expected by applications using it. This in itself is also a bug.
> >
> > 3. If the cache could not be backfilled, the function would attempt
> > to get all the requested objects from the ring (instead of only the
> > number of requested objects minus the objects available in the ring),
> > and the function would fail if that failed.
> > Now, the first part of the request is always satisfied from the
> cache,
> > and if the subsequent backfilling of the cache from the ring fails,
> > only
> > the remaining requested objects are retrieved from the ring.
> >
> > The function would fail despite there are enough objects in the cache
> > plus the common pool.
> >
> > 4. The code flow for satisfying the request from the cache was
> slightly
> > inefficient:
> > The likely code path where the objects are simply served from the
> cache
> > was treated as unlikely. Now it is treated as likely.
> > And in the code path where the cache was backfilled first, numbers
> were
> > added and subtracted from the cache length; now this code path simply
> > sets the cache length to its final value.
> >
> > v2 changes
> > - Do not modify description of return value. This belongs in a
> separate
> > doc fix.
> > - Elaborate even more on which bugs the modifications fix.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++---------
> --
> >  1 file changed, 54 insertions(+), 21 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1e7a3c1527..2898c690b0 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1463,38 +1463,71 @@ rte_mempool_do_generic_get(struct rte_mempool
> > *mp, void **obj_table,
> >  	uint32_t index, len;
> >  	void **cache_objs;
> >
> > -	/* No cache provided or cannot be satisfied from cache */
> > -	if (unlikely(cache == NULL || n >= cache->size))
> > +	/* No cache provided or if get would overflow mem allocated for
> > cache */
> > +	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
> >  		goto ring_dequeue;
> >
> > -	cache_objs = cache->objs;
> > +	cache_objs = &cache->objs[cache->len];
> > +
> > +	if (n <= cache->len) {
> > +		/* The entire request can be satisfied from the cache. */
> > +		cache->len -= n;
> > +		for (index = 0; index < n; index++)
> > +			*obj_table++ = *--cache_objs;
> > +
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> >
> > -	/* Can this be satisfied from the cache? */
> > -	if (cache->len < n) {
> > -		/* No. Backfill the cache first, and then fill from it */
> > -		uint32_t req = n + (cache->size - cache->len);
> > +		return 0;
> > +	}
> >
> > -		/* How many do we require i.e. number to fill the cache +
> > the request */
> > -		ret = rte_mempool_ops_dequeue_bulk(mp,
> > -			&cache->objs[cache->len], req);
> > +	/* Satisfy the first part of the request by depleting the cache.
> > */
> > +	len = cache->len;
> > +	for (index = 0; index < len; index++)
> > +		*obj_table++ = *--cache_objs;
> > +
> > +	/* Number of objects remaining to satisfy the request. */
> > +	len = n - len;
> > +
> > +	/* Fill the cache from the ring; fetch size + remaining objects.
> > */
> > +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> > +			cache->size + len);
> > +	if (unlikely(ret < 0)) {
> > +		/*
> > +		 * We are buffer constrained, and not able to allocate
> > +		 * cache + remaining.
> > +		 * Do not fill the cache, just satisfy the remaining part
> > of
> > +		 * the request directly from the ring.
> > +		 */
> > +		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
> >  		if (unlikely(ret < 0)) {
> >  			/*
> > -			 * In the off chance that we are buffer constrained,
> > -			 * where we are not able to allocate cache + n, go to
> > -			 * the ring directly. If that fails, we are truly out
> > of
> > -			 * buffers.
> > +			 * That also failed.
> > +			 * No further action is required to roll the first
> > +			 * part of the request back into the cache, as both
> > +			 * cache->len and the objects in the cache are
> > intact.
> >  			 */
> > -			goto ring_dequeue;
> > +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
> > +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
> > +
> > +			return ret;
> >  		}
> >
> > -		cache->len += req;
> > +		/* Commit that the cache was emptied. */
> > +		cache->len = 0;
> > +
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> > +
> > +		return 0;
> >  	}
> >
> > -	/* Now fill in the response ... */
> > -	for (index = 0, len = cache->len - 1; index < n; ++index, len--,
> > obj_table++)
> > -		*obj_table = cache_objs[len];
> > +	cache_objs = &cache->objs[cache->size + len];
> >
> > -	cache->len -= n;
> > +	/* Satisfy the remaining part of the request from the filled
> > cache. */
> > +	cache->len = cache->size;
> > +	for (index = 0; index < len; index++)
> > +		*obj_table++ = *--cache_objs;
> >
> >  	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> >  	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> > @@ -1503,7 +1536,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> > *mp, void **obj_table,
> >
> >  ring_dequeue:
> >
> > -	/* get remaining objects from ring */
> > +	/* Get the objects from the ring. */
> >  	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
> >
> >  	if (ret < 0) {
> > --
> > 2.17.1
> 
> PING.
> 
> According to Patchwork [1], this patch provides up to 10.9 % single
> thread throughput improvement on XL710 with x86, and 0.7 % improvement
> with ARM.
> 
> Still no interest?
> 
> PS: Bruce reviewed V1 of this patch [2], but I don't think it is
> appropriate copying a Reviewed-by tag from one version of a patch to
> another, regardless how small the changes are.
> 
> [1] http://mails.dpdk.org/archives/test-report/2022-
> February/256462.html
> [2] http://inbox.dpdk.org/dev/YeaDSxj%2FuZ0vPMl+@bricha3-
> MOBL.ger.corp.intel.com/


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v3] mempool: fix get objects from mempool with cache
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
                   ` (6 preceding siblings ...)
  2022-02-02 10:33 ` [PATCH v4] mempool: fix mempool cache flushing algorithm Morten Brørup
@ 2022-10-04 12:53 ` Andrew Rybchenko
  2022-10-04 14:42   ` Morten Brørup
  2022-10-07 10:44 ` [PATCH v4] " Andrew Rybchenko
  2022-10-09 13:37 ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
  9 siblings, 1 reply; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-04 12:53 UTC (permalink / raw)
  To: Olivier Matz
  Cc: dev, Morten Brørup, Beilei Xing, Bruce Richardson,
	Jerin Jacob Kollanukkaran

From: Morten Brørup <mb@smartsharesystems.com>

A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then, and some inefficiencies were introduced.

Fix the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated
with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

This bug caused a major performance degradation in scenarios where the
application burst length is the same as the cache size. In such cases,
the objects were not ever fetched from the mempool cache, regardless if
they could have been.
This scenario occurs e.g. if an application has configured a mempool
with a size matching the application's burst size.

2. The function is a helper for rte_mempool_generic_get(), so it must
behave according to the description of that function.
Specifically, objects must first be returned from the cache,
subsequently from the ring.
After the change in DPDK version 1.3, this was not the behavior when
the request was partially satisfied from the cache; instead, the objects
from the ring were returned ahead of the objects from the cache.
This bug degraded application performance on CPUs with a small L1 cache,
which benefit from having the hot objects first in the returned array.
(This is probably also the reason why the function returns the objects
in reverse order, which it still does.)
Now, all code paths first return objects from the cache, subsequently
from the ring.

The function was not behaving as described (by the function using it)
and expected by applications using it. This in itself is also a bug.

3. If the cache could not be backfilled, the function would attempt
to get all the requested objects from the ring (instead of only the
number of requested objects minus the objects available in the ring),
and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache,
and if the subsequent backfilling of the cache from the ring fails, only
the remaining requested objects are retrieved from the ring.

The function would fail despite there are enough objects in the cache
plus the common pool.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache
was treated as unlikely. Now it is treated as likely.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
v3 changes (Andrew Rybchenko)
 - Always get first objects from the cache even if request is bigger
   than cache size. Remove one corresponding condition from the path
   when request is fully served from cache.
 - Simplify code to avoid duplication:
    - Get objects directly from backend in single place only.
    - Share code which gets from the cache first regardless if
      everythihg is obtained from the cache or just the first part.
 - Rollback cache length in unlikely failure branch to avoid cache
   vs NULL check in success branch.

v2 changes
- Do not modify description of return value. This belongs in a separate
doc fix.
- Elaborate even more on which bugs the modifications fix.

 lib/mempool/rte_mempool.h | 74 +++++++++++++++++++++++++--------------
 1 file changed, 48 insertions(+), 26 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index a3c4ee351d..58e41ed401 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1443,41 +1443,54 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
+	unsigned int remaining = n;
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided */
+	if (unlikely(cache == NULL))
 		goto ring_dequeue;
 
-	cache_objs = cache->objs;
+	/* Use the cache as much as we have to return hot objects first */
+	len = RTE_MIN(remaining, cache->len);
+	cache_objs = &cache->objs[cache->len];
+	cache->len -= len;
+	remaining -= len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+	if (remaining == 0) {
+		/* The entire request is satisfied from the cache. */
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
-		if (unlikely(ret < 0)) {
-			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
-			 */
-			goto ring_dequeue;
-		}
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-		cache->len += req;
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	/* if dequeue below would overflow mem allocated for cache */
+	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+		goto ring_dequeue;
 
-	cache->len -= n;
+	/* Fill the cache from the ring; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + remaining);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the ring.
+		 */
+		goto ring_dequeue;
+	}
+
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache_objs = &cache->objs[cache->size + remaining];
+	for (index = 0; index < remaining; index++)
+		*obj_table++ = *--cache_objs;
+
+	cache->len = cache->size;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1486,10 +1499,19 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
-	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
+	/* Get the objects directly from the ring. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, remaining);
 
 	if (ret < 0) {
+		if (cache != NULL) {
+			cache->len = n - remaining;
+			/*
+			 * No further action is required to roll the first part
+			 * of the request back into the cache, as objects in
+			 * the cache are intact.
+			 */
+		}
+
 		RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
 		RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
 	} else {
-- 
2.30.2


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v2] mempool: fix get objects from mempool with cache
  2022-02-02  8:14 ` [PATCH v2] mempool: fix get objects from " Morten Brørup
  2022-06-15 21:18   ` Morten Brørup
@ 2022-10-04 12:57   ` Andrew Rybchenko
  2022-10-04 15:13     ` Morten Brørup
  2022-10-04 16:03   ` Morten Brørup
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-04 12:57 UTC (permalink / raw)
  To: Morten Brørup, olivier.matz; +Cc: bruce.richardson, jerinjacobk, dev

Hi Morten,

In general I agree that the fix is required.
In sent v3 I'm trying to make it a bit better from my point of
view. See few notes below.

On 2/2/22 11:14, Morten Brørup wrote:
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> This patch fixes the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> This bug caused a major performance degradation in scenarios where the
> application burst length is the same as the cache size. In such cases,
> the objects were not ever fetched from the mempool cache, regardless if
> they could have been.
> This scenario occurs e.g. if an application has configured a mempool
> with a size matching the application's burst size.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the ring.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the objects
> from the ring were returned ahead of the objects from the cache.
> This bug degraded application performance on CPUs with a small L1 cache,
> which benefit from having the hot objects first in the returned array.
> (This is probably also the reason why the function returns the objects
> in reverse order, which it still does.)
> Now, all code paths first return objects from the cache, subsequently
> from the ring.
> 
> The function was not behaving as described (by the function using it)
> and expected by applications using it. This in itself is also a bug.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the ring (instead of only the
> number of requested objects minus the objects available in the ring),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the ring fails, only
> the remaining requested objects are retrieved from the ring.
> 
> The function would fail despite there are enough objects in the cache
> plus the common pool.
> 
> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.

I've just sent v3 with suggested changes to the patch.

> 
> v2 changes
> - Do not modify description of return value. This belongs in a separate
> doc fix.
> - Elaborate even more on which bugs the modifications fix.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>   lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++-----------
>   1 file changed, 54 insertions(+), 21 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..2898c690b0 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1463,38 +1463,71 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>   	uint32_t index, len;
>   	void **cache_objs;
>   
> -	/* No cache provided or cannot be satisfied from cache */
> -	if (unlikely(cache == NULL || n >= cache->size))
> +	/* No cache provided or if get would overflow mem allocated for cache */
> +	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))

The second condition is unnecessary until we try to fill in
cache from backend.

>   		goto ring_dequeue;
>   
> -	cache_objs = cache->objs;
> +	cache_objs = &cache->objs[cache->len];
> +
> +	if (n <= cache->len) {
> +		/* The entire request can be satisfied from the cache. */
> +		cache->len -= n;
> +		for (index = 0; index < n; index++)
> +			*obj_table++ = *--cache_objs;
> +
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
>   
> -	/* Can this be satisfied from the cache? */
> -	if (cache->len < n) {
> -		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = n + (cache->size - cache->len);
> +		return 0;
> +	}
>   
> -		/* How many do we require i.e. number to fill the cache + the request */
> -		ret = rte_mempool_ops_dequeue_bulk(mp,
> -			&cache->objs[cache->len], req);
> +	/* Satisfy the first part of the request by depleting the cache. */
> +	len = cache->len;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;

I dislike duplication of these lines here and above. See v3.

> +
> +	/* Number of objects remaining to satisfy the request. */
> +	len = n - len;
> +
> +	/* Fill the cache from the ring; fetch size + remaining objects. */
> +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> +			cache->size + len);
> +	if (unlikely(ret < 0)) {
> +		/*
> +		 * We are buffer constrained, and not able to allocate
> +		 * cache + remaining.
> +		 * Do not fill the cache, just satisfy the remaining part of
> +		 * the request directly from the ring.
> +		 */
> +		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);

I dislike the duplication as well. We can goto ring_dequeue
instead. See v3.

>   		if (unlikely(ret < 0)) {
>   			/*
> -			 * In the off chance that we are buffer constrained,
> -			 * where we are not able to allocate cache + n, go to
> -			 * the ring directly. If that fails, we are truly out of
> -			 * buffers.
> +			 * That also failed.
> +			 * No further action is required to roll the first
> +			 * part of the request back into the cache, as both
> +			 * cache->len and the objects in the cache are intact.
>   			 */
> -			goto ring_dequeue;
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
> +
> +			return ret;
>   		}
>   
> -		cache->len += req;
> +		/* Commit that the cache was emptied. */
> +		cache->len = 0;
> +
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> +
> +		return 0;
>   	}
>   
> -	/* Now fill in the response ... */
> -	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
> -		*obj_table = cache_objs[len];
> +	cache_objs = &cache->objs[cache->size + len];
>   
> -	cache->len -= n;
> +	/* Satisfy the remaining part of the request from the filled cache. */
> +	cache->len = cache->size;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;
>   
>   	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
>   	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> @@ -1503,7 +1536,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>   
>   ring_dequeue:
>   
> -	/* get remaining objects from ring */
> +	/* Get the objects from the ring. */
>   	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
>   
>   	if (ret < 0) {


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v3] mempool: fix get objects from mempool with cache
  2022-10-04 12:53 ` [PATCH v3] mempool: fix get objects from mempool with cache Andrew Rybchenko
@ 2022-10-04 14:42   ` Morten Brørup
  0 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-04 14:42 UTC (permalink / raw)
  To: Andrew Rybchenko, Olivier Matz
  Cc: dev, Beilei Xing, Bruce Richardson, Jerin Jacob Kollanukkaran

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Tuesday, 4 October 2022 14.54
> To: Olivier Matz
> Cc: dev@dpdk.org; Morten Brørup; Beilei Xing; Bruce Richardson; Jerin
> Jacob Kollanukkaran
> Subject: [PATCH v3] mempool: fix get objects from mempool with cache
> 
> From: Morten Brørup <mb@smartsharesystems.com>
> 
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> Fix the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction
> of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> This bug caused a major performance degradation in scenarios where the
> application burst length is the same as the cache size. In such cases,
> the objects were not ever fetched from the mempool cache, regardless if
> they could have been.
> This scenario occurs e.g. if an application has configured a mempool
> with a size matching the application's burst size.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the ring.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the
> objects
> from the ring were returned ahead of the objects from the cache.
> This bug degraded application performance on CPUs with a small L1
> cache,
> which benefit from having the hot objects first in the returned array.
> (This is probably also the reason why the function returns the objects
> in reverse order, which it still does.)
> Now, all code paths first return objects from the cache, subsequently
> from the ring.
> 
> The function was not behaving as described (by the function using it)
> and expected by applications using it. This in itself is also a bug.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the ring (instead of only the
> number of requested objects minus the objects available in the ring),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the ring fails,
> only
> the remaining requested objects are retrieved from the ring.
> 
> The function would fail despite there are enough objects in the cache
> plus the common pool.
> 
> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> ---
> v3 changes (Andrew Rybchenko)
>  - Always get first objects from the cache even if request is bigger
>    than cache size. Remove one corresponding condition from the path
>    when request is fully served from cache.
>  - Simplify code to avoid duplication:
>     - Get objects directly from backend in single place only.
>     - Share code which gets from the cache first regardless if
>       everythihg is obtained from the cache or just the first part.
>  - Rollback cache length in unlikely failure branch to avoid cache
>    vs NULL check in success branch.
> 
> v2 changes
> - Do not modify description of return value. This belongs in a separate
> doc fix.
> - Elaborate even more on which bugs the modifications fix.
> 
>  lib/mempool/rte_mempool.h | 74 +++++++++++++++++++++++++--------------
>  1 file changed, 48 insertions(+), 26 deletions(-)

Thank you, Andrew.

I haven't compared the resulting assembler output (regarding performance), but I have carefully reviewed the resulting v3 source code for potential bugs in all code paths and for performance, and think it looks good.

The RTE_MIN() macro looks like it prefers the first parameter, so static branch prediction for len=RTE_MIN(remaining, cache->len) should be correct.

You could consider adding likely() around (cache != NULL) near the bottom of the function, so it matches the unlikely(cache == NULL) at the top of the function; mainly for symmetry in the source code, as I expect it to be the compiler default anyway.

Also, you could add "remaining" to the comment:
/* Get the objects directly from the ring. */

As is, or with suggested modifications...

Reviewed-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v2] mempool: fix get objects from mempool with cache
  2022-10-04 12:57   ` Andrew Rybchenko
@ 2022-10-04 15:13     ` Morten Brørup
  2022-10-04 15:58       ` Andrew Rybchenko
  2022-10-06 13:43       ` Aaron Conole
  0 siblings, 2 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-04 15:13 UTC (permalink / raw)
  To: Aaron Conole
  Cc: Andrew Rybchenko, olivier.matz, bruce.richardson, jerinjacobk,
	dev, Yuying Zhang, Beilei Xing

@Aaron, do you have any insights or comments to my curiosity below?

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Tuesday, 4 October 2022 14.58
> 
> Hi Morten,
> 
> In general I agree that the fix is required.
> In sent v3 I'm trying to make it a bit better from my point of
> view. See few notes below.

I stand by my review and accept of v3 - this message is not intended to change that! I'm just curious...

I wonder how accurate the automated performance tests ([v2], [v3]) are, and if they are comparable between February and October?

[v2]: http://mails.dpdk.org/archives/test-report/2022-February/256462.html
[v3]: http://mails.dpdk.org/archives/test-report/2022-October/311526.html


Ubuntu 20.04
Kernel: 4.15.0-generic
Compiler: gcc 7.4
NIC: Intel Corporation Ethernet Converged Network Adapter XL710-QDA2 40000 Mbps
Target: x86_64-native-linuxapp-gcc
Fail/Total: 0/4

Detail performance results:
** V2 **:
+----------+-------------+---------+------------+------------------------------+
| num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
|          |             |         |            |           expected           |
+==========+=============+=========+============+==============================+
| 1        | 2           | 512     | 64         | 0.5%                         |
+----------+-------------+---------+------------+------------------------------+
| 1        | 2           | 2048    | 64         | -1.5%                        |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 512     | 64         | 4.3%                         |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 2048    | 64         | 10.9%                        |
+----------+-------------+---------+------------+------------------------------+

** V3 **:
+----------+-------------+---------+------------+------------------------------+
| num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
|          |             |         |            |           expected           |
+==========+=============+=========+============+==============================+
| 1        | 2           | 512     | 64         | -0.7%                        |
+----------+-------------+---------+------------+------------------------------+
| 1        | 2           | 2048    | 64         | -2.3%                        |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 512     | 64         | 0.5%                         |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 2048    | 64         | 7.9%                         |
+----------+-------------+---------+------------+------------------------------+


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v2] mempool: fix get objects from mempool with cache
  2022-10-04 15:13     ` Morten Brørup
@ 2022-10-04 15:58       ` Andrew Rybchenko
  2022-10-04 18:09         ` Morten Brørup
  2022-10-06 13:43       ` Aaron Conole
  1 sibling, 1 reply; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-04 15:58 UTC (permalink / raw)
  To: Morten Brørup, Aaron Conole
  Cc: olivier.matz, bruce.richardson, jerinjacobk, dev, Yuying Zhang,
	Beilei Xing

On 10/4/22 18:13, Morten Brørup wrote:
> @Aaron, do you have any insights or comments to my curiosity below?
> 
>> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
>> Sent: Tuesday, 4 October 2022 14.58
>>
>> Hi Morten,
>>
>> In general I agree that the fix is required.
>> In sent v3 I'm trying to make it a bit better from my point of
>> view. See few notes below.
> 
> I stand by my review and accept of v3 - this message is not intended to change that! I'm just curious...
> 
> I wonder how accurate the automated performance tests ([v2], [v3]) are, and if they are comparable between February and October?
> 
> [v2]: http://mails.dpdk.org/archives/test-report/2022-February/256462.html
> [v3]: http://mails.dpdk.org/archives/test-report/2022-October/311526.html
> 
> 
> Ubuntu 20.04
> Kernel: 4.15.0-generic
> Compiler: gcc 7.4
> NIC: Intel Corporation Ethernet Converged Network Adapter XL710-QDA2 40000 Mbps
> Target: x86_64-native-linuxapp-gcc
> Fail/Total: 0/4
> 
> Detail performance results:
> ** V2 **:
> +----------+-------------+---------+------------+------------------------------+
> | num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
> |          |             |         |            |           expected           |
> +==========+=============+=========+============+==============================+
> | 1        | 2           | 512     | 64         | 0.5%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 2           | 2048    | 64         | -1.5%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 512     | 64         | 4.3%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 2048    | 64         | 10.9%                        |
> +----------+-------------+---------+------------+------------------------------+
> 
> ** V3 **:
> +----------+-------------+---------+------------+------------------------------+
> | num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
> |          |             |         |            |           expected           |
> +==========+=============+=========+============+==============================+
> | 1        | 2           | 512     | 64         | -0.7%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 2           | 2048    | 64         | -2.3%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 512     | 64         | 0.5%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 2048    | 64         | 7.9%                         |
> +----------+-------------+---------+------------+------------------------------+
> 

Very interesting, may be it make sense to sent your patch and
mine once again to check current figures and results stability.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v2] mempool: fix get objects from mempool with cache
  2022-02-02  8:14 ` [PATCH v2] mempool: fix get objects from " Morten Brørup
  2022-06-15 21:18   ` Morten Brørup
  2022-10-04 12:57   ` Andrew Rybchenko
@ 2022-10-04 16:03   ` Morten Brørup
  2022-10-04 16:36   ` Morten Brørup
  2022-10-04 16:39   ` Morten Brørup
  4 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-04 16:03 UTC (permalink / raw)
  To: andrew.rybchenko, Aaron Conole
  Cc: olivier.matz, bruce.richardson, jerinjacobk, dev, Yuying Zhang,
	Beilei Xing

RESENT for test purposes.

A flush threshold for the mempool cache was introduced in DPDK version 1.3, but rte_mempool_do_generic_get() was not completely updated back then, and some inefficiencies were introduced.

This patch fixes the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size, which was correct before, but became irrelevant with the introduction of the flush threshold. E.g. the cache can hold up to flushthresh objects, which is more than its size, so some requests were not served from the cache, even though they could be.
The initial screening has now been corrected to match the initial screening in rte_mempool_do_generic_put(), which verifies that a cache is present, and that the length of the request does not overflow the memory allocated for the cache.

This bug caused a major performance degradation in scenarios where the application burst length is the same as the cache size. In such cases, the objects were not ever fetched from the mempool cache, regardless if they could have been.
This scenario occurs e.g. if an application has configured a mempool with a size matching the application's burst size.

2. The function is a helper for rte_mempool_generic_get(), so it must behave according to the description of that function.
Specifically, objects must first be returned from the cache, subsequently from the ring.
After the change in DPDK version 1.3, this was not the behavior when the request was partially satisfied from the cache; instead, the objects from the ring were returned ahead of the objects from the cache.
This bug degraded application performance on CPUs with a small L1 cache, which benefit from having the hot objects first in the returned array.
(This is probably also the reason why the function returns the objects in reverse order, which it still does.) Now, all code paths first return objects from the cache, subsequently from the ring.

The function was not behaving as described (by the function using it) and expected by applications using it. This in itself is also a bug.

3. If the cache could not be backfilled, the function would attempt to get all the requested objects from the ring (instead of only the number of requested objects minus the objects available in the ring), and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache, and if the subsequent backfilling of the cache from the ring fails, only the remaining requested objects are retrieved from the ring.

The function would fail despite there are enough objects in the cache plus the common pool.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache was treated as unlikely. Now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were added and subtracted from the cache length; now this code path simply sets the cache length to its final value.

v2 changes
- Do not modify description of return value. This belongs in a separate doc fix.
- Elaborate even more on which bugs the modifications fix.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 21 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h index 1e7a3c1527..2898c690b0 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1463,38 +1463,71 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
-	cache_objs = cache->objs;
+	cache_objs = &cache->objs[cache->len];
+
+	if (n <= cache->len) {
+		/* The entire request can be satisfied from the cache. */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+		return 0;
+	}
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
+	/* Satisfy the first part of the request by depleting the cache. */
+	len = cache->len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
+
+	/* Number of objects remaining to satisfy the request. */
+	len = n - len;
+
+	/* Fill the cache from the ring; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + len);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the ring.
+		 */
+		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
 		if (unlikely(ret < 0)) {
 			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
+			 * That also failed.
+			 * No further action is required to roll the first
+			 * part of the request back into the cache, as both
+			 * cache->len and the objects in the cache are intact.
 			 */
-			goto ring_dequeue;
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
+
+			return ret;
 		}
 
-		cache->len += req;
+		/* Commit that the cache was emptied. */
+		cache->len = 0;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	cache_objs = &cache->objs[cache->size + len];
 
-	cache->len -= n;
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache->len = cache->size;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n); @@ -1503,7 +1536,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring. */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {
--
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v2] mempool: fix get objects from mempool with cache
  2022-02-02  8:14 ` [PATCH v2] mempool: fix get objects from " Morten Brørup
                     ` (2 preceding siblings ...)
  2022-10-04 16:03   ` Morten Brørup
@ 2022-10-04 16:36   ` Morten Brørup
  2022-10-04 16:39   ` Morten Brørup
  4 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-04 16:36 UTC (permalink / raw)
  To: andrew.rybchenko, Aaron Conole
  Cc: olivier.matz, bruce.richardson, jerinjacobk, dev, Yuying Zhang,
	Beilei Xing

RESENT for test purposes.

A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then, and some inefficiencies were introduced.

This patch fixes the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated
with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

This bug caused a major performance degradation in scenarios where the
application burst length is the same as the cache size. In such cases,
the objects were not ever fetched from the mempool cache, regardless if
they could have been.
This scenario occurs e.g. if an application has configured a mempool
with a size matching the application's burst size.

2. The function is a helper for rte_mempool_generic_get(), so it must
behave according to the description of that function.
Specifically, objects must first be returned from the cache,
subsequently from the ring.
After the change in DPDK version 1.3, this was not the behavior when
the request was partially satisfied from the cache; instead, the objects
from the ring were returned ahead of the objects from the cache.
This bug degraded application performance on CPUs with a small L1 cache,
which benefit from having the hot objects first in the returned array.
(This is probably also the reason why the function returns the objects
in reverse order, which it still does.)
Now, all code paths first return objects from the cache, subsequently
from the ring.

The function was not behaving as described (by the function using it)
and expected by applications using it. This in itself is also a bug.

3. If the cache could not be backfilled, the function would attempt
to get all the requested objects from the ring (instead of only the
number of requested objects minus the objects available in the ring),
and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache,
and if the subsequent backfilling of the cache from the ring fails, only
the remaining requested objects are retrieved from the ring.

The function would fail despite there are enough objects in the cache
plus the common pool.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache
was treated as unlikely. Now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were
added and subtracted from the cache length; now this code path simply
sets the cache length to its final value.

v2 changes
- Do not modify description of return value. This belongs in a separate
doc fix.
- Elaborate even more on which bugs the modifications fix.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 21 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..2898c690b0 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1463,38 +1463,71 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
-	cache_objs = cache->objs;
+	cache_objs = &cache->objs[cache->len];
+
+	if (n <= cache->len) {
+		/* The entire request can be satisfied from the cache. */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+		return 0;
+	}
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
+	/* Satisfy the first part of the request by depleting the cache. */
+	len = cache->len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
+
+	/* Number of objects remaining to satisfy the request. */
+	len = n - len;
+
+	/* Fill the cache from the ring; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + len);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the ring.
+		 */
+		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
 		if (unlikely(ret < 0)) {
 			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
+			 * That also failed.
+			 * No further action is required to roll the first
+			 * part of the request back into the cache, as both
+			 * cache->len and the objects in the cache are intact.
 			 */
-			goto ring_dequeue;
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
+
+			return ret;
 		}
 
-		cache->len += req;
+		/* Commit that the cache was emptied. */
+		cache->len = 0;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	cache_objs = &cache->objs[cache->size + len];
 
-	cache->len -= n;
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache->len = cache->size;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1503,7 +1536,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring. */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v2] mempool: fix get objects from mempool with cache
  2022-02-02  8:14 ` [PATCH v2] mempool: fix get objects from " Morten Brørup
                     ` (3 preceding siblings ...)
  2022-10-04 16:36   ` Morten Brørup
@ 2022-10-04 16:39   ` Morten Brørup
  4 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-04 16:39 UTC (permalink / raw)
  To: andrew.rybchenko, Aaron Conole
  Cc: olivier.matz, bruce.richardson, jerinjacobk, dev, Yuying Zhang,
	Beilei Xing

RESENT for test purposes.

A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then, and some inefficiencies were introduced.

This patch fixes the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated
with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

This bug caused a major performance degradation in scenarios where the
application burst length is the same as the cache size. In such cases,
the objects were not ever fetched from the mempool cache, regardless if
they could have been.
This scenario occurs e.g. if an application has configured a mempool
with a size matching the application's burst size.

2. The function is a helper for rte_mempool_generic_get(), so it must
behave according to the description of that function.
Specifically, objects must first be returned from the cache,
subsequently from the ring.
After the change in DPDK version 1.3, this was not the behavior when
the request was partially satisfied from the cache; instead, the objects
from the ring were returned ahead of the objects from the cache.
This bug degraded application performance on CPUs with a small L1 cache,
which benefit from having the hot objects first in the returned array.
(This is probably also the reason why the function returns the objects
in reverse order, which it still does.)
Now, all code paths first return objects from the cache, subsequently
from the ring.

The function was not behaving as described (by the function using it)
and expected by applications using it. This in itself is also a bug.

3. If the cache could not be backfilled, the function would attempt
to get all the requested objects from the ring (instead of only the
number of requested objects minus the objects available in the ring),
and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache,
and if the subsequent backfilling of the cache from the ring fails, only
the remaining requested objects are retrieved from the ring.

The function would fail despite there are enough objects in the cache
plus the common pool.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache
was treated as unlikely. Now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were
added and subtracted from the cache length; now this code path simply
sets the cache length to its final value.

v2 changes
- Do not modify description of return value. This belongs in a separate
doc fix.
- Elaborate even more on which bugs the modifications fix.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 75 ++++++++++++++++++++++++++++-----------
 1 file changed, 54 insertions(+), 21 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..2898c690b0 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1463,38 +1463,71 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
-	cache_objs = cache->objs;
+	cache_objs = &cache->objs[cache->len];
+
+	if (n <= cache->len) {
+		/* The entire request can be satisfied from the cache. */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+		return 0;
+	}
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
+	/* Satisfy the first part of the request by depleting the cache. */
+	len = cache->len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
+
+	/* Number of objects remaining to satisfy the request. */
+	len = n - len;
+
+	/* Fill the cache from the ring; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + len);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the ring.
+		 */
+		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
 		if (unlikely(ret < 0)) {
 			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
+			 * That also failed.
+			 * No further action is required to roll the first
+			 * part of the request back into the cache, as both
+			 * cache->len and the objects in the cache are intact.
 			 */
-			goto ring_dequeue;
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
+
+			return ret;
 		}
 
-		cache->len += req;
+		/* Commit that the cache was emptied. */
+		cache->len = 0;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	cache_objs = &cache->objs[cache->size + len];
 
-	cache->len -= n;
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache->len = cache->size;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1503,7 +1536,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring. */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v2] mempool: fix get objects from mempool with cache
  2022-10-04 15:58       ` Andrew Rybchenko
@ 2022-10-04 18:09         ` Morten Brørup
  0 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-04 18:09 UTC (permalink / raw)
  To: Andrew Rybchenko, Aaron Conole
  Cc: olivier.matz, bruce.richardson, jerinjacobk, dev, Yuying Zhang,
	Beilei Xing

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Tuesday, 4 October 2022 17.59
> 
> On 10/4/22 18:13, Morten Brørup wrote:
> > @Aaron, do you have any insights or comments to my curiosity below?
> >
> >> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> >> Sent: Tuesday, 4 October 2022 14.58
> >>
> >> Hi Morten,
> >>
> >> In general I agree that the fix is required.
> >> In sent v3 I'm trying to make it a bit better from my point of
> >> view. See few notes below.
> >
> > I stand by my review and accept of v3 - this message is not intended
> to change that! I'm just curious...
> >
> > I wonder how accurate the automated performance tests ([v2], [v3])
> are, and if they are comparable between February and October?
> >
> > [v2]: http://mails.dpdk.org/archives/test-report/2022-
> February/256462.html
> > [v3]: http://mails.dpdk.org/archives/test-report/2022-
> October/311526.html
> >
> >
> > Ubuntu 20.04
> > Kernel: 4.15.0-generic
> > Compiler: gcc 7.4
> > NIC: Intel Corporation Ethernet Converged Network Adapter XL710-QDA2
> 40000 Mbps
> > Target: x86_64-native-linuxapp-gcc
> > Fail/Total: 0/4
> >
> > Detail performance results:
> > ** V2 **:
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | num_cpus | num_threads | txd/rxd | frame_size |  throughput
> difference from  |
> > |          |             |         |            |           expected
> |
> >
> +==========+=============+=========+============+======================
> ========+
> > | 1        | 2           | 512     | 64         | 0.5%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 2           | 2048    | 64         | -1.5%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 1           | 512     | 64         | 4.3%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 1           | 2048    | 64         | 10.9%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> >
> > ** V3 **:
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | num_cpus | num_threads | txd/rxd | frame_size |  throughput
> difference from  |
> > |          |             |         |            |           expected
> |
> >
> +==========+=============+=========+============+======================
> ========+
> > | 1        | 2           | 512     | 64         | -0.7%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 2           | 2048    | 64         | -2.3%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 1           | 512     | 64         | 0.5%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> > | 1        | 1           | 2048    | 64         | 7.9%
> |
> > +----------+-------------+---------+------------+--------------------
> ----------+
> >
> 
> Very interesting, may be it make sense to sent your patch and
> mine once again to check current figures and results stability.
> 

V2 retest:

http://mails.dpdk.org/archives/test-report/2022-October/311609.html

Ubuntu 20.04
Kernel: 4.15.0-generic
Compiler: gcc 7.4
NIC: Intel Corporation Ethernet Converged Network Adapter XL710-QDA2 40000 Mbps
Target: x86_64-native-linuxapp-gcc
Fail/Total: 0/4

Detail performance results: 
+----------+-------------+---------+------------+------------------------------+
| num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
|          |             |         |            |           expected           |
+==========+=============+=========+============+==============================+
| 1        | 2           | 512     | 64         | 0.1%                         |
+----------+-------------+---------+------------+------------------------------+
| 1        | 2           | 2048    | 64         | -3.0%                        |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 512     | 64         | -3.7%                        |
+----------+-------------+---------+------------+------------------------------+
| 1        | 1           | 2048    | 64         | 8.1%                         |
+----------+-------------+---------+------------+------------------------------+

Probably not very accurate.

And 8.1% is very close to the V3 7.9% - nothing to worry about, compared to the previous v2 test result of 10.9%.

-Morten


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v4] mempool: fix mempool cache flushing algorithm
  2022-02-02 10:33 ` [PATCH v4] mempool: fix mempool cache flushing algorithm Morten Brørup
  2022-04-07  9:04   ` Morten Brørup
@ 2022-10-04 20:01   ` Morten Brørup
  2022-10-09 11:11   ` [PATCH 1/2] mempool: check driver enqueue result in one place Andrew Rybchenko
  2022-10-09 13:19   ` [PATCH v4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
  3 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-04 20:01 UTC (permalink / raw)
  To: andrew.rybchenko; +Cc: bruce.richardson, jerinjacobk, dev, olivier.matz

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 2 February 2022 11.34
> 
> This patch fixes the rte_mempool_do_generic_put() caching algorithm,
> which was fundamentally wrong, causing multiple performance issues when
> flushing.
> 
> Although the bugs do have serious performance implications when
> flushing, the function did not fail when flushing (or otherwise).
> Backporting could be considered optional.
> 
> The algorithm was:
>  1. Add the objects to the cache
>  2. Anything greater than the cache size (if it crosses the cache flush
>     threshold) is flushed to the ring.
> 
> Please note that the description in the source code said that it kept
> "cache min value" objects after flushing, but the function actually
> kept
> the cache full after flushing, which the above description reflects.
> 
> Now, the algorithm is:
>  1. If the objects cannot be added to the cache without crossing the
>     flush threshold, flush the cache to the ring.
>  2. Add the objects to the cache.
> 
> This patch fixes these bugs:
> 
> 1. The cache was still full after flushing.
> In the opposite direction, i.e. when getting objects from the cache,
> the
> cache is refilled to full level when it crosses the low watermark
> (which
> happens to be zero).
> Similarly, the cache should be flushed to empty level when it crosses
> the high watermark (which happens to be 1.5 x the size of the cache).
> The existing flushing behaviour was suboptimal for real applications,
> because crossing the low or high watermark typically happens when the
> application is in a state where the number of put/get events are out of
> balance, e.g. when absorbing a burst of packets into a QoS queue
> (getting more mbufs from the mempool), or when a burst of packets is
> trickling out from the QoS queue (putting the mbufs back into the
> mempool).
> Now, the mempool cache is completely flushed when crossing the flush
> threshold, so only the newly put (hot) objects remain in the mempool
> cache afterwards.
> 
> This bug degraded performance caused by too frequent flushing.
> 
> Consider this application scenario:
> 
> Either, an lcore thread in the application is in a state of balance,
> where it uses the mempool cache within its flush/refill boundaries; in
> this situation, the flush method is less important, and this fix is
> irrelevant.
> 
> Or, an lcore thread in the application is out of balance (either
> permanently or temporarily), and mostly gets or puts objects from/to
> the
> mempool. If it mostly puts objects, not flushing all of the objects
> will
> cause more frequent flushing. This is the scenario addressed by this
> fix. E.g.:
> 
> Cache size=256, flushthresh=384 (1.5x size), initial len=256;
> application burst len=32.
> 
> If there are "size" objects in the cache after flushing, the cache is
> flushed at every 4th burst.
> 
> If the cache is flushed completely, the cache is only flushed at every
> 16th burst.
> 
> As you can see, this bug caused the cache to be flushed 4x too
> frequently in this example.
> 
> And when/if the application thread breaks its pattern of continuously
> putting objects, and suddenly starts to get objects instead, it will
> either get objects already in the cache, or the get() function will
> refill the cache.
> 
> The concept of not flushing the cache completely was probably based on
> an assumption that it is more likely for an application's lcore thread
> to get() after flushing than to put() after flushing.
> I strongly disagree with this assumption! If an application thread is
> continuously putting so much that it overflows the cache, it is much
> more likely to keep putting than it is to start getting. If in doubt,
> consider how CPU branch predictors work: When the application has done
> something many times consecutively, the branch predictor will expect
> the
> application to do the same again, rather than suddenly do something
> else.
> 
> Also, if you consider the description of the algorithm in the source
> code, and agree that "cache min value" cannot mean "cache size", the
> function did not behave as intended. This in itself is a bug.
> 
> 2. The flush threshold comparison was off by one.
> It must be "len > flushthresh", not "len >= flushthresh".
> Consider a flush multiplier of 1 instead of 1.5; the cache would be
> flushed already when reaching size objecs, not when exceeding size
> objects. In other words, the cache would not be able to hold "size"
> objects, which is clearly a bug.
> Now, flushing is triggered when the flush threshold is exceeded, not
> when reached.
> 
> This bug degraded performance due to premature flushing. In my example
> above, this bug caused flushing every 3rd burst instead of every 4th.
> 
> 3. The most recent (hot) objects were flushed, leaving the oldest
> (cold)
> objects in the mempool cache.
> This bug degraded performance, because flushing prevented immediate
> reuse of the (hot) objects already in the CPU cache.
> Now, the existing (cold) objects in the mempool cache are flushed
> before
> the new (hot) objects are added the to the mempool cache.
> 
> 4. With RTE_LIBRTE_MEMPOOL_DEBUG defined, the return value of
> rte_mempool_ops_enqueue_bulk() was not checked when flushing the cache.
> Now, it is checked in both locations where used; and obviously still
> only if RTE_LIBRTE_MEMPOOL_DEBUG is defined.
> 
> v2 changes:
> 
> - Not adding the new objects to the mempool cache before flushing it
> also allows the memory allocated for the mempool cache to be reduced
> from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
> However, such this change would break the ABI, so it was removed in v2.
> 
> - The mempool cache should be cache line aligned for the benefit of the
> copying method, which on some CPU architectures performs worse on data
> crossing a cache boundary.
> However, such this change would break the ABI, so it was removed in v2;
> and yet another alternative copying method replaced the rte_memcpy().
> 
> v3 changes:
> 
> - Actually remove my modifications of the rte_mempool_cache structure.
> 
> v4 changes:
> 
> - Updated patch title to reflect that the scope of the patch is only
> mempool cache flushing.
> 
> - Do not replace rte_memcpy() with alternative copying method. This was
> a pure optimization, not a fix.
> 
> - Elaborate even more on the bugs fixed by the modifications.
> 
> - Added 4th bullet item to the patch description, regarding
> rte_mempool_ops_enqueue_bulk() with RTE_LIBRTE_MEMPOOL_DEBUG.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------
>  1 file changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..e7e09e48fc 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
>  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>  		goto ring_enqueue;
> 
> -	cache_objs = &cache->objs[cache->len];
> +	/* If the request itself is too big for the cache */
> +	if (unlikely(n > cache->flushthresh))
> +		goto ring_enqueue;
> 
>  	/*
>  	 * The cache follows the following algorithm
> -	 *   1. Add the objects to the cache
> -	 *   2. Anything greater than the cache min value (if it crosses
> the
> -	 *   cache flush threshold) is flushed to the ring.
> +	 *   1. If the objects cannot be added to the cache without
> +	 *   crossing the flush threshold, flush the cache to the ring.
> +	 *   2. Add the objects to the cache.
>  	 */
> 
> -	/* Add elements back into the cache */
> -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> +	if (cache->len + n <= cache->flushthresh) {
> +		cache_objs = &cache->objs[cache->len];
> 
> -	cache->len += n;
> +		cache->len += n;
> +	} else {
> +		cache_objs = &cache->objs[0];
> 
> -	if (cache->len >= cache->flushthresh) {
> -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> -				cache->len - cache->size);
> -		cache->len = cache->size;
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> >len) < 0)
> +			rte_panic("cannot put objects in mempool\n");
> +#else
> +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> +#endif
> +		cache->len = n;
>  	}
> 
> +	/* Add the objects to the cache. */
> +	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
> +
>  	return;
> 
>  ring_enqueue:
> 
> -	/* push remaining objects in ring */
> +	/* Put the objects into the ring */
>  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
>  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
>  		rte_panic("cannot put objects in mempool\n");
> --
> 2.17.1

Andrew, would you please also take a look at this patch and share your opinion.

I guess that the most controversial change in the patch is that it leaves the mempool cache nearly empty after flushing it.

Without the patch, the mempool cache is left full (at 100% size) after flushing. (Flushing is triggered by crossing the flush threshold, which is 50% above the cache size. This is not changed by the patch.)

As described with the patch, I consider this behavior incorrect: In periods where an application is sending more from its QoS queues that goes into the QoS queues, the mempool_put() function is called more often than the mempool_get() function, so there will naturally be consecutive cache flushing.

Many applications use QoS queues or similar traffic shapers, so mempool cache flushing is not as infrequent and exotic as some might think! (And flushing a burst of packets from the mempool cache to the underlying mempool is considered costly.)

Without the patch, consecutive cache flushing will be processed as many small flushes, because only the 50% objects above the cache size (the objects between the cache size and the cache threshold) are flushed each time.

With the patch, the flushes will be fewer and larger, because the full 150% cache size (every object in the cache up to the cache threshold) will be flushed each time.

PS: Bruce and I discussed this patch back in April, but didn't reach a conclusion. You might find some insights in that mail thread.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v2] mempool: fix get objects from mempool with cache
  2022-10-04 15:13     ` Morten Brørup
  2022-10-04 15:58       ` Andrew Rybchenko
@ 2022-10-06 13:43       ` Aaron Conole
  1 sibling, 0 replies; 85+ messages in thread
From: Aaron Conole @ 2022-10-06 13:43 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Andrew Rybchenko, olivier.matz, bruce.richardson, jerinjacobk,
	dev, Yuying Zhang, Beilei Xing

Morten Brørup <mb@smartsharesystems.com> writes:

> @Aaron, do you have any insights or comments to my curiosity below?

Sorry, the perf tests from Feb to Oct should *generally* be comparable
but keep in mind that they are based on different baseline versions of
DPDK.  Also, the perf tests are done as thresholds rather than hard
limits (because there can be some minor variations run to run, iirc).

>> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
>> Sent: Tuesday, 4 October 2022 14.58
>> 
>> Hi Morten,
>> 
>> In general I agree that the fix is required.
>> In sent v3 I'm trying to make it a bit better from my point of
>> view. See few notes below.
>
> I stand by my review and accept of v3 - this message is not intended to change that! I'm just curious...
>
> I wonder how accurate the automated performance tests ([v2], [v3])
> are, and if they are comparable between February and October?
>
> [v2]: http://mails.dpdk.org/archives/test-report/2022-February/256462.html
> [v3]: http://mails.dpdk.org/archives/test-report/2022-October/311526.html
>
>
> Ubuntu 20.04
> Kernel: 4.15.0-generic
> Compiler: gcc 7.4
> NIC: Intel Corporation Ethernet Converged Network Adapter XL710-QDA2 40000 Mbps
> Target: x86_64-native-linuxapp-gcc
> Fail/Total: 0/4
>
> Detail performance results:
> ** V2 **:
> +----------+-------------+---------+------------+------------------------------+
> | num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
> |          |             |         |            |           expected           |
> +==========+=============+=========+============+==============================+
> | 1        | 2           | 512     | 64         | 0.5%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 2           | 2048    | 64         | -1.5%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 512     | 64         | 4.3%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 2048    | 64         | 10.9%                        |
> +----------+-------------+---------+------------+------------------------------+
>
> ** V3 **:
> +----------+-------------+---------+------------+------------------------------+
> | num_cpus | num_threads | txd/rxd | frame_size |  throughput difference from  |
> |          |             |         |            |           expected           |
> +==========+=============+=========+============+==============================+
> | 1        | 2           | 512     | 64         | -0.7%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 2           | 2048    | 64         | -2.3%                        |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 512     | 64         | 0.5%                         |
> +----------+-------------+---------+------------+------------------------------+
> | 1        | 1           | 2048    | 64         | 7.9%                         |
> +----------+-------------+---------+------------+------------------------------+


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v4] mempool: fix get objects from mempool with cache
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
                   ` (7 preceding siblings ...)
  2022-10-04 12:53 ` [PATCH v3] mempool: fix get objects from mempool with cache Andrew Rybchenko
@ 2022-10-07 10:44 ` Andrew Rybchenko
  2022-10-08 20:56   ` Thomas Monjalon
  2022-10-09 13:37 ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
  9 siblings, 1 reply; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-07 10:44 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, Morten Brørup

From: Morten Brørup <mb@smartsharesystems.com>

A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then, and some inefficiencies were introduced.

Fix the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated
with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

This bug caused a major performance degradation in scenarios where the
application burst length is the same as the cache size. In such cases,
the objects were not ever fetched from the mempool cache, regardless if
they could have been.
This scenario occurs e.g. if an application has configured a mempool
with a size matching the application's burst size.

2. The function is a helper for rte_mempool_generic_get(), so it must
behave according to the description of that function.
Specifically, objects must first be returned from the cache,
subsequently from the backend.
After the change in DPDK version 1.3, this was not the behavior when
the request was partially satisfied from the cache; instead, the objects
from the backend were returned ahead of the objects from the cache.
This bug degraded application performance on CPUs with a small L1 cache,
which benefit from having the hot objects first in the returned array.
(This is probably also the reason why the function returns the objects
in reverse order, which it still does.)
Now, all code paths first return objects from the cache, subsequently
from the backend.

The function was not behaving as described (by the function using it)
and expected by applications using it. This in itself is also a bug.

3. If the cache could not be backfilled, the function would attempt
to get all the requested objects from the backend (instead of only the
number of requested objects minus the objects available in the backend),
and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache,
and if the subsequent backfilling of the cache from the backend fails,
only the remaining requested objects are retrieved from the backend.

The function would fail despite there are enough objects in the cache
plus the common pool.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache
was treated as unlikely. Now it is treated as likely.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
---
v4 changes (Andrew Rybchenko)
 - Avoid usage of misleading ring, since other mempool drivers
   exist, use term backend
 - Avoid term ring in goto label, use driver_dequeue as a label name
 - Add likely() to cache != NULL in driver dequeue, just for symmetry
 - Highlight that remaining objects are deqeueued from the driver

v3 changes (Andrew Rybchenko)
 - Always get first objects from the cache even if request is bigger
   than cache size. Remove one corresponding condition from the path
   when request is fully served from cache.
 - Simplify code to avoid duplication:
    - Get objects directly from backend in single place only.
    - Share code which gets from the cache first regardless if
      everythihg is obtained from the cache or just the first part.
 - Rollback cache length in unlikely failure branch to avoid cache
   vs NULL check in success branch.

v2 changes
- Do not modify description of return value. This belongs in a separate
doc fix.
- Elaborate even more on which bugs the modifications fix.

 lib/mempool/rte_mempool.h | 80 +++++++++++++++++++++++++--------------
 1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 4c4af2a8ed..2401c4ac80 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1435,60 +1435,82 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
  *   A pointer to a mempool cache structure. May be NULL if not needed.
  * @return
  *   - >=0: Success; number of objects supplied.
- *   - <0: Error; code of ring dequeue function.
+ *   - <0: Error; code of driver dequeue function.
  */
 static __rte_always_inline int
 rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
+	unsigned int remaining = n;
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
-		goto ring_dequeue;
+	/* No cache provided */
+	if (unlikely(cache == NULL))
+		goto driver_dequeue;
 
-	cache_objs = cache->objs;
+	/* Use the cache as much as we have to return hot objects first */
+	len = RTE_MIN(remaining, cache->len);
+	cache_objs = &cache->objs[cache->len];
+	cache->len -= len;
+	remaining -= len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+	if (remaining == 0) {
+		/* The entire request is satisfied from the cache. */
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
-		if (unlikely(ret < 0)) {
-			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
-			 */
-			goto ring_dequeue;
-		}
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-		cache->len += req;
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	/* if dequeue below would overflow mem allocated for cache */
+	if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+		goto driver_dequeue;
+
+	/* Fill the cache from the backend; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + remaining);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the backend.
+		 */
+		goto driver_dequeue;
+	}
+
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache_objs = &cache->objs[cache->size + remaining];
+	for (index = 0; index < remaining; index++)
+		*obj_table++ = *--cache_objs;
 
-	cache->len -= n;
+	cache->len = cache->size;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
 	return 0;
 
-ring_dequeue:
+driver_dequeue:
 
-	/* get remaining objects from ring */
-	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
+	/* Get remaining objects directly from the backend. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, remaining);
 
 	if (ret < 0) {
+		if (likely(cache != NULL)) {
+			cache->len = n - remaining;
+			/*
+			 * No further action is required to roll the first part
+			 * of the request back into the cache, as objects in
+			 * the cache are intact.
+			 */
+		}
+
 		RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
 		RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
 	} else {
-- 
2.30.2


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4] mempool: fix get objects from mempool with cache
  2022-10-07 10:44 ` [PATCH v4] " Andrew Rybchenko
@ 2022-10-08 20:56   ` Thomas Monjalon
  2022-10-11 20:30     ` Copy-pasted code should be updated Morten Brørup
  2022-10-14 14:01     ` [PATCH v4] mempool: fix get objects from mempool with cache Olivier Matz
  0 siblings, 2 replies; 85+ messages in thread
From: Thomas Monjalon @ 2022-10-08 20:56 UTC (permalink / raw)
  To: Morten Brørup, Andrew Rybchenko; +Cc: Olivier Matz, dev

07/10/2022 12:44, Andrew Rybchenko:
> From: Morten Brørup <mb@smartsharesystems.com>
> 
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> Fix the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> This bug caused a major performance degradation in scenarios where the
> application burst length is the same as the cache size. In such cases,
> the objects were not ever fetched from the mempool cache, regardless if
> they could have been.
> This scenario occurs e.g. if an application has configured a mempool
> with a size matching the application's burst size.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the backend.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the objects
> from the backend were returned ahead of the objects from the cache.
> This bug degraded application performance on CPUs with a small L1 cache,
> which benefit from having the hot objects first in the returned array.
> (This is probably also the reason why the function returns the objects
> in reverse order, which it still does.)
> Now, all code paths first return objects from the cache, subsequently
> from the backend.
> 
> The function was not behaving as described (by the function using it)
> and expected by applications using it. This in itself is also a bug.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the backend (instead of only the
> number of requested objects minus the objects available in the backend),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the backend fails,
> only the remaining requested objects are retrieved from the backend.
> 
> The function would fail despite there are enough objects in the cache
> plus the common pool.
> 
> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>

Applied, thanks.




^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 1/2] mempool: check driver enqueue result in one place
  2022-02-02 10:33 ` [PATCH v4] mempool: fix mempool cache flushing algorithm Morten Brørup
  2022-04-07  9:04   ` Morten Brørup
  2022-10-04 20:01   ` Morten Brørup
@ 2022-10-09 11:11   ` Andrew Rybchenko
  2022-10-09 11:11     ` [PATCH 2/2] mempool: avoid usage of term ring on put Andrew Rybchenko
  2022-10-09 13:01     ` [PATCH 1/2] mempool: check driver enqueue result in one place Morten Brørup
  2022-10-09 13:19   ` [PATCH v4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
  3 siblings, 2 replies; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-09 11:11 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, Morten Brørup

Enqueue operation must not fail. Move corresponding debug check
from one particular case to dequeue operation helper in order
to do it for all invocations.

Log critical message with useful information instead of rte_panic().

Make rte_mempool_do_generic_put() implementation more readable and
fix incosistency when return value is not checked in one place and
checked in another.

Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 lib/mempool/rte_mempool.h | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 4c4af2a8ed..95d64901e5 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -786,12 +786,19 @@ rte_mempool_ops_enqueue_bulk(struct rte_mempool *mp, void * const *obj_table,
 		unsigned n)
 {
 	struct rte_mempool_ops *ops;
+	int ret;
 
 	RTE_MEMPOOL_STAT_ADD(mp, put_common_pool_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, put_common_pool_objs, n);
 	rte_mempool_trace_ops_enqueue_bulk(mp, obj_table, n);
 	ops = rte_mempool_get_ops(mp->ops_index);
-	return ops->enqueue(mp, obj_table, n);
+	ret = ops->enqueue(mp, obj_table, n);
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+	if (unlikely(ret < 0))
+		RTE_LOG(CRIT, MEMPOOL, "cannot enqueue %u objects to mempool %s\n",
+			n, mp->name);
+#endif
+	return ret;
 }
 
 /**
@@ -1351,12 +1358,7 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 ring_enqueue:
 
 	/* push remaining objects in ring */
-#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
-	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
-		rte_panic("cannot put objects in mempool\n");
-#else
 	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
-#endif
 }
 
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH 2/2] mempool: avoid usage of term ring on put
  2022-10-09 11:11   ` [PATCH 1/2] mempool: check driver enqueue result in one place Andrew Rybchenko
@ 2022-10-09 11:11     ` Andrew Rybchenko
  2022-10-09 13:08       ` Morten Brørup
  2022-10-09 13:01     ` [PATCH 1/2] mempool: check driver enqueue result in one place Morten Brørup
  1 sibling, 1 reply; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-09 11:11 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, Morten Brørup

Term ring is misleading since it is the default, but still just
one of possible drivers to store objects.

Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 lib/mempool/rte_mempool.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 95d64901e5..c2d4e8ba55 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1331,7 +1331,7 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 	/* No cache provided or if put would overflow mem allocated for cache */
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
-		goto ring_enqueue;
+		goto driver_enqueue;
 
 	cache_objs = &cache->objs[cache->len];
 
@@ -1339,7 +1339,7 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	 * The cache follows the following algorithm
 	 *   1. Add the objects to the cache
 	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   cache flush threshold) is flushed to the backend.
 	 */
 
 	/* Add elements back into the cache */
@@ -1355,9 +1355,9 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 	return;
 
-ring_enqueue:
+driver_enqueue:
 
-	/* push remaining objects in ring */
+	/* push objects to the backend */
 	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
 }
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH 1/2] mempool: check driver enqueue result in one place
  2022-10-09 11:11   ` [PATCH 1/2] mempool: check driver enqueue result in one place Andrew Rybchenko
  2022-10-09 11:11     ` [PATCH 2/2] mempool: avoid usage of term ring on put Andrew Rybchenko
@ 2022-10-09 13:01     ` Morten Brørup
  1 sibling, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-09 13:01 UTC (permalink / raw)
  To: Andrew Rybchenko, Olivier Matz; +Cc: dev

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Sunday, 9 October 2022 13.12
> 
> Enqueue operation must not fail. Move corresponding debug check
> from one particular case to dequeue operation helper in order
> to do it for all invocations.
> 
> Log critical message with useful information instead of rte_panic().
> 
> Make rte_mempool_do_generic_put() implementation more readable and
> fix incosistency when return value is not checked in one place and
> checked in another.
> 
> Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> ---

Moving the debug check to cover all invocations is an improvement. Well spotted!

I have considered if panicking would still be appropriate instead of logging, and have come to the conclusion that I agree with that modification too.

Reviewed-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH 2/2] mempool: avoid usage of term ring on put
  2022-10-09 11:11     ` [PATCH 2/2] mempool: avoid usage of term ring on put Andrew Rybchenko
@ 2022-10-09 13:08       ` Morten Brørup
  2022-10-09 13:14         ` Andrew Rybchenko
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-09 13:08 UTC (permalink / raw)
  To: Andrew Rybchenko, Olivier Matz; +Cc: dev

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Sunday, 9 October 2022 13.12
> 
> Term ring is misleading since it is the default, but still just
> one of possible drivers to store objects.
> 
> Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> ---

Reviewed-by: Morten Brørup <mb@smartsharesystems.com>

PS: The term "ring" (representing the backend) is used a few more times in the documentation parts of the file, but that's a fix for another day. :-)


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH 2/2] mempool: avoid usage of term ring on put
  2022-10-09 13:08       ` Morten Brørup
@ 2022-10-09 13:14         ` Andrew Rybchenko
  0 siblings, 0 replies; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-09 13:14 UTC (permalink / raw)
  To: Morten Brørup, Olivier Matz; +Cc: dev

On 10/9/22 16:08, Morten Brørup wrote:
>> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
>> Sent: Sunday, 9 October 2022 13.12
>>
>> Term ring is misleading since it is the default, but still just
>> one of possible drivers to store objects.
>>
>> Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>> ---
> 
> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
> 
> PS: The term "ring" (representing the backend) is used a few more times in the documentation parts of the file, but that's a fix for another day. :-)
> 

Yes, including top level of the description how mepool
works. Unfortunately I have no enough power right now
to fix the description.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4] mempool: fix mempool cache flushing algorithm
  2022-02-02 10:33 ` [PATCH v4] mempool: fix mempool cache flushing algorithm Morten Brørup
                     ` (2 preceding siblings ...)
  2022-10-09 11:11   ` [PATCH 1/2] mempool: check driver enqueue result in one place Andrew Rybchenko
@ 2022-10-09 13:19   ` Andrew Rybchenko
  3 siblings, 0 replies; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-09 13:19 UTC (permalink / raw)
  To: Morten Brørup, olivier.matz; +Cc: bruce.richardson, jerinjacobk, dev

> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..e7e09e48fc 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
>   	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>   		goto ring_enqueue;
>   
> -	cache_objs = &cache->objs[cache->len];
> +	/* If the request itself is too big for the cache */
> +	if (unlikely(n > cache->flushthresh))
> +		goto ring_enqueue;
>   

n is checked twice above and it is not actually required.
Just the later check is required.
Check vs RTE_MEMPOOL_CACHE_MAX_SIZE was required to ensure
that we do not overflow cache objects since we copied
to cache before flushing, but it is not the case now.
We never cross flush threshold now.

I'll send v5 were I split the most questionable part into
a separate patch at the end.



^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
                   ` (8 preceding siblings ...)
  2022-10-07 10:44 ` [PATCH v4] " Andrew Rybchenko
@ 2022-10-09 13:37 ` Andrew Rybchenko
  2022-10-09 13:37   ` [PATCH v6 1/4] mempool: check driver enqueue result in one place Andrew Rybchenko
                     ` (4 more replies)
  9 siblings, 5 replies; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-09 13:37 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, Morten Brørup, Bruce Richardson

v6 changes (Andrew Rybchenko):

- Fix spelling

v5 changes (Andrew Rybchenko):

- Factor out cosmetic fixes into separate patches to make all
  patches smaller and easier to review
- Remove extra check as per review notes
- Factor out entire cache flushing into a separate patch.
  It is nice from logical changes separation point of view,
  easier to bisect and revert.

v4 changes:

- Updated patch title to reflect that the scope of the patch is only
mempool cache flushing.

- Do not replace rte_memcpy() with alternative copying method. This was
a pure optimization, not a fix.

- Elaborate even more on the bugs fixed by the modifications.

- Added 4th bullet item to the patch description, regarding
rte_mempool_ops_enqueue_bulk() with RTE_LIBRTE_MEMPOOL_DEBUG.

v3 changes:

- Actually remove my modifications of the rte_mempool_cache structure.

v2 changes:

- Not adding the new objects to the mempool cache before flushing it
also allows the memory allocated for the mempool cache to be reduced
from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
However, such this change would break the ABI, so it was removed in v2.

- The mempool cache should be cache line aligned for the benefit of the
copying method, which on some CPU architectures performs worse on data
crossing a cache boundary.
However, such this change would break the ABI, so it was removed in v2;
and yet another alternative copying method replaced the rte_memcpy().

Andrew Rybchenko (3):
  mempool: check driver enqueue result in one place
  mempool: avoid usage of term ring on put
  mempool: flush cache completely on overflow

Morten Brørup (1):
  mempool: fix cache flushing algorithm

 lib/mempool/rte_mempool.c |  5 ++++
 lib/mempool/rte_mempool.h | 55 ++++++++++++++++++++-------------------
 2 files changed, 33 insertions(+), 27 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v6 1/4] mempool: check driver enqueue result in one place
  2022-10-09 13:37 ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
@ 2022-10-09 13:37   ` Andrew Rybchenko
  2022-10-09 13:37   ` [PATCH v6 2/4] mempool: avoid usage of term ring on put Andrew Rybchenko
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-09 13:37 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, Morten Brørup, Bruce Richardson

Enqueue operation must not fail. Move corresponding debug check
from one particular case to dequeue operation helper in order
to do it for all invocations.

Log critical message with useful information instead of rte_panic().

Make rte_mempool_do_generic_put() implementation more readable and
fix incosistency when return value is not checked in one place and
checked in another.

Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 2401c4ac80..bc29d49aab 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -786,12 +786,19 @@ rte_mempool_ops_enqueue_bulk(struct rte_mempool *mp, void * const *obj_table,
 		unsigned n)
 {
 	struct rte_mempool_ops *ops;
+	int ret;
 
 	RTE_MEMPOOL_STAT_ADD(mp, put_common_pool_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, put_common_pool_objs, n);
 	rte_mempool_trace_ops_enqueue_bulk(mp, obj_table, n);
 	ops = rte_mempool_get_ops(mp->ops_index);
-	return ops->enqueue(mp, obj_table, n);
+	ret = ops->enqueue(mp, obj_table, n);
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+	if (unlikely(ret < 0))
+		RTE_LOG(CRIT, MEMPOOL, "cannot enqueue %u objects to mempool %s\n",
+			n, mp->name);
+#endif
+	return ret;
 }
 
 /**
@@ -1351,12 +1358,7 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 ring_enqueue:
 
 	/* push remaining objects in ring */
-#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
-	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
-		rte_panic("cannot put objects in mempool\n");
-#else
 	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
-#endif
 }
 
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v6 2/4] mempool: avoid usage of term ring on put
  2022-10-09 13:37 ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
  2022-10-09 13:37   ` [PATCH v6 1/4] mempool: check driver enqueue result in one place Andrew Rybchenko
@ 2022-10-09 13:37   ` Andrew Rybchenko
  2022-10-09 13:37   ` [PATCH v6 3/4] mempool: fix cache flushing algorithm Andrew Rybchenko
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-09 13:37 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, Morten Brørup, Bruce Richardson

Term ring is misleading since it is the default, but still just
one of possible drivers to store objects.

Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index bc29d49aab..a072e5554b 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1331,7 +1331,7 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 	/* No cache provided or if put would overflow mem allocated for cache */
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
-		goto ring_enqueue;
+		goto driver_enqueue;
 
 	cache_objs = &cache->objs[cache->len];
 
@@ -1339,7 +1339,7 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	 * The cache follows the following algorithm
 	 *   1. Add the objects to the cache
 	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   cache flush threshold) is flushed to the backend.
 	 */
 
 	/* Add elements back into the cache */
@@ -1355,9 +1355,9 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 	return;
 
-ring_enqueue:
+driver_enqueue:
 
-	/* push remaining objects in ring */
+	/* push objects to the backend */
 	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
 }
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v6 3/4] mempool: fix cache flushing algorithm
  2022-10-09 13:37 ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
  2022-10-09 13:37   ` [PATCH v6 1/4] mempool: check driver enqueue result in one place Andrew Rybchenko
  2022-10-09 13:37   ` [PATCH v6 2/4] mempool: avoid usage of term ring on put Andrew Rybchenko
@ 2022-10-09 13:37   ` Andrew Rybchenko
  2022-10-09 14:31     ` Morten Brørup
  2022-10-09 13:37   ` [PATCH v6 4/4] mempool: flush cache completely on overflow Andrew Rybchenko
  2022-10-10 15:21   ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Thomas Monjalon
  4 siblings, 1 reply; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-09 13:37 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, Morten Brørup, Bruce Richardson

From: Morten Brørup <mb@smartsharesystems.com>

Fix the rte_mempool_do_generic_put() caching flushing algorithm to
keep hot objects in cache instead of cold ones.

The algorithm was:
 1. Add the objects to the cache.
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the backend.

Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
the cache full after flushing, which the above description reflects.

Now, the algorithm is:
 1. If the objects cannot be added to the cache without crossing the
    flush threshold, flush some cached objects to the backend to
    free up required space.
 2. Add the objects to the cache.

The most recent (hot) objects were flushed, leaving the oldest (cold)
objects in the mempool cache. The bug degraded performance, because
flushing prevented immediate reuse of the (hot) objects already in
the CPU cache.  Now, the existing (cold) objects in the mempool cache
are flushed before the new (hot) objects are added the to the mempool
cache.

Since nearby code is touched anyway fix flush threshold comparison
to do flushing if the threshold is really exceed, not just reached.
I.e. it must be "len > flushthresh", not "len >= flushthresh".
Consider a flush multiplier of 1 instead of 1.5; the cache would be
flushed already when reaching size objects, not when exceeding size
objects. In other words, the cache would not be able to hold "size"
objects, which is clearly a bug. The bug could degraded performance
due to premature flushing.

Since we never exceed flush threshold now, cache size in the mempool
may be decreased from RTE_MEMPOOL_CACHE_MAX_SIZE * 3 to
RTE_MEMPOOL_CACHE_MAX_SIZE * 2. In fact it could be
CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE), but flush
threshold multiplier is internal.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 lib/mempool/rte_mempool.c |  5 +++++
 lib/mempool/rte_mempool.h | 43 +++++++++++++++++++++++----------------
 2 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index de59009baf..4ba8ab7b63 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -746,6 +746,11 @@ rte_mempool_free(struct rte_mempool *mp)
 static void
 mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
 {
+	/* Check that cache have enough space for flush threshold */
+	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE) >
+			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
+			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs[0]));
+
 	cache->size = size;
 	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
 	cache->len = 0;
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index a072e5554b..e3364ed7b8 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -90,7 +90,7 @@ struct rte_mempool_cache {
 	 * Cache is allocated to this size to allow it to overflow in certain
 	 * cases to avoid needless emptying of cache.
 	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
 } __rte_cache_aligned;
 
 /**
@@ -1329,30 +1329,39 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
 
-	/* No cache provided or if put would overflow mem allocated for cache */
-	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
+	/* No cache provided or the request itself is too big for the cache */
+	if (unlikely(cache == NULL || n > cache->flushthresh))
 		goto driver_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
-
 	/*
-	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the backend.
+	 * The cache follows the following algorithm:
+	 *   1. If the objects cannot be added to the cache without crossing
+	 *      the flush threshold, flush the cache to the backend.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
-
-	cache->len += n;
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
+		cache->len += n;
+	} else {
+		unsigned int keep = (n >= cache->size) ? 0 : (cache->size - n);
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+		/*
+		 * If number of object to keep in the cache is positive:
+		 * keep = cache->size - n < cache->flushthresh - n < cache->len
+		 * since cache->flushthresh > cache->size.
+		 * If keep is 0, cache->len cannot be 0 anyway since
+		 * n <= cache->flushthresh and we'd no be here with
+		 * cache->len == 0.
+		 */
+		cache_objs = &cache->objs[keep];
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len - keep);
+		cache->len = keep + n;
 	}
 
+	/* Add the objects to the cache. */
+	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
+
 	return;
 
 driver_enqueue:
-- 
2.30.2


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v6 4/4] mempool: flush cache completely on overflow
  2022-10-09 13:37 ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
                     ` (2 preceding siblings ...)
  2022-10-09 13:37   ` [PATCH v6 3/4] mempool: fix cache flushing algorithm Andrew Rybchenko
@ 2022-10-09 13:37   ` Andrew Rybchenko
  2022-10-09 14:44     ` Morten Brørup
  2022-10-10 15:21   ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Thomas Monjalon
  4 siblings, 1 reply; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-09 13:37 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev, Morten Brørup, Bruce Richardson

The cache was still full after flushing. In the opposite direction,
i.e. when getting objects from the cache, the cache is refilled to full
level when it crosses the low watermark (which happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The existing flushing behaviour was suboptimal for real applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

This bug degraded performance caused by too frequent flushing.

Consider this application scenario:

Either, an lcore thread in the application is in a state of balance,
where it uses the mempool cache within its flush/refill boundaries; in
this situation, the flush method is less important, and this fix is
irrelevant.

Or, an lcore thread in the application is out of balance (either
permanently or temporarily), and mostly gets or puts objects from/to the
mempool. If it mostly puts objects, not flushing all of the objects will
cause more frequent flushing. This is the scenario addressed by this
fix. E.g.:

Cache size=256, flushthresh=384 (1.5x size), initial len=256;
application burst len=32.

If there are "size" objects in the cache after flushing, the cache is
flushed at every 4th burst.

If the cache is flushed completely, the cache is only flushed at every
16th burst.

As you can see, this bug caused the cache to be flushed 4x too
frequently in this example.

And when/if the application thread breaks its pattern of continuously
putting objects, and suddenly starts to get objects instead, it will
either get objects already in the cache, or the get() function will
refill the cache.

The concept of not flushing the cache completely was probably based on
an assumption that it is more likely for an application's lcore thread
to get() after flushing than to put() after flushing.
I strongly disagree with this assumption! If an application thread is
continuously putting so much that it overflows the cache, it is much
more likely to keep putting than it is to start getting. If in doubt,
consider how CPU branch predictors work: When the application has done
something many times consecutively, the branch predictor will expect the
application to do the same again, rather than suddenly do something
else.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 lib/mempool/rte_mempool.h | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index e3364ed7b8..26b2697572 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1344,19 +1344,9 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 		cache_objs = &cache->objs[cache->len];
 		cache->len += n;
 	} else {
-		unsigned int keep = (n >= cache->size) ? 0 : (cache->size - n);
-
-		/*
-		 * If number of object to keep in the cache is positive:
-		 * keep = cache->size - n < cache->flushthresh - n < cache->len
-		 * since cache->flushthresh > cache->size.
-		 * If keep is 0, cache->len cannot be 0 anyway since
-		 * n <= cache->flushthresh and we'd no be here with
-		 * cache->len == 0.
-		 */
-		cache_objs = &cache->objs[keep];
-		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len - keep);
-		cache->len = keep + n;
+		cache_objs = &cache->objs[0];
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+		cache->len = n;
 	}
 
 	/* Add the objects to the cache. */
-- 
2.30.2


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v6 3/4] mempool: fix cache flushing algorithm
  2022-10-09 13:37   ` [PATCH v6 3/4] mempool: fix cache flushing algorithm Andrew Rybchenko
@ 2022-10-09 14:31     ` Morten Brørup
  2022-10-09 14:51       ` Andrew Rybchenko
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-09 14:31 UTC (permalink / raw)
  To: Andrew Rybchenko, Olivier Matz; +Cc: dev, Bruce Richardson

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Sunday, 9 October 2022 15.38
> 
> From: Morten Brørup <mb@smartsharesystems.com>
> 
> Fix the rte_mempool_do_generic_put() caching flushing algorithm to
> keep hot objects in cache instead of cold ones.
> 
> The algorithm was:
>  1. Add the objects to the cache.
>  2. Anything greater than the cache size (if it crosses the cache flush
>     threshold) is flushed to the backend.
> 
> Please note that the description in the source code said that it kept
> "cache min value" objects after flushing, but the function actually
> kept
> the cache full after flushing, which the above description reflects.
> 
> Now, the algorithm is:
>  1. If the objects cannot be added to the cache without crossing the
>     flush threshold, flush some cached objects to the backend to
>     free up required space.
>  2. Add the objects to the cache.
> 
> The most recent (hot) objects were flushed, leaving the oldest (cold)
> objects in the mempool cache. The bug degraded performance, because
> flushing prevented immediate reuse of the (hot) objects already in
> the CPU cache.  Now, the existing (cold) objects in the mempool cache
> are flushed before the new (hot) objects are added the to the mempool
> cache.
> 
> Since nearby code is touched anyway fix flush threshold comparison
> to do flushing if the threshold is really exceed, not just reached.
> I.e. it must be "len > flushthresh", not "len >= flushthresh".
> Consider a flush multiplier of 1 instead of 1.5; the cache would be
> flushed already when reaching size objects, not when exceeding size
> objects. In other words, the cache would not be able to hold "size"
> objects, which is clearly a bug. The bug could degraded performance
> due to premature flushing.
> 
> Since we never exceed flush threshold now, cache size in the mempool
> may be decreased from RTE_MEMPOOL_CACHE_MAX_SIZE * 3 to
> RTE_MEMPOOL_CACHE_MAX_SIZE * 2. In fact it could be
> CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE), but flush
> threshold multiplier is internal.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> ---

[...]

> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
>  	 * Cache is allocated to this size to allow it to overflow in
> certain
>  	 * cases to avoid needless emptying of cache.
>  	 */
> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
>  } __rte_cache_aligned;

How much are we allowed to break the ABI here?

This patch reduces the size of the structure by removing a now unused part at the end, which should be harmless.

If we may also move the position of the objs array, I would add __rte_cache_aligned to the objs array. It makes no difference in the general case, but if get/put operations are always 32 objects, it will reduce the number of memory (or last level cache) accesses from five to four 64 B cache lines for every get/put operation.

	uint32_t len;	      /**< Current cache count */
-	/*
-	 * Cache is allocated to this size to allow it to overflow in certain
-	 * cases to avoid needless emptying of cache.
-	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
+	/**
+	 * Cache objects
+	 *
+	 * Cache is allocated to this size to allow it to overflow in certain
+	 * cases to avoid needless emptying of cache.
+	 */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
} __rte_cache_aligned;

With or without the above suggested optimization...

Reviewed-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v6 4/4] mempool: flush cache completely on overflow
  2022-10-09 13:37   ` [PATCH v6 4/4] mempool: flush cache completely on overflow Andrew Rybchenko
@ 2022-10-09 14:44     ` Morten Brørup
  2022-10-14 14:01       ` Olivier Matz
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-09 14:44 UTC (permalink / raw)
  To: Andrew Rybchenko, Olivier Matz; +Cc: dev, Bruce Richardson

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Sunday, 9 October 2022 15.38
> To: Olivier Matz
> Cc: dev@dpdk.org; Morten Brørup; Bruce Richardson
> Subject: [PATCH v6 4/4] mempool: flush cache completely on overflow
> 
> The cache was still full after flushing. In the opposite direction,
> i.e. when getting objects from the cache, the cache is refilled to full
> level when it crosses the low watermark (which happens to be zero).
> Similarly, the cache should be flushed to empty level when it crosses
> the high watermark (which happens to be 1.5 x the size of the cache).
> The existing flushing behaviour was suboptimal for real applications,
> because crossing the low or high watermark typically happens when the
> application is in a state where the number of put/get events are out of
> balance, e.g. when absorbing a burst of packets into a QoS queue
> (getting more mbufs from the mempool), or when a burst of packets is
> trickling out from the QoS queue (putting the mbufs back into the
> mempool).
> Now, the mempool cache is completely flushed when crossing the flush
> threshold, so only the newly put (hot) objects remain in the mempool
> cache afterwards.
> 
> This bug degraded performance caused by too frequent flushing.
> 
> Consider this application scenario:
> 
> Either, an lcore thread in the application is in a state of balance,
> where it uses the mempool cache within its flush/refill boundaries; in
> this situation, the flush method is less important, and this fix is
> irrelevant.
> 
> Or, an lcore thread in the application is out of balance (either
> permanently or temporarily), and mostly gets or puts objects from/to
> the
> mempool. If it mostly puts objects, not flushing all of the objects
> will
> cause more frequent flushing. This is the scenario addressed by this
> fix. E.g.:
> 
> Cache size=256, flushthresh=384 (1.5x size), initial len=256;
> application burst len=32.
> 
> If there are "size" objects in the cache after flushing, the cache is
> flushed at every 4th burst.
> 
> If the cache is flushed completely, the cache is only flushed at every
> 16th burst.
> 
> As you can see, this bug caused the cache to be flushed 4x too
> frequently in this example.
> 
> And when/if the application thread breaks its pattern of continuously
> putting objects, and suddenly starts to get objects instead, it will
> either get objects already in the cache, or the get() function will
> refill the cache.
> 
> The concept of not flushing the cache completely was probably based on
> an assumption that it is more likely for an application's lcore thread
> to get() after flushing than to put() after flushing.
> I strongly disagree with this assumption! If an application thread is
> continuously putting so much that it overflows the cache, it is much
> more likely to keep putting than it is to start getting. If in doubt,
> consider how CPU branch predictors work: When the application has done
> something many times consecutively, the branch predictor will expect
> the
> application to do the same again, rather than suddenly do something
> else.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> ---

Reviewed-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v6 3/4] mempool: fix cache flushing algorithm
  2022-10-09 14:31     ` Morten Brørup
@ 2022-10-09 14:51       ` Andrew Rybchenko
  2022-10-09 15:08         ` Morten Brørup
  0 siblings, 1 reply; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-09 14:51 UTC (permalink / raw)
  To: Morten Brørup, Olivier Matz; +Cc: dev, Bruce Richardson

On 10/9/22 17:31, Morten Brørup wrote:
>> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
>> Sent: Sunday, 9 October 2022 15.38
>>
>> From: Morten Brørup <mb@smartsharesystems.com>
>>
>> Fix the rte_mempool_do_generic_put() caching flushing algorithm to
>> keep hot objects in cache instead of cold ones.
>>
>> The algorithm was:
>>   1. Add the objects to the cache.
>>   2. Anything greater than the cache size (if it crosses the cache flush
>>      threshold) is flushed to the backend.
>>
>> Please note that the description in the source code said that it kept
>> "cache min value" objects after flushing, but the function actually
>> kept
>> the cache full after flushing, which the above description reflects.
>>
>> Now, the algorithm is:
>>   1. If the objects cannot be added to the cache without crossing the
>>      flush threshold, flush some cached objects to the backend to
>>      free up required space.
>>   2. Add the objects to the cache.
>>
>> The most recent (hot) objects were flushed, leaving the oldest (cold)
>> objects in the mempool cache. The bug degraded performance, because
>> flushing prevented immediate reuse of the (hot) objects already in
>> the CPU cache.  Now, the existing (cold) objects in the mempool cache
>> are flushed before the new (hot) objects are added the to the mempool
>> cache.
>>
>> Since nearby code is touched anyway fix flush threshold comparison
>> to do flushing if the threshold is really exceed, not just reached.
>> I.e. it must be "len > flushthresh", not "len >= flushthresh".
>> Consider a flush multiplier of 1 instead of 1.5; the cache would be
>> flushed already when reaching size objects, not when exceeding size
>> objects. In other words, the cache would not be able to hold "size"
>> objects, which is clearly a bug. The bug could degraded performance
>> due to premature flushing.
>>
>> Since we never exceed flush threshold now, cache size in the mempool
>> may be decreased from RTE_MEMPOOL_CACHE_MAX_SIZE * 3 to
>> RTE_MEMPOOL_CACHE_MAX_SIZE * 2. In fact it could be
>> CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE), but flush
>> threshold multiplier is internal.
>>
>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>> Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>> ---
> 
> [...]
> 
>> --- a/lib/mempool/rte_mempool.h
>> +++ b/lib/mempool/rte_mempool.h
>> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
>>   	 * Cache is allocated to this size to allow it to overflow in
>> certain
>>   	 * cases to avoid needless emptying of cache.
>>   	 */
>> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
>> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
>>   } __rte_cache_aligned;
> 
> How much are we allowed to break the ABI here?
> 
> This patch reduces the size of the structure by removing a now unused part at the end, which should be harmless.
> 
> If we may also move the position of the objs array, I would add __rte_cache_aligned to the objs array. It makes no difference in the general case, but if get/put operations are always 32 objects, it will reduce the number of memory (or last level cache) accesses from five to four 64 B cache lines for every get/put operation.
> 
> 	uint32_t len;	      /**< Current cache count */
> -	/*
> -	 * Cache is allocated to this size to allow it to overflow in certain
> -	 * cases to avoid needless emptying of cache.
> -	 */
> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> +	/**
> +	 * Cache objects
> +	 *
> +	 * Cache is allocated to this size to allow it to overflow in certain
> +	 * cases to avoid needless emptying of cache.
> +	 */
> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
> } __rte_cache_aligned;

I think aligning objs on cacheline should be a separate patch.

> 
> With or without the above suggested optimization...
> 
> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
> 


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v6 3/4] mempool: fix cache flushing algorithm
  2022-10-09 14:51       ` Andrew Rybchenko
@ 2022-10-09 15:08         ` Morten Brørup
  2022-10-14 14:01           ` Olivier Matz
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-09 15:08 UTC (permalink / raw)
  To: Andrew Rybchenko, Olivier Matz; +Cc: dev, Bruce Richardson

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Sunday, 9 October 2022 16.52
> 
> On 10/9/22 17:31, Morten Brørup wrote:
> >> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> >> Sent: Sunday, 9 October 2022 15.38
> >>
> >> From: Morten Brørup <mb@smartsharesystems.com>
> >>

[...]

> >> --- a/lib/mempool/rte_mempool.h
> >> +++ b/lib/mempool/rte_mempool.h
> >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> >>   	 * Cache is allocated to this size to allow it to overflow in
> >> certain
> >>   	 * cases to avoid needless emptying of cache.
> >>   	 */
> >> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> >> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
> >>   } __rte_cache_aligned;
> >
> > How much are we allowed to break the ABI here?
> >
> > This patch reduces the size of the structure by removing a now unused
> part at the end, which should be harmless.
> >
> > If we may also move the position of the objs array, I would add
> __rte_cache_aligned to the objs array. It makes no difference in the
> general case, but if get/put operations are always 32 objects, it will
> reduce the number of memory (or last level cache) accesses from five to
> four 64 B cache lines for every get/put operation.
> >
> > 	uint32_t len;	      /**< Current cache count */
> > -	/*
> > -	 * Cache is allocated to this size to allow it to overflow in
> certain
> > -	 * cases to avoid needless emptying of cache.
> > -	 */
> > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> > +	/**
> > +	 * Cache objects
> > +	 *
> > +	 * Cache is allocated to this size to allow it to overflow in
> certain
> > +	 * cases to avoid needless emptying of cache.
> > +	 */
> > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
> > } __rte_cache_aligned;
> 
> I think aligning objs on cacheline should be a separate patch.

Good point. I'll let you do it. :-)

PS: Thank you for following up on this patch series, Andrew!

-Morten

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm
  2022-10-09 13:37 ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
                     ` (3 preceding siblings ...)
  2022-10-09 13:37   ` [PATCH v6 4/4] mempool: flush cache completely on overflow Andrew Rybchenko
@ 2022-10-10 15:21   ` Thomas Monjalon
  2022-10-11 19:26     ` Morten Brørup
  2022-10-26 14:09     ` Thomas Monjalon
  4 siblings, 2 replies; 85+ messages in thread
From: Thomas Monjalon @ 2022-10-10 15:21 UTC (permalink / raw)
  To: Andrew Rybchenko; +Cc: Olivier Matz, dev, Morten Brørup, Bruce Richardson

> Andrew Rybchenko (3):
>   mempool: check driver enqueue result in one place
>   mempool: avoid usage of term ring on put
>   mempool: flush cache completely on overflow
> 
> Morten Brørup (1):
>   mempool: fix cache flushing algorithm

Applied only first 2 "cosmetic" patches as discussed with Andrew.
The goal is to make some performance tests
before merging the rest of the series.





^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm
  2022-10-10 15:21   ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Thomas Monjalon
@ 2022-10-11 19:26     ` Morten Brørup
  2022-10-26 14:09     ` Thomas Monjalon
  1 sibling, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-11 19:26 UTC (permalink / raw)
  To: Thomas Monjalon, Andrew Rybchenko; +Cc: Olivier Matz, dev, Bruce Richardson

> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Monday, 10 October 2022 17.21
> 
> > Andrew Rybchenko (3):
> >   mempool: check driver enqueue result in one place
> >   mempool: avoid usage of term ring on put
> >   mempool: flush cache completely on overflow
> >
> > Morten Brørup (1):
> >   mempool: fix cache flushing algorithm
> 
> Applied only first 2 "cosmetic" patches as discussed with Andrew.
> The goal is to make some performance tests
> before merging the rest of the series.

I just came to think of this:

Don't test with RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE, because some PMD's bypass the mempool library and manipulate the mempool cache structure directly, e.g. https://elixir.bootlin.com/dpdk/latest/source/drivers/net/i40e/i40e_rxtx_vec_avx512.c#L903

The copy-pasted code in those PMDs should probably also be updated to reflect the updated mempool library behavior. :-(


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Copy-pasted code should be updated
  2022-10-08 20:56   ` Thomas Monjalon
@ 2022-10-11 20:30     ` Morten Brørup
  2022-10-11 21:47       ` Honnappa Nagarahalli
  2022-10-14 14:01     ` [PATCH v4] mempool: fix get objects from mempool with cache Olivier Matz
  1 sibling, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-11 20:30 UTC (permalink / raw)
  To: Yuying Zhang, Beilei Xing, Jingjing Wu, Qiming Yang, Qi Zhang
  Cc: Olivier Matz, dev, Thomas Monjalon, Andrew Rybchenko,
	Feifei Wang, ruifeng.wang, honnappa.nagarahalli, nd, techboard

Dear Intel PMD maintainers (CC: techboard),

I strongly recommend that you update the code you copy-pasted from the mempool library to your PMDs, so they reflect the new and improved mempool cache behavior [1]. When choosing to copy-paste code from a core library, you should feel obliged to keep your copied code matching the source code you copied it from!

Also, as reported in bug #1052, you forgot to copy-paste the instrumentation, thereby 1. making the mempool debug statistics invalid and 2. omitting the mempool accesses from the trace when using your PMDs. :-(

Alternatively, just remove the copy-pasted code and use the mempool library's API instead. ;-)

The direct re-arm code also contains copy-pasted mempool cache handling code - which was accepted with the argument that the same code was already copy-pasted elsewhere. I don't know if the direct re-arm code also needs updating... Authors of that patch (CC to this email), please coordinate with the PMD maintainers.

PS:  As noted in the 22.11-rc1 release notes, more changes to the mempool library [2] may be coming.

[1]: https://patches.dpdk.org/project/dpdk/patch/20221007104450.2567961-1-andrew.rybchenko@oktetlabs.ru/

[2]: https://patches.dpdk.org/project/dpdk/list/?series=25063

-Morten


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: Copy-pasted code should be updated
  2022-10-11 20:30     ` Copy-pasted code should be updated Morten Brørup
@ 2022-10-11 21:47       ` Honnappa Nagarahalli
  2022-10-30  8:44         ` Morten Brørup
  0 siblings, 1 reply; 85+ messages in thread
From: Honnappa Nagarahalli @ 2022-10-11 21:47 UTC (permalink / raw)
  To: Morten Brørup, Yuying Zhang, Beilei Xing, Jingjing Wu,
	Qiming Yang, Qi Zhang
  Cc: Olivier Matz, dev, thomas, Andrew Rybchenko, Feifei Wang,
	Ruifeng Wang, nd, techboard, Kamalakshitha Aligeri, nd

<snip>

> 
> Dear Intel PMD maintainers (CC: techboard),
> 
> I strongly recommend that you update the code you copy-pasted from the
> mempool library to your PMDs, so they reflect the new and improved
> mempool cache behavior [1]. When choosing to copy-paste code from a core
> library, you should feel obliged to keep your copied code matching the source
> code you copied it from!
> 
> Also, as reported in bug #1052, you forgot to copy-paste the instrumentation,
> thereby 1. making the mempool debug statistics invalid and 2. omitting the
> mempool accesses from the trace when using your PMDs. :-(
We are working on mempool APIs to expose the per core cache memory to PMD so that the buffers can be copied directly. We are planning to fix this duplication as part of that.

> 
> Alternatively, just remove the copy-pasted code and use the mempool
> library's API instead. ;-)
> 
> The direct re-arm code also contains copy-pasted mempool cache handling
> code - which was accepted with the argument that the same code was
> already copy-pasted elsewhere. I don't know if the direct re-arm code also
> needs updating... Authors of that patch (CC to this email), please coordinate
> with the PMD maintainers.
Direct-rearm patch is not accepted yet.

> 
> PS:  As noted in the 22.11-rc1 release notes, more changes to the mempool
> library [2] may be coming.
> 
> [1]:
> https://patches.dpdk.org/project/dpdk/patch/20221007104450.2567961-1-
> andrew.rybchenko@oktetlabs.ru/
> 
> [2]: https://patches.dpdk.org/project/dpdk/list/?series=25063
> 
> -Morten


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v6 3/4] mempool: fix cache flushing algorithm
  2022-10-09 15:08         ` Morten Brørup
@ 2022-10-14 14:01           ` Olivier Matz
  2022-10-14 15:57             ` Morten Brørup
  0 siblings, 1 reply; 85+ messages in thread
From: Olivier Matz @ 2022-10-14 14:01 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Andrew Rybchenko, dev, Bruce Richardson

Hi Morten, Andrew,

On Sun, Oct 09, 2022 at 05:08:39PM +0200, Morten Brørup wrote:
> > From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > Sent: Sunday, 9 October 2022 16.52
> > 
> > On 10/9/22 17:31, Morten Brørup wrote:
> > >> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > >> Sent: Sunday, 9 October 2022 15.38
> > >>
> > >> From: Morten Brørup <mb@smartsharesystems.com>
> > >>
> 
> [...]

I finally took a couple of hours to carefully review the mempool-related
series (including the ones that have already been pushed).

The new behavior looks better to me in all situations I can think about.

> 
> > >> --- a/lib/mempool/rte_mempool.h
> > >> +++ b/lib/mempool/rte_mempool.h
> > >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> > >>   	 * Cache is allocated to this size to allow it to overflow in
> > >> certain
> > >>   	 * cases to avoid needless emptying of cache.
> > >>   	 */
> > >> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> > >> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
> > >>   } __rte_cache_aligned;
> > >
> > > How much are we allowed to break the ABI here?
> > >
> > > This patch reduces the size of the structure by removing a now unused
> > part at the end, which should be harmless.

It is an ABI breakage: an existing application will use the new 22.11
function to create the mempool (with a smaller cache), but will use the
old inlined get/put that can exceed MAX_SIZE x 2 will remain.

But this is a nice memory consumption improvement, in my opinion we
should accept it for 22.11 with an entry in the release note.


> > >
> > > If we may also move the position of the objs array, I would add
> > __rte_cache_aligned to the objs array. It makes no difference in the
> > general case, but if get/put operations are always 32 objects, it will
> > reduce the number of memory (or last level cache) accesses from five to
> > four 64 B cache lines for every get/put operation.

Will it really be the case? Since cache->len has to be accessed too,
I don't think it would make a difference.


> > >
> > > 	uint32_t len;	      /**< Current cache count */
> > > -	/*
> > > -	 * Cache is allocated to this size to allow it to overflow in
> > certain
> > > -	 * cases to avoid needless emptying of cache.
> > > -	 */
> > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> > > +	/**
> > > +	 * Cache objects
> > > +	 *
> > > +	 * Cache is allocated to this size to allow it to overflow in
> > certain
> > > +	 * cases to avoid needless emptying of cache.
> > > +	 */
> > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
> > > } __rte_cache_aligned;
> > 
> > I think aligning objs on cacheline should be a separate patch.
> 
> Good point. I'll let you do it. :-)
> 
> PS: Thank you for following up on this patch series, Andrew!

Many thanks for this rework.

Acked-by: Olivier Matz <olivier.matz@6wind.com>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v6 4/4] mempool: flush cache completely on overflow
  2022-10-09 14:44     ` Morten Brørup
@ 2022-10-14 14:01       ` Olivier Matz
  0 siblings, 0 replies; 85+ messages in thread
From: Olivier Matz @ 2022-10-14 14:01 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Andrew Rybchenko, dev, Bruce Richardson

On Sun, Oct 09, 2022 at 04:44:08PM +0200, Morten Brørup wrote:
> > From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > Sent: Sunday, 9 October 2022 15.38
> > To: Olivier Matz
> > Cc: dev@dpdk.org; Morten Brørup; Bruce Richardson
> > Subject: [PATCH v6 4/4] mempool: flush cache completely on overflow
> > 
> > The cache was still full after flushing. In the opposite direction,
> > i.e. when getting objects from the cache, the cache is refilled to full
> > level when it crosses the low watermark (which happens to be zero).
> > Similarly, the cache should be flushed to empty level when it crosses
> > the high watermark (which happens to be 1.5 x the size of the cache).
> > The existing flushing behaviour was suboptimal for real applications,
> > because crossing the low or high watermark typically happens when the
> > application is in a state where the number of put/get events are out of
> > balance, e.g. when absorbing a burst of packets into a QoS queue
> > (getting more mbufs from the mempool), or when a burst of packets is
> > trickling out from the QoS queue (putting the mbufs back into the
> > mempool).
> > Now, the mempool cache is completely flushed when crossing the flush
> > threshold, so only the newly put (hot) objects remain in the mempool
> > cache afterwards.
> > 
> > This bug degraded performance caused by too frequent flushing.
> > 
> > Consider this application scenario:
> > 
> > Either, an lcore thread in the application is in a state of balance,
> > where it uses the mempool cache within its flush/refill boundaries; in
> > this situation, the flush method is less important, and this fix is
> > irrelevant.
> > 
> > Or, an lcore thread in the application is out of balance (either
> > permanently or temporarily), and mostly gets or puts objects from/to
> > the
> > mempool. If it mostly puts objects, not flushing all of the objects
> > will
> > cause more frequent flushing. This is the scenario addressed by this
> > fix. E.g.:
> > 
> > Cache size=256, flushthresh=384 (1.5x size), initial len=256;
> > application burst len=32.
> > 
> > If there are "size" objects in the cache after flushing, the cache is
> > flushed at every 4th burst.
> > 
> > If the cache is flushed completely, the cache is only flushed at every
> > 16th burst.
> > 
> > As you can see, this bug caused the cache to be flushed 4x too
> > frequently in this example.
> > 
> > And when/if the application thread breaks its pattern of continuously
> > putting objects, and suddenly starts to get objects instead, it will
> > either get objects already in the cache, or the get() function will
> > refill the cache.
> > 
> > The concept of not flushing the cache completely was probably based on
> > an assumption that it is more likely for an application's lcore thread
> > to get() after flushing than to put() after flushing.
> > I strongly disagree with this assumption! If an application thread is
> > continuously putting so much that it overflows the cache, it is much
> > more likely to keep putting than it is to start getting. If in doubt,
> > consider how CPU branch predictors work: When the application has done
> > something many times consecutively, the branch predictor will expect
> > the
> > application to do the same again, rather than suddenly do something
> > else.
> > 
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > ---
> 
> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
> 

Acked-by: Olivier Matz <olivier.matz@6wind.com>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4] mempool: fix get objects from mempool with cache
  2022-10-08 20:56   ` Thomas Monjalon
  2022-10-11 20:30     ` Copy-pasted code should be updated Morten Brørup
@ 2022-10-14 14:01     ` Olivier Matz
  1 sibling, 0 replies; 85+ messages in thread
From: Olivier Matz @ 2022-10-14 14:01 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Morten Brørup, Andrew Rybchenko, dev

On Sat, Oct 08, 2022 at 10:56:06PM +0200, Thomas Monjalon wrote:
> 07/10/2022 12:44, Andrew Rybchenko:
> > From: Morten Brørup <mb@smartsharesystems.com>
> > 
> > A flush threshold for the mempool cache was introduced in DPDK version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then, and some inefficiencies were introduced.
> > 
> > Fix the following in rte_mempool_do_generic_get():
> > 
> > 1. The code that initially screens the cache request was not updated
> > with the change in DPDK version 1.3.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction of
> > the flush threshold. E.g. the cache can hold up to flushthresh objects,
> > which is more than its size, so some requests were not served from the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.
> > 
> > This bug caused a major performance degradation in scenarios where the
> > application burst length is the same as the cache size. In such cases,
> > the objects were not ever fetched from the mempool cache, regardless if
> > they could have been.
> > This scenario occurs e.g. if an application has configured a mempool
> > with a size matching the application's burst size.
> > 
> > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > behave according to the description of that function.
> > Specifically, objects must first be returned from the cache,
> > subsequently from the backend.
> > After the change in DPDK version 1.3, this was not the behavior when
> > the request was partially satisfied from the cache; instead, the objects
> > from the backend were returned ahead of the objects from the cache.
> > This bug degraded application performance on CPUs with a small L1 cache,
> > which benefit from having the hot objects first in the returned array.
> > (This is probably also the reason why the function returns the objects
> > in reverse order, which it still does.)
> > Now, all code paths first return objects from the cache, subsequently
> > from the backend.
> > 
> > The function was not behaving as described (by the function using it)
> > and expected by applications using it. This in itself is also a bug.
> > 
> > 3. If the cache could not be backfilled, the function would attempt
> > to get all the requested objects from the backend (instead of only the
> > number of requested objects minus the objects available in the backend),
> > and the function would fail if that failed.
> > Now, the first part of the request is always satisfied from the cache,
> > and if the subsequent backfilling of the cache from the backend fails,
> > only the remaining requested objects are retrieved from the backend.
> > 
> > The function would fail despite there are enough objects in the cache
> > plus the common pool.
> > 
> > 4. The code flow for satisfying the request from the cache was slightly
> > inefficient:
> > The likely code path where the objects are simply served from the cache
> > was treated as unlikely. Now it is treated as likely.
> > 
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Applied, thanks.

Better late than never: I reviewed this patch after it has been pushed,
and it looks good to me.

Thanks,
Olivier


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v6 3/4] mempool: fix cache flushing algorithm
  2022-10-14 14:01           ` Olivier Matz
@ 2022-10-14 15:57             ` Morten Brørup
  2022-10-14 19:50               ` Olivier Matz
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-14 15:57 UTC (permalink / raw)
  To: Olivier Matz; +Cc: Andrew Rybchenko, dev, Bruce Richardson

> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Friday, 14 October 2022 16.01
> 
> Hi Morten, Andrew,
> 
> On Sun, Oct 09, 2022 at 05:08:39PM +0200, Morten Brørup wrote:
> > > From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > Sent: Sunday, 9 October 2022 16.52
> > >
> > > On 10/9/22 17:31, Morten Brørup wrote:
> > > >> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > >> Sent: Sunday, 9 October 2022 15.38
> > > >>
> > > >> From: Morten Brørup <mb@smartsharesystems.com>
> > > >>
> >
> > [...]
> 
> I finally took a couple of hours to carefully review the mempool-
> related
> series (including the ones that have already been pushed).
> 
> The new behavior looks better to me in all situations I can think
> about.

Extreme care is required when touching a core library like the mempool.

Thank you, Olivier.

> 
> >
> > > >> --- a/lib/mempool/rte_mempool.h
> > > >> +++ b/lib/mempool/rte_mempool.h
> > > >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> > > >>   	 * Cache is allocated to this size to allow it to overflow
> in
> > > >> certain
> > > >>   	 * cases to avoid needless emptying of cache.
> > > >>   	 */
> > > >> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> objects */
> > > >> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache
> objects */
> > > >>   } __rte_cache_aligned;
> > > >
> > > > How much are we allowed to break the ABI here?
> > > >
> > > > This patch reduces the size of the structure by removing a now
> unused
> > > part at the end, which should be harmless.
> 
> It is an ABI breakage: an existing application will use the new 22.11
> function to create the mempool (with a smaller cache), but will use the
> old inlined get/put that can exceed MAX_SIZE x 2 will remain.
> 
> But this is a nice memory consumption improvement, in my opinion we
> should accept it for 22.11 with an entry in the release note.
> 
> 
> > > >
> > > > If we may also move the position of the objs array, I would add
> > > __rte_cache_aligned to the objs array. It makes no difference in
> the
> > > general case, but if get/put operations are always 32 objects, it
> will
> > > reduce the number of memory (or last level cache) accesses from
> five to
> > > four 64 B cache lines for every get/put operation.
> 
> Will it really be the case? Since cache->len has to be accessed too,
> I don't think it would make a difference.

Yes, the first cache line, containing cache->len, will be accessed always. I forgot to count that; so the improvement by aligning cache->objs will be five cache line accesses instead of six.

Let me try to explain the scenario in other words:

In an application where a mempool cache is only accessed in bursts of 32 objects (256 B), it matters if those 256 B accesses in the mempool cache start at a cache line aligned address or not. If cache line aligned, accessing those 256 B in the mempool cache will only touch 4 cache lines; if not, 5 cache lines will be touched. (For architectures with 128 B cache line, it will be 2 instead of 3 touched cache lines per mempool cache get/put operation in applications using only bursts of 32 objects.)

If we cache line align cache->objs, those bursts of 32 objects (256 B) will be cache line aligned: Any address at cache->objs[N * 32 objects] is cache line aligned if objs->objs[0] is cache line aligned.

Currently, the cache->objs directly follows cache->len, which makes cache->objs[0] cache line unaligned.

If we decide to break the mempool cache ABI, we might as well include my suggested cache line alignment performance improvement. It doesn't degrade performance for mempool caches not only accessed in bursts of 32 objects.

> 
> 
> > > >
> > > > 	uint32_t len;	      /**< Current cache count */
> > > > -	/*
> > > > -	 * Cache is allocated to this size to allow it to overflow
> in
> > > certain
> > > > -	 * cases to avoid needless emptying of cache.
> > > > -	 */
> > > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> objects */
> > > > +	/**
> > > > +	 * Cache objects
> > > > +	 *
> > > > +	 * Cache is allocated to this size to allow it to overflow
> in
> > > certain
> > > > +	 * cases to avoid needless emptying of cache.
> > > > +	 */
> > > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> __rte_cache_aligned;
> > > > } __rte_cache_aligned;
> > >
> > > I think aligning objs on cacheline should be a separate patch.
> >
> > Good point. I'll let you do it. :-)
> >
> > PS: Thank you for following up on this patch series, Andrew!
> 
> Many thanks for this rework.
> 
> Acked-by: Olivier Matz <olivier.matz@6wind.com>

Perhaps Reviewed-by would be appropriate?


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v6 3/4] mempool: fix cache flushing algorithm
  2022-10-14 15:57             ` Morten Brørup
@ 2022-10-14 19:50               ` Olivier Matz
  2022-10-15  6:57                 ` Morten Brørup
  0 siblings, 1 reply; 85+ messages in thread
From: Olivier Matz @ 2022-10-14 19:50 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Andrew Rybchenko, dev, Bruce Richardson

On Fri, Oct 14, 2022 at 05:57:39PM +0200, Morten Brørup wrote:
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Friday, 14 October 2022 16.01
> > 
> > Hi Morten, Andrew,
> > 
> > On Sun, Oct 09, 2022 at 05:08:39PM +0200, Morten Brørup wrote:
> > > > From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > Sent: Sunday, 9 October 2022 16.52
> > > >
> > > > On 10/9/22 17:31, Morten Brørup wrote:
> > > > >> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > >> Sent: Sunday, 9 October 2022 15.38
> > > > >>
> > > > >> From: Morten Brørup <mb@smartsharesystems.com>
> > > > >>
> > >
> > > [...]
> > 
> > I finally took a couple of hours to carefully review the mempool-
> > related
> > series (including the ones that have already been pushed).
> > 
> > The new behavior looks better to me in all situations I can think
> > about.
> 
> Extreme care is required when touching a core library like the mempool.
> 
> Thank you, Olivier.
> 
> > 
> > >
> > > > >> --- a/lib/mempool/rte_mempool.h
> > > > >> +++ b/lib/mempool/rte_mempool.h
> > > > >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> > > > >>   	 * Cache is allocated to this size to allow it to overflow
> > in
> > > > >> certain
> > > > >>   	 * cases to avoid needless emptying of cache.
> > > > >>   	 */
> > > > >> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> > objects */
> > > > >> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache
> > objects */
> > > > >>   } __rte_cache_aligned;
> > > > >
> > > > > How much are we allowed to break the ABI here?
> > > > >
> > > > > This patch reduces the size of the structure by removing a now
> > unused
> > > > part at the end, which should be harmless.
> > 
> > It is an ABI breakage: an existing application will use the new 22.11
> > function to create the mempool (with a smaller cache), but will use the
> > old inlined get/put that can exceed MAX_SIZE x 2 will remain.
> > 
> > But this is a nice memory consumption improvement, in my opinion we
> > should accept it for 22.11 with an entry in the release note.
> > 
> > 
> > > > >
> > > > > If we may also move the position of the objs array, I would add
> > > > __rte_cache_aligned to the objs array. It makes no difference in
> > the
> > > > general case, but if get/put operations are always 32 objects, it
> > will
> > > > reduce the number of memory (or last level cache) accesses from
> > five to
> > > > four 64 B cache lines for every get/put operation.
> > 
> > Will it really be the case? Since cache->len has to be accessed too,
> > I don't think it would make a difference.
> 
> Yes, the first cache line, containing cache->len, will be accessed always. I forgot to count that; so the improvement by aligning cache->objs will be five cache line accesses instead of six.
> 
> Let me try to explain the scenario in other words:
> 
> In an application where a mempool cache is only accessed in bursts of 32 objects (256 B), it matters if those 256 B accesses in the mempool cache start at a cache line aligned address or not. If cache line aligned, accessing those 256 B in the mempool cache will only touch 4 cache lines; if not, 5 cache lines will be touched. (For architectures with 128 B cache line, it will be 2 instead of 3 touched cache lines per mempool cache get/put operation in applications using only bursts of 32 objects.)
> 
> If we cache line align cache->objs, those bursts of 32 objects (256 B) will be cache line aligned: Any address at cache->objs[N * 32 objects] is cache line aligned if objs->objs[0] is cache line aligned.
> 
> Currently, the cache->objs directly follows cache->len, which makes cache->objs[0] cache line unaligned.
> 
> If we decide to break the mempool cache ABI, we might as well include my suggested cache line alignment performance improvement. It doesn't degrade performance for mempool caches not only accessed in bursts of 32 objects.

I don't follow you. Currently, with 16 objects (128B), we access to 3
cache lines:

      ┌────────┐
      │len     │
cache │********│---
line0 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line2 │        │
      │        │
      └────────┘

With the alignment, it is also 3 cache lines:

      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤---
      │********│ ^
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line2 │********│ |
      │********│ v
      └────────┘---


Am I missing something?

> 
> > 
> > 
> > > > >
> > > > > 	uint32_t len;	      /**< Current cache count */
> > > > > -	/*
> > > > > -	 * Cache is allocated to this size to allow it to overflow
> > in
> > > > certain
> > > > > -	 * cases to avoid needless emptying of cache.
> > > > > -	 */
> > > > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> > objects */
> > > > > +	/**
> > > > > +	 * Cache objects
> > > > > +	 *
> > > > > +	 * Cache is allocated to this size to allow it to overflow
> > in
> > > > certain
> > > > > +	 * cases to avoid needless emptying of cache.
> > > > > +	 */
> > > > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> > __rte_cache_aligned;
> > > > > } __rte_cache_aligned;
> > > >
> > > > I think aligning objs on cacheline should be a separate patch.
> > >
> > > Good point. I'll let you do it. :-)
> > >
> > > PS: Thank you for following up on this patch series, Andrew!
> > 
> > Many thanks for this rework.
> > 
> > Acked-by: Olivier Matz <olivier.matz@6wind.com>
> 
> Perhaps Reviewed-by would be appropriate?

I was thinking that "Acked-by" was commonly used by maintainers, and
"Reviewed-by" for reviews by community members. After reading the
documentation again, it's not that clear now in my mind :)

Thanks,
Olivier

^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v6 3/4] mempool: fix cache flushing algorithm
  2022-10-14 19:50               ` Olivier Matz
@ 2022-10-15  6:57                 ` Morten Brørup
  2022-10-18 16:32                   ` Jerin Jacob
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-15  6:57 UTC (permalink / raw)
  To: Olivier Matz; +Cc: Andrew Rybchenko, dev, Bruce Richardson

> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Friday, 14 October 2022 21.51
> 
> On Fri, Oct 14, 2022 at 05:57:39PM +0200, Morten Brørup wrote:
> > > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > > Sent: Friday, 14 October 2022 16.01
> > >
> > > Hi Morten, Andrew,
> > >
> > > On Sun, Oct 09, 2022 at 05:08:39PM +0200, Morten Brørup wrote:
> > > > > From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > > Sent: Sunday, 9 October 2022 16.52
> > > > >
> > > > > On 10/9/22 17:31, Morten Brørup wrote:
> > > > > >> From: Andrew Rybchenko
> [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > > >> Sent: Sunday, 9 October 2022 15.38
> > > > > >>
> > > > > >> From: Morten Brørup <mb@smartsharesystems.com>
> > > > > >>
> > > >
> > > > [...]
> > >
> > > I finally took a couple of hours to carefully review the mempool-
> > > related
> > > series (including the ones that have already been pushed).
> > >
> > > The new behavior looks better to me in all situations I can think
> > > about.
> >
> > Extreme care is required when touching a core library like the
> mempool.
> >
> > Thank you, Olivier.
> >
> > >
> > > >
> > > > > >> --- a/lib/mempool/rte_mempool.h
> > > > > >> +++ b/lib/mempool/rte_mempool.h
> > > > > >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> > > > > >>   	 * Cache is allocated to this size to allow it to
> overflow
> > > in
> > > > > >> certain
> > > > > >>   	 * cases to avoid needless emptying of cache.
> > > > > >>   	 */
> > > > > >> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**<
> Cache
> > > objects */
> > > > > >> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**<
> Cache
> > > objects */
> > > > > >>   } __rte_cache_aligned;
> > > > > >
> > > > > > How much are we allowed to break the ABI here?
> > > > > >
> > > > > > This patch reduces the size of the structure by removing a
> now
> > > unused
> > > > > part at the end, which should be harmless.
> > >
> > > It is an ABI breakage: an existing application will use the new
> 22.11
> > > function to create the mempool (with a smaller cache), but will use
> the
> > > old inlined get/put that can exceed MAX_SIZE x 2 will remain.
> > >
> > > But this is a nice memory consumption improvement, in my opinion we
> > > should accept it for 22.11 with an entry in the release note.
> > >
> > >
> > > > > >
> > > > > > If we may also move the position of the objs array, I would
> add
> > > > > __rte_cache_aligned to the objs array. It makes no difference
> in
> > > the
> > > > > general case, but if get/put operations are always 32 objects,
> it
> > > will
> > > > > reduce the number of memory (or last level cache) accesses from
> > > five to
> > > > > four 64 B cache lines for every get/put operation.
> > >
> > > Will it really be the case? Since cache->len has to be accessed
> too,
> > > I don't think it would make a difference.
> >
> > Yes, the first cache line, containing cache->len, will be accessed
> always. I forgot to count that; so the improvement by aligning cache-
> >objs will be five cache line accesses instead of six.
> >
> > Let me try to explain the scenario in other words:
> >
> > In an application where a mempool cache is only accessed in bursts of
> 32 objects (256 B), it matters if those 256 B accesses in the mempool
> cache start at a cache line aligned address or not. If cache line
> aligned, accessing those 256 B in the mempool cache will only touch 4
> cache lines; if not, 5 cache lines will be touched. (For architectures
> with 128 B cache line, it will be 2 instead of 3 touched cache lines
> per mempool cache get/put operation in applications using only bursts
> of 32 objects.)
> >
> > If we cache line align cache->objs, those bursts of 32 objects (256
> B) will be cache line aligned: Any address at cache->objs[N * 32
> objects] is cache line aligned if objs->objs[0] is cache line aligned.
> >
> > Currently, the cache->objs directly follows cache->len, which makes
> cache->objs[0] cache line unaligned.
> >
> > If we decide to break the mempool cache ABI, we might as well include
> my suggested cache line alignment performance improvement. It doesn't
> degrade performance for mempool caches not only accessed in bursts of
> 32 objects.
> 
> I don't follow you. Currently, with 16 objects (128B), we access to 3
> cache lines:
> 
>       ┌────────┐
>       │len     │
> cache │********│---
> line0 │********│ ^
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line1 │********│ |
>       │********│ |
>       ├────────┤ |
>       │********│_v_
> cache │        │
> line2 │        │
>       │        │
>       └────────┘
> 
> With the alignment, it is also 3 cache lines:
> 
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤---
>       │********│ ^
> cache │********│ |
> line1 │********│ |
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line2 │********│ |
>       │********│ v
>       └────────┘---
> 
> 
> Am I missing something?

Accessing the objects at the bottom of the mempool cache is a special case, where cache line0 is also used for objects.

Consider the next burst (and any following bursts):

Current:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │********│---
line2 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line3 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line4 │        │
      │        │
      └────────┘
4 cache lines touched, incl. line0 for len.

With the proposed alignment:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line2 │        │
      │        │
      ├────────┤
      │********│---
cache │********│ ^
line3 │********│ |
      │********│ | 16 objects
      ├────────┤ | 128B
      │********│ |
cache │********│ |
line4 │********│ |
      │********│_v_
      └────────┘
Only 3 cache lines touched, incl. line0 for len.


> 
> >
> > >
> > >
> > > > > >
> > > > > > 	uint32_t len;	      /**< Current cache count */
> > > > > > -	/*
> > > > > > -	 * Cache is allocated to this size to allow it to overflow
> > > in
> > > > > certain
> > > > > > -	 * cases to avoid needless emptying of cache.
> > > > > > -	 */
> > > > > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> > > objects */
> > > > > > +	/**
> > > > > > +	 * Cache objects
> > > > > > +	 *
> > > > > > +	 * Cache is allocated to this size to allow it to overflow
> > > in
> > > > > certain
> > > > > > +	 * cases to avoid needless emptying of cache.
> > > > > > +	 */
> > > > > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> > > __rte_cache_aligned;
> > > > > > } __rte_cache_aligned;
> > > > >
> > > > > I think aligning objs on cacheline should be a separate patch.
> > > >
> > > > Good point. I'll let you do it. :-)
> > > >
> > > > PS: Thank you for following up on this patch series, Andrew!
> > >
> > > Many thanks for this rework.
> > >
> > > Acked-by: Olivier Matz <olivier.matz@6wind.com>
> >
> > Perhaps Reviewed-by would be appropriate?
> 
> I was thinking that "Acked-by" was commonly used by maintainers, and
> "Reviewed-by" for reviews by community members. After reading the
> documentation again, it's not that clear now in my mind :)
> 
> Thanks,
> Olivier


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v6 3/4] mempool: fix cache flushing algorithm
  2022-10-15  6:57                 ` Morten Brørup
@ 2022-10-18 16:32                   ` Jerin Jacob
  0 siblings, 0 replies; 85+ messages in thread
From: Jerin Jacob @ 2022-10-18 16:32 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Olivier Matz, Andrew Rybchenko, dev, Bruce Richardson

On Sat, Oct 15, 2022 at 12:27 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Friday, 14 October 2022 21.51
> >
> > On Fri, Oct 14, 2022 at 05:57:39PM +0200, Morten Brørup wrote:
> > > > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > > > Sent: Friday, 14 October 2022 16.01
> > > >
> > > > Hi Morten, Andrew,
> > > >
> > > > On Sun, Oct 09, 2022 at 05:08:39PM +0200, Morten Brørup wrote:
> > > > > > From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > > > Sent: Sunday, 9 October 2022 16.52
> > > > > >
> > > > > > On 10/9/22 17:31, Morten Brørup wrote:
> > > > > > >> From: Andrew Rybchenko
> > [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > > > >> Sent: Sunday, 9 October 2022 15.38
> > > > > > >>
> > > > > > >> From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > >>
> > > > >
> > > > > [...]
> > > >
> > > > I finally took a couple of hours to carefully review the mempool-
> > > > related
> > > > series (including the ones that have already been pushed).
> > > >
> > > > The new behavior looks better to me in all situations I can think
> > > > about.
> > >
> > > Extreme care is required when touching a core library like the
> > mempool.
> > >
> > > Thank you, Olivier.
> > >
> > > >
> > > > >
> > > > > > >> --- a/lib/mempool/rte_mempool.h
> > > > > > >> +++ b/lib/mempool/rte_mempool.h
> > > > > > >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> > > > > > >>     * Cache is allocated to this size to allow it to
> > overflow
> > > > in
> > > > > > >> certain
> > > > > > >>     * cases to avoid needless emptying of cache.
> > > > > > >>     */
> > > > > > >> -  void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**<
> > Cache
> > > > objects */
> > > > > > >> +  void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**<
> > Cache
> > > > objects */
> > > > > > >>   } __rte_cache_aligned;
> > > > > > >
> > > > > > > How much are we allowed to break the ABI here?
> > > > > > >
> > > > > > > This patch reduces the size of the structure by removing a
> > now
> > > > unused
> > > > > > part at the end, which should be harmless.
> > > >
> > > > It is an ABI breakage: an existing application will use the new
> > 22.11
> > > > function to create the mempool (with a smaller cache), but will use
> > the
> > > > old inlined get/put that can exceed MAX_SIZE x 2 will remain.
> > > >
> > > > But this is a nice memory consumption improvement, in my opinion we
> > > > should accept it for 22.11 with an entry in the release note.
> > > >
> > > >
> > > > > > >
> > > > > > > If we may also move the position of the objs array, I would
> > add
> > > > > > __rte_cache_aligned to the objs array. It makes no difference
> > in
> > > > the
> > > > > > general case, but if get/put operations are always 32 objects,
> > it
> > > > will
> > > > > > reduce the number of memory (or last level cache) accesses from
> > > > five to
> > > > > > four 64 B cache lines for every get/put operation.
> > > >
> > > > Will it really be the case? Since cache->len has to be accessed
> > too,
> > > > I don't think it would make a difference.
> > >
> > > Yes, the first cache line, containing cache->len, will be accessed
> > always. I forgot to count that; so the improvement by aligning cache-
> > >objs will be five cache line accesses instead of six.
> > >
> > > Let me try to explain the scenario in other words:
> > >
> > > In an application where a mempool cache is only accessed in bursts of
> > 32 objects (256 B), it matters if those 256 B accesses in the mempool
> > cache start at a cache line aligned address or not. If cache line
> > aligned, accessing those 256 B in the mempool cache will only touch 4
> > cache lines; if not, 5 cache lines will be touched. (For architectures
> > with 128 B cache line, it will be 2 instead of 3 touched cache lines
> > per mempool cache get/put operation in applications using only bursts
> > of 32 objects.)
> > >
> > > If we cache line align cache->objs, those bursts of 32 objects (256
> > B) will be cache line aligned: Any address at cache->objs[N * 32
> > objects] is cache line aligned if objs->objs[0] is cache line aligned.
> > >
> > > Currently, the cache->objs directly follows cache->len, which makes
> > cache->objs[0] cache line unaligned.
> > >
> > > If we decide to break the mempool cache ABI, we might as well include
> > my suggested cache line alignment performance improvement. It doesn't
> > degrade performance for mempool caches not only accessed in bursts of
> > 32 objects.
> >
> > I don't follow you. Currently, with 16 objects (128B), we access to 3
> > cache lines:
> >
> >       ┌────────┐
> >       │len     │
> > cache │********│---
> > line0 │********│ ^
> >       │********│ |
> >       ├────────┤ | 16 objects
> >       │********│ | 128B
> > cache │********│ |
> > line1 │********│ |
> >       │********│ |
> >       ├────────┤ |
> >       │********│_v_
> > cache │        │
> > line2 │        │
> >       │        │
> >       └────────┘
> >
> > With the alignment, it is also 3 cache lines:
> >
> >       ┌────────┐
> >       │len     │
> > cache │        │
> > line0 │        │
> >       │        │
> >       ├────────┤---
> >       │********│ ^
> > cache │********│ |
> > line1 │********│ |
> >       │********│ |
> >       ├────────┤ | 16 objects
> >       │********│ | 128B
> > cache │********│ |
> > line2 │********│ |
> >       │********│ v
> >       └────────┘---
> >
> >
> > Am I missing something?
>
> Accessing the objects at the bottom of the mempool cache is a special case, where cache line0 is also used for objects.
>
> Consider the next burst (and any following bursts):
>
> Current:
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line1 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │********│---
> line2 │********│ ^
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line3 │********│ |
>       │********│ |
>       ├────────┤ |
>       │********│_v_
> cache │        │
> line4 │        │
>       │        │
>       └────────┘
> 4 cache lines touched, incl. line0 for len.
>
> With the proposed alignment:
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line1 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line2 │        │
>       │        │
>       ├────────┤
>       │********│---
> cache │********│ ^
> line3 │********│ |
>       │********│ | 16 objects
>       ├────────┤ | 128B
>       │********│ |
> cache │********│ |
> line4 │********│ |
>       │********│_v_
>       └────────┘
> Only 3 cache lines touched, incl. line0 for len.


When tested with testpmd,l3fwd there was less than 1% regression. It
could be noise.
But making the cacheline alignment is fixing that.

In addition to @Morten Brørup  point, I think, there is a factor
"load" stall on cache->len read, What I meant by that is:

In the case  of (len and objs) are in the same cache line. Assume objs
are written as stores operation and not read anything on cacheline
VS a few stores done for objects and on subsequent len read via
enqueue operation may stall based where those obj reached in
the cache hierarchy and cache policy(write-back vs write-through)

If we are seeing no regression with cachealinged with various platform
testing then I think it make sense to make cache aligned.

>
>
> >
> > >
> > > >
> > > >
> > > > > > >
> > > > > > >     uint32_t len;         /**< Current cache count */
> > > > > > > -   /*
> > > > > > > -    * Cache is allocated to this size to allow it to overflow
> > > > in
> > > > > > certain
> > > > > > > -    * cases to avoid needless emptying of cache.
> > > > > > > -    */
> > > > > > > -   void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> > > > objects */
> > > > > > > +   /**
> > > > > > > +    * Cache objects
> > > > > > > +    *
> > > > > > > +    * Cache is allocated to this size to allow it to overflow
> > > > in
> > > > > > certain
> > > > > > > +    * cases to avoid needless emptying of cache.
> > > > > > > +    */
> > > > > > > +   void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> > > > __rte_cache_aligned;
> > > > > > > } __rte_cache_aligned;
> > > > > >
> > > > > > I think aligning objs on cacheline should be a separate patch.
> > > > >
> > > > > Good point. I'll let you do it. :-)
> > > > >
> > > > > PS: Thank you for following up on this patch series, Andrew!
> > > >
> > > > Many thanks for this rework.
> > > >
> > > > Acked-by: Olivier Matz <olivier.matz@6wind.com>
> > >
> > > Perhaps Reviewed-by would be appropriate?
> >
> > I was thinking that "Acked-by" was commonly used by maintainers, and
> > "Reviewed-by" for reviews by community members. After reading the
> > documentation again, it's not that clear now in my mind :)
> >
> > Thanks,
> > Olivier
>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm
  2022-10-10 15:21   ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Thomas Monjalon
  2022-10-11 19:26     ` Morten Brørup
@ 2022-10-26 14:09     ` Thomas Monjalon
  2022-10-26 14:26       ` Morten Brørup
  1 sibling, 1 reply; 85+ messages in thread
From: Thomas Monjalon @ 2022-10-26 14:09 UTC (permalink / raw)
  To: Andrew Rybchenko, Olivier Matz, Morten Brørup
  Cc: dev, dev, Bruce Richardson

10/10/2022 17:21, Thomas Monjalon:
> > Andrew Rybchenko (3):
> >   mempool: check driver enqueue result in one place
> >   mempool: avoid usage of term ring on put
> >   mempool: flush cache completely on overflow
> > 
> > Morten Brørup (1):
> >   mempool: fix cache flushing algorithm
> 
> Applied only first 2 "cosmetic" patches as discussed with Andrew.
> The goal is to make some performance tests
> before merging the rest of the series.

The last 2 patches are now merged in 22.11-rc2

There were some comments about improving alignment on cache.
Is it something we want in this release?



^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm
  2022-10-26 14:09     ` Thomas Monjalon
@ 2022-10-26 14:26       ` Morten Brørup
  2022-10-26 14:44         ` [PATCH] mempool: cache align mempool cache objects Morten Brørup
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-26 14:26 UTC (permalink / raw)
  To: Thomas Monjalon, Andrew Rybchenko, Olivier Matz
  Cc: dev, dev, Bruce Richardson, Jerin Jacob

> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, 26 October 2022 16.09
> 
> 10/10/2022 17:21, Thomas Monjalon:
> > > Andrew Rybchenko (3):
> > >   mempool: check driver enqueue result in one place
> > >   mempool: avoid usage of term ring on put
> > >   mempool: flush cache completely on overflow
> > >
> > > Morten Brørup (1):
> > >   mempool: fix cache flushing algorithm
> >
> > Applied only first 2 "cosmetic" patches as discussed with Andrew.
> > The goal is to make some performance tests
> > before merging the rest of the series.
> 
> The last 2 patches are now merged in 22.11-rc2

Thank you.

> 
> There were some comments about improving alignment on cache.
> Is it something we want in this release?

I think so, yes. Jerin also agreed that it was a good idea.

I will send a patch in a few minutes.

-Morten


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH] mempool: cache align mempool cache objects
  2022-10-26 14:26       ` Morten Brørup
@ 2022-10-26 14:44         ` Morten Brørup
  2022-10-26 19:44           ` Andrew Rybchenko
                             ` (3 more replies)
  0 siblings, 4 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-26 14:44 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko, jerinj, thomas
  Cc: bruce.richardson, dev, Morten Brørup

Add __rte_cache_aligned to the objs array.

It makes no difference in the general case, but if get/put operations are
always 32 objects, it will reduce the number of memory (or last level
cache) accesses from five to four 64 B cache lines for every get/put
operation.

For readability reasons, an example using 16 objects follows:

Currently, with 16 objects (128B), we access to 3
cache lines:

      ┌────────┐
      │len     │
cache │********│---
line0 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line2 │        │
      │        │
      └────────┘

With the alignment, it is also 3 cache lines:

      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤---
      │********│ ^
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line2 │********│ |
      │********│ v
      └────────┘---

However, accessing the objects at the bottom of the mempool cache is a
special case, where cache line0 is also used for objects.

Consider the next burst (and any following bursts):

Current:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │********│---
line2 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line3 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line4 │        │
      │        │
      └────────┘
4 cache lines touched, incl. line0 for len.

With the proposed alignment:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line2 │        │
      │        │
      ├────────┤
      │********│---
cache │********│ ^
line3 │********│ |
      │********│ | 16 objects
      ├────────┤ | 128B
      │********│ |
cache │********│ |
line4 │********│ |
      │********│_v_
      └────────┘
Only 3 cache lines touched, incl. line0 for len.

Credits go to Olivier Matz for the nice ASCII graphics.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1f5707f46a..3725a72951 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -86,11 +86,13 @@ struct rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
 	uint32_t flushthresh; /**< Threshold before we flush excess elements */
 	uint32_t len;	      /**< Current cache count */
-	/*
+	/**
+	 * Cache objects
+	 *
 	 * Cache is allocated to this size to allow it to overflow in certain
 	 * cases to avoid needless emptying of cache.
 	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
 } __rte_cache_aligned;
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] mempool: cache align mempool cache objects
  2022-10-26 14:44         ` [PATCH] mempool: cache align mempool cache objects Morten Brørup
@ 2022-10-26 19:44           ` Andrew Rybchenko
  2022-10-27  8:34           ` Olivier Matz
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 85+ messages in thread
From: Andrew Rybchenko @ 2022-10-26 19:44 UTC (permalink / raw)
  To: Morten Brørup, olivier.matz, jerinj, thomas; +Cc: bruce.richardson, dev

On 10/26/22 17:44, Morten Brørup wrote:
> Add __rte_cache_aligned to the objs array.
> 
> It makes no difference in the general case, but if get/put operations are
> always 32 objects, it will reduce the number of memory (or last level
> cache) accesses from five to four 64 B cache lines for every get/put
> operation.
> 
> For readability reasons, an example using 16 objects follows:
> 
> Currently, with 16 objects (128B), we access to 3
> cache lines:
> 
>        ┌────────┐
>        │len     │
> cache │********│---
> line0 │********│ ^
>        │********│ |
>        ├────────┤ | 16 objects
>        │********│ | 128B
> cache │********│ |
> line1 │********│ |
>        │********│ |
>        ├────────┤ |
>        │********│_v_
> cache │        │
> line2 │        │
>        │        │
>        └────────┘
> 
> With the alignment, it is also 3 cache lines:
> 
>        ┌────────┐
>        │len     │
> cache │        │
> line0 │        │
>        │        │
>        ├────────┤---
>        │********│ ^
> cache │********│ |
> line1 │********│ |
>        │********│ |
>        ├────────┤ | 16 objects
>        │********│ | 128B
> cache │********│ |
> line2 │********│ |
>        │********│ v
>        └────────┘---
> 
> However, accessing the objects at the bottom of the mempool cache is a
> special case, where cache line0 is also used for objects.
> 
> Consider the next burst (and any following bursts):
> 
> Current:
>        ┌────────┐
>        │len     │
> cache │        │
> line0 │        │
>        │        │
>        ├────────┤
>        │        │
> cache │        │
> line1 │        │
>        │        │
>        ├────────┤
>        │        │
> cache │********│---
> line2 │********│ ^
>        │********│ |
>        ├────────┤ | 16 objects
>        │********│ | 128B
> cache │********│ |
> line3 │********│ |
>        │********│ |
>        ├────────┤ |
>        │********│_v_
> cache │        │
> line4 │        │
>        │        │
>        └────────┘
> 4 cache lines touched, incl. line0 for len.
> 
> With the proposed alignment:
>        ┌────────┐
>        │len     │
> cache │        │
> line0 │        │
>        │        │
>        ├────────┤
>        │        │
> cache │        │
> line1 │        │
>        │        │
>        ├────────┤
>        │        │
> cache │        │
> line2 │        │
>        │        │
>        ├────────┤
>        │********│---
> cache │********│ ^
> line3 │********│ |
>        │********│ | 16 objects
>        ├────────┤ | 128B
>        │********│ |
> cache │********│ |
> line4 │********│ |
>        │********│_v_
>        └────────┘
> Only 3 cache lines touched, incl. line0 for len.
> 
> Credits go to Olivier Matz for the nice ASCII graphics.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>

Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] mempool: cache align mempool cache objects
  2022-10-26 14:44         ` [PATCH] mempool: cache align mempool cache objects Morten Brørup
  2022-10-26 19:44           ` Andrew Rybchenko
@ 2022-10-27  8:34           ` Olivier Matz
  2022-10-27  9:22             ` Morten Brørup
  2022-10-28  6:35           ` [PATCH v3 1/2] " Morten Brørup
  2022-10-28  6:41           ` [PATCH v4 1/2] mempool: cache align mempool cache objects Morten Brørup
  3 siblings, 1 reply; 85+ messages in thread
From: Olivier Matz @ 2022-10-27  8:34 UTC (permalink / raw)
  To: Morten Brørup
  Cc: andrew.rybchenko, jerinj, thomas, bruce.richardson, dev

Hi Morten,

On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote:
> Add __rte_cache_aligned to the objs array.
> 
> It makes no difference in the general case, but if get/put operations are
> always 32 objects, it will reduce the number of memory (or last level
> cache) accesses from five to four 64 B cache lines for every get/put
> operation.
> 
> For readability reasons, an example using 16 objects follows:
> 
> Currently, with 16 objects (128B), we access to 3
> cache lines:
> 
>       ┌────────┐
>       │len     │
> cache │********│---
> line0 │********│ ^
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line1 │********│ |
>       │********│ |
>       ├────────┤ |
>       │********│_v_
> cache │        │
> line2 │        │
>       │        │
>       └────────┘
> 
> With the alignment, it is also 3 cache lines:
> 
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤---
>       │********│ ^
> cache │********│ |
> line1 │********│ |
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line2 │********│ |
>       │********│ v
>       └────────┘---
> 
> However, accessing the objects at the bottom of the mempool cache is a
> special case, where cache line0 is also used for objects.
> 
> Consider the next burst (and any following bursts):
> 
> Current:
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line1 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │********│---
> line2 │********│ ^
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line3 │********│ |
>       │********│ |
>       ├────────┤ |
>       │********│_v_
> cache │        │
> line4 │        │
>       │        │
>       └────────┘
> 4 cache lines touched, incl. line0 for len.
> 
> With the proposed alignment:
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line1 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line2 │        │
>       │        │
>       ├────────┤
>       │********│---
> cache │********│ ^
> line3 │********│ |
>       │********│ | 16 objects
>       ├────────┤ | 128B
>       │********│ |
> cache │********│ |
> line4 │********│ |
>       │********│_v_
>       └────────┘
> Only 3 cache lines touched, incl. line0 for len.

I understand your logic, but are we sure that having an application that
works with bulks of 32 means that the cache will stay aligned to 32
elements for the whole life of the application?

In an application, the alignment of the cache can change if you have
any of:
- software queues (reassembly for instance)
- packet duplication (bridge, multicast)
- locally generated packets (keepalive, control protocol)
- pipeline to other cores

Even with testpmd, which work by bulk of 32, I can see that the size
of the cache filling is not aligned to 32. Right after starting the
application, we already have this:

  internal cache infos:
    cache_size=250
    cache_count[0]=231

This is probably related to the hw rx rings size, number of queues,
number of ports.

The "250" default value for cache size in testpmd is questionable, but
with --mbcache=256, the behavior is similar.

Also, when we transmit to a NIC, the mbufs are not returned immediatly
to the pool, they may stay in the hw tx ring during some time, which is
a driver decision.

After processing traffic on cores 8 and 24 with this testpmd, I get:
    cache_count[0]=231
    cache_count[8]=123
    cache_count[24]=122

In my opinion, it is not realistic to think that the mempool cache will
remain aligned to cachelines. In these conditions, it looks better to
keep the structure packed to avoid wasting memory.

Olivier


> 
> Credits go to Olivier Matz for the nice ASCII graphics.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1f5707f46a..3725a72951 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -86,11 +86,13 @@ struct rte_mempool_cache {
>  	uint32_t size;	      /**< Size of the cache */
>  	uint32_t flushthresh; /**< Threshold before we flush excess elements */
>  	uint32_t len;	      /**< Current cache count */
> -	/*
> +	/**
> +	 * Cache objects
> +	 *
>  	 * Cache is allocated to this size to allow it to overflow in certain
>  	 * cases to avoid needless emptying of cache.
>  	 */
> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
>  } __rte_cache_aligned;
>  
>  /**
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH] mempool: cache align mempool cache objects
  2022-10-27  8:34           ` Olivier Matz
@ 2022-10-27  9:22             ` Morten Brørup
  2022-10-27 11:42               ` Olivier Matz
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-27  9:22 UTC (permalink / raw)
  To: Olivier Matz; +Cc: andrew.rybchenko, jerinj, thomas, bruce.richardson, dev

> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Thursday, 27 October 2022 10.35
> 
> Hi Morten,
> 
> On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote:
> > Add __rte_cache_aligned to the objs array.
> >
> > It makes no difference in the general case, but if get/put operations
> are
> > always 32 objects, it will reduce the number of memory (or last level
> > cache) accesses from five to four 64 B cache lines for every get/put
> > operation.
> >
> > For readability reasons, an example using 16 objects follows:
> >
> > Currently, with 16 objects (128B), we access to 3
> > cache lines:
> >
> >       ┌────────┐
> >       │len     │
> > cache │********│---
> > line0 │********│ ^
> >       │********│ |
> >       ├────────┤ | 16 objects
> >       │********│ | 128B
> > cache │********│ |
> > line1 │********│ |
> >       │********│ |
> >       ├────────┤ |
> >       │********│_v_
> > cache │        │
> > line2 │        │
> >       │        │
> >       └────────┘
> >
> > With the alignment, it is also 3 cache lines:
> >
> >       ┌────────┐
> >       │len     │
> > cache │        │
> > line0 │        │
> >       │        │
> >       ├────────┤---
> >       │********│ ^
> > cache │********│ |
> > line1 │********│ |
> >       │********│ |
> >       ├────────┤ | 16 objects
> >       │********│ | 128B
> > cache │********│ |
> > line2 │********│ |
> >       │********│ v
> >       └────────┘---
> >
> > However, accessing the objects at the bottom of the mempool cache is
> a
> > special case, where cache line0 is also used for objects.
> >
> > Consider the next burst (and any following bursts):
> >
> > Current:
> >       ┌────────┐
> >       │len     │
> > cache │        │
> > line0 │        │
> >       │        │
> >       ├────────┤
> >       │        │
> > cache │        │
> > line1 │        │
> >       │        │
> >       ├────────┤
> >       │        │
> > cache │********│---
> > line2 │********│ ^
> >       │********│ |
> >       ├────────┤ | 16 objects
> >       │********│ | 128B
> > cache │********│ |
> > line3 │********│ |
> >       │********│ |
> >       ├────────┤ |
> >       │********│_v_
> > cache │        │
> > line4 │        │
> >       │        │
> >       └────────┘
> > 4 cache lines touched, incl. line0 for len.
> >
> > With the proposed alignment:
> >       ┌────────┐
> >       │len     │
> > cache │        │
> > line0 │        │
> >       │        │
> >       ├────────┤
> >       │        │
> > cache │        │
> > line1 │        │
> >       │        │
> >       ├────────┤
> >       │        │
> > cache │        │
> > line2 │        │
> >       │        │
> >       ├────────┤
> >       │********│---
> > cache │********│ ^
> > line3 │********│ |
> >       │********│ | 16 objects
> >       ├────────┤ | 128B
> >       │********│ |
> > cache │********│ |
> > line4 │********│ |
> >       │********│_v_
> >       └────────┘
> > Only 3 cache lines touched, incl. line0 for len.
> 
> I understand your logic, but are we sure that having an application
> that
> works with bulks of 32 means that the cache will stay aligned to 32
> elements for the whole life of the application?
> 
> In an application, the alignment of the cache can change if you have
> any of:
> - software queues (reassembly for instance)
> - packet duplication (bridge, multicast)
> - locally generated packets (keepalive, control protocol)
> - pipeline to other cores
> 
> Even with testpmd, which work by bulk of 32, I can see that the size
> of the cache filling is not aligned to 32. Right after starting the
> application, we already have this:
> 
>   internal cache infos:
>     cache_size=250
>     cache_count[0]=231
> 
> This is probably related to the hw rx rings size, number of queues,
> number of ports.
> 
> The "250" default value for cache size in testpmd is questionable, but
> with --mbcache=256, the behavior is similar.
> 
> Also, when we transmit to a NIC, the mbufs are not returned immediatly
> to the pool, they may stay in the hw tx ring during some time, which is
> a driver decision.
> 
> After processing traffic on cores 8 and 24 with this testpmd, I get:
>     cache_count[0]=231
>     cache_count[8]=123
>     cache_count[24]=122
> 
> In my opinion, it is not realistic to think that the mempool cache will
> remain aligned to cachelines. In these conditions, it looks better to
> keep the structure packed to avoid wasting memory.

I agree that is a special use case to only access the mempool cache in bursts of 32 objects, so the accesses are always cache line aligned. (Generalized, the burst size must not be 32; a burst size that is a multiple of RTE_CACHE_LINE_SIZE/sizeof(void*), i.e. a burst size of 8 on a 64-bit architecture, will do.)

Adding a hole of 52 byte per mempool cache is nothing, considering that the mempool cache already uses 8 KB (RTE_MEMPOOL_CACHE_MAX_SIZE * 2 * sizeof(void*) = 1024 * 8 byte) for the objects.

Also - assuming that memory allocations are cache line aligned - the 52 byte of unused memory cannot be used regardless if they are before or after the objects. Instead of having 52 B unused after the objects, we might as well have a hole of 52 B unused before the objects. In other words: There is really no downside to this.

Jerin also presented a separate argument for moving the objects to another cache line than the len field: The risk for "load-after-store stall" when loading the len field after storing objects in cache line0 [1].

[1]: http://inbox.dpdk.org/dev/CALBAE1P4zFYdLwoQukn5Q-V-nTvc_UBWmWjhaV2uVBXQRytSSA@mail.gmail.com/

A new idea just popped into my head: The hot debug statistics counters (put_bulk, put_objs, get_success_bulk, get_success_objs) could be moved to this free space, reducing the need to touch another cache line for debug counters. I haven’t thought this idea through yet; it might conflict with Jerin's comment.

> 
> Olivier
> 
> 
> >
> > Credits go to Olivier Matz for the nice ASCII graphics.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/mempool/rte_mempool.h | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1f5707f46a..3725a72951 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -86,11 +86,13 @@ struct rte_mempool_cache {
> >  	uint32_t size;	      /**< Size of the cache */
> >  	uint32_t flushthresh; /**< Threshold before we flush excess
> elements */
> >  	uint32_t len;	      /**< Current cache count */
> > -	/*
> > +	/**
> > +	 * Cache objects
> > +	 *
> >  	 * Cache is allocated to this size to allow it to overflow in
> certain
> >  	 * cases to avoid needless emptying of cache.
> >  	 */
> > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
> > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
> >  } __rte_cache_aligned;
> >
> >  /**
> > --
> > 2.17.1
> >


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] mempool: cache align mempool cache objects
  2022-10-27  9:22             ` Morten Brørup
@ 2022-10-27 11:42               ` Olivier Matz
  2022-10-27 12:11                 ` Morten Brørup
  0 siblings, 1 reply; 85+ messages in thread
From: Olivier Matz @ 2022-10-27 11:42 UTC (permalink / raw)
  To: Morten Brørup
  Cc: andrew.rybchenko, jerinj, thomas, bruce.richardson, dev

On Thu, Oct 27, 2022 at 11:22:07AM +0200, Morten Brørup wrote:
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Thursday, 27 October 2022 10.35
> > 
> > Hi Morten,
> > 
> > On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote:
> > > Add __rte_cache_aligned to the objs array.
> > >
> > > It makes no difference in the general case, but if get/put operations
> > are
> > > always 32 objects, it will reduce the number of memory (or last level
> > > cache) accesses from five to four 64 B cache lines for every get/put
> > > operation.
> > >
> > > For readability reasons, an example using 16 objects follows:
> > >
> > > Currently, with 16 objects (128B), we access to 3
> > > cache lines:
> > >
> > >       ┌────────┐
> > >       │len     │
> > > cache │********│---
> > > line0 │********│ ^
> > >       │********│ |
> > >       ├────────┤ | 16 objects
> > >       │********│ | 128B
> > > cache │********│ |
> > > line1 │********│ |
> > >       │********│ |
> > >       ├────────┤ |
> > >       │********│_v_
> > > cache │        │
> > > line2 │        │
> > >       │        │
> > >       └────────┘
> > >
> > > With the alignment, it is also 3 cache lines:
> > >
> > >       ┌────────┐
> > >       │len     │
> > > cache │        │
> > > line0 │        │
> > >       │        │
> > >       ├────────┤---
> > >       │********│ ^
> > > cache │********│ |
> > > line1 │********│ |
> > >       │********│ |
> > >       ├────────┤ | 16 objects
> > >       │********│ | 128B
> > > cache │********│ |
> > > line2 │********│ |
> > >       │********│ v
> > >       └────────┘---
> > >
> > > However, accessing the objects at the bottom of the mempool cache is
> > a
> > > special case, where cache line0 is also used for objects.
> > >
> > > Consider the next burst (and any following bursts):
> > >
> > > Current:
> > >       ┌────────┐
> > >       │len     │
> > > cache │        │
> > > line0 │        │
> > >       │        │
> > >       ├────────┤
> > >       │        │
> > > cache │        │
> > > line1 │        │
> > >       │        │
> > >       ├────────┤
> > >       │        │
> > > cache │********│---
> > > line2 │********│ ^
> > >       │********│ |
> > >       ├────────┤ | 16 objects
> > >       │********│ | 128B
> > > cache │********│ |
> > > line3 │********│ |
> > >       │********│ |
> > >       ├────────┤ |
> > >       │********│_v_
> > > cache │        │
> > > line4 │        │
> > >       │        │
> > >       └────────┘
> > > 4 cache lines touched, incl. line0 for len.
> > >
> > > With the proposed alignment:
> > >       ┌────────┐
> > >       │len     │
> > > cache │        │
> > > line0 │        │
> > >       │        │
> > >       ├────────┤
> > >       │        │
> > > cache │        │
> > > line1 │        │
> > >       │        │
> > >       ├────────┤
> > >       │        │
> > > cache │        │
> > > line2 │        │
> > >       │        │
> > >       ├────────┤
> > >       │********│---
> > > cache │********│ ^
> > > line3 │********│ |
> > >       │********│ | 16 objects
> > >       ├────────┤ | 128B
> > >       │********│ |
> > > cache │********│ |
> > > line4 │********│ |
> > >       │********│_v_
> > >       └────────┘
> > > Only 3 cache lines touched, incl. line0 for len.
> > 
> > I understand your logic, but are we sure that having an application
> > that
> > works with bulks of 32 means that the cache will stay aligned to 32
> > elements for the whole life of the application?
> > 
> > In an application, the alignment of the cache can change if you have
> > any of:
> > - software queues (reassembly for instance)
> > - packet duplication (bridge, multicast)
> > - locally generated packets (keepalive, control protocol)
> > - pipeline to other cores
> > 
> > Even with testpmd, which work by bulk of 32, I can see that the size
> > of the cache filling is not aligned to 32. Right after starting the
> > application, we already have this:
> > 
> >   internal cache infos:
> >     cache_size=250
> >     cache_count[0]=231
> > 
> > This is probably related to the hw rx rings size, number of queues,
> > number of ports.
> > 
> > The "250" default value for cache size in testpmd is questionable, but
> > with --mbcache=256, the behavior is similar.
> > 
> > Also, when we transmit to a NIC, the mbufs are not returned immediatly
> > to the pool, they may stay in the hw tx ring during some time, which is
> > a driver decision.
> > 
> > After processing traffic on cores 8 and 24 with this testpmd, I get:
> >     cache_count[0]=231
> >     cache_count[8]=123
> >     cache_count[24]=122
> > 
> > In my opinion, it is not realistic to think that the mempool cache will
> > remain aligned to cachelines. In these conditions, it looks better to
> > keep the structure packed to avoid wasting memory.
> 
> I agree that is a special use case to only access the mempool cache in
> bursts of 32 objects, so the accesses are always cache line
> aligned. (Generalized, the burst size must not be 32; a burst size
> that is a multiple of RTE_CACHE_LINE_SIZE/sizeof(void*), i.e. a burst
> size of 8 on a 64-bit architecture, will do.)

Is there a real situation where it happens to always have read/write
accesses per bulks of 32? From what I see in my quick test, it is not
the case, even with testpmd.

> Adding a hole of 52 byte per mempool cache is nothing, considering
> that the mempool cache already uses 8 KB (RTE_MEMPOOL_CACHE_MAX_SIZE *
> 2 * sizeof(void*) = 1024 * 8 byte) for the objects.
>
> Also - assuming that memory allocations are cache line aligned - the
> 52 byte of unused memory cannot be used regardless if they are before
> or after the objects. Instead of having 52 B unused after the objects,
> we might as well have a hole of 52 B unused before the objects. In
> other words: There is really no downside to this.

Correct, the memory waste argument to nack the patch is invalid.

> Jerin also presented a separate argument for moving the objects to
> another cache line than the len field: The risk for "load-after-store
> stall" when loading the len field after storing objects in cache line0
> [1].
> 
> [1]: http://inbox.dpdk.org/dev/CALBAE1P4zFYdLwoQukn5Q-V-nTvc_UBWmWjhaV2uVBXQRytSSA@mail.gmail.com/

I'll be prudent on this justification without numbers. The case where we
access to the objects of the first cache line (among several KB) is
maybe not that frequent.

> A new idea just popped into my head: The hot debug statistics
> counters (put_bulk, put_objs, get_success_bulk, get_success_objs)
> could be moved to this free space, reducing the need to touch another
> cache line for debug counters. I haven’t thought this idea through
> yet; it might conflict with Jerin's comment.

Yes, but since the stats are only enabled when RTE_LIBRTE_MEMPOOL_DEBUG
is set, it won't have any impact on non-debug builds.


Honnestly, I find it hard to convince myself that it is a real
optimization. I don't see any reason why it would be slower though. So
since we already broke the mempool cache struct ABI in a previous
commit, and since it won't consume more memory, I'm ok to include that
patch. It would be great to have numbers to put some weight in the
balance.




> 
> > 
> > Olivier
> > 
> > 
> > >
> > > Credits go to Olivier Matz for the nice ASCII graphics.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > ---
> > >  lib/mempool/rte_mempool.h | 6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 1f5707f46a..3725a72951 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -86,11 +86,13 @@ struct rte_mempool_cache {
> > >  	uint32_t size;	      /**< Size of the cache */
> > >  	uint32_t flushthresh; /**< Threshold before we flush excess
> > elements */
> > >  	uint32_t len;	      /**< Current cache count */
> > > -	/*
> > > +	/**
> > > +	 * Cache objects
> > > +	 *
> > >  	 * Cache is allocated to this size to allow it to overflow in
> > certain
> > >  	 * cases to avoid needless emptying of cache.
> > >  	 */
> > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
> > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
> > >  } __rte_cache_aligned;
> > >
> > >  /**
> > > --
> > > 2.17.1
> > >
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH] mempool: cache align mempool cache objects
  2022-10-27 11:42               ` Olivier Matz
@ 2022-10-27 12:11                 ` Morten Brørup
  2022-10-27 15:20                   ` Olivier Matz
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-27 12:11 UTC (permalink / raw)
  To: Olivier Matz; +Cc: andrew.rybchenko, jerinj, thomas, bruce.richardson, dev

> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Thursday, 27 October 2022 13.43
> 
> On Thu, Oct 27, 2022 at 11:22:07AM +0200, Morten Brørup wrote:
> > > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > > Sent: Thursday, 27 October 2022 10.35
> > >
> > > Hi Morten,
> > >
> > > On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote:
> > > > Add __rte_cache_aligned to the objs array.
> > > >
> > > > It makes no difference in the general case, but if get/put
> operations
> > > are
> > > > always 32 objects, it will reduce the number of memory (or last
> level
> > > > cache) accesses from five to four 64 B cache lines for every
> get/put
> > > > operation.
> > > >
> > > > For readability reasons, an example using 16 objects follows:
> > > >
> > > > Currently, with 16 objects (128B), we access to 3
> > > > cache lines:
> > > >
> > > >       ┌────────┐
> > > >       │len     │
> > > > cache │********│---
> > > > line0 │********│ ^
> > > >       │********│ |
> > > >       ├────────┤ | 16 objects
> > > >       │********│ | 128B
> > > > cache │********│ |
> > > > line1 │********│ |
> > > >       │********│ |
> > > >       ├────────┤ |
> > > >       │********│_v_
> > > > cache │        │
> > > > line2 │        │
> > > >       │        │
> > > >       └────────┘
> > > >
> > > > With the alignment, it is also 3 cache lines:
> > > >
> > > >       ┌────────┐
> > > >       │len     │
> > > > cache │        │
> > > > line0 │        │
> > > >       │        │
> > > >       ├────────┤---
> > > >       │********│ ^
> > > > cache │********│ |
> > > > line1 │********│ |
> > > >       │********│ |
> > > >       ├────────┤ | 16 objects
> > > >       │********│ | 128B
> > > > cache │********│ |
> > > > line2 │********│ |
> > > >       │********│ v
> > > >       └────────┘---
> > > >
> > > > However, accessing the objects at the bottom of the mempool cache
> is
> > > a
> > > > special case, where cache line0 is also used for objects.
> > > >
> > > > Consider the next burst (and any following bursts):
> > > >
> > > > Current:
> > > >       ┌────────┐
> > > >       │len     │
> > > > cache │        │
> > > > line0 │        │
> > > >       │        │
> > > >       ├────────┤
> > > >       │        │
> > > > cache │        │
> > > > line1 │        │
> > > >       │        │
> > > >       ├────────┤
> > > >       │        │
> > > > cache │********│---
> > > > line2 │********│ ^
> > > >       │********│ |
> > > >       ├────────┤ | 16 objects
> > > >       │********│ | 128B
> > > > cache │********│ |
> > > > line3 │********│ |
> > > >       │********│ |
> > > >       ├────────┤ |
> > > >       │********│_v_
> > > > cache │        │
> > > > line4 │        │
> > > >       │        │
> > > >       └────────┘
> > > > 4 cache lines touched, incl. line0 for len.
> > > >
> > > > With the proposed alignment:
> > > >       ┌────────┐
> > > >       │len     │
> > > > cache │        │
> > > > line0 │        │
> > > >       │        │
> > > >       ├────────┤
> > > >       │        │
> > > > cache │        │
> > > > line1 │        │
> > > >       │        │
> > > >       ├────────┤
> > > >       │        │
> > > > cache │        │
> > > > line2 │        │
> > > >       │        │
> > > >       ├────────┤
> > > >       │********│---
> > > > cache │********│ ^
> > > > line3 │********│ |
> > > >       │********│ | 16 objects
> > > >       ├────────┤ | 128B
> > > >       │********│ |
> > > > cache │********│ |
> > > > line4 │********│ |
> > > >       │********│_v_
> > > >       └────────┘
> > > > Only 3 cache lines touched, incl. line0 for len.
> > >
> > > I understand your logic, but are we sure that having an application
> > > that
> > > works with bulks of 32 means that the cache will stay aligned to 32
> > > elements for the whole life of the application?
> > >
> > > In an application, the alignment of the cache can change if you
> have
> > > any of:
> > > - software queues (reassembly for instance)
> > > - packet duplication (bridge, multicast)
> > > - locally generated packets (keepalive, control protocol)
> > > - pipeline to other cores
> > >
> > > Even with testpmd, which work by bulk of 32, I can see that the
> size
> > > of the cache filling is not aligned to 32. Right after starting the
> > > application, we already have this:
> > >
> > >   internal cache infos:
> > >     cache_size=250
> > >     cache_count[0]=231
> > >
> > > This is probably related to the hw rx rings size, number of queues,
> > > number of ports.
> > >
> > > The "250" default value for cache size in testpmd is questionable,
> but
> > > with --mbcache=256, the behavior is similar.
> > >
> > > Also, when we transmit to a NIC, the mbufs are not returned
> immediatly
> > > to the pool, they may stay in the hw tx ring during some time,
> which is
> > > a driver decision.
> > >
> > > After processing traffic on cores 8 and 24 with this testpmd, I
> get:
> > >     cache_count[0]=231
> > >     cache_count[8]=123
> > >     cache_count[24]=122
> > >
> > > In my opinion, it is not realistic to think that the mempool cache
> will
> > > remain aligned to cachelines. In these conditions, it looks better
> to
> > > keep the structure packed to avoid wasting memory.
> >
> > I agree that is a special use case to only access the mempool cache
> in
> > bursts of 32 objects, so the accesses are always cache line
> > aligned. (Generalized, the burst size must not be 32; a burst size
> > that is a multiple of RTE_CACHE_LINE_SIZE/sizeof(void*), i.e. a burst
> > size of 8 on a 64-bit architecture, will do.)
> 
> Is there a real situation where it happens to always have read/write
> accesses per bulks of 32? From what I see in my quick test, it is not
> the case, even with testpmd.
> 
> > Adding a hole of 52 byte per mempool cache is nothing, considering
> > that the mempool cache already uses 8 KB (RTE_MEMPOOL_CACHE_MAX_SIZE
> *
> > 2 * sizeof(void*) = 1024 * 8 byte) for the objects.
> >
> > Also - assuming that memory allocations are cache line aligned - the
> > 52 byte of unused memory cannot be used regardless if they are before
> > or after the objects. Instead of having 52 B unused after the
> objects,
> > we might as well have a hole of 52 B unused before the objects. In
> > other words: There is really no downside to this.
> 
> Correct, the memory waste argument to nack the patch is invalid.
> 
> > Jerin also presented a separate argument for moving the objects to
> > another cache line than the len field: The risk for "load-after-store
> > stall" when loading the len field after storing objects in cache
> line0
> > [1].
> >
> > [1]: http://inbox.dpdk.org/dev/CALBAE1P4zFYdLwoQukn5Q-V-
> nTvc_UBWmWjhaV2uVBXQRytSSA@mail.gmail.com/
> 
> I'll be prudent on this justification without numbers. The case where
> we
> access to the objects of the first cache line (among several KB) is
> maybe not that frequent.
> 
> > A new idea just popped into my head: The hot debug statistics
> > counters (put_bulk, put_objs, get_success_bulk, get_success_objs)
> > could be moved to this free space, reducing the need to touch another
> > cache line for debug counters. I haven’t thought this idea through
> > yet; it might conflict with Jerin's comment.
> 
> Yes, but since the stats are only enabled when RTE_LIBRTE_MEMPOOL_DEBUG
> is set, it won't have any impact on non-debug builds.

Correct, but I do expect that it would reduce the performance cost of using RTE_LIBRTE_MEMPOOL_DEBUG. I'll provide such a patch shortly.

> 
> 
> Honnestly, I find it hard to convince myself that it is a real
> optimization. I don't see any reason why it would be slower though. So
> since we already broke the mempool cache struct ABI in a previous
> commit, and since it won't consume more memory, I'm ok to include that
> patch.

I don't know if there are any such applications now, and you are probably right that there are not. But this patch opens a road towards it.

Acked-by ?

> It would be great to have numbers to put some weight in the
> balance.

Yes, it would also be great if drivers didn't copy-paste code from the mempool library, so the performance effect of modifications in the mempool library would be reflected in such tests.

> 
> 
> 
> 
> >
> > >
> > > Olivier
> > >
> > >
> > > >
> > > > Credits go to Olivier Matz for the nice ASCII graphics.
> > > >
> > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > ---
> > > >  lib/mempool/rte_mempool.h | 6 ++++--
> > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/mempool/rte_mempool.h
> b/lib/mempool/rte_mempool.h
> > > > index 1f5707f46a..3725a72951 100644
> > > > --- a/lib/mempool/rte_mempool.h
> > > > +++ b/lib/mempool/rte_mempool.h
> > > > @@ -86,11 +86,13 @@ struct rte_mempool_cache {
> > > >  	uint32_t size;	      /**< Size of the cache */
> > > >  	uint32_t flushthresh; /**< Threshold before we flush excess
> > > elements */
> > > >  	uint32_t len;	      /**< Current cache count */
> > > > -	/*
> > > > +	/**
> > > > +	 * Cache objects
> > > > +	 *
> > > >  	 * Cache is allocated to this size to allow it to overflow
> in
> > > certain
> > > >  	 * cases to avoid needless emptying of cache.
> > > >  	 */
> > > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache
> objects */
> > > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> __rte_cache_aligned;
> > > >  } __rte_cache_aligned;
> > > >
> > > >  /**
> > > > --
> > > > 2.17.1
> > > >
> >


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] mempool: cache align mempool cache objects
  2022-10-27 12:11                 ` Morten Brørup
@ 2022-10-27 15:20                   ` Olivier Matz
  0 siblings, 0 replies; 85+ messages in thread
From: Olivier Matz @ 2022-10-27 15:20 UTC (permalink / raw)
  To: Morten Brørup
  Cc: andrew.rybchenko, jerinj, thomas, bruce.richardson, dev

On Thu, Oct 27, 2022 at 02:11:29PM +0200, Morten Brørup wrote:
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Thursday, 27 October 2022 13.43
> > 
> > On Thu, Oct 27, 2022 at 11:22:07AM +0200, Morten Brørup wrote:
> > > > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > > > Sent: Thursday, 27 October 2022 10.35
> > > >
> > > > Hi Morten,
> > > >
> > > > On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote:
> > > > > Add __rte_cache_aligned to the objs array.
> > > > >
> > > > > It makes no difference in the general case, but if get/put
> > operations
> > > > are
> > > > > always 32 objects, it will reduce the number of memory (or last
> > level
> > > > > cache) accesses from five to four 64 B cache lines for every
> > get/put
> > > > > operation.
> > > > >
> > > > > For readability reasons, an example using 16 objects follows:
> > > > >
> > > > > Currently, with 16 objects (128B), we access to 3
> > > > > cache lines:
> > > > >
> > > > >       ┌────────┐
> > > > >       │len     │
> > > > > cache │********│---
> > > > > line0 │********│ ^
> > > > >       │********│ |
> > > > >       ├────────┤ | 16 objects
> > > > >       │********│ | 128B
> > > > > cache │********│ |
> > > > > line1 │********│ |
> > > > >       │********│ |
> > > > >       ├────────┤ |
> > > > >       │********│_v_
> > > > > cache │        │
> > > > > line2 │        │
> > > > >       │        │
> > > > >       └────────┘
> > > > >
> > > > > With the alignment, it is also 3 cache lines:
> > > > >
> > > > >       ┌────────┐
> > > > >       │len     │
> > > > > cache │        │
> > > > > line0 │        │
> > > > >       │        │
> > > > >       ├────────┤---
> > > > >       │********│ ^
> > > > > cache │********│ |
> > > > > line1 │********│ |
> > > > >       │********│ |
> > > > >       ├────────┤ | 16 objects
> > > > >       │********│ | 128B
> > > > > cache │********│ |
> > > > > line2 │********│ |
> > > > >       │********│ v
> > > > >       └────────┘---
> > > > >
> > > > > However, accessing the objects at the bottom of the mempool cache
> > is
> > > > a
> > > > > special case, where cache line0 is also used for objects.
> > > > >
> > > > > Consider the next burst (and any following bursts):
> > > > >
> > > > > Current:
> > > > >       ┌────────┐
> > > > >       │len     │
> > > > > cache │        │
> > > > > line0 │        │
> > > > >       │        │
> > > > >       ├────────┤
> > > > >       │        │
> > > > > cache │        │
> > > > > line1 │        │
> > > > >       │        │
> > > > >       ├────────┤
> > > > >       │        │
> > > > > cache │********│---
> > > > > line2 │********│ ^
> > > > >       │********│ |
> > > > >       ├────────┤ | 16 objects
> > > > >       │********│ | 128B
> > > > > cache │********│ |
> > > > > line3 │********│ |
> > > > >       │********│ |
> > > > >       ├────────┤ |
> > > > >       │********│_v_
> > > > > cache │        │
> > > > > line4 │        │
> > > > >       │        │
> > > > >       └────────┘
> > > > > 4 cache lines touched, incl. line0 for len.
> > > > >
> > > > > With the proposed alignment:
> > > > >       ┌────────┐
> > > > >       │len     │
> > > > > cache │        │
> > > > > line0 │        │
> > > > >       │        │
> > > > >       ├────────┤
> > > > >       │        │
> > > > > cache │        │
> > > > > line1 │        │
> > > > >       │        │
> > > > >       ├────────┤
> > > > >       │        │
> > > > > cache │        │
> > > > > line2 │        │
> > > > >       │        │
> > > > >       ├────────┤
> > > > >       │********│---
> > > > > cache │********│ ^
> > > > > line3 │********│ |
> > > > >       │********│ | 16 objects
> > > > >       ├────────┤ | 128B
> > > > >       │********│ |
> > > > > cache │********│ |
> > > > > line4 │********│ |
> > > > >       │********│_v_
> > > > >       └────────┘
> > > > > Only 3 cache lines touched, incl. line0 for len.
> > > >
> > > > I understand your logic, but are we sure that having an application
> > > > that
> > > > works with bulks of 32 means that the cache will stay aligned to 32
> > > > elements for the whole life of the application?
> > > >
> > > > In an application, the alignment of the cache can change if you
> > have
> > > > any of:
> > > > - software queues (reassembly for instance)
> > > > - packet duplication (bridge, multicast)
> > > > - locally generated packets (keepalive, control protocol)
> > > > - pipeline to other cores
> > > >
> > > > Even with testpmd, which work by bulk of 32, I can see that the
> > size
> > > > of the cache filling is not aligned to 32. Right after starting the
> > > > application, we already have this:
> > > >
> > > >   internal cache infos:
> > > >     cache_size=250
> > > >     cache_count[0]=231
> > > >
> > > > This is probably related to the hw rx rings size, number of queues,
> > > > number of ports.
> > > >
> > > > The "250" default value for cache size in testpmd is questionable,
> > but
> > > > with --mbcache=256, the behavior is similar.
> > > >
> > > > Also, when we transmit to a NIC, the mbufs are not returned
> > immediatly
> > > > to the pool, they may stay in the hw tx ring during some time,
> > which is
> > > > a driver decision.
> > > >
> > > > After processing traffic on cores 8 and 24 with this testpmd, I
> > get:
> > > >     cache_count[0]=231
> > > >     cache_count[8]=123
> > > >     cache_count[24]=122
> > > >
> > > > In my opinion, it is not realistic to think that the mempool cache
> > will
> > > > remain aligned to cachelines. In these conditions, it looks better
> > to
> > > > keep the structure packed to avoid wasting memory.
> > >
> > > I agree that is a special use case to only access the mempool cache
> > in
> > > bursts of 32 objects, so the accesses are always cache line
> > > aligned. (Generalized, the burst size must not be 32; a burst size
> > > that is a multiple of RTE_CACHE_LINE_SIZE/sizeof(void*), i.e. a burst
> > > size of 8 on a 64-bit architecture, will do.)
> > 
> > Is there a real situation where it happens to always have read/write
> > accesses per bulks of 32? From what I see in my quick test, it is not
> > the case, even with testpmd.
> > 
> > > Adding a hole of 52 byte per mempool cache is nothing, considering
> > > that the mempool cache already uses 8 KB (RTE_MEMPOOL_CACHE_MAX_SIZE
> > *
> > > 2 * sizeof(void*) = 1024 * 8 byte) for the objects.
> > >
> > > Also - assuming that memory allocations are cache line aligned - the
> > > 52 byte of unused memory cannot be used regardless if they are before
> > > or after the objects. Instead of having 52 B unused after the
> > objects,
> > > we might as well have a hole of 52 B unused before the objects. In
> > > other words: There is really no downside to this.
> > 
> > Correct, the memory waste argument to nack the patch is invalid.
> > 
> > > Jerin also presented a separate argument for moving the objects to
> > > another cache line than the len field: The risk for "load-after-store
> > > stall" when loading the len field after storing objects in cache
> > line0
> > > [1].
> > >
> > > [1]: http://inbox.dpdk.org/dev/CALBAE1P4zFYdLwoQukn5Q-V-
> > nTvc_UBWmWjhaV2uVBXQRytSSA@mail.gmail.com/
> > 
> > I'll be prudent on this justification without numbers. The case where
> > we
> > access to the objects of the first cache line (among several KB) is
> > maybe not that frequent.
> > 
> > > A new idea just popped into my head: The hot debug statistics
> > > counters (put_bulk, put_objs, get_success_bulk, get_success_objs)
> > > could be moved to this free space, reducing the need to touch another
> > > cache line for debug counters. I haven’t thought this idea through
> > > yet; it might conflict with Jerin's comment.
> > 
> > Yes, but since the stats are only enabled when RTE_LIBRTE_MEMPOOL_DEBUG
> > is set, it won't have any impact on non-debug builds.
> 
> Correct, but I do expect that it would reduce the performance cost of using RTE_LIBRTE_MEMPOOL_DEBUG. I'll provide such a patch shortly.
> 
> > 
> > 
> > Honnestly, I find it hard to convince myself that it is a real
> > optimization. I don't see any reason why it would be slower though. So
> > since we already broke the mempool cache struct ABI in a previous
> > commit, and since it won't consume more memory, I'm ok to include that
> > patch.
> 
> I don't know if there are any such applications now, and you are probably right that there are not. But this patch opens a road towards it.
> 
> Acked-by ?

Acked-by: Olivier Matz <olivier.matz@6wind.com>

Thanks Morten

> 
> > It would be great to have numbers to put some weight in the
> > balance.
> 
> Yes, it would also be great if drivers didn't copy-paste code from the mempool library, so the performance effect of modifications in the mempool library would be reflected in such tests.
> 
> > 
> > 
> > 
> > 
> > >
> > > >
> > > > Olivier
> > > >
> > > >
> > > > >
> > > > > Credits go to Olivier Matz for the nice ASCII graphics.
> > > > >
> > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > ---
> > > > >  lib/mempool/rte_mempool.h | 6 ++++--
> > > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/mempool/rte_mempool.h
> > b/lib/mempool/rte_mempool.h
> > > > > index 1f5707f46a..3725a72951 100644
> > > > > --- a/lib/mempool/rte_mempool.h
> > > > > +++ b/lib/mempool/rte_mempool.h
> > > > > @@ -86,11 +86,13 @@ struct rte_mempool_cache {
> > > > >  	uint32_t size;	      /**< Size of the cache */
> > > > >  	uint32_t flushthresh; /**< Threshold before we flush excess
> > > > elements */
> > > > >  	uint32_t len;	      /**< Current cache count */
> > > > > -	/*
> > > > > +	/**
> > > > > +	 * Cache objects
> > > > > +	 *
> > > > >  	 * Cache is allocated to this size to allow it to overflow
> > in
> > > > certain
> > > > >  	 * cases to avoid needless emptying of cache.
> > > > >  	 */
> > > > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache
> > objects */
> > > > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> > __rte_cache_aligned;
> > > > >  } __rte_cache_aligned;
> > > > >
> > > > >  /**
> > > > > --
> > > > > 2.17.1
> > > > >
> > >
> 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v3 1/2] mempool: cache align mempool cache objects
  2022-10-26 14:44         ` [PATCH] mempool: cache align mempool cache objects Morten Brørup
  2022-10-26 19:44           ` Andrew Rybchenko
  2022-10-27  8:34           ` Olivier Matz
@ 2022-10-28  6:35           ` Morten Brørup
  2022-10-28  6:35             ` [PATCH v3 2/2] mempool: optimized debug statistics Morten Brørup
  2022-10-28  6:41           ` [PATCH v4 1/2] mempool: cache align mempool cache objects Morten Brørup
  3 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-28  6:35 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: jerinj, thomas, bruce.richardson, dev, Morten Brørup

Add __rte_cache_aligned to the objs array.

It makes no difference in the general case, but if get/put operations are
always 32 objects, it will reduce the number of memory (or last level
cache) accesses from five to four 64 B cache lines for every get/put
operation.

For readability reasons, an example using 16 objects follows:

Currently, with 16 objects (128B), we access to 3
cache lines:

      ┌────────┐
      │len     │
cache │********│---
line0 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line2 │        │
      │        │
      └────────┘

With the alignment, it is also 3 cache lines:

      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤---
      │********│ ^
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line2 │********│ |
      │********│ v
      └────────┘---

However, accessing the objects at the bottom of the mempool cache is a
special case, where cache line0 is also used for objects.

Consider the next burst (and any following bursts):

Current:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │********│---
line2 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line3 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line4 │        │
      │        │
      └────────┘
4 cache lines touched, incl. line0 for len.

With the proposed alignment:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line2 │        │
      │        │
      ├────────┤
      │********│---
cache │********│ ^
line3 │********│ |
      │********│ | 16 objects
      ├────────┤ | 128B
      │********│ |
cache │********│ |
line4 │********│ |
      │********│_v_
      └────────┘
Only 3 cache lines touched, incl. line0 for len.

Credits go to Olivier Matz for the nice ASCII graphics.

v3:
* No changes. Made part of a series.
v2:
* No such version.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1f5707f46a..3725a72951 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -86,11 +86,13 @@ struct rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
 	uint32_t flushthresh; /**< Threshold before we flush excess elements */
 	uint32_t len;	      /**< Current cache count */
-	/*
+	/**
+	 * Cache objects
+	 *
 	 * Cache is allocated to this size to allow it to overflow in certain
 	 * cases to avoid needless emptying of cache.
 	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
 } __rte_cache_aligned;
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v3 2/2] mempool: optimized debug statistics
  2022-10-28  6:35           ` [PATCH v3 1/2] " Morten Brørup
@ 2022-10-28  6:35             ` Morten Brørup
  0 siblings, 0 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-28  6:35 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: jerinj, thomas, bruce.richardson, dev, Morten Brørup

When built with debug enabled (RTE_LIBRTE_MEMPOOL_DEBUG defined), the
performance of mempools with caches is improved as follows.

Accessing objects in the mempool is likely to increment either the
put_bulk and put_objs or the get_success_bulk and get_success_objs
debug statistics counters.

By adding an alternative set of these counters to the mempool cache
structure, accesing the dedicated debug statistics structure is avoided in
the likely cases where these counters are incremented.

The trick here is that the cache line holding the mempool cache structure
is accessed anyway, in order to update the "len" field. Updating some
debug statistics counters in the same cache line has lower performance
cost than accessing the debug statistics counters in the dedicated debug
statistics structure, i.e. in another cache line.

Running mempool_perf_autotest on a VMware virtual server shows an avg.
increase of 6.4 % in rate_persec for the tests with cache. (Only when
built with with debug enabled, obviously!)

For the tests without cache, the avg. increase in rate_persec is 0.8 %. I
assume this is noise from the test environment.

v3:
* Try to fix git reference by making part of a series.
* Add --in-reply-to v1 when sending email.
v2:
* Fix spelling and repeated word in commit message, caught by checkpatch.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.c |  7 +++++
 lib/mempool/rte_mempool.h | 55 +++++++++++++++++++++++++++++++--------
 2 files changed, 51 insertions(+), 11 deletions(-)

diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 21c94a2b9f..7b8c00a022 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -1285,6 +1285,13 @@ rte_mempool_dump(FILE *f, struct rte_mempool *mp)
 		sum.get_fail_objs += mp->stats[lcore_id].get_fail_objs;
 		sum.get_success_blks += mp->stats[lcore_id].get_success_blks;
 		sum.get_fail_blks += mp->stats[lcore_id].get_fail_blks;
+		/* Add the fast access statistics, if local caches exist */
+		if (mp->cache_size != 0) {
+			sum.put_bulk += mp->local_cache[lcore_id].put_bulk;
+			sum.put_objs += mp->local_cache[lcore_id].put_objs;
+			sum.get_success_bulk += mp->local_cache[lcore_id].get_success_bulk;
+			sum.get_success_objs += mp->local_cache[lcore_id].get_success_objs;
+		}
 	}
 	fprintf(f, "  stats:\n");
 	fprintf(f, "    put_bulk=%"PRIu64"\n", sum.put_bulk);
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 3725a72951..d84087bc92 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -86,6 +86,14 @@ struct rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
 	uint32_t flushthresh; /**< Threshold before we flush excess elements */
 	uint32_t len;	      /**< Current cache count */
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+	uint32_t unused;
+	/* Fast access statistics, only for likely events */
+	uint64_t put_bulk;             /**< Number of puts. */
+	uint64_t put_objs;             /**< Number of objects successfully put. */
+	uint64_t get_success_bulk;     /**< Successful allocation number. */
+	uint64_t get_success_objs;     /**< Objects successfully allocated. */
+#endif
 	/**
 	 * Cache objects
 	 *
@@ -1327,13 +1335,19 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 {
 	void **cache_objs;
 
+	/* No cache provided */
+	if (unlikely(cache == NULL))
+		goto driver_enqueue;
+
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	/* increment stat now, adding in mempool always success */
-	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
-	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
+	cache->put_bulk += 1;
+	cache->put_objs += n;
+#endif
 
-	/* No cache provided or the request itself is too big for the cache */
-	if (unlikely(cache == NULL || n > cache->flushthresh))
-		goto driver_enqueue;
+	/* The request is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto driver_enqueue_stats_incremented;
 
 	/*
 	 * The cache follows the following algorithm:
@@ -1358,6 +1372,12 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 driver_enqueue:
 
+	/* increment stat now, adding in mempool always success */
+	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
+	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
+
+driver_enqueue_stats_incremented:
+
 	/* push objects to the backend */
 	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
 }
@@ -1464,8 +1484,10 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	if (remaining == 0) {
 		/* The entire request is satisfied from the cache. */
 
-		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
-		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		cache->get_success_bulk += 1;
+		cache->get_success_objs += n;
+#endif
 
 		return 0;
 	}
@@ -1494,8 +1516,10 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 	cache->len = cache->size;
 
-	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
-	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+	cache->get_success_bulk += 1;
+	cache->get_success_objs += n;
+#endif
 
 	return 0;
 
@@ -1517,8 +1541,17 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 		RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
 		RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
 	} else {
-		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
-		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (likely(cache != NULL)) {
+			cache->get_success_bulk += 1;
+			cache->get_success_bulk += n;
+		} else {
+#endif
+			RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+			RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, n);
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		}
+#endif
 	}
 
 	return ret;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v4 1/2] mempool: cache align mempool cache objects
  2022-10-26 14:44         ` [PATCH] mempool: cache align mempool cache objects Morten Brørup
                             ` (2 preceding siblings ...)
  2022-10-28  6:35           ` [PATCH v3 1/2] " Morten Brørup
@ 2022-10-28  6:41           ` Morten Brørup
  2022-10-28  6:41             ` [PATCH v4 2/2] mempool: optimized debug statistics Morten Brørup
  2022-10-30  9:17             ` [PATCH v4 1/2] mempool: cache align mempool cache objects Thomas Monjalon
  3 siblings, 2 replies; 85+ messages in thread
From: Morten Brørup @ 2022-10-28  6:41 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: jerinj, thomas, bruce.richardson, dev, Morten Brørup

Add __rte_cache_aligned to the objs array.

It makes no difference in the general case, but if get/put operations are
always 32 objects, it will reduce the number of memory (or last level
cache) accesses from five to four 64 B cache lines for every get/put
operation.

For readability reasons, an example using 16 objects follows:

Currently, with 16 objects (128B), we access to 3
cache lines:

      ┌────────┐
      │len     │
cache │********│---
line0 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line2 │        │
      │        │
      └────────┘

With the alignment, it is also 3 cache lines:

      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤---
      │********│ ^
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line2 │********│ |
      │********│ v
      └────────┘---

However, accessing the objects at the bottom of the mempool cache is a
special case, where cache line0 is also used for objects.

Consider the next burst (and any following bursts):

Current:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │********│---
line2 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line3 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line4 │        │
      │        │
      └────────┘
4 cache lines touched, incl. line0 for len.

With the proposed alignment:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line2 │        │
      │        │
      ├────────┤
      │********│---
cache │********│ ^
line3 │********│ |
      │********│ | 16 objects
      ├────────┤ | 128B
      │********│ |
cache │********│ |
line4 │********│ |
      │********│_v_
      └────────┘
Only 3 cache lines touched, incl. line0 for len.

Credits go to Olivier Matz for the nice ASCII graphics.

v4:
* No changes. Added reviewed- and acked-by tags.
v3:
* No changes. Made part of a series.
v2:
* No such version.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
---
 lib/mempool/rte_mempool.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1f5707f46a..3725a72951 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -86,11 +86,13 @@ struct rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
 	uint32_t flushthresh; /**< Threshold before we flush excess elements */
 	uint32_t len;	      /**< Current cache count */
-	/*
+	/**
+	 * Cache objects
+	 *
 	 * Cache is allocated to this size to allow it to overflow in certain
 	 * cases to avoid needless emptying of cache.
 	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
 } __rte_cache_aligned;
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH v4 2/2] mempool: optimized debug statistics
  2022-10-28  6:41           ` [PATCH v4 1/2] mempool: cache align mempool cache objects Morten Brørup
@ 2022-10-28  6:41             ` Morten Brørup
  2022-10-30  9:09               ` Morten Brørup
  2022-10-30  9:17             ` [PATCH v4 1/2] mempool: cache align mempool cache objects Thomas Monjalon
  1 sibling, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-28  6:41 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: jerinj, thomas, bruce.richardson, dev, Morten Brørup

When built with debug enabled (RTE_LIBRTE_MEMPOOL_DEBUG defined), the
performance of mempools with caches is improved as follows.

Accessing objects in the mempool is likely to increment either the
put_bulk and put_objs or the get_success_bulk and get_success_objs
debug statistics counters.

By adding an alternative set of these counters to the mempool cache
structure, accessing the dedicated debug statistics structure is avoided in
the likely cases where these counters are incremented.

The trick here is that the cache line holding the mempool cache structure
is accessed anyway, in order to update the "len" field. Updating some
debug statistics counters in the same cache line has lower performance
cost than accessing the debug statistics counters in the dedicated debug
statistics structure, i.e. in another cache line.

Running mempool_perf_autotest on a VMware virtual server shows an avg.
increase of 6.4 % in rate_persec for the tests with cache. (Only when
built with debug enabled, obviously!)

For the tests without cache, the avg. increase in rate_persec is 0.8 %. I
assume this is noise from the test environment.

v4:
* Fix spelling and repeated word in commit message, caught by checkpatch.
v3:
* Try to fix git reference by making part of a series.
* Add --in-reply-to v1 when sending email.
v2:
* Fix spelling and repeated word in commit message, caught by checkpatch.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.c |  7 +++++
 lib/mempool/rte_mempool.h | 55 +++++++++++++++++++++++++++++++--------
 2 files changed, 51 insertions(+), 11 deletions(-)

diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 21c94a2b9f..7b8c00a022 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -1285,6 +1285,13 @@ rte_mempool_dump(FILE *f, struct rte_mempool *mp)
 		sum.get_fail_objs += mp->stats[lcore_id].get_fail_objs;
 		sum.get_success_blks += mp->stats[lcore_id].get_success_blks;
 		sum.get_fail_blks += mp->stats[lcore_id].get_fail_blks;
+		/* Add the fast access statistics, if local caches exist */
+		if (mp->cache_size != 0) {
+			sum.put_bulk += mp->local_cache[lcore_id].put_bulk;
+			sum.put_objs += mp->local_cache[lcore_id].put_objs;
+			sum.get_success_bulk += mp->local_cache[lcore_id].get_success_bulk;
+			sum.get_success_objs += mp->local_cache[lcore_id].get_success_objs;
+		}
 	}
 	fprintf(f, "  stats:\n");
 	fprintf(f, "    put_bulk=%"PRIu64"\n", sum.put_bulk);
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 3725a72951..d84087bc92 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -86,6 +86,14 @@ struct rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
 	uint32_t flushthresh; /**< Threshold before we flush excess elements */
 	uint32_t len;	      /**< Current cache count */
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+	uint32_t unused;
+	/* Fast access statistics, only for likely events */
+	uint64_t put_bulk;             /**< Number of puts. */
+	uint64_t put_objs;             /**< Number of objects successfully put. */
+	uint64_t get_success_bulk;     /**< Successful allocation number. */
+	uint64_t get_success_objs;     /**< Objects successfully allocated. */
+#endif
 	/**
 	 * Cache objects
 	 *
@@ -1327,13 +1335,19 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 {
 	void **cache_objs;
 
+	/* No cache provided */
+	if (unlikely(cache == NULL))
+		goto driver_enqueue;
+
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	/* increment stat now, adding in mempool always success */
-	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
-	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
+	cache->put_bulk += 1;
+	cache->put_objs += n;
+#endif
 
-	/* No cache provided or the request itself is too big for the cache */
-	if (unlikely(cache == NULL || n > cache->flushthresh))
-		goto driver_enqueue;
+	/* The request is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto driver_enqueue_stats_incremented;
 
 	/*
 	 * The cache follows the following algorithm:
@@ -1358,6 +1372,12 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 driver_enqueue:
 
+	/* increment stat now, adding in mempool always success */
+	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
+	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
+
+driver_enqueue_stats_incremented:
+
 	/* push objects to the backend */
 	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
 }
@@ -1464,8 +1484,10 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	if (remaining == 0) {
 		/* The entire request is satisfied from the cache. */
 
-		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
-		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		cache->get_success_bulk += 1;
+		cache->get_success_objs += n;
+#endif
 
 		return 0;
 	}
@@ -1494,8 +1516,10 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 	cache->len = cache->size;
 
-	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
-	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+	cache->get_success_bulk += 1;
+	cache->get_success_objs += n;
+#endif
 
 	return 0;
 
@@ -1517,8 +1541,17 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 		RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
 		RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
 	} else {
-		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
-		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (likely(cache != NULL)) {
+			cache->get_success_bulk += 1;
+			cache->get_success_bulk += n;
+		} else {
+#endif
+			RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+			RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, n);
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		}
+#endif
 	}
 
 	return ret;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: Copy-pasted code should be updated
  2022-10-11 21:47       ` Honnappa Nagarahalli
@ 2022-10-30  8:44         ` Morten Brørup
  2022-10-30 22:50           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-30  8:44 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Yuying Zhang, Beilei Xing, Jingjing Wu,
	Qiming Yang, Qi Zhang
  Cc: Olivier Matz, dev, thomas, Andrew Rybchenko, Feifei Wang,
	Ruifeng Wang, nd, techboard, Kamalakshitha Aligeri, nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Tuesday, 11 October 2022 23.48
> 
> <snip>
> 
> >
> > Dear Intel PMD maintainers (CC: techboard),
> >
> > I strongly recommend that you update the code you copy-pasted from
> the
> > mempool library to your PMDs, so they reflect the new and improved
> > mempool cache behavior [1]. When choosing to copy-paste code from a
> core
> > library, you should feel obliged to keep your copied code matching
> the source
> > code you copied it from!
> >
> > Also, as reported in bug #1052, you forgot to copy-paste the
> instrumentation,
> > thereby 1. making the mempool debug statistics invalid and 2.
> omitting the
> > mempool accesses from the trace when using your PMDs. :-(
> We are working on mempool APIs to expose the per core cache memory to
> PMD so that the buffers can be copied directly. We are planning to fix
> this duplication as part of that.

Is the copy-paste bug fix going to make it for 22.11?

Otherwise, these PMDs are managing the mempool cache differently than the mempool library does. (And the mempool library instrumentation will remain partially bypassed for these PMDs.) This should be mentioned as a know bug in the release notes.

> 
> >
> > Alternatively, just remove the copy-pasted code and use the mempool
> > library's API instead. ;-)
> >
> > The direct re-arm code also contains copy-pasted mempool cache
> handling
> > code - which was accepted with the argument that the same code was
> > already copy-pasted elsewhere. I don't know if the direct re-arm code
> also
> > needs updating... Authors of that patch (CC to this email), please
> coordinate
> > with the PMD maintainers.
> Direct-rearm patch is not accepted yet.
> 
> >
> > PS:  As noted in the 22.11-rc1 release notes, more changes to the
> mempool
> > library [2] may be coming.
> >
> > [1]:
> > https://patches.dpdk.org/project/dpdk/patch/20221007104450.2567961-1-
> > andrew.rybchenko@oktetlabs.ru/
> >
> > [2]: https://patches.dpdk.org/project/dpdk/list/?series=25063
> >
> > -Morten
> 


^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH v4 2/2] mempool: optimized debug statistics
  2022-10-28  6:41             ` [PATCH v4 2/2] mempool: optimized debug statistics Morten Brørup
@ 2022-10-30  9:09               ` Morten Brørup
  2022-10-30  9:16                 ` Thomas Monjalon
  0 siblings, 1 reply; 85+ messages in thread
From: Morten Brørup @ 2022-10-30  9:09 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko, Thomas Monjalon
  Cc: jerinj, bruce.richardson, dev

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Friday, 28 October 2022 08.42
> 
> When built with debug enabled (RTE_LIBRTE_MEMPOOL_DEBUG defined), the
> performance of mempools with caches is improved as follows.
> 
> Accessing objects in the mempool is likely to increment either the
> put_bulk and put_objs or the get_success_bulk and get_success_objs
> debug statistics counters.
> 
> By adding an alternative set of these counters to the mempool cache
> structure, accessing the dedicated debug statistics structure is
> avoided in
> the likely cases where these counters are incremented.
> 
> The trick here is that the cache line holding the mempool cache
> structure
> is accessed anyway, in order to update the "len" field. Updating some
> debug statistics counters in the same cache line has lower performance
> cost than accessing the debug statistics counters in the dedicated
> debug
> statistics structure, i.e. in another cache line.
> 
> Running mempool_perf_autotest on a VMware virtual server shows an avg.
> increase of 6.4 % in rate_persec for the tests with cache. (Only when
> built with debug enabled, obviously!)
> 
> For the tests without cache, the avg. increase in rate_persec is 0.8 %.
> I
> assume this is noise from the test environment.
> 
> v4:
> * Fix spelling and repeated word in commit message, caught by
> checkpatch.
> v3:
> * Try to fix git reference by making part of a series.
> * Add --in-reply-to v1 when sending email.
> v2:
> * Fix spelling and repeated word in commit message, caught by
> checkpatch.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.c |  7 +++++
>  lib/mempool/rte_mempool.h | 55 +++++++++++++++++++++++++++++++--------
>  2 files changed, 51 insertions(+), 11 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> index 21c94a2b9f..7b8c00a022 100644
> --- a/lib/mempool/rte_mempool.c
> +++ b/lib/mempool/rte_mempool.c
> @@ -1285,6 +1285,13 @@ rte_mempool_dump(FILE *f, struct rte_mempool
> *mp)
>  		sum.get_fail_objs += mp->stats[lcore_id].get_fail_objs;
>  		sum.get_success_blks += mp-
> >stats[lcore_id].get_success_blks;
>  		sum.get_fail_blks += mp->stats[lcore_id].get_fail_blks;
> +		/* Add the fast access statistics, if local caches exist */
> +		if (mp->cache_size != 0) {
> +			sum.put_bulk += mp->local_cache[lcore_id].put_bulk;
> +			sum.put_objs += mp->local_cache[lcore_id].put_objs;
> +			sum.get_success_bulk += mp-
> >local_cache[lcore_id].get_success_bulk;
> +			sum.get_success_objs += mp-
> >local_cache[lcore_id].get_success_objs;
> +		}
>  	}
>  	fprintf(f, "  stats:\n");
>  	fprintf(f, "    put_bulk=%"PRIu64"\n", sum.put_bulk);
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 3725a72951..d84087bc92 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -86,6 +86,14 @@ struct rte_mempool_cache {
>  	uint32_t size;	      /**< Size of the cache */
>  	uint32_t flushthresh; /**< Threshold before we flush excess
> elements */
>  	uint32_t len;	      /**< Current cache count */
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> +	uint32_t unused;
> +	/* Fast access statistics, only for likely events */
> +	uint64_t put_bulk;             /**< Number of puts. */
> +	uint64_t put_objs;             /**< Number of objects
> successfully put. */
> +	uint64_t get_success_bulk;     /**< Successful allocation number.
> */
> +	uint64_t get_success_objs;     /**< Objects successfully
> allocated. */
> +#endif
>  	/**
>  	 * Cache objects
>  	 *
> @@ -1327,13 +1335,19 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
>  {
>  	void **cache_objs;
> 
> +	/* No cache provided */
> +	if (unlikely(cache == NULL))
> +		goto driver_enqueue;
> +
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
>  	/* increment stat now, adding in mempool always success */
> -	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
> -	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
> +	cache->put_bulk += 1;
> +	cache->put_objs += n;
> +#endif
> 
> -	/* No cache provided or the request itself is too big for the
> cache */
> -	if (unlikely(cache == NULL || n > cache->flushthresh))
> -		goto driver_enqueue;
> +	/* The request is too big for the cache */
> +	if (unlikely(n > cache->flushthresh))
> +		goto driver_enqueue_stats_incremented;
> 
>  	/*
>  	 * The cache follows the following algorithm:
> @@ -1358,6 +1372,12 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
> 
>  driver_enqueue:
> 
> +	/* increment stat now, adding in mempool always success */
> +	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
> +	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
> +
> +driver_enqueue_stats_incremented:
> +
>  	/* push objects to the backend */
>  	rte_mempool_ops_enqueue_bulk(mp, obj_table, n);
>  }
> @@ -1464,8 +1484,10 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  	if (remaining == 0) {
>  		/* The entire request is satisfied from the cache. */
> 
> -		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> -		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> +		cache->get_success_bulk += 1;
> +		cache->get_success_objs += n;
> +#endif
> 
>  		return 0;
>  	}
> @@ -1494,8 +1516,10 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
> 
>  	cache->len = cache->size;
> 
> -	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> -	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> +	cache->get_success_bulk += 1;
> +	cache->get_success_objs += n;
> +#endif
> 
>  	return 0;
> 
> @@ -1517,8 +1541,17 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>  		RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
>  		RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
>  	} else {
> -		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> -		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> +		if (likely(cache != NULL)) {
> +			cache->get_success_bulk += 1;
> +			cache->get_success_bulk += n;
> +		} else {
> +#endif
> +			RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +			RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, n);
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> +		}
> +#endif
>  	}
> 
>  	return ret;
> --
> 2.17.1

I am retracting this second part of the patch series, and reopening the original patch instead. This second part is probably not going to make it to 22.11 anyway.

Instead, I am going to provide another patch series (after 22.11) to split the current RTE_LIBRTE_MEMPOOL_DEBUG define in two: RTE_LIBRTE_MEMPOOL_STATS for statistics, and RTE_LIBRTE_MEMPOOL_DEBUG for debugging. And then this patch can be added to the RTE_LIBRTE_MEMPOOL_STATS.

-Morten


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 2/2] mempool: optimized debug statistics
  2022-10-30  9:09               ` Morten Brørup
@ 2022-10-30  9:16                 ` Thomas Monjalon
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Monjalon @ 2022-10-30  9:16 UTC (permalink / raw)
  To: Morten Brørup
  Cc: olivier.matz, andrew.rybchenko, dev, jerinj, bruce.richardson, dev

30/10/2022 10:09, Morten Brørup:
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Friday, 28 October 2022 08.42
> > 
> > When built with debug enabled (RTE_LIBRTE_MEMPOOL_DEBUG defined), the
> > performance of mempools with caches is improved as follows.
> > 
> > Accessing objects in the mempool is likely to increment either the
> > put_bulk and put_objs or the get_success_bulk and get_success_objs
> > debug statistics counters.
> > 
> > By adding an alternative set of these counters to the mempool cache
> > structure, accessing the dedicated debug statistics structure is
> > avoided in
> > the likely cases where these counters are incremented.
> > 
> > The trick here is that the cache line holding the mempool cache
> > structure
> > is accessed anyway, in order to update the "len" field. Updating some
> > debug statistics counters in the same cache line has lower performance
> > cost than accessing the debug statistics counters in the dedicated
> > debug
> > statistics structure, i.e. in another cache line.
> > 
> > Running mempool_perf_autotest on a VMware virtual server shows an avg.
> > increase of 6.4 % in rate_persec for the tests with cache. (Only when
> > built with debug enabled, obviously!)
> > 
> > For the tests without cache, the avg. increase in rate_persec is 0.8 %.
> > I
> > assume this is noise from the test environment.
> > 
> > v4:
> > * Fix spelling and repeated word in commit message, caught by
> > checkpatch.
> > v3:
> > * Try to fix git reference by making part of a series.
> > * Add --in-reply-to v1 when sending email.
> > v2:
> > * Fix spelling and repeated word in commit message, caught by
> > checkpatch.
> > 
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> 
> I am retracting this second part of the patch series, and reopening the original patch instead. This second part is probably not going to make it to 22.11 anyway.

Indeed, I have decided to take patch 1 only, which is reviewed.

> Instead, I am going to provide another patch series (after 22.11) to split the current RTE_LIBRTE_MEMPOOL_DEBUG define in two: RTE_LIBRTE_MEMPOOL_STATS for statistics, and RTE_LIBRTE_MEMPOOL_DEBUG for debugging. And then this patch can be added to the RTE_LIBRTE_MEMPOOL_STATS.




^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH v4 1/2] mempool: cache align mempool cache objects
  2022-10-28  6:41           ` [PATCH v4 1/2] mempool: cache align mempool cache objects Morten Brørup
  2022-10-28  6:41             ` [PATCH v4 2/2] mempool: optimized debug statistics Morten Brørup
@ 2022-10-30  9:17             ` Thomas Monjalon
  1 sibling, 0 replies; 85+ messages in thread
From: Thomas Monjalon @ 2022-10-30  9:17 UTC (permalink / raw)
  To: Morten Brørup
  Cc: olivier.matz, andrew.rybchenko, dev, jerinj, bruce.richardson, dev

28/10/2022 08:41, Morten Brørup:
> Add __rte_cache_aligned to the objs array.
> 
> It makes no difference in the general case, but if get/put operations are
> always 32 objects, it will reduce the number of memory (or last level
> cache) accesses from five to four 64 B cache lines for every get/put
> operation.
> 
> For readability reasons, an example using 16 objects follows:
> 
> Currently, with 16 objects (128B), we access to 3
> cache lines:
> 
>       ┌────────┐
>       │len     │
> cache │********│---
> line0 │********│ ^
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line1 │********│ |
>       │********│ |
>       ├────────┤ |
>       │********│_v_
> cache │        │
> line2 │        │
>       │        │
>       └────────┘
> 
> With the alignment, it is also 3 cache lines:
> 
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤---
>       │********│ ^
> cache │********│ |
> line1 │********│ |
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line2 │********│ |
>       │********│ v
>       └────────┘---
> 
> However, accessing the objects at the bottom of the mempool cache is a
> special case, where cache line0 is also used for objects.
> 
> Consider the next burst (and any following bursts):
> 
> Current:
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line1 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │********│---
> line2 │********│ ^
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line3 │********│ |
>       │********│ |
>       ├────────┤ |
>       │********│_v_
> cache │        │
> line4 │        │
>       │        │
>       └────────┘
> 4 cache lines touched, incl. line0 for len.
> 
> With the proposed alignment:
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line1 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line2 │        │
>       │        │
>       ├────────┤
>       │********│---
> cache │********│ ^
> line3 │********│ |
>       │********│ | 16 objects
>       ├────────┤ | 128B
>       │********│ |
> cache │********│ |
> line4 │********│ |
>       │********│_v_
>       └────────┘
> Only 3 cache lines touched, incl. line0 for len.
> 
> Credits go to Olivier Matz for the nice ASCII graphics.
> 
> v4:
> * No changes. Added reviewed- and acked-by tags.
> v3:
> * No changes. Made part of a series.
> v2:
> * No such version.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Acked-by: Olivier Matz <olivier.matz@6wind.com>

Applied only this first patch, thanks.
The second patch needs more time.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: Copy-pasted code should be updated
  2022-10-30  8:44         ` Morten Brørup
@ 2022-10-30 22:50           ` Honnappa Nagarahalli
  0 siblings, 0 replies; 85+ messages in thread
From: Honnappa Nagarahalli @ 2022-10-30 22:50 UTC (permalink / raw)
  To: Morten Brørup, Yuying Zhang, Beilei Xing, Jingjing Wu,
	Qiming Yang, Qi Zhang
  Cc: Olivier Matz, dev, thomas, Andrew Rybchenko, Feifei Wang,
	Ruifeng Wang, nd, techboard, Kamalakshitha Aligeri, nd

<snip>
> >
> > >
> > > Dear Intel PMD maintainers (CC: techboard),
> > >
> > > I strongly recommend that you update the code you copy-pasted from
> > the
> > > mempool library to your PMDs, so they reflect the new and improved
> > > mempool cache behavior [1]. When choosing to copy-paste code from a
> > core
> > > library, you should feel obliged to keep your copied code matching
> > the source
> > > code you copied it from!
> > >
> > > Also, as reported in bug #1052, you forgot to copy-paste the
> > instrumentation,
> > > thereby 1. making the mempool debug statistics invalid and 2.
> > omitting the
> > > mempool accesses from the trace when using your PMDs. :-(
> > We are working on mempool APIs to expose the per core cache memory to
> > PMD so that the buffers can be copied directly. We are planning to fix
> > this duplication as part of that.
> 
> Is the copy-paste bug fix going to make it for 22.11?
It will not make it to 22.11. It is targeted for 23.02.

> 
> Otherwise, these PMDs are managing the mempool cache differently than
> the mempool library does. (And the mempool library instrumentation will
> remain partially bypassed for these PMDs.) This should be mentioned as a
> know bug in the release notes.
Agree

> 
> >
> > >
> > > Alternatively, just remove the copy-pasted code and use the mempool
> > > library's API instead. ;-)
> > >
> > > The direct re-arm code also contains copy-pasted mempool cache
> > handling
> > > code - which was accepted with the argument that the same code was
> > > already copy-pasted elsewhere. I don't know if the direct re-arm
> > > code
> > also
> > > needs updating... Authors of that patch (CC to this email), please
> > coordinate
> > > with the PMD maintainers.
> > Direct-rearm patch is not accepted yet.
> >
> > >
> > > PS:  As noted in the 22.11-rc1 release notes, more changes to the
> > mempool
> > > library [2] may be coming.
> > >
> > > [1]:
> > >
> https://patches.dpdk.org/project/dpdk/patch/20221007104450.2567961-1
> > > -
> > > andrew.rybchenko@oktetlabs.ru/
> > >
> > > [2]: https://patches.dpdk.org/project/dpdk/list/?series=25063
> > >
> > > -Morten
> >


^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2022-10-30 22:51 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
2022-01-06 12:23 ` [PATCH] mempool: optimize incomplete cache handling Morten Brørup
2022-01-06 16:55   ` Jerin Jacob
2022-01-07  8:46     ` Morten Brørup
2022-01-10  7:26       ` Jerin Jacob
2022-01-10 10:55         ` Morten Brørup
2022-01-14 16:36 ` [PATCH] mempool: fix get objects from mempool with cache Morten Brørup
2022-01-17 17:35   ` Bruce Richardson
2022-01-18  8:25     ` Morten Brørup
2022-01-18  9:07       ` Bruce Richardson
2022-01-24 15:38   ` Olivier Matz
2022-01-24 16:11     ` Olivier Matz
2022-01-28 10:22     ` Morten Brørup
2022-01-17 11:52 ` [PATCH] mempool: optimize put objects to " Morten Brørup
2022-01-19 14:52 ` [PATCH v2] mempool: fix " Morten Brørup
2022-01-19 15:03 ` [PATCH v3] " Morten Brørup
2022-01-24 15:39   ` Olivier Matz
2022-01-28  9:37     ` Morten Brørup
2022-02-02  8:14 ` [PATCH v2] mempool: fix get objects from " Morten Brørup
2022-06-15 21:18   ` Morten Brørup
2022-09-29 10:52     ` Morten Brørup
2022-10-04 12:57   ` Andrew Rybchenko
2022-10-04 15:13     ` Morten Brørup
2022-10-04 15:58       ` Andrew Rybchenko
2022-10-04 18:09         ` Morten Brørup
2022-10-06 13:43       ` Aaron Conole
2022-10-04 16:03   ` Morten Brørup
2022-10-04 16:36   ` Morten Brørup
2022-10-04 16:39   ` Morten Brørup
2022-02-02 10:33 ` [PATCH v4] mempool: fix mempool cache flushing algorithm Morten Brørup
2022-04-07  9:04   ` Morten Brørup
2022-04-07  9:14     ` Bruce Richardson
2022-04-07  9:26       ` Morten Brørup
2022-04-07 10:32         ` Bruce Richardson
2022-04-07 10:43           ` Bruce Richardson
2022-04-07 11:36             ` Morten Brørup
2022-10-04 20:01   ` Morten Brørup
2022-10-09 11:11   ` [PATCH 1/2] mempool: check driver enqueue result in one place Andrew Rybchenko
2022-10-09 11:11     ` [PATCH 2/2] mempool: avoid usage of term ring on put Andrew Rybchenko
2022-10-09 13:08       ` Morten Brørup
2022-10-09 13:14         ` Andrew Rybchenko
2022-10-09 13:01     ` [PATCH 1/2] mempool: check driver enqueue result in one place Morten Brørup
2022-10-09 13:19   ` [PATCH v4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
2022-10-04 12:53 ` [PATCH v3] mempool: fix get objects from mempool with cache Andrew Rybchenko
2022-10-04 14:42   ` Morten Brørup
2022-10-07 10:44 ` [PATCH v4] " Andrew Rybchenko
2022-10-08 20:56   ` Thomas Monjalon
2022-10-11 20:30     ` Copy-pasted code should be updated Morten Brørup
2022-10-11 21:47       ` Honnappa Nagarahalli
2022-10-30  8:44         ` Morten Brørup
2022-10-30 22:50           ` Honnappa Nagarahalli
2022-10-14 14:01     ` [PATCH v4] mempool: fix get objects from mempool with cache Olivier Matz
2022-10-09 13:37 ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Andrew Rybchenko
2022-10-09 13:37   ` [PATCH v6 1/4] mempool: check driver enqueue result in one place Andrew Rybchenko
2022-10-09 13:37   ` [PATCH v6 2/4] mempool: avoid usage of term ring on put Andrew Rybchenko
2022-10-09 13:37   ` [PATCH v6 3/4] mempool: fix cache flushing algorithm Andrew Rybchenko
2022-10-09 14:31     ` Morten Brørup
2022-10-09 14:51       ` Andrew Rybchenko
2022-10-09 15:08         ` Morten Brørup
2022-10-14 14:01           ` Olivier Matz
2022-10-14 15:57             ` Morten Brørup
2022-10-14 19:50               ` Olivier Matz
2022-10-15  6:57                 ` Morten Brørup
2022-10-18 16:32                   ` Jerin Jacob
2022-10-09 13:37   ` [PATCH v6 4/4] mempool: flush cache completely on overflow Andrew Rybchenko
2022-10-09 14:44     ` Morten Brørup
2022-10-14 14:01       ` Olivier Matz
2022-10-10 15:21   ` [PATCH v6 0/4] mempool: fix mempool cache flushing algorithm Thomas Monjalon
2022-10-11 19:26     ` Morten Brørup
2022-10-26 14:09     ` Thomas Monjalon
2022-10-26 14:26       ` Morten Brørup
2022-10-26 14:44         ` [PATCH] mempool: cache align mempool cache objects Morten Brørup
2022-10-26 19:44           ` Andrew Rybchenko
2022-10-27  8:34           ` Olivier Matz
2022-10-27  9:22             ` Morten Brørup
2022-10-27 11:42               ` Olivier Matz
2022-10-27 12:11                 ` Morten Brørup
2022-10-27 15:20                   ` Olivier Matz
2022-10-28  6:35           ` [PATCH v3 1/2] " Morten Brørup
2022-10-28  6:35             ` [PATCH v3 2/2] mempool: optimized debug statistics Morten Brørup
2022-10-28  6:41           ` [PATCH v4 1/2] mempool: cache align mempool cache objects Morten Brørup
2022-10-28  6:41             ` [PATCH v4 2/2] mempool: optimized debug statistics Morten Brørup
2022-10-30  9:09               ` Morten Brørup
2022-10-30  9:16                 ` Thomas Monjalon
2022-10-30  9:17             ` [PATCH v4 1/2] mempool: cache align mempool cache objects Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).