DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC] mempool: rte_mempool_do_generic_get optimizations
@ 2021-12-26 15:34 Morten Brørup
  2022-01-06 12:23 ` [PATCH] mempool: optimize incomplete cache handling Morten Brørup
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Morten Brørup @ 2021-12-26 15:34 UTC (permalink / raw)
  To: Olivier Matz, Andrew Rybchenko, dev

While going through the mempool code for potential optimizations, I found two details in rte_mempool_do_generic_get(), which are easily improved.

Any comments or alternative suggestions?


1. The objects are returned in reverse order. This is silly, and should be optimized.

rte_mempool_do_generic_get() line 1493:

	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	rte_memcpy(obj_table, &cache_objs[cache->len - n], sizeof(void *) * n);


2. The initial screening in rte_mempool_do_generic_get() differs from the initial screening in rte_mempool_do_generic_put().

For reference, rte_mempool_do_generic_put() line 1343:

	/* No cache provided or if put would overflow mem allocated for cache */
	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
		goto ring_enqueue;

Notice how this uses RTE_MEMPOOL_CACHE_MAX_SIZE to determine the maximum burst size into the cache.

Now, rte_mempool_do_generic_get() line 1466:

	/* No cache provided or cannot be satisfied from cache */
	if (unlikely(cache == NULL || n >= cache->size))
		goto ring_dequeue;

	cache_objs = cache->objs;

	/* Can this be satisfied from the cache? */
	if (cache->len < n) {
		/* No. Backfill the cache first, and then fill from it */
		uint32_t req = n + (cache->size - cache->len);

First of all, there might already be up to cache->flushthresh - 1 objects in the cache, which is 50 % more than cache->size, so screening for n >= cache->size would not serve those from the cache!

Second of all, the next step is to check if the cache holds sufficient objects. So the initial screening should only do initial screening. Therefore, I propose changing the initial screening to also use RTE_MEMPOOL_CACHE_MAX_SIZE to determine the maximum burst size from the cache, like in rte_mempool_do_generic_put().

rte_mempool_do_generic_get() line 1466:

-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
		goto ring_dequeue;


Med venlig hilsen / Kind regards,
-Morten Brørup


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mempool: optimize incomplete cache handling
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
@ 2022-01-06 12:23 ` Morten Brørup
  2022-01-06 16:55   ` Jerin Jacob
  2022-01-14 16:36 ` [PATCH] mempool: fix get objects from mempool with cache Morten Brørup
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Morten Brørup @ 2022-01-06 12:23 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko; +Cc: dev, Morten Brørup

A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then.

The incompleteness did not cause any functional bugs, so this patch
could be considered refactoring for the purpose of cleaning up.

This patch completes the update of rte_mempool_do_generic_get() as
follows:

1. A few comments were malplaced or no longer correct.
Some comments have been updated/added/corrected.

2. The code that initially screens the cache request was not updated.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

3. The code flow for satisfying the request from the cache was weird.
The likely code path where the objects are simply served from the cache
was treated as unlikely; now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were
added and subtracted from the cache length; now this code path simply
sets the cache length to its final value.

4. The objects were returned in reverse order.
Returning the objects in reverse order is not necessary, so rte_memcpy()
is now used instead.

This patch also updates/adds/corrects some comments in
rte_mempool_do_generic_put().

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..4c36ad6dd1 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1353,12 +1353,12 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	 *   cache flush threshold) is flushed to the ring.
 	 */
 
-	/* Add elements back into the cache */
+	/* Add the objects to the cache */
 	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
-
 	cache->len += n;
 
 	if (cache->len >= cache->flushthresh) {
+		/* Flush excess objects in the cache to the ring */
 		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
 				cache->len - cache->size);
 		cache->len = cache->size;
@@ -1368,7 +1368,7 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
@@ -1460,21 +1460,25 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
-	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
 	cache_objs = cache->objs;
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
+	/* Can the request be satisfied from the cache? */
+	if (n <= cache->len) {
+		/* Yes. Simply decrease the cache length */
+		cache->len -= n;
+	} else {
+		/* No. Backfill the cache from the ring first */
+
+		/* Number required to fill the cache + n */
 		uint32_t req = n + (cache->size - cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill the cache from the ring */
 		ret = rte_mempool_ops_dequeue_bulk(mp,
 			&cache->objs[cache->len], req);
 		if (unlikely(ret < 0)) {
@@ -1487,14 +1491,12 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			goto ring_dequeue;
 		}
 
-		cache->len += req;
+		/* Set the length of the backfilled cache - n */
+		cache->len = cache->size;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
-
-	cache->len -= n;
+	/* Get the objects from the cache, at the already decreased offset */
+	rte_memcpy(obj_table, &cache_objs[cache->len], sizeof(void *) * n);
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1503,7 +1505,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mempool: optimize incomplete cache handling
  2022-01-06 12:23 ` [PATCH] mempool: optimize incomplete cache handling Morten Brørup
@ 2022-01-06 16:55   ` Jerin Jacob
  2022-01-07  8:46     ` Morten Brørup
  0 siblings, 1 reply; 13+ messages in thread
From: Jerin Jacob @ 2022-01-06 16:55 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Olivier Matz, Andrew Rybchenko, dpdk-dev

On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then.
>
> The incompleteness did not cause any functional bugs, so this patch
> could be considered refactoring for the purpose of cleaning up.
>
> This patch completes the update of rte_mempool_do_generic_get() as
> follows:
>
> 1. A few comments were malplaced or no longer correct.
> Some comments have been updated/added/corrected.
>
> 2. The code that initially screens the cache request was not updated.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
>
> 3. The code flow for satisfying the request from the cache was weird.
> The likely code path where the objects are simply served from the cache
> was treated as unlikely; now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.
>
> 4. The objects were returned in reverse order.
> Returning the objects in reverse order is not necessary, so rte_memcpy()
> is now used instead.

Have you checked the performance with network workload?
IMO, reverse order makes sense(LIFO vs FIFO).
The LIFO makes the cache warm as the same buffers are reused frequently.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] mempool: optimize incomplete cache handling
  2022-01-06 16:55   ` Jerin Jacob
@ 2022-01-07  8:46     ` Morten Brørup
  2022-01-10  7:26       ` Jerin Jacob
  0 siblings, 1 reply; 13+ messages in thread
From: Morten Brørup @ 2022-01-07  8:46 UTC (permalink / raw)
  To: Jerin Jacob; +Cc: Olivier Matz, Andrew Rybchenko, dpdk-dev

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Thursday, 6 January 2022 17.55
> 
> On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > A flush threshold for the mempool cache was introduced in DPDK
> version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then.
> >
> > The incompleteness did not cause any functional bugs, so this patch
> > could be considered refactoring for the purpose of cleaning up.
> >
> > This patch completes the update of rte_mempool_do_generic_get() as
> > follows:
> >
> > 1. A few comments were malplaced or no longer correct.
> > Some comments have been updated/added/corrected.
> >
> > 2. The code that initially screens the cache request was not updated.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction
> of
> > the flush threshold. E.g. the cache can hold up to flushthresh
> objects,
> > which is more than its size, so some requests were not served from
> the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a
> cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.
> >
> > 3. The code flow for satisfying the request from the cache was weird.
> > The likely code path where the objects are simply served from the
> cache
> > was treated as unlikely; now it is treated as likely.
> > And in the code path where the cache was backfilled first, numbers
> were
> > added and subtracted from the cache length; now this code path simply
> > sets the cache length to its final value.
> >
> > 4. The objects were returned in reverse order.
> > Returning the objects in reverse order is not necessary, so
> rte_memcpy()
> > is now used instead.
> 
> Have you checked the performance with network workload?
> IMO, reverse order makes sense(LIFO vs FIFO).
> The LIFO makes the cache warm as the same buffers are reused
> frequently.

I have not done any performance testing. We probably agree that the only major difference lies in how the objects are returned. And we probably also agree that rte_memcpy() is faster than the copy loop it replaced, especially when n is constant at compile time. So the performance difference mainly depends on the application, which I will discuss below.

Let's first consider LIFO vs. FIFO.

The key argument for the rte_memcpy() optimization is that we are still getting the burst of objects from the top of the stack (LIFO); only the order of the objects inside the burst is not reverse anymore.

Here is an example:

The cache initially contains 8 objects: 01234567.

8 more objects are put into the cache: 89ABCDEF.

The cache now holds: 0123456789ABCDEF.

Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e. we are still getting the 4 objects most recently put into the cache.

Furthermore, if the application is working with fixed size bursts, it will usually put and get the same size burst, i.e. put the burst 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache again.


Here is an example unfavorable scenario:

The cache initially contains 4 objects, which have gone cold: 0123.

4 more objects, which happen to be hot, are put into the cache: 4567.

Getting 8 objects from the cache gives us 01234567 instead of 76543210.

Now, if the application only processes the first 4 of the 8 objects in the burst, it would have benefitted from those objects being the hot 7654 objects instead of the cold 0123 objects.

However, I think that most applications process the complete burst, so I do consider this scenario unlikely.

Similarly, a pipelined application doesn't process objects in reverse order at every other step in the pipeline, even though the previous step in the pipeline most recently touched the last object of the burst.


My overall conclusion was that the benefit of using rte_memcpy() outweighs the disadvantage of the unfavorable scenario, because I consider the probability of the unfavorable scenario occurring very low. But again: it mainly depends on the application.

If anyone disagrees with the risk analysis described above, I will happily provide a version 2 of the patch, where the objects are still returned in reverse order. After all, the rte_memcpy() benefit is relatively small compared to the impact if the unlikely scenario occurs.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mempool: optimize incomplete cache handling
  2022-01-07  8:46     ` Morten Brørup
@ 2022-01-10  7:26       ` Jerin Jacob
  2022-01-10 10:55         ` Morten Brørup
  0 siblings, 1 reply; 13+ messages in thread
From: Jerin Jacob @ 2022-01-10  7:26 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Olivier Matz, Andrew Rybchenko, dpdk-dev

On Fri, Jan 7, 2022 at 2:16 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > Sent: Thursday, 6 January 2022 17.55
> >
> > On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup <mb@smartsharesystems.com>
> > wrote:
> > >
> > > A flush threshold for the mempool cache was introduced in DPDK
> > version
> > > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > > then.
> > >
> > > The incompleteness did not cause any functional bugs, so this patch
> > > could be considered refactoring for the purpose of cleaning up.
> > >
> > > This patch completes the update of rte_mempool_do_generic_get() as
> > > follows:
> > >
> > > 1. A few comments were malplaced or no longer correct.
> > > Some comments have been updated/added/corrected.
> > >
> > > 2. The code that initially screens the cache request was not updated.
> > > The initial screening compared the request length to the cache size,
> > > which was correct before, but became irrelevant with the introduction
> > of
> > > the flush threshold. E.g. the cache can hold up to flushthresh
> > objects,
> > > which is more than its size, so some requests were not served from
> > the
> > > cache, even though they could be.
> > > The initial screening has now been corrected to match the initial
> > > screening in rte_mempool_do_generic_put(), which verifies that a
> > cache
> > > is present, and that the length of the request does not overflow the
> > > memory allocated for the cache.
> > >
> > > 3. The code flow for satisfying the request from the cache was weird.
> > > The likely code path where the objects are simply served from the
> > cache
> > > was treated as unlikely; now it is treated as likely.
> > > And in the code path where the cache was backfilled first, numbers
> > were
> > > added and subtracted from the cache length; now this code path simply
> > > sets the cache length to its final value.
> > >
> > > 4. The objects were returned in reverse order.
> > > Returning the objects in reverse order is not necessary, so
> > rte_memcpy()
> > > is now used instead.
> >
> > Have you checked the performance with network workload?
> > IMO, reverse order makes sense(LIFO vs FIFO).
> > The LIFO makes the cache warm as the same buffers are reused
> > frequently.
>
> I have not done any performance testing. We probably agree that the only major difference lies in how the objects are returned. And we probably also agree that rte_memcpy() is faster than the copy loop it replaced, especially when n is constant at compile time. So the performance difference mainly depends on the application, which I will discuss below.
>
> Let's first consider LIFO vs. FIFO.
>
> The key argument for the rte_memcpy() optimization is that we are still getting the burst of objects from the top of the stack (LIFO); only the order of the objects inside the burst is not reverse anymore.
>
> Here is an example:
>
> The cache initially contains 8 objects: 01234567.
>
> 8 more objects are put into the cache: 89ABCDEF.
>
> The cache now holds: 0123456789ABCDEF.

Agree. However I think, it may matter with less sized L1 cache
machines and burst size is more where it plays role what can be in L1
with the scheme.

I would suggest splitting each performance improvement as a separate
patch for better tracking and quantity of the performance improvement.

I think, mempool performance test and tx only stream mode in testpmd
can quantify patches.



>
> Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e. we are still getting the 4 objects most recently put into the cache.
>
> Furthermore, if the application is working with fixed size bursts, it will usually put and get the same size burst, i.e. put the burst 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache again.
>
>
> Here is an example unfavorable scenario:
>
> The cache initially contains 4 objects, which have gone cold: 0123.
>
> 4 more objects, which happen to be hot, are put into the cache: 4567.
>
> Getting 8 objects from the cache gives us 01234567 instead of 76543210.
>
> Now, if the application only processes the first 4 of the 8 objects in the burst, it would have benefitted from those objects being the hot 7654 objects instead of the cold 0123 objects.
>
> However, I think that most applications process the complete burst, so I do consider this scenario unlikely.
>
> Similarly, a pipelined application doesn't process objects in reverse order at every other step in the pipeline, even though the previous step in the pipeline most recently touched the last object of the burst.
>
>
> My overall conclusion was that the benefit of using rte_memcpy() outweighs the disadvantage of the unfavorable scenario, because I consider the probability of the unfavorable scenario occurring very low. But again: it mainly depends on the application.
>
> If anyone disagrees with the risk analysis described above, I will happily provide a version 2 of the patch, where the objects are still returned in reverse order. After all, the rte_memcpy() benefit is relatively small compared to the impact if the unlikely scenario occurs.
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] mempool: optimize incomplete cache handling
  2022-01-10  7:26       ` Jerin Jacob
@ 2022-01-10 10:55         ` Morten Brørup
  0 siblings, 0 replies; 13+ messages in thread
From: Morten Brørup @ 2022-01-10 10:55 UTC (permalink / raw)
  To: Jerin Jacob, Bruce Richardson; +Cc: Olivier Matz, Andrew Rybchenko, dpdk-dev

+Bruce; you seemed interested in my work in this area.

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Monday, 10 January 2022 08.27
> 
> On Fri, Jan 7, 2022 at 2:16 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > Sent: Thursday, 6 January 2022 17.55
> > >
> > > On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup
> <mb@smartsharesystems.com>
> > > wrote:
> > > >
> > > > A flush threshold for the mempool cache was introduced in DPDK
> > > version
> > > > 1.3, but rte_mempool_do_generic_get() was not completely updated
> back
> > > > then.
> > > >
> > > > The incompleteness did not cause any functional bugs, so this
> patch
> > > > could be considered refactoring for the purpose of cleaning up.
> > > >
> > > > This patch completes the update of rte_mempool_do_generic_get()
> as
> > > > follows:
> > > >
> > > > 1. A few comments were malplaced or no longer correct.
> > > > Some comments have been updated/added/corrected.
> > > >
> > > > 2. The code that initially screens the cache request was not
> updated.
> > > > The initial screening compared the request length to the cache
> size,
> > > > which was correct before, but became irrelevant with the
> introduction
> > > of
> > > > the flush threshold. E.g. the cache can hold up to flushthresh
> > > objects,
> > > > which is more than its size, so some requests were not served
> from
> > > the
> > > > cache, even though they could be.
> > > > The initial screening has now been corrected to match the initial
> > > > screening in rte_mempool_do_generic_put(), which verifies that a
> > > cache
> > > > is present, and that the length of the request does not overflow
> the
> > > > memory allocated for the cache.
> > > >
> > > > 3. The code flow for satisfying the request from the cache was
> weird.
> > > > The likely code path where the objects are simply served from the
> > > cache
> > > > was treated as unlikely; now it is treated as likely.
> > > > And in the code path where the cache was backfilled first,
> numbers
> > > were
> > > > added and subtracted from the cache length; now this code path
> simply
> > > > sets the cache length to its final value.
> > > >
> > > > 4. The objects were returned in reverse order.
> > > > Returning the objects in reverse order is not necessary, so
> > > rte_memcpy()
> > > > is now used instead.
> > >
> > > Have you checked the performance with network workload?
> > > IMO, reverse order makes sense(LIFO vs FIFO).
> > > The LIFO makes the cache warm as the same buffers are reused
> > > frequently.
> >
> > I have not done any performance testing. We probably agree that the
> only major difference lies in how the objects are returned. And we
> probably also agree that rte_memcpy() is faster than the copy loop it
> replaced, especially when n is constant at compile time. So the
> performance difference mainly depends on the application, which I will
> discuss below.
> >
> > Let's first consider LIFO vs. FIFO.
> >
> > The key argument for the rte_memcpy() optimization is that we are
> still getting the burst of objects from the top of the stack (LIFO);
> only the order of the objects inside the burst is not reverse anymore.
> >
> > Here is an example:
> >
> > The cache initially contains 8 objects: 01234567.
> >
> > 8 more objects are put into the cache: 89ABCDEF.
> >
> > The cache now holds: 0123456789ABCDEF.
> 
> Agree. However I think, it may matter with less sized L1 cache
> machines and burst size is more where it plays role what can be in L1
> with the scheme.

Good point! Thinking further about it made me realize that the mempool cache flushing algorithm is fundamentally flawed, at least in some cases...


rte_mempool_do_generic_put():

When putting objects into the cache, and the cache length exceeds the flush threshold, the most recent (hot) objects are flushed to the ring, thus leaving the less recent (colder) objects at the top of the cache stack.

Example (cache size: 8, flush threshold: 12, put 8 objects):

Initial cache: 01234567

Cache after putting (hot) objects 89ABCDEF: 0123456789ABCDEF

Cache flush threshold reached. Resulting cache: 01234567

Furthermore, the cache has to be completely depleted before the hot objects that were flushed to the ring are retrieved from the ring again.


rte_mempool_do_generic_get():

When getting objects from the cache, and the cache does not hold the requested number of objects, the cache will first be backfilled from the ring, thus putting colder objects at the top of the cache stack, and then the objects will be returned from the top of the cache stack, i.e. the backfilled (cold) objects will be returned first.

Example (cache size: 8, get 8 objects):

Initial cache: 0123 (hot or lukewarm)

Cache after backfill to size + requested objects: 0123456789ABCDEF

Returned objects: FEDCBA98 (cold)

Cache after returning objects: 012345678 (i.e. cold objects at the top)


> 
> I would suggest splitting each performance improvement as a separate
> patch for better tracking and quantity of the performance improvement.

With the new realizations above, I should reconsider my patch from scratch.

I have also been wondering if the mempool cache size really needs to be configurable, or it could be fixed size?

Bruce mentioned in another thread (http://inbox.dpdk.org/dev/Ydv%2FIMz8eIRPSguY@bricha3-MOBL.ger.corp.intel.com/T/#m02cabb25655c08a0980888df8c41aba9ac8dd6ff) that the typical configuration of the cache size is RTE_MEMPOOL_CACHE_MAX_SIZE.

I would dare to say that it probably suffices to configure if the mempool has a cache or not! The cache_size parameter of rte_mempool_create() is not respected 1:1 anyway (because each per-lcore cache may consume up to 1.5 x cache_size objects from the mempool backing store), so the cache_size parameter could be parsed as non-zero vs. zero to determine if a cache is wanted or not.

> 
> I think, mempool performance test and tx only stream mode in testpmd
> can quantify patches.
> 
> 
> 
> >
> > Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e.
> we are still getting the 4 objects most recently put into the cache.
> >
> > Furthermore, if the application is working with fixed size bursts, it
> will usually put and get the same size burst, i.e. put the burst
> 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache
> again.
> >
> >
> > Here is an example unfavorable scenario:
> >
> > The cache initially contains 4 objects, which have gone cold: 0123.
> >
> > 4 more objects, which happen to be hot, are put into the cache: 4567.
> >
> > Getting 8 objects from the cache gives us 01234567 instead of
> 76543210.
> >
> > Now, if the application only processes the first 4 of the 8 objects
> in the burst, it would have benefitted from those objects being the hot
> 7654 objects instead of the cold 0123 objects.
> >
> > However, I think that most applications process the complete burst,
> so I do consider this scenario unlikely.
> >
> > Similarly, a pipelined application doesn't process objects in reverse
> order at every other step in the pipeline, even though the previous
> step in the pipeline most recently touched the last object of the
> burst.
> >
> >
> > My overall conclusion was that the benefit of using rte_memcpy()
> outweighs the disadvantage of the unfavorable scenario, because I
> consider the probability of the unfavorable scenario occurring very
> low. But again: it mainly depends on the application.
> >
> > If anyone disagrees with the risk analysis described above, I will
> happily provide a version 2 of the patch, where the objects are still
> returned in reverse order. After all, the rte_memcpy() benefit is
> relatively small compared to the impact if the unlikely scenario
> occurs.
> >


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mempool: fix get objects from mempool with cache
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
  2022-01-06 12:23 ` [PATCH] mempool: optimize incomplete cache handling Morten Brørup
@ 2022-01-14 16:36 ` Morten Brørup
  2022-01-17 17:35   ` Bruce Richardson
  2022-01-17 11:52 ` [PATCH] mempool: optimize put objects to " Morten Brørup
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Morten Brørup @ 2022-01-14 16:36 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, jerinjacobk, dev, Morten Brørup

A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then, and some inefficiencies were introduced.

This patch fixes the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated
with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

2. The function is a helper for rte_mempool_generic_get(), so it must
behave according to the description of that function.
Specifically, objects must first be returned from the cache,
subsequently from the ring.
After the change in DPDK version 1.3, this was not the behavior when
the request was partially satisfied from the cache; instead, the objects
from the ring were returned ahead of the objects from the cache. This is
bad for CPUs with a small L1 cache, which benefit from having the hot
objects first in the returned array. (This is also the reason why
the function returns the objects in reverse order.)
Now, all code paths first return objects from the cache, subsequently
from the ring.

3. If the cache could not be backfilled, the function would attempt
to get all the requested objects from the ring (instead of only the
number of requested objects minus the objects available in the ring),
and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache,
and if the subsequent backfilling of the cache from the ring fails, only
the remaining requested objects are retrieved from the ring.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache
was treated as unlikely. Now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were
added and subtracted from the cache length; now this code path simply
sets the cache length to its final value.

5. Some comments were not correct anymore.
The comments have been updated.
Most importanly, the description of the succesful return value was
inaccurate. Success only returns 0, not >= 0.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 81 ++++++++++++++++++++++++++++-----------
 1 file changed, 59 insertions(+), 22 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..88f1b8b7ab 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1443,6 +1443,10 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
 
 /**
  * @internal Get several objects from the mempool; used internally.
+ *
+ * If cache is enabled, objects are returned from the cache in Last In First
+ * Out (LIFO) order for the benefit of CPUs with small L1 cache.
+ *
  * @param mp
  *   A pointer to the mempool structure.
  * @param obj_table
@@ -1452,7 +1456,7 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
  * @param cache
  *   A pointer to a mempool cache structure. May be NULL if not needed.
  * @return
- *   - >=0: Success; number of objects supplied.
+ *   - 0: Success; got n objects.
  *   - <0: Error; code of ring dequeue function.
  */
 static __rte_always_inline int
@@ -1463,38 +1467,71 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
-	cache_objs = cache->objs;
+	cache_objs = &cache->objs[cache->len];
+
+	if (n <= cache->len) {
+		/* The entire request can be satisfied from the cache. */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
+		return 0;
+	}
+
+	/* Satisfy the first part of the request by depleting the cache. */
+	len = cache->len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
+
+	/* Number of objects remaining to satisfy the request. */
+	len = n - len;
+
+	/* Fill the cache from the ring; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + len);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the ring.
+		 */
+		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
 		if (unlikely(ret < 0)) {
 			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
+			 * That also failed.
+			 * No furter action is required to roll the first
+			 * part of the request back into the cache, as both
+			 * cache->len and the objects in the cache are intact.
 			 */
-			goto ring_dequeue;
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
+
+			return ret;
 		}
 
-		cache->len += req;
+		/* Commit that the cache was emptied. */
+		cache->len = 0;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	cache_objs = &cache->objs[cache->size + len];
 
-	cache->len -= n;
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache->len = cache->size;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1503,7 +1540,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring. */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] mempool: optimize put objects to mempool with cache
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
  2022-01-06 12:23 ` [PATCH] mempool: optimize incomplete cache handling Morten Brørup
  2022-01-14 16:36 ` [PATCH] mempool: fix get objects from mempool with cache Morten Brørup
@ 2022-01-17 11:52 ` Morten Brørup
  2022-01-19 14:52 ` [PATCH v2] mempool: fix " Morten Brørup
  2022-01-19 15:03 ` [PATCH v3] " Morten Brørup
  4 siblings, 0 replies; 13+ messages in thread
From: Morten Brørup @ 2022-01-17 11:52 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, jerinjacobk, dev, Morten Brørup

This patch optimizes the rte_mempool_do_generic_put() caching algorithm.

The algorithm was:
 1. Add the objects to the cache
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the ring.

(Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
"size" objects, which is reflected in the above description.)

Now, the algorithm is:
 1. If the the objects cannot be added to the cache without crossing the
    flush threshold, flush the cache to the ring.
 2. Add the objects to the cache.

This patch fixes the following two inefficiencies of the old algorithm:

1. The most recent (hot) objects are flushed, leaving the oldest (cold)
objects in the mempool cache.
This is bad for CPUs with a small L1 cache, because when they get
objects from the mempool after the mempool cache has been flushed, they
get cold objects instead of hot objects.
Now, the existing (cold) objects in the mempool cache are flushed before
the new (hot) objects are added the to the mempool cache.

2. The cache is still full after flushing.
In the opposite direction, i.e. when getting objects from the cache, the
cache is refilled to full level when it crosses the low watermark (which
happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The current flushing behaviour is suboptimal for real life applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
NB: When the application is in a state where put/get events are in
balance, the cache should remain within its low and high watermarks, and
the algorithms for refilling/flushing the cache should not come into
play.
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

Not adding the new objects to the mempool cache before flushing it also
allows the memory allocated for the mempool cache to be reduced from 3 x
to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.

Futhermore, a minor bug in the flush threshold comparison has been
corrected; it must be "len > flushthresh", not "len > flushthresh".
Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
would be flushed already when reaching size elements, not when exceeding
size elements.
Now, flushing is triggered when the flush threshold is exceeded, not
when reached.

And finally, using the x86 variant of rte_memcpy() is inefficient here,
where n is relatively small and unknown at compile time.
Now, it has been replaced by an alternative copying method, optimized
for the fact that most Ethernet PMDs operate in bursts of 4 or 8 mbufs
or multiples thereof.
The mempool cache is cache line aligned for the benefit of this copying
method, which on some CPU architectures performs worse on data crossing
a cache boundary.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 47 ++++++++++++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..1ce850bedd 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -94,7 +94,8 @@ struct rte_mempool_cache {
 	 * Cache is allocated to this size to allow it to overflow in certain
 	 * cases to avoid needless emptying of cache.
 	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
+	/**< Cache objects */
 } __rte_cache_aligned;
 
 /**
@@ -1344,31 +1345,51 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
+	/* If the request itself is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto ring_enqueue;
 
 	/*
 	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   1. If the the objects cannot be added to the cache without
+	 *   crossing the flush threshold, flush the cache to the ring.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
 
-	cache->len += n;
+		cache->len += n;
+	} else {
+		cache_objs = cache->objs;
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
+			rte_panic("cannot put objects in mempool\n");
+#else
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+#endif
+		cache->len = n;
+	}
+
+	/* Add the objects to the cache. */
+	for (; n >= 4; n -= 4) {
+#ifdef RTE_ARCH_64
+		rte_mov32((unsigned char *)cache_objs, (const unsigned char *)obj_table);
+#else
+		rte_mov16((unsigned char *)cache_objs, (const unsigned char *)obj_table);
+#endif
+		cache_objs += 4;
+		obj_table += 4;
 	}
+	for (; n > 0; --n)
+		*cache_objs++ = *obj_table++;
 
 	return;
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mempool: fix get objects from mempool with cache
  2022-01-14 16:36 ` [PATCH] mempool: fix get objects from mempool with cache Morten Brørup
@ 2022-01-17 17:35   ` Bruce Richardson
  2022-01-18  8:25     ` Morten Brørup
  0 siblings, 1 reply; 13+ messages in thread
From: Bruce Richardson @ 2022-01-17 17:35 UTC (permalink / raw)
  To: Morten Brørup; +Cc: olivier.matz, andrew.rybchenko, jerinjacobk, dev

On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> This patch fixes the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the ring.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the objects
> from the ring were returned ahead of the objects from the cache. This is
> bad for CPUs with a small L1 cache, which benefit from having the hot
> objects first in the returned array. (This is also the reason why
> the function returns the objects in reverse order.)
> Now, all code paths first return objects from the cache, subsequently
> from the ring.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the ring (instead of only the
> number of requested objects minus the objects available in the ring),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the ring fails, only
> the remaining requested objects are retrieved from the ring.
> 
> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.
> 
> 5. Some comments were not correct anymore.
> The comments have been updated.
> Most importanly, the description of the succesful return value was
> inaccurate. Success only returns 0, not >= 0.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---

I am a little uncertain about the reversing of the copies taking things out
of the mempool - for machines where we are not that cache constrainted will
we lose out in possible optimizations where the compiler optimizes the copy
loop as a memcpy?

Otherwise the logic all looks correct to me.

/Bruce

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH] mempool: fix get objects from mempool with cache
  2022-01-17 17:35   ` Bruce Richardson
@ 2022-01-18  8:25     ` Morten Brørup
  2022-01-18  9:07       ` Bruce Richardson
  0 siblings, 1 reply; 13+ messages in thread
From: Morten Brørup @ 2022-01-18  8:25 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: olivier.matz, andrew.rybchenko, jerinjacobk, dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Monday, 17 January 2022 18.35
> 
> On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> > A flush threshold for the mempool cache was introduced in DPDK
> version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then, and some inefficiencies were introduced.
> >
> > This patch fixes the following in rte_mempool_do_generic_get():
> >
> > 1. The code that initially screens the cache request was not updated
> > with the change in DPDK version 1.3.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction
> of
> > the flush threshold. E.g. the cache can hold up to flushthresh
> objects,
> > which is more than its size, so some requests were not served from
> the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a
> cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.
> >
> > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > behave according to the description of that function.
> > Specifically, objects must first be returned from the cache,
> > subsequently from the ring.
> > After the change in DPDK version 1.3, this was not the behavior when
> > the request was partially satisfied from the cache; instead, the
> objects
> > from the ring were returned ahead of the objects from the cache. This
> is
> > bad for CPUs with a small L1 cache, which benefit from having the hot
> > objects first in the returned array. (This is also the reason why
> > the function returns the objects in reverse order.)
> > Now, all code paths first return objects from the cache, subsequently
> > from the ring.
> >
> > 3. If the cache could not be backfilled, the function would attempt
> > to get all the requested objects from the ring (instead of only the
> > number of requested objects minus the objects available in the ring),
> > and the function would fail if that failed.
> > Now, the first part of the request is always satisfied from the
> cache,
> > and if the subsequent backfilling of the cache from the ring fails,
> only
> > the remaining requested objects are retrieved from the ring.
> >
> > 4. The code flow for satisfying the request from the cache was
> slightly
> > inefficient:
> > The likely code path where the objects are simply served from the
> cache
> > was treated as unlikely. Now it is treated as likely.
> > And in the code path where the cache was backfilled first, numbers
> were
> > added and subtracted from the cache length; now this code path simply
> > sets the cache length to its final value.
> >
> > 5. Some comments were not correct anymore.
> > The comments have been updated.
> > Most importanly, the description of the succesful return value was
> > inaccurate. Success only returns 0, not >= 0.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> 
> I am a little uncertain about the reversing of the copies taking things
> out
> of the mempool - for machines where we are not that cache constrainted
> will
> we lose out in possible optimizations where the compiler optimizes the
> copy
> loop as a memcpy?

The objects are also returned in reverse order in the code it replaces, so this behavior is not introduced by this patch; I only describe the reason for it.

I floated a previous patch, in which the objects were returned in order, but Jerin argued [1] that we should keep it the way it was, unless I could show a performance improvement.

So I retracted that patch to split it up in two independent patches instead. This patch for get(), and [3] for put().

While experimenting using rte_memcpy() for these, I couldn't achieve a performance boost - quite the opposite. So I gave up on it.

Reviewing the x86 variant of rte_memcpy() [2] makes me think that it is inefficient for copying small bulks of pointers, especially when n is unknown at compile time, and its code path goes through a great deal of branches.

> 
> Otherwise the logic all looks correct to me.
> 
> /Bruce

[1]: http://inbox.dpdk.org/dev/CALBAE1OjCswxUfaNLWg5y-tnPkFhvvKQ8sJ3JpBoo7ObgeB5OA@mail.gmail.com/
[2]: http://code.dpdk.org/dpdk/latest/source/lib/eal/x86/include/rte_memcpy.h
[3]: http://inbox.dpdk.org/dev/20220117115231.8060-1-mb@smartsharesystems.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mempool: fix get objects from mempool with cache
  2022-01-18  8:25     ` Morten Brørup
@ 2022-01-18  9:07       ` Bruce Richardson
  0 siblings, 0 replies; 13+ messages in thread
From: Bruce Richardson @ 2022-01-18  9:07 UTC (permalink / raw)
  To: Morten Brørup; +Cc: olivier.matz, andrew.rybchenko, jerinjacobk, dev

On Tue, Jan 18, 2022 at 09:25:22AM +0100, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Monday, 17 January 2022 18.35
> > 
> > On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> > > A flush threshold for the mempool cache was introduced in DPDK
> > version
> > > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > > then, and some inefficiencies were introduced.
> > >
> > > This patch fixes the following in rte_mempool_do_generic_get():
> > >
> > > 1. The code that initially screens the cache request was not updated
> > > with the change in DPDK version 1.3.
> > > The initial screening compared the request length to the cache size,
> > > which was correct before, but became irrelevant with the introduction
> > of
> > > the flush threshold. E.g. the cache can hold up to flushthresh
> > objects,
> > > which is more than its size, so some requests were not served from
> > the
> > > cache, even though they could be.
> > > The initial screening has now been corrected to match the initial
> > > screening in rte_mempool_do_generic_put(), which verifies that a
> > cache
> > > is present, and that the length of the request does not overflow the
> > > memory allocated for the cache.
> > >
> > > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > > behave according to the description of that function.
> > > Specifically, objects must first be returned from the cache,
> > > subsequently from the ring.
> > > After the change in DPDK version 1.3, this was not the behavior when
> > > the request was partially satisfied from the cache; instead, the
> > objects
> > > from the ring were returned ahead of the objects from the cache. This
> > is
> > > bad for CPUs with a small L1 cache, which benefit from having the hot
> > > objects first in the returned array. (This is also the reason why
> > > the function returns the objects in reverse order.)
> > > Now, all code paths first return objects from the cache, subsequently
> > > from the ring.
> > >
> > > 3. If the cache could not be backfilled, the function would attempt
> > > to get all the requested objects from the ring (instead of only the
> > > number of requested objects minus the objects available in the ring),
> > > and the function would fail if that failed.
> > > Now, the first part of the request is always satisfied from the
> > cache,
> > > and if the subsequent backfilling of the cache from the ring fails,
> > only
> > > the remaining requested objects are retrieved from the ring.
> > >
> > > 4. The code flow for satisfying the request from the cache was
> > slightly
> > > inefficient:
> > > The likely code path where the objects are simply served from the
> > cache
> > > was treated as unlikely. Now it is treated as likely.
> > > And in the code path where the cache was backfilled first, numbers
> > were
> > > added and subtracted from the cache length; now this code path simply
> > > sets the cache length to its final value.
> > >
> > > 5. Some comments were not correct anymore.
> > > The comments have been updated.
> > > Most importanly, the description of the succesful return value was
> > > inaccurate. Success only returns 0, not >= 0.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > ---
> > 
> > I am a little uncertain about the reversing of the copies taking things
> > out
> > of the mempool - for machines where we are not that cache constrainted
> > will
> > we lose out in possible optimizations where the compiler optimizes the
> > copy
> > loop as a memcpy?
> 
> The objects are also returned in reverse order in the code it replaces, so this behavior is not introduced by this patch; I only describe the reason for it.
> 
> I floated a previous patch, in which the objects were returned in order, but Jerin argued [1] that we should keep it the way it was, unless I could show a performance improvement.
> 
> So I retracted that patch to split it up in two independent patches instead. This patch for get(), and [3] for put().
> 
> While experimenting using rte_memcpy() for these, I couldn't achieve a performance boost - quite the opposite. So I gave up on it.
> 
> Reviewing the x86 variant of rte_memcpy() [2] makes me think that it is inefficient for copying small bulks of pointers, especially when n is unknown at compile time, and its code path goes through a great deal of branches.
>
Thanks for all the explanation.

Reviewed-by: Bruce Richardson <bruce.richardson@intel.com> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2] mempool: fix put objects to mempool with cache
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
                   ` (2 preceding siblings ...)
  2022-01-17 11:52 ` [PATCH] mempool: optimize put objects to " Morten Brørup
@ 2022-01-19 14:52 ` Morten Brørup
  2022-01-19 15:03 ` [PATCH v3] " Morten Brørup
  4 siblings, 0 replies; 13+ messages in thread
From: Morten Brørup @ 2022-01-19 14:52 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, jerinjacobk, dev, Morten Brørup

This patch optimizes the rte_mempool_do_generic_put() caching algorithm,
and fixes a bug in it.

The existing algorithm was:
 1. Add the objects to the cache
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the ring.

Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
"size" objects, which is reflected in the above description.

Now, the algorithm is:
 1. If the objects cannot be added to the cache without crossing the
    flush threshold, flush the cache to the ring.
 2. Add the objects to the cache.

This patch changes these details:

1. Bug: The cache was still full after flushing.
In the opposite direction, i.e. when getting objects from the cache, the
cache is refilled to full level when it crosses the low watermark (which
happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The existing flushing behaviour was suboptimal for real applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
NB: When the application is in a state where put/get events are in
balance, the cache should remain within its low and high watermarks, and
the algorithms for refilling/flushing the cache should not come into
play.
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

2. Minor bug: The flush threshold comparison has been corrected; it must
be "len > flushthresh", not "len >= flushthresh".
Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
would be flushed already when reaching size elements, not when exceeding
size elements.
Now, flushing is triggered when the flush threshold is exceeded, not
when reached.

3. Optimization: The most recent (hot) objects are flushed, leaving the
oldest (cold) objects in the mempool cache.
This is bad for CPUs with a small L1 cache, because when they get
objects from the mempool after the mempool cache has been flushed, they
get cold objects instead of hot objects.
Now, the existing (cold) objects in the mempool cache are flushed before
the new (hot) objects are added the to the mempool cache.

4. Optimization: Using the x86 variant of rte_memcpy() is inefficient
here, where n is relatively small and unknown at compile time.
Now, it has been replaced by an alternative copying method, optimized
for the fact that most Ethernet PMDs operate in bursts of 4 or 8 mbufs
or multiples thereof.

v2 changes:

- Not adding the new objects to the mempool cache before flushing it
also allows the memory allocated for the mempool cache to be reduced
from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
However, such this change would break the ABI, so it was removed in v2.

- The mempool cache should be cache line aligned for the benefit of the
copying method, which on some CPU architectures performs worse on data
crossing a cache boundary.
However, such this change would break the ABI, so it was removed in v2;
and yet another alternative copying method replaced the rte_memcpy().

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 54 +++++++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 14 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..8a7067ee5b 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -94,7 +94,8 @@ struct rte_mempool_cache {
 	 * Cache is allocated to this size to allow it to overflow in certain
 	 * cases to avoid needless emptying of cache.
 	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
+	/**< Cache objects */
 } __rte_cache_aligned;
 
 /**
@@ -1334,6 +1335,7 @@ static __rte_always_inline void
 rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
+	uint32_t index;
 	void **cache_objs;
 
 	/* increment stat now, adding in mempool always success */
@@ -1344,31 +1346,56 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
+	/* If the request itself is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto ring_enqueue;
 
 	/*
 	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   1. If the objects cannot be added to the cache without
+	 *   crossing the flush threshold, flush the cache to the ring.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
 
-	cache->len += n;
+		cache->len += n;
+	} else {
+		cache_objs = cache->objs;
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
+			rte_panic("cannot put objects in mempool\n");
+#else
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+#endif
+		cache->len = n;
+	}
+
+	/* Add the objects to the cache. */
+	for (index = 0; index < (n & ~0x3); index += 4) {
+		cache_objs[index] = obj_table[index];
+		cache_objs[index + 1] = obj_table[index + 1];
+		cache_objs[index + 2] = obj_table[index + 2];
+		cache_objs[index + 3] = obj_table[index + 3];
+	}
+	switch (n & 0x3) {
+	case 3:
+		cache_objs[index] = obj_table[index];
+		index++; /* fallthrough */
+	case 2:
+		cache_objs[index] = obj_table[index];
+		index++; /* fallthrough */
+	case 1:
+		cache_objs[index] = obj_table[index];
 	}
 
 	return;
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
@@ -1377,7 +1404,6 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 #endif
 }
 
-
 /**
  * Put several objects back in the mempool.
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3] mempool: fix put objects to mempool with cache
  2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
                   ` (3 preceding siblings ...)
  2022-01-19 14:52 ` [PATCH v2] mempool: fix " Morten Brørup
@ 2022-01-19 15:03 ` Morten Brørup
  4 siblings, 0 replies; 13+ messages in thread
From: Morten Brørup @ 2022-01-19 15:03 UTC (permalink / raw)
  To: olivier.matz, andrew.rybchenko
  Cc: bruce.richardson, jerinjacobk, dev, Morten Brørup

mempool: fix put objects to mempool with cache

This patch optimizes the rte_mempool_do_generic_put() caching algorithm,
and fixes a bug in it.

The existing algorithm was:
 1. Add the objects to the cache
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the ring.

Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
"size" objects, which is reflected in the above description.

Now, the algorithm is:
 1. If the objects cannot be added to the cache without crossing the
    flush threshold, flush the cache to the ring.
 2. Add the objects to the cache.

This patch changes these details:

1. Bug: The cache was still full after flushing.
In the opposite direction, i.e. when getting objects from the cache, the
cache is refilled to full level when it crosses the low watermark (which
happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The existing flushing behaviour was suboptimal for real applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
NB: When the application is in a state where put/get events are in
balance, the cache should remain within its low and high watermarks, and
the algorithms for refilling/flushing the cache should not come into
play.
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

2. Minor bug: The flush threshold comparison has been corrected; it must
be "len > flushthresh", not "len >= flushthresh".
Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
would be flushed already when reaching size elements, not when exceeding
size elements.
Now, flushing is triggered when the flush threshold is exceeded, not
when reached.

3. Optimization: The most recent (hot) objects are flushed, leaving the
oldest (cold) objects in the mempool cache.
This is bad for CPUs with a small L1 cache, because when they get
objects from the mempool after the mempool cache has been flushed, they
get cold objects instead of hot objects.
Now, the existing (cold) objects in the mempool cache are flushed before
the new (hot) objects are added the to the mempool cache.

4. Optimization: Using the x86 variant of rte_memcpy() is inefficient
here, where n is relatively small and unknown at compile time.
Now, it has been replaced by an alternative copying method, optimized
for the fact that most Ethernet PMDs operate in bursts of 4 or 8 mbufs
or multiples thereof.

v2 changes:

- Not adding the new objects to the mempool cache before flushing it
also allows the memory allocated for the mempool cache to be reduced
from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
However, such this change would break the ABI, so it was removed in v2.

- The mempool cache should be cache line aligned for the benefit of the
copying method, which on some CPU architectures performs worse on data
crossing a cache boundary.
However, such this change would break the ABI, so it was removed in v2;
and yet another alternative copying method replaced the rte_memcpy().

v3 changes:

- Actually remove my modifications of the rte_mempool_cache structure.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 51 +++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..7b364cfc74 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1334,6 +1334,7 @@ static __rte_always_inline void
 rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
+	uint32_t index;
 	void **cache_objs;
 
 	/* increment stat now, adding in mempool always success */
@@ -1344,31 +1345,56 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
+	/* If the request itself is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto ring_enqueue;
 
 	/*
 	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   1. If the objects cannot be added to the cache without
+	 *   crossing the flush threshold, flush the cache to the ring.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
 
-	cache->len += n;
+		cache->len += n;
+	} else {
+		cache_objs = cache->objs;
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
+			rte_panic("cannot put objects in mempool\n");
+#else
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+#endif
+		cache->len = n;
+	}
+
+	/* Add the objects to the cache. */
+	for (index = 0; index < (n & ~0x3); index += 4) {
+		cache_objs[index] = obj_table[index];
+		cache_objs[index + 1] = obj_table[index + 1];
+		cache_objs[index + 2] = obj_table[index + 2];
+		cache_objs[index + 3] = obj_table[index + 3];
+	}
+	switch (n & 0x3) {
+	case 3:
+		cache_objs[index] = obj_table[index];
+		index++; /* fallthrough */
+	case 2:
+		cache_objs[index] = obj_table[index];
+		index++; /* fallthrough */
+	case 1:
+		cache_objs[index] = obj_table[index];
 	}
 
 	return;
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
@@ -1377,7 +1403,6 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 #endif
 }
 
-
 /**
  * Put several objects back in the mempool.
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-01-19 15:03 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-26 15:34 [RFC] mempool: rte_mempool_do_generic_get optimizations Morten Brørup
2022-01-06 12:23 ` [PATCH] mempool: optimize incomplete cache handling Morten Brørup
2022-01-06 16:55   ` Jerin Jacob
2022-01-07  8:46     ` Morten Brørup
2022-01-10  7:26       ` Jerin Jacob
2022-01-10 10:55         ` Morten Brørup
2022-01-14 16:36 ` [PATCH] mempool: fix get objects from mempool with cache Morten Brørup
2022-01-17 17:35   ` Bruce Richardson
2022-01-18  8:25     ` Morten Brørup
2022-01-18  9:07       ` Bruce Richardson
2022-01-17 11:52 ` [PATCH] mempool: optimize put objects to " Morten Brørup
2022-01-19 14:52 ` [PATCH v2] mempool: fix " Morten Brørup
2022-01-19 15:03 ` [PATCH v3] " Morten Brørup

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ http://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git