DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] mempool: optimize copy in cache get
@ 2019-05-21  9:03 Xiao Wang
  2019-05-21  9:34 ` Andrew Rybchenko
  0 siblings, 1 reply; 5+ messages in thread
From: Xiao Wang @ 2019-05-21  9:03 UTC (permalink / raw)
  To: olivier.matz; +Cc: dev, arybchenko, Xiao Wang

Use rte_memcpy to improve the pointer array copy. This optimization method
has already been applied to __mempool_generic_put() [1], this patch applies
it to __mempool_generic_get(). Slight performance gain can be observed in
testpmd txonly test.

[1] 863bfb47449 ("mempool: optimize copy in cache")

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 lib/librte_mempool/rte_mempool.h | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 8053f7a04..975da8d22 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -1344,15 +1344,11 @@ __mempool_generic_get(struct rte_mempool *mp, void **obj_table,
 		      unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
-	uint32_t index, len;
-	void **cache_objs;
 
 	/* No cache provided or cannot be satisfied from cache */
 	if (unlikely(cache == NULL || n >= cache->size))
 		goto ring_dequeue;
 
-	cache_objs = cache->objs;
-
 	/* Can this be satisfied from the cache? */
 	if (cache->len < n) {
 		/* No. Backfill the cache first, and then fill from it */
@@ -1375,8 +1371,7 @@ __mempool_generic_get(struct rte_mempool *mp, void **obj_table,
 	}
 
 	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	rte_memcpy(obj_table, &cache->objs[cache->len - n], sizeof(void *) * n);
 
 	cache->len -= n;
 
-- 
2.15.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] [PATCH] mempool: optimize copy in cache get
  2019-05-21  9:03 [dpdk-dev] [PATCH] mempool: optimize copy in cache get Xiao Wang
@ 2019-05-21  9:34 ` Andrew Rybchenko
  2019-07-01 13:11   ` Olivier Matz
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Rybchenko @ 2019-05-21  9:34 UTC (permalink / raw)
  To: Xiao Wang, olivier.matz; +Cc: dev

On 5/21/19 12:03 PM, Xiao Wang wrote:
> Use rte_memcpy to improve the pointer array copy. This optimization method
> has already been applied to __mempool_generic_put() [1], this patch applies
> it to __mempool_generic_get(). Slight performance gain can be observed in
> testpmd txonly test.
>
> [1] 863bfb47449 ("mempool: optimize copy in cache")
>
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> ---
>   lib/librte_mempool/rte_mempool.h | 7 +------
>   1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
> index 8053f7a04..975da8d22 100644
> --- a/lib/librte_mempool/rte_mempool.h
> +++ b/lib/librte_mempool/rte_mempool.h
> @@ -1344,15 +1344,11 @@ __mempool_generic_get(struct rte_mempool *mp, void **obj_table,
>   		      unsigned int n, struct rte_mempool_cache *cache)
>   {
>   	int ret;
> -	uint32_t index, len;
> -	void **cache_objs;
>   
>   	/* No cache provided or cannot be satisfied from cache */
>   	if (unlikely(cache == NULL || n >= cache->size))
>   		goto ring_dequeue;
>   
> -	cache_objs = cache->objs;
> -
>   	/* Can this be satisfied from the cache? */
>   	if (cache->len < n) {
>   		/* No. Backfill the cache first, and then fill from it */
> @@ -1375,8 +1371,7 @@ __mempool_generic_get(struct rte_mempool *mp, void **obj_table,
>   	}
>   
>   	/* Now fill in the response ... */
> -	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
> -		*obj_table = cache_objs[len];
> +	rte_memcpy(obj_table, &cache->objs[cache->len - n], sizeof(void *) * n);
>   
>   	cache->len -= n;
>   

I think the idea of the loop above is to get objects in reverse order to 
order
to reuse cache top objects (put last) first. It should improve cache hit 
etc.
So, performance effect of the patch could be very different on various CPUs
(with different cache sizes) and various work-loads.

So, I doubt that it is a step in right direction.

Andrew.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] [PATCH] mempool: optimize copy in cache get
  2019-05-21  9:34 ` Andrew Rybchenko
@ 2019-07-01 13:11   ` Olivier Matz
  2019-07-01 14:21     ` Wang, Xiao W
  0 siblings, 1 reply; 5+ messages in thread
From: Olivier Matz @ 2019-07-01 13:11 UTC (permalink / raw)
  To: Andrew Rybchenko; +Cc: Xiao Wang, dev

Hi,

On Tue, May 21, 2019 at 12:34:55PM +0300, Andrew Rybchenko wrote:
> On 5/21/19 12:03 PM, Xiao Wang wrote:
> > Use rte_memcpy to improve the pointer array copy. This optimization method
> > has already been applied to __mempool_generic_put() [1], this patch applies
> > it to __mempool_generic_get(). Slight performance gain can be observed in
> > testpmd txonly test.
> > 
> > [1] 863bfb47449 ("mempool: optimize copy in cache")
> > 
> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > ---
> >   lib/librte_mempool/rte_mempool.h | 7 +------
> >   1 file changed, 1 insertion(+), 6 deletions(-)
> > 
> > diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
> > index 8053f7a04..975da8d22 100644
> > --- a/lib/librte_mempool/rte_mempool.h
> > +++ b/lib/librte_mempool/rte_mempool.h
> > @@ -1344,15 +1344,11 @@ __mempool_generic_get(struct rte_mempool *mp, void **obj_table,
> >   		      unsigned int n, struct rte_mempool_cache *cache)
> >   {
> >   	int ret;
> > -	uint32_t index, len;
> > -	void **cache_objs;
> >   	/* No cache provided or cannot be satisfied from cache */
> >   	if (unlikely(cache == NULL || n >= cache->size))
> >   		goto ring_dequeue;
> > -	cache_objs = cache->objs;
> > -
> >   	/* Can this be satisfied from the cache? */
> >   	if (cache->len < n) {
> >   		/* No. Backfill the cache first, and then fill from it */
> > @@ -1375,8 +1371,7 @@ __mempool_generic_get(struct rte_mempool *mp, void **obj_table,
> >   	}
> >   	/* Now fill in the response ... */
> > -	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
> > -		*obj_table = cache_objs[len];
> > +	rte_memcpy(obj_table, &cache->objs[cache->len - n], sizeof(void *) * n);
> >   	cache->len -= n;
> 
> I think the idea of the loop above is to get objects in reverse order to
> order
> to reuse cache top objects (put last) first. It should improve cache hit
> etc.
> So, performance effect of the patch could be very different on various CPUs
> (with different cache sizes) and various work-loads.
> 
> So, I doubt that it is a step in right direction.

For reference, this was already discussed 3 years ago:

https://mails.dpdk.org/archives/dev/2016-May/039873.html
https://mails.dpdk.org/archives/dev/2016-June/040029.html

I'm still not convinced that reversing object addresses (as it's done
today) is really important. But Andrew is probably right, the impact of
this kind of patch probably varies depending on many factors. More
performance numbers on real-life use-cases would help to decide what to
do.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] [PATCH] mempool: optimize copy in cache get
  2019-07-01 13:11   ` Olivier Matz
@ 2019-07-01 14:21     ` Wang, Xiao W
  2019-07-01 15:00       ` Olivier Matz
  0 siblings, 1 reply; 5+ messages in thread
From: Wang, Xiao W @ 2019-07-01 14:21 UTC (permalink / raw)
  To: Olivier Matz, Andrew Rybchenko; +Cc: dev

Hi,

> -----Original Message-----
> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Monday, July 1, 2019 9:11 PM
> To: Andrew Rybchenko <arybchenko@solarflare.com>
> Cc: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> Subject: Re: [PATCH] mempool: optimize copy in cache get
> 
> Hi,
> 
> On Tue, May 21, 2019 at 12:34:55PM +0300, Andrew Rybchenko wrote:
> > On 5/21/19 12:03 PM, Xiao Wang wrote:
> > > Use rte_memcpy to improve the pointer array copy. This optimization
> method
> > > has already been applied to __mempool_generic_put() [1], this patch
> applies
> > > it to __mempool_generic_get(). Slight performance gain can be observed
> in
> > > testpmd txonly test.
> > >
> > > [1] 863bfb47449 ("mempool: optimize copy in cache")
> > >
> > > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > > ---
> > >   lib/librte_mempool/rte_mempool.h | 7 +------
> > >   1 file changed, 1 insertion(+), 6 deletions(-)
> > >
> > > diff --git a/lib/librte_mempool/rte_mempool.h
> b/lib/librte_mempool/rte_mempool.h
> > > index 8053f7a04..975da8d22 100644
> > > --- a/lib/librte_mempool/rte_mempool.h
> > > +++ b/lib/librte_mempool/rte_mempool.h
> > > @@ -1344,15 +1344,11 @@ __mempool_generic_get(struct
> rte_mempool *mp, void **obj_table,
> > >   		      unsigned int n, struct rte_mempool_cache *cache)
> > >   {
> > >   	int ret;
> > > -	uint32_t index, len;
> > > -	void **cache_objs;
> > >   	/* No cache provided or cannot be satisfied from cache */
> > >   	if (unlikely(cache == NULL || n >= cache->size))
> > >   		goto ring_dequeue;
> > > -	cache_objs = cache->objs;
> > > -
> > >   	/* Can this be satisfied from the cache? */
> > >   	if (cache->len < n) {
> > >   		/* No. Backfill the cache first, and then fill from it */
> > > @@ -1375,8 +1371,7 @@ __mempool_generic_get(struct rte_mempool
> *mp, void **obj_table,
> > >   	}
> > >   	/* Now fill in the response ... */
> > > -	for (index = 0, len = cache->len - 1; index < n; ++index, len--,
> obj_table++)
> > > -		*obj_table = cache_objs[len];
> > > +	rte_memcpy(obj_table, &cache->objs[cache->len - n], sizeof(void *) *
> n);
> > >   	cache->len -= n;
> >
> > I think the idea of the loop above is to get objects in reverse order to
> > order
> > to reuse cache top objects (put last) first. It should improve cache hit
> > etc.
> > So, performance effect of the patch could be very different on various CPUs
> > (with different cache sizes) and various work-loads.
> >
> > So, I doubt that it is a step in right direction.
> 
> For reference, this was already discussed 3 years ago:
> 
> https://mails.dpdk.org/archives/dev/2016-May/039873.html
> https://mails.dpdk.org/archives/dev/2016-June/040029.html
> 
> I'm still not convinced that reversing object addresses (as it's done
> today) is really important. But Andrew is probably right, the impact of
> this kind of patch probably varies depending on many factors. More
> performance numbers on real-life use-cases would help to decide what to
> do.
> 
> Regards,
> Olivier

I agree, and thanks for the reference link. So theoretically neither way can be
a definite best choice, it depends on various real-life factors. I'm thinking about
how to let app developer be aware of this so that they themselves could make
the choice. Or it's not worth doing due to small perf gain?

BRs,
Xiao 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] [PATCH] mempool: optimize copy in cache get
  2019-07-01 14:21     ` Wang, Xiao W
@ 2019-07-01 15:00       ` Olivier Matz
  0 siblings, 0 replies; 5+ messages in thread
From: Olivier Matz @ 2019-07-01 15:00 UTC (permalink / raw)
  To: Wang, Xiao W; +Cc: Andrew Rybchenko, dev

Hi,

On Mon, Jul 01, 2019 at 02:21:41PM +0000, Wang, Xiao W wrote:
> Hi,
> 
> > -----Original Message-----
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Monday, July 1, 2019 9:11 PM
> > To: Andrew Rybchenko <arybchenko@solarflare.com>
> > Cc: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> > Subject: Re: [PATCH] mempool: optimize copy in cache get
> > 
> > Hi,
> > 
> > On Tue, May 21, 2019 at 12:34:55PM +0300, Andrew Rybchenko wrote:
> > > On 5/21/19 12:03 PM, Xiao Wang wrote:
> > > > Use rte_memcpy to improve the pointer array copy. This optimization
> > method
> > > > has already been applied to __mempool_generic_put() [1], this patch
> > applies
> > > > it to __mempool_generic_get(). Slight performance gain can be observed
> > in
> > > > testpmd txonly test.
> > > >
> > > > [1] 863bfb47449 ("mempool: optimize copy in cache")
> > > >
> > > > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> > > > ---
> > > >   lib/librte_mempool/rte_mempool.h | 7 +------
> > > >   1 file changed, 1 insertion(+), 6 deletions(-)
> > > >
> > > > diff --git a/lib/librte_mempool/rte_mempool.h
> > b/lib/librte_mempool/rte_mempool.h
> > > > index 8053f7a04..975da8d22 100644
> > > > --- a/lib/librte_mempool/rte_mempool.h
> > > > +++ b/lib/librte_mempool/rte_mempool.h
> > > > @@ -1344,15 +1344,11 @@ __mempool_generic_get(struct
> > rte_mempool *mp, void **obj_table,
> > > >   		      unsigned int n, struct rte_mempool_cache *cache)
> > > >   {
> > > >   	int ret;
> > > > -	uint32_t index, len;
> > > > -	void **cache_objs;
> > > >   	/* No cache provided or cannot be satisfied from cache */
> > > >   	if (unlikely(cache == NULL || n >= cache->size))
> > > >   		goto ring_dequeue;
> > > > -	cache_objs = cache->objs;
> > > > -
> > > >   	/* Can this be satisfied from the cache? */
> > > >   	if (cache->len < n) {
> > > >   		/* No. Backfill the cache first, and then fill from it */
> > > > @@ -1375,8 +1371,7 @@ __mempool_generic_get(struct rte_mempool
> > *mp, void **obj_table,
> > > >   	}
> > > >   	/* Now fill in the response ... */
> > > > -	for (index = 0, len = cache->len - 1; index < n; ++index, len--,
> > obj_table++)
> > > > -		*obj_table = cache_objs[len];
> > > > +	rte_memcpy(obj_table, &cache->objs[cache->len - n], sizeof(void *) *
> > n);
> > > >   	cache->len -= n;
> > >
> > > I think the idea of the loop above is to get objects in reverse order to
> > > order
> > > to reuse cache top objects (put last) first. It should improve cache hit
> > > etc.
> > > So, performance effect of the patch could be very different on various CPUs
> > > (with different cache sizes) and various work-loads.
> > >
> > > So, I doubt that it is a step in right direction.
> > 
> > For reference, this was already discussed 3 years ago:
> > 
> > https://mails.dpdk.org/archives/dev/2016-May/039873.html
> > https://mails.dpdk.org/archives/dev/2016-June/040029.html
> > 
> > I'm still not convinced that reversing object addresses (as it's done
> > today) is really important. But Andrew is probably right, the impact of
> > this kind of patch probably varies depending on many factors. More
> > performance numbers on real-life use-cases would help to decide what to
> > do.
> > 
> > Regards,
> > Olivier
> 
> I agree, and thanks for the reference link. So theoretically neither way can be
> a definite best choice, it depends on various real-life factors. I'm thinking about
> how to let app developer be aware of this so that they themselves could make
> the choice. Or it's not worth doing due to small perf gain?

I don't think it's worth doing a dynamic selection for this.

On the other hand, having performance numbers for different
use-cases/archs is welcome. From a pure cpu cycles perspective, the
rte_memcpy() should be faster.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-07-01 15:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-21  9:03 [dpdk-dev] [PATCH] mempool: optimize copy in cache get Xiao Wang
2019-05-21  9:34 ` Andrew Rybchenko
2019-07-01 13:11   ` Olivier Matz
2019-07-01 14:21     ` Wang, Xiao W
2019-07-01 15:00       ` Olivier Matz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).