DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC] mempool: zero-copy cache put bulk
@ 2022-11-05 13:40 Morten Brørup
  2022-11-05 23:11 ` Honnappa Nagarahalli
  0 siblings, 1 reply; 9+ messages in thread
From: Morten Brørup @ 2022-11-05 13:40 UTC (permalink / raw)
  To: dev, olivier.matz, andrew.rybchenko, honnappa.nagarahalli

Zero-copy access to the mempool cache is beneficial for PMD performance, and must be provided by the mempool library to fix [Bug 1052] without a performance regression.

[Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052


This RFC offers a conceptual zero-copy put function, where the application promises to store some objects, and in return gets an address where to store them.

I would like some early feedback.

Notes:
* Allowing the 'cache' parameter to be NULL, and getting it from the mempool instead, was inspired by rte_mempool_cache_flush().
* Asserting that the 'mp' parameter is not NULL is not done by other functions, so I omitted it here too.

NB: Please ignore formatting. Also, this code has not even been compile tested.

/**
 * Promise to put objects in a mempool via zero-copy access to a user-owned mempool cache.
 *
 * @param cache
 *   A pointer to the mempool cache.
 * @param mp
 *   A pointer to the mempool.
 * @param n
 *   The number of objects to be put in the mempool cache.
 * @return
 *   The pointer to where to put the objects in the mempool cache.
 *   NULL on error
 *   with rte_errno set appropriately.
 */
static __rte_always_inline void *
rte_mempool_cache_put_bulk_promise(struct rte_mempool_cache *cache,
        struct rte_mempool *mp,
        unsigned int n)
{
    void **cache_objs;

    if (cache == NULL)
        cache = rte_mempool_default_cache(mp, rte_lcore_id());
    if (cache == NULL) {
        rte_errno = EINVAL;
        return NULL;
    }

    rte_mempool_trace_cache_put_bulk_promise(cache, mp, n);

    /* The request itself is too big for the cache */
    if (unlikely(n > cache->flushthresh)) {
        rte_errno = EINVAL;
        return NULL;
    }

    /*
     * The cache follows the following algorithm:
     *   1. If the objects cannot be added to the cache without crossing
     *      the flush threshold, flush the cache to the backend.
     *   2. Add the objects to the cache.
     */

    if (cache->len + n <= cache->flushthresh) {
        cache_objs = &cache->objs[cache->len];
        cache->len += n;
    } else {
        cache_objs = &cache->objs[0];
        rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
        cache->len = n;
    }

    RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
    RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);

    return cache_objs;
}


Med venlig hilsen / Kind regards,
-Morten Brørup



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] mempool: zero-copy cache put bulk
  2022-11-05 13:40 [RFC] mempool: zero-copy cache put bulk Morten Brørup
@ 2022-11-05 23:11 ` Honnappa Nagarahalli
  2022-11-06  6:57   ` Morten Brørup
  0 siblings, 1 reply; 9+ messages in thread
From: Honnappa Nagarahalli @ 2022-11-05 23:11 UTC (permalink / raw)
  To: Morten Brørup, dev, olivier.matz, andrew.rybchenko,
	Kamalakshitha Aligeri
  Cc: nd, nd

+ Akshitha, she is working on similar patch

Few comments inline

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Saturday, November 5, 2022 8:40 AM
> To: dev@dpdk.org; olivier.matz@6wind.com;
> andrew.rybchenko@oktetlabs.ru; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: [RFC] mempool: zero-copy cache put bulk
> 
> Zero-copy access to the mempool cache is beneficial for PMD performance,
> and must be provided by the mempool library to fix [Bug 1052] without a
> performance regression.
> 
> [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> 
> 
> This RFC offers a conceptual zero-copy put function, where the application
> promises to store some objects, and in return gets an address where to store
> them.
> 
> I would like some early feedback.
> 
> Notes:
> * Allowing the 'cache' parameter to be NULL, and getting it from the
> mempool instead, was inspired by rte_mempool_cache_flush().
I am not sure why the 'cache' parameter is required for this API. This API should take the mem pool as the parameter.

We have based our API on 'rte_mempool_do_generic_put' and removed the 'cache' parameter. This new API, on success, returns the pointer to memory where the objects are copied. On failure it returns NULL and the caller has to call 'rte_mempool_ops_enqueue_bulk'. Alternatively, the new API could do this as well and PMD does not need to do anything if it gets a NULL pointer.

We should think about providing  similar API on the RX side to keep it symmetric.

> * Asserting that the 'mp' parameter is not NULL is not done by other
> functions, so I omitted it here too.
> 
> NB: Please ignore formatting. Also, this code has not even been compile
> tested.
We are little bit ahead, tested the changes with i40e PF PMD, wrote unit test cases, going through internal review, will send out RFC on Monday

> 
> /**
>  * Promise to put objects in a mempool via zero-copy access to a user-owned
> mempool cache.
>  *
>  * @param cache
>  *   A pointer to the mempool cache.
>  * @param mp
>  *   A pointer to the mempool.
>  * @param n
>  *   The number of objects to be put in the mempool cache.
>  * @return
>  *   The pointer to where to put the objects in the mempool cache.
>  *   NULL on error
>  *   with rte_errno set appropriately.
>  */
> static __rte_always_inline void *
> rte_mempool_cache_put_bulk_promise(struct rte_mempool_cache *cache,
>         struct rte_mempool *mp,
>         unsigned int n)
> {
>     void **cache_objs;
> 
>     if (cache == NULL)
>         cache = rte_mempool_default_cache(mp, rte_lcore_id());
>     if (cache == NULL) {
>         rte_errno = EINVAL;
>         return NULL;
>     }
> 
>     rte_mempool_trace_cache_put_bulk_promise(cache, mp, n);
> 
>     /* The request itself is too big for the cache */
>     if (unlikely(n > cache->flushthresh)) {
>         rte_errno = EINVAL;
>         return NULL;
>     }
> 
>     /*
>      * The cache follows the following algorithm:
>      *   1. If the objects cannot be added to the cache without crossing
>      *      the flush threshold, flush the cache to the backend.
>      *   2. Add the objects to the cache.
>      */
> 
>     if (cache->len + n <= cache->flushthresh) {
>         cache_objs = &cache->objs[cache->len];
>         cache->len += n;
>     } else {
>         cache_objs = &cache->objs[0];
>         rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
>         cache->len = n;
>     }
> 
>     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
>     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> 
>     return cache_objs;
> }
> 
> 
> Med venlig hilsen / Kind regards,
> -Morten Brørup
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] mempool: zero-copy cache put bulk
  2022-11-05 23:11 ` Honnappa Nagarahalli
@ 2022-11-06  6:57   ` Morten Brørup
  2022-11-09 17:57     ` Honnappa Nagarahalli
  0 siblings, 1 reply; 9+ messages in thread
From: Morten Brørup @ 2022-11-06  6:57 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, olivier.matz, andrew.rybchenko,
	Kamalakshitha Aligeri
  Cc: nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Sunday, 6 November 2022 00.11
> 
> + Akshitha, she is working on similar patch
> 
> Few comments inline
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Saturday, November 5, 2022 8:40 AM
> >
> > Zero-copy access to the mempool cache is beneficial for PMD
> performance,
> > and must be provided by the mempool library to fix [Bug 1052] without
> a
> > performance regression.
> >
> > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> >
> >
> > This RFC offers a conceptual zero-copy put function, where the
> application
> > promises to store some objects, and in return gets an address where
> to store
> > them.
> >
> > I would like some early feedback.
> >
> > Notes:
> > * Allowing the 'cache' parameter to be NULL, and getting it from the
> > mempool instead, was inspired by rte_mempool_cache_flush().
> I am not sure why the 'cache' parameter is required for this API. This
> API should take the mem pool as the parameter.
> 
> We have based our API on 'rte_mempool_do_generic_put' and removed the
> 'cache' parameter.

I thoroughly considered omitting the 'cache' parameter, but included it for two reasons:

1. The function is a "mempool cache" function (i.e. primarily working on the mempool cache), not a "mempool" function.

So it is appropriate to have a pointer directly to the structure it is working on. Following this through, I also made 'cache' the first parameter and 'mp' the second, like in rte_mempool_cache_flush().

2. In most cases, the function only accesses the mempool structure in order to get the cache pointer. Skipping this step improves performance.

And since the cache is created along with the mempool itself (and thus never changes for a mempool), it would be safe for the PMD to store the 'cache' pointer along with the 'mp' pointer in the PMD's queue structure.

E.g. in the i40e PMD the i40e_rx_queue structure could include a "struct rte_mempool_cache *cache" field, which could be used i40e_rxq_rearm() [1] instead of "cache = rte_mempool_default_cache(rxq->mp, rte_lcore_id())".

[1] https://elixir.bootlin.com/dpdk/v22.11-rc2/source/drivers/net/i40e/i40e_rxtx_vec_avx512.c#L31

> This new API, on success, returns the pointer to
> memory where the objects are copied. On failure it returns NULL and the
> caller has to call 'rte_mempool_ops_enqueue_bulk'. Alternatively, the
> new API could do this as well and PMD does not need to do anything if
> it gets a NULL pointer.

Yes, we agree about these two details:

1. The function should return a pointer, not an integer.
It would be a waste to use a another CPU register to convey a success/error integer value, when the success/failure information is just as easily conveyed by the pointer return value (non-NULL/NULL), and rte_errno for various error values in the unlikely cases.

2. The function should leave it up to the PMD what to do if direct access to the cache is unavailable.

> 
> We should think about providing  similar API on the RX side to keep it
> symmetric.

I sent an RFC for that too:
http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87488@smartserver.smartshare.dk/T/#u


> 
> > * Asserting that the 'mp' parameter is not NULL is not done by other
> > functions, so I omitted it here too.
> >
> > NB: Please ignore formatting. Also, this code has not even been
> compile
> > tested.
> We are little bit ahead, tested the changes with i40e PF PMD, wrote
> unit test cases, going through internal review, will send out RFC on
> Monday

Sounds good. Looking forward to review.

> 
> >
> > /**
> >  * Promise to put objects in a mempool via zero-copy access to a
> user-owned
> > mempool cache.
> >  *
> >  * @param cache
> >  *   A pointer to the mempool cache.
> >  * @param mp
> >  *   A pointer to the mempool.
> >  * @param n
> >  *   The number of objects to be put in the mempool cache.
> >  * @return
> >  *   The pointer to where to put the objects in the mempool cache.
> >  *   NULL on error
> >  *   with rte_errno set appropriately.
> >  */
> > static __rte_always_inline void *
> > rte_mempool_cache_put_bulk_promise(struct rte_mempool_cache *cache,
> >         struct rte_mempool *mp,
> >         unsigned int n)
> > {
> >     void **cache_objs;
> >
> >     if (cache == NULL)
> >         cache = rte_mempool_default_cache(mp, rte_lcore_id());
> >     if (cache == NULL) {
> >         rte_errno = EINVAL;
> >         return NULL;
> >     }
> >
> >     rte_mempool_trace_cache_put_bulk_promise(cache, mp, n);
> >
> >     /* The request itself is too big for the cache */
> >     if (unlikely(n > cache->flushthresh)) {
> >         rte_errno = EINVAL;
> >         return NULL;
> >     }
> >
> >     /*
> >      * The cache follows the following algorithm:
> >      *   1. If the objects cannot be added to the cache without
> crossing
> >      *      the flush threshold, flush the cache to the backend.
> >      *   2. Add the objects to the cache.
> >      */
> >
> >     if (cache->len + n <= cache->flushthresh) {
> >         cache_objs = &cache->objs[cache->len];
> >         cache->len += n;
> >     } else {
> >         cache_objs = &cache->objs[0];
> >         rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> >         cache->len = n;
> >     }
> >
> >     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> >     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> >
> >     return cache_objs;
> > }
> >
> >
> > Med venlig hilsen / Kind regards,
> > -Morten Brørup
> >
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] mempool: zero-copy cache put bulk
  2022-11-06  6:57   ` Morten Brørup
@ 2022-11-09 17:57     ` Honnappa Nagarahalli
  2022-11-09 20:36       ` Morten Brørup
  0 siblings, 1 reply; 9+ messages in thread
From: Honnappa Nagarahalli @ 2022-11-09 17:57 UTC (permalink / raw)
  To: Morten Brørup, dev, olivier.matz, andrew.rybchenko,
	Kamalakshitha Aligeri
  Cc: nd, nd

<snip>

> 
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Sunday, 6 November 2022 00.11
> >
> > + Akshitha, she is working on similar patch
> >
> > Few comments inline
> >
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: Saturday, November 5, 2022 8:40 AM
> > >
> > > Zero-copy access to the mempool cache is beneficial for PMD
> > performance,
> > > and must be provided by the mempool library to fix [Bug 1052]
> > > without
> > a
> > > performance regression.
> > >
> > > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> > >
> > >
> > > This RFC offers a conceptual zero-copy put function, where the
> > application
> > > promises to store some objects, and in return gets an address where
> > to store
> > > them.
> > >
> > > I would like some early feedback.
> > >
> > > Notes:
> > > * Allowing the 'cache' parameter to be NULL, and getting it from the
> > > mempool instead, was inspired by rte_mempool_cache_flush().
> > I am not sure why the 'cache' parameter is required for this API. This
> > API should take the mem pool as the parameter.
> >
> > We have based our API on 'rte_mempool_do_generic_put' and removed
> the
> > 'cache' parameter.
> 
> I thoroughly considered omitting the 'cache' parameter, but included it for
> two reasons:
> 
> 1. The function is a "mempool cache" function (i.e. primarily working on the
> mempool cache), not a "mempool" function.
> 
> So it is appropriate to have a pointer directly to the structure it is working on.
> Following this through, I also made 'cache' the first parameter and 'mp' the
> second, like in rte_mempool_cache_flush().
I am wondering if the PMD should be aware of the cache or not. For ex: in the case of pipeline mode, the RX and TX side of the PMD are running on different cores.
However, since the rte_mempool_cache_flush API is provided, may be that decision is already done? Interestingly, rte_mempool_cache_flush is called by just a single PMD.

So, the question is, should we allow zero-copy only for per-core cache or for other cases as well.

> 
> 2. In most cases, the function only accesses the mempool structure in order to
> get the cache pointer. Skipping this step improves performance.
> 
> And since the cache is created along with the mempool itself (and thus never
> changes for a mempool), it would be safe for the PMD to store the 'cache'
> pointer along with the 'mp' pointer in the PMD's queue structure.
Agreed

> 
> E.g. in the i40e PMD the i40e_rx_queue structure could include a "struct
> rte_mempool_cache *cache" field, which could be used i40e_rxq_rearm() [1]
> instead of "cache = rte_mempool_default_cache(rxq->mp, rte_lcore_id())".
> 
> [1] https://elixir.bootlin.com/dpdk/v22.11-
> rc2/source/drivers/net/i40e/i40e_rxtx_vec_avx512.c#L31
> 
> > This new API, on success, returns the pointer to memory where the
> > objects are copied. On failure it returns NULL and the caller has to
> > call 'rte_mempool_ops_enqueue_bulk'. Alternatively, the new API could
> > do this as well and PMD does not need to do anything if it gets a NULL
> > pointer.
> 
> Yes, we agree about these two details:
> 
> 1. The function should return a pointer, not an integer.
> It would be a waste to use a another CPU register to convey a success/error
> integer value, when the success/failure information is just as easily conveyed
> by the pointer return value (non-NULL/NULL), and rte_errno for various error
> values in the unlikely cases.
> 
> 2. The function should leave it up to the PMD what to do if direct access to
> the cache is unavailable.
Just wondering about the advantage of this. I do not think PMD's have much of a choice other than calling 'rte_mempool_ops_enqueue_bulk'

> 
> >
> > We should think about providing  similar API on the RX side to keep it
> > symmetric.
> 
> I sent an RFC for that too:
> http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87488@
> smartserver.smartshare.dk/T/#u
> 
> 
> >
> > > * Asserting that the 'mp' parameter is not NULL is not done by other
> > > functions, so I omitted it here too.
> > >
> > > NB: Please ignore formatting. Also, this code has not even been
> > compile
> > > tested.
> > We are little bit ahead, tested the changes with i40e PF PMD, wrote
> > unit test cases, going through internal review, will send out RFC on
> > Monday
> 
> Sounds good. Looking forward to review.
> 
> >
> > >
> > > /**
> > >  * Promise to put objects in a mempool via zero-copy access to a
> > user-owned
> > > mempool cache.
> > >  *
> > >  * @param cache
> > >  *   A pointer to the mempool cache.
> > >  * @param mp
> > >  *   A pointer to the mempool.
> > >  * @param n
> > >  *   The number of objects to be put in the mempool cache.
> > >  * @return
> > >  *   The pointer to where to put the objects in the mempool cache.
> > >  *   NULL on error
> > >  *   with rte_errno set appropriately.
> > >  */
> > > static __rte_always_inline void *
> > > rte_mempool_cache_put_bulk_promise(struct rte_mempool_cache
> *cache,
> > >         struct rte_mempool *mp,
> > >         unsigned int n)
> > > {
> > >     void **cache_objs;
> > >
> > >     if (cache == NULL)
> > >         cache = rte_mempool_default_cache(mp, rte_lcore_id());
Any reason we need this? If we are expecting the PMD to store the pointer to cache and a NULL is passed, it would mean it is a mempool with no per-core cache?
We could also leave the NULL check to the PMD.

> > >     if (cache == NULL) {
> > >         rte_errno = EINVAL;
> > >         return NULL;
> > >     }
> > >
> > >     rte_mempool_trace_cache_put_bulk_promise(cache, mp, n);
> > >
> > >     /* The request itself is too big for the cache */
> > >     if (unlikely(n > cache->flushthresh)) {
> > >         rte_errno = EINVAL;
> > >         return NULL;
> > >     }
> > >
> > >     /*
> > >      * The cache follows the following algorithm:
> > >      *   1. If the objects cannot be added to the cache without
> > crossing
> > >      *      the flush threshold, flush the cache to the backend.
> > >      *   2. Add the objects to the cache.
> > >      */
> > >
> > >     if (cache->len + n <= cache->flushthresh) {
> > >         cache_objs = &cache->objs[cache->len];
> > >         cache->len += n;
> > >     } else {
> > >         cache_objs = &cache->objs[0];
> > >         rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > >         cache->len = n;
> > >     }
> > >
> > >     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> > >     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
These are new stats. Do these break ABI compatibility (though these are under DEBUG flag)?

> > >
> > >     return cache_objs;
> > > }
> > >
> > >
> > > Med venlig hilsen / Kind regards,
> > > -Morten Brørup
> > >
> >


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] mempool: zero-copy cache put bulk
  2022-11-09 17:57     ` Honnappa Nagarahalli
@ 2022-11-09 20:36       ` Morten Brørup
  2022-11-09 22:45         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 9+ messages in thread
From: Morten Brørup @ 2022-11-09 20:36 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, olivier.matz, andrew.rybchenko,
	Kamalakshitha Aligeri, Bruce Richardson
  Cc: nd

+To: Bruce also showed interest in this topic, and might have more insights.

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, 9 November 2022 18.58
> 
> <snip>
> 
> >
> > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > Sent: Sunday, 6 November 2022 00.11
> > >
> > > + Akshitha, she is working on similar patch
> > >
> > > Few comments inline
> > >
> > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > Sent: Saturday, November 5, 2022 8:40 AM
> > > >
> > > > Zero-copy access to the mempool cache is beneficial for PMD
> > > performance,
> > > > and must be provided by the mempool library to fix [Bug 1052]
> > > > without
> > > a
> > > > performance regression.
> > > >
> > > > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> > > >
> > > >
> > > > This RFC offers a conceptual zero-copy put function, where the
> > > application
> > > > promises to store some objects, and in return gets an address
> where
> > > to store
> > > > them.
> > > >
> > > > I would like some early feedback.
> > > >
> > > > Notes:
> > > > * Allowing the 'cache' parameter to be NULL, and getting it from
> the
> > > > mempool instead, was inspired by rte_mempool_cache_flush().
> > > I am not sure why the 'cache' parameter is required for this API.
> This
> > > API should take the mem pool as the parameter.
> > >
> > > We have based our API on 'rte_mempool_do_generic_put' and removed
> > the
> > > 'cache' parameter.
> >
> > I thoroughly considered omitting the 'cache' parameter, but included
> it for
> > two reasons:
> >
> > 1. The function is a "mempool cache" function (i.e. primarily working
> on the
> > mempool cache), not a "mempool" function.
> >
> > So it is appropriate to have a pointer directly to the structure it
> is working on.
> > Following this through, I also made 'cache' the first parameter and
> 'mp' the
> > second, like in rte_mempool_cache_flush().
> I am wondering if the PMD should be aware of the cache or not. For ex:
> in the case of pipeline mode, the RX and TX side of the PMD are running
> on different cores.

In that example, the PMD can store two cache pointers, one for each of the RX and TX side.

And if the PMD is unaware of the cache pointer, it can look it up at runtime using rte_lcore_id(), like it does in the current Intel PMDs.

> However, since the rte_mempool_cache_flush API is provided, may be that
> decision is already done? Interestingly, rte_mempool_cache_flush is
> called by just a single PMD.

I intentionally aligned this RFC with rte_mempool_cache_flush() to maintain consistency.

However, the API is not set in stone. It should always be acceptable to consider improved alternatives.

> 
> So, the question is, should we allow zero-copy only for per-core cache
> or for other cases as well.

I suppose that the mempool library was designed to have a mempool associated with exactly one mempool cache per core. (Alternatively, the mempool can be configured with no mempool caches at all.)

We should probably stay loyal to that design concept, and only allow zero-copy for per-core cache.

If you can come up with an example of the opposite, I would like to explore that option too... I can't think of a good example myself, and perhaps I'm overlooking a relevant use case.

> 
> >
> > 2. In most cases, the function only accesses the mempool structure in
> order to
> > get the cache pointer. Skipping this step improves performance.
> >
> > And since the cache is created along with the mempool itself (and
> thus never
> > changes for a mempool), it would be safe for the PMD to store the
> 'cache'
> > pointer along with the 'mp' pointer in the PMD's queue structure.
> Agreed
> 
> >
> > E.g. in the i40e PMD the i40e_rx_queue structure could include a
> "struct
> > rte_mempool_cache *cache" field, which could be used i40e_rxq_rearm()
> [1]
> > instead of "cache = rte_mempool_default_cache(rxq->mp,
> rte_lcore_id())".
> >
> > [1] https://elixir.bootlin.com/dpdk/v22.11-
> > rc2/source/drivers/net/i40e/i40e_rxtx_vec_avx512.c#L31
> >
> > > This new API, on success, returns the pointer to memory where the
> > > objects are copied. On failure it returns NULL and the caller has
> to
> > > call 'rte_mempool_ops_enqueue_bulk'. Alternatively, the new API
> could
> > > do this as well and PMD does not need to do anything if it gets a
> NULL
> > > pointer.
> >
> > Yes, we agree about these two details:
> >
> > 1. The function should return a pointer, not an integer.
> > It would be a waste to use a another CPU register to convey a
> success/error
> > integer value, when the success/failure information is just as easily
> conveyed
> > by the pointer return value (non-NULL/NULL), and rte_errno for
> various error
> > values in the unlikely cases.
> >
> > 2. The function should leave it up to the PMD what to do if direct
> access to
> > the cache is unavailable.
> Just wondering about the advantage of this. I do not think PMD's have
> much of a choice other than calling 'rte_mempool_ops_enqueue_bulk'

I agree, but that was not my point. Let me try to rephrase:

The PMD is more likely to know how to efficiently build the array of mbufs to pass to rte_mempool_ops_enqueue_bulk() than the mempool library - many PMDs already implement a variety of vector instruction variants to do exactly that. So we should not try to be clever and add a fallback path - this job belongs in the PMD.

The PMD might not even have the array of mbufs lined up when calling rte_mempool_cache_put_bulk_promise(). The PMD could have an array of internal structures, where the mbuf pointer is an element in that structure.

> 
> >
> > >
> > > We should think about providing  similar API on the RX side to keep
> it
> > > symmetric.
> >
> > I sent an RFC for that too:
> > http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87488@
> > smartserver.smartshare.dk/T/#u
> >
> >
> > >
> > > > * Asserting that the 'mp' parameter is not NULL is not done by
> other
> > > > functions, so I omitted it here too.
> > > >
> > > > NB: Please ignore formatting. Also, this code has not even been
> > > compile
> > > > tested.
> > > We are little bit ahead, tested the changes with i40e PF PMD, wrote
> > > unit test cases, going through internal review, will send out RFC
> on
> > > Monday
> >
> > Sounds good. Looking forward to review.
> >
> > >
> > > >
> > > > /**
> > > >  * Promise to put objects in a mempool via zero-copy access to a
> > > user-owned
> > > > mempool cache.
> > > >  *
> > > >  * @param cache
> > > >  *   A pointer to the mempool cache.
> > > >  * @param mp
> > > >  *   A pointer to the mempool.
> > > >  * @param n
> > > >  *   The number of objects to be put in the mempool cache.
> > > >  * @return
> > > >  *   The pointer to where to put the objects in the mempool
> cache.
> > > >  *   NULL on error
> > > >  *   with rte_errno set appropriately.
> > > >  */
> > > > static __rte_always_inline void *
> > > > rte_mempool_cache_put_bulk_promise(struct rte_mempool_cache
> > *cache,
> > > >         struct rte_mempool *mp,
> > > >         unsigned int n)
> > > > {
> > > >     void **cache_objs;
> > > >
> > > >     if (cache == NULL)
> > > >         cache = rte_mempool_default_cache(mp, rte_lcore_id());
> Any reason we need this? If we are expecting the PMD to store the
> pointer to cache and a NULL is passed, it would mean it is a mempool
> with no per-core cache?
> We could also leave the NULL check to the PMD.

Personally, I would strongly prefer requiring the cache pointer to be valid, and use RTE_ASSERT() here, instead of allowing a NULL pointer as a special case to look it up inside the function. I consider this special NULL case useless bloat, which does not belong in a fast path library function.

But I copied this approach from rte_mempool_cache_flush().

We could expose an "unsafe" function where is not allowed to pass NULL pointers, and a "safe" function (fixing the cache pointer if NULL) for consistency.

If the rte_mempool_cache_flush() function is popular, we could also expose an "unsafe" variant where passing NULL pointers are disallowed.

I wonder if there are any examples of such safe/unsafe variants in DPDK? It would be nice with a common naming convention for such function variants.

> 
> > > >     if (cache == NULL) {
> > > >         rte_errno = EINVAL;
> > > >         return NULL;
> > > >     }
> > > >
> > > >     rte_mempool_trace_cache_put_bulk_promise(cache, mp, n);
> > > >
> > > >     /* The request itself is too big for the cache */
> > > >     if (unlikely(n > cache->flushthresh)) {
> > > >         rte_errno = EINVAL;
> > > >         return NULL;
> > > >     }
> > > >
> > > >     /*
> > > >      * The cache follows the following algorithm:
> > > >      *   1. If the objects cannot be added to the cache without
> > > crossing
> > > >      *      the flush threshold, flush the cache to the backend.
> > > >      *   2. Add the objects to the cache.
> > > >      */
> > > >
> > > >     if (cache->len + n <= cache->flushthresh) {
> > > >         cache_objs = &cache->objs[cache->len];
> > > >         cache->len += n;
> > > >     } else {
> > > >         cache_objs = &cache->objs[0];
> > > >         rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > > >         cache->len = n;
> > > >     }
> > > >
> > > >     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> > > >     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> These are new stats. Do these break ABI compatibility (though these are
> under DEBUG flag)?

They are not mempool cache stats, they are only kept in the cache structure to provide alternative (i.e. faster) update access to some (i.e. the most often updated) of the existing mempool stats. The patch is [1], and part of a series currently being discussed if should go into 22.11-rc3 or not [2].

[1]: https://patchwork.dpdk.org/project/dpdk/patch/20221109181852.109856-3-mb@smartsharesystems.com/
[2]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D874A6@smartserver.smartshare.dk/T/#m41bf4e8bd886db49f11c8dbd63741b353277082f

> 
> > > >
> > > >     return cache_objs;
> > > > }
> > > >
> > > >
> > > > Med venlig hilsen / Kind regards,
> > > > -Morten Brørup
> > > >
> > >
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] mempool: zero-copy cache put bulk
  2022-11-09 20:36       ` Morten Brørup
@ 2022-11-09 22:45         ` Honnappa Nagarahalli
  2022-11-10 10:15           ` Morten Brørup
  0 siblings, 1 reply; 9+ messages in thread
From: Honnappa Nagarahalli @ 2022-11-09 22:45 UTC (permalink / raw)
  To: Morten Brørup, dev, olivier.matz, andrew.rybchenko,
	Kamalakshitha Aligeri, Bruce Richardson
  Cc: nd, nd

<snip>

> 
> +To: Bruce also showed interest in this topic, and might have more insights.
> 
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Wednesday, 9 November 2022 18.58
> >
> > <snip>
> >
> > >
> > > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > > Sent: Sunday, 6 November 2022 00.11
> > > >
> > > > + Akshitha, she is working on similar patch
> > > >
> > > > Few comments inline
> > > >
> > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > Sent: Saturday, November 5, 2022 8:40 AM
> > > > >
> > > > > Zero-copy access to the mempool cache is beneficial for PMD
> > > > performance,
> > > > > and must be provided by the mempool library to fix [Bug 1052]
> > > > > without
> > > > a
> > > > > performance regression.
> > > > >
> > > > > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> > > > >
> > > > >
> > > > > This RFC offers a conceptual zero-copy put function, where the
> > > > application
> > > > > promises to store some objects, and in return gets an address
> > where
> > > > to store
> > > > > them.
> > > > >
> > > > > I would like some early feedback.
> > > > >
> > > > > Notes:
> > > > > * Allowing the 'cache' parameter to be NULL, and getting it from
> > the
> > > > > mempool instead, was inspired by rte_mempool_cache_flush().
> > > > I am not sure why the 'cache' parameter is required for this API.
> > This
> > > > API should take the mem pool as the parameter.
> > > >
> > > > We have based our API on 'rte_mempool_do_generic_put' and removed
> > > the
> > > > 'cache' parameter.
> > >
> > > I thoroughly considered omitting the 'cache' parameter, but included
> > it for
> > > two reasons:
> > >
> > > 1. The function is a "mempool cache" function (i.e. primarily
> > > working
> > on the
> > > mempool cache), not a "mempool" function.
> > >
> > > So it is appropriate to have a pointer directly to the structure it
> > is working on.
> > > Following this through, I also made 'cache' the first parameter and
> > 'mp' the
> > > second, like in rte_mempool_cache_flush().
> > I am wondering if the PMD should be aware of the cache or not. For ex:
> > in the case of pipeline mode, the RX and TX side of the PMD are
> > running on different cores.
> 
> In that example, the PMD can store two cache pointers, one for each of the
> RX and TX side.
I did not understand this. If RX core and TX core have their own per-core caches the logic would not work. For ex: the RX core cache would not get filled.

In the case of pipeline mode, there will not be a per-core cache. The buffers would be allocated and freed from a global ring or a global lockless stack.

> 
> And if the PMD is unaware of the cache pointer, it can look it up at runtime
> using rte_lcore_id(), like it does in the current Intel PMDs.
> 
> > However, since the rte_mempool_cache_flush API is provided, may be
> > that decision is already done? Interestingly, rte_mempool_cache_flush
> > is called by just a single PMD.
> 
> I intentionally aligned this RFC with rte_mempool_cache_flush() to maintain
> consistency.
> 
> However, the API is not set in stone. It should always be acceptable to
> consider improved alternatives.
> 
> >
> > So, the question is, should we allow zero-copy only for per-core cache
> > or for other cases as well.
> 
> I suppose that the mempool library was designed to have a mempool
> associated with exactly one mempool cache per core. (Alternatively, the
> mempool can be configured with no mempool caches at all.)
> 
> We should probably stay loyal to that design concept, and only allow zero-
> copy for per-core cache.
> 
> If you can come up with an example of the opposite, I would like to explore
> that option too... I can't think of a good example myself, and perhaps I'm
> overlooking a relevant use case.
The use case I am talking about is the pipeline mode as I mentioned above. Let me know if you agree.

> 
> >
> > >
> > > 2. In most cases, the function only accesses the mempool structure
> > > in
> > order to
> > > get the cache pointer. Skipping this step improves performance.
> > >
> > > And since the cache is created along with the mempool itself (and
> > thus never
> > > changes for a mempool), it would be safe for the PMD to store the
> > 'cache'
> > > pointer along with the 'mp' pointer in the PMD's queue structure.
> > Agreed
> >
> > >
> > > E.g. in the i40e PMD the i40e_rx_queue structure could include a
> > "struct
> > > rte_mempool_cache *cache" field, which could be used
> > > i40e_rxq_rearm()
> > [1]
> > > instead of "cache = rte_mempool_default_cache(rxq->mp,
> > rte_lcore_id())".
> > >
> > > [1] https://elixir.bootlin.com/dpdk/v22.11-
> > > rc2/source/drivers/net/i40e/i40e_rxtx_vec_avx512.c#L31
> > >
> > > > This new API, on success, returns the pointer to memory where the
> > > > objects are copied. On failure it returns NULL and the caller has
> > to
> > > > call 'rte_mempool_ops_enqueue_bulk'. Alternatively, the new API
> > could
> > > > do this as well and PMD does not need to do anything if it gets a
> > NULL
> > > > pointer.
> > >
> > > Yes, we agree about these two details:
> > >
> > > 1. The function should return a pointer, not an integer.
> > > It would be a waste to use a another CPU register to convey a
> > success/error
> > > integer value, when the success/failure information is just as
> > > easily
> > conveyed
> > > by the pointer return value (non-NULL/NULL), and rte_errno for
> > various error
> > > values in the unlikely cases.
> > >
> > > 2. The function should leave it up to the PMD what to do if direct
> > access to
> > > the cache is unavailable.
> > Just wondering about the advantage of this. I do not think PMD's have
> > much of a choice other than calling 'rte_mempool_ops_enqueue_bulk'
> 
> I agree, but that was not my point. Let me try to rephrase:
> 
> The PMD is more likely to know how to efficiently build the array of mbufs to
> pass to rte_mempool_ops_enqueue_bulk() than the mempool library - many
> PMDs already implement a variety of vector instruction variants to do exactly
> that. So we should not try to be clever and add a fallback path - this job
> belongs in the PMD.
> 
> The PMD might not even have the array of mbufs lined up when calling
> rte_mempool_cache_put_bulk_promise(). The PMD could have an array of
> internal structures, where the mbuf pointer is an element in that structure.
Agree, you are correct. We should leave it to PMD to handle the failure case.

> 
> >
> > >
> > > >
> > > > We should think about providing  similar API on the RX side to
> > > > keep
> > it
> > > > symmetric.
> > >
> > > I sent an RFC for that too:
> > >
> http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87488@
> > > smartserver.smartshare.dk/T/#u
> > >
> > >
> > > >
> > > > > * Asserting that the 'mp' parameter is not NULL is not done by
> > other
> > > > > functions, so I omitted it here too.
> > > > >
> > > > > NB: Please ignore formatting. Also, this code has not even been
> > > > compile
> > > > > tested.
> > > > We are little bit ahead, tested the changes with i40e PF PMD,
> > > > wrote unit test cases, going through internal review, will send
> > > > out RFC
> > on
> > > > Monday
> > >
> > > Sounds good. Looking forward to review.
> > >
> > > >
> > > > >
> > > > > /**
> > > > >  * Promise to put objects in a mempool via zero-copy access to a
> > > > user-owned
> > > > > mempool cache.
> > > > >  *
> > > > >  * @param cache
> > > > >  *   A pointer to the mempool cache.
> > > > >  * @param mp
> > > > >  *   A pointer to the mempool.
> > > > >  * @param n
> > > > >  *   The number of objects to be put in the mempool cache.
> > > > >  * @return
> > > > >  *   The pointer to where to put the objects in the mempool
> > cache.
> > > > >  *   NULL on error
> > > > >  *   with rte_errno set appropriately.
> > > > >  */
> > > > > static __rte_always_inline void *
> > > > > rte_mempool_cache_put_bulk_promise(struct rte_mempool_cache
> > > *cache,
> > > > >         struct rte_mempool *mp,
> > > > >         unsigned int n)
> > > > > {
> > > > >     void **cache_objs;
> > > > >
> > > > >     if (cache == NULL)
> > > > >         cache = rte_mempool_default_cache(mp, rte_lcore_id());
> > Any reason we need this? If we are expecting the PMD to store the
> > pointer to cache and a NULL is passed, it would mean it is a mempool
> > with no per-core cache?
> > We could also leave the NULL check to the PMD.
> 
> Personally, I would strongly prefer requiring the cache pointer to be valid,
> and use RTE_ASSERT() here, instead of allowing a NULL pointer as a special
> case to look it up inside the function. I consider this special NULL case useless
> bloat, which does not belong in a fast path library function.
> 
> But I copied this approach from rte_mempool_cache_flush().
The API definition does not bind it to do this check. We might be able to delete the check in rte_mempool_cache_flush.

> 
> We could expose an "unsafe" function where is not allowed to pass NULL
> pointers, and a "safe" function (fixing the cache pointer if NULL) for
> consistency.
> 
> If the rte_mempool_cache_flush() function is popular, we could also expose
> an "unsafe" variant where passing NULL pointers are disallowed.
> 
> I wonder if there are any examples of such safe/unsafe variants in DPDK? It
> would be nice with a common naming convention for such function variants.
> 
> >
> > > > >     if (cache == NULL) {
> > > > >         rte_errno = EINVAL;
> > > > >         return NULL;
> > > > >     }
> > > > >
> > > > >     rte_mempool_trace_cache_put_bulk_promise(cache, mp, n);
> > > > >
> > > > >     /* The request itself is too big for the cache */
> > > > >     if (unlikely(n > cache->flushthresh)) {
> > > > >         rte_errno = EINVAL;
> > > > >         return NULL;
> > > > >     }
> > > > >
> > > > >     /*
> > > > >      * The cache follows the following algorithm:
> > > > >      *   1. If the objects cannot be added to the cache without
> > > > crossing
> > > > >      *      the flush threshold, flush the cache to the backend.
> > > > >      *   2. Add the objects to the cache.
> > > > >      */
> > > > >
> > > > >     if (cache->len + n <= cache->flushthresh) {
> > > > >         cache_objs = &cache->objs[cache->len];
> > > > >         cache->len += n;
> > > > >     } else {
> > > > >         cache_objs = &cache->objs[0];
> > > > >         rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > > > >         cache->len = n;
> > > > >     }
> > > > >
> > > > >     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> > > > >     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> > These are new stats. Do these break ABI compatibility (though these
> > are under DEBUG flag)?
> 
> They are not mempool cache stats, they are only kept in the cache structure
> to provide alternative (i.e. faster) update access to some (i.e. the most often
> updated) of the existing mempool stats. The patch is [1], and part of a series
> currently being discussed if should go into 22.11-rc3 or not [2].
> 
> [1]:
> https://patchwork.dpdk.org/project/dpdk/patch/20221109181852.109856-3-
> mb@smartsharesystems.com/
> [2]:
> http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D874A6
> @smartserver.smartshare.dk/T/#m41bf4e8bd886db49f11c8dbd63741b35327
> 7082f
> 
> >
> > > > >
> > > > >     return cache_objs;
> > > > > }
> > > > >
> > > > >
> > > > > Med venlig hilsen / Kind regards, -Morten Brørup
> > > > >
> > > >
> >


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] mempool: zero-copy cache put bulk
  2022-11-09 22:45         ` Honnappa Nagarahalli
@ 2022-11-10 10:15           ` Morten Brørup
  2022-11-10 11:00             ` Bruce Richardson
  0 siblings, 1 reply; 9+ messages in thread
From: Morten Brørup @ 2022-11-10 10:15 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, olivier.matz, andrew.rybchenko,
	Kamalakshitha Aligeri, Bruce Richardson
  Cc: nd, nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, 9 November 2022 23.46
> >
> > +To: Bruce also showed interest in this topic, and might have more
> insights.
> >
> > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > Sent: Wednesday, 9 November 2022 18.58
> > >
> > > <snip>
> > >
> > > >
> > > > > From: Honnappa Nagarahalli
> [mailto:Honnappa.Nagarahalli@arm.com]
> > > > > Sent: Sunday, 6 November 2022 00.11
> > > > >
> > > > > + Akshitha, she is working on similar patch
> > > > >
> > > > > Few comments inline
> > > > >
> > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > Sent: Saturday, November 5, 2022 8:40 AM
> > > > > >
> > > > > > Zero-copy access to the mempool cache is beneficial for PMD
> > > > > performance,
> > > > > > and must be provided by the mempool library to fix [Bug 1052]
> > > > > > without
> > > > > a
> > > > > > performance regression.
> > > > > >
> > > > > > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> > > > > >
> > > > > >
> > > > > > This RFC offers a conceptual zero-copy put function, where
> the
> > > > > application
> > > > > > promises to store some objects, and in return gets an address
> > > where
> > > > > to store
> > > > > > them.
> > > > > >
> > > > > > I would like some early feedback.
> > > > > >
> > > > > > Notes:
> > > > > > * Allowing the 'cache' parameter to be NULL, and getting it
> from
> > > the
> > > > > > mempool instead, was inspired by rte_mempool_cache_flush().
> > > > > I am not sure why the 'cache' parameter is required for this
> API.
> > > This
> > > > > API should take the mem pool as the parameter.
> > > > >
> > > > > We have based our API on 'rte_mempool_do_generic_put' and
> removed
> > > > the
> > > > > 'cache' parameter.
> > > >
> > > > I thoroughly considered omitting the 'cache' parameter, but
> included
> > > it for
> > > > two reasons:
> > > >
> > > > 1. The function is a "mempool cache" function (i.e. primarily
> > > > working
> > > on the
> > > > mempool cache), not a "mempool" function.
> > > >
> > > > So it is appropriate to have a pointer directly to the structure
> it
> > > is working on.
> > > > Following this through, I also made 'cache' the first parameter
> and
> > > 'mp' the
> > > > second, like in rte_mempool_cache_flush().
> > > I am wondering if the PMD should be aware of the cache or not. For
> ex:
> > > in the case of pipeline mode, the RX and TX side of the PMD are
> > > running on different cores.
> >
> > In that example, the PMD can store two cache pointers, one for each
> of the
> > RX and TX side.
> I did not understand this. If RX core and TX core have their own per-
> core caches the logic would not work. For ex: the RX core cache would
> not get filled.
> 
> In the case of pipeline mode, there will not be a per-core cache. The
> buffers would be allocated and freed from a global ring or a global
> lockless stack.

Aha... Now I understand what you mean: You are referring to use cases where the mempool is configured to *not* have a mempool cache.

For a mempool without a mempool cache, the proposed "mempool cache" zero-copy functions can obviously not be used.

We need "mempool" zero-copy functions for the mempools that have no mempool cache.

However, those functions depend on the mempool's underlying backing store.

E.g. zero-copy access to a ring has certain requirements [1].

[1]: http://doc.dpdk.org/guides/prog_guide/ring_lib.html#ring-peek-zero-copy-api

For a stack, I think it is possible to locklessly zero-copy pop objects. But it is impossible to locklessly zero-copy push elements to a stack; another thread can race to pop some objects from the stack before the pushing thread has finished writing them into the stack.

Furthermore, the ring zero-copy get function cannot return a consecutive array of objects when wrapping, and PMD functions using vector instructions usually rely on handling chunks of e.g. 8 objects.

Just for a second, let me theorize into the absurd: Even worse, if a mempool's underlying backing store does not use an array of pointers as its internal storage structure, it is impossible to use a pointer to an array of pointers for zero-copy transactions. E.g. if the backing store uses a list or a tree structure for its storage, a pointer to somewhere in the list or tree structure is not an array of objects pointers.

Anyway, we could consider designing a generic API for zero-copy mempool get/put; but it should be compatible with all underlying backing stores - or return failure, so the PMD can fall back to the standard functions, if the mempool is in a state where zero-copy access to a contiguous burst cannot be provided. E.g. zero-copy get from a ring can return failure when zero-copy access to the ring is temporarily unavailable due to being at a point where it would wrap.

Here is a conceptual proposal for such an API.

/* Mempool zero-copy transaction state. Opaque outside the mempool API. */
struct rte_mempool_zc_transaction_state {
	char	opaque[RTE_CACHE_LINE_SIZE];
};

/** Start zero-copy get/put bulk transaction.
 *
 * @param[in] mp
 *   Pointer to the mempool.
 * @param[out] obj_table_ptr
 *   Where to store the pointer to
 *   the zero-copy array of objects in the mempool.
 * @param[in] n
 *   Number of objects in the transaction.
 * @param[in] cache
 *   Pointer to the mempool cache. May be NULL if unknown.
 * @param[out] transaction
 *   Where to store the opaque transaction information.
 *   Used internally by the mempool library.
 * @return
 *   - 1: Transaction completed;
 *        '_finish' must not be called.
 *   - 0: Transaction started;
 *        '_finish' must be called to complete the transaction.
 *   - <0: Error; failure code.
 */
static __rte_always_inline int
rte_mempool_get/put_zc_bulk_start(
	struct rte_mempool *mp,
	void ***obj_table_ptr,
	unsigned int n,
	struct rte_mempool_cache *cache,
	rte_mempool_zc_transaction_state *transaction);

/** Finish zero-copy get/put bulk transaction.
 *
 * @param[in] mp
 *   Pointer to mempool.
 * @param[in] obj_table_ptr
 *   Pointer to the zero-copy array of objects in the mempool,
 *   returned by the 'start' function.
 * @param[in] n
 *   Number of objects in the transaction.
 *   Must be the same as for the 'start' function.
 * @param[in] transaction
 *   Opaque transaction information,
 *   returned by the 'start' function.
 *   Used internally by the mempool library.
 */
static __rte_always_inline void
rte_mempool_get/put_zc_bulk_finish(
	struct rte_mempool *mp,
	void **obj_table,
	unsigned int n,
	rte_mempool_zc_transaction_state *transaction);

Note that these are *bulk* functions, so 'n' has to remain the same for a 'finish' call as it was for the 'start' call of a transaction.

And then the underlying backing stores would need to provide callbacks that implement these functions, if they offer zero-copy functionality.

The mempool implementation of these could start by checking for a mempool cache, and use the "mempool cache" zero-copy if present.

Some internal state information (from the mempool library or the underlying mempool backing store) may need to be carried over from the 'start' to the 'finish' function, so I have added a transaction state parameter. The transaction state must be held by the application for thread safety reasons. Think of this like the 'flags' parameter to the Linux kernel's spin_lock_irqsave/irqrestore() functions.

We could omit the 'obj_table' and 'n' parameters from the 'finish' functions and store them in the transaction state if needed; but we might possibly achieve higher performance by passing them as parameters instead.

> 
> >
> > And if the PMD is unaware of the cache pointer, it can look it up at
> runtime
> > using rte_lcore_id(), like it does in the current Intel PMDs.
> >
> > > However, since the rte_mempool_cache_flush API is provided, may be
> > > that decision is already done? Interestingly,
> rte_mempool_cache_flush
> > > is called by just a single PMD.
> >
> > I intentionally aligned this RFC with rte_mempool_cache_flush() to
> maintain
> > consistency.
> >
> > However, the API is not set in stone. It should always be acceptable
> to
> > consider improved alternatives.
> >
> > >
> > > So, the question is, should we allow zero-copy only for per-core
> cache
> > > or for other cases as well.
> >
> > I suppose that the mempool library was designed to have a mempool
> > associated with exactly one mempool cache per core. (Alternatively,
> the
> > mempool can be configured with no mempool caches at all.)
> >
> > We should probably stay loyal to that design concept, and only allow
> zero-
> > copy for per-core cache.
> >
> > If you can come up with an example of the opposite, I would like to
> explore
> > that option too... I can't think of a good example myself, and
> perhaps I'm
> > overlooking a relevant use case.
> The use case I am talking about is the pipeline mode as I mentioned
> above. Let me know if you agree.

I see what you mean, and I don't object. :-)

However, I still think the "mempool cache" zero-copy functions could be useful.

They would be needed for the generic "mempool" zero-copy functions anyway.

And the "mempool cache" zero-copy functions are much simpler to design, implement and use than the "mempool" zero-copy functions, so it is a good first step.

> 
> >
> > >
> > > >
> > > > 2. In most cases, the function only accesses the mempool
> structure
> > > > in
> > > order to
> > > > get the cache pointer. Skipping this step improves performance.
> > > >
> > > > And since the cache is created along with the mempool itself (and
> > > thus never
> > > > changes for a mempool), it would be safe for the PMD to store the
> > > 'cache'
> > > > pointer along with the 'mp' pointer in the PMD's queue structure.
> > > Agreed
> > >
> > > >
> > > > E.g. in the i40e PMD the i40e_rx_queue structure could include a
> > > "struct
> > > > rte_mempool_cache *cache" field, which could be used
> > > > i40e_rxq_rearm()
> > > [1]
> > > > instead of "cache = rte_mempool_default_cache(rxq->mp,
> > > rte_lcore_id())".
> > > >
> > > > [1] https://elixir.bootlin.com/dpdk/v22.11-
> > > > rc2/source/drivers/net/i40e/i40e_rxtx_vec_avx512.c#L31
> > > >
> > > > > This new API, on success, returns the pointer to memory where
> the
> > > > > objects are copied. On failure it returns NULL and the caller
> has
> > > to
> > > > > call 'rte_mempool_ops_enqueue_bulk'. Alternatively, the new API
> > > could
> > > > > do this as well and PMD does not need to do anything if it gets
> a
> > > NULL
> > > > > pointer.
> > > >
> > > > Yes, we agree about these two details:
> > > >
> > > > 1. The function should return a pointer, not an integer.
> > > > It would be a waste to use a another CPU register to convey a
> > > success/error
> > > > integer value, when the success/failure information is just as
> > > > easily
> > > conveyed
> > > > by the pointer return value (non-NULL/NULL), and rte_errno for
> > > various error
> > > > values in the unlikely cases.
> > > >
> > > > 2. The function should leave it up to the PMD what to do if
> direct
> > > access to
> > > > the cache is unavailable.
> > > Just wondering about the advantage of this. I do not think PMD's
> have
> > > much of a choice other than calling 'rte_mempool_ops_enqueue_bulk'
> >
> > I agree, but that was not my point. Let me try to rephrase:
> >
> > The PMD is more likely to know how to efficiently build the array of
> mbufs to
> > pass to rte_mempool_ops_enqueue_bulk() than the mempool library -
> many
> > PMDs already implement a variety of vector instruction variants to do
> exactly
> > that. So we should not try to be clever and add a fallback path -
> this job
> > belongs in the PMD.
> >
> > The PMD might not even have the array of mbufs lined up when calling
> > rte_mempool_cache_put_bulk_promise(). The PMD could have an array of
> > internal structures, where the mbuf pointer is an element in that
> structure.
> Agree, you are correct. We should leave it to PMD to handle the failure
> case.
> 
> >
> > >
> > > >
> > > > >
> > > > > We should think about providing  similar API on the RX side to
> > > > > keep
> > > it
> > > > > symmetric.
> > > >
> > > > I sent an RFC for that too:
> > > >
> > http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87488@
> > > > smartserver.smartshare.dk/T/#u
> > > >
> > > >
> > > > >
> > > > > > * Asserting that the 'mp' parameter is not NULL is not done
> by
> > > other
> > > > > > functions, so I omitted it here too.
> > > > > >
> > > > > > NB: Please ignore formatting. Also, this code has not even
> been
> > > > > compile
> > > > > > tested.
> > > > > We are little bit ahead, tested the changes with i40e PF PMD,
> > > > > wrote unit test cases, going through internal review, will send
> > > > > out RFC
> > > on
> > > > > Monday
> > > >
> > > > Sounds good. Looking forward to review.
> > > >
> > > > >
> > > > > >
> > > > > > /**
> > > > > >  * Promise to put objects in a mempool via zero-copy access
> to a
> > > > > user-owned
> > > > > > mempool cache.
> > > > > >  *
> > > > > >  * @param cache
> > > > > >  *   A pointer to the mempool cache.
> > > > > >  * @param mp
> > > > > >  *   A pointer to the mempool.
> > > > > >  * @param n
> > > > > >  *   The number of objects to be put in the mempool cache.
> > > > > >  * @return
> > > > > >  *   The pointer to where to put the objects in the mempool
> > > cache.
> > > > > >  *   NULL on error
> > > > > >  *   with rte_errno set appropriately.
> > > > > >  */
> > > > > > static __rte_always_inline void *
> > > > > > rte_mempool_cache_put_bulk_promise(struct rte_mempool_cache
> > > > *cache,
> > > > > >         struct rte_mempool *mp,
> > > > > >         unsigned int n)
> > > > > > {
> > > > > >     void **cache_objs;
> > > > > >
> > > > > >     if (cache == NULL)
> > > > > >         cache = rte_mempool_default_cache(mp,
> rte_lcore_id());
> > > Any reason we need this? If we are expecting the PMD to store the
> > > pointer to cache and a NULL is passed, it would mean it is a
> mempool
> > > with no per-core cache?
> > > We could also leave the NULL check to the PMD.
> >
> > Personally, I would strongly prefer requiring the cache pointer to be
> valid,
> > and use RTE_ASSERT() here, instead of allowing a NULL pointer as a
> special
> > case to look it up inside the function. I consider this special NULL
> case useless
> > bloat, which does not belong in a fast path library function.
> >
> > But I copied this approach from rte_mempool_cache_flush().
> The API definition does not bind it to do this check. We might be able
> to delete the check in rte_mempool_cache_flush.
> 
> >
> > We could expose an "unsafe" function where is not allowed to pass
> NULL
> > pointers, and a "safe" function (fixing the cache pointer if NULL)
> for
> > consistency.
> >
> > If the rte_mempool_cache_flush() function is popular, we could also
> expose
> > an "unsafe" variant where passing NULL pointers are disallowed.
> >
> > I wonder if there are any examples of such safe/unsafe variants in
> DPDK? It
> > would be nice with a common naming convention for such function
> variants.
> >
> > >
> > > > > >     if (cache == NULL) {
> > > > > >         rte_errno = EINVAL;
> > > > > >         return NULL;
> > > > > >     }
> > > > > >
> > > > > >     rte_mempool_trace_cache_put_bulk_promise(cache, mp, n);
> > > > > >
> > > > > >     /* The request itself is too big for the cache */
> > > > > >     if (unlikely(n > cache->flushthresh)) {
> > > > > >         rte_errno = EINVAL;
> > > > > >         return NULL;
> > > > > >     }
> > > > > >
> > > > > >     /*
> > > > > >      * The cache follows the following algorithm:
> > > > > >      *   1. If the objects cannot be added to the cache
> without
> > > > > crossing
> > > > > >      *      the flush threshold, flush the cache to the
> backend.
> > > > > >      *   2. Add the objects to the cache.
> > > > > >      */
> > > > > >
> > > > > >     if (cache->len + n <= cache->flushthresh) {
> > > > > >         cache_objs = &cache->objs[cache->len];
> > > > > >         cache->len += n;
> > > > > >     } else {
> > > > > >         cache_objs = &cache->objs[0];
> > > > > >         rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> >len);
> > > > > >         cache->len = n;
> > > > > >     }
> > > > > >
> > > > > >     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> > > > > >     RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> > > These are new stats. Do these break ABI compatibility (though these
> > > are under DEBUG flag)?
> >
> > They are not mempool cache stats, they are only kept in the cache
> structure
> > to provide alternative (i.e. faster) update access to some (i.e. the
> most often
> > updated) of the existing mempool stats. The patch is [1], and part of
> a series
> > currently being discussed if should go into 22.11-rc3 or not [2].
> >
> > [1]:
> > https://patchwork.dpdk.org/project/dpdk/patch/20221109181852.109856-
> 3-
> > mb@smartsharesystems.com/
> > [2]:
> > http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D874A6
> > @smartserver.smartshare.dk/T/#m41bf4e8bd886db49f11c8dbd63741b35327
> > 7082f
> >
> > >
> > > > > >
> > > > > >     return cache_objs;
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Med venlig hilsen / Kind regards, -Morten Brørup
> > > > > >
> > > > >
> > >
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] mempool: zero-copy cache put bulk
  2022-11-10 10:15           ` Morten Brørup
@ 2022-11-10 11:00             ` Bruce Richardson
  2022-11-11  4:24               ` Honnappa Nagarahalli
  0 siblings, 1 reply; 9+ messages in thread
From: Bruce Richardson @ 2022-11-10 11:00 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Honnappa Nagarahalli, dev, olivier.matz, andrew.rybchenko,
	Kamalakshitha Aligeri, nd

On Thu, Nov 10, 2022 at 11:15:27AM +0100, Morten Brørup wrote:
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Wednesday, 9 November 2022 23.46
> > >
> > > +To: Bruce also showed interest in this topic, and might have more
> > insights.
> > >
> > > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > > Sent: Wednesday, 9 November 2022 18.58
> > > >
> > > > <snip>
> > > >
> > > > >
> > > > > > From: Honnappa Nagarahalli
> > [mailto:Honnappa.Nagarahalli@arm.com]
> > > > > > Sent: Sunday, 6 November 2022 00.11
> > > > > >
> > > > > > + Akshitha, she is working on similar patch
> > > > > >
> > > > > > Few comments inline
> > > > > >
> > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > Sent: Saturday, November 5, 2022 8:40 AM
> > > > > > >
> > > > > > > Zero-copy access to the mempool cache is beneficial for PMD
> > > > > > performance,
> > > > > > > and must be provided by the mempool library to fix [Bug 1052]
> > > > > > > without
> > > > > > a
> > > > > > > performance regression.
> > > > > > >
> > > > > > > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> > > > > > >
> > > > > > >
> > > > > > > This RFC offers a conceptual zero-copy put function, where
> > the
> > > > > > application
> > > > > > > promises to store some objects, and in return gets an address
> > > > where
> > > > > > to store
> > > > > > > them.
> > > > > > >
> > > > > > > I would like some early feedback.
> > > > > > >
> > > > > > > Notes:
> > > > > > > * Allowing the 'cache' parameter to be NULL, and getting it
> > from
> > > > the
> > > > > > > mempool instead, was inspired by rte_mempool_cache_flush().
> > > > > > I am not sure why the 'cache' parameter is required for this
> > API.
> > > > This
> > > > > > API should take the mem pool as the parameter.
> > > > > >
> > > > > > We have based our API on 'rte_mempool_do_generic_put' and
> > removed
> > > > > the
> > > > > > 'cache' parameter.
> > > > >
> > > > > I thoroughly considered omitting the 'cache' parameter, but
> > included
> > > > it for
> > > > > two reasons:
> > > > >
> > > > > 1. The function is a "mempool cache" function (i.e. primarily
> > > > > working
> > > > on the
> > > > > mempool cache), not a "mempool" function.
> > > > >
> > > > > So it is appropriate to have a pointer directly to the structure
> > it
> > > > is working on.
> > > > > Following this through, I also made 'cache' the first parameter
> > and
> > > > 'mp' the
> > > > > second, like in rte_mempool_cache_flush().
> > > > I am wondering if the PMD should be aware of the cache or not. For
> > ex:
> > > > in the case of pipeline mode, the RX and TX side of the PMD are
> > > > running on different cores.
> > >
> > > In that example, the PMD can store two cache pointers, one for each
> > of the
> > > RX and TX side.
> > I did not understand this. If RX core and TX core have their own per-
> > core caches the logic would not work. For ex: the RX core cache would
> > not get filled.
> > 
> > In the case of pipeline mode, there will not be a per-core cache. The
> > buffers would be allocated and freed from a global ring or a global
> > lockless stack.
> 
> Aha... Now I understand what you mean: You are referring to use cases where the mempool is configured to *not* have a mempool cache.
> 
> For a mempool without a mempool cache, the proposed "mempool cache" zero-copy functions can obviously not be used.
> 
> We need "mempool" zero-copy functions for the mempools that have no mempool cache.
> 
> However, those functions depend on the mempool's underlying backing store.
> 
> E.g. zero-copy access to a ring has certain requirements [1].
> 
> [1]: http://doc.dpdk.org/guides/prog_guide/ring_lib.html#ring-peek-zero-copy-api
> 
> For a stack, I think it is possible to locklessly zero-copy pop objects. But it is impossible to locklessly zero-copy push elements to a stack; another thread can race to pop some objects from the stack before the pushing thread has finished writing them into the stack.
> 
> Furthermore, the ring zero-copy get function cannot return a consecutive array of objects when wrapping, and PMD functions using vector instructions usually rely on handling chunks of e.g. 8 objects.
> 
> Just for a second, let me theorize into the absurd: Even worse, if a mempool's underlying backing store does not use an array of pointers as its internal storage structure, it is impossible to use a pointer to an array of pointers for zero-copy transactions. E.g. if the backing store uses a list or a tree structure for its storage, a pointer to somewhere in the list or tree structure is not an array of objects pointers.
> 
> Anyway, we could consider designing a generic API for zero-copy mempool get/put; but it should be compatible with all underlying backing stores - or return failure, so the PMD can fall back to the standard functions, if the mempool is in a state where zero-copy access to a contiguous burst cannot be provided. E.g. zero-copy get from a ring can return failure when zero-copy access to the ring is temporarily unavailable due to being at a point where it would wrap.
> 
> Here is a conceptual proposal for such an API.
> 
> /* Mempool zero-copy transaction state. Opaque outside the mempool API. */
> struct rte_mempool_zc_transaction_state {
> 	char	opaque[RTE_CACHE_LINE_SIZE];
> };
> 
> /** Start zero-copy get/put bulk transaction.
>  *
>  * @param[in] mp
>  *   Pointer to the mempool.
>  * @param[out] obj_table_ptr
>  *   Where to store the pointer to
>  *   the zero-copy array of objects in the mempool.
>  * @param[in] n
>  *   Number of objects in the transaction.
>  * @param[in] cache
>  *   Pointer to the mempool cache. May be NULL if unknown.
>  * @param[out] transaction
>  *   Where to store the opaque transaction information.
>  *   Used internally by the mempool library.
>  * @return
>  *   - 1: Transaction completed;
>  *        '_finish' must not be called.
>  *   - 0: Transaction started;
>  *        '_finish' must be called to complete the transaction.
>  *   - <0: Error; failure code.
>  */
> static __rte_always_inline int
> rte_mempool_get/put_zc_bulk_start(
> 	struct rte_mempool *mp,
> 	void ***obj_table_ptr,
> 	unsigned int n,
> 	struct rte_mempool_cache *cache,
> 	rte_mempool_zc_transaction_state *transaction);
> 
> /** Finish zero-copy get/put bulk transaction.
>  *
>  * @param[in] mp
>  *   Pointer to mempool.
>  * @param[in] obj_table_ptr
>  *   Pointer to the zero-copy array of objects in the mempool,
>  *   returned by the 'start' function.
>  * @param[in] n
>  *   Number of objects in the transaction.
>  *   Must be the same as for the 'start' function.
>  * @param[in] transaction
>  *   Opaque transaction information,
>  *   returned by the 'start' function.
>  *   Used internally by the mempool library.
>  */
> static __rte_always_inline void
> rte_mempool_get/put_zc_bulk_finish(
> 	struct rte_mempool *mp,
> 	void **obj_table,
> 	unsigned int n,
> 	rte_mempool_zc_transaction_state *transaction);
> 
> Note that these are *bulk* functions, so 'n' has to remain the same for a 'finish' call as it was for the 'start' call of a transaction.
> 
> And then the underlying backing stores would need to provide callbacks that implement these functions, if they offer zero-copy functionality.
> 
> The mempool implementation of these could start by checking for a mempool cache, and use the "mempool cache" zero-copy if present.
> 
> Some internal state information (from the mempool library or the underlying mempool backing store) may need to be carried over from the 'start' to the 'finish' function, so I have added a transaction state parameter. The transaction state must be held by the application for thread safety reasons. Think of this like the 'flags' parameter to the Linux kernel's spin_lock_irqsave/irqrestore() functions.
> 
> We could omit the 'obj_table' and 'n' parameters from the 'finish' functions and store them in the transaction state if needed; but we might possibly achieve higher performance by passing them as parameters instead.
> 
> > 
> > >
> > > And if the PMD is unaware of the cache pointer, it can look it up at
> > runtime
> > > using rte_lcore_id(), like it does in the current Intel PMDs.
> > >
> > > > However, since the rte_mempool_cache_flush API is provided, may be
> > > > that decision is already done? Interestingly,
> > rte_mempool_cache_flush
> > > > is called by just a single PMD.
> > >
> > > I intentionally aligned this RFC with rte_mempool_cache_flush() to
> > maintain
> > > consistency.
> > >
> > > However, the API is not set in stone. It should always be acceptable
> > to
> > > consider improved alternatives.
> > >
> > > >
> > > > So, the question is, should we allow zero-copy only for per-core
> > cache
> > > > or for other cases as well.
> > >
> > > I suppose that the mempool library was designed to have a mempool
> > > associated with exactly one mempool cache per core. (Alternatively,
> > the
> > > mempool can be configured with no mempool caches at all.)
> > >
> > > We should probably stay loyal to that design concept, and only allow
> > zero-
> > > copy for per-core cache.
> > >
> > > If you can come up with an example of the opposite, I would like to
> > explore
> > > that option too... I can't think of a good example myself, and
> > perhaps I'm
> > > overlooking a relevant use case.
> > The use case I am talking about is the pipeline mode as I mentioned
> > above. Let me know if you agree.
> 
> I see what you mean, and I don't object. :-)
> 
> However, I still think the "mempool cache" zero-copy functions could be useful.
> 
> They would be needed for the generic "mempool" zero-copy functions anyway.
> 
> And the "mempool cache" zero-copy functions are much simpler to design, implement and use than the "mempool" zero-copy functions, so it is a good first step.
> 
I would think that even in pipeline mode applications a mempool cache would
still be very useful, as it would like reduce the number of accesses to the
underlying shared ring or stack resources.

For example, if, as is common, the application works in bursts of up to 32,
but the mempool cache was configured as 128, it means that the TX side
would flush buffers to the shared pool at most every 4 bursts and likely
less frequently than that due to the bursts themselves not always being the
full 32. Since accesses to the underlying ring and stack tend to be slow
due to locking or atomic operations, the more you can reduce the accesses
the better.

Therefore, I would very much consider use of a mempool without a cache as
an edge-case - but one we need to support, but not optimize for, since
mempool accesses without cache would already be rather slow.

My 2c here.
/Bruce

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC] mempool: zero-copy cache put bulk
  2022-11-10 11:00             ` Bruce Richardson
@ 2022-11-11  4:24               ` Honnappa Nagarahalli
  0 siblings, 0 replies; 9+ messages in thread
From: Honnappa Nagarahalli @ 2022-11-11  4:24 UTC (permalink / raw)
  To: Bruce Richardson, Morten Brørup
  Cc: dev, olivier.matz, andrew.rybchenko, Kamalakshitha Aligeri, nd, nd

<snip>

> > > > > > > From: Honnappa Nagarahalli
> > > [mailto:Honnappa.Nagarahalli@arm.com]
> > > > > > > Sent: Sunday, 6 November 2022 00.11
> > > > > > >
> > > > > > > + Akshitha, she is working on similar patch
> > > > > > >
> > > > > > > Few comments inline
> > > > > > >
> > > > > > > > From: Morten Br�rup <mb@smartsharesystems.com>
> > > > > > > > Sent: Saturday, November 5, 2022 8:40 AM
> > > > > > > >
> > > > > > > > Zero-copy access to the mempool cache is beneficial for
> > > > > > > > PMD
> > > > > > > performance,
> > > > > > > > and must be provided by the mempool library to fix [Bug
> > > > > > > > 1052] without
> > > > > > > a
> > > > > > > > performance regression.
> > > > > > > >
> > > > > > > > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> > > > > > > >
> > > > > > > >
> > > > > > > > This RFC offers a conceptual zero-copy put function, where
> > > the
> > > > > > > application
> > > > > > > > promises to store some objects, and in return gets an
> > > > > > > > address
> > > > > where
> > > > > > > to store
> > > > > > > > them.
> > > > > > > >
> > > > > > > > I would like some early feedback.
> > > > > > > >
> > > > > > > > Notes:
> > > > > > > > * Allowing the 'cache' parameter to be NULL, and getting
> > > > > > > > it
> > > from
> > > > > the
> > > > > > > > mempool instead, was inspired by rte_mempool_cache_flush().
> > > > > > > I am not sure why the 'cache' parameter is required for this
> > > API.
> > > > > This
> > > > > > > API should take the mem pool as the parameter.
> > > > > > >
> > > > > > > We have based our API on 'rte_mempool_do_generic_put' and
> > > removed
> > > > > > the
> > > > > > > 'cache' parameter.
> > > > > >
> > > > > > I thoroughly considered omitting the 'cache' parameter, but
> > > included
> > > > > it for
> > > > > > two reasons:
> > > > > >
> > > > > > 1. The function is a "mempool cache" function (i.e. primarily
> > > > > > working
> > > > > on the
> > > > > > mempool cache), not a "mempool" function.
> > > > > >
> > > > > > So it is appropriate to have a pointer directly to the
> > > > > > structure
> > > it
> > > > > is working on.
> > > > > > Following this through, I also made 'cache' the first
> > > > > > parameter
> > > and
> > > > > 'mp' the
> > > > > > second, like in rte_mempool_cache_flush().
> > > > > I am wondering if the PMD should be aware of the cache or not.
> > > > > For
> > > ex:
> > > > > in the case of pipeline mode, the RX and TX side of the PMD are
> > > > > running on different cores.
> > > >
> > > > In that example, the PMD can store two cache pointers, one for
> > > > each
> > > of the
> > > > RX and TX side.
> > > I did not understand this. If RX core and TX core have their own
> > > per- core caches the logic would not work. For ex: the RX core cache
> > > would not get filled.
> > >
> > > In the case of pipeline mode, there will not be a per-core cache.
> > > The buffers would be allocated and freed from a global ring or a
> > > global lockless stack.
> >
> > Aha... Now I understand what you mean: You are referring to use cases
> where the mempool is configured to *not* have a mempool cache.
> >
> > For a mempool without a mempool cache, the proposed "mempool cache"
> zero-copy functions can obviously not be used.
> >
> > We need "mempool" zero-copy functions for the mempools that have no
> mempool cache.
> >
> > However, those functions depend on the mempool's underlying backing
> store.
> >
> > E.g. zero-copy access to a ring has certain requirements [1].
> >
> > [1]:
> > http://doc.dpdk.org/guides/prog_guide/ring_lib.html#ring-peek-zero-cop
> > y-api
> >
> > For a stack, I think it is possible to locklessly zero-copy pop objects. But it is
> impossible to locklessly zero-copy push elements to a stack; another thread
> can race to pop some objects from the stack before the pushing thread has
> finished writing them into the stack.
> >
> > Furthermore, the ring zero-copy get function cannot return a consecutive
> array of objects when wrapping, and PMD functions using vector instructions
> usually rely on handling chunks of e.g. 8 objects.
> >
> > Just for a second, let me theorize into the absurd: Even worse, if a
> mempool's underlying backing store does not use an array of pointers as its
> internal storage structure, it is impossible to use a pointer to an array of
> pointers for zero-copy transactions. E.g. if the backing store uses a list or a
> tree structure for its storage, a pointer to somewhere in the list or tree
> structure is not an array of objects pointers.
> >
> > Anyway, we could consider designing a generic API for zero-copy mempool
> get/put; but it should be compatible with all underlying backing stores - or
> return failure, so the PMD can fall back to the standard functions, if the
> mempool is in a state where zero-copy access to a contiguous burst cannot
> be provided. E.g. zero-copy get from a ring can return failure when zero-copy
> access to the ring is temporarily unavailable due to being at a point where it
> would wrap.
> >
> > Here is a conceptual proposal for such an API.
> >
> > /* Mempool zero-copy transaction state. Opaque outside the mempool
> > API. */ struct rte_mempool_zc_transaction_state {
> > 	char	opaque[RTE_CACHE_LINE_SIZE];
> > };
> >
> > /** Start zero-copy get/put bulk transaction.
> >  *
> >  * @param[in] mp
> >  *   Pointer to the mempool.
> >  * @param[out] obj_table_ptr
> >  *   Where to store the pointer to
> >  *   the zero-copy array of objects in the mempool.
> >  * @param[in] n
> >  *   Number of objects in the transaction.
> >  * @param[in] cache
> >  *   Pointer to the mempool cache. May be NULL if unknown.
> >  * @param[out] transaction
> >  *   Where to store the opaque transaction information.
> >  *   Used internally by the mempool library.
> >  * @return
> >  *   - 1: Transaction completed;
> >  *        '_finish' must not be called.
> >  *   - 0: Transaction started;
> >  *        '_finish' must be called to complete the transaction.
> >  *   - <0: Error; failure code.
> >  */
> > static __rte_always_inline int
> > rte_mempool_get/put_zc_bulk_start(
> > 	struct rte_mempool *mp,
> > 	void ***obj_table_ptr,
> > 	unsigned int n,
> > 	struct rte_mempool_cache *cache,
> > 	rte_mempool_zc_transaction_state *transaction);
> >
> > /** Finish zero-copy get/put bulk transaction.
> >  *
> >  * @param[in] mp
> >  *   Pointer to mempool.
> >  * @param[in] obj_table_ptr
> >  *   Pointer to the zero-copy array of objects in the mempool,
> >  *   returned by the 'start' function.
> >  * @param[in] n
> >  *   Number of objects in the transaction.
> >  *   Must be the same as for the 'start' function.
> >  * @param[in] transaction
> >  *   Opaque transaction information,
> >  *   returned by the 'start' function.
> >  *   Used internally by the mempool library.
> >  */
> > static __rte_always_inline void
> > rte_mempool_get/put_zc_bulk_finish(
> > 	struct rte_mempool *mp,
> > 	void **obj_table,
> > 	unsigned int n,
> > 	rte_mempool_zc_transaction_state *transaction);
> >
> > Note that these are *bulk* functions, so 'n' has to remain the same for a
> 'finish' call as it was for the 'start' call of a transaction.
> >
> > And then the underlying backing stores would need to provide callbacks
> that implement these functions, if they offer zero-copy functionality.
> >
> > The mempool implementation of these could start by checking for a
> mempool cache, and use the "mempool cache" zero-copy if present.
> >
> > Some internal state information (from the mempool library or the
> underlying mempool backing store) may need to be carried over from the
> 'start' to the 'finish' function, so I have added a transaction state parameter.
> The transaction state must be held by the application for thread safety
> reasons. Think of this like the 'flags' parameter to the Linux kernel's
> spin_lock_irqsave/irqrestore() functions.
> >
> > We could omit the 'obj_table' and 'n' parameters from the 'finish' functions
> and store them in the transaction state if needed; but we might possibly
> achieve higher performance by passing them as parameters instead.
> >
> > >
> > > >
> > > > And if the PMD is unaware of the cache pointer, it can look it up
> > > > at
> > > runtime
> > > > using rte_lcore_id(), like it does in the current Intel PMDs.
> > > >
> > > > > However, since the rte_mempool_cache_flush API is provided, may
> > > > > be that decision is already done? Interestingly,
> > > rte_mempool_cache_flush
> > > > > is called by just a single PMD.
> > > >
> > > > I intentionally aligned this RFC with rte_mempool_cache_flush() to
> > > maintain
> > > > consistency.
> > > >
> > > > However, the API is not set in stone. It should always be
> > > > acceptable
> > > to
> > > > consider improved alternatives.
> > > >
> > > > >
> > > > > So, the question is, should we allow zero-copy only for per-core
> > > cache
> > > > > or for other cases as well.
> > > >
> > > > I suppose that the mempool library was designed to have a mempool
> > > > associated with exactly one mempool cache per core.
> > > > (Alternatively,
> > > the
> > > > mempool can be configured with no mempool caches at all.)
> > > >
> > > > We should probably stay loyal to that design concept, and only
> > > > allow
> > > zero-
> > > > copy for per-core cache.
> > > >
> > > > If you can come up with an example of the opposite, I would like
> > > > to
> > > explore
> > > > that option too... I can't think of a good example myself, and
> > > perhaps I'm
> > > > overlooking a relevant use case.
> > > The use case I am talking about is the pipeline mode as I mentioned
> > > above. Let me know if you agree.
> >
> > I see what you mean, and I don't object. :-)
> >
> > However, I still think the "mempool cache" zero-copy functions could be
> useful.
> >
> > They would be needed for the generic "mempool" zero-copy functions
> anyway.
> >
> > And the "mempool cache" zero-copy functions are much simpler to design,
> implement and use than the "mempool" zero-copy functions, so it is a good
> first step.
> >
> I would think that even in pipeline mode applications a mempool cache
> would still be very useful, as it would like reduce the number of accesses to
> the underlying shared ring or stack resources.
> 
> For example, if, as is common, the application works in bursts of up to 32, but
> the mempool cache was configured as 128, it means that the TX side would
> flush buffers to the shared pool at most every 4 bursts and likely less
> frequently than that due to the bursts themselves not always being the full
> 32. Since accesses to the underlying ring and stack tend to be slow due to
> locking or atomic operations, the more you can reduce the accesses the
> better.
> 
> Therefore, I would very much consider use of a mempool without a cache as
> an edge-case - but one we need to support, but not optimize for, since
> mempool accesses without cache would already be rather slow.
Understood. Looks like supporting APIs for per-core cache is enough for now. I do not see a lot of advantages for mempool without a cache.

> 
> My 2c here.
> /Bruce

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-11-11  4:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-05 13:40 [RFC] mempool: zero-copy cache put bulk Morten Brørup
2022-11-05 23:11 ` Honnappa Nagarahalli
2022-11-06  6:57   ` Morten Brørup
2022-11-09 17:57     ` Honnappa Nagarahalli
2022-11-09 20:36       ` Morten Brørup
2022-11-09 22:45         ` Honnappa Nagarahalli
2022-11-10 10:15           ` Morten Brørup
2022-11-10 11:00             ` Bruce Richardson
2022-11-11  4:24               ` Honnappa Nagarahalli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).