From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 8AE0AA034D;
	Fri, 28 Jan 2022 11:22:59 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 45BD642868;
	Fri, 28 Jan 2022 11:22:59 +0100 (CET)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id B0F8A42855
 for <dev@dpdk.org>; Fri, 28 Jan 2022 11:22:58 +0100 (CET)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [PATCH] mempool: fix get objects from mempool with cache
X-MimeOLE: Produced By Microsoft Exchange V6.5
Date: Fri, 28 Jan 2022 11:22:58 +0100
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86E56@smartserver.smartshare.dk>
In-Reply-To: <Ye7IEJGmdAclYMeZ@platinum>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [PATCH] mempool: fix get objects from mempool with cache
Thread-Index: AdgROINOIbkele6iSk+i228gPXquaAC88fyw
References: <98CBD80474FA8B44BF855DF32C47DC35D86DB2@smartserver.smartshare.dk>
 <20220114163650.94288-1-mb@smartsharesystems.com> <Ye7IEJGmdAclYMeZ@platinum>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Olivier Matz" <olivier.matz@6wind.com>
Cc: <andrew.rybchenko@oktetlabs.ru>, <bruce.richardson@intel.com>,
 <jerinjacobk@gmail.com>, <dev@dpdk.org>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

Olivier, thank you for the detailed feedback on my mempool RFC and =
patches.

We might disagree on some points, but that is the point of having a =
discussion. :-)

> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Monday, 24 January 2022 16.39
>=20
> Hi Morten,
>=20
> Few comments below.
>=20
> On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Br=F8rup wrote:
> > A flush threshold for the mempool cache was introduced in DPDK
> version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated =
back
> > then, and some inefficiencies were introduced.
> >
> > This patch fixes the following in rte_mempool_do_generic_get():
> >
> > 1. The code that initially screens the cache request was not updated
> > with the change in DPDK version 1.3.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the =
introduction
> of
> > the flush threshold. E.g. the cache can hold up to flushthresh
> objects,
> > which is more than its size, so some requests were not served from
> the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a
> cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.

This bug will cause a major performance degradation in a scenario where =
the application burst length is the same as the cache size. In this =
case, the objects are not ever fetched from the mempool cache.

This scenario occurs if an application has configured a mempool with a =
size matching the application's burst size. Do any such applications =
exist? I don't know.

> >
> > 2. The function is a helper for rte_mempool_generic_get(), so it =
must
> > behave according to the description of that function.
> > Specifically, objects must first be returned from the cache,
> > subsequently from the ring.
> > After the change in DPDK version 1.3, this was not the behavior when
> > the request was partially satisfied from the cache; instead, the
> objects
> > from the ring were returned ahead of the objects from the cache. =
This
> is
> > bad for CPUs with a small L1 cache, which benefit from having the =
hot
> > objects first in the returned array. (This is also the reason why
> > the function returns the objects in reverse order.)
> > Now, all code paths first return objects from the cache, =
subsequently
> > from the ring.

Formally, the function is buggy when it isn't doing what it is supposed =
to.

But yes, it only has a performance impact. And perhaps mostly on low end =
hardware.

> >
> > 3. If the cache could not be backfilled, the function would attempt
> > to get all the requested objects from the ring (instead of only the
> > number of requested objects minus the objects available in the =
ring),
> > and the function would fail if that failed.
> > Now, the first part of the request is always satisfied from the
> cache,
> > and if the subsequent backfilling of the cache from the ring fails,
> only
> > the remaining requested objects are retrieved from the ring.
>=20
> This is the only point I'd consider to be a fix. The problem, from the
> user perspective, is that a get() can fail despite there are enough
> objects in cache + common pool.
>=20
> To be honnest, I feel a bit uncomfortable to have such a list of
> problems solved in one commit, even if I understand that they are part
> of the same code rework.
>=20
> Ideally, this fix should be a separate commit. What do you think of
> having this simple patch for this fix, and then do the
> optimizations/rework in another commit?
>=20
>   --- a/lib/mempool/rte_mempool.h
>   +++ b/lib/mempool/rte_mempool.h
>   @@ -1484,7 +1484,22 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>                            * the ring directly. If that fails, we are
> truly out of
>                            * buffers.
>                            */
>   -                       goto ring_dequeue;
>   +                       req =3D n - cache->len;
>   +                       ret =3D rte_mempool_ops_dequeue_bulk(mp,
> obj_table, req);
>   +                       if (ret < 0) {
>   +                               RTE_MEMPOOL_STAT_ADD(mp,
> get_fail_bulk, 1);
>   +                               RTE_MEMPOOL_STAT_ADD(mp,
> get_fail_objs, n);
>   +                               return ret;
>   +                       }
>   +                       obj_table +=3D req;
>   +                       len =3D cache->len;
>   +                       while (len > 0)
>   +                               *obj_table++ =3D cache_objs[--len];
>   +                       cache->len =3D 0;
>   +                       RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk,
> 1);
>   +                       RTE_MEMPOOL_STAT_ADD(mp, get_success_objs,
> n);
>   +
>   +                       return 0;
>                   }
>=20
>                   cache->len +=3D req;
>=20
> The title of this commit could then be more precise to describe
> the solved issue.

I get your point here, but I still consider the other modifications bug =
fixes too, so a unified rework patch works better for me.

As you also noticed yourself, the resulting code is clear and easily =
readable. If that was not the case, I might have agreed to break it up =
in a couple of steps.

>=20
> > 4. The code flow for satisfying the request from the cache was
> slightly
> > inefficient:
> > The likely code path where the objects are simply served from the
> cache
> > was treated as unlikely. Now it is treated as likely.
> > And in the code path where the cache was backfilled first, numbers
> were
> > added and subtracted from the cache length; now this code path =
simply
> > sets the cache length to its final value.
> >
> > 5. Some comments were not correct anymore.
> > The comments have been updated.
> > Most importanly, the description of the succesful return value was
> > inaccurate. Success only returns 0, not >=3D 0.
> >
> > Signed-off-by: Morten Br=F8rup <mb@smartsharesystems.com>
> > ---
> >  lib/mempool/rte_mempool.h | 81 =
++++++++++++++++++++++++++++---------
> --
> >  1 file changed, 59 insertions(+), 22 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1e7a3c1527..88f1b8b7ab 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1443,6 +1443,10 @@ rte_mempool_put(struct rte_mempool *mp, void
> *obj)
> >
> >  /**
> >   * @internal Get several objects from the mempool; used internally.
> > + *
> > + * If cache is enabled, objects are returned from the cache in Last
> In First
> > + * Out (LIFO) order for the benefit of CPUs with small L1 cache.
> > + *
> >   * @param mp
> >   *   A pointer to the mempool structure.
> >   * @param obj_table
> > @@ -1452,7 +1456,7 @@ rte_mempool_put(struct rte_mempool *mp, void
> *obj)
> >   * @param cache
> >   *   A pointer to a mempool cache structure. May be NULL if not
> needed.
> >   * @return
> > - *   - >=3D0: Success; number of objects supplied.
> > + *   - 0: Success; got n objects.
> >   *   - <0: Error; code of ring dequeue function.
> >   */
> >  static __rte_always_inline int
>=20
> I think that part should be in a separate commit too. This is a
> documentation fix, which is easily backportable (and should be
> backported) (Fixes: af75078fece3 ("first public release")).

OK. I'll take this out from the patch. And as you mentioned in your =
follow-up, this is also addressed by another commit, so I'll leave it at =
that.

>=20
> > @@ -1463,38 +1467,71 @@ rte_mempool_do_generic_get(struct =
rte_mempool
> *mp, void **obj_table,
> >  	uint32_t index, len;
> >  	void **cache_objs;
> >
> > -	/* No cache provided or cannot be satisfied from cache */
> > -	if (unlikely(cache =3D=3D NULL || n >=3D cache->size))
> > +	/* No cache provided or if get would overflow mem allocated for
> cache */
> > +	if (unlikely(cache =3D=3D NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
> >  		goto ring_dequeue;
> >
> > -	cache_objs =3D cache->objs;
> > +	cache_objs =3D &cache->objs[cache->len];
> > +
> > +	if (n <=3D cache->len) {
> > +		/* The entire request can be satisfied from the cache. */
> > +		cache->len -=3D n;
> > +		for (index =3D 0; index < n; index++)
> > +			*obj_table++ =3D *--cache_objs;
> >
> > -	/* Can this be satisfied from the cache? */
> > -	if (cache->len < n) {
> > -		/* No. Backfill the cache first, and then fill from it */
> > -		uint32_t req =3D n + (cache->size - cache->len);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> >
> > -		/* How many do we require i.e. number to fill the cache +
> the request */
> > -		ret =3D rte_mempool_ops_dequeue_bulk(mp,
> > -			&cache->objs[cache->len], req);
> > +		return 0;
> > +	}
> > +
> > +	/* Satisfy the first part of the request by depleting the cache.
> */
> > +	len =3D cache->len;
> > +	for (index =3D 0; index < len; index++)
> > +		*obj_table++ =3D *--cache_objs;
> > +
> > +	/* Number of objects remaining to satisfy the request. */
> > +	len =3D n - len;
> > +
> > +	/* Fill the cache from the ring; fetch size + remaining objects.
> */
> > +	ret =3D rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> > +			cache->size + len);
> > +	if (unlikely(ret < 0)) {
> > +		/*
> > +		 * We are buffer constrained, and not able to allocate
> > +		 * cache + remaining.
> > +		 * Do not fill the cache, just satisfy the remaining part
> of
> > +		 * the request directly from the ring.
> > +		 */
> > +		ret =3D rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
> >  		if (unlikely(ret < 0)) {
> >  			/*
> > -			 * In the off chance that we are buffer constrained,
> > -			 * where we are not able to allocate cache + n, go to
> > -			 * the ring directly. If that fails, we are truly out
> of
> > -			 * buffers.
> > +			 * That also failed.
> > +			 * No furter action is required to roll the first
> > +			 * part of the request back into the cache, as both
> > +			 * cache->len and the objects in the cache are
> intact.
> >  			 */
> > -			goto ring_dequeue;
> > +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
> > +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
> > +
> > +			return ret;
> >  		}
> >
> > -		cache->len +=3D req;
> > +		/* Commit that the cache was emptied. */
> > +		cache->len =3D 0;
> > +
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> > +
> > +		return 0;
> >  	}
> >
> > -	/* Now fill in the response ... */
> > -	for (index =3D 0, len =3D cache->len - 1; index < n; ++index, =
len--,
> obj_table++)
> > -		*obj_table =3D cache_objs[len];
> > +	cache_objs =3D &cache->objs[cache->size + len];
> >
> > -	cache->len -=3D n;
> > +	/* Satisfy the remaining part of the request from the filled
> cache. */
> > +	cache->len =3D cache->size;
> > +	for (index =3D 0; index < len; index++)
> > +		*obj_table++ =3D *--cache_objs;
> >
> >  	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> >  	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> > @@ -1503,7 +1540,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
> >
> >  ring_dequeue:
> >
> > -	/* get remaining objects from ring */
> > +	/* Get the objects from the ring. */
> >  	ret =3D rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
> >
> >  	if (ret < 0) {
>=20
> About the code itself, it is more readable now, and probably more
> efficient. Did you notice any performance change in mempool perf
> autotests ?

No significant performance change in my test environment. Probably also =
because mempool_perf_autotest doesn't test flushing/refilling the cache.

>=20
> Thanks,
> Olivier