From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 8AE0AA034D; Fri, 28 Jan 2022 11:22:59 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 45BD642868; Fri, 28 Jan 2022 11:22:59 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id B0F8A42855 for ; Fri, 28 Jan 2022 11:22:58 +0100 (CET) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH] mempool: fix get objects from mempool with cache X-MimeOLE: Produced By Microsoft Exchange V6.5 Date: Fri, 28 Jan 2022 11:22:58 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86E56@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH] mempool: fix get objects from mempool with cache Thread-Index: AdgROINOIbkele6iSk+i228gPXquaAC88fyw References: <98CBD80474FA8B44BF855DF32C47DC35D86DB2@smartserver.smartshare.dk> <20220114163650.94288-1-mb@smartsharesystems.com> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Olivier Matz" Cc: , , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Olivier, thank you for the detailed feedback on my mempool RFC and = patches. We might disagree on some points, but that is the point of having a = discussion. :-) > From: Olivier Matz [mailto:olivier.matz@6wind.com] > Sent: Monday, 24 January 2022 16.39 >=20 > Hi Morten, >=20 > Few comments below. >=20 > On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Br=F8rup wrote: > > A flush threshold for the mempool cache was introduced in DPDK > version > > 1.3, but rte_mempool_do_generic_get() was not completely updated = back > > then, and some inefficiencies were introduced. > > > > This patch fixes the following in rte_mempool_do_generic_get(): > > > > 1. The code that initially screens the cache request was not updated > > with the change in DPDK version 1.3. > > The initial screening compared the request length to the cache size, > > which was correct before, but became irrelevant with the = introduction > of > > the flush threshold. E.g. the cache can hold up to flushthresh > objects, > > which is more than its size, so some requests were not served from > the > > cache, even though they could be. > > The initial screening has now been corrected to match the initial > > screening in rte_mempool_do_generic_put(), which verifies that a > cache > > is present, and that the length of the request does not overflow the > > memory allocated for the cache. This bug will cause a major performance degradation in a scenario where = the application burst length is the same as the cache size. In this = case, the objects are not ever fetched from the mempool cache. This scenario occurs if an application has configured a mempool with a = size matching the application's burst size. Do any such applications = exist? I don't know. > > > > 2. The function is a helper for rte_mempool_generic_get(), so it = must > > behave according to the description of that function. > > Specifically, objects must first be returned from the cache, > > subsequently from the ring. > > After the change in DPDK version 1.3, this was not the behavior when > > the request was partially satisfied from the cache; instead, the > objects > > from the ring were returned ahead of the objects from the cache. = This > is > > bad for CPUs with a small L1 cache, which benefit from having the = hot > > objects first in the returned array. (This is also the reason why > > the function returns the objects in reverse order.) > > Now, all code paths first return objects from the cache, = subsequently > > from the ring. Formally, the function is buggy when it isn't doing what it is supposed = to. But yes, it only has a performance impact. And perhaps mostly on low end = hardware. > > > > 3. If the cache could not be backfilled, the function would attempt > > to get all the requested objects from the ring (instead of only the > > number of requested objects minus the objects available in the = ring), > > and the function would fail if that failed. > > Now, the first part of the request is always satisfied from the > cache, > > and if the subsequent backfilling of the cache from the ring fails, > only > > the remaining requested objects are retrieved from the ring. >=20 > This is the only point I'd consider to be a fix. The problem, from the > user perspective, is that a get() can fail despite there are enough > objects in cache + common pool. >=20 > To be honnest, I feel a bit uncomfortable to have such a list of > problems solved in one commit, even if I understand that they are part > of the same code rework. >=20 > Ideally, this fix should be a separate commit. What do you think of > having this simple patch for this fix, and then do the > optimizations/rework in another commit? >=20 > --- a/lib/mempool/rte_mempool.h > +++ b/lib/mempool/rte_mempool.h > @@ -1484,7 +1484,22 @@ rte_mempool_do_generic_get(struct rte_mempool > *mp, void **obj_table, > * the ring directly. If that fails, we are > truly out of > * buffers. > */ > - goto ring_dequeue; > + req =3D n - cache->len; > + ret =3D rte_mempool_ops_dequeue_bulk(mp, > obj_table, req); > + if (ret < 0) { > + RTE_MEMPOOL_STAT_ADD(mp, > get_fail_bulk, 1); > + RTE_MEMPOOL_STAT_ADD(mp, > get_fail_objs, n); > + return ret; > + } > + obj_table +=3D req; > + len =3D cache->len; > + while (len > 0) > + *obj_table++ =3D cache_objs[--len]; > + cache->len =3D 0; > + RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, > 1); > + RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, > n); > + > + return 0; > } >=20 > cache->len +=3D req; >=20 > The title of this commit could then be more precise to describe > the solved issue. I get your point here, but I still consider the other modifications bug = fixes too, so a unified rework patch works better for me. As you also noticed yourself, the resulting code is clear and easily = readable. If that was not the case, I might have agreed to break it up = in a couple of steps. >=20 > > 4. The code flow for satisfying the request from the cache was > slightly > > inefficient: > > The likely code path where the objects are simply served from the > cache > > was treated as unlikely. Now it is treated as likely. > > And in the code path where the cache was backfilled first, numbers > were > > added and subtracted from the cache length; now this code path = simply > > sets the cache length to its final value. > > > > 5. Some comments were not correct anymore. > > The comments have been updated. > > Most importanly, the description of the succesful return value was > > inaccurate. Success only returns 0, not >=3D 0. > > > > Signed-off-by: Morten Br=F8rup > > --- > > lib/mempool/rte_mempool.h | 81 = ++++++++++++++++++++++++++++--------- > -- > > 1 file changed, 59 insertions(+), 22 deletions(-) > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > index 1e7a3c1527..88f1b8b7ab 100644 > > --- a/lib/mempool/rte_mempool.h > > +++ b/lib/mempool/rte_mempool.h > > @@ -1443,6 +1443,10 @@ rte_mempool_put(struct rte_mempool *mp, void > *obj) > > > > /** > > * @internal Get several objects from the mempool; used internally. > > + * > > + * If cache is enabled, objects are returned from the cache in Last > In First > > + * Out (LIFO) order for the benefit of CPUs with small L1 cache. > > + * > > * @param mp > > * A pointer to the mempool structure. > > * @param obj_table > > @@ -1452,7 +1456,7 @@ rte_mempool_put(struct rte_mempool *mp, void > *obj) > > * @param cache > > * A pointer to a mempool cache structure. May be NULL if not > needed. > > * @return > > - * - >=3D0: Success; number of objects supplied. > > + * - 0: Success; got n objects. > > * - <0: Error; code of ring dequeue function. > > */ > > static __rte_always_inline int >=20 > I think that part should be in a separate commit too. This is a > documentation fix, which is easily backportable (and should be > backported) (Fixes: af75078fece3 ("first public release")). OK. I'll take this out from the patch. And as you mentioned in your = follow-up, this is also addressed by another commit, so I'll leave it at = that. >=20 > > @@ -1463,38 +1467,71 @@ rte_mempool_do_generic_get(struct = rte_mempool > *mp, void **obj_table, > > uint32_t index, len; > > void **cache_objs; > > > > - /* No cache provided or cannot be satisfied from cache */ > > - if (unlikely(cache =3D=3D NULL || n >=3D cache->size)) > > + /* No cache provided or if get would overflow mem allocated for > cache */ > > + if (unlikely(cache =3D=3D NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE)) > > goto ring_dequeue; > > > > - cache_objs =3D cache->objs; > > + cache_objs =3D &cache->objs[cache->len]; > > + > > + if (n <=3D cache->len) { > > + /* The entire request can be satisfied from the cache. */ > > + cache->len -=3D n; > > + for (index =3D 0; index < n; index++) > > + *obj_table++ =3D *--cache_objs; > > > > - /* Can this be satisfied from the cache? */ > > - if (cache->len < n) { > > - /* No. Backfill the cache first, and then fill from it */ > > - uint32_t req =3D n + (cache->size - cache->len); > > + RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1); > > + RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n); > > > > - /* How many do we require i.e. number to fill the cache + > the request */ > > - ret =3D rte_mempool_ops_dequeue_bulk(mp, > > - &cache->objs[cache->len], req); > > + return 0; > > + } > > + > > + /* Satisfy the first part of the request by depleting the cache. > */ > > + len =3D cache->len; > > + for (index =3D 0; index < len; index++) > > + *obj_table++ =3D *--cache_objs; > > + > > + /* Number of objects remaining to satisfy the request. */ > > + len =3D n - len; > > + > > + /* Fill the cache from the ring; fetch size + remaining objects. > */ > > + ret =3D rte_mempool_ops_dequeue_bulk(mp, cache->objs, > > + cache->size + len); > > + if (unlikely(ret < 0)) { > > + /* > > + * We are buffer constrained, and not able to allocate > > + * cache + remaining. > > + * Do not fill the cache, just satisfy the remaining part > of > > + * the request directly from the ring. > > + */ > > + ret =3D rte_mempool_ops_dequeue_bulk(mp, obj_table, len); > > if (unlikely(ret < 0)) { > > /* > > - * In the off chance that we are buffer constrained, > > - * where we are not able to allocate cache + n, go to > > - * the ring directly. If that fails, we are truly out > of > > - * buffers. > > + * That also failed. > > + * No furter action is required to roll the first > > + * part of the request back into the cache, as both > > + * cache->len and the objects in the cache are > intact. > > */ > > - goto ring_dequeue; > > + RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1); > > + RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n); > > + > > + return ret; > > } > > > > - cache->len +=3D req; > > + /* Commit that the cache was emptied. */ > > + cache->len =3D 0; > > + > > + RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1); > > + RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n); > > + > > + return 0; > > } > > > > - /* Now fill in the response ... */ > > - for (index =3D 0, len =3D cache->len - 1; index < n; ++index, = len--, > obj_table++) > > - *obj_table =3D cache_objs[len]; > > + cache_objs =3D &cache->objs[cache->size + len]; > > > > - cache->len -=3D n; > > + /* Satisfy the remaining part of the request from the filled > cache. */ > > + cache->len =3D cache->size; > > + for (index =3D 0; index < len; index++) > > + *obj_table++ =3D *--cache_objs; > > > > RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1); > > RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n); > > @@ -1503,7 +1540,7 @@ rte_mempool_do_generic_get(struct rte_mempool > *mp, void **obj_table, > > > > ring_dequeue: > > > > - /* get remaining objects from ring */ > > + /* Get the objects from the ring. */ > > ret =3D rte_mempool_ops_dequeue_bulk(mp, obj_table, n); > > > > if (ret < 0) { >=20 > About the code itself, it is more readable now, and probably more > efficient. Did you notice any performance change in mempool perf > autotests ? No significant performance change in my test environment. Probably also = because mempool_perf_autotest doesn't test flushing/refilling the cache. >=20 > Thanks, > Olivier