From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 28041A034D; Fri, 28 Jan 2022 10:37:52 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id A0A2742838; Fri, 28 Jan 2022 10:37:51 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id C31E040141 for ; Fri, 28 Jan 2022 10:37:49 +0100 (CET) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH v3] mempool: fix put objects to mempool with cache X-MimeOLE: Produced By Microsoft Exchange V6.5 Date: Fri, 28 Jan 2022 10:37:49 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86E55@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v3] mempool: fix put objects to mempool with cache Thread-Index: AdgROIl6x/XBr4dOSwq6lGZXRSxhmgC60sYg References: <98CBD80474FA8B44BF855DF32C47DC35D86DB2@smartserver.smartshare.dk> <20220119150301.42583-1-mb@smartsharesystems.com> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Olivier Matz" Cc: , , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Olivier Matz [mailto:olivier.matz@6wind.com] > Sent: Monday, 24 January 2022 16.39 >=20 > Hi Morten, >=20 > On Wed, Jan 19, 2022 at 04:03:01PM +0100, Morten Br=F8rup wrote: > > mempool: fix put objects to mempool with cache > > > > This patch optimizes the rte_mempool_do_generic_put() caching > algorithm, > > and fixes a bug in it. >=20 > I think we should avoid grouping fixes and optimizations in one > patch. The main reason is that fixes aims to be backported, which > is not the case of optimizations. OK. I'll separate them. >=20 > > The existing algorithm was: > > 1. Add the objects to the cache > > 2. Anything greater than the cache size (if it crosses the cache > flush > > threshold) is flushed to the ring. > > > > Please note that the description in the source code said that it = kept > > "cache min value" objects after flushing, but the function actually > kept > > "size" objects, which is reflected in the above description. > > > > Now, the algorithm is: > > 1. If the objects cannot be added to the cache without crossing the > > flush threshold, flush the cache to the ring. > > 2. Add the objects to the cache. > > > > This patch changes these details: > > > > 1. Bug: The cache was still full after flushing. > > In the opposite direction, i.e. when getting objects from the cache, > the > > cache is refilled to full level when it crosses the low watermark > (which > > happens to be zero). > > Similarly, the cache should be flushed to empty level when it = crosses > > the high watermark (which happens to be 1.5 x the size of the = cache). > > The existing flushing behaviour was suboptimal for real = applications, > > because crossing the low or high watermark typically happens when = the > > application is in a state where the number of put/get events are out > of > > balance, e.g. when absorbing a burst of packets into a QoS queue > > (getting more mbufs from the mempool), or when a burst of packets is > > trickling out from the QoS queue (putting the mbufs back into the > > mempool). > > NB: When the application is in a state where put/get events are in > > balance, the cache should remain within its low and high watermarks, > and > > the algorithms for refilling/flushing the cache should not come into > > play. > > Now, the mempool cache is completely flushed when crossing the flush > > threshold, so only the newly put (hot) objects remain in the mempool > > cache afterwards. >=20 > I'm not sure we should call this behavior a bug. What is the impact > on applications, from a user perspective? Can it break a use-case, or > have an important performance impact? It doesn't break anything. But it doesn't behave as intended (according to its description in the = source code), so I do consider it a bug! Any professional tester, when = seeing an implementation that doesn't do what is intended, would also = flag the implementation as faulty. It has a performance impact: It causes many more mempool cache flushes = than was intended. I have elaborated by an example here: = http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D86E54@smartser= ver.smartshare.dk/T/#t >=20 >=20 > > 2. Minor bug: The flush threshold comparison has been corrected; it > must > > be "len > flushthresh", not "len >=3D flushthresh". > > Reasoning: Consider a flush multiplier of 1 instead of 1.5; the = cache > > would be flushed already when reaching size elements, not when > exceeding > > size elements. > > Now, flushing is triggered when the flush threshold is exceeded, not > > when reached. >=20 > Same here, we should ask ourselves what is the impact before calling > it a bug. It's a classic off-by-one bug. It only impacts performance, causing premature mempool cache flushing. Referring to my example in the RFC discussion, this bug causes flushing = every 3rd application put() instead of every 4th. >=20 >=20 > > 3. Optimization: The most recent (hot) objects are flushed, leaving > the > > oldest (cold) objects in the mempool cache. > > This is bad for CPUs with a small L1 cache, because when they get > > objects from the mempool after the mempool cache has been flushed, > they > > get cold objects instead of hot objects. > > Now, the existing (cold) objects in the mempool cache are flushed > before > > the new (hot) objects are added the to the mempool cache. > > > > 4. Optimization: Using the x86 variant of rte_memcpy() is = inefficient > > here, where n is relatively small and unknown at compile time. > > Now, it has been replaced by an alternative copying method, = optimized > > for the fact that most Ethernet PMDs operate in bursts of 4 or 8 > mbufs > > or multiples thereof. >=20 > For these optimizations, do you have an idea of what is the = performance > gain? Ideally (I understand it is not always possible), each > optimization > is done separately, and its impact is measured. Regarding 3: I don't have access to hardware with a CPU with small L1 = cache. But the algorithm was structurally wrong, so I think it should be = fixed. Not working with such hardware ourselves, I labeled it an = "optimization"... if the patch came from someone with affected hardware, = it could reasonably had been labeled a "bug fix". Regarding 4: I'll stick with rte_memcpy() in the "fix" patch, and = provide a separate optimization patch with performance information. >=20 >=20 > > v2 changes: > > > > - Not adding the new objects to the mempool cache before flushing it > > also allows the memory allocated for the mempool cache to be reduced > > from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE. > > However, such this change would break the ABI, so it was removed in > v2. > > > > - The mempool cache should be cache line aligned for the benefit of > the > > copying method, which on some CPU architectures performs worse on > data > > crossing a cache boundary. > > However, such this change would break the ABI, so it was removed in > v2; > > and yet another alternative copying method replaced the = rte_memcpy(). >=20 > OK, we may want to keep this in mind for the next abi breakage. Sounds good. >=20 >=20 > > > > v3 changes: > > > > - Actually remove my modifications of the rte_mempool_cache > structure. > > > > Signed-off-by: Morten Br=F8rup > > --- > > lib/mempool/rte_mempool.h | 51 = +++++++++++++++++++++++++++++-------- > -- > > 1 file changed, 38 insertions(+), 13 deletions(-) > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > index 1e7a3c1527..7b364cfc74 100644 > > --- a/lib/mempool/rte_mempool.h > > +++ b/lib/mempool/rte_mempool.h > > @@ -1334,6 +1334,7 @@ static __rte_always_inline void > > rte_mempool_do_generic_put(struct rte_mempool *mp, void * const > *obj_table, > > unsigned int n, struct rte_mempool_cache *cache) > > { > > + uint32_t index; > > void **cache_objs; > > > > /* increment stat now, adding in mempool always success */ > > @@ -1344,31 +1345,56 @@ rte_mempool_do_generic_put(struct = rte_mempool > *mp, void * const *obj_table, > > if (unlikely(cache =3D=3D NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE)) > > goto ring_enqueue; > > > > - cache_objs =3D &cache->objs[cache->len]; > > + /* If the request itself is too big for the cache */ > > + if (unlikely(n > cache->flushthresh)) > > + goto ring_enqueue; > > > > /* > > * The cache follows the following algorithm > > - * 1. Add the objects to the cache > > - * 2. Anything greater than the cache min value (if it crosses > the > > - * cache flush threshold) is flushed to the ring. > > + * 1. If the objects cannot be added to the cache without > > + * crossing the flush threshold, flush the cache to the ring. > > + * 2. Add the objects to the cache. > > */ > > > > - /* Add elements back into the cache */ > > - rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n); > > + if (cache->len + n <=3D cache->flushthresh) { > > + cache_objs =3D &cache->objs[cache->len]; > > > > - cache->len +=3D n; > > + cache->len +=3D n; > > + } else { > > + cache_objs =3D cache->objs; > > > > - if (cache->len >=3D cache->flushthresh) { > > - rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size], > > - cache->len - cache->size); > > - cache->len =3D cache->size; > > +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG > > + if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache- > >len) < 0) > > + rte_panic("cannot put objects in mempool\n"); > > +#else > > + rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len); > > +#endif > > + cache->len =3D n; > > + } > > + > > + /* Add the objects to the cache. */ > > + for (index =3D 0; index < (n & ~0x3); index +=3D 4) { > > + cache_objs[index] =3D obj_table[index]; > > + cache_objs[index + 1] =3D obj_table[index + 1]; > > + cache_objs[index + 2] =3D obj_table[index + 2]; > > + cache_objs[index + 3] =3D obj_table[index + 3]; > > + } > > + switch (n & 0x3) { > > + case 3: > > + cache_objs[index] =3D obj_table[index]; > > + index++; /* fallthrough */ > > + case 2: > > + cache_objs[index] =3D obj_table[index]; > > + index++; /* fallthrough */ > > + case 1: > > + cache_objs[index] =3D obj_table[index]; > > } > > > > return; > > > > ring_enqueue: > > > > - /* push remaining objects in ring */ > > + /* Put the objects into the ring */ > > #ifdef RTE_LIBRTE_MEMPOOL_DEBUG > > if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0) > > rte_panic("cannot put objects in mempool\n"); > > @@ -1377,7 +1403,6 @@ rte_mempool_do_generic_put(struct rte_mempool > *mp, void * const *obj_table, > > #endif > > } > > > > - > > /** > > * Put several objects back in the mempool. > > * > > -- > > 2.17.1 > >