From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id C8B9FA00C5;
	Wed,  2 Feb 2022 11:33:58 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 588F440DF4;
	Wed,  2 Feb 2022 11:33:58 +0100 (CET)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id 0AC3340688
 for <dev@dpdk.org>; Wed,  2 Feb 2022 11:33:56 +0100 (CET)
Received: from dkrd2.smartsharesys.local ([192.168.4.12]) by
 smartserver.smartsharesystems.com with Microsoft SMTPSVC(6.0.3790.4675); 
 Wed, 2 Feb 2022 11:33:56 +0100
From: =?UTF-8?q?Morten=20Br=C3=B8rup?= <mb@smartsharesystems.com>
To: olivier.matz@6wind.com,
	andrew.rybchenko@oktetlabs.ru
Cc: bruce.richardson@intel.com, jerinjacobk@gmail.com, dev@dpdk.org,
 =?UTF-8?q?Morten=20Br=C3=B8rup?= <mb@smartsharesystems.com>
Subject: [PATCH v4] mempool: fix mempool cache flushing algorithm
Date: Wed,  2 Feb 2022 11:33:54 +0100
Message-Id: <20220202103354.79832-1-mb@smartsharesystems.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D86DB2@smartserver.smartshare.dk>
References: <98CBD80474FA8B44BF855DF32C47DC35D86DB2@smartserver.smartshare.dk>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-OriginalArrivalTime: 02 Feb 2022 10:33:56.0344 (UTC)
 FILETIME=[61A1BF80:01D81820]
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

This patch fixes the rte_mempool_do_generic_put() caching algorithm,
which was fundamentally wrong, causing multiple performance issues when
flushing.

Although the bugs do have serious performance implications when
flushing, the function did not fail when flushing (or otherwise).
Backporting could be considered optional.

The algorithm was:
 1. Add the objects to the cache
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the ring.

Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
the cache full after flushing, which the above description reflects.

Now, the algorithm is:
 1. If the objects cannot be added to the cache without crossing the
    flush threshold, flush the cache to the ring.
 2. Add the objects to the cache.

This patch fixes these bugs:

1. The cache was still full after flushing.
In the opposite direction, i.e. when getting objects from the cache, the
cache is refilled to full level when it crosses the low watermark (which
happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The existing flushing behaviour was suboptimal for real applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

This bug degraded performance caused by too frequent flushing.

Consider this application scenario:

Either, an lcore thread in the application is in a state of balance,
where it uses the mempool cache within its flush/refill boundaries; in
this situation, the flush method is less important, and this fix is
irrelevant.

Or, an lcore thread in the application is out of balance (either
permanently or temporarily), and mostly gets or puts objects from/to the
mempool. If it mostly puts objects, not flushing all of the objects will
cause more frequent flushing. This is the scenario addressed by this
fix. E.g.:

Cache size=256, flushthresh=384 (1.5x size), initial len=256;
application burst len=32.

If there are "size" objects in the cache after flushing, the cache is
flushed at every 4th burst.

If the cache is flushed completely, the cache is only flushed at every
16th burst.

As you can see, this bug caused the cache to be flushed 4x too
frequently in this example.

And when/if the application thread breaks its pattern of continuously
putting objects, and suddenly starts to get objects instead, it will
either get objects already in the cache, or the get() function will
refill the cache.

The concept of not flushing the cache completely was probably based on
an assumption that it is more likely for an application's lcore thread
to get() after flushing than to put() after flushing.
I strongly disagree with this assumption! If an application thread is
continuously putting so much that it overflows the cache, it is much
more likely to keep putting than it is to start getting. If in doubt,
consider how CPU branch predictors work: When the application has done
something many times consecutively, the branch predictor will expect the
application to do the same again, rather than suddenly do something
else.

Also, if you consider the description of the algorithm in the source
code, and agree that "cache min value" cannot mean "cache size", the
function did not behave as intended. This in itself is a bug.

2. The flush threshold comparison was off by one.
It must be "len > flushthresh", not "len >= flushthresh".
Consider a flush multiplier of 1 instead of 1.5; the cache would be
flushed already when reaching size objecs, not when exceeding size
objects. In other words, the cache would not be able to hold "size"
objects, which is clearly a bug.
Now, flushing is triggered when the flush threshold is exceeded, not
when reached.

This bug degraded performance due to premature flushing. In my example
above, this bug caused flushing every 3rd burst instead of every 4th.

3. The most recent (hot) objects were flushed, leaving the oldest (cold)
objects in the mempool cache.
This bug degraded performance, because flushing prevented immediate
reuse of the (hot) objects already in the CPU cache.
Now, the existing (cold) objects in the mempool cache are flushed before
the new (hot) objects are added the to the mempool cache.

4. With RTE_LIBRTE_MEMPOOL_DEBUG defined, the return value of
rte_mempool_ops_enqueue_bulk() was not checked when flushing the cache.
Now, it is checked in both locations where used; and obviously still
only if RTE_LIBRTE_MEMPOOL_DEBUG is defined.

v2 changes:

- Not adding the new objects to the mempool cache before flushing it
also allows the memory allocated for the mempool cache to be reduced
from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
However, such this change would break the ABI, so it was removed in v2.

- The mempool cache should be cache line aligned for the benefit of the
copying method, which on some CPU architectures performs worse on data
crossing a cache boundary.
However, such this change would break the ABI, so it was removed in v2;
and yet another alternative copying method replaced the rte_memcpy().

v3 changes:

- Actually remove my modifications of the rte_mempool_cache structure.

v4 changes:

- Updated patch title to reflect that the scope of the patch is only
mempool cache flushing.

- Do not replace rte_memcpy() with alternative copying method. This was
a pure optimization, not a fix.

- Elaborate even more on the bugs fixed by the modifications.

- Added 4th bullet item to the patch description, regarding
rte_mempool_ops_enqueue_bulk() with RTE_LIBRTE_MEMPOOL_DEBUG.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..e7e09e48fc 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
+	/* If the request itself is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto ring_enqueue;
 
 	/*
 	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   1. If the objects cannot be added to the cache without
+	 *   crossing the flush threshold, flush the cache to the ring.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
 
-	cache->len += n;
+		cache->len += n;
+	} else {
+		cache_objs = &cache->objs[0];
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
+			rte_panic("cannot put objects in mempool\n");
+#else
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+#endif
+		cache->len = n;
 	}
 
+	/* Add the objects to the cache. */
+	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
+
 	return;
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
-- 
2.17.1