From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id C8B9FA00C5; Wed, 2 Feb 2022 11:33:58 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 588F440DF4; Wed, 2 Feb 2022 11:33:58 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 0AC3340688 for ; Wed, 2 Feb 2022 11:33:56 +0100 (CET) Received: from dkrd2.smartsharesys.local ([192.168.4.12]) by smartserver.smartsharesystems.com with Microsoft SMTPSVC(6.0.3790.4675); Wed, 2 Feb 2022 11:33:56 +0100 From: =?UTF-8?q?Morten=20Br=C3=B8rup?= To: olivier.matz@6wind.com, andrew.rybchenko@oktetlabs.ru Cc: bruce.richardson@intel.com, jerinjacobk@gmail.com, dev@dpdk.org, =?UTF-8?q?Morten=20Br=C3=B8rup?= Subject: [PATCH v4] mempool: fix mempool cache flushing algorithm Date: Wed, 2 Feb 2022 11:33:54 +0100 Message-Id: <20220202103354.79832-1-mb@smartsharesystems.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D86DB2@smartserver.smartshare.dk> References: <98CBD80474FA8B44BF855DF32C47DC35D86DB2@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-OriginalArrivalTime: 02 Feb 2022 10:33:56.0344 (UTC) FILETIME=[61A1BF80:01D81820] X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org This patch fixes the rte_mempool_do_generic_put() caching algorithm, which was fundamentally wrong, causing multiple performance issues when flushing. Although the bugs do have serious performance implications when flushing, the function did not fail when flushing (or otherwise). Backporting could be considered optional. The algorithm was: 1. Add the objects to the cache 2. Anything greater than the cache size (if it crosses the cache flush threshold) is flushed to the ring. Please note that the description in the source code said that it kept "cache min value" objects after flushing, but the function actually kept the cache full after flushing, which the above description reflects. Now, the algorithm is: 1. If the objects cannot be added to the cache without crossing the flush threshold, flush the cache to the ring. 2. Add the objects to the cache. This patch fixes these bugs: 1. The cache was still full after flushing. In the opposite direction, i.e. when getting objects from the cache, the cache is refilled to full level when it crosses the low watermark (which happens to be zero). Similarly, the cache should be flushed to empty level when it crosses the high watermark (which happens to be 1.5 x the size of the cache). The existing flushing behaviour was suboptimal for real applications, because crossing the low or high watermark typically happens when the application is in a state where the number of put/get events are out of balance, e.g. when absorbing a burst of packets into a QoS queue (getting more mbufs from the mempool), or when a burst of packets is trickling out from the QoS queue (putting the mbufs back into the mempool). Now, the mempool cache is completely flushed when crossing the flush threshold, so only the newly put (hot) objects remain in the mempool cache afterwards. This bug degraded performance caused by too frequent flushing. Consider this application scenario: Either, an lcore thread in the application is in a state of balance, where it uses the mempool cache within its flush/refill boundaries; in this situation, the flush method is less important, and this fix is irrelevant. Or, an lcore thread in the application is out of balance (either permanently or temporarily), and mostly gets or puts objects from/to the mempool. If it mostly puts objects, not flushing all of the objects will cause more frequent flushing. This is the scenario addressed by this fix. E.g.: Cache size=256, flushthresh=384 (1.5x size), initial len=256; application burst len=32. If there are "size" objects in the cache after flushing, the cache is flushed at every 4th burst. If the cache is flushed completely, the cache is only flushed at every 16th burst. As you can see, this bug caused the cache to be flushed 4x too frequently in this example. And when/if the application thread breaks its pattern of continuously putting objects, and suddenly starts to get objects instead, it will either get objects already in the cache, or the get() function will refill the cache. The concept of not flushing the cache completely was probably based on an assumption that it is more likely for an application's lcore thread to get() after flushing than to put() after flushing. I strongly disagree with this assumption! If an application thread is continuously putting so much that it overflows the cache, it is much more likely to keep putting than it is to start getting. If in doubt, consider how CPU branch predictors work: When the application has done something many times consecutively, the branch predictor will expect the application to do the same again, rather than suddenly do something else. Also, if you consider the description of the algorithm in the source code, and agree that "cache min value" cannot mean "cache size", the function did not behave as intended. This in itself is a bug. 2. The flush threshold comparison was off by one. It must be "len > flushthresh", not "len >= flushthresh". Consider a flush multiplier of 1 instead of 1.5; the cache would be flushed already when reaching size objecs, not when exceeding size objects. In other words, the cache would not be able to hold "size" objects, which is clearly a bug. Now, flushing is triggered when the flush threshold is exceeded, not when reached. This bug degraded performance due to premature flushing. In my example above, this bug caused flushing every 3rd burst instead of every 4th. 3. The most recent (hot) objects were flushed, leaving the oldest (cold) objects in the mempool cache. This bug degraded performance, because flushing prevented immediate reuse of the (hot) objects already in the CPU cache. Now, the existing (cold) objects in the mempool cache are flushed before the new (hot) objects are added the to the mempool cache. 4. With RTE_LIBRTE_MEMPOOL_DEBUG defined, the return value of rte_mempool_ops_enqueue_bulk() was not checked when flushing the cache. Now, it is checked in both locations where used; and obviously still only if RTE_LIBRTE_MEMPOOL_DEBUG is defined. v2 changes: - Not adding the new objects to the mempool cache before flushing it also allows the memory allocated for the mempool cache to be reduced from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE. However, such this change would break the ABI, so it was removed in v2. - The mempool cache should be cache line aligned for the benefit of the copying method, which on some CPU architectures performs worse on data crossing a cache boundary. However, such this change would break the ABI, so it was removed in v2; and yet another alternative copying method replaced the rte_memcpy(). v3 changes: - Actually remove my modifications of the rte_mempool_cache structure. v4 changes: - Updated patch title to reflect that the scope of the patch is only mempool cache flushing. - Do not replace rte_memcpy() with alternative copying method. This was a pure optimization, not a fix. - Elaborate even more on the bugs fixed by the modifications. - Added 4th bullet item to the patch description, regarding rte_mempool_ops_enqueue_bulk() with RTE_LIBRTE_MEMPOOL_DEBUG. Signed-off-by: Morten Brørup --- lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h index 1e7a3c1527..e7e09e48fc 100644 --- a/lib/mempool/rte_mempool.h +++ b/lib/mempool/rte_mempool.h @@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table, if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE)) goto ring_enqueue; - cache_objs = &cache->objs[cache->len]; + /* If the request itself is too big for the cache */ + if (unlikely(n > cache->flushthresh)) + goto ring_enqueue; /* * The cache follows the following algorithm - * 1. Add the objects to the cache - * 2. Anything greater than the cache min value (if it crosses the - * cache flush threshold) is flushed to the ring. + * 1. If the objects cannot be added to the cache without + * crossing the flush threshold, flush the cache to the ring. + * 2. Add the objects to the cache. */ - /* Add elements back into the cache */ - rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n); + if (cache->len + n <= cache->flushthresh) { + cache_objs = &cache->objs[cache->len]; - cache->len += n; + cache->len += n; + } else { + cache_objs = &cache->objs[0]; - if (cache->len >= cache->flushthresh) { - rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size], - cache->len - cache->size); - cache->len = cache->size; +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG + if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0) + rte_panic("cannot put objects in mempool\n"); +#else + rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len); +#endif + cache->len = n; } + /* Add the objects to the cache. */ + rte_memcpy(cache_objs, obj_table, sizeof(void *) * n); + return; ring_enqueue: - /* push remaining objects in ring */ + /* Put the objects into the ring */ #ifdef RTE_LIBRTE_MEMPOOL_DEBUG if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0) rte_panic("cannot put objects in mempool\n"); -- 2.17.1