From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 10140A04AD; Wed, 19 Jan 2022 15:52:42 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id EDAA541155; Wed, 19 Jan 2022 15:52:41 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id DFFD341147 for ; Wed, 19 Jan 2022 15:52:39 +0100 (CET) Received: from dkrd2.smartsharesys.local ([192.168.4.12]) by smartserver.smartsharesystems.com with Microsoft SMTPSVC(6.0.3790.4675); Wed, 19 Jan 2022 15:52:38 +0100 From: =?UTF-8?q?Morten=20Br=C3=B8rup?= To: olivier.matz@6wind.com, andrew.rybchenko@oktetlabs.ru Cc: bruce.richardson@intel.com, jerinjacobk@gmail.com, dev@dpdk.org, =?UTF-8?q?Morten=20Br=C3=B8rup?= Subject: [PATCH v2] mempool: fix put objects to mempool with cache Date: Wed, 19 Jan 2022 15:52:36 +0100 Message-Id: <20220119145236.42431-1-mb@smartsharesystems.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D86DB2@smartserver.smartshare.dk> References: <98CBD80474FA8B44BF855DF32C47DC35D86DB2@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-OriginalArrivalTime: 19 Jan 2022 14:52:38.0596 (UTC) FILETIME=[33D4F040:01D80D44] X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org This patch optimizes the rte_mempool_do_generic_put() caching algorithm, and fixes a bug in it. The existing algorithm was: 1. Add the objects to the cache 2. Anything greater than the cache size (if it crosses the cache flush threshold) is flushed to the ring. Please note that the description in the source code said that it kept "cache min value" objects after flushing, but the function actually kept "size" objects, which is reflected in the above description. Now, the algorithm is: 1. If the objects cannot be added to the cache without crossing the flush threshold, flush the cache to the ring. 2. Add the objects to the cache. This patch changes these details: 1. Bug: The cache was still full after flushing. In the opposite direction, i.e. when getting objects from the cache, the cache is refilled to full level when it crosses the low watermark (which happens to be zero). Similarly, the cache should be flushed to empty level when it crosses the high watermark (which happens to be 1.5 x the size of the cache). The existing flushing behaviour was suboptimal for real applications, because crossing the low or high watermark typically happens when the application is in a state where the number of put/get events are out of balance, e.g. when absorbing a burst of packets into a QoS queue (getting more mbufs from the mempool), or when a burst of packets is trickling out from the QoS queue (putting the mbufs back into the mempool). NB: When the application is in a state where put/get events are in balance, the cache should remain within its low and high watermarks, and the algorithms for refilling/flushing the cache should not come into play. Now, the mempool cache is completely flushed when crossing the flush threshold, so only the newly put (hot) objects remain in the mempool cache afterwards. 2. Minor bug: The flush threshold comparison has been corrected; it must be "len > flushthresh", not "len >= flushthresh". Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache would be flushed already when reaching size elements, not when exceeding size elements. Now, flushing is triggered when the flush threshold is exceeded, not when reached. 3. Optimization: The most recent (hot) objects are flushed, leaving the oldest (cold) objects in the mempool cache. This is bad for CPUs with a small L1 cache, because when they get objects from the mempool after the mempool cache has been flushed, they get cold objects instead of hot objects. Now, the existing (cold) objects in the mempool cache are flushed before the new (hot) objects are added the to the mempool cache. 4. Optimization: Using the x86 variant of rte_memcpy() is inefficient here, where n is relatively small and unknown at compile time. Now, it has been replaced by an alternative copying method, optimized for the fact that most Ethernet PMDs operate in bursts of 4 or 8 mbufs or multiples thereof. v2 changes: - Not adding the new objects to the mempool cache before flushing it also allows the memory allocated for the mempool cache to be reduced from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE. However, such this change would break the ABI, so it was removed in v2. - The mempool cache should be cache line aligned for the benefit of the copying method, which on some CPU architectures performs worse on data crossing a cache boundary. However, such this change would break the ABI, so it was removed in v2; and yet another alternative copying method replaced the rte_memcpy(). Signed-off-by: Morten Brørup --- lib/mempool/rte_mempool.h | 54 +++++++++++++++++++++++++++++---------- 1 file changed, 40 insertions(+), 14 deletions(-) diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h index 1e7a3c1527..8a7067ee5b 100644 --- a/lib/mempool/rte_mempool.h +++ b/lib/mempool/rte_mempool.h @@ -94,7 +94,8 @@ struct rte_mempool_cache { * Cache is allocated to this size to allow it to overflow in certain * cases to avoid needless emptying of cache. */ - void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */ + void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned; + /**< Cache objects */ } __rte_cache_aligned; /** @@ -1334,6 +1335,7 @@ static __rte_always_inline void rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table, unsigned int n, struct rte_mempool_cache *cache) { + uint32_t index; void **cache_objs; /* increment stat now, adding in mempool always success */ @@ -1344,31 +1346,56 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table, if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE)) goto ring_enqueue; - cache_objs = &cache->objs[cache->len]; + /* If the request itself is too big for the cache */ + if (unlikely(n > cache->flushthresh)) + goto ring_enqueue; /* * The cache follows the following algorithm - * 1. Add the objects to the cache - * 2. Anything greater than the cache min value (if it crosses the - * cache flush threshold) is flushed to the ring. + * 1. If the objects cannot be added to the cache without + * crossing the flush threshold, flush the cache to the ring. + * 2. Add the objects to the cache. */ - /* Add elements back into the cache */ - rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n); + if (cache->len + n <= cache->flushthresh) { + cache_objs = &cache->objs[cache->len]; - cache->len += n; + cache->len += n; + } else { + cache_objs = cache->objs; - if (cache->len >= cache->flushthresh) { - rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size], - cache->len - cache->size); - cache->len = cache->size; +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG + if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0) + rte_panic("cannot put objects in mempool\n"); +#else + rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len); +#endif + cache->len = n; + } + + /* Add the objects to the cache. */ + for (index = 0; index < (n & ~0x3); index += 4) { + cache_objs[index] = obj_table[index]; + cache_objs[index + 1] = obj_table[index + 1]; + cache_objs[index + 2] = obj_table[index + 2]; + cache_objs[index + 3] = obj_table[index + 3]; + } + switch (n & 0x3) { + case 3: + cache_objs[index] = obj_table[index]; + index++; /* fallthrough */ + case 2: + cache_objs[index] = obj_table[index]; + index++; /* fallthrough */ + case 1: + cache_objs[index] = obj_table[index]; } return; ring_enqueue: - /* push remaining objects in ring */ + /* Put the objects into the ring */ #ifdef RTE_LIBRTE_MEMPOOL_DEBUG if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0) rte_panic("cannot put objects in mempool\n"); @@ -1377,7 +1404,6 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table, #endif } - /** * Put several objects back in the mempool. * -- 2.17.1