From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 49AF4A050B; Thu, 7 Apr 2022 13:36:23 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id E03284068B; Thu, 7 Apr 2022 13:36:22 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id DDC5040689 for ; Thu, 7 Apr 2022 13:36:21 +0200 (CEST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH v4] mempool: fix mempool cache flushing algorithm Date: Thu, 7 Apr 2022 13:36:17 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86FBB@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v4] mempool: fix mempool cache flushing algorithm Thread-Index: AdhKbGlJircBbZDvR46AGDRtNCUl1gAAIPfg References: <98CBD80474FA8B44BF855DF32C47DC35D86DB2@smartserver.smartshare.dk> <20220202103354.79832-1-mb@smartsharesystems.com> <98CBD80474FA8B44BF855DF32C47DC35D86FB8@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D86FB9@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" Cc: , , , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Thursday, 7 April 2022 12.44 >=20 > On Thu, Apr 07, 2022 at 11:32:12AM +0100, Bruce Richardson wrote: > > On Thu, Apr 07, 2022 at 11:26:53AM +0200, Morten Br=F8rup wrote: > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > > Sent: Thursday, 7 April 2022 11.14 > > > > > > > > On Thu, Apr 07, 2022 at 11:04:53AM +0200, Morten Br=F8rup wrote: > > > > > > From: Morten Br=F8rup [mailto:mb@smartsharesystems.com] > > > > > > Sent: Wednesday, 2 February 2022 11.34 > > > > > > > > > > > > This patch fixes the rte_mempool_do_generic_put() caching > > > > algorithm, > > > > > > which was fundamentally wrong, causing multiple performance > issues > > > > when > > > > > > flushing. > > > > > > > > > > > > > > > > [...] > > > > > > > > > > Olivier, > > > > > > > > > > Will you please consider this patch [1] and the other one [2]. > > > > > > > > > > The primary bug here is this: When a mempool cache becomes = full > (i.e. > > > > exceeds the "flush threshold"), and is flushed to the backing > ring, it > > > > is still full afterwards; but it should be empty afterwards. It > is not > > > > flushed entirely, only the elements exceeding "size" are = flushed. > > > > > > > > > > > > > I don't believe it should be flushed entirely, there should > always be > > > > some > > > > elements left so that even after flush we can still allocate an > > > > additional > > > > burst. We want to avoid the situation where a flush of all > elements is > > > > immediately followed by a refill of new elements. However, we = can > flush > > > > to > > > > maybe size/2, and improve things. In short, this not emptying is > by > > > > design > > > > rather than a bug, though we can look to tweak the behaviour. > > > > > > > > > > I initially agreed with you about flushing to size/2. > > > > > > However, I did think further about it when I wrote the patch, and > came to this conclusion: If an application thread repeatedly puts > objects into the mempool, and does it so often that the cache = overflows > (i.e. reaches the flush threshold) and needs to be flushed, it is far > more likely that the application thread will continue doing that, > rather than start getting objects from the mempool. This speaks for > flushing the cache entirely. > > > > > > Both solutions are better than flushing to size, so if there is a > preference for keeping some objects in the cache after flushing, I can > update the patch accordingly. I forgot to mention some details here... The cache is a stack, so leaving objects in it after flushing can be = done in one of two ways: 1. Flush the top objects, and leave the bottom objects, which are = extremely cold. 2. Flush the bottom objects, and move the objects from the top to the = bottom, which is a costly operation. Theoretically, there is a third option: Make the stack a circular = buffer, so its "bottom pointer" can be moved around, instead of copying = objects from the top to the bottom after flushing. However, this will = add complexity when copying arrays to/from the stack in both the normal = cases, i.e. to/from the application. And it introduces requirements to = the cache size. So I quickly discarded the idea when it first came to = me. The provided patch flushes the entire cache, and then stores the newly = added objects (the ones causing the flush) in the cache. So it is not = completely empty after flushing. It contains some (but not many) = objects, and they are hot. > > > > > > > Would it be worth looking at adding per-core hinting to the mempool? > > Indicate for a core that it allocates-only, i.e. RX thread, frees- > only, > > i.e. TX-thread, or does both alloc and free (the default)? That hint > could > > be used only on flush or refill to specify whether to flush all or > partial, > > and similarly to refill to max possible or just to size. > > > Actually, taking the idea further, we could always track per-core > whether a > core has ever done a flush/refill and use that as the hint instead. It > could even be done in a branch-free manner if we want. For example: >=20 > on flush: > keep_entries =3D (size >> 1) & (never_refills - 1); >=20 > which will set the entries to keep to be 0 if we have never had to > refill, or > half of size, if the thread has previously done refills. >=20 Your suggestion is a good idea for a performance improvement. We would also need "mostly" variants in addition to the "only" variants. = Or the automatic detection will cause problems if triggered by some rare = event. And applications using the "service cores" concept will just fall back = to the default alloc-free balanced variant. Perhaps we should fix the current bugs (my term, not consensus) first, = and then look at further performance improvements. It's already uphill = getting Acks for my fixes as they are. Another performance improvement could be hard coding the mempool cache = size to RTE_MEMPOOL_CACHE_MAX_SIZE, so the copying between the cache and = the backing ring can be done by a fixed size optimized memcpy using = vector instructions. Obviously, whatever we do to optimize this, we should ensure optimal = handling of the common case, where objects are only moved between the = application and the mempool cache, and doesn't touch the backing ring.