From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id E0E9FA034D;
	Fri, 28 Jan 2022 09:52:27 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id C475342812;
	Fri, 28 Jan 2022 09:52:27 +0100 (CET)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id CE85640141
 for <dev@dpdk.org>; Fri, 28 Jan 2022 09:52:26 +0100 (CET)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [RFC] mempool: modify flush threshold
X-MimeOLE: Produced By Microsoft Exchange V6.5
Date: Fri, 28 Jan 2022 09:52:26 +0100
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D86E54@smartserver.smartshare.dk>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [RFC] mempool: modify flush threshold
Thread-Index: AdgROwkZuWblliw+TCizaeUiQGWQjAC3ePDQAALDIhA=
References: <98CBD80474FA8B44BF855DF32C47DC35D86DB9@smartserver.smartshare.dk>
 <YdhYSRXDTBJLpJob@bricha3-MOBL.ger.corp.intel.com>
 <98CBD80474FA8B44BF855DF32C47DC35D86DF0@smartserver.smartshare.dk>
 <Ydv/IMz8eIRPSguY@bricha3-MOBL.ger.corp.intel.com>
 <Ye7MS7BffKJJGB1c@platinum> 
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Olivier Matz" <olivier.matz@6wind.com>,
 "Jerin Jacob" <jerinjacobk@gmail.com>
Cc: "Bruce Richardson" <bruce.richardson@intel.com>,
 "Andrew Rybchenko" <andrew.rybchenko@oktetlabs.ru>,
 "Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com>, <dev@dpdk.org>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

+Jerin, you had feedback regarding the mempool.

> From: Morten Br=F8rup
> Sent: Friday, 28 January 2022 09.41
>=20
> +Honnappa, you have shown a keen interest in this topic.
>=20
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Monday, 24 January 2022 16.57
> >
> > On Mon, Jan 10, 2022 at 09:40:48AM +0000, Bruce Richardson wrote:
> > > On Sat, Jan 08, 2022 at 12:00:17PM +0100, Morten Br=F8rup wrote:
> > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > Sent: Friday, 7 January 2022 16.12
> > > > >
> > > > > On Tue, Dec 28, 2021 at 03:28:45PM +0100, Morten Br=F8rup =
wrote:
> > > > > > Hi mempool maintainers and DPDK team.
> > > > > >
> > > > > > Does anyone know the reason or history why
> > > > > CACHE_FLUSHTHRESH_MULTIPLIER was chosen to be 1.5? I think it
> is
> > > > > counterintuitive.
> > > > > >
> > > > > > The mempool cache flush threshold was introduced in DPDK
> > version 1.3;
> > > > > it was not in DPDK version 1.2. The copyright notice for
> > rte_mempool.c
> > > > > says year 2012.
> > > > > >
> > > > > >
> > > > > > Here is my analysis:
> > > > > >
> > > > > > With the multiplier of 1.5, a mempool cache is allowed to be
> > filled
> > > > > up to 50 % above than its target size before its excess =
entries
> > are
> > > > > flushed to the mempool (thereby reducing the cache length to
> the
> > target
> > > > > size).
> > > > > >
> > > > > > In the opposite direction, a mempool cache is allowed to be
> > drained
> > > > > completely, i.e. up to 100 % below its target size.
> > > > > >
> > > > > > My instinct tells me that it would be more natural to let a
> > mempool
> > > > > cache go the same amount above and below its target size, i.e.
> > using a
> > > > > flush multiplier of 2 instead of 1.5.
> > > > > >
> > > > > > Also, the cache should be allowed to fill up to and =
including
> > the
> > > > > flush threshold, so it is flushed when the threshold is
> exceeded,
> > > > > instead of when it is reached.
> > > > > >
> > > > > > Here is a simplified example:
> > > > > >
> > > > > > Imagine a cache target size of 32, corresponding to a =
typical
> > packet
> > > > > burst. With a flush threshold of 2 (and len > threshold =
instead
> > of len
> > > > > >=3D threshold), the cache could hold 1 +/-1 packet bursts. =
With
> > the
> > > > > current multiplier it can only hold [0 .. 1.5[ packet bursts,
> not
> > > > > really providing a lot of elasticity.
> > > > > >
> > > > > Hi Morten,
> > > > >
> > > > > Interesting to see this being looked at again. The original
> idea
> > of
> > > > > adding
> > > > > in some extra room above the requested value was to avoid the
> > worst-
> > > > > case
> > > > > scenario of a pool oscillating between full and empty
> repeatedly
> > due to
> > > > > the
> > > > > addition/removal of perhaps a single packet. As for why 1.5 =
was
> > chosen
> > > > > as
> > > > > the value, I don't recall any particular reason for it myself.
> > The main
> > > > > objective was to have a separate flush and size values so that
> we
> > could
> > > > > go
> > > > > a bit above full, and when flushing, not emptying the entire
> > cache down
> > > > > to
> > > > > zero.
> > > >
> > > > Thanks for providing the historical background for this feature,
> > Bruce.
> > > >
> > > > >
> > > > > In terms of the behavioural points you make above, I wonder if
> > symmetry
> > > > > is
> > > > > actually necessary or desirable in this case. After all, the
> > ideal case
> > > > > is
> > > > > probably to keep the mempool neither full nor empty, so that
> both
> > > > > allocations or frees can be done without having to go to the
> > underlying
> > > > > shared data structure. To accomodate this, the mempool will
> only
> > flush
> > > > > when
> > > > > the number of elements is greater than size * 1.5, and then it
> > only
> > > > > flushes
> > > > > elements down to size, ensuring that allocations can still =
take
> > place.
> > > > > On allocation, new buffers are taken when we don't have enough
> in
> > the
> > > > > cache
> > > > > to fullfil the request, and then the cache is filled up to
> size,
> > not to
> > > > > the
> > > > > flush threshold.
> > > >
> > > > I agree with the ideal case.
> > > >
> > > > However, it looks like the addition of the flush threshold also
> > changed the "size" parameter to effectively become "desired length"
> > instead. This interpretation is also supported by the flush
> algorithm,
> > which doesn't flush below the "size", but to the "size". So based on
> > interpretation, I was wondering why it is not symmetrical around the
> > "desired length", but allowed to go 100 % below and only 50 % above.
> > > >
> > > > >
> > > > > Now, for the scenario you describe - where the mempool cache
> size
> > is
> > > > > set to
> > > > > be the same as the burst size, this scheme probably does break
> > down, in
> > > > > that we don't really have any burst elasticity. However, I
> would
> > query
> > > > > if
> > > > > that is a configuration that is used, since to the user it
> should
> > > > > appear
> > > > > correctly to provide no elasticity. Looking at testpmd, and =
our
> > example
> > > > > apps, the standard there is a burst size of 32 and mempool
> cache
> > of
> > > > > ~256.
> > > > > In OVS code, netdev-dpdk.c seems to initialize the mempool
> with
> > cache
> > > > > size
> > > > > of RTE_MEMPOOL_CACHE_MAX_SIZE (through define MP_CACHE_SZ). In
> > all
> > > > > these
> > > > > cases, I think the 1.5 threshold should work just fine for us.
> > > >
> > > > My example was for demonstration only, and hopefully not being
> used
> > by any applications.
> > > >
> > > > The simplified example was intended to demonstrate the
> theoretical
> > effect of the unbalance in using the 1.5 threshold. It will be
> similar
> > with a cache size of 256 objects: You will be allowed to go 8 bursts
> > (calculation: 256 objects / 32 objects per burst) below the cache
> > "size" without retrieving objects from the backing pool, but only 4
> > bursts above the cache "size" before flushing objects to the backing
> > pool.
> > > >
> > > > > That said,
> > > > > if you want to bump it up to 2x, I can't say I'd object
> strongly
> > as it
> > > > > should be harmless, I think.
> > > >
> > > > From an academic viewpoint, 2x seems much better to me than =
1.5x.
> > And based on your background information from the introduction of =
the
> > flush threshold, there was no academic reason why 1.5x was chosen -
> > this is certainly valuable feedback.
> > > >
> >
> > Just to say I agree with Morten's analysis. Having a threshold at 2x
> > size would
> > make more sense to me.
> >
> > Without the corner cases (big requests, or common pool empty), I
> would
> > have suggested this, which looks more symetric than what we have
> today:
> >
> > get()
> >   fill request with as many objs as possible from cache
> >   if cache is empty, request "size" more objects from common pool
> >   fill the rest of the request from cache
>=20
> I agree, and this is doable without a performance penalty.
>=20
> >
> > put()
> >   fill cache with provided objects up to 2x size
> >   if cache exceeds threshold, fill common pool with "size" objects
> >   fill the cache with the remaining objs from request
>=20
> I tried implementing this already, but it cannot be done without a
> performance cost, one way or the other:
>=20
> 1. If we flush the top of the cache (which is a stack), we are =
flushing
> the hot objects, leaving cold objects at the new top of the cache.
>=20
> 2. If we flush the bottom of the cache, we need to move the remaining
> objects in the cache down. E.g. flushing objects at index [0..256[ in
> the cache means that the objects at index [256..len[ in the cache must
> be moved down to index[0..len-256[.
>=20
> So I came to the conclusion that flushing the cache needs to flush it
> completely. I also consider this the best choice for real =
applications:
>=20
> Either, an lcore thread in the application is in a state of balance,
> where it uses the mempool cache within its flush/refill boundaries -
> which makes the flush method less important.
>=20
> Or, an lcore thread in the application is out of balance (either
> permanently or temporarily), and mostly gets or puts objects from/to
> the mempool. If it mostly puts objects, not flushing all of the =
objects
> will cause more frequent flushing. E.g.:
>=20
> Cache size=3D256, flushthresh=3D512 (2x, not 1.5x), initial len=3D256;
> application burst len=3D32.
>=20
> If flushing leaves "size" objects in the cache, the cache is flushed
> every 8th burst. And with flushthresh=3D384 (1.5x), flushing occurs =
every
> 4th burst, which is what the mempool does today!
>=20
> If the cache is flushed completely, the cache is only flushed at every
> 16th burst.
>=20
>=20
> And when/if the application thread breaks its pattern of continuously
> putting objects, and suddenly starts to get objects instead, it will
> either get objects already in the cache, or the get() function will
> refill the cache.
>=20
> The concept of not flushing the cache completely is based on the
> assumption that it is more likely for an application's lcore thread to
> get() after flushing than to put() after flushing. I strongly disagree
> with this assumption! If an application thread is continuously putting
> so much that it overflows the cache, it is much more likely to keep
> putting than it is to start getting. This is CPU branch prediction =
101.
>=20
> >
> > But this is a quite sensitive area, and modifying these functions
> > should
> > be done with care, as it impacts mbuf allocation.
>=20
> Certainly!
>=20
> The current mempool caching algorithm is defective, and I'm trying to
> repair it. Honnappa already provided very valuable feedback regarding
> CPUs with small L1 cache, which must be taken carefully into
> consideration.

I'm sorry... The L1 cache feedback was from Jerin, not Honnappa.

>=20
> For now, there seems to be consensus regarding changing the flush
> threshold multiplier from 1.5 to 2.
>=20
> >
> > > > I will consider providing a patch. Worst case it probably =
doesn't
> > do any harm. And if anyone is really interested, they can look at =
the
> > mempool debug stats to see the difference it makes for their =
specific
> > application, e.g. the put_common_pool_bulk/put_bulk and
> > get_common_pool_bulk/get_success_bulk ratios should decrease.
> > >
> > > Sure, thanks.
> > >
> > > /Bruce