From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id C4593A00C5; Thu, 27 Oct 2022 17:20:54 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 6799342BF9; Thu, 27 Oct 2022 17:20:54 +0200 (CEST) Received: from mail-wr1-f46.google.com (mail-wr1-f46.google.com [209.85.221.46]) by mails.dpdk.org (Postfix) with ESMTP id 44DA141153 for ; Thu, 27 Oct 2022 17:20:53 +0200 (CEST) Received: by mail-wr1-f46.google.com with SMTP id k8so2789187wrh.1 for ; Thu, 27 Oct 2022 08:20:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=tKoLG7DkwkKmp2nOoOsqdD5bZllo6pPo8TwTPJkxipI=; b=D7nwXhTaWGaUyabW6a+NJSISaNYzpYYJ195sfNxnIqDhAbB7Rh/swdaOPU6GSmzSFA FGGVLe0BNMuCUTwTxtMis0OFZYpViGnTTB/7lRomeyH4phR/7vVTCLIVBjIBlDoqSMgC j9YCjvrLVY6LereZCtlcSW6DBvsX7UsA4c4q19FabOTyEucDnzgSSha0kzBfSk14XdWV R9s6VPFmQjDVZI2eQKt5DBY5/dHEQhcAm9TqKUVgcr5qpgxz5IpKPc7a4nSq2UidlP/1 ADYzezPhQd85bMnp9drvi+mr9tj7EZgZg/KhR6aE/aG8xfnx2sZjUPx5NqF03kQMrxhm A51g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=tKoLG7DkwkKmp2nOoOsqdD5bZllo6pPo8TwTPJkxipI=; b=fPrOEQb0bratR4/lw0QHLsriNazzbVWayti+jJQIFlU8cHGFNSocsk4yb6P2MPYY4B bCaaQ5fxlBbYZKO+Eb3hwqEZV73Lo776aZ0AZr76VxbrWOrELVVrJ9REb4LojWXBAaA2 y2KN1I+nSTu0sHT5JI7dRHkG9BQSQmvMhwFQ8qmiL5iiBlQh9gLoxnXSWyQsAHRrqmSf E3TYEUQfRdDgeriAtLmdwVguOT239fM/JfhK5y9jP8ocmPJr7JrgXI9l2BuFnt+Ilkv6 CDm17/SnEhwIXCnTB2mBhESC7knf2MMejbFACAxSFBWFS6uSjtCTTW6jxrxe5YxMjOuG ++CQ== X-Gm-Message-State: ACrzQf0DlWCw17W05eFGdG5fsc/cSzobR034RYkEwtTQrHqY5L1KBO+R h50m53wEaXhJffcSnwuT0nJnLw== X-Google-Smtp-Source: AMsMyM7IuiE8Cap17WbUgxQgSjQmeBzodni+Oe2AQ2eDuDTmz5ww6DD4il/PnMbAhFUhzqDS52Q4xg== X-Received: by 2002:a05:6000:15c6:b0:22e:5c0c:13d6 with SMTP id y6-20020a05600015c600b0022e5c0c13d6mr31575980wry.485.1666884052867; Thu, 27 Oct 2022 08:20:52 -0700 (PDT) Received: from 6wind.com ([2a01:e0a:5ac:6460:c065:401d:87eb:9b25]) by smtp.gmail.com with ESMTPSA id k1-20020adff281000000b0022ac672654dsm1393045wro.58.2022.10.27.08.20.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Oct 2022 08:20:51 -0700 (PDT) Date: Thu, 27 Oct 2022 17:20:51 +0200 From: Olivier Matz To: Morten =?iso-8859-1?Q?Br=F8rup?= Cc: andrew.rybchenko@oktetlabs.ru, jerinj@marvell.com, thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org Subject: Re: [PATCH] mempool: cache align mempool cache objects Message-ID: References: <98CBD80474FA8B44BF855DF32C47DC35D8744E@smartserver.smartshare.dk> <20221026144436.71068-1-mb@smartsharesystems.com> <98CBD80474FA8B44BF855DF32C47DC35D87450@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D87452@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87452@smartserver.smartshare.dk> X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Thu, Oct 27, 2022 at 02:11:29PM +0200, Morten Brørup wrote: > > From: Olivier Matz [mailto:olivier.matz@6wind.com] > > Sent: Thursday, 27 October 2022 13.43 > > > > On Thu, Oct 27, 2022 at 11:22:07AM +0200, Morten Brørup wrote: > > > > From: Olivier Matz [mailto:olivier.matz@6wind.com] > > > > Sent: Thursday, 27 October 2022 10.35 > > > > > > > > Hi Morten, > > > > > > > > On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote: > > > > > Add __rte_cache_aligned to the objs array. > > > > > > > > > > It makes no difference in the general case, but if get/put > > operations > > > > are > > > > > always 32 objects, it will reduce the number of memory (or last > > level > > > > > cache) accesses from five to four 64 B cache lines for every > > get/put > > > > > operation. > > > > > > > > > > For readability reasons, an example using 16 objects follows: > > > > > > > > > > Currently, with 16 objects (128B), we access to 3 > > > > > cache lines: > > > > > > > > > > ┌────────┐ > > > > > │len │ > > > > > cache │********│--- > > > > > line0 │********│ ^ > > > > > │********│ | > > > > > ├────────┤ | 16 objects > > > > > │********│ | 128B > > > > > cache │********│ | > > > > > line1 │********│ | > > > > > │********│ | > > > > > ├────────┤ | > > > > > │********│_v_ > > > > > cache │ │ > > > > > line2 │ │ > > > > > │ │ > > > > > └────────┘ > > > > > > > > > > With the alignment, it is also 3 cache lines: > > > > > > > > > > ┌────────┐ > > > > > │len │ > > > > > cache │ │ > > > > > line0 │ │ > > > > > │ │ > > > > > ├────────┤--- > > > > > │********│ ^ > > > > > cache │********│ | > > > > > line1 │********│ | > > > > > │********│ | > > > > > ├────────┤ | 16 objects > > > > > │********│ | 128B > > > > > cache │********│ | > > > > > line2 │********│ | > > > > > │********│ v > > > > > └────────┘--- > > > > > > > > > > However, accessing the objects at the bottom of the mempool cache > > is > > > > a > > > > > special case, where cache line0 is also used for objects. > > > > > > > > > > Consider the next burst (and any following bursts): > > > > > > > > > > Current: > > > > > ┌────────┐ > > > > > │len │ > > > > > cache │ │ > > > > > line0 │ │ > > > > > │ │ > > > > > ├────────┤ > > > > > │ │ > > > > > cache │ │ > > > > > line1 │ │ > > > > > │ │ > > > > > ├────────┤ > > > > > │ │ > > > > > cache │********│--- > > > > > line2 │********│ ^ > > > > > │********│ | > > > > > ├────────┤ | 16 objects > > > > > │********│ | 128B > > > > > cache │********│ | > > > > > line3 │********│ | > > > > > │********│ | > > > > > ├────────┤ | > > > > > │********│_v_ > > > > > cache │ │ > > > > > line4 │ │ > > > > > │ │ > > > > > └────────┘ > > > > > 4 cache lines touched, incl. line0 for len. > > > > > > > > > > With the proposed alignment: > > > > > ┌────────┐ > > > > > │len │ > > > > > cache │ │ > > > > > line0 │ │ > > > > > │ │ > > > > > ├────────┤ > > > > > │ │ > > > > > cache │ │ > > > > > line1 │ │ > > > > > │ │ > > > > > ├────────┤ > > > > > │ │ > > > > > cache │ │ > > > > > line2 │ │ > > > > > │ │ > > > > > ├────────┤ > > > > > │********│--- > > > > > cache │********│ ^ > > > > > line3 │********│ | > > > > > │********│ | 16 objects > > > > > ├────────┤ | 128B > > > > > │********│ | > > > > > cache │********│ | > > > > > line4 │********│ | > > > > > │********│_v_ > > > > > └────────┘ > > > > > Only 3 cache lines touched, incl. line0 for len. > > > > > > > > I understand your logic, but are we sure that having an application > > > > that > > > > works with bulks of 32 means that the cache will stay aligned to 32 > > > > elements for the whole life of the application? > > > > > > > > In an application, the alignment of the cache can change if you > > have > > > > any of: > > > > - software queues (reassembly for instance) > > > > - packet duplication (bridge, multicast) > > > > - locally generated packets (keepalive, control protocol) > > > > - pipeline to other cores > > > > > > > > Even with testpmd, which work by bulk of 32, I can see that the > > size > > > > of the cache filling is not aligned to 32. Right after starting the > > > > application, we already have this: > > > > > > > > internal cache infos: > > > > cache_size=250 > > > > cache_count[0]=231 > > > > > > > > This is probably related to the hw rx rings size, number of queues, > > > > number of ports. > > > > > > > > The "250" default value for cache size in testpmd is questionable, > > but > > > > with --mbcache=256, the behavior is similar. > > > > > > > > Also, when we transmit to a NIC, the mbufs are not returned > > immediatly > > > > to the pool, they may stay in the hw tx ring during some time, > > which is > > > > a driver decision. > > > > > > > > After processing traffic on cores 8 and 24 with this testpmd, I > > get: > > > > cache_count[0]=231 > > > > cache_count[8]=123 > > > > cache_count[24]=122 > > > > > > > > In my opinion, it is not realistic to think that the mempool cache > > will > > > > remain aligned to cachelines. In these conditions, it looks better > > to > > > > keep the structure packed to avoid wasting memory. > > > > > > I agree that is a special use case to only access the mempool cache > > in > > > bursts of 32 objects, so the accesses are always cache line > > > aligned. (Generalized, the burst size must not be 32; a burst size > > > that is a multiple of RTE_CACHE_LINE_SIZE/sizeof(void*), i.e. a burst > > > size of 8 on a 64-bit architecture, will do.) > > > > Is there a real situation where it happens to always have read/write > > accesses per bulks of 32? From what I see in my quick test, it is not > > the case, even with testpmd. > > > > > Adding a hole of 52 byte per mempool cache is nothing, considering > > > that the mempool cache already uses 8 KB (RTE_MEMPOOL_CACHE_MAX_SIZE > > * > > > 2 * sizeof(void*) = 1024 * 8 byte) for the objects. > > > > > > Also - assuming that memory allocations are cache line aligned - the > > > 52 byte of unused memory cannot be used regardless if they are before > > > or after the objects. Instead of having 52 B unused after the > > objects, > > > we might as well have a hole of 52 B unused before the objects. In > > > other words: There is really no downside to this. > > > > Correct, the memory waste argument to nack the patch is invalid. > > > > > Jerin also presented a separate argument for moving the objects to > > > another cache line than the len field: The risk for "load-after-store > > > stall" when loading the len field after storing objects in cache > > line0 > > > [1]. > > > > > > [1]: http://inbox.dpdk.org/dev/CALBAE1P4zFYdLwoQukn5Q-V- > > nTvc_UBWmWjhaV2uVBXQRytSSA@mail.gmail.com/ > > > > I'll be prudent on this justification without numbers. The case where > > we > > access to the objects of the first cache line (among several KB) is > > maybe not that frequent. > > > > > A new idea just popped into my head: The hot debug statistics > > > counters (put_bulk, put_objs, get_success_bulk, get_success_objs) > > > could be moved to this free space, reducing the need to touch another > > > cache line for debug counters. I haven’t thought this idea through > > > yet; it might conflict with Jerin's comment. > > > > Yes, but since the stats are only enabled when RTE_LIBRTE_MEMPOOL_DEBUG > > is set, it won't have any impact on non-debug builds. > > Correct, but I do expect that it would reduce the performance cost of using RTE_LIBRTE_MEMPOOL_DEBUG. I'll provide such a patch shortly. > > > > > > > Honnestly, I find it hard to convince myself that it is a real > > optimization. I don't see any reason why it would be slower though. So > > since we already broke the mempool cache struct ABI in a previous > > commit, and since it won't consume more memory, I'm ok to include that > > patch. > > I don't know if there are any such applications now, and you are probably right that there are not. But this patch opens a road towards it. > > Acked-by ? Acked-by: Olivier Matz Thanks Morten > > > It would be great to have numbers to put some weight in the > > balance. > > Yes, it would also be great if drivers didn't copy-paste code from the mempool library, so the performance effect of modifications in the mempool library would be reflected in such tests. > > > > > > > > > > > > > > > > > > > > Olivier > > > > > > > > > > > > > > > > > > Credits go to Olivier Matz for the nice ASCII graphics. > > > > > > > > > > Signed-off-by: Morten Brørup > > > > > --- > > > > > lib/mempool/rte_mempool.h | 6 ++++-- > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/lib/mempool/rte_mempool.h > > b/lib/mempool/rte_mempool.h > > > > > index 1f5707f46a..3725a72951 100644 > > > > > --- a/lib/mempool/rte_mempool.h > > > > > +++ b/lib/mempool/rte_mempool.h > > > > > @@ -86,11 +86,13 @@ struct rte_mempool_cache { > > > > > uint32_t size; /**< Size of the cache */ > > > > > uint32_t flushthresh; /**< Threshold before we flush excess > > > > elements */ > > > > > uint32_t len; /**< Current cache count */ > > > > > - /* > > > > > + /** > > > > > + * Cache objects > > > > > + * > > > > > * Cache is allocated to this size to allow it to overflow > > in > > > > certain > > > > > * cases to avoid needless emptying of cache. > > > > > */ > > > > > - void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache > > objects */ > > > > > + void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] > > __rte_cache_aligned; > > > > > } __rte_cache_aligned; > > > > > > > > > > /** > > > > > -- > > > > > 2.17.1 > > > > > > > > >