From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 258F3A00C5; Thu, 27 Oct 2022 13:42:45 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C198E42670; Thu, 27 Oct 2022 13:42:44 +0200 (CEST) Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com [209.85.221.43]) by mails.dpdk.org (Postfix) with ESMTP id 9156341153 for ; Thu, 27 Oct 2022 13:42:42 +0200 (CEST) Received: by mail-wr1-f43.google.com with SMTP id bs21so1708013wrb.4 for ; Thu, 27 Oct 2022 04:42:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=Bg47oM9m6DFhoAlaA7+R8WPtal3+f29ylQUgM1p1QWU=; b=ZcCpQbi1bHb0Omjj5xD3Eod8xcnSlZtLZBPaWs4XK+r9K689Sr0Qcwja8+DiwC7SGo ccIy7wZsPTuVPxOJAV4yKtd0MiP/b14hHX57ldMrazGouz2IUHIkUrovjroekohdcp2f 47bNfQzZDnoQRcKuIJ1VHo4y0svS3Dw8iQ/b73DrHS6cieSSnxoKJmFFDkGZEwQp6j9A fYdTwfxtW+DHtvCDoZdTU8kgv4cQdNw59JvcPozQtrydRs/cmaTHgV5Tl3N1c3ckra97 LdJbZR0G2tGDC9JkB7zt9YzMMUBgABEeCGihdZI4WduzmLnzoKBL+pO0NkjZbb1BIECP 8mOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Bg47oM9m6DFhoAlaA7+R8WPtal3+f29ylQUgM1p1QWU=; b=lWcyKCIjNj9ubSgiAO4YDT/YJebp4DUqYC5QM9Y3TSB+tGxdHV+eyBaxgFoFjHHufW DOQWK/ULDoRn6Ajascs3KB4AvG1P56JxtuAWLH2mcIJYwKce41wM+7/nAWXxkKX2uBeo /8ztvGKLikiIW+14CZcf8/fhNmPKY08ZkalX1PHq47sshOdizNPQDmheMywNzVbI4UT4 MhAqImCjQs5XQTU0NV6NURVEAUhtOWNJe2dlwssivG/Womxif+CKLUky2HBDBMkZP4Te ZpcZQtXic1bSD0TddykjB+6jaL7fVssSOhJeK/XzH3d8isig2RVPTiFzCzXc2dZktOAj mipg== X-Gm-Message-State: ACrzQf0+5P2YwlmDds+9jyBQ7pkj+813nJW4SO3HVKJZbXE1F1nIz6ev Z2AVKX6YhoxiFiVZRZ/5G1bXaA== X-Google-Smtp-Source: AMsMyM4UdRp29wEtknylpCT00+/OFx+C4d2c3glcODHMWAW28ZbFbVKfqZrY9fgtmi8CCRVooKKPKw== X-Received: by 2002:a5d:4a41:0:b0:228:48c6:7386 with SMTP id v1-20020a5d4a41000000b0022848c67386mr31534694wrs.649.1666870962184; Thu, 27 Oct 2022 04:42:42 -0700 (PDT) Received: from 6wind.com ([2a01:e0a:5ac:6460:c065:401d:87eb:9b25]) by smtp.gmail.com with ESMTPSA id n19-20020a05600c4f9300b003b4c979e6bcsm4853325wmq.10.2022.10.27.04.42.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Oct 2022 04:42:41 -0700 (PDT) Date: Thu, 27 Oct 2022 13:42:40 +0200 From: Olivier Matz To: Morten =?iso-8859-1?Q?Br=F8rup?= Cc: andrew.rybchenko@oktetlabs.ru, jerinj@marvell.com, thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org Subject: Re: [PATCH] mempool: cache align mempool cache objects Message-ID: References: <98CBD80474FA8B44BF855DF32C47DC35D8744E@smartserver.smartshare.dk> <20221026144436.71068-1-mb@smartsharesystems.com> <98CBD80474FA8B44BF855DF32C47DC35D87450@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87450@smartserver.smartshare.dk> X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Thu, Oct 27, 2022 at 11:22:07AM +0200, Morten Brørup wrote: > > From: Olivier Matz [mailto:olivier.matz@6wind.com] > > Sent: Thursday, 27 October 2022 10.35 > > > > Hi Morten, > > > > On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote: > > > Add __rte_cache_aligned to the objs array. > > > > > > It makes no difference in the general case, but if get/put operations > > are > > > always 32 objects, it will reduce the number of memory (or last level > > > cache) accesses from five to four 64 B cache lines for every get/put > > > operation. > > > > > > For readability reasons, an example using 16 objects follows: > > > > > > Currently, with 16 objects (128B), we access to 3 > > > cache lines: > > > > > > ┌────────┐ > > > │len │ > > > cache │********│--- > > > line0 │********│ ^ > > > │********│ | > > > ├────────┤ | 16 objects > > > │********│ | 128B > > > cache │********│ | > > > line1 │********│ | > > > │********│ | > > > ├────────┤ | > > > │********│_v_ > > > cache │ │ > > > line2 │ │ > > > │ │ > > > └────────┘ > > > > > > With the alignment, it is also 3 cache lines: > > > > > > ┌────────┐ > > > │len │ > > > cache │ │ > > > line0 │ │ > > > │ │ > > > ├────────┤--- > > > │********│ ^ > > > cache │********│ | > > > line1 │********│ | > > > │********│ | > > > ├────────┤ | 16 objects > > > │********│ | 128B > > > cache │********│ | > > > line2 │********│ | > > > │********│ v > > > └────────┘--- > > > > > > However, accessing the objects at the bottom of the mempool cache is > > a > > > special case, where cache line0 is also used for objects. > > > > > > Consider the next burst (and any following bursts): > > > > > > Current: > > > ┌────────┐ > > > │len │ > > > cache │ │ > > > line0 │ │ > > > │ │ > > > ├────────┤ > > > │ │ > > > cache │ │ > > > line1 │ │ > > > │ │ > > > ├────────┤ > > > │ │ > > > cache │********│--- > > > line2 │********│ ^ > > > │********│ | > > > ├────────┤ | 16 objects > > > │********│ | 128B > > > cache │********│ | > > > line3 │********│ | > > > │********│ | > > > ├────────┤ | > > > │********│_v_ > > > cache │ │ > > > line4 │ │ > > > │ │ > > > └────────┘ > > > 4 cache lines touched, incl. line0 for len. > > > > > > With the proposed alignment: > > > ┌────────┐ > > > │len │ > > > cache │ │ > > > line0 │ │ > > > │ │ > > > ├────────┤ > > > │ │ > > > cache │ │ > > > line1 │ │ > > > │ │ > > > ├────────┤ > > > │ │ > > > cache │ │ > > > line2 │ │ > > > │ │ > > > ├────────┤ > > > │********│--- > > > cache │********│ ^ > > > line3 │********│ | > > > │********│ | 16 objects > > > ├────────┤ | 128B > > > │********│ | > > > cache │********│ | > > > line4 │********│ | > > > │********│_v_ > > > └────────┘ > > > Only 3 cache lines touched, incl. line0 for len. > > > > I understand your logic, but are we sure that having an application > > that > > works with bulks of 32 means that the cache will stay aligned to 32 > > elements for the whole life of the application? > > > > In an application, the alignment of the cache can change if you have > > any of: > > - software queues (reassembly for instance) > > - packet duplication (bridge, multicast) > > - locally generated packets (keepalive, control protocol) > > - pipeline to other cores > > > > Even with testpmd, which work by bulk of 32, I can see that the size > > of the cache filling is not aligned to 32. Right after starting the > > application, we already have this: > > > > internal cache infos: > > cache_size=250 > > cache_count[0]=231 > > > > This is probably related to the hw rx rings size, number of queues, > > number of ports. > > > > The "250" default value for cache size in testpmd is questionable, but > > with --mbcache=256, the behavior is similar. > > > > Also, when we transmit to a NIC, the mbufs are not returned immediatly > > to the pool, they may stay in the hw tx ring during some time, which is > > a driver decision. > > > > After processing traffic on cores 8 and 24 with this testpmd, I get: > > cache_count[0]=231 > > cache_count[8]=123 > > cache_count[24]=122 > > > > In my opinion, it is not realistic to think that the mempool cache will > > remain aligned to cachelines. In these conditions, it looks better to > > keep the structure packed to avoid wasting memory. > > I agree that is a special use case to only access the mempool cache in > bursts of 32 objects, so the accesses are always cache line > aligned. (Generalized, the burst size must not be 32; a burst size > that is a multiple of RTE_CACHE_LINE_SIZE/sizeof(void*), i.e. a burst > size of 8 on a 64-bit architecture, will do.) Is there a real situation where it happens to always have read/write accesses per bulks of 32? From what I see in my quick test, it is not the case, even with testpmd. > Adding a hole of 52 byte per mempool cache is nothing, considering > that the mempool cache already uses 8 KB (RTE_MEMPOOL_CACHE_MAX_SIZE * > 2 * sizeof(void*) = 1024 * 8 byte) for the objects. > > Also - assuming that memory allocations are cache line aligned - the > 52 byte of unused memory cannot be used regardless if they are before > or after the objects. Instead of having 52 B unused after the objects, > we might as well have a hole of 52 B unused before the objects. In > other words: There is really no downside to this. Correct, the memory waste argument to nack the patch is invalid. > Jerin also presented a separate argument for moving the objects to > another cache line than the len field: The risk for "load-after-store > stall" when loading the len field after storing objects in cache line0 > [1]. > > [1]: http://inbox.dpdk.org/dev/CALBAE1P4zFYdLwoQukn5Q-V-nTvc_UBWmWjhaV2uVBXQRytSSA@mail.gmail.com/ I'll be prudent on this justification without numbers. The case where we access to the objects of the first cache line (among several KB) is maybe not that frequent. > A new idea just popped into my head: The hot debug statistics > counters (put_bulk, put_objs, get_success_bulk, get_success_objs) > could be moved to this free space, reducing the need to touch another > cache line for debug counters. I haven’t thought this idea through > yet; it might conflict with Jerin's comment. Yes, but since the stats are only enabled when RTE_LIBRTE_MEMPOOL_DEBUG is set, it won't have any impact on non-debug builds. Honnestly, I find it hard to convince myself that it is a real optimization. I don't see any reason why it would be slower though. So since we already broke the mempool cache struct ABI in a previous commit, and since it won't consume more memory, I'm ok to include that patch. It would be great to have numbers to put some weight in the balance. > > > > > Olivier > > > > > > > > > > Credits go to Olivier Matz for the nice ASCII graphics. > > > > > > Signed-off-by: Morten Brørup > > > --- > > > lib/mempool/rte_mempool.h | 6 ++++-- > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > > index 1f5707f46a..3725a72951 100644 > > > --- a/lib/mempool/rte_mempool.h > > > +++ b/lib/mempool/rte_mempool.h > > > @@ -86,11 +86,13 @@ struct rte_mempool_cache { > > > uint32_t size; /**< Size of the cache */ > > > uint32_t flushthresh; /**< Threshold before we flush excess > > elements */ > > > uint32_t len; /**< Current cache count */ > > > - /* > > > + /** > > > + * Cache objects > > > + * > > > * Cache is allocated to this size to allow it to overflow in > > certain > > > * cases to avoid needless emptying of cache. > > > */ > > > - void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */ > > > + void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned; > > > } __rte_cache_aligned; > > > > > > /** > > > -- > > > 2.17.1 > > > >