From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 89EB8424FC; Tue, 5 Sep 2023 07:50:11 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 68AD540270; Tue, 5 Sep 2023 07:50:11 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id C679F4026A for ; Tue, 5 Sep 2023 07:50:09 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 68E6114E63 for ; Tue, 5 Sep 2023 07:50:09 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 67D9414D73; Tue, 5 Sep 2023 07:50:09 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-2.3 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A autolearn=disabled version=3.4.6 X-Spam-Score: -2.3 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 7F9B815193; Tue, 5 Sep 2023 07:50:06 +0200 (CEST) Message-ID: <66b862bc-ca22-02da-069c-a82c2d52ed49@lysator.liu.se> Date: Tue, 5 Sep 2023 07:50:06 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: [RFC] cache guard Content-Language: en-US To: =?UTF-8?Q?Morten_Br=c3=b8rup?= , Thomas Monjalon Cc: Bruce Richardson , dev@dpdk.org, olivier.matz@6wind.com, andrew.rybchenko@oktetlabs.ru, honnappa.nagarahalli@arm.com, konstantin.v.ananyev@yandex.ru, mattias.ronnblom@ericsson.com References: <98CBD80474FA8B44BF855DF32C47DC35D87B39@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D87B47@smartserver.smartshare.dk> <2507011.Sgy9Pd6rRy@thomas> <98CBD80474FA8B44BF855DF32C47DC35D87B69@smartserver.smartshare.dk> <10be7ca8-fe5a-ba24-503a-9c6ce2fef710@lysator.liu.se> <98CBD80474FA8B44BF855DF32C47DC35D87B6F@smartserver.smartshare.dk> From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87B6F@smartserver.smartshare.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2023-09-04 14:48, Morten Brørup wrote: >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se] >> Sent: Monday, 4 September 2023 14.07 >> >> On 2023-09-01 20:52, Morten Brørup wrote: >>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se] >>>> Sent: Friday, 1 September 2023 18.58 >>>> >>>> On 2023-09-01 14:26, Thomas Monjalon wrote: >>>>> 27/08/2023 10:34, Morten Brørup: >>>>>> +CC Honnappa and Konstantin, Ring lib maintainers >>>>>> +CC Mattias, PRNG lib maintainer >>>>>> >>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] >>>>>>> Sent: Friday, 25 August 2023 11.24 >>>>>>> >>>>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote: >>>>>>>> +CC mempool maintainers >>>>>>>> >>>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] >>>>>>>>> Sent: Friday, 25 August 2023 10.23 >>>>>>>>> >>>>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote: >>>>>>>>>> Bruce, >>>>>>>>>> >>>>>>>>>> With this patch [1], it is noted that the ring producer and >>>>>>> consumer data >>>>>>>>> should not be on adjacent cache lines, for performance reasons. >>>>>>>>>> >>>>>>>>>> [1]: >>>>>>>>> >>>>>>> >>>> >> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f >>>>>>> fd4b66 >>>>>>>>> e75485cc8b63b9aedfbdfe8b0 >>>>>>>>>> >>>>>>>>>> (It's obvious that they cannot share the same cache line, >> because >>>>>>> they are >>>>>>>>> accessed by two different threads.) >>>>>>>>>> >>>>>>>>>> Intuitively, I would think that having them on different cache >>>>>>> lines would >>>>>>>>> suffice. Why does having an empty cache line between them make a >>>>>>> difference? >>>>>>>>>> >>>>>>>>>> And does it need to be an empty cache line? Or does it suffice >>>>>>> having the >>>>>>>>> second structure start at two cache lines after the start of the >>>>>>> first >>>>>>>>> structure (e.g. if the size of the first structure is two cache >>>>>>> lines)? >>>>>>>>>> >>>>>>>>>> I'm asking because the same principle might apply to other code >>>>>>> too. >>>>>>>>>> >>>>>>>>> Hi Morten, >>>>>>>>> >>>>>>>>> this was something we discovered when working on the distributor >>>>>>> library. >>>>>>>>> If we have cachelines per core where there is heavy access, >> having >>>>>>> some >>>>>>>>> cachelines as a gap between the content cachelines can help >>>>>>> performance. We >>>>>>>>> believe this helps due to avoiding issues with the HW >> prefetchers >>>>>>> (e.g. >>>>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline >>>>>>>>> speculatively when an operation is done on the first line. >>>>>>>> >>>>>>>> I guessed that it had something to do with speculative >> prefetching, >>>>>>> but wasn't sure. Good to get confirmation, and that it has a >>>> measureable >>>>>>> effect somewhere. Very interesting! >>>>>>>> >>>>>>>> NB: More comments in the ring lib about stuff like this would be >>>> nice. >>>>>>>> >>>>>>>> So, for the mempool lib, what do you think about applying the >> same >>>>>>> technique to the rte_mempool_debug_stats structure (which is an >>>> array >>>>>>> indexed per lcore)... Two adjacent lcores heavily accessing their >>>> local >>>>>>> mempool caches seems likely to me. But how heavy does the access >>>> need to >>>>>>> be for this technique to be relevant? >>>>>>>> >>>>>>> >>>>>>> No idea how heavy the accesses need to be for this to have a >>>> noticable >>>>>>> effect. For things like debug stats, I wonder how worthwhile >> making >>>> such >>>>>>> a >>>>>>> change would be, but then again, any change would have very low >>>> impact >>>>>>> too >>>>>>> in that case. >>>>>> >>>>>> I just tried adding padding to some of the hot structures in our >> own >>>> application, and observed a significant performance improvement for >>>> those. >>>>>> >>>>>> So I think this technique should have higher visibility in DPDK by >>>> adding a new cache macro to rte_common.h: >>>>> >>>>> +1 to make more visibility in doc and adding a macro, good idea! >>>>> >>>>> >>>>> >>>> >>>> A worry I have is that for CPUs with large (in this context) N, you >> will >>>> end up with a lot of padding to avoid next-N-lines false sharing. >> That >>>> would be padding after, and in the general (non-array) case also >> before, >>>> the actual per-lcore data. A slight nuisance is also that those >>>> prefetched lines of padding, will never contain anything useful, and >>>> thus fetching them will always be a waste. >>> >>> Out of curiosity, what is the largest N anyone here on the list is >> aware of? >>> >>>> >>>> Padding/alignment may not be the only way to avoid HW-prefetcher- >> induced >>>> false sharing for per-lcore data structures. >>>> >>>> What we are discussing here is organizing the statically allocated >>>> per-lcore structs of a particular module in an array with the >>>> appropriate padding/alignment. In this model, all data related to a >>>> particular module is close (memory address/page-wise), but not so >> close >>>> to cause false sharing. >>>> >>>> /* rte_a.c */ >>>> >>>> struct rte_a_state >>>> { >>>> int x; >>>> RTE_CACHE_GUARD; >>>> } __rte_cache_aligned; >>>> >>>> static struct rte_a_state a_states[RTE_MAX_LCORE]; >>>> >>>> /* rte_b.c */ >>>> >>>> struct rte_b_state >>>> { >>>> char y; >>>> char z; >>>> RTE_CACHE_GUARD; >>>> } __rte_cache_aligned; >>>> >>>> >>>> static struct rte_b_state b_states[RTE_MAX_LCORE]; >>>> >>>> What you would end up with in runtime when the linker has done its >> job >>>> is something that essentially looks like this (in memory): >>>> >>>> struct { >>>> struct rte_a_state a_states[RTE_MAX_LCORE]; >>>> struct rte_b_state b_states[RTE_MAX_LCORE]; >>>> }; >>>> >>>> You could consider turning it around, and keeping data (i.e., module >>>> structs) related to a particular lcore, for all modules, close. In >> other >>>> words, keeping a per-lcore arrays of variable-sized elements. >>>> >>>> So, something that will end up looking like this (in memory, not in >> the >>>> source code): >>>> >>>> struct rte_lcore_state >>>> { >>>> struct rte_a_state a_state; >>>> struct rte_b_state b_state; >>>> RTE_CACHE_GUARD; >>>> }; >>>> >>>> struct rte_lcore_state lcore_states[RTE_LCORE_MAX]; >>>> >>>> In such a scenario, the per-lcore struct type for a module need not >> (and >>>> should not) be cache-line-aligned (but may still have some alignment >>>> requirements). Data will be more tightly packed, and the "next lines" >>>> prefetched may actually be useful (although I'm guessing in practice >>>> they will usually not). >>>> >>>> There may be several ways to implement that scheme. The above is to >>>> illustrate how thing would look in memory, not necessarily on the >> level >>>> of the source code. >>>> >>>> One way could be to fit the per-module-per-lcore struct in a chunk of >>>> memory allocated in a per-lcore heap. In such a case, the DPDK heap >>>> would need extension, maybe with semantics similar to that of NUMA- >> node >>>> specific allocations. >>>> >>>> Another way would be to use thread-local storage (TLS, __thread), >>>> although it's unclear to me how well TLS works with larger data >>>> structures. >>>> >>>> A third way may be to somehow achieve something that looks like the >>>> above example, using macros, without breaking module encapsulation or >>>> generally be too intrusive or otherwise cumbersome. >>>> >>>> Not sure this is worth the trouble (compared to just more padding), >> but >>>> I thought it was an idea worth sharing. >>> >>> I think what Mattias suggests is relevant, and it would be great if a >> generic solution could be found for DPDK. >>> >>> For reference, we initially used RTE_PER_LCORE(module_variable), i.e. >> thread local storage, extensively in our application modules. But it has >> two disadvantages: >>> 1. TLS does not use hugepages. (The same applies to global and local >> variables, BTW.) >>> 2. We need to set up global pointers to these TLS variables, so they >> can be accessed from the main lcore (e.g. for statistics). This means >> that every module needs some sort of module_per_lcore_init, called by >> the thread after its creation, to set the >> module_global_ptr[rte_lcore_id()] = &RTE_PER_LCORE(module_variable). >>> >> >> Good points. I never thought about the initialization issue. >> >> How about memory consumption and TLS? If you have many non-EAL-threads >> in the DPDK process, would the system allocate TLS memory for DPDK >> lcore-specific data structures? Assuming a scenario where __thread was >> used instead of the standard DPDK pattern. > > Room for all the __thread variables is allocated and initialized for every thread started. > > Note: RTE_PER_LCORE variables are simply __thread-wrapped variables: > > #define RTE_DEFINE_PER_LCORE(type, name) \ > __thread __typeof__(type) per_lcore_##name > >> >>> Eventually, we gave up and migrated to the DPDK standard design >> pattern of instantiating a global module_variable[RTE_MAX_LCORE], and >> letting each thread use its own entry in that array. >>> >>> And as Mattias suggests, this adds a lot of useless padding, because >> each modules' variables now need to start on its own cache line. >>> >>> So a generic solution with packed per-thread data would be a great >> long term solution. >>> >>> Short term, I can live with the simple cache guard. It is very easy to >> implement and use. >>> >> The RTE_CACHE_GUARD pattern is also intrusive, in the sense it needs to >> be explicitly added everywhere (just like __rte_cache_aligned) and error >> prone, and somewhat brittle (in the face of changed ). > > Agree. Like a lot of other stuff in DPDK. DPDK is very explicit. :-) > >> >> (I mentioned this not to discourage the use of RTE_CACHE_GUARD - more to >> encourage somehow to invite something more efficient, robust and >> easier-to-use.) > > I get that you don't discourage it, but I don't expect the discussion to proceed further at this time. So please don’t forget to also ACK this patch. ;-) > Acked-by: Mattias Rönnblom > There should be a wish list, where concepts like your suggested improvement could be easily noted. >