From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id A803341B1C; Sun, 27 Aug 2023 15:55:10 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 34F974025E; Sun, 27 Aug 2023 15:55:10 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id CB6F24025D for ; Sun, 27 Aug 2023 15:55:08 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 374BA30BB for ; Sun, 27 Aug 2023 15:55:08 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 35ACE31AB; Sun, 27 Aug 2023 15:55:08 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A autolearn=disabled version=3.4.6 X-Spam-Score: -1.7 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 001922DE0; Sun, 27 Aug 2023 15:55:04 +0200 (CEST) Message-ID: <5a56c531-5d5c-777b-c1ea-dbcc25009220@lysator.liu.se> Date: Sun, 27 Aug 2023 15:55:04 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: [RFC] cache guard To: =?UTF-8?Q?Morten_Br=c3=b8rup?= , Bruce Richardson Cc: dev@dpdk.org, olivier.matz@6wind.com, andrew.rybchenko@oktetlabs.ru, honnappa.nagarahalli@arm.com, konstantin.v.ananyev@yandex.ru, mattias.ronnblom@ericsson.com References: <98CBD80474FA8B44BF855DF32C47DC35D87B39@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D87B3A@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D87B47@smartserver.smartshare.dk> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87B47@smartserver.smartshare.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2023-08-27 10:34, Morten Brørup wrote: > +CC Honnappa and Konstantin, Ring lib maintainers > +CC Mattias, PRNG lib maintainer > >> From: Bruce Richardson [mailto:bruce.richardson@intel.com] >> Sent: Friday, 25 August 2023 11.24 >> >> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote: >>> +CC mempool maintainers >>> >>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] >>>> Sent: Friday, 25 August 2023 10.23 >>>> >>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote: >>>>> Bruce, >>>>> >>>>> With this patch [1], it is noted that the ring producer and >> consumer data >>>> should not be on adjacent cache lines, for performance reasons. >>>>> >>>>> [1]: >>>> >> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f >> fd4b66 >>>> e75485cc8b63b9aedfbdfe8b0 >>>>> >>>>> (It's obvious that they cannot share the same cache line, because >> they are >>>> accessed by two different threads.) >>>>> >>>>> Intuitively, I would think that having them on different cache >> lines would >>>> suffice. Why does having an empty cache line between them make a >> difference? >>>>> >>>>> And does it need to be an empty cache line? Or does it suffice >> having the >>>> second structure start at two cache lines after the start of the >> first >>>> structure (e.g. if the size of the first structure is two cache >> lines)? >>>>> >>>>> I'm asking because the same principle might apply to other code >> too. >>>>> >>>> Hi Morten, >>>> >>>> this was something we discovered when working on the distributor >> library. >>>> If we have cachelines per core where there is heavy access, having >> some >>>> cachelines as a gap between the content cachelines can help >> performance. We >>>> believe this helps due to avoiding issues with the HW prefetchers >> (e.g. >>>> adjacent cacheline prefetcher) bringing in the second cacheline >>>> speculatively when an operation is done on the first line. >>> >>> I guessed that it had something to do with speculative prefetching, >> but wasn't sure. Good to get confirmation, and that it has a measureable >> effect somewhere. Very interesting! >>> >>> NB: More comments in the ring lib about stuff like this would be nice. >>> >>> So, for the mempool lib, what do you think about applying the same >> technique to the rte_mempool_debug_stats structure (which is an array >> indexed per lcore)... Two adjacent lcores heavily accessing their local >> mempool caches seems likely to me. But how heavy does the access need to >> be for this technique to be relevant? >>> >> >> No idea how heavy the accesses need to be for this to have a noticable >> effect. For things like debug stats, I wonder how worthwhile making such >> a >> change would be, but then again, any change would have very low impact >> too >> in that case. > > I just tried adding padding to some of the hot structures in our own application, and observed a significant performance improvement for those. > > So I think this technique should have higher visibility in DPDK by adding a new cache macro to rte_common.h: > > /** > * Empty cache line, to guard against speculative prefetching. > * "to guard against false sharing-like effects on systems with a next-N-lines hardware prefetcher" > * Use as spacing between data accessed by different lcores, > * to prevent cache thrashing on CPUs with speculative prefetching. > */ > #define RTE_CACHE_GUARD(name) char cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned; > You could have a macro which specified how much guarding there needs to be, ideally defined on a per-CPU basis. (These things has nothing to do with the ISA, but everything to do with the implementation.) I'm not sure N is always 1. So the guard padding should be RTE_CACHE_LINE_SIZE * RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in #if RTE_CACHE_GUARD_LINES > 0 #endif ...so you can disable this (cute!) hack (on custom DPDK builds) in case you have disabled hardware prefetching, which seems generally to be a good idea for packet processing type applications. ...which leads me to another suggestions: add a note on disabling hardware prefetching in the optimization guide. Seems like a very good idea to have this in , and otherwise make this issue visible and known. > To be used like this: > > struct rte_ring { > char name[RTE_RING_NAMESIZE] __rte_cache_aligned; > /**< Name of the ring. */ > int flags; /**< Flags supplied at creation. */ > const struct rte_memzone *memzone; > /**< Memzone, if any, containing the rte_ring */ > uint32_t size; /**< Size of ring. */ > uint32_t mask; /**< Mask (size-1) of ring. */ > uint32_t capacity; /**< Usable size of ring */ > > - char pad0 __rte_cache_aligned; /**< empty cache line */ > + RTE_CACHE_GUARD(prod); /**< Isolate producer status. */ > > /** Ring producer status. */ > union { > struct rte_ring_headtail prod; > struct rte_ring_hts_headtail hts_prod; > struct rte_ring_rts_headtail rts_prod; > } __rte_cache_aligned; > > - char pad1 __rte_cache_aligned; /**< empty cache line */ > + RTE_CACHE_GUARD(both); /**< Isolate producer from consumer. */ > > /** Ring consumer status. */ > union { > struct rte_ring_headtail cons; > struct rte_ring_hts_headtail hts_cons; > struct rte_ring_rts_headtail rts_cons; > } __rte_cache_aligned; > > - char pad2 __rte_cache_aligned; /**< empty cache line */ > + RTE_CACHE_GUARD(cons); /**< Isolate consumer status. */ > }; > > > And for the mempool library: > > #ifdef RTE_LIBRTE_MEMPOOL_STATS > /** > * A structure that stores the mempool statistics (per-lcore). > * Note: Cache stats (put_cache_bulk/objs, get_cache_bulk/objs) are not > * captured since they can be calculated from other stats. > * For example: put_cache_objs = put_objs - put_common_pool_objs. > */ > struct rte_mempool_debug_stats { > uint64_t put_bulk; /**< Number of puts. */ > uint64_t put_objs; /**< Number of objects successfully put. */ > uint64_t put_common_pool_bulk; /**< Number of bulks enqueued in common pool. */ > uint64_t put_common_pool_objs; /**< Number of objects enqueued in common pool. */ > uint64_t get_common_pool_bulk; /**< Number of bulks dequeued from common pool. */ > uint64_t get_common_pool_objs; /**< Number of objects dequeued from common pool. */ > uint64_t get_success_bulk; /**< Successful allocation number. */ > uint64_t get_success_objs; /**< Objects successfully allocated. */ > uint64_t get_fail_bulk; /**< Failed allocation number. */ > uint64_t get_fail_objs; /**< Objects that failed to be allocated. */ > uint64_t get_success_blks; /**< Successful allocation number of contiguous blocks. */ > uint64_t get_fail_blks; /**< Failed allocation number of contiguous blocks. */ > + RTE_CACHE_GUARD(debug_stats); /**< Isolation between lcores. */ > } __rte_cache_aligned; > #endif > > struct rte_mempool { > [...] > #ifdef RTE_LIBRTE_MEMPOOL_STATS > /** Per-lcore statistics. > * > * Plus one, for unregistered non-EAL threads. > */ > struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1]; > #endif > } __rte_cache_aligned; > > > It also seems relevant for the PRNG library: > > /lib/eal/common/rte_random.c: > > struct rte_rand_state { > uint64_t z1; > uint64_t z2; > uint64_t z3; > uint64_t z4; > uint64_t z5; > + RTE_CACHE_GUARD(z); > } __rte_cache_aligned; > Yes. Should there be two cache guard macros? One parameter-free RTE_CACHE_GUARD and a RTE_CACHE_NAMED_GUARD(name) macro? Maybe it's better just to keep the single macro, but have a convention with some generic name (i.e., not 'z' above) for the guard field, like 'cache_guard' or just 'guard'. Having unique name makes no sense, except in rare cases where you need multiple guard lines per struct. > /* One instance each for every lcore id-equipped thread, and one > * additional instance to be shared by all others threads (i.e., all > * unregistered non-EAL threads). > */ > static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1]; >