From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 982B743AC4; Fri, 9 Feb 2024 12:46:14 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 3174842E64; Fri, 9 Feb 2024 12:46:14 +0100 (CET) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id DB2C540697 for ; Fri, 9 Feb 2024 12:46:12 +0100 (CET) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 7F7EFAC4 for ; Fri, 9 Feb 2024 12:46:12 +0100 (CET) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 7319DAC3; Fri, 9 Feb 2024 12:46:12 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.4 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id B34EBB07; Fri, 9 Feb 2024 12:46:06 +0100 (CET) Message-ID: Date: Fri, 9 Feb 2024 12:46:05 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 1/5] eal: add static per-lcore memory allocation facility To: =?UTF-8?Q?Morten_Br=C3=B8rup?= , =?UTF-8?Q?Mattias_R=C3=B6nnblom?= , dev@dpdk.org Cc: Stephen Hemminger References: <20240208181644.455233-1-mattias.ronnblom@ericsson.com> <20240208181644.455233-2-mattias.ronnblom@ericsson.com> <98CBD80474FA8B44BF855DF32C47DC35E9F200@smartserver.smartshare.dk> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F200@smartserver.smartshare.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-02-09 09:25, Morten Brørup wrote: >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com] >> Sent: Thursday, 8 February 2024 19.17 >> >> Introduce DPDK per-lcore id variables, or lcore variables for short. >> >> An lcore variable has one value for every current and future lcore >> id-equipped thread. >> >> The primary use case is for statically allocating >> small chunks of often-used data, which is related logically, but where >> there are performance benefits to reap from having updates being local >> to an lcore. >> >> Lcore variables are similar to thread-local storage (TLS, e.g., C11 >> _Thread_local), but decoupling the values' life time with that of the >> threads. >> >> Lcore variables are also similar in terms of functionality provided by >> FreeBSD kernel's DPCPU_*() family of macros and the associated >> build-time machinery. DPCPU uses linker scripts, which effectively >> prevents the reuse of its, otherwise seemingly viable, approach. >> >> The currently-prevailing way to solve the same problem as lcore >> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized >> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of >> lcore variables over this approach is that data related to the same >> lcore now is close (spatially, in memory), rather than data used by >> the same module, which in turn avoid excessive use of padding, >> polluting caches with unused data. >> >> Signed-off-by: Mattias Rönnblom >> --- > > This looks very promising. :-) > > Here's a bunch of comments, questions and suggestions. > > > * Question: Performance. > What is the cost of accessing an lcore variable vs a variable in TLS? > I suppose the relative cost diminishes if the variable is a larger struct, compared to a simple uint64_t. > In case all the relevant data is available in a cache close to the core, both options carry quite low overhead. Accessing a lcore variable will always require a TLS lookup, in the form of retrieving the lcore_id of the current thread. In that sense, there will likely be a number of extra instructions required to do the lcore variable address lookup (i.e., doing the load from rte_lcore_var table based on the lcore_id you just looked up, and adding the variable's offset). A TLS lookup will incur an extra overhead of less than a clock cycle, compared to accessing a non-TLS static variable, in case static linking is used. For shared objects, TLS is much more expensive (something often visible in dynamically linked DPDK app flame graphs, in the form of the __tls_addr symbol). Then you need to add ~3 cc/access. This on a micro benchmark running on a x86_64 Raptor Lake P-core. (To visialize the difference between shared object and not, one can use Compiler Explorer and -fPIC versus -fPIE.) Things get more complicated if you access the same variable in the same section code, since then it can be left on the stack/in a register by the compiler, especially if LTO is used. In other words, if you do rte_lcore_id() several times in a row, only the first one will cost you anything. This happens fairly often in DPDK, with rte_lcore_id(). Finally, if you do something like diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c index af9fffd81b..a65c30d27e 100644 --- a/lib/eal/common/rte_random.c +++ b/lib/eal/common/rte_random.c @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state *state) static __rte_always_inline struct rte_rand_state *__rte_rand_get_state(void) { - unsigned int idx; - - idx = rte_lcore_id(); - - if (unlikely(idx == LCORE_ID_ANY)) - return &unregistered_rand_state; - - return RTE_LCORE_VAR_PTR(rand_state); + return &unregistered_rand_state; } uint64_t ...and re-run the rand_perf_autotest, at least I see no difference at all (in a statically linked build). Both results in rte_rand() using ~11 cc/call. What that suggests is that TLS overhead is very small, and that any extra instructions required by lcore variables doesn't add much, if anything at all, at least in this particular case. > Some of my suggestions below might also affect performance. > > > * Advantage: Provides direct access to worker thread variables. > With the current alternative (thread-local storage), the main thread cannot access the TLS variables of the worker threads, > unless worker threads publish global access pointers. > Lcore variables of any lcore thread can be direcctly accessed by any thread, which simplifies code. > > > * Advantage: Roadmap towards hugemem. > It would be nice if the lcore variable memory was allocated in hugemem, to reduce TLB misses. > The current alternative (thread-local storage) is also not using hugement, so not a degradation. > I agree, but the thing is it's hard to figure out how much memory is required for these kind of variables, given how DPDK is built and linked. In an OS kernel, you can just take all the symbols, put them in a special section, and size that section. Such a thing can't easily be done with DPDK, since shared object builds are supported, plus that this facility should be available not only to DPDK modules, but also the application, so relying on linker scripts isn't really feasible (not probably not even feasible for DPDK itself). In that scenario, you want to size up the per-lcore buffer to be so large, you don't have to worry about overruns. That will waste memory. If you use huge page memory, paging can't help you to avoid pre-allocating actual physical memory. That said, even large (by static per-lcore data standards) buffers are potentially small enough not to grow the amount of memory used by a DPDK process too much. You need to provision for RTE_MAX_LCORE of them though. The value of lcore variables should be small, and thus incur few TLB misses, so you may not gain much from huge pages. In my world, it's more about "fitting often-used per-lcore data into L1 or L2 CPU caches", rather than the easier "fitting often-used per-lcore data into a working set size reasonably expected to be covered by hardware TLB/caches". > Lcore variables are available very early at startup, so I guess the RTE memory allocator is not yet available. > Hugemem could be allocated using O/S allocation, so there is a possible road towards using hugemem. > With the current design, that true. I'm not sure it's a strict requirement though, but it does makes things simpler. > Either way, using hugement would require one more indirection (the pointer to the allocated hugemem). > I don't know which has better performance, using hugemem or avoiding the additional pointer dereferencing. > > > * Suggestion: Consider adding an entry for unregistered non-EAL threads. > Please consider making room for one more entry, shared by all unregistered non-EAL threads, i.e. > making the array size RTE_MAX_LCORE + 1 and indexing by (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE). > > It would be convenient for the use cases where a variable shared by the unregistered non-EAL threads don't need special treatment. > I thought about this, but it would require a conditional in the lookup macro, as you show. More importantly, it would make the whole thing less elegant and harder to understand. It's bad enough that "per-lcore" is actually "per-lcore id" (or the equivalent "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's + 1" is not an improvement. But useful? Sure. I think you may still need other data for dealing with unregistered threads, for example a mutex or spin lock to deal with concurrency issues that arises with shared data. There may also be cases were you are best off by simply disallowing unregistered threads from calling into that API. > Obviously, this might affect performance. > If the performance cost is not negligble, the addtional entry (and indexing branch) could be disabled at build time. > > > * Suggestion: Do not fix the alignment at 16 byte. > Pass an alignment parameter to rte_lcore_var_alloc() and use alignof() when calling it: > > +#include > + > +#define RTE_LCORE_VAR_ALLOC(name) \ > + name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name))) > + > +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment) \ > + name = rte_lcore_var_alloc(size, alignment) > + > +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size) \ > + name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT) > + > + +++ /cconfig/rte_config.h > +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16 > > That seems like a very good idea. I'll look into it. > * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but behaves differently. > >> +/** >> + * Iterate over each lcore id's value for a lcore variable. >> + */ >> +#define RTE_LCORE_VAR_FOREACH(var, name) \ >> + for (unsigned int lcore_id = \ >> + (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0); \ >> + lcore_id < RTE_MAX_LCORE; \ >> + lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) >> + > > The macro name RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(i), which only iterates on running cores. > You might want to give it a name that differs more. > True. Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for confusion, for sure. Being consistent with is not so easy, since it's not even consistent with itself. For example, rte_lcore_count() returns the number of lcores (EAL threads) *plus the number of registered non-EAL threads*, and RTE_LCORE_FOREACH() give a different count. :) > If it wasn't for API breakage, I would suggest renaming RTE_LCORE_FOREACH() instead, but that's not realistic. ;-) > > Small detail: "var" is a pointer, so consider renaming it to "ptr" and adding _PTR to the macro name. The "var" name comes from how names things. I think I had it as "ptr" initially. I'll change it back. Thanks a lot Morten.