Re: [RFC 1/5] eal: add static per-lcore memory allocation facility

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	"Mattias Rönnblom" <mattias.ronnblom@ericsson.com>,
	dev@dpdk.org
Cc: Stephen Hemminger <stephen@networkplumber.org>
Subject: Re: [RFC 1/5] eal: add static per-lcore memory allocation facility
Date: Fri, 9 Feb 2024 12:46:05 +0100	[thread overview]
Message-ID: <fcdc52f7-1e6c-4aa2-a389-47d0264cf3e9@lysator.liu.se> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F200@smartserver.smartshare.dk>

On 2024-02-09 09:25, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Thursday, 8 February 2024 19.17
>>
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small chunks of often-used data, which is related logically, but where
>> there are performance benefits to reap from having updates being local
>> to an lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
> 
> This looks very promising. :-)
> 
> Here's a bunch of comments, questions and suggestions.
> 
> 
> * Question: Performance.
> What is the cost of accessing an lcore variable vs a variable in TLS?
> I suppose the relative cost diminishes if the variable is a larger struct, compared to a simple uint64_t.
> 

In case all the relevant data is available in a cache close to the core, 
both options carry quite low overhead.

Accessing a lcore variable will always require a TLS lookup, in the form 
of retrieving the lcore_id of the current thread. In that sense, there 
will likely be a number of extra instructions required to do the lcore 
variable address lookup (i.e., doing the load from rte_lcore_var table 
based on the lcore_id you just looked up, and adding the variable's offset).

A TLS lookup will incur an extra overhead of less than a clock cycle, 
compared to accessing a non-TLS static variable, in case static linking 
is used. For shared objects, TLS is much more expensive (something often 
visible in dynamically linked DPDK app flame graphs, in the form of the 
__tls_addr symbol). Then you need to add ~3 cc/access. This on a micro 
benchmark running on a x86_64 Raptor Lake P-core.

(To visialize the difference between shared object and not, one can use 
Compiler Explorer and -fPIC versus -fPIE.)

Things get more complicated if you access the same variable in the same 
section code, since then it can be left on the stack/in a register by 
the compiler, especially if LTO is used. In other words, if you do 
rte_lcore_id() several times in a row, only the first one will cost you 
anything. This happens fairly often in DPDK, with rte_lcore_id().

Finally, if you do something like

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index af9fffd81b..a65c30d27e 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state *state)
  static __rte_always_inline
  struct rte_rand_state *__rte_rand_get_state(void)
  {
-       unsigned int idx;
-
-       idx = rte_lcore_id();
-
-       if (unlikely(idx == LCORE_ID_ANY))
-               return &unregistered_rand_state;
-
-       return RTE_LCORE_VAR_PTR(rand_state);
+       return &unregistered_rand_state;
  }

  uint64_t

...and re-run the rand_perf_autotest, at least I see no difference at 
all (in a statically linked build). Both results in rte_rand() using ~11 
cc/call. What that suggests is that TLS overhead is very small, and that 
any extra instructions required by lcore variables doesn't add much, if 
anything at all, at least in this particular case.

> Some of my suggestions below might also affect performance.
> 
> 
> * Advantage: Provides direct access to worker thread variables.
> With the current alternative (thread-local storage), the main thread cannot access the TLS variables of the worker threads,
> unless worker threads publish global access pointers.
> Lcore variables of any lcore thread can be direcctly accessed by any thread, which simplifies code.
> 
> 
> * Advantage: Roadmap towards hugemem.
> It would be nice if the lcore variable memory was allocated in hugemem, to reduce TLB misses.
> The current alternative (thread-local storage) is also not using hugement, so not a degradation.
> 

I agree, but the thing is it's hard to figure out how much memory is 
required for these kind of variables, given how DPDK is built and 
linked. In an OS kernel, you can just take all the symbols, put them in 
a special section, and size that section. Such a thing can't easily be 
done with DPDK, since shared object builds are supported, plus that this 
facility should be available not only to DPDK modules, but also the 
application, so relying on linker scripts isn't really feasible (not 
probably not even feasible for DPDK itself).

In that scenario, you want to size up the per-lcore buffer to be so 
large, you don't have to worry about overruns. That will waste memory. 
If you use huge page memory, paging can't help you to avoid 
pre-allocating actual physical memory.

That said, even large (by static per-lcore data standards) buffers are 
potentially small enough not to grow the amount of memory used by a DPDK 
process too much. You need to provision for RTE_MAX_LCORE of them though.

The value of lcore variables should be small, and thus incur few TLB 
misses, so you may not gain much from huge pages. In my world, it's more 
about "fitting often-used per-lcore data into L1 or L2 CPU caches", 
rather than the easier "fitting often-used per-lcore data into a working 
set size reasonably expected to be covered by hardware TLB/caches".

> Lcore variables are available very early at startup, so I guess the RTE memory allocator is not yet available.
> Hugemem could be allocated using O/S allocation, so there is a possible road towards using hugemem.
> 

With the current design, that true. I'm not sure it's a strict 
requirement though, but it does makes things simpler.

> Either way, using hugement would require one more indirection (the pointer to the allocated hugemem).
> I don't know which has better performance, using hugemem or avoiding the additional pointer dereferencing.
> 
> 
> * Suggestion: Consider adding an entry for unregistered non-EAL threads.
> Please consider making room for one more entry, shared by all unregistered non-EAL threads, i.e.
> making the array size RTE_MAX_LCORE + 1 and indexing by (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
> 
> It would be convenient for the use cases where a variable shared by the unregistered non-EAL threads don't need special treatment.
> 

I thought about this, but it would require a conditional in the lookup 
macro, as you show. More importantly, it would make the whole 
<rte_lcore_var.h> thing less elegant and harder to understand. It's bad 
enough that "per-lcore" is actually "per-lcore id" (or the equivalent 
"per-EAL thread and unregistered EAL-thread"). Adding a "btw it's <what 
I said before> + 1" is not an improvement.

But useful? Sure.

I think you may still need other data for dealing with unregistered 
threads, for example a mutex or spin lock to deal with concurrency 
issues that arises with shared data.

There may also be cases were you are best off by simply disallowing 
unregistered threads from calling into that API.

> Obviously, this might affect performance.
> If the performance cost is not negligble, the addtional entry (and indexing branch) could be disabled at build time.
> 
> 
> * Suggestion: Do not fix the alignment at 16 byte.
> Pass an alignment parameter to rte_lcore_var_alloc() and use alignof() when calling it:
> 
> +#include <stdalign.h>
> +
> +#define RTE_LCORE_VAR_ALLOC(name)			\
> +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
> +
> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)	\
> +	name = rte_lcore_var_alloc(size, alignment)
> +
> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
> +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
> +
> + +++ /cconfig/rte_config.h
> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
> 
> 

That seems like a very good idea. I'll look into it.

> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but behaves differently.
> 
>> +/**
>> + * Iterate over each lcore id's value for a lcore variable.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
>> +	for (unsigned int lcore_id =					\
>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
>> +	     lcore_id < RTE_MAX_LCORE;					\
>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
>> +
> 
> The macro name RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(i), which only iterates on running cores.
> You might want to give it a name that differs more.
> 

True.

Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for confusion, 
for sure.

Being consistent with <rte_lcore.h> is not so easy, since it's not even 
consistent with itself. For example, rte_lcore_count() returns the 
number of lcores (EAL threads) *plus the number of registered non-EAL 
threads*, and RTE_LCORE_FOREACH() give a different count. :)

> If it wasn't for API breakage, I would suggest renaming RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
> 
> Small detail: "var" is a pointer, so consider renaming it to "ptr" and adding _PTR to the macro name.

The "var" name comes from how <sys/queue.h> names things. I think I had 
it as "ptr" initially. I'll change it back.

Thanks a lot Morten.

next prev parent reply	other threads:[~2024-02-09 11:46 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-02-09  8:25   ` Morten Brørup
2024-02-09 11:46     ` Mattias Rönnblom [this message]
2024-02-09 13:04       ` Morten Brørup
2024-02-19  7:49         ` Mattias Rönnblom
2024-02-19 11:10           ` Morten Brørup
2024-02-19 14:31             ` Mattias Rönnblom
2024-02-19 15:04               ` Morten Brørup
2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-02-20  9:11           ` Bruce Richardson
2024-02-20 10:47             ` Mattias Rönnblom
2024-02-20 11:39               ` Bruce Richardson
2024-02-20 13:37                 ` Morten Brørup
2024-02-20 16:26                 ` Mattias Rönnblom
2024-02-21  9:43           ` Jerin Jacob
2024-02-21 10:31             ` Morten Brørup
2024-02-21 14:26             ` Mattias Rönnblom
2024-02-22  9:22           ` Morten Brørup
2024-02-23 10:12             ` Mattias Rönnblom
2024-02-25 15:03           ` [RFC v4 0/6] Lcore variables Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-02-27  9:58               ` Morten Brørup
2024-02-27 13:44                 ` Mattias Rönnblom
2024-02-27 15:05                   ` Morten Brørup
2024-02-27 16:27                     ` Mattias Rönnblom
2024-02-27 16:51                       ` Morten Brørup
2024-02-28 10:09               ` [RFC v5 0/6] Lcore variables Mattias Rönnblom
2024-02-28 10:09                 ` [RFC v5 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-03-19 12:52                   ` Konstantin Ananyev
2024-03-20 10:24                     ` Mattias Rönnblom
2024-03-20 14:18                       ` Konstantin Ananyev
2024-05-06  8:27                   ` [RFC v6 0/6] Lcore variables Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 2/6] eal: add lcore variable test suite Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 4/6] power: keep per-lcore " Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 5/6] service: " Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-02-28 10:09                 ` [RFC v5 2/6] eal: add lcore variable test suite Mattias Rönnblom
2024-02-28 10:09                 ` [RFC v5 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-28 10:09                 ` [RFC v5 4/6] power: keep per-lcore " Mattias Rönnblom
2024-02-28 10:09                 ` [RFC v5 5/6] service: " Mattias Rönnblom
2024-02-28 10:09                 ` [RFC v5 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 2/6] eal: add lcore variable test suite Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 4/6] power: keep per-lcore " Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 5/6] service: " Mattias Rönnblom
2024-02-25 16:28               ` Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 2/6] eal: add lcore variable test suite Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-20 15:31           ` Morten Brørup
2024-02-20  8:49         ` [RFC v3 4/6] power: keep per-lcore " Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 5/6] service: " Mattias Rönnblom
2024-02-22  9:42           ` Morten Brørup
2024-02-23 10:19             ` Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 2/5] eal: add lcore variable test suite Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-19 11:22       ` Morten Brørup
2024-02-19 14:04         ` Mattias Rönnblom
2024-02-19 15:10           ` Morten Brørup
2024-02-19  9:40     ` [RFC v2 4/5] power: keep per-lcore " Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 5/5] service: " Mattias Rönnblom
2024-02-08 18:16 ` [RFC 2/5] eal: add lcore variable test suite Mattias Rönnblom
2024-02-08 18:16 ` [RFC 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-08 18:16 ` [RFC 4/5] power: keep per-lcore " Mattias Rönnblom
2024-02-08 18:16 ` [RFC 5/5] service: " Mattias Rönnblom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fcdc52f7-1e6c-4aa2-a389-47d0264cf3e9@lysator.liu.se \
    --to=hofors@lysator.liu.se \
    --cc=dev@dpdk.org \
    --cc=mattias.ronnblom@ericsson.com \
    --cc=mb@smartsharesystems.com \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).