From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 0019543B49;
	Mon, 19 Feb 2024 08:49:11 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 83D69402B8;
	Mon, 19 Feb 2024 08:49:11 +0100 (CET)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id A551340275
 for <dev@dpdk.org>; Mon, 19 Feb 2024 08:49:10 +0100 (CET)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id 24B87B776
 for <dev@dpdk.org>; Mon, 19 Feb 2024 08:49:10 +0100 (CET)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id 18F90B775; Mon, 19 Feb 2024 08:49:10 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED,AWL,
 T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0
X-Spam-Score: -1.4
Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se
 [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id 98E70B8A7;
 Mon, 19 Feb 2024 08:49:07 +0100 (CET)
Message-ID: <e4e83a55-9086-434b-998f-e4b038af2048@lysator.liu.se>
Date: Mon, 19 Feb 2024 08:49:06 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC 1/5] eal: add static per-lcore memory allocation facility
To: =?UTF-8?Q?Morten_Br=C3=B8rup?= <mb@smartsharesystems.com>,
 =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <mattias.ronnblom@ericsson.com>,
 dev@dpdk.org
Cc: Stephen Hemminger <stephen@networkplumber.org>
References: <20240208181644.455233-1-mattias.ronnblom@ericsson.com>
 <20240208181644.455233-2-mattias.ronnblom@ericsson.com>
 <98CBD80474FA8B44BF855DF32C47DC35E9F200@smartserver.smartshare.dk>
 <fcdc52f7-1e6c-4aa2-a389-47d0264cf3e9@lysator.liu.se>
 <98CBD80474FA8B44BF855DF32C47DC35E9F203@smartserver.smartshare.dk>
Content-Language: en-US
From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F203@smartserver.smartshare.dk>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On 2024-02-09 14:04, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Friday, 9 February 2024 12.46
>>
>> On 2024-02-09 09:25, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>> Sent: Thursday, 8 February 2024 19.17
>>>>
>>>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>>>
>>>> An lcore variable has one value for every current and future lcore
>>>> id-equipped thread.
>>>>
>>>> The primary <rte_lcore_var.h> use case is for statically allocating
>>>> small chunks of often-used data, which is related logically, but
>> where
>>>> there are performance benefits to reap from having updates being
>> local
>>>> to an lcore.
>>>>
>>>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>>>> _Thread_local), but decoupling the values' life time with that of
>> the
>>>> threads.
>>>>
>>>> Lcore variables are also similar in terms of functionality provided
>> by
>>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>>>> build-time machinery. DPCPU uses linker scripts, which effectively
>>>> prevents the reuse of its, otherwise seemingly viable, approach.
>>>>
>>>> The currently-prevailing way to solve the same problem as lcore
>>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
>> sized
>>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>>>> lcore variables over this approach is that data related to the same
>>>> lcore now is close (spatially, in memory), rather than data used by
>>>> the same module, which in turn avoid excessive use of padding,
>>>> polluting caches with unused data.
>>>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>
>>> This looks very promising. :-)
>>>
>>> Here's a bunch of comments, questions and suggestions.
>>>
>>>
>>> * Question: Performance.
>>> What is the cost of accessing an lcore variable vs a variable in TLS?
>>> I suppose the relative cost diminishes if the variable is a larger
>> struct, compared to a simple uint64_t.
>>>
>>
>> In case all the relevant data is available in a cache close to the
>> core,
>> both options carry quite low overhead.
>>
>> Accessing a lcore variable will always require a TLS lookup, in the
>> form
>> of retrieving the lcore_id of the current thread. In that sense, there
>> will likely be a number of extra instructions required to do the lcore
>> variable address lookup (i.e., doing the load from rte_lcore_var table
>> based on the lcore_id you just looked up, and adding the variable's
>> offset).
>>
>> A TLS lookup will incur an extra overhead of less than a clock cycle,
>> compared to accessing a non-TLS static variable, in case static linking
>> is used. For shared objects, TLS is much more expensive (something
>> often
>> visible in dynamically linked DPDK app flame graphs, in the form of the
>> __tls_addr symbol). Then you need to add ~3 cc/access. This on a micro
>> benchmark running on a x86_64 Raptor Lake P-core.
>>
>> (To visialize the difference between shared object and not, one can use
>> Compiler Explorer and -fPIC versus -fPIE.)
>>
>> Things get more complicated if you access the same variable in the same
>> section code, since then it can be left on the stack/in a register by
>> the compiler, especially if LTO is used. In other words, if you do
>> rte_lcore_id() several times in a row, only the first one will cost you
>> anything. This happens fairly often in DPDK, with rte_lcore_id().
>>
>> Finally, if you do something like
>>
>> diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
>> index af9fffd81b..a65c30d27e 100644
>> --- a/lib/eal/common/rte_random.c
>> +++ b/lib/eal/common/rte_random.c
>> @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state *state)
>>    static __rte_always_inline
>>    struct rte_rand_state *__rte_rand_get_state(void)
>>    {
>> -       unsigned int idx;
>> -
>> -       idx = rte_lcore_id();
>> -
>> -       if (unlikely(idx == LCORE_ID_ANY))
>> -               return &unregistered_rand_state;
>> -
>> -       return RTE_LCORE_VAR_PTR(rand_state);
>> +       return &unregistered_rand_state;
>>    }
>>
>>    uint64_t
>>
>> ...and re-run the rand_perf_autotest, at least I see no difference at
>> all (in a statically linked build). Both results in rte_rand() using
>> ~11
>> cc/call. What that suggests is that TLS overhead is very small, and
>> that
>> any extra instructions required by lcore variables doesn't add much, if
>> anything at all, at least in this particular case.
> 
> Excellent. Thank you for a thorough and detailed answer, Mattias.
> 
>>
>>> Some of my suggestions below might also affect performance.
>>>
>>>
>>> * Advantage: Provides direct access to worker thread variables.
>>> With the current alternative (thread-local storage), the main thread
>> cannot access the TLS variables of the worker threads,
>>> unless worker threads publish global access pointers.
>>> Lcore variables of any lcore thread can be direcctly accessed by any
>> thread, which simplifies code.
>>>
>>>
>>> * Advantage: Roadmap towards hugemem.
>>> It would be nice if the lcore variable memory was allocated in
>> hugemem, to reduce TLB misses.
>>> The current alternative (thread-local storage) is also not using
>> hugement, so not a degradation.
>>>
>>
>> I agree, but the thing is it's hard to figure out how much memory is
>> required for these kind of variables, given how DPDK is built and
>> linked. In an OS kernel, you can just take all the symbols, put them in
>> a special section, and size that section. Such a thing can't easily be
>> done with DPDK, since shared object builds are supported, plus that
>> this
>> facility should be available not only to DPDK modules, but also the
>> application, so relying on linker scripts isn't really feasible (not
>> probably not even feasible for DPDK itself).
>>
>> In that scenario, you want to size up the per-lcore buffer to be so
>> large, you don't have to worry about overruns. That will waste memory.
>> If you use huge page memory, paging can't help you to avoid
>> pre-allocating actual physical memory.
> 
> Good point.
> I had noticed that RTE_MAX_LCORE_VAR was 1 MB (per RTE_MAX_LCORE), but I hadn't considered how paging helps us use less physical memory than that.
> 
>>
>> That said, even large (by static per-lcore data standards) buffers are
>> potentially small enough not to grow the amount of memory used by a
>> DPDK
>> process too much. You need to provision for RTE_MAX_LCORE of them
>> though.
>>
>> The value of lcore variables should be small, and thus incur few TLB
>> misses, so you may not gain much from huge pages. In my world, it's
>> more
>> about "fitting often-used per-lcore data into L1 or L2 CPU caches",
>> rather than the easier "fitting often-used per-lcore data into a
>> working
>> set size reasonably expected to be covered by hardware TLB/caches".
> 
> Yes, I suppose that lcore variables are intended to be small, and large per-lcore structures should keep following the current design patterns for allocation and access.
> 

It seems to me that support for per-lcore heaps should be the solution 
for supporting use cases requiring many, larger and/or dynamic objects 
on a per-lcore basis.

Ideally, you would design both that mechanism and lcore variables 
together, but then if you couple enough amount of improvements together 
you will never get anywhere. An instance of where perfect is the enemy 
of good, perhaps.

> Perhaps this guideline is worth mentioning in the documentation.
> 

What is missing, more specifically? The size limitation and the static 
nature of lcore variables is described, and what current design patterns 
they expected to (partly) replace is also covered.

>>
>>> Lcore variables are available very early at startup, so I guess the
>> RTE memory allocator is not yet available.
>>> Hugemem could be allocated using O/S allocation, so there is a
>> possible road towards using hugemem.
>>>
>>
>> With the current design, that true. I'm not sure it's a strict
>> requirement though, but it does makes things simpler.
>>
>>> Either way, using hugement would require one more indirection (the
>> pointer to the allocated hugemem).
>>> I don't know which has better performance, using hugemem or avoiding
>> the additional pointer dereferencing.
>>>
>>>
>>> * Suggestion: Consider adding an entry for unregistered non-EAL
>> threads.
>>> Please consider making room for one more entry, shared by all
>> unregistered non-EAL threads, i.e.
>>> making the array size RTE_MAX_LCORE + 1 and indexing by
>> (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
>>>
>>> It would be convenient for the use cases where a variable shared by
>> the unregistered non-EAL threads don't need special treatment.
>>>
>>
>> I thought about this, but it would require a conditional in the lookup
>> macro, as you show. More importantly, it would make the whole
>> <rte_lcore_var.h> thing less elegant and harder to understand. It's bad
>> enough that "per-lcore" is actually "per-lcore id" (or the equivalent
>> "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's <what
>> I said before> + 1" is not an improvement.
> 
> We could promote "one more entry for unregistered non-EAL threads" design pattern (for relevant use cases only!) by extending EAL with one more TLS variable, maintained like _thread_id, but set to RTE_MAX_LCORE when _tread_id is set to -1:
> 
> +++ eal_common_thread.c:
>    RTE_DEFINE_PER_LCORE(int, _thread_id) = -1;
> + RTE_DEFINE_PER_LCORE(int, _thread_idx) = RTE_MAX_LCORE;
> 
> and
> 
> +++ rte_lcore.h:
> static inline unsigned
> rte_lcore_id(void)
> {
> 	return RTE_PER_LCORE(_lcore_id);
> }
> + static inline unsigned
> + rte_lcore_idx(void)
> + {
> + 	return RTE_PER_LCORE(_lcore_idx);
> + }
> 
> That would eliminate the (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE) conditional, also where currently used.
> 

Wouldn't that effectively give a shared lcore id to all unregistered 
threads?

We definitely shouldn't further complicate anything related to the DPDK 
threading model, in my opinion.

If a module needs one or more variable instances that aren't per lcore, 
use regular static allocation instead. I would favor clarity over 
convenience here, at least until we know better (see below as well).

>>
>> But useful? Sure.
>>
>> I think you may still need other data for dealing with unregistered
>> threads, for example a mutex or spin lock to deal with concurrency
>> issues that arises with shared data.
> 
> Adding the extra entry is only for the benefit of use cases where special handling is not required. It will make the code for those use cases much cleaner. I think it is useful.
> 

It will make it shorter, but not less clean, I would argue.

> Use cases requiring special handling should still do the special handling they do today.
> 

For DPDK modules using lcore variables and which treat unregistered 
threads as "full citizens", I expect special handling of unregistered 
threads to be the norm. Take rte_random.h as an example. Current API 
does not guarantee MT safety for concurrent calls of unregistered 
threads. It probably should, and it should probably be by means of a 
mutex (not spinlock).

The reason I'm not running off to make a rte_random.c patch is that's 
it's unclear to me what is the role of unregistered threads in DPDK. I'm 
reasonably comfortable with a model where there are many threads that 
basically don't interact with the DPDK APIs (except maybe some very 
narrow exposure, like the preemption-safe ring variant). One example of 
such a design would be big slow control plane which uses multi-threading 
and the Linux process scheduler for work scheduling, hosted in the same 
process as a DPDK data plane app.

What I find more strange is a scenario where there are unregistered 
threads which interacts with a wide variety of DPDK APIs, does so 
at-high-rates/with-high-performance-requirements and are expected to be 
preemption-safe. So they are basically EAL threads without a lcore id.

Support for that latter scenario has also been voiced, in previous 
discussions, from what I recall.

I think it's hard to answer the question of a "unregistered thread 
spare" for lcore variables without first knowing what the future should 
look like for unregistered threads in DPDK, in terms of being able to 
call into DPDK APIs, preemption-safety guarantees, etc.

It seems that until you have a clearer picture of how generally to treat 
unregistered threads, you are best off with just a per-lcore id instance 
of lcore variables.

>>
>> There may also be cases were you are best off by simply disallowing
>> unregistered threads from calling into that API.
>>
>>> Obviously, this might affect performance.
>>> If the performance cost is not negligble, the addtional entry (and
>> indexing branch) could be disabled at build time.
>>>
>>>
>>> * Suggestion: Do not fix the alignment at 16 byte.
>>> Pass an alignment parameter to rte_lcore_var_alloc() and use
>> alignof() when calling it:
>>>
>>> +#include <stdalign.h>
>>> +
>>> +#define RTE_LCORE_VAR_ALLOC(name)			\
>>> +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
>>> +
>>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)	\
>>> +	name = rte_lcore_var_alloc(size, alignment)
>>> +
>>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
>>> +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
>>> +
>>> + +++ /cconfig/rte_config.h
>>> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
>>>
>>>
>>
>> That seems like a very good idea. I'll look into it.
>>
>>> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but
>> behaves differently.
>>>
>>>> +/**
>>>> + * Iterate over each lcore id's value for a lcore variable.
>>>> + */
>>>> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
>>>> +	for (unsigned int lcore_id =					\
>>>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
>>>> +	     lcore_id < RTE_MAX_LCORE;					\
>>>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
>>>> +
>>>
>>> The macro name RTE_LCORE_VAR_FOREACH() resembles
>> RTE_LCORE_FOREACH(i), which only iterates on running cores.
>>> You might want to give it a name that differs more.
>>>
>>
>> True.
>>
>> Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for
>> confusion,
>> for sure.
>>
>> Being consistent with <rte_lcore.h> is not so easy, since it's not even
>> consistent with itself. For example, rte_lcore_count() returns the
>> number of lcores (EAL threads) *plus the number of registered non-EAL
>> threads*, and RTE_LCORE_FOREACH() give a different count. :)
> 
> Naming is hard. I don't have a good name, and can only offer inspiration...
> 
> <rte_lcore.h> has RTE_LCORE_FOREACH() and its RTE_LCORE_FOREACH_WORKER() variant with _WORKER appended.
> 
> Perhaps RTE_LCORE_VAR_FOREACH_ALL(), with _ALL appended to indicate a variant.
> 
>>
>>> If it wasn't for API breakage, I would suggest renaming
>> RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
>>>
>>> Small detail: "var" is a pointer, so consider renaming it to "ptr"
>> and adding _PTR to the macro name.
>>
>> The "var" name comes from how <sys/queue.h> names things. I think I had
>> it as "ptr" initially. I'll change it back.
> 
> Thanks.
> 
>>
>> Thanks a lot Morten.