From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 10FE743B14;
	Tue, 20 Feb 2024 11:47:20 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 74F0C402D1;
	Tue, 20 Feb 2024 11:47:19 +0100 (CET)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id 02801402B8
 for <dev@dpdk.org>; Tue, 20 Feb 2024 11:47:17 +0100 (CET)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id 8545D10B86
 for <dev@dpdk.org>; Tue, 20 Feb 2024 11:47:17 +0100 (CET)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id 797B310B19; Tue, 20 Feb 2024 11:47:17 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.4 required=5.0 tests=ALL_TRUSTED,AWL,
 T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0
X-Spam-Score: -1.4
Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se
 [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id E2ACF10ABE;
 Tue, 20 Feb 2024 11:47:14 +0100 (CET)
Message-ID: <68c8f01c-d404-4b63-adca-13b560c95df8@lysator.liu.se>
Date: Tue, 20 Feb 2024 11:47:14 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
To: Bruce Richardson <bruce.richardson@intel.com>,
 =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <mattias.ronnblom@ericsson.com>
Cc: dev@dpdk.org, =?UTF-8?Q?Morten_Br=C3=B8rup?= <mb@smartsharesystems.com>,
 Stephen Hemminger <stephen@networkplumber.org>
References: <20240219094036.485727-2-mattias.ronnblom@ericsson.com>
 <20240220084908.488252-1-mattias.ronnblom@ericsson.com>
 <20240220084908.488252-2-mattias.ronnblom@ericsson.com>
 <ZdRszO-P4qZubtp3@bricha3-mobl1.ger.corp.intel.com>
Content-Language: en-US
From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <ZdRszO-P4qZubtp3@bricha3-mobl1.ger.corp.intel.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On 2024-02-20 10:11, Bruce Richardson wrote:
> On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small chunks of often-used data, which is related logically, but where
>> there are performance benefits to reap from having updates being local
>> to an lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> RFC v3:
>>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>>   * Update example to reflect FOREACH macro name change (in RFC v2).
>>
>> RFC v2:
>>   * Use alignof to derive alignment requirements. (Morten Brørup)
>>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>>   * Allow user-specified alignment, but limit max to cache line size.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   config/rte_config.h                   |   1 +
>>   doc/api/doxy-api-index.md             |   1 +
>>   lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>>   lib/eal/common/meson.build            |   1 +
>>   lib/eal/include/meson.build           |   1 +
>>   lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>>   lib/eal/version.map                   |   4 +
>>   7 files changed, 465 insertions(+)
>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>
>> diff --git a/config/rte_config.h b/config/rte_config.h
>> index da265d7dd2..884482e473 100644
>> --- a/config/rte_config.h
>> +++ b/config/rte_config.h
>> @@ -30,6 +30,7 @@
>>   /* EAL defines */
>>   #define RTE_CACHE_GUARD_LINES 1
>>   #define RTE_MAX_HEAPS 32
>> +#define RTE_MAX_LCORE_VAR 1048576
>>   #define RTE_MAX_MEMSEG_LISTS 128
>>   #define RTE_MAX_MEMSEG_PER_LIST 8192
>>   #define RTE_MAX_MEM_MB_PER_LIST 32768
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index a6a768bd7c..bb06bb7ca1 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>>     [interrupts](@ref rte_interrupts.h),
>>     [launch](@ref rte_launch.h),
>>     [lcore](@ref rte_lcore.h),
>> +  [lcore-varible](@ref rte_lcore_var.h),
>>     [per-lcore](@ref rte_per_lcore.h),
>>     [service cores](@ref rte_service.h),
>>     [keepalive](@ref rte_keepalive.h),
>> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
>> new file mode 100644
>> index 0000000000..dfd11cbd0b
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_var.c
>> @@ -0,0 +1,82 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#include <inttypes.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_debug.h>
>> +#include <rte_log.h>
>> +
>> +#include <rte_lcore_var.h>
>> +
>> +#include "eal_private.h"
>> +
>> +#define WARN_THRESHOLD 75
>> +
>> +/*
>> + * Avoid using offset zero, since it would result in a NULL-value
>> + * "handle" (offset) pointer, which in principle and per the API
>> + * definition shouldn't be an issue, but may confuse some tools and
>> + * users.
>> + */
>> +#define INITIAL_OFFSET 1
>> +
>> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
>> +
> 
> While I like the idea of improved handling for per-core variables, my main
> concern with this set is this definition here, which adds yet another
> dependency on the compile-time defined RTE_MAX_LCORE value.
> 

lcore variables replaces one RTE_MAX_LCORE-dependent pattern with another.

You could even argue the dependency on RTE_MAX_LCORE is reduced with 
lcore variables, if you look at where/in how many places in the code 
base this macro is being used. Centralizing per-lcore data management 
may also provide some opportunity in the future for extending the API to 
cope with some more dynamic RTE_MAX_LCORE variant. Not without ABI 
breakage of course, but we are not ever going to change anything related 
to RTE_MAX_LCORE without breaking the ABI, since this constant is 
everywhere, including compiled into the application itself.

> I believe we already have an issue with this #define where it's impossible
> to come up with a single value that works for all, or nearly all cases. The
> current default is still 128, yet DPDK needs to support systems where the
> number of cores is well into the hundreds, requiring workarounds of core
> mappings or customized builds of DPDK. Upping the value fixes those issues
> at the cost to memory footprint explosion for smaller systems.
> 

I agree this is an issue.

RTE_MAX_LCORE also need to be sized to accommodate not only all cores 
used, but the sum of all EAL threads and registered non-EAL threads.

So, there is no reliable way to discover what RTE_MAX_LCORE is on a 
particular piece of hardware, since the actual number of lcore ids 
needed is up to the application.

Why is the default set so low? Linux has MAX_CPUS, which serves the same 
purpose, which is set to 4096 by default, if I recall correctly. 
Shouldn't we at least be able to increase it to 256?

> I'm therefore nervous about putting more dependencies on this value, when I
> feel we should be moving away from its use, to allow more runtime
> configurability of cores.
> 

What more specifically do you have in mind?

Maybe I'm overly pessimistic, but supporting lcores without any upper 
bound and also allowing them to be added and removed at any point during 
run time seems far-fetched, given where DPDK is today.

To include an actual upper bound, set during DPDK run-time 
initialization, lower than RTE_MAX_LCORE, seems easier. I think there is 
some equivalent in the Linux kernel. Again, you would need to 
accommodate for future rte_register_thread() calls.

<rte_lcore_var.h> could be extended with a user-specified lcore variable 
  init/free function callbacks, to allow lazy or late initialization.

If one could have a way to retrieve the max possible lcore ids *for a 
particular DPDK process* (as opposed to a particular build) it would be 
possible to avoid touching the per-lcore buffers for lcore ids that 
would never be used. With data in BSS, it would never be mapped/allocated.

An issue with BSS data is that there might be very RT-sensitive 
applications deciding to lock all memory into RAM, to avoid latency 
jitter caused by paging, and such would suffer from a large 
rte_lcore_var (or all the current static arrays). Lcore variables makes 
this worse, since rte_lcore_var is larger than the sum of today's static 
arrays, and must be so, with some margin, since there is no way to 
figure out ahead of time how much memory is actually going to be needed.

> For this set/feature, would it be possible to have a run-time allocated
> (and sized) array rather than using the RTE_MAX_LCORE value?
> 

What I explored was having the per-lcore buffers dynamically allocated. 
What I ran into was I saw no apparent benefit, and with dynamic 
allocation there were new problems to solve. One was to assure lcore 
variable buffers were allocated before they were being used. In 
particular if you want to use huge page memory, lcore variables may be 
available only when that machinery is ready to accept requests.

Also, with huge page memory, you won't get the benefit you will get from 
depend paging and BSS (i.e., only used memory is actually allocated).

With malloc(), I believe you generally do get that same benefit, if you 
allocation is sufficiently large.

I also considered just allocating chunks, fitting (say) 64 kB worth of 
lcore variables in each. Turned out more complex, and to no benefit, 
other than reducing footprint for mlockall() type apps, which seemed 
like corner case.

I never considered no upper-bound, dynamic, RTE_MAX_LCORE.

> Thanks, (and apologies for the mini-rant!)
> 
> /Bruce

Thanks for the comments. This is was no way near a rant.