DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC 0/5] Lcore variables
@ 2024-02-08 18:16 Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                   ` (4 more replies)
  0 siblings, 5 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

This RFC presents a new API <rte_lcore_var.h> for static per-lcore id
data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

One thing is unclear to the author is how this API relates to
potential future per-lcore dynamic allocator (e.g., a per-lcore heap).

Contrary to what the version.map edit suggests, this RFC is not meant
for a proposal for DPDK 24.03.

Mattias Rönnblom (5):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable test suite
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable

 app/test/meson.build                  |   1 +
 app/test/test_lcore_var.c             | 384 ++++++++++++++++++++++++++
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  80 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/common/rte_random.c           |  30 +-
 lib/eal/common/rte_service.c          | 119 ++++----
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 352 +++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 lib/power/rte_power_pmd_mgmt.c        |  27 +-
 12 files changed, 925 insertions(+), 76 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
@ 2024-02-08 18:16 ` Mattias Rönnblom
  2024-02-09  8:25   ` Morten Brørup
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 2/5] eal: add lcore variable test suite Mattias Rönnblom
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  80 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 352 ++++++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 7 files changed, 440 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/config/rte_config.h b/config/rte_config.h
index da265d7dd2..884482e473 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -30,6 +30,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index a6a768bd7c..bb06bb7ca1 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -98,6 +98,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore-varible](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..5276fe7192
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+/* XXX: should this file be called eal_common_ldata.c or rte_ldata.c? */
+
+#include <inttypes.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define WARN_THRESHOLD 75
+#define MAX_AUTO_ALIGNMENT 16U
+
+/*
+ * Avoid using offset zero, since it would result in a NULL-value
+ * "handle" (offset) pointer, which in principle and per the API
+ * definition shouldn't be an issue, but may confuse some tools and
+ * users.
+ */
+#define INITIAL_OFFSET MAX_AUTO_ALIGNMENT
+
+char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
+
+static uintptr_t allocated = INITIAL_OFFSET;
+
+static void
+verify_allocation(uintptr_t new_allocated)
+{
+	static bool has_warned;
+
+	RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
+
+	if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
+	    !has_warned) {
+		EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
+			"of the maximum capacity (%d bytes)", WARN_THRESHOLD,
+			RTE_MAX_LCORE_VAR);
+		has_warned = true;
+	}
+}
+
+static void *
+lcore_var_alloc(size_t size, size_t alignment)
+{
+	uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, alignment);
+
+	void *offset = (void *)new_allocated;
+
+	new_allocated += size;
+
+	verify_allocation(new_allocated);
+
+	allocated = new_allocated;
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, alignment);
+
+	return offset;
+}
+
+void *
+rte_lcore_var_alloc(size_t size)
+{
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+
+	/* Allocations are naturally aligned (i.e., the same alignment
+	 * as the object size, up to a maximum of 16 bytes, which
+	 * should satisify alignment requirements of any kind of
+	 * object.
+	 */
+	size_t alignment = RTE_MIN(size, MAX_AUTO_ALIGNMENT);
+
+	return lcore_var_alloc(size, alignment);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..c1854dc6a4
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,352 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Per-lcore id variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. In other words,
+ * there's one copy of its value for each and every current and future
+ * lcore id-equipped thread, with the total number of copies amounting
+ * to \c RTE_MAX_LCORE.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for \c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. A handle may be passed between modules and
+ * threads just like any pointer, but its value is not the address of
+ * any particular object, but rather just an opaque identifier, stored
+ * in a typed pointer (to inform the access macro the type of values).
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
+ *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but is should
+ * generally only *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by to different lcore
+ * ids *may* be frequently read or written by the owners without the
+ * risk of false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomics) should
+ * employed to assure there are no data races between the owning
+ * thread and any non-owner threads accessing the same lcore variable
+ * instance.
+ *
+ * The value of the lcore variable for a particular lcore id may be
+ * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
+ * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * To modify the value of an lcore variable for a particular lcore id,
+ * either access the object through the pointer retrieved by \ref
+ * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
+ * RTE_LCORE_VAR_LCORE_SET.
+ *
+ * The access macros each has a short-hand which may be used by an EAL
+ * thread or registered non-EAL thread to access the lcore variable
+ * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
+ * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
+ *
+ * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier. The
+ * *identifier* value is common across all lcore ids.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like \c int,
+ * but would more typically be a \c struct. An application may choose
+ * to define an lcore variable, which it then it goes on to never
+ * allocate.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * The sum of all lcore variables, plus any padding required, must be
+ * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
+ * violation of this maximum results in the process being terminated.
+ *
+ * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
+ * same order of magnitude in size as a thread stack.
+ *
+ * The lcore variable storage buffers are kept in the BSS section in
+ * the resulting binary, where data generally isn't mapped in until
+ * it's accessed. This means that unused portions of the lcore
+ * variable storage area will not occupy any physical memory (with a
+ * granularity of the memory page size [usually 4 kB]).
+ *
+ * Lcore variables should generally *not* be \ref __rte_cache_aligned
+ * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, all nearby data structures
+ * should almost-always be written to by a single thread (the lcore
+ * variable owner). Adding padding will increase the effective memory
+ * working set size, and potentially reducing performance.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         unsigned int lcore_id;
+ *
+ *         RTE_LCORE_VAR_ALLOC(foo_state);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH(lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * \endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * } __rte_cache_aligned;
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * \endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this forces the
+ * use of cache-line alignment to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables has the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to \ref rte_lcore_var.h is the \ref
+ * rte_per_lcore.h API, which make use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., \ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. One effect of this is thread-local variables must
+ *     initialized in a "lazy" manner (e.g., at the point of thread
+ *     creation). Lcore variables may be accessed immediately after
+ *     having been allocated (which is usually prior any thread beyond
+ *     the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id, and thus
+ *     not for such "regular" threads.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define a lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various per-lcore id instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handler, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable are only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC_SZ(name, size)	\
+	name = rte_lcore_var_alloc(size)
+
+/**
+ * Allocate space for an lcore variable of the size suggested by the
+ * handler pointer type and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC(name)			\
+	RTE_LCORE_VAR_ALLOC_SZ(name, sizeof(*(name)))
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a \ref
+ * RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SZ(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SZ(name);				\
+	}
+
+/**
+ * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)		\
+	((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)				\
+	((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+/**
+ * Get value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
+
+/**
+ * Set the value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
+
+/**
+ * Get value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
+
+/**
+ * Set value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_SET(name, value) \
+	RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
+
+/**
+ * Iterate over each lcore id's value for a lcore variable.
+ */
+#define RTE_LCORE_VAR_FOREACH(var, name)				\
+	for (unsigned int lcore_id =					\
+		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
+
+/**
+ * Allocate space in the per-lcore id buffer for a lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * The allocation is always successful, barring an fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * @return
+ *   The id of the variable, stored in a void pointer value.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 5e0cd47c82..e90b86115a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -393,6 +393,10 @@ EXPERIMENTAL {
 	# added in 23.07
 	rte_memzone_max_get;
 	rte_memzone_max_set;
+
+	# added in 24.03
+	rte_lcore_var_alloc;
+	rte_lcore_var;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 2/5] eal: add lcore variable test suite
  2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-08 18:16 ` Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 384 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 385 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 6389ae83ee..93412cce51 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -101,6 +101,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..0229f90bf2
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,384 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static bool
+rand_bool(void)
+{
+	return rte_rand() & 1;
+}
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_PTR(test_int);
+
+	bool naturally_aligned = RTE_PTR_ALIGN_CEIL(ptr, sizeof(int)) == ptr;
+
+	bool equal;
+
+	if (rand_bool())
+		equal = RTE_LCORE_VAR_GET(test_int) == state->old_value;
+	else
+		equal = *(RTE_LCORE_VAR_PTR(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	if (rand_bool())
+		RTE_LCORE_VAR_SET(test_int, state->new_value);
+	else
+		*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		RTE_LCORE_VAR_LCORE_SET(lcore_id, test_int, state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		TEST_ASSERT_EQUAL(state->new_value,
+				  RTE_LCORE_VAR_LCORE_GET(lcore_id, test_int),
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH(v, test_int) {
+		printf("expected %d actual %d\n",
+		       states[lcore_id].new_value, *v);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_PTR(test_struct);
+
+	/*
+	 * Lcore variable alignment is based on object size, not any
+	 * particular requirements on the struct's field.
+	 */
+	bool properly_aligned =
+		RTE_PTR_ALIGN_CEIL(lcore_struct, 16) == lcore_struct;
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_struct);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state
+{
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_PTR(test_array);
+
+	/*
+	 * Lcore variable alignment is based on object size, not any
+	 * particular requirements on the struct's field.
+	 */
+	bool properly_aligned =
+		RTE_PTR_ALIGN_CEIL(lcore_array, 16) == lcore_array;
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(RTE_LCORE_VAR_LCORE_GET(lcore_id, test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_array);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (RTE_MAX_LCORE_VAR / 2)
+
+static int
+test_many_lvars(void)
+{
+	void **handlers = malloc(sizeof(void *) * MANY_LVARS);
+	int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		void *handle = rte_lcore_var_alloc(1);
+
+		uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), handle);
+
+		*b = (uint8_t)i;
+
+		handlers[i] = handle;
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_FOREACH_WORKER(lcore_id) {
+			uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(),
+							       handlers[i]);
+			TEST_ASSERT_EQUAL((uint8_t)i, *b,
+					  "Unexpected lcore variable value.");
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 3/5] random: keep PRNG state in lcore variable
  2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 2/5] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-02-08 18:16 ` Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 4/5] power: keep per-lcore " Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 5/5] service: " Mattias Rönnblom
  4 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Move keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances to keeping the
same state in to a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_random.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 7709b8f2c6..af9fffd81b 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct rte_rand_state {
@@ -19,14 +20,12 @@ struct rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
-} __rte_cache_aligned;
+};
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state __rte_cache_aligned;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_PTR(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 4/5] power: keep per-lcore state in lcore variable
  2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
                   ` (2 preceding siblings ...)
  2024-02-08 18:16 ` [RFC 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-08 18:16 ` Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 5/5] service: " Mattias Rönnblom
  4 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/power/rte_power_pmd_mgmt.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 591fc69f36..bb20e564de 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -68,8 +69,8 @@ struct pmd_core_cfg {
 	/**< Number of queues ready to enter power optimized state */
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
-} __rte_cache_aligned;
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+};
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -772,10 +770,13 @@ RTE_INIT(rte_power_ethdev_pmgmt_init) {
 	size_t i;
 	int j;
 
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
+
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		struct pmd_core_cfg *lcore_cfg =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_cfgs);
+		TAILQ_INIT(&lcore_cfg->head);
 	}
 
 	/* initialize config defaults */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC 5/5] service: keep per-lcore state in lcore variable
  2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
                   ` (3 preceding siblings ...)
  2024-02-08 18:16 ` [RFC 4/5] power: keep per-lcore " Mattias Rönnblom
@ 2024-02-08 18:16 ` Mattias Rönnblom
  4 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c | 119 ++++++++++++++++++++---------------
 1 file changed, 68 insertions(+), 51 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index d959c91459..c557e80409 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,11 +102,12 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
+	else {
+		struct core_state *cs;
+		RTE_LCORE_VAR_FOREACH(cs, lcore_states)
+			memset(cs, 0, sizeof(struct core_state));
 	}
 
 	int i;
@@ -122,7 +124,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +137,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +286,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +293,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +454,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +467,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +489,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +535,16 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs =
+		RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +552,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +573,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +590,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +642,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +694,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +712,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +737,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +761,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +785,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +815,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +824,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +849,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +860,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +868,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +876,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +885,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +901,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +948,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +977,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +989,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1028,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-09  8:25   ` Morten Brørup
  2024-02-09 11:46     ` Mattias Rönnblom
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
  1 sibling, 1 reply; 42+ messages in thread
From: Morten Brørup @ 2024-02-09  8:25 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Thursday, 8 February 2024 19.17
> 
> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small chunks of often-used data, which is related logically, but where
> there are performance benefits to reap from having updates being local
> to an lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---

This looks very promising. :-)

Here's a bunch of comments, questions and suggestions.


* Question: Performance.
What is the cost of accessing an lcore variable vs a variable in TLS?
I suppose the relative cost diminishes if the variable is a larger struct, compared to a simple uint64_t.

Some of my suggestions below might also affect performance.


* Advantage: Provides direct access to worker thread variables.
With the current alternative (thread-local storage), the main thread cannot access the TLS variables of the worker threads,
unless worker threads publish global access pointers.
Lcore variables of any lcore thread can be direcctly accessed by any thread, which simplifies code.


* Advantage: Roadmap towards hugemem.
It would be nice if the lcore variable memory was allocated in hugemem, to reduce TLB misses.
The current alternative (thread-local storage) is also not using hugement, so not a degradation.

Lcore variables are available very early at startup, so I guess the RTE memory allocator is not yet available.
Hugemem could be allocated using O/S allocation, so there is a possible road towards using hugemem.

Either way, using hugement would require one more indirection (the pointer to the allocated hugemem).
I don't know which has better performance, using hugemem or avoiding the additional pointer dereferencing.


* Suggestion: Consider adding an entry for unregistered non-EAL threads.
Please consider making room for one more entry, shared by all unregistered non-EAL threads, i.e.
making the array size RTE_MAX_LCORE + 1 and indexing by (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).

It would be convenient for the use cases where a variable shared by the unregistered non-EAL threads don't need special treatment.

Obviously, this might affect performance.
If the performance cost is not negligble, the addtional entry (and indexing branch) could be disabled at build time.


* Suggestion: Do not fix the alignment at 16 byte.
Pass an alignment parameter to rte_lcore_var_alloc() and use alignof() when calling it:

+#include <stdalign.h>
+
+#define RTE_LCORE_VAR_ALLOC(name)			\
+	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
+
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)	\
+	name = rte_lcore_var_alloc(size, alignment)
+
+#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
+	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
+
+ +++ /cconfig/rte_config.h
+#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16


* Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but behaves differently.

> +/**
> + * Iterate over each lcore id's value for a lcore variable.
> + */
> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
> +	for (unsigned int lcore_id =					\
> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
> +	     lcore_id < RTE_MAX_LCORE;					\
> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> +

The macro name RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(i), which only iterates on running cores.
You might want to give it a name that differs more.

If it wasn't for API breakage, I would suggest renaming RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)

Small detail: "var" is a pointer, so consider renaming it to "ptr" and adding _PTR to the macro name.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-09  8:25   ` Morten Brørup
@ 2024-02-09 11:46     ` Mattias Rönnblom
  2024-02-09 13:04       ` Morten Brørup
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-09 11:46 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-09 09:25, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Thursday, 8 February 2024 19.17
>>
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small chunks of often-used data, which is related logically, but where
>> there are performance benefits to reap from having updates being local
>> to an lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
> 
> This looks very promising. :-)
> 
> Here's a bunch of comments, questions and suggestions.
> 
> 
> * Question: Performance.
> What is the cost of accessing an lcore variable vs a variable in TLS?
> I suppose the relative cost diminishes if the variable is a larger struct, compared to a simple uint64_t.
> 

In case all the relevant data is available in a cache close to the core, 
both options carry quite low overhead.

Accessing a lcore variable will always require a TLS lookup, in the form 
of retrieving the lcore_id of the current thread. In that sense, there 
will likely be a number of extra instructions required to do the lcore 
variable address lookup (i.e., doing the load from rte_lcore_var table 
based on the lcore_id you just looked up, and adding the variable's offset).

A TLS lookup will incur an extra overhead of less than a clock cycle, 
compared to accessing a non-TLS static variable, in case static linking 
is used. For shared objects, TLS is much more expensive (something often 
visible in dynamically linked DPDK app flame graphs, in the form of the 
__tls_addr symbol). Then you need to add ~3 cc/access. This on a micro 
benchmark running on a x86_64 Raptor Lake P-core.

(To visialize the difference between shared object and not, one can use 
Compiler Explorer and -fPIC versus -fPIE.)

Things get more complicated if you access the same variable in the same 
section code, since then it can be left on the stack/in a register by 
the compiler, especially if LTO is used. In other words, if you do 
rte_lcore_id() several times in a row, only the first one will cost you 
anything. This happens fairly often in DPDK, with rte_lcore_id().

Finally, if you do something like

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index af9fffd81b..a65c30d27e 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state *state)
  static __rte_always_inline
  struct rte_rand_state *__rte_rand_get_state(void)
  {
-       unsigned int idx;
-
-       idx = rte_lcore_id();
-
-       if (unlikely(idx == LCORE_ID_ANY))
-               return &unregistered_rand_state;
-
-       return RTE_LCORE_VAR_PTR(rand_state);
+       return &unregistered_rand_state;
  }

  uint64_t

...and re-run the rand_perf_autotest, at least I see no difference at 
all (in a statically linked build). Both results in rte_rand() using ~11 
cc/call. What that suggests is that TLS overhead is very small, and that 
any extra instructions required by lcore variables doesn't add much, if 
anything at all, at least in this particular case.

> Some of my suggestions below might also affect performance.
> 
> 
> * Advantage: Provides direct access to worker thread variables.
> With the current alternative (thread-local storage), the main thread cannot access the TLS variables of the worker threads,
> unless worker threads publish global access pointers.
> Lcore variables of any lcore thread can be direcctly accessed by any thread, which simplifies code.
> 
> 
> * Advantage: Roadmap towards hugemem.
> It would be nice if the lcore variable memory was allocated in hugemem, to reduce TLB misses.
> The current alternative (thread-local storage) is also not using hugement, so not a degradation.
> 

I agree, but the thing is it's hard to figure out how much memory is 
required for these kind of variables, given how DPDK is built and 
linked. In an OS kernel, you can just take all the symbols, put them in 
a special section, and size that section. Such a thing can't easily be 
done with DPDK, since shared object builds are supported, plus that this 
facility should be available not only to DPDK modules, but also the 
application, so relying on linker scripts isn't really feasible (not 
probably not even feasible for DPDK itself).

In that scenario, you want to size up the per-lcore buffer to be so 
large, you don't have to worry about overruns. That will waste memory. 
If you use huge page memory, paging can't help you to avoid 
pre-allocating actual physical memory.

That said, even large (by static per-lcore data standards) buffers are 
potentially small enough not to grow the amount of memory used by a DPDK 
process too much. You need to provision for RTE_MAX_LCORE of them though.

The value of lcore variables should be small, and thus incur few TLB 
misses, so you may not gain much from huge pages. In my world, it's more 
about "fitting often-used per-lcore data into L1 or L2 CPU caches", 
rather than the easier "fitting often-used per-lcore data into a working 
set size reasonably expected to be covered by hardware TLB/caches".

> Lcore variables are available very early at startup, so I guess the RTE memory allocator is not yet available.
> Hugemem could be allocated using O/S allocation, so there is a possible road towards using hugemem.
> 

With the current design, that true. I'm not sure it's a strict 
requirement though, but it does makes things simpler.

> Either way, using hugement would require one more indirection (the pointer to the allocated hugemem).
> I don't know which has better performance, using hugemem or avoiding the additional pointer dereferencing.
> 
> 
> * Suggestion: Consider adding an entry for unregistered non-EAL threads.
> Please consider making room for one more entry, shared by all unregistered non-EAL threads, i.e.
> making the array size RTE_MAX_LCORE + 1 and indexing by (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
> 
> It would be convenient for the use cases where a variable shared by the unregistered non-EAL threads don't need special treatment.
> 

I thought about this, but it would require a conditional in the lookup 
macro, as you show. More importantly, it would make the whole 
<rte_lcore_var.h> thing less elegant and harder to understand. It's bad 
enough that "per-lcore" is actually "per-lcore id" (or the equivalent 
"per-EAL thread and unregistered EAL-thread"). Adding a "btw it's <what 
I said before> + 1" is not an improvement.

But useful? Sure.

I think you may still need other data for dealing with unregistered 
threads, for example a mutex or spin lock to deal with concurrency 
issues that arises with shared data.

There may also be cases were you are best off by simply disallowing 
unregistered threads from calling into that API.

> Obviously, this might affect performance.
> If the performance cost is not negligble, the addtional entry (and indexing branch) could be disabled at build time.
> 
> 
> * Suggestion: Do not fix the alignment at 16 byte.
> Pass an alignment parameter to rte_lcore_var_alloc() and use alignof() when calling it:
> 
> +#include <stdalign.h>
> +
> +#define RTE_LCORE_VAR_ALLOC(name)			\
> +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
> +
> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)	\
> +	name = rte_lcore_var_alloc(size, alignment)
> +
> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
> +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
> +
> + +++ /cconfig/rte_config.h
> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
> 
> 

That seems like a very good idea. I'll look into it.

> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but behaves differently.
> 
>> +/**
>> + * Iterate over each lcore id's value for a lcore variable.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
>> +	for (unsigned int lcore_id =					\
>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
>> +	     lcore_id < RTE_MAX_LCORE;					\
>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
>> +
> 
> The macro name RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(i), which only iterates on running cores.
> You might want to give it a name that differs more.
> 

True.

Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for confusion, 
for sure.

Being consistent with <rte_lcore.h> is not so easy, since it's not even 
consistent with itself. For example, rte_lcore_count() returns the 
number of lcores (EAL threads) *plus the number of registered non-EAL 
threads*, and RTE_LCORE_FOREACH() give a different count. :)

> If it wasn't for API breakage, I would suggest renaming RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
> 
> Small detail: "var" is a pointer, so consider renaming it to "ptr" and adding _PTR to the macro name.

The "var" name comes from how <sys/queue.h> names things. I think I had 
it as "ptr" initially. I'll change it back.

Thanks a lot Morten.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-09 11:46     ` Mattias Rönnblom
@ 2024-02-09 13:04       ` Morten Brørup
  2024-02-19  7:49         ` Mattias Rönnblom
  0 siblings, 1 reply; 42+ messages in thread
From: Morten Brørup @ 2024-02-09 13:04 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Friday, 9 February 2024 12.46
> 
> On 2024-02-09 09:25, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Thursday, 8 February 2024 19.17
> >>
> >> Introduce DPDK per-lcore id variables, or lcore variables for short.
> >>
> >> An lcore variable has one value for every current and future lcore
> >> id-equipped thread.
> >>
> >> The primary <rte_lcore_var.h> use case is for statically allocating
> >> small chunks of often-used data, which is related logically, but
> where
> >> there are performance benefits to reap from having updates being
> local
> >> to an lcore.
> >>
> >> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> >> _Thread_local), but decoupling the values' life time with that of
> the
> >> threads.
> >>
> >> Lcore variables are also similar in terms of functionality provided
> by
> >> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >> build-time machinery. DPCPU uses linker scripts, which effectively
> >> prevents the reuse of its, otherwise seemingly viable, approach.
> >>
> >> The currently-prevailing way to solve the same problem as lcore
> >> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
> sized
> >> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> >> lcore variables over this approach is that data related to the same
> >> lcore now is close (spatially, in memory), rather than data used by
> >> the same module, which in turn avoid excessive use of padding,
> >> polluting caches with unused data.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> ---
> >
> > This looks very promising. :-)
> >
> > Here's a bunch of comments, questions and suggestions.
> >
> >
> > * Question: Performance.
> > What is the cost of accessing an lcore variable vs a variable in TLS?
> > I suppose the relative cost diminishes if the variable is a larger
> struct, compared to a simple uint64_t.
> >
> 
> In case all the relevant data is available in a cache close to the
> core,
> both options carry quite low overhead.
> 
> Accessing a lcore variable will always require a TLS lookup, in the
> form
> of retrieving the lcore_id of the current thread. In that sense, there
> will likely be a number of extra instructions required to do the lcore
> variable address lookup (i.e., doing the load from rte_lcore_var table
> based on the lcore_id you just looked up, and adding the variable's
> offset).
> 
> A TLS lookup will incur an extra overhead of less than a clock cycle,
> compared to accessing a non-TLS static variable, in case static linking
> is used. For shared objects, TLS is much more expensive (something
> often
> visible in dynamically linked DPDK app flame graphs, in the form of the
> __tls_addr symbol). Then you need to add ~3 cc/access. This on a micro
> benchmark running on a x86_64 Raptor Lake P-core.
> 
> (To visialize the difference between shared object and not, one can use
> Compiler Explorer and -fPIC versus -fPIE.)
> 
> Things get more complicated if you access the same variable in the same
> section code, since then it can be left on the stack/in a register by
> the compiler, especially if LTO is used. In other words, if you do
> rte_lcore_id() several times in a row, only the first one will cost you
> anything. This happens fairly often in DPDK, with rte_lcore_id().
> 
> Finally, if you do something like
> 
> diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
> index af9fffd81b..a65c30d27e 100644
> --- a/lib/eal/common/rte_random.c
> +++ b/lib/eal/common/rte_random.c
> @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state *state)
>   static __rte_always_inline
>   struct rte_rand_state *__rte_rand_get_state(void)
>   {
> -       unsigned int idx;
> -
> -       idx = rte_lcore_id();
> -
> -       if (unlikely(idx == LCORE_ID_ANY))
> -               return &unregistered_rand_state;
> -
> -       return RTE_LCORE_VAR_PTR(rand_state);
> +       return &unregistered_rand_state;
>   }
> 
>   uint64_t
> 
> ...and re-run the rand_perf_autotest, at least I see no difference at
> all (in a statically linked build). Both results in rte_rand() using
> ~11
> cc/call. What that suggests is that TLS overhead is very small, and
> that
> any extra instructions required by lcore variables doesn't add much, if
> anything at all, at least in this particular case.

Excellent. Thank you for a thorough and detailed answer, Mattias.

> 
> > Some of my suggestions below might also affect performance.
> >
> >
> > * Advantage: Provides direct access to worker thread variables.
> > With the current alternative (thread-local storage), the main thread
> cannot access the TLS variables of the worker threads,
> > unless worker threads publish global access pointers.
> > Lcore variables of any lcore thread can be direcctly accessed by any
> thread, which simplifies code.
> >
> >
> > * Advantage: Roadmap towards hugemem.
> > It would be nice if the lcore variable memory was allocated in
> hugemem, to reduce TLB misses.
> > The current alternative (thread-local storage) is also not using
> hugement, so not a degradation.
> >
> 
> I agree, but the thing is it's hard to figure out how much memory is
> required for these kind of variables, given how DPDK is built and
> linked. In an OS kernel, you can just take all the symbols, put them in
> a special section, and size that section. Such a thing can't easily be
> done with DPDK, since shared object builds are supported, plus that
> this
> facility should be available not only to DPDK modules, but also the
> application, so relying on linker scripts isn't really feasible (not
> probably not even feasible for DPDK itself).
> 
> In that scenario, you want to size up the per-lcore buffer to be so
> large, you don't have to worry about overruns. That will waste memory.
> If you use huge page memory, paging can't help you to avoid
> pre-allocating actual physical memory.

Good point.
I had noticed that RTE_MAX_LCORE_VAR was 1 MB (per RTE_MAX_LCORE), but I hadn't considered how paging helps us use less physical memory than that.

> 
> That said, even large (by static per-lcore data standards) buffers are
> potentially small enough not to grow the amount of memory used by a
> DPDK
> process too much. You need to provision for RTE_MAX_LCORE of them
> though.
> 
> The value of lcore variables should be small, and thus incur few TLB
> misses, so you may not gain much from huge pages. In my world, it's
> more
> about "fitting often-used per-lcore data into L1 or L2 CPU caches",
> rather than the easier "fitting often-used per-lcore data into a
> working
> set size reasonably expected to be covered by hardware TLB/caches".

Yes, I suppose that lcore variables are intended to be small, and large per-lcore structures should keep following the current design patterns for allocation and access.

Perhaps this guideline is worth mentioning in the documentation.

> 
> > Lcore variables are available very early at startup, so I guess the
> RTE memory allocator is not yet available.
> > Hugemem could be allocated using O/S allocation, so there is a
> possible road towards using hugemem.
> >
> 
> With the current design, that true. I'm not sure it's a strict
> requirement though, but it does makes things simpler.
> 
> > Either way, using hugement would require one more indirection (the
> pointer to the allocated hugemem).
> > I don't know which has better performance, using hugemem or avoiding
> the additional pointer dereferencing.
> >
> >
> > * Suggestion: Consider adding an entry for unregistered non-EAL
> threads.
> > Please consider making room for one more entry, shared by all
> unregistered non-EAL threads, i.e.
> > making the array size RTE_MAX_LCORE + 1 and indexing by
> (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
> >
> > It would be convenient for the use cases where a variable shared by
> the unregistered non-EAL threads don't need special treatment.
> >
> 
> I thought about this, but it would require a conditional in the lookup
> macro, as you show. More importantly, it would make the whole
> <rte_lcore_var.h> thing less elegant and harder to understand. It's bad
> enough that "per-lcore" is actually "per-lcore id" (or the equivalent
> "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's <what
> I said before> + 1" is not an improvement.

We could promote "one more entry for unregistered non-EAL threads" design pattern (for relevant use cases only!) by extending EAL with one more TLS variable, maintained like _thread_id, but set to RTE_MAX_LCORE when _tread_id is set to -1:

+++ eal_common_thread.c:
  RTE_DEFINE_PER_LCORE(int, _thread_id) = -1;
+ RTE_DEFINE_PER_LCORE(int, _thread_idx) = RTE_MAX_LCORE;

and

+++ rte_lcore.h:
static inline unsigned
rte_lcore_id(void)
{
	return RTE_PER_LCORE(_lcore_id);
}
+ static inline unsigned
+ rte_lcore_idx(void)
+ {
+ 	return RTE_PER_LCORE(_lcore_idx);
+ }

That would eliminate the (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE) conditional, also where currently used.

> 
> But useful? Sure.
> 
> I think you may still need other data for dealing with unregistered
> threads, for example a mutex or spin lock to deal with concurrency
> issues that arises with shared data.

Adding the extra entry is only for the benefit of use cases where special handling is not required. It will make the code for those use cases much cleaner. I think it is useful.

Use cases requiring special handling should still do the special handling they do today.

> 
> There may also be cases were you are best off by simply disallowing
> unregistered threads from calling into that API.
> 
> > Obviously, this might affect performance.
> > If the performance cost is not negligble, the addtional entry (and
> indexing branch) could be disabled at build time.
> >
> >
> > * Suggestion: Do not fix the alignment at 16 byte.
> > Pass an alignment parameter to rte_lcore_var_alloc() and use
> alignof() when calling it:
> >
> > +#include <stdalign.h>
> > +
> > +#define RTE_LCORE_VAR_ALLOC(name)			\
> > +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
> > +
> > +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)	\
> > +	name = rte_lcore_var_alloc(size, alignment)
> > +
> > +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
> > +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
> > +
> > + +++ /cconfig/rte_config.h
> > +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
> >
> >
> 
> That seems like a very good idea. I'll look into it.
> 
> > * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but
> behaves differently.
> >
> >> +/**
> >> + * Iterate over each lcore id's value for a lcore variable.
> >> + */
> >> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
> >> +	for (unsigned int lcore_id =					\
> >> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
> >> +	     lcore_id < RTE_MAX_LCORE;					\
> >> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> >> +
> >
> > The macro name RTE_LCORE_VAR_FOREACH() resembles
> RTE_LCORE_FOREACH(i), which only iterates on running cores.
> > You might want to give it a name that differs more.
> >
> 
> True.
> 
> Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for
> confusion,
> for sure.
> 
> Being consistent with <rte_lcore.h> is not so easy, since it's not even
> consistent with itself. For example, rte_lcore_count() returns the
> number of lcores (EAL threads) *plus the number of registered non-EAL
> threads*, and RTE_LCORE_FOREACH() give a different count. :)

Naming is hard. I don't have a good name, and can only offer inspiration...

<rte_lcore.h> has RTE_LCORE_FOREACH() and its RTE_LCORE_FOREACH_WORKER() variant with _WORKER appended.

Perhaps RTE_LCORE_VAR_FOREACH_ALL(), with _ALL appended to indicate a variant.

> 
> > If it wasn't for API breakage, I would suggest renaming
> RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
> >
> > Small detail: "var" is a pointer, so consider renaming it to "ptr"
> and adding _PTR to the macro name.
> 
> The "var" name comes from how <sys/queue.h> names things. I think I had
> it as "ptr" initially. I'll change it back.

Thanks.

> 
> Thanks a lot Morten.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-09 13:04       ` Morten Brørup
@ 2024-02-19  7:49         ` Mattias Rönnblom
  2024-02-19 11:10           ` Morten Brørup
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  7:49 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-09 14:04, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Friday, 9 February 2024 12.46
>>
>> On 2024-02-09 09:25, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>> Sent: Thursday, 8 February 2024 19.17
>>>>
>>>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>>>
>>>> An lcore variable has one value for every current and future lcore
>>>> id-equipped thread.
>>>>
>>>> The primary <rte_lcore_var.h> use case is for statically allocating
>>>> small chunks of often-used data, which is related logically, but
>> where
>>>> there are performance benefits to reap from having updates being
>> local
>>>> to an lcore.
>>>>
>>>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>>>> _Thread_local), but decoupling the values' life time with that of
>> the
>>>> threads.
>>>>
>>>> Lcore variables are also similar in terms of functionality provided
>> by
>>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>>>> build-time machinery. DPCPU uses linker scripts, which effectively
>>>> prevents the reuse of its, otherwise seemingly viable, approach.
>>>>
>>>> The currently-prevailing way to solve the same problem as lcore
>>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
>> sized
>>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>>>> lcore variables over this approach is that data related to the same
>>>> lcore now is close (spatially, in memory), rather than data used by
>>>> the same module, which in turn avoid excessive use of padding,
>>>> polluting caches with unused data.
>>>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>
>>> This looks very promising. :-)
>>>
>>> Here's a bunch of comments, questions and suggestions.
>>>
>>>
>>> * Question: Performance.
>>> What is the cost of accessing an lcore variable vs a variable in TLS?
>>> I suppose the relative cost diminishes if the variable is a larger
>> struct, compared to a simple uint64_t.
>>>
>>
>> In case all the relevant data is available in a cache close to the
>> core,
>> both options carry quite low overhead.
>>
>> Accessing a lcore variable will always require a TLS lookup, in the
>> form
>> of retrieving the lcore_id of the current thread. In that sense, there
>> will likely be a number of extra instructions required to do the lcore
>> variable address lookup (i.e., doing the load from rte_lcore_var table
>> based on the lcore_id you just looked up, and adding the variable's
>> offset).
>>
>> A TLS lookup will incur an extra overhead of less than a clock cycle,
>> compared to accessing a non-TLS static variable, in case static linking
>> is used. For shared objects, TLS is much more expensive (something
>> often
>> visible in dynamically linked DPDK app flame graphs, in the form of the
>> __tls_addr symbol). Then you need to add ~3 cc/access. This on a micro
>> benchmark running on a x86_64 Raptor Lake P-core.
>>
>> (To visialize the difference between shared object and not, one can use
>> Compiler Explorer and -fPIC versus -fPIE.)
>>
>> Things get more complicated if you access the same variable in the same
>> section code, since then it can be left on the stack/in a register by
>> the compiler, especially if LTO is used. In other words, if you do
>> rte_lcore_id() several times in a row, only the first one will cost you
>> anything. This happens fairly often in DPDK, with rte_lcore_id().
>>
>> Finally, if you do something like
>>
>> diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
>> index af9fffd81b..a65c30d27e 100644
>> --- a/lib/eal/common/rte_random.c
>> +++ b/lib/eal/common/rte_random.c
>> @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state *state)
>>    static __rte_always_inline
>>    struct rte_rand_state *__rte_rand_get_state(void)
>>    {
>> -       unsigned int idx;
>> -
>> -       idx = rte_lcore_id();
>> -
>> -       if (unlikely(idx == LCORE_ID_ANY))
>> -               return &unregistered_rand_state;
>> -
>> -       return RTE_LCORE_VAR_PTR(rand_state);
>> +       return &unregistered_rand_state;
>>    }
>>
>>    uint64_t
>>
>> ...and re-run the rand_perf_autotest, at least I see no difference at
>> all (in a statically linked build). Both results in rte_rand() using
>> ~11
>> cc/call. What that suggests is that TLS overhead is very small, and
>> that
>> any extra instructions required by lcore variables doesn't add much, if
>> anything at all, at least in this particular case.
> 
> Excellent. Thank you for a thorough and detailed answer, Mattias.
> 
>>
>>> Some of my suggestions below might also affect performance.
>>>
>>>
>>> * Advantage: Provides direct access to worker thread variables.
>>> With the current alternative (thread-local storage), the main thread
>> cannot access the TLS variables of the worker threads,
>>> unless worker threads publish global access pointers.
>>> Lcore variables of any lcore thread can be direcctly accessed by any
>> thread, which simplifies code.
>>>
>>>
>>> * Advantage: Roadmap towards hugemem.
>>> It would be nice if the lcore variable memory was allocated in
>> hugemem, to reduce TLB misses.
>>> The current alternative (thread-local storage) is also not using
>> hugement, so not a degradation.
>>>
>>
>> I agree, but the thing is it's hard to figure out how much memory is
>> required for these kind of variables, given how DPDK is built and
>> linked. In an OS kernel, you can just take all the symbols, put them in
>> a special section, and size that section. Such a thing can't easily be
>> done with DPDK, since shared object builds are supported, plus that
>> this
>> facility should be available not only to DPDK modules, but also the
>> application, so relying on linker scripts isn't really feasible (not
>> probably not even feasible for DPDK itself).
>>
>> In that scenario, you want to size up the per-lcore buffer to be so
>> large, you don't have to worry about overruns. That will waste memory.
>> If you use huge page memory, paging can't help you to avoid
>> pre-allocating actual physical memory.
> 
> Good point.
> I had noticed that RTE_MAX_LCORE_VAR was 1 MB (per RTE_MAX_LCORE), but I hadn't considered how paging helps us use less physical memory than that.
> 
>>
>> That said, even large (by static per-lcore data standards) buffers are
>> potentially small enough not to grow the amount of memory used by a
>> DPDK
>> process too much. You need to provision for RTE_MAX_LCORE of them
>> though.
>>
>> The value of lcore variables should be small, and thus incur few TLB
>> misses, so you may not gain much from huge pages. In my world, it's
>> more
>> about "fitting often-used per-lcore data into L1 or L2 CPU caches",
>> rather than the easier "fitting often-used per-lcore data into a
>> working
>> set size reasonably expected to be covered by hardware TLB/caches".
> 
> Yes, I suppose that lcore variables are intended to be small, and large per-lcore structures should keep following the current design patterns for allocation and access.
> 

It seems to me that support for per-lcore heaps should be the solution 
for supporting use cases requiring many, larger and/or dynamic objects 
on a per-lcore basis.

Ideally, you would design both that mechanism and lcore variables 
together, but then if you couple enough amount of improvements together 
you will never get anywhere. An instance of where perfect is the enemy 
of good, perhaps.

> Perhaps this guideline is worth mentioning in the documentation.
> 

What is missing, more specifically? The size limitation and the static 
nature of lcore variables is described, and what current design patterns 
they expected to (partly) replace is also covered.

>>
>>> Lcore variables are available very early at startup, so I guess the
>> RTE memory allocator is not yet available.
>>> Hugemem could be allocated using O/S allocation, so there is a
>> possible road towards using hugemem.
>>>
>>
>> With the current design, that true. I'm not sure it's a strict
>> requirement though, but it does makes things simpler.
>>
>>> Either way, using hugement would require one more indirection (the
>> pointer to the allocated hugemem).
>>> I don't know which has better performance, using hugemem or avoiding
>> the additional pointer dereferencing.
>>>
>>>
>>> * Suggestion: Consider adding an entry for unregistered non-EAL
>> threads.
>>> Please consider making room for one more entry, shared by all
>> unregistered non-EAL threads, i.e.
>>> making the array size RTE_MAX_LCORE + 1 and indexing by
>> (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
>>>
>>> It would be convenient for the use cases where a variable shared by
>> the unregistered non-EAL threads don't need special treatment.
>>>
>>
>> I thought about this, but it would require a conditional in the lookup
>> macro, as you show. More importantly, it would make the whole
>> <rte_lcore_var.h> thing less elegant and harder to understand. It's bad
>> enough that "per-lcore" is actually "per-lcore id" (or the equivalent
>> "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's <what
>> I said before> + 1" is not an improvement.
> 
> We could promote "one more entry for unregistered non-EAL threads" design pattern (for relevant use cases only!) by extending EAL with one more TLS variable, maintained like _thread_id, but set to RTE_MAX_LCORE when _tread_id is set to -1:
> 
> +++ eal_common_thread.c:
>    RTE_DEFINE_PER_LCORE(int, _thread_id) = -1;
> + RTE_DEFINE_PER_LCORE(int, _thread_idx) = RTE_MAX_LCORE;
> 
> and
> 
> +++ rte_lcore.h:
> static inline unsigned
> rte_lcore_id(void)
> {
> 	return RTE_PER_LCORE(_lcore_id);
> }
> + static inline unsigned
> + rte_lcore_idx(void)
> + {
> + 	return RTE_PER_LCORE(_lcore_idx);
> + }
> 
> That would eliminate the (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE) conditional, also where currently used.
> 

Wouldn't that effectively give a shared lcore id to all unregistered 
threads?

We definitely shouldn't further complicate anything related to the DPDK 
threading model, in my opinion.

If a module needs one or more variable instances that aren't per lcore, 
use regular static allocation instead. I would favor clarity over 
convenience here, at least until we know better (see below as well).

>>
>> But useful? Sure.
>>
>> I think you may still need other data for dealing with unregistered
>> threads, for example a mutex or spin lock to deal with concurrency
>> issues that arises with shared data.
> 
> Adding the extra entry is only for the benefit of use cases where special handling is not required. It will make the code for those use cases much cleaner. I think it is useful.
> 

It will make it shorter, but not less clean, I would argue.

> Use cases requiring special handling should still do the special handling they do today.
> 

For DPDK modules using lcore variables and which treat unregistered 
threads as "full citizens", I expect special handling of unregistered 
threads to be the norm. Take rte_random.h as an example. Current API 
does not guarantee MT safety for concurrent calls of unregistered 
threads. It probably should, and it should probably be by means of a 
mutex (not spinlock).

The reason I'm not running off to make a rte_random.c patch is that's 
it's unclear to me what is the role of unregistered threads in DPDK. I'm 
reasonably comfortable with a model where there are many threads that 
basically don't interact with the DPDK APIs (except maybe some very 
narrow exposure, like the preemption-safe ring variant). One example of 
such a design would be big slow control plane which uses multi-threading 
and the Linux process scheduler for work scheduling, hosted in the same 
process as a DPDK data plane app.

What I find more strange is a scenario where there are unregistered 
threads which interacts with a wide variety of DPDK APIs, does so 
at-high-rates/with-high-performance-requirements and are expected to be 
preemption-safe. So they are basically EAL threads without a lcore id.

Support for that latter scenario has also been voiced, in previous 
discussions, from what I recall.

I think it's hard to answer the question of a "unregistered thread 
spare" for lcore variables without first knowing what the future should 
look like for unregistered threads in DPDK, in terms of being able to 
call into DPDK APIs, preemption-safety guarantees, etc.

It seems that until you have a clearer picture of how generally to treat 
unregistered threads, you are best off with just a per-lcore id instance 
of lcore variables.

>>
>> There may also be cases were you are best off by simply disallowing
>> unregistered threads from calling into that API.
>>
>>> Obviously, this might affect performance.
>>> If the performance cost is not negligble, the addtional entry (and
>> indexing branch) could be disabled at build time.
>>>
>>>
>>> * Suggestion: Do not fix the alignment at 16 byte.
>>> Pass an alignment parameter to rte_lcore_var_alloc() and use
>> alignof() when calling it:
>>>
>>> +#include <stdalign.h>
>>> +
>>> +#define RTE_LCORE_VAR_ALLOC(name)			\
>>> +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
>>> +
>>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)	\
>>> +	name = rte_lcore_var_alloc(size, alignment)
>>> +
>>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
>>> +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
>>> +
>>> + +++ /cconfig/rte_config.h
>>> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
>>>
>>>
>>
>> That seems like a very good idea. I'll look into it.
>>
>>> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but
>> behaves differently.
>>>
>>>> +/**
>>>> + * Iterate over each lcore id's value for a lcore variable.
>>>> + */
>>>> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
>>>> +	for (unsigned int lcore_id =					\
>>>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
>>>> +	     lcore_id < RTE_MAX_LCORE;					\
>>>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
>>>> +
>>>
>>> The macro name RTE_LCORE_VAR_FOREACH() resembles
>> RTE_LCORE_FOREACH(i), which only iterates on running cores.
>>> You might want to give it a name that differs more.
>>>
>>
>> True.
>>
>> Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for
>> confusion,
>> for sure.
>>
>> Being consistent with <rte_lcore.h> is not so easy, since it's not even
>> consistent with itself. For example, rte_lcore_count() returns the
>> number of lcores (EAL threads) *plus the number of registered non-EAL
>> threads*, and RTE_LCORE_FOREACH() give a different count. :)
> 
> Naming is hard. I don't have a good name, and can only offer inspiration...
> 
> <rte_lcore.h> has RTE_LCORE_FOREACH() and its RTE_LCORE_FOREACH_WORKER() variant with _WORKER appended.
> 
> Perhaps RTE_LCORE_VAR_FOREACH_ALL(), with _ALL appended to indicate a variant.
> 
>>
>>> If it wasn't for API breakage, I would suggest renaming
>> RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
>>>
>>> Small detail: "var" is a pointer, so consider renaming it to "ptr"
>> and adding _PTR to the macro name.
>>
>> The "var" name comes from how <sys/queue.h> names things. I think I had
>> it as "ptr" initially. I'll change it back.
> 
> Thanks.
> 
>>
>> Thanks a lot Morten.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 0/5] Lcore variables
  2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-09  8:25   ` Morten Brørup
@ 2024-02-19  9:40   ` Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                       ` (4 more replies)
  1 sibling, 5 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

This RFC presents a new API <rte_lcore_var.h> for static per-lcore id
data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

One thing is unclear to the author is how this API relates to
potential future per-lcore dynamic allocator (e.g., a per-lcore heap).

Contrary to what the version.map edit suggests, this RFC is not meant
for a proposal for DPDK 24.03.

Mattias Rönnblom (5):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable test suite
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable

 app/test/meson.build                  |   1 +
 app/test/test_lcore_var.c             | 408 ++++++++++++++++++++++++++
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  82 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/common/rte_random.c           |  30 +-
 lib/eal/common/rte_service.c          | 119 ++++----
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 374 +++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 lib/power/rte_power_pmd_mgmt.c        |  27 +-
 12 files changed, 973 insertions(+), 76 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 1/5] eal: add static per-lcore memory allocation facility
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
@ 2024-02-19  9:40     ` Mattias Rönnblom
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 2/5] eal: add lcore variable test suite Mattias Rönnblom
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  82 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 374 ++++++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 7 files changed, 464 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/config/rte_config.h b/config/rte_config.h
index da265d7dd2..884482e473 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -30,6 +30,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index a6a768bd7c..bb06bb7ca1 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -98,6 +98,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore-varible](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..dfd11cbd0b
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define WARN_THRESHOLD 75
+
+/*
+ * Avoid using offset zero, since it would result in a NULL-value
+ * "handle" (offset) pointer, which in principle and per the API
+ * definition shouldn't be an issue, but may confuse some tools and
+ * users.
+ */
+#define INITIAL_OFFSET 1
+
+char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
+
+static uintptr_t allocated = INITIAL_OFFSET;
+
+static void
+verify_allocation(uintptr_t new_allocated)
+{
+	static bool has_warned;
+
+	RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
+
+	if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
+	    !has_warned) {
+		EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
+			"of the maximum capacity (%d bytes)", WARN_THRESHOLD,
+			RTE_MAX_LCORE_VAR);
+		has_warned = true;
+	}
+}
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
+
+	void *offset = (void *)new_allocated;
+
+	new_allocated += size;
+
+	verify_allocation(new_allocated);
+
+	allocated = new_allocated;
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return offset;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to aligned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..4434fc21ef
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,374 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Per-lcore id variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. In other words,
+ * there's one copy of its value for each and every current and future
+ * lcore id-equipped thread, with the total number of copies amounting
+ * to \c RTE_MAX_LCORE.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for \c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. A handle may be passed between modules and
+ * threads just like any pointer, but its value is not the address of
+ * any particular object, but rather just an opaque identifier, stored
+ * in a typed pointer (to inform the access macro the type of values).
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
+ *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but is should
+ * generally only *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by to different lcore
+ * ids *may* be frequently read or written by the owners without the
+ * risk of false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomics) should
+ * employed to assure there are no data races between the owning
+ * thread and any non-owner threads accessing the same lcore variable
+ * instance.
+ *
+ * The value of the lcore variable for a particular lcore id may be
+ * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
+ * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * To modify the value of an lcore variable for a particular lcore id,
+ * either access the object through the pointer retrieved by \ref
+ * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
+ * RTE_LCORE_VAR_LCORE_SET.
+ *
+ * The access macros each has a short-hand which may be used by an EAL
+ * thread or registered non-EAL thread to access the lcore variable
+ * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
+ * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
+ *
+ * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier. The
+ * *identifier* value is common across all lcore ids.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like \c int,
+ * but would more typically be a \c struct. An application may choose
+ * to define an lcore variable, which it then it goes on to never
+ * allocate.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * The sum of all lcore variables, plus any padding required, must be
+ * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
+ * violation of this maximum results in the process being terminated.
+ *
+ * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
+ * same order of magnitude in size as a thread stack.
+ *
+ * The lcore variable storage buffers are kept in the BSS section in
+ * the resulting binary, where data generally isn't mapped in until
+ * it's accessed. This means that unused portions of the lcore
+ * variable storage area will not occupy any physical memory (with a
+ * granularity of the memory page size [usually 4 kB]).
+ *
+ * Lcore variables should generally *not* be \ref __rte_cache_aligned
+ * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, all nearby data structures
+ * should almost-always be written to by a single thread (the lcore
+ * variable owner). Adding padding will increase the effective memory
+ * working set size, and potentially reducing performance.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         unsigned int lcore_id;
+ *
+ *         RTE_LCORE_VAR_ALLOC(foo_state);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH(lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * \endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * } __rte_cache_aligned;
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * \endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this forces the
+ * use of cache-line alignment to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables has the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to \ref rte_lcore_var.h is the \ref
+ * rte_per_lcore.h API, which make use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., \ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. One effect of this is thread-local variables must
+ *     initialized in a "lazy" manner (e.g., at the point of thread
+ *     creation). Lcore variables may be accessed immediately after
+ *     having been allocated (which is usually prior any thread beyond
+ *     the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id, and thus
+ *     not for such "regular" threads.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define a lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various per-lcore id instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handler, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable are only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)	\
+	name = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
+	name = rte_lcore_var_alloc(size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handler pointer type, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC(name)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)), alignof(*(name)))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a \ref
+ * RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)		\
+	((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)				\
+	((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+/**
+ * Get value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
+
+/**
+ * Set the value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
+
+/**
+ * Get value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
+
+/**
+ * Set value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_SET(name, value) \
+	RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
+
+/**
+ * Iterate over each lcore id's value for a lcore variable.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)				\
+	for (unsigned int lcore_id =					\
+		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
+
+/**
+ * Allocate space in the per-lcore id buffers for a lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than \c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The id of the variable, stored in a void pointer value.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 5e0cd47c82..e90b86115a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -393,6 +393,10 @@ EXPERIMENTAL {
 	# added in 23.07
 	rte_memzone_max_get;
 	rte_memzone_max_set;
+
+	# added in 24.03
+	rte_lcore_var_alloc;
+	rte_lcore_var;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 2/5] eal: add lcore variable test suite
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-19  9:40     ` Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

RFC v2:
 * Improve alignment-related test coverage.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 408 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 409 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 6389ae83ee..93412cce51 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -101,6 +101,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..310d32e10d
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,408 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static bool
+rand_bool(void)
+{
+	return rte_rand() & 1;
+}
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_PTR(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal;
+
+	if (rand_bool())
+		equal = RTE_LCORE_VAR_GET(test_int) == state->old_value;
+	else
+		equal = *(RTE_LCORE_VAR_PTR(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	if (rand_bool())
+		RTE_LCORE_VAR_SET(test_int, state->new_value);
+	else
+		*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		RTE_LCORE_VAR_LCORE_SET(lcore_id, test_int, state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		TEST_ASSERT_EQUAL(state->new_value,
+				  RTE_LCORE_VAR_LCORE_GET(lcore_id, test_int),
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_PTR(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_struct);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state
+{
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_PTR(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(RTE_LCORE_VAR_LCORE_GET(lcore_id, test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_array);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (RTE_MAX_LCORE_VAR / 2)
+
+static int
+test_many_lvars(void)
+{
+	void **handlers = malloc(sizeof(void *) * MANY_LVARS);
+	int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		void *handle = rte_lcore_var_alloc(1, 1);
+
+		uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), handle);
+
+		*b = (uint8_t)i;
+
+		handlers[i] = handle;
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_FOREACH_WORKER(lcore_id) {
+			uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(),
+							       handlers[i]);
+			TEST_ASSERT_EQUAL((uint8_t)i, *b,
+					  "Unexpected lcore variable value.");
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 3/5] random: keep PRNG state in lcore variable
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 2/5] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-02-19  9:40     ` Mattias Rönnblom
  2024-02-19 11:22       ` Morten Brørup
  2024-02-19  9:40     ` [RFC v2 4/5] power: keep per-lcore " Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 5/5] service: " Mattias Rönnblom
  4 siblings, 1 reply; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_random.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 7709b8f2c6..af9fffd81b 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct rte_rand_state {
@@ -19,14 +20,12 @@ struct rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
-} __rte_cache_aligned;
+};
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state __rte_cache_aligned;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_PTR(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 4/5] power: keep per-lcore state in lcore variable
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
                       ` (2 preceding siblings ...)
  2024-02-19  9:40     ` [RFC v2 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-19  9:40     ` Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 5/5] service: " Mattias Rönnblom
  4 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/power/rte_power_pmd_mgmt.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 591fc69f36..bb20e564de 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -68,8 +69,8 @@ struct pmd_core_cfg {
 	/**< Number of queues ready to enter power optimized state */
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
-} __rte_cache_aligned;
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+};
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -772,10 +770,13 @@ RTE_INIT(rte_power_ethdev_pmgmt_init) {
 	size_t i;
 	int j;
 
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
+
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		struct pmd_core_cfg *lcore_cfg =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_cfgs);
+		TAILQ_INIT(&lcore_cfg->head);
 	}
 
 	/* initialize config defaults */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v2 5/5] service: keep per-lcore state in lcore variable
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
                       ` (3 preceding siblings ...)
  2024-02-19  9:40     ` [RFC v2 4/5] power: keep per-lcore " Mattias Rönnblom
@ 2024-02-19  9:40     ` Mattias Rönnblom
  4 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c | 119 ++++++++++++++++++++---------------
 1 file changed, 68 insertions(+), 51 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index d959c91459..de205c5da5 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,11 +102,12 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
+	else {
+		struct core_state *cs;
+		RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+			memset(cs, 0, sizeof(struct core_state));
 	}
 
 	int i;
@@ -122,7 +124,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +137,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +286,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +293,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +454,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +467,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +489,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +535,16 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs =
+		RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +552,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +573,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +590,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +642,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +694,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +712,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +737,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +761,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +785,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +815,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +824,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +849,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +860,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +868,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +876,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +885,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +901,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +948,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +977,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +989,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1028,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-19  7:49         ` Mattias Rönnblom
@ 2024-02-19 11:10           ` Morten Brørup
  2024-02-19 14:31             ` Mattias Rönnblom
  0 siblings, 1 reply; 42+ messages in thread
From: Morten Brørup @ 2024-02-19 11:10 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 19 February 2024 08.49
> 
> On 2024-02-09 14:04, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Friday, 9 February 2024 12.46
> >>
> >> On 2024-02-09 09:25, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>>> Sent: Thursday, 8 February 2024 19.17
> >>>>
> >>>> Introduce DPDK per-lcore id variables, or lcore variables for
> short.
> >>>>
> >>>> An lcore variable has one value for every current and future lcore
> >>>> id-equipped thread.
> >>>>
> >>>> The primary <rte_lcore_var.h> use case is for statically
> allocating
> >>>> small chunks of often-used data, which is related logically, but
> >> where
> >>>> there are performance benefits to reap from having updates being
> >> local
> >>>> to an lcore.
> >>>>
> >>>> Lcore variables are similar to thread-local storage (TLS, e.g.,
> C11
> >>>> _Thread_local), but decoupling the values' life time with that of
> >> the
> >>>> threads.
> >>>>
> >>>> Lcore variables are also similar in terms of functionality
> provided
> >> by
> >>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >>>> build-time machinery. DPCPU uses linker scripts, which effectively
> >>>> prevents the reuse of its, otherwise seemingly viable, approach.
> >>>>
> >>>> The currently-prevailing way to solve the same problem as lcore
> >>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
> >> sized
> >>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> >>>> lcore variables over this approach is that data related to the
> same
> >>>> lcore now is close (spatially, in memory), rather than data used
> by
> >>>> the same module, which in turn avoid excessive use of padding,
> >>>> polluting caches with unused data.
> >>>>
> >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>> ---
> >>>
> >>> This looks very promising. :-)
> >>>
> >>> Here's a bunch of comments, questions and suggestions.
> >>>
> >>>
> >>> * Question: Performance.
> >>> What is the cost of accessing an lcore variable vs a variable in
> TLS?
> >>> I suppose the relative cost diminishes if the variable is a larger
> >> struct, compared to a simple uint64_t.
> >>>
> >>
> >> In case all the relevant data is available in a cache close to the
> >> core,
> >> both options carry quite low overhead.
> >>
> >> Accessing a lcore variable will always require a TLS lookup, in the
> >> form
> >> of retrieving the lcore_id of the current thread. In that sense,
> there
> >> will likely be a number of extra instructions required to do the
> lcore
> >> variable address lookup (i.e., doing the load from rte_lcore_var
> table
> >> based on the lcore_id you just looked up, and adding the variable's
> >> offset).
> >>
> >> A TLS lookup will incur an extra overhead of less than a clock
> cycle,
> >> compared to accessing a non-TLS static variable, in case static
> linking
> >> is used. For shared objects, TLS is much more expensive (something
> >> often
> >> visible in dynamically linked DPDK app flame graphs, in the form of
> the
> >> __tls_addr symbol). Then you need to add ~3 cc/access. This on a
> micro
> >> benchmark running on a x86_64 Raptor Lake P-core.
> >>
> >> (To visialize the difference between shared object and not, one can
> use
> >> Compiler Explorer and -fPIC versus -fPIE.)
> >>
> >> Things get more complicated if you access the same variable in the
> same
> >> section code, since then it can be left on the stack/in a register
> by
> >> the compiler, especially if LTO is used. In other words, if you do
> >> rte_lcore_id() several times in a row, only the first one will cost
> you
> >> anything. This happens fairly often in DPDK, with rte_lcore_id().
> >>
> >> Finally, if you do something like
> >>
> >> diff --git a/lib/eal/common/rte_random.c
> b/lib/eal/common/rte_random.c
> >> index af9fffd81b..a65c30d27e 100644
> >> --- a/lib/eal/common/rte_random.c
> >> +++ b/lib/eal/common/rte_random.c
> >> @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state
> *state)
> >>    static __rte_always_inline
> >>    struct rte_rand_state *__rte_rand_get_state(void)
> >>    {
> >> -       unsigned int idx;
> >> -
> >> -       idx = rte_lcore_id();
> >> -
> >> -       if (unlikely(idx == LCORE_ID_ANY))
> >> -               return &unregistered_rand_state;
> >> -
> >> -       return RTE_LCORE_VAR_PTR(rand_state);
> >> +       return &unregistered_rand_state;
> >>    }
> >>
> >>    uint64_t
> >>
> >> ...and re-run the rand_perf_autotest, at least I see no difference
> at
> >> all (in a statically linked build). Both results in rte_rand() using
> >> ~11
> >> cc/call. What that suggests is that TLS overhead is very small, and
> >> that
> >> any extra instructions required by lcore variables doesn't add much,
> if
> >> anything at all, at least in this particular case.
> >
> > Excellent. Thank you for a thorough and detailed answer, Mattias.
> >
> >>
> >>> Some of my suggestions below might also affect performance.
> >>>
> >>>
> >>> * Advantage: Provides direct access to worker thread variables.
> >>> With the current alternative (thread-local storage), the main
> thread
> >> cannot access the TLS variables of the worker threads,
> >>> unless worker threads publish global access pointers.
> >>> Lcore variables of any lcore thread can be direcctly accessed by
> any
> >> thread, which simplifies code.
> >>>
> >>>
> >>> * Advantage: Roadmap towards hugemem.
> >>> It would be nice if the lcore variable memory was allocated in
> >> hugemem, to reduce TLB misses.
> >>> The current alternative (thread-local storage) is also not using
> >> hugement, so not a degradation.
> >>>
> >>
> >> I agree, but the thing is it's hard to figure out how much memory is
> >> required for these kind of variables, given how DPDK is built and
> >> linked. In an OS kernel, you can just take all the symbols, put them
> in
> >> a special section, and size that section. Such a thing can't easily
> be
> >> done with DPDK, since shared object builds are supported, plus that
> >> this
> >> facility should be available not only to DPDK modules, but also the
> >> application, so relying on linker scripts isn't really feasible (not
> >> probably not even feasible for DPDK itself).
> >>
> >> In that scenario, you want to size up the per-lcore buffer to be so
> >> large, you don't have to worry about overruns. That will waste
> memory.
> >> If you use huge page memory, paging can't help you to avoid
> >> pre-allocating actual physical memory.
> >
> > Good point.
> > I had noticed that RTE_MAX_LCORE_VAR was 1 MB (per RTE_MAX_LCORE),
> but I hadn't considered how paging helps us use less physical memory
> than that.
> >
> >>
> >> That said, even large (by static per-lcore data standards) buffers
> are
> >> potentially small enough not to grow the amount of memory used by a
> >> DPDK
> >> process too much. You need to provision for RTE_MAX_LCORE of them
> >> though.
> >>
> >> The value of lcore variables should be small, and thus incur few TLB
> >> misses, so you may not gain much from huge pages. In my world, it's
> >> more
> >> about "fitting often-used per-lcore data into L1 or L2 CPU caches",
> >> rather than the easier "fitting often-used per-lcore data into a
> >> working
> >> set size reasonably expected to be covered by hardware TLB/caches".
> >
> > Yes, I suppose that lcore variables are intended to be small, and
> large per-lcore structures should keep following the current design
> patterns for allocation and access.
> >
> 
> It seems to me that support for per-lcore heaps should be the solution
> for supporting use cases requiring many, larger and/or dynamic objects
> on a per-lcore basis.
> 
> Ideally, you would design both that mechanism and lcore variables
> together, but then if you couple enough amount of improvements together
> you will never get anywhere. An instance of where perfect is the enemy
> of good, perhaps.

So true. :-)

> 
> > Perhaps this guideline is worth mentioning in the documentation.
> >
> 
> What is missing, more specifically? The size limitation and the static
> nature of lcore variables is described, and what current design
> patterns
> they expected to (partly) replace is also covered.

Your documentation is fine, and nothing specific is missing here.
I was thinking out loud that the high level DPDK documentation should describe common design patterns.

> 
> >>
> >>> Lcore variables are available very early at startup, so I guess the
> >> RTE memory allocator is not yet available.
> >>> Hugemem could be allocated using O/S allocation, so there is a
> >> possible road towards using hugemem.
> >>>
> >>
> >> With the current design, that true. I'm not sure it's a strict
> >> requirement though, but it does makes things simpler.
> >>
> >>> Either way, using hugement would require one more indirection (the
> >> pointer to the allocated hugemem).
> >>> I don't know which has better performance, using hugemem or
> avoiding
> >> the additional pointer dereferencing.
> >>>
> >>>
> >>> * Suggestion: Consider adding an entry for unregistered non-EAL
> >> threads.
> >>> Please consider making room for one more entry, shared by all
> >> unregistered non-EAL threads, i.e.
> >>> making the array size RTE_MAX_LCORE + 1 and indexing by
> >> (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
> >>>
> >>> It would be convenient for the use cases where a variable shared by
> >> the unregistered non-EAL threads don't need special treatment.
> >>>
> >>
> >> I thought about this, but it would require a conditional in the
> lookup
> >> macro, as you show. More importantly, it would make the whole
> >> <rte_lcore_var.h> thing less elegant and harder to understand. It's
> bad
> >> enough that "per-lcore" is actually "per-lcore id" (or the
> equivalent
> >> "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's
> <what
> >> I said before> + 1" is not an improvement.
> >
> > We could promote "one more entry for unregistered non-EAL threads"
> design pattern (for relevant use cases only!) by extending EAL with one
> more TLS variable, maintained like _thread_id, but set to RTE_MAX_LCORE
> when _tread_id is set to -1:
> >
> > +++ eal_common_thread.c:
> >    RTE_DEFINE_PER_LCORE(int, _thread_id) = -1;
> > + RTE_DEFINE_PER_LCORE(int, _thread_idx) = RTE_MAX_LCORE;

Ups... wrong reference! I meant to refer to _lcore_id, not _thread_id. Correction:

We could promote "one more entry for unregistered non-EAL threads" design pattern (for relevant use cases only!) by extending EAL with one more TLS variable, maintained like _lcore_id, but set to RTE_MAX_LCORE when _lcore_id is set to LCORE_ID_ANY:

+++ eal_common_thread.c:
  RTE_DEFINE_PER_LCORE(unsigned int, _lcore_id) = LCORE_ID_ANY;
+ RTE_DEFINE_PER_LCORE(unsigned int, _lcore_idx) = RTE_MAX_LCORE;

> >
> > and
> >
> > +++ rte_lcore.h:
> > static inline unsigned
> > rte_lcore_id(void)
> > {
> > 	return RTE_PER_LCORE(_lcore_id);
> > }
> > + static inline unsigned
> > + rte_lcore_idx(void)
> > + {
> > + 	return RTE_PER_LCORE(_lcore_idx);
> > + }
> >
> > That would eliminate the (rte_lcore_id() < RTE_MAX_LCORE ?
> rte_lcore_id() : RTE_MAX_LCORE) conditional, also where currently used.
> >
> 
> Wouldn't that effectively give a shared lcore id to all unregistered
> threads?

Yes, just like the rte_lcore_id() is LCORE_ID_ANY (i.e. UINT32_MAX) for all unregistered threads; but it will be usable for array indexing, behaving as a shadow variable of RTE_PER_LCORE(_lcore_id) for optimizing away the "rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE" when indexing.

> 
> We definitely shouldn't further complicate anything related to the DPDK
> threading model, in my opinion.
> 
> If a module needs one or more variable instances that aren't per lcore,
> use regular static allocation instead. I would favor clarity over
> convenience here, at least until we know better (see below as well).
> 
> >>
> >> But useful? Sure.
> >>
> >> I think you may still need other data for dealing with unregistered
> >> threads, for example a mutex or spin lock to deal with concurrency
> >> issues that arises with shared data.
> >
> > Adding the extra entry is only for the benefit of use cases where
> special handling is not required. It will make the code for those use
> cases much cleaner. I think it is useful.
> >
> 
> It will make it shorter, but not less clean, I would argue.
> 
> > Use cases requiring special handling should still do the special
> handling they do today.
> >
> 
> For DPDK modules using lcore variables and which treat unregistered
> threads as "full citizens", I expect special handling of unregistered
> threads to be the norm. Take rte_random.h as an example. Current API
> does not guarantee MT safety for concurrent calls of unregistered
> threads. It probably should, and it should probably be by means of a
> mutex (not spinlock).
> 
> The reason I'm not running off to make a rte_random.c patch is that's
> it's unclear to me what is the role of unregistered threads in DPDK.
> I'm
> reasonably comfortable with a model where there are many threads that
> basically don't interact with the DPDK APIs (except maybe some very
> narrow exposure, like the preemption-safe ring variant). One example of
> such a design would be big slow control plane which uses multi-
> threading
> and the Linux process scheduler for work scheduling, hosted in the same
> process as a DPDK data plane app.
> 
> What I find more strange is a scenario where there are unregistered
> threads which interacts with a wide variety of DPDK APIs, does so
> at-high-rates/with-high-performance-requirements and are expected to be
> preemption-safe. So they are basically EAL threads without a lcore id.

Yes, this is happening in the wild.
E.g. our application has a mode where it uses fewer EAL threads, and processes more in non-EAL threads. So to say, the same work is processed either by an EAL thread or a non-EAL thread, depending on the application's mode.
The extra array entry would be useful for such use cases.

> 
> Support for that latter scenario has also been voiced, in previous
> discussions, from what I recall.
> 
> I think it's hard to answer the question of a "unregistered thread
> spare" for lcore variables without first knowing what the future should
> look like for unregistered threads in DPDK, in terms of being able to
> call into DPDK APIs, preemption-safety guarantees, etc.
> 
> It seems that until you have a clearer picture of how generally to
> treat
> unregistered threads, you are best off with just a per-lcore id
> instance
> of lcore variables.

I get your point. It also reduces the risk of bugs caused by incorrect use of the additional entry.

I am arguing for a different angle: Providing the extra entry will help uncovering relevant use cases.

> 
> >>
> >> There may also be cases were you are best off by simply disallowing
> >> unregistered threads from calling into that API.
> >>
> >>> Obviously, this might affect performance.
> >>> If the performance cost is not negligble, the addtional entry (and
> >> indexing branch) could be disabled at build time.
> >>>
> >>>
> >>> * Suggestion: Do not fix the alignment at 16 byte.
> >>> Pass an alignment parameter to rte_lcore_var_alloc() and use
> >> alignof() when calling it:
> >>>
> >>> +#include <stdalign.h>
> >>> +
> >>> +#define RTE_LCORE_VAR_ALLOC(name)			\
> >>> +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
> >>> +
> >>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)
> 	\
> >>> +	name = rte_lcore_var_alloc(size, alignment)
> >>> +
> >>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
> >>> +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
> >>> +
> >>> + +++ /cconfig/rte_config.h
> >>> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
> >>>
> >>>
> >>
> >> That seems like a very good idea. I'll look into it.
> >>
> >>> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(),
> but
> >> behaves differently.
> >>>
> >>>> +/**
> >>>> + * Iterate over each lcore id's value for a lcore variable.
> >>>> + */
> >>>> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
> >>>> +	for (unsigned int lcore_id =					\
> >>>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);
> 	\
> >>>> +	     lcore_id < RTE_MAX_LCORE;
> 	\
> >>>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id,
> name))
> >>>> +
> >>>
> >>> The macro name RTE_LCORE_VAR_FOREACH() resembles
> >> RTE_LCORE_FOREACH(i), which only iterates on running cores.
> >>> You might want to give it a name that differs more.
> >>>
> >>
> >> True.
> >>
> >> Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for
> >> confusion,
> >> for sure.
> >>
> >> Being consistent with <rte_lcore.h> is not so easy, since it's not
> even
> >> consistent with itself. For example, rte_lcore_count() returns the
> >> number of lcores (EAL threads) *plus the number of registered non-
> EAL
> >> threads*, and RTE_LCORE_FOREACH() give a different count. :)
> >
> > Naming is hard. I don't have a good name, and can only offer
> inspiration...
> >
> > <rte_lcore.h> has RTE_LCORE_FOREACH() and its
> RTE_LCORE_FOREACH_WORKER() variant with _WORKER appended.
> >
> > Perhaps RTE_LCORE_VAR_FOREACH_ALL(), with _ALL appended to indicate a
> variant.
> >
> >>
> >>> If it wasn't for API breakage, I would suggest renaming
> >> RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
> >>>
> >>> Small detail: "var" is a pointer, so consider renaming it to "ptr"
> >> and adding _PTR to the macro name.
> >>
> >> The "var" name comes from how <sys/queue.h> names things. I think I
> had
> >> it as "ptr" initially. I'll change it back.
> >
> > Thanks.
> >
> >>
> >> Thanks a lot Morten.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC v2 3/5] random: keep PRNG state in lcore variable
  2024-02-19  9:40     ` [RFC v2 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-19 11:22       ` Morten Brørup
  2024-02-19 14:04         ` Mattias Rönnblom
  0 siblings, 1 reply; 42+ messages in thread
From: Morten Brørup @ 2024-02-19 11:22 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Monday, 19 February 2024 10.41
> 
> Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
> cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
> same state in a more cache-friendly lcore variable.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---

[...]

> @@ -19,14 +20,12 @@ struct rte_rand_state {
>  	uint64_t z3;
>  	uint64_t z4;
>  	uint64_t z5;
> -	RTE_CACHE_GUARD;
> -} __rte_cache_aligned;
> +};
> 
> -/* One instance each for every lcore id-equipped thread, and one
> - * additional instance to be shared by all others threads (i.e., all
> - * unregistered non-EAL threads).
> - */
> -static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
> +RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
> +
> +/* instance to be shared by all unregistered non-EAL threads */
> +static struct rte_rand_state unregistered_rand_state
> __rte_cache_aligned;

The unregistered_rand_state instance is still __rte_cache_aligned; consider also adding an RTE_CACHE_GUARD to it.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v2 3/5] random: keep PRNG state in lcore variable
  2024-02-19 11:22       ` Morten Brørup
@ 2024-02-19 14:04         ` Mattias Rönnblom
  2024-02-19 15:10           ` Morten Brørup
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-19 14:04 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-19 12:22, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Monday, 19 February 2024 10.41
>>
>> Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
>> cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
>> same state in a more cache-friendly lcore variable.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
> 
> [...]
> 
>> @@ -19,14 +20,12 @@ struct rte_rand_state {
>>   	uint64_t z3;
>>   	uint64_t z4;
>>   	uint64_t z5;
>> -	RTE_CACHE_GUARD;
>> -} __rte_cache_aligned;
>> +};
>>
>> -/* One instance each for every lcore id-equipped thread, and one
>> - * additional instance to be shared by all others threads (i.e., all
>> - * unregistered non-EAL threads).
>> - */
>> -static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
>> +RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
>> +
>> +/* instance to be shared by all unregistered non-EAL threads */
>> +static struct rte_rand_state unregistered_rand_state
>> __rte_cache_aligned;
> 
> The unregistered_rand_state instance is still __rte_cache_aligned; consider also adding an RTE_CACHE_GUARD to it.
> 

It shouldn't be cache-line aligned. I'll remove it. Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-19 11:10           ` Morten Brørup
@ 2024-02-19 14:31             ` Mattias Rönnblom
  2024-02-19 15:04               ` Morten Brørup
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-19 14:31 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-19 12:10, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Monday, 19 February 2024 08.49
>>
>> On 2024-02-09 14:04, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>>> Sent: Friday, 9 February 2024 12.46
>>>>
>>>> On 2024-02-09 09:25, Morten Brørup wrote:
>>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>>>> Sent: Thursday, 8 February 2024 19.17
>>>>>>
>>>>>> Introduce DPDK per-lcore id variables, or lcore variables for
>> short.
>>>>>>
>>>>>> An lcore variable has one value for every current and future lcore
>>>>>> id-equipped thread.
>>>>>>
>>>>>> The primary <rte_lcore_var.h> use case is for statically
>> allocating
>>>>>> small chunks of often-used data, which is related logically, but
>>>> where
>>>>>> there are performance benefits to reap from having updates being
>>>> local
>>>>>> to an lcore.
>>>>>>
>>>>>> Lcore variables are similar to thread-local storage (TLS, e.g.,
>> C11
>>>>>> _Thread_local), but decoupling the values' life time with that of
>>>> the
>>>>>> threads.
>>>>>>
>>>>>> Lcore variables are also similar in terms of functionality
>> provided
>>>> by
>>>>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>>>>>> build-time machinery. DPCPU uses linker scripts, which effectively
>>>>>> prevents the reuse of its, otherwise seemingly viable, approach.
>>>>>>
>>>>>> The currently-prevailing way to solve the same problem as lcore
>>>>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
>>>> sized
>>>>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>>>>>> lcore variables over this approach is that data related to the
>> same
>>>>>> lcore now is close (spatially, in memory), rather than data used
>> by
>>>>>> the same module, which in turn avoid excessive use of padding,
>>>>>> polluting caches with unused data.
>>>>>>
>>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>>> ---
>>>>>
>>>>> This looks very promising. :-)
>>>>>
>>>>> Here's a bunch of comments, questions and suggestions.
>>>>>
>>>>>
>>>>> * Question: Performance.
>>>>> What is the cost of accessing an lcore variable vs a variable in
>> TLS?
>>>>> I suppose the relative cost diminishes if the variable is a larger
>>>> struct, compared to a simple uint64_t.
>>>>>
>>>>
>>>> In case all the relevant data is available in a cache close to the
>>>> core,
>>>> both options carry quite low overhead.
>>>>
>>>> Accessing a lcore variable will always require a TLS lookup, in the
>>>> form
>>>> of retrieving the lcore_id of the current thread. In that sense,
>> there
>>>> will likely be a number of extra instructions required to do the
>> lcore
>>>> variable address lookup (i.e., doing the load from rte_lcore_var
>> table
>>>> based on the lcore_id you just looked up, and adding the variable's
>>>> offset).
>>>>
>>>> A TLS lookup will incur an extra overhead of less than a clock
>> cycle,
>>>> compared to accessing a non-TLS static variable, in case static
>> linking
>>>> is used. For shared objects, TLS is much more expensive (something
>>>> often
>>>> visible in dynamically linked DPDK app flame graphs, in the form of
>> the
>>>> __tls_addr symbol). Then you need to add ~3 cc/access. This on a
>> micro
>>>> benchmark running on a x86_64 Raptor Lake P-core.
>>>>
>>>> (To visialize the difference between shared object and not, one can
>> use
>>>> Compiler Explorer and -fPIC versus -fPIE.)
>>>>
>>>> Things get more complicated if you access the same variable in the
>> same
>>>> section code, since then it can be left on the stack/in a register
>> by
>>>> the compiler, especially if LTO is used. In other words, if you do
>>>> rte_lcore_id() several times in a row, only the first one will cost
>> you
>>>> anything. This happens fairly often in DPDK, with rte_lcore_id().
>>>>
>>>> Finally, if you do something like
>>>>
>>>> diff --git a/lib/eal/common/rte_random.c
>> b/lib/eal/common/rte_random.c
>>>> index af9fffd81b..a65c30d27e 100644
>>>> --- a/lib/eal/common/rte_random.c
>>>> +++ b/lib/eal/common/rte_random.c
>>>> @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state
>> *state)
>>>>     static __rte_always_inline
>>>>     struct rte_rand_state *__rte_rand_get_state(void)
>>>>     {
>>>> -       unsigned int idx;
>>>> -
>>>> -       idx = rte_lcore_id();
>>>> -
>>>> -       if (unlikely(idx == LCORE_ID_ANY))
>>>> -               return &unregistered_rand_state;
>>>> -
>>>> -       return RTE_LCORE_VAR_PTR(rand_state);
>>>> +       return &unregistered_rand_state;
>>>>     }
>>>>
>>>>     uint64_t
>>>>
>>>> ...and re-run the rand_perf_autotest, at least I see no difference
>> at
>>>> all (in a statically linked build). Both results in rte_rand() using
>>>> ~11
>>>> cc/call. What that suggests is that TLS overhead is very small, and
>>>> that
>>>> any extra instructions required by lcore variables doesn't add much,
>> if
>>>> anything at all, at least in this particular case.
>>>
>>> Excellent. Thank you for a thorough and detailed answer, Mattias.
>>>
>>>>
>>>>> Some of my suggestions below might also affect performance.
>>>>>
>>>>>
>>>>> * Advantage: Provides direct access to worker thread variables.
>>>>> With the current alternative (thread-local storage), the main
>> thread
>>>> cannot access the TLS variables of the worker threads,
>>>>> unless worker threads publish global access pointers.
>>>>> Lcore variables of any lcore thread can be direcctly accessed by
>> any
>>>> thread, which simplifies code.
>>>>>
>>>>>
>>>>> * Advantage: Roadmap towards hugemem.
>>>>> It would be nice if the lcore variable memory was allocated in
>>>> hugemem, to reduce TLB misses.
>>>>> The current alternative (thread-local storage) is also not using
>>>> hugement, so not a degradation.
>>>>>
>>>>
>>>> I agree, but the thing is it's hard to figure out how much memory is
>>>> required for these kind of variables, given how DPDK is built and
>>>> linked. In an OS kernel, you can just take all the symbols, put them
>> in
>>>> a special section, and size that section. Such a thing can't easily
>> be
>>>> done with DPDK, since shared object builds are supported, plus that
>>>> this
>>>> facility should be available not only to DPDK modules, but also the
>>>> application, so relying on linker scripts isn't really feasible (not
>>>> probably not even feasible for DPDK itself).
>>>>
>>>> In that scenario, you want to size up the per-lcore buffer to be so
>>>> large, you don't have to worry about overruns. That will waste
>> memory.
>>>> If you use huge page memory, paging can't help you to avoid
>>>> pre-allocating actual physical memory.
>>>
>>> Good point.
>>> I had noticed that RTE_MAX_LCORE_VAR was 1 MB (per RTE_MAX_LCORE),
>> but I hadn't considered how paging helps us use less physical memory
>> than that.
>>>
>>>>
>>>> That said, even large (by static per-lcore data standards) buffers
>> are
>>>> potentially small enough not to grow the amount of memory used by a
>>>> DPDK
>>>> process too much. You need to provision for RTE_MAX_LCORE of them
>>>> though.
>>>>
>>>> The value of lcore variables should be small, and thus incur few TLB
>>>> misses, so you may not gain much from huge pages. In my world, it's
>>>> more
>>>> about "fitting often-used per-lcore data into L1 or L2 CPU caches",
>>>> rather than the easier "fitting often-used per-lcore data into a
>>>> working
>>>> set size reasonably expected to be covered by hardware TLB/caches".
>>>
>>> Yes, I suppose that lcore variables are intended to be small, and
>> large per-lcore structures should keep following the current design
>> patterns for allocation and access.
>>>
>>
>> It seems to me that support for per-lcore heaps should be the solution
>> for supporting use cases requiring many, larger and/or dynamic objects
>> on a per-lcore basis.
>>
>> Ideally, you would design both that mechanism and lcore variables
>> together, but then if you couple enough amount of improvements together
>> you will never get anywhere. An instance of where perfect is the enemy
>> of good, perhaps.
> 
> So true. :-)
> 
>>
>>> Perhaps this guideline is worth mentioning in the documentation.
>>>
>>
>> What is missing, more specifically? The size limitation and the static
>> nature of lcore variables is described, and what current design
>> patterns
>> they expected to (partly) replace is also covered.
> 
> Your documentation is fine, and nothing specific is missing here.
> I was thinking out loud that the high level DPDK documentation should describe common design patterns.
> 
>>
>>>>
>>>>> Lcore variables are available very early at startup, so I guess the
>>>> RTE memory allocator is not yet available.
>>>>> Hugemem could be allocated using O/S allocation, so there is a
>>>> possible road towards using hugemem.
>>>>>
>>>>
>>>> With the current design, that true. I'm not sure it's a strict
>>>> requirement though, but it does makes things simpler.
>>>>
>>>>> Either way, using hugement would require one more indirection (the
>>>> pointer to the allocated hugemem).
>>>>> I don't know which has better performance, using hugemem or
>> avoiding
>>>> the additional pointer dereferencing.
>>>>>
>>>>>
>>>>> * Suggestion: Consider adding an entry for unregistered non-EAL
>>>> threads.
>>>>> Please consider making room for one more entry, shared by all
>>>> unregistered non-EAL threads, i.e.
>>>>> making the array size RTE_MAX_LCORE + 1 and indexing by
>>>> (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
>>>>>
>>>>> It would be convenient for the use cases where a variable shared by
>>>> the unregistered non-EAL threads don't need special treatment.
>>>>>
>>>>
>>>> I thought about this, but it would require a conditional in the
>> lookup
>>>> macro, as you show. More importantly, it would make the whole
>>>> <rte_lcore_var.h> thing less elegant and harder to understand. It's
>> bad
>>>> enough that "per-lcore" is actually "per-lcore id" (or the
>> equivalent
>>>> "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's
>> <what
>>>> I said before> + 1" is not an improvement.
>>>
>>> We could promote "one more entry for unregistered non-EAL threads"
>> design pattern (for relevant use cases only!) by extending EAL with one
>> more TLS variable, maintained like _thread_id, but set to RTE_MAX_LCORE
>> when _tread_id is set to -1:
>>>
>>> +++ eal_common_thread.c:
>>>     RTE_DEFINE_PER_LCORE(int, _thread_id) = -1;
>>> + RTE_DEFINE_PER_LCORE(int, _thread_idx) = RTE_MAX_LCORE;
> 
> Ups... wrong reference! I meant to refer to _lcore_id, not _thread_id. Correction:
> 

OK. I subconsciously ignored this mistake, and read it as "_lcore_id".

> We could promote "one more entry for unregistered non-EAL threads" design pattern (for relevant use cases only!) by extending EAL with one more TLS variable, maintained like _lcore_id, but set to RTE_MAX_LCORE when _lcore_id is set to LCORE_ID_ANY:
> 
> +++ eal_common_thread.c:
>    RTE_DEFINE_PER_LCORE(unsigned int, _lcore_id) = LCORE_ID_ANY;
> + RTE_DEFINE_PER_LCORE(unsigned int, _lcore_idx) = RTE_MAX_LCORE;
> 
>>>
>>> and
>>>
>>> +++ rte_lcore.h:
>>> static inline unsigned
>>> rte_lcore_id(void)
>>> {
>>> 	return RTE_PER_LCORE(_lcore_id);
>>> }
>>> + static inline unsigned
>>> + rte_lcore_idx(void)
>>> + {
>>> + 	return RTE_PER_LCORE(_lcore_idx);
>>> + }
>>>
>>> That would eliminate the (rte_lcore_id() < RTE_MAX_LCORE ?
>> rte_lcore_id() : RTE_MAX_LCORE) conditional, also where currently used.
>>>
>>
>> Wouldn't that effectively give a shared lcore id to all unregistered
>> threads?
> 
> Yes, just like the rte_lcore_id() is LCORE_ID_ANY (i.e. UINT32_MAX) for all unregistered threads; but it will be usable for array indexing, behaving as a shadow variable of RTE_PER_LCORE(_lcore_id) for optimizing away the "rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE" when indexing.
> 
>>
>> We definitely shouldn't further complicate anything related to the DPDK
>> threading model, in my opinion.
>>
>> If a module needs one or more variable instances that aren't per lcore,
>> use regular static allocation instead. I would favor clarity over
>> convenience here, at least until we know better (see below as well).
>>
>>>>
>>>> But useful? Sure.
>>>>
>>>> I think you may still need other data for dealing with unregistered
>>>> threads, for example a mutex or spin lock to deal with concurrency
>>>> issues that arises with shared data.
>>>
>>> Adding the extra entry is only for the benefit of use cases where
>> special handling is not required. It will make the code for those use
>> cases much cleaner. I think it is useful.
>>>
>>
>> It will make it shorter, but not less clean, I would argue.
>>
>>> Use cases requiring special handling should still do the special
>> handling they do today.
>>>
>>
>> For DPDK modules using lcore variables and which treat unregistered
>> threads as "full citizens", I expect special handling of unregistered
>> threads to be the norm. Take rte_random.h as an example. Current API
>> does not guarantee MT safety for concurrent calls of unregistered
>> threads. It probably should, and it should probably be by means of a
>> mutex (not spinlock).
>>
>> The reason I'm not running off to make a rte_random.c patch is that's
>> it's unclear to me what is the role of unregistered threads in DPDK.
>> I'm
>> reasonably comfortable with a model where there are many threads that
>> basically don't interact with the DPDK APIs (except maybe some very
>> narrow exposure, like the preemption-safe ring variant). One example of
>> such a design would be big slow control plane which uses multi-
>> threading
>> and the Linux process scheduler for work scheduling, hosted in the same
>> process as a DPDK data plane app.
>>
>> What I find more strange is a scenario where there are unregistered
>> threads which interacts with a wide variety of DPDK APIs, does so
>> at-high-rates/with-high-performance-requirements and are expected to be
>> preemption-safe. So they are basically EAL threads without a lcore id.
> 
> Yes, this is happening in the wild.
> E.g. our application has a mode where it uses fewer EAL threads, and processes more in non-EAL threads. So to say, the same work is processed either by an EAL thread or a non-EAL thread, depending on the application's mode.
> The extra array entry would be useful for such use cases.
> 

Is there some particular reason you can't register those non-EAL threads?

>>
>> Support for that latter scenario has also been voiced, in previous
>> discussions, from what I recall.
>>
>> I think it's hard to answer the question of a "unregistered thread
>> spare" for lcore variables without first knowing what the future should
>> look like for unregistered threads in DPDK, in terms of being able to
>> call into DPDK APIs, preemption-safety guarantees, etc.
>>
>> It seems that until you have a clearer picture of how generally to
>> treat
>> unregistered threads, you are best off with just a per-lcore id
>> instance
>> of lcore variables.
> 
> I get your point. It also reduces the risk of bugs caused by incorrect use of the additional entry.
> 
> I am arguing for a different angle: Providing the extra entry will help uncovering relevant use cases.
> 

Maybe have two "spares" in case you find two new uses cases? :)

No, adding spares doesn't work, unless you rework the API and rename it 
to fit the new purpose of not only providing per-lcore id variables, but 
per-something-else.

>>
>>>>
>>>> There may also be cases were you are best off by simply disallowing
>>>> unregistered threads from calling into that API.
>>>>
>>>>> Obviously, this might affect performance.
>>>>> If the performance cost is not negligble, the addtional entry (and
>>>> indexing branch) could be disabled at build time.
>>>>>
>>>>>
>>>>> * Suggestion: Do not fix the alignment at 16 byte.
>>>>> Pass an alignment parameter to rte_lcore_var_alloc() and use
>>>> alignof() when calling it:
>>>>>
>>>>> +#include <stdalign.h>
>>>>> +
>>>>> +#define RTE_LCORE_VAR_ALLOC(name)			\
>>>>> +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
>>>>> +
>>>>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)
>> 	\
>>>>> +	name = rte_lcore_var_alloc(size, alignment)
>>>>> +
>>>>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
>>>>> +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
>>>>> +
>>>>> + +++ /cconfig/rte_config.h
>>>>> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
>>>>>
>>>>>
>>>>
>>>> That seems like a very good idea. I'll look into it.
>>>>
>>>>> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(),
>> but
>>>> behaves differently.
>>>>>
>>>>>> +/**
>>>>>> + * Iterate over each lcore id's value for a lcore variable.
>>>>>> + */
>>>>>> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
>>>>>> +	for (unsigned int lcore_id =					\
>>>>>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);
>> 	\
>>>>>> +	     lcore_id < RTE_MAX_LCORE;
>> 	\
>>>>>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id,
>> name))
>>>>>> +
>>>>>
>>>>> The macro name RTE_LCORE_VAR_FOREACH() resembles
>>>> RTE_LCORE_FOREACH(i), which only iterates on running cores.
>>>>> You might want to give it a name that differs more.
>>>>>
>>>>
>>>> True.
>>>>
>>>> Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for
>>>> confusion,
>>>> for sure.
>>>>
>>>> Being consistent with <rte_lcore.h> is not so easy, since it's not
>> even
>>>> consistent with itself. For example, rte_lcore_count() returns the
>>>> number of lcores (EAL threads) *plus the number of registered non-
>> EAL
>>>> threads*, and RTE_LCORE_FOREACH() give a different count. :)
>>>
>>> Naming is hard. I don't have a good name, and can only offer
>> inspiration...
>>>
>>> <rte_lcore.h> has RTE_LCORE_FOREACH() and its
>> RTE_LCORE_FOREACH_WORKER() variant with _WORKER appended.
>>>
>>> Perhaps RTE_LCORE_VAR_FOREACH_ALL(), with _ALL appended to indicate a
>> variant.
>>>
>>>>
>>>>> If it wasn't for API breakage, I would suggest renaming
>>>> RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
>>>>>
>>>>> Small detail: "var" is a pointer, so consider renaming it to "ptr"
>>>> and adding _PTR to the macro name.
>>>>
>>>> The "var" name comes from how <sys/queue.h> names things. I think I
>> had
>>>> it as "ptr" initially. I'll change it back.
>>>
>>> Thanks.
>>>
>>>>
>>>> Thanks a lot Morten.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-19 14:31             ` Mattias Rönnblom
@ 2024-02-19 15:04               ` Morten Brørup
  0 siblings, 0 replies; 42+ messages in thread
From: Morten Brørup @ 2024-02-19 15:04 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 19 February 2024 15.32
> 
> On 2024-02-19 12:10, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Monday, 19 February 2024 08.49
> >>
> >> On 2024-02-09 14:04, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >>>> Sent: Friday, 9 February 2024 12.46
> >>>>
> >>>> On 2024-02-09 09:25, Morten Brørup wrote:
> >>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>>>>> Sent: Thursday, 8 February 2024 19.17
> >>>>>>
> >>>>>> Introduce DPDK per-lcore id variables, or lcore variables for
> >> short.
> >>>>>>
> >>>>>> An lcore variable has one value for every current and future
> lcore
> >>>>>> id-equipped thread.
> >>>>>>
> >>>>>> The primary <rte_lcore_var.h> use case is for statically
> >> allocating
> >>>>>> small chunks of often-used data, which is related logically, but
> >>>> where
> >>>>>> there are performance benefits to reap from having updates being
> >>>> local
> >>>>>> to an lcore.
> >>>>>>
> >>>>>> Lcore variables are similar to thread-local storage (TLS, e.g.,
> >> C11
> >>>>>> _Thread_local), but decoupling the values' life time with that
> of
> >>>> the
> >>>>>> threads.
> >>>>>>
> >>>>>> Lcore variables are also similar in terms of functionality
> >> provided
> >>>> by
> >>>>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >>>>>> build-time machinery. DPCPU uses linker scripts, which
> effectively
> >>>>>> prevents the reuse of its, otherwise seemingly viable, approach.
> >>>>>>
> >>>>>> The currently-prevailing way to solve the same problem as lcore
> >>>>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
> >>>> sized
> >>>>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit
> of
> >>>>>> lcore variables over this approach is that data related to the
> >> same
> >>>>>> lcore now is close (spatially, in memory), rather than data used
> >> by
> >>>>>> the same module, which in turn avoid excessive use of padding,
> >>>>>> polluting caches with unused data.
> >>>>>>
> >>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>>>> ---

[...]

> > Ups... wrong reference! I meant to refer to _lcore_id, not
> _thread_id. Correction:
> >
> 
> OK. I subconsciously ignored this mistake, and read it as "_lcore_id".

:-)

[...]

> >> For DPDK modules using lcore variables and which treat unregistered
> >> threads as "full citizens", I expect special handling of
> unregistered
> >> threads to be the norm. Take rte_random.h as an example. Current API
> >> does not guarantee MT safety for concurrent calls of unregistered
> >> threads. It probably should, and it should probably be by means of a
> >> mutex (not spinlock).
> >>
> >> The reason I'm not running off to make a rte_random.c patch is
> that's
> >> it's unclear to me what is the role of unregistered threads in DPDK.
> >> I'm
> >> reasonably comfortable with a model where there are many threads
> that
> >> basically don't interact with the DPDK APIs (except maybe some very
> >> narrow exposure, like the preemption-safe ring variant). One example
> of
> >> such a design would be big slow control plane which uses multi-
> >> threading
> >> and the Linux process scheduler for work scheduling, hosted in the
> same
> >> process as a DPDK data plane app.
> >>
> >> What I find more strange is a scenario where there are unregistered
> >> threads which interacts with a wide variety of DPDK APIs, does so
> >> at-high-rates/with-high-performance-requirements and are expected to
> be
> >> preemption-safe. So they are basically EAL threads without a lcore
> id.
> >
> > Yes, this is happening in the wild.
> > E.g. our application has a mode where it uses fewer EAL threads, and
> processes more in non-EAL threads. So to say, the same work is
> processed either by an EAL thread or a non-EAL thread, depending on the
> application's mode.
> > The extra array entry would be useful for such use cases.
> >
> 
> Is there some particular reason you can't register those non-EAL
> threads?

Legacy. I suppose we could just do that instead.
Thanks for the suggestion!

> 
> >>
> >> Support for that latter scenario has also been voiced, in previous
> >> discussions, from what I recall.
> >>
> >> I think it's hard to answer the question of a "unregistered thread
> >> spare" for lcore variables without first knowing what the future
> should
> >> look like for unregistered threads in DPDK, in terms of being able
> to
> >> call into DPDK APIs, preemption-safety guarantees, etc.
> >>
> >> It seems that until you have a clearer picture of how generally to
> >> treat
> >> unregistered threads, you are best off with just a per-lcore id
> >> instance
> >> of lcore variables.
> >
> > I get your point. It also reduces the risk of bugs caused by
> incorrect use of the additional entry.
> >
> > I am arguing for a different angle: Providing the extra entry will
> help uncovering relevant use cases.
> >
> 
> Maybe have two "spares" in case you find two new uses cases? :)
> 
> No, adding spares doesn't work, unless you rework the API and rename it
> to fit the new purpose of not only providing per-lcore id variables,
> but per-something-else.
> 

OK. I'm convinced.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC v2 3/5] random: keep PRNG state in lcore variable
  2024-02-19 14:04         ` Mattias Rönnblom
@ 2024-02-19 15:10           ` Morten Brørup
  0 siblings, 0 replies; 42+ messages in thread
From: Morten Brørup @ 2024-02-19 15:10 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 19 February 2024 15.04
> 
> On 2024-02-19 12:22, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Monday, 19 February 2024 10.41
> >>
> >> Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
> >> cache-aligned and RTE_CACHE_GUARDed struct instances with keeping
> the
> >> same state in a more cache-friendly lcore variable.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> ---
> >
> > [...]
> >
> >> @@ -19,14 +20,12 @@ struct rte_rand_state {
> >>   	uint64_t z3;
> >>   	uint64_t z4;
> >>   	uint64_t z5;
> >> -	RTE_CACHE_GUARD;
> >> -} __rte_cache_aligned;
> >> +};
> >>
> >> -/* One instance each for every lcore id-equipped thread, and one
> >> - * additional instance to be shared by all others threads (i.e.,
> all
> >> - * unregistered non-EAL threads).
> >> - */
> >> -static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
> >> +RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
> >> +
> >> +/* instance to be shared by all unregistered non-EAL threads */
> >> +static struct rte_rand_state unregistered_rand_state
> >> __rte_cache_aligned;
> >
> > The unregistered_rand_state instance is still __rte_cache_aligned;
> consider also adding an RTE_CACHE_GUARD to it.
> >
> 
> It shouldn't be cache-line aligned. I'll remove it. Thanks.

Agreed; that fix is just as good. Then,

Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v3 0/6] Lcore variables
  2024-02-19  9:40     ` [RFC v2 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-20  8:49       ` Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                           ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

This RFC presents a new API <rte_lcore_var.h> for static per-lcore id
data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

One thing is unclear to the author is how this API relates to
potential future per-lcore dynamic allocator (e.g., a per-lcore heap).

Contrary to what the version.map edit suggests, this RFC is not meant
for a proposal for DPDK 24.03.

Mattias Rönnblom (6):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable test suite
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 app/test/meson.build                  |   1 +
 app/test/test_lcore_var.c             | 407 ++++++++++++++++++++++++++
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  82 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/common/rte_random.c           |  30 +-
 lib/eal/common/rte_service.c          | 119 ++++----
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 lib/eal/x86/rte_power_intrinsics.c    |  17 +-
 lib/power/rte_power_pmd_mgmt.c        |  36 ++-
 13 files changed, 987 insertions(+), 88 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  2024-02-20  9:11           ` Bruce Richardson
                             ` (2 more replies)
  2024-02-20  8:49         ` [RFC v3 2/6] eal: add lcore variable test suite Mattias Rönnblom
                           ` (4 subsequent siblings)
  5 siblings, 3 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  82 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 7 files changed, 465 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/config/rte_config.h b/config/rte_config.h
index da265d7dd2..884482e473 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -30,6 +30,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index a6a768bd7c..bb06bb7ca1 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -98,6 +98,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore-varible](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..dfd11cbd0b
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define WARN_THRESHOLD 75
+
+/*
+ * Avoid using offset zero, since it would result in a NULL-value
+ * "handle" (offset) pointer, which in principle and per the API
+ * definition shouldn't be an issue, but may confuse some tools and
+ * users.
+ */
+#define INITIAL_OFFSET 1
+
+char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
+
+static uintptr_t allocated = INITIAL_OFFSET;
+
+static void
+verify_allocation(uintptr_t new_allocated)
+{
+	static bool has_warned;
+
+	RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
+
+	if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
+	    !has_warned) {
+		EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
+			"of the maximum capacity (%d bytes)", WARN_THRESHOLD,
+			RTE_MAX_LCORE_VAR);
+		has_warned = true;
+	}
+}
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
+
+	void *offset = (void *)new_allocated;
+
+	new_allocated += size;
+
+	verify_allocation(new_allocated);
+
+	allocated = new_allocated;
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return offset;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to aligned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..da49d48d7c
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,375 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Per-lcore id variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. In other words,
+ * there's one copy of its value for each and every current and future
+ * lcore id-equipped thread, with the total number of copies amounting
+ * to \c RTE_MAX_LCORE.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for \c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. A handle may be passed between modules and
+ * threads just like any pointer, but its value is not the address of
+ * any particular object, but rather just an opaque identifier, stored
+ * in a typed pointer (to inform the access macro the type of values).
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
+ *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but is should
+ * generally only *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by to different lcore
+ * ids *may* be frequently read or written by the owners without the
+ * risk of false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomics) should
+ * employed to assure there are no data races between the owning
+ * thread and any non-owner threads accessing the same lcore variable
+ * instance.
+ *
+ * The value of the lcore variable for a particular lcore id may be
+ * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
+ * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * To modify the value of an lcore variable for a particular lcore id,
+ * either access the object through the pointer retrieved by \ref
+ * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
+ * RTE_LCORE_VAR_LCORE_SET.
+ *
+ * The access macros each has a short-hand which may be used by an EAL
+ * thread or registered non-EAL thread to access the lcore variable
+ * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
+ * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
+ *
+ * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier. The
+ * *identifier* value is common across all lcore ids.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like \c int,
+ * but would more typically be a \c struct. An application may choose
+ * to define an lcore variable, which it then it goes on to never
+ * allocate.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * The sum of all lcore variables, plus any padding required, must be
+ * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
+ * violation of this maximum results in the process being terminated.
+ *
+ * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
+ * same order of magnitude in size as a thread stack.
+ *
+ * The lcore variable storage buffers are kept in the BSS section in
+ * the resulting binary, where data generally isn't mapped in until
+ * it's accessed. This means that unused portions of the lcore
+ * variable storage area will not occupy any physical memory (with a
+ * granularity of the memory page size [usually 4 kB]).
+ *
+ * Lcore variables should generally *not* be \ref __rte_cache_aligned
+ * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, all nearby data structures
+ * should almost-always be written to by a single thread (the lcore
+ * variable owner). Adding padding will increase the effective memory
+ * working set size, and potentially reducing performance.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         unsigned int lcore_id;
+ *
+ *         RTE_LCORE_VAR_ALLOC(foo_state);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * \endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * } __rte_cache_aligned;
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * \endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this forces the
+ * use of cache-line alignment to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables has the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to \ref rte_lcore_var.h is the \ref
+ * rte_per_lcore.h API, which make use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., \ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. One effect of this is thread-local variables must
+ *     initialized in a "lazy" manner (e.g., at the point of thread
+ *     creation). Lcore variables may be accessed immediately after
+ *     having been allocated (which is usually prior any thread beyond
+ *     the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id, and thus
+ *     not for such "regular" threads.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define a lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various per-lcore id instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handler, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable are only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)	\
+	name = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
+	name = rte_lcore_var_alloc(size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handler pointer type, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC(name)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)),		\
+				       alignof(typeof(*(name))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a \ref
+ * RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)		\
+	((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)				\
+	((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+/**
+ * Get value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
+
+/**
+ * Set the value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
+
+/**
+ * Get value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
+
+/**
+ * Set value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_SET(name, value) \
+	RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
+
+/**
+ * Iterate over each lcore id's value for a lcore variable.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)				\
+	for (unsigned int lcore_id =					\
+		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
+
+/**
+ * Allocate space in the per-lcore id buffers for a lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than \c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The id of the variable, stored in a void pointer value.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 5e0cd47c82..e90b86115a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -393,6 +393,10 @@ EXPERIMENTAL {
 	# added in 23.07
 	rte_memzone_max_get;
 	rte_memzone_max_set;
+
+	# added in 24.03
+	rte_lcore_var_alloc;
+	rte_lcore_var;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v3 2/6] eal: add lcore variable test suite
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Add test suite to exercise the <rte_lcore_var.h> API.

RFC v2:
 * Improve alignment-related test coverage.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 407 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 408 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 6389ae83ee..93412cce51 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -101,6 +101,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..27084e91e9
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,407 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static bool
+rand_bool(void)
+{
+	return rte_rand() & 1;
+}
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_PTR(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal;
+
+	if (rand_bool())
+		equal = RTE_LCORE_VAR_GET(test_int) == state->old_value;
+	else
+		equal = *(RTE_LCORE_VAR_PTR(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	if (rand_bool())
+		RTE_LCORE_VAR_SET(test_int, state->new_value);
+	else
+		*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		RTE_LCORE_VAR_LCORE_SET(lcore_id, test_int, state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		TEST_ASSERT_EQUAL(state->new_value,
+				  RTE_LCORE_VAR_LCORE_GET(lcore_id, test_int),
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_PTR(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_struct);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_PTR(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(RTE_LCORE_VAR_LCORE_GET(lcore_id, test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_array);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (RTE_MAX_LCORE_VAR / 2)
+
+static int
+test_many_lvars(void)
+{
+	void **handlers = malloc(sizeof(void *) * MANY_LVARS);
+	int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		void *handle = rte_lcore_var_alloc(1, 1);
+
+		uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), handle);
+
+		*b = (uint8_t)i;
+
+		handlers[i] = handle;
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_FOREACH_WORKER(lcore_id) {
+			uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(),
+							       handlers[i]);
+			TEST_ASSERT_EQUAL((uint8_t)i, *b,
+					  "Unexpected lcore variable value.");
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v3 3/6] random: keep PRNG state in lcore variable
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 2/6] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  2024-02-20 15:31           ` Morten Brørup
  2024-02-20  8:49         ` [RFC v3 4/6] power: keep per-lcore " Mattias Rönnblom
                           ` (2 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/common/rte_random.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 7709b8f2c6..adbbf13f0e 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct rte_rand_state {
@@ -19,14 +20,12 @@ struct rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
-} __rte_cache_aligned;
+};
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_PTR(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v3 4/6] power: keep per-lcore state in lcore variable
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
                           ` (2 preceding siblings ...)
  2024-02-20  8:49         ` [RFC v3 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 5/6] service: " Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

RFC v3:
 * Replace for loop with FOREACH macro.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/power/rte_power_pmd_mgmt.c | 36 ++++++++++++++++------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 591fc69f36..ea30454895 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -68,8 +69,8 @@ struct pmd_core_cfg {
 	/**< Number of queues ready to enter power optimized state */
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
-} __rte_cache_aligned;
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+};
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,21 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v3 5/6] service: keep per-lcore state in lcore variable
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
                           ` (3 preceding siblings ...)
  2024-02-20  8:49         ` [RFC v3 4/6] power: keep per-lcore " Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  2024-02-22  9:42           ` Morten Brørup
  2024-02-20  8:49         ` [RFC v3 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 1 reply; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c | 119 ++++++++++++++++++++---------------
 1 file changed, 68 insertions(+), 51 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index d959c91459..de205c5da5 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,11 +102,12 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
+	else {
+		struct core_state *cs;
+		RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+			memset(cs, 0, sizeof(struct core_state));
 	}
 
 	int i;
@@ -122,7 +124,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +137,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +286,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +293,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +454,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +467,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +489,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +535,16 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs =
+		RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +552,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +573,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +590,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +642,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +694,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +712,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +737,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +761,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +785,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +815,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +824,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +849,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +860,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +868,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +876,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +885,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +901,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +948,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +977,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +989,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1028,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC v3 6/6] eal: keep per-lcore power intrinsics state in lcore variable
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
                           ` (4 preceding siblings ...)
  2024-02-20  8:49         ` [RFC v3 5/6] service: " Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  5 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 532a2e646b..f4659af77e 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -12,10 +13,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -170,7 +175,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_PTR(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -262,7 +267,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_PTR(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -301,8 +306,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_PTR(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-20  9:11           ` Bruce Richardson
  2024-02-20 10:47             ` Mattias Rönnblom
  2024-02-21  9:43           ` Jerin Jacob
  2024-02-22  9:22           ` Morten Brørup
  2 siblings, 1 reply; 42+ messages in thread
From: Bruce Richardson @ 2024-02-20  9:11 UTC (permalink / raw)
  To: Mattias Rönnblom; +Cc: dev, hofors, Morten Brørup, Stephen Hemminger

On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small chunks of often-used data, which is related logically, but where
> there are performance benefits to reap from having updates being local
> to an lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> RFC v3:
>  * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>  * Update example to reflect FOREACH macro name change (in RFC v2).
> 
> RFC v2:
>  * Use alignof to derive alignment requirements. (Morten Brørup)
>  * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>    *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>  * Allow user-specified alignment, but limit max to cache line size.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  config/rte_config.h                   |   1 +
>  doc/api/doxy-api-index.md             |   1 +
>  lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>  lib/eal/common/meson.build            |   1 +
>  lib/eal/include/meson.build           |   1 +
>  lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>  lib/eal/version.map                   |   4 +
>  7 files changed, 465 insertions(+)
>  create mode 100644 lib/eal/common/eal_common_lcore_var.c
>  create mode 100644 lib/eal/include/rte_lcore_var.h
> 
> diff --git a/config/rte_config.h b/config/rte_config.h
> index da265d7dd2..884482e473 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -30,6 +30,7 @@
>  /* EAL defines */
>  #define RTE_CACHE_GUARD_LINES 1
>  #define RTE_MAX_HEAPS 32
> +#define RTE_MAX_LCORE_VAR 1048576
>  #define RTE_MAX_MEMSEG_LISTS 128
>  #define RTE_MAX_MEMSEG_PER_LIST 8192
>  #define RTE_MAX_MEM_MB_PER_LIST 32768
> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> index a6a768bd7c..bb06bb7ca1 100644
> --- a/doc/api/doxy-api-index.md
> +++ b/doc/api/doxy-api-index.md
> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>    [interrupts](@ref rte_interrupts.h),
>    [launch](@ref rte_launch.h),
>    [lcore](@ref rte_lcore.h),
> +  [lcore-varible](@ref rte_lcore_var.h),
>    [per-lcore](@ref rte_per_lcore.h),
>    [service cores](@ref rte_service.h),
>    [keepalive](@ref rte_keepalive.h),
> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
> new file mode 100644
> index 0000000000..dfd11cbd0b
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_var.c
> @@ -0,0 +1,82 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#include <inttypes.h>
> +
> +#include <rte_common.h>
> +#include <rte_debug.h>
> +#include <rte_log.h>
> +
> +#include <rte_lcore_var.h>
> +
> +#include "eal_private.h"
> +
> +#define WARN_THRESHOLD 75
> +
> +/*
> + * Avoid using offset zero, since it would result in a NULL-value
> + * "handle" (offset) pointer, which in principle and per the API
> + * definition shouldn't be an issue, but may confuse some tools and
> + * users.
> + */
> +#define INITIAL_OFFSET 1
> +
> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
> +

While I like the idea of improved handling for per-core variables, my main
concern with this set is this definition here, which adds yet another
dependency on the compile-time defined RTE_MAX_LCORE value.

I believe we already have an issue with this #define where it's impossible
to come up with a single value that works for all, or nearly all cases. The
current default is still 128, yet DPDK needs to support systems where the
number of cores is well into the hundreds, requiring workarounds of core
mappings or customized builds of DPDK. Upping the value fixes those issues
at the cost to memory footprint explosion for smaller systems.

I'm therefore nervous about putting more dependencies on this value, when I
feel we should be moving away from its use, to allow more runtime
configurability of cores.

For this set/feature, would it be possible to have a run-time allocated
(and sized) array rather than using the RTE_MAX_LCORE value?

Thanks, (and apologies for the mini-rant!)

/Bruce

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20  9:11           ` Bruce Richardson
@ 2024-02-20 10:47             ` Mattias Rönnblom
  2024-02-20 11:39               ` Bruce Richardson
  0 siblings, 1 reply; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-20 10:47 UTC (permalink / raw)
  To: Bruce Richardson, Mattias Rönnblom
  Cc: dev, Morten Brørup, Stephen Hemminger

On 2024-02-20 10:11, Bruce Richardson wrote:
> On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small chunks of often-used data, which is related logically, but where
>> there are performance benefits to reap from having updates being local
>> to an lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> RFC v3:
>>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>>   * Update example to reflect FOREACH macro name change (in RFC v2).
>>
>> RFC v2:
>>   * Use alignof to derive alignment requirements. (Morten Brørup)
>>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>>   * Allow user-specified alignment, but limit max to cache line size.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   config/rte_config.h                   |   1 +
>>   doc/api/doxy-api-index.md             |   1 +
>>   lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>>   lib/eal/common/meson.build            |   1 +
>>   lib/eal/include/meson.build           |   1 +
>>   lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>>   lib/eal/version.map                   |   4 +
>>   7 files changed, 465 insertions(+)
>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>
>> diff --git a/config/rte_config.h b/config/rte_config.h
>> index da265d7dd2..884482e473 100644
>> --- a/config/rte_config.h
>> +++ b/config/rte_config.h
>> @@ -30,6 +30,7 @@
>>   /* EAL defines */
>>   #define RTE_CACHE_GUARD_LINES 1
>>   #define RTE_MAX_HEAPS 32
>> +#define RTE_MAX_LCORE_VAR 1048576
>>   #define RTE_MAX_MEMSEG_LISTS 128
>>   #define RTE_MAX_MEMSEG_PER_LIST 8192
>>   #define RTE_MAX_MEM_MB_PER_LIST 32768
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index a6a768bd7c..bb06bb7ca1 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>>     [interrupts](@ref rte_interrupts.h),
>>     [launch](@ref rte_launch.h),
>>     [lcore](@ref rte_lcore.h),
>> +  [lcore-varible](@ref rte_lcore_var.h),
>>     [per-lcore](@ref rte_per_lcore.h),
>>     [service cores](@ref rte_service.h),
>>     [keepalive](@ref rte_keepalive.h),
>> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
>> new file mode 100644
>> index 0000000000..dfd11cbd0b
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_var.c
>> @@ -0,0 +1,82 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#include <inttypes.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_debug.h>
>> +#include <rte_log.h>
>> +
>> +#include <rte_lcore_var.h>
>> +
>> +#include "eal_private.h"
>> +
>> +#define WARN_THRESHOLD 75
>> +
>> +/*
>> + * Avoid using offset zero, since it would result in a NULL-value
>> + * "handle" (offset) pointer, which in principle and per the API
>> + * definition shouldn't be an issue, but may confuse some tools and
>> + * users.
>> + */
>> +#define INITIAL_OFFSET 1
>> +
>> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
>> +
> 
> While I like the idea of improved handling for per-core variables, my main
> concern with this set is this definition here, which adds yet another
> dependency on the compile-time defined RTE_MAX_LCORE value.
> 

lcore variables replaces one RTE_MAX_LCORE-dependent pattern with another.

You could even argue the dependency on RTE_MAX_LCORE is reduced with 
lcore variables, if you look at where/in how many places in the code 
base this macro is being used. Centralizing per-lcore data management 
may also provide some opportunity in the future for extending the API to 
cope with some more dynamic RTE_MAX_LCORE variant. Not without ABI 
breakage of course, but we are not ever going to change anything related 
to RTE_MAX_LCORE without breaking the ABI, since this constant is 
everywhere, including compiled into the application itself.

> I believe we already have an issue with this #define where it's impossible
> to come up with a single value that works for all, or nearly all cases. The
> current default is still 128, yet DPDK needs to support systems where the
> number of cores is well into the hundreds, requiring workarounds of core
> mappings or customized builds of DPDK. Upping the value fixes those issues
> at the cost to memory footprint explosion for smaller systems.
> 

I agree this is an issue.

RTE_MAX_LCORE also need to be sized to accommodate not only all cores 
used, but the sum of all EAL threads and registered non-EAL threads.

So, there is no reliable way to discover what RTE_MAX_LCORE is on a 
particular piece of hardware, since the actual number of lcore ids 
needed is up to the application.

Why is the default set so low? Linux has MAX_CPUS, which serves the same 
purpose, which is set to 4096 by default, if I recall correctly. 
Shouldn't we at least be able to increase it to 256?

> I'm therefore nervous about putting more dependencies on this value, when I
> feel we should be moving away from its use, to allow more runtime
> configurability of cores.
> 

What more specifically do you have in mind?

Maybe I'm overly pessimistic, but supporting lcores without any upper 
bound and also allowing them to be added and removed at any point during 
run time seems far-fetched, given where DPDK is today.

To include an actual upper bound, set during DPDK run-time 
initialization, lower than RTE_MAX_LCORE, seems easier. I think there is 
some equivalent in the Linux kernel. Again, you would need to 
accommodate for future rte_register_thread() calls.

<rte_lcore_var.h> could be extended with a user-specified lcore variable 
  init/free function callbacks, to allow lazy or late initialization.

If one could have a way to retrieve the max possible lcore ids *for a 
particular DPDK process* (as opposed to a particular build) it would be 
possible to avoid touching the per-lcore buffers for lcore ids that 
would never be used. With data in BSS, it would never be mapped/allocated.

An issue with BSS data is that there might be very RT-sensitive 
applications deciding to lock all memory into RAM, to avoid latency 
jitter caused by paging, and such would suffer from a large 
rte_lcore_var (or all the current static arrays). Lcore variables makes 
this worse, since rte_lcore_var is larger than the sum of today's static 
arrays, and must be so, with some margin, since there is no way to 
figure out ahead of time how much memory is actually going to be needed.

> For this set/feature, would it be possible to have a run-time allocated
> (and sized) array rather than using the RTE_MAX_LCORE value?
> 

What I explored was having the per-lcore buffers dynamically allocated. 
What I ran into was I saw no apparent benefit, and with dynamic 
allocation there were new problems to solve. One was to assure lcore 
variable buffers were allocated before they were being used. In 
particular if you want to use huge page memory, lcore variables may be 
available only when that machinery is ready to accept requests.

Also, with huge page memory, you won't get the benefit you will get from 
depend paging and BSS (i.e., only used memory is actually allocated).

With malloc(), I believe you generally do get that same benefit, if you 
allocation is sufficiently large.

I also considered just allocating chunks, fitting (say) 64 kB worth of 
lcore variables in each. Turned out more complex, and to no benefit, 
other than reducing footprint for mlockall() type apps, which seemed 
like corner case.

I never considered no upper-bound, dynamic, RTE_MAX_LCORE.

> Thanks, (and apologies for the mini-rant!)
> 
> /Bruce

Thanks for the comments. This is was no way near a rant.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20 10:47             ` Mattias Rönnblom
@ 2024-02-20 11:39               ` Bruce Richardson
  2024-02-20 13:37                 ` Morten Brørup
  2024-02-20 16:26                 ` Mattias Rönnblom
  0 siblings, 2 replies; 42+ messages in thread
From: Bruce Richardson @ 2024-02-20 11:39 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Mattias Rönnblom, dev, Morten Brørup, Stephen Hemminger

On Tue, Feb 20, 2024 at 11:47:14AM +0100, Mattias Rönnblom wrote:
> On 2024-02-20 10:11, Bruce Richardson wrote:
> > On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
> > > Introduce DPDK per-lcore id variables, or lcore variables for short.
> > > 
> > > An lcore variable has one value for every current and future lcore
> > > id-equipped thread.
> > > 
> > > The primary <rte_lcore_var.h> use case is for statically allocating
> > > small chunks of often-used data, which is related logically, but where
> > > there are performance benefits to reap from having updates being local
> > > to an lcore.
> > > 
> > > Lcore variables are similar to thread-local storage (TLS, e.g., C11
> > > _Thread_local), but decoupling the values' life time with that of the
> > > threads.

<snip>

> > > +/*
> > > + * Avoid using offset zero, since it would result in a NULL-value
> > > + * "handle" (offset) pointer, which in principle and per the API
> > > + * definition shouldn't be an issue, but may confuse some tools and
> > > + * users.
> > > + */
> > > +#define INITIAL_OFFSET 1
> > > +
> > > +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
> > > +
> > 
> > While I like the idea of improved handling for per-core variables, my main
> > concern with this set is this definition here, which adds yet another
> > dependency on the compile-time defined RTE_MAX_LCORE value.
> > 
> 
> lcore variables replaces one RTE_MAX_LCORE-dependent pattern with another.
> 
> You could even argue the dependency on RTE_MAX_LCORE is reduced with lcore
> variables, if you look at where/in how many places in the code base this
> macro is being used. Centralizing per-lcore data management may also provide
> some opportunity in the future for extending the API to cope with some more
> dynamic RTE_MAX_LCORE variant. Not without ABI breakage of course, but we
> are not ever going to change anything related to RTE_MAX_LCORE without
> breaking the ABI, since this constant is everywhere, including compiled into
> the application itself.
> 

Yep, that is true if it's widely used.

> > I believe we already have an issue with this #define where it's impossible
> > to come up with a single value that works for all, or nearly all cases. The
> > current default is still 128, yet DPDK needs to support systems where the
> > number of cores is well into the hundreds, requiring workarounds of core
> > mappings or customized builds of DPDK. Upping the value fixes those issues
> > at the cost to memory footprint explosion for smaller systems.
> > 
> 
> I agree this is an issue.
> 
> RTE_MAX_LCORE also need to be sized to accommodate not only all cores used,
> but the sum of all EAL threads and registered non-EAL threads.
> 
> So, there is no reliable way to discover what RTE_MAX_LCORE is on a
> particular piece of hardware, since the actual number of lcore ids needed is
> up to the application.
> 
> Why is the default set so low? Linux has MAX_CPUS, which serves the same
> purpose, which is set to 4096 by default, if I recall correctly. Shouldn't
> we at least be able to increase it to 256?

The default is so low because of the mempool caches. These are an array of
buffer pointers with 512 (IIRC) entries per core up to RTE_MAX_LCORE.

> 
> > I'm therefore nervous about putting more dependencies on this value, when I
> > feel we should be moving away from its use, to allow more runtime
> > configurability of cores.
> > 
> 
> What more specifically do you have in mind?
> 

I don't think having a dynamically scaling RTE_MAX_LCORE is feasible, but
what I would like to see is a runtime specified value. For example, you
could run DPDK with EAL parameter "--max-lcores=1024" for large systems or
"--max-lcores=32" for small ones. That would then be used at init-time to
scale all internal datastructures appropriately.

/Bruce

<snip for brevity>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20 11:39               ` Bruce Richardson
@ 2024-02-20 13:37                 ` Morten Brørup
  2024-02-20 16:26                 ` Mattias Rönnblom
  1 sibling, 0 replies; 42+ messages in thread
From: Morten Brørup @ 2024-02-20 13:37 UTC (permalink / raw)
  To: Bruce Richardson, Mattias Rönnblom
  Cc: Mattias Rönnblom, dev, Stephen Hemminger

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Tuesday, 20 February 2024 12.39
> 
> On Tue, Feb 20, 2024 at 11:47:14AM +0100, Mattias Rönnblom wrote:
> > On 2024-02-20 10:11, Bruce Richardson wrote:
> > > On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
> > > > Introduce DPDK per-lcore id variables, or lcore variables for
> short.
> > > >
> > > > An lcore variable has one value for every current and future
> lcore
> > > > id-equipped thread.
> > > >
> > > > The primary <rte_lcore_var.h> use case is for statically
> allocating
> > > > small chunks of often-used data, which is related logically, but
> where
> > > > there are performance benefits to reap from having updates being
> local
> > > > to an lcore.
> > > >
> > > > Lcore variables are similar to thread-local storage (TLS, e.g.,
> C11
> > > > _Thread_local), but decoupling the values' life time with that of
> the
> > > > threads.
> 
> <snip>
> 
> > > > +/*
> > > > + * Avoid using offset zero, since it would result in a NULL-
> value
> > > > + * "handle" (offset) pointer, which in principle and per the API
> > > > + * definition shouldn't be an issue, but may confuse some tools
> and
> > > > + * users.
> > > > + */
> > > > +#define INITIAL_OFFSET 1
> > > > +
> > > > +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR]
> __rte_cache_aligned;
> > > > +
> > >
> > > While I like the idea of improved handling for per-core variables,
> my main
> > > concern with this set is this definition here, which adds yet
> another
> > > dependency on the compile-time defined RTE_MAX_LCORE value.
> > >
> >
> > lcore variables replaces one RTE_MAX_LCORE-dependent pattern with
> another.
> >
> > You could even argue the dependency on RTE_MAX_LCORE is reduced with
> lcore
> > variables, if you look at where/in how many places in the code base
> this
> > macro is being used. Centralizing per-lcore data management may also
> provide
> > some opportunity in the future for extending the API to cope with
> some more
> > dynamic RTE_MAX_LCORE variant. Not without ABI breakage of course,
> but we
> > are not ever going to change anything related to RTE_MAX_LCORE
> without
> > breaking the ABI, since this constant is everywhere, including
> compiled into
> > the application itself.
> >
> 
> Yep, that is true if it's widely used.
> 
> > > I believe we already have an issue with this #define where it's
> impossible
> > > to come up with a single value that works for all, or nearly all
> cases. The
> > > current default is still 128, yet DPDK needs to support systems
> where the
> > > number of cores is well into the hundreds, requiring workarounds of
> core
> > > mappings or customized builds of DPDK. Upping the value fixes those
> issues
> > > at the cost to memory footprint explosion for smaller systems.
> > >
> >
> > I agree this is an issue.
> >
> > RTE_MAX_LCORE also need to be sized to accommodate not only all cores
> used,
> > but the sum of all EAL threads and registered non-EAL threads.
> >
> > So, there is no reliable way to discover what RTE_MAX_LCORE is on a
> > particular piece of hardware, since the actual number of lcore ids
> needed is
> > up to the application.
> >
> > Why is the default set so low? Linux has MAX_CPUS, which serves the
> same
> > purpose, which is set to 4096 by default, if I recall correctly.
> Shouldn't
> > we at least be able to increase it to 256?

I recall a recent techboard meeting where the default was discussed. The default was agreed so low because it suffices for the vast majority of hardware out there, and applications for bigger platforms can be expected to build DPDK with a different configuration themselves. And as Bruce also mentions, it's a tradeoff for memory consumption.

> 
> The default is so low because of the mempool caches. These are an array
> of
> buffer pointers with 512 (IIRC) entries per core up to RTE_MAX_LCORE.

The decision was based on a need to make a quick decision, so we used narrow guesstimates, not a broader memory consumption analysis.

If we really cared about default memory consumption, we should reduce the default RTE_MAX_QUEUES_PER_PORT from 1024 too. It has quite an effect.

Having hard data about which build time configuration parameters have the biggest effect on memory consumption would be extremely useful for tweaking the parameters for resource limited hardware.
It's a mix of static and dynamic allocation, so it's not obvious which scalable data structures consume the most memory.

> 
> >
> > > I'm therefore nervous about putting more dependencies on this
> value, when I
> > > feel we should be moving away from its use, to allow more runtime
> > > configurability of cores.
> > >
> >
> > What more specifically do you have in mind?
> >
> 
> I don't think having a dynamically scaling RTE_MAX_LCORE is feasible,
> but
> what I would like to see is a runtime specified value. For example, you
> could run DPDK with EAL parameter "--max-lcores=1024" for large systems
> or
> "--max-lcores=32" for small ones. That would then be used at init-time
> to
> scale all internal datastructures appropriately.
> 

I agree 100 % that a better long term solution should be on the general road map.
Memory is a precious resource, but few seem to care about it.

A mix could provide an easy migration path:
Having RTE_MAX_LCORE as the hard upper limit (and default value) for a runtime specified max number ("rte_max_lcores").
With this, the goal would be for modules with very small data sets to continue using RTE_MAX_LCORE fixed size arrays, and for modules with larger data sets to migrate to rte_max_lcores dynamically sized arrays.

I am opposed to blocking a new patch series, only because it adds another RTE_MAX_LCORE sized array. We already have plenty of those.
It can be migrated towards dynamically sized array at a later time, just like the other modules with RTE_MAX_LCORE sized arrays.
Perhaps "fixing" an existing module would free up more memory than fixing this module. Let's spend development resources where they have the biggest impact.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC v3 3/6] random: keep PRNG state in lcore variable
  2024-02-20  8:49         ` [RFC v3 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-20 15:31           ` Morten Brørup
  0 siblings, 0 replies; 42+ messages in thread
From: Morten Brørup @ 2024-02-20 15:31 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Tuesday, 20 February 2024 09.49
> 

[...]

> @@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
> 
>  	idx = rte_lcore_id();
> 
> -	/* last instance reserved for unregistered non-EAL threads */
>  	if (unlikely(idx == LCORE_ID_ANY))

idx is now only used here, so you could get rid of it by comparing directly to rte_lcore_id() instead.

Minor detail only; don't spin the patch for it.

> -		idx = RTE_MAX_LCORE;
> +		return &unregistered_rand_state;
> 
> -	return &rand_states[idx];
> +	return RTE_LCORE_VAR_PTR(rand_state);
>  }


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20 11:39               ` Bruce Richardson
  2024-02-20 13:37                 ` Morten Brørup
@ 2024-02-20 16:26                 ` Mattias Rönnblom
  1 sibling, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-20 16:26 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Mattias Rönnblom, dev, Morten Brørup, Stephen Hemminger

On 2024-02-20 12:39, Bruce Richardson wrote:
> On Tue, Feb 20, 2024 at 11:47:14AM +0100, Mattias Rönnblom wrote:
>> On 2024-02-20 10:11, Bruce Richardson wrote:
>>> On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
>>>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>>>
>>>> An lcore variable has one value for every current and future lcore
>>>> id-equipped thread.
>>>>
>>>> The primary <rte_lcore_var.h> use case is for statically allocating
>>>> small chunks of often-used data, which is related logically, but where
>>>> there are performance benefits to reap from having updates being local
>>>> to an lcore.
>>>>
>>>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>>>> _Thread_local), but decoupling the values' life time with that of the
>>>> threads.
> 
> <snip>
> 
>>>> +/*
>>>> + * Avoid using offset zero, since it would result in a NULL-value
>>>> + * "handle" (offset) pointer, which in principle and per the API
>>>> + * definition shouldn't be an issue, but may confuse some tools and
>>>> + * users.
>>>> + */
>>>> +#define INITIAL_OFFSET 1
>>>> +
>>>> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
>>>> +
>>>
>>> While I like the idea of improved handling for per-core variables, my main
>>> concern with this set is this definition here, which adds yet another
>>> dependency on the compile-time defined RTE_MAX_LCORE value.
>>>
>>
>> lcore variables replaces one RTE_MAX_LCORE-dependent pattern with another.
>>
>> You could even argue the dependency on RTE_MAX_LCORE is reduced with lcore
>> variables, if you look at where/in how many places in the code base this
>> macro is being used. Centralizing per-lcore data management may also provide
>> some opportunity in the future for extending the API to cope with some more
>> dynamic RTE_MAX_LCORE variant. Not without ABI breakage of course, but we
>> are not ever going to change anything related to RTE_MAX_LCORE without
>> breaking the ABI, since this constant is everywhere, including compiled into
>> the application itself.
>>
> 
> Yep, that is true if it's widely used.
> 
>>> I believe we already have an issue with this #define where it's impossible
>>> to come up with a single value that works for all, or nearly all cases. The
>>> current default is still 128, yet DPDK needs to support systems where the
>>> number of cores is well into the hundreds, requiring workarounds of core
>>> mappings or customized builds of DPDK. Upping the value fixes those issues
>>> at the cost to memory footprint explosion for smaller systems.
>>>
>>
>> I agree this is an issue.
>>
>> RTE_MAX_LCORE also need to be sized to accommodate not only all cores used,
>> but the sum of all EAL threads and registered non-EAL threads.
>>
>> So, there is no reliable way to discover what RTE_MAX_LCORE is on a
>> particular piece of hardware, since the actual number of lcore ids needed is
>> up to the application.
>>
>> Why is the default set so low? Linux has MAX_CPUS, which serves the same
>> purpose, which is set to 4096 by default, if I recall correctly. Shouldn't
>> we at least be able to increase it to 256?
> 
> The default is so low because of the mempool caches. These are an array of
> buffer pointers with 512 (IIRC) entries per core up to RTE_MAX_LCORE.
> 
>>
>>> I'm therefore nervous about putting more dependencies on this value, when I
>>> feel we should be moving away from its use, to allow more runtime
>>> configurability of cores.
>>>
>>
>> What more specifically do you have in mind?
>>
> 
> I don't think having a dynamically scaling RTE_MAX_LCORE is feasible, but
> what I would like to see is a runtime specified value. For example, you
> could run DPDK with EAL parameter "--max-lcores=1024" for large systems or
> "--max-lcores=32" for small ones. That would then be used at init-time to
> scale all internal datastructures appropriately.
> 

Sounds reasonably to me, especially if you would take gradual approach.

By gradual I mean something like adding a function 
rte_lcore_max_possible(), or something like that, returning the EAL 
init-specified value. DPDK libraries/PMDs could then gradually be made 
aware and taking advantage of knowing that lcore ids will always be 
below a certain threshold, usually significantly lower than RTE_MAX_LCORE.

The only change required for lcore variables would be that the FOREACH 
macro would use the run-time-max value, rather than RTE_MAX_LCORE, which 
in turn would leave all the higher-numbered lcore id buffers 
untouched/unmapped.

The set of possible lcore ids could also be expressed as a bitset, if 
you have machine with a huge amount of cores, running many small DPDK 
instances.

> /Bruce
> 
> <snip for brevity>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-20  9:11           ` Bruce Richardson
@ 2024-02-21  9:43           ` Jerin Jacob
  2024-02-21 10:31             ` Morten Brørup
  2024-02-21 14:26             ` Mattias Rönnblom
  2024-02-22  9:22           ` Morten Brørup
  2 siblings, 2 replies; 42+ messages in thread
From: Jerin Jacob @ 2024-02-21  9:43 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Stephen Hemminger, Tomasz Duszynski

On Tue, Feb 20, 2024 at 2:35 PM Mattias Rönnblom
<mattias.ronnblom@ericsson.com> wrote:
>
> Introduce DPDK per-lcore id variables, or lcore variables for short.
>
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
>
> The primary <rte_lcore_var.h> use case is for statically allocating
> small chunks of often-used data, which is related logically, but where
> there are performance benefits to reap from having updates being local
> to an lcore.

I think, in order to quantify the gain, we must add a performance test
case to measure the acces cycles with lcore variables scheme vs this
scheme.
Other PMU counters(Cache misses) may be interesting but we dont have
means in DPDK to do self monitoring now like
https://patches.dpdk.org/project/dpdk/patch/20221213104350.3218167-1-tduszynski@marvell.com/

>
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
>
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
>
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
>
> RFC v3:
>  * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>  * Update example to reflect FOREACH macro name change (in RFC v2).
>
> RFC v2:
>  * Use alignof to derive alignment requirements. (Morten Brørup)
>  * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>    *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>  * Allow user-specified alignment, but limit max to cache line size.
>
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  config/rte_config.h                   |   1 +
>  doc/api/doxy-api-index.md             |   1 +
>  lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>  lib/eal/common/meson.build            |   1 +
>  lib/eal/include/meson.build           |   1 +
>  lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>  lib/eal/version.map                   |   4 +
>  7 files changed, 465 insertions(+)
>  create mode 100644 lib/eal/common/eal_common_lcore_var.c
>  create mode 100644 lib/eal/include/rte_lcore_var.h
>
> diff --git a/config/rte_config.h b/config/rte_config.h
> index da265d7dd2..884482e473 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -30,6 +30,7 @@
>  /* EAL defines */
>  #define RTE_CACHE_GUARD_LINES 1
>  #define RTE_MAX_HEAPS 32
> +#define RTE_MAX_LCORE_VAR 1048576
>  #define RTE_MAX_MEMSEG_LISTS 128
>  #define RTE_MAX_MEMSEG_PER_LIST 8192
>  #define RTE_MAX_MEM_MB_PER_LIST 32768
> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> index a6a768bd7c..bb06bb7ca1 100644
> --- a/doc/api/doxy-api-index.md
> +++ b/doc/api/doxy-api-index.md
> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>    [interrupts](@ref rte_interrupts.h),
>    [launch](@ref rte_launch.h),
>    [lcore](@ref rte_lcore.h),
> +  [lcore-varible](@ref rte_lcore_var.h),
>    [per-lcore](@ref rte_per_lcore.h),
>    [service cores](@ref rte_service.h),
>    [keepalive](@ref rte_keepalive.h),
> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
> new file mode 100644
> index 0000000000..dfd11cbd0b
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_var.c
> @@ -0,0 +1,82 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#include <inttypes.h>
> +
> +#include <rte_common.h>
> +#include <rte_debug.h>
> +#include <rte_log.h>
> +
> +#include <rte_lcore_var.h>
> +
> +#include "eal_private.h"
> +
> +#define WARN_THRESHOLD 75
> +
> +/*
> + * Avoid using offset zero, since it would result in a NULL-value
> + * "handle" (offset) pointer, which in principle and per the API
> + * definition shouldn't be an issue, but may confuse some tools and
> + * users.
> + */
> +#define INITIAL_OFFSET 1
> +
> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
> +
> +static uintptr_t allocated = INITIAL_OFFSET;
> +
> +static void
> +verify_allocation(uintptr_t new_allocated)
> +{
> +       static bool has_warned;
> +
> +       RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
> +
> +       if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
> +           !has_warned) {
> +               EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
> +                       "of the maximum capacity (%d bytes)", WARN_THRESHOLD,
> +                       RTE_MAX_LCORE_VAR);
> +               has_warned = true;
> +       }
> +}
> +
> +static void *
> +lcore_var_alloc(size_t size, size_t align)
> +{
> +       uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
> +
> +       void *offset = (void *)new_allocated;
> +
> +       new_allocated += size;
> +
> +       verify_allocation(new_allocated);
> +
> +       allocated = new_allocated;
> +
> +       EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
> +               "%"PRIuPTR"-byte alignment", size, align);
> +
> +       return offset;
> +}
> +
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align)
> +{
> +       /* Having the per-lcore buffer size aligned on cache lines
> +        * assures as well as having the base pointer aligned on cache
> +        * size assures that aligned offsets also translate to aligned
> +        * pointers across all values.
> +        */
> +       RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> +       RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
> +
> +       /* '0' means asking for worst-case alignment requirements */
> +       if (align == 0)
> +               align = alignof(max_align_t);
> +
> +       RTE_ASSERT(rte_is_power_of_2(align));
> +
> +       return lcore_var_alloc(size, align);
> +}
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 22a626ba6f..d41403680b 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -18,6 +18,7 @@ sources += files(
>          'eal_common_interrupts.c',
>          'eal_common_launch.c',
>          'eal_common_lcore.c',
> +        'eal_common_lcore_var.c',
>          'eal_common_mcfg.c',
>          'eal_common_memalloc.c',
>          'eal_common_memory.c',
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index e94b056d46..9449253e23 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -27,6 +27,7 @@ headers += files(
>          'rte_keepalive.h',
>          'rte_launch.h',
>          'rte_lcore.h',
> +        'rte_lcore_var.h',
>          'rte_lock_annotations.h',
>          'rte_malloc.h',
>          'rte_mcslock.h',
> diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
> new file mode 100644
> index 0000000000..da49d48d7c
> --- /dev/null
> +++ b/lib/eal/include/rte_lcore_var.h
> @@ -0,0 +1,375 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#ifndef _RTE_LCORE_VAR_H_
> +#define _RTE_LCORE_VAR_H_
> +
> +/**
> + * @file
> + *
> + * RTE Per-lcore id variables
> + *
> + * This API provides a mechanism to create and access per-lcore id
> + * variables in a space- and cycle-efficient manner.
> + *
> + * A per-lcore id variable (or lcore variable for short) has one value
> + * for each EAL thread and registered non-EAL thread. In other words,
> + * there's one copy of its value for each and every current and future
> + * lcore id-equipped thread, with the total number of copies amounting
> + * to \c RTE_MAX_LCORE.
> + *
> + * In order to access the values of an lcore variable, a handle is
> + * used. The type of the handle is a pointer to the value's type
> + * (e.g., for \c uint32_t lcore variable, the handle is a
> + * <code>uint32_t *</code>. A handle may be passed between modules and
> + * threads just like any pointer, but its value is not the address of
> + * any particular object, but rather just an opaque identifier, stored
> + * in a typed pointer (to inform the access macro the type of values).
> + *
> + * @b Creation
> + *
> + * An lcore variable is created in two steps:
> + *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
> + *  2. Allocate lcore variable storage and initialize the handle with
> + *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
> + *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
> + *     module initialization, but may be done at any time.
> + *
> + * An lcore variable is not tied to the owning thread's lifetime. It's
> + * available for use by any thread immediately after having been
> + * allocated, and continues to be available throughout the lifetime of
> + * the EAL.
> + *
> + * Lcore variables cannot and need not be freed.
> + *
> + * @b Access
> + *
> + * The value of any lcore variable for any lcore id may be accessed
> + * from any thread (including unregistered threads), but is should
> + * generally only *frequently* read from or written to by the owner.
> + *
> + * Values of the same lcore variable but owned by to different lcore
> + * ids *may* be frequently read or written by the owners without the
> + * risk of false sharing.
> + *
> + * An appropriate synchronization mechanism (e.g., atomics) should
> + * employed to assure there are no data races between the owning
> + * thread and any non-owner threads accessing the same lcore variable
> + * instance.
> + *
> + * The value of the lcore variable for a particular lcore id may be
> + * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
> + * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
> + *
> + * To modify the value of an lcore variable for a particular lcore id,
> + * either access the object through the pointer retrieved by \ref
> + * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
> + * RTE_LCORE_VAR_LCORE_SET.
> + *
> + * The access macros each has a short-hand which may be used by an EAL
> + * thread or registered non-EAL thread to access the lcore variable
> + * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
> + * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
> + *
> + * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
> + * pointer with the same type as the value, it may not be directly
> + * dereferenced and must be treated as an opaque identifier. The
> + * *identifier* value is common across all lcore ids.
> + *
> + * @b Storage
> + *
> + * An lcore variable's values may by of a primitive type like \c int,
> + * but would more typically be a \c struct. An application may choose
> + * to define an lcore variable, which it then it goes on to never
> + * allocate.
> + *
> + * The lcore variable handle introduces a per-variable (not
> + * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
> + * there are some memory footprint gains to be made by organizing all
> + * per-lcore id data for a particular module as one lcore variable
> + * (e.g., as a struct).
> + *
> + * The sum of all lcore variables, plus any padding required, must be
> + * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
> + * violation of this maximum results in the process being terminated.
> + *
> + * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
> + * same order of magnitude in size as a thread stack.
> + *
> + * The lcore variable storage buffers are kept in the BSS section in
> + * the resulting binary, where data generally isn't mapped in until
> + * it's accessed. This means that unused portions of the lcore
> + * variable storage area will not occupy any physical memory (with a
> + * granularity of the memory page size [usually 4 kB]).
> + *
> + * Lcore variables should generally *not* be \ref __rte_cache_aligned
> + * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
> + * of these constructs are designed to avoid false sharing. In the
> + * case of an lcore variable instance, all nearby data structures
> + * should almost-always be written to by a single thread (the lcore
> + * variable owner). Adding padding will increase the effective memory
> + * working set size, and potentially reducing performance.
> + *
> + * @b Example
> + *
> + * Below is an example of the use of an lcore variable:
> + *
> + * \code{.c}
> + * struct foo_lcore_state {
> + *         int a;
> + *         long b;
> + * };
> + *
> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
> + *
> + * long foo_get_a_plus_b(void)
> + * {
> + *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
> + *
> + *         return state->a + state->b;
> + * }
> + *
> + * RTE_INIT(rte_foo_init)
> + * {
> + *         unsigned int lcore_id;
> + *
> + *         RTE_LCORE_VAR_ALLOC(foo_state);
> + *
> + *         struct foo_lcore_state *state;
> + *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_states) {
> + *                 (initialize 'state')
> + *         }
> + *
> + *         (other initialization)
> + * }
> + * \endcode
> + *
> + *
> + * @b Alternatives
> + *
> + * Lcore variables are designed to replace a pattern exemplified below:
> + * \code{.c}
> + * struct foo_lcore_state {
> + *         int a;
> + *         long b;
> + *         RTE_CACHE_GUARD;
> + * } __rte_cache_aligned;
> + *
> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
> + * \endcode
> + *
> + * This scheme is simple and effective, but has one drawback: the data
> + * is organized so that objects related to all lcores for a particular
> + * module is kept close in memory. At a bare minimum, this forces the
> + * use of cache-line alignment to avoid false sharing. With CPU
> + * hardware prefetching and memory loads resulting from speculative
> + * execution (functions which seemingly are getting more eager faster
> + * than they are getting more intelligent), one or more "guard" cache
> + * lines may be required to separate one lcore's data from another's.
> + *
> + * Lcore variables has the upside of working with, not against, the
> + * CPU's assumptions and for example next-line prefetchers may well
> + * work the way its designers intended (i.e., to the benefit, not
> + * detriment, of system performance).
> + *
> + * Another alternative to \ref rte_lcore_var.h is the \ref
> + * rte_per_lcore.h API, which make use of thread-local storage (TLS,
> + * e.g., GCC __thread or C11 _Thread_local). The main differences
> + * between by using the various forms of TLS (e.g., \ref
> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
> + * variables are:
> + *
> + *   * The existence and non-existence of a thread-local variable
> + *     instance follow that of particular thread's. The data cannot be
> + *     accessed before the thread has been created, nor after it has
> + *     exited. One effect of this is thread-local variables must
> + *     initialized in a "lazy" manner (e.g., at the point of thread
> + *     creation). Lcore variables may be accessed immediately after
> + *     having been allocated (which is usually prior any thread beyond
> + *     the main thread is running).
> + *   * A thread-local variable is duplicated across all threads in the
> + *     process, including unregistered non-EAL threads (i.e.,
> + *     "regular" threads). For DPDK applications heavily relying on
> + *     multi-threading (in conjunction to DPDK's "one thread per core"
> + *     pattern), either by having many concurrent threads or
> + *     creating/destroying threads at a high rate, an excessive use of
> + *     thread-local variables may cause inefficiencies (e.g.,
> + *     increased thread creation overhead due to thread-local storage
> + *     initialization or increased total RAM footprint usage). Lcore
> + *     variables *only* exist for threads with an lcore id, and thus
> + *     not for such "regular" threads.
> + *   * If data in thread-local storage may be shared between threads
> + *     (i.e., can a pointer to a thread-local variable be passed to
> + *     and successfully dereferenced by non-owning thread) depends on
> + *     the details of the TLS implementation. With GCC __thread and
> + *     GCC _Thread_local, such data sharing is supported. In the C11
> + *     standard, the result of accessing another thread's
> + *     _Thread_local object is implementation-defined. Lcore variable
> + *     instances may be accessed reliably by any thread.
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <stddef.h>
> +#include <stdalign.h>
> +
> +#include <rte_common.h>
> +#include <rte_config.h>
> +#include <rte_lcore.h>
> +
> +/**
> + * Given the lcore variable type, produces the type of the lcore
> + * variable handle.
> + */
> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)                \
> +       type *
> +
> +/**
> + * Define a lcore variable handle.
> + *
> + * This macro defines a variable which is used as a handle to access
> + * the various per-lcore id instances of a per-lcore id variable.
> + *
> + * The aim with this macro is to make clear at the point of
> + * declaration that this is an lcore handler, rather than a regular
> + * pointer.
> + *
> + * Add @b static as a prefix in case the lcore variable are only to be
> + * accessed from a particular translation unit.
> + */
> +#define RTE_LCORE_VAR_HANDLE(type, name)       \
> +       RTE_LCORE_VAR_HANDLE_TYPE(type) name
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)      \
> +       name = rte_lcore_var_alloc(size, align)
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle,
> + * with values aligned for any type of object.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)   \
> +       name = rte_lcore_var_alloc(size, 0)
> +
> +/**
> + * Allocate space for an lcore variable of the size and alignment requirements
> + * suggested by the handler pointer type, and initialize its handle.
> + */
> +#define RTE_LCORE_VAR_ALLOC(name)                                      \
> +       RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)),           \
> +                                      alignof(typeof(*(name))))
> +
> +/**
> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
> + * means of a \ref RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)               \
> +       RTE_INIT(rte_lcore_var_init_ ## name)                           \
> +       {                                                               \
> +               RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);      \
> +       }
> +
> +/**
> + * Allocate an explicitly-sized lcore variable by means of a \ref
> + * RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)            \
> +       RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
> +
> +/**
> + * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT(name)                                       \
> +       RTE_INIT(rte_lcore_var_init_ ## name)                           \
> +       {                                                               \
> +               RTE_LCORE_VAR_ALLOC(name);                              \
> +       }
> +
> +#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)              \
> +       ((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
> +
> +/**
> + * Get pointer to lcore variable instance with the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)                                \
> +       ((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> +
> +/**
> + * Get value of a lcore variable instance of the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)                \
> +       (*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
> +
> +/**
> + * Set the value of a lcore variable instance of the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)         \
> +       (*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
> +
> +/**
> + * Get pointer to lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
> +
> +/**
> + * Get value of lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
> +
> +/**
> + * Set value of lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_SET(name, value) \
> +       RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
> +
> +/**
> + * Iterate over each lcore id's value for a lcore variable.
> + */
> +#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)                         \
> +       for (unsigned int lcore_id =                                    \
> +                    (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);   \
> +            lcore_id < RTE_MAX_LCORE;                                  \
> +            lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> +
> +extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
> +
> +/**
> + * Allocate space in the per-lcore id buffers for a lcore variable.
> + *
> + * The pointer returned is only an opaque identifer of the variable. To
> + * get an actual pointer to a particular instance of the variable use
> + * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
> + *
> + * The allocation is always successful, barring a fatal exhaustion of
> + * the per-lcore id buffer space.
> + *
> + * @param size
> + *   The size (in bytes) of the variable's per-lcore id value.
> + * @param align
> + *   If 0, the values will be suitably aligned for any kind of type
> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
> + *   on a multiple of *align*, which must be a power of 2 and equal or
> + *   less than \c RTE_CACHE_LINE_SIZE.
> + * @return
> + *   The id of the variable, stored in a void pointer value.
> + */
> +__rte_experimental
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_LCORE_VAR_H_ */
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 5e0cd47c82..e90b86115a 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -393,6 +393,10 @@ EXPERIMENTAL {
>         # added in 23.07
>         rte_memzone_max_get;
>         rte_memzone_max_set;
> +
> +       # added in 24.03
> +       rte_lcore_var_alloc;
> +       rte_lcore_var;
>  };
>
>  INTERNAL {
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-21  9:43           ` Jerin Jacob
@ 2024-02-21 10:31             ` Morten Brørup
  2024-02-21 14:26             ` Mattias Rönnblom
  1 sibling, 0 replies; 42+ messages in thread
From: Morten Brørup @ 2024-02-21 10:31 UTC (permalink / raw)
  To: Jerin Jacob, Mattias Rönnblom
  Cc: dev, hofors, Stephen Hemminger, Tomasz Duszynski

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Wednesday, 21 February 2024 10.44
> 
> On Tue, Feb 20, 2024 at 2:35 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
> >
> > Introduce DPDK per-lcore id variables, or lcore variables for short.
> >
> > An lcore variable has one value for every current and future lcore
> > id-equipped thread.
> >
> > The primary <rte_lcore_var.h> use case is for statically allocating
> > small chunks of often-used data, which is related logically, but
> where
> > there are performance benefits to reap from having updates being
> local
> > to an lcore.
> 
> I think, in order to quantify the gain, we must add a performance test
> case to measure the acces cycles with lcore variables scheme vs this
> scheme.
> Other PMU counters(Cache misses) may be interesting but we dont have
> means in DPDK to do self monitoring now like
> https://patches.dpdk.org/project/dpdk/patch/20221213104350.3218167-1-
> tduszynski@marvell.com/
> 
> >
> > Lcore variables are similar to thread-local storage (TLS, e.g., C11
> > _Thread_local), but decoupling the values' life time with that of the
> > threads.

Lcore variables can be accessed by other threads, unlike TLS variables.

If a TLS variable needs to be accessed by other threads, there must also be an RTE_MAX_LCORE-sized array of pointers to the TLS variable, where each worker thread must initialize the entry pointing to its TLS variable.

> >
> > Lcore variables are also similar in terms of functionality provided
> by
> > FreeBSD kernel's DPCPU_*() family of macros and the associated
> > build-time machinery. DPCPU uses linker scripts, which effectively
> > prevents the reuse of its, otherwise seemingly viable, approach.
> >
> > The currently-prevailing way to solve the same problem as lcore
> > variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> > array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> > lcore variables over this approach is that data related to the same
> > lcore now is close (spatially, in memory), rather than data used by
> > the same module, which in turn avoid excessive use of padding,
> > polluting caches with unused data.
> >

There are 3 ways to implement per-lcore variables:
1. Thread-local storage, available via RTE_DEFINE_PER_LCORE(type, name).
2. RTE_MAX_LCORE-sized arrays.
3. Lcore variables, as provided by this patch series.

Perhaps an overview of differences and performance numbers would help understand the benefits of this patch series.

The advantages of packing more variables into the same cache line may be hard to measure without PMU counters, and could perhaps be described or estimated instead.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-21  9:43           ` Jerin Jacob
  2024-02-21 10:31             ` Morten Brørup
@ 2024-02-21 14:26             ` Mattias Rönnblom
  1 sibling, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-21 14:26 UTC (permalink / raw)
  To: Jerin Jacob, Mattias Rönnblom
  Cc: dev, Morten Brørup, Stephen Hemminger, Tomasz Duszynski

On 2024-02-21 10:43, Jerin Jacob wrote:
> On Tue, Feb 20, 2024 at 2:35 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
>>
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small chunks of often-used data, which is related logically, but where
>> there are performance benefits to reap from having updates being local
>> to an lcore.
> 
> I think, in order to quantify the gain, we must add a performance test
> case to measure the acces cycles with lcore variables scheme vs this
> scheme.

As I might have mentioned elsewhere in the thread, the micro benchmarks 
are already there, in the form of the service and random perf tests.

The service perf tests doesn't show any difference, and the rand perf 
tests seems to indicate lcore variables add one (1) core clock cycle per 
rte_rand() call (measured on Raptor Lake E- and P-cores).

The effects on a real-world app would be highly dependent on what DPDK 
services it's using that themselves are using static per-lcore data, and 
to what extent the app itself use per-lcore data.

Provided lcore variables performs as good as the cache-aligned static 
array pattern for micro benchmarks, lcore variables should always 
be-as-good-or-better in a real-world app, because the cache working set 
size will always be smaller (no padding).

That said, I don't think lcore variables will result in noticable 
performance gain for the typical app. If you do see large gains, I 
suspect it will be on systems with next-N-lines prefetchers and the 
lcore data weren't RTE_CACHE_GUARDed.

> Other PMU counters(Cache misses) may be interesting but we dont have
> means in DPDK to do self monitoring now like
> https://patches.dpdk.org/project/dpdk/patch/20221213104350.3218167-1-tduszynski@marvell.com/
> 
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> RFC v3:
>>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>>   * Update example to reflect FOREACH macro name change (in RFC v2).
>>
>> RFC v2:
>>   * Use alignof to derive alignment requirements. (Morten Brørup)
>>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>>   * Allow user-specified alignment, but limit max to cache line size.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   config/rte_config.h                   |   1 +
>>   doc/api/doxy-api-index.md             |   1 +
>>   lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>>   lib/eal/common/meson.build            |   1 +
>>   lib/eal/include/meson.build           |   1 +
>>   lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>>   lib/eal/version.map                   |   4 +
>>   7 files changed, 465 insertions(+)
>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>
>> diff --git a/config/rte_config.h b/config/rte_config.h
>> index da265d7dd2..884482e473 100644
>> --- a/config/rte_config.h
>> +++ b/config/rte_config.h
>> @@ -30,6 +30,7 @@
>>   /* EAL defines */
>>   #define RTE_CACHE_GUARD_LINES 1
>>   #define RTE_MAX_HEAPS 32
>> +#define RTE_MAX_LCORE_VAR 1048576
>>   #define RTE_MAX_MEMSEG_LISTS 128
>>   #define RTE_MAX_MEMSEG_PER_LIST 8192
>>   #define RTE_MAX_MEM_MB_PER_LIST 32768
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index a6a768bd7c..bb06bb7ca1 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>>     [interrupts](@ref rte_interrupts.h),
>>     [launch](@ref rte_launch.h),
>>     [lcore](@ref rte_lcore.h),
>> +  [lcore-varible](@ref rte_lcore_var.h),
>>     [per-lcore](@ref rte_per_lcore.h),
>>     [service cores](@ref rte_service.h),
>>     [keepalive](@ref rte_keepalive.h),
>> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
>> new file mode 100644
>> index 0000000000..dfd11cbd0b
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_var.c
>> @@ -0,0 +1,82 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#include <inttypes.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_debug.h>
>> +#include <rte_log.h>
>> +
>> +#include <rte_lcore_var.h>
>> +
>> +#include "eal_private.h"
>> +
>> +#define WARN_THRESHOLD 75
>> +
>> +/*
>> + * Avoid using offset zero, since it would result in a NULL-value
>> + * "handle" (offset) pointer, which in principle and per the API
>> + * definition shouldn't be an issue, but may confuse some tools and
>> + * users.
>> + */
>> +#define INITIAL_OFFSET 1
>> +
>> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
>> +
>> +static uintptr_t allocated = INITIAL_OFFSET;
>> +
>> +static void
>> +verify_allocation(uintptr_t new_allocated)
>> +{
>> +       static bool has_warned;
>> +
>> +       RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
>> +
>> +       if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
>> +           !has_warned) {
>> +               EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
>> +                       "of the maximum capacity (%d bytes)", WARN_THRESHOLD,
>> +                       RTE_MAX_LCORE_VAR);
>> +               has_warned = true;
>> +       }
>> +}
>> +
>> +static void *
>> +lcore_var_alloc(size_t size, size_t align)
>> +{
>> +       uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
>> +
>> +       void *offset = (void *)new_allocated;
>> +
>> +       new_allocated += size;
>> +
>> +       verify_allocation(new_allocated);
>> +
>> +       allocated = new_allocated;
>> +
>> +       EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
>> +               "%"PRIuPTR"-byte alignment", size, align);
>> +
>> +       return offset;
>> +}
>> +
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align)
>> +{
>> +       /* Having the per-lcore buffer size aligned on cache lines
>> +        * assures as well as having the base pointer aligned on cache
>> +        * size assures that aligned offsets also translate to aligned
>> +        * pointers across all values.
>> +        */
>> +       RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
>> +       RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
>> +
>> +       /* '0' means asking for worst-case alignment requirements */
>> +       if (align == 0)
>> +               align = alignof(max_align_t);
>> +
>> +       RTE_ASSERT(rte_is_power_of_2(align));
>> +
>> +       return lcore_var_alloc(size, align);
>> +}
>> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
>> index 22a626ba6f..d41403680b 100644
>> --- a/lib/eal/common/meson.build
>> +++ b/lib/eal/common/meson.build
>> @@ -18,6 +18,7 @@ sources += files(
>>           'eal_common_interrupts.c',
>>           'eal_common_launch.c',
>>           'eal_common_lcore.c',
>> +        'eal_common_lcore_var.c',
>>           'eal_common_mcfg.c',
>>           'eal_common_memalloc.c',
>>           'eal_common_memory.c',
>> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
>> index e94b056d46..9449253e23 100644
>> --- a/lib/eal/include/meson.build
>> +++ b/lib/eal/include/meson.build
>> @@ -27,6 +27,7 @@ headers += files(
>>           'rte_keepalive.h',
>>           'rte_launch.h',
>>           'rte_lcore.h',
>> +        'rte_lcore_var.h',
>>           'rte_lock_annotations.h',
>>           'rte_malloc.h',
>>           'rte_mcslock.h',
>> diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
>> new file mode 100644
>> index 0000000000..da49d48d7c
>> --- /dev/null
>> +++ b/lib/eal/include/rte_lcore_var.h
>> @@ -0,0 +1,375 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#ifndef _RTE_LCORE_VAR_H_
>> +#define _RTE_LCORE_VAR_H_
>> +
>> +/**
>> + * @file
>> + *
>> + * RTE Per-lcore id variables
>> + *
>> + * This API provides a mechanism to create and access per-lcore id
>> + * variables in a space- and cycle-efficient manner.
>> + *
>> + * A per-lcore id variable (or lcore variable for short) has one value
>> + * for each EAL thread and registered non-EAL thread. In other words,
>> + * there's one copy of its value for each and every current and future
>> + * lcore id-equipped thread, with the total number of copies amounting
>> + * to \c RTE_MAX_LCORE.
>> + *
>> + * In order to access the values of an lcore variable, a handle is
>> + * used. The type of the handle is a pointer to the value's type
>> + * (e.g., for \c uint32_t lcore variable, the handle is a
>> + * <code>uint32_t *</code>. A handle may be passed between modules and
>> + * threads just like any pointer, but its value is not the address of
>> + * any particular object, but rather just an opaque identifier, stored
>> + * in a typed pointer (to inform the access macro the type of values).
>> + *
>> + * @b Creation
>> + *
>> + * An lcore variable is created in two steps:
>> + *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
>> + *  2. Allocate lcore variable storage and initialize the handle with
>> + *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
>> + *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
>> + *     module initialization, but may be done at any time.
>> + *
>> + * An lcore variable is not tied to the owning thread's lifetime. It's
>> + * available for use by any thread immediately after having been
>> + * allocated, and continues to be available throughout the lifetime of
>> + * the EAL.
>> + *
>> + * Lcore variables cannot and need not be freed.
>> + *
>> + * @b Access
>> + *
>> + * The value of any lcore variable for any lcore id may be accessed
>> + * from any thread (including unregistered threads), but is should
>> + * generally only *frequently* read from or written to by the owner.
>> + *
>> + * Values of the same lcore variable but owned by to different lcore
>> + * ids *may* be frequently read or written by the owners without the
>> + * risk of false sharing.
>> + *
>> + * An appropriate synchronization mechanism (e.g., atomics) should
>> + * employed to assure there are no data races between the owning
>> + * thread and any non-owner threads accessing the same lcore variable
>> + * instance.
>> + *
>> + * The value of the lcore variable for a particular lcore id may be
>> + * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
>> + * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
>> + *
>> + * To modify the value of an lcore variable for a particular lcore id,
>> + * either access the object through the pointer retrieved by \ref
>> + * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
>> + * RTE_LCORE_VAR_LCORE_SET.
>> + *
>> + * The access macros each has a short-hand which may be used by an EAL
>> + * thread or registered non-EAL thread to access the lcore variable
>> + * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
>> + * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
>> + *
>> + * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
>> + * pointer with the same type as the value, it may not be directly
>> + * dereferenced and must be treated as an opaque identifier. The
>> + * *identifier* value is common across all lcore ids.
>> + *
>> + * @b Storage
>> + *
>> + * An lcore variable's values may by of a primitive type like \c int,
>> + * but would more typically be a \c struct. An application may choose
>> + * to define an lcore variable, which it then it goes on to never
>> + * allocate.
>> + *
>> + * The lcore variable handle introduces a per-variable (not
>> + * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
>> + * there are some memory footprint gains to be made by organizing all
>> + * per-lcore id data for a particular module as one lcore variable
>> + * (e.g., as a struct).
>> + *
>> + * The sum of all lcore variables, plus any padding required, must be
>> + * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
>> + * violation of this maximum results in the process being terminated.
>> + *
>> + * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
>> + * same order of magnitude in size as a thread stack.
>> + *
>> + * The lcore variable storage buffers are kept in the BSS section in
>> + * the resulting binary, where data generally isn't mapped in until
>> + * it's accessed. This means that unused portions of the lcore
>> + * variable storage area will not occupy any physical memory (with a
>> + * granularity of the memory page size [usually 4 kB]).
>> + *
>> + * Lcore variables should generally *not* be \ref __rte_cache_aligned
>> + * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
>> + * of these constructs are designed to avoid false sharing. In the
>> + * case of an lcore variable instance, all nearby data structures
>> + * should almost-always be written to by a single thread (the lcore
>> + * variable owner). Adding padding will increase the effective memory
>> + * working set size, and potentially reducing performance.
>> + *
>> + * @b Example
>> + *
>> + * Below is an example of the use of an lcore variable:
>> + *
>> + * \code{.c}
>> + * struct foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + * };
>> + *
>> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
>> + *
>> + * long foo_get_a_plus_b(void)
>> + * {
>> + *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
>> + *
>> + *         return state->a + state->b;
>> + * }
>> + *
>> + * RTE_INIT(rte_foo_init)
>> + * {
>> + *         unsigned int lcore_id;
>> + *
>> + *         RTE_LCORE_VAR_ALLOC(foo_state);
>> + *
>> + *         struct foo_lcore_state *state;
>> + *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_states) {
>> + *                 (initialize 'state')
>> + *         }
>> + *
>> + *         (other initialization)
>> + * }
>> + * \endcode
>> + *
>> + *
>> + * @b Alternatives
>> + *
>> + * Lcore variables are designed to replace a pattern exemplified below:
>> + * \code{.c}
>> + * struct foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + *         RTE_CACHE_GUARD;
>> + * } __rte_cache_aligned;
>> + *
>> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
>> + * \endcode
>> + *
>> + * This scheme is simple and effective, but has one drawback: the data
>> + * is organized so that objects related to all lcores for a particular
>> + * module is kept close in memory. At a bare minimum, this forces the
>> + * use of cache-line alignment to avoid false sharing. With CPU
>> + * hardware prefetching and memory loads resulting from speculative
>> + * execution (functions which seemingly are getting more eager faster
>> + * than they are getting more intelligent), one or more "guard" cache
>> + * lines may be required to separate one lcore's data from another's.
>> + *
>> + * Lcore variables has the upside of working with, not against, the
>> + * CPU's assumptions and for example next-line prefetchers may well
>> + * work the way its designers intended (i.e., to the benefit, not
>> + * detriment, of system performance).
>> + *
>> + * Another alternative to \ref rte_lcore_var.h is the \ref
>> + * rte_per_lcore.h API, which make use of thread-local storage (TLS,
>> + * e.g., GCC __thread or C11 _Thread_local). The main differences
>> + * between by using the various forms of TLS (e.g., \ref
>> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
>> + * variables are:
>> + *
>> + *   * The existence and non-existence of a thread-local variable
>> + *     instance follow that of particular thread's. The data cannot be
>> + *     accessed before the thread has been created, nor after it has
>> + *     exited. One effect of this is thread-local variables must
>> + *     initialized in a "lazy" manner (e.g., at the point of thread
>> + *     creation). Lcore variables may be accessed immediately after
>> + *     having been allocated (which is usually prior any thread beyond
>> + *     the main thread is running).
>> + *   * A thread-local variable is duplicated across all threads in the
>> + *     process, including unregistered non-EAL threads (i.e.,
>> + *     "regular" threads). For DPDK applications heavily relying on
>> + *     multi-threading (in conjunction to DPDK's "one thread per core"
>> + *     pattern), either by having many concurrent threads or
>> + *     creating/destroying threads at a high rate, an excessive use of
>> + *     thread-local variables may cause inefficiencies (e.g.,
>> + *     increased thread creation overhead due to thread-local storage
>> + *     initialization or increased total RAM footprint usage). Lcore
>> + *     variables *only* exist for threads with an lcore id, and thus
>> + *     not for such "regular" threads.
>> + *   * If data in thread-local storage may be shared between threads
>> + *     (i.e., can a pointer to a thread-local variable be passed to
>> + *     and successfully dereferenced by non-owning thread) depends on
>> + *     the details of the TLS implementation. With GCC __thread and
>> + *     GCC _Thread_local, such data sharing is supported. In the C11
>> + *     standard, the result of accessing another thread's
>> + *     _Thread_local object is implementation-defined. Lcore variable
>> + *     instances may be accessed reliably by any thread.
>> + */
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +#include <stddef.h>
>> +#include <stdalign.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_config.h>
>> +#include <rte_lcore.h>
>> +
>> +/**
>> + * Given the lcore variable type, produces the type of the lcore
>> + * variable handle.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)                \
>> +       type *
>> +
>> +/**
>> + * Define a lcore variable handle.
>> + *
>> + * This macro defines a variable which is used as a handle to access
>> + * the various per-lcore id instances of a per-lcore id variable.
>> + *
>> + * The aim with this macro is to make clear at the point of
>> + * declaration that this is an lcore handler, rather than a regular
>> + * pointer.
>> + *
>> + * Add @b static as a prefix in case the lcore variable are only to be
>> + * accessed from a particular translation unit.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE(type, name)       \
>> +       RTE_LCORE_VAR_HANDLE_TYPE(type) name
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)      \
>> +       name = rte_lcore_var_alloc(size, align)
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle,
>> + * with values aligned for any type of object.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)   \
>> +       name = rte_lcore_var_alloc(size, 0)
>> +
>> +/**
>> + * Allocate space for an lcore variable of the size and alignment requirements
>> + * suggested by the handler pointer type, and initialize its handle.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC(name)                                      \
>> +       RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)),           \
>> +                                      alignof(typeof(*(name))))
>> +
>> +/**
>> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
>> + * means of a \ref RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)               \
>> +       RTE_INIT(rte_lcore_var_init_ ## name)                           \
>> +       {                                                               \
>> +               RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);      \
>> +       }
>> +
>> +/**
>> + * Allocate an explicitly-sized lcore variable by means of a \ref
>> + * RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)            \
>> +       RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
>> +
>> +/**
>> + * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT(name)                                       \
>> +       RTE_INIT(rte_lcore_var_init_ ## name)                           \
>> +       {                                                               \
>> +               RTE_LCORE_VAR_ALLOC(name);                              \
>> +       }
>> +
>> +#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)              \
>> +       ((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
>> +
>> +/**
>> + * Get pointer to lcore variable instance with the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)                                \
>> +       ((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
>> +
>> +/**
>> + * Get value of a lcore variable instance of the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)                \
>> +       (*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
>> +
>> +/**
>> + * Set the value of a lcore variable instance of the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)         \
>> +       (*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
>> +
>> +/**
>> + * Get pointer to lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
>> +
>> +/**
>> + * Get value of lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
>> +
>> +/**
>> + * Set value of lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_SET(name, value) \
>> +       RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
>> +
>> +/**
>> + * Iterate over each lcore id's value for a lcore variable.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)                         \
>> +       for (unsigned int lcore_id =                                    \
>> +                    (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);   \
>> +            lcore_id < RTE_MAX_LCORE;                                  \
>> +            lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
>> +
>> +extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
>> +
>> +/**
>> + * Allocate space in the per-lcore id buffers for a lcore variable.
>> + *
>> + * The pointer returned is only an opaque identifer of the variable. To
>> + * get an actual pointer to a particular instance of the variable use
>> + * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
>> + *
>> + * The allocation is always successful, barring a fatal exhaustion of
>> + * the per-lcore id buffer space.
>> + *
>> + * @param size
>> + *   The size (in bytes) of the variable's per-lcore id value.
>> + * @param align
>> + *   If 0, the values will be suitably aligned for any kind of type
>> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
>> + *   on a multiple of *align*, which must be a power of 2 and equal or
>> + *   less than \c RTE_CACHE_LINE_SIZE.
>> + * @return
>> + *   The id of the variable, stored in a void pointer value.
>> + */
>> +__rte_experimental
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align);
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif /* _RTE_LCORE_VAR_H_ */
>> diff --git a/lib/eal/version.map b/lib/eal/version.map
>> index 5e0cd47c82..e90b86115a 100644
>> --- a/lib/eal/version.map
>> +++ b/lib/eal/version.map
>> @@ -393,6 +393,10 @@ EXPERIMENTAL {
>>          # added in 23.07
>>          rte_memzone_max_get;
>>          rte_memzone_max_set;
>> +
>> +       # added in 24.03
>> +       rte_lcore_var_alloc;
>> +       rte_lcore_var;
>>   };
>>
>>   INTERNAL {
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-20  9:11           ` Bruce Richardson
  2024-02-21  9:43           ` Jerin Jacob
@ 2024-02-22  9:22           ` Morten Brørup
  2024-02-23 10:12             ` Mattias Rönnblom
  2 siblings, 1 reply; 42+ messages in thread
From: Morten Brørup @ 2024-02-22  9:22 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Tuesday, 20 February 2024 09.49
> 
> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small chunks of often-used data, which is related logically, but where
> there are performance benefits to reap from having updates being local
> to an lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> RFC v3:
>  * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>  * Update example to reflect FOREACH macro name change (in RFC v2).
> 
> RFC v2:
>  * Use alignof to derive alignment requirements. (Morten Brørup)
>  * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>    *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>  * Allow user-specified alignment, but limit max to cache line size.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  config/rte_config.h                   |   1 +
>  doc/api/doxy-api-index.md             |   1 +
>  lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>  lib/eal/common/meson.build            |   1 +
>  lib/eal/include/meson.build           |   1 +
>  lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>  lib/eal/version.map                   |   4 +
>  7 files changed, 465 insertions(+)
>  create mode 100644 lib/eal/common/eal_common_lcore_var.c
>  create mode 100644 lib/eal/include/rte_lcore_var.h
> 
> diff --git a/config/rte_config.h b/config/rte_config.h
> index da265d7dd2..884482e473 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -30,6 +30,7 @@
>  /* EAL defines */
>  #define RTE_CACHE_GUARD_LINES 1
>  #define RTE_MAX_HEAPS 32
> +#define RTE_MAX_LCORE_VAR 1048576
>  #define RTE_MAX_MEMSEG_LISTS 128
>  #define RTE_MAX_MEMSEG_PER_LIST 8192
>  #define RTE_MAX_MEM_MB_PER_LIST 32768
> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> index a6a768bd7c..bb06bb7ca1 100644
> --- a/doc/api/doxy-api-index.md
> +++ b/doc/api/doxy-api-index.md
> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>    [interrupts](@ref rte_interrupts.h),
>    [launch](@ref rte_launch.h),
>    [lcore](@ref rte_lcore.h),
> +  [lcore-varible](@ref rte_lcore_var.h),
>    [per-lcore](@ref rte_per_lcore.h),
>    [service cores](@ref rte_service.h),
>    [keepalive](@ref rte_keepalive.h),
> diff --git a/lib/eal/common/eal_common_lcore_var.c
> b/lib/eal/common/eal_common_lcore_var.c
> new file mode 100644
> index 0000000000..dfd11cbd0b
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_var.c
> @@ -0,0 +1,82 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#include <inttypes.h>
> +
> +#include <rte_common.h>
> +#include <rte_debug.h>
> +#include <rte_log.h>
> +
> +#include <rte_lcore_var.h>
> +
> +#include "eal_private.h"
> +
> +#define WARN_THRESHOLD 75

It's not an error condition, so 75 % seems like a low threshold for WARNING.
Consider increasing it to 95 %, or change the level to NOTICE.
Or both.

> +
> +/*
> + * Avoid using offset zero, since it would result in a NULL-value
> + * "handle" (offset) pointer, which in principle and per the API
> + * definition shouldn't be an issue, but may confuse some tools and
> + * users.
> + */
> +#define INITIAL_OFFSET 1
> +
> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
> +
> +static uintptr_t allocated = INITIAL_OFFSET;

Please add an API to get the amount of allocated lcore variable memory.
The easy option is to make the above variable public (with a proper name, e.g. rte_lcore_var_allocated).

The total amount of lcore variable memory is already public: RTE_MAX_LCORE_VAR.

> +
> +static void
> +verify_allocation(uintptr_t new_allocated)
> +{
> +	static bool has_warned;
> +
> +	RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
> +
> +	if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
> +	    !has_warned) {
> +		EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
> +			"of the maximum capacity (%d bytes)", WARN_THRESHOLD,
> +			RTE_MAX_LCORE_VAR);
> +		has_warned = true;
> +	}
> +}
> +
> +static void *
> +lcore_var_alloc(size_t size, size_t align)
> +{
> +	uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
> +
> +	void *offset = (void *)new_allocated;
> +
> +	new_allocated += size;
> +
> +	verify_allocation(new_allocated);
> +
> +	allocated = new_allocated;
> +
> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
> +		"%"PRIuPTR"-byte alignment", size, align);
> +
> +	return offset;
> +}
> +
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align)
> +{
> +	/* Having the per-lcore buffer size aligned on cache lines
> +	 * assures as well as having the base pointer aligned on cache
> +	 * size assures that aligned offsets also translate to aligned
> +	 * pointers across all values.
> +	 */
> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
> +
> +	/* '0' means asking for worst-case alignment requirements */
> +	if (align == 0)
> +		align = alignof(max_align_t);
> +
> +	RTE_ASSERT(rte_is_power_of_2(align));
> +
> +	return lcore_var_alloc(size, align);
> +}
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 22a626ba6f..d41403680b 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -18,6 +18,7 @@ sources += files(
>          'eal_common_interrupts.c',
>          'eal_common_launch.c',
>          'eal_common_lcore.c',
> +        'eal_common_lcore_var.c',
>          'eal_common_mcfg.c',
>          'eal_common_memalloc.c',
>          'eal_common_memory.c',
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index e94b056d46..9449253e23 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -27,6 +27,7 @@ headers += files(
>          'rte_keepalive.h',
>          'rte_launch.h',
>          'rte_lcore.h',
> +        'rte_lcore_var.h',
>          'rte_lock_annotations.h',
>          'rte_malloc.h',
>          'rte_mcslock.h',
> diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
> new file mode 100644
> index 0000000000..da49d48d7c
> --- /dev/null
> +++ b/lib/eal/include/rte_lcore_var.h
> @@ -0,0 +1,375 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#ifndef _RTE_LCORE_VAR_H_
> +#define _RTE_LCORE_VAR_H_
> +
> +/**
> + * @file
> + *
> + * RTE Per-lcore id variables
> + *
> + * This API provides a mechanism to create and access per-lcore id
> + * variables in a space- and cycle-efficient manner.
> + *
> + * A per-lcore id variable (or lcore variable for short) has one value
> + * for each EAL thread and registered non-EAL thread. In other words,
> + * there's one copy of its value for each and every current and future
> + * lcore id-equipped thread, with the total number of copies amounting
> + * to \c RTE_MAX_LCORE.
> + *
> + * In order to access the values of an lcore variable, a handle is
> + * used. The type of the handle is a pointer to the value's type
> + * (e.g., for \c uint32_t lcore variable, the handle is a
> + * <code>uint32_t *</code>. A handle may be passed between modules and
> + * threads just like any pointer, but its value is not the address of
> + * any particular object, but rather just an opaque identifier, stored
> + * in a typed pointer (to inform the access macro the type of values).
> + *
> + * @b Creation
> + *
> + * An lcore variable is created in two steps:
> + *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
> + *  2. Allocate lcore variable storage and initialize the handle with
> + *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
> + *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
> + *     module initialization, but may be done at any time.
> + *
> + * An lcore variable is not tied to the owning thread's lifetime. It's
> + * available for use by any thread immediately after having been
> + * allocated, and continues to be available throughout the lifetime of
> + * the EAL.
> + *
> + * Lcore variables cannot and need not be freed.
> + *
> + * @b Access
> + *
> + * The value of any lcore variable for any lcore id may be accessed
> + * from any thread (including unregistered threads), but is should
> + * generally only *frequently* read from or written to by the owner.
> + *
> + * Values of the same lcore variable but owned by to different lcore
> + * ids *may* be frequently read or written by the owners without the
> + * risk of false sharing.
> + *
> + * An appropriate synchronization mechanism (e.g., atomics) should
> + * employed to assure there are no data races between the owning
> + * thread and any non-owner threads accessing the same lcore variable
> + * instance.
> + *
> + * The value of the lcore variable for a particular lcore id may be
> + * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
> + * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
> + *
> + * To modify the value of an lcore variable for a particular lcore id,
> + * either access the object through the pointer retrieved by \ref
> + * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
> + * RTE_LCORE_VAR_LCORE_SET.
> + *
> + * The access macros each has a short-hand which may be used by an EAL
> + * thread or registered non-EAL thread to access the lcore variable
> + * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
> + * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
> + *
> + * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
> + * pointer with the same type as the value, it may not be directly
> + * dereferenced and must be treated as an opaque identifier. The
> + * *identifier* value is common across all lcore ids.
> + *
> + * @b Storage
> + *
> + * An lcore variable's values may by of a primitive type like \c int,
> + * but would more typically be a \c struct. An application may choose
> + * to define an lcore variable, which it then it goes on to never
> + * allocate.
> + *
> + * The lcore variable handle introduces a per-variable (not
> + * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
> + * there are some memory footprint gains to be made by organizing all
> + * per-lcore id data for a particular module as one lcore variable
> + * (e.g., as a struct).
> + *
> + * The sum of all lcore variables, plus any padding required, must be
> + * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
> + * violation of this maximum results in the process being terminated.
> + *
> + * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
> + * same order of magnitude in size as a thread stack.
> + *
> + * The lcore variable storage buffers are kept in the BSS section in
> + * the resulting binary, where data generally isn't mapped in until
> + * it's accessed. This means that unused portions of the lcore
> + * variable storage area will not occupy any physical memory (with a
> + * granularity of the memory page size [usually 4 kB]).
> + *
> + * Lcore variables should generally *not* be \ref __rte_cache_aligned
> + * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
> + * of these constructs are designed to avoid false sharing. In the
> + * case of an lcore variable instance, all nearby data structures
> + * should almost-always be written to by a single thread (the lcore
> + * variable owner). Adding padding will increase the effective memory
> + * working set size, and potentially reducing performance.
> + *
> + * @b Example
> + *
> + * Below is an example of the use of an lcore variable:
> + *
> + * \code{.c}
> + * struct foo_lcore_state {
> + *         int a;
> + *         long b;
> + * };
> + *
> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
> + *
> + * long foo_get_a_plus_b(void)
> + * {
> + *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
> + *
> + *         return state->a + state->b;
> + * }
> + *
> + * RTE_INIT(rte_foo_init)
> + * {
> + *         unsigned int lcore_id;

This variable is part of RTE_LCORE_VAR_FOREACH_VALUE(), and can be removed from here.

> + *
> + *         RTE_LCORE_VAR_ALLOC(foo_state);

Typo: foo_state -> lcore_states

> + *
> + *         struct foo_lcore_state *state;
> + *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_states) {

Typo:
RTE_LCORE_VAR_FOREACH_VALUE(lcore_states)
->
RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states)

> + *                 (initialize 'state')
> + *         }
> + *
> + *         (other initialization)
> + * }
> + * \endcode
> + *
> + *
> + * @b Alternatives
> + *
> + * Lcore variables are designed to replace a pattern exemplified below:
> + * \code{.c}
> + * struct foo_lcore_state {
> + *         int a;
> + *         long b;
> + *         RTE_CACHE_GUARD;
> + * } __rte_cache_aligned;
> + *
> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
> + * \endcode
> + *
> + * This scheme is simple and effective, but has one drawback: the data
> + * is organized so that objects related to all lcores for a particular
> + * module is kept close in memory. At a bare minimum, this forces the
> + * use of cache-line alignment to avoid false sharing. With CPU
> + * hardware prefetching and memory loads resulting from speculative
> + * execution (functions which seemingly are getting more eager faster
> + * than they are getting more intelligent), one or more "guard" cache
> + * lines may be required to separate one lcore's data from another's.
> + *
> + * Lcore variables has the upside of working with, not against, the
> + * CPU's assumptions and for example next-line prefetchers may well
> + * work the way its designers intended (i.e., to the benefit, not
> + * detriment, of system performance).
> + *
> + * Another alternative to \ref rte_lcore_var.h is the \ref
> + * rte_per_lcore.h API, which make use of thread-local storage (TLS,
> + * e.g., GCC __thread or C11 _Thread_local). The main differences
> + * between by using the various forms of TLS (e.g., \ref
> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
> + * variables are:
> + *
> + *   * The existence and non-existence of a thread-local variable
> + *     instance follow that of particular thread's. The data cannot be
> + *     accessed before the thread has been created, nor after it has
> + *     exited. One effect of this is thread-local variables must
> + *     initialized in a "lazy" manner (e.g., at the point of thread
> + *     creation). Lcore variables may be accessed immediately after
> + *     having been allocated (which is usually prior any thread beyond
> + *     the main thread is running).
> + *   * A thread-local variable is duplicated across all threads in the
> + *     process, including unregistered non-EAL threads (i.e.,
> + *     "regular" threads). For DPDK applications heavily relying on
> + *     multi-threading (in conjunction to DPDK's "one thread per core"
> + *     pattern), either by having many concurrent threads or
> + *     creating/destroying threads at a high rate, an excessive use of
> + *     thread-local variables may cause inefficiencies (e.g.,
> + *     increased thread creation overhead due to thread-local storage
> + *     initialization or increased total RAM footprint usage). Lcore
> + *     variables *only* exist for threads with an lcore id, and thus
> + *     not for such "regular" threads.
> + *   * If data in thread-local storage may be shared between threads
> + *     (i.e., can a pointer to a thread-local variable be passed to
> + *     and successfully dereferenced by non-owning thread) depends on
> + *     the details of the TLS implementation. With GCC __thread and
> + *     GCC _Thread_local, such data sharing is supported. In the C11
> + *     standard, the result of accessing another thread's
> + *     _Thread_local object is implementation-defined. Lcore variable
> + *     instances may be accessed reliably by any thread.
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <stddef.h>
> +#include <stdalign.h>
> +
> +#include <rte_common.h>
> +#include <rte_config.h>
> +#include <rte_lcore.h>
> +
> +/**
> + * Given the lcore variable type, produces the type of the lcore
> + * variable handle.
> + */
> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
> +	type *

This macro seems superfluous.
In RTE_LCORE_VAR_HANDLE(type, name) just use:
 type * name
Are there other use cases for it?

> +
> +/**
> + * Define a lcore variable handle.
> + *
> + * This macro defines a variable which is used as a handle to access
> + * the various per-lcore id instances of a per-lcore id variable.
> + *
> + * The aim with this macro is to make clear at the point of
> + * declaration that this is an lcore handler, rather than a regular
> + * pointer.
> + *
> + * Add @b static as a prefix in case the lcore variable are only to be
> + * accessed from a particular translation unit.
> + */
> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name

Thinking out loud here...
Consider if this name should be more similar with RTE_DEFINE_PER_LCORE(type, name), e.g. RTE_DEFINE_LCORE_VAR(type, name) or RTE_LCORE_VAR_DEFINE(type, name).
Using the common prefix RTE_LCORE_VAR is preferable.
Using the term "handle" indicates that it is opaque and needs to be allocated by an allocation function.
On the other hand, the "handle" is not unique per thread, so it's nor really a "handle".

> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)	\
> +	name = rte_lcore_var_alloc(size, align)
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle,
> + * with values aligned for any type of object.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
> +	name = rte_lcore_var_alloc(size, 0)
> +
> +/**
> + * Allocate space for an lcore variable of the size and alignment
> requirements
> + * suggested by the handler pointer type, and initialize its handle.
> + */
> +#define RTE_LCORE_VAR_ALLOC(name)					\
> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)),		\
> +				       alignof(typeof(*(name))))
> +
> +/**
> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
> + * means of a \ref RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> +	{								\
> +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
> +	}
> +
> +/**
> + * Allocate an explicitly-sized lcore variable by means of a \ref
> + * RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
> +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
> +
> +/**
> + * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT(name)					\
> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> +	{								\
> +		RTE_LCORE_VAR_ALLOC(name);				\
> +	}
> +
> +#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)		\
> +	((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))

This macro also seems superfluous.
Doesn't RTE_LCORE_VAR_LCORE_PTR() suffice?

> +
> +/**
> + * Get pointer to lcore variable instance with the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)				\
> +	((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))

This uses type casting.
I wonder if additional build-time type checking would be possible...
Nice to have: The compiler should fail if name is not a pointer, but a struct or an uint64_t, or even an uintptr_t.

> +
> +/**
> + * Get value of a lcore variable instance of the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)		\
> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))

The four accessor functions, RTE_LCORE_VAR[_LCORE]_GET/SET(), seem superfluous.
They make the API seem more complex than just using RTE_LCORE_VAR[_LCORE]_PTR() for access.

> +
> +/**
> + * Set the value of a lcore variable instance of the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)		\
> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
> +
> +/**
> + * Get pointer to lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
> +
> +/**
> + * Get value of lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
> +
> +/**
> + * Set value of lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_SET(name, value) \
> +	RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
> +
> +/**
> + * Iterate over each lcore id's value for a lcore variable.
> + */
> +#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)				\
> +	for (unsigned int lcore_id =					\
> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
> +	     lcore_id < RTE_MAX_LCORE;					\
> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))

RTE_LCORE_VAR_FOREACH_PTR(ptr, name) would be an even better name; considering that "var" is really a pointer.

I also wonder about build-time type checking here...
Nice to have: The compiler should fail if "ptr" is not a pointer.

> +
> +extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
> +
> +/**
> + * Allocate space in the per-lcore id buffers for a lcore variable.
> + *
> + * The pointer returned is only an opaque identifer of the variable. To
> + * get an actual pointer to a particular instance of the variable use
> + * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
> + *
> + * The allocation is always successful, barring a fatal exhaustion of
> + * the per-lcore id buffer space.
> + *
> + * @param size
> + *   The size (in bytes) of the variable's per-lcore id value.
> + * @param align
> + *   If 0, the values will be suitably aligned for any kind of type
> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
> + *   on a multiple of *align*, which must be a power of 2 and equal or
> + *   less than \c RTE_CACHE_LINE_SIZE.
> + * @return
> + *   The id of the variable, stored in a void pointer value.
> + */
> +__rte_experimental
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_LCORE_VAR_H_ */
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 5e0cd47c82..e90b86115a 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -393,6 +393,10 @@ EXPERIMENTAL {
>  	# added in 23.07
>  	rte_memzone_max_get;
>  	rte_memzone_max_set;
> +
> +	# added in 24.03
> +	rte_lcore_var_alloc;
> +	rte_lcore_var;
>  };
> 
>  INTERNAL {
> --
> 2.34.1

Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC v3 5/6] service: keep per-lcore state in lcore variable
  2024-02-20  8:49         ` [RFC v3 5/6] service: " Mattias Rönnblom
@ 2024-02-22  9:42           ` Morten Brørup
  2024-02-23 10:19             ` Mattias Rönnblom
  0 siblings, 1 reply; 42+ messages in thread
From: Morten Brørup @ 2024-02-22  9:42 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Tuesday, 20 February 2024 09.49
> 
> Replace static array of cache-aligned structs with an lcore variable,
> to slightly benefit code simplicity and performance.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---


> @@ -486,8 +489,7 @@ service_runner_func(void *arg)
>  {
>  	RTE_SET_USED(arg);
>  	uint8_t i;
> -	const int lcore = rte_lcore_id();
> -	struct core_state *cs = &lcore_states[lcore];
> +	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);

Typo: TAB -> SPACE.

> 
>  	rte_atomic_store_explicit(&cs->thread_active, 1,
> rte_memory_order_seq_cst);
> 
> @@ -533,13 +535,16 @@ service_runner_func(void *arg)
>  int32_t
>  rte_service_lcore_may_be_active(uint32_t lcore)
>  {
> -	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
> +	struct core_state *cs =
> +		RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
> +
> +	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
>  		return -EINVAL;

This comment is mostly related to patch 1 in the series...

You are setting cs = RTE_LCORE_VAR_LCORE_PTR(lcore, ...) before validating that lcore < RTE_MAX_LCORE. I wondered if that potentially was an overrun bug.

It is obvious when looking at the RTE_LCORE_VAR_LCORE_PTR() macro implementation, but perhaps its description could mention that it is safe to use with an "invalid" lcore_id, but not dereferencing it.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-22  9:22           ` Morten Brørup
@ 2024-02-23 10:12             ` Mattias Rönnblom
  0 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-23 10:12 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-22 10:22, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Tuesday, 20 February 2024 09.49
>>
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small chunks of often-used data, which is related logically, but where
>> there are performance benefits to reap from having updates being local
>> to an lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> RFC v3:
>>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>>   * Update example to reflect FOREACH macro name change (in RFC v2).
>>
>> RFC v2:
>>   * Use alignof to derive alignment requirements. (Morten Brørup)
>>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>>   * Allow user-specified alignment, but limit max to cache line size.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   config/rte_config.h                   |   1 +
>>   doc/api/doxy-api-index.md             |   1 +
>>   lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>>   lib/eal/common/meson.build            |   1 +
>>   lib/eal/include/meson.build           |   1 +
>>   lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>>   lib/eal/version.map                   |   4 +
>>   7 files changed, 465 insertions(+)
>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>
>> diff --git a/config/rte_config.h b/config/rte_config.h
>> index da265d7dd2..884482e473 100644
>> --- a/config/rte_config.h
>> +++ b/config/rte_config.h
>> @@ -30,6 +30,7 @@
>>   /* EAL defines */
>>   #define RTE_CACHE_GUARD_LINES 1
>>   #define RTE_MAX_HEAPS 32
>> +#define RTE_MAX_LCORE_VAR 1048576
>>   #define RTE_MAX_MEMSEG_LISTS 128
>>   #define RTE_MAX_MEMSEG_PER_LIST 8192
>>   #define RTE_MAX_MEM_MB_PER_LIST 32768
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index a6a768bd7c..bb06bb7ca1 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>>     [interrupts](@ref rte_interrupts.h),
>>     [launch](@ref rte_launch.h),
>>     [lcore](@ref rte_lcore.h),
>> +  [lcore-varible](@ref rte_lcore_var.h),
>>     [per-lcore](@ref rte_per_lcore.h),
>>     [service cores](@ref rte_service.h),
>>     [keepalive](@ref rte_keepalive.h),
>> diff --git a/lib/eal/common/eal_common_lcore_var.c
>> b/lib/eal/common/eal_common_lcore_var.c
>> new file mode 100644
>> index 0000000000..dfd11cbd0b
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_var.c
>> @@ -0,0 +1,82 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#include <inttypes.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_debug.h>
>> +#include <rte_log.h>
>> +
>> +#include <rte_lcore_var.h>
>> +
>> +#include "eal_private.h"
>> +
>> +#define WARN_THRESHOLD 75
> 
> It's not an error condition, so 75 % seems like a low threshold for WARNING.
> Consider increasing it to 95 %, or change the level to NOTICE.
> Or both.
> 

I'll make an attempt at a variant which uses the libc heap instead of 
BSS, and does so dynamically. Then one need not worry about a fixed-size 
upper bound, barring heap allocation failures (which you are best off 
making fatal in the lcore variables case).

The glibc heap is available early (as early as the earliest RTE_INIT()).

You also avoid the headache of thinking about what happens if indeed all 
of the rte_lcore_var array is backed by actual memory. That could be due 
to mlockall(), huge page use for BSS, or systems where BSS is not 
on-demand mapped. I have no idea how paging works on Windows NT, for 
example.

>> +
>> +/*
>> + * Avoid using offset zero, since it would result in a NULL-value
>> + * "handle" (offset) pointer, which in principle and per the API
>> + * definition shouldn't be an issue, but may confuse some tools and
>> + * users.
>> + */
>> +#define INITIAL_OFFSET 1
>> +
>> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
>> +
>> +static uintptr_t allocated = INITIAL_OFFSET;
> 
> Please add an API to get the amount of allocated lcore variable memory.
> The easy option is to make the above variable public (with a proper name, e.g. rte_lcore_var_allocated).
> 
> The total amount of lcore variable memory is already public: RTE_MAX_LCORE_VAR.
> 

Makes sense with the RFC v3 design.

If you eliminate the fixed upper bound and use the heap, there shouldn't 
be any particular need to track lcore variable memory use separately 
from other heap users.

>> +
>> +static void
>> +verify_allocation(uintptr_t new_allocated)
>> +{
>> +	static bool has_warned;
>> +
>> +	RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
>> +
>> +	if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
>> +	    !has_warned) {
>> +		EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
>> +			"of the maximum capacity (%d bytes)", WARN_THRESHOLD,
>> +			RTE_MAX_LCORE_VAR);
>> +		has_warned = true;
>> +	}
>> +}
>> +
>> +static void *
>> +lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
>> +
>> +	void *offset = (void *)new_allocated;
>> +
>> +	new_allocated += size;
>> +
>> +	verify_allocation(new_allocated);
>> +
>> +	allocated = new_allocated;
>> +
>> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
>> +		"%"PRIuPTR"-byte alignment", size, align);
>> +
>> +	return offset;
>> +}
>> +
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	/* Having the per-lcore buffer size aligned on cache lines
>> +	 * assures as well as having the base pointer aligned on cache
>> +	 * size assures that aligned offsets also translate to aligned
>> +	 * pointers across all values.
>> +	 */
>> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
>> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
>> +
>> +	/* '0' means asking for worst-case alignment requirements */
>> +	if (align == 0)
>> +		align = alignof(max_align_t);
>> +
>> +	RTE_ASSERT(rte_is_power_of_2(align));
>> +
>> +	return lcore_var_alloc(size, align);
>> +}
>> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
>> index 22a626ba6f..d41403680b 100644
>> --- a/lib/eal/common/meson.build
>> +++ b/lib/eal/common/meson.build
>> @@ -18,6 +18,7 @@ sources += files(
>>           'eal_common_interrupts.c',
>>           'eal_common_launch.c',
>>           'eal_common_lcore.c',
>> +        'eal_common_lcore_var.c',
>>           'eal_common_mcfg.c',
>>           'eal_common_memalloc.c',
>>           'eal_common_memory.c',
>> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
>> index e94b056d46..9449253e23 100644
>> --- a/lib/eal/include/meson.build
>> +++ b/lib/eal/include/meson.build
>> @@ -27,6 +27,7 @@ headers += files(
>>           'rte_keepalive.h',
>>           'rte_launch.h',
>>           'rte_lcore.h',
>> +        'rte_lcore_var.h',
>>           'rte_lock_annotations.h',
>>           'rte_malloc.h',
>>           'rte_mcslock.h',
>> diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
>> new file mode 100644
>> index 0000000000..da49d48d7c
>> --- /dev/null
>> +++ b/lib/eal/include/rte_lcore_var.h
>> @@ -0,0 +1,375 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#ifndef _RTE_LCORE_VAR_H_
>> +#define _RTE_LCORE_VAR_H_
>> +
>> +/**
>> + * @file
>> + *
>> + * RTE Per-lcore id variables
>> + *
>> + * This API provides a mechanism to create and access per-lcore id
>> + * variables in a space- and cycle-efficient manner.
>> + *
>> + * A per-lcore id variable (or lcore variable for short) has one value
>> + * for each EAL thread and registered non-EAL thread. In other words,
>> + * there's one copy of its value for each and every current and future
>> + * lcore id-equipped thread, with the total number of copies amounting
>> + * to \c RTE_MAX_LCORE.
>> + *
>> + * In order to access the values of an lcore variable, a handle is
>> + * used. The type of the handle is a pointer to the value's type
>> + * (e.g., for \c uint32_t lcore variable, the handle is a
>> + * <code>uint32_t *</code>. A handle may be passed between modules and
>> + * threads just like any pointer, but its value is not the address of
>> + * any particular object, but rather just an opaque identifier, stored
>> + * in a typed pointer (to inform the access macro the type of values).
>> + *
>> + * @b Creation
>> + *
>> + * An lcore variable is created in two steps:
>> + *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
>> + *  2. Allocate lcore variable storage and initialize the handle with
>> + *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
>> + *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
>> + *     module initialization, but may be done at any time.
>> + *
>> + * An lcore variable is not tied to the owning thread's lifetime. It's
>> + * available for use by any thread immediately after having been
>> + * allocated, and continues to be available throughout the lifetime of
>> + * the EAL.
>> + *
>> + * Lcore variables cannot and need not be freed.
>> + *
>> + * @b Access
>> + *
>> + * The value of any lcore variable for any lcore id may be accessed
>> + * from any thread (including unregistered threads), but is should
>> + * generally only *frequently* read from or written to by the owner.
>> + *
>> + * Values of the same lcore variable but owned by to different lcore
>> + * ids *may* be frequently read or written by the owners without the
>> + * risk of false sharing.
>> + *
>> + * An appropriate synchronization mechanism (e.g., atomics) should
>> + * employed to assure there are no data races between the owning
>> + * thread and any non-owner threads accessing the same lcore variable
>> + * instance.
>> + *
>> + * The value of the lcore variable for a particular lcore id may be
>> + * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
>> + * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
>> + *
>> + * To modify the value of an lcore variable for a particular lcore id,
>> + * either access the object through the pointer retrieved by \ref
>> + * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
>> + * RTE_LCORE_VAR_LCORE_SET.
>> + *
>> + * The access macros each has a short-hand which may be used by an EAL
>> + * thread or registered non-EAL thread to access the lcore variable
>> + * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
>> + * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
>> + *
>> + * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
>> + * pointer with the same type as the value, it may not be directly
>> + * dereferenced and must be treated as an opaque identifier. The
>> + * *identifier* value is common across all lcore ids.
>> + *
>> + * @b Storage
>> + *
>> + * An lcore variable's values may by of a primitive type like \c int,
>> + * but would more typically be a \c struct. An application may choose
>> + * to define an lcore variable, which it then it goes on to never
>> + * allocate.
>> + *
>> + * The lcore variable handle introduces a per-variable (not
>> + * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
>> + * there are some memory footprint gains to be made by organizing all
>> + * per-lcore id data for a particular module as one lcore variable
>> + * (e.g., as a struct).
>> + *
>> + * The sum of all lcore variables, plus any padding required, must be
>> + * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
>> + * violation of this maximum results in the process being terminated.
>> + *
>> + * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
>> + * same order of magnitude in size as a thread stack.
>> + *
>> + * The lcore variable storage buffers are kept in the BSS section in
>> + * the resulting binary, where data generally isn't mapped in until
>> + * it's accessed. This means that unused portions of the lcore
>> + * variable storage area will not occupy any physical memory (with a
>> + * granularity of the memory page size [usually 4 kB]).
>> + *
>> + * Lcore variables should generally *not* be \ref __rte_cache_aligned
>> + * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
>> + * of these constructs are designed to avoid false sharing. In the
>> + * case of an lcore variable instance, all nearby data structures
>> + * should almost-always be written to by a single thread (the lcore
>> + * variable owner). Adding padding will increase the effective memory
>> + * working set size, and potentially reducing performance.
>> + *
>> + * @b Example
>> + *
>> + * Below is an example of the use of an lcore variable:
>> + *
>> + * \code{.c}
>> + * struct foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + * };
>> + *
>> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
>> + *
>> + * long foo_get_a_plus_b(void)
>> + * {
>> + *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
>> + *
>> + *         return state->a + state->b;
>> + * }
>> + *
>> + * RTE_INIT(rte_foo_init)
>> + * {
>> + *         unsigned int lcore_id;
> 
> This variable is part of RTE_LCORE_VAR_FOREACH_VALUE(), and can be removed from here.
> 
>> + *
>> + *         RTE_LCORE_VAR_ALLOC(foo_state);
> 
> Typo: foo_state -> lcore_states
> 

Will fix.

>> + *
>> + *         struct foo_lcore_state *state;
>> + *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_states) {
> 
> Typo:
> RTE_LCORE_VAR_FOREACH_VALUE(lcore_states)
> ->
> RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states)
> 

Will fix.

>> + *                 (initialize 'state')
>> + *         }
>> + *
>> + *         (other initialization)
>> + * }
>> + * \endcode
>> + *
>> + *
>> + * @b Alternatives
>> + *
>> + * Lcore variables are designed to replace a pattern exemplified below:
>> + * \code{.c}
>> + * struct foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + *         RTE_CACHE_GUARD;
>> + * } __rte_cache_aligned;
>> + *
>> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
>> + * \endcode
>> + *
>> + * This scheme is simple and effective, but has one drawback: the data
>> + * is organized so that objects related to all lcores for a particular
>> + * module is kept close in memory. At a bare minimum, this forces the
>> + * use of cache-line alignment to avoid false sharing. With CPU
>> + * hardware prefetching and memory loads resulting from speculative
>> + * execution (functions which seemingly are getting more eager faster
>> + * than they are getting more intelligent), one or more "guard" cache
>> + * lines may be required to separate one lcore's data from another's.
>> + *
>> + * Lcore variables has the upside of working with, not against, the
>> + * CPU's assumptions and for example next-line prefetchers may well
>> + * work the way its designers intended (i.e., to the benefit, not
>> + * detriment, of system performance).
>> + *
>> + * Another alternative to \ref rte_lcore_var.h is the \ref
>> + * rte_per_lcore.h API, which make use of thread-local storage (TLS,
>> + * e.g., GCC __thread or C11 _Thread_local). The main differences
>> + * between by using the various forms of TLS (e.g., \ref
>> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
>> + * variables are:
>> + *
>> + *   * The existence and non-existence of a thread-local variable
>> + *     instance follow that of particular thread's. The data cannot be
>> + *     accessed before the thread has been created, nor after it has
>> + *     exited. One effect of this is thread-local variables must
>> + *     initialized in a "lazy" manner (e.g., at the point of thread
>> + *     creation). Lcore variables may be accessed immediately after
>> + *     having been allocated (which is usually prior any thread beyond
>> + *     the main thread is running).
>> + *   * A thread-local variable is duplicated across all threads in the
>> + *     process, including unregistered non-EAL threads (i.e.,
>> + *     "regular" threads). For DPDK applications heavily relying on
>> + *     multi-threading (in conjunction to DPDK's "one thread per core"
>> + *     pattern), either by having many concurrent threads or
>> + *     creating/destroying threads at a high rate, an excessive use of
>> + *     thread-local variables may cause inefficiencies (e.g.,
>> + *     increased thread creation overhead due to thread-local storage
>> + *     initialization or increased total RAM footprint usage). Lcore
>> + *     variables *only* exist for threads with an lcore id, and thus
>> + *     not for such "regular" threads.
>> + *   * If data in thread-local storage may be shared between threads
>> + *     (i.e., can a pointer to a thread-local variable be passed to
>> + *     and successfully dereferenced by non-owning thread) depends on
>> + *     the details of the TLS implementation. With GCC __thread and
>> + *     GCC _Thread_local, such data sharing is supported. In the C11
>> + *     standard, the result of accessing another thread's
>> + *     _Thread_local object is implementation-defined. Lcore variable
>> + *     instances may be accessed reliably by any thread.
>> + */
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +#include <stddef.h>
>> +#include <stdalign.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_config.h>
>> +#include <rte_lcore.h>
>> +
>> +/**
>> + * Given the lcore variable type, produces the type of the lcore
>> + * variable handle.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
>> +	type *
> 
> This macro seems superfluous.
> In RTE_LCORE_VAR_HANDLE(type, name) just use:
>   type * name
> Are there other use cases for it?
> 

It's just a marker, like RTE_LCORE_VAR_HANDLE(), to indicate this is not 
your average pointer type.

It's not obvious these marker macros make things more clear. One could 
just say in the API docs that lcore handles are opaque pointers to the 
lcore variable's type, and make clear they may only be dereferenced 
through the provided macros.

>> +
>> +/**
>> + * Define a lcore variable handle.
>> + *
>> + * This macro defines a variable which is used as a handle to access
>> + * the various per-lcore id instances of a per-lcore id variable.
>> + *
>> + * The aim with this macro is to make clear at the point of
>> + * declaration that this is an lcore handler, rather than a regular
>> + * pointer.
>> + *
>> + * Add @b static as a prefix in case the lcore variable are only to be
>> + * accessed from a particular translation unit.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
>> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
> 
> Thinking out loud here...
> Consider if this name should be more similar with RTE_DEFINE_PER_LCORE(type, name), e.g. RTE_DEFINE_LCORE_VAR(type, name) or RTE_LCORE_VAR_DEFINE(type, name).
> Using the common prefix RTE_LCORE_VAR is preferable.
> Using the term "handle" indicates that it is opaque and needs to be allocated by an allocation function.
> On the other hand, the "handle" is not unique per thread, so it's nor really a "handle".
> 

It's a handle to a variable, not a handle to a particular instance of 
its values.

>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)	\
>> +	name = rte_lcore_var_alloc(size, align)
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle,
>> + * with values aligned for any type of object.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
>> +	name = rte_lcore_var_alloc(size, 0)
>> +
>> +/**
>> + * Allocate space for an lcore variable of the size and alignment
>> requirements
>> + * suggested by the handler pointer type, and initialize its handle.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC(name)					\
>> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)),		\
>> +				       alignof(typeof(*(name))))
>> +
>> +/**
>> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
>> + * means of a \ref RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
>> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
>> +	{								\
>> +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
>> +	}
>> +
>> +/**
>> + * Allocate an explicitly-sized lcore variable by means of a \ref
>> + * RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
>> +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
>> +
>> +/**
>> + * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT(name)					\
>> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
>> +	{								\
>> +		RTE_LCORE_VAR_ALLOC(name);				\
>> +	}
>> +
>> +#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)		\
>> +	((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
> 
> This macro also seems superfluous.
> Doesn't RTE_LCORE_VAR_LCORE_PTR() suffice?
> 

It's just functional decomposition (but for macros). To make the whole 
thing a little more readable.

Maybe I should change "name" to "handle" in this and other instances 
(e.g., RTE_LCORE_VAR_LCORE_PTR).

>> +
>> +/**
>> + * Get pointer to lcore variable instance with the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)				\
>> +	((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> 
> This uses type casting.
> I wonder if additional build-time type checking would be possible...
> Nice to have: The compiler should fail if name is not a pointer, but a struct or an uint64_t, or even an uintptr_t.
> 
There is no way to compared the type of the lcore variable (at the point 
of declaration) with the type of the handle pointer at the point of 
handle "dereferencing" (which is essentially is what this macro does).

You can't cast a struct to a pointer. You could assure it's a pointer by 
replacing the __RTE_LCORE_VAR_LCORE_PTR() with

static inline __rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
{
	return (void *)&rte_lcore_var[lcore_id][(uintptr_t)handle];
}

(Bad practice to use a macro when a function can do the job anyway.)

Maybe this function shouldn't even have the "__" prefix. Could well be 
valid uses cases when you want void * typed access to a lcore variable 
value.

I'll use a function in the next RFC version.

>> +
>> +/**
>> + * Get value of a lcore variable instance of the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)		\
>> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
> 
> The four accessor functions, RTE_LCORE_VAR[_LCORE]_GET/SET(), seem superfluous.
> They make the API seem more complex than just using RTE_LCORE_VAR[_LCORE]_PTR() for access.
> 

They are (somewhat) useful when the value is a primitive type.

RTE_LCORE_VAR_SET(my_int, 17);

versus

*RTE_LCORE_VAR_PTR(my_int) = 17;

Former is slightly more readable, imo, but I agree with you that these 
macros do clutter up the API.

>> +
>> +/**
>> + * Set the value of a lcore variable instance of the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)		\
>> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
>> +
>> +/**
>> + * Get pointer to lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
>> +
>> +/**
>> + * Get value of lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
>> +
>> +/**
>> + * Set value of lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_SET(name, value) \
>> +	RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
>> +
>> +/**
>> + * Iterate over each lcore id's value for a lcore variable.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)				\
>> +	for (unsigned int lcore_id =					\
>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
>> +	     lcore_id < RTE_MAX_LCORE;					\
>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> 
> RTE_LCORE_VAR_FOREACH_PTR(ptr, name) would be an even better name; considering that "var" is really a pointer.
> 

No, it's for each value, referenced via the pointer.

RTE_LCORE_VAR_FOREACH_VALUE_PTR() is too long.

I'll change "var" -> "ptr".

> I also wonder about build-time type checking here...
> Nice to have: The compiler should fail if "ptr" is not a pointer.
> 

I agree.

>> +
>> +extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
>> +
>> +/**
>> + * Allocate space in the per-lcore id buffers for a lcore variable.
>> + *
>> + * The pointer returned is only an opaque identifer of the variable. To
>> + * get an actual pointer to a particular instance of the variable use
>> + * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
>> + *
>> + * The allocation is always successful, barring a fatal exhaustion of
>> + * the per-lcore id buffer space.
>> + *
>> + * @param size
>> + *   The size (in bytes) of the variable's per-lcore id value.
>> + * @param align
>> + *   If 0, the values will be suitably aligned for any kind of type
>> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
>> + *   on a multiple of *align*, which must be a power of 2 and equal or
>> + *   less than \c RTE_CACHE_LINE_SIZE.
>> + * @return
>> + *   The id of the variable, stored in a void pointer value.
>> + */
>> +__rte_experimental
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align);
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif /* _RTE_LCORE_VAR_H_ */
>> diff --git a/lib/eal/version.map b/lib/eal/version.map
>> index 5e0cd47c82..e90b86115a 100644
>> --- a/lib/eal/version.map
>> +++ b/lib/eal/version.map
>> @@ -393,6 +393,10 @@ EXPERIMENTAL {
>>   	# added in 23.07
>>   	rte_memzone_max_get;
>>   	rte_memzone_max_set;
>> +
>> +	# added in 24.03
>> +	rte_lcore_var_alloc;
>> +	rte_lcore_var;
>>   };
>>
>>   INTERNAL {
>> --
>> 2.34.1
> 
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC v3 5/6] service: keep per-lcore state in lcore variable
  2024-02-22  9:42           ` Morten Brørup
@ 2024-02-23 10:19             ` Mattias Rönnblom
  0 siblings, 0 replies; 42+ messages in thread
From: Mattias Rönnblom @ 2024-02-23 10:19 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-22 10:42, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Tuesday, 20 February 2024 09.49
>>
>> Replace static array of cache-aligned structs with an lcore variable,
>> to slightly benefit code simplicity and performance.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
> 
> 
>> @@ -486,8 +489,7 @@ service_runner_func(void *arg)
>>   {
>>   	RTE_SET_USED(arg);
>>   	uint8_t i;
>> -	const int lcore = rte_lcore_id();
>> -	struct core_state *cs = &lcore_states[lcore];
>> +	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
> 
> Typo: TAB -> SPACE.
> 

Will fix.

>>
>>   	rte_atomic_store_explicit(&cs->thread_active, 1,
>> rte_memory_order_seq_cst);
>>
>> @@ -533,13 +535,16 @@ service_runner_func(void *arg)
>>   int32_t
>>   rte_service_lcore_may_be_active(uint32_t lcore)
>>   {
>> -	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
>> +	struct core_state *cs =
>> +		RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
>> +
>> +	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
>>   		return -EINVAL;
> 
> This comment is mostly related to patch 1 in the series...
> 
> You are setting cs = RTE_LCORE_VAR_LCORE_PTR(lcore, ...) before validating that lcore < RTE_MAX_LCORE. I wondered if that potentially was an overrun bug.
> 
> It is obvious when looking at the RTE_LCORE_VAR_LCORE_PTR() macro implementation, but perhaps its description could mention that it is safe to use with an "invalid" lcore_id, but not dereferencing it.
> 

I thought about adding something equivalent to an RTE_ASSERT() on 
lcore_id in the dereferencing macros, but then I thought that maybe it 
is a valid use case to pass invalid lcore ids.

Invalid ids being OK or not, I think the above code should do "cs = 
/../" *after* the lcore id check. Now it looks strange and force the 
reader to consider if this is valid or not, for no good reason.

The lcore variable API docs should probably explicitly allow invalid 
core id in the macros.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2024-02-23 10:19 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-02-09  8:25   ` Morten Brørup
2024-02-09 11:46     ` Mattias Rönnblom
2024-02-09 13:04       ` Morten Brørup
2024-02-19  7:49         ` Mattias Rönnblom
2024-02-19 11:10           ` Morten Brørup
2024-02-19 14:31             ` Mattias Rönnblom
2024-02-19 15:04               ` Morten Brørup
2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-02-20  9:11           ` Bruce Richardson
2024-02-20 10:47             ` Mattias Rönnblom
2024-02-20 11:39               ` Bruce Richardson
2024-02-20 13:37                 ` Morten Brørup
2024-02-20 16:26                 ` Mattias Rönnblom
2024-02-21  9:43           ` Jerin Jacob
2024-02-21 10:31             ` Morten Brørup
2024-02-21 14:26             ` Mattias Rönnblom
2024-02-22  9:22           ` Morten Brørup
2024-02-23 10:12             ` Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 2/6] eal: add lcore variable test suite Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-20 15:31           ` Morten Brørup
2024-02-20  8:49         ` [RFC v3 4/6] power: keep per-lcore " Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 5/6] service: " Mattias Rönnblom
2024-02-22  9:42           ` Morten Brørup
2024-02-23 10:19             ` Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 2/5] eal: add lcore variable test suite Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-19 11:22       ` Morten Brørup
2024-02-19 14:04         ` Mattias Rönnblom
2024-02-19 15:10           ` Morten Brørup
2024-02-19  9:40     ` [RFC v2 4/5] power: keep per-lcore " Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 5/5] service: " Mattias Rönnblom
2024-02-08 18:16 ` [RFC 2/5] eal: add lcore variable test suite Mattias Rönnblom
2024-02-08 18:16 ` [RFC 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-08 18:16 ` [RFC 4/5] power: keep per-lcore " Mattias Rönnblom
2024-02-08 18:16 ` [RFC 5/5] service: " Mattias Rönnblom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).