DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC 0/5] Lcore variables
@ 2024-02-08 18:16 Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                   ` (4 more replies)
  0 siblings, 5 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

This RFC presents a new API <rte_lcore_var.h> for static per-lcore id
data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

One thing is unclear to the author is how this API relates to
potential future per-lcore dynamic allocator (e.g., a per-lcore heap).

Contrary to what the version.map edit suggests, this RFC is not meant
for a proposal for DPDK 24.03.

Mattias Rönnblom (5):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable test suite
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable

 app/test/meson.build                  |   1 +
 app/test/test_lcore_var.c             | 384 ++++++++++++++++++++++++++
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  80 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/common/rte_random.c           |  30 +-
 lib/eal/common/rte_service.c          | 119 ++++----
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 352 +++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 lib/power/rte_power_pmd_mgmt.c        |  27 +-
 12 files changed, 925 insertions(+), 76 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
@ 2024-02-08 18:16 ` Mattias Rönnblom
  2024-02-09  8:25   ` Morten Brørup
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 2/5] eal: add lcore variable test suite Mattias Rönnblom
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  80 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 352 ++++++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 7 files changed, 440 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/config/rte_config.h b/config/rte_config.h
index da265d7dd2..884482e473 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -30,6 +30,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index a6a768bd7c..bb06bb7ca1 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -98,6 +98,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore-varible](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..5276fe7192
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+/* XXX: should this file be called eal_common_ldata.c or rte_ldata.c? */
+
+#include <inttypes.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define WARN_THRESHOLD 75
+#define MAX_AUTO_ALIGNMENT 16U
+
+/*
+ * Avoid using offset zero, since it would result in a NULL-value
+ * "handle" (offset) pointer, which in principle and per the API
+ * definition shouldn't be an issue, but may confuse some tools and
+ * users.
+ */
+#define INITIAL_OFFSET MAX_AUTO_ALIGNMENT
+
+char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
+
+static uintptr_t allocated = INITIAL_OFFSET;
+
+static void
+verify_allocation(uintptr_t new_allocated)
+{
+	static bool has_warned;
+
+	RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
+
+	if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
+	    !has_warned) {
+		EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
+			"of the maximum capacity (%d bytes)", WARN_THRESHOLD,
+			RTE_MAX_LCORE_VAR);
+		has_warned = true;
+	}
+}
+
+static void *
+lcore_var_alloc(size_t size, size_t alignment)
+{
+	uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, alignment);
+
+	void *offset = (void *)new_allocated;
+
+	new_allocated += size;
+
+	verify_allocation(new_allocated);
+
+	allocated = new_allocated;
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, alignment);
+
+	return offset;
+}
+
+void *
+rte_lcore_var_alloc(size_t size)
+{
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+
+	/* Allocations are naturally aligned (i.e., the same alignment
+	 * as the object size, up to a maximum of 16 bytes, which
+	 * should satisify alignment requirements of any kind of
+	 * object.
+	 */
+	size_t alignment = RTE_MIN(size, MAX_AUTO_ALIGNMENT);
+
+	return lcore_var_alloc(size, alignment);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..c1854dc6a4
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,352 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Per-lcore id variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. In other words,
+ * there's one copy of its value for each and every current and future
+ * lcore id-equipped thread, with the total number of copies amounting
+ * to \c RTE_MAX_LCORE.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for \c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. A handle may be passed between modules and
+ * threads just like any pointer, but its value is not the address of
+ * any particular object, but rather just an opaque identifier, stored
+ * in a typed pointer (to inform the access macro the type of values).
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
+ *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but is should
+ * generally only *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by to different lcore
+ * ids *may* be frequently read or written by the owners without the
+ * risk of false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomics) should
+ * employed to assure there are no data races between the owning
+ * thread and any non-owner threads accessing the same lcore variable
+ * instance.
+ *
+ * The value of the lcore variable for a particular lcore id may be
+ * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
+ * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * To modify the value of an lcore variable for a particular lcore id,
+ * either access the object through the pointer retrieved by \ref
+ * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
+ * RTE_LCORE_VAR_LCORE_SET.
+ *
+ * The access macros each has a short-hand which may be used by an EAL
+ * thread or registered non-EAL thread to access the lcore variable
+ * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
+ * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
+ *
+ * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier. The
+ * *identifier* value is common across all lcore ids.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like \c int,
+ * but would more typically be a \c struct. An application may choose
+ * to define an lcore variable, which it then it goes on to never
+ * allocate.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * The sum of all lcore variables, plus any padding required, must be
+ * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
+ * violation of this maximum results in the process being terminated.
+ *
+ * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
+ * same order of magnitude in size as a thread stack.
+ *
+ * The lcore variable storage buffers are kept in the BSS section in
+ * the resulting binary, where data generally isn't mapped in until
+ * it's accessed. This means that unused portions of the lcore
+ * variable storage area will not occupy any physical memory (with a
+ * granularity of the memory page size [usually 4 kB]).
+ *
+ * Lcore variables should generally *not* be \ref __rte_cache_aligned
+ * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, all nearby data structures
+ * should almost-always be written to by a single thread (the lcore
+ * variable owner). Adding padding will increase the effective memory
+ * working set size, and potentially reducing performance.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         unsigned int lcore_id;
+ *
+ *         RTE_LCORE_VAR_ALLOC(foo_state);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH(lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * \endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * } __rte_cache_aligned;
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * \endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this forces the
+ * use of cache-line alignment to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables has the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to \ref rte_lcore_var.h is the \ref
+ * rte_per_lcore.h API, which make use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., \ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. One effect of this is thread-local variables must
+ *     initialized in a "lazy" manner (e.g., at the point of thread
+ *     creation). Lcore variables may be accessed immediately after
+ *     having been allocated (which is usually prior any thread beyond
+ *     the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id, and thus
+ *     not for such "regular" threads.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define a lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various per-lcore id instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handler, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable are only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC_SZ(name, size)	\
+	name = rte_lcore_var_alloc(size)
+
+/**
+ * Allocate space for an lcore variable of the size suggested by the
+ * handler pointer type and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC(name)			\
+	RTE_LCORE_VAR_ALLOC_SZ(name, sizeof(*(name)))
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a \ref
+ * RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SZ(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SZ(name);				\
+	}
+
+/**
+ * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)		\
+	((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)				\
+	((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+/**
+ * Get value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
+
+/**
+ * Set the value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
+
+/**
+ * Get value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
+
+/**
+ * Set value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_SET(name, value) \
+	RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
+
+/**
+ * Iterate over each lcore id's value for a lcore variable.
+ */
+#define RTE_LCORE_VAR_FOREACH(var, name)				\
+	for (unsigned int lcore_id =					\
+		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
+
+/**
+ * Allocate space in the per-lcore id buffer for a lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * The allocation is always successful, barring an fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * @return
+ *   The id of the variable, stored in a void pointer value.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 5e0cd47c82..e90b86115a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -393,6 +393,10 @@ EXPERIMENTAL {
 	# added in 23.07
 	rte_memzone_max_get;
 	rte_memzone_max_set;
+
+	# added in 24.03
+	rte_lcore_var_alloc;
+	rte_lcore_var;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC 2/5] eal: add lcore variable test suite
  2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-08 18:16 ` Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 384 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 385 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 6389ae83ee..93412cce51 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -101,6 +101,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..0229f90bf2
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,384 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static bool
+rand_bool(void)
+{
+	return rte_rand() & 1;
+}
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_PTR(test_int);
+
+	bool naturally_aligned = RTE_PTR_ALIGN_CEIL(ptr, sizeof(int)) == ptr;
+
+	bool equal;
+
+	if (rand_bool())
+		equal = RTE_LCORE_VAR_GET(test_int) == state->old_value;
+	else
+		equal = *(RTE_LCORE_VAR_PTR(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	if (rand_bool())
+		RTE_LCORE_VAR_SET(test_int, state->new_value);
+	else
+		*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		RTE_LCORE_VAR_LCORE_SET(lcore_id, test_int, state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		TEST_ASSERT_EQUAL(state->new_value,
+				  RTE_LCORE_VAR_LCORE_GET(lcore_id, test_int),
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH(v, test_int) {
+		printf("expected %d actual %d\n",
+		       states[lcore_id].new_value, *v);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_PTR(test_struct);
+
+	/*
+	 * Lcore variable alignment is based on object size, not any
+	 * particular requirements on the struct's field.
+	 */
+	bool properly_aligned =
+		RTE_PTR_ALIGN_CEIL(lcore_struct, 16) == lcore_struct;
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_struct);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state
+{
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_PTR(test_array);
+
+	/*
+	 * Lcore variable alignment is based on object size, not any
+	 * particular requirements on the struct's field.
+	 */
+	bool properly_aligned =
+		RTE_PTR_ALIGN_CEIL(lcore_array, 16) == lcore_array;
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(RTE_LCORE_VAR_LCORE_GET(lcore_id, test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_array);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (RTE_MAX_LCORE_VAR / 2)
+
+static int
+test_many_lvars(void)
+{
+	void **handlers = malloc(sizeof(void *) * MANY_LVARS);
+	int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		void *handle = rte_lcore_var_alloc(1);
+
+		uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), handle);
+
+		*b = (uint8_t)i;
+
+		handlers[i] = handle;
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_FOREACH_WORKER(lcore_id) {
+			uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(),
+							       handlers[i]);
+			TEST_ASSERT_EQUAL((uint8_t)i, *b,
+					  "Unexpected lcore variable value.");
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC 3/5] random: keep PRNG state in lcore variable
  2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 2/5] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-02-08 18:16 ` Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 4/5] power: keep per-lcore " Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 5/5] service: " Mattias Rönnblom
  4 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Move keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances to keeping the
same state in to a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_random.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 7709b8f2c6..af9fffd81b 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct rte_rand_state {
@@ -19,14 +20,12 @@ struct rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
-} __rte_cache_aligned;
+};
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state __rte_cache_aligned;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_PTR(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC 4/5] power: keep per-lcore state in lcore variable
  2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
                   ` (2 preceding siblings ...)
  2024-02-08 18:16 ` [RFC 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-08 18:16 ` Mattias Rönnblom
  2024-02-08 18:16 ` [RFC 5/5] service: " Mattias Rönnblom
  4 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/power/rte_power_pmd_mgmt.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 591fc69f36..bb20e564de 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -68,8 +69,8 @@ struct pmd_core_cfg {
 	/**< Number of queues ready to enter power optimized state */
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
-} __rte_cache_aligned;
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+};
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -772,10 +770,13 @@ RTE_INIT(rte_power_ethdev_pmgmt_init) {
 	size_t i;
 	int j;
 
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
+
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		struct pmd_core_cfg *lcore_cfg =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_cfgs);
+		TAILQ_INIT(&lcore_cfg->head);
 	}
 
 	/* initialize config defaults */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC 5/5] service: keep per-lcore state in lcore variable
  2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
                   ` (3 preceding siblings ...)
  2024-02-08 18:16 ` [RFC 4/5] power: keep per-lcore " Mattias Rönnblom
@ 2024-02-08 18:16 ` Mattias Rönnblom
  4 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-08 18:16 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c | 119 ++++++++++++++++++++---------------
 1 file changed, 68 insertions(+), 51 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index d959c91459..c557e80409 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,11 +102,12 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
+	else {
+		struct core_state *cs;
+		RTE_LCORE_VAR_FOREACH(cs, lcore_states)
+			memset(cs, 0, sizeof(struct core_state));
 	}
 
 	int i;
@@ -122,7 +124,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +137,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +286,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +293,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +454,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +467,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +489,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +535,16 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs =
+		RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +552,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +573,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +590,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +642,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +694,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +712,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +737,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +761,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +785,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +815,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +824,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +849,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +860,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +868,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +876,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +885,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +901,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +948,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +977,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +989,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1028,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-09  8:25   ` Morten Brørup
  2024-02-09 11:46     ` Mattias Rönnblom
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-02-09  8:25 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Thursday, 8 February 2024 19.17
> 
> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small chunks of often-used data, which is related logically, but where
> there are performance benefits to reap from having updates being local
> to an lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---

This looks very promising. :-)

Here's a bunch of comments, questions and suggestions.


* Question: Performance.
What is the cost of accessing an lcore variable vs a variable in TLS?
I suppose the relative cost diminishes if the variable is a larger struct, compared to a simple uint64_t.

Some of my suggestions below might also affect performance.


* Advantage: Provides direct access to worker thread variables.
With the current alternative (thread-local storage), the main thread cannot access the TLS variables of the worker threads,
unless worker threads publish global access pointers.
Lcore variables of any lcore thread can be direcctly accessed by any thread, which simplifies code.


* Advantage: Roadmap towards hugemem.
It would be nice if the lcore variable memory was allocated in hugemem, to reduce TLB misses.
The current alternative (thread-local storage) is also not using hugement, so not a degradation.

Lcore variables are available very early at startup, so I guess the RTE memory allocator is not yet available.
Hugemem could be allocated using O/S allocation, so there is a possible road towards using hugemem.

Either way, using hugement would require one more indirection (the pointer to the allocated hugemem).
I don't know which has better performance, using hugemem or avoiding the additional pointer dereferencing.


* Suggestion: Consider adding an entry for unregistered non-EAL threads.
Please consider making room for one more entry, shared by all unregistered non-EAL threads, i.e.
making the array size RTE_MAX_LCORE + 1 and indexing by (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).

It would be convenient for the use cases where a variable shared by the unregistered non-EAL threads don't need special treatment.

Obviously, this might affect performance.
If the performance cost is not negligble, the addtional entry (and indexing branch) could be disabled at build time.


* Suggestion: Do not fix the alignment at 16 byte.
Pass an alignment parameter to rte_lcore_var_alloc() and use alignof() when calling it:

+#include <stdalign.h>
+
+#define RTE_LCORE_VAR_ALLOC(name)			\
+	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
+
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)	\
+	name = rte_lcore_var_alloc(size, alignment)
+
+#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
+	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
+
+ +++ /cconfig/rte_config.h
+#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16


* Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but behaves differently.

> +/**
> + * Iterate over each lcore id's value for a lcore variable.
> + */
> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
> +	for (unsigned int lcore_id =					\
> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
> +	     lcore_id < RTE_MAX_LCORE;					\
> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> +

The macro name RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(i), which only iterates on running cores.
You might want to give it a name that differs more.

If it wasn't for API breakage, I would suggest renaming RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)

Small detail: "var" is a pointer, so consider renaming it to "ptr" and adding _PTR to the macro name.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-09  8:25   ` Morten Brørup
@ 2024-02-09 11:46     ` Mattias Rönnblom
  2024-02-09 13:04       ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-09 11:46 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-09 09:25, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Thursday, 8 February 2024 19.17
>>
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small chunks of often-used data, which is related logically, but where
>> there are performance benefits to reap from having updates being local
>> to an lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
> 
> This looks very promising. :-)
> 
> Here's a bunch of comments, questions and suggestions.
> 
> 
> * Question: Performance.
> What is the cost of accessing an lcore variable vs a variable in TLS?
> I suppose the relative cost diminishes if the variable is a larger struct, compared to a simple uint64_t.
> 

In case all the relevant data is available in a cache close to the core, 
both options carry quite low overhead.

Accessing a lcore variable will always require a TLS lookup, in the form 
of retrieving the lcore_id of the current thread. In that sense, there 
will likely be a number of extra instructions required to do the lcore 
variable address lookup (i.e., doing the load from rte_lcore_var table 
based on the lcore_id you just looked up, and adding the variable's offset).

A TLS lookup will incur an extra overhead of less than a clock cycle, 
compared to accessing a non-TLS static variable, in case static linking 
is used. For shared objects, TLS is much more expensive (something often 
visible in dynamically linked DPDK app flame graphs, in the form of the 
__tls_addr symbol). Then you need to add ~3 cc/access. This on a micro 
benchmark running on a x86_64 Raptor Lake P-core.

(To visialize the difference between shared object and not, one can use 
Compiler Explorer and -fPIC versus -fPIE.)

Things get more complicated if you access the same variable in the same 
section code, since then it can be left on the stack/in a register by 
the compiler, especially if LTO is used. In other words, if you do 
rte_lcore_id() several times in a row, only the first one will cost you 
anything. This happens fairly often in DPDK, with rte_lcore_id().

Finally, if you do something like

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index af9fffd81b..a65c30d27e 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state *state)
  static __rte_always_inline
  struct rte_rand_state *__rte_rand_get_state(void)
  {
-       unsigned int idx;
-
-       idx = rte_lcore_id();
-
-       if (unlikely(idx == LCORE_ID_ANY))
-               return &unregistered_rand_state;
-
-       return RTE_LCORE_VAR_PTR(rand_state);
+       return &unregistered_rand_state;
  }

  uint64_t

...and re-run the rand_perf_autotest, at least I see no difference at 
all (in a statically linked build). Both results in rte_rand() using ~11 
cc/call. What that suggests is that TLS overhead is very small, and that 
any extra instructions required by lcore variables doesn't add much, if 
anything at all, at least in this particular case.

> Some of my suggestions below might also affect performance.
> 
> 
> * Advantage: Provides direct access to worker thread variables.
> With the current alternative (thread-local storage), the main thread cannot access the TLS variables of the worker threads,
> unless worker threads publish global access pointers.
> Lcore variables of any lcore thread can be direcctly accessed by any thread, which simplifies code.
> 
> 
> * Advantage: Roadmap towards hugemem.
> It would be nice if the lcore variable memory was allocated in hugemem, to reduce TLB misses.
> The current alternative (thread-local storage) is also not using hugement, so not a degradation.
> 

I agree, but the thing is it's hard to figure out how much memory is 
required for these kind of variables, given how DPDK is built and 
linked. In an OS kernel, you can just take all the symbols, put them in 
a special section, and size that section. Such a thing can't easily be 
done with DPDK, since shared object builds are supported, plus that this 
facility should be available not only to DPDK modules, but also the 
application, so relying on linker scripts isn't really feasible (not 
probably not even feasible for DPDK itself).

In that scenario, you want to size up the per-lcore buffer to be so 
large, you don't have to worry about overruns. That will waste memory. 
If you use huge page memory, paging can't help you to avoid 
pre-allocating actual physical memory.

That said, even large (by static per-lcore data standards) buffers are 
potentially small enough not to grow the amount of memory used by a DPDK 
process too much. You need to provision for RTE_MAX_LCORE of them though.

The value of lcore variables should be small, and thus incur few TLB 
misses, so you may not gain much from huge pages. In my world, it's more 
about "fitting often-used per-lcore data into L1 or L2 CPU caches", 
rather than the easier "fitting often-used per-lcore data into a working 
set size reasonably expected to be covered by hardware TLB/caches".

> Lcore variables are available very early at startup, so I guess the RTE memory allocator is not yet available.
> Hugemem could be allocated using O/S allocation, so there is a possible road towards using hugemem.
> 

With the current design, that true. I'm not sure it's a strict 
requirement though, but it does makes things simpler.

> Either way, using hugement would require one more indirection (the pointer to the allocated hugemem).
> I don't know which has better performance, using hugemem or avoiding the additional pointer dereferencing.
> 
> 
> * Suggestion: Consider adding an entry for unregistered non-EAL threads.
> Please consider making room for one more entry, shared by all unregistered non-EAL threads, i.e.
> making the array size RTE_MAX_LCORE + 1 and indexing by (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
> 
> It would be convenient for the use cases where a variable shared by the unregistered non-EAL threads don't need special treatment.
> 

I thought about this, but it would require a conditional in the lookup 
macro, as you show. More importantly, it would make the whole 
<rte_lcore_var.h> thing less elegant and harder to understand. It's bad 
enough that "per-lcore" is actually "per-lcore id" (or the equivalent 
"per-EAL thread and unregistered EAL-thread"). Adding a "btw it's <what 
I said before> + 1" is not an improvement.

But useful? Sure.

I think you may still need other data for dealing with unregistered 
threads, for example a mutex or spin lock to deal with concurrency 
issues that arises with shared data.

There may also be cases were you are best off by simply disallowing 
unregistered threads from calling into that API.

> Obviously, this might affect performance.
> If the performance cost is not negligble, the addtional entry (and indexing branch) could be disabled at build time.
> 
> 
> * Suggestion: Do not fix the alignment at 16 byte.
> Pass an alignment parameter to rte_lcore_var_alloc() and use alignof() when calling it:
> 
> +#include <stdalign.h>
> +
> +#define RTE_LCORE_VAR_ALLOC(name)			\
> +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
> +
> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)	\
> +	name = rte_lcore_var_alloc(size, alignment)
> +
> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
> +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
> +
> + +++ /cconfig/rte_config.h
> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
> 
> 

That seems like a very good idea. I'll look into it.

> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but behaves differently.
> 
>> +/**
>> + * Iterate over each lcore id's value for a lcore variable.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
>> +	for (unsigned int lcore_id =					\
>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
>> +	     lcore_id < RTE_MAX_LCORE;					\
>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
>> +
> 
> The macro name RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(i), which only iterates on running cores.
> You might want to give it a name that differs more.
> 

True.

Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for confusion, 
for sure.

Being consistent with <rte_lcore.h> is not so easy, since it's not even 
consistent with itself. For example, rte_lcore_count() returns the 
number of lcores (EAL threads) *plus the number of registered non-EAL 
threads*, and RTE_LCORE_FOREACH() give a different count. :)

> If it wasn't for API breakage, I would suggest renaming RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
> 
> Small detail: "var" is a pointer, so consider renaming it to "ptr" and adding _PTR to the macro name.

The "var" name comes from how <sys/queue.h> names things. I think I had 
it as "ptr" initially. I'll change it back.

Thanks a lot Morten.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-09 11:46     ` Mattias Rönnblom
@ 2024-02-09 13:04       ` Morten Brørup
  2024-02-19  7:49         ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-02-09 13:04 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Friday, 9 February 2024 12.46
> 
> On 2024-02-09 09:25, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Thursday, 8 February 2024 19.17
> >>
> >> Introduce DPDK per-lcore id variables, or lcore variables for short.
> >>
> >> An lcore variable has one value for every current and future lcore
> >> id-equipped thread.
> >>
> >> The primary <rte_lcore_var.h> use case is for statically allocating
> >> small chunks of often-used data, which is related logically, but
> where
> >> there are performance benefits to reap from having updates being
> local
> >> to an lcore.
> >>
> >> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> >> _Thread_local), but decoupling the values' life time with that of
> the
> >> threads.
> >>
> >> Lcore variables are also similar in terms of functionality provided
> by
> >> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >> build-time machinery. DPCPU uses linker scripts, which effectively
> >> prevents the reuse of its, otherwise seemingly viable, approach.
> >>
> >> The currently-prevailing way to solve the same problem as lcore
> >> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
> sized
> >> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> >> lcore variables over this approach is that data related to the same
> >> lcore now is close (spatially, in memory), rather than data used by
> >> the same module, which in turn avoid excessive use of padding,
> >> polluting caches with unused data.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> ---
> >
> > This looks very promising. :-)
> >
> > Here's a bunch of comments, questions and suggestions.
> >
> >
> > * Question: Performance.
> > What is the cost of accessing an lcore variable vs a variable in TLS?
> > I suppose the relative cost diminishes if the variable is a larger
> struct, compared to a simple uint64_t.
> >
> 
> In case all the relevant data is available in a cache close to the
> core,
> both options carry quite low overhead.
> 
> Accessing a lcore variable will always require a TLS lookup, in the
> form
> of retrieving the lcore_id of the current thread. In that sense, there
> will likely be a number of extra instructions required to do the lcore
> variable address lookup (i.e., doing the load from rte_lcore_var table
> based on the lcore_id you just looked up, and adding the variable's
> offset).
> 
> A TLS lookup will incur an extra overhead of less than a clock cycle,
> compared to accessing a non-TLS static variable, in case static linking
> is used. For shared objects, TLS is much more expensive (something
> often
> visible in dynamically linked DPDK app flame graphs, in the form of the
> __tls_addr symbol). Then you need to add ~3 cc/access. This on a micro
> benchmark running on a x86_64 Raptor Lake P-core.
> 
> (To visialize the difference between shared object and not, one can use
> Compiler Explorer and -fPIC versus -fPIE.)
> 
> Things get more complicated if you access the same variable in the same
> section code, since then it can be left on the stack/in a register by
> the compiler, especially if LTO is used. In other words, if you do
> rte_lcore_id() several times in a row, only the first one will cost you
> anything. This happens fairly often in DPDK, with rte_lcore_id().
> 
> Finally, if you do something like
> 
> diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
> index af9fffd81b..a65c30d27e 100644
> --- a/lib/eal/common/rte_random.c
> +++ b/lib/eal/common/rte_random.c
> @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state *state)
>   static __rte_always_inline
>   struct rte_rand_state *__rte_rand_get_state(void)
>   {
> -       unsigned int idx;
> -
> -       idx = rte_lcore_id();
> -
> -       if (unlikely(idx == LCORE_ID_ANY))
> -               return &unregistered_rand_state;
> -
> -       return RTE_LCORE_VAR_PTR(rand_state);
> +       return &unregistered_rand_state;
>   }
> 
>   uint64_t
> 
> ...and re-run the rand_perf_autotest, at least I see no difference at
> all (in a statically linked build). Both results in rte_rand() using
> ~11
> cc/call. What that suggests is that TLS overhead is very small, and
> that
> any extra instructions required by lcore variables doesn't add much, if
> anything at all, at least in this particular case.

Excellent. Thank you for a thorough and detailed answer, Mattias.

> 
> > Some of my suggestions below might also affect performance.
> >
> >
> > * Advantage: Provides direct access to worker thread variables.
> > With the current alternative (thread-local storage), the main thread
> cannot access the TLS variables of the worker threads,
> > unless worker threads publish global access pointers.
> > Lcore variables of any lcore thread can be direcctly accessed by any
> thread, which simplifies code.
> >
> >
> > * Advantage: Roadmap towards hugemem.
> > It would be nice if the lcore variable memory was allocated in
> hugemem, to reduce TLB misses.
> > The current alternative (thread-local storage) is also not using
> hugement, so not a degradation.
> >
> 
> I agree, but the thing is it's hard to figure out how much memory is
> required for these kind of variables, given how DPDK is built and
> linked. In an OS kernel, you can just take all the symbols, put them in
> a special section, and size that section. Such a thing can't easily be
> done with DPDK, since shared object builds are supported, plus that
> this
> facility should be available not only to DPDK modules, but also the
> application, so relying on linker scripts isn't really feasible (not
> probably not even feasible for DPDK itself).
> 
> In that scenario, you want to size up the per-lcore buffer to be so
> large, you don't have to worry about overruns. That will waste memory.
> If you use huge page memory, paging can't help you to avoid
> pre-allocating actual physical memory.

Good point.
I had noticed that RTE_MAX_LCORE_VAR was 1 MB (per RTE_MAX_LCORE), but I hadn't considered how paging helps us use less physical memory than that.

> 
> That said, even large (by static per-lcore data standards) buffers are
> potentially small enough not to grow the amount of memory used by a
> DPDK
> process too much. You need to provision for RTE_MAX_LCORE of them
> though.
> 
> The value of lcore variables should be small, and thus incur few TLB
> misses, so you may not gain much from huge pages. In my world, it's
> more
> about "fitting often-used per-lcore data into L1 or L2 CPU caches",
> rather than the easier "fitting often-used per-lcore data into a
> working
> set size reasonably expected to be covered by hardware TLB/caches".

Yes, I suppose that lcore variables are intended to be small, and large per-lcore structures should keep following the current design patterns for allocation and access.

Perhaps this guideline is worth mentioning in the documentation.

> 
> > Lcore variables are available very early at startup, so I guess the
> RTE memory allocator is not yet available.
> > Hugemem could be allocated using O/S allocation, so there is a
> possible road towards using hugemem.
> >
> 
> With the current design, that true. I'm not sure it's a strict
> requirement though, but it does makes things simpler.
> 
> > Either way, using hugement would require one more indirection (the
> pointer to the allocated hugemem).
> > I don't know which has better performance, using hugemem or avoiding
> the additional pointer dereferencing.
> >
> >
> > * Suggestion: Consider adding an entry for unregistered non-EAL
> threads.
> > Please consider making room for one more entry, shared by all
> unregistered non-EAL threads, i.e.
> > making the array size RTE_MAX_LCORE + 1 and indexing by
> (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
> >
> > It would be convenient for the use cases where a variable shared by
> the unregistered non-EAL threads don't need special treatment.
> >
> 
> I thought about this, but it would require a conditional in the lookup
> macro, as you show. More importantly, it would make the whole
> <rte_lcore_var.h> thing less elegant and harder to understand. It's bad
> enough that "per-lcore" is actually "per-lcore id" (or the equivalent
> "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's <what
> I said before> + 1" is not an improvement.

We could promote "one more entry for unregistered non-EAL threads" design pattern (for relevant use cases only!) by extending EAL with one more TLS variable, maintained like _thread_id, but set to RTE_MAX_LCORE when _tread_id is set to -1:

+++ eal_common_thread.c:
  RTE_DEFINE_PER_LCORE(int, _thread_id) = -1;
+ RTE_DEFINE_PER_LCORE(int, _thread_idx) = RTE_MAX_LCORE;

and

+++ rte_lcore.h:
static inline unsigned
rte_lcore_id(void)
{
	return RTE_PER_LCORE(_lcore_id);
}
+ static inline unsigned
+ rte_lcore_idx(void)
+ {
+ 	return RTE_PER_LCORE(_lcore_idx);
+ }

That would eliminate the (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE) conditional, also where currently used.

> 
> But useful? Sure.
> 
> I think you may still need other data for dealing with unregistered
> threads, for example a mutex or spin lock to deal with concurrency
> issues that arises with shared data.

Adding the extra entry is only for the benefit of use cases where special handling is not required. It will make the code for those use cases much cleaner. I think it is useful.

Use cases requiring special handling should still do the special handling they do today.

> 
> There may also be cases were you are best off by simply disallowing
> unregistered threads from calling into that API.
> 
> > Obviously, this might affect performance.
> > If the performance cost is not negligble, the addtional entry (and
> indexing branch) could be disabled at build time.
> >
> >
> > * Suggestion: Do not fix the alignment at 16 byte.
> > Pass an alignment parameter to rte_lcore_var_alloc() and use
> alignof() when calling it:
> >
> > +#include <stdalign.h>
> > +
> > +#define RTE_LCORE_VAR_ALLOC(name)			\
> > +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
> > +
> > +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)	\
> > +	name = rte_lcore_var_alloc(size, alignment)
> > +
> > +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
> > +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
> > +
> > + +++ /cconfig/rte_config.h
> > +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
> >
> >
> 
> That seems like a very good idea. I'll look into it.
> 
> > * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but
> behaves differently.
> >
> >> +/**
> >> + * Iterate over each lcore id's value for a lcore variable.
> >> + */
> >> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
> >> +	for (unsigned int lcore_id =					\
> >> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
> >> +	     lcore_id < RTE_MAX_LCORE;					\
> >> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> >> +
> >
> > The macro name RTE_LCORE_VAR_FOREACH() resembles
> RTE_LCORE_FOREACH(i), which only iterates on running cores.
> > You might want to give it a name that differs more.
> >
> 
> True.
> 
> Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for
> confusion,
> for sure.
> 
> Being consistent with <rte_lcore.h> is not so easy, since it's not even
> consistent with itself. For example, rte_lcore_count() returns the
> number of lcores (EAL threads) *plus the number of registered non-EAL
> threads*, and RTE_LCORE_FOREACH() give a different count. :)

Naming is hard. I don't have a good name, and can only offer inspiration...

<rte_lcore.h> has RTE_LCORE_FOREACH() and its RTE_LCORE_FOREACH_WORKER() variant with _WORKER appended.

Perhaps RTE_LCORE_VAR_FOREACH_ALL(), with _ALL appended to indicate a variant.

> 
> > If it wasn't for API breakage, I would suggest renaming
> RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
> >
> > Small detail: "var" is a pointer, so consider renaming it to "ptr"
> and adding _PTR to the macro name.
> 
> The "var" name comes from how <sys/queue.h> names things. I think I had
> it as "ptr" initially. I'll change it back.

Thanks.

> 
> Thanks a lot Morten.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-09 13:04       ` Morten Brørup
@ 2024-02-19  7:49         ` Mattias Rönnblom
  2024-02-19 11:10           ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  7:49 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-09 14:04, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Friday, 9 February 2024 12.46
>>
>> On 2024-02-09 09:25, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>> Sent: Thursday, 8 February 2024 19.17
>>>>
>>>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>>>
>>>> An lcore variable has one value for every current and future lcore
>>>> id-equipped thread.
>>>>
>>>> The primary <rte_lcore_var.h> use case is for statically allocating
>>>> small chunks of often-used data, which is related logically, but
>> where
>>>> there are performance benefits to reap from having updates being
>> local
>>>> to an lcore.
>>>>
>>>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>>>> _Thread_local), but decoupling the values' life time with that of
>> the
>>>> threads.
>>>>
>>>> Lcore variables are also similar in terms of functionality provided
>> by
>>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>>>> build-time machinery. DPCPU uses linker scripts, which effectively
>>>> prevents the reuse of its, otherwise seemingly viable, approach.
>>>>
>>>> The currently-prevailing way to solve the same problem as lcore
>>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
>> sized
>>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>>>> lcore variables over this approach is that data related to the same
>>>> lcore now is close (spatially, in memory), rather than data used by
>>>> the same module, which in turn avoid excessive use of padding,
>>>> polluting caches with unused data.
>>>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>
>>> This looks very promising. :-)
>>>
>>> Here's a bunch of comments, questions and suggestions.
>>>
>>>
>>> * Question: Performance.
>>> What is the cost of accessing an lcore variable vs a variable in TLS?
>>> I suppose the relative cost diminishes if the variable is a larger
>> struct, compared to a simple uint64_t.
>>>
>>
>> In case all the relevant data is available in a cache close to the
>> core,
>> both options carry quite low overhead.
>>
>> Accessing a lcore variable will always require a TLS lookup, in the
>> form
>> of retrieving the lcore_id of the current thread. In that sense, there
>> will likely be a number of extra instructions required to do the lcore
>> variable address lookup (i.e., doing the load from rte_lcore_var table
>> based on the lcore_id you just looked up, and adding the variable's
>> offset).
>>
>> A TLS lookup will incur an extra overhead of less than a clock cycle,
>> compared to accessing a non-TLS static variable, in case static linking
>> is used. For shared objects, TLS is much more expensive (something
>> often
>> visible in dynamically linked DPDK app flame graphs, in the form of the
>> __tls_addr symbol). Then you need to add ~3 cc/access. This on a micro
>> benchmark running on a x86_64 Raptor Lake P-core.
>>
>> (To visialize the difference between shared object and not, one can use
>> Compiler Explorer and -fPIC versus -fPIE.)
>>
>> Things get more complicated if you access the same variable in the same
>> section code, since then it can be left on the stack/in a register by
>> the compiler, especially if LTO is used. In other words, if you do
>> rte_lcore_id() several times in a row, only the first one will cost you
>> anything. This happens fairly often in DPDK, with rte_lcore_id().
>>
>> Finally, if you do something like
>>
>> diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
>> index af9fffd81b..a65c30d27e 100644
>> --- a/lib/eal/common/rte_random.c
>> +++ b/lib/eal/common/rte_random.c
>> @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state *state)
>>    static __rte_always_inline
>>    struct rte_rand_state *__rte_rand_get_state(void)
>>    {
>> -       unsigned int idx;
>> -
>> -       idx = rte_lcore_id();
>> -
>> -       if (unlikely(idx == LCORE_ID_ANY))
>> -               return &unregistered_rand_state;
>> -
>> -       return RTE_LCORE_VAR_PTR(rand_state);
>> +       return &unregistered_rand_state;
>>    }
>>
>>    uint64_t
>>
>> ...and re-run the rand_perf_autotest, at least I see no difference at
>> all (in a statically linked build). Both results in rte_rand() using
>> ~11
>> cc/call. What that suggests is that TLS overhead is very small, and
>> that
>> any extra instructions required by lcore variables doesn't add much, if
>> anything at all, at least in this particular case.
> 
> Excellent. Thank you for a thorough and detailed answer, Mattias.
> 
>>
>>> Some of my suggestions below might also affect performance.
>>>
>>>
>>> * Advantage: Provides direct access to worker thread variables.
>>> With the current alternative (thread-local storage), the main thread
>> cannot access the TLS variables of the worker threads,
>>> unless worker threads publish global access pointers.
>>> Lcore variables of any lcore thread can be direcctly accessed by any
>> thread, which simplifies code.
>>>
>>>
>>> * Advantage: Roadmap towards hugemem.
>>> It would be nice if the lcore variable memory was allocated in
>> hugemem, to reduce TLB misses.
>>> The current alternative (thread-local storage) is also not using
>> hugement, so not a degradation.
>>>
>>
>> I agree, but the thing is it's hard to figure out how much memory is
>> required for these kind of variables, given how DPDK is built and
>> linked. In an OS kernel, you can just take all the symbols, put them in
>> a special section, and size that section. Such a thing can't easily be
>> done with DPDK, since shared object builds are supported, plus that
>> this
>> facility should be available not only to DPDK modules, but also the
>> application, so relying on linker scripts isn't really feasible (not
>> probably not even feasible for DPDK itself).
>>
>> In that scenario, you want to size up the per-lcore buffer to be so
>> large, you don't have to worry about overruns. That will waste memory.
>> If you use huge page memory, paging can't help you to avoid
>> pre-allocating actual physical memory.
> 
> Good point.
> I had noticed that RTE_MAX_LCORE_VAR was 1 MB (per RTE_MAX_LCORE), but I hadn't considered how paging helps us use less physical memory than that.
> 
>>
>> That said, even large (by static per-lcore data standards) buffers are
>> potentially small enough not to grow the amount of memory used by a
>> DPDK
>> process too much. You need to provision for RTE_MAX_LCORE of them
>> though.
>>
>> The value of lcore variables should be small, and thus incur few TLB
>> misses, so you may not gain much from huge pages. In my world, it's
>> more
>> about "fitting often-used per-lcore data into L1 or L2 CPU caches",
>> rather than the easier "fitting often-used per-lcore data into a
>> working
>> set size reasonably expected to be covered by hardware TLB/caches".
> 
> Yes, I suppose that lcore variables are intended to be small, and large per-lcore structures should keep following the current design patterns for allocation and access.
> 

It seems to me that support for per-lcore heaps should be the solution 
for supporting use cases requiring many, larger and/or dynamic objects 
on a per-lcore basis.

Ideally, you would design both that mechanism and lcore variables 
together, but then if you couple enough amount of improvements together 
you will never get anywhere. An instance of where perfect is the enemy 
of good, perhaps.

> Perhaps this guideline is worth mentioning in the documentation.
> 

What is missing, more specifically? The size limitation and the static 
nature of lcore variables is described, and what current design patterns 
they expected to (partly) replace is also covered.

>>
>>> Lcore variables are available very early at startup, so I guess the
>> RTE memory allocator is not yet available.
>>> Hugemem could be allocated using O/S allocation, so there is a
>> possible road towards using hugemem.
>>>
>>
>> With the current design, that true. I'm not sure it's a strict
>> requirement though, but it does makes things simpler.
>>
>>> Either way, using hugement would require one more indirection (the
>> pointer to the allocated hugemem).
>>> I don't know which has better performance, using hugemem or avoiding
>> the additional pointer dereferencing.
>>>
>>>
>>> * Suggestion: Consider adding an entry for unregistered non-EAL
>> threads.
>>> Please consider making room for one more entry, shared by all
>> unregistered non-EAL threads, i.e.
>>> making the array size RTE_MAX_LCORE + 1 and indexing by
>> (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
>>>
>>> It would be convenient for the use cases where a variable shared by
>> the unregistered non-EAL threads don't need special treatment.
>>>
>>
>> I thought about this, but it would require a conditional in the lookup
>> macro, as you show. More importantly, it would make the whole
>> <rte_lcore_var.h> thing less elegant and harder to understand. It's bad
>> enough that "per-lcore" is actually "per-lcore id" (or the equivalent
>> "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's <what
>> I said before> + 1" is not an improvement.
> 
> We could promote "one more entry for unregistered non-EAL threads" design pattern (for relevant use cases only!) by extending EAL with one more TLS variable, maintained like _thread_id, but set to RTE_MAX_LCORE when _tread_id is set to -1:
> 
> +++ eal_common_thread.c:
>    RTE_DEFINE_PER_LCORE(int, _thread_id) = -1;
> + RTE_DEFINE_PER_LCORE(int, _thread_idx) = RTE_MAX_LCORE;
> 
> and
> 
> +++ rte_lcore.h:
> static inline unsigned
> rte_lcore_id(void)
> {
> 	return RTE_PER_LCORE(_lcore_id);
> }
> + static inline unsigned
> + rte_lcore_idx(void)
> + {
> + 	return RTE_PER_LCORE(_lcore_idx);
> + }
> 
> That would eliminate the (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE) conditional, also where currently used.
> 

Wouldn't that effectively give a shared lcore id to all unregistered 
threads?

We definitely shouldn't further complicate anything related to the DPDK 
threading model, in my opinion.

If a module needs one or more variable instances that aren't per lcore, 
use regular static allocation instead. I would favor clarity over 
convenience here, at least until we know better (see below as well).

>>
>> But useful? Sure.
>>
>> I think you may still need other data for dealing with unregistered
>> threads, for example a mutex or spin lock to deal with concurrency
>> issues that arises with shared data.
> 
> Adding the extra entry is only for the benefit of use cases where special handling is not required. It will make the code for those use cases much cleaner. I think it is useful.
> 

It will make it shorter, but not less clean, I would argue.

> Use cases requiring special handling should still do the special handling they do today.
> 

For DPDK modules using lcore variables and which treat unregistered 
threads as "full citizens", I expect special handling of unregistered 
threads to be the norm. Take rte_random.h as an example. Current API 
does not guarantee MT safety for concurrent calls of unregistered 
threads. It probably should, and it should probably be by means of a 
mutex (not spinlock).

The reason I'm not running off to make a rte_random.c patch is that's 
it's unclear to me what is the role of unregistered threads in DPDK. I'm 
reasonably comfortable with a model where there are many threads that 
basically don't interact with the DPDK APIs (except maybe some very 
narrow exposure, like the preemption-safe ring variant). One example of 
such a design would be big slow control plane which uses multi-threading 
and the Linux process scheduler for work scheduling, hosted in the same 
process as a DPDK data plane app.

What I find more strange is a scenario where there are unregistered 
threads which interacts with a wide variety of DPDK APIs, does so 
at-high-rates/with-high-performance-requirements and are expected to be 
preemption-safe. So they are basically EAL threads without a lcore id.

Support for that latter scenario has also been voiced, in previous 
discussions, from what I recall.

I think it's hard to answer the question of a "unregistered thread 
spare" for lcore variables without first knowing what the future should 
look like for unregistered threads in DPDK, in terms of being able to 
call into DPDK APIs, preemption-safety guarantees, etc.

It seems that until you have a clearer picture of how generally to treat 
unregistered threads, you are best off with just a per-lcore id instance 
of lcore variables.

>>
>> There may also be cases were you are best off by simply disallowing
>> unregistered threads from calling into that API.
>>
>>> Obviously, this might affect performance.
>>> If the performance cost is not negligble, the addtional entry (and
>> indexing branch) could be disabled at build time.
>>>
>>>
>>> * Suggestion: Do not fix the alignment at 16 byte.
>>> Pass an alignment parameter to rte_lcore_var_alloc() and use
>> alignof() when calling it:
>>>
>>> +#include <stdalign.h>
>>> +
>>> +#define RTE_LCORE_VAR_ALLOC(name)			\
>>> +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
>>> +
>>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)	\
>>> +	name = rte_lcore_var_alloc(size, alignment)
>>> +
>>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
>>> +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
>>> +
>>> + +++ /cconfig/rte_config.h
>>> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
>>>
>>>
>>
>> That seems like a very good idea. I'll look into it.
>>
>>> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(), but
>> behaves differently.
>>>
>>>> +/**
>>>> + * Iterate over each lcore id's value for a lcore variable.
>>>> + */
>>>> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
>>>> +	for (unsigned int lcore_id =					\
>>>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
>>>> +	     lcore_id < RTE_MAX_LCORE;					\
>>>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
>>>> +
>>>
>>> The macro name RTE_LCORE_VAR_FOREACH() resembles
>> RTE_LCORE_FOREACH(i), which only iterates on running cores.
>>> You might want to give it a name that differs more.
>>>
>>
>> True.
>>
>> Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for
>> confusion,
>> for sure.
>>
>> Being consistent with <rte_lcore.h> is not so easy, since it's not even
>> consistent with itself. For example, rte_lcore_count() returns the
>> number of lcores (EAL threads) *plus the number of registered non-EAL
>> threads*, and RTE_LCORE_FOREACH() give a different count. :)
> 
> Naming is hard. I don't have a good name, and can only offer inspiration...
> 
> <rte_lcore.h> has RTE_LCORE_FOREACH() and its RTE_LCORE_FOREACH_WORKER() variant with _WORKER appended.
> 
> Perhaps RTE_LCORE_VAR_FOREACH_ALL(), with _ALL appended to indicate a variant.
> 
>>
>>> If it wasn't for API breakage, I would suggest renaming
>> RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
>>>
>>> Small detail: "var" is a pointer, so consider renaming it to "ptr"
>> and adding _PTR to the macro name.
>>
>> The "var" name comes from how <sys/queue.h> names things. I think I had
>> it as "ptr" initially. I'll change it back.
> 
> Thanks.
> 
>>
>> Thanks a lot Morten.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v2 0/5] Lcore variables
  2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-09  8:25   ` Morten Brørup
@ 2024-02-19  9:40   ` Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                       ` (4 more replies)
  1 sibling, 5 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

This RFC presents a new API <rte_lcore_var.h> for static per-lcore id
data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

One thing is unclear to the author is how this API relates to
potential future per-lcore dynamic allocator (e.g., a per-lcore heap).

Contrary to what the version.map edit suggests, this RFC is not meant
for a proposal for DPDK 24.03.

Mattias Rönnblom (5):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable test suite
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable

 app/test/meson.build                  |   1 +
 app/test/test_lcore_var.c             | 408 ++++++++++++++++++++++++++
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  82 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/common/rte_random.c           |  30 +-
 lib/eal/common/rte_service.c          | 119 ++++----
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 374 +++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 lib/power/rte_power_pmd_mgmt.c        |  27 +-
 12 files changed, 973 insertions(+), 76 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v2 1/5] eal: add static per-lcore memory allocation facility
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
@ 2024-02-19  9:40     ` Mattias Rönnblom
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 2/5] eal: add lcore variable test suite Mattias Rönnblom
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  82 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 374 ++++++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 7 files changed, 464 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/config/rte_config.h b/config/rte_config.h
index da265d7dd2..884482e473 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -30,6 +30,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index a6a768bd7c..bb06bb7ca1 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -98,6 +98,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore-varible](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..dfd11cbd0b
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define WARN_THRESHOLD 75
+
+/*
+ * Avoid using offset zero, since it would result in a NULL-value
+ * "handle" (offset) pointer, which in principle and per the API
+ * definition shouldn't be an issue, but may confuse some tools and
+ * users.
+ */
+#define INITIAL_OFFSET 1
+
+char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
+
+static uintptr_t allocated = INITIAL_OFFSET;
+
+static void
+verify_allocation(uintptr_t new_allocated)
+{
+	static bool has_warned;
+
+	RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
+
+	if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
+	    !has_warned) {
+		EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
+			"of the maximum capacity (%d bytes)", WARN_THRESHOLD,
+			RTE_MAX_LCORE_VAR);
+		has_warned = true;
+	}
+}
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
+
+	void *offset = (void *)new_allocated;
+
+	new_allocated += size;
+
+	verify_allocation(new_allocated);
+
+	allocated = new_allocated;
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return offset;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to aligned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..4434fc21ef
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,374 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Per-lcore id variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. In other words,
+ * there's one copy of its value for each and every current and future
+ * lcore id-equipped thread, with the total number of copies amounting
+ * to \c RTE_MAX_LCORE.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for \c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. A handle may be passed between modules and
+ * threads just like any pointer, but its value is not the address of
+ * any particular object, but rather just an opaque identifier, stored
+ * in a typed pointer (to inform the access macro the type of values).
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
+ *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but is should
+ * generally only *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by to different lcore
+ * ids *may* be frequently read or written by the owners without the
+ * risk of false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomics) should
+ * employed to assure there are no data races between the owning
+ * thread and any non-owner threads accessing the same lcore variable
+ * instance.
+ *
+ * The value of the lcore variable for a particular lcore id may be
+ * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
+ * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * To modify the value of an lcore variable for a particular lcore id,
+ * either access the object through the pointer retrieved by \ref
+ * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
+ * RTE_LCORE_VAR_LCORE_SET.
+ *
+ * The access macros each has a short-hand which may be used by an EAL
+ * thread or registered non-EAL thread to access the lcore variable
+ * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
+ * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
+ *
+ * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier. The
+ * *identifier* value is common across all lcore ids.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like \c int,
+ * but would more typically be a \c struct. An application may choose
+ * to define an lcore variable, which it then it goes on to never
+ * allocate.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * The sum of all lcore variables, plus any padding required, must be
+ * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
+ * violation of this maximum results in the process being terminated.
+ *
+ * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
+ * same order of magnitude in size as a thread stack.
+ *
+ * The lcore variable storage buffers are kept in the BSS section in
+ * the resulting binary, where data generally isn't mapped in until
+ * it's accessed. This means that unused portions of the lcore
+ * variable storage area will not occupy any physical memory (with a
+ * granularity of the memory page size [usually 4 kB]).
+ *
+ * Lcore variables should generally *not* be \ref __rte_cache_aligned
+ * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, all nearby data structures
+ * should almost-always be written to by a single thread (the lcore
+ * variable owner). Adding padding will increase the effective memory
+ * working set size, and potentially reducing performance.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         unsigned int lcore_id;
+ *
+ *         RTE_LCORE_VAR_ALLOC(foo_state);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH(lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * \endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * } __rte_cache_aligned;
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * \endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this forces the
+ * use of cache-line alignment to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables has the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to \ref rte_lcore_var.h is the \ref
+ * rte_per_lcore.h API, which make use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., \ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. One effect of this is thread-local variables must
+ *     initialized in a "lazy" manner (e.g., at the point of thread
+ *     creation). Lcore variables may be accessed immediately after
+ *     having been allocated (which is usually prior any thread beyond
+ *     the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id, and thus
+ *     not for such "regular" threads.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define a lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various per-lcore id instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handler, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable are only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)	\
+	name = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
+	name = rte_lcore_var_alloc(size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handler pointer type, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC(name)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)), alignof(*(name)))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a \ref
+ * RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)		\
+	((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)				\
+	((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+/**
+ * Get value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
+
+/**
+ * Set the value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
+
+/**
+ * Get value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
+
+/**
+ * Set value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_SET(name, value) \
+	RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
+
+/**
+ * Iterate over each lcore id's value for a lcore variable.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)				\
+	for (unsigned int lcore_id =					\
+		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
+
+/**
+ * Allocate space in the per-lcore id buffers for a lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than \c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The id of the variable, stored in a void pointer value.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 5e0cd47c82..e90b86115a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -393,6 +393,10 @@ EXPERIMENTAL {
 	# added in 23.07
 	rte_memzone_max_get;
 	rte_memzone_max_set;
+
+	# added in 24.03
+	rte_lcore_var_alloc;
+	rte_lcore_var;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v2 2/5] eal: add lcore variable test suite
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-19  9:40     ` Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

RFC v2:
 * Improve alignment-related test coverage.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 408 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 409 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 6389ae83ee..93412cce51 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -101,6 +101,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..310d32e10d
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,408 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static bool
+rand_bool(void)
+{
+	return rte_rand() & 1;
+}
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_PTR(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal;
+
+	if (rand_bool())
+		equal = RTE_LCORE_VAR_GET(test_int) == state->old_value;
+	else
+		equal = *(RTE_LCORE_VAR_PTR(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	if (rand_bool())
+		RTE_LCORE_VAR_SET(test_int, state->new_value);
+	else
+		*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		RTE_LCORE_VAR_LCORE_SET(lcore_id, test_int, state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		TEST_ASSERT_EQUAL(state->new_value,
+				  RTE_LCORE_VAR_LCORE_GET(lcore_id, test_int),
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_PTR(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_struct);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state
+{
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_PTR(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(RTE_LCORE_VAR_LCORE_GET(lcore_id, test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_array);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (RTE_MAX_LCORE_VAR / 2)
+
+static int
+test_many_lvars(void)
+{
+	void **handlers = malloc(sizeof(void *) * MANY_LVARS);
+	int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		void *handle = rte_lcore_var_alloc(1, 1);
+
+		uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), handle);
+
+		*b = (uint8_t)i;
+
+		handlers[i] = handle;
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_FOREACH_WORKER(lcore_id) {
+			uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(),
+							       handlers[i]);
+			TEST_ASSERT_EQUAL((uint8_t)i, *b,
+					  "Unexpected lcore variable value.");
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v2 3/5] random: keep PRNG state in lcore variable
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 2/5] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-02-19  9:40     ` Mattias Rönnblom
  2024-02-19 11:22       ` Morten Brørup
  2024-02-19  9:40     ` [RFC v2 4/5] power: keep per-lcore " Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 5/5] service: " Mattias Rönnblom
  4 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_random.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 7709b8f2c6..af9fffd81b 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct rte_rand_state {
@@ -19,14 +20,12 @@ struct rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
-} __rte_cache_aligned;
+};
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state __rte_cache_aligned;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_PTR(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v2 4/5] power: keep per-lcore state in lcore variable
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
                       ` (2 preceding siblings ...)
  2024-02-19  9:40     ` [RFC v2 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-19  9:40     ` Mattias Rönnblom
  2024-02-19  9:40     ` [RFC v2 5/5] service: " Mattias Rönnblom
  4 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/power/rte_power_pmd_mgmt.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 591fc69f36..bb20e564de 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -68,8 +69,8 @@ struct pmd_core_cfg {
 	/**< Number of queues ready to enter power optimized state */
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
-} __rte_cache_aligned;
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+};
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -772,10 +770,13 @@ RTE_INIT(rte_power_ethdev_pmgmt_init) {
 	size_t i;
 	int j;
 
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
+
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		struct pmd_core_cfg *lcore_cfg =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_cfgs);
+		TAILQ_INIT(&lcore_cfg->head);
 	}
 
 	/* initialize config defaults */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v2 5/5] service: keep per-lcore state in lcore variable
  2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
                       ` (3 preceding siblings ...)
  2024-02-19  9:40     ` [RFC v2 4/5] power: keep per-lcore " Mattias Rönnblom
@ 2024-02-19  9:40     ` Mattias Rönnblom
  4 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-19  9:40 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c | 119 ++++++++++++++++++++---------------
 1 file changed, 68 insertions(+), 51 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index d959c91459..de205c5da5 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,11 +102,12 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
+	else {
+		struct core_state *cs;
+		RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+			memset(cs, 0, sizeof(struct core_state));
 	}
 
 	int i;
@@ -122,7 +124,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +137,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +286,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +293,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +454,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +467,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +489,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +535,16 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs =
+		RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +552,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +573,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +590,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +642,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +694,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +712,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +737,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +761,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +785,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +815,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +824,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +849,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +860,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +868,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +876,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +885,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +901,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +948,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +977,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +989,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1028,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-19  7:49         ` Mattias Rönnblom
@ 2024-02-19 11:10           ` Morten Brørup
  2024-02-19 14:31             ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-02-19 11:10 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 19 February 2024 08.49
> 
> On 2024-02-09 14:04, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Friday, 9 February 2024 12.46
> >>
> >> On 2024-02-09 09:25, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>>> Sent: Thursday, 8 February 2024 19.17
> >>>>
> >>>> Introduce DPDK per-lcore id variables, or lcore variables for
> short.
> >>>>
> >>>> An lcore variable has one value for every current and future lcore
> >>>> id-equipped thread.
> >>>>
> >>>> The primary <rte_lcore_var.h> use case is for statically
> allocating
> >>>> small chunks of often-used data, which is related logically, but
> >> where
> >>>> there are performance benefits to reap from having updates being
> >> local
> >>>> to an lcore.
> >>>>
> >>>> Lcore variables are similar to thread-local storage (TLS, e.g.,
> C11
> >>>> _Thread_local), but decoupling the values' life time with that of
> >> the
> >>>> threads.
> >>>>
> >>>> Lcore variables are also similar in terms of functionality
> provided
> >> by
> >>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >>>> build-time machinery. DPCPU uses linker scripts, which effectively
> >>>> prevents the reuse of its, otherwise seemingly viable, approach.
> >>>>
> >>>> The currently-prevailing way to solve the same problem as lcore
> >>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
> >> sized
> >>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> >>>> lcore variables over this approach is that data related to the
> same
> >>>> lcore now is close (spatially, in memory), rather than data used
> by
> >>>> the same module, which in turn avoid excessive use of padding,
> >>>> polluting caches with unused data.
> >>>>
> >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>> ---
> >>>
> >>> This looks very promising. :-)
> >>>
> >>> Here's a bunch of comments, questions and suggestions.
> >>>
> >>>
> >>> * Question: Performance.
> >>> What is the cost of accessing an lcore variable vs a variable in
> TLS?
> >>> I suppose the relative cost diminishes if the variable is a larger
> >> struct, compared to a simple uint64_t.
> >>>
> >>
> >> In case all the relevant data is available in a cache close to the
> >> core,
> >> both options carry quite low overhead.
> >>
> >> Accessing a lcore variable will always require a TLS lookup, in the
> >> form
> >> of retrieving the lcore_id of the current thread. In that sense,
> there
> >> will likely be a number of extra instructions required to do the
> lcore
> >> variable address lookup (i.e., doing the load from rte_lcore_var
> table
> >> based on the lcore_id you just looked up, and adding the variable's
> >> offset).
> >>
> >> A TLS lookup will incur an extra overhead of less than a clock
> cycle,
> >> compared to accessing a non-TLS static variable, in case static
> linking
> >> is used. For shared objects, TLS is much more expensive (something
> >> often
> >> visible in dynamically linked DPDK app flame graphs, in the form of
> the
> >> __tls_addr symbol). Then you need to add ~3 cc/access. This on a
> micro
> >> benchmark running on a x86_64 Raptor Lake P-core.
> >>
> >> (To visialize the difference between shared object and not, one can
> use
> >> Compiler Explorer and -fPIC versus -fPIE.)
> >>
> >> Things get more complicated if you access the same variable in the
> same
> >> section code, since then it can be left on the stack/in a register
> by
> >> the compiler, especially if LTO is used. In other words, if you do
> >> rte_lcore_id() several times in a row, only the first one will cost
> you
> >> anything. This happens fairly often in DPDK, with rte_lcore_id().
> >>
> >> Finally, if you do something like
> >>
> >> diff --git a/lib/eal/common/rte_random.c
> b/lib/eal/common/rte_random.c
> >> index af9fffd81b..a65c30d27e 100644
> >> --- a/lib/eal/common/rte_random.c
> >> +++ b/lib/eal/common/rte_random.c
> >> @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state
> *state)
> >>    static __rte_always_inline
> >>    struct rte_rand_state *__rte_rand_get_state(void)
> >>    {
> >> -       unsigned int idx;
> >> -
> >> -       idx = rte_lcore_id();
> >> -
> >> -       if (unlikely(idx == LCORE_ID_ANY))
> >> -               return &unregistered_rand_state;
> >> -
> >> -       return RTE_LCORE_VAR_PTR(rand_state);
> >> +       return &unregistered_rand_state;
> >>    }
> >>
> >>    uint64_t
> >>
> >> ...and re-run the rand_perf_autotest, at least I see no difference
> at
> >> all (in a statically linked build). Both results in rte_rand() using
> >> ~11
> >> cc/call. What that suggests is that TLS overhead is very small, and
> >> that
> >> any extra instructions required by lcore variables doesn't add much,
> if
> >> anything at all, at least in this particular case.
> >
> > Excellent. Thank you for a thorough and detailed answer, Mattias.
> >
> >>
> >>> Some of my suggestions below might also affect performance.
> >>>
> >>>
> >>> * Advantage: Provides direct access to worker thread variables.
> >>> With the current alternative (thread-local storage), the main
> thread
> >> cannot access the TLS variables of the worker threads,
> >>> unless worker threads publish global access pointers.
> >>> Lcore variables of any lcore thread can be direcctly accessed by
> any
> >> thread, which simplifies code.
> >>>
> >>>
> >>> * Advantage: Roadmap towards hugemem.
> >>> It would be nice if the lcore variable memory was allocated in
> >> hugemem, to reduce TLB misses.
> >>> The current alternative (thread-local storage) is also not using
> >> hugement, so not a degradation.
> >>>
> >>
> >> I agree, but the thing is it's hard to figure out how much memory is
> >> required for these kind of variables, given how DPDK is built and
> >> linked. In an OS kernel, you can just take all the symbols, put them
> in
> >> a special section, and size that section. Such a thing can't easily
> be
> >> done with DPDK, since shared object builds are supported, plus that
> >> this
> >> facility should be available not only to DPDK modules, but also the
> >> application, so relying on linker scripts isn't really feasible (not
> >> probably not even feasible for DPDK itself).
> >>
> >> In that scenario, you want to size up the per-lcore buffer to be so
> >> large, you don't have to worry about overruns. That will waste
> memory.
> >> If you use huge page memory, paging can't help you to avoid
> >> pre-allocating actual physical memory.
> >
> > Good point.
> > I had noticed that RTE_MAX_LCORE_VAR was 1 MB (per RTE_MAX_LCORE),
> but I hadn't considered how paging helps us use less physical memory
> than that.
> >
> >>
> >> That said, even large (by static per-lcore data standards) buffers
> are
> >> potentially small enough not to grow the amount of memory used by a
> >> DPDK
> >> process too much. You need to provision for RTE_MAX_LCORE of them
> >> though.
> >>
> >> The value of lcore variables should be small, and thus incur few TLB
> >> misses, so you may not gain much from huge pages. In my world, it's
> >> more
> >> about "fitting often-used per-lcore data into L1 or L2 CPU caches",
> >> rather than the easier "fitting often-used per-lcore data into a
> >> working
> >> set size reasonably expected to be covered by hardware TLB/caches".
> >
> > Yes, I suppose that lcore variables are intended to be small, and
> large per-lcore structures should keep following the current design
> patterns for allocation and access.
> >
> 
> It seems to me that support for per-lcore heaps should be the solution
> for supporting use cases requiring many, larger and/or dynamic objects
> on a per-lcore basis.
> 
> Ideally, you would design both that mechanism and lcore variables
> together, but then if you couple enough amount of improvements together
> you will never get anywhere. An instance of where perfect is the enemy
> of good, perhaps.

So true. :-)

> 
> > Perhaps this guideline is worth mentioning in the documentation.
> >
> 
> What is missing, more specifically? The size limitation and the static
> nature of lcore variables is described, and what current design
> patterns
> they expected to (partly) replace is also covered.

Your documentation is fine, and nothing specific is missing here.
I was thinking out loud that the high level DPDK documentation should describe common design patterns.

> 
> >>
> >>> Lcore variables are available very early at startup, so I guess the
> >> RTE memory allocator is not yet available.
> >>> Hugemem could be allocated using O/S allocation, so there is a
> >> possible road towards using hugemem.
> >>>
> >>
> >> With the current design, that true. I'm not sure it's a strict
> >> requirement though, but it does makes things simpler.
> >>
> >>> Either way, using hugement would require one more indirection (the
> >> pointer to the allocated hugemem).
> >>> I don't know which has better performance, using hugemem or
> avoiding
> >> the additional pointer dereferencing.
> >>>
> >>>
> >>> * Suggestion: Consider adding an entry for unregistered non-EAL
> >> threads.
> >>> Please consider making room for one more entry, shared by all
> >> unregistered non-EAL threads, i.e.
> >>> making the array size RTE_MAX_LCORE + 1 and indexing by
> >> (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
> >>>
> >>> It would be convenient for the use cases where a variable shared by
> >> the unregistered non-EAL threads don't need special treatment.
> >>>
> >>
> >> I thought about this, but it would require a conditional in the
> lookup
> >> macro, as you show. More importantly, it would make the whole
> >> <rte_lcore_var.h> thing less elegant and harder to understand. It's
> bad
> >> enough that "per-lcore" is actually "per-lcore id" (or the
> equivalent
> >> "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's
> <what
> >> I said before> + 1" is not an improvement.
> >
> > We could promote "one more entry for unregistered non-EAL threads"
> design pattern (for relevant use cases only!) by extending EAL with one
> more TLS variable, maintained like _thread_id, but set to RTE_MAX_LCORE
> when _tread_id is set to -1:
> >
> > +++ eal_common_thread.c:
> >    RTE_DEFINE_PER_LCORE(int, _thread_id) = -1;
> > + RTE_DEFINE_PER_LCORE(int, _thread_idx) = RTE_MAX_LCORE;

Ups... wrong reference! I meant to refer to _lcore_id, not _thread_id. Correction:

We could promote "one more entry for unregistered non-EAL threads" design pattern (for relevant use cases only!) by extending EAL with one more TLS variable, maintained like _lcore_id, but set to RTE_MAX_LCORE when _lcore_id is set to LCORE_ID_ANY:

+++ eal_common_thread.c:
  RTE_DEFINE_PER_LCORE(unsigned int, _lcore_id) = LCORE_ID_ANY;
+ RTE_DEFINE_PER_LCORE(unsigned int, _lcore_idx) = RTE_MAX_LCORE;

> >
> > and
> >
> > +++ rte_lcore.h:
> > static inline unsigned
> > rte_lcore_id(void)
> > {
> > 	return RTE_PER_LCORE(_lcore_id);
> > }
> > + static inline unsigned
> > + rte_lcore_idx(void)
> > + {
> > + 	return RTE_PER_LCORE(_lcore_idx);
> > + }
> >
> > That would eliminate the (rte_lcore_id() < RTE_MAX_LCORE ?
> rte_lcore_id() : RTE_MAX_LCORE) conditional, also where currently used.
> >
> 
> Wouldn't that effectively give a shared lcore id to all unregistered
> threads?

Yes, just like the rte_lcore_id() is LCORE_ID_ANY (i.e. UINT32_MAX) for all unregistered threads; but it will be usable for array indexing, behaving as a shadow variable of RTE_PER_LCORE(_lcore_id) for optimizing away the "rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE" when indexing.

> 
> We definitely shouldn't further complicate anything related to the DPDK
> threading model, in my opinion.
> 
> If a module needs one or more variable instances that aren't per lcore,
> use regular static allocation instead. I would favor clarity over
> convenience here, at least until we know better (see below as well).
> 
> >>
> >> But useful? Sure.
> >>
> >> I think you may still need other data for dealing with unregistered
> >> threads, for example a mutex or spin lock to deal with concurrency
> >> issues that arises with shared data.
> >
> > Adding the extra entry is only for the benefit of use cases where
> special handling is not required. It will make the code for those use
> cases much cleaner. I think it is useful.
> >
> 
> It will make it shorter, but not less clean, I would argue.
> 
> > Use cases requiring special handling should still do the special
> handling they do today.
> >
> 
> For DPDK modules using lcore variables and which treat unregistered
> threads as "full citizens", I expect special handling of unregistered
> threads to be the norm. Take rte_random.h as an example. Current API
> does not guarantee MT safety for concurrent calls of unregistered
> threads. It probably should, and it should probably be by means of a
> mutex (not spinlock).
> 
> The reason I'm not running off to make a rte_random.c patch is that's
> it's unclear to me what is the role of unregistered threads in DPDK.
> I'm
> reasonably comfortable with a model where there are many threads that
> basically don't interact with the DPDK APIs (except maybe some very
> narrow exposure, like the preemption-safe ring variant). One example of
> such a design would be big slow control plane which uses multi-
> threading
> and the Linux process scheduler for work scheduling, hosted in the same
> process as a DPDK data plane app.
> 
> What I find more strange is a scenario where there are unregistered
> threads which interacts with a wide variety of DPDK APIs, does so
> at-high-rates/with-high-performance-requirements and are expected to be
> preemption-safe. So they are basically EAL threads without a lcore id.

Yes, this is happening in the wild.
E.g. our application has a mode where it uses fewer EAL threads, and processes more in non-EAL threads. So to say, the same work is processed either by an EAL thread or a non-EAL thread, depending on the application's mode.
The extra array entry would be useful for such use cases.

> 
> Support for that latter scenario has also been voiced, in previous
> discussions, from what I recall.
> 
> I think it's hard to answer the question of a "unregistered thread
> spare" for lcore variables without first knowing what the future should
> look like for unregistered threads in DPDK, in terms of being able to
> call into DPDK APIs, preemption-safety guarantees, etc.
> 
> It seems that until you have a clearer picture of how generally to
> treat
> unregistered threads, you are best off with just a per-lcore id
> instance
> of lcore variables.

I get your point. It also reduces the risk of bugs caused by incorrect use of the additional entry.

I am arguing for a different angle: Providing the extra entry will help uncovering relevant use cases.

> 
> >>
> >> There may also be cases were you are best off by simply disallowing
> >> unregistered threads from calling into that API.
> >>
> >>> Obviously, this might affect performance.
> >>> If the performance cost is not negligble, the addtional entry (and
> >> indexing branch) could be disabled at build time.
> >>>
> >>>
> >>> * Suggestion: Do not fix the alignment at 16 byte.
> >>> Pass an alignment parameter to rte_lcore_var_alloc() and use
> >> alignof() when calling it:
> >>>
> >>> +#include <stdalign.h>
> >>> +
> >>> +#define RTE_LCORE_VAR_ALLOC(name)			\
> >>> +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
> >>> +
> >>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)
> 	\
> >>> +	name = rte_lcore_var_alloc(size, alignment)
> >>> +
> >>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
> >>> +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
> >>> +
> >>> + +++ /cconfig/rte_config.h
> >>> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
> >>>
> >>>
> >>
> >> That seems like a very good idea. I'll look into it.
> >>
> >>> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(),
> but
> >> behaves differently.
> >>>
> >>>> +/**
> >>>> + * Iterate over each lcore id's value for a lcore variable.
> >>>> + */
> >>>> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
> >>>> +	for (unsigned int lcore_id =					\
> >>>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);
> 	\
> >>>> +	     lcore_id < RTE_MAX_LCORE;
> 	\
> >>>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id,
> name))
> >>>> +
> >>>
> >>> The macro name RTE_LCORE_VAR_FOREACH() resembles
> >> RTE_LCORE_FOREACH(i), which only iterates on running cores.
> >>> You might want to give it a name that differs more.
> >>>
> >>
> >> True.
> >>
> >> Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for
> >> confusion,
> >> for sure.
> >>
> >> Being consistent with <rte_lcore.h> is not so easy, since it's not
> even
> >> consistent with itself. For example, rte_lcore_count() returns the
> >> number of lcores (EAL threads) *plus the number of registered non-
> EAL
> >> threads*, and RTE_LCORE_FOREACH() give a different count. :)
> >
> > Naming is hard. I don't have a good name, and can only offer
> inspiration...
> >
> > <rte_lcore.h> has RTE_LCORE_FOREACH() and its
> RTE_LCORE_FOREACH_WORKER() variant with _WORKER appended.
> >
> > Perhaps RTE_LCORE_VAR_FOREACH_ALL(), with _ALL appended to indicate a
> variant.
> >
> >>
> >>> If it wasn't for API breakage, I would suggest renaming
> >> RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
> >>>
> >>> Small detail: "var" is a pointer, so consider renaming it to "ptr"
> >> and adding _PTR to the macro name.
> >>
> >> The "var" name comes from how <sys/queue.h> names things. I think I
> had
> >> it as "ptr" initially. I'll change it back.
> >
> > Thanks.
> >
> >>
> >> Thanks a lot Morten.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v2 3/5] random: keep PRNG state in lcore variable
  2024-02-19  9:40     ` [RFC v2 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-19 11:22       ` Morten Brørup
  2024-02-19 14:04         ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-02-19 11:22 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Monday, 19 February 2024 10.41
> 
> Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
> cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
> same state in a more cache-friendly lcore variable.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---

[...]

> @@ -19,14 +20,12 @@ struct rte_rand_state {
>  	uint64_t z3;
>  	uint64_t z4;
>  	uint64_t z5;
> -	RTE_CACHE_GUARD;
> -} __rte_cache_aligned;
> +};
> 
> -/* One instance each for every lcore id-equipped thread, and one
> - * additional instance to be shared by all others threads (i.e., all
> - * unregistered non-EAL threads).
> - */
> -static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
> +RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
> +
> +/* instance to be shared by all unregistered non-EAL threads */
> +static struct rte_rand_state unregistered_rand_state
> __rte_cache_aligned;

The unregistered_rand_state instance is still __rte_cache_aligned; consider also adding an RTE_CACHE_GUARD to it.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v2 3/5] random: keep PRNG state in lcore variable
  2024-02-19 11:22       ` Morten Brørup
@ 2024-02-19 14:04         ` Mattias Rönnblom
  2024-02-19 15:10           ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-19 14:04 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-19 12:22, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Monday, 19 February 2024 10.41
>>
>> Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
>> cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
>> same state in a more cache-friendly lcore variable.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
> 
> [...]
> 
>> @@ -19,14 +20,12 @@ struct rte_rand_state {
>>   	uint64_t z3;
>>   	uint64_t z4;
>>   	uint64_t z5;
>> -	RTE_CACHE_GUARD;
>> -} __rte_cache_aligned;
>> +};
>>
>> -/* One instance each for every lcore id-equipped thread, and one
>> - * additional instance to be shared by all others threads (i.e., all
>> - * unregistered non-EAL threads).
>> - */
>> -static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
>> +RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
>> +
>> +/* instance to be shared by all unregistered non-EAL threads */
>> +static struct rte_rand_state unregistered_rand_state
>> __rte_cache_aligned;
> 
> The unregistered_rand_state instance is still __rte_cache_aligned; consider also adding an RTE_CACHE_GUARD to it.
> 

It shouldn't be cache-line aligned. I'll remove it. Thanks.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-19 11:10           ` Morten Brørup
@ 2024-02-19 14:31             ` Mattias Rönnblom
  2024-02-19 15:04               ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-19 14:31 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-19 12:10, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Monday, 19 February 2024 08.49
>>
>> On 2024-02-09 14:04, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>>> Sent: Friday, 9 February 2024 12.46
>>>>
>>>> On 2024-02-09 09:25, Morten Brørup wrote:
>>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>>>> Sent: Thursday, 8 February 2024 19.17
>>>>>>
>>>>>> Introduce DPDK per-lcore id variables, or lcore variables for
>> short.
>>>>>>
>>>>>> An lcore variable has one value for every current and future lcore
>>>>>> id-equipped thread.
>>>>>>
>>>>>> The primary <rte_lcore_var.h> use case is for statically
>> allocating
>>>>>> small chunks of often-used data, which is related logically, but
>>>> where
>>>>>> there are performance benefits to reap from having updates being
>>>> local
>>>>>> to an lcore.
>>>>>>
>>>>>> Lcore variables are similar to thread-local storage (TLS, e.g.,
>> C11
>>>>>> _Thread_local), but decoupling the values' life time with that of
>>>> the
>>>>>> threads.
>>>>>>
>>>>>> Lcore variables are also similar in terms of functionality
>> provided
>>>> by
>>>>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>>>>>> build-time machinery. DPCPU uses linker scripts, which effectively
>>>>>> prevents the reuse of its, otherwise seemingly viable, approach.
>>>>>>
>>>>>> The currently-prevailing way to solve the same problem as lcore
>>>>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
>>>> sized
>>>>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>>>>>> lcore variables over this approach is that data related to the
>> same
>>>>>> lcore now is close (spatially, in memory), rather than data used
>> by
>>>>>> the same module, which in turn avoid excessive use of padding,
>>>>>> polluting caches with unused data.
>>>>>>
>>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>>> ---
>>>>>
>>>>> This looks very promising. :-)
>>>>>
>>>>> Here's a bunch of comments, questions and suggestions.
>>>>>
>>>>>
>>>>> * Question: Performance.
>>>>> What is the cost of accessing an lcore variable vs a variable in
>> TLS?
>>>>> I suppose the relative cost diminishes if the variable is a larger
>>>> struct, compared to a simple uint64_t.
>>>>>
>>>>
>>>> In case all the relevant data is available in a cache close to the
>>>> core,
>>>> both options carry quite low overhead.
>>>>
>>>> Accessing a lcore variable will always require a TLS lookup, in the
>>>> form
>>>> of retrieving the lcore_id of the current thread. In that sense,
>> there
>>>> will likely be a number of extra instructions required to do the
>> lcore
>>>> variable address lookup (i.e., doing the load from rte_lcore_var
>> table
>>>> based on the lcore_id you just looked up, and adding the variable's
>>>> offset).
>>>>
>>>> A TLS lookup will incur an extra overhead of less than a clock
>> cycle,
>>>> compared to accessing a non-TLS static variable, in case static
>> linking
>>>> is used. For shared objects, TLS is much more expensive (something
>>>> often
>>>> visible in dynamically linked DPDK app flame graphs, in the form of
>> the
>>>> __tls_addr symbol). Then you need to add ~3 cc/access. This on a
>> micro
>>>> benchmark running on a x86_64 Raptor Lake P-core.
>>>>
>>>> (To visialize the difference between shared object and not, one can
>> use
>>>> Compiler Explorer and -fPIC versus -fPIE.)
>>>>
>>>> Things get more complicated if you access the same variable in the
>> same
>>>> section code, since then it can be left on the stack/in a register
>> by
>>>> the compiler, especially if LTO is used. In other words, if you do
>>>> rte_lcore_id() several times in a row, only the first one will cost
>> you
>>>> anything. This happens fairly often in DPDK, with rte_lcore_id().
>>>>
>>>> Finally, if you do something like
>>>>
>>>> diff --git a/lib/eal/common/rte_random.c
>> b/lib/eal/common/rte_random.c
>>>> index af9fffd81b..a65c30d27e 100644
>>>> --- a/lib/eal/common/rte_random.c
>>>> +++ b/lib/eal/common/rte_random.c
>>>> @@ -125,14 +125,7 @@ __rte_rand_lfsr258(struct rte_rand_state
>> *state)
>>>>     static __rte_always_inline
>>>>     struct rte_rand_state *__rte_rand_get_state(void)
>>>>     {
>>>> -       unsigned int idx;
>>>> -
>>>> -       idx = rte_lcore_id();
>>>> -
>>>> -       if (unlikely(idx == LCORE_ID_ANY))
>>>> -               return &unregistered_rand_state;
>>>> -
>>>> -       return RTE_LCORE_VAR_PTR(rand_state);
>>>> +       return &unregistered_rand_state;
>>>>     }
>>>>
>>>>     uint64_t
>>>>
>>>> ...and re-run the rand_perf_autotest, at least I see no difference
>> at
>>>> all (in a statically linked build). Both results in rte_rand() using
>>>> ~11
>>>> cc/call. What that suggests is that TLS overhead is very small, and
>>>> that
>>>> any extra instructions required by lcore variables doesn't add much,
>> if
>>>> anything at all, at least in this particular case.
>>>
>>> Excellent. Thank you for a thorough and detailed answer, Mattias.
>>>
>>>>
>>>>> Some of my suggestions below might also affect performance.
>>>>>
>>>>>
>>>>> * Advantage: Provides direct access to worker thread variables.
>>>>> With the current alternative (thread-local storage), the main
>> thread
>>>> cannot access the TLS variables of the worker threads,
>>>>> unless worker threads publish global access pointers.
>>>>> Lcore variables of any lcore thread can be direcctly accessed by
>> any
>>>> thread, which simplifies code.
>>>>>
>>>>>
>>>>> * Advantage: Roadmap towards hugemem.
>>>>> It would be nice if the lcore variable memory was allocated in
>>>> hugemem, to reduce TLB misses.
>>>>> The current alternative (thread-local storage) is also not using
>>>> hugement, so not a degradation.
>>>>>
>>>>
>>>> I agree, but the thing is it's hard to figure out how much memory is
>>>> required for these kind of variables, given how DPDK is built and
>>>> linked. In an OS kernel, you can just take all the symbols, put them
>> in
>>>> a special section, and size that section. Such a thing can't easily
>> be
>>>> done with DPDK, since shared object builds are supported, plus that
>>>> this
>>>> facility should be available not only to DPDK modules, but also the
>>>> application, so relying on linker scripts isn't really feasible (not
>>>> probably not even feasible for DPDK itself).
>>>>
>>>> In that scenario, you want to size up the per-lcore buffer to be so
>>>> large, you don't have to worry about overruns. That will waste
>> memory.
>>>> If you use huge page memory, paging can't help you to avoid
>>>> pre-allocating actual physical memory.
>>>
>>> Good point.
>>> I had noticed that RTE_MAX_LCORE_VAR was 1 MB (per RTE_MAX_LCORE),
>> but I hadn't considered how paging helps us use less physical memory
>> than that.
>>>
>>>>
>>>> That said, even large (by static per-lcore data standards) buffers
>> are
>>>> potentially small enough not to grow the amount of memory used by a
>>>> DPDK
>>>> process too much. You need to provision for RTE_MAX_LCORE of them
>>>> though.
>>>>
>>>> The value of lcore variables should be small, and thus incur few TLB
>>>> misses, so you may not gain much from huge pages. In my world, it's
>>>> more
>>>> about "fitting often-used per-lcore data into L1 or L2 CPU caches",
>>>> rather than the easier "fitting often-used per-lcore data into a
>>>> working
>>>> set size reasonably expected to be covered by hardware TLB/caches".
>>>
>>> Yes, I suppose that lcore variables are intended to be small, and
>> large per-lcore structures should keep following the current design
>> patterns for allocation and access.
>>>
>>
>> It seems to me that support for per-lcore heaps should be the solution
>> for supporting use cases requiring many, larger and/or dynamic objects
>> on a per-lcore basis.
>>
>> Ideally, you would design both that mechanism and lcore variables
>> together, but then if you couple enough amount of improvements together
>> you will never get anywhere. An instance of where perfect is the enemy
>> of good, perhaps.
> 
> So true. :-)
> 
>>
>>> Perhaps this guideline is worth mentioning in the documentation.
>>>
>>
>> What is missing, more specifically? The size limitation and the static
>> nature of lcore variables is described, and what current design
>> patterns
>> they expected to (partly) replace is also covered.
> 
> Your documentation is fine, and nothing specific is missing here.
> I was thinking out loud that the high level DPDK documentation should describe common design patterns.
> 
>>
>>>>
>>>>> Lcore variables are available very early at startup, so I guess the
>>>> RTE memory allocator is not yet available.
>>>>> Hugemem could be allocated using O/S allocation, so there is a
>>>> possible road towards using hugemem.
>>>>>
>>>>
>>>> With the current design, that true. I'm not sure it's a strict
>>>> requirement though, but it does makes things simpler.
>>>>
>>>>> Either way, using hugement would require one more indirection (the
>>>> pointer to the allocated hugemem).
>>>>> I don't know which has better performance, using hugemem or
>> avoiding
>>>> the additional pointer dereferencing.
>>>>>
>>>>>
>>>>> * Suggestion: Consider adding an entry for unregistered non-EAL
>>>> threads.
>>>>> Please consider making room for one more entry, shared by all
>>>> unregistered non-EAL threads, i.e.
>>>>> making the array size RTE_MAX_LCORE + 1 and indexing by
>>>> (rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE).
>>>>>
>>>>> It would be convenient for the use cases where a variable shared by
>>>> the unregistered non-EAL threads don't need special treatment.
>>>>>
>>>>
>>>> I thought about this, but it would require a conditional in the
>> lookup
>>>> macro, as you show. More importantly, it would make the whole
>>>> <rte_lcore_var.h> thing less elegant and harder to understand. It's
>> bad
>>>> enough that "per-lcore" is actually "per-lcore id" (or the
>> equivalent
>>>> "per-EAL thread and unregistered EAL-thread"). Adding a "btw it's
>> <what
>>>> I said before> + 1" is not an improvement.
>>>
>>> We could promote "one more entry for unregistered non-EAL threads"
>> design pattern (for relevant use cases only!) by extending EAL with one
>> more TLS variable, maintained like _thread_id, but set to RTE_MAX_LCORE
>> when _tread_id is set to -1:
>>>
>>> +++ eal_common_thread.c:
>>>     RTE_DEFINE_PER_LCORE(int, _thread_id) = -1;
>>> + RTE_DEFINE_PER_LCORE(int, _thread_idx) = RTE_MAX_LCORE;
> 
> Ups... wrong reference! I meant to refer to _lcore_id, not _thread_id. Correction:
> 

OK. I subconsciously ignored this mistake, and read it as "_lcore_id".

> We could promote "one more entry for unregistered non-EAL threads" design pattern (for relevant use cases only!) by extending EAL with one more TLS variable, maintained like _lcore_id, but set to RTE_MAX_LCORE when _lcore_id is set to LCORE_ID_ANY:
> 
> +++ eal_common_thread.c:
>    RTE_DEFINE_PER_LCORE(unsigned int, _lcore_id) = LCORE_ID_ANY;
> + RTE_DEFINE_PER_LCORE(unsigned int, _lcore_idx) = RTE_MAX_LCORE;
> 
>>>
>>> and
>>>
>>> +++ rte_lcore.h:
>>> static inline unsigned
>>> rte_lcore_id(void)
>>> {
>>> 	return RTE_PER_LCORE(_lcore_id);
>>> }
>>> + static inline unsigned
>>> + rte_lcore_idx(void)
>>> + {
>>> + 	return RTE_PER_LCORE(_lcore_idx);
>>> + }
>>>
>>> That would eliminate the (rte_lcore_id() < RTE_MAX_LCORE ?
>> rte_lcore_id() : RTE_MAX_LCORE) conditional, also where currently used.
>>>
>>
>> Wouldn't that effectively give a shared lcore id to all unregistered
>> threads?
> 
> Yes, just like the rte_lcore_id() is LCORE_ID_ANY (i.e. UINT32_MAX) for all unregistered threads; but it will be usable for array indexing, behaving as a shadow variable of RTE_PER_LCORE(_lcore_id) for optimizing away the "rte_lcore_id() < RTE_MAX_LCORE ? rte_lcore_id() : RTE_MAX_LCORE" when indexing.
> 
>>
>> We definitely shouldn't further complicate anything related to the DPDK
>> threading model, in my opinion.
>>
>> If a module needs one or more variable instances that aren't per lcore,
>> use regular static allocation instead. I would favor clarity over
>> convenience here, at least until we know better (see below as well).
>>
>>>>
>>>> But useful? Sure.
>>>>
>>>> I think you may still need other data for dealing with unregistered
>>>> threads, for example a mutex or spin lock to deal with concurrency
>>>> issues that arises with shared data.
>>>
>>> Adding the extra entry is only for the benefit of use cases where
>> special handling is not required. It will make the code for those use
>> cases much cleaner. I think it is useful.
>>>
>>
>> It will make it shorter, but not less clean, I would argue.
>>
>>> Use cases requiring special handling should still do the special
>> handling they do today.
>>>
>>
>> For DPDK modules using lcore variables and which treat unregistered
>> threads as "full citizens", I expect special handling of unregistered
>> threads to be the norm. Take rte_random.h as an example. Current API
>> does not guarantee MT safety for concurrent calls of unregistered
>> threads. It probably should, and it should probably be by means of a
>> mutex (not spinlock).
>>
>> The reason I'm not running off to make a rte_random.c patch is that's
>> it's unclear to me what is the role of unregistered threads in DPDK.
>> I'm
>> reasonably comfortable with a model where there are many threads that
>> basically don't interact with the DPDK APIs (except maybe some very
>> narrow exposure, like the preemption-safe ring variant). One example of
>> such a design would be big slow control plane which uses multi-
>> threading
>> and the Linux process scheduler for work scheduling, hosted in the same
>> process as a DPDK data plane app.
>>
>> What I find more strange is a scenario where there are unregistered
>> threads which interacts with a wide variety of DPDK APIs, does so
>> at-high-rates/with-high-performance-requirements and are expected to be
>> preemption-safe. So they are basically EAL threads without a lcore id.
> 
> Yes, this is happening in the wild.
> E.g. our application has a mode where it uses fewer EAL threads, and processes more in non-EAL threads. So to say, the same work is processed either by an EAL thread or a non-EAL thread, depending on the application's mode.
> The extra array entry would be useful for such use cases.
> 

Is there some particular reason you can't register those non-EAL threads?

>>
>> Support for that latter scenario has also been voiced, in previous
>> discussions, from what I recall.
>>
>> I think it's hard to answer the question of a "unregistered thread
>> spare" for lcore variables without first knowing what the future should
>> look like for unregistered threads in DPDK, in terms of being able to
>> call into DPDK APIs, preemption-safety guarantees, etc.
>>
>> It seems that until you have a clearer picture of how generally to
>> treat
>> unregistered threads, you are best off with just a per-lcore id
>> instance
>> of lcore variables.
> 
> I get your point. It also reduces the risk of bugs caused by incorrect use of the additional entry.
> 
> I am arguing for a different angle: Providing the extra entry will help uncovering relevant use cases.
> 

Maybe have two "spares" in case you find two new uses cases? :)

No, adding spares doesn't work, unless you rework the API and rename it 
to fit the new purpose of not only providing per-lcore id variables, but 
per-something-else.

>>
>>>>
>>>> There may also be cases were you are best off by simply disallowing
>>>> unregistered threads from calling into that API.
>>>>
>>>>> Obviously, this might affect performance.
>>>>> If the performance cost is not negligble, the addtional entry (and
>>>> indexing branch) could be disabled at build time.
>>>>>
>>>>>
>>>>> * Suggestion: Do not fix the alignment at 16 byte.
>>>>> Pass an alignment parameter to rte_lcore_var_alloc() and use
>>>> alignof() when calling it:
>>>>>
>>>>> +#include <stdalign.h>
>>>>> +
>>>>> +#define RTE_LCORE_VAR_ALLOC(name)			\
>>>>> +	name = rte_lcore_var_alloc(sizeof(*(name)), alignof(*(name)))
>>>>> +
>>>>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, alignment)
>> 	\
>>>>> +	name = rte_lcore_var_alloc(size, alignment)
>>>>> +
>>>>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
>>>>> +	name = rte_lcore_var_alloc(size, RTE_LCORE_VAR_ALIGNMENT_DEFAULT)
>>>>> +
>>>>> + +++ /cconfig/rte_config.h
>>>>> +#define RTE_LCORE_VAR_ALIGNMENT_DEFAULT 16
>>>>>
>>>>>
>>>>
>>>> That seems like a very good idea. I'll look into it.
>>>>
>>>>> * Concern: RTE_LCORE_VAR_FOREACH() resembles RTE_LCORE_FOREACH(),
>> but
>>>> behaves differently.
>>>>>
>>>>>> +/**
>>>>>> + * Iterate over each lcore id's value for a lcore variable.
>>>>>> + */
>>>>>> +#define RTE_LCORE_VAR_FOREACH(var, name)				\
>>>>>> +	for (unsigned int lcore_id =					\
>>>>>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);
>> 	\
>>>>>> +	     lcore_id < RTE_MAX_LCORE;
>> 	\
>>>>>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id,
>> name))
>>>>>> +
>>>>>
>>>>> The macro name RTE_LCORE_VAR_FOREACH() resembles
>>>> RTE_LCORE_FOREACH(i), which only iterates on running cores.
>>>>> You might want to give it a name that differs more.
>>>>>
>>>>
>>>> True.
>>>>
>>>> Maybe RTE_LCORE_VAR_FOREACH_VALUE() is better? Still room for
>>>> confusion,
>>>> for sure.
>>>>
>>>> Being consistent with <rte_lcore.h> is not so easy, since it's not
>> even
>>>> consistent with itself. For example, rte_lcore_count() returns the
>>>> number of lcores (EAL threads) *plus the number of registered non-
>> EAL
>>>> threads*, and RTE_LCORE_FOREACH() give a different count. :)
>>>
>>> Naming is hard. I don't have a good name, and can only offer
>> inspiration...
>>>
>>> <rte_lcore.h> has RTE_LCORE_FOREACH() and its
>> RTE_LCORE_FOREACH_WORKER() variant with _WORKER appended.
>>>
>>> Perhaps RTE_LCORE_VAR_FOREACH_ALL(), with _ALL appended to indicate a
>> variant.
>>>
>>>>
>>>>> If it wasn't for API breakage, I would suggest renaming
>>>> RTE_LCORE_FOREACH() instead, but that's not realistic. ;-)
>>>>>
>>>>> Small detail: "var" is a pointer, so consider renaming it to "ptr"
>>>> and adding _PTR to the macro name.
>>>>
>>>> The "var" name comes from how <sys/queue.h> names things. I think I
>> had
>>>> it as "ptr" initially. I'll change it back.
>>>
>>> Thanks.
>>>
>>>>
>>>> Thanks a lot Morten.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC 1/5] eal: add static per-lcore memory allocation facility
  2024-02-19 14:31             ` Mattias Rönnblom
@ 2024-02-19 15:04               ` Morten Brørup
  0 siblings, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-02-19 15:04 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 19 February 2024 15.32
> 
> On 2024-02-19 12:10, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Monday, 19 February 2024 08.49
> >>
> >> On 2024-02-09 14:04, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >>>> Sent: Friday, 9 February 2024 12.46
> >>>>
> >>>> On 2024-02-09 09:25, Morten Brørup wrote:
> >>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>>>>> Sent: Thursday, 8 February 2024 19.17
> >>>>>>
> >>>>>> Introduce DPDK per-lcore id variables, or lcore variables for
> >> short.
> >>>>>>
> >>>>>> An lcore variable has one value for every current and future
> lcore
> >>>>>> id-equipped thread.
> >>>>>>
> >>>>>> The primary <rte_lcore_var.h> use case is for statically
> >> allocating
> >>>>>> small chunks of often-used data, which is related logically, but
> >>>> where
> >>>>>> there are performance benefits to reap from having updates being
> >>>> local
> >>>>>> to an lcore.
> >>>>>>
> >>>>>> Lcore variables are similar to thread-local storage (TLS, e.g.,
> >> C11
> >>>>>> _Thread_local), but decoupling the values' life time with that
> of
> >>>> the
> >>>>>> threads.
> >>>>>>
> >>>>>> Lcore variables are also similar in terms of functionality
> >> provided
> >>>> by
> >>>>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >>>>>> build-time machinery. DPCPU uses linker scripts, which
> effectively
> >>>>>> prevents the reuse of its, otherwise seemingly viable, approach.
> >>>>>>
> >>>>>> The currently-prevailing way to solve the same problem as lcore
> >>>>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
> >>>> sized
> >>>>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit
> of
> >>>>>> lcore variables over this approach is that data related to the
> >> same
> >>>>>> lcore now is close (spatially, in memory), rather than data used
> >> by
> >>>>>> the same module, which in turn avoid excessive use of padding,
> >>>>>> polluting caches with unused data.
> >>>>>>
> >>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>>>> ---

[...]

> > Ups... wrong reference! I meant to refer to _lcore_id, not
> _thread_id. Correction:
> >
> 
> OK. I subconsciously ignored this mistake, and read it as "_lcore_id".

:-)

[...]

> >> For DPDK modules using lcore variables and which treat unregistered
> >> threads as "full citizens", I expect special handling of
> unregistered
> >> threads to be the norm. Take rte_random.h as an example. Current API
> >> does not guarantee MT safety for concurrent calls of unregistered
> >> threads. It probably should, and it should probably be by means of a
> >> mutex (not spinlock).
> >>
> >> The reason I'm not running off to make a rte_random.c patch is
> that's
> >> it's unclear to me what is the role of unregistered threads in DPDK.
> >> I'm
> >> reasonably comfortable with a model where there are many threads
> that
> >> basically don't interact with the DPDK APIs (except maybe some very
> >> narrow exposure, like the preemption-safe ring variant). One example
> of
> >> such a design would be big slow control plane which uses multi-
> >> threading
> >> and the Linux process scheduler for work scheduling, hosted in the
> same
> >> process as a DPDK data plane app.
> >>
> >> What I find more strange is a scenario where there are unregistered
> >> threads which interacts with a wide variety of DPDK APIs, does so
> >> at-high-rates/with-high-performance-requirements and are expected to
> be
> >> preemption-safe. So they are basically EAL threads without a lcore
> id.
> >
> > Yes, this is happening in the wild.
> > E.g. our application has a mode where it uses fewer EAL threads, and
> processes more in non-EAL threads. So to say, the same work is
> processed either by an EAL thread or a non-EAL thread, depending on the
> application's mode.
> > The extra array entry would be useful for such use cases.
> >
> 
> Is there some particular reason you can't register those non-EAL
> threads?

Legacy. I suppose we could just do that instead.
Thanks for the suggestion!

> 
> >>
> >> Support for that latter scenario has also been voiced, in previous
> >> discussions, from what I recall.
> >>
> >> I think it's hard to answer the question of a "unregistered thread
> >> spare" for lcore variables without first knowing what the future
> should
> >> look like for unregistered threads in DPDK, in terms of being able
> to
> >> call into DPDK APIs, preemption-safety guarantees, etc.
> >>
> >> It seems that until you have a clearer picture of how generally to
> >> treat
> >> unregistered threads, you are best off with just a per-lcore id
> >> instance
> >> of lcore variables.
> >
> > I get your point. It also reduces the risk of bugs caused by
> incorrect use of the additional entry.
> >
> > I am arguing for a different angle: Providing the extra entry will
> help uncovering relevant use cases.
> >
> 
> Maybe have two "spares" in case you find two new uses cases? :)
> 
> No, adding spares doesn't work, unless you rework the API and rename it
> to fit the new purpose of not only providing per-lcore id variables,
> but per-something-else.
> 

OK. I'm convinced.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v2 3/5] random: keep PRNG state in lcore variable
  2024-02-19 14:04         ` Mattias Rönnblom
@ 2024-02-19 15:10           ` Morten Brørup
  0 siblings, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-02-19 15:10 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 19 February 2024 15.04
> 
> On 2024-02-19 12:22, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Monday, 19 February 2024 10.41
> >>
> >> Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
> >> cache-aligned and RTE_CACHE_GUARDed struct instances with keeping
> the
> >> same state in a more cache-friendly lcore variable.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> ---
> >
> > [...]
> >
> >> @@ -19,14 +20,12 @@ struct rte_rand_state {
> >>   	uint64_t z3;
> >>   	uint64_t z4;
> >>   	uint64_t z5;
> >> -	RTE_CACHE_GUARD;
> >> -} __rte_cache_aligned;
> >> +};
> >>
> >> -/* One instance each for every lcore id-equipped thread, and one
> >> - * additional instance to be shared by all others threads (i.e.,
> all
> >> - * unregistered non-EAL threads).
> >> - */
> >> -static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
> >> +RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
> >> +
> >> +/* instance to be shared by all unregistered non-EAL threads */
> >> +static struct rte_rand_state unregistered_rand_state
> >> __rte_cache_aligned;
> >
> > The unregistered_rand_state instance is still __rte_cache_aligned;
> consider also adding an RTE_CACHE_GUARD to it.
> >
> 
> It shouldn't be cache-line aligned. I'll remove it. Thanks.

Agreed; that fix is just as good. Then,

Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v3 0/6] Lcore variables
  2024-02-19  9:40     ` [RFC v2 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-20  8:49       ` Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                           ` (5 more replies)
  0 siblings, 6 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

This RFC presents a new API <rte_lcore_var.h> for static per-lcore id
data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

One thing is unclear to the author is how this API relates to
potential future per-lcore dynamic allocator (e.g., a per-lcore heap).

Contrary to what the version.map edit suggests, this RFC is not meant
for a proposal for DPDK 24.03.

Mattias Rönnblom (6):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable test suite
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 app/test/meson.build                  |   1 +
 app/test/test_lcore_var.c             | 407 ++++++++++++++++++++++++++
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  82 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/common/rte_random.c           |  30 +-
 lib/eal/common/rte_service.c          | 119 ++++----
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 lib/eal/x86/rte_power_intrinsics.c    |  17 +-
 lib/power/rte_power_pmd_mgmt.c        |  36 ++-
 13 files changed, 987 insertions(+), 88 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  2024-02-20  9:11           ` Bruce Richardson
                             ` (3 more replies)
  2024-02-20  8:49         ` [RFC v3 2/6] eal: add lcore variable test suite Mattias Rönnblom
                           ` (4 subsequent siblings)
  5 siblings, 4 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  82 ++++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 7 files changed, 465 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/config/rte_config.h b/config/rte_config.h
index da265d7dd2..884482e473 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -30,6 +30,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index a6a768bd7c..bb06bb7ca1 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -98,6 +98,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore-varible](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..dfd11cbd0b
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define WARN_THRESHOLD 75
+
+/*
+ * Avoid using offset zero, since it would result in a NULL-value
+ * "handle" (offset) pointer, which in principle and per the API
+ * definition shouldn't be an issue, but may confuse some tools and
+ * users.
+ */
+#define INITIAL_OFFSET 1
+
+char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
+
+static uintptr_t allocated = INITIAL_OFFSET;
+
+static void
+verify_allocation(uintptr_t new_allocated)
+{
+	static bool has_warned;
+
+	RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
+
+	if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
+	    !has_warned) {
+		EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
+			"of the maximum capacity (%d bytes)", WARN_THRESHOLD,
+			RTE_MAX_LCORE_VAR);
+		has_warned = true;
+	}
+}
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
+
+	void *offset = (void *)new_allocated;
+
+	new_allocated += size;
+
+	verify_allocation(new_allocated);
+
+	allocated = new_allocated;
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return offset;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to aligned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..da49d48d7c
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,375 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Per-lcore id variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. In other words,
+ * there's one copy of its value for each and every current and future
+ * lcore id-equipped thread, with the total number of copies amounting
+ * to \c RTE_MAX_LCORE.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for \c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. A handle may be passed between modules and
+ * threads just like any pointer, but its value is not the address of
+ * any particular object, but rather just an opaque identifier, stored
+ * in a typed pointer (to inform the access macro the type of values).
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
+ *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but is should
+ * generally only *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by to different lcore
+ * ids *may* be frequently read or written by the owners without the
+ * risk of false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomics) should
+ * employed to assure there are no data races between the owning
+ * thread and any non-owner threads accessing the same lcore variable
+ * instance.
+ *
+ * The value of the lcore variable for a particular lcore id may be
+ * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
+ * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * To modify the value of an lcore variable for a particular lcore id,
+ * either access the object through the pointer retrieved by \ref
+ * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
+ * RTE_LCORE_VAR_LCORE_SET.
+ *
+ * The access macros each has a short-hand which may be used by an EAL
+ * thread or registered non-EAL thread to access the lcore variable
+ * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
+ * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
+ *
+ * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier. The
+ * *identifier* value is common across all lcore ids.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like \c int,
+ * but would more typically be a \c struct. An application may choose
+ * to define an lcore variable, which it then it goes on to never
+ * allocate.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * The sum of all lcore variables, plus any padding required, must be
+ * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
+ * violation of this maximum results in the process being terminated.
+ *
+ * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
+ * same order of magnitude in size as a thread stack.
+ *
+ * The lcore variable storage buffers are kept in the BSS section in
+ * the resulting binary, where data generally isn't mapped in until
+ * it's accessed. This means that unused portions of the lcore
+ * variable storage area will not occupy any physical memory (with a
+ * granularity of the memory page size [usually 4 kB]).
+ *
+ * Lcore variables should generally *not* be \ref __rte_cache_aligned
+ * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, all nearby data structures
+ * should almost-always be written to by a single thread (the lcore
+ * variable owner). Adding padding will increase the effective memory
+ * working set size, and potentially reducing performance.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         unsigned int lcore_id;
+ *
+ *         RTE_LCORE_VAR_ALLOC(foo_state);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * \endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * } __rte_cache_aligned;
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * \endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this forces the
+ * use of cache-line alignment to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables has the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to \ref rte_lcore_var.h is the \ref
+ * rte_per_lcore.h API, which make use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., \ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. One effect of this is thread-local variables must
+ *     initialized in a "lazy" manner (e.g., at the point of thread
+ *     creation). Lcore variables may be accessed immediately after
+ *     having been allocated (which is usually prior any thread beyond
+ *     the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id, and thus
+ *     not for such "regular" threads.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define a lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various per-lcore id instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handler, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable are only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)	\
+	name = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
+	name = rte_lcore_var_alloc(size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handler pointer type, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC(name)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)),		\
+				       alignof(typeof(*(name))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a \ref
+ * RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)		\
+	((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)				\
+	((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+/**
+ * Get value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
+
+/**
+ * Set the value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
+
+/**
+ * Get value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
+
+/**
+ * Set value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_SET(name, value) \
+	RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
+
+/**
+ * Iterate over each lcore id's value for a lcore variable.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)				\
+	for (unsigned int lcore_id =					\
+		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
+
+extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
+
+/**
+ * Allocate space in the per-lcore id buffers for a lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than \c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The id of the variable, stored in a void pointer value.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 5e0cd47c82..e90b86115a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -393,6 +393,10 @@ EXPERIMENTAL {
 	# added in 23.07
 	rte_memzone_max_get;
 	rte_memzone_max_set;
+
+	# added in 24.03
+	rte_lcore_var_alloc;
+	rte_lcore_var;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v3 2/6] eal: add lcore variable test suite
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Add test suite to exercise the <rte_lcore_var.h> API.

RFC v2:
 * Improve alignment-related test coverage.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 407 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 408 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 6389ae83ee..93412cce51 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -101,6 +101,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..27084e91e9
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,407 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static bool
+rand_bool(void)
+{
+	return rte_rand() & 1;
+}
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_PTR(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal;
+
+	if (rand_bool())
+		equal = RTE_LCORE_VAR_GET(test_int) == state->old_value;
+	else
+		equal = *(RTE_LCORE_VAR_PTR(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	if (rand_bool())
+		RTE_LCORE_VAR_SET(test_int, state->new_value);
+	else
+		*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		RTE_LCORE_VAR_LCORE_SET(lcore_id, test_int, state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		TEST_ASSERT_EQUAL(state->new_value,
+				  RTE_LCORE_VAR_LCORE_GET(lcore_id, test_int),
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_PTR(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_struct);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_PTR(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(RTE_LCORE_VAR_LCORE_GET(lcore_id, test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_array);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (RTE_MAX_LCORE_VAR / 2)
+
+static int
+test_many_lvars(void)
+{
+	void **handlers = malloc(sizeof(void *) * MANY_LVARS);
+	int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		void *handle = rte_lcore_var_alloc(1, 1);
+
+		uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), handle);
+
+		*b = (uint8_t)i;
+
+		handlers[i] = handle;
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_FOREACH_WORKER(lcore_id) {
+			uint8_t *b = __RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(),
+							       handlers[i]);
+			TEST_ASSERT_EQUAL((uint8_t)i, *b,
+					  "Unexpected lcore variable value.");
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v3 3/6] random: keep PRNG state in lcore variable
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 2/6] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  2024-02-20 15:31           ` Morten Brørup
  2024-02-20  8:49         ` [RFC v3 4/6] power: keep per-lcore " Mattias Rönnblom
                           ` (2 subsequent siblings)
  5 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/common/rte_random.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 7709b8f2c6..adbbf13f0e 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct rte_rand_state {
@@ -19,14 +20,12 @@ struct rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
-} __rte_cache_aligned;
+};
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_PTR(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v3 4/6] power: keep per-lcore state in lcore variable
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
                           ` (2 preceding siblings ...)
  2024-02-20  8:49         ` [RFC v3 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 5/6] service: " Mattias Rönnblom
  2024-02-20  8:49         ` [RFC v3 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

RFC v3:
 * Replace for loop with FOREACH macro.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/power/rte_power_pmd_mgmt.c | 36 ++++++++++++++++------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 591fc69f36..ea30454895 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -68,8 +69,8 @@ struct pmd_core_cfg {
 	/**< Number of queues ready to enter power optimized state */
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
-} __rte_cache_aligned;
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+};
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,21 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v3 5/6] service: keep per-lcore state in lcore variable
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
                           ` (3 preceding siblings ...)
  2024-02-20  8:49         ` [RFC v3 4/6] power: keep per-lcore " Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  2024-02-22  9:42           ` Morten Brørup
  2024-02-20  8:49         ` [RFC v3 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c | 119 ++++++++++++++++++++---------------
 1 file changed, 68 insertions(+), 51 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index d959c91459..de205c5da5 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,11 +102,12 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
+	else {
+		struct core_state *cs;
+		RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+			memset(cs, 0, sizeof(struct core_state));
 	}
 
 	int i;
@@ -122,7 +124,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +137,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +286,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +293,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +454,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +467,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +489,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +535,16 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs =
+		RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +552,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +573,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +590,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +642,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +694,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +712,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +737,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +761,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +785,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +815,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +824,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +849,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +860,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +868,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +876,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +885,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +901,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +948,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +977,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +989,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1028,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v3 6/6] eal: keep per-lcore power intrinsics state in lcore variable
  2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
                           ` (4 preceding siblings ...)
  2024-02-20  8:49         ` [RFC v3 5/6] service: " Mattias Rönnblom
@ 2024-02-20  8:49         ` Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-20  8:49 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 532a2e646b..f4659af77e 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -12,10 +13,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -170,7 +175,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_PTR(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -262,7 +267,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_PTR(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -301,8 +306,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_PTR(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-20  9:11           ` Bruce Richardson
  2024-02-20 10:47             ` Mattias Rönnblom
  2024-02-21  9:43           ` Jerin Jacob
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 313+ messages in thread
From: Bruce Richardson @ 2024-02-20  9:11 UTC (permalink / raw)
  To: Mattias Rönnblom; +Cc: dev, hofors, Morten Brørup, Stephen Hemminger

On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small chunks of often-used data, which is related logically, but where
> there are performance benefits to reap from having updates being local
> to an lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> RFC v3:
>  * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>  * Update example to reflect FOREACH macro name change (in RFC v2).
> 
> RFC v2:
>  * Use alignof to derive alignment requirements. (Morten Brørup)
>  * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>    *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>  * Allow user-specified alignment, but limit max to cache line size.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  config/rte_config.h                   |   1 +
>  doc/api/doxy-api-index.md             |   1 +
>  lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>  lib/eal/common/meson.build            |   1 +
>  lib/eal/include/meson.build           |   1 +
>  lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>  lib/eal/version.map                   |   4 +
>  7 files changed, 465 insertions(+)
>  create mode 100644 lib/eal/common/eal_common_lcore_var.c
>  create mode 100644 lib/eal/include/rte_lcore_var.h
> 
> diff --git a/config/rte_config.h b/config/rte_config.h
> index da265d7dd2..884482e473 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -30,6 +30,7 @@
>  /* EAL defines */
>  #define RTE_CACHE_GUARD_LINES 1
>  #define RTE_MAX_HEAPS 32
> +#define RTE_MAX_LCORE_VAR 1048576
>  #define RTE_MAX_MEMSEG_LISTS 128
>  #define RTE_MAX_MEMSEG_PER_LIST 8192
>  #define RTE_MAX_MEM_MB_PER_LIST 32768
> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> index a6a768bd7c..bb06bb7ca1 100644
> --- a/doc/api/doxy-api-index.md
> +++ b/doc/api/doxy-api-index.md
> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>    [interrupts](@ref rte_interrupts.h),
>    [launch](@ref rte_launch.h),
>    [lcore](@ref rte_lcore.h),
> +  [lcore-varible](@ref rte_lcore_var.h),
>    [per-lcore](@ref rte_per_lcore.h),
>    [service cores](@ref rte_service.h),
>    [keepalive](@ref rte_keepalive.h),
> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
> new file mode 100644
> index 0000000000..dfd11cbd0b
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_var.c
> @@ -0,0 +1,82 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#include <inttypes.h>
> +
> +#include <rte_common.h>
> +#include <rte_debug.h>
> +#include <rte_log.h>
> +
> +#include <rte_lcore_var.h>
> +
> +#include "eal_private.h"
> +
> +#define WARN_THRESHOLD 75
> +
> +/*
> + * Avoid using offset zero, since it would result in a NULL-value
> + * "handle" (offset) pointer, which in principle and per the API
> + * definition shouldn't be an issue, but may confuse some tools and
> + * users.
> + */
> +#define INITIAL_OFFSET 1
> +
> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
> +

While I like the idea of improved handling for per-core variables, my main
concern with this set is this definition here, which adds yet another
dependency on the compile-time defined RTE_MAX_LCORE value.

I believe we already have an issue with this #define where it's impossible
to come up with a single value that works for all, or nearly all cases. The
current default is still 128, yet DPDK needs to support systems where the
number of cores is well into the hundreds, requiring workarounds of core
mappings or customized builds of DPDK. Upping the value fixes those issues
at the cost to memory footprint explosion for smaller systems.

I'm therefore nervous about putting more dependencies on this value, when I
feel we should be moving away from its use, to allow more runtime
configurability of cores.

For this set/feature, would it be possible to have a run-time allocated
(and sized) array rather than using the RTE_MAX_LCORE value?

Thanks, (and apologies for the mini-rant!)

/Bruce

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20  9:11           ` Bruce Richardson
@ 2024-02-20 10:47             ` Mattias Rönnblom
  2024-02-20 11:39               ` Bruce Richardson
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-20 10:47 UTC (permalink / raw)
  To: Bruce Richardson, Mattias Rönnblom
  Cc: dev, Morten Brørup, Stephen Hemminger

On 2024-02-20 10:11, Bruce Richardson wrote:
> On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small chunks of often-used data, which is related logically, but where
>> there are performance benefits to reap from having updates being local
>> to an lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> RFC v3:
>>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>>   * Update example to reflect FOREACH macro name change (in RFC v2).
>>
>> RFC v2:
>>   * Use alignof to derive alignment requirements. (Morten Brørup)
>>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>>   * Allow user-specified alignment, but limit max to cache line size.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   config/rte_config.h                   |   1 +
>>   doc/api/doxy-api-index.md             |   1 +
>>   lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>>   lib/eal/common/meson.build            |   1 +
>>   lib/eal/include/meson.build           |   1 +
>>   lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>>   lib/eal/version.map                   |   4 +
>>   7 files changed, 465 insertions(+)
>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>
>> diff --git a/config/rte_config.h b/config/rte_config.h
>> index da265d7dd2..884482e473 100644
>> --- a/config/rte_config.h
>> +++ b/config/rte_config.h
>> @@ -30,6 +30,7 @@
>>   /* EAL defines */
>>   #define RTE_CACHE_GUARD_LINES 1
>>   #define RTE_MAX_HEAPS 32
>> +#define RTE_MAX_LCORE_VAR 1048576
>>   #define RTE_MAX_MEMSEG_LISTS 128
>>   #define RTE_MAX_MEMSEG_PER_LIST 8192
>>   #define RTE_MAX_MEM_MB_PER_LIST 32768
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index a6a768bd7c..bb06bb7ca1 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>>     [interrupts](@ref rte_interrupts.h),
>>     [launch](@ref rte_launch.h),
>>     [lcore](@ref rte_lcore.h),
>> +  [lcore-varible](@ref rte_lcore_var.h),
>>     [per-lcore](@ref rte_per_lcore.h),
>>     [service cores](@ref rte_service.h),
>>     [keepalive](@ref rte_keepalive.h),
>> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
>> new file mode 100644
>> index 0000000000..dfd11cbd0b
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_var.c
>> @@ -0,0 +1,82 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#include <inttypes.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_debug.h>
>> +#include <rte_log.h>
>> +
>> +#include <rte_lcore_var.h>
>> +
>> +#include "eal_private.h"
>> +
>> +#define WARN_THRESHOLD 75
>> +
>> +/*
>> + * Avoid using offset zero, since it would result in a NULL-value
>> + * "handle" (offset) pointer, which in principle and per the API
>> + * definition shouldn't be an issue, but may confuse some tools and
>> + * users.
>> + */
>> +#define INITIAL_OFFSET 1
>> +
>> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
>> +
> 
> While I like the idea of improved handling for per-core variables, my main
> concern with this set is this definition here, which adds yet another
> dependency on the compile-time defined RTE_MAX_LCORE value.
> 

lcore variables replaces one RTE_MAX_LCORE-dependent pattern with another.

You could even argue the dependency on RTE_MAX_LCORE is reduced with 
lcore variables, if you look at where/in how many places in the code 
base this macro is being used. Centralizing per-lcore data management 
may also provide some opportunity in the future for extending the API to 
cope with some more dynamic RTE_MAX_LCORE variant. Not without ABI 
breakage of course, but we are not ever going to change anything related 
to RTE_MAX_LCORE without breaking the ABI, since this constant is 
everywhere, including compiled into the application itself.

> I believe we already have an issue with this #define where it's impossible
> to come up with a single value that works for all, or nearly all cases. The
> current default is still 128, yet DPDK needs to support systems where the
> number of cores is well into the hundreds, requiring workarounds of core
> mappings or customized builds of DPDK. Upping the value fixes those issues
> at the cost to memory footprint explosion for smaller systems.
> 

I agree this is an issue.

RTE_MAX_LCORE also need to be sized to accommodate not only all cores 
used, but the sum of all EAL threads and registered non-EAL threads.

So, there is no reliable way to discover what RTE_MAX_LCORE is on a 
particular piece of hardware, since the actual number of lcore ids 
needed is up to the application.

Why is the default set so low? Linux has MAX_CPUS, which serves the same 
purpose, which is set to 4096 by default, if I recall correctly. 
Shouldn't we at least be able to increase it to 256?

> I'm therefore nervous about putting more dependencies on this value, when I
> feel we should be moving away from its use, to allow more runtime
> configurability of cores.
> 

What more specifically do you have in mind?

Maybe I'm overly pessimistic, but supporting lcores without any upper 
bound and also allowing them to be added and removed at any point during 
run time seems far-fetched, given where DPDK is today.

To include an actual upper bound, set during DPDK run-time 
initialization, lower than RTE_MAX_LCORE, seems easier. I think there is 
some equivalent in the Linux kernel. Again, you would need to 
accommodate for future rte_register_thread() calls.

<rte_lcore_var.h> could be extended with a user-specified lcore variable 
  init/free function callbacks, to allow lazy or late initialization.

If one could have a way to retrieve the max possible lcore ids *for a 
particular DPDK process* (as opposed to a particular build) it would be 
possible to avoid touching the per-lcore buffers for lcore ids that 
would never be used. With data in BSS, it would never be mapped/allocated.

An issue with BSS data is that there might be very RT-sensitive 
applications deciding to lock all memory into RAM, to avoid latency 
jitter caused by paging, and such would suffer from a large 
rte_lcore_var (or all the current static arrays). Lcore variables makes 
this worse, since rte_lcore_var is larger than the sum of today's static 
arrays, and must be so, with some margin, since there is no way to 
figure out ahead of time how much memory is actually going to be needed.

> For this set/feature, would it be possible to have a run-time allocated
> (and sized) array rather than using the RTE_MAX_LCORE value?
> 

What I explored was having the per-lcore buffers dynamically allocated. 
What I ran into was I saw no apparent benefit, and with dynamic 
allocation there were new problems to solve. One was to assure lcore 
variable buffers were allocated before they were being used. In 
particular if you want to use huge page memory, lcore variables may be 
available only when that machinery is ready to accept requests.

Also, with huge page memory, you won't get the benefit you will get from 
depend paging and BSS (i.e., only used memory is actually allocated).

With malloc(), I believe you generally do get that same benefit, if you 
allocation is sufficiently large.

I also considered just allocating chunks, fitting (say) 64 kB worth of 
lcore variables in each. Turned out more complex, and to no benefit, 
other than reducing footprint for mlockall() type apps, which seemed 
like corner case.

I never considered no upper-bound, dynamic, RTE_MAX_LCORE.

> Thanks, (and apologies for the mini-rant!)
> 
> /Bruce

Thanks for the comments. This is was no way near a rant.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20 10:47             ` Mattias Rönnblom
@ 2024-02-20 11:39               ` Bruce Richardson
  2024-02-20 13:37                 ` Morten Brørup
  2024-02-20 16:26                 ` Mattias Rönnblom
  0 siblings, 2 replies; 313+ messages in thread
From: Bruce Richardson @ 2024-02-20 11:39 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Mattias Rönnblom, dev, Morten Brørup, Stephen Hemminger

On Tue, Feb 20, 2024 at 11:47:14AM +0100, Mattias Rönnblom wrote:
> On 2024-02-20 10:11, Bruce Richardson wrote:
> > On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
> > > Introduce DPDK per-lcore id variables, or lcore variables for short.
> > > 
> > > An lcore variable has one value for every current and future lcore
> > > id-equipped thread.
> > > 
> > > The primary <rte_lcore_var.h> use case is for statically allocating
> > > small chunks of often-used data, which is related logically, but where
> > > there are performance benefits to reap from having updates being local
> > > to an lcore.
> > > 
> > > Lcore variables are similar to thread-local storage (TLS, e.g., C11
> > > _Thread_local), but decoupling the values' life time with that of the
> > > threads.

<snip>

> > > +/*
> > > + * Avoid using offset zero, since it would result in a NULL-value
> > > + * "handle" (offset) pointer, which in principle and per the API
> > > + * definition shouldn't be an issue, but may confuse some tools and
> > > + * users.
> > > + */
> > > +#define INITIAL_OFFSET 1
> > > +
> > > +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
> > > +
> > 
> > While I like the idea of improved handling for per-core variables, my main
> > concern with this set is this definition here, which adds yet another
> > dependency on the compile-time defined RTE_MAX_LCORE value.
> > 
> 
> lcore variables replaces one RTE_MAX_LCORE-dependent pattern with another.
> 
> You could even argue the dependency on RTE_MAX_LCORE is reduced with lcore
> variables, if you look at where/in how many places in the code base this
> macro is being used. Centralizing per-lcore data management may also provide
> some opportunity in the future for extending the API to cope with some more
> dynamic RTE_MAX_LCORE variant. Not without ABI breakage of course, but we
> are not ever going to change anything related to RTE_MAX_LCORE without
> breaking the ABI, since this constant is everywhere, including compiled into
> the application itself.
> 

Yep, that is true if it's widely used.

> > I believe we already have an issue with this #define where it's impossible
> > to come up with a single value that works for all, or nearly all cases. The
> > current default is still 128, yet DPDK needs to support systems where the
> > number of cores is well into the hundreds, requiring workarounds of core
> > mappings or customized builds of DPDK. Upping the value fixes those issues
> > at the cost to memory footprint explosion for smaller systems.
> > 
> 
> I agree this is an issue.
> 
> RTE_MAX_LCORE also need to be sized to accommodate not only all cores used,
> but the sum of all EAL threads and registered non-EAL threads.
> 
> So, there is no reliable way to discover what RTE_MAX_LCORE is on a
> particular piece of hardware, since the actual number of lcore ids needed is
> up to the application.
> 
> Why is the default set so low? Linux has MAX_CPUS, which serves the same
> purpose, which is set to 4096 by default, if I recall correctly. Shouldn't
> we at least be able to increase it to 256?

The default is so low because of the mempool caches. These are an array of
buffer pointers with 512 (IIRC) entries per core up to RTE_MAX_LCORE.

> 
> > I'm therefore nervous about putting more dependencies on this value, when I
> > feel we should be moving away from its use, to allow more runtime
> > configurability of cores.
> > 
> 
> What more specifically do you have in mind?
> 

I don't think having a dynamically scaling RTE_MAX_LCORE is feasible, but
what I would like to see is a runtime specified value. For example, you
could run DPDK with EAL parameter "--max-lcores=1024" for large systems or
"--max-lcores=32" for small ones. That would then be used at init-time to
scale all internal datastructures appropriately.

/Bruce

<snip for brevity>

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20 11:39               ` Bruce Richardson
@ 2024-02-20 13:37                 ` Morten Brørup
  2024-02-20 16:26                 ` Mattias Rönnblom
  1 sibling, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-02-20 13:37 UTC (permalink / raw)
  To: Bruce Richardson, Mattias Rönnblom
  Cc: Mattias Rönnblom, dev, Stephen Hemminger

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Tuesday, 20 February 2024 12.39
> 
> On Tue, Feb 20, 2024 at 11:47:14AM +0100, Mattias Rönnblom wrote:
> > On 2024-02-20 10:11, Bruce Richardson wrote:
> > > On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
> > > > Introduce DPDK per-lcore id variables, or lcore variables for
> short.
> > > >
> > > > An lcore variable has one value for every current and future
> lcore
> > > > id-equipped thread.
> > > >
> > > > The primary <rte_lcore_var.h> use case is for statically
> allocating
> > > > small chunks of often-used data, which is related logically, but
> where
> > > > there are performance benefits to reap from having updates being
> local
> > > > to an lcore.
> > > >
> > > > Lcore variables are similar to thread-local storage (TLS, e.g.,
> C11
> > > > _Thread_local), but decoupling the values' life time with that of
> the
> > > > threads.
> 
> <snip>
> 
> > > > +/*
> > > > + * Avoid using offset zero, since it would result in a NULL-
> value
> > > > + * "handle" (offset) pointer, which in principle and per the API
> > > > + * definition shouldn't be an issue, but may confuse some tools
> and
> > > > + * users.
> > > > + */
> > > > +#define INITIAL_OFFSET 1
> > > > +
> > > > +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR]
> __rte_cache_aligned;
> > > > +
> > >
> > > While I like the idea of improved handling for per-core variables,
> my main
> > > concern with this set is this definition here, which adds yet
> another
> > > dependency on the compile-time defined RTE_MAX_LCORE value.
> > >
> >
> > lcore variables replaces one RTE_MAX_LCORE-dependent pattern with
> another.
> >
> > You could even argue the dependency on RTE_MAX_LCORE is reduced with
> lcore
> > variables, if you look at where/in how many places in the code base
> this
> > macro is being used. Centralizing per-lcore data management may also
> provide
> > some opportunity in the future for extending the API to cope with
> some more
> > dynamic RTE_MAX_LCORE variant. Not without ABI breakage of course,
> but we
> > are not ever going to change anything related to RTE_MAX_LCORE
> without
> > breaking the ABI, since this constant is everywhere, including
> compiled into
> > the application itself.
> >
> 
> Yep, that is true if it's widely used.
> 
> > > I believe we already have an issue with this #define where it's
> impossible
> > > to come up with a single value that works for all, or nearly all
> cases. The
> > > current default is still 128, yet DPDK needs to support systems
> where the
> > > number of cores is well into the hundreds, requiring workarounds of
> core
> > > mappings or customized builds of DPDK. Upping the value fixes those
> issues
> > > at the cost to memory footprint explosion for smaller systems.
> > >
> >
> > I agree this is an issue.
> >
> > RTE_MAX_LCORE also need to be sized to accommodate not only all cores
> used,
> > but the sum of all EAL threads and registered non-EAL threads.
> >
> > So, there is no reliable way to discover what RTE_MAX_LCORE is on a
> > particular piece of hardware, since the actual number of lcore ids
> needed is
> > up to the application.
> >
> > Why is the default set so low? Linux has MAX_CPUS, which serves the
> same
> > purpose, which is set to 4096 by default, if I recall correctly.
> Shouldn't
> > we at least be able to increase it to 256?

I recall a recent techboard meeting where the default was discussed. The default was agreed so low because it suffices for the vast majority of hardware out there, and applications for bigger platforms can be expected to build DPDK with a different configuration themselves. And as Bruce also mentions, it's a tradeoff for memory consumption.

> 
> The default is so low because of the mempool caches. These are an array
> of
> buffer pointers with 512 (IIRC) entries per core up to RTE_MAX_LCORE.

The decision was based on a need to make a quick decision, so we used narrow guesstimates, not a broader memory consumption analysis.

If we really cared about default memory consumption, we should reduce the default RTE_MAX_QUEUES_PER_PORT from 1024 too. It has quite an effect.

Having hard data about which build time configuration parameters have the biggest effect on memory consumption would be extremely useful for tweaking the parameters for resource limited hardware.
It's a mix of static and dynamic allocation, so it's not obvious which scalable data structures consume the most memory.

> 
> >
> > > I'm therefore nervous about putting more dependencies on this
> value, when I
> > > feel we should be moving away from its use, to allow more runtime
> > > configurability of cores.
> > >
> >
> > What more specifically do you have in mind?
> >
> 
> I don't think having a dynamically scaling RTE_MAX_LCORE is feasible,
> but
> what I would like to see is a runtime specified value. For example, you
> could run DPDK with EAL parameter "--max-lcores=1024" for large systems
> or
> "--max-lcores=32" for small ones. That would then be used at init-time
> to
> scale all internal datastructures appropriately.
> 

I agree 100 % that a better long term solution should be on the general road map.
Memory is a precious resource, but few seem to care about it.

A mix could provide an easy migration path:
Having RTE_MAX_LCORE as the hard upper limit (and default value) for a runtime specified max number ("rte_max_lcores").
With this, the goal would be for modules with very small data sets to continue using RTE_MAX_LCORE fixed size arrays, and for modules with larger data sets to migrate to rte_max_lcores dynamically sized arrays.

I am opposed to blocking a new patch series, only because it adds another RTE_MAX_LCORE sized array. We already have plenty of those.
It can be migrated towards dynamically sized array at a later time, just like the other modules with RTE_MAX_LCORE sized arrays.
Perhaps "fixing" an existing module would free up more memory than fixing this module. Let's spend development resources where they have the biggest impact.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v3 3/6] random: keep PRNG state in lcore variable
  2024-02-20  8:49         ` [RFC v3 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-20 15:31           ` Morten Brørup
  0 siblings, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-02-20 15:31 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Tuesday, 20 February 2024 09.49
> 

[...]

> @@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
> 
>  	idx = rte_lcore_id();
> 
> -	/* last instance reserved for unregistered non-EAL threads */
>  	if (unlikely(idx == LCORE_ID_ANY))

idx is now only used here, so you could get rid of it by comparing directly to rte_lcore_id() instead.

Minor detail only; don't spin the patch for it.

> -		idx = RTE_MAX_LCORE;
> +		return &unregistered_rand_state;
> 
> -	return &rand_states[idx];
> +	return RTE_LCORE_VAR_PTR(rand_state);
>  }


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20 11:39               ` Bruce Richardson
  2024-02-20 13:37                 ` Morten Brørup
@ 2024-02-20 16:26                 ` Mattias Rönnblom
  1 sibling, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-20 16:26 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Mattias Rönnblom, dev, Morten Brørup, Stephen Hemminger

On 2024-02-20 12:39, Bruce Richardson wrote:
> On Tue, Feb 20, 2024 at 11:47:14AM +0100, Mattias Rönnblom wrote:
>> On 2024-02-20 10:11, Bruce Richardson wrote:
>>> On Tue, Feb 20, 2024 at 09:49:03AM +0100, Mattias Rönnblom wrote:
>>>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>>>
>>>> An lcore variable has one value for every current and future lcore
>>>> id-equipped thread.
>>>>
>>>> The primary <rte_lcore_var.h> use case is for statically allocating
>>>> small chunks of often-used data, which is related logically, but where
>>>> there are performance benefits to reap from having updates being local
>>>> to an lcore.
>>>>
>>>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>>>> _Thread_local), but decoupling the values' life time with that of the
>>>> threads.
> 
> <snip>
> 
>>>> +/*
>>>> + * Avoid using offset zero, since it would result in a NULL-value
>>>> + * "handle" (offset) pointer, which in principle and per the API
>>>> + * definition shouldn't be an issue, but may confuse some tools and
>>>> + * users.
>>>> + */
>>>> +#define INITIAL_OFFSET 1
>>>> +
>>>> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
>>>> +
>>>
>>> While I like the idea of improved handling for per-core variables, my main
>>> concern with this set is this definition here, which adds yet another
>>> dependency on the compile-time defined RTE_MAX_LCORE value.
>>>
>>
>> lcore variables replaces one RTE_MAX_LCORE-dependent pattern with another.
>>
>> You could even argue the dependency on RTE_MAX_LCORE is reduced with lcore
>> variables, if you look at where/in how many places in the code base this
>> macro is being used. Centralizing per-lcore data management may also provide
>> some opportunity in the future for extending the API to cope with some more
>> dynamic RTE_MAX_LCORE variant. Not without ABI breakage of course, but we
>> are not ever going to change anything related to RTE_MAX_LCORE without
>> breaking the ABI, since this constant is everywhere, including compiled into
>> the application itself.
>>
> 
> Yep, that is true if it's widely used.
> 
>>> I believe we already have an issue with this #define where it's impossible
>>> to come up with a single value that works for all, or nearly all cases. The
>>> current default is still 128, yet DPDK needs to support systems where the
>>> number of cores is well into the hundreds, requiring workarounds of core
>>> mappings or customized builds of DPDK. Upping the value fixes those issues
>>> at the cost to memory footprint explosion for smaller systems.
>>>
>>
>> I agree this is an issue.
>>
>> RTE_MAX_LCORE also need to be sized to accommodate not only all cores used,
>> but the sum of all EAL threads and registered non-EAL threads.
>>
>> So, there is no reliable way to discover what RTE_MAX_LCORE is on a
>> particular piece of hardware, since the actual number of lcore ids needed is
>> up to the application.
>>
>> Why is the default set so low? Linux has MAX_CPUS, which serves the same
>> purpose, which is set to 4096 by default, if I recall correctly. Shouldn't
>> we at least be able to increase it to 256?
> 
> The default is so low because of the mempool caches. These are an array of
> buffer pointers with 512 (IIRC) entries per core up to RTE_MAX_LCORE.
> 
>>
>>> I'm therefore nervous about putting more dependencies on this value, when I
>>> feel we should be moving away from its use, to allow more runtime
>>> configurability of cores.
>>>
>>
>> What more specifically do you have in mind?
>>
> 
> I don't think having a dynamically scaling RTE_MAX_LCORE is feasible, but
> what I would like to see is a runtime specified value. For example, you
> could run DPDK with EAL parameter "--max-lcores=1024" for large systems or
> "--max-lcores=32" for small ones. That would then be used at init-time to
> scale all internal datastructures appropriately.
> 

Sounds reasonably to me, especially if you would take gradual approach.

By gradual I mean something like adding a function 
rte_lcore_max_possible(), or something like that, returning the EAL 
init-specified value. DPDK libraries/PMDs could then gradually be made 
aware and taking advantage of knowing that lcore ids will always be 
below a certain threshold, usually significantly lower than RTE_MAX_LCORE.

The only change required for lcore variables would be that the FOREACH 
macro would use the run-time-max value, rather than RTE_MAX_LCORE, which 
in turn would leave all the higher-numbered lcore id buffers 
untouched/unmapped.

The set of possible lcore ids could also be expressed as a bitset, if 
you have machine with a huge amount of cores, running many small DPDK 
instances.

> /Bruce
> 
> <snip for brevity>

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-20  9:11           ` Bruce Richardson
@ 2024-02-21  9:43           ` Jerin Jacob
  2024-02-21 10:31             ` Morten Brørup
  2024-02-21 14:26             ` Mattias Rönnblom
  2024-02-22  9:22           ` Morten Brørup
  2024-02-25 15:03           ` [RFC v4 0/6] Lcore variables Mattias Rönnblom
  3 siblings, 2 replies; 313+ messages in thread
From: Jerin Jacob @ 2024-02-21  9:43 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Stephen Hemminger, Tomasz Duszynski

On Tue, Feb 20, 2024 at 2:35 PM Mattias Rönnblom
<mattias.ronnblom@ericsson.com> wrote:
>
> Introduce DPDK per-lcore id variables, or lcore variables for short.
>
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
>
> The primary <rte_lcore_var.h> use case is for statically allocating
> small chunks of often-used data, which is related logically, but where
> there are performance benefits to reap from having updates being local
> to an lcore.

I think, in order to quantify the gain, we must add a performance test
case to measure the acces cycles with lcore variables scheme vs this
scheme.
Other PMU counters(Cache misses) may be interesting but we dont have
means in DPDK to do self monitoring now like
https://patches.dpdk.org/project/dpdk/patch/20221213104350.3218167-1-tduszynski@marvell.com/

>
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
>
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
>
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
>
> RFC v3:
>  * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>  * Update example to reflect FOREACH macro name change (in RFC v2).
>
> RFC v2:
>  * Use alignof to derive alignment requirements. (Morten Brørup)
>  * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>    *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>  * Allow user-specified alignment, but limit max to cache line size.
>
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  config/rte_config.h                   |   1 +
>  doc/api/doxy-api-index.md             |   1 +
>  lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>  lib/eal/common/meson.build            |   1 +
>  lib/eal/include/meson.build           |   1 +
>  lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>  lib/eal/version.map                   |   4 +
>  7 files changed, 465 insertions(+)
>  create mode 100644 lib/eal/common/eal_common_lcore_var.c
>  create mode 100644 lib/eal/include/rte_lcore_var.h
>
> diff --git a/config/rte_config.h b/config/rte_config.h
> index da265d7dd2..884482e473 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -30,6 +30,7 @@
>  /* EAL defines */
>  #define RTE_CACHE_GUARD_LINES 1
>  #define RTE_MAX_HEAPS 32
> +#define RTE_MAX_LCORE_VAR 1048576
>  #define RTE_MAX_MEMSEG_LISTS 128
>  #define RTE_MAX_MEMSEG_PER_LIST 8192
>  #define RTE_MAX_MEM_MB_PER_LIST 32768
> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> index a6a768bd7c..bb06bb7ca1 100644
> --- a/doc/api/doxy-api-index.md
> +++ b/doc/api/doxy-api-index.md
> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>    [interrupts](@ref rte_interrupts.h),
>    [launch](@ref rte_launch.h),
>    [lcore](@ref rte_lcore.h),
> +  [lcore-varible](@ref rte_lcore_var.h),
>    [per-lcore](@ref rte_per_lcore.h),
>    [service cores](@ref rte_service.h),
>    [keepalive](@ref rte_keepalive.h),
> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
> new file mode 100644
> index 0000000000..dfd11cbd0b
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_var.c
> @@ -0,0 +1,82 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#include <inttypes.h>
> +
> +#include <rte_common.h>
> +#include <rte_debug.h>
> +#include <rte_log.h>
> +
> +#include <rte_lcore_var.h>
> +
> +#include "eal_private.h"
> +
> +#define WARN_THRESHOLD 75
> +
> +/*
> + * Avoid using offset zero, since it would result in a NULL-value
> + * "handle" (offset) pointer, which in principle and per the API
> + * definition shouldn't be an issue, but may confuse some tools and
> + * users.
> + */
> +#define INITIAL_OFFSET 1
> +
> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
> +
> +static uintptr_t allocated = INITIAL_OFFSET;
> +
> +static void
> +verify_allocation(uintptr_t new_allocated)
> +{
> +       static bool has_warned;
> +
> +       RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
> +
> +       if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
> +           !has_warned) {
> +               EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
> +                       "of the maximum capacity (%d bytes)", WARN_THRESHOLD,
> +                       RTE_MAX_LCORE_VAR);
> +               has_warned = true;
> +       }
> +}
> +
> +static void *
> +lcore_var_alloc(size_t size, size_t align)
> +{
> +       uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
> +
> +       void *offset = (void *)new_allocated;
> +
> +       new_allocated += size;
> +
> +       verify_allocation(new_allocated);
> +
> +       allocated = new_allocated;
> +
> +       EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
> +               "%"PRIuPTR"-byte alignment", size, align);
> +
> +       return offset;
> +}
> +
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align)
> +{
> +       /* Having the per-lcore buffer size aligned on cache lines
> +        * assures as well as having the base pointer aligned on cache
> +        * size assures that aligned offsets also translate to aligned
> +        * pointers across all values.
> +        */
> +       RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> +       RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
> +
> +       /* '0' means asking for worst-case alignment requirements */
> +       if (align == 0)
> +               align = alignof(max_align_t);
> +
> +       RTE_ASSERT(rte_is_power_of_2(align));
> +
> +       return lcore_var_alloc(size, align);
> +}
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 22a626ba6f..d41403680b 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -18,6 +18,7 @@ sources += files(
>          'eal_common_interrupts.c',
>          'eal_common_launch.c',
>          'eal_common_lcore.c',
> +        'eal_common_lcore_var.c',
>          'eal_common_mcfg.c',
>          'eal_common_memalloc.c',
>          'eal_common_memory.c',
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index e94b056d46..9449253e23 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -27,6 +27,7 @@ headers += files(
>          'rte_keepalive.h',
>          'rte_launch.h',
>          'rte_lcore.h',
> +        'rte_lcore_var.h',
>          'rte_lock_annotations.h',
>          'rte_malloc.h',
>          'rte_mcslock.h',
> diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
> new file mode 100644
> index 0000000000..da49d48d7c
> --- /dev/null
> +++ b/lib/eal/include/rte_lcore_var.h
> @@ -0,0 +1,375 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#ifndef _RTE_LCORE_VAR_H_
> +#define _RTE_LCORE_VAR_H_
> +
> +/**
> + * @file
> + *
> + * RTE Per-lcore id variables
> + *
> + * This API provides a mechanism to create and access per-lcore id
> + * variables in a space- and cycle-efficient manner.
> + *
> + * A per-lcore id variable (or lcore variable for short) has one value
> + * for each EAL thread and registered non-EAL thread. In other words,
> + * there's one copy of its value for each and every current and future
> + * lcore id-equipped thread, with the total number of copies amounting
> + * to \c RTE_MAX_LCORE.
> + *
> + * In order to access the values of an lcore variable, a handle is
> + * used. The type of the handle is a pointer to the value's type
> + * (e.g., for \c uint32_t lcore variable, the handle is a
> + * <code>uint32_t *</code>. A handle may be passed between modules and
> + * threads just like any pointer, but its value is not the address of
> + * any particular object, but rather just an opaque identifier, stored
> + * in a typed pointer (to inform the access macro the type of values).
> + *
> + * @b Creation
> + *
> + * An lcore variable is created in two steps:
> + *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
> + *  2. Allocate lcore variable storage and initialize the handle with
> + *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
> + *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
> + *     module initialization, but may be done at any time.
> + *
> + * An lcore variable is not tied to the owning thread's lifetime. It's
> + * available for use by any thread immediately after having been
> + * allocated, and continues to be available throughout the lifetime of
> + * the EAL.
> + *
> + * Lcore variables cannot and need not be freed.
> + *
> + * @b Access
> + *
> + * The value of any lcore variable for any lcore id may be accessed
> + * from any thread (including unregistered threads), but is should
> + * generally only *frequently* read from or written to by the owner.
> + *
> + * Values of the same lcore variable but owned by to different lcore
> + * ids *may* be frequently read or written by the owners without the
> + * risk of false sharing.
> + *
> + * An appropriate synchronization mechanism (e.g., atomics) should
> + * employed to assure there are no data races between the owning
> + * thread and any non-owner threads accessing the same lcore variable
> + * instance.
> + *
> + * The value of the lcore variable for a particular lcore id may be
> + * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
> + * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
> + *
> + * To modify the value of an lcore variable for a particular lcore id,
> + * either access the object through the pointer retrieved by \ref
> + * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
> + * RTE_LCORE_VAR_LCORE_SET.
> + *
> + * The access macros each has a short-hand which may be used by an EAL
> + * thread or registered non-EAL thread to access the lcore variable
> + * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
> + * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
> + *
> + * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
> + * pointer with the same type as the value, it may not be directly
> + * dereferenced and must be treated as an opaque identifier. The
> + * *identifier* value is common across all lcore ids.
> + *
> + * @b Storage
> + *
> + * An lcore variable's values may by of a primitive type like \c int,
> + * but would more typically be a \c struct. An application may choose
> + * to define an lcore variable, which it then it goes on to never
> + * allocate.
> + *
> + * The lcore variable handle introduces a per-variable (not
> + * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
> + * there are some memory footprint gains to be made by organizing all
> + * per-lcore id data for a particular module as one lcore variable
> + * (e.g., as a struct).
> + *
> + * The sum of all lcore variables, plus any padding required, must be
> + * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
> + * violation of this maximum results in the process being terminated.
> + *
> + * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
> + * same order of magnitude in size as a thread stack.
> + *
> + * The lcore variable storage buffers are kept in the BSS section in
> + * the resulting binary, where data generally isn't mapped in until
> + * it's accessed. This means that unused portions of the lcore
> + * variable storage area will not occupy any physical memory (with a
> + * granularity of the memory page size [usually 4 kB]).
> + *
> + * Lcore variables should generally *not* be \ref __rte_cache_aligned
> + * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
> + * of these constructs are designed to avoid false sharing. In the
> + * case of an lcore variable instance, all nearby data structures
> + * should almost-always be written to by a single thread (the lcore
> + * variable owner). Adding padding will increase the effective memory
> + * working set size, and potentially reducing performance.
> + *
> + * @b Example
> + *
> + * Below is an example of the use of an lcore variable:
> + *
> + * \code{.c}
> + * struct foo_lcore_state {
> + *         int a;
> + *         long b;
> + * };
> + *
> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
> + *
> + * long foo_get_a_plus_b(void)
> + * {
> + *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
> + *
> + *         return state->a + state->b;
> + * }
> + *
> + * RTE_INIT(rte_foo_init)
> + * {
> + *         unsigned int lcore_id;
> + *
> + *         RTE_LCORE_VAR_ALLOC(foo_state);
> + *
> + *         struct foo_lcore_state *state;
> + *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_states) {
> + *                 (initialize 'state')
> + *         }
> + *
> + *         (other initialization)
> + * }
> + * \endcode
> + *
> + *
> + * @b Alternatives
> + *
> + * Lcore variables are designed to replace a pattern exemplified below:
> + * \code{.c}
> + * struct foo_lcore_state {
> + *         int a;
> + *         long b;
> + *         RTE_CACHE_GUARD;
> + * } __rte_cache_aligned;
> + *
> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
> + * \endcode
> + *
> + * This scheme is simple and effective, but has one drawback: the data
> + * is organized so that objects related to all lcores for a particular
> + * module is kept close in memory. At a bare minimum, this forces the
> + * use of cache-line alignment to avoid false sharing. With CPU
> + * hardware prefetching and memory loads resulting from speculative
> + * execution (functions which seemingly are getting more eager faster
> + * than they are getting more intelligent), one or more "guard" cache
> + * lines may be required to separate one lcore's data from another's.
> + *
> + * Lcore variables has the upside of working with, not against, the
> + * CPU's assumptions and for example next-line prefetchers may well
> + * work the way its designers intended (i.e., to the benefit, not
> + * detriment, of system performance).
> + *
> + * Another alternative to \ref rte_lcore_var.h is the \ref
> + * rte_per_lcore.h API, which make use of thread-local storage (TLS,
> + * e.g., GCC __thread or C11 _Thread_local). The main differences
> + * between by using the various forms of TLS (e.g., \ref
> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
> + * variables are:
> + *
> + *   * The existence and non-existence of a thread-local variable
> + *     instance follow that of particular thread's. The data cannot be
> + *     accessed before the thread has been created, nor after it has
> + *     exited. One effect of this is thread-local variables must
> + *     initialized in a "lazy" manner (e.g., at the point of thread
> + *     creation). Lcore variables may be accessed immediately after
> + *     having been allocated (which is usually prior any thread beyond
> + *     the main thread is running).
> + *   * A thread-local variable is duplicated across all threads in the
> + *     process, including unregistered non-EAL threads (i.e.,
> + *     "regular" threads). For DPDK applications heavily relying on
> + *     multi-threading (in conjunction to DPDK's "one thread per core"
> + *     pattern), either by having many concurrent threads or
> + *     creating/destroying threads at a high rate, an excessive use of
> + *     thread-local variables may cause inefficiencies (e.g.,
> + *     increased thread creation overhead due to thread-local storage
> + *     initialization or increased total RAM footprint usage). Lcore
> + *     variables *only* exist for threads with an lcore id, and thus
> + *     not for such "regular" threads.
> + *   * If data in thread-local storage may be shared between threads
> + *     (i.e., can a pointer to a thread-local variable be passed to
> + *     and successfully dereferenced by non-owning thread) depends on
> + *     the details of the TLS implementation. With GCC __thread and
> + *     GCC _Thread_local, such data sharing is supported. In the C11
> + *     standard, the result of accessing another thread's
> + *     _Thread_local object is implementation-defined. Lcore variable
> + *     instances may be accessed reliably by any thread.
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <stddef.h>
> +#include <stdalign.h>
> +
> +#include <rte_common.h>
> +#include <rte_config.h>
> +#include <rte_lcore.h>
> +
> +/**
> + * Given the lcore variable type, produces the type of the lcore
> + * variable handle.
> + */
> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)                \
> +       type *
> +
> +/**
> + * Define a lcore variable handle.
> + *
> + * This macro defines a variable which is used as a handle to access
> + * the various per-lcore id instances of a per-lcore id variable.
> + *
> + * The aim with this macro is to make clear at the point of
> + * declaration that this is an lcore handler, rather than a regular
> + * pointer.
> + *
> + * Add @b static as a prefix in case the lcore variable are only to be
> + * accessed from a particular translation unit.
> + */
> +#define RTE_LCORE_VAR_HANDLE(type, name)       \
> +       RTE_LCORE_VAR_HANDLE_TYPE(type) name
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)      \
> +       name = rte_lcore_var_alloc(size, align)
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle,
> + * with values aligned for any type of object.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)   \
> +       name = rte_lcore_var_alloc(size, 0)
> +
> +/**
> + * Allocate space for an lcore variable of the size and alignment requirements
> + * suggested by the handler pointer type, and initialize its handle.
> + */
> +#define RTE_LCORE_VAR_ALLOC(name)                                      \
> +       RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)),           \
> +                                      alignof(typeof(*(name))))
> +
> +/**
> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
> + * means of a \ref RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)               \
> +       RTE_INIT(rte_lcore_var_init_ ## name)                           \
> +       {                                                               \
> +               RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);      \
> +       }
> +
> +/**
> + * Allocate an explicitly-sized lcore variable by means of a \ref
> + * RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)            \
> +       RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
> +
> +/**
> + * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT(name)                                       \
> +       RTE_INIT(rte_lcore_var_init_ ## name)                           \
> +       {                                                               \
> +               RTE_LCORE_VAR_ALLOC(name);                              \
> +       }
> +
> +#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)              \
> +       ((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
> +
> +/**
> + * Get pointer to lcore variable instance with the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)                                \
> +       ((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> +
> +/**
> + * Get value of a lcore variable instance of the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)                \
> +       (*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
> +
> +/**
> + * Set the value of a lcore variable instance of the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)         \
> +       (*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
> +
> +/**
> + * Get pointer to lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
> +
> +/**
> + * Get value of lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
> +
> +/**
> + * Set value of lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_SET(name, value) \
> +       RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
> +
> +/**
> + * Iterate over each lcore id's value for a lcore variable.
> + */
> +#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)                         \
> +       for (unsigned int lcore_id =                                    \
> +                    (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);   \
> +            lcore_id < RTE_MAX_LCORE;                                  \
> +            lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> +
> +extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
> +
> +/**
> + * Allocate space in the per-lcore id buffers for a lcore variable.
> + *
> + * The pointer returned is only an opaque identifer of the variable. To
> + * get an actual pointer to a particular instance of the variable use
> + * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
> + *
> + * The allocation is always successful, barring a fatal exhaustion of
> + * the per-lcore id buffer space.
> + *
> + * @param size
> + *   The size (in bytes) of the variable's per-lcore id value.
> + * @param align
> + *   If 0, the values will be suitably aligned for any kind of type
> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
> + *   on a multiple of *align*, which must be a power of 2 and equal or
> + *   less than \c RTE_CACHE_LINE_SIZE.
> + * @return
> + *   The id of the variable, stored in a void pointer value.
> + */
> +__rte_experimental
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_LCORE_VAR_H_ */
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 5e0cd47c82..e90b86115a 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -393,6 +393,10 @@ EXPERIMENTAL {
>         # added in 23.07
>         rte_memzone_max_get;
>         rte_memzone_max_set;
> +
> +       # added in 24.03
> +       rte_lcore_var_alloc;
> +       rte_lcore_var;
>  };
>
>  INTERNAL {
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-21  9:43           ` Jerin Jacob
@ 2024-02-21 10:31             ` Morten Brørup
  2024-02-21 14:26             ` Mattias Rönnblom
  1 sibling, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-02-21 10:31 UTC (permalink / raw)
  To: Jerin Jacob, Mattias Rönnblom
  Cc: dev, hofors, Stephen Hemminger, Tomasz Duszynski

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Wednesday, 21 February 2024 10.44
> 
> On Tue, Feb 20, 2024 at 2:35 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
> >
> > Introduce DPDK per-lcore id variables, or lcore variables for short.
> >
> > An lcore variable has one value for every current and future lcore
> > id-equipped thread.
> >
> > The primary <rte_lcore_var.h> use case is for statically allocating
> > small chunks of often-used data, which is related logically, but
> where
> > there are performance benefits to reap from having updates being
> local
> > to an lcore.
> 
> I think, in order to quantify the gain, we must add a performance test
> case to measure the acces cycles with lcore variables scheme vs this
> scheme.
> Other PMU counters(Cache misses) may be interesting but we dont have
> means in DPDK to do self monitoring now like
> https://patches.dpdk.org/project/dpdk/patch/20221213104350.3218167-1-
> tduszynski@marvell.com/
> 
> >
> > Lcore variables are similar to thread-local storage (TLS, e.g., C11
> > _Thread_local), but decoupling the values' life time with that of the
> > threads.

Lcore variables can be accessed by other threads, unlike TLS variables.

If a TLS variable needs to be accessed by other threads, there must also be an RTE_MAX_LCORE-sized array of pointers to the TLS variable, where each worker thread must initialize the entry pointing to its TLS variable.

> >
> > Lcore variables are also similar in terms of functionality provided
> by
> > FreeBSD kernel's DPCPU_*() family of macros and the associated
> > build-time machinery. DPCPU uses linker scripts, which effectively
> > prevents the reuse of its, otherwise seemingly viable, approach.
> >
> > The currently-prevailing way to solve the same problem as lcore
> > variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> > array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> > lcore variables over this approach is that data related to the same
> > lcore now is close (spatially, in memory), rather than data used by
> > the same module, which in turn avoid excessive use of padding,
> > polluting caches with unused data.
> >

There are 3 ways to implement per-lcore variables:
1. Thread-local storage, available via RTE_DEFINE_PER_LCORE(type, name).
2. RTE_MAX_LCORE-sized arrays.
3. Lcore variables, as provided by this patch series.

Perhaps an overview of differences and performance numbers would help understand the benefits of this patch series.

The advantages of packing more variables into the same cache line may be hard to measure without PMU counters, and could perhaps be described or estimated instead.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-21  9:43           ` Jerin Jacob
  2024-02-21 10:31             ` Morten Brørup
@ 2024-02-21 14:26             ` Mattias Rönnblom
  1 sibling, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-21 14:26 UTC (permalink / raw)
  To: Jerin Jacob, Mattias Rönnblom
  Cc: dev, Morten Brørup, Stephen Hemminger, Tomasz Duszynski

On 2024-02-21 10:43, Jerin Jacob wrote:
> On Tue, Feb 20, 2024 at 2:35 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
>>
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small chunks of often-used data, which is related logically, but where
>> there are performance benefits to reap from having updates being local
>> to an lcore.
> 
> I think, in order to quantify the gain, we must add a performance test
> case to measure the acces cycles with lcore variables scheme vs this
> scheme.

As I might have mentioned elsewhere in the thread, the micro benchmarks 
are already there, in the form of the service and random perf tests.

The service perf tests doesn't show any difference, and the rand perf 
tests seems to indicate lcore variables add one (1) core clock cycle per 
rte_rand() call (measured on Raptor Lake E- and P-cores).

The effects on a real-world app would be highly dependent on what DPDK 
services it's using that themselves are using static per-lcore data, and 
to what extent the app itself use per-lcore data.

Provided lcore variables performs as good as the cache-aligned static 
array pattern for micro benchmarks, lcore variables should always 
be-as-good-or-better in a real-world app, because the cache working set 
size will always be smaller (no padding).

That said, I don't think lcore variables will result in noticable 
performance gain for the typical app. If you do see large gains, I 
suspect it will be on systems with next-N-lines prefetchers and the 
lcore data weren't RTE_CACHE_GUARDed.

> Other PMU counters(Cache misses) may be interesting but we dont have
> means in DPDK to do self monitoring now like
> https://patches.dpdk.org/project/dpdk/patch/20221213104350.3218167-1-tduszynski@marvell.com/
> 
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> RFC v3:
>>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>>   * Update example to reflect FOREACH macro name change (in RFC v2).
>>
>> RFC v2:
>>   * Use alignof to derive alignment requirements. (Morten Brørup)
>>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>>   * Allow user-specified alignment, but limit max to cache line size.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   config/rte_config.h                   |   1 +
>>   doc/api/doxy-api-index.md             |   1 +
>>   lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>>   lib/eal/common/meson.build            |   1 +
>>   lib/eal/include/meson.build           |   1 +
>>   lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>>   lib/eal/version.map                   |   4 +
>>   7 files changed, 465 insertions(+)
>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>
>> diff --git a/config/rte_config.h b/config/rte_config.h
>> index da265d7dd2..884482e473 100644
>> --- a/config/rte_config.h
>> +++ b/config/rte_config.h
>> @@ -30,6 +30,7 @@
>>   /* EAL defines */
>>   #define RTE_CACHE_GUARD_LINES 1
>>   #define RTE_MAX_HEAPS 32
>> +#define RTE_MAX_LCORE_VAR 1048576
>>   #define RTE_MAX_MEMSEG_LISTS 128
>>   #define RTE_MAX_MEMSEG_PER_LIST 8192
>>   #define RTE_MAX_MEM_MB_PER_LIST 32768
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index a6a768bd7c..bb06bb7ca1 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>>     [interrupts](@ref rte_interrupts.h),
>>     [launch](@ref rte_launch.h),
>>     [lcore](@ref rte_lcore.h),
>> +  [lcore-varible](@ref rte_lcore_var.h),
>>     [per-lcore](@ref rte_per_lcore.h),
>>     [service cores](@ref rte_service.h),
>>     [keepalive](@ref rte_keepalive.h),
>> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
>> new file mode 100644
>> index 0000000000..dfd11cbd0b
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_var.c
>> @@ -0,0 +1,82 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#include <inttypes.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_debug.h>
>> +#include <rte_log.h>
>> +
>> +#include <rte_lcore_var.h>
>> +
>> +#include "eal_private.h"
>> +
>> +#define WARN_THRESHOLD 75
>> +
>> +/*
>> + * Avoid using offset zero, since it would result in a NULL-value
>> + * "handle" (offset) pointer, which in principle and per the API
>> + * definition shouldn't be an issue, but may confuse some tools and
>> + * users.
>> + */
>> +#define INITIAL_OFFSET 1
>> +
>> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
>> +
>> +static uintptr_t allocated = INITIAL_OFFSET;
>> +
>> +static void
>> +verify_allocation(uintptr_t new_allocated)
>> +{
>> +       static bool has_warned;
>> +
>> +       RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
>> +
>> +       if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
>> +           !has_warned) {
>> +               EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
>> +                       "of the maximum capacity (%d bytes)", WARN_THRESHOLD,
>> +                       RTE_MAX_LCORE_VAR);
>> +               has_warned = true;
>> +       }
>> +}
>> +
>> +static void *
>> +lcore_var_alloc(size_t size, size_t align)
>> +{
>> +       uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
>> +
>> +       void *offset = (void *)new_allocated;
>> +
>> +       new_allocated += size;
>> +
>> +       verify_allocation(new_allocated);
>> +
>> +       allocated = new_allocated;
>> +
>> +       EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
>> +               "%"PRIuPTR"-byte alignment", size, align);
>> +
>> +       return offset;
>> +}
>> +
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align)
>> +{
>> +       /* Having the per-lcore buffer size aligned on cache lines
>> +        * assures as well as having the base pointer aligned on cache
>> +        * size assures that aligned offsets also translate to aligned
>> +        * pointers across all values.
>> +        */
>> +       RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
>> +       RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
>> +
>> +       /* '0' means asking for worst-case alignment requirements */
>> +       if (align == 0)
>> +               align = alignof(max_align_t);
>> +
>> +       RTE_ASSERT(rte_is_power_of_2(align));
>> +
>> +       return lcore_var_alloc(size, align);
>> +}
>> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
>> index 22a626ba6f..d41403680b 100644
>> --- a/lib/eal/common/meson.build
>> +++ b/lib/eal/common/meson.build
>> @@ -18,6 +18,7 @@ sources += files(
>>           'eal_common_interrupts.c',
>>           'eal_common_launch.c',
>>           'eal_common_lcore.c',
>> +        'eal_common_lcore_var.c',
>>           'eal_common_mcfg.c',
>>           'eal_common_memalloc.c',
>>           'eal_common_memory.c',
>> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
>> index e94b056d46..9449253e23 100644
>> --- a/lib/eal/include/meson.build
>> +++ b/lib/eal/include/meson.build
>> @@ -27,6 +27,7 @@ headers += files(
>>           'rte_keepalive.h',
>>           'rte_launch.h',
>>           'rte_lcore.h',
>> +        'rte_lcore_var.h',
>>           'rte_lock_annotations.h',
>>           'rte_malloc.h',
>>           'rte_mcslock.h',
>> diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
>> new file mode 100644
>> index 0000000000..da49d48d7c
>> --- /dev/null
>> +++ b/lib/eal/include/rte_lcore_var.h
>> @@ -0,0 +1,375 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#ifndef _RTE_LCORE_VAR_H_
>> +#define _RTE_LCORE_VAR_H_
>> +
>> +/**
>> + * @file
>> + *
>> + * RTE Per-lcore id variables
>> + *
>> + * This API provides a mechanism to create and access per-lcore id
>> + * variables in a space- and cycle-efficient manner.
>> + *
>> + * A per-lcore id variable (or lcore variable for short) has one value
>> + * for each EAL thread and registered non-EAL thread. In other words,
>> + * there's one copy of its value for each and every current and future
>> + * lcore id-equipped thread, with the total number of copies amounting
>> + * to \c RTE_MAX_LCORE.
>> + *
>> + * In order to access the values of an lcore variable, a handle is
>> + * used. The type of the handle is a pointer to the value's type
>> + * (e.g., for \c uint32_t lcore variable, the handle is a
>> + * <code>uint32_t *</code>. A handle may be passed between modules and
>> + * threads just like any pointer, but its value is not the address of
>> + * any particular object, but rather just an opaque identifier, stored
>> + * in a typed pointer (to inform the access macro the type of values).
>> + *
>> + * @b Creation
>> + *
>> + * An lcore variable is created in two steps:
>> + *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
>> + *  2. Allocate lcore variable storage and initialize the handle with
>> + *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
>> + *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
>> + *     module initialization, but may be done at any time.
>> + *
>> + * An lcore variable is not tied to the owning thread's lifetime. It's
>> + * available for use by any thread immediately after having been
>> + * allocated, and continues to be available throughout the lifetime of
>> + * the EAL.
>> + *
>> + * Lcore variables cannot and need not be freed.
>> + *
>> + * @b Access
>> + *
>> + * The value of any lcore variable for any lcore id may be accessed
>> + * from any thread (including unregistered threads), but is should
>> + * generally only *frequently* read from or written to by the owner.
>> + *
>> + * Values of the same lcore variable but owned by to different lcore
>> + * ids *may* be frequently read or written by the owners without the
>> + * risk of false sharing.
>> + *
>> + * An appropriate synchronization mechanism (e.g., atomics) should
>> + * employed to assure there are no data races between the owning
>> + * thread and any non-owner threads accessing the same lcore variable
>> + * instance.
>> + *
>> + * The value of the lcore variable for a particular lcore id may be
>> + * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
>> + * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
>> + *
>> + * To modify the value of an lcore variable for a particular lcore id,
>> + * either access the object through the pointer retrieved by \ref
>> + * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
>> + * RTE_LCORE_VAR_LCORE_SET.
>> + *
>> + * The access macros each has a short-hand which may be used by an EAL
>> + * thread or registered non-EAL thread to access the lcore variable
>> + * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
>> + * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
>> + *
>> + * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
>> + * pointer with the same type as the value, it may not be directly
>> + * dereferenced and must be treated as an opaque identifier. The
>> + * *identifier* value is common across all lcore ids.
>> + *
>> + * @b Storage
>> + *
>> + * An lcore variable's values may by of a primitive type like \c int,
>> + * but would more typically be a \c struct. An application may choose
>> + * to define an lcore variable, which it then it goes on to never
>> + * allocate.
>> + *
>> + * The lcore variable handle introduces a per-variable (not
>> + * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
>> + * there are some memory footprint gains to be made by organizing all
>> + * per-lcore id data for a particular module as one lcore variable
>> + * (e.g., as a struct).
>> + *
>> + * The sum of all lcore variables, plus any padding required, must be
>> + * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
>> + * violation of this maximum results in the process being terminated.
>> + *
>> + * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
>> + * same order of magnitude in size as a thread stack.
>> + *
>> + * The lcore variable storage buffers are kept in the BSS section in
>> + * the resulting binary, where data generally isn't mapped in until
>> + * it's accessed. This means that unused portions of the lcore
>> + * variable storage area will not occupy any physical memory (with a
>> + * granularity of the memory page size [usually 4 kB]).
>> + *
>> + * Lcore variables should generally *not* be \ref __rte_cache_aligned
>> + * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
>> + * of these constructs are designed to avoid false sharing. In the
>> + * case of an lcore variable instance, all nearby data structures
>> + * should almost-always be written to by a single thread (the lcore
>> + * variable owner). Adding padding will increase the effective memory
>> + * working set size, and potentially reducing performance.
>> + *
>> + * @b Example
>> + *
>> + * Below is an example of the use of an lcore variable:
>> + *
>> + * \code{.c}
>> + * struct foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + * };
>> + *
>> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
>> + *
>> + * long foo_get_a_plus_b(void)
>> + * {
>> + *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
>> + *
>> + *         return state->a + state->b;
>> + * }
>> + *
>> + * RTE_INIT(rte_foo_init)
>> + * {
>> + *         unsigned int lcore_id;
>> + *
>> + *         RTE_LCORE_VAR_ALLOC(foo_state);
>> + *
>> + *         struct foo_lcore_state *state;
>> + *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_states) {
>> + *                 (initialize 'state')
>> + *         }
>> + *
>> + *         (other initialization)
>> + * }
>> + * \endcode
>> + *
>> + *
>> + * @b Alternatives
>> + *
>> + * Lcore variables are designed to replace a pattern exemplified below:
>> + * \code{.c}
>> + * struct foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + *         RTE_CACHE_GUARD;
>> + * } __rte_cache_aligned;
>> + *
>> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
>> + * \endcode
>> + *
>> + * This scheme is simple and effective, but has one drawback: the data
>> + * is organized so that objects related to all lcores for a particular
>> + * module is kept close in memory. At a bare minimum, this forces the
>> + * use of cache-line alignment to avoid false sharing. With CPU
>> + * hardware prefetching and memory loads resulting from speculative
>> + * execution (functions which seemingly are getting more eager faster
>> + * than they are getting more intelligent), one or more "guard" cache
>> + * lines may be required to separate one lcore's data from another's.
>> + *
>> + * Lcore variables has the upside of working with, not against, the
>> + * CPU's assumptions and for example next-line prefetchers may well
>> + * work the way its designers intended (i.e., to the benefit, not
>> + * detriment, of system performance).
>> + *
>> + * Another alternative to \ref rte_lcore_var.h is the \ref
>> + * rte_per_lcore.h API, which make use of thread-local storage (TLS,
>> + * e.g., GCC __thread or C11 _Thread_local). The main differences
>> + * between by using the various forms of TLS (e.g., \ref
>> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
>> + * variables are:
>> + *
>> + *   * The existence and non-existence of a thread-local variable
>> + *     instance follow that of particular thread's. The data cannot be
>> + *     accessed before the thread has been created, nor after it has
>> + *     exited. One effect of this is thread-local variables must
>> + *     initialized in a "lazy" manner (e.g., at the point of thread
>> + *     creation). Lcore variables may be accessed immediately after
>> + *     having been allocated (which is usually prior any thread beyond
>> + *     the main thread is running).
>> + *   * A thread-local variable is duplicated across all threads in the
>> + *     process, including unregistered non-EAL threads (i.e.,
>> + *     "regular" threads). For DPDK applications heavily relying on
>> + *     multi-threading (in conjunction to DPDK's "one thread per core"
>> + *     pattern), either by having many concurrent threads or
>> + *     creating/destroying threads at a high rate, an excessive use of
>> + *     thread-local variables may cause inefficiencies (e.g.,
>> + *     increased thread creation overhead due to thread-local storage
>> + *     initialization or increased total RAM footprint usage). Lcore
>> + *     variables *only* exist for threads with an lcore id, and thus
>> + *     not for such "regular" threads.
>> + *   * If data in thread-local storage may be shared between threads
>> + *     (i.e., can a pointer to a thread-local variable be passed to
>> + *     and successfully dereferenced by non-owning thread) depends on
>> + *     the details of the TLS implementation. With GCC __thread and
>> + *     GCC _Thread_local, such data sharing is supported. In the C11
>> + *     standard, the result of accessing another thread's
>> + *     _Thread_local object is implementation-defined. Lcore variable
>> + *     instances may be accessed reliably by any thread.
>> + */
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +#include <stddef.h>
>> +#include <stdalign.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_config.h>
>> +#include <rte_lcore.h>
>> +
>> +/**
>> + * Given the lcore variable type, produces the type of the lcore
>> + * variable handle.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)                \
>> +       type *
>> +
>> +/**
>> + * Define a lcore variable handle.
>> + *
>> + * This macro defines a variable which is used as a handle to access
>> + * the various per-lcore id instances of a per-lcore id variable.
>> + *
>> + * The aim with this macro is to make clear at the point of
>> + * declaration that this is an lcore handler, rather than a regular
>> + * pointer.
>> + *
>> + * Add @b static as a prefix in case the lcore variable are only to be
>> + * accessed from a particular translation unit.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE(type, name)       \
>> +       RTE_LCORE_VAR_HANDLE_TYPE(type) name
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)      \
>> +       name = rte_lcore_var_alloc(size, align)
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle,
>> + * with values aligned for any type of object.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)   \
>> +       name = rte_lcore_var_alloc(size, 0)
>> +
>> +/**
>> + * Allocate space for an lcore variable of the size and alignment requirements
>> + * suggested by the handler pointer type, and initialize its handle.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC(name)                                      \
>> +       RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)),           \
>> +                                      alignof(typeof(*(name))))
>> +
>> +/**
>> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
>> + * means of a \ref RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)               \
>> +       RTE_INIT(rte_lcore_var_init_ ## name)                           \
>> +       {                                                               \
>> +               RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);      \
>> +       }
>> +
>> +/**
>> + * Allocate an explicitly-sized lcore variable by means of a \ref
>> + * RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)            \
>> +       RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
>> +
>> +/**
>> + * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT(name)                                       \
>> +       RTE_INIT(rte_lcore_var_init_ ## name)                           \
>> +       {                                                               \
>> +               RTE_LCORE_VAR_ALLOC(name);                              \
>> +       }
>> +
>> +#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)              \
>> +       ((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
>> +
>> +/**
>> + * Get pointer to lcore variable instance with the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)                                \
>> +       ((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
>> +
>> +/**
>> + * Get value of a lcore variable instance of the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)                \
>> +       (*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
>> +
>> +/**
>> + * Set the value of a lcore variable instance of the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)         \
>> +       (*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
>> +
>> +/**
>> + * Get pointer to lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
>> +
>> +/**
>> + * Get value of lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
>> +
>> +/**
>> + * Set value of lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_SET(name, value) \
>> +       RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
>> +
>> +/**
>> + * Iterate over each lcore id's value for a lcore variable.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)                         \
>> +       for (unsigned int lcore_id =                                    \
>> +                    (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);   \
>> +            lcore_id < RTE_MAX_LCORE;                                  \
>> +            lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
>> +
>> +extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
>> +
>> +/**
>> + * Allocate space in the per-lcore id buffers for a lcore variable.
>> + *
>> + * The pointer returned is only an opaque identifer of the variable. To
>> + * get an actual pointer to a particular instance of the variable use
>> + * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
>> + *
>> + * The allocation is always successful, barring a fatal exhaustion of
>> + * the per-lcore id buffer space.
>> + *
>> + * @param size
>> + *   The size (in bytes) of the variable's per-lcore id value.
>> + * @param align
>> + *   If 0, the values will be suitably aligned for any kind of type
>> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
>> + *   on a multiple of *align*, which must be a power of 2 and equal or
>> + *   less than \c RTE_CACHE_LINE_SIZE.
>> + * @return
>> + *   The id of the variable, stored in a void pointer value.
>> + */
>> +__rte_experimental
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align);
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif /* _RTE_LCORE_VAR_H_ */
>> diff --git a/lib/eal/version.map b/lib/eal/version.map
>> index 5e0cd47c82..e90b86115a 100644
>> --- a/lib/eal/version.map
>> +++ b/lib/eal/version.map
>> @@ -393,6 +393,10 @@ EXPERIMENTAL {
>>          # added in 23.07
>>          rte_memzone_max_get;
>>          rte_memzone_max_set;
>> +
>> +       # added in 24.03
>> +       rte_lcore_var_alloc;
>> +       rte_lcore_var;
>>   };
>>
>>   INTERNAL {
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-20  9:11           ` Bruce Richardson
  2024-02-21  9:43           ` Jerin Jacob
@ 2024-02-22  9:22           ` Morten Brørup
  2024-02-23 10:12             ` Mattias Rönnblom
  2024-02-25 15:03           ` [RFC v4 0/6] Lcore variables Mattias Rönnblom
  3 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-02-22  9:22 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Tuesday, 20 February 2024 09.49
> 
> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small chunks of often-used data, which is related logically, but where
> there are performance benefits to reap from having updates being local
> to an lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> RFC v3:
>  * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>  * Update example to reflect FOREACH macro name change (in RFC v2).
> 
> RFC v2:
>  * Use alignof to derive alignment requirements. (Morten Brørup)
>  * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>    *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>  * Allow user-specified alignment, but limit max to cache line size.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  config/rte_config.h                   |   1 +
>  doc/api/doxy-api-index.md             |   1 +
>  lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>  lib/eal/common/meson.build            |   1 +
>  lib/eal/include/meson.build           |   1 +
>  lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>  lib/eal/version.map                   |   4 +
>  7 files changed, 465 insertions(+)
>  create mode 100644 lib/eal/common/eal_common_lcore_var.c
>  create mode 100644 lib/eal/include/rte_lcore_var.h
> 
> diff --git a/config/rte_config.h b/config/rte_config.h
> index da265d7dd2..884482e473 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -30,6 +30,7 @@
>  /* EAL defines */
>  #define RTE_CACHE_GUARD_LINES 1
>  #define RTE_MAX_HEAPS 32
> +#define RTE_MAX_LCORE_VAR 1048576
>  #define RTE_MAX_MEMSEG_LISTS 128
>  #define RTE_MAX_MEMSEG_PER_LIST 8192
>  #define RTE_MAX_MEM_MB_PER_LIST 32768
> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> index a6a768bd7c..bb06bb7ca1 100644
> --- a/doc/api/doxy-api-index.md
> +++ b/doc/api/doxy-api-index.md
> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>    [interrupts](@ref rte_interrupts.h),
>    [launch](@ref rte_launch.h),
>    [lcore](@ref rte_lcore.h),
> +  [lcore-varible](@ref rte_lcore_var.h),
>    [per-lcore](@ref rte_per_lcore.h),
>    [service cores](@ref rte_service.h),
>    [keepalive](@ref rte_keepalive.h),
> diff --git a/lib/eal/common/eal_common_lcore_var.c
> b/lib/eal/common/eal_common_lcore_var.c
> new file mode 100644
> index 0000000000..dfd11cbd0b
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_var.c
> @@ -0,0 +1,82 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#include <inttypes.h>
> +
> +#include <rte_common.h>
> +#include <rte_debug.h>
> +#include <rte_log.h>
> +
> +#include <rte_lcore_var.h>
> +
> +#include "eal_private.h"
> +
> +#define WARN_THRESHOLD 75

It's not an error condition, so 75 % seems like a low threshold for WARNING.
Consider increasing it to 95 %, or change the level to NOTICE.
Or both.

> +
> +/*
> + * Avoid using offset zero, since it would result in a NULL-value
> + * "handle" (offset) pointer, which in principle and per the API
> + * definition shouldn't be an issue, but may confuse some tools and
> + * users.
> + */
> +#define INITIAL_OFFSET 1
> +
> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
> +
> +static uintptr_t allocated = INITIAL_OFFSET;

Please add an API to get the amount of allocated lcore variable memory.
The easy option is to make the above variable public (with a proper name, e.g. rte_lcore_var_allocated).

The total amount of lcore variable memory is already public: RTE_MAX_LCORE_VAR.

> +
> +static void
> +verify_allocation(uintptr_t new_allocated)
> +{
> +	static bool has_warned;
> +
> +	RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
> +
> +	if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
> +	    !has_warned) {
> +		EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
> +			"of the maximum capacity (%d bytes)", WARN_THRESHOLD,
> +			RTE_MAX_LCORE_VAR);
> +		has_warned = true;
> +	}
> +}
> +
> +static void *
> +lcore_var_alloc(size_t size, size_t align)
> +{
> +	uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
> +
> +	void *offset = (void *)new_allocated;
> +
> +	new_allocated += size;
> +
> +	verify_allocation(new_allocated);
> +
> +	allocated = new_allocated;
> +
> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
> +		"%"PRIuPTR"-byte alignment", size, align);
> +
> +	return offset;
> +}
> +
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align)
> +{
> +	/* Having the per-lcore buffer size aligned on cache lines
> +	 * assures as well as having the base pointer aligned on cache
> +	 * size assures that aligned offsets also translate to aligned
> +	 * pointers across all values.
> +	 */
> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
> +
> +	/* '0' means asking for worst-case alignment requirements */
> +	if (align == 0)
> +		align = alignof(max_align_t);
> +
> +	RTE_ASSERT(rte_is_power_of_2(align));
> +
> +	return lcore_var_alloc(size, align);
> +}
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 22a626ba6f..d41403680b 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -18,6 +18,7 @@ sources += files(
>          'eal_common_interrupts.c',
>          'eal_common_launch.c',
>          'eal_common_lcore.c',
> +        'eal_common_lcore_var.c',
>          'eal_common_mcfg.c',
>          'eal_common_memalloc.c',
>          'eal_common_memory.c',
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index e94b056d46..9449253e23 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -27,6 +27,7 @@ headers += files(
>          'rte_keepalive.h',
>          'rte_launch.h',
>          'rte_lcore.h',
> +        'rte_lcore_var.h',
>          'rte_lock_annotations.h',
>          'rte_malloc.h',
>          'rte_mcslock.h',
> diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
> new file mode 100644
> index 0000000000..da49d48d7c
> --- /dev/null
> +++ b/lib/eal/include/rte_lcore_var.h
> @@ -0,0 +1,375 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#ifndef _RTE_LCORE_VAR_H_
> +#define _RTE_LCORE_VAR_H_
> +
> +/**
> + * @file
> + *
> + * RTE Per-lcore id variables
> + *
> + * This API provides a mechanism to create and access per-lcore id
> + * variables in a space- and cycle-efficient manner.
> + *
> + * A per-lcore id variable (or lcore variable for short) has one value
> + * for each EAL thread and registered non-EAL thread. In other words,
> + * there's one copy of its value for each and every current and future
> + * lcore id-equipped thread, with the total number of copies amounting
> + * to \c RTE_MAX_LCORE.
> + *
> + * In order to access the values of an lcore variable, a handle is
> + * used. The type of the handle is a pointer to the value's type
> + * (e.g., for \c uint32_t lcore variable, the handle is a
> + * <code>uint32_t *</code>. A handle may be passed between modules and
> + * threads just like any pointer, but its value is not the address of
> + * any particular object, but rather just an opaque identifier, stored
> + * in a typed pointer (to inform the access macro the type of values).
> + *
> + * @b Creation
> + *
> + * An lcore variable is created in two steps:
> + *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
> + *  2. Allocate lcore variable storage and initialize the handle with
> + *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
> + *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
> + *     module initialization, but may be done at any time.
> + *
> + * An lcore variable is not tied to the owning thread's lifetime. It's
> + * available for use by any thread immediately after having been
> + * allocated, and continues to be available throughout the lifetime of
> + * the EAL.
> + *
> + * Lcore variables cannot and need not be freed.
> + *
> + * @b Access
> + *
> + * The value of any lcore variable for any lcore id may be accessed
> + * from any thread (including unregistered threads), but is should
> + * generally only *frequently* read from or written to by the owner.
> + *
> + * Values of the same lcore variable but owned by to different lcore
> + * ids *may* be frequently read or written by the owners without the
> + * risk of false sharing.
> + *
> + * An appropriate synchronization mechanism (e.g., atomics) should
> + * employed to assure there are no data races between the owning
> + * thread and any non-owner threads accessing the same lcore variable
> + * instance.
> + *
> + * The value of the lcore variable for a particular lcore id may be
> + * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
> + * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
> + *
> + * To modify the value of an lcore variable for a particular lcore id,
> + * either access the object through the pointer retrieved by \ref
> + * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
> + * RTE_LCORE_VAR_LCORE_SET.
> + *
> + * The access macros each has a short-hand which may be used by an EAL
> + * thread or registered non-EAL thread to access the lcore variable
> + * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
> + * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
> + *
> + * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
> + * pointer with the same type as the value, it may not be directly
> + * dereferenced and must be treated as an opaque identifier. The
> + * *identifier* value is common across all lcore ids.
> + *
> + * @b Storage
> + *
> + * An lcore variable's values may by of a primitive type like \c int,
> + * but would more typically be a \c struct. An application may choose
> + * to define an lcore variable, which it then it goes on to never
> + * allocate.
> + *
> + * The lcore variable handle introduces a per-variable (not
> + * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
> + * there are some memory footprint gains to be made by organizing all
> + * per-lcore id data for a particular module as one lcore variable
> + * (e.g., as a struct).
> + *
> + * The sum of all lcore variables, plus any padding required, must be
> + * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
> + * violation of this maximum results in the process being terminated.
> + *
> + * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
> + * same order of magnitude in size as a thread stack.
> + *
> + * The lcore variable storage buffers are kept in the BSS section in
> + * the resulting binary, where data generally isn't mapped in until
> + * it's accessed. This means that unused portions of the lcore
> + * variable storage area will not occupy any physical memory (with a
> + * granularity of the memory page size [usually 4 kB]).
> + *
> + * Lcore variables should generally *not* be \ref __rte_cache_aligned
> + * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
> + * of these constructs are designed to avoid false sharing. In the
> + * case of an lcore variable instance, all nearby data structures
> + * should almost-always be written to by a single thread (the lcore
> + * variable owner). Adding padding will increase the effective memory
> + * working set size, and potentially reducing performance.
> + *
> + * @b Example
> + *
> + * Below is an example of the use of an lcore variable:
> + *
> + * \code{.c}
> + * struct foo_lcore_state {
> + *         int a;
> + *         long b;
> + * };
> + *
> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
> + *
> + * long foo_get_a_plus_b(void)
> + * {
> + *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
> + *
> + *         return state->a + state->b;
> + * }
> + *
> + * RTE_INIT(rte_foo_init)
> + * {
> + *         unsigned int lcore_id;

This variable is part of RTE_LCORE_VAR_FOREACH_VALUE(), and can be removed from here.

> + *
> + *         RTE_LCORE_VAR_ALLOC(foo_state);

Typo: foo_state -> lcore_states

> + *
> + *         struct foo_lcore_state *state;
> + *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_states) {

Typo:
RTE_LCORE_VAR_FOREACH_VALUE(lcore_states)
->
RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states)

> + *                 (initialize 'state')
> + *         }
> + *
> + *         (other initialization)
> + * }
> + * \endcode
> + *
> + *
> + * @b Alternatives
> + *
> + * Lcore variables are designed to replace a pattern exemplified below:
> + * \code{.c}
> + * struct foo_lcore_state {
> + *         int a;
> + *         long b;
> + *         RTE_CACHE_GUARD;
> + * } __rte_cache_aligned;
> + *
> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
> + * \endcode
> + *
> + * This scheme is simple and effective, but has one drawback: the data
> + * is organized so that objects related to all lcores for a particular
> + * module is kept close in memory. At a bare minimum, this forces the
> + * use of cache-line alignment to avoid false sharing. With CPU
> + * hardware prefetching and memory loads resulting from speculative
> + * execution (functions which seemingly are getting more eager faster
> + * than they are getting more intelligent), one or more "guard" cache
> + * lines may be required to separate one lcore's data from another's.
> + *
> + * Lcore variables has the upside of working with, not against, the
> + * CPU's assumptions and for example next-line prefetchers may well
> + * work the way its designers intended (i.e., to the benefit, not
> + * detriment, of system performance).
> + *
> + * Another alternative to \ref rte_lcore_var.h is the \ref
> + * rte_per_lcore.h API, which make use of thread-local storage (TLS,
> + * e.g., GCC __thread or C11 _Thread_local). The main differences
> + * between by using the various forms of TLS (e.g., \ref
> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
> + * variables are:
> + *
> + *   * The existence and non-existence of a thread-local variable
> + *     instance follow that of particular thread's. The data cannot be
> + *     accessed before the thread has been created, nor after it has
> + *     exited. One effect of this is thread-local variables must
> + *     initialized in a "lazy" manner (e.g., at the point of thread
> + *     creation). Lcore variables may be accessed immediately after
> + *     having been allocated (which is usually prior any thread beyond
> + *     the main thread is running).
> + *   * A thread-local variable is duplicated across all threads in the
> + *     process, including unregistered non-EAL threads (i.e.,
> + *     "regular" threads). For DPDK applications heavily relying on
> + *     multi-threading (in conjunction to DPDK's "one thread per core"
> + *     pattern), either by having many concurrent threads or
> + *     creating/destroying threads at a high rate, an excessive use of
> + *     thread-local variables may cause inefficiencies (e.g.,
> + *     increased thread creation overhead due to thread-local storage
> + *     initialization or increased total RAM footprint usage). Lcore
> + *     variables *only* exist for threads with an lcore id, and thus
> + *     not for such "regular" threads.
> + *   * If data in thread-local storage may be shared between threads
> + *     (i.e., can a pointer to a thread-local variable be passed to
> + *     and successfully dereferenced by non-owning thread) depends on
> + *     the details of the TLS implementation. With GCC __thread and
> + *     GCC _Thread_local, such data sharing is supported. In the C11
> + *     standard, the result of accessing another thread's
> + *     _Thread_local object is implementation-defined. Lcore variable
> + *     instances may be accessed reliably by any thread.
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <stddef.h>
> +#include <stdalign.h>
> +
> +#include <rte_common.h>
> +#include <rte_config.h>
> +#include <rte_lcore.h>
> +
> +/**
> + * Given the lcore variable type, produces the type of the lcore
> + * variable handle.
> + */
> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
> +	type *

This macro seems superfluous.
In RTE_LCORE_VAR_HANDLE(type, name) just use:
 type * name
Are there other use cases for it?

> +
> +/**
> + * Define a lcore variable handle.
> + *
> + * This macro defines a variable which is used as a handle to access
> + * the various per-lcore id instances of a per-lcore id variable.
> + *
> + * The aim with this macro is to make clear at the point of
> + * declaration that this is an lcore handler, rather than a regular
> + * pointer.
> + *
> + * Add @b static as a prefix in case the lcore variable are only to be
> + * accessed from a particular translation unit.
> + */
> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name

Thinking out loud here...
Consider if this name should be more similar with RTE_DEFINE_PER_LCORE(type, name), e.g. RTE_DEFINE_LCORE_VAR(type, name) or RTE_LCORE_VAR_DEFINE(type, name).
Using the common prefix RTE_LCORE_VAR is preferable.
Using the term "handle" indicates that it is opaque and needs to be allocated by an allocation function.
On the other hand, the "handle" is not unique per thread, so it's nor really a "handle".

> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)	\
> +	name = rte_lcore_var_alloc(size, align)
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle,
> + * with values aligned for any type of object.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
> +	name = rte_lcore_var_alloc(size, 0)
> +
> +/**
> + * Allocate space for an lcore variable of the size and alignment
> requirements
> + * suggested by the handler pointer type, and initialize its handle.
> + */
> +#define RTE_LCORE_VAR_ALLOC(name)					\
> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)),		\
> +				       alignof(typeof(*(name))))
> +
> +/**
> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
> + * means of a \ref RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> +	{								\
> +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
> +	}
> +
> +/**
> + * Allocate an explicitly-sized lcore variable by means of a \ref
> + * RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
> +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
> +
> +/**
> + * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
> + */
> +#define RTE_LCORE_VAR_INIT(name)					\
> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> +	{								\
> +		RTE_LCORE_VAR_ALLOC(name);				\
> +	}
> +
> +#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)		\
> +	((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))

This macro also seems superfluous.
Doesn't RTE_LCORE_VAR_LCORE_PTR() suffice?

> +
> +/**
> + * Get pointer to lcore variable instance with the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)				\
> +	((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))

This uses type casting.
I wonder if additional build-time type checking would be possible...
Nice to have: The compiler should fail if name is not a pointer, but a struct or an uint64_t, or even an uintptr_t.

> +
> +/**
> + * Get value of a lcore variable instance of the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)		\
> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))

The four accessor functions, RTE_LCORE_VAR[_LCORE]_GET/SET(), seem superfluous.
They make the API seem more complex than just using RTE_LCORE_VAR[_LCORE]_PTR() for access.

> +
> +/**
> + * Set the value of a lcore variable instance of the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)		\
> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
> +
> +/**
> + * Get pointer to lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
> +
> +/**
> + * Get value of lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
> +
> +/**
> + * Set value of lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_SET(name, value) \
> +	RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
> +
> +/**
> + * Iterate over each lcore id's value for a lcore variable.
> + */
> +#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)				\
> +	for (unsigned int lcore_id =					\
> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
> +	     lcore_id < RTE_MAX_LCORE;					\
> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))

RTE_LCORE_VAR_FOREACH_PTR(ptr, name) would be an even better name; considering that "var" is really a pointer.

I also wonder about build-time type checking here...
Nice to have: The compiler should fail if "ptr" is not a pointer.

> +
> +extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
> +
> +/**
> + * Allocate space in the per-lcore id buffers for a lcore variable.
> + *
> + * The pointer returned is only an opaque identifer of the variable. To
> + * get an actual pointer to a particular instance of the variable use
> + * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
> + *
> + * The allocation is always successful, barring a fatal exhaustion of
> + * the per-lcore id buffer space.
> + *
> + * @param size
> + *   The size (in bytes) of the variable's per-lcore id value.
> + * @param align
> + *   If 0, the values will be suitably aligned for any kind of type
> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
> + *   on a multiple of *align*, which must be a power of 2 and equal or
> + *   less than \c RTE_CACHE_LINE_SIZE.
> + * @return
> + *   The id of the variable, stored in a void pointer value.
> + */
> +__rte_experimental
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_LCORE_VAR_H_ */
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 5e0cd47c82..e90b86115a 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -393,6 +393,10 @@ EXPERIMENTAL {
>  	# added in 23.07
>  	rte_memzone_max_get;
>  	rte_memzone_max_set;
> +
> +	# added in 24.03
> +	rte_lcore_var_alloc;
> +	rte_lcore_var;
>  };
> 
>  INTERNAL {
> --
> 2.34.1

Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v3 5/6] service: keep per-lcore state in lcore variable
  2024-02-20  8:49         ` [RFC v3 5/6] service: " Mattias Rönnblom
@ 2024-02-22  9:42           ` Morten Brørup
  2024-02-23 10:19             ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-02-22  9:42 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Tuesday, 20 February 2024 09.49
> 
> Replace static array of cache-aligned structs with an lcore variable,
> to slightly benefit code simplicity and performance.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---


> @@ -486,8 +489,7 @@ service_runner_func(void *arg)
>  {
>  	RTE_SET_USED(arg);
>  	uint8_t i;
> -	const int lcore = rte_lcore_id();
> -	struct core_state *cs = &lcore_states[lcore];
> +	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);

Typo: TAB -> SPACE.

> 
>  	rte_atomic_store_explicit(&cs->thread_active, 1,
> rte_memory_order_seq_cst);
> 
> @@ -533,13 +535,16 @@ service_runner_func(void *arg)
>  int32_t
>  rte_service_lcore_may_be_active(uint32_t lcore)
>  {
> -	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
> +	struct core_state *cs =
> +		RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
> +
> +	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
>  		return -EINVAL;

This comment is mostly related to patch 1 in the series...

You are setting cs = RTE_LCORE_VAR_LCORE_PTR(lcore, ...) before validating that lcore < RTE_MAX_LCORE. I wondered if that potentially was an overrun bug.

It is obvious when looking at the RTE_LCORE_VAR_LCORE_PTR() macro implementation, but perhaps its description could mention that it is safe to use with an "invalid" lcore_id, but not dereferencing it.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v3 1/6] eal: add static per-lcore memory allocation facility
  2024-02-22  9:22           ` Morten Brørup
@ 2024-02-23 10:12             ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-23 10:12 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-22 10:22, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Tuesday, 20 February 2024 09.49
>>
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small chunks of often-used data, which is related logically, but where
>> there are performance benefits to reap from having updates being local
>> to an lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> RFC v3:
>>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>>   * Update example to reflect FOREACH macro name change (in RFC v2).
>>
>> RFC v2:
>>   * Use alignof to derive alignment requirements. (Morten Brørup)
>>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>>   * Allow user-specified alignment, but limit max to cache line size.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   config/rte_config.h                   |   1 +
>>   doc/api/doxy-api-index.md             |   1 +
>>   lib/eal/common/eal_common_lcore_var.c |  82 ++++++
>>   lib/eal/common/meson.build            |   1 +
>>   lib/eal/include/meson.build           |   1 +
>>   lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
>>   lib/eal/version.map                   |   4 +
>>   7 files changed, 465 insertions(+)
>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>
>> diff --git a/config/rte_config.h b/config/rte_config.h
>> index da265d7dd2..884482e473 100644
>> --- a/config/rte_config.h
>> +++ b/config/rte_config.h
>> @@ -30,6 +30,7 @@
>>   /* EAL defines */
>>   #define RTE_CACHE_GUARD_LINES 1
>>   #define RTE_MAX_HEAPS 32
>> +#define RTE_MAX_LCORE_VAR 1048576
>>   #define RTE_MAX_MEMSEG_LISTS 128
>>   #define RTE_MAX_MEMSEG_PER_LIST 8192
>>   #define RTE_MAX_MEM_MB_PER_LIST 32768
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index a6a768bd7c..bb06bb7ca1 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -98,6 +98,7 @@ The public API headers are grouped by topics:
>>     [interrupts](@ref rte_interrupts.h),
>>     [launch](@ref rte_launch.h),
>>     [lcore](@ref rte_lcore.h),
>> +  [lcore-varible](@ref rte_lcore_var.h),
>>     [per-lcore](@ref rte_per_lcore.h),
>>     [service cores](@ref rte_service.h),
>>     [keepalive](@ref rte_keepalive.h),
>> diff --git a/lib/eal/common/eal_common_lcore_var.c
>> b/lib/eal/common/eal_common_lcore_var.c
>> new file mode 100644
>> index 0000000000..dfd11cbd0b
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_var.c
>> @@ -0,0 +1,82 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#include <inttypes.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_debug.h>
>> +#include <rte_log.h>
>> +
>> +#include <rte_lcore_var.h>
>> +
>> +#include "eal_private.h"
>> +
>> +#define WARN_THRESHOLD 75
> 
> It's not an error condition, so 75 % seems like a low threshold for WARNING.
> Consider increasing it to 95 %, or change the level to NOTICE.
> Or both.
> 

I'll make an attempt at a variant which uses the libc heap instead of 
BSS, and does so dynamically. Then one need not worry about a fixed-size 
upper bound, barring heap allocation failures (which you are best off 
making fatal in the lcore variables case).

The glibc heap is available early (as early as the earliest RTE_INIT()).

You also avoid the headache of thinking about what happens if indeed all 
of the rte_lcore_var array is backed by actual memory. That could be due 
to mlockall(), huge page use for BSS, or systems where BSS is not 
on-demand mapped. I have no idea how paging works on Windows NT, for 
example.

>> +
>> +/*
>> + * Avoid using offset zero, since it would result in a NULL-value
>> + * "handle" (offset) pointer, which in principle and per the API
>> + * definition shouldn't be an issue, but may confuse some tools and
>> + * users.
>> + */
>> +#define INITIAL_OFFSET 1
>> +
>> +char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR] __rte_cache_aligned;
>> +
>> +static uintptr_t allocated = INITIAL_OFFSET;
> 
> Please add an API to get the amount of allocated lcore variable memory.
> The easy option is to make the above variable public (with a proper name, e.g. rte_lcore_var_allocated).
> 
> The total amount of lcore variable memory is already public: RTE_MAX_LCORE_VAR.
> 

Makes sense with the RFC v3 design.

If you eliminate the fixed upper bound and use the heap, there shouldn't 
be any particular need to track lcore variable memory use separately 
from other heap users.

>> +
>> +static void
>> +verify_allocation(uintptr_t new_allocated)
>> +{
>> +	static bool has_warned;
>> +
>> +	RTE_VERIFY(new_allocated < RTE_MAX_LCORE_VAR);
>> +
>> +	if (new_allocated > (WARN_THRESHOLD * RTE_MAX_LCORE_VAR) / 100 &&
>> +	    !has_warned) {
>> +		EAL_LOG(WARNING, "Per-lcore data usage has exceeded %d%% "
>> +			"of the maximum capacity (%d bytes)", WARN_THRESHOLD,
>> +			RTE_MAX_LCORE_VAR);
>> +		has_warned = true;
>> +	}
>> +}
>> +
>> +static void *
>> +lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	uintptr_t new_allocated = RTE_ALIGN_CEIL(allocated, align);
>> +
>> +	void *offset = (void *)new_allocated;
>> +
>> +	new_allocated += size;
>> +
>> +	verify_allocation(new_allocated);
>> +
>> +	allocated = new_allocated;
>> +
>> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
>> +		"%"PRIuPTR"-byte alignment", size, align);
>> +
>> +	return offset;
>> +}
>> +
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	/* Having the per-lcore buffer size aligned on cache lines
>> +	 * assures as well as having the base pointer aligned on cache
>> +	 * size assures that aligned offsets also translate to aligned
>> +	 * pointers across all values.
>> +	 */
>> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
>> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
>> +
>> +	/* '0' means asking for worst-case alignment requirements */
>> +	if (align == 0)
>> +		align = alignof(max_align_t);
>> +
>> +	RTE_ASSERT(rte_is_power_of_2(align));
>> +
>> +	return lcore_var_alloc(size, align);
>> +}
>> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
>> index 22a626ba6f..d41403680b 100644
>> --- a/lib/eal/common/meson.build
>> +++ b/lib/eal/common/meson.build
>> @@ -18,6 +18,7 @@ sources += files(
>>           'eal_common_interrupts.c',
>>           'eal_common_launch.c',
>>           'eal_common_lcore.c',
>> +        'eal_common_lcore_var.c',
>>           'eal_common_mcfg.c',
>>           'eal_common_memalloc.c',
>>           'eal_common_memory.c',
>> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
>> index e94b056d46..9449253e23 100644
>> --- a/lib/eal/include/meson.build
>> +++ b/lib/eal/include/meson.build
>> @@ -27,6 +27,7 @@ headers += files(
>>           'rte_keepalive.h',
>>           'rte_launch.h',
>>           'rte_lcore.h',
>> +        'rte_lcore_var.h',
>>           'rte_lock_annotations.h',
>>           'rte_malloc.h',
>>           'rte_mcslock.h',
>> diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
>> new file mode 100644
>> index 0000000000..da49d48d7c
>> --- /dev/null
>> +++ b/lib/eal/include/rte_lcore_var.h
>> @@ -0,0 +1,375 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#ifndef _RTE_LCORE_VAR_H_
>> +#define _RTE_LCORE_VAR_H_
>> +
>> +/**
>> + * @file
>> + *
>> + * RTE Per-lcore id variables
>> + *
>> + * This API provides a mechanism to create and access per-lcore id
>> + * variables in a space- and cycle-efficient manner.
>> + *
>> + * A per-lcore id variable (or lcore variable for short) has one value
>> + * for each EAL thread and registered non-EAL thread. In other words,
>> + * there's one copy of its value for each and every current and future
>> + * lcore id-equipped thread, with the total number of copies amounting
>> + * to \c RTE_MAX_LCORE.
>> + *
>> + * In order to access the values of an lcore variable, a handle is
>> + * used. The type of the handle is a pointer to the value's type
>> + * (e.g., for \c uint32_t lcore variable, the handle is a
>> + * <code>uint32_t *</code>. A handle may be passed between modules and
>> + * threads just like any pointer, but its value is not the address of
>> + * any particular object, but rather just an opaque identifier, stored
>> + * in a typed pointer (to inform the access macro the type of values).
>> + *
>> + * @b Creation
>> + *
>> + * An lcore variable is created in two steps:
>> + *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
>> + *  2. Allocate lcore variable storage and initialize the handle with
>> + *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
>> + *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
>> + *     module initialization, but may be done at any time.
>> + *
>> + * An lcore variable is not tied to the owning thread's lifetime. It's
>> + * available for use by any thread immediately after having been
>> + * allocated, and continues to be available throughout the lifetime of
>> + * the EAL.
>> + *
>> + * Lcore variables cannot and need not be freed.
>> + *
>> + * @b Access
>> + *
>> + * The value of any lcore variable for any lcore id may be accessed
>> + * from any thread (including unregistered threads), but is should
>> + * generally only *frequently* read from or written to by the owner.
>> + *
>> + * Values of the same lcore variable but owned by to different lcore
>> + * ids *may* be frequently read or written by the owners without the
>> + * risk of false sharing.
>> + *
>> + * An appropriate synchronization mechanism (e.g., atomics) should
>> + * employed to assure there are no data races between the owning
>> + * thread and any non-owner threads accessing the same lcore variable
>> + * instance.
>> + *
>> + * The value of the lcore variable for a particular lcore id may be
>> + * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
>> + * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
>> + *
>> + * To modify the value of an lcore variable for a particular lcore id,
>> + * either access the object through the pointer retrieved by \ref
>> + * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
>> + * RTE_LCORE_VAR_LCORE_SET.
>> + *
>> + * The access macros each has a short-hand which may be used by an EAL
>> + * thread or registered non-EAL thread to access the lcore variable
>> + * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
>> + * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
>> + *
>> + * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
>> + * pointer with the same type as the value, it may not be directly
>> + * dereferenced and must be treated as an opaque identifier. The
>> + * *identifier* value is common across all lcore ids.
>> + *
>> + * @b Storage
>> + *
>> + * An lcore variable's values may by of a primitive type like \c int,
>> + * but would more typically be a \c struct. An application may choose
>> + * to define an lcore variable, which it then it goes on to never
>> + * allocate.
>> + *
>> + * The lcore variable handle introduces a per-variable (not
>> + * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
>> + * there are some memory footprint gains to be made by organizing all
>> + * per-lcore id data for a particular module as one lcore variable
>> + * (e.g., as a struct).
>> + *
>> + * The sum of all lcore variables, plus any padding required, must be
>> + * less than the DPDK build-time constant \c RTE_MAX_LCORE_VAR. A
>> + * violation of this maximum results in the process being terminated.
>> + *
>> + * It's reasonable to expected that \c RTE_MAX_LCORE_VAR is on the
>> + * same order of magnitude in size as a thread stack.
>> + *
>> + * The lcore variable storage buffers are kept in the BSS section in
>> + * the resulting binary, where data generally isn't mapped in until
>> + * it's accessed. This means that unused portions of the lcore
>> + * variable storage area will not occupy any physical memory (with a
>> + * granularity of the memory page size [usually 4 kB]).
>> + *
>> + * Lcore variables should generally *not* be \ref __rte_cache_aligned
>> + * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
>> + * of these constructs are designed to avoid false sharing. In the
>> + * case of an lcore variable instance, all nearby data structures
>> + * should almost-always be written to by a single thread (the lcore
>> + * variable owner). Adding padding will increase the effective memory
>> + * working set size, and potentially reducing performance.
>> + *
>> + * @b Example
>> + *
>> + * Below is an example of the use of an lcore variable:
>> + *
>> + * \code{.c}
>> + * struct foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + * };
>> + *
>> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
>> + *
>> + * long foo_get_a_plus_b(void)
>> + * {
>> + *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
>> + *
>> + *         return state->a + state->b;
>> + * }
>> + *
>> + * RTE_INIT(rte_foo_init)
>> + * {
>> + *         unsigned int lcore_id;
> 
> This variable is part of RTE_LCORE_VAR_FOREACH_VALUE(), and can be removed from here.
> 
>> + *
>> + *         RTE_LCORE_VAR_ALLOC(foo_state);
> 
> Typo: foo_state -> lcore_states
> 

Will fix.

>> + *
>> + *         struct foo_lcore_state *state;
>> + *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_states) {
> 
> Typo:
> RTE_LCORE_VAR_FOREACH_VALUE(lcore_states)
> ->
> RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states)
> 

Will fix.

>> + *                 (initialize 'state')
>> + *         }
>> + *
>> + *         (other initialization)
>> + * }
>> + * \endcode
>> + *
>> + *
>> + * @b Alternatives
>> + *
>> + * Lcore variables are designed to replace a pattern exemplified below:
>> + * \code{.c}
>> + * struct foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + *         RTE_CACHE_GUARD;
>> + * } __rte_cache_aligned;
>> + *
>> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
>> + * \endcode
>> + *
>> + * This scheme is simple and effective, but has one drawback: the data
>> + * is organized so that objects related to all lcores for a particular
>> + * module is kept close in memory. At a bare minimum, this forces the
>> + * use of cache-line alignment to avoid false sharing. With CPU
>> + * hardware prefetching and memory loads resulting from speculative
>> + * execution (functions which seemingly are getting more eager faster
>> + * than they are getting more intelligent), one or more "guard" cache
>> + * lines may be required to separate one lcore's data from another's.
>> + *
>> + * Lcore variables has the upside of working with, not against, the
>> + * CPU's assumptions and for example next-line prefetchers may well
>> + * work the way its designers intended (i.e., to the benefit, not
>> + * detriment, of system performance).
>> + *
>> + * Another alternative to \ref rte_lcore_var.h is the \ref
>> + * rte_per_lcore.h API, which make use of thread-local storage (TLS,
>> + * e.g., GCC __thread or C11 _Thread_local). The main differences
>> + * between by using the various forms of TLS (e.g., \ref
>> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
>> + * variables are:
>> + *
>> + *   * The existence and non-existence of a thread-local variable
>> + *     instance follow that of particular thread's. The data cannot be
>> + *     accessed before the thread has been created, nor after it has
>> + *     exited. One effect of this is thread-local variables must
>> + *     initialized in a "lazy" manner (e.g., at the point of thread
>> + *     creation). Lcore variables may be accessed immediately after
>> + *     having been allocated (which is usually prior any thread beyond
>> + *     the main thread is running).
>> + *   * A thread-local variable is duplicated across all threads in the
>> + *     process, including unregistered non-EAL threads (i.e.,
>> + *     "regular" threads). For DPDK applications heavily relying on
>> + *     multi-threading (in conjunction to DPDK's "one thread per core"
>> + *     pattern), either by having many concurrent threads or
>> + *     creating/destroying threads at a high rate, an excessive use of
>> + *     thread-local variables may cause inefficiencies (e.g.,
>> + *     increased thread creation overhead due to thread-local storage
>> + *     initialization or increased total RAM footprint usage). Lcore
>> + *     variables *only* exist for threads with an lcore id, and thus
>> + *     not for such "regular" threads.
>> + *   * If data in thread-local storage may be shared between threads
>> + *     (i.e., can a pointer to a thread-local variable be passed to
>> + *     and successfully dereferenced by non-owning thread) depends on
>> + *     the details of the TLS implementation. With GCC __thread and
>> + *     GCC _Thread_local, such data sharing is supported. In the C11
>> + *     standard, the result of accessing another thread's
>> + *     _Thread_local object is implementation-defined. Lcore variable
>> + *     instances may be accessed reliably by any thread.
>> + */
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +#include <stddef.h>
>> +#include <stdalign.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_config.h>
>> +#include <rte_lcore.h>
>> +
>> +/**
>> + * Given the lcore variable type, produces the type of the lcore
>> + * variable handle.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
>> +	type *
> 
> This macro seems superfluous.
> In RTE_LCORE_VAR_HANDLE(type, name) just use:
>   type * name
> Are there other use cases for it?
> 

It's just a marker, like RTE_LCORE_VAR_HANDLE(), to indicate this is not 
your average pointer type.

It's not obvious these marker macros make things more clear. One could 
just say in the API docs that lcore handles are opaque pointers to the 
lcore variable's type, and make clear they may only be dereferenced 
through the provided macros.

>> +
>> +/**
>> + * Define a lcore variable handle.
>> + *
>> + * This macro defines a variable which is used as a handle to access
>> + * the various per-lcore id instances of a per-lcore id variable.
>> + *
>> + * The aim with this macro is to make clear at the point of
>> + * declaration that this is an lcore handler, rather than a regular
>> + * pointer.
>> + *
>> + * Add @b static as a prefix in case the lcore variable are only to be
>> + * accessed from a particular translation unit.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
>> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
> 
> Thinking out loud here...
> Consider if this name should be more similar with RTE_DEFINE_PER_LCORE(type, name), e.g. RTE_DEFINE_LCORE_VAR(type, name) or RTE_LCORE_VAR_DEFINE(type, name).
> Using the common prefix RTE_LCORE_VAR is preferable.
> Using the term "handle" indicates that it is opaque and needs to be allocated by an allocation function.
> On the other hand, the "handle" is not unique per thread, so it's nor really a "handle".
> 

It's a handle to a variable, not a handle to a particular instance of 
its values.

>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align)	\
>> +	name = rte_lcore_var_alloc(size, align)
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle,
>> + * with values aligned for any type of object.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE(name, size)	\
>> +	name = rte_lcore_var_alloc(size, 0)
>> +
>> +/**
>> + * Allocate space for an lcore variable of the size and alignment
>> requirements
>> + * suggested by the handler pointer type, and initialize its handle.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC(name)					\
>> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, sizeof(*(name)),		\
>> +				       alignof(typeof(*(name))))
>> +
>> +/**
>> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
>> + * means of a \ref RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
>> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
>> +	{								\
>> +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
>> +	}
>> +
>> +/**
>> + * Allocate an explicitly-sized lcore variable by means of a \ref
>> + * RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
>> +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
>> +
>> +/**
>> + * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
>> + */
>> +#define RTE_LCORE_VAR_INIT(name)					\
>> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
>> +	{								\
>> +		RTE_LCORE_VAR_ALLOC(name);				\
>> +	}
>> +
>> +#define __RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)		\
>> +	((void *)(&rte_lcore_var[lcore_id][(uintptr_t)(name)]))
> 
> This macro also seems superfluous.
> Doesn't RTE_LCORE_VAR_LCORE_PTR() suffice?
> 

It's just functional decomposition (but for macros). To make the whole 
thing a little more readable.

Maybe I should change "name" to "handle" in this and other instances 
(e.g., RTE_LCORE_VAR_LCORE_PTR).

>> +
>> +/**
>> + * Get pointer to lcore variable instance with the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)				\
>> +	((typeof(name))__RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> 
> This uses type casting.
> I wonder if additional build-time type checking would be possible...
> Nice to have: The compiler should fail if name is not a pointer, but a struct or an uint64_t, or even an uintptr_t.
> 
There is no way to compared the type of the lcore variable (at the point 
of declaration) with the type of the handle pointer at the point of 
handle "dereferencing" (which is essentially is what this macro does).

You can't cast a struct to a pointer. You could assure it's a pointer by 
replacing the __RTE_LCORE_VAR_LCORE_PTR() with

static inline __rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
{
	return (void *)&rte_lcore_var[lcore_id][(uintptr_t)handle];
}

(Bad practice to use a macro when a function can do the job anyway.)

Maybe this function shouldn't even have the "__" prefix. Could well be 
valid uses cases when you want void * typed access to a lcore variable 
value.

I'll use a function in the next RFC version.

>> +
>> +/**
>> + * Get value of a lcore variable instance of the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, name)		\
>> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)))
> 
> The four accessor functions, RTE_LCORE_VAR[_LCORE]_GET/SET(), seem superfluous.
> They make the API seem more complex than just using RTE_LCORE_VAR[_LCORE]_PTR() for access.
> 

They are (somewhat) useful when the value is a primitive type.

RTE_LCORE_VAR_SET(my_int, 17);

versus

*RTE_LCORE_VAR_PTR(my_int) = 17;

Former is slightly more readable, imo, but I agree with you that these 
macros do clutter up the API.

>> +
>> +/**
>> + * Set the value of a lcore variable instance of the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, name, value)		\
>> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, name)) = (value))
>> +
>> +/**
>> + * Get pointer to lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_PTR(name) RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), name)
>> +
>> +/**
>> + * Get value of lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_GET(name) RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), name)
>> +
>> +/**
>> + * Set value of lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_SET(name, value) \
>> +	RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), name, value)
>> +
>> +/**
>> + * Iterate over each lcore id's value for a lcore variable.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH_VALUE(var, name)				\
>> +	for (unsigned int lcore_id =					\
>> +		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, name)), 0);	\
>> +	     lcore_id < RTE_MAX_LCORE;					\
>> +	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, name))
> 
> RTE_LCORE_VAR_FOREACH_PTR(ptr, name) would be an even better name; considering that "var" is really a pointer.
> 

No, it's for each value, referenced via the pointer.

RTE_LCORE_VAR_FOREACH_VALUE_PTR() is too long.

I'll change "var" -> "ptr".

> I also wonder about build-time type checking here...
> Nice to have: The compiler should fail if "ptr" is not a pointer.
> 

I agree.

>> +
>> +extern char rte_lcore_var[RTE_MAX_LCORE][RTE_MAX_LCORE_VAR];
>> +
>> +/**
>> + * Allocate space in the per-lcore id buffers for a lcore variable.
>> + *
>> + * The pointer returned is only an opaque identifer of the variable. To
>> + * get an actual pointer to a particular instance of the variable use
>> + * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
>> + *
>> + * The allocation is always successful, barring a fatal exhaustion of
>> + * the per-lcore id buffer space.
>> + *
>> + * @param size
>> + *   The size (in bytes) of the variable's per-lcore id value.
>> + * @param align
>> + *   If 0, the values will be suitably aligned for any kind of type
>> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
>> + *   on a multiple of *align*, which must be a power of 2 and equal or
>> + *   less than \c RTE_CACHE_LINE_SIZE.
>> + * @return
>> + *   The id of the variable, stored in a void pointer value.
>> + */
>> +__rte_experimental
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align);
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif /* _RTE_LCORE_VAR_H_ */
>> diff --git a/lib/eal/version.map b/lib/eal/version.map
>> index 5e0cd47c82..e90b86115a 100644
>> --- a/lib/eal/version.map
>> +++ b/lib/eal/version.map
>> @@ -393,6 +393,10 @@ EXPERIMENTAL {
>>   	# added in 23.07
>>   	rte_memzone_max_get;
>>   	rte_memzone_max_set;
>> +
>> +	# added in 24.03
>> +	rte_lcore_var_alloc;
>> +	rte_lcore_var;
>>   };
>>
>>   INTERNAL {
>> --
>> 2.34.1
> 
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v3 5/6] service: keep per-lcore state in lcore variable
  2024-02-22  9:42           ` Morten Brørup
@ 2024-02-23 10:19             ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-23 10:19 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-22 10:42, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Tuesday, 20 February 2024 09.49
>>
>> Replace static array of cache-aligned structs with an lcore variable,
>> to slightly benefit code simplicity and performance.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
> 
> 
>> @@ -486,8 +489,7 @@ service_runner_func(void *arg)
>>   {
>>   	RTE_SET_USED(arg);
>>   	uint8_t i;
>> -	const int lcore = rte_lcore_id();
>> -	struct core_state *cs = &lcore_states[lcore];
>> +	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
> 
> Typo: TAB -> SPACE.
> 

Will fix.

>>
>>   	rte_atomic_store_explicit(&cs->thread_active, 1,
>> rte_memory_order_seq_cst);
>>
>> @@ -533,13 +535,16 @@ service_runner_func(void *arg)
>>   int32_t
>>   rte_service_lcore_may_be_active(uint32_t lcore)
>>   {
>> -	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
>> +	struct core_state *cs =
>> +		RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
>> +
>> +	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
>>   		return -EINVAL;
> 
> This comment is mostly related to patch 1 in the series...
> 
> You are setting cs = RTE_LCORE_VAR_LCORE_PTR(lcore, ...) before validating that lcore < RTE_MAX_LCORE. I wondered if that potentially was an overrun bug.
> 
> It is obvious when looking at the RTE_LCORE_VAR_LCORE_PTR() macro implementation, but perhaps its description could mention that it is safe to use with an "invalid" lcore_id, but not dereferencing it.
> 

I thought about adding something equivalent to an RTE_ASSERT() on 
lcore_id in the dereferencing macros, but then I thought that maybe it 
is a valid use case to pass invalid lcore ids.

Invalid ids being OK or not, I think the above code should do "cs = 
/../" *after* the lcore id check. Now it looks strange and force the 
reader to consider if this is valid or not, for no good reason.

The lcore variable API docs should probably explicitly allow invalid 
core id in the macros.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v4 0/6] Lcore variables
  2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                             ` (2 preceding siblings ...)
  2024-02-22  9:22           ` Morten Brørup
@ 2024-02-25 15:03           ` Mattias Rönnblom
  2024-02-25 15:03             ` [RFC v4 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                               ` (5 more replies)
  3 siblings, 6 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-25 15:03 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

This RFC presents a new API <rte_lcore_var.h> for static per-lcore id
data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

One thing is unclear to the author is how this API relates to
potential future per-lcore dynamic allocator (e.g., a per-lcore heap).

Contrary to what the version.map edit suggests, this RFC is not meant
for a proposal for DPDK 24.03.

Mattias Rönnblom (6):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable test suite
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 app/test/meson.build                  |   1 +
 app/test/test_lcore_var.c             | 439 ++++++++++++++++++++++++++
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  68 ++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/common/rte_random.c           |  30 +-
 lib/eal/common/rte_service.c          | 120 ++++---
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 lib/eal/x86/rte_power_intrinsics.c    |  17 +-
 lib/power/rte_power_pmd_mgmt.c        |  36 +--
 13 files changed, 1006 insertions(+), 88 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v4 1/6] eal: add static per-lcore memory allocation facility
  2024-02-25 15:03           ` [RFC v4 0/6] Lcore variables Mattias Rönnblom
@ 2024-02-25 15:03             ` Mattias Rönnblom
  2024-02-27  9:58               ` Morten Brørup
  2024-02-28 10:09               ` [RFC v5 0/6] Lcore variables Mattias Rönnblom
  2024-02-25 15:03             ` [RFC v4 2/6] eal: add lcore variable test suite Mattias Rönnblom
                               ` (4 subsequent siblings)
  5 siblings, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-25 15:03 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  68 +++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 375 ++++++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 7 files changed, 451 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/config/rte_config.h b/config/rte_config.h
index d743a5c3d3..0dac33d3b9 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 8c1eb8fafa..a3b8391570 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore-varible](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..5c353ebd46
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..09a7c7d4f6
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,375 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Per-lcore id variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. In other words,
+ * there's one copy of its value for each and every current and future
+ * lcore id-equipped thread, with the total number of copies amounting
+ * to \c RTE_MAX_LCORE.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for \c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. A handle may be passed between modules and
+ * threads just like any pointer, but its value is not the address of
+ * any particular object, but rather just an opaque identifier, stored
+ * in a typed pointer (to inform the access macro the type of values).
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define a lcore variable handle by using \ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by \ref RTE_LCORE_VAR_ALLOC or
+ *     \ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but is should
+ * generally only *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by to different lcore
+ * ids *may* be frequently read or written by the owners without the
+ * risk of false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomics) should
+ * employed to assure there are no data races between the owning
+ * thread and any non-owner threads accessing the same lcore variable
+ * instance.
+ *
+ * The value of the lcore variable for a particular lcore id may be
+ * retrieved with \ref RTE_LCORE_VAR_LCORE_GET. To get a pointer to the
+ * same object, use \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * To modify the value of an lcore variable for a particular lcore id,
+ * either access the object through the pointer retrieved by \ref
+ * RTE_LCORE_VAR_LCORE_PTR or, for primitive types, use \ref
+ * RTE_LCORE_VAR_LCORE_SET.
+ *
+ * The access macros each has a short-hand which may be used by an EAL
+ * thread or registered non-EAL thread to access the lcore variable
+ * instance of its own lcore id. Those are \ref RTE_LCORE_VAR_GET,
+ * \ref RTE_LCORE_VAR_PTR, and \ref RTE_LCORE_VAR_SET.
+ *
+ * Although the handle (as defined by \ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier. The
+ * *identifier* value is common across all lcore ids.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like \c int,
+ * but would more typically be a \c struct. An application may choose
+ * to define an lcore variable, which it then it goes on to never
+ * allocate.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of \c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * The size of a lcore variable's value must be less than the DPDK
+ * build-time constant \c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be \ref __rte_cache_aligned
+ * and need *not* include a \ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, all nearby data structures
+ * should almost-always be written to by a single thread (the lcore
+ * variable owner). Adding padding will increase the effective memory
+ * working set size, and potentially reducing performance.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_PTR(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * \endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * \code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * } __rte_cache_aligned;
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * \endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this forces the
+ * use of cache-line alignment to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables has the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to \ref rte_lcore_var.h is the \ref
+ * rte_per_lcore.h API, which make use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., \ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. One effect of this is thread-local variables must
+ *     initialized in a "lazy" manner (e.g., at the point of thread
+ *     creation). Lcore variables may be accessed immediately after
+ *     having been allocated (which is usually prior any thread beyond
+ *     the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id, and thus
+ *     not for such "regular" threads.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define a lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various per-lcore id instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handler, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable are only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handler pointer type, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a \ref
+ * RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a \ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+static inline void *
+__rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)			\
+	((typeof(handle))__rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_GET(lcore_id, handle)	\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)))
+
+/**
+ * Set the value of a lcore variable instance of the specified lcore id.
+ */
+#define RTE_LCORE_VAR_LCORE_SET(lcore_id, handle, value)		\
+	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)) = (value))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_PTR(handle) \
+	RTE_LCORE_VAR_LCORE_PTR(rte_lcore_id(), handle)
+
+/**
+ * Get value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_GET(handle) \
+	RTE_LCORE_VAR_LCORE_GET(rte_lcore_id(), handle)
+
+/**
+ * Set value of lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_SET(handle, value) \
+	RTE_LCORE_VAR_LCORE_SET(rte_lcore_id(), handle, value)
+
+/**
+ * Iterate over each lcore id's value for a lcore variable.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(var, handle)			\
+	for (unsigned int lcore_id =					\
+		     (((var) = RTE_LCORE_VAR_LCORE_PTR(0, handle)), 0);	\
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (var) = RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for a lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * \ref RTE_LCORE_VAR_PTR or \ref RTE_LCORE_VAR_LCORE_PTR.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than \c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The id of the variable, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 5e0cd47c82..e90b86115a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -393,6 +393,10 @@ EXPERIMENTAL {
 	# added in 23.07
 	rte_memzone_max_get;
 	rte_memzone_max_set;
+
+	# added in 24.03
+	rte_lcore_var_alloc;
+	rte_lcore_var;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v4 2/6] eal: add lcore variable test suite
  2024-02-25 15:03           ` [RFC v4 0/6] Lcore variables Mattias Rönnblom
  2024-02-25 15:03             ` [RFC v4 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-25 15:03             ` Mattias Rönnblom
  2024-02-25 15:03             ` [RFC v4 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
                               ` (3 subsequent siblings)
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-25 15:03 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Add test suite to exercise the <rte_lcore_var.h> API.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.
RFC v2:
 * Improve alignment-related test coverage.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 439 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 440 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d909039ae..846affa98c 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..d24403b0f7
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,439 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static bool
+rand_bool(void)
+{
+	return rte_rand() & 1;
+}
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_PTR(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal;
+
+	if (rand_bool())
+		equal = RTE_LCORE_VAR_GET(test_int) == state->old_value;
+	else
+		equal = *(RTE_LCORE_VAR_PTR(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	if (rand_bool())
+		RTE_LCORE_VAR_SET(test_int, state->new_value);
+	else
+		*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		RTE_LCORE_VAR_LCORE_SET(lcore_id, test_int, state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		TEST_ASSERT_EQUAL(state->new_value,
+				  RTE_LCORE_VAR_LCORE_GET(lcore_id, test_int),
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_PTR(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_struct);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_PTR(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(RTE_LCORE_VAR_LCORE_GET(lcore_id, test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before = RTE_LCORE_VAR_LCORE_GET(lcore_id, before_array);
+		char after = RTE_LCORE_VAR_LCORE_GET(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_PTR(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = RTE_LCORE_VAR_LCORE_GET(lcore_id,
+							     handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_PTR(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_PTR(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v4 3/6] random: keep PRNG state in lcore variable
  2024-02-25 15:03           ` [RFC v4 0/6] Lcore variables Mattias Rönnblom
  2024-02-25 15:03             ` [RFC v4 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-25 15:03             ` [RFC v4 2/6] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-02-25 15:03             ` Mattias Rönnblom
  2024-02-25 15:03             ` [RFC v4 4/6] power: keep per-lcore " Mattias Rönnblom
                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-25 15:03 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/common/rte_random.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 7709b8f2c6..adbbf13f0e 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct rte_rand_state {
@@ -19,14 +20,12 @@ struct rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
-} __rte_cache_aligned;
+};
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_PTR(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_PTR(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v4 4/6] power: keep per-lcore state in lcore variable
  2024-02-25 15:03           ` [RFC v4 0/6] Lcore variables Mattias Rönnblom
                               ` (2 preceding siblings ...)
  2024-02-25 15:03             ` [RFC v4 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-25 15:03             ` Mattias Rönnblom
  2024-02-25 15:03             ` [RFC v4 5/6] service: " Mattias Rönnblom
  2024-02-25 15:03             ` [RFC v4 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-25 15:03 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

RFC v3:
 * Replace for loop with FOREACH macro.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/power/rte_power_pmd_mgmt.c | 36 ++++++++++++++++------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 591fc69f36..ea30454895 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -68,8 +69,8 @@ struct pmd_core_cfg {
 	/**< Number of queues ready to enter power optimized state */
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
-} __rte_cache_aligned;
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+};
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_PTR(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_PTR(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,21 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v4 5/6] service: keep per-lcore state in lcore variable
  2024-02-25 15:03           ` [RFC v4 0/6] Lcore variables Mattias Rönnblom
                               ` (3 preceding siblings ...)
  2024-02-25 15:03             ` [RFC v4 4/6] power: keep per-lcore " Mattias Rönnblom
@ 2024-02-25 15:03             ` Mattias Rönnblom
  2024-02-25 16:28               ` Mattias Rönnblom
  2024-02-25 15:03             ` [RFC v4 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-25 15:03 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c | 120 ++++++++++++++++++++---------------
 1 file changed, 69 insertions(+), 51 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index d959c91459..7fbae704ed 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,11 +102,12 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
+	else {
+		struct core_state *cs;
+		RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+			memset(cs, 0, sizeof(struct core_state));
 	}
 
 	int i;
@@ -122,7 +124,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +137,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +286,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +293,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +454,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +467,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +489,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_PTR(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +535,17 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs;
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
+	cs = RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +553,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +574,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +591,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +643,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +695,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +713,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +738,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +762,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +786,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +816,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +825,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +850,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +861,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +869,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +877,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +886,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +902,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +949,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +978,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +990,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1029,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_PTR(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v4 6/6] eal: keep per-lcore power intrinsics state in lcore variable
  2024-02-25 15:03           ` [RFC v4 0/6] Lcore variables Mattias Rönnblom
                               ` (4 preceding siblings ...)
  2024-02-25 15:03             ` [RFC v4 5/6] service: " Mattias Rönnblom
@ 2024-02-25 15:03             ` Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-25 15:03 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 532a2e646b..f4659af77e 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -12,10 +13,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -170,7 +175,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_PTR(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -262,7 +267,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_PTR(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -301,8 +306,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_PTR(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v4 5/6] service: keep per-lcore state in lcore variable
  2024-02-25 15:03             ` [RFC v4 5/6] service: " Mattias Rönnblom
@ 2024-02-25 16:28               ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-25 16:28 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: Morten Brørup, Stephen Hemminger

On 2024-02-25 16:03, Mattias Rönnblom wrote:
> Replace static array of cache-aligned structs with an lcore variable,
> to slightly benefit code simplicity and performance.
> 
> RFC v4:
>   * Remove strange-looking lcore value lookup potentially containing
>     invalid lcore id. (Morten Brørup)
>   * Replace misplaced tab with space. (Morten Brørup)
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>   lib/eal/common/rte_service.c | 120 ++++++++++++++++++++---------------
>   1 file changed, 69 insertions(+), 51 deletions(-)
> 
> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
> index d959c91459..7fbae704ed 100644
> --- a/lib/eal/common/rte_service.c
> +++ b/lib/eal/common/rte_service.c
> @@ -11,6 +11,7 @@
>   
>   #include <eal_trace_internal.h>
>   #include <rte_lcore.h>
> +#include <rte_lcore_var.h>
>   #include <rte_branch_prediction.h>
>   #include <rte_common.h>
>   #include <rte_cycles.h>
> @@ -75,7 +76,7 @@ struct core_state {
>   
>   static uint32_t rte_service_count;
>   static struct rte_service_spec_impl *rte_services;
> -static struct core_state *lcore_states;
> +static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
>   static uint32_t rte_service_library_initialized;
>   
>   int32_t
> @@ -101,11 +102,12 @@ rte_service_init(void)
>   		goto fail_mem;
>   	}
>   
> -	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
> -			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
> -	if (!lcore_states) {
> -		EAL_LOG(ERR, "error allocating core states array");
> -		goto fail_mem;
> +	if (lcore_states == NULL)
> +		RTE_LCORE_VAR_ALLOC(lcore_states);
> +	else {
> +		struct core_state *cs;
> +		RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
> +			memset(cs, 0, sizeof(struct core_state));
>   	}
>   
>   	int i;
> @@ -122,7 +124,6 @@ rte_service_init(void)
>   	return 0;
>   fail_mem:
>   	rte_free(rte_services);
> -	rte_free(lcore_states);
>   	return -ENOMEM;
>   }
>   
> @@ -136,7 +137,6 @@ rte_service_finalize(void)
>   	rte_eal_mp_wait_lcore();
>   
>   	rte_free(rte_services);
> -	rte_free(lcore_states);
>   
>   	rte_service_library_initialized = 0;
>   }
> @@ -286,7 +286,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
>   int32_t
>   rte_service_component_unregister(uint32_t id)
>   {
> -	uint32_t i;
>   	struct rte_service_spec_impl *s;
>   	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
>   
> @@ -294,9 +293,10 @@ rte_service_component_unregister(uint32_t id)
>   
>   	s->internal_flags &= ~(SERVICE_F_REGISTERED);
>   
> +	struct core_state *cs;
>   	/* clear the run-bit in all cores */
> -	for (i = 0; i < RTE_MAX_LCORE; i++)
> -		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
> +	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
> +		cs->service_mask &= ~(UINT64_C(1) << id);
>   
>   	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
>   
> @@ -454,7 +454,10 @@ rte_service_may_be_active(uint32_t id)
>   		return -EINVAL;
>   
>   	for (i = 0; i < lcore_count; i++) {
> -		if (lcore_states[ids[i]].service_active_on_lcore[id])
> +		struct core_state *cs =
> +			RTE_LCORE_VAR_LCORE_PTR(ids[i], lcore_states);
> +
> +		if (cs->service_active_on_lcore[id])
>   			return 1;
>   	}
>   
> @@ -464,7 +467,7 @@ rte_service_may_be_active(uint32_t id)
>   int32_t
>   rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
>   {
> -	struct core_state *cs = &lcore_states[rte_lcore_id()];
> +	struct core_state *cs =	RTE_LCORE_VAR_PTR(lcore_states);
>   	struct rte_service_spec_impl *s;
>   
>   	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> @@ -486,8 +489,7 @@ service_runner_func(void *arg)
>   {
>   	RTE_SET_USED(arg);
>   	uint8_t i;
> -	const int lcore = rte_lcore_id();
> -	struct core_state *cs = &lcore_states[lcore];
> +	struct core_state *cs = RTE_LCORE_VAR_PTR(lcore_states);
>   
>   	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
>   
> @@ -533,13 +535,17 @@ service_runner_func(void *arg)
>   int32_t
>   rte_service_lcore_may_be_active(uint32_t lcore)
>   {
> -	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
> +	struct core_state *cs;
> +
> +	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)

This doesn't work, since 'cs' is not yet initialized. I'll fix it v5.

<snip>

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v4 1/6] eal: add static per-lcore memory allocation facility
  2024-02-25 15:03             ` [RFC v4 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-27  9:58               ` Morten Brørup
  2024-02-27 13:44                 ` Mattias Rönnblom
  2024-02-28 10:09               ` [RFC v5 0/6] Lcore variables Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-02-27  9:58 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Sunday, 25 February 2024 16.03

[...]

> +static void *
> +lcore_var_alloc(size_t size, size_t align)
> +{
> +	void *handle;
> +	void *value;
> +
> +	offset = RTE_ALIGN_CEIL(offset, align);
> +
> +	if (offset + size > RTE_MAX_LCORE_VAR) {

This would be the usual comparison:
if (lcore_buffer == NULL) {

> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> +					     LCORE_BUFFER_SIZE);
> +		RTE_VERIFY(lcore_buffer != NULL);
> +
> +		offset = 0;
> +	}

[...]

> +/**
> + * Define a lcore variable handle.
> + *
> + * This macro defines a variable which is used as a handle to access
> + * the various per-lcore id instances of a per-lcore id variable.
> + *
> + * The aim with this macro is to make clear at the point of
> + * declaration that this is an lcore handler, rather than a regular
> + * pointer.
> + *
> + * Add @b static as a prefix in case the lcore variable are only to be
> + * accessed from a particular translation unit.
> + */
> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
> +

The parameter is "name" here, and "handle" in other macros.
Just mentioning to make sure you thought about it.

> +/**
> + * Get pointer to lcore variable instance with the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)			\
> +	((typeof(handle))__rte_lcore_var_lcore_ptr(lcore_id, handle))
> +
> +/**
> + * Get value of a lcore variable instance of the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, handle)	\
> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)))
> +
> +/**
> + * Set the value of a lcore variable instance of the specified lcore id.
> + */
> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, handle, value)		\
> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)) = (value))

I still think RTE_LCORE_VAR[_LCORE]_PTR() suffice, and RTE_LCORE_VAR[_LCORE]_GET/SET are superfluous.
But I don't insist on their removal. :-)

With or without suggested changes...

For the series,
Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v4 1/6] eal: add static per-lcore memory allocation facility
  2024-02-27  9:58               ` Morten Brørup
@ 2024-02-27 13:44                 ` Mattias Rönnblom
  2024-02-27 15:05                   ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-27 13:44 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-27 10:58, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Sunday, 25 February 2024 16.03
> 
> [...]
> 
>> +static void *
>> +lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	void *handle;
>> +	void *value;
>> +
>> +	offset = RTE_ALIGN_CEIL(offset, align);
>> +
>> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> 
> This would be the usual comparison:
> if (lcore_buffer == NULL) {
> 
>> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
>> +					     LCORE_BUFFER_SIZE);
>> +		RTE_VERIFY(lcore_buffer != NULL);
>> +
>> +		offset = 0;
>> +	}
> 
> [...]
> 
>> +/**
>> + * Define a lcore variable handle.
>> + *
>> + * This macro defines a variable which is used as a handle to access
>> + * the various per-lcore id instances of a per-lcore id variable.
>> + *
>> + * The aim with this macro is to make clear at the point of
>> + * declaration that this is an lcore handler, rather than a regular
>> + * pointer.
>> + *
>> + * Add @b static as a prefix in case the lcore variable are only to be
>> + * accessed from a particular translation unit.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
>> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
>> +
> 
> The parameter is "name" here, and "handle" in other macros.
> Just mentioning to make sure you thought about it.
> 
>> +/**
>> + * Get pointer to lcore variable instance with the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)			\
>> +	((typeof(handle))__rte_lcore_var_lcore_ptr(lcore_id, handle))
>> +
>> +/**
>> + * Get value of a lcore variable instance of the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, handle)	\
>> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)))
>> +
>> +/**
>> + * Set the value of a lcore variable instance of the specified lcore id.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, handle, value)		\
>> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)) = (value))
> 
> I still think RTE_LCORE_VAR[_LCORE]_PTR() suffice, and RTE_LCORE_VAR[_LCORE]_GET/SET are superfluous.
> But I don't insist on their removal. :-)
> 

I'll remove them. One can always add them later. Nothing I've seen in 
the DPDK code base so far has been called for their use.

Should the RTE_LCORE_VAR_PTR() be renamed RTE_LCORE_VAR_VALUE() (and 
still return a pointer, obviously)? "PTR" seems a little superfluous 
(Hungarian). "RTE_LCORE_VAR()" would be short, but not very descriptive.

> With or without suggested changes...
> 
> For the series,
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 

Thanks for all help.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v4 1/6] eal: add static per-lcore memory allocation facility
  2024-02-27 13:44                 ` Mattias Rönnblom
@ 2024-02-27 15:05                   ` Morten Brørup
  2024-02-27 16:27                     ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-02-27 15:05 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Tuesday, 27 February 2024 14.44
> 
> On 2024-02-27 10:58, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Sunday, 25 February 2024 16.03
> >
> > [...]
> >
> >> +static void *
> >> +lcore_var_alloc(size_t size, size_t align)
> >> +{
> >> +	void *handle;
> >> +	void *value;
> >> +
> >> +	offset = RTE_ALIGN_CEIL(offset, align);
> >> +
> >> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> >
> > This would be the usual comparison:
> > if (lcore_buffer == NULL) {
> >
> >> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> >> +					     LCORE_BUFFER_SIZE);
> >> +		RTE_VERIFY(lcore_buffer != NULL);
> >> +
> >> +		offset = 0;
> >> +	}
> >
> > [...]
> >
> >> +/**
> >> + * Define a lcore variable handle.
> >> + *
> >> + * This macro defines a variable which is used as a handle to access
> >> + * the various per-lcore id instances of a per-lcore id variable.
> >> + *
> >> + * The aim with this macro is to make clear at the point of
> >> + * declaration that this is an lcore handler, rather than a regular
> >> + * pointer.
> >> + *
> >> + * Add @b static as a prefix in case the lcore variable are only to
> be
> >> + * accessed from a particular translation unit.
> >> + */
> >> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
> >> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
> >> +
> >
> > The parameter is "name" here, and "handle" in other macros.
> > Just mentioning to make sure you thought about it.
> >
> >> +/**
> >> + * Get pointer to lcore variable instance with the specified lcore
> id.
> >> + */
> >> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)			\
> >> +	((typeof(handle))__rte_lcore_var_lcore_ptr(lcore_id, handle))
> >> +
> >> +/**
> >> + * Get value of a lcore variable instance of the specified lcore id.
> >> + */
> >> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, handle)	\
> >> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)))
> >> +
> >> +/**
> >> + * Set the value of a lcore variable instance of the specified lcore
> id.
> >> + */
> >> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, handle, value)		\
> >> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)) = (value))
> >
> > I still think RTE_LCORE_VAR[_LCORE]_PTR() suffice, and
> RTE_LCORE_VAR[_LCORE]_GET/SET are superfluous.
> > But I don't insist on their removal. :-)
> >
> 
> I'll remove them. One can always add them later. Nothing I've seen in
> the DPDK code base so far has been called for their use.
> 
> Should the RTE_LCORE_VAR_PTR() be renamed RTE_LCORE_VAR_VALUE() (and
> still return a pointer, obviously)? "PTR" seems a little superfluous
> (Hungarian). "RTE_LCORE_VAR()" would be short, but not very descriptive.

Good question...

I would try to align this name and the name of the associated foreach macro, currently RTE_LCORE_VAR_FOREACH_VALUE(var, handle).

It seems confusing to have a macro named _VALUE() returning a pointer.
(Which is why I also dislike the foreach macro's current name and "var" parameter name.)

If it is supposed to be frequently used, a shorter name is preferable.
Which leans towards RTE_LCORE_VAR().

And then RTE_FOREACH_LCORE_VAR(iterator, handle) or RTE_LCORE_VAR_FOREACH(iterator, handle).

But then it is not obvious from the name that they operate on pointers.
We don't use Hungarian style in DPDK, so perhaps that is acceptable.


Your conclusion that GET/SET are not generally required inspired me for another idea...
Maybe returning a pointer is not the right thing to do!

I wonder if there are any obstacles to generally dereferencing the lcore variable pointer, like this:

#define RTE_LCORE_VAR_LCORE(lcore_id, handle) \
	(*(typeof(handle))__rte_lcore_var_lcore_ptr(lcore_id, handle))

It would work for both get and set:
RTE_LCORE_VAR(foo) = RTE_LCORE_VAR(bar);

And also for functions being passed the address of the variable.
E.g. memset(&RTE_LCORE_VAR(foo), ...) would expand to:
memset(&(*(typeof(foo))__rte_lcore_var_lcore_ptr(rte_lcore_id(), foo)), ...);


One more thought, not related to the above discussion:

The TLS per-lcore variables are built with "per_lcore_" prefix added to the names, like this:
#define RTE_DEFINE_PER_LCORE(type, name) \
	__thread __typeof__(type) per_lcore_##name

Should the lcore variables have something similar, i.e.:
#define RTE_LCORE_VAR_HANDLE(type, name) \
	RTE_LCORE_VAR_HANDLE_TYPE(type) lcore_var_##name


> 
> > With or without suggested changes...
> >
> > For the series,
> > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >
> 
> Thanks for all help.

Thank you for the detailed consideration of my feedback.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v4 1/6] eal: add static per-lcore memory allocation facility
  2024-02-27 15:05                   ` Morten Brørup
@ 2024-02-27 16:27                     ` Mattias Rönnblom
  2024-02-27 16:51                       ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-27 16:27 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

On 2024-02-27 16:05, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Tuesday, 27 February 2024 14.44
>>
>> On 2024-02-27 10:58, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>> Sent: Sunday, 25 February 2024 16.03
>>>
>>> [...]
>>>
>>>> +static void *
>>>> +lcore_var_alloc(size_t size, size_t align)
>>>> +{
>>>> +	void *handle;
>>>> +	void *value;
>>>> +
>>>> +	offset = RTE_ALIGN_CEIL(offset, align);
>>>> +
>>>> +	if (offset + size > RTE_MAX_LCORE_VAR) {
>>>
>>> This would be the usual comparison:
>>> if (lcore_buffer == NULL) {
>>>
>>>> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
>>>> +					     LCORE_BUFFER_SIZE);
>>>> +		RTE_VERIFY(lcore_buffer != NULL);
>>>> +
>>>> +		offset = 0;
>>>> +	}
>>>
>>> [...]
>>>
>>>> +/**
>>>> + * Define a lcore variable handle.
>>>> + *
>>>> + * This macro defines a variable which is used as a handle to access
>>>> + * the various per-lcore id instances of a per-lcore id variable.
>>>> + *
>>>> + * The aim with this macro is to make clear at the point of
>>>> + * declaration that this is an lcore handler, rather than a regular
>>>> + * pointer.
>>>> + *
>>>> + * Add @b static as a prefix in case the lcore variable are only to
>> be
>>>> + * accessed from a particular translation unit.
>>>> + */
>>>> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
>>>> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
>>>> +
>>>
>>> The parameter is "name" here, and "handle" in other macros.
>>> Just mentioning to make sure you thought about it.
>>>
>>>> +/**
>>>> + * Get pointer to lcore variable instance with the specified lcore
>> id.
>>>> + */
>>>> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)			\
>>>> +	((typeof(handle))__rte_lcore_var_lcore_ptr(lcore_id, handle))
>>>> +
>>>> +/**
>>>> + * Get value of a lcore variable instance of the specified lcore id.
>>>> + */
>>>> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, handle)	\
>>>> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)))
>>>> +
>>>> +/**
>>>> + * Set the value of a lcore variable instance of the specified lcore
>> id.
>>>> + */
>>>> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, handle, value)		\
>>>> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)) = (value))
>>>
>>> I still think RTE_LCORE_VAR[_LCORE]_PTR() suffice, and
>> RTE_LCORE_VAR[_LCORE]_GET/SET are superfluous.
>>> But I don't insist on their removal. :-)
>>>
>>
>> I'll remove them. One can always add them later. Nothing I've seen in
>> the DPDK code base so far has been called for their use.
>>
>> Should the RTE_LCORE_VAR_PTR() be renamed RTE_LCORE_VAR_VALUE() (and
>> still return a pointer, obviously)? "PTR" seems a little superfluous
>> (Hungarian). "RTE_LCORE_VAR()" would be short, but not very descriptive.
> 
> Good question...
> 
> I would try to align this name and the name of the associated foreach macro, currently RTE_LCORE_VAR_FOREACH_VALUE(var, handle).
> 
> It seems confusing to have a macro named _VALUE() returning a pointer.
> (Which is why I also dislike the foreach macro's current name and "var" parameter name.)
> 

Not sure I agree. In C, you often ask for a value and get a pointer to 
that value. I'll leave it VALUE() for now.

> If it is supposed to be frequently used, a shorter name is preferable.
> Which leans towards RTE_LCORE_VAR().
> 
> And then RTE_FOREACH_LCORE_VAR(iterator, handle) or RTE_LCORE_VAR_FOREACH(iterator, handle).
> 

RTE_LCORE_VAR_FOREACH was the original name, which was changed because 
it was confusingly close to RTE_LCORE_FOREACH(), but had a different 
semantics in regards to which lcore ids are iterated over (EAL threads 
only, versus all lcore ids).

> But then it is not obvious from the name that they operate on pointers.
> We don't use Hungarian style in DPDK, so perhaps that is acceptable.
> 
> 
> Your conclusion that GET/SET are not generally required inspired me for another idea...
> Maybe returning a pointer is not the right thing to do!
> 
> I wonder if there are any obstacles to generally dereferencing the lcore variable pointer, like this:
> 
> #define RTE_LCORE_VAR_LCORE(lcore_id, handle) \
> 	(*(typeof(handle))__rte_lcore_var_lcore_ptr(lcore_id, handle))
> 
> It would work for both get and set:
> RTE_LCORE_VAR(foo) = RTE_LCORE_VAR(bar);
> 
> And also for functions being passed the address of the variable.
> E.g. memset(&RTE_LCORE_VAR(foo), ...) would expand to:
> memset(&(*(typeof(foo))__rte_lcore_var_lcore_ptr(rte_lcore_id(), foo)), ...);
> 
> 

The value is usually accessed by means of a pointer, so no need to 
return *pointer.

> One more thought, not related to the above discussion:
> 
> The TLS per-lcore variables are built with "per_lcore_" prefix added to the names, like this:
> #define RTE_DEFINE_PER_LCORE(type, name) \
> 	__thread __typeof__(type) per_lcore_##name
> 
> Should the lcore variables have something similar, i.e.:
> #define RTE_LCORE_VAR_HANDLE(type, name) \
> 	RTE_LCORE_VAR_HANDLE_TYPE(type) lcore_var_##name
> 

I started out with a prefix, but I removed it, since you may want to 
access (copy, assign) the handler pointer directly, and thus need to 
know it's real name. Also, I didn't see why you need a prefix.

For example, consider a section of code where you want to use one of two 
variables depending on condition.

RTE_LCORE_VAR_HANDLE(actual, int);

if (something)
     actual = some_handle;
else
     actual = some_other_handle;

int *value = RTE_LCORE_VAR_VALUE(actual);

This above doesn't work if some_handle is actually named 
rte_lcore_var_some_handle or something like that.

If you want to add a prefix (for which there shouldn't be a need), you 
would need a macro RTE_LCORE_VAR_NAME() as well, so the user can derive 
the actual name (including the prefix).

> 
>>
>>> With or without suggested changes...
>>>
>>> For the series,
>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>>
>>
>> Thanks for all help.
> 
> Thank you for the detailed consideration of my feedback.
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v4 1/6] eal: add static per-lcore memory allocation facility
  2024-02-27 16:27                     ` Mattias Rönnblom
@ 2024-02-27 16:51                       ` Morten Brørup
  0 siblings, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-02-27 16:51 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev; +Cc: Stephen Hemminger

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Tuesday, 27 February 2024 17.28
> 
> On 2024-02-27 16:05, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Tuesday, 27 February 2024 14.44
> >>
> >> On 2024-02-27 10:58, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>>> Sent: Sunday, 25 February 2024 16.03
> >>>
> >>> [...]
> >>>
> >>>> +static void *
> >>>> +lcore_var_alloc(size_t size, size_t align)
> >>>> +{
> >>>> +	void *handle;
> >>>> +	void *value;
> >>>> +
> >>>> +	offset = RTE_ALIGN_CEIL(offset, align);
> >>>> +
> >>>> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> >>>
> >>> This would be the usual comparison:
> >>> if (lcore_buffer == NULL) {
> >>>
> >>>> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> >>>> +					     LCORE_BUFFER_SIZE);
> >>>> +		RTE_VERIFY(lcore_buffer != NULL);
> >>>> +
> >>>> +		offset = 0;
> >>>> +	}
> >>>
> >>> [...]
> >>>
> >>>> +/**
> >>>> + * Define a lcore variable handle.
> >>>> + *
> >>>> + * This macro defines a variable which is used as a handle to
> access
> >>>> + * the various per-lcore id instances of a per-lcore id variable.
> >>>> + *
> >>>> + * The aim with this macro is to make clear at the point of
> >>>> + * declaration that this is an lcore handler, rather than a
> regular
> >>>> + * pointer.
> >>>> + *
> >>>> + * Add @b static as a prefix in case the lcore variable are only
> to
> >> be
> >>>> + * accessed from a particular translation unit.
> >>>> + */
> >>>> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
> >>>> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
> >>>> +
> >>>
> >>> The parameter is "name" here, and "handle" in other macros.
> >>> Just mentioning to make sure you thought about it.
> >>>
> >>>> +/**
> >>>> + * Get pointer to lcore variable instance with the specified lcore
> >> id.
> >>>> + */
> >>>> +#define RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)
> 	\
> >>>> +	((typeof(handle))__rte_lcore_var_lcore_ptr(lcore_id,
> handle))
> >>>> +
> >>>> +/**
> >>>> + * Get value of a lcore variable instance of the specified lcore
> id.
> >>>> + */
> >>>> +#define RTE_LCORE_VAR_LCORE_GET(lcore_id, handle)	\
> >>>> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)))
> >>>> +
> >>>> +/**
> >>>> + * Set the value of a lcore variable instance of the specified
> lcore
> >> id.
> >>>> + */
> >>>> +#define RTE_LCORE_VAR_LCORE_SET(lcore_id, handle, value)
> 	\
> >>>> +	(*(RTE_LCORE_VAR_LCORE_PTR(lcore_id, handle)) = (value))
> >>>
> >>> I still think RTE_LCORE_VAR[_LCORE]_PTR() suffice, and
> >> RTE_LCORE_VAR[_LCORE]_GET/SET are superfluous.
> >>> But I don't insist on their removal. :-)
> >>>
> >>
> >> I'll remove them. One can always add them later. Nothing I've seen in
> >> the DPDK code base so far has been called for their use.
> >>
> >> Should the RTE_LCORE_VAR_PTR() be renamed RTE_LCORE_VAR_VALUE() (and
> >> still return a pointer, obviously)? "PTR" seems a little superfluous
> >> (Hungarian). "RTE_LCORE_VAR()" would be short, but not very
> descriptive.
> >
> > Good question...
> >
> > I would try to align this name and the name of the associated foreach
> macro, currently RTE_LCORE_VAR_FOREACH_VALUE(var, handle).
> >
> > It seems confusing to have a macro named _VALUE() returning a pointer.
> > (Which is why I also dislike the foreach macro's current name and
> "var" parameter name.)
> >
> 
> Not sure I agree. In C, you often ask for a value and get a pointer to
> that value. I'll leave it VALUE() for now.

Yes, fopen() is an example of this.
But such functions don't have VALUE in their names.
(I'm not so worried about the "var" parameter name being confusing.)

You can leave it VALUE for now, just keep an open mind for changing it. :-)

> 
> > If it is supposed to be frequently used, a shorter name is preferable.
> > Which leans towards RTE_LCORE_VAR().
> >
> > And then RTE_FOREACH_LCORE_VAR(iterator, handle) or
> RTE_LCORE_VAR_FOREACH(iterator, handle).
> >
> 
> RTE_LCORE_VAR_FOREACH was the original name, which was changed because
> it was confusingly close to RTE_LCORE_FOREACH(), but had a different
> semantics in regards to which lcore ids are iterated over (EAL threads
> only, versus all lcore ids).

I know I was going in circles here.
Perhaps when we get used to the lcore variables, the similar name might not be confusing anymore. I suppose this happened to me during the review discussions.
I don't have a solid answer, so I'm throwing the ball around to see how it bounces.

> 
> > But then it is not obvious from the name that they operate on
> pointers.
> > We don't use Hungarian style in DPDK, so perhaps that is acceptable.
> >
> >
> > Your conclusion that GET/SET are not generally required inspired me
> for another idea...
> > Maybe returning a pointer is not the right thing to do!
> >
> > I wonder if there are any obstacles to generally dereferencing the
> lcore variable pointer, like this:
> >
> > #define RTE_LCORE_VAR_LCORE(lcore_id, handle) \
> > 	(*(typeof(handle))__rte_lcore_var_lcore_ptr(lcore_id, handle))
> >
> > It would work for both get and set:
> > RTE_LCORE_VAR(foo) = RTE_LCORE_VAR(bar);
> >
> > And also for functions being passed the address of the variable.
> > E.g. memset(&RTE_LCORE_VAR(foo), ...) would expand to:
> > memset(&(*(typeof(foo))__rte_lcore_var_lcore_ptr(rte_lcore_id(),
> foo)), ...);
> >
> >
> 
> The value is usually accessed by means of a pointer, so no need to
> return *pointer.

OK. I suppose you have a pretty good overview of the relevant use cases by now.

> 
> > One more thought, not related to the above discussion:
> >
> > The TLS per-lcore variables are built with "per_lcore_" prefix added
> to the names, like this:
> > #define RTE_DEFINE_PER_LCORE(type, name) \
> > 	__thread __typeof__(type) per_lcore_##name
> >
> > Should the lcore variables have something similar, i.e.:
> > #define RTE_LCORE_VAR_HANDLE(type, name) \
> > 	RTE_LCORE_VAR_HANDLE_TYPE(type) lcore_var_##name
> >
> 
> I started out with a prefix, but I removed it, since you may want to
> access (copy, assign) the handler pointer directly, and thus need to
> know it's real name. Also, I didn't see why you need a prefix.
> 
> For example, consider a section of code where you want to use one of two
> variables depending on condition.
> 
> RTE_LCORE_VAR_HANDLE(actual, int);
> 
> if (something)
>      actual = some_handle;
> else
>      actual = some_other_handle;
> 
> int *value = RTE_LCORE_VAR_VALUE(actual);
> 
> This above doesn't work if some_handle is actually named
> rte_lcore_var_some_handle or something like that.
> 
> If you want to add a prefix (for which there shouldn't be a need), you
> would need a macro RTE_LCORE_VAR_NAME() as well, so the user can derive
> the actual name (including the prefix).

Thanks for the detailed reply.
Let's not add a prefix.

> 
> >
> >>
> >>> With or without suggested changes...
> >>>
> >>> For the series,
> >>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >>>
> >>
> >> Thanks for all help.
> >
> > Thank you for the detailed consideration of my feedback.
> >

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v5 0/6] Lcore variables
  2024-02-25 15:03             ` [RFC v4 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-27  9:58               ` Morten Brørup
@ 2024-02-28 10:09               ` Mattias Rönnblom
  2024-02-28 10:09                 ` [RFC v5 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                   ` (5 more replies)
  1 sibling, 6 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-28 10:09 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

This RFC presents a new API <rte_lcore_var.h> for static per-lcore id
data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

One thing is unclear to the author is how this API relates to a
potential future per-lcore dynamic allocator (e.g., a per-lcore heap).

Contrary to what the version.map edit suggests, this RFC is not meant
for a proposal for DPDK 24.03.

Mattias Rönnblom (6):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable test suite
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 app/test/meson.build                  |   1 +
 app/test/test_lcore_var.c             | 432 ++++++++++++++++++++++++++
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  68 ++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/common/rte_random.c           |  30 +-
 lib/eal/common/rte_service.c          | 118 ++++---
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 368 ++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 lib/eal/x86/rte_power_intrinsics.c    |  17 +-
 lib/power/rte_power_pmd_mgmt.c        |  36 +--
 13 files changed, 990 insertions(+), 88 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v5 1/6] eal: add static per-lcore memory allocation facility
  2024-02-28 10:09               ` [RFC v5 0/6] Lcore variables Mattias Rönnblom
@ 2024-02-28 10:09                 ` Mattias Rönnblom
  2024-03-19 12:52                   ` Konstantin Ananyev
  2024-05-06  8:27                   ` [RFC v6 0/6] Lcore variables Mattias Rönnblom
  2024-02-28 10:09                 ` [RFC v5 2/6] eal: add lcore variable test suite Mattias Rönnblom
                                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-28 10:09 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  68 +++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 368 ++++++++++++++++++++++++++
 lib/eal/version.map                   |   4 +
 7 files changed, 444 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/config/rte_config.h b/config/rte_config.h
index d743a5c3d3..0dac33d3b9 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 8c1eb8fafa..a3b8391570 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore-varible](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..5c353ebd46
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..1db479253d
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,368 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Per-lcore id variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. In other words,
+ * there's one copy of its value for each and every current and future
+ * lcore id-equipped thread, with the total number of copies amounting
+ * to @c RTE_MAX_LCORE.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. A handle may be passed between modules and
+ * threads just like any pointer, but its value is not the address of
+ * any particular object, but rather just an opaque identifier, stored
+ * in a typed pointer (to inform the access macro the type of values).
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define a lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but is should
+ * generally only *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by to different lcore
+ * ids *may* be frequently read or written by the owners without the
+ * risk of false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomics) should
+ * employed to assure there are no data races between the owning
+ * thread and any non-owner threads accessing the same lcore variable
+ * instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value, for which a
+ * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like @c int,
+ * but would more typically be a @c struct. An application may choose
+ * to define an lcore variable, which it then it goes on to never
+ * allocate.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * The size of a lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, all nearby data structures
+ * should almost-always be written to by a single thread (the lcore
+ * variable owner). Adding padding will increase the effective memory
+ * working set size, and potentially reducing performance.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * } __rte_cache_aligned;
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this forces the
+ * use of cache-line alignment to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables has the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which make use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. One effect of this is thread-local variables must
+ *     initialized in a "lazy" manner (e.g., at the point of thread
+ *     creation). Lcore variables may be accessed immediately after
+ *     having been allocated (which is usually prior any thread beyond
+ *     the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id, and thus
+ *     not for such "regular" threads.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define a lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various per-lcore id instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handler, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable are only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handler pointer type, and initialize its handle.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for a lcore variable.
+ *
+ * @param value
+ *   A pointer set successivly set to point to lcore variable value
+ *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
+	for (unsigned int lcore_id =					\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for a lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The id of the variable, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 5e0cd47c82..e90b86115a 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -393,6 +393,10 @@ EXPERIMENTAL {
 	# added in 23.07
 	rte_memzone_max_get;
 	rte_memzone_max_set;
+
+	# added in 24.03
+	rte_lcore_var_alloc;
+	rte_lcore_var;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v5 2/6] eal: add lcore variable test suite
  2024-02-28 10:09               ` [RFC v5 0/6] Lcore variables Mattias Rönnblom
  2024-02-28 10:09                 ` [RFC v5 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-02-28 10:09                 ` Mattias Rönnblom
  2024-02-28 10:09                 ` [RFC v5 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
                                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-28 10:09 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Add test suite to exercise the <rte_lcore_var.h> API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d909039ae..846affa98c 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..e07d13460f
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,432 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v5 3/6] random: keep PRNG state in lcore variable
  2024-02-28 10:09               ` [RFC v5 0/6] Lcore variables Mattias Rönnblom
  2024-02-28 10:09                 ` [RFC v5 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-02-28 10:09                 ` [RFC v5 2/6] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-02-28 10:09                 ` Mattias Rönnblom
  2024-02-28 10:09                 ` [RFC v5 4/6] power: keep per-lcore " Mattias Rönnblom
                                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-28 10:09 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/common/rte_random.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 7709b8f2c6..b265660283 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct rte_rand_state {
@@ -19,14 +20,12 @@ struct rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
-} __rte_cache_aligned;
+};
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v5 4/6] power: keep per-lcore state in lcore variable
  2024-02-28 10:09               ` [RFC v5 0/6] Lcore variables Mattias Rönnblom
                                   ` (2 preceding siblings ...)
  2024-02-28 10:09                 ` [RFC v5 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-02-28 10:09                 ` Mattias Rönnblom
  2024-02-28 10:09                 ` [RFC v5 5/6] service: " Mattias Rönnblom
  2024-02-28 10:09                 ` [RFC v5 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-28 10:09 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

RFC v3:
 * Replace for loop with FOREACH macro.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/power/rte_power_pmd_mgmt.c | 36 ++++++++++++++++------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 591fc69f36..595c8091e6 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -68,8 +69,8 @@ struct pmd_core_cfg {
 	/**< Number of queues ready to enter power optimized state */
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
-} __rte_cache_aligned;
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+};
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,21 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v5 5/6] service: keep per-lcore state in lcore variable
  2024-02-28 10:09               ` [RFC v5 0/6] Lcore variables Mattias Rönnblom
                                   ` (3 preceding siblings ...)
  2024-02-28 10:09                 ` [RFC v5 4/6] power: keep per-lcore " Mattias Rönnblom
@ 2024-02-28 10:09                 ` Mattias Rönnblom
  2024-02-28 10:09                 ` [RFC v5 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-28 10:09 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/common/rte_service.c | 118 ++++++++++++++++++++---------------
 1 file changed, 67 insertions(+), 51 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index d959c91459..5429ddce41 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,11 +102,12 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
+	else {
+		struct core_state *cs;
+		RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+			memset(cs, 0, sizeof(struct core_state));
 	}
 
 	int i;
@@ -122,7 +124,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +137,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +286,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +293,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +454,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +467,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +489,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +535,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +551,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +572,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +589,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +641,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +693,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +711,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +736,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +760,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +784,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +814,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +823,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +848,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +859,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +867,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +875,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +884,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +900,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +947,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +976,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +988,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1027,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v5 6/6] eal: keep per-lcore power intrinsics state in lcore variable
  2024-02-28 10:09               ` [RFC v5 0/6] Lcore variables Mattias Rönnblom
                                   ` (4 preceding siblings ...)
  2024-02-28 10:09                 ` [RFC v5 5/6] service: " Mattias Rönnblom
@ 2024-02-28 10:09                 ` Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-02-28 10:09 UTC (permalink / raw)
  To: dev; +Cc: hofors, Morten Brørup, Stephen Hemminger, Mattias Rönnblom

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 532a2e646b..23d1761f0a 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -4,6 +4,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -12,10 +13,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -170,7 +175,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -262,7 +267,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -301,8 +306,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v5 1/6] eal: add static per-lcore memory allocation facility
  2024-02-28 10:09                 ` [RFC v5 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-03-19 12:52                   ` Konstantin Ananyev
  2024-03-20 10:24                     ` Mattias Rönnblom
  2024-05-06  8:27                   ` [RFC v6 0/6] Lcore variables Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Konstantin Ananyev @ 2024-03-19 12:52 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Morten Brørup, Stephen Hemminger


Hi Mattias,
> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small chunks of often-used data, which is related logically, but where
> there are performance benefits to reap from having updates being local
> to an lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.

Thanks for the RFC, very interesting one.
Few comments/questions below. 

 
> RFC v5:
>  * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
>  * The RTE_LCORE_VAR_GET() and SET() convience access macros
>    covered an uncommon use case, where the lcore value is of a
>    primitive type, rather than a struct, and is thus eliminated
>    from the API. (Morten Brørup)
>  * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
>    RTE_LCORE_VAR_VALUE().
>  * The underscores are removed from __rte_lcore_var_lcore_ptr() to
>    signal that this function is a part of the public API.
>  * Macro arguments are documented.
> 
> RFV v4:
>  * Replace large static array with libc heap-allocated memory. One
>    implication of this change is there no longer exists a fixed upper
>    bound for the total amount of memory used by lcore variables.
>    RTE_MAX_LCORE_VAR has changed meaning, and now represent the
>    maximum size of any individual lcore variable value.
>  * Fix issues in example. (Morten Brørup)
>  * Improve access macro type checking. (Morten Brørup)
>  * Refer to the lcore variable handle as "handle" and not "name" in
>    various macros.
>  * Document lack of thread safety in rte_lcore_var_alloc().
>  * Provide API-level assurance the lcore variable handle is
>    always non-NULL, to all applications to use NULL to mean
>    "not yet allocated".
>  * Note zero-sized allocations are not allowed.
>  * Give API-level guarantee the lcore variable values are zeroed.
> 
> RFC v3:
>  * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>  * Update example to reflect FOREACH macro name change (in RFC v2).
> 
> RFC v2:
>  * Use alignof to derive alignment requirements. (Morten Brørup)
>  * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>    *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>  * Allow user-specified alignment, but limit max to cache line size.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  config/rte_config.h                   |   1 +
>  doc/api/doxy-api-index.md             |   1 +
>  lib/eal/common/eal_common_lcore_var.c |  68 +++++
>  lib/eal/common/meson.build            |   1 +
>  lib/eal/include/meson.build           |   1 +
>  lib/eal/include/rte_lcore_var.h       | 368 ++++++++++++++++++++++++++
>  lib/eal/version.map                   |   4 +
>  7 files changed, 444 insertions(+)
>  create mode 100644 lib/eal/common/eal_common_lcore_var.c
>  create mode 100644 lib/eal/include/rte_lcore_var.h
> 
> diff --git a/config/rte_config.h b/config/rte_config.h
> index d743a5c3d3..0dac33d3b9 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -41,6 +41,7 @@
>  /* EAL defines */
>  #define RTE_CACHE_GUARD_LINES 1
>  #define RTE_MAX_HEAPS 32
> +#define RTE_MAX_LCORE_VAR 1048576
>  #define RTE_MAX_MEMSEG_LISTS 128
>  #define RTE_MAX_MEMSEG_PER_LIST 8192
>  #define RTE_MAX_MEM_MB_PER_LIST 32768
> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> index 8c1eb8fafa..a3b8391570 100644
> --- a/doc/api/doxy-api-index.md
> +++ b/doc/api/doxy-api-index.md
> @@ -99,6 +99,7 @@ The public API headers are grouped by topics:
>    [interrupts](@ref rte_interrupts.h),
>    [launch](@ref rte_launch.h),
>    [lcore](@ref rte_lcore.h),
> +  [lcore-varible](@ref rte_lcore_var.h),
>    [per-lcore](@ref rte_per_lcore.h),
>    [service cores](@ref rte_service.h),
>    [keepalive](@ref rte_keepalive.h),
> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
> new file mode 100644
> index 0000000000..5c353ebd46
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_var.c
> @@ -0,0 +1,68 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#include <inttypes.h>
> +
> +#include <rte_common.h>
> +#include <rte_debug.h>
> +#include <rte_log.h>
> +
> +#include <rte_lcore_var.h>
> +
> +#include "eal_private.h"
> +
> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> +
> +static void *lcore_buffer;
> +static size_t offset = RTE_MAX_LCORE_VAR;
> +
> +static void *
> +lcore_var_alloc(size_t size, size_t align)
> +{
> +	void *handle;
> +	void *value;
> +
> +	offset = RTE_ALIGN_CEIL(offset, align);
> +
> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> +					     LCORE_BUFFER_SIZE);

Hmm... do I get it right: if offset is <= then RTE_MAX_LCORE_VAR,  and  offset + size > RTE_MAX_LCORE_VAR
we simply overwrite lcore_buffer with newly allocated buffer of the same size?
I understand that you expect it just never to happen (total size of all lcore vars never exceed 1MB), but still
I think we need to handle it in a some better way then just ignoring such possibility...
Might be RTE_VERIFY() at least?   

As a more generic question - do we need to support LCORE_VAR for dlopen()s that could happen after rte_eal_init()
is called and LCORE threads were created?
Because, if no, then we probably can make this construction much more flexible:
one buffer per LCORE, allocate on demand, etc. 

> +		RTE_VERIFY(lcore_buffer != NULL);
> +
> +		offset = 0;
> +	}
> +
> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
> +
> +	offset += size;
> +
> +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
> +		memset(value, 0, size);
> +
> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
> +		"%"PRIuPTR"-byte alignment", size, align);
> +
> +	return handle;
> +}
> +
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align)
> +{
> +	/* Having the per-lcore buffer size aligned on cache lines
> +	 * assures as well as having the base pointer aligned on cache
> +	 * size assures that aligned offsets also translate to alipgned
> +	 * pointers across all values.
> +	 */
> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
> +	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
> +
> +	/* '0' means asking for worst-case alignment requirements */
> +	if (align == 0)
> +		align = alignof(max_align_t);
> +
> +	RTE_ASSERT(rte_is_power_of_2(align));
> +
> +	return lcore_var_alloc(size, align);
> +}

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v5 1/6] eal: add static per-lcore memory allocation facility
  2024-03-19 12:52                   ` Konstantin Ananyev
@ 2024-03-20 10:24                     ` Mattias Rönnblom
  2024-03-20 14:18                       ` Konstantin Ananyev
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-03-20 10:24 UTC (permalink / raw)
  To: Konstantin Ananyev, Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger

On 2024-03-19 13:52, Konstantin Ananyev wrote:
> 
> Hi Mattias,
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small chunks of often-used data, which is related logically, but where
>> there are performance benefits to reap from having updates being local
>> to an lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
> 
> Thanks for the RFC, very interesting one.
> Few comments/questions below.
> 
>   
>> RFC v5:
>>   * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
>>   * The RTE_LCORE_VAR_GET() and SET() convience access macros
>>     covered an uncommon use case, where the lcore value is of a
>>     primitive type, rather than a struct, and is thus eliminated
>>     from the API. (Morten Brørup)
>>   * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
>>     RTE_LCORE_VAR_VALUE().
>>   * The underscores are removed from __rte_lcore_var_lcore_ptr() to
>>     signal that this function is a part of the public API.
>>   * Macro arguments are documented.
>>
>> RFV v4:
>>   * Replace large static array with libc heap-allocated memory. One
>>     implication of this change is there no longer exists a fixed upper
>>     bound for the total amount of memory used by lcore variables.
>>     RTE_MAX_LCORE_VAR has changed meaning, and now represent the
>>     maximum size of any individual lcore variable value.
>>   * Fix issues in example. (Morten Brørup)
>>   * Improve access macro type checking. (Morten Brørup)
>>   * Refer to the lcore variable handle as "handle" and not "name" in
>>     various macros.
>>   * Document lack of thread safety in rte_lcore_var_alloc().
>>   * Provide API-level assurance the lcore variable handle is
>>     always non-NULL, to all applications to use NULL to mean
>>     "not yet allocated".
>>   * Note zero-sized allocations are not allowed.
>>   * Give API-level guarantee the lcore variable values are zeroed.
>>
>> RFC v3:
>>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>>   * Update example to reflect FOREACH macro name change (in RFC v2).
>>
>> RFC v2:
>>   * Use alignof to derive alignment requirements. (Morten Brørup)
>>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>>   * Allow user-specified alignment, but limit max to cache line size.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
>>   config/rte_config.h                   |   1 +
>>   doc/api/doxy-api-index.md             |   1 +
>>   lib/eal/common/eal_common_lcore_var.c |  68 +++++
>>   lib/eal/common/meson.build            |   1 +
>>   lib/eal/include/meson.build           |   1 +
>>   lib/eal/include/rte_lcore_var.h       | 368 ++++++++++++++++++++++++++
>>   lib/eal/version.map                   |   4 +
>>   7 files changed, 444 insertions(+)
>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>
>> diff --git a/config/rte_config.h b/config/rte_config.h
>> index d743a5c3d3..0dac33d3b9 100644
>> --- a/config/rte_config.h
>> +++ b/config/rte_config.h
>> @@ -41,6 +41,7 @@
>>   /* EAL defines */
>>   #define RTE_CACHE_GUARD_LINES 1
>>   #define RTE_MAX_HEAPS 32
>> +#define RTE_MAX_LCORE_VAR 1048576
>>   #define RTE_MAX_MEMSEG_LISTS 128
>>   #define RTE_MAX_MEMSEG_PER_LIST 8192
>>   #define RTE_MAX_MEM_MB_PER_LIST 32768
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index 8c1eb8fafa..a3b8391570 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -99,6 +99,7 @@ The public API headers are grouped by topics:
>>     [interrupts](@ref rte_interrupts.h),
>>     [launch](@ref rte_launch.h),
>>     [lcore](@ref rte_lcore.h),
>> +  [lcore-varible](@ref rte_lcore_var.h),
>>     [per-lcore](@ref rte_per_lcore.h),
>>     [service cores](@ref rte_service.h),
>>     [keepalive](@ref rte_keepalive.h),
>> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
>> new file mode 100644
>> index 0000000000..5c353ebd46
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_var.c
>> @@ -0,0 +1,68 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#include <inttypes.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_debug.h>
>> +#include <rte_log.h>
>> +
>> +#include <rte_lcore_var.h>
>> +
>> +#include "eal_private.h"
>> +
>> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
>> +
>> +static void *lcore_buffer;
>> +static size_t offset = RTE_MAX_LCORE_VAR;
>> +
>> +static void *
>> +lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	void *handle;
>> +	void *value;
>> +
>> +	offset = RTE_ALIGN_CEIL(offset, align);
>> +
>> +	if (offset + size > RTE_MAX_LCORE_VAR) {
>> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
>> +					     LCORE_BUFFER_SIZE);
> 
> Hmm... do I get it right: if offset is <= then RTE_MAX_LCORE_VAR,  and  offset + size > RTE_MAX_LCORE_VAR
> we simply overwrite lcore_buffer with newly allocated buffer of the same size?

No, it's just the pointer that is overwritten. The old buffer will 
remain in memory.

> I understand that you expect it just never to happen (total size of all lcore vars never exceed 1MB), but still
> I think we need to handle it in a some better way then just ignoring such possibility...
> Might be RTE_VERIFY() at least?
> 

In this revision of the patch set, RTE_MAX_LCORE_VAR does not represent 
an upper bound for the sum of all lcore variables' size, but rather only 
the maximum size of a single lcore variable.

Variable alignment and size constraints are RTE_ASSERT()ed at the point 
of allocation. One could argue they should be RTE_VERIFY()-ed instead, 
since there aren't any performance constraints.

> As a more generic question - do we need to support LCORE_VAR for dlopen()s that could happen after rte_eal_init()
> is called and LCORE threads were created?

Yes, allocations after rte_eal_init() (caused by dlopen() or otherwise) 
must be allowed imo, and are allowed. Otherwise applications sitting on 
top of DPDK can't use this facility.

> Because, if no, then we probably can make this construction much more flexible:
> one buffer per LCORE, allocate on demand, etc.
> 

On-demand allocations are already supported, but one can't do free(). 
That's why I've called what this module provide "static allocation", 
while it may be more appropriately described as "dynamic allocation 
without deallocation".

"True" dynamic memory allocation of per-lcore memory would be very 
useful, but is an entirely different beast in terms of complexity and 
(if to be usable in the packet processing fast path) performance 
requirements.

"True" dynamic memory allocation would also result in something less 
compact (at least if you use the usual pattern with a per-object heap 
header).

>> +		RTE_VERIFY(lcore_buffer != NULL);
>> +
>> +		offset = 0;
>> +	}
>> +
>> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
>> +
>> +	offset += size;
>> +
>> +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
>> +		memset(value, 0, size);
>> +
>> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
>> +		"%"PRIuPTR"-byte alignment", size, align);
>> +
>> +	return handle;
>> +}
>> +
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	/* Having the per-lcore buffer size aligned on cache lines
>> +	 * assures as well as having the base pointer aligned on cache
>> +	 * size assures that aligned offsets also translate to alipgned
>> +	 * pointers across all values.
>> +	 */
>> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
>> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
>> +	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
>> +
>> +	/* '0' means asking for worst-case alignment requirements */
>> +	if (align == 0)
>> +		align = alignof(max_align_t);
>> +
>> +	RTE_ASSERT(rte_is_power_of_2(align));
>> +
>> +	return lcore_var_alloc(size, align);
>> +}

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v5 1/6] eal: add static per-lcore memory allocation facility
  2024-03-20 10:24                     ` Mattias Rönnblom
@ 2024-03-20 14:18                       ` Konstantin Ananyev
  0 siblings, 0 replies; 313+ messages in thread
From: Konstantin Ananyev @ 2024-03-20 14:18 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger



> >> Introduce DPDK per-lcore id variables, or lcore variables for short.
> >>
> >> An lcore variable has one value for every current and future lcore
> >> id-equipped thread.
> >>
> >> The primary <rte_lcore_var.h> use case is for statically allocating
> >> small chunks of often-used data, which is related logically, but where
> >> there are performance benefits to reap from having updates being local
> >> to an lcore.
> >>
> >> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> >> _Thread_local), but decoupling the values' life time with that of the
> >> threads.
> >>
> >> Lcore variables are also similar in terms of functionality provided by
> >> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >> build-time machinery. DPCPU uses linker scripts, which effectively
> >> prevents the reuse of its, otherwise seemingly viable, approach.
> >>
> >> The currently-prevailing way to solve the same problem as lcore
> >> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> >> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> >> lcore variables over this approach is that data related to the same
> >> lcore now is close (spatially, in memory), rather than data used by
> >> the same module, which in turn avoid excessive use of padding,
> >> polluting caches with unused data.
> >
> > Thanks for the RFC, very interesting one.
> > Few comments/questions below.
> >
> >
> >> RFC v5:
> >>   * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
> >>   * The RTE_LCORE_VAR_GET() and SET() convience access macros
> >>     covered an uncommon use case, where the lcore value is of a
> >>     primitive type, rather than a struct, and is thus eliminated
> >>     from the API. (Morten Brørup)
> >>   * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
> >>     RTE_LCORE_VAR_VALUE().
> >>   * The underscores are removed from __rte_lcore_var_lcore_ptr() to
> >>     signal that this function is a part of the public API.
> >>   * Macro arguments are documented.
> >>
> >> RFV v4:
> >>   * Replace large static array with libc heap-allocated memory. One
> >>     implication of this change is there no longer exists a fixed upper
> >>     bound for the total amount of memory used by lcore variables.
> >>     RTE_MAX_LCORE_VAR has changed meaning, and now represent the
> >>     maximum size of any individual lcore variable value.
> >>   * Fix issues in example. (Morten Brørup)
> >>   * Improve access macro type checking. (Morten Brørup)
> >>   * Refer to the lcore variable handle as "handle" and not "name" in
> >>     various macros.
> >>   * Document lack of thread safety in rte_lcore_var_alloc().
> >>   * Provide API-level assurance the lcore variable handle is
> >>     always non-NULL, to all applications to use NULL to mean
> >>     "not yet allocated".
> >>   * Note zero-sized allocations are not allowed.
> >>   * Give API-level guarantee the lcore variable values are zeroed.
> >>
> >> RFC v3:
> >>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
> >>   * Update example to reflect FOREACH macro name change (in RFC v2).
> >>
> >> RFC v2:
> >>   * Use alignof to derive alignment requirements. (Morten Brørup)
> >>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
> >>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
> >>   * Allow user-specified alignment, but limit max to cache line size.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >> ---
> >>   config/rte_config.h                   |   1 +
> >>   doc/api/doxy-api-index.md             |   1 +
> >>   lib/eal/common/eal_common_lcore_var.c |  68 +++++
> >>   lib/eal/common/meson.build            |   1 +
> >>   lib/eal/include/meson.build           |   1 +
> >>   lib/eal/include/rte_lcore_var.h       | 368 ++++++++++++++++++++++++++
> >>   lib/eal/version.map                   |   4 +
> >>   7 files changed, 444 insertions(+)
> >>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
> >>   create mode 100644 lib/eal/include/rte_lcore_var.h
> >>
> >> diff --git a/config/rte_config.h b/config/rte_config.h
> >> index d743a5c3d3..0dac33d3b9 100644
> >> --- a/config/rte_config.h
> >> +++ b/config/rte_config.h
> >> @@ -41,6 +41,7 @@
> >>   /* EAL defines */
> >>   #define RTE_CACHE_GUARD_LINES 1
> >>   #define RTE_MAX_HEAPS 32
> >> +#define RTE_MAX_LCORE_VAR 1048576
> >>   #define RTE_MAX_MEMSEG_LISTS 128
> >>   #define RTE_MAX_MEMSEG_PER_LIST 8192
> >>   #define RTE_MAX_MEM_MB_PER_LIST 32768
> >> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> >> index 8c1eb8fafa..a3b8391570 100644
> >> --- a/doc/api/doxy-api-index.md
> >> +++ b/doc/api/doxy-api-index.md
> >> @@ -99,6 +99,7 @@ The public API headers are grouped by topics:
> >>     [interrupts](@ref rte_interrupts.h),
> >>     [launch](@ref rte_launch.h),
> >>     [lcore](@ref rte_lcore.h),
> >> +  [lcore-varible](@ref rte_lcore_var.h),
> >>     [per-lcore](@ref rte_per_lcore.h),
> >>     [service cores](@ref rte_service.h),
> >>     [keepalive](@ref rte_keepalive.h),
> >> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
> >> new file mode 100644
> >> index 0000000000..5c353ebd46
> >> --- /dev/null
> >> +++ b/lib/eal/common/eal_common_lcore_var.c
> >> @@ -0,0 +1,68 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2024 Ericsson AB
> >> + */
> >> +
> >> +#include <inttypes.h>
> >> +
> >> +#include <rte_common.h>
> >> +#include <rte_debug.h>
> >> +#include <rte_log.h>
> >> +
> >> +#include <rte_lcore_var.h>
> >> +
> >> +#include "eal_private.h"
> >> +
> >> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> >> +
> >> +static void *lcore_buffer;
> >> +static size_t offset = RTE_MAX_LCORE_VAR;
> >> +
> >> +static void *
> >> +lcore_var_alloc(size_t size, size_t align)
> >> +{
> >> +	void *handle;
> >> +	void *value;
> >> +
> >> +	offset = RTE_ALIGN_CEIL(offset, align);
> >> +
> >> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> >> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> >> +					     LCORE_BUFFER_SIZE);
> >
> > Hmm... do I get it right: if offset is <= then RTE_MAX_LCORE_VAR,  and  offset + size > RTE_MAX_LCORE_VAR
> > we simply overwrite lcore_buffer with newly allocated buffer of the same size?
> 
> No, it's just the pointer that is overwritten. The old buffer will
> remain in memory.

Ah ok, I missed that you changed the handle to pointer conversion in new version too.
Now handle is not just an offset, but an actual pointer to lcore0 var, so all we have is to add
lcore_idx offset.
Makes sense, thanks for clarifying.
LGTM then.
 

> 
> > I understand that you expect it just never to happen (total size of all lcore vars never exceed 1MB), but still
> > I think we need to handle it in a some better way then just ignoring such possibility...
> > Might be RTE_VERIFY() at least?
> >
> 
> In this revision of the patch set, RTE_MAX_LCORE_VAR does not represent
> an upper bound for the sum of all lcore variables' size, but rather only
> the maximum size of a single lcore variable.
> 
> Variable alignment and size constraints are RTE_ASSERT()ed at the point
> of allocation. One could argue they should be RTE_VERIFY()-ed instead,
> since there aren't any performance constraints.
> 
> > As a more generic question - do we need to support LCORE_VAR for dlopen()s that could happen after rte_eal_init()
> > is called and LCORE threads were created?
> 
> Yes, allocations after rte_eal_init() (caused by dlopen() or otherwise)
> must be allowed imo, and are allowed. Otherwise applications sitting on
> top of DPDK can't use this facility.
> 
> > Because, if no, then we probably can make this construction much more flexible:
> > one buffer per LCORE, allocate on demand, etc.
> >
> 
> On-demand allocations are already supported, but one can't do free().
> That's why I've called what this module provide "static allocation",
> while it may be more appropriately described as "dynamic allocation
> without deallocation".
> 
> "True" dynamic memory allocation of per-lcore memory would be very
> useful, but is an entirely different beast in terms of complexity and
> (if to be usable in the packet processing fast path) performance
> requirements.
> 
> "True" dynamic memory allocation would also result in something less
> compact (at least if you use the usual pattern with a per-object heap
> header).
> 
> >> +		RTE_VERIFY(lcore_buffer != NULL);
> >> +
> >> +		offset = 0;
> >> +	}
> >> +
> >> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
> >> +
> >> +	offset += size;
> >> +
> >> +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
> >> +		memset(value, 0, size);
> >> +
> >> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
> >> +		"%"PRIuPTR"-byte alignment", size, align);
> >> +
> >> +	return handle;
> >> +}
> >> +
> >> +void *
> >> +rte_lcore_var_alloc(size_t size, size_t align)
> >> +{
> >> +	/* Having the per-lcore buffer size aligned on cache lines
> >> +	 * assures as well as having the base pointer aligned on cache
> >> +	 * size assures that aligned offsets also translate to alipgned
> >> +	 * pointers across all values.
> >> +	 */
> >> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> >> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
> >> +	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
> >> +
> >> +	/* '0' means asking for worst-case alignment requirements */
> >> +	if (align == 0)
> >> +		align = alignof(max_align_t);
> >> +
> >> +	RTE_ASSERT(rte_is_power_of_2(align));
> >> +
> >> +	return lcore_var_alloc(size, align);
> >> +}

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v6 0/6] Lcore variables
  2024-02-28 10:09                 ` [RFC v5 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-03-19 12:52                   ` Konstantin Ananyev
@ 2024-05-06  8:27                   ` Mattias Rönnblom
  2024-05-06  8:27                     ` [RFC v6 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                       ` (6 more replies)
  1 sibling, 7 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-05-06  8:27 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, Mattias Rönnblom

This RFC presents a new API <rte_lcore_var.h> for static per-lcore id
data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

One thing is unclear to the author is how this API relates to a
potential future per-lcore dynamic allocator (e.g., a per-lcore heap).

Mattias Rönnblom (6):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable test suite
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 app/test/meson.build                  |   1 +
 app/test/test_lcore_var.c             | 432 ++++++++++++++++++++++++++
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  69 ++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/common/rte_random.c           |  28 +-
 lib/eal/common/rte_service.c          | 115 +++----
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 384 +++++++++++++++++++++++
 lib/eal/version.map                   |   3 +
 lib/eal/x86/rte_power_intrinsics.c    |  17 +-
 lib/power/rte_power_pmd_mgmt.c        |  34 +-
 13 files changed, 1000 insertions(+), 87 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v6 1/6] eal: add static per-lcore memory allocation facility
  2024-05-06  8:27                   ` [RFC v6 0/6] Lcore variables Mattias Rönnblom
@ 2024-05-06  8:27                     ` Mattias Rönnblom
  2024-09-10  7:03                       ` [PATCH 0/6] Lcore variables Mattias Rönnblom
  2024-05-06  8:27                     ` [RFC v6 2/6] eal: add lcore variable test suite Mattias Rönnblom
                                       ` (5 subsequent siblings)
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-05-06  8:27 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small chunks of often-used data, which is related logically, but where
there are performance benefits to reap from having updates being local
to an lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 config/rte_config.h                   |   1 +
 doc/api/doxy-api-index.md             |   1 +
 lib/eal/common/eal_common_lcore_var.c |  69 +++++
 lib/eal/common/meson.build            |   1 +
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_lcore_var.h       | 384 ++++++++++++++++++++++++++
 lib/eal/version.map                   |   3 +
 7 files changed, 460 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/config/rte_config.h b/config/rte_config.h
index dd7bb0d35b..311692e498 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 8c1eb8fafa..a3b8391570 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore-varible](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..74ad8272ec
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..cfbcac41dd
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,384 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Per-lcore id variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. There is one
+ * copy for each current and future lcore id-equipped thread, with the
+ * total number of copies amounting to @c RTE_MAX_LCORE. The value of
+ * an lcore variable for a particular lcore id is independent from
+ * other values (for other lcore ids) within the same lcore variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. The handler type is used to inform the
+ * access macros the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as a an opaque identifier. An allocated handle
+ * never has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define a lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by to different lcore
+ * ids may be frequently read or written by the owners without risking
+ * false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should employed to assure there are no data races between
+ * the owning thread and any non-owner threads accessing the same
+ * lcore variable instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may choose to define an lcore variable handle, which
+ * it then it goes on to never allocate.
+ *
+ * The size of a lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always the lcore
+ * variables' owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values take on an initial value of zero.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this forces the
+ * use of cache-line alignment to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables has the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which make use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. As a result, thread-local variables must initialized in
+ *     a "lazy" manner (e.g., at the point of thread creation). Lcore
+ *     variables may be accessed immediately after having been
+ *     allocated (which may be prior any thread beyond the main
+ *     thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define a lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various per-lcore id instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handler, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable are only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handler pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for a lcore variable.
+ *
+ * @param value
+ *   A pointer set successivly set to point to lcore variable value
+ *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
+	for (unsigned int lcore_id =					\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for a lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The id of the variable, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 3df50c3fbb..7702642785 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,9 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	rte_lcore_var_alloc;
+	rte_lcore_var;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v6 2/6] eal: add lcore variable test suite
  2024-05-06  8:27                   ` [RFC v6 0/6] Lcore variables Mattias Rönnblom
  2024-05-06  8:27                     ` [RFC v6 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-05-06  8:27                     ` Mattias Rönnblom
  2024-05-06  8:27                     ` [RFC v6 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
                                       ` (4 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-05-06  8:27 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, Mattias Rönnblom

Add test suite to exercise the <rte_lcore_var.h> API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d909039ae..846affa98c 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..e07d13460f
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,432 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v6 3/6] random: keep PRNG state in lcore variable
  2024-05-06  8:27                   ` [RFC v6 0/6] Lcore variables Mattias Rönnblom
  2024-05-06  8:27                     ` [RFC v6 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-05-06  8:27                     ` [RFC v6 2/6] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-05-06  8:27                     ` Mattias Rönnblom
  2024-05-06  8:27                     ` [RFC v6 4/6] power: keep per-lcore " Mattias Rönnblom
                                       ` (3 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-05-06  8:27 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, Mattias Rönnblom

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v6 4/6] power: keep per-lcore state in lcore variable
  2024-05-06  8:27                   ` [RFC v6 0/6] Lcore variables Mattias Rönnblom
                                       ` (2 preceding siblings ...)
  2024-05-06  8:27                     ` [RFC v6 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-05-06  8:27                     ` Mattias Rönnblom
  2024-05-06  8:27                     ` [RFC v6 5/6] service: " Mattias Rönnblom
                                       ` (2 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-05-06  8:27 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

RFC v3:
 * Replace for loop with FOREACH macro.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/power/rte_power_pmd_mgmt.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a5139dd4f7 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,21 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v6 5/6] service: keep per-lcore state in lcore variable
  2024-05-06  8:27                   ` [RFC v6 0/6] Lcore variables Mattias Rönnblom
                                       ` (3 preceding siblings ...)
  2024-05-06  8:27                     ` [RFC v6 4/6] power: keep per-lcore " Mattias Rönnblom
@ 2024-05-06  8:27                     ` Mattias Rönnblom
  2024-05-06  8:27                     ` [RFC v6 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  2024-09-02 14:42                     ` [RFC v6 0/6] Lcore variables Morten Brørup
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-05-06  8:27 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/common/rte_service.c | 115 +++++++++++++++++++----------------
 1 file changed, 63 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 56379930b6..03379f1588 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,12 +102,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -122,7 +119,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +132,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +281,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +288,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +449,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +462,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +484,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +530,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +546,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +567,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +584,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +636,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +688,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +706,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +731,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +755,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +779,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +809,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +818,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +843,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +854,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +862,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +870,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +879,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +895,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +942,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +971,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +983,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1022,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [RFC v6 6/6] eal: keep per-lcore power intrinsics state in lcore variable
  2024-05-06  8:27                   ` [RFC v6 0/6] Lcore variables Mattias Rönnblom
                                       ` (4 preceding siblings ...)
  2024-05-06  8:27                     ` [RFC v6 5/6] service: " Mattias Rönnblom
@ 2024-05-06  8:27                     ` Mattias Rönnblom
  2024-09-02 14:42                     ` [RFC v6 0/6] Lcore variables Morten Brørup
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-05-06  8:27 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, Mattias Rönnblom

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [RFC v6 0/6] Lcore variables
  2024-05-06  8:27                   ` [RFC v6 0/6] Lcore variables Mattias Rönnblom
                                       ` (5 preceding siblings ...)
  2024-05-06  8:27                     ` [RFC v6 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
@ 2024-09-02 14:42                     ` Morten Brørup
  2024-09-10  6:41                       ` Mattias Rönnblom
  6 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-09-02 14:42 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: hofors, Stephen Hemminger, Konstantin Ananyev

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Monday, 6 May 2024 10.27
> 
> This RFC presents a new API <rte_lcore_var.h> for static per-lcore id
> data allocation.
> 
> Please refer to the <rte_lcore_var.h> API documentation for both a
> rationale for this new API, and a comparison to the alternatives
> available.
> 
> The adoption of this API would affect many different DPDK modules, but
> the author updated only a few, mostly to serve as examples in this
> RFC, and to iron out some, but surely not all, wrinkles in the API.
> 
> The question on how to best allocate static per-lcore memory has been
> up several times on the dev mailing list, for example in the thread on
> "random: use per lcore state" RFC by Stephen Hemminger.
> 
> Lcore variables are surely not the answer to all your per-lcore-data
> needs, since it only allows for more-or-less static allocation. In the
> author's opinion, it does however provide a reasonably simple and
> clean and seemingly very much performant solution to a real problem.

This RFC is an improvement of the design pattern of allocating a RTE_MAX_LCORE sized array of structs per library, which typically introduces a lot of padding, and thus wastes L1 data cache.

I would like to see it as a patch getting into DPDK 24.11.

> 
> One thing is unclear to the author is how this API relates to a
> potential future per-lcore dynamic allocator (e.g., a per-lcore heap).

Perfection is the enemy of progress.
Let's consider this a 1:1 upgrade of a existing design pattern, and not worry about how to broaden its scope in the future.

-Morten


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v6 0/6] Lcore variables
  2024-09-02 14:42                     ` [RFC v6 0/6] Lcore variables Morten Brørup
@ 2024-09-10  6:41                       ` Mattias Rönnblom
  2024-09-10 15:41                         ` Stephen Hemminger
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-10  6:41 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev

On 2024-09-02 16:42, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Monday, 6 May 2024 10.27
>>
>> This RFC presents a new API <rte_lcore_var.h> for static per-lcore id
>> data allocation.
>>
>> Please refer to the <rte_lcore_var.h> API documentation for both a
>> rationale for this new API, and a comparison to the alternatives
>> available.
>>
>> The adoption of this API would affect many different DPDK modules, but
>> the author updated only a few, mostly to serve as examples in this
>> RFC, and to iron out some, but surely not all, wrinkles in the API.
>>
>> The question on how to best allocate static per-lcore memory has been
>> up several times on the dev mailing list, for example in the thread on
>> "random: use per lcore state" RFC by Stephen Hemminger.
>>
>> Lcore variables are surely not the answer to all your per-lcore-data
>> needs, since it only allows for more-or-less static allocation. In the
>> author's opinion, it does however provide a reasonably simple and
>> clean and seemingly very much performant solution to a real problem.
> 
> This RFC is an improvement of the design pattern of allocating a RTE_MAX_LCORE sized array of structs per library, which typically introduces a lot of padding, and thus wastes L1 data cache.
> 
> I would like to see it as a patch getting into DPDK 24.11.
> 

I would be happy to develop and maintain this DPDK module.

I will submit this as a v1 PATCH.

>>
>> One thing is unclear to the author is how this API relates to a
>> potential future per-lcore dynamic allocator (e.g., a per-lcore heap).
> 
> Perfection is the enemy of progress.
> Let's consider this a 1:1 upgrade of a existing design pattern, and not worry about how to broaden its scope in the future.
> 
> -Morten
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH 0/6] Lcore variables
  2024-05-06  8:27                     ` [RFC v6 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-10  7:03                       ` Mattias Rönnblom
  2024-09-10  7:03                         ` [PATCH 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                           ` (5 more replies)
  0 siblings, 6 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-10  7:03 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (6):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable test suite
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                            |   6 +
 app/test/meson.build                   |   1 +
 app/test/test_lcore_var.c              | 432 +++++++++++++++++++++++++
 config/rte_config.h                    |   1 +
 doc/api/doxy-api-index.md              |   1 +
 doc/guides/rel_notes/release_24_11.rst |  14 +
 lib/eal/common/eal_common_lcore_var.c  |  69 ++++
 lib/eal/common/meson.build             |   1 +
 lib/eal/common/rte_random.c            |  28 +-
 lib/eal/common/rte_service.c           | 115 ++++---
 lib/eal/include/meson.build            |   1 +
 lib/eal/include/rte_lcore_var.h        | 384 ++++++++++++++++++++++
 lib/eal/version.map                    |   3 +
 lib/eal/x86/rte_power_intrinsics.c     |  17 +-
 lib/power/rte_power_pmd_mgmt.c         |  34 +-
 15 files changed, 1020 insertions(+), 87 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH 1/6] eal: add static per-lcore memory allocation facility
  2024-09-10  7:03                       ` [PATCH 0/6] Lcore variables Mattias Rönnblom
@ 2024-09-10  7:03                         ` Mattias Rönnblom
  2024-09-10  9:32                           ` Morten Brørup
                                             ` (2 more replies)
  2024-09-10  7:03                         ` [PATCH 2/6] eal: add lcore variable test suite Mattias Rönnblom
                                           ` (4 subsequent siblings)
  5 siblings, 3 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-10  7:03 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                            |   6 +
 config/rte_config.h                    |   1 +
 doc/api/doxy-api-index.md              |   1 +
 doc/guides/rel_notes/release_24_11.rst |  14 +
 lib/eal/common/eal_common_lcore_var.c  |  69 +++++
 lib/eal/common/meson.build             |   1 +
 lib/eal/include/meson.build            |   1 +
 lib/eal/include/rte_lcore_var.h        | 384 +++++++++++++++++++++++++
 lib/eal/version.map                    |   3 +
 9 files changed, 480 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c5a703b5c0..362d9a3f28 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index dd7bb0d35b..311692e498 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..07d7cbc66c 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore-varible](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 0ff70d9057..adb8eb404d 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -55,6 +55,20 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each lcore.
+
+    With lcore variables, data is organized spatially on a per-lcore
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..74ad8272ec
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..7d3178c424
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,384 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Per-lcore id variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. There is one
+ * copy for each current and future lcore id-equipped thread, with the
+ * total number of copies amounting to @c RTE_MAX_LCORE. The value of
+ * an lcore variable for a particular lcore id is independent from
+ * other values (for other lcore ids) within the same lcore variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. The handler type is used to inform the
+ * access macros the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as a an opaque identifier. An allocated handle
+ * never has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define a lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by to different lcore
+ * ids may be frequently read or written by the owners without risking
+ * false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should employed to assure there are no data races between
+ * the owning thread and any non-owner threads accessing the same
+ * lcore variable instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may choose to define an lcore variable handle, which
+ * it then it goes on to never allocate.
+ *
+ * The size of a lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always the lcore
+ * variables' owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values take on an initial value of zero.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this forces the
+ * use of cache-line alignment to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables has the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which make use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. As a result, thread-local variables must initialized in
+ *     a "lazy" manner (e.g., at the point of thread creation). Lcore
+ *     variables may be accessed immediately after having been
+ *     allocated (which may be prior any thread beyond the main
+ *     thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define a lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various per-lcore id instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handler, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable are only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handler pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for a lcore variable.
+ *
+ * @param value
+ *   A pointer set successivly set to point to lcore variable value
+ *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
+	for (unsigned int lcore_id =					\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for a lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The id of the variable, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..5f5a3522c0 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,9 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	rte_lcore_var_alloc;
+	rte_lcore_var;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH 2/6] eal: add lcore variable test suite
  2024-09-10  7:03                       ` [PATCH 0/6] Lcore variables Mattias Rönnblom
  2024-09-10  7:03                         ` [PATCH 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-10  7:03                         ` Mattias Rönnblom
  2024-09-10  7:03                         ` [PATCH 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
                                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-10  7:03 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Add test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..e07d13460f
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,432 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH 3/6] random: keep PRNG state in lcore variable
  2024-09-10  7:03                       ` [PATCH 0/6] Lcore variables Mattias Rönnblom
  2024-09-10  7:03                         ` [PATCH 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-10  7:03                         ` [PATCH 2/6] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-09-10  7:03                         ` Mattias Rönnblom
  2024-09-10  7:03                         ` [PATCH 4/6] power: keep per-lcore " Mattias Rönnblom
                                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-10  7:03 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH 4/6] power: keep per-lcore state in lcore variable
  2024-09-10  7:03                       ` [PATCH 0/6] Lcore variables Mattias Rönnblom
                                           ` (2 preceding siblings ...)
  2024-09-10  7:03                         ` [PATCH 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-09-10  7:03                         ` Mattias Rönnblom
  2024-09-10  7:03                         ` [PATCH 5/6] service: " Mattias Rönnblom
  2024-09-10  7:03                         ` [PATCH 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-10  7:03 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a5139dd4f7 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,21 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH 5/6] service: keep per-lcore state in lcore variable
  2024-09-10  7:03                       ` [PATCH 0/6] Lcore variables Mattias Rönnblom
                                           ` (3 preceding siblings ...)
  2024-09-10  7:03                         ` [PATCH 4/6] power: keep per-lcore " Mattias Rönnblom
@ 2024-09-10  7:03                         ` Mattias Rönnblom
  2024-09-10  7:03                         ` [PATCH 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-10  7:03 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 115 +++++++++++++++++++----------------
 1 file changed, 63 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 56379930b6..03379f1588 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,12 +102,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -122,7 +119,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +132,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +281,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +288,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +449,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +462,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +484,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +530,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +546,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +567,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +584,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +636,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +688,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +706,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +731,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +755,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +779,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +809,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +818,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +843,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +854,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +862,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +870,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +879,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +895,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +942,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +971,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +983,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1022,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH 6/6] eal: keep per-lcore power intrinsics state in lcore variable
  2024-09-10  7:03                       ` [PATCH 0/6] Lcore variables Mattias Rönnblom
                                           ` (4 preceding siblings ...)
  2024-09-10  7:03                         ` [PATCH 5/6] service: " Mattias Rönnblom
@ 2024-09-10  7:03                         ` Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-10  7:03 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH 1/6] eal: add static per-lcore memory allocation facility
  2024-09-10  7:03                         ` [PATCH 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-10  9:32                           ` Morten Brørup
  2024-09-10 10:44                             ` Mattias Rönnblom
  2024-09-11 10:32                           ` Morten Brørup
  2024-09-11 17:04                           ` [PATCH v2 0/6] Lcore variables Mattias Rönnblom
  2 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-09-10  9:32 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Stephen Hemminger, Konstantin Ananyev, David Marchand

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Tuesday, 10 September 2024 09.04
> 
> Introduce DPDK per-lcore id variables, or lcore variables for short.

Throughout the descriptions and comments,
please replace "lcore id" with "lcore" (e.g. "per-lcore variables"),
when referring to the lcore, and not the index of the lcore.
(Your intention might be to highlight that it only covers threads with an lcore id,
but if that wasn't the case, you would refer to them as "threads" not "lcores".)
Except, of course, when referring to an actual lcore id, e.g. lcore_id function parameters.

Paraphrasing:
Consider the type of what you are referring to;
use "lcore" if its type is "thread", and
use "lcore id" if its type is "int".

I might be wrong here, but please think hard about it.

> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small, frequently-accessed data structures, for which one instance
> should exist for each lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> --

> +++ b/doc/api/doxy-api-index.md
> @@ -99,6 +99,7 @@ The public API headers are grouped by topics:
>    [interrupts](@ref rte_interrupts.h),
>    [launch](@ref rte_launch.h),
>    [lcore](@ref rte_lcore.h),
> +  [lcore-varible](@ref rte_lcore_var.h),

Typo: varible -> variable


> +++ b/doc/guides/rel_notes/release_24_11.rst
> @@ -55,6 +55,20 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
> 
> +* **Added EAL per-lcore static memory allocation facility.**
> +
> +    Added EAL API <rte_lcore_var.h> for statically allocating small,
> +    frequently-accessed data structures, for which one instance should
> +    exist for each lcore.
> +
> +    With lcore variables, data is organized spatially on a per-lcore
> +    basis, rather than per library or PMD, avoiding the need for cache
> +    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
> +    reduces CPU cache internal fragmentation, improving performance.
> +
> +    Lcore variables are similar to thread-local storage (TLS, e.g.,
> +    C11 _Thread_local), but decoupling the values' life time from that
> +    of the threads.

When referring to TLS, you might want to clarify that lcore variables are not instantiated for unregistered threads.


> +static void *lcore_buffer;
> +static size_t offset = RTE_MAX_LCORE_VAR;
> +
> +static void *
> +lcore_var_alloc(size_t size, size_t align)
> +{
> +	void *handle;
> +	void *value;
> +
> +	offset = RTE_ALIGN_CEIL(offset, align);
> +
> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> +					     LCORE_BUFFER_SIZE);
> +		RTE_VERIFY(lcore_buffer != NULL);
> +
> +		offset = 0;
> +	}

To determine if the lcore_buffer memory should be allocated, why not just check if lcore_buffer == NULL?
Then offset wouldn't need an initial value of RTE_MAX_LCORE_VAR.

> +
> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
> +
> +	offset += size;
> +
> +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
> +		memset(value, 0, size);
> +
> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
> +		"%"PRIuPTR"-byte alignment", size, align);
> +
> +	return handle;
> +}


> +/**
> + * @file
> + *
> + * RTE Per-lcore id variables

Suggest mentioning the short form too, e.g.:
"RTE Per-lcore id variables (RTE Lcore variables)"

> + *
> + * This API provides a mechanism to create and access per-lcore id
> + * variables in a space- and cycle-efficient manner.
> + *
> + * A per-lcore id variable (or lcore variable for short) has one value
> + * for each EAL thread and registered non-EAL thread.

And service thread.

> + * There is one
> + * copy for each current and future lcore id-equipped thread, with the

"one copy" -> "one instance"

> + * total number of copies amounting to @c RTE_MAX_LCORE. The value of

"copies" -> "instances"

> + * an lcore variable for a particular lcore id is independent from
> + * other values (for other lcore ids) within the same lcore variable.
> + *
> + * In order to access the values of an lcore variable, a handle is
> + * used. The type of the handle is a pointer to the value's type
> + * (e.g., for @c uint32_t lcore variable, the handle is a
> + * <code>uint32_t *</code>. The handler type is used to inform the

Typo: "handler" -> "handle", I think :-/
Found this typo multiple times; search-replace.

> + * access macros the type of the values. A handle may be passed
> + * between modules and threads just like any pointer, but its value
> + * must be treated as a an opaque identifier. An allocated handle
> + * never has the value NULL.
> + *
> + * @b Creation
> + *
> + * An lcore variable is created in two steps:
> + *  1. Define a lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
> + *  2. Allocate lcore variable storage and initialize the handle with
> + *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
> + *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
> + *     module initialization, but may be done at any time.
> + *
> + * An lcore variable is not tied to the owning thread's lifetime. It's
> + * available for use by any thread immediately after having been
> + * allocated, and continues to be available throughout the lifetime of
> + * the EAL.
> + *
> + * Lcore variables cannot and need not be freed.
> + *
> + * @b Access
> + *
> + * The value of any lcore variable for any lcore id may be accessed
> + * from any thread (including unregistered threads), but it should
> + * only be *frequently* read from or written to by the owner.
> + *
> + * Values of the same lcore variable but owned by to different lcore

Typo: to -> two

> + * ids may be frequently read or written by the owners without risking
> + * false sharing.
> + *
> + * An appropriate synchronization mechanism (e.g., atomic loads and
> + * stores) should employed to assure there are no data races between
> + * the owning thread and any non-owner threads accessing the same
> + * lcore variable instance.
> + *
> + * The value of the lcore variable for a particular lcore id is
> + * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
> + *
> + * A common pattern is for an EAL thread or a registered non-EAL
> + * thread to access its own lcore variable value. For this purpose, a
> + * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
> + *
> + * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
> + * pointer with the same type as the value, it may not be directly
> + * dereferenced and must be treated as an opaque identifier.
> + *
> + * Lcore variable handles and value pointers may be freely passed
> + * between different threads.
> + *
> + * @b Storage
> + *
> + * An lcore variable's values may by of a primitive type like @c int,

Two typos: "values may by" -> "value may be"

> + * but would more typically be a @c struct.
> + *
> + * The lcore variable handle introduces a per-variable (not
> + * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
> + * there are some memory footprint gains to be made by organizing all
> + * per-lcore id data for a particular module as one lcore variable
> + * (e.g., as a struct).
> + *
> + * An application may choose to define an lcore variable handle, which
> + * it then it goes on to never allocate.
> + *
> + * The size of a lcore variable's value must be less than the DPDK
> + * build-time constant @c RTE_MAX_LCORE_VAR.
> + *
> + * The lcore variable are stored in a series of lcore buffers, which
> + * are allocated from the libc heap. Heap allocation failures are
> + * treated as fatal.
> + *
> + * Lcore variables should generally *not* be @ref __rte_cache_aligned
> + * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
> + * of these constructs are designed to avoid false sharing. In the
> + * case of an lcore variable instance, the thread most recently
> + * accessing nearby data structures should almost-always the lcore

Missing word: should almost-always *be* the lcore variables' owner.


> + * variables' owner. Adding padding will increase the effective memory
> + * working set size, potentially reducing performance.
> + *
> + * Lcore variable values take on an initial value of zero.
> + *
> + * @b Example
> + *
> + * Below is an example of the use of an lcore variable:
> + *
> + * @code{.c}
> + * struct foo_lcore_state {
> + *         int a;
> + *         long b;
> + * };
> + *
> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
> + *
> + * long foo_get_a_plus_b(void)
> + * {
> + *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
> + *
> + *         return state->a + state->b;
> + * }
> + *
> + * RTE_INIT(rte_foo_init)
> + * {
> + *         RTE_LCORE_VAR_ALLOC(lcore_states);
> + *
> + *         struct foo_lcore_state *state;
> + *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
> + *                 (initialize 'state')

Consider: (initialize 'state') -> /* initialize 'state' */

> + *         }
> + *
> + *         (other initialization)

Consider: (other initialization) -> /* other initialization */

> + * }
> + * @endcode
> + *
> + *
> + * @b Alternatives
> + *
> + * Lcore variables are designed to replace a pattern exemplified below:
> + * @code{.c}
> + * struct __rte_cache_aligned foo_lcore_state {
> + *         int a;
> + *         long b;
> + *         RTE_CACHE_GUARD;
> + * };
> + *
> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
> + * @endcode
> + *
> + * This scheme is simple and effective, but has one drawback: the data
> + * is organized so that objects related to all lcores for a particular
> + * module is kept close in memory. At a bare minimum, this forces the
> + * use of cache-line alignment to avoid false sharing. With CPU

Consider adding: use of *padding to* cache-line alignment
My point here is:
This sentence should somehow include the word "padding".
This paragraph is not only aboud cache line alignment, it is primarily about padding.

> + * hardware prefetching and memory loads resulting from speculative
> + * execution (functions which seemingly are getting more eager faster
> + * than they are getting more intelligent), one or more "guard" cache
> + * lines may be required to separate one lcore's data from another's.
> + *
> + * Lcore variables has the upside of working with, not against, the

Typo: has -> have

> + * CPU's assumptions and for example next-line prefetchers may well
> + * work the way its designers intended (i.e., to the benefit, not
> + * detriment, of system performance).
> + *
> + * Another alternative to @ref rte_lcore_var.h is the @ref
> + * rte_per_lcore.h API, which make use of thread-local storage (TLS,

Typo: make -> makes

> + * e.g., GCC __thread or C11 _Thread_local). The main differences
> + * between by using the various forms of TLS (e.g., @ref
> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
> + * variables are:
> + *
> + *   * The existence and non-existence of a thread-local variable
> + *     instance follow that of particular thread's. The data cannot be

Typo: "thread's" -> "threads", I think. :-/

> + *     accessed before the thread has been created, nor after it has
> + *     exited. As a result, thread-local variables must initialized in

Missing word: must *be* initialized

> + *     a "lazy" manner (e.g., at the point of thread creation). Lcore
> + *     variables may be accessed immediately after having been
> + *     allocated (which may be prior any thread beyond the main
> + *     thread is running).
> + *   * A thread-local variable is duplicated across all threads in the
> + *     process, including unregistered non-EAL threads (i.e.,
> + *     "regular" threads). For DPDK applications heavily relying on
> + *     multi-threading (in conjunction to DPDK's "one thread per core"
> + *     pattern), either by having many concurrent threads or
> + *     creating/destroying threads at a high rate, an excessive use of
> + *     thread-local variables may cause inefficiencies (e.g.,
> + *     increased thread creation overhead due to thread-local storage
> + *     initialization or increased total RAM footprint usage). Lcore
> + *     variables *only* exist for threads with an lcore id.
> + *   * If data in thread-local storage may be shared between threads
> + *     (i.e., can a pointer to a thread-local variable be passed to
> + *     and successfully dereferenced by non-owning thread) depends on
> + *     the details of the TLS implementation. With GCC __thread and
> + *     GCC _Thread_local, such data sharing is supported. In the C11
> + *     standard, the result of accessing another thread's
> + *     _Thread_local object is implementation-defined. Lcore variable
> + *     instances may be accessed reliably by any thread.
> + */
> +
> +#include <stddef.h>
> +#include <stdalign.h>
> +
> +#include <rte_common.h>
> +#include <rte_config.h>
> +#include <rte_lcore.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * Given the lcore variable type, produces the type of the lcore
> + * variable handle.
> + */
> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
> +	type *
> +
> +/**
> + * Define a lcore variable handle.

Typo: "a lcore" -> "an lcore"
Found this typo multiple times; search-replace "a lcore".

> + *
> + * This macro defines a variable which is used as a handle to access
> + * the various per-lcore id instances of a per-lcore id variable.

Suggest:
"the various per-lcore id instances of a per-lcore id variable" ->
"the various instances of a per-lcore id variable"

> + *
> + * The aim with this macro is to make clear at the point of
> + * declaration that this is an lcore handler, rather than a regular
> + * pointer.
> + *
> + * Add @b static as a prefix in case the lcore variable are only to be

Typo: are -> is

> + * accessed from a particular translation unit.
> + */
> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle.
> + *
> + * The values of the lcore variable are initialized to zero.

Consider adding: "the lcore variable *instances* are initialized"
Found this typo multiple times; search-replace.

> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
> +	handle = rte_lcore_var_alloc(size, align)
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle,
> + * with values aligned for any type of object.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
> +
> +/**
> + * Allocate space for an lcore variable of the size and alignment
> requirements
> + * suggested by the handler pointer type, and initialize its handle.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_ALLOC(handle)					\
> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
> +				       alignof(typeof(*(handle))))
> +
> +/**
> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
> + * means of a @ref RTE_INIT constructor.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> +	{								\
> +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
> +	}
> +
> +/**
> + * Allocate an explicitly-sized lcore variable by means of a @ref
> + * RTE_INIT constructor.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
> +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
> +
> +/**
> + * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_INIT(name)					\
> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> +	{								\
> +		RTE_LCORE_VAR_ALLOC(name);				\
> +	}
> +
> +/**
> + * Get void pointer to lcore variable instance with the specified
> + * lcore id.
> + *
> + * @param lcore_id
> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> + *   instances should be accessed. The lcore id need not be valid
> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
> + *   is also not valid (and thus should not be dereferenced).
> + * @param handle
> + *   The lcore variable handle.
> + */
> +static inline void *
> +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
> +{
> +	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
> +}
> +
> +/**
> + * Get pointer to lcore variable instance with the specified lcore id.
> + *
> + * @param lcore_id
> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> + *   instances should be accessed. The lcore id need not be valid
> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
> + *   is also not valid (and thus should not be dereferenced).
> + * @param handle
> + *   The lcore variable handle.
> + */
> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
> +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
> +
> +/**
> + * Get pointer to lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_VALUE(handle) \
> +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> +
> +/**
> + * Iterate over each lcore id's value for a lcore variable.
> + *
> + * @param value
> + *   A pointer set successivly set to point to lcore variable value

"set successivly set" -> "successivly set"

Thinking out loud, ignore at your preference:
During the RFC discussions, the term used for referring to an lcore variable was discussed;
we considered "pointer", but settled for "value".
Perhaps "instance" would be usable in comments like like the one describing this function...
"A pointer set successivly set to point to lcore variable value" ->
"A pointer set successivly set to point to lcore variable instance".
I don't know.


> + *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
> + * @param handle
> + *   The lcore variable handle.
> + */
> +#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
> +	for (unsigned int lcore_id =					\
> +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
> +	     lcore_id < RTE_MAX_LCORE;					\
> +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
> +
> +/**
> + * Allocate space in the per-lcore id buffers for a lcore variable.
> + *
> + * The pointer returned is only an opaque identifer of the variable. To
> + * get an actual pointer to a particular instance of the variable use
> + * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
> + *
> + * The lcore variable values' memory is set to zero.
> + *
> + * The allocation is always successful, barring a fatal exhaustion of
> + * the per-lcore id buffer space.
> + *
> + * rte_lcore_var_alloc() is not multi-thread safe.
> + *
> + * @param size
> + *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
> + * @param align
> + *   If 0, the values will be suitably aligned for any kind of type
> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
> + *   on a multiple of *align*, which must be a power of 2 and equal or
> + *   less than @c RTE_CACHE_LINE_SIZE.
> + * @return
> + *   The id of the variable, stored in a void pointer value. The value

"id" -> "handle"

> + *   is always non-NULL.
> + */
> +__rte_experimental
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_LCORE_VAR_H_ */
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index e3ff412683..5f5a3522c0 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -396,6 +396,9 @@ EXPERIMENTAL {
> 
>  	# added in 24.03
>  	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
> +
> +	rte_lcore_var_alloc;
> +	rte_lcore_var;

No such function: rte_lcore_var

>  };
> 
>  INTERNAL {
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH 1/6] eal: add static per-lcore memory allocation facility
  2024-09-10  9:32                           ` Morten Brørup
@ 2024-09-10 10:44                             ` Mattias Rönnblom
  2024-09-10 13:07                               ` Morten Brørup
  2024-09-10 15:55                               ` Stephen Hemminger
  0 siblings, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-10 10:44 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand

On 2024-09-10 11:32, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Tuesday, 10 September 2024 09.04
>>
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> Throughout the descriptions and comments,
> please replace "lcore id" with "lcore" (e.g. "per-lcore variables"),
> when referring to the lcore, and not the index of the lcore.
> (Your intention might be to highlight that it only covers threads with an lcore id,
> but if that wasn't the case, you would refer to them as "threads" not "lcores".)
> Except, of course, when referring to an actual lcore id, e.g. lcore_id function parameters.

"lcore" is just another word for "EAL thread." The lcore variables exist 
in one instance for every thread with an lcore id, thus also for 
registered non-EAL threads (i.e., threads which are not lcores).

I've tried to summarize the (very confusing) terminology of DPDK's 
threading model here:
https://ericsson.github.io/dataplanebook/threading/threading.html#eal-threads

So, in my world, "per-lcore id variables" is pretty accurate. You could 
say "variables with per-lcore id values" if you want to make it even 
more clear, what's going on.

> 
> Paraphrasing:
> Consider the type of what you are referring to;
> use "lcore" if its type is "thread", and
> use "lcore id" if its type is "int".
> 
> I might be wrong here, but please think hard about it.
> 
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small, frequently-accessed data structures, for which one instance
>> should exist for each lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>
>> --
> 
>> +++ b/doc/api/doxy-api-index.md
>> @@ -99,6 +99,7 @@ The public API headers are grouped by topics:
>>     [interrupts](@ref rte_interrupts.h),
>>     [launch](@ref rte_launch.h),
>>     [lcore](@ref rte_lcore.h),
>> +  [lcore-varible](@ref rte_lcore_var.h),
> 
> Typo: varible -> variable
> 
> 

I'll change it to "lcore variables" (no dash, plural).

>> +++ b/doc/guides/rel_notes/release_24_11.rst
>> @@ -55,6 +55,20 @@ New Features
>>        Also, make sure to start the actual text at the margin.
>>        =======================================================
>>
>> +* **Added EAL per-lcore static memory allocation facility.**
>> +
>> +    Added EAL API <rte_lcore_var.h> for statically allocating small,
>> +    frequently-accessed data structures, for which one instance should
>> +    exist for each lcore.
>> +
>> +    With lcore variables, data is organized spatially on a per-lcore
>> +    basis, rather than per library or PMD, avoiding the need for cache
>> +    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
>> +    reduces CPU cache internal fragmentation, improving performance.
>> +
>> +    Lcore variables are similar to thread-local storage (TLS, e.g.,
>> +    C11 _Thread_local), but decoupling the values' life time from that
>> +    of the threads.
> 
> When referring to TLS, you might want to clarify that lcore variables are not instantiated for unregistered threads.
> 

Isn't that clear from the first paragraph? Although it should say "per 
lcore id", rather than "per lcore."

> 
>> +static void *lcore_buffer;
>> +static size_t offset = RTE_MAX_LCORE_VAR;
>> +
>> +static void *
>> +lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	void *handle;
>> +	void *value;
>> +
>> +	offset = RTE_ALIGN_CEIL(offset, align);
>> +
>> +	if (offset + size > RTE_MAX_LCORE_VAR) {
>> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
>> +					     LCORE_BUFFER_SIZE);
>> +		RTE_VERIFY(lcore_buffer != NULL);
>> +
>> +		offset = 0;
>> +	}
> 
> To determine if the lcore_buffer memory should be allocated, why not just check if lcore_buffer == NULL?

Because it may be the case, lcore_buffer is non-NULL but the remaining 
space is too small to service the allocation.

> Then offset wouldn't need an initial value of RTE_MAX_LCORE_VAR.
> 
>> +
>> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
>> +
>> +	offset += size;
>> +
>> +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
>> +		memset(value, 0, size);
>> +
>> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
>> +		"%"PRIuPTR"-byte alignment", size, align);
>> +
>> +	return handle;
>> +}
> 
> 
>> +/**
>> + * @file
>> + *
>> + * RTE Per-lcore id variables
> 
> Suggest mentioning the short form too, e.g.:
> "RTE Per-lcore id variables (RTE Lcore variables)"

What about just "RTE Lcore variables"?

Exactly what they are is thoroughly described in the text that follows.

> 
>> + *
>> + * This API provides a mechanism to create and access per-lcore id
>> + * variables in a space- and cycle-efficient manner.
>> + *
>> + * A per-lcore id variable (or lcore variable for short) has one value
>> + * for each EAL thread and registered non-EAL thread.
> 
> And service thread.

Service threads are EAL threads, or, at a bare minimum, must have a 
lcore id, and thus must be registered.

> 
>> + * There is one
>> + * copy for each current and future lcore id-equipped thread, with the
> 
> "one copy" -> "one instance"
> 

Fixed.

>> + * total number of copies amounting to @c RTE_MAX_LCORE. The value of
> 
> "copies" -> "instances"
> 

OK, I'll rephrase that sentence.

>> + * an lcore variable for a particular lcore id is independent from
>> + * other values (for other lcore ids) within the same lcore variable.
>> + *
>> + * In order to access the values of an lcore variable, a handle is
>> + * used. The type of the handle is a pointer to the value's type
>> + * (e.g., for @c uint32_t lcore variable, the handle is a
>> + * <code>uint32_t *</code>. The handler type is used to inform the
> 
> Typo: "handler" -> "handle", I think :-/
> Found this typo multiple times; search-replace.

Fixed.

> 
>> + * access macros the type of the values. A handle may be passed
>> + * between modules and threads just like any pointer, but its value
>> + * must be treated as a an opaque identifier. An allocated handle
>> + * never has the value NULL.
>> + *
>> + * @b Creation
>> + *
>> + * An lcore variable is created in two steps:
>> + *  1. Define a lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
>> + *  2. Allocate lcore variable storage and initialize the handle with
>> + *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
>> + *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
>> + *     module initialization, but may be done at any time.
>> + *
>> + * An lcore variable is not tied to the owning thread's lifetime. It's
>> + * available for use by any thread immediately after having been
>> + * allocated, and continues to be available throughout the lifetime of
>> + * the EAL.
>> + *
>> + * Lcore variables cannot and need not be freed.
>> + *
>> + * @b Access
>> + *
>> + * The value of any lcore variable for any lcore id may be accessed
>> + * from any thread (including unregistered threads), but it should
>> + * only be *frequently* read from or written to by the owner.
>> + *
>> + * Values of the same lcore variable but owned by to different lcore
> 
> Typo: to -> two
> 

Fixed.

>> + * ids may be frequently read or written by the owners without risking
>> + * false sharing.
>> + *
>> + * An appropriate synchronization mechanism (e.g., atomic loads and
>> + * stores) should employed to assure there are no data races between
>> + * the owning thread and any non-owner threads accessing the same
>> + * lcore variable instance.
>> + *
>> + * The value of the lcore variable for a particular lcore id is
>> + * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
>> + *
>> + * A common pattern is for an EAL thread or a registered non-EAL
>> + * thread to access its own lcore variable value. For this purpose, a
>> + * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
>> + *
>> + * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
>> + * pointer with the same type as the value, it may not be directly
>> + * dereferenced and must be treated as an opaque identifier.
>> + *
>> + * Lcore variable handles and value pointers may be freely passed
>> + * between different threads.
>> + *
>> + * @b Storage
>> + *
>> + * An lcore variable's values may by of a primitive type like @c int,
> 
> Two typos: "values may by" -> "value may be"
> 

That's not a typo. An lcore variable take on multiple values, one for 
each lcore id. That said, I guess you could refer to the whole thing 
(the set of values) as the "value" as well.

>> + * but would more typically be a @c struct.
>> + *
>> + * The lcore variable handle introduces a per-variable (not
>> + * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
>> + * there are some memory footprint gains to be made by organizing all
>> + * per-lcore id data for a particular module as one lcore variable
>> + * (e.g., as a struct).
>> + *
>> + * An application may choose to define an lcore variable handle, which
>> + * it then it goes on to never allocate.
>> + *
>> + * The size of a lcore variable's value must be less than the DPDK
>> + * build-time constant @c RTE_MAX_LCORE_VAR.
>> + *
>> + * The lcore variable are stored in a series of lcore buffers, which
>> + * are allocated from the libc heap. Heap allocation failures are
>> + * treated as fatal.
>> + *
>> + * Lcore variables should generally *not* be @ref __rte_cache_aligned
>> + * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
>> + * of these constructs are designed to avoid false sharing. In the
>> + * case of an lcore variable instance, the thread most recently
>> + * accessing nearby data structures should almost-always the lcore
> 
> Missing word: should almost-always *be* the lcore variables' owner.
> 

Fixed.

> 
>> + * variables' owner. Adding padding will increase the effective memory
>> + * working set size, potentially reducing performance.
>> + *
>> + * Lcore variable values take on an initial value of zero.
>> + *
>> + * @b Example
>> + *
>> + * Below is an example of the use of an lcore variable:
>> + *
>> + * @code{.c}
>> + * struct foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + * };
>> + *
>> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
>> + *
>> + * long foo_get_a_plus_b(void)
>> + * {
>> + *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
>> + *
>> + *         return state->a + state->b;
>> + * }
>> + *
>> + * RTE_INIT(rte_foo_init)
>> + * {
>> + *         RTE_LCORE_VAR_ALLOC(lcore_states);
>> + *
>> + *         struct foo_lcore_state *state;
>> + *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
>> + *                 (initialize 'state')
> 
> Consider: (initialize 'state') -> /* initialize 'state' */
> 

I think I tried that, and it failed because the compiler didn't like 
nested comments.

>> + *         }
>> + *
>> + *         (other initialization)
> 
> Consider: (other initialization) -> /* other initialization */
> 
>> + * }
>> + * @endcode
>> + *
>> + *
>> + * @b Alternatives
>> + *
>> + * Lcore variables are designed to replace a pattern exemplified below:
>> + * @code{.c}
>> + * struct __rte_cache_aligned foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + *         RTE_CACHE_GUARD;
>> + * };
>> + *
>> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
>> + * @endcode
>> + *
>> + * This scheme is simple and effective, but has one drawback: the data
>> + * is organized so that objects related to all lcores for a particular
>> + * module is kept close in memory. At a bare minimum, this forces the
>> + * use of cache-line alignment to avoid false sharing. With CPU
> 
> Consider adding: use of *padding to* cache-line alignment
> My point here is:
> This sentence should somehow include the word "padding".

I'm not sure everyone thinks about __rte_cache_aligned or cache-aligned 
heap allocations as "padded."

> This paragraph is not only aboud cache line alignment, it is primarily about padding.
> 

"At a bare minimum, this requires sizing data structures (e.g., using 
`__rte_cache_aligned`) to an even number of cache lines to avoid false 
sharing."

How about this?

>> + * hardware prefetching and memory loads resulting from speculative
>> + * execution (functions which seemingly are getting more eager faster
>> + * than they are getting more intelligent), one or more "guard" cache
>> + * lines may be required to separate one lcore's data from another's.
>> + *
>> + * Lcore variables has the upside of working with, not against, the
> 
> Typo: has -> have
> 

Fixed.

>> + * CPU's assumptions and for example next-line prefetchers may well
>> + * work the way its designers intended (i.e., to the benefit, not
>> + * detriment, of system performance).
>> + *
>> + * Another alternative to @ref rte_lcore_var.h is the @ref
>> + * rte_per_lcore.h API, which make use of thread-local storage (TLS,
> 
> Typo: make -> makes >

Fixed.

>> + * e.g., GCC __thread or C11 _Thread_local). The main differences
>> + * between by using the various forms of TLS (e.g., @ref
>> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
>> + * variables are:
>> + *
>> + *   * The existence and non-existence of a thread-local variable
>> + *     instance follow that of particular thread's. The data cannot be
> 
> Typo: "thread's" -> "threads", I think. :-/
> 

It's not a typo.

>> + *     accessed before the thread has been created, nor after it has
>> + *     exited. As a result, thread-local variables must initialized in
> 
> Missing word: must *be* initialized
> 

Fixed.

>> + *     a "lazy" manner (e.g., at the point of thread creation). Lcore
>> + *     variables may be accessed immediately after having been
>> + *     allocated (which may be prior any thread beyond the main
>> + *     thread is running).
>> + *   * A thread-local variable is duplicated across all threads in the
>> + *     process, including unregistered non-EAL threads (i.e.,
>> + *     "regular" threads). For DPDK applications heavily relying on
>> + *     multi-threading (in conjunction to DPDK's "one thread per core"
>> + *     pattern), either by having many concurrent threads or
>> + *     creating/destroying threads at a high rate, an excessive use of
>> + *     thread-local variables may cause inefficiencies (e.g.,
>> + *     increased thread creation overhead due to thread-local storage
>> + *     initialization or increased total RAM footprint usage). Lcore
>> + *     variables *only* exist for threads with an lcore id.
>> + *   * If data in thread-local storage may be shared between threads
>> + *     (i.e., can a pointer to a thread-local variable be passed to
>> + *     and successfully dereferenced by non-owning thread) depends on
>> + *     the details of the TLS implementation. With GCC __thread and
>> + *     GCC _Thread_local, such data sharing is supported. In the C11
>> + *     standard, the result of accessing another thread's
>> + *     _Thread_local object is implementation-defined. Lcore variable
>> + *     instances may be accessed reliably by any thread.
>> + */
>> +
>> +#include <stddef.h>
>> +#include <stdalign.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_config.h>
>> +#include <rte_lcore.h>
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +/**
>> + * Given the lcore variable type, produces the type of the lcore
>> + * variable handle.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
>> +	type *
>> +
>> +/**
>> + * Define a lcore variable handle.
> 
> Typo: "a lcore" -> "an lcore"
> Found this typo multiple times; search-replace "a lcore".
> 

Yes, fixed.

>> + *
>> + * This macro defines a variable which is used as a handle to access
>> + * the various per-lcore id instances of a per-lcore id variable.
> 
> Suggest:
> "the various per-lcore id instances of a per-lcore id variable" ->
> "the various instances of a per-lcore id variable" >

Sounds good.

>> + *
>> + * The aim with this macro is to make clear at the point of
>> + * declaration that this is an lcore handler, rather than a regular
>> + * pointer.
>> + *
>> + * Add @b static as a prefix in case the lcore variable are only to be
> 
> Typo: are -> is
> 

Fixed.

>> + * accessed from a particular translation unit.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
>> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle.
>> + *
>> + * The values of the lcore variable are initialized to zero.
> 
> Consider adding: "the lcore variable *instances* are initialized"
> Found this typo multiple times; search-replace.
> 

It's not a typo. "Values" is just short for "instances of the value", 
just like "instances" is. Using instances everywhere may confuse the 
reader that an instance both a name and a value, which is not the case. 
I don't know, maybe I should be using "values" everywhere instead of 
"instances".

I agree there's some lack of consistency here and potential room for 
improvement, but I'm not sure exactly how improvement looks like.

>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
>> +	handle = rte_lcore_var_alloc(size, align)
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle,
>> + * with values aligned for any type of object.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
>> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
>> +
>> +/**
>> + * Allocate space for an lcore variable of the size and alignment
>> requirements
>> + * suggested by the handler pointer type, and initialize its handle.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC(handle)					\
>> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
>> +				       alignof(typeof(*(handle))))
>> +
>> +/**
>> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
>> + * means of a @ref RTE_INIT constructor.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
>> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
>> +	{								\
>> +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
>> +	}
>> +
>> +/**
>> + * Allocate an explicitly-sized lcore variable by means of a @ref
>> + * RTE_INIT constructor.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
>> +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
>> +
>> +/**
>> + * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_INIT(name)					\
>> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
>> +	{								\
>> +		RTE_LCORE_VAR_ALLOC(name);				\
>> +	}
>> +
>> +/**
>> + * Get void pointer to lcore variable instance with the specified
>> + * lcore id.
>> + *
>> + * @param lcore_id
>> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
>> + *   instances should be accessed. The lcore id need not be valid
>> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
>> + *   is also not valid (and thus should not be dereferenced).
>> + * @param handle
>> + *   The lcore variable handle.
>> + */
>> +static inline void *
>> +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
>> +{
>> +	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
>> +}
>> +
>> +/**
>> + * Get pointer to lcore variable instance with the specified lcore id.
>> + *
>> + * @param lcore_id
>> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
>> + *   instances should be accessed. The lcore id need not be valid
>> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
>> + *   is also not valid (and thus should not be dereferenced).
>> + * @param handle
>> + *   The lcore variable handle.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
>> +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
>> +
>> +/**
>> + * Get pointer to lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_VALUE(handle) \
>> +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
>> +
>> +/**
>> + * Iterate over each lcore id's value for a lcore variable.
>> + *
>> + * @param value
>> + *   A pointer set successivly set to point to lcore variable value
> 
> "set successivly set" -> "successivly set"
> 
> Thinking out loud, ignore at your preference:
> During the RFC discussions, the term used for referring to an lcore variable was discussed;
> we considered "pointer", but settled for "value".
> Perhaps "instance" would be usable in comments like like the one describing this function...
> "A pointer set successivly set to point to lcore variable value" ->
> "A pointer set successivly set to point to lcore variable instance".
> I don't know.
> 

I also don't know.

> 
>> + *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
>> + * @param handle
>> + *   The lcore variable handle.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
>> +	for (unsigned int lcore_id =					\
>> +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
>> +	     lcore_id < RTE_MAX_LCORE;					\
>> +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
>> +
>> +/**
>> + * Allocate space in the per-lcore id buffers for a lcore variable.
>> + *
>> + * The pointer returned is only an opaque identifer of the variable. To
>> + * get an actual pointer to a particular instance of the variable use
>> + * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
>> + *
>> + * The lcore variable values' memory is set to zero.
>> + *
>> + * The allocation is always successful, barring a fatal exhaustion of
>> + * the per-lcore id buffer space.
>> + *
>> + * rte_lcore_var_alloc() is not multi-thread safe.
>> + *
>> + * @param size
>> + *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
>> + * @param align
>> + *   If 0, the values will be suitably aligned for any kind of type
>> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
>> + *   on a multiple of *align*, which must be a power of 2 and equal or
>> + *   less than @c RTE_CACHE_LINE_SIZE.
>> + * @return
>> + *   The id of the variable, stored in a void pointer value. The value
> 
> "id" -> "handle"
> 

Fixed.

>> + *   is always non-NULL.
>> + */
>> +__rte_experimental
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align);
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif /* _RTE_LCORE_VAR_H_ */
>> diff --git a/lib/eal/version.map b/lib/eal/version.map
>> index e3ff412683..5f5a3522c0 100644
>> --- a/lib/eal/version.map
>> +++ b/lib/eal/version.map
>> @@ -396,6 +396,9 @@ EXPERIMENTAL {
>>
>>   	# added in 24.03
>>   	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
>> +
>> +	rte_lcore_var_alloc;
>> +	rte_lcore_var;
> 
> No such function: rte_lcore_var

Indeed. That variable is gone. Fixed.

Thanks a lot of your review Morten.

> 
>>   };
>>
>>   INTERNAL {
>> --
>> 2.34.1
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH 1/6] eal: add static per-lcore memory allocation facility
  2024-09-10 10:44                             ` Mattias Rönnblom
@ 2024-09-10 13:07                               ` Morten Brørup
  2024-09-10 15:55                               ` Stephen Hemminger
  1 sibling, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-09-10 13:07 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Tuesday, 10 September 2024 12.45
> 
> On 2024-09-10 11:32, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Tuesday, 10 September 2024 09.04
> >>
> >> Introduce DPDK per-lcore id variables, or lcore variables for short.
> >
> > Throughout the descriptions and comments,
> > please replace "lcore id" with "lcore" (e.g. "per-lcore variables"),
> > when referring to the lcore, and not the index of the lcore.
> > (Your intention might be to highlight that it only covers threads with
> an lcore id,
> > but if that wasn't the case, you would refer to them as "threads" not
> "lcores".)
> > Except, of course, when referring to an actual lcore id, e.g. lcore_id
> function parameters.
> 
> "lcore" is just another word for "EAL thread." The lcore variables exist
> in one instance for every thread with an lcore id, thus also for
> registered non-EAL threads (i.e., threads which are not lcores).
> 
> I've tried to summarize the (very confusing) terminology of DPDK's
> threading model here:
> https://ericsson.github.io/dataplanebook/threading/threading.html#eal-
> threads
> 
> So, in my world, "per-lcore id variables" is pretty accurate. You could
> say "variables with per-lcore id values" if you want to make it even
> more clear, what's going on.

With your reference terminology in mind, "per-lcore id variables" is OK with me.

<rant>
DPDK's lcore terminology has drifted quite far away from its original 1:1 meaning, but I'm not going to try to clean it up.
It also seems the meaning of "socket" is drifting.

And the DPDK's project's API/API compatibility ambitions seem to favor bolting on new features to the pile, rather than replacing APIs that have grown misleading with new APIs serving new requirements.
</rant>

> 
> >
> > Paraphrasing:
> > Consider the type of what you are referring to;
> > use "lcore" if its type is "thread", and
> > use "lcore id" if its type is "int".
> >
> > I might be wrong here, but please think hard about it.
> >
> >>
> >> An lcore variable has one value for every current and future lcore
> >> id-equipped thread.
> >>
> >> The primary <rte_lcore_var.h> use case is for statically allocating
> >> small, frequently-accessed data structures, for which one instance
> >> should exist for each lcore.
> >>
> >> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> >> _Thread_local), but decoupling the values' life time with that of the
> >> threads.
> >>
> >> Lcore variables are also similar in terms of functionality provided
> by
> >> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >> build-time machinery. DPCPU uses linker scripts, which effectively
> >> prevents the reuse of its, otherwise seemingly viable, approach.
> >>
> >> The currently-prevailing way to solve the same problem as lcore
> >> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> >> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> >> lcore variables over this approach is that data related to the same
> >> lcore now is close (spatially, in memory), rather than data used by
> >> the same module, which in turn avoid excessive use of padding,
> >> polluting caches with unused data.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >>
> >> --
> >
> >> +++ b/doc/api/doxy-api-index.md
> >> @@ -99,6 +99,7 @@ The public API headers are grouped by topics:
> >>     [interrupts](@ref rte_interrupts.h),
> >>     [launch](@ref rte_launch.h),
> >>     [lcore](@ref rte_lcore.h),
> >> +  [lcore-varible](@ref rte_lcore_var.h),
> >
> > Typo: varible -> variable
> >
> >
> 
> I'll change it to "lcore variables" (no dash, plural).

+1

> 
> >> +++ b/doc/guides/rel_notes/release_24_11.rst
> >> @@ -55,6 +55,20 @@ New Features
> >>        Also, make sure to start the actual text at the margin.
> >>        =======================================================
> >>
> >> +* **Added EAL per-lcore static memory allocation facility.**
> >> +
> >> +    Added EAL API <rte_lcore_var.h> for statically allocating small,
> >> +    frequently-accessed data structures, for which one instance
> should
> >> +    exist for each lcore.
> >> +
> >> +    With lcore variables, data is organized spatially on a per-lcore
> >> +    basis, rather than per library or PMD, avoiding the need for
> cache
> >> +    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
> >> +    reduces CPU cache internal fragmentation, improving performance.
> >> +
> >> +    Lcore variables are similar to thread-local storage (TLS, e.g.,
> >> +    C11 _Thread_local), but decoupling the values' life time from
> that
> >> +    of the threads.
> >
> > When referring to TLS, you might want to clarify that lcore variables
> are not instantiated for unregistered threads.
> >
> 
> Isn't that clear from the first paragraph? Although it should say "per
> lcore id", rather than "per lcore."

Yes, almost.
But in this paragraph, when you mention that they are similar to TLS, someone might not catch that it still applies (that they only are instantiated for lcores and not all threads). So clarify one extra time, just to ensure that everyone gets it.

> 
> >
> >> +static void *lcore_buffer;
> >> +static size_t offset = RTE_MAX_LCORE_VAR;
> >> +
> >> +static void *
> >> +lcore_var_alloc(size_t size, size_t align)
> >> +{
> >> +	void *handle;
> >> +	void *value;
> >> +
> >> +	offset = RTE_ALIGN_CEIL(offset, align);
> >> +
> >> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> >> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> >> +					     LCORE_BUFFER_SIZE);
> >> +		RTE_VERIFY(lcore_buffer != NULL);
> >> +
> >> +		offset = 0;
> >> +	}
> >
> > To determine if the lcore_buffer memory should be allocated, why not
> just check if lcore_buffer == NULL?
> 
> Because it may be the case, lcore_buffer is non-NULL but the remaining
> space is too small to service the allocation.

There's no error handling of that case. You simply forget about the allocated memory, and behave like initial allocation/initialization.

> 
> > Then offset wouldn't need an initial value of RTE_MAX_LCORE_VAR.
> >
> >> +
> >> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
> >> +
> >> +	offset += size;
> >> +
> >> +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
> >> +		memset(value, 0, size);
> >> +
> >> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with
> a "
> >> +		"%"PRIuPTR"-byte alignment", size, align);
> >> +
> >> +	return handle;
> >> +}
> >
> >
> >> +/**
> >> + * @file
> >> + *
> >> + * RTE Per-lcore id variables
> >
> > Suggest mentioning the short form too, e.g.:
> > "RTE Per-lcore id variables (RTE Lcore variables)"
> 
> What about just "RTE Lcore variables"?

+1

> 
> Exactly what they are is thoroughly described in the text that follows.
> 
> >
> >> + *
> >> + * This API provides a mechanism to create and access per-lcore id
> >> + * variables in a space- and cycle-efficient manner.
> >> + *
> >> + * A per-lcore id variable (or lcore variable for short) has one
> value
> >> + * for each EAL thread and registered non-EAL thread.
> >
> > And service thread.
> 
> Service threads are EAL threads, or, at a bare minimum, must have a
> lcore id, and thus must be registered.

Service threads have an lcore id, yes, but they have rte_lcore_role_t enum value ROLE_SERVICE, which differs from that of EAL threads (ROLE_EAL). Registered non-EAL threads have yet another role, ROLE_NON_EAL.

> 
> >
> >> + * There is one
> >> + * copy for each current and future lcore id-equipped thread, with
> the
> >
> > "one copy" -> "one instance"
> >
> 
> Fixed.
> 
> >> + * total number of copies amounting to @c RTE_MAX_LCORE. The value
> of
> >
> > "copies" -> "instances"
> >
> 
> OK, I'll rephrase that sentence.
> 
> >> + * an lcore variable for a particular lcore id is independent from
> >> + * other values (for other lcore ids) within the same lcore
> variable.
> >> + *
> >> + * In order to access the values of an lcore variable, a handle is
> >> + * used. The type of the handle is a pointer to the value's type
> >> + * (e.g., for @c uint32_t lcore variable, the handle is a
> >> + * <code>uint32_t *</code>. The handler type is used to inform the
> >
> > Typo: "handler" -> "handle", I think :-/
> > Found this typo multiple times; search-replace.
> 
> Fixed.
> 
> >
> >> + * access macros the type of the values. A handle may be passed
> >> + * between modules and threads just like any pointer, but its value
> >> + * must be treated as a an opaque identifier. An allocated handle
> >> + * never has the value NULL.
> >> + *
> >> + * @b Creation
> >> + *
> >> + * An lcore variable is created in two steps:
> >> + *  1. Define a lcore variable handle by using @ref
> RTE_LCORE_VAR_HANDLE.
> >> + *  2. Allocate lcore variable storage and initialize the handle
> with
> >> + *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
> >> + *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time
> of
> >> + *     module initialization, but may be done at any time.
> >> + *
> >> + * An lcore variable is not tied to the owning thread's lifetime.
> It's
> >> + * available for use by any thread immediately after having been
> >> + * allocated, and continues to be available throughout the lifetime
> of
> >> + * the EAL.
> >> + *
> >> + * Lcore variables cannot and need not be freed.
> >> + *
> >> + * @b Access
> >> + *
> >> + * The value of any lcore variable for any lcore id may be accessed
> >> + * from any thread (including unregistered threads), but it should
> >> + * only be *frequently* read from or written to by the owner.
> >> + *
> >> + * Values of the same lcore variable but owned by to different lcore
> >
> > Typo: to -> two
> >
> 
> Fixed.
> 
> >> + * ids may be frequently read or written by the owners without
> risking
> >> + * false sharing.
> >> + *
> >> + * An appropriate synchronization mechanism (e.g., atomic loads and
> >> + * stores) should employed to assure there are no data races between
> >> + * the owning thread and any non-owner threads accessing the same
> >> + * lcore variable instance.
> >> + *
> >> + * The value of the lcore variable for a particular lcore id is
> >> + * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
> >> + *
> >> + * A common pattern is for an EAL thread or a registered non-EAL
> >> + * thread to access its own lcore variable value. For this purpose,
> a
> >> + * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
> >> + *
> >> + * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is
> a
> >> + * pointer with the same type as the value, it may not be directly
> >> + * dereferenced and must be treated as an opaque identifier.
> >> + *
> >> + * Lcore variable handles and value pointers may be freely passed
> >> + * between different threads.
> >> + *
> >> + * @b Storage
> >> + *
> >> + * An lcore variable's values may by of a primitive type like @c
> int,
> >
> > Two typos: "values may by" -> "value may be"
> >
> 
> That's not a typo. An lcore variable take on multiple values, one for
> each lcore id. That said, I guess you could refer to the whole thing
> (the set of values) as the "value" as well.

OK. Reading it the way you explain, I get it. No typo.

> 
> >> + * but would more typically be a @c struct.
> >> + *
> >> + * The lcore variable handle introduces a per-variable (not
> >> + * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
> >> + * there are some memory footprint gains to be made by organizing
> all
> >> + * per-lcore id data for a particular module as one lcore variable
> >> + * (e.g., as a struct).
> >> + *
> >> + * An application may choose to define an lcore variable handle,
> which
> >> + * it then it goes on to never allocate.
> >> + *
> >> + * The size of a lcore variable's value must be less than the DPDK
> >> + * build-time constant @c RTE_MAX_LCORE_VAR.
> >> + *
> >> + * The lcore variable are stored in a series of lcore buffers, which
> >> + * are allocated from the libc heap. Heap allocation failures are
> >> + * treated as fatal.
> >> + *
> >> + * Lcore variables should generally *not* be @ref
> __rte_cache_aligned
> >> + * and need *not* include a @ref RTE_CACHE_GUARD field, since the
> use
> >> + * of these constructs are designed to avoid false sharing. In the
> >> + * case of an lcore variable instance, the thread most recently
> >> + * accessing nearby data structures should almost-always the lcore
> >
> > Missing word: should almost-always *be* the lcore variables' owner.
> >
> 
> Fixed.
> 
> >
> >> + * variables' owner. Adding padding will increase the effective
> memory
> >> + * working set size, potentially reducing performance.
> >> + *
> >> + * Lcore variable values take on an initial value of zero.
> >> + *
> >> + * @b Example
> >> + *
> >> + * Below is an example of the use of an lcore variable:
> >> + *
> >> + * @code{.c}
> >> + * struct foo_lcore_state {
> >> + *         int a;
> >> + *         long b;
> >> + * };
> >> + *
> >> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state,
> lcore_states);
> >> + *
> >> + * long foo_get_a_plus_b(void)
> >> + * {
> >> + *         struct foo_lcore_state *state =
> RTE_LCORE_VAR_VALUE(lcore_states);
> >> + *
> >> + *         return state->a + state->b;
> >> + * }
> >> + *
> >> + * RTE_INIT(rte_foo_init)
> >> + * {
> >> + *         RTE_LCORE_VAR_ALLOC(lcore_states);
> >> + *
> >> + *         struct foo_lcore_state *state;
> >> + *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
> >> + *                 (initialize 'state')
> >
> > Consider: (initialize 'state') -> /* initialize 'state' */
> >
> 
> I think I tried that, and it failed because the compiler didn't like
> nested comments.

OK, no objections. Just leave it as is.

> 
> >> + *         }
> >> + *
> >> + *         (other initialization)
> >
> > Consider: (other initialization) -> /* other initialization */
> >
> >> + * }
> >> + * @endcode
> >> + *
> >> + *
> >> + * @b Alternatives
> >> + *
> >> + * Lcore variables are designed to replace a pattern exemplified
> below:
> >> + * @code{.c}
> >> + * struct __rte_cache_aligned foo_lcore_state {
> >> + *         int a;
> >> + *         long b;
> >> + *         RTE_CACHE_GUARD;
> >> + * };
> >> + *
> >> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
> >> + * @endcode
> >> + *
> >> + * This scheme is simple and effective, but has one drawback: the
> data
> >> + * is organized so that objects related to all lcores for a
> particular
> >> + * module is kept close in memory. At a bare minimum, this forces
> the
> >> + * use of cache-line alignment to avoid false sharing. With CPU
> >
> > Consider adding: use of *padding to* cache-line alignment
> > My point here is:
> > This sentence should somehow include the word "padding".
> 
> I'm not sure everyone thinks about __rte_cache_aligned or cache-aligned
> heap allocations as "padded."
> 
> > This paragraph is not only aboud cache line alignment, it is primarily
> about padding.
> >
> 
> "At a bare minimum, this requires sizing data structures (e.g., using
> `__rte_cache_aligned`) to an even number of cache lines to avoid false
> sharing."
> 
> How about this?

OK. Sizing might imply padding, so it serves the point I was targeting.
But "even number" -> "whole number". The number might be odd. :-)

> 
> >> + * hardware prefetching and memory loads resulting from speculative
> >> + * execution (functions which seemingly are getting more eager
> faster
> >> + * than they are getting more intelligent), one or more "guard"
> cache
> >> + * lines may be required to separate one lcore's data from
> another's.
> >> + *
> >> + * Lcore variables has the upside of working with, not against, the
> >
> > Typo: has -> have
> >
> 
> Fixed.
> 
> >> + * CPU's assumptions and for example next-line prefetchers may well
> >> + * work the way its designers intended (i.e., to the benefit, not
> >> + * detriment, of system performance).
> >> + *
> >> + * Another alternative to @ref rte_lcore_var.h is the @ref
> >> + * rte_per_lcore.h API, which make use of thread-local storage (TLS,
> >
> > Typo: make -> makes >
> 
> Fixed.
> 
> >> + * e.g., GCC __thread or C11 _Thread_local). The main differences
> >> + * between by using the various forms of TLS (e.g., @ref
> >> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
> >> + * variables are:
> >> + *
> >> + *   * The existence and non-existence of a thread-local variable
> >> + *     instance follow that of particular thread's. The data cannot
> be
> >
> > Typo: "thread's" -> "threads", I think. :-/
> >
> 
> It's not a typo.

OK.

> 
> >> + *     accessed before the thread has been created, nor after it has
> >> + *     exited. As a result, thread-local variables must initialized
> in
> >
> > Missing word: must *be* initialized
> >
> 
> Fixed.
> 
> >> + *     a "lazy" manner (e.g., at the point of thread creation).
> Lcore
> >> + *     variables may be accessed immediately after having been
> >> + *     allocated (which may be prior any thread beyond the main
> >> + *     thread is running).
> >> + *   * A thread-local variable is duplicated across all threads in
> the
> >> + *     process, including unregistered non-EAL threads (i.e.,
> >> + *     "regular" threads). For DPDK applications heavily relying on
> >> + *     multi-threading (in conjunction to DPDK's "one thread per
> core"
> >> + *     pattern), either by having many concurrent threads or
> >> + *     creating/destroying threads at a high rate, an excessive use
> of
> >> + *     thread-local variables may cause inefficiencies (e.g.,
> >> + *     increased thread creation overhead due to thread-local
> storage
> >> + *     initialization or increased total RAM footprint usage). Lcore
> >> + *     variables *only* exist for threads with an lcore id.
> >> + *   * If data in thread-local storage may be shared between threads
> >> + *     (i.e., can a pointer to a thread-local variable be passed to
> >> + *     and successfully dereferenced by non-owning thread) depends
> on
> >> + *     the details of the TLS implementation. With GCC __thread and
> >> + *     GCC _Thread_local, such data sharing is supported. In the C11
> >> + *     standard, the result of accessing another thread's
> >> + *     _Thread_local object is implementation-defined. Lcore
> variable
> >> + *     instances may be accessed reliably by any thread.
> >> + */
> >> +
> >> +#include <stddef.h>
> >> +#include <stdalign.h>
> >> +
> >> +#include <rte_common.h>
> >> +#include <rte_config.h>
> >> +#include <rte_lcore.h>
> >> +
> >> +#ifdef __cplusplus
> >> +extern "C" {
> >> +#endif
> >> +
> >> +/**
> >> + * Given the lcore variable type, produces the type of the lcore
> >> + * variable handle.
> >> + */
> >> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
> >> +	type *
> >> +
> >> +/**
> >> + * Define a lcore variable handle.
> >
> > Typo: "a lcore" -> "an lcore"
> > Found this typo multiple times; search-replace "a lcore".
> >
> 
> Yes, fixed.
> 
> >> + *
> >> + * This macro defines a variable which is used as a handle to access
> >> + * the various per-lcore id instances of a per-lcore id variable.
> >
> > Suggest:
> > "the various per-lcore id instances of a per-lcore id variable" ->
> > "the various instances of a per-lcore id variable" >
> 
> Sounds good.
> 
> >> + *
> >> + * The aim with this macro is to make clear at the point of
> >> + * declaration that this is an lcore handler, rather than a regular
> >> + * pointer.
> >> + *
> >> + * Add @b static as a prefix in case the lcore variable are only to
> be
> >
> > Typo: are -> is
> >
> 
> Fixed.
> 
> >> + * accessed from a particular translation unit.
> >> + */
> >> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
> >> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
> >> +
> >> +/**
> >> + * Allocate space for an lcore variable, and initialize its handle.
> >> + *
> >> + * The values of the lcore variable are initialized to zero.
> >
> > Consider adding: "the lcore variable *instances* are initialized"
> > Found this typo multiple times; search-replace.
> >
> 
> It's not a typo. "Values" is just short for "instances of the value",
> just like "instances" is. Using instances everywhere may confuse the
> reader that an instance both a name and a value, which is not the case.
> I don't know, maybe I should be using "values" everywhere instead of
> "instances".
> 
> I agree there's some lack of consistency here and potential room for
> improvement, but I'm not sure exactly how improvement looks like.

Yes, perhaps using "values" (instead of "instances of the value") everywhere,
and avoiding "instances", might be better.

If you repeat/paraphrase your above explanation in the documentation and/or source code, it should cover it.

> 
> >> + */
> >> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
> >> +	handle = rte_lcore_var_alloc(size, align)
> >> +
> >> +/**
> >> + * Allocate space for an lcore variable, and initialize its handle,
> >> + * with values aligned for any type of object.
> >> + *
> >> + * The values of the lcore variable are initialized to zero.
> >> + */
> >> +#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
> >> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
> >> +
> >> +/**
> >> + * Allocate space for an lcore variable of the size and alignment
> >> requirements
> >> + * suggested by the handler pointer type, and initialize its handle.
> >> + *
> >> + * The values of the lcore variable are initialized to zero.
> >> + */
> >> +#define RTE_LCORE_VAR_ALLOC(handle)					\
> >> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
> >> +				       alignof(typeof(*(handle))))
> >> +
> >> +/**
> >> + * Allocate an explicitly-sized, explicitly-aligned lcore variable
> by
> >> + * means of a @ref RTE_INIT constructor.
> >> + *
> >> + * The values of the lcore variable are initialized to zero.
> >> + */
> >> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
> >> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> >> +	{								\
> >> +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
> >> +	}
> >> +
> >> +/**
> >> + * Allocate an explicitly-sized lcore variable by means of a @ref
> >> + * RTE_INIT constructor.
> >> + *
> >> + * The values of the lcore variable are initialized to zero.
> >> + */
> >> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
> >> +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
> >> +
> >> +/**
> >> + * Allocate an lcore variable by means of a @ref RTE_INIT
> constructor.
> >> + *
> >> + * The values of the lcore variable are initialized to zero.
> >> + */
> >> +#define RTE_LCORE_VAR_INIT(name)					\
> >> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> >> +	{								\
> >> +		RTE_LCORE_VAR_ALLOC(name);				\
> >> +	}
> >> +
> >> +/**
> >> + * Get void pointer to lcore variable instance with the specified
> >> + * lcore id.
> >> + *
> >> + * @param lcore_id
> >> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> >> + *   instances should be accessed. The lcore id need not be valid
> >> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the
> pointer
> >> + *   is also not valid (and thus should not be dereferenced).
> >> + * @param handle
> >> + *   The lcore variable handle.
> >> + */
> >> +static inline void *
> >> +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
> >> +{
> >> +	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
> >> +}
> >> +
> >> +/**
> >> + * Get pointer to lcore variable instance with the specified lcore
> id.
> >> + *
> >> + * @param lcore_id
> >> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> >> + *   instances should be accessed. The lcore id need not be valid
> >> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the
> pointer
> >> + *   is also not valid (and thus should not be dereferenced).
> >> + * @param handle
> >> + *   The lcore variable handle.
> >> + */
> >> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)
> 	\
> >> +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
> >> +
> >> +/**
> >> + * Get pointer to lcore variable instance of the current thread.
> >> + *
> >> + * May only be used by EAL threads and registered non-EAL threads.
> >> + */
> >> +#define RTE_LCORE_VAR_VALUE(handle) \
> >> +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> >> +
> >> +/**
> >> + * Iterate over each lcore id's value for a lcore variable.
> >> + *
> >> + * @param value
> >> + *   A pointer set successivly set to point to lcore variable value
> >
> > "set successivly set" -> "successivly set"

Don't forget.

> >
> > Thinking out loud, ignore at your preference:
> > During the RFC discussions, the term used for referring to an lcore
> variable was discussed;
> > we considered "pointer", but settled for "value".
> > Perhaps "instance" would be usable in comments like like the one
> describing this function...
> > "A pointer set successivly set to point to lcore variable value" ->
> > "A pointer set successivly set to point to lcore variable instance".
> > I don't know.
> >
> 
> I also don't know.

Referring to the terminology above, if you go for "value" rather than "instance" (or "instance of the value"), stick with "value" here too.

> 
> >
> >> + *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
> >> + * @param handle
> >> + *   The lcore variable handle.
> >> + */
> >> +#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
> >> +	for (unsigned int lcore_id =					\
> >> +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0);
> \
> >> +	     lcore_id < RTE_MAX_LCORE;					\
> >> +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
> handle))
> >> +
> >> +/**
> >> + * Allocate space in the per-lcore id buffers for a lcore variable.
> >> + *
> >> + * The pointer returned is only an opaque identifer of the variable.
> To
> >> + * get an actual pointer to a particular instance of the variable
> use
> >> + * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
> >> + *
> >> + * The lcore variable values' memory is set to zero.
> >> + *
> >> + * The allocation is always successful, barring a fatal exhaustion
> of
> >> + * the per-lcore id buffer space.
> >> + *
> >> + * rte_lcore_var_alloc() is not multi-thread safe.
> >> + *
> >> + * @param size
> >> + *   The size (in bytes) of the variable's per-lcore id value. Must
> be > 0.
> >> + * @param align
> >> + *   If 0, the values will be suitably aligned for any kind of type
> >> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be
> aligned
> >> + *   on a multiple of *align*, which must be a power of 2 and equal
> or
> >> + *   less than @c RTE_CACHE_LINE_SIZE.
> >> + * @return
> >> + *   The id of the variable, stored in a void pointer value. The
> value
> >
> > "id" -> "handle"
> >
> 
> Fixed.
> 
> >> + *   is always non-NULL.
> >> + */
> >> +__rte_experimental
> >> +void *
> >> +rte_lcore_var_alloc(size_t size, size_t align);
> >> +
> >> +#ifdef __cplusplus
> >> +}
> >> +#endif
> >> +
> >> +#endif /* _RTE_LCORE_VAR_H_ */
> >> diff --git a/lib/eal/version.map b/lib/eal/version.map
> >> index e3ff412683..5f5a3522c0 100644
> >> --- a/lib/eal/version.map
> >> +++ b/lib/eal/version.map
> >> @@ -396,6 +396,9 @@ EXPERIMENTAL {
> >>
> >>   	# added in 24.03
> >>   	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
> >> +
> >> +	rte_lcore_var_alloc;
> >> +	rte_lcore_var;
> >
> > No such function: rte_lcore_var
> 
> Indeed. That variable is gone. Fixed.
> 
> Thanks a lot of your review Morten.

Thanks a lot for your contribution, Mattias. :-)

> 
> >
> >>   };
> >>
> >>   INTERNAL {
> >> --
> >> 2.34.1
> >

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [RFC v6 0/6] Lcore variables
  2024-09-10  6:41                       ` Mattias Rönnblom
@ 2024-09-10 15:41                         ` Stephen Hemminger
  0 siblings, 0 replies; 313+ messages in thread
From: Stephen Hemminger @ 2024-09-10 15:41 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Morten Brørup, Mattias Rönnblom, dev, Konstantin Ananyev

On Tue, 10 Sep 2024 08:41:19 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> On 2024-09-02 16:42, Morten Brørup wrote:

On a related note, latest GCC supports annotating the address space
of variables. Kernel uses it for RCU.

It would be good if DPDK could do this for:
	- per lcore data
	- data in huge pages
	- data protected by rcu

With these annotations various checkers and compilers can warn about
places where data is passed (with cast to override).



^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH 1/6] eal: add static per-lcore memory allocation facility
  2024-09-10 10:44                             ` Mattias Rönnblom
  2024-09-10 13:07                               ` Morten Brørup
@ 2024-09-10 15:55                               ` Stephen Hemminger
  1 sibling, 0 replies; 313+ messages in thread
From: Stephen Hemminger @ 2024-09-10 15:55 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Morten Brørup, Mattias Rönnblom, dev,
	Konstantin Ananyev, David Marchand, Nandini Persad

On Tue, 10 Sep 2024 12:44:49 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> "lcore" is just another word for "EAL thread." The lcore variables exist 
> in one instance for every thread with an lcore id, thus also for 
> registered non-EAL threads (i.e., threads which are not lcores).
> 
> I've tried to summarize the (very confusing) terminology of DPDK's 
> threading model here:
> https://ericsson.github.io/dataplanebook/threading/threading.html#eal-threads
> 
> So, in my world, "per-lcore id variables" is pretty accurate. You could 
> say "variables with per-lcore id values" if you want to make it even 
> more clear, what's going on.

This is good and should be in DPDK documentation along with references
to other Intel/Arm documentation.

I don't see a glossary section in current documentation.
The issue goes deeper there is no clear introduction in the current DPDK documentation.

My suggestion would be something similar to Fd.io VPP and other projects

	About DPDK
	- Introduction
	- Glossary
	- Supported platforms
	- Release notes
	- FAQ

	Getting stated
	- Getting started on Linux
	...
	- Sample Applications

	Developer documentation
	- Programmer’s Guide
	- HowTo Guides
	- DPDK Tools User Guides
	- Testpmd Application User Guide
	- Drivers
	    - Network Interface
	    - Baseband
		...

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH 1/6] eal: add static per-lcore memory allocation facility
  2024-09-10  7:03                         ` [PATCH 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-10  9:32                           ` Morten Brørup
@ 2024-09-11 10:32                           ` Morten Brørup
  2024-09-11 15:05                             ` Mattias Rönnblom
  2024-09-11 17:04                           ` [PATCH v2 0/6] Lcore variables Mattias Rönnblom
  2 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-09-11 10:32 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Tyler Retzlaff

> +static void *lcore_buffer;
[...]
> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> +					     LCORE_BUFFER_SIZE);

Since lcore_buffer is never freed again, it is easy to support Windows:

#ifdef RTE_EXEC_ENV_WINDOWS
#include <malloc.h>
#endif

#ifndef RTE_EXEC_ENV_WINDOWS
lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
		LCORE_BUFFER_SIZE);
#else
/* Never freed again, so don't worry about _aligned_free(). */
lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
		RTE_CACHE_LINE_SIZE);
#endif

Ref:
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/aligned-malloc?view=msvc-170

NB: Note the opposite parameter order.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH 1/6] eal: add static per-lcore memory allocation facility
  2024-09-11 10:32                           ` Morten Brørup
@ 2024-09-11 15:05                             ` Mattias Rönnblom
  2024-09-11 15:07                               ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-11 15:05 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Tyler Retzlaff

On 2024-09-11 12:32, Morten Brørup wrote:
>> +static void *lcore_buffer;
> [...]
>> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
>> +					     LCORE_BUFFER_SIZE);
> 
> Since lcore_buffer is never freed again, it is easy to support Windows:
> 
> #ifdef RTE_EXEC_ENV_WINDOWS
> #include <malloc.h>
> #endif
> 
> #ifndef RTE_EXEC_ENV_WINDOWS
> lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> 		LCORE_BUFFER_SIZE);
> #else
> /* Never freed again, so don't worry about _aligned_free(). */

What is the reason for this comment? It seems like it addresses the 
Windows code path in particular.

> lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> 		RTE_CACHE_LINE_SIZE);
> #endif
> 
> Ref:
> https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/aligned-malloc?view=msvc-170
> 
> NB: Note the opposite parameter order.
> 

Thanks. I will add something like this.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH 1/6] eal: add static per-lcore memory allocation facility
  2024-09-11 15:05                             ` Mattias Rönnblom
@ 2024-09-11 15:07                               ` Morten Brørup
  0 siblings, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-09-11 15:07 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Tyler Retzlaff

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Wednesday, 11 September 2024 17.05
> 
> On 2024-09-11 12:32, Morten Brørup wrote:
> >> +static void *lcore_buffer;
> > [...]
> >> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> >> +					     LCORE_BUFFER_SIZE);
> >
> > Since lcore_buffer is never freed again, it is easy to support
> Windows:
> >
> > #ifdef RTE_EXEC_ENV_WINDOWS
> > #include <malloc.h>
> > #endif
> >
> > #ifndef RTE_EXEC_ENV_WINDOWS
> > lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> > 		LCORE_BUFFER_SIZE);
> > #else
> > /* Never freed again, so don't worry about _aligned_free(). */
> 
> What is the reason for this comment? It seems like it addresses the
> Windows code path in particular.

It is Windows specific.
Memory allocated with _aligned_malloc() cannot be freed with free(); it needs to be freed with _aligned_free().

> 
> > lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> > 		RTE_CACHE_LINE_SIZE);
> > #endif
> >
> > Ref:
> > https://learn.microsoft.com/en-us/cpp/c-runtime-
> library/reference/aligned-malloc?view=msvc-170
> >
> > NB: Note the opposite parameter order.
> >
> 
> Thanks. I will add something like this.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v2 0/6] Lcore variables
  2024-09-10  7:03                         ` [PATCH 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-10  9:32                           ` Morten Brørup
  2024-09-11 10:32                           ` Morten Brørup
@ 2024-09-11 17:04                           ` Mattias Rönnblom
  2024-09-11 17:04                             ` [PATCH v2 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                               ` (5 more replies)
  2 siblings, 6 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-11 17:04 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (6):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable test suite
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                            |   6 +
 app/test/meson.build                   |   1 +
 app/test/test_lcore_var.c              | 432 +++++++++++++++++++++++++
 config/rte_config.h                    |   1 +
 doc/api/doxy-api-index.md              |   1 +
 doc/guides/rel_notes/release_24_11.rst |  14 +
 lib/eal/common/eal_common_lcore_var.c  |  78 +++++
 lib/eal/common/meson.build             |   1 +
 lib/eal/common/rte_random.c            |  28 +-
 lib/eal/common/rte_service.c           | 115 ++++---
 lib/eal/include/meson.build            |   1 +
 lib/eal/include/rte_lcore_var.h        | 385 ++++++++++++++++++++++
 lib/eal/version.map                    |   2 +
 lib/eal/x86/rte_power_intrinsics.c     |  17 +-
 lib/power/rte_power_pmd_mgmt.c         |  34 +-
 15 files changed, 1029 insertions(+), 87 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-11 17:04                           ` [PATCH v2 0/6] Lcore variables Mattias Rönnblom
@ 2024-09-11 17:04                             ` Mattias Rönnblom
  2024-09-12  2:33                               ` fengchengwen
                                                 ` (2 more replies)
  2024-09-11 17:04                             ` [PATCH v2 2/6] eal: add lcore variable test suite Mattias Rönnblom
                                               ` (4 subsequent siblings)
  5 siblings, 3 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-11 17:04 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                            |   6 +
 config/rte_config.h                    |   1 +
 doc/api/doxy-api-index.md              |   1 +
 doc/guides/rel_notes/release_24_11.rst |  14 +
 lib/eal/common/eal_common_lcore_var.c  |  78 +++++
 lib/eal/common/meson.build             |   1 +
 lib/eal/include/meson.build            |   1 +
 lib/eal/include/rte_lcore_var.h        | 385 +++++++++++++++++++++++++
 lib/eal/version.map                    |   2 +
 9 files changed, 489 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c5a703b5c0..362d9a3f28 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index dd7bb0d35b..311692e498 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 0ff70d9057..a3884f7491 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -55,6 +55,20 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..309822039b
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+#ifdef RTE_EXEC_ENV_WINDOWS
+		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
+					       RTE_CACHE_LINE_SIZE);
+#else
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+#endif
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..ec3ab714a8
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,385 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. There is one
+ * instance for each current and future lcore id-equipped thread, with
+ * a total of RTE_MAX_LCORE instances. The value of an lcore variable
+ * for a particular lcore id is independent from other values (for
+ * other lcore ids) within the same lcore variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. The handle type is used to inform the
+ * access macros the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as a an opaque identifier. An allocated handle
+ * never has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by two different lcore
+ * ids may be frequently read or written by the owners without risking
+ * false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should employed to assure there are no data races between
+ * the owning thread and any non-owner threads accessing the same
+ * lcore variable instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may choose to define an lcore variable handle, which
+ * it then it goes on to never allocate.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variables' owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values take on an initial value of zero.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables have the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. As a result, thread-local variables must be initialized in
+ *     a "lazy" manner (e.g., at the point of thread creation). Lcore
+ *     variables may be accessed immediately after having been
+ *     allocated (which may be prior any thread beyond the main
+ *     thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handle, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param value
+ *   A pointer successively set to point to lcore variable value
+ *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
+	for (unsigned int lcore_id =					\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..0c80bf7331 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,8 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v2 2/6] eal: add lcore variable test suite
  2024-09-11 17:04                           ` [PATCH v2 0/6] Lcore variables Mattias Rönnblom
  2024-09-11 17:04                             ` [PATCH v2 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-11 17:04                             ` Mattias Rönnblom
  2024-09-12  7:35                               ` Jerin Jacob
  2024-09-11 17:04                             ` [PATCH v2 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
                                               ` (3 subsequent siblings)
  5 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-11 17:04 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Add test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..e07d13460f
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,432 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v2 3/6] random: keep PRNG state in lcore variable
  2024-09-11 17:04                           ` [PATCH v2 0/6] Lcore variables Mattias Rönnblom
  2024-09-11 17:04                             ` [PATCH v2 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-11 17:04                             ` [PATCH v2 2/6] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-09-11 17:04                             ` Mattias Rönnblom
  2024-09-11 17:04                             ` [PATCH v2 4/6] power: keep per-lcore " Mattias Rönnblom
                                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-11 17:04 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v2 4/6] power: keep per-lcore state in lcore variable
  2024-09-11 17:04                           ` [PATCH v2 0/6] Lcore variables Mattias Rönnblom
                                               ` (2 preceding siblings ...)
  2024-09-11 17:04                             ` [PATCH v2 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-09-11 17:04                             ` Mattias Rönnblom
  2024-09-11 17:04                             ` [PATCH v2 5/6] service: " Mattias Rönnblom
  2024-09-11 17:04                             ` [PATCH v2 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-11 17:04 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a5139dd4f7 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,21 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v2 5/6] service: keep per-lcore state in lcore variable
  2024-09-11 17:04                           ` [PATCH v2 0/6] Lcore variables Mattias Rönnblom
                                               ` (3 preceding siblings ...)
  2024-09-11 17:04                             ` [PATCH v2 4/6] power: keep per-lcore " Mattias Rönnblom
@ 2024-09-11 17:04                             ` Mattias Rönnblom
  2024-09-11 17:04                             ` [PATCH v2 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-11 17:04 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 115 +++++++++++++++++++----------------
 1 file changed, 63 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 56379930b6..03379f1588 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,12 +102,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -122,7 +119,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +132,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +281,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +288,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +449,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +462,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +484,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +530,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +546,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +567,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +584,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +636,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +688,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +706,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +731,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +755,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +779,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +809,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +818,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +843,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +854,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +862,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +870,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +879,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +895,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +942,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +971,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +983,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1022,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v2 6/6] eal: keep per-lcore power intrinsics state in lcore variable
  2024-09-11 17:04                           ` [PATCH v2 0/6] Lcore variables Mattias Rönnblom
                                               ` (4 preceding siblings ...)
  2024-09-11 17:04                             ` [PATCH v2 5/6] service: " Mattias Rönnblom
@ 2024-09-11 17:04                             ` Mattias Rönnblom
  5 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-11 17:04 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Mattias Rönnblom

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-11 17:04                             ` [PATCH v2 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-12  2:33                               ` fengchengwen
  2024-09-12  5:35                                 ` Mattias Rönnblom
  2024-09-12  8:44                               ` [PATCH v3 0/7] Lcore variables Mattias Rönnblom
  2024-09-12  9:10                               ` [PATCH v2 1/6] eal: add static per-lcore memory allocation facility Morten Brørup
  2 siblings, 1 reply; 313+ messages in thread
From: fengchengwen @ 2024-09-12  2:33 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand

On 2024/9/12 1:04, Mattias Rönnblom wrote:
> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small, frequently-accessed data structures, for which one instance
> should exist for each lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> --
> 
> PATCH v2:
>  * Add Windows support. (Morten Brørup)
>  * Fix lcore variables API index reference. (Morten Brørup)
>  * Various improvements of the API documentation. (Morten Brørup)
>  * Elimination of unused symbol in version.map. (Morten Brørup)

these history could move to the cover letter.

> 
> PATCH:
>  * Update MAINTAINERS and release notes.
>  * Stop covering included files in extern "C" {}.
> 
> RFC v6:
>  * Include <stdlib.h> to get aligned_alloc().
>  * Tweak documentation (grammar).
>  * Provide API-level guarantees that lcore variable values take on an
>    initial value of zero.
>  * Fix misplaced __rte_cache_aligned in the API doc example.
> 
> RFC v5:
>  * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
>  * The RTE_LCORE_VAR_GET() and SET() convience access macros
>    covered an uncommon use case, where the lcore value is of a
>    primitive type, rather than a struct, and is thus eliminated
>    from the API. (Morten Brørup)
>  * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
>    RTE_LCORE_VAR_VALUE().
>  * The underscores are removed from __rte_lcore_var_lcore_ptr() to
>    signal that this function is a part of the public API.
>  * Macro arguments are documented.
> 
> RFV v4:
>  * Replace large static array with libc heap-allocated memory. One
>    implication of this change is there no longer exists a fixed upper
>    bound for the total amount of memory used by lcore variables.
>    RTE_MAX_LCORE_VAR has changed meaning, and now represent the
>    maximum size of any individual lcore variable value.
>  * Fix issues in example. (Morten Brørup)
>  * Improve access macro type checking. (Morten Brørup)
>  * Refer to the lcore variable handle as "handle" and not "name" in
>    various macros.
>  * Document lack of thread safety in rte_lcore_var_alloc().
>  * Provide API-level assurance the lcore variable handle is
>    always non-NULL, to all applications to use NULL to mean
>    "not yet allocated".
>  * Note zero-sized allocations are not allowed.
>  * Give API-level guarantee the lcore variable values are zeroed.
> 
> RFC v3:
>  * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>  * Update example to reflect FOREACH macro name change (in RFC v2).
> 
> RFC v2:
>  * Use alignof to derive alignment requirements. (Morten Brørup)
>  * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>    *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>  * Allow user-specified alignment, but limit max to cache line size.
> ---
>  MAINTAINERS                            |   6 +
>  config/rte_config.h                    |   1 +
>  doc/api/doxy-api-index.md              |   1 +
>  doc/guides/rel_notes/release_24_11.rst |  14 +
>  lib/eal/common/eal_common_lcore_var.c  |  78 +++++
>  lib/eal/common/meson.build             |   1 +
>  lib/eal/include/meson.build            |   1 +
>  lib/eal/include/rte_lcore_var.h        | 385 +++++++++++++++++++++++++
>  lib/eal/version.map                    |   2 +
>  9 files changed, 489 insertions(+)
>  create mode 100644 lib/eal/common/eal_common_lcore_var.c
>  create mode 100644 lib/eal/include/rte_lcore_var.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c5a703b5c0..362d9a3f28 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
>  F: lib/eal/common/rte_random.c
>  F: app/test/test_rand_perf.c
>  
> +Lcore Variables
> +M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> +F: lib/eal/include/rte_lcore_var.h
> +F: lib/eal/common/eal_common_lcore_var.c
> +F: app/test/test_lcore_var.c
> +
>  ARM v7
>  M: Wathsala Vithanage <wathsala.vithanage@arm.com>
>  F: config/arm/
> diff --git a/config/rte_config.h b/config/rte_config.h
> index dd7bb0d35b..311692e498 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -41,6 +41,7 @@
>  /* EAL defines */
>  #define RTE_CACHE_GUARD_LINES 1
>  #define RTE_MAX_HEAPS 32
> +#define RTE_MAX_LCORE_VAR 1048576
>  #define RTE_MAX_MEMSEG_LISTS 128
>  #define RTE_MAX_MEMSEG_PER_LIST 8192
>  #define RTE_MAX_MEM_MB_PER_LIST 32768
> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> index f9f0300126..ed577f14ee 100644
> --- a/doc/api/doxy-api-index.md
> +++ b/doc/api/doxy-api-index.md
> @@ -99,6 +99,7 @@ The public API headers are grouped by topics:
>    [interrupts](@ref rte_interrupts.h),
>    [launch](@ref rte_launch.h),
>    [lcore](@ref rte_lcore.h),
> +  [lcore variables](@ref rte_lcore_var.h),
>    [per-lcore](@ref rte_per_lcore.h),
>    [service cores](@ref rte_service.h),
>    [keepalive](@ref rte_keepalive.h),
> diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
> index 0ff70d9057..a3884f7491 100644
> --- a/doc/guides/rel_notes/release_24_11.rst
> +++ b/doc/guides/rel_notes/release_24_11.rst
> @@ -55,6 +55,20 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
>  
> +* **Added EAL per-lcore static memory allocation facility.**
> +
> +    Added EAL API <rte_lcore_var.h> for statically allocating small,
> +    frequently-accessed data structures, for which one instance should
> +    exist for each EAL thread and registered non-EAL thread.
> +
> +    With lcore variables, data is organized spatially on a per-lcore id
> +    basis, rather than per library or PMD, avoiding the need for cache
> +    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
> +    reduces CPU cache internal fragmentation, improving performance.
> +
> +    Lcore variables are similar to thread-local storage (TLS, e.g.,
> +    C11 _Thread_local), but decoupling the values' life time from that
> +    of the threads.
>  
>  Removed Items
>  -------------
> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
> new file mode 100644
> index 0000000000..309822039b
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_var.c
> @@ -0,0 +1,78 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#include <inttypes.h>
> +#include <stdlib.h>
> +
> +#ifdef RTE_EXEC_ENV_WINDOWS
> +#include <malloc.h>
> +#endif
> +
> +#include <rte_common.h>
> +#include <rte_debug.h>
> +#include <rte_log.h>
> +
> +#include <rte_lcore_var.h>
> +
> +#include "eal_private.h"
> +
> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> +
> +static void *lcore_buffer;
> +static size_t offset = RTE_MAX_LCORE_VAR;
> +
> +static void *
> +lcore_var_alloc(size_t size, size_t align)
> +{
> +	void *handle;
> +	void *value;
> +
> +	offset = RTE_ALIGN_CEIL(offset, align);
> +
> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> +#ifdef RTE_EXEC_ENV_WINDOWS
> +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> +					       RTE_CACHE_LINE_SIZE);
> +#else
> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> +					     LCORE_BUFFER_SIZE);
> +#endif
> +		RTE_VERIFY(lcore_buffer != NULL);
> +
> +		offset = 0;
> +	}
> +
> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
> +
> +	offset += size;
> +
> +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
> +		memset(value, 0, size);
> +
> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
> +		"%"PRIuPTR"-byte alignment", size, align);

Currrent the data was malloc by libc function, I think it's mainly for such INIT macro which will be init before main.
But it will introduce following problem:
1\ it can't benefit from huge-pages. this patch may reserved many 1MBs for each lcore, if we could place it in huge-pages it will reduce the TLB miss rate, especially it freq access data.
2\ it can't across multi-process. many of current lcore-data also don't support multi-process, but I think it worth do that, and it will help us to some service recovery when sub-process failed and reboot.

...


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-12  2:33                               ` fengchengwen
@ 2024-09-12  5:35                                 ` Mattias Rönnblom
  2024-09-12  7:05                                   ` fengchengwen
  2024-09-12  7:28                                   ` Jerin Jacob
  0 siblings, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12  5:35 UTC (permalink / raw)
  To: fengchengwen, Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand

On 2024-09-12 04:33, fengchengwen wrote:
> On 2024/9/12 1:04, Mattias Rönnblom wrote:
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small, frequently-accessed data structures, for which one instance
>> should exist for each lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>
>> --
>>
>> PATCH v2:
>>   * Add Windows support. (Morten Brørup)
>>   * Fix lcore variables API index reference. (Morten Brørup)
>>   * Various improvements of the API documentation. (Morten Brørup)
>>   * Elimination of unused symbol in version.map. (Morten Brørup)
> 
> these history could move to the cover letter.
> 
>>
>> PATCH:
>>   * Update MAINTAINERS and release notes.
>>   * Stop covering included files in extern "C" {}.
>>
>> RFC v6:
>>   * Include <stdlib.h> to get aligned_alloc().
>>   * Tweak documentation (grammar).
>>   * Provide API-level guarantees that lcore variable values take on an
>>     initial value of zero.
>>   * Fix misplaced __rte_cache_aligned in the API doc example.
>>
>> RFC v5:
>>   * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
>>   * The RTE_LCORE_VAR_GET() and SET() convience access macros
>>     covered an uncommon use case, where the lcore value is of a
>>     primitive type, rather than a struct, and is thus eliminated
>>     from the API. (Morten Brørup)
>>   * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
>>     RTE_LCORE_VAR_VALUE().
>>   * The underscores are removed from __rte_lcore_var_lcore_ptr() to
>>     signal that this function is a part of the public API.
>>   * Macro arguments are documented.
>>
>> RFV v4:
>>   * Replace large static array with libc heap-allocated memory. One
>>     implication of this change is there no longer exists a fixed upper
>>     bound for the total amount of memory used by lcore variables.
>>     RTE_MAX_LCORE_VAR has changed meaning, and now represent the
>>     maximum size of any individual lcore variable value.
>>   * Fix issues in example. (Morten Brørup)
>>   * Improve access macro type checking. (Morten Brørup)
>>   * Refer to the lcore variable handle as "handle" and not "name" in
>>     various macros.
>>   * Document lack of thread safety in rte_lcore_var_alloc().
>>   * Provide API-level assurance the lcore variable handle is
>>     always non-NULL, to all applications to use NULL to mean
>>     "not yet allocated".
>>   * Note zero-sized allocations are not allowed.
>>   * Give API-level guarantee the lcore variable values are zeroed.
>>
>> RFC v3:
>>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>>   * Update example to reflect FOREACH macro name change (in RFC v2).
>>
>> RFC v2:
>>   * Use alignof to derive alignment requirements. (Morten Brørup)
>>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>>   * Allow user-specified alignment, but limit max to cache line size.
>> ---
>>   MAINTAINERS                            |   6 +
>>   config/rte_config.h                    |   1 +
>>   doc/api/doxy-api-index.md              |   1 +
>>   doc/guides/rel_notes/release_24_11.rst |  14 +
>>   lib/eal/common/eal_common_lcore_var.c  |  78 +++++
>>   lib/eal/common/meson.build             |   1 +
>>   lib/eal/include/meson.build            |   1 +
>>   lib/eal/include/rte_lcore_var.h        | 385 +++++++++++++++++++++++++
>>   lib/eal/version.map                    |   2 +
>>   9 files changed, 489 insertions(+)
>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index c5a703b5c0..362d9a3f28 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
>>   F: lib/eal/common/rte_random.c
>>   F: app/test/test_rand_perf.c
>>   
>> +Lcore Variables
>> +M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> +F: lib/eal/include/rte_lcore_var.h
>> +F: lib/eal/common/eal_common_lcore_var.c
>> +F: app/test/test_lcore_var.c
>> +
>>   ARM v7
>>   M: Wathsala Vithanage <wathsala.vithanage@arm.com>
>>   F: config/arm/
>> diff --git a/config/rte_config.h b/config/rte_config.h
>> index dd7bb0d35b..311692e498 100644
>> --- a/config/rte_config.h
>> +++ b/config/rte_config.h
>> @@ -41,6 +41,7 @@
>>   /* EAL defines */
>>   #define RTE_CACHE_GUARD_LINES 1
>>   #define RTE_MAX_HEAPS 32
>> +#define RTE_MAX_LCORE_VAR 1048576
>>   #define RTE_MAX_MEMSEG_LISTS 128
>>   #define RTE_MAX_MEMSEG_PER_LIST 8192
>>   #define RTE_MAX_MEM_MB_PER_LIST 32768
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index f9f0300126..ed577f14ee 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -99,6 +99,7 @@ The public API headers are grouped by topics:
>>     [interrupts](@ref rte_interrupts.h),
>>     [launch](@ref rte_launch.h),
>>     [lcore](@ref rte_lcore.h),
>> +  [lcore variables](@ref rte_lcore_var.h),
>>     [per-lcore](@ref rte_per_lcore.h),
>>     [service cores](@ref rte_service.h),
>>     [keepalive](@ref rte_keepalive.h),
>> diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
>> index 0ff70d9057..a3884f7491 100644
>> --- a/doc/guides/rel_notes/release_24_11.rst
>> +++ b/doc/guides/rel_notes/release_24_11.rst
>> @@ -55,6 +55,20 @@ New Features
>>        Also, make sure to start the actual text at the margin.
>>        =======================================================
>>   
>> +* **Added EAL per-lcore static memory allocation facility.**
>> +
>> +    Added EAL API <rte_lcore_var.h> for statically allocating small,
>> +    frequently-accessed data structures, for which one instance should
>> +    exist for each EAL thread and registered non-EAL thread.
>> +
>> +    With lcore variables, data is organized spatially on a per-lcore id
>> +    basis, rather than per library or PMD, avoiding the need for cache
>> +    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
>> +    reduces CPU cache internal fragmentation, improving performance.
>> +
>> +    Lcore variables are similar to thread-local storage (TLS, e.g.,
>> +    C11 _Thread_local), but decoupling the values' life time from that
>> +    of the threads.
>>   
>>   Removed Items
>>   -------------
>> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
>> new file mode 100644
>> index 0000000000..309822039b
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_var.c
>> @@ -0,0 +1,78 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#include <inttypes.h>
>> +#include <stdlib.h>
>> +
>> +#ifdef RTE_EXEC_ENV_WINDOWS
>> +#include <malloc.h>
>> +#endif
>> +
>> +#include <rte_common.h>
>> +#include <rte_debug.h>
>> +#include <rte_log.h>
>> +
>> +#include <rte_lcore_var.h>
>> +
>> +#include "eal_private.h"
>> +
>> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
>> +
>> +static void *lcore_buffer;
>> +static size_t offset = RTE_MAX_LCORE_VAR;
>> +
>> +static void *
>> +lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	void *handle;
>> +	void *value;
>> +
>> +	offset = RTE_ALIGN_CEIL(offset, align);
>> +
>> +	if (offset + size > RTE_MAX_LCORE_VAR) {
>> +#ifdef RTE_EXEC_ENV_WINDOWS
>> +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
>> +					       RTE_CACHE_LINE_SIZE);
>> +#else
>> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
>> +					     LCORE_BUFFER_SIZE);
>> +#endif
>> +		RTE_VERIFY(lcore_buffer != NULL);
>> +
>> +		offset = 0;
>> +	}
>> +
>> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
>> +
>> +	offset += size;
>> +
>> +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
>> +		memset(value, 0, size);
>> +
>> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
>> +		"%"PRIuPTR"-byte alignment", size, align);
> 
> Currrent the data was malloc by libc function, I think it's mainly for such INIT macro which will be init before main.
> But it will introduce following problem:
> 1\ it can't benefit from huge-pages. this patch may reserved many 1MBs for each lcore, if we could place it in huge-pages it will reduce the TLB miss rate, especially it freq access data.

This mechanism is for small allocations, which the sum of is also 
expected to be small (although the system won't break if they aren't).

If you have large allocations, you are better off using lazy huge page 
allocations further down the initialization process. Otherwise, you will 
end up using memory for RTE_MAX_LCORE instances, rather than the actual 
lcore count, which could be substantially smaller.

But sure, everything else being equal, you could have used huge pages 
for these lcore variable values. But everything isn't equal.

> 2\ it can't across multi-process. many of current lcore-data also don't support multi-process, but I think it worth do that, and it will help us to some service recovery when sub-process failed and reboot.
> 
> ...
> 

Not sure I think that's a downside. Further cementing that anti-pattern 
into DPDK seems to be a bad idea to me.

lcore variables doesn't *introduce* any of these issues, since the 
mechanisms it's replacing also have these shortcomings (if you think 
about them as such - I'm not sure I do).

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-12  5:35                                 ` Mattias Rönnblom
@ 2024-09-12  7:05                                   ` fengchengwen
  2024-09-12  7:28                                   ` Jerin Jacob
  1 sibling, 0 replies; 313+ messages in thread
From: fengchengwen @ 2024-09-12  7:05 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand

On 2024/9/12 13:35, Mattias Rönnblom wrote:
> On 2024-09-12 04:33, fengchengwen wrote:
>> On 2024/9/12 1:04, Mattias Rönnblom wrote:
>>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>>
>>> An lcore variable has one value for every current and future lcore
>>> id-equipped thread.
>>>
>>> The primary <rte_lcore_var.h> use case is for statically allocating
>>> small, frequently-accessed data structures, for which one instance
>>> should exist for each lcore.
>>>
>>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>>> _Thread_local), but decoupling the values' life time with that of the
>>> threads.
>>>
>>> Lcore variables are also similar in terms of functionality provided by
>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>>> build-time machinery. DPCPU uses linker scripts, which effectively
>>> prevents the reuse of its, otherwise seemingly viable, approach.
>>>
>>> The currently-prevailing way to solve the same problem as lcore
>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>>> lcore variables over this approach is that data related to the same
>>> lcore now is close (spatially, in memory), rather than data used by
>>> the same module, which in turn avoid excessive use of padding,
>>> polluting caches with unused data.
>>>
>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>>
>>> -- 
>>>
>>> PATCH v2:
>>>   * Add Windows support. (Morten Brørup)
>>>   * Fix lcore variables API index reference. (Morten Brørup)
>>>   * Various improvements of the API documentation. (Morten Brørup)
>>>   * Elimination of unused symbol in version.map. (Morten Brørup)
>>
>> these history could move to the cover letter.
>>
>>>
>>> PATCH:
>>>   * Update MAINTAINERS and release notes.
>>>   * Stop covering included files in extern "C" {}.
>>>
>>> RFC v6:
>>>   * Include <stdlib.h> to get aligned_alloc().
>>>   * Tweak documentation (grammar).
>>>   * Provide API-level guarantees that lcore variable values take on an
>>>     initial value of zero.
>>>   * Fix misplaced __rte_cache_aligned in the API doc example.
>>>
>>> RFC v5:
>>>   * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
>>>   * The RTE_LCORE_VAR_GET() and SET() convience access macros
>>>     covered an uncommon use case, where the lcore value is of a
>>>     primitive type, rather than a struct, and is thus eliminated
>>>     from the API. (Morten Brørup)
>>>   * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
>>>     RTE_LCORE_VAR_VALUE().
>>>   * The underscores are removed from __rte_lcore_var_lcore_ptr() to
>>>     signal that this function is a part of the public API.
>>>   * Macro arguments are documented.
>>>
>>> RFV v4:
>>>   * Replace large static array with libc heap-allocated memory. One
>>>     implication of this change is there no longer exists a fixed upper
>>>     bound for the total amount of memory used by lcore variables.
>>>     RTE_MAX_LCORE_VAR has changed meaning, and now represent the
>>>     maximum size of any individual lcore variable value.
>>>   * Fix issues in example. (Morten Brørup)
>>>   * Improve access macro type checking. (Morten Brørup)
>>>   * Refer to the lcore variable handle as "handle" and not "name" in
>>>     various macros.
>>>   * Document lack of thread safety in rte_lcore_var_alloc().
>>>   * Provide API-level assurance the lcore variable handle is
>>>     always non-NULL, to all applications to use NULL to mean
>>>     "not yet allocated".
>>>   * Note zero-sized allocations are not allowed.
>>>   * Give API-level guarantee the lcore variable values are zeroed.
>>>
>>> RFC v3:
>>>   * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
>>>   * Update example to reflect FOREACH macro name change (in RFC v2).
>>>
>>> RFC v2:
>>>   * Use alignof to derive alignment requirements. (Morten Brørup)
>>>   * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
>>>     *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
>>>   * Allow user-specified alignment, but limit max to cache line size.
>>> ---
>>>   MAINTAINERS                            |   6 +
>>>   config/rte_config.h                    |   1 +
>>>   doc/api/doxy-api-index.md              |   1 +
>>>   doc/guides/rel_notes/release_24_11.rst |  14 +
>>>   lib/eal/common/eal_common_lcore_var.c  |  78 +++++
>>>   lib/eal/common/meson.build             |   1 +
>>>   lib/eal/include/meson.build            |   1 +
>>>   lib/eal/include/rte_lcore_var.h        | 385 +++++++++++++++++++++++++
>>>   lib/eal/version.map                    |   2 +
>>>   9 files changed, 489 insertions(+)
>>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index c5a703b5c0..362d9a3f28 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
>>>   F: lib/eal/common/rte_random.c
>>>   F: app/test/test_rand_perf.c
>>>   +Lcore Variables
>>> +M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>> +F: lib/eal/include/rte_lcore_var.h
>>> +F: lib/eal/common/eal_common_lcore_var.c
>>> +F: app/test/test_lcore_var.c
>>> +
>>>   ARM v7
>>>   M: Wathsala Vithanage <wathsala.vithanage@arm.com>
>>>   F: config/arm/
>>> diff --git a/config/rte_config.h b/config/rte_config.h
>>> index dd7bb0d35b..311692e498 100644
>>> --- a/config/rte_config.h
>>> +++ b/config/rte_config.h
>>> @@ -41,6 +41,7 @@
>>>   /* EAL defines */
>>>   #define RTE_CACHE_GUARD_LINES 1
>>>   #define RTE_MAX_HEAPS 32
>>> +#define RTE_MAX_LCORE_VAR 1048576
>>>   #define RTE_MAX_MEMSEG_LISTS 128
>>>   #define RTE_MAX_MEMSEG_PER_LIST 8192
>>>   #define RTE_MAX_MEM_MB_PER_LIST 32768
>>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>>> index f9f0300126..ed577f14ee 100644
>>> --- a/doc/api/doxy-api-index.md
>>> +++ b/doc/api/doxy-api-index.md
>>> @@ -99,6 +99,7 @@ The public API headers are grouped by topics:
>>>     [interrupts](@ref rte_interrupts.h),
>>>     [launch](@ref rte_launch.h),
>>>     [lcore](@ref rte_lcore.h),
>>> +  [lcore variables](@ref rte_lcore_var.h),
>>>     [per-lcore](@ref rte_per_lcore.h),
>>>     [service cores](@ref rte_service.h),
>>>     [keepalive](@ref rte_keepalive.h),
>>> diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
>>> index 0ff70d9057..a3884f7491 100644
>>> --- a/doc/guides/rel_notes/release_24_11.rst
>>> +++ b/doc/guides/rel_notes/release_24_11.rst
>>> @@ -55,6 +55,20 @@ New Features
>>>        Also, make sure to start the actual text at the margin.
>>>        =======================================================
>>>   +* **Added EAL per-lcore static memory allocation facility.**
>>> +
>>> +    Added EAL API <rte_lcore_var.h> for statically allocating small,
>>> +    frequently-accessed data structures, for which one instance should
>>> +    exist for each EAL thread and registered non-EAL thread.
>>> +
>>> +    With lcore variables, data is organized spatially on a per-lcore id
>>> +    basis, rather than per library or PMD, avoiding the need for cache
>>> +    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
>>> +    reduces CPU cache internal fragmentation, improving performance.
>>> +
>>> +    Lcore variables are similar to thread-local storage (TLS, e.g.,
>>> +    C11 _Thread_local), but decoupling the values' life time from that
>>> +    of the threads.
>>>     Removed Items
>>>   -------------
>>> diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
>>> new file mode 100644
>>> index 0000000000..309822039b
>>> --- /dev/null
>>> +++ b/lib/eal/common/eal_common_lcore_var.c
>>> @@ -0,0 +1,78 @@
>>> +/* SPDX-License-Identifier: BSD-3-Clause
>>> + * Copyright(c) 2024 Ericsson AB
>>> + */
>>> +
>>> +#include <inttypes.h>
>>> +#include <stdlib.h>
>>> +
>>> +#ifdef RTE_EXEC_ENV_WINDOWS
>>> +#include <malloc.h>
>>> +#endif
>>> +
>>> +#include <rte_common.h>
>>> +#include <rte_debug.h>
>>> +#include <rte_log.h>
>>> +
>>> +#include <rte_lcore_var.h>
>>> +
>>> +#include "eal_private.h"
>>> +
>>> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
>>> +
>>> +static void *lcore_buffer;
>>> +static size_t offset = RTE_MAX_LCORE_VAR;
>>> +
>>> +static void *
>>> +lcore_var_alloc(size_t size, size_t align)
>>> +{
>>> +    void *handle;
>>> +    void *value;
>>> +
>>> +    offset = RTE_ALIGN_CEIL(offset, align);
>>> +
>>> +    if (offset + size > RTE_MAX_LCORE_VAR) {
>>> +#ifdef RTE_EXEC_ENV_WINDOWS
>>> +        lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
>>> +                           RTE_CACHE_LINE_SIZE);
>>> +#else
>>> +        lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
>>> +                         LCORE_BUFFER_SIZE);
>>> +#endif
>>> +        RTE_VERIFY(lcore_buffer != NULL);
>>> +
>>> +        offset = 0;
>>> +    }
>>> +
>>> +    handle = RTE_PTR_ADD(lcore_buffer, offset);
>>> +
>>> +    offset += size;
>>> +
>>> +    RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
>>> +        memset(value, 0, size);
>>> +
>>> +    EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
>>> +        "%"PRIuPTR"-byte alignment", size, align);
>>
>> Currrent the data was malloc by libc function, I think it's mainly for such INIT macro which will be init before main.
>> But it will introduce following problem:
>> 1\ it can't benefit from huge-pages. this patch may reserved many 1MBs for each lcore, if we could place it in huge-pages it will reduce the TLB miss rate, especially it freq access data.
> 
> This mechanism is for small allocations, which the sum of is also expected to be small (although the system won't break if they aren't).
> 
> If you have large allocations, you are better off using lazy huge page allocations further down the initialization process. Otherwise, you will end up using memory for RTE_MAX_LCORE instances, rather than the actual lcore count, which could be substantially smaller.

Yes, it may cost two much memory if allocated from hugepage memory.

> 
> But sure, everything else being equal, you could have used huge pages for these lcore variable values. But everything isn't equal.
> 
>> 2\ it can't across multi-process. many of current lcore-data also don't support multi-process, but I think it worth do that, and it will help us to some service recovery when sub-process failed and reboot.
>>
>> ...
>>
> 
> Not sure I think that's a downside. Further cementing that anti-pattern into DPDK seems to be a bad idea to me.
> 
> lcore variables doesn't *introduce* any of these issues, since the mechanisms it's replacing also have these shortcomings (if you think about them as such - I'm not sure I do).

Got it.

This feature is a enhanced for current lcore variables, which bring together scattered data from the point view of a single core.
and current it seemmed hard to extend support hugepage memory.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-12  5:35                                 ` Mattias Rönnblom
  2024-09-12  7:05                                   ` fengchengwen
@ 2024-09-12  7:28                                   ` Jerin Jacob
  1 sibling, 0 replies; 313+ messages in thread
From: Jerin Jacob @ 2024-09-12  7:28 UTC (permalink / raw)
  To: Mattias Rönnblom, Anatoly Burakov
  Cc: fengchengwen, Mattias Rönnblom, dev, Morten Brørup,
	Stephen Hemminger, Konstantin Ananyev, David Marchand

On Thu, Sep 12, 2024 at 11:05 AM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>
> On 2024-09-12 04:33, fengchengwen wrote:
> > On 2024/9/12 1:04, Mattias Rönnblom wrote:
> >> Introduce DPDK per-lcore id variables, or lcore variables for short.
> >>
> >> An lcore variable has one value for every current and future lcore
> >> id-equipped thread.
> >>
> >> The primary <rte_lcore_var.h> use case is for statically allocating
> >> small, frequently-accessed data structures, for which one instance
> >> should exist for each lcore.
> >>
> >> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> >> _Thread_local), but decoupling the values' life time with that of the
> >> threads.
> >>
> >> Lcore variables are also similar in terms of functionality provided by
> >> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >> build-time machinery. DPCPU uses linker scripts, which effectively
> >> prevents the reuse of its, otherwise seemingly viable, approach.
> >>
> >> The currently-prevailing way to solve the same problem as lcore
> >> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> >> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> >> lcore variables over this approach is that data related to the same
> >> lcore now is close (spatially, in memory), rather than data used by
> >> the same module, which in turn avoid excessive use of padding,
> >> polluting caches with unused data.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >>
> >> --
> >>

> >> +
> >> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> >> +
> >> +static void *lcore_buffer;
> >> +static size_t offset = RTE_MAX_LCORE_VAR;
> >> +
> >> +static void *
> >> +lcore_var_alloc(size_t size, size_t align)
> >> +{
> >> +    void *handle;
> >> +    void *value;
> >> +
> >> +    offset = RTE_ALIGN_CEIL(offset, align);
> >> +
> >> +    if (offset + size > RTE_MAX_LCORE_VAR) {
> >> +#ifdef RTE_EXEC_ENV_WINDOWS
> >> +            lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> >> +                                           RTE_CACHE_LINE_SIZE);
> >> +#else
> >> +            lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> >> +                                         LCORE_BUFFER_SIZE);
> >> +#endif
> >> +            RTE_VERIFY(lcore_buffer != NULL);
> >> +
> >> +            offset = 0;
> >> +    }
> >> +
> >> +    handle = RTE_PTR_ADD(lcore_buffer, offset);
> >> +
> >> +    offset += size;
> >> +
> >> +    RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
> >> +            memset(value, 0, size);
> >> +
> >> +    EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
> >> +            "%"PRIuPTR"-byte alignment", size, align);
> >
> > Currrent the data was malloc by libc function, I think it's mainly for such INIT macro which will be init before main.
> > But it will introduce following problem:
> > 1\ it can't benefit from huge-pages. this patch may reserved many 1MBs for each lcore, if we could place it in huge-pages it will reduce the TLB miss rate, especially it freq access data.
>
> This mechanism is for small allocations, which the sum of is also
> expected to be small (although the system won't break if they aren't).
>
> If you have large allocations, you are better off using lazy huge page
> allocations further down the initialization process. Otherwise, you will
> end up using memory for RTE_MAX_LCORE instances, rather than the actual
> lcore count, which could be substantially smaller.

+ @Anatoly Burakov

If I am not wrong, DPDK huge page memory allocator (rte_malloc()), may
have similar overhead glibc once. Meaning, The hugepage allocated only
when needed and space is over.
if so, why not use rte_malloc() if available.



>
> But sure, everything else being equal, you could have used huge pages
> for these lcore variable values. But everything isn't equal.
>
> > 2\ it can't across multi-process. many of current lcore-data also don't support multi-process, but I think it worth do that, and it will help us to some service recovery when sub-process failed and reboot.
> >
> > ...
> >
>
> Not sure I think that's a downside. Further cementing that anti-pattern
> into DPDK seems to be a bad idea to me.
>
> lcore variables doesn't *introduce* any of these issues, since the
> mechanisms it's replacing also have these shortcomings (if you think
> about them as such - I'm not sure I do).

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2 2/6] eal: add lcore variable test suite
  2024-09-11 17:04                             ` [PATCH v2 2/6] eal: add lcore variable test suite Mattias Rönnblom
@ 2024-09-12  7:35                               ` Jerin Jacob
  2024-09-12  8:56                                 ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Jerin Jacob @ 2024-09-12  7:35 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand

On Wed, Sep 11, 2024 at 11:08 PM Mattias Rönnblom
<mattias.ronnblom@ericsson.com> wrote:
>
> Add test suite to exercise the <rte_lcore_var.h> API.
>
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>
> --
>
> RFC v5:
>  * Adapt tests to reflect the removal of the GET() and SET() macros.
>
> RFC v4:
>  * Check all lcore id's values for all variables in the many variables
>    test case.
>  * Introduce test case for max-sized lcore variables.
>
> RFC v2:
>  * Improve alignment-related test coverage.
> ---
>  app/test/meson.build      |   1 +
>  app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 433 insertions(+)
>  create mode 100644 app/test/test_lcore_var.c
>
> diff --git a/app/test/meson.build b/app/test/meson.build
> index e29258e6ec..48279522f0 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -103,6 +103,7 @@ source_file_deps = {
>      'test_ipsec_sad.c': ['ipsec'],
>      'test_kvargs.c': ['kvargs'],
>      'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
> +    'test_lcore_var.c': [],
>      'test_lcores.c': [],
>      'test_link_bonding.c': ['ethdev', 'net_bond',
> +}
> +
> +REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);

IMO, Good to add one perf test suite for the operations like other
library calls. It may be compared with TLS on same operation.
So that end users can decide to use the scheme based on their use
case, and we get performance test case to avoid future regression
for this library.

It may not show any difference in numbers, but once we have self
monitoring performance counters[1] it can in the future.
[1[]
https://patches.dpdk.org/project/dpdk/patch/20230201131757.1787527-1-tduszynski@marvell.com/




> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v3 0/7] Lcore variables
  2024-09-11 17:04                             ` [PATCH v2 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-12  2:33                               ` fengchengwen
@ 2024-09-12  8:44                               ` Mattias Rönnblom
  2024-09-12  8:44                                 ` [PATCH v3 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                   ` (6 more replies)
  2024-09-12  9:10                               ` [PATCH v2 1/6] eal: add static per-lcore memory allocation facility Morten Brørup
  2 siblings, 7 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12  8:44 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                            |   6 +
 app/test/meson.build                   |   2 +
 app/test/test_lcore_var.c              | 432 +++++++++++++++++++++++++
 app/test/test_lcore_var_perf.c         | 160 +++++++++
 config/rte_config.h                    |   1 +
 doc/api/doxy-api-index.md              |   1 +
 doc/guides/rel_notes/release_24_11.rst |  14 +
 lib/eal/common/eal_common_lcore_var.c  |  78 +++++
 lib/eal/common/meson.build             |   1 +
 lib/eal/common/rte_random.c            |  28 +-
 lib/eal/common/rte_service.c           | 115 ++++---
 lib/eal/include/meson.build            |   1 +
 lib/eal/include/rte_lcore_var.h        | 385 ++++++++++++++++++++++
 lib/eal/version.map                    |   2 +
 lib/eal/x86/rte_power_intrinsics.c     |  17 +-
 lib/power/rte_power_pmd_mgmt.c         |  34 +-
 16 files changed, 1190 insertions(+), 87 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v3 1/7] eal: add static per-lcore memory allocation facility
  2024-09-12  8:44                               ` [PATCH v3 0/7] Lcore variables Mattias Rönnblom
@ 2024-09-12  8:44                                 ` Mattias Rönnblom
  2024-09-16 10:52                                   ` [PATCH v4 0/7] Lcore variables Mattias Rönnblom
  2024-09-12  8:44                                 ` [PATCH v3 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12  8:44 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                            |   6 +
 config/rte_config.h                    |   1 +
 doc/api/doxy-api-index.md              |   1 +
 doc/guides/rel_notes/release_24_11.rst |  14 +
 lib/eal/common/eal_common_lcore_var.c  |  78 +++++
 lib/eal/common/meson.build             |   1 +
 lib/eal/include/meson.build            |   1 +
 lib/eal/include/rte_lcore_var.h        | 385 +++++++++++++++++++++++++
 lib/eal/version.map                    |   2 +
 9 files changed, 489 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c5a703b5c0..362d9a3f28 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index dd7bb0d35b..311692e498 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 0ff70d9057..a3884f7491 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -55,6 +55,20 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..309822039b
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+#ifdef RTE_EXEC_ENV_WINDOWS
+		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
+					       RTE_CACHE_LINE_SIZE);
+#else
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+#endif
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..ec3ab714a8
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,385 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. There is one
+ * instance for each current and future lcore id-equipped thread, with
+ * a total of RTE_MAX_LCORE instances. The value of an lcore variable
+ * for a particular lcore id is independent from other values (for
+ * other lcore ids) within the same lcore variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. The handle type is used to inform the
+ * access macros the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as a an opaque identifier. An allocated handle
+ * never has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by two different lcore
+ * ids may be frequently read or written by the owners without risking
+ * false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should employed to assure there are no data races between
+ * the owning thread and any non-owner threads accessing the same
+ * lcore variable instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may choose to define an lcore variable handle, which
+ * it then it goes on to never allocate.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variables' owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values take on an initial value of zero.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables have the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. As a result, thread-local variables must be initialized in
+ *     a "lazy" manner (e.g., at the point of thread creation). Lcore
+ *     variables may be accessed immediately after having been
+ *     allocated (which may be prior any thread beyond the main
+ *     thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handle, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param value
+ *   A pointer successively set to point to lcore variable value
+ *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
+	for (unsigned int lcore_id =					\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..0c80bf7331 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,8 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v3 2/7] eal: add lcore variable functional tests
  2024-09-12  8:44                               ` [PATCH v3 0/7] Lcore variables Mattias Rönnblom
  2024-09-12  8:44                                 ` [PATCH v3 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-12  8:44                                 ` Mattias Rönnblom
  2024-09-12  8:44                                 ` [PATCH v3 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12  8:44 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..e07d13460f
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,432 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-12  8:44                               ` [PATCH v3 0/7] Lcore variables Mattias Rönnblom
  2024-09-12  8:44                                 ` [PATCH v3 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-12  8:44                                 ` [PATCH v3 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-09-12  8:44                                 ` Mattias Rönnblom
  2024-09-12  9:39                                   ` Morten Brørup
  2024-09-12 13:09                                   ` Jerin Jacob
  2024-09-12  8:44                                 ` [PATCH v3 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12  8:44 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
 2 files changed, 161 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..ea1d7ba90b
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <stdio.h>
+
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+init(struct lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+update(struct lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+static RTE_DEFINE_PER_LCORE(struct lcore_state, tls_lcore_state);
+
+static void
+tls_init(void)
+{
+	init(&RTE_PER_LCORE(tls_lcore_state));
+}
+
+static __rte_noinline void
+tls_update(void)
+{
+	update(&RTE_PER_LCORE(tls_lcore_state));
+}
+
+struct __rte_cache_aligned lcore_state_aligned {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static struct lcore_state_aligned sarray_lcore_state[RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	struct lcore_state *state =
+		(struct lcore_state *)&sarray_lcore_state[rte_lcore_id()];
+
+	init(state);
+}
+
+static __rte_noinline void
+sarray_update(void)
+{
+	struct lcore_state *state =
+		(struct lcore_state *)&sarray_lcore_state[rte_lcore_id()];
+
+	update(state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct lcore_state, lvar_lcore_state);
+
+static void
+lvar_init(void)
+{
+	RTE_LCORE_VAR_ALLOC(lvar_lcore_state);
+
+	struct lcore_state *state = RTE_LCORE_VAR_VALUE(lvar_lcore_state);
+
+	init(state);
+}
+
+static __rte_noinline void
+lvar_update(void)
+{
+	struct lcore_state *state = RTE_LCORE_VAR_VALUE(lvar_lcore_state);
+
+	update(state);
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static double
+benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
+{
+	uint64_t i;
+	uint64_t start;
+	uint64_t end;
+	double latency;
+
+	init_fun();
+
+	start = rte_get_timer_cycles();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun();
+
+	end = rte_get_timer_cycles();
+
+	latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
+
+	return latency;
+}
+
+static int
+test_lcore_var_access(void)
+{
+	/* Note: the potential performance benefit of lcore variables
+	 * compared thread-local storage or the use of statically
+	 * sized, lcore id-indexed arrays are not shorter latencies in
+	 * a scenario with low cache pressure, but rather fewer cache
+	 * misses in a real-world scenario, with extensive cache
+	 * usage. These tests just tries to assure that the lcore
+	 * variable overhead is not significantly greater other
+	 * alternatives, when the per-lcore data is in L1.
+	 */
+	double tls_latency;
+	double sarray_latency;
+	double lvar_latency;
+
+	tls_latency = benchmark_access_method(tls_init, tls_update);
+	sarray_latency = benchmark_access_method(sarray_init, sarray_update);
+	lvar_latency = benchmark_access_method(lvar_init, lvar_update);
+
+	printf("Latencies [ns/update]\n");
+	printf("Thread-local storage  Static array  Lcore variables\n");
+	printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
+	       sarray_latency * 1e9, lvar_latency * 1e9);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v3 4/7] random: keep PRNG state in lcore variable
  2024-09-12  8:44                               ` [PATCH v3 0/7] Lcore variables Mattias Rönnblom
                                                   ` (2 preceding siblings ...)
  2024-09-12  8:44                                 ` [PATCH v3 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-09-12  8:44                                 ` Mattias Rönnblom
  2024-09-12  8:44                                 ` [PATCH v3 5/7] power: keep per-lcore " Mattias Rönnblom
                                                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12  8:44 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v3 5/7] power: keep per-lcore state in lcore variable
  2024-09-12  8:44                               ` [PATCH v3 0/7] Lcore variables Mattias Rönnblom
                                                   ` (3 preceding siblings ...)
  2024-09-12  8:44                                 ` [PATCH v3 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-09-12  8:44                                 ` Mattias Rönnblom
  2024-09-12  8:44                                 ` [PATCH v3 6/7] service: " Mattias Rönnblom
  2024-09-12  8:44                                 ` [PATCH v3 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12  8:44 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a5139dd4f7 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,21 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v3 6/7] service: keep per-lcore state in lcore variable
  2024-09-12  8:44                               ` [PATCH v3 0/7] Lcore variables Mattias Rönnblom
                                                   ` (4 preceding siblings ...)
  2024-09-12  8:44                                 ` [PATCH v3 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-09-12  8:44                                 ` Mattias Rönnblom
  2024-09-12  8:44                                 ` [PATCH v3 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12  8:44 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 115 +++++++++++++++++++----------------
 1 file changed, 63 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 56379930b6..03379f1588 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,12 +102,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -122,7 +119,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +132,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +281,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +288,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +449,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +462,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +484,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +530,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +546,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +567,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +584,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +636,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +688,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +706,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +731,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +755,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +779,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +809,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +818,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +843,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +854,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +862,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +870,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +879,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +895,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +942,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +971,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +983,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1022,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v3 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-09-12  8:44                               ` [PATCH v3 0/7] Lcore variables Mattias Rönnblom
                                                   ` (5 preceding siblings ...)
  2024-09-12  8:44                                 ` [PATCH v3 6/7] service: " Mattias Rönnblom
@ 2024-09-12  8:44                                 ` Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12  8:44 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2 2/6] eal: add lcore variable test suite
  2024-09-12  7:35                               ` Jerin Jacob
@ 2024-09-12  8:56                                 ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12  8:56 UTC (permalink / raw)
  To: Jerin Jacob, Mattias Rönnblom
  Cc: dev, Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand

On 2024-09-12 09:35, Jerin Jacob wrote:
> On Wed, Sep 11, 2024 at 11:08 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
>>
>> Add test suite to exercise the <rte_lcore_var.h> API.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>
>> --
>>
>> RFC v5:
>>   * Adapt tests to reflect the removal of the GET() and SET() macros.
>>
>> RFC v4:
>>   * Check all lcore id's values for all variables in the many variables
>>     test case.
>>   * Introduce test case for max-sized lcore variables.
>>
>> RFC v2:
>>   * Improve alignment-related test coverage.
>> ---
>>   app/test/meson.build      |   1 +
>>   app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 433 insertions(+)
>>   create mode 100644 app/test/test_lcore_var.c
>>
>> diff --git a/app/test/meson.build b/app/test/meson.build
>> index e29258e6ec..48279522f0 100644
>> --- a/app/test/meson.build
>> +++ b/app/test/meson.build
>> @@ -103,6 +103,7 @@ source_file_deps = {
>>       'test_ipsec_sad.c': ['ipsec'],
>>       'test_kvargs.c': ['kvargs'],
>>       'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
>> +    'test_lcore_var.c': [],
>>       'test_lcores.c': [],
>>       'test_link_bonding.c': ['ethdev', 'net_bond',
>> +}
>> +
>> +REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
> 
> IMO, Good to add one perf test suite for the operations like other
> library calls. It may be compared with TLS on same operation.
> So that end users can decide to use the scheme based on their use
> case, and we get performance test case to avoid future regression
> for this library.
> 

OK. I've added a micro benchmark.

> It may not show any difference in numbers, but once we have self
> monitoring performance counters[1] it can in the future.
> [1[]
> https://patches.dpdk.org/project/dpdk/patch/20230201131757.1787527-1-tduszynski@marvell.com/
> 
> 
> 
> 
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-11 17:04                             ` [PATCH v2 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-12  2:33                               ` fengchengwen
  2024-09-12  8:44                               ` [PATCH v3 0/7] Lcore variables Mattias Rönnblom
@ 2024-09-12  9:10                               ` Morten Brørup
  2024-09-12 13:16                                 ` Jerin Jacob
  2 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-09-12  9:10 UTC (permalink / raw)
  To: Mattias Rönnblom, dev, Jerin Jacob, Chengwen Feng
  Cc: Mattias Rönnblom, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Anatoly Burakov

> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)

Considering hugepages...

Lcore variables may be allocated before DPDK's memory allocator (rte_malloc()) is ready, so rte_malloc() cannot be used for lcore variables.

And lcore variables are not usable (shared) for DPDK multi-process, so the lcore_buffer could be allocated through the O/S APIs as anonymous hugepages, instead of using rte_malloc().

The alternative, using rte_malloc(), would disallow allocating lcore variables before DPDK's memory allocator has been initialized, which I think is too late.

Anyway, hugepages is not a "must have" here, it is a "nice to have". It can be added to the lcore variables subsystem at a later time.


Here are some thoughts about optimizing for TLB entry usage...

If lcore variables use hugepages, and LCORE_BUFFER_SIZE matches the hugepage size (2 MB), all the lcore variables will only consume 1 hugepage TLB entry.
However, this may limit the max size of an lcore variable (RTE_MAX_LCORE_VAR) too much, if the system supports many lcores (RTE_MAX_LCORE).
E.g. with 1024 lcores, the max size of an lcore variable would be 2048 bytes.
And with 128 lcores, the max size of an lcore variable would be 16 KB.

So if we want to optimize for hugepage TLB entry usage, the question becomes: What is a reasonable max size of an lcore variable?

And although hugepages is only a "nice to have", the max size of an lcore variable (RTE_MAX_LCORE_VAR) is part of the API/ABI, so we should consider it now, if we want to optimize for hugepage TLB entry usage in the future.


A few more comments below, not related to hugepages.

> +
> +static void *lcore_buffer;
> +static size_t offset = RTE_MAX_LCORE_VAR;
> +
> +static void *
> +lcore_var_alloc(size_t size, size_t align)
> +{
> +	void *handle;
> +	void *value;
> +
> +	offset = RTE_ALIGN_CEIL(offset, align);
> +
> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> +#ifdef RTE_EXEC_ENV_WINDOWS
> +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> +					       RTE_CACHE_LINE_SIZE);
> +#else
> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> +					     LCORE_BUFFER_SIZE);
> +#endif
> +		RTE_VERIFY(lcore_buffer != NULL);
> +
> +		offset = 0;
> +	}
> +
> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
> +
> +	offset += size;
> +
> +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
> +		memset(value, 0, size);
> +
> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with
> a "
> +		"%"PRIuPTR"-byte alignment", size, align);
> +
> +	return handle;
> +}
> +
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align)
> +{
> +	/* Having the per-lcore buffer size aligned on cache lines
> +	 * assures as well as having the base pointer aligned on cache
> +	 * size assures that aligned offsets also translate to alipgned
> +	 * pointers across all values.
> +	 */
> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
> +	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);

This specific RTE_ASSERT() should be upgraded to RTE_VERIFY(), so it is checked in non-debug builds too.
The code is slow path and not inline, and if this check doesn't pass, accessing the lcore variable will cause a buffer overrun. Prefer failing early.

> +
> +	/* '0' means asking for worst-case alignment requirements */
> +	if (align == 0)
> +		align = alignof(max_align_t);
> +
> +	RTE_ASSERT(rte_is_power_of_2(align));
> +
> +	return lcore_var_alloc(size, align);
> +}


> +/**
> + * Allocate space in the per-lcore id buffers for an lcore variable.
> + *
> + * The pointer returned is only an opaque identifer of the variable. To
> + * get an actual pointer to a particular instance of the variable use
> + * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
> + *
> + * The lcore variable values' memory is set to zero.
> + *
> + * The allocation is always successful, barring a fatal exhaustion of
> + * the per-lcore id buffer space.
> + *
> + * rte_lcore_var_alloc() is not multi-thread safe.
> + *
> + * @param size
> + *   The size (in bytes) of the variable's per-lcore id value. Must be
> > 0.
> + * @param align
> + *   If 0, the values will be suitably aligned for any kind of type
> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be
> aligned
> + *   on a multiple of *align*, which must be a power of 2 and equal or
> + *   less than @c RTE_CACHE_LINE_SIZE.
> + * @return
> + *   The variable's handle, stored in a void pointer value. The value
> + *   is always non-NULL.
> + */
> +__rte_experimental

I don't know how useful these are, but consider adding:
#ifndef RTE_TOOLCHAIN_MSVC
__attribute__((malloc))
__attribute__((alloc_size(1)))
__attribute__((alloc_align(2)))
__attribute__((returns_nonnull))
#endif

> +void *
> +rte_lcore_var_alloc(size_t size, size_t align);


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-12  8:44                                 ` [PATCH v3 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-09-12  9:39                                   ` Morten Brørup
  2024-09-12 13:01                                     ` Mattias Rönnblom
  2024-09-12 13:09                                   ` Jerin Jacob
  1 sibling, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-09-12  9:39 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

> +struct lcore_state {
> +	uint64_t a;
> +	uint64_t b;
> +	uint64_t sum;
> +};
> +
> +static __rte_always_inline void
> +update(struct lcore_state *state)
> +{
> +	state->sum += state->a * state->b;
> +}
> +
> +static RTE_DEFINE_PER_LCORE(struct lcore_state, tls_lcore_state);
> +
> +static __rte_noinline void
> +tls_update(void)
> +{
> +	update(&RTE_PER_LCORE(tls_lcore_state));

I would normally access TLS variables directly, not through a pointer, i.e.:

RTE_PER_LCORE(tls_lcore_state.sum) += RTE_PER_LCORE(tls_lcore_state.a) * RTE_PER_LCORE(tls_lcore_state.b);

On the other hand, then it wouldn't be 1:1 comparable to the two other test cases.

Besides, I expect the compiler to optimize away the indirect access, and produce the same output (as for the alternative implementation) anyway.

No change requested. Just noticing.

> +}
> +
> +struct __rte_cache_aligned lcore_state_aligned {
> +	uint64_t a;
> +	uint64_t b;
> +	uint64_t sum;

Please add RTE_CACHE_GUARD here, for 100 % matching the common design pattern.

> +};
> +
> +static struct lcore_state_aligned sarray_lcore_state[RTE_MAX_LCORE];


> +	printf("Latencies [ns/update]\n");
> +	printf("Thread-local storage  Static array  Lcore variables\n");
> +	printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> +	       sarray_latency * 1e9, lvar_latency * 1e9);

I prefer cycles over ns. Perhaps you could show both?


With RTE_CACHE_GUARD added where mentioned,

Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-12  9:39                                   ` Morten Brørup
@ 2024-09-12 13:01                                     ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12 13:01 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

On 2024-09-12 11:39, Morten Brørup wrote:
>> +struct lcore_state {
>> +	uint64_t a;
>> +	uint64_t b;
>> +	uint64_t sum;
>> +};
>> +
>> +static __rte_always_inline void
>> +update(struct lcore_state *state)
>> +{
>> +	state->sum += state->a * state->b;
>> +}
>> +
>> +static RTE_DEFINE_PER_LCORE(struct lcore_state, tls_lcore_state);
>> +
>> +static __rte_noinline void
>> +tls_update(void)
>> +{
>> +	update(&RTE_PER_LCORE(tls_lcore_state));
> 
> I would normally access TLS variables directly, not through a pointer, i.e.:
> 
> RTE_PER_LCORE(tls_lcore_state.sum) += RTE_PER_LCORE(tls_lcore_state.a) * RTE_PER_LCORE(tls_lcore_state.b);
> 
> On the other hand, then it wouldn't be 1:1 comparable to the two other test cases.
> 
> Besides, I expect the compiler to optimize away the indirect access, and produce the same output (as for the alternative implementation) anyway.
> 
> No change requested. Just noticing.
> 
>> +}
>> +
>> +struct __rte_cache_aligned lcore_state_aligned {
>> +	uint64_t a;
>> +	uint64_t b;
>> +	uint64_t sum;
> 
> Please add RTE_CACHE_GUARD here, for 100 % matching the common design pattern.
> 

Will do.

>> +};
>> +
>> +static struct lcore_state_aligned sarray_lcore_state[RTE_MAX_LCORE];
> 
> 
>> +	printf("Latencies [ns/update]\n");
>> +	printf("Thread-local storage  Static array  Lcore variables\n");
>> +	printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
>> +	       sarray_latency * 1e9, lvar_latency * 1e9);
> 
> I prefer cycles over ns. Perhaps you could show both?
> 

That's makes you an x86 guy. :) Since only on x86 those cycles makes any 
sense.

I didn't want to use cycles since it would be a very small value on 
certain (e.g., old ARM) platforms.

But, elsewhere in the perf tests TSC cycles are used, so maybe I should 
switch to using such nevertheless.

> 
> With RTE_CACHE_GUARD added where mentioned,
> 
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-12  8:44                                 ` [PATCH v3 3/7] eal: add lcore variable performance test Mattias Rönnblom
  2024-09-12  9:39                                   ` Morten Brørup
@ 2024-09-12 13:09                                   ` Jerin Jacob
  2024-09-12 13:20                                     ` Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Jerin Jacob @ 2024-09-12 13:09 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob

On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
<mattias.ronnblom@ericsson.com> wrote:
>
> Add basic micro benchmark for lcore variables, in an attempt to assure
> that the overhead isn't significantly greater than alternative
> approaches, in scenarios where the benefits aren't expected to show up
> (i.e., when plenty of cache is available compared to the working set
> size of the per-lcore data).
>
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  app/test/meson.build           |   1 +
>  app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
>  2 files changed, 161 insertions(+)
>  create mode 100644 app/test/test_lcore_var_perf.c


> +static double
> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> +{
> +       uint64_t i;
> +       uint64_t start;
> +       uint64_t end;
> +       double latency;
> +
> +       init_fun();
> +
> +       start = rte_get_timer_cycles();
> +
> +       for (i = 0; i < ITERATIONS; i++)
> +               update_fun();
> +
> +       end = rte_get_timer_cycles();

Use precise variant. rte_rdtsc_precise() or so to be accurate

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-12  9:10                               ` [PATCH v2 1/6] eal: add static per-lcore memory allocation facility Morten Brørup
@ 2024-09-12 13:16                                 ` Jerin Jacob
  2024-09-12 13:41                                   ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Jerin Jacob @ 2024-09-12 13:16 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, dev, Chengwen Feng, Mattias Rönnblom,
	Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Anatoly Burakov

On Thu, Sep 12, 2024 at 2:40 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
>
> Considering hugepages...
>
> Lcore variables may be allocated before DPDK's memory allocator (rte_malloc()) is ready, so rte_malloc() cannot be used for lcore variables.
>
> And lcore variables are not usable (shared) for DPDK multi-process, so the lcore_buffer could be allocated through the O/S APIs as anonymous hugepages, instead of using rte_malloc().
>
> The alternative, using rte_malloc(), would disallow allocating lcore variables before DPDK's memory allocator has been initialized, which I think is too late.

I thought it is not. A lot of the subsystems are initialized after the
memory subsystem is initialized.
[1] example given in documentation. I thought, RTE_INIT needs to
replaced if the subsystem called after memory initialized (which is
the case for most of the libraries)
Trace library had a similar situation. It is managed like [2]



[1]
 * struct foo_lcore_state {
 *         int a;
 *         long b;
 * };
 *
 * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
 *
 * long foo_get_a_plus_b(void)
 * {
 *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
 *
 *         return state->a + state->b;
 * }
 *
 * RTE_INIT(rte_foo_init)
 * {
 *         RTE_LCORE_VAR_ALLOC(lcore_states);
 *
 *         struct foo_lcore_state *state;
 *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
 *                 (initialize 'state')
 *         }
 *
 *         (other initialization)
 * }


[2]


        /* First attempt from huge page */
        header = eal_malloc_no_trace(NULL, trace_mem_sz(trace->buff_len), 8);
        if (header) {
                trace->lcore_meta[count].area = TRACE_AREA_HUGEPAGE;
                goto found;
        }

        /* Second attempt from heap */
        header = malloc(trace_mem_sz(trace->buff_len));
        if (header == NULL) {
                trace_crit("trace mem malloc attempt failed");
                header = NULL;
                goto fail;

        }

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-12 13:09                                   ` Jerin Jacob
@ 2024-09-12 13:20                                     ` Mattias Rönnblom
  2024-09-12 15:11                                       ` Jerin Jacob
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-12 13:20 UTC (permalink / raw)
  To: Jerin Jacob, Mattias Rönnblom
  Cc: dev, Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob

On 2024-09-12 15:09, Jerin Jacob wrote:
> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
>>
>> Add basic micro benchmark for lcore variables, in an attempt to assure
>> that the overhead isn't significantly greater than alternative
>> approaches, in scenarios where the benefits aren't expected to show up
>> (i.e., when plenty of cache is available compared to the working set
>> size of the per-lcore data).
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   app/test/meson.build           |   1 +
>>   app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
>>   2 files changed, 161 insertions(+)
>>   create mode 100644 app/test/test_lcore_var_perf.c
> 
> 
>> +static double
>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
>> +{
>> +       uint64_t i;
>> +       uint64_t start;
>> +       uint64_t end;
>> +       double latency;
>> +
>> +       init_fun();
>> +
>> +       start = rte_get_timer_cycles();
>> +
>> +       for (i = 0; i < ITERATIONS; i++)
>> +               update_fun();
>> +
>> +       end = rte_get_timer_cycles();
> 
> Use precise variant. rte_rdtsc_precise() or so to be accurate

With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-12 13:16                                 ` Jerin Jacob
@ 2024-09-12 13:41                                   ` Morten Brørup
  2024-09-12 15:22                                     ` Jerin Jacob
  0 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-09-12 13:41 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Mattias Rönnblom, dev, Chengwen Feng, Mattias Rönnblom,
	Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Anatoly Burakov

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Thursday, 12 September 2024 15.17
> 
> On Thu, Sep 12, 2024 at 2:40 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > > +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> >
> > Considering hugepages...
> >
> > Lcore variables may be allocated before DPDK's memory allocator
> (rte_malloc()) is ready, so rte_malloc() cannot be used for lcore variables.
> >
> > And lcore variables are not usable (shared) for DPDK multi-process, so the
> lcore_buffer could be allocated through the O/S APIs as anonymous hugepages,
> instead of using rte_malloc().
> >
> > The alternative, using rte_malloc(), would disallow allocating lcore
> variables before DPDK's memory allocator has been initialized, which I think
> is too late.
> 
> I thought it is not. A lot of the subsystems are initialized after the
> memory subsystem is initialized.
> [1] example given in documentation. I thought, RTE_INIT needs to
> replaced if the subsystem called after memory initialized (which is
> the case for most of the libraries)

The list of RTE_INIT functions are called before main(). It is not very useful.

Yes, it would be good to replace (or supplement) RTE_INIT_PRIO by something similar, which calls the list of "INIT" functions at the appropriate time during EAL initialization.

DPDK should then use this "INIT" list for all its initialization, so the init function of new features (such as this, and trace) can be inserted at the correct location in the list.

> Trace library had a similar situation. It is managed like [2]

Yes, if we insist on using rte_malloc() for lcore variables, the alternative is to prohibit establishing lcore variables in functions called through RTE_INIT.
Although I don't like this alternative, it might be viable.

> 
> 
> 
> [1]
>  * struct foo_lcore_state {
>  *         int a;
>  *         long b;
>  * };
>  *
>  * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
>  *
>  * long foo_get_a_plus_b(void)
>  * {
>  *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
>  *
>  *         return state->a + state->b;
>  * }
>  *
>  * RTE_INIT(rte_foo_init)
>  * {
>  *         RTE_LCORE_VAR_ALLOC(lcore_states);
>  *
>  *         struct foo_lcore_state *state;
>  *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
>  *                 (initialize 'state')
>  *         }
>  *
>  *         (other initialization)
>  * }
> 
> 
> [2]
> 
> 
>         /* First attempt from huge page */
>         header = eal_malloc_no_trace(NULL, trace_mem_sz(trace->buff_len), 8);
>         if (header) {
>                 trace->lcore_meta[count].area = TRACE_AREA_HUGEPAGE;
>                 goto found;
>         }
> 
>         /* Second attempt from heap */
>         header = malloc(trace_mem_sz(trace->buff_len));
>         if (header == NULL) {
>                 trace_crit("trace mem malloc attempt failed");
>                 header = NULL;
>                 goto fail;
> 
>         }

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-12 13:20                                     ` Mattias Rönnblom
@ 2024-09-12 15:11                                       ` Jerin Jacob
  2024-09-13  6:47                                         ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Jerin Jacob @ 2024-09-12 15:11 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Mattias Rönnblom, dev, Morten Brørup,
	Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>
> On 2024-09-12 15:09, Jerin Jacob wrote:
> > On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > <mattias.ronnblom@ericsson.com> wrote:
> >>
> >> Add basic micro benchmark for lcore variables, in an attempt to assure
> >> that the overhead isn't significantly greater than alternative
> >> approaches, in scenarios where the benefits aren't expected to show up
> >> (i.e., when plenty of cache is available compared to the working set
> >> size of the per-lcore data).
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> ---
> >>   app/test/meson.build           |   1 +
> >>   app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
> >>   2 files changed, 161 insertions(+)
> >>   create mode 100644 app/test/test_lcore_var_perf.c
> >
> >
> >> +static double
> >> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> >> +{
> >> +       uint64_t i;
> >> +       uint64_t start;
> >> +       uint64_t end;
> >> +       double latency;
> >> +
> >> +       init_fun();
> >> +
> >> +       start = rte_get_timer_cycles();
> >> +
> >> +       for (i = 0; i < ITERATIONS; i++)
> >> +               update_fun();
> >> +
> >> +       end = rte_get_timer_cycles();
> >
> > Use precise variant. rte_rdtsc_precise() or so to be accurate
>
> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.

I was thinking in another way, with 1e7 iteration, the additional
barrier on precise will be amortized, and we get more _deterministic_
behavior e.s.p in case if we print cycles and if we need to catch
regressions.
Furthermore, you may consider replacing rte_random() in fast path to
running number or so if it is not deterministic in cycle computation.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-12 13:41                                   ` Morten Brørup
@ 2024-09-12 15:22                                     ` Jerin Jacob
  2024-09-18 10:11                                       ` Jerin Jacob
  0 siblings, 1 reply; 313+ messages in thread
From: Jerin Jacob @ 2024-09-12 15:22 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, dev, Chengwen Feng, Mattias Rönnblom,
	Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Anatoly Burakov

On Thu, Sep 12, 2024 at 7:11 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > Sent: Thursday, 12 September 2024 15.17
> >
> > On Thu, Sep 12, 2024 at 2:40 PM Morten Brørup <mb@smartsharesystems.com>
> > wrote:
> > >
> > > > +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> > >
> > > Considering hugepages...
> > >
> > > Lcore variables may be allocated before DPDK's memory allocator
> > (rte_malloc()) is ready, so rte_malloc() cannot be used for lcore variables.
> > >
> > > And lcore variables are not usable (shared) for DPDK multi-process, so the
> > lcore_buffer could be allocated through the O/S APIs as anonymous hugepages,
> > instead of using rte_malloc().
> > >
> > > The alternative, using rte_malloc(), would disallow allocating lcore
> > variables before DPDK's memory allocator has been initialized, which I think
> > is too late.
> >
> > I thought it is not. A lot of the subsystems are initialized after the
> > memory subsystem is initialized.
> > [1] example given in documentation. I thought, RTE_INIT needs to
> > replaced if the subsystem called after memory initialized (which is
> > the case for most of the libraries)
>
> The list of RTE_INIT functions are called before main(). It is not very useful.
>
> Yes, it would be good to replace (or supplement) RTE_INIT_PRIO by something similar, which calls the list of "INIT" functions at the appropriate time during EAL initialization.
>
> DPDK should then use this "INIT" list for all its initialization, so the init function of new features (such as this, and trace) can be inserted at the correct location in the list.
>
> > Trace library had a similar situation. It is managed like [2]
>
> Yes, if we insist on using rte_malloc() for lcore variables, the alternative is to prohibit establishing lcore variables in functions called through RTE_INIT.

I was not insisting on using ONLY rte_malloc(). Since rte_malloc() can
be called before rte_eal_init)(it will return NULL). Alloc routine can
check first rte_malloc() is available if not switch over glibc.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-12 15:11                                       ` Jerin Jacob
@ 2024-09-13  6:47                                         ` Mattias Rönnblom
  2024-09-13 11:23                                           ` Jerin Jacob
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-13  6:47 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Mattias Rönnblom, dev, Morten Brørup,
	Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

On 2024-09-12 17:11, Jerin Jacob wrote:
> On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>
>> On 2024-09-12 15:09, Jerin Jacob wrote:
>>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
>>> <mattias.ronnblom@ericsson.com> wrote:
>>>>
>>>> Add basic micro benchmark for lcore variables, in an attempt to assure
>>>> that the overhead isn't significantly greater than alternative
>>>> approaches, in scenarios where the benefits aren't expected to show up
>>>> (i.e., when plenty of cache is available compared to the working set
>>>> size of the per-lcore data).
>>>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>>    app/test/meson.build           |   1 +
>>>>    app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
>>>>    2 files changed, 161 insertions(+)
>>>>    create mode 100644 app/test/test_lcore_var_perf.c
>>>
>>>
>>>> +static double
>>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
>>>> +{
>>>> +       uint64_t i;
>>>> +       uint64_t start;
>>>> +       uint64_t end;
>>>> +       double latency;
>>>> +
>>>> +       init_fun();
>>>> +
>>>> +       start = rte_get_timer_cycles();
>>>> +
>>>> +       for (i = 0; i < ITERATIONS; i++)
>>>> +               update_fun();
>>>> +
>>>> +       end = rte_get_timer_cycles();
>>>
>>> Use precise variant. rte_rdtsc_precise() or so to be accurate
>>
>> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> 
> I was thinking in another way, with 1e7 iteration, the additional
> barrier on precise will be amortized, and we get more _deterministic_
> behavior e.s.p in case if we print cycles and if we need to catch
> regressions.

If you time a section of code which spends ~40000000 cycles, it doesn't 
matter if you add or remove a few cycles at the beginning and the end.

The rte_rdtsc_precise() is both better (more precise in the sense of 
more serialization), and worse (because it's more costly, and thus more 
intrusive).

You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It 
doesn't matter.

> Furthermore, you may consider replacing rte_random() in fast path to
> running number or so if it is not deterministic in cycle computation.

rte_rand() is not used in the fast path. I don't understand what you 
mean by "running number".

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-13  6:47                                         ` Mattias Rönnblom
@ 2024-09-13 11:23                                           ` Jerin Jacob
  2024-09-13 14:40                                             ` Morten Brørup
  2024-09-16 10:50                                             ` Mattias Rönnblom
  0 siblings, 2 replies; 313+ messages in thread
From: Jerin Jacob @ 2024-09-13 11:23 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Mattias Rönnblom, dev, Morten Brørup,
	Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>
> On 2024-09-12 17:11, Jerin Jacob wrote:
> > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> >>
> >> On 2024-09-12 15:09, Jerin Jacob wrote:
> >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> >>> <mattias.ronnblom@ericsson.com> wrote:
> >>>>
> >>>> Add basic micro benchmark for lcore variables, in an attempt to assure
> >>>> that the overhead isn't significantly greater than alternative
> >>>> approaches, in scenarios where the benefits aren't expected to show up
> >>>> (i.e., when plenty of cache is available compared to the working set
> >>>> size of the per-lcore data).
> >>>>
> >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>> ---
> >>>>    app/test/meson.build           |   1 +
> >>>>    app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
> >>>>    2 files changed, 161 insertions(+)
> >>>>    create mode 100644 app/test/test_lcore_var_perf.c
> >>>
> >>>
> >>>> +static double
> >>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> >>>> +{
> >>>> +       uint64_t i;
> >>>> +       uint64_t start;
> >>>> +       uint64_t end;
> >>>> +       double latency;
> >>>> +
> >>>> +       init_fun();
> >>>> +
> >>>> +       start = rte_get_timer_cycles();
> >>>> +
> >>>> +       for (i = 0; i < ITERATIONS; i++)
> >>>> +               update_fun();
> >>>> +
> >>>> +       end = rte_get_timer_cycles();
> >>>
> >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> >>
> >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> >
> > I was thinking in another way, with 1e7 iteration, the additional
> > barrier on precise will be amortized, and we get more _deterministic_
> > behavior e.s.p in case if we print cycles and if we need to catch
> > regressions.
>
> If you time a section of code which spends ~40000000 cycles, it doesn't
> matter if you add or remove a few cycles at the beginning and the end.
>
> The rte_rdtsc_precise() is both better (more precise in the sense of
> more serialization), and worse (because it's more costly, and thus more
> intrusive).

We can calibrate the overhead to remove the cost.

>
> You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> doesn't matter.

Yes. In this setup and it is pretty inaccurate PER iteration. Please
refer to the below patch to see the difference.

Patch 1: Make nanoseconds to cycles per iteration
------------------------------------------------------------------

diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
index ea1d7ba90b52..b8d25400f593 100644
--- a/app/test/test_lcore_var_perf.c
+++ b/app/test/test_lcore_var_perf.c
@@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
void (*update_fun)(void))

        end = rte_get_timer_cycles();

-       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
+       latency = ((end - start)) / ITERATIONS;

        return latency;
 }
@@ -137,8 +137,7 @@ test_lcore_var_access(void)

-       printf("Latencies [ns/update]\n");
+       printf("Latencies [cycles/update]\n");
        printf("Thread-local storage  Static array  Lcore variables\n");
-       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
-              sarray_latency * 1e9, lvar_latency * 1e9);
+       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
lvar_latency);

        return TEST_SUCCESS;
 }


Patch 2: Change to precise with calibration
-----------------------------------------------------------

diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
index ea1d7ba90b52..8142ecd56241 100644
--- a/app/test/test_lcore_var_perf.c
+++ b/app/test/test_lcore_var_perf.c
@@ -96,23 +96,28 @@ lvar_update(void)
 static double
 benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
 {
-       uint64_t i;
+       double tsc_latency;
+       double latency;
        uint64_t start;
        uint64_t end;
-       double latency;
+       uint64_t i;

-       init_fun();
+       /* calculate rte_rdtsc_precise overhead */
+       start = rte_rdtsc_precise();
+       end = rte_rdtsc_precise();
+       tsc_latency = (end - start);

-       start = rte_get_timer_cycles();
+       init_fun();

-       for (i = 0; i < ITERATIONS; i++)
+       latency = 0;
+       for (i = 0; i < ITERATIONS; i++) {
+               start = rte_rdtsc_precise();
                update_fun();
+               end = rte_rdtsc_precise();
+               latency += (end - start) - tsc_latency;
+       }

-       end = rte_get_timer_cycles();
-
-       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
-
-       return latency;
+       return latency / (double)ITERATIONS;
 }

 static int
@@ -135,10 +140,9 @@ test_lcore_var_access(void)
        sarray_latency = benchmark_access_method(sarray_init, sarray_update);
        lvar_latency = benchmark_access_method(lvar_init, lvar_update);

-       printf("Latencies [ns/update]\n");
+       printf("Latencies [cycles/update]\n");
        printf("Thread-local storage  Static array  Lcore variables\n");
-       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
-              sarray_latency * 1e9, lvar_latency * 1e9);
+       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
lvar_latency);

        return TEST_SUCCESS;
 }

ARM N2 core with patch 1(aka current scheme)
-----------------------------------

 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                 7.0           7.0              7.0


ARM N2 core with patch 2
-----------------------------------

 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                11.4          15.5             15.5

x86 i9 core with patch 1(aka current scheme)
------------------------------------------------------------

 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [ns/update]
Thread-local storage  Static array  Lcore variables
                 5.0           6.0              6.0

x86 i9 core with patch 2
--------------------------------
 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                 5.3          10.6             11.7





>
> > Furthermore, you may consider replacing rte_random() in fast path to
> > running number or so if it is not deterministic in cycle computation.
>
> rte_rand() is not used in the fast path. I don't understand what you

I missed that. Ignore this comment.

> mean by "running number".

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-13 11:23                                           ` Jerin Jacob
@ 2024-09-13 14:40                                             ` Morten Brørup
  2024-09-16  8:12                                               ` Jerin Jacob
  2024-09-16 10:50                                             ` Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-09-13 14:40 UTC (permalink / raw)
  To: Jerin Jacob, Mattias Rönnblom
  Cc: Mattias Rönnblom, dev, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Friday, 13 September 2024 13.24
> 
> On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se>
> wrote:
> >
> > On 2024-09-12 17:11, Jerin Jacob wrote:
> > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se>
> wrote:
> > >>
> > >> On 2024-09-12 15:09, Jerin Jacob wrote:
> > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > >>> <mattias.ronnblom@ericsson.com> wrote:
> > >>>> +static double
> > >>>> +benchmark_access_method(void (*init_fun)(void), void
> (*update_fun)(void))
> > >>>> +{
> > >>>> +       uint64_t i;
> > >>>> +       uint64_t start;
> > >>>> +       uint64_t end;
> > >>>> +       double latency;
> > >>>> +
> > >>>> +       init_fun();
> > >>>> +
> > >>>> +       start = rte_get_timer_cycles();
> > >>>> +
> > >>>> +       for (i = 0; i < ITERATIONS; i++)
> > >>>> +               update_fun();
> > >>>> +
> > >>>> +       end = rte_get_timer_cycles();
> > >>>
> > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> > >>
> > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> > >
> > > I was thinking in another way, with 1e7 iteration, the additional
> > > barrier on precise will be amortized, and we get more _deterministic_
> > > behavior e.s.p in case if we print cycles and if we need to catch
> > > regressions.
> >
> > If you time a section of code which spends ~40000000 cycles, it doesn't
> > matter if you add or remove a few cycles at the beginning and the end.
> >
> > The rte_rdtsc_precise() is both better (more precise in the sense of
> > more serialization), and worse (because it's more costly, and thus more
> > intrusive).
> 
> We can calibrate the overhead to remove the cost.
> 
> >
> > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> > doesn't matter.
> 
> Yes. In this setup and it is pretty inaccurate PER iteration. Please
> refer to the below patch to see the difference.

No, Mattias is right. The time is sampled once before the loop, then the function is executed 10 million (ITERATIONS) times in the loop, and then the time is sampled once again.

So the overhead and accuracy of the timing function is amortized across the 10 million calls to the function being measured, and becomes insignificant.

Other perf tests also do it this way, and also use rte_get_timer_cycles(). E.g. the mempool_perf test.

Another detail: The for loop itself may cost a few cycles, which may not be irrelevant when measuring a function using very few cycles. If the compiler doesn't unroll the loop, it should be done manually:

        for (i = 0; i < ITERATIONS / 100; i++) {
                update_fun();
                update_fun();
                ... repeated 100 times
        }


> 
> Patch 1: Make nanoseconds to cycles per iteration
> ------------------------------------------------------------------
> 
> diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> index ea1d7ba90b52..b8d25400f593 100644
> --- a/app/test/test_lcore_var_perf.c
> +++ b/app/test/test_lcore_var_perf.c
> @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> void (*update_fun)(void))
> 
>         end = rte_get_timer_cycles();
> 
> -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> +       latency = ((end - start)) / ITERATIONS;

This calculation uses integer arithmetic, which will round down the resulting latency.
Please use floating point arithmetic: latency = (end - start) / (double)ITERATIONS;

> 
>         return latency;
>  }
> @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> 
> -       printf("Latencies [ns/update]\n");
> +       printf("Latencies [cycles/update]\n");
>         printf("Thread-local storage  Static array  Lcore variables\n");
> -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> -              sarray_latency * 1e9, lvar_latency * 1e9);
> +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> lvar_latency);
> 
>         return TEST_SUCCESS;
>  }

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-13 14:40                                             ` Morten Brørup
@ 2024-09-16  8:12                                               ` Jerin Jacob
  2024-09-16  9:51                                                 ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Jerin Jacob @ 2024-09-16  8:12 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, Mattias Rönnblom, dev,
	Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

On Fri, Sep 13, 2024 at 8:10 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > Sent: Friday, 13 September 2024 13.24
> >
> > On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se>
> > wrote:
> > >
> > > On 2024-09-12 17:11, Jerin Jacob wrote:
> > > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se>
> > wrote:
> > > >>
> > > >> On 2024-09-12 15:09, Jerin Jacob wrote:
> > > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > > >>> <mattias.ronnblom@ericsson.com> wrote:
> > > >>>> +static double
> > > >>>> +benchmark_access_method(void (*init_fun)(void), void
> > (*update_fun)(void))
> > > >>>> +{
> > > >>>> +       uint64_t i;
> > > >>>> +       uint64_t start;
> > > >>>> +       uint64_t end;
> > > >>>> +       double latency;
> > > >>>> +
> > > >>>> +       init_fun();
> > > >>>> +
> > > >>>> +       start = rte_get_timer_cycles();
> > > >>>> +
> > > >>>> +       for (i = 0; i < ITERATIONS; i++)
> > > >>>> +               update_fun();
> > > >>>> +
> > > >>>> +       end = rte_get_timer_cycles();
> > > >>>
> > > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> > > >>
> > > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> > > >
> > > > I was thinking in another way, with 1e7 iteration, the additional
> > > > barrier on precise will be amortized, and we get more _deterministic_
> > > > behavior e.s.p in case if we print cycles and if we need to catch
> > > > regressions.
> > >
> > > If you time a section of code which spends ~40000000 cycles, it doesn't
> > > matter if you add or remove a few cycles at the beginning and the end.
> > >
> > > The rte_rdtsc_precise() is both better (more precise in the sense of
> > > more serialization), and worse (because it's more costly, and thus more
> > > intrusive).
> >
> > We can calibrate the overhead to remove the cost.
> >
> > >
> > > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> > > doesn't matter.
> >
> > Yes. In this setup and it is pretty inaccurate PER iteration. Please
> > refer to the below patch to see the difference.
>
> No, Mattias is right. The time is sampled once before the loop, then the function is executed 10 million (ITERATIONS) times in the loop, and then the time is sampled once again.

No. I am not disagreeing. That why I said, “Yes. In this setup”.

All I am saying, there is a more accurate way of doing measurement for
this test along with “data” at
https://mails.dpdk.org/archives/dev/2024-September/301227.html


>
> So the overhead and accuracy of the timing function is amortized across the 10 million calls to the function being measured, and becomes insignificant.
>
> Other perf tests also do it this way, and also use rte_get_timer_cycles(). E.g. the mempool_perf test.
>
> Another detail: The for loop itself may cost a few cycles, which may not be irrelevant when measuring a function using very few cycles. If the compiler doesn't unroll the loop, it should be done manually:
>
>         for (i = 0; i < ITERATIONS / 100; i++) {
>                 update_fun();
>                 update_fun();
>                 ... repeated 100 times

I have done a similar scheme for trace perf for inline function test
at https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30

Either the above scheme or the below scheme needs to be used as
mentioned in https://mails.dpdk.org/archives/dev/2024-September/301227.html

+       for (i = 0; i < ITERATIONS; i++) {
+               start = rte_rdtsc_precise();
                update_fun();
+               end = rte_rdtsc_precise();
+               latency += (end - start) - tsc_latency;
+       }




>         }
>
>
> >
> > Patch 1: Make nanoseconds to cycles per iteration
> > ------------------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> > index ea1d7ba90b52..b8d25400f593 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> > void (*update_fun)(void))
> >
> >         end = rte_get_timer_cycles();
> >
> > -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> > +       latency = ((end - start)) / ITERATIONS;
>
> This calculation uses integer arithmetic, which will round down the resulting latency.
> Please use floating point arithmetic: latency = (end - start) / (double)ITERATIONS;

Yup. It is in patch 2
https://mails.dpdk.org/archives/dev/2024-September/301227.html

>
> >
> >         return latency;
> >  }
> > @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> >
> > -       printf("Latencies [ns/update]\n");
> > +       printf("Latencies [cycles/update]\n");
> >         printf("Thread-local storage  Static array  Lcore variables\n");
> > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> >         return TEST_SUCCESS;
> >  }

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-16  8:12                                               ` Jerin Jacob
@ 2024-09-16  9:51                                                 ` Morten Brørup
  0 siblings, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-09-16  9:51 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Mattias Rönnblom, Mattias Rönnblom, dev,
	Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Monday, 16 September 2024 10.12
> 
> On Fri, Sep 13, 2024 at 8:10 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > Sent: Friday, 13 September 2024 13.24
> > >
> > > On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se>
> > > wrote:
> > > >
> > > > On 2024-09-12 17:11, Jerin Jacob wrote:
> > > > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom
> <hofors@lysator.liu.se>
> > > wrote:
> > > > >>
> > > > >> On 2024-09-12 15:09, Jerin Jacob wrote:
> > > > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > > > >>> <mattias.ronnblom@ericsson.com> wrote:
> > > > >>>> +static double
> > > > >>>> +benchmark_access_method(void (*init_fun)(void), void
> > > (*update_fun)(void))
> > > > >>>> +{
> > > > >>>> +       uint64_t i;
> > > > >>>> +       uint64_t start;
> > > > >>>> +       uint64_t end;
> > > > >>>> +       double latency;
> > > > >>>> +
> > > > >>>> +       init_fun();
> > > > >>>> +
> > > > >>>> +       start = rte_get_timer_cycles();
> > > > >>>> +
> > > > >>>> +       for (i = 0; i < ITERATIONS; i++)
> > > > >>>> +               update_fun();
> > > > >>>> +
> > > > >>>> +       end = rte_get_timer_cycles();
> > > > >>>
> > > > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> > > > >>
> > > > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> > > > >
> > > > > I was thinking in another way, with 1e7 iteration, the additional
> > > > > barrier on precise will be amortized, and we get more _deterministic_
> > > > > behavior e.s.p in case if we print cycles and if we need to catch
> > > > > regressions.
> > > >
> > > > If you time a section of code which spends ~40000000 cycles, it doesn't
> > > > matter if you add or remove a few cycles at the beginning and the end.
> > > >
> > > > The rte_rdtsc_precise() is both better (more precise in the sense of
> > > > more serialization), and worse (because it's more costly, and thus more
> > > > intrusive).
> > >
> > > We can calibrate the overhead to remove the cost.
> > >
> > > >
> > > > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> > > > doesn't matter.
> > >
> > > Yes. In this setup and it is pretty inaccurate PER iteration. Please
> > > refer to the below patch to see the difference.
> >
> > No, Mattias is right. The time is sampled once before the loop, then the
> function is executed 10 million (ITERATIONS) times in the loop, and then the
> time is sampled once again.
> 
> No. I am not disagreeing. That why I said, “Yes. In this setup”.

Sorry, I misunderstood. Then we're all on the same page here. :-)

> 
> All I am saying, there is a more accurate way of doing measurement for
> this test along with “data” at
> https://mails.dpdk.org/archives/dev/2024-September/301227.html
> 
> 
> >
> > So the overhead and accuracy of the timing function is amortized across the
> 10 million calls to the function being measured, and becomes insignificant.
> >
> > Other perf tests also do it this way, and also use rte_get_timer_cycles().
> E.g. the mempool_perf test.
> >
> > Another detail: The for loop itself may cost a few cycles, which may not be
> irrelevant when measuring a function using very few cycles. If the compiler
> doesn't unroll the loop, it should be done manually:
> >
> >         for (i = 0; i < ITERATIONS / 100; i++) {
> >                 update_fun();
> >                 update_fun();
> >                 ... repeated 100 times
> 
> I have done a similar scheme for trace perf for inline function test
> at https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30

Nice macro. :-)

> 
> Either the above scheme or the below scheme needs to be used as
> mentioned in https://mails.dpdk.org/archives/dev/2024-September/301227.html
> 
> +       for (i = 0; i < ITERATIONS; i++) {
> +               start = rte_rdtsc_precise();
>                 update_fun();
> +               end = rte_rdtsc_precise();
> +               latency += (end - start) - tsc_latency;
> +       }
> 

I prefer reading the timestamps outside the loop.
If there is any jitter in the execution time (or cycles used) by rte_rdtsc_precise(), it gets amortized when used outside the loop. If used inside the loop, the jitter adds up, and may affect the result.

On the other hand, I guess using rte_rdtsc_precise() inside the loop may show different results, due to its memory barriers. I don't know; just speculating.

Maybe we want to use both methods to measure this? Considering that we are measuring the time to access frequently used variables in hot parts of the code, as implemented by three different design patterns. Performance here is quite important.

And if we want to subtract the overhead from rte_rdtsc_precise() itself - which I think is a good idea if used inside the loop - we probably need another loop to measure that, rather than just calling it twice and subtracting the returned values.

> 
> 
> 
> >         }
> >
> >
> > >
> > > Patch 1: Make nanoseconds to cycles per iteration
> > > ------------------------------------------------------------------
> > >
> > > diff --git a/app/test/test_lcore_var_perf.c
> b/app/test/test_lcore_var_perf.c
> > > index ea1d7ba90b52..b8d25400f593 100644
> > > --- a/app/test/test_lcore_var_perf.c
> > > +++ b/app/test/test_lcore_var_perf.c
> > > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> > > void (*update_fun)(void))
> > >
> > >         end = rte_get_timer_cycles();
> > >
> > > -       latency = ((end - start) / (double)rte_get_timer_hz()) /
> ITERATIONS;
> > > +       latency = ((end - start)) / ITERATIONS;
> >
> > This calculation uses integer arithmetic, which will round down the
> resulting latency.
> > Please use floating point arithmetic: latency = (end - start) /
> (double)ITERATIONS;
> 
> Yup. It is in patch 2
> https://mails.dpdk.org/archives/dev/2024-September/301227.html

Yep; my comment was mostly meant for Mattias, if he switched from nanoseconds to cycles, to remember using floating point calculation here.

> 
> >
> > >
> > >         return latency;
> > >  }
> > > @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> > >
> > > -       printf("Latencies [ns/update]\n");
> > > +       printf("Latencies [cycles/update]\n");
> > >         printf("Thread-local storage  Static array  Lcore variables\n");
> > > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > > lvar_latency);
> > >
> > >         return TEST_SUCCESS;
> > >  }

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-13 11:23                                           ` Jerin Jacob
  2024-09-13 14:40                                             ` Morten Brørup
@ 2024-09-16 10:50                                             ` Mattias Rönnblom
  2024-09-18 10:04                                               ` Jerin Jacob
  1 sibling, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-16 10:50 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Mattias Rönnblom, dev, Morten Brørup,
	Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

On 2024-09-13 13:23, Jerin Jacob wrote:
> On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>
>> On 2024-09-12 17:11, Jerin Jacob wrote:
>>> On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>>>
>>>> On 2024-09-12 15:09, Jerin Jacob wrote:
>>>>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
>>>>> <mattias.ronnblom@ericsson.com> wrote:
>>>>>>
>>>>>> Add basic micro benchmark for lcore variables, in an attempt to assure
>>>>>> that the overhead isn't significantly greater than alternative
>>>>>> approaches, in scenarios where the benefits aren't expected to show up
>>>>>> (i.e., when plenty of cache is available compared to the working set
>>>>>> size of the per-lcore data).
>>>>>>
>>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>>> ---
>>>>>>     app/test/meson.build           |   1 +
>>>>>>     app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
>>>>>>     2 files changed, 161 insertions(+)
>>>>>>     create mode 100644 app/test/test_lcore_var_perf.c
>>>>>
>>>>>
>>>>>> +static double
>>>>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
>>>>>> +{
>>>>>> +       uint64_t i;
>>>>>> +       uint64_t start;
>>>>>> +       uint64_t end;
>>>>>> +       double latency;
>>>>>> +
>>>>>> +       init_fun();
>>>>>> +
>>>>>> +       start = rte_get_timer_cycles();
>>>>>> +
>>>>>> +       for (i = 0; i < ITERATIONS; i++)
>>>>>> +               update_fun();
>>>>>> +
>>>>>> +       end = rte_get_timer_cycles();
>>>>>
>>>>> Use precise variant. rte_rdtsc_precise() or so to be accurate
>>>>
>>>> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
>>>
>>> I was thinking in another way, with 1e7 iteration, the additional
>>> barrier on precise will be amortized, and we get more _deterministic_
>>> behavior e.s.p in case if we print cycles and if we need to catch
>>> regressions.
>>
>> If you time a section of code which spends ~40000000 cycles, it doesn't
>> matter if you add or remove a few cycles at the beginning and the end.
>>
>> The rte_rdtsc_precise() is both better (more precise in the sense of
>> more serialization), and worse (because it's more costly, and thus more
>> intrusive).
> 
> We can calibrate the overhead to remove the cost.
> 
What you are interested is primarily the impact of (instruction) 
throughput, not the latency of the sequence of instructions that must be 
retired in order to load the lcore variable values, when you switch from
(say) lcore id-index static arrays to lcore variables in your module.

Usually, there is not reason to make a distinction between latency and 
throughput in this context, but as you zoom into very short snippets of 
code being executed, the difference becomes relevant. For example, 
adding an div instruction won't necessarily add 12 cc to your program's 
execution time on a Zen 4, even though that is its latency. Rather, the 
effects may, depending on data dependencies and what other instructions 
are executed in parallel, be much smaller.

So, one could argue the ILP you get with the loop is a feature, not a bug.

With or without per-iteration latency measurements, these benchmark are 
not-very-useful at best, and misleading at worst. I will rework them to 
include more than a single module/lcore variable, which I think would be 
somewhat of an improvement.

Even better would have some real domain logic, instead of just a dummy 
multiplication.

>>
>> You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
>> doesn't matter.
> 
> Yes. In this setup and it is pretty inaccurate PER iteration. Please
> refer to the below patch to see the difference.
> 
> Patch 1: Make nanoseconds to cycles per iteration
> ------------------------------------------------------------------
> 
> diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> index ea1d7ba90b52..b8d25400f593 100644
> --- a/app/test/test_lcore_var_perf.c
> +++ b/app/test/test_lcore_var_perf.c
> @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> void (*update_fun)(void))
> 
>          end = rte_get_timer_cycles();
> 
> -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> +       latency = ((end - start)) / ITERATIONS;
> 
>          return latency;
>   }
> @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> 
> -       printf("Latencies [ns/update]\n");
> +       printf("Latencies [cycles/update]\n");
>          printf("Thread-local storage  Static array  Lcore variables\n");
> -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> -              sarray_latency * 1e9, lvar_latency * 1e9);
> +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> lvar_latency);
> 
>          return TEST_SUCCESS;
>   }
> 
> 
> Patch 2: Change to precise with calibration
> -----------------------------------------------------------
> 
> diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> index ea1d7ba90b52..8142ecd56241 100644
> --- a/app/test/test_lcore_var_perf.c
> +++ b/app/test/test_lcore_var_perf.c
> @@ -96,23 +96,28 @@ lvar_update(void)
>   static double
>   benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
>   {
> -       uint64_t i;
> +       double tsc_latency;
> +       double latency;
>          uint64_t start;
>          uint64_t end;
> -       double latency;
> +       uint64_t i;
> 
> -       init_fun();
> +       /* calculate rte_rdtsc_precise overhead */
> +       start = rte_rdtsc_precise();
> +       end = rte_rdtsc_precise();
> +       tsc_latency = (end - start);
> 
> -       start = rte_get_timer_cycles();
> +       init_fun();
> 
> -       for (i = 0; i < ITERATIONS; i++)
> +       latency = 0;
> +       for (i = 0; i < ITERATIONS; i++) {
> +               start = rte_rdtsc_precise();
>                  update_fun();
> +               end = rte_rdtsc_precise();
> +               latency += (end - start) - tsc_latency;
> +       }
> 
> -       end = rte_get_timer_cycles();
> -
> -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> -
> -       return latency;
> +       return latency / (double)ITERATIONS;
>   }
> 
>   static int
> @@ -135,10 +140,9 @@ test_lcore_var_access(void)
>          sarray_latency = benchmark_access_method(sarray_init, sarray_update);
>          lvar_latency = benchmark_access_method(lvar_init, lvar_update);
> 
> -       printf("Latencies [ns/update]\n");
> +       printf("Latencies [cycles/update]\n");
>          printf("Thread-local storage  Static array  Lcore variables\n");
> -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> -              sarray_latency * 1e9, lvar_latency * 1e9);
> +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> lvar_latency);
> 
>          return TEST_SUCCESS;
>   }
> 
> ARM N2 core with patch 1(aka current scheme)
> -----------------------------------
> 
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [cycles/update]
> Thread-local storage  Static array  Lcore variables
>                   7.0           7.0              7.0
> 
> 
> ARM N2 core with patch 2
> -----------------------------------
> 
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [cycles/update]
> Thread-local storage  Static array  Lcore variables
>                  11.4          15.5             15.5
> 
> x86 i9 core with patch 1(aka current scheme)
> ------------------------------------------------------------
> 
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [ns/update]
> Thread-local storage  Static array  Lcore variables
>                   5.0           6.0              6.0
> 
> x86 i9 core with patch 2
> --------------------------------
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [cycles/update]
> Thread-local storage  Static array  Lcore variables
>                   5.3          10.6             11.7
> 
> 
> 
> 
> 
>>
>>> Furthermore, you may consider replacing rte_random() in fast path to
>>> running number or so if it is not deterministic in cycle computation.
>>
>> rte_rand() is not used in the fast path. I don't understand what you
> 
> I missed that. Ignore this comment.
> 
>> mean by "running number".

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v4 0/7]  Lcore variables
  2024-09-12  8:44                                 ` [PATCH v3 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-16 10:52                                   ` Mattias Rönnblom
  2024-09-16 10:52                                     ` [PATCH v4 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                       ` (6 more replies)
  0 siblings, 7 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-16 10:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                            |   6 +
 app/test/meson.build                   |   2 +
 app/test/test_lcore_var.c              | 432 +++++++++++++++++++++++++
 app/test/test_lcore_var_perf.c         | 244 ++++++++++++++
 config/rte_config.h                    |   1 +
 doc/api/doxy-api-index.md              |   1 +
 doc/guides/rel_notes/release_24_11.rst |  14 +
 lib/eal/common/eal_common_lcore_var.c  |  78 +++++
 lib/eal/common/meson.build             |   1 +
 lib/eal/common/rte_random.c            |  28 +-
 lib/eal/common/rte_service.c           | 115 ++++---
 lib/eal/include/meson.build            |   1 +
 lib/eal/include/rte_lcore_var.h        | 385 ++++++++++++++++++++++
 lib/eal/version.map                    |   2 +
 lib/eal/x86/rte_power_intrinsics.c     |  17 +-
 lib/power/rte_power_pmd_mgmt.c         |  34 +-
 16 files changed, 1274 insertions(+), 87 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v4 1/7] eal: add static per-lcore memory allocation facility
  2024-09-16 10:52                                   ` [PATCH v4 0/7] Lcore variables Mattias Rönnblom
@ 2024-09-16 10:52                                     ` Mattias Rönnblom
  2024-09-16 14:02                                       ` Konstantin Ananyev
  2024-09-17 14:32                                       ` [PATCH v5 0/7] Lcore variables Mattias Rönnblom
  2024-09-16 10:52                                     ` [PATCH v4 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                       ` (5 subsequent siblings)
  6 siblings, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-16 10:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                            |   6 +
 config/rte_config.h                    |   1 +
 doc/api/doxy-api-index.md              |   1 +
 doc/guides/rel_notes/release_24_11.rst |  14 +
 lib/eal/common/eal_common_lcore_var.c  |  78 +++++
 lib/eal/common/meson.build             |   1 +
 lib/eal/include/meson.build            |   1 +
 lib/eal/include/rte_lcore_var.h        | 385 +++++++++++++++++++++++++
 lib/eal/version.map                    |   2 +
 9 files changed, 489 insertions(+)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c5a703b5c0..362d9a3f28 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index dd7bb0d35b..311692e498 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 0ff70d9057..a3884f7491 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -55,6 +55,20 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..309822039b
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+#ifdef RTE_EXEC_ENV_WINDOWS
+		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
+					       RTE_CACHE_LINE_SIZE);
+#else
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+#endif
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..ec3ab714a8
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,385 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. There is one
+ * instance for each current and future lcore id-equipped thread, with
+ * a total of RTE_MAX_LCORE instances. The value of an lcore variable
+ * for a particular lcore id is independent from other values (for
+ * other lcore ids) within the same lcore variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. The handle type is used to inform the
+ * access macros the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as a an opaque identifier. An allocated handle
+ * never has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by two different lcore
+ * ids may be frequently read or written by the owners without risking
+ * false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should employed to assure there are no data races between
+ * the owning thread and any non-owner threads accessing the same
+ * lcore variable instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may choose to define an lcore variable handle, which
+ * it then it goes on to never allocate.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variables' owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values take on an initial value of zero.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables have the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. As a result, thread-local variables must be initialized in
+ *     a "lazy" manner (e.g., at the point of thread creation). Lcore
+ *     variables may be accessed immediately after having been
+ *     allocated (which may be prior any thread beyond the main
+ *     thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handle, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param value
+ *   A pointer successively set to point to lcore variable value
+ *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
+	for (unsigned int lcore_id =					\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..0c80bf7331 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,8 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v4 2/7] eal: add lcore variable functional tests
  2024-09-16 10:52                                   ` [PATCH v4 0/7] Lcore variables Mattias Rönnblom
  2024-09-16 10:52                                     ` [PATCH v4 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-16 10:52                                     ` Mattias Rönnblom
  2024-09-16 10:52                                     ` [PATCH v4 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                       ` (4 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-16 10:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..e07d13460f
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,432 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v4 3/7] eal: add lcore variable performance test
  2024-09-16 10:52                                   ` [PATCH v4 0/7] Lcore variables Mattias Rönnblom
  2024-09-16 10:52                                     ` [PATCH v4 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-16 10:52                                     ` [PATCH v4 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-09-16 10:52                                     ` Mattias Rönnblom
  2024-09-16 11:13                                       ` Mattias Rönnblom
  2024-09-16 10:52                                     ` [PATCH v4 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                       ` (3 subsequent siblings)
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-16 10:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

--

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 244 +++++++++++++++++++++++++++++++++
 2 files changed, 245 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..8b0abc771c
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,244 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =
+		RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %13.1f %13.1f %16.1f\n", num_mods, sarray_latency,
+	       tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays are not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each wiht a small, per-lcore state. Note however that
+ * these tests has very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("Latencies [TSC cycles/update]\n");
+	printf("Modules/Variables  Static array  Thread-local Storage  "
+	       "Lcore variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v4 4/7] random: keep PRNG state in lcore variable
  2024-09-16 10:52                                   ` [PATCH v4 0/7] Lcore variables Mattias Rönnblom
                                                       ` (2 preceding siblings ...)
  2024-09-16 10:52                                     ` [PATCH v4 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-09-16 10:52                                     ` Mattias Rönnblom
  2024-09-16 16:11                                       ` Konstantin Ananyev
  2024-09-16 10:52                                     ` [PATCH v4 5/7] power: keep per-lcore " Mattias Rönnblom
                                                       ` (2 subsequent siblings)
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-16 10:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v4 5/7] power: keep per-lcore state in lcore variable
  2024-09-16 10:52                                   ` [PATCH v4 0/7] Lcore variables Mattias Rönnblom
                                                       ` (3 preceding siblings ...)
  2024-09-16 10:52                                     ` [PATCH v4 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-09-16 10:52                                     ` Mattias Rönnblom
  2024-09-16 16:12                                       ` Konstantin Ananyev
  2024-09-16 10:52                                     ` [PATCH v4 6/7] service: " Mattias Rönnblom
  2024-09-16 10:52                                     ` [PATCH v4 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-16 10:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a5139dd4f7 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,21 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v4 6/7] service: keep per-lcore state in lcore variable
  2024-09-16 10:52                                   ` [PATCH v4 0/7] Lcore variables Mattias Rönnblom
                                                       ` (4 preceding siblings ...)
  2024-09-16 10:52                                     ` [PATCH v4 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-09-16 10:52                                     ` Mattias Rönnblom
  2024-09-16 16:13                                       ` Konstantin Ananyev
  2024-09-16 10:52                                     ` [PATCH v4 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-16 10:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 115 +++++++++++++++++++----------------
 1 file changed, 63 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 56379930b6..03379f1588 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,12 +102,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -122,7 +119,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +132,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +281,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +288,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +449,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +462,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +484,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +530,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +546,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +567,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +584,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +636,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +688,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +706,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +731,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +755,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +779,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +809,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +818,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +843,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +854,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +862,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +870,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +879,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +895,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +942,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +971,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +983,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1022,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v4 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-09-16 10:52                                   ` [PATCH v4 0/7] Lcore variables Mattias Rönnblom
                                                       ` (5 preceding siblings ...)
  2024-09-16 10:52                                     ` [PATCH v4 6/7] service: " Mattias Rönnblom
@ 2024-09-16 10:52                                     ` Mattias Rönnblom
  2024-09-16 16:14                                       ` Konstantin Ananyev
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-16 10:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v4 3/7] eal: add lcore variable performance test
  2024-09-16 10:52                                     ` [PATCH v4 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-09-16 11:13                                       ` Mattias Rönnblom
  2024-09-16 11:54                                         ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-16 11:13 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob

On 2024-09-16 12:52, Mattias Rönnblom wrote:
> Add basic micro benchmark for lcore variables, in an attempt to assure
> that the overhead isn't significantly greater than alternative
> approaches, in scenarios where the benefits aren't expected to show up
> (i.e., when plenty of cache is available compared to the working set
> size of the per-lcore data).
> 

Here are some test results for a Raptor Cove @ 3,2 GHz (GCC 11):

  + ------------------------------------------------------- +
  + Test Suite : lcore variable perf autotest
  + ------------------------------------------------------- +
Latencies [TSC cycles/update]
Modules/Variables  Static array  Thread-local Storage  Lcore variables
                 1           3.9           5.5              3.7
                 2           3.8           5.5              3.8
                 4           4.9           5.5              3.7
                 8           3.8           5.5              3.8
                16          11.3           5.5              3.7
                32          20.9           5.5              3.7
                64          23.5           5.5              3.7
               128          23.2           5.5              3.7
               256          23.5           5.5              3.7
               512          24.1           5.5              3.7
              1024          25.3           5.5              3.9
  + TestCase [ 0] : test_lcore_var_access succeeded
  + ------------------------------------------------------- +


The reason for TLS being slower than lcore variables (which in turn 
relies on TLS for lcore id lookup) is the lazy initialization 
conditional that is imposed on variant. Could that be avoided (which is 
module-dependent I suppose), it beats lcore variables at ~3.0 cycles/update.

I must say I'm surprised to see lcore variables doing this good, at 
these very modest working set sizes. Probably, you can stay at near-zero 
L1 misses with lcore variables (and TLS), but start missing the L1 with 
static arrays.

> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> 
> --
> 
> PATCH v4:
>   * Rework the tests to be a little less unrealistic. Instead of a
>     single dummy module using a single variable, use a number of
>     variables/modules. In this way, differences in cache effects may
>     show up.
>   * Add RTE_CACHE_GUARD to better mimic that static array pattern.
>     (Morten Brørup)
>   * Show latencies as TSC cycles. (Morten Brørup)
> ---
>   app/test/meson.build           |   1 +
>   app/test/test_lcore_var_perf.c | 244 +++++++++++++++++++++++++++++++++
>   2 files changed, 245 insertions(+)
>   create mode 100644 app/test/test_lcore_var_perf.c
> 
> diff --git a/app/test/meson.build b/app/test/meson.build
> index 48279522f0..d4e0c59900 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -104,6 +104,7 @@ source_file_deps = {
>       'test_kvargs.c': ['kvargs'],
>       'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
>       'test_lcore_var.c': [],
> +    'test_lcore_var_perf.c': [],
>       'test_lcores.c': [],
>       'test_link_bonding.c': ['ethdev', 'net_bond',
>           'net'] + packet_burst_generator_deps + virtual_pmd_deps,
> diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> new file mode 100644
> index 0000000000..8b0abc771c
> --- /dev/null
> +++ b/app/test/test_lcore_var_perf.c
> @@ -0,0 +1,244 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#define MAX_MODS 1024
> +
> +#include <stdio.h>
> +
> +#include <rte_bitops.h>
> +#include <rte_cycles.h>
> +#include <rte_lcore_var.h>
> +#include <rte_per_lcore.h>
> +#include <rte_random.h>
> +
> +#include "test.h"
> +
> +struct mod_lcore_state {
> +	uint64_t a;
> +	uint64_t b;
> +	uint64_t sum;
> +};
> +
> +static void
> +mod_init(struct mod_lcore_state *state)
> +{
> +	state->a = rte_rand();
> +	state->b = rte_rand();
> +	state->sum = 0;
> +}
> +
> +static __rte_always_inline void
> +mod_update(volatile struct mod_lcore_state *state)
> +{
> +	state->sum += state->a * state->b;
> +}
> +
> +struct __rte_cache_aligned mod_lcore_state_aligned {
> +	struct mod_lcore_state mod_state;
> +
> +	RTE_CACHE_GUARD;
> +};
> +
> +static struct mod_lcore_state_aligned
> +sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
> +
> +static void
> +sarray_init(void)
> +{
> +	unsigned int lcore_id = rte_lcore_id();
> +	int mod;
> +
> +	for (mod = 0; mod < MAX_MODS; mod++) {
> +		struct mod_lcore_state *mod_state =
> +			&sarray_lcore_state[mod][lcore_id].mod_state;
> +
> +		mod_init(mod_state);
> +	}
> +}
> +
> +static __rte_noinline void
> +sarray_update(unsigned int mod)
> +{
> +	unsigned int lcore_id = rte_lcore_id();
> +	struct mod_lcore_state *mod_state =
> +		&sarray_lcore_state[mod][lcore_id].mod_state;
> +
> +	mod_update(mod_state);
> +}
> +
> +struct mod_lcore_state_lazy {
> +	struct mod_lcore_state mod_state;
> +	bool initialized;
> +};
> +
> +/*
> + * Note: it's usually a bad idea have this much thread-local storage
> + * allocated in a real application, since it will incur a cost on
> + * thread creation and non-lcore thread memory usage.
> + */
> +static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
> +			    tls_lcore_state)[MAX_MODS];
> +
> +static inline void
> +tls_init(struct mod_lcore_state_lazy *state)
> +{
> +	mod_init(&state->mod_state);
> +
> +	state->initialized = true;
> +}
> +
> +static __rte_noinline void
> +tls_update(unsigned int mod)
> +{
> +	struct mod_lcore_state_lazy *state =
> +		&RTE_PER_LCORE(tls_lcore_state[mod]);
> +
> +	/* With thread-local storage, initialization must usually be lazy */
> +	if (!state->initialized)
> +		tls_init(state);
> +
> +	mod_update(&state->mod_state);
> +}
> +
> +RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
> +
> +static void
> +lvar_init(void)
> +{
> +	unsigned int mod;
> +
> +	for (mod = 0; mod < MAX_MODS; mod++) {
> +		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
> +
> +		struct mod_lcore_state *state =
> +			RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
> +
> +		mod_init(state);
> +	}
> +}
> +
> +static __rte_noinline void
> +lvar_update(unsigned int mod)
> +{
> +	struct mod_lcore_state *state =
> +		RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
> +
> +	mod_update(state);
> +}
> +
> +static void
> +shuffle(unsigned int *elems, size_t len)
> +{
> +	size_t i;
> +
> +	for (i = len - 1; i > 0; i--) {
> +		unsigned int other = rte_rand_max(i + 1);
> +
> +		unsigned int tmp = elems[other];
> +		elems[other] = elems[i];
> +		elems[i] = tmp;
> +	}
> +}
> +
> +#define ITERATIONS UINT64_C(10000000)
> +
> +static inline double
> +benchmark_access(const unsigned int *mods, unsigned int num_mods,
> +		 void (*init_fun)(void), void (*update_fun)(unsigned int))
> +{
> +	unsigned int i;
> +	double start;
> +	double end;
> +	double latency;
> +	unsigned int num_mods_mask = num_mods - 1;
> +
> +	RTE_VERIFY(rte_is_power_of_2(num_mods));
> +
> +	if (init_fun != NULL)
> +		init_fun();
> +
> +	/* Warm up cache and make sure TLS variables are initialized */
> +	for (i = 0; i < num_mods; i++)
> +		update_fun(i);
> +
> +	start = rte_rdtsc();
> +
> +	for (i = 0; i < ITERATIONS; i++)
> +		update_fun(mods[i & num_mods_mask]);
> +
> +	end = rte_rdtsc();
> +
> +	latency = (end - start) / ITERATIONS;
> +
> +	return latency;
> +}
> +
> +static void
> +test_lcore_var_access_n(unsigned int num_mods)
> +{
> +	double sarray_latency;
> +	double tls_latency;
> +	double lvar_latency;
> +	unsigned int mods[num_mods];
> +	unsigned int i;
> +
> +	for (i = 0; i < num_mods; i++)
> +		mods[i] = i;
> +
> +	shuffle(mods, num_mods);
> +
> +	sarray_latency =
> +		benchmark_access(mods, num_mods, sarray_init, sarray_update);
> +
> +	tls_latency =
> +		benchmark_access(mods, num_mods, NULL, tls_update);
> +
> +	lvar_latency =
> +		benchmark_access(mods, num_mods, lvar_init, lvar_update);
> +
> +	printf("%17u %13.1f %13.1f %16.1f\n", num_mods, sarray_latency,
> +	       tls_latency, lvar_latency);
> +}
> +
> +/*
> + * The potential performance benefit of lcore variables compared to
> + * the use of statically sized, lcore id-indexed arrays are not
> + * shorter latencies in a scenario with low cache pressure, but rather
> + * fewer cache misses in a real-world scenario, with extensive cache
> + * usage. These tests are a crude simulation of such, using <N> dummy
> + * modules, each wiht a small, per-lcore state. Note however that
> + * these tests has very little non-lcore/thread local state, which is
> + * unrealistic.
> + */
> +
> +static int
> +test_lcore_var_access(void)
> +{
> +	unsigned int num_mods = 1;
> +
> +	printf("Latencies [TSC cycles/update]\n");
> +	printf("Modules/Variables  Static array  Thread-local Storage  "
> +	       "Lcore variables\n");
> +
> +	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
> +		test_lcore_var_access_n(num_mods);
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static struct unit_test_suite lcore_var_testsuite = {
> +	.suite_name = "lcore variable perf autotest",
> +	.unit_test_cases = {
> +		TEST_CASE(test_lcore_var_access),
> +		TEST_CASES_END()
> +	},
> +};
> +
> +static int
> +test_lcore_var_perf(void)
> +{
> +	return unit_test_suite_runner(&lcore_var_testsuite);
> +}
> +
> +REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 3/7] eal: add lcore variable performance test
  2024-09-16 11:13                                       ` Mattias Rönnblom
@ 2024-09-16 11:54                                         ` Morten Brørup
  2024-09-16 16:12                                           ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-09-16 11:54 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 16 September 2024 13.13
> 
> On 2024-09-16 12:52, Mattias Rönnblom wrote:
> > Add basic micro benchmark for lcore variables, in an attempt to assure
> > that the overhead isn't significantly greater than alternative
> > approaches, in scenarios where the benefits aren't expected to show up
> > (i.e., when plenty of cache is available compared to the working set
> > size of the per-lcore data).
> >
> 
> Here are some test results for a Raptor Cove @ 3,2 GHz (GCC 11):
> 
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [TSC cycles/update]
> Modules/Variables  Static array  Thread-local Storage  Lcore variables
>                  1           3.9           5.5              3.7
>                  2           3.8           5.5              3.8
>                  4           4.9           5.5              3.7
>                  8           3.8           5.5              3.8
>                 16          11.3           5.5              3.7
>                 32          20.9           5.5              3.7
>                 64          23.5           5.5              3.7
>                128          23.2           5.5              3.7
>                256          23.5           5.5              3.7
>                512          24.1           5.5              3.7
>               1024          25.3           5.5              3.9
>   + TestCase [ 0] : test_lcore_var_access succeeded
>   + ------------------------------------------------------- +
> 
> 
> The reason for TLS being slower than lcore variables (which in turn
> relies on TLS for lcore id lookup) is the lazy initialization
> conditional that is imposed on variant. Could that be avoided (which is
> module-dependent I suppose), it beats lcore variables at ~3.0 cycles/update.

I think you should not assume lazy initialization of TLS in your benchmark.
Our application uses TLS, and when spinning up a new thread, we call an per-lcore init function of each module before calling the per-lcore run function. This design pattern is also described in Figure 1.4 [1] in the Programmer's Guide.

[1]: https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html

> 
> I must say I'm surprised to see lcore variables doing this good, at
> these very modest working set sizes. Probably, you can stay at near-zero
> L1 misses with lcore variables (and TLS), but start missing the L1 with
> static arrays.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 1/7] eal: add static per-lcore memory allocation facility
  2024-09-16 10:52                                     ` [PATCH v4 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-16 14:02                                       ` Konstantin Ananyev
  2024-09-16 17:39                                         ` Morten Brørup
  2024-09-17 14:28                                         ` Mattias Rönnblom
  2024-09-17 14:32                                       ` [PATCH v5 0/7] Lcore variables Mattias Rönnblom
  1 sibling, 2 replies; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-16 14:02 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob



> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small, frequently-accessed data structures, for which one instance
> should exist for each lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>

LGTM in general, few small questions (mostly nits), see below. 
 
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_var.c
> @@ -0,0 +1,78 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#include <inttypes.h>
> +#include <stdlib.h>
> +
> +#ifdef RTE_EXEC_ENV_WINDOWS
> +#include <malloc.h>
> +#endif
> +
> +#include <rte_common.h>
> +#include <rte_debug.h>
> +#include <rte_log.h>
> +
> +#include <rte_lcore_var.h>
> +
> +#include "eal_private.h"
> +
> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> +
> +static void *lcore_buffer;
> +static size_t offset = RTE_MAX_LCORE_VAR;
> +
> +static void *
> +lcore_var_alloc(size_t size, size_t align)
> +{
> +	void *handle;
> +	void *value;
> +
> +	offset = RTE_ALIGN_CEIL(offset, align);
> +
> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> +#ifdef RTE_EXEC_ENV_WINDOWS
> +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> +					       RTE_CACHE_LINE_SIZE);
> +#else
> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> +					     LCORE_BUFFER_SIZE);
> +#endif

Don't remember did that question already arise or not:
For debugging and health-checking purposes - would it make sense to link all
lcore_buffer values into a linked list?
So user/developer/some tool can walk over it to check that provided handle value
is really a valid lcore_var, etc.

> +		RTE_VERIFY(lcore_buffer != NULL);
> +
> +		offset = 0;
> +	}
> +
> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
> +
> +	offset += size;
> +
> +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
> +		memset(value, 0, size);
> +
> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
> +		"%"PRIuPTR"-byte alignment", size, align);
> +
> +	return handle;
> +}
> +
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align)
> +{
> +	/* Having the per-lcore buffer size aligned on cache lines
> +	 * assures as well as having the base pointer aligned on cache
> +	 * size assures that aligned offsets also translate to alipgned
> +	 * pointers across all values.
> +	 */
> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
> +	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
> +
> +	/* '0' means asking for worst-case alignment requirements */
> +	if (align == 0)
> +		align = alignof(max_align_t);
> +
> +	RTE_ASSERT(rte_is_power_of_2(align));
> +
> +	return lcore_var_alloc(size, align);
> +}

....

> diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
> new file mode 100644
> index 0000000000..ec3ab714a8
> --- /dev/null
> +++ b/lib/eal/include/rte_lcore_var.h

... 

> +/**
> + * Given the lcore variable type, produces the type of the lcore
> + * variable handle.
> + */
> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
> +	type *
> +
> +/**
> + * Define an lcore variable handle.
> + *
> + * This macro defines a variable which is used as a handle to access
> + * the various instances of a per-lcore id variable.
> + *
> + * The aim with this macro is to make clear at the point of
> + * declaration that this is an lcore handle, rather than a regular
> + * pointer.
> + *
> + * Add @b static as a prefix in case the lcore variable is only to be
> + * accessed from a particular translation unit.
> + */
> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
> +	handle = rte_lcore_var_alloc(size, align)
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle,
> + * with values aligned for any type of object.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
> +
> +/**
> + * Allocate space for an lcore variable of the size and alignment requirements
> + * suggested by the handle pointer type, and initialize its handle.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_ALLOC(handle)					\
> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
> +				       alignof(typeof(*(handle))))
> +
> +/**
> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
> + * means of a @ref RTE_INIT constructor.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> +	{								\
> +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
> +	}
> +
> +/**
> + * Allocate an explicitly-sized lcore variable by means of a @ref
> + * RTE_INIT constructor.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
> +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
> +
> +/**
> + * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_INIT(name)					\
> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> +	{								\
> +		RTE_LCORE_VAR_ALLOC(name);				\
> +	}
> +
> +/**
> + * Get void pointer to lcore variable instance with the specified
> + * lcore id.
> + *
> + * @param lcore_id
> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> + *   instances should be accessed. The lcore id need not be valid
> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
> + *   is also not valid (and thus should not be dereferenced).
> + * @param handle
> + *   The lcore variable handle.
> + */
> +static inline void *
> +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
> +{
> +	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
> +}
> +
> +/**
> + * Get pointer to lcore variable instance with the specified lcore id.
> + *
> + * @param lcore_id
> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> + *   instances should be accessed. The lcore id need not be valid
> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
> + *   is also not valid (and thus should not be dereferenced).
> + * @param handle
> + *   The lcore variable handle.
> + */
> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
> +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
> +
> +/**
> + * Get pointer to lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_VALUE(handle) \
> +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)

Would it make sense to check that rte_lcore_id() !=  LCORE_ID_ANY?
After all if people do not want this extra check, they can probably use
RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
explicitly.

> +
> +/**
> + * Iterate over each lcore id's value for an lcore variable.
> + *
> + * @param value
> + *   A pointer successively set to point to lcore variable value
> + *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
> + * @param handle
> + *   The lcore variable handle.
> + */
> +#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
> +	for (unsigned int lcore_id =					\
> +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
> +	     lcore_id < RTE_MAX_LCORE;					\
> +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))

Might be a bit better (and safer) to make lcore_id a macro parameter?
I.E.:
define RTE_LCORE_VAR_FOREACH_VALUE(value, handle, lcore_id) \
for ((lcore_id) = ... 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 4/7] random: keep PRNG state in lcore variable
  2024-09-16 10:52                                     ` [PATCH v4 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-09-16 16:11                                       ` Konstantin Ananyev
  0 siblings, 0 replies; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-16 16:11 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob



> -----Original Message-----
> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Sent: Monday, September 16, 2024 11:52 AM
> To: dev@dpdk.org
> Cc: hofors@lysator.liu.se; Morten Brørup <mb@smartsharesystems.com>; Stephen Hemminger <stephen@networkplumber.org>;
> Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>; David Marchand <david.marchand@redhat.com>; Jerin Jacob
> <jerinj@marvell.com>; Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Subject: [PATCH v4 4/7] random: keep PRNG state in lcore variable
> 
> Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
> cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
> same state in a more cache-friendly lcore variable.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> --
> 
> RFC v3:
>  * Remove cache alignment on unregistered threads' rte_rand_state.
>    (Morten Brørup)
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com> 

> 2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 5/7] power: keep per-lcore state in lcore variable
  2024-09-16 10:52                                     ` [PATCH v4 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-09-16 16:12                                       ` Konstantin Ananyev
  0 siblings, 0 replies; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-16 16:12 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob


> Replace static array of cache-aligned structs with an lcore variable,
> to slightly benefit code simplicity and performance.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> --
> 
> RFC v3:
>  * Replace for loop with FOREACH macro.
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

> 2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v4 3/7] eal: add lcore variable performance test
  2024-09-16 11:54                                         ` Morten Brørup
@ 2024-09-16 16:12                                           ` Mattias Rönnblom
  2024-09-16 17:19                                             ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-16 16:12 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

On 2024-09-16 13:54, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Monday, 16 September 2024 13.13
>>
>> On 2024-09-16 12:52, Mattias Rönnblom wrote:
>>> Add basic micro benchmark for lcore variables, in an attempt to assure
>>> that the overhead isn't significantly greater than alternative
>>> approaches, in scenarios where the benefits aren't expected to show up
>>> (i.e., when plenty of cache is available compared to the working set
>>> size of the per-lcore data).
>>>
>>
>> Here are some test results for a Raptor Cove @ 3,2 GHz (GCC 11):
>>
>>    + ------------------------------------------------------- +
>>    + Test Suite : lcore variable perf autotest
>>    + ------------------------------------------------------- +
>> Latencies [TSC cycles/update]
>> Modules/Variables  Static array  Thread-local Storage  Lcore variables
>>                   1           3.9           5.5              3.7
>>                   2           3.8           5.5              3.8
>>                   4           4.9           5.5              3.7
>>                   8           3.8           5.5              3.8
>>                  16          11.3           5.5              3.7
>>                  32          20.9           5.5              3.7
>>                  64          23.5           5.5              3.7
>>                 128          23.2           5.5              3.7
>>                 256          23.5           5.5              3.7
>>                 512          24.1           5.5              3.7
>>                1024          25.3           5.5              3.9
>>    + TestCase [ 0] : test_lcore_var_access succeeded
>>    + ------------------------------------------------------- +
>>
>>
>> The reason for TLS being slower than lcore variables (which in turn
>> relies on TLS for lcore id lookup) is the lazy initialization
>> conditional that is imposed on variant. Could that be avoided (which is
>> module-dependent I suppose), it beats lcore variables at ~3.0 cycles/update.
> 
> I think you should not assume lazy initialization of TLS in your benchmark.
> Our application uses TLS, and when spinning up a new thread, we call an per-lcore init function of each module before calling the per-lcore run function. This design pattern is also described in Figure 1.4 [1] in the Programmer's Guide.
> 
> [1]: https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html
> 

Per-lcore init functions may be an option, and also may not, depending 
on what API you need to adhere to. But maybe I should add non-lazy TLS 
variant as well.

I should probably add some information on lcore variables in the EAL 
programmer's guide as well.

Non-lazy TLS would be a more viable option if there were proper 
framework support for it. Now, I'm not sure there is a better way to do 
it in a DPDK library than how it's done for tracing, where there's an 
explicit call per thread created. Other DPDK-internal users of 
RTE_PER_LCORE seems to depend on lazy initialization.

>>
>> I must say I'm surprised to see lcore variables doing this good, at
>> these very modest working set sizes. Probably, you can stay at near-zero
>> L1 misses with lcore variables (and TLS), but start missing the L1 with
>> static arrays.
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 6/7] service: keep per-lcore state in lcore variable
  2024-09-16 10:52                                     ` [PATCH v4 6/7] service: " Mattias Rönnblom
@ 2024-09-16 16:13                                       ` Konstantin Ananyev
  0 siblings, 0 replies; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-16 16:13 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob


> Replace static array of cache-aligned structs with an lcore variable,
> to slightly benefit code simplicity and performance.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> --
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com> 

> 2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-09-16 10:52                                     ` [PATCH v4 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
@ 2024-09-16 16:14                                       ` Konstantin Ananyev
  0 siblings, 0 replies; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-16 16:14 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob


> Keep per-lcore power intrinsics state in a lcore variable to reduce
> cache working set size and avoid any CPU next-line-prefetching causing
> false sharing.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

> 2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 3/7] eal: add lcore variable performance test
  2024-09-16 16:12                                           ` Mattias Rönnblom
@ 2024-09-16 17:19                                             ` Morten Brørup
  0 siblings, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-09-16 17:19 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 16 September 2024 18.13
> 
> On 2024-09-16 13:54, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Monday, 16 September 2024 13.13
> >>
> >> The reason for TLS being slower than lcore variables (which in turn
> >> relies on TLS for lcore id lookup) is the lazy initialization
> >> conditional that is imposed on variant. Could that be avoided (which
> is
> >> module-dependent I suppose), it beats lcore variables at ~3.0
> cycles/update.
> >
> > I think you should not assume lazy initialization of TLS in your
> benchmark.
> > Our application uses TLS, and when spinning up a new thread, we call
> an per-lcore init function of each module before calling the per-lcore
> run function. This design pattern is also described in Figure 1.4 [1] in
> the Programmer's Guide.
> >
> > [1]: https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html
> >
> 
> Per-lcore init functions may be an option, and also may not, depending
> on what API you need to adhere to. But maybe I should add non-lazy TLS
> variant as well.

Certainly. Both, or just non-lazy is fine with me.

> 
> I should probably add some information on lcore variables in the EAL
> programmer's guide as well.

+1

> 
> Non-lazy TLS would be a more viable option if there were proper
> framework support for it.

The framework should provide RTE_LCORE_INIT macros for modules to define per-lcore init functions, which EAL should call when EAL creates additional threads. And they should obviously be called from within the newly created thread, not from the main thread.
And if some per-lcore init function only needs to do it work for worker threads, the init function can check the thread type as the first thing.

> Now, I'm not sure there is a better way to do
> it in a DPDK library than how it's done for tracing, where there's an
> explicit call per thread created. Other DPDK-internal users of
> RTE_PER_LCORE seems to depend on lazy initialization.

The framework lacks the per-thread init feature, so it's implemented differently in different modules. Don't get distracted by how the trace module does it. Just imagine the framework offering some generic mechanism to do it.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 1/7] eal: add static per-lcore memory allocation facility
  2024-09-16 14:02                                       ` Konstantin Ananyev
@ 2024-09-16 17:39                                         ` Morten Brørup
  2024-09-16 23:19                                           ` Konstantin Ananyev
  2024-09-17 14:28                                         ` Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-09-16 17:39 UTC (permalink / raw)
  To: Konstantin Ananyev, Mattias Rönnblom, dev
  Cc: hofors, Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Monday, 16 September 2024 16.02
> 
> > Introduce DPDK per-lcore id variables, or lcore variables for short.
> >
> > An lcore variable has one value for every current and future lcore
> > id-equipped thread.
> >
> > The primary <rte_lcore_var.h> use case is for statically allocating
> > small, frequently-accessed data structures, for which one instance
> > should exist for each lcore.
> >
> > Lcore variables are similar to thread-local storage (TLS, e.g., C11
> > _Thread_local), but decoupling the values' life time with that of the
> > threads.
> >
> > Lcore variables are also similar in terms of functionality provided by
> > FreeBSD kernel's DPCPU_*() family of macros and the associated
> > build-time machinery. DPCPU uses linker scripts, which effectively
> > prevents the reuse of its, otherwise seemingly viable, approach.
> >
> > The currently-prevailing way to solve the same problem as lcore
> > variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> > array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> > lcore variables over this approach is that data related to the same
> > lcore now is close (spatially, in memory), rather than data used by
> > the same module, which in turn avoid excessive use of padding,
> > polluting caches with unused data.
> >
> > Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> LGTM in general, few small questions (mostly nits), see below.
> 
> > --- /dev/null
> > +++ b/lib/eal/common/eal_common_lcore_var.c
> > @@ -0,0 +1,78 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2024 Ericsson AB
> > + */
> > +
> > +#include <inttypes.h>
> > +#include <stdlib.h>
> > +
> > +#ifdef RTE_EXEC_ENV_WINDOWS
> > +#include <malloc.h>
> > +#endif
> > +
> > +#include <rte_common.h>
> > +#include <rte_debug.h>
> > +#include <rte_log.h>
> > +
> > +#include <rte_lcore_var.h>
> > +
> > +#include "eal_private.h"
> > +
> > +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> > +
> > +static void *lcore_buffer;
> > +static size_t offset = RTE_MAX_LCORE_VAR;
> > +
> > +static void *
> > +lcore_var_alloc(size_t size, size_t align)
> > +{
> > +	void *handle;
> > +	void *value;
> > +
> > +	offset = RTE_ALIGN_CEIL(offset, align);
> > +
> > +	if (offset + size > RTE_MAX_LCORE_VAR) {
> > +#ifdef RTE_EXEC_ENV_WINDOWS
> > +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> > +					       RTE_CACHE_LINE_SIZE);
> > +#else
> > +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> > +					     LCORE_BUFFER_SIZE);
> > +#endif
> 
> Don't remember did that question already arise or not:
> For debugging and health-checking purposes - would it make sense to link
> all
> lcore_buffer values into a linked list?
> So user/developer/some tool can walk over it to check that provided
> handle value
> is really a valid lcore_var, etc.

Nice idea.
Such a list, along with an accompanying dump function can be added later.

> 
> > +		RTE_VERIFY(lcore_buffer != NULL);
> > +
> > +		offset = 0;
> > +	}
> > +
> > +	handle = RTE_PTR_ADD(lcore_buffer, offset);
> > +
> > +	offset += size;
> > +
> > +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
> > +		memset(value, 0, size);
> > +
> > +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with
> a "
> > +		"%"PRIuPTR"-byte alignment", size, align);
> > +
> > +	return handle;
> > +}
> > +
> > +void *
> > +rte_lcore_var_alloc(size_t size, size_t align)
> > +{
> > +	/* Having the per-lcore buffer size aligned on cache lines
> > +	 * assures as well as having the base pointer aligned on cache
> > +	 * size assures that aligned offsets also translate to alipgned
> > +	 * pointers across all values.
> > +	 */
> > +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> > +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
> > +	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
> > +
> > +	/* '0' means asking for worst-case alignment requirements */
> > +	if (align == 0)
> > +		align = alignof(max_align_t);
> > +
> > +	RTE_ASSERT(rte_is_power_of_2(align));
> > +
> > +	return lcore_var_alloc(size, align);
> > +}
> 
> ....
> 
> > diff --git a/lib/eal/include/rte_lcore_var.h
> b/lib/eal/include/rte_lcore_var.h
> > new file mode 100644
> > index 0000000000..ec3ab714a8
> > --- /dev/null
> > +++ b/lib/eal/include/rte_lcore_var.h
> 
> ...
> 
> > +/**
> > + * Given the lcore variable type, produces the type of the lcore
> > + * variable handle.
> > + */
> > +#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
> > +	type *
> > +
> > +/**
> > + * Define an lcore variable handle.
> > + *
> > + * This macro defines a variable which is used as a handle to access
> > + * the various instances of a per-lcore id variable.
> > + *
> > + * The aim with this macro is to make clear at the point of
> > + * declaration that this is an lcore handle, rather than a regular
> > + * pointer.
> > + *
> > + * Add @b static as a prefix in case the lcore variable is only to be
> > + * accessed from a particular translation unit.
> > + */
> > +#define RTE_LCORE_VAR_HANDLE(type, name)	\
> > +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
> > +
> > +/**
> > + * Allocate space for an lcore variable, and initialize its handle.
> > + *
> > + * The values of the lcore variable are initialized to zero.
> > + */
> > +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
> > +	handle = rte_lcore_var_alloc(size, align)
> > +
> > +/**
> > + * Allocate space for an lcore variable, and initialize its handle,
> > + * with values aligned for any type of object.
> > + *
> > + * The values of the lcore variable are initialized to zero.
> > + */
> > +#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
> > +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
> > +
> > +/**
> > + * Allocate space for an lcore variable of the size and alignment
> requirements
> > + * suggested by the handle pointer type, and initialize its handle.
> > + *
> > + * The values of the lcore variable are initialized to zero.
> > + */
> > +#define RTE_LCORE_VAR_ALLOC(handle)					\
> > +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
> > +				       alignof(typeof(*(handle))))
> > +
> > +/**
> > + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
> > + * means of a @ref RTE_INIT constructor.
> > + *
> > + * The values of the lcore variable are initialized to zero.
> > + */
> > +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
> > +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> > +	{								\
> > +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
> > +	}
> > +
> > +/**
> > + * Allocate an explicitly-sized lcore variable by means of a @ref
> > + * RTE_INIT constructor.
> > + *
> > + * The values of the lcore variable are initialized to zero.
> > + */
> > +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
> > +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
> > +
> > +/**
> > + * Allocate an lcore variable by means of a @ref RTE_INIT
> constructor.
> > + *
> > + * The values of the lcore variable are initialized to zero.
> > + */
> > +#define RTE_LCORE_VAR_INIT(name)					\
> > +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> > +	{								\
> > +		RTE_LCORE_VAR_ALLOC(name);				\
> > +	}
> > +
> > +/**
> > + * Get void pointer to lcore variable instance with the specified
> > + * lcore id.
> > + *
> > + * @param lcore_id
> > + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> > + *   instances should be accessed. The lcore id need not be valid
> > + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the
> pointer
> > + *   is also not valid (and thus should not be dereferenced).
> > + * @param handle
> > + *   The lcore variable handle.
> > + */
> > +static inline void *
> > +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
> > +{
> > +	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
> > +}
> > +
> > +/**
> > + * Get pointer to lcore variable instance with the specified lcore
> id.
> > + *
> > + * @param lcore_id
> > + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> > + *   instances should be accessed. The lcore id need not be valid
> > + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the
> pointer
> > + *   is also not valid (and thus should not be dereferenced).
> > + * @param handle
> > + *   The lcore variable handle.
> > + */
> > +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
> > +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
> > +
> > +/**
> > + * Get pointer to lcore variable instance of the current thread.
> > + *
> > + * May only be used by EAL threads and registered non-EAL threads.
> > + */
> > +#define RTE_LCORE_VAR_VALUE(handle) \
> > +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> 
> Would it make sense to check that rte_lcore_id() !=  LCORE_ID_ANY?
> After all if people do not want this extra check, they can probably use
> RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> explicitly.

Not generally. I prefer keeping it brief.
We could add a _SAFE variant with this extra check, like LIST_FOREACH has LIST_FOREACH_SAFE (although for a different purpose).

Come to think of it: In the name of brevity, consider renaming RTE_LCORE_VAR_VALUE to RTE_LCORE_VAR. (And RTE_LCORE_VAR_FOREACH_VALUE to RTE_LCORE_VAR_FOREACH.) We want to see these everywhere in the code.

> 
> > +
> > +/**
> > + * Iterate over each lcore id's value for an lcore variable.
> > + *
> > + * @param value
> > + *   A pointer successively set to point to lcore variable value
> > + *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
> > + * @param handle
> > + *   The lcore variable handle.
> > + */
> > +#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
> > +	for (unsigned int lcore_id =					\
> > +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0);
> \
> > +	     lcore_id < RTE_MAX_LCORE;					\
> > +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
> handle))
> 
> Might be a bit better (and safer) to make lcore_id a macro parameter?
> I.E.:
> define RTE_LCORE_VAR_FOREACH_VALUE(value, handle, lcore_id) \
> for ((lcore_id) = ...

The same thought have struck me, so I checked the scope of lcore_id.
The scope of lcore_id remains limited to the for loop, i.e. it is available inside the for loop, but not after it.
IMO this suffices, and lcore_id doesn't need to be a macro parameter.
Maybe renaming lcore_id to _lcore_id would be an improvement, if lcore_id is already defined and used for other purposes within the for loop.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 1/7] eal: add static per-lcore memory allocation facility
  2024-09-16 17:39                                         ` Morten Brørup
@ 2024-09-16 23:19                                           ` Konstantin Ananyev
  2024-09-17  7:12                                             ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-16 23:19 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: hofors, Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob


> > > +/**
> > > + * Get pointer to lcore variable instance of the current thread.
> > > + *
> > > + * May only be used by EAL threads and registered non-EAL threads.
> > > + */
> > > +#define RTE_LCORE_VAR_VALUE(handle) \
> > > +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> >
> > Would it make sense to check that rte_lcore_id() !=  LCORE_ID_ANY?
> > After all if people do not want this extra check, they can probably use
> > RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> > explicitly.
> 
> Not generally. I prefer keeping it brief.
> We could add a _SAFE variant with this extra check, like LIST_FOREACH has LIST_FOREACH_SAFE (although for a different purpose).
> 
> Come to think of it: In the name of brevity, consider renaming RTE_LCORE_VAR_VALUE to RTE_LCORE_VAR. (And
> RTE_LCORE_VAR_FOREACH_VALUE to RTE_LCORE_VAR_FOREACH.) We want to see these everywhere in the code.

Well, it is not about brevity...
I just feel  uncomfortable that our own public macro doesn't check value
returned by rte_lcore_id() and introduce a possible out-of-bound memory access. 

 
> >
> > > +
> > > +/**
> > > + * Iterate over each lcore id's value for an lcore variable.
> > > + *
> > > + * @param value
> > > + *   A pointer successively set to point to lcore variable value
> > > + *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
> > > + * @param handle
> > > + *   The lcore variable handle.
> > > + */
> > > +#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
> > > +	for (unsigned int lcore_id =					\
> > > +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0);
> > \
> > > +	     lcore_id < RTE_MAX_LCORE;					\
> > > +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
> > handle))
> >
> > Might be a bit better (and safer) to make lcore_id a macro parameter?
> > I.E.:
> > define RTE_LCORE_VAR_FOREACH_VALUE(value, handle, lcore_id) \
> > for ((lcore_id) = ...
> 
> The same thought have struck me, so I checked the scope of lcore_id.
> The scope of lcore_id remains limited to the for loop, i.e. it is available inside the for loop, but not after it.

Variable with the same name (and type) can be defined by used before the loop,
With the intention to use it inside the loop.
Just like it happens here (in patch #2):
+	unsigned int lcore_id;
.....
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+


> IMO this suffices, and lcore_id doesn't need to be a macro parameter.
> Maybe renaming lcore_id to _lcore_id would be an improvement, if lcore_id is already defined and used for other purposes within
> the for loop.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 1/7] eal: add static per-lcore memory allocation facility
  2024-09-16 23:19                                           ` Konstantin Ananyev
@ 2024-09-17  7:12                                             ` Morten Brørup
  2024-09-17  8:09                                               ` Konstantin Ananyev
  0 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-09-17  7:12 UTC (permalink / raw)
  To: Konstantin Ananyev, Mattias Rönnblom, dev
  Cc: hofors, Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Tuesday, 17 September 2024 01.20
> 
> > > > +/**
> > > > + * Get pointer to lcore variable instance of the current thread.
> > > > + *
> > > > + * May only be used by EAL threads and registered non-EAL threads.
> > > > + */
> > > > +#define RTE_LCORE_VAR_VALUE(handle) \
> > > > +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> > >
> > > Would it make sense to check that rte_lcore_id() !=  LCORE_ID_ANY?
> > > After all if people do not want this extra check, they can probably use
> > > RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> > > explicitly.
> >
> > Not generally. I prefer keeping it brief.
> > We could add a _SAFE variant with this extra check, like LIST_FOREACH has
> LIST_FOREACH_SAFE (although for a different purpose).
> >
> > Come to think of it: In the name of brevity, consider renaming
> RTE_LCORE_VAR_VALUE to RTE_LCORE_VAR. (And
> > RTE_LCORE_VAR_FOREACH_VALUE to RTE_LCORE_VAR_FOREACH.) We want to see these
> everywhere in the code.
> 
> Well, it is not about brevity...
> I just feel  uncomfortable that our own public macro doesn't check value
> returned by rte_lcore_id() and introduce a possible out-of-bound memory
> access.

For performance reasons, we generally don't check parameter validity in fast path functions/macros; lots of code in DPDK uses ptr->array[rte_lcore_id()] without checking rte_lcore_id() validity.
We shouldn't do it here either.

There's a secondary benefit:
RTE_LCORE_VAR_VALUE() returns a pointer, so this macro can always be used.
Especially, the pointer can be initialized with other variables at the start of a function:
struct mystruct * const state = RTE_LCORE_VAR_VALUE(state_handle);
The out-of-bound memory access will occur if dereferencing the pointer.

> 
> 
> > >
> > > > +
> > > > +/**
> > > > + * Iterate over each lcore id's value for an lcore variable.
> > > > + *
> > > > + * @param value
> > > > + *   A pointer successively set to point to lcore variable value
> > > > + *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
> > > > + * @param handle
> > > > + *   The lcore variable handle.
> > > > + */
> > > > +#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
> > > > +	for (unsigned int lcore_id =					\
> > > > +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)),
> 0);
> > > \
> > > > +	     lcore_id < RTE_MAX_LCORE;					\
> > > > +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
> > > handle))
> > >
> > > Might be a bit better (and safer) to make lcore_id a macro parameter?
> > > I.E.:
> > > define RTE_LCORE_VAR_FOREACH_VALUE(value, handle, lcore_id) \
> > > for ((lcore_id) = ...
> >
> > The same thought have struck me, so I checked the scope of lcore_id.
> > The scope of lcore_id remains limited to the for loop, i.e. it is available
> inside the for loop, but not after it.
> 
> Variable with the same name (and type) can be defined by used before the loop,
> With the intention to use it inside the loop.
> Just like it happens here (in patch #2):
> +	unsigned int lcore_id;
> .....
> +	/* take the opportunity to test the foreach macro */
> +	int *v;
> +	lcore_id = 0;
> +	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
> +		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
> +				  "Unexpected value on lcore %d during "
> +				  "iteration", lcore_id);
> +		lcore_id++;
> +	}
> +
> 

You convinced me here, Konstantin.
Adding the iterator (lcore_id) as a macro parameter reduces the risk of bugs, and has no real disadvantages.

> 
> > IMO this suffices, and lcore_id doesn't need to be a macro parameter.
> > Maybe renaming lcore_id to _lcore_id would be an improvement, if lcore_id is
> already defined and used for other purposes within
> > the for loop.

PS:
We discussed the _VALUE postfix previously, Mattias, and I agreed to it. But now that I have become more familiar with the code, I think the _VALUE postfix should be dropped.
I'm usually in favor of long variable/function/macro names, arguing that they improve code readability.
But I don't think the _VALUE postfix really improves readability.
Especially when RTE_LCORE_VAR() has become widely used, and everyone is familiar with it, a long name (RTE_LCORE_VAR_VALUE()) will be more annoying than helpful.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 1/7] eal: add static per-lcore memory allocation facility
  2024-09-17  7:12                                             ` Morten Brørup
@ 2024-09-17  8:09                                               ` Konstantin Ananyev
  0 siblings, 0 replies; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-17  8:09 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: hofors, Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob



> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Tuesday, 17 September 2024 01.20
> >
> > > > > +/**
> > > > > + * Get pointer to lcore variable instance of the current thread.
> > > > > + *
> > > > > + * May only be used by EAL threads and registered non-EAL threads.
> > > > > + */
> > > > > +#define RTE_LCORE_VAR_VALUE(handle) \
> > > > > +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> > > >
> > > > Would it make sense to check that rte_lcore_id() !=  LCORE_ID_ANY?
> > > > After all if people do not want this extra check, they can probably use
> > > > RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> > > > explicitly.
> > >
> > > Not generally. I prefer keeping it brief.
> > > We could add a _SAFE variant with this extra check, like LIST_FOREACH has
> > LIST_FOREACH_SAFE (although for a different purpose).
> > >
> > > Come to think of it: In the name of brevity, consider renaming
> > RTE_LCORE_VAR_VALUE to RTE_LCORE_VAR. (And
> > > RTE_LCORE_VAR_FOREACH_VALUE to RTE_LCORE_VAR_FOREACH.) We want to see these
> > everywhere in the code.
> >
> > Well, it is not about brevity...
> > I just feel  uncomfortable that our own public macro doesn't check value
> > returned by rte_lcore_id() and introduce a possible out-of-bound memory
> > access.
> 
> For performance reasons, we generally don't check parameter validity in fast path functions/macros; lots of code in DPDK uses ptr-
> >array[rte_lcore_id()] without checking rte_lcore_id() validity.

Yes there are plenty of such places inside DPDK...
Ok, I'll leave it for the author to decide, after all there is a clear comment
in front of it forbidding to use that macro for non-EAL threads.
Hope users will read it before using ;)

> We shouldn't do it here either.
> 
> There's a secondary benefit:
> RTE_LCORE_VAR_VALUE() returns a pointer, so this macro can always be used.
> Especially, the pointer can be initialized with other variables at the start of a function:
> struct mystruct * const state = RTE_LCORE_VAR_VALUE(state_handle);
> The out-of-bound memory access will occur if dereferencing the pointer.
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v4 1/7] eal: add static per-lcore memory allocation facility
  2024-09-16 14:02                                       ` Konstantin Ananyev
  2024-09-16 17:39                                         ` Morten Brørup
@ 2024-09-17 14:28                                         ` Mattias Rönnblom
  2024-09-17 16:11                                           ` Konstantin Ananyev
  2024-09-17 16:29                                           ` Konstantin Ananyev
  1 sibling, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-17 14:28 UTC (permalink / raw)
  To: Konstantin Ananyev, Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob

On 2024-09-16 16:02, Konstantin Ananyev wrote:
> 
> 
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small, frequently-accessed data structures, for which one instance
>> should exist for each lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> LGTM in general, few small questions (mostly nits), see below.
>   
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_var.c
>> @@ -0,0 +1,78 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#include <inttypes.h>
>> +#include <stdlib.h>
>> +
>> +#ifdef RTE_EXEC_ENV_WINDOWS
>> +#include <malloc.h>
>> +#endif
>> +
>> +#include <rte_common.h>
>> +#include <rte_debug.h>
>> +#include <rte_log.h>
>> +
>> +#include <rte_lcore_var.h>
>> +
>> +#include "eal_private.h"
>> +
>> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
>> +
>> +static void *lcore_buffer;
>> +static size_t offset = RTE_MAX_LCORE_VAR;
>> +
>> +static void *
>> +lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	void *handle;
>> +	void *value;
>> +
>> +	offset = RTE_ALIGN_CEIL(offset, align);
>> +
>> +	if (offset + size > RTE_MAX_LCORE_VAR) {
>> +#ifdef RTE_EXEC_ENV_WINDOWS
>> +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
>> +					       RTE_CACHE_LINE_SIZE);
>> +#else
>> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
>> +					     LCORE_BUFFER_SIZE);
>> +#endif
> 
> Don't remember did that question already arise or not:
> For debugging and health-checking purposes - would it make sense to link all
> lcore_buffer values into a linked list?
> So user/developer/some tool can walk over it to check that provided handle value
> is really a valid lcore_var, etc.
> 

At least you could add some basic statistics, like the total size 
allocated my lcore variables, and the number of variables.

One could also add tracing.

>> +		RTE_VERIFY(lcore_buffer != NULL);
>> +
>> +		offset = 0;
>> +	}
>> +
>> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
>> +
>> +	offset += size;
>> +
>> +	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
>> +		memset(value, 0, size);
>> +
>> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
>> +		"%"PRIuPTR"-byte alignment", size, align);
>> +
>> +	return handle;
>> +}
>> +
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	/* Having the per-lcore buffer size aligned on cache lines
>> +	 * assures as well as having the base pointer aligned on cache
>> +	 * size assures that aligned offsets also translate to alipgned
>> +	 * pointers across all values.
>> +	 */
>> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
>> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
>> +	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
>> +
>> +	/* '0' means asking for worst-case alignment requirements */
>> +	if (align == 0)
>> +		align = alignof(max_align_t);
>> +
>> +	RTE_ASSERT(rte_is_power_of_2(align));
>> +
>> +	return lcore_var_alloc(size, align);
>> +}
> 
> ....
> 
>> diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
>> new file mode 100644
>> index 0000000000..ec3ab714a8
>> --- /dev/null
>> +++ b/lib/eal/include/rte_lcore_var.h
> 
> ...
> 
>> +/**
>> + * Given the lcore variable type, produces the type of the lcore
>> + * variable handle.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
>> +	type *
>> +
>> +/**
>> + * Define an lcore variable handle.
>> + *
>> + * This macro defines a variable which is used as a handle to access
>> + * the various instances of a per-lcore id variable.
>> + *
>> + * The aim with this macro is to make clear at the point of
>> + * declaration that this is an lcore handle, rather than a regular
>> + * pointer.
>> + *
>> + * Add @b static as a prefix in case the lcore variable is only to be
>> + * accessed from a particular translation unit.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
>> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
>> +	handle = rte_lcore_var_alloc(size, align)
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle,
>> + * with values aligned for any type of object.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
>> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
>> +
>> +/**
>> + * Allocate space for an lcore variable of the size and alignment requirements
>> + * suggested by the handle pointer type, and initialize its handle.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC(handle)					\
>> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
>> +				       alignof(typeof(*(handle))))
>> +
>> +/**
>> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
>> + * means of a @ref RTE_INIT constructor.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
>> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
>> +	{								\
>> +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
>> +	}
>> +
>> +/**
>> + * Allocate an explicitly-sized lcore variable by means of a @ref
>> + * RTE_INIT constructor.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
>> +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
>> +
>> +/**
>> + * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_INIT(name)					\
>> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
>> +	{								\
>> +		RTE_LCORE_VAR_ALLOC(name);				\
>> +	}
>> +
>> +/**
>> + * Get void pointer to lcore variable instance with the specified
>> + * lcore id.
>> + *
>> + * @param lcore_id
>> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
>> + *   instances should be accessed. The lcore id need not be valid
>> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
>> + *   is also not valid (and thus should not be dereferenced).
>> + * @param handle
>> + *   The lcore variable handle.
>> + */
>> +static inline void *
>> +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
>> +{
>> +	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
>> +}
>> +
>> +/**
>> + * Get pointer to lcore variable instance with the specified lcore id.
>> + *
>> + * @param lcore_id
>> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
>> + *   instances should be accessed. The lcore id need not be valid
>> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
>> + *   is also not valid (and thus should not be dereferenced).
>> + * @param handle
>> + *   The lcore variable handle.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
>> +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
>> +
>> +/**
>> + * Get pointer to lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_VALUE(handle) \
>> +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> 
> Would it make sense to check that rte_lcore_id() !=  LCORE_ID_ANY?
> After all if people do not want this extra check, they can probably use
> RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> explicitly.
> 

It would make sense, if it was an RTE_ASSERT(). Otherwise, I don't think 
so. Attempting to gracefully handle API violations is bad practice, imo.

>> +
>> +/**
>> + * Iterate over each lcore id's value for an lcore variable.
>> + *
>> + * @param value
>> + *   A pointer successively set to point to lcore variable value
>> + *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
>> + * @param handle
>> + *   The lcore variable handle.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
>> +	for (unsigned int lcore_id =					\
>> +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
>> +	     lcore_id < RTE_MAX_LCORE;					\
>> +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
> 
> Might be a bit better (and safer) to make lcore_id a macro parameter?
> I.E.:
> define RTE_LCORE_VAR_FOREACH_VALUE(value, handle, lcore_id) \
> for ((lcore_id) = ...
> 

Why?

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v5 0/7] Lcore variables
  2024-09-16 10:52                                     ` [PATCH v4 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-16 14:02                                       ` Konstantin Ananyev
@ 2024-09-17 14:32                                       ` Mattias Rönnblom
  2024-09-17 14:32                                         ` [PATCH v5 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                           ` (6 more replies)
  1 sibling, 7 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-17 14:32 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 432 ++++++++++++++++++
 app/test/test_lcore_var_perf.c                | 257 +++++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  45 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  78 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 115 ++---
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 385 ++++++++++++++++
 lib/eal/version.map                           |   2 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  34 +-
 17 files changed, 1326 insertions(+), 93 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v5 1/7] eal: add static per-lcore memory allocation facility
  2024-09-17 14:32                                       ` [PATCH v5 0/7] Lcore variables Mattias Rönnblom
@ 2024-09-17 14:32                                         ` Mattias Rönnblom
  2024-09-18  8:00                                           ` [PATCH v6 0/7] Lcore variables Mattias Rönnblom
  2024-09-17 14:32                                         ` [PATCH v5 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                           ` (5 subsequent siblings)
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-17 14:32 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  45 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  78 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 385 ++++++++++++++++++
 lib/eal/version.map                           |   2 +
 10 files changed, 528 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c5a703b5c0..362d9a3f28 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index dd7bb0d35b..311692e498 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 9559c12a98..12b49672a6 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -433,12 +433,45 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default static variables, blocks allocated on the DPDK heap, and
+other type of memory is shared by all DPDK threads.
+
+An application, a DPDK library or PMD may keep opt to keep per-thread
+state.
+
+Per-thread data may be maintained using either *lcore variables*
+(``rte_lcore_var.h``), *thread-local storage (TLS)*
+(``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE``
+elements, index by ``rte_lcore_id()``. These methods allows for
+per-lcore data to be a largely module-internal affair, and not
+directly visible in its API. Another possibility is to have deal
+explicitly with per-thread aspects in the API (e.g., the ports of the
+Eventdev API).
+
+Lcore varibles are suitable for small object statically allocated at
+the time of module or application initialization. An lcore variable
+take on one value for each lcore id-equipped thread (i.e., for EAL
+threads and registered non-EAL threads, in total ``RTE_MAX_LCORE``
+instances). The lifetime of lcore variables are detached from that of
+the owning threads, and may thus be initialized prior to the owner
+having been created.
+
+Variables with thread-local storage are allocated at the time of
+thread creation, and exists until the thread terminates, for every
+thread in the process. Only very small object should be allocated in
+TLS, since large TLS objects significantly slows down thread creation
+and may needlessly increase memory footprint for application that make
+extensive use of unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore id-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must both cache-aligned, and include a
+``RTE_CACHE_GUARD``. Such extensive use of padding cause internal
+fragmentation (i.e., unused space) and lower cache hit rates.
+
+For more discussions on per-lcore state, see the ``rte_lcore_var.h``
+API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 0ff70d9057..a3884f7491 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -55,6 +55,20 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..309822039b
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+#ifdef RTE_EXEC_ENV_WINDOWS
+		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
+					       RTE_CACHE_LINE_SIZE);
+#else
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+#endif
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..ec3ab714a8
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,385 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. There is one
+ * instance for each current and future lcore id-equipped thread, with
+ * a total of RTE_MAX_LCORE instances. The value of an lcore variable
+ * for a particular lcore id is independent from other values (for
+ * other lcore ids) within the same lcore variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. The handle type is used to inform the
+ * access macros the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as a an opaque identifier. An allocated handle
+ * never has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by two different lcore
+ * ids may be frequently read or written by the owners without risking
+ * false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should employed to assure there are no data races between
+ * the owning thread and any non-owner threads accessing the same
+ * lcore variable instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may choose to define an lcore variable handle, which
+ * it then it goes on to never allocate.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variables' owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values take on an initial value of zero.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables have the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. As a result, thread-local variables must be initialized in
+ *     a "lazy" manner (e.g., at the point of thread creation). Lcore
+ *     variables may be accessed immediately after having been
+ *     allocated (which may be prior any thread beyond the main
+ *     thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handle, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param value
+ *   A pointer successively set to point to lcore variable value
+ *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
+	for (unsigned int lcore_id =					\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..0c80bf7331 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,8 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v5 2/7] eal: add lcore variable functional tests
  2024-09-17 14:32                                       ` [PATCH v5 0/7] Lcore variables Mattias Rönnblom
  2024-09-17 14:32                                         ` [PATCH v5 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-17 14:32                                         ` Mattias Rönnblom
  2024-09-17 14:32                                         ` [PATCH v5 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                           ` (4 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-17 14:32 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..e07d13460f
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,432 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v5 3/7] eal: add lcore variable performance test
  2024-09-17 14:32                                       ` [PATCH v5 0/7] Lcore variables Mattias Rönnblom
  2024-09-17 14:32                                         ` [PATCH v5 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-17 14:32                                         ` [PATCH v5 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-09-17 14:32                                         ` Mattias Rönnblom
  2024-09-17 15:40                                           ` Morten Brørup
  2024-09-17 14:32                                         ` [PATCH v5 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                           ` (3 subsequent siblings)
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-17 14:32 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

--

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 257 +++++++++++++++++++++++++++++++++
 2 files changed, 258 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..538286d01b
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =
+		RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays are not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each wiht a small, per-lcore state. Note however that
+ * these tests has very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v5 4/7] random: keep PRNG state in lcore variable
  2024-09-17 14:32                                       ` [PATCH v5 0/7] Lcore variables Mattias Rönnblom
                                                           ` (2 preceding siblings ...)
  2024-09-17 14:32                                         ` [PATCH v5 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-09-17 14:32                                         ` Mattias Rönnblom
  2024-09-17 14:32                                         ` [PATCH v5 5/7] power: keep per-lcore " Mattias Rönnblom
                                                           ` (2 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-17 14:32 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v5 5/7] power: keep per-lcore state in lcore variable
  2024-09-17 14:32                                       ` [PATCH v5 0/7] Lcore variables Mattias Rönnblom
                                                           ` (3 preceding siblings ...)
  2024-09-17 14:32                                         ` [PATCH v5 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-09-17 14:32                                         ` Mattias Rönnblom
  2024-09-17 14:32                                         ` [PATCH v5 6/7] service: " Mattias Rönnblom
  2024-09-17 14:32                                         ` [PATCH v5 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-17 14:32 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

--

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a5139dd4f7 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,21 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v5 6/7] service: keep per-lcore state in lcore variable
  2024-09-17 14:32                                       ` [PATCH v5 0/7] Lcore variables Mattias Rönnblom
                                                           ` (4 preceding siblings ...)
  2024-09-17 14:32                                         ` [PATCH v5 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-09-17 14:32                                         ` Mattias Rönnblom
  2024-09-17 14:32                                         ` [PATCH v5 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-17 14:32 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

--

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 115 +++++++++++++++++++----------------
 1 file changed, 63 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 56379930b6..03379f1588 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,12 +102,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -122,7 +119,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +132,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +281,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +288,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +449,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +462,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +484,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +530,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +546,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +567,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +584,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +636,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +688,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +706,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +731,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +755,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +779,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +809,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +818,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +843,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +854,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +862,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +870,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +879,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +895,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +942,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +971,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +983,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1022,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v5 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-09-17 14:32                                       ` [PATCH v5 0/7] Lcore variables Mattias Rönnblom
                                                           ` (5 preceding siblings ...)
  2024-09-17 14:32                                         ` [PATCH v5 6/7] service: " Mattias Rönnblom
@ 2024-09-17 14:32                                         ` Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-17 14:32 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v5 3/7] eal: add lcore variable performance test
  2024-09-17 14:32                                         ` [PATCH v5 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-09-17 15:40                                           ` Morten Brørup
  2024-09-18  6:05                                             ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-09-17 15:40 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

> +	start = rte_rdtsc();
> +
> +	for (i = 0; i < ITERATIONS; i++)
> +		update_fun(mods[i & num_mods_mask]);

This indexing adds more instructions to be executed than just the update function.
The added overhead is the same for all tested access methods, so the absolute difference in latency (i.e. measured in cycles) is still perfectly valid.
Just mentioning it; no change required.

> +
> +	end = rte_rdtsc();
> +
> +	latency = (end - start) / ITERATIONS;

This calculation is integer; add (double) somewhere to make it floating point.

> +
> +	return latency;
> +}


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 1/7] eal: add static per-lcore memory allocation facility
  2024-09-17 14:28                                         ` Mattias Rönnblom
@ 2024-09-17 16:11                                           ` Konstantin Ananyev
  2024-09-18  7:00                                             ` Mattias Rönnblom
  2024-09-17 16:29                                           ` Konstantin Ananyev
  1 sibling, 1 reply; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-17 16:11 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob

> >> +
> >> +/**
> >> + * Get pointer to lcore variable instance with the specified lcore id.
> >> + *
> >> + * @param lcore_id
> >> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> >> + *   instances should be accessed. The lcore id need not be valid
> >> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
> >> + *   is also not valid (and thus should not be dereferenced).
> >> + * @param handle
> >> + *   The lcore variable handle.
> >> + */
> >> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
> >> +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
> >> +
> >> +/**
> >> + * Get pointer to lcore variable instance of the current thread.
> >> + *
> >> + * May only be used by EAL threads and registered non-EAL threads.
> >> + */
> >> +#define RTE_LCORE_VAR_VALUE(handle) \
> >> +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> >
> > Would it make sense to check that rte_lcore_id() !=  LCORE_ID_ANY?
> > After all if people do not want this extra check, they can probably use
> > RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> > explicitly.
> >
> 
> It would make sense, if it was an RTE_ASSERT(). Otherwise, I don't think
> so. Attempting to gracefully handle API violations is bad practice, imo.

Ok, RTE_ASSERT() might be a good compromise.
As I said in another mail for that thread, I wouldn't insist here.

> 
> >> +
> >> +/**
> >> + * Iterate over each lcore id's value for an lcore variable.
> >> + *
> >> + * @param value
> >> + *   A pointer successively set to point to lcore variable value
> >> + *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
> >> + * @param handle
> >> + *   The lcore variable handle.
> >> + */
> >> +#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
> >> +	for (unsigned int lcore_id =					\
> >> +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
> >> +	     lcore_id < RTE_MAX_LCORE;					\
> >> +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
> >
> > Might be a bit better (and safer) to make lcore_id a macro parameter?
> > I.E.:
> > define RTE_LCORE_VAR_FOREACH_VALUE(value, handle, lcore_id) \
> > for ((lcore_id) = ...
> >
> 
> Why?

Variable with the same name (and type) can be defined by user before the loop,
With the intention to use it inside the loop.
Just like it happens here (in patch #2):
+	unsigned int lcore_id;
.....
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	lcore_id = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		lcore_id++;
+	}
+
 




^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v4 1/7] eal: add static per-lcore memory allocation facility
  2024-09-17 14:28                                         ` Mattias Rönnblom
  2024-09-17 16:11                                           ` Konstantin Ananyev
@ 2024-09-17 16:29                                           ` Konstantin Ananyev
  2024-09-18  7:50                                             ` Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-17 16:29 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob


> >> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> >> +
> >> +static void *lcore_buffer;
> >> +static size_t offset = RTE_MAX_LCORE_VAR;
> >> +
> >> +static void *
> >> +lcore_var_alloc(size_t size, size_t align)
> >> +{
> >> +	void *handle;
> >> +	void *value;
> >> +
> >> +	offset = RTE_ALIGN_CEIL(offset, align);
> >> +
> >> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> >> +#ifdef RTE_EXEC_ENV_WINDOWS
> >> +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> >> +					       RTE_CACHE_LINE_SIZE);
> >> +#else
> >> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> >> +					     LCORE_BUFFER_SIZE);
> >> +#endif
> >
> > Don't remember did that question already arise or not:
> > For debugging and health-checking purposes - would it make sense to link all
> > lcore_buffer values into a linked list?
> > So user/developer/some tool can walk over it to check that provided handle value
> > is really a valid lcore_var, etc.
> >
> 
> At least you could add some basic statistics, like the total size
> allocated my lcore variables, and the number of variables.

My thought was more about easing debugging/health-cheking,
but yes, some stats can also be collected.

> One could also add tracing.
> 
> >> +		RTE_VERIFY(lcore_buffer != NULL);
> >> +
> >> +		offset = 0;
> >> +	}
> >> +

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v5 3/7] eal: add lcore variable performance test
  2024-09-17 15:40                                           ` Morten Brørup
@ 2024-09-18  6:05                                             ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  6:05 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

On 2024-09-17 17:40, Morten Brørup wrote:
>> +	start = rte_rdtsc();
>> +
>> +	for (i = 0; i < ITERATIONS; i++)
>> +		update_fun(mods[i & num_mods_mask]);
> 
> This indexing adds more instructions to be executed than just the update function.
> The added overhead is the same for all tested access methods, so the absolute difference in latency (i.e. measured in cycles) is still perfectly valid.
> Just mentioning it; no change required.
> 
>> +
>> +	end = rte_rdtsc();
>> +
>> +	latency = (end - start) / ITERATIONS;
> 
> This calculation is integer; add (double) somewhere to make it floating point.
> 

Indeed, it is. Will fix.

>> +
>> +	return latency;
>> +}
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v4 1/7] eal: add static per-lcore memory allocation facility
  2024-09-17 16:11                                           ` Konstantin Ananyev
@ 2024-09-18  7:00                                             ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  7:00 UTC (permalink / raw)
  To: Konstantin Ananyev, Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob

On 2024-09-17 18:11, Konstantin Ananyev wrote:
>>>> +
>>>> +/**
>>>> + * Get pointer to lcore variable instance with the specified lcore id.
>>>> + *
>>>> + * @param lcore_id
>>>> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
>>>> + *   instances should be accessed. The lcore id need not be valid
>>>> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
>>>> + *   is also not valid (and thus should not be dereferenced).
>>>> + * @param handle
>>>> + *   The lcore variable handle.
>>>> + */
>>>> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
>>>> +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
>>>> +
>>>> +/**
>>>> + * Get pointer to lcore variable instance of the current thread.
>>>> + *
>>>> + * May only be used by EAL threads and registered non-EAL threads.
>>>> + */
>>>> +#define RTE_LCORE_VAR_VALUE(handle) \
>>>> +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
>>>
>>> Would it make sense to check that rte_lcore_id() !=  LCORE_ID_ANY?
>>> After all if people do not want this extra check, they can probably use
>>> RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
>>> explicitly.
>>>
>>
>> It would make sense, if it was an RTE_ASSERT(). Otherwise, I don't think
>> so. Attempting to gracefully handle API violations is bad practice, imo.
> 
> Ok, RTE_ASSERT() might be a good compromise.
> As I said in another mail for that thread, I wouldn't insist here.
> 

After a having a closer look at this issue, I'm not so sure any more. 
Such an assertion would disallow the use of the macros to retrieve a 
potentially-invalid pointer, which is then never used, in case it is 
invalid.

>>
>>>> +
>>>> +/**
>>>> + * Iterate over each lcore id's value for an lcore variable.
>>>> + *
>>>> + * @param value
>>>> + *   A pointer successively set to point to lcore variable value
>>>> + *   corresponding to every lcore id (up to @c RTE_MAX_LCORE).
>>>> + * @param handle
>>>> + *   The lcore variable handle.
>>>> + */
>>>> +#define RTE_LCORE_VAR_FOREACH_VALUE(value, handle)			\
>>>> +	for (unsigned int lcore_id =					\
>>>> +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
>>>> +	     lcore_id < RTE_MAX_LCORE;					\
>>>> +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
>>>
>>> Might be a bit better (and safer) to make lcore_id a macro parameter?
>>> I.E.:
>>> define RTE_LCORE_VAR_FOREACH_VALUE(value, handle, lcore_id) \
>>> for ((lcore_id) = ...
>>>
>>
>> Why?
> 
> Variable with the same name (and type) can be defined by user before the loop,
> With the intention to use it inside the loop.
> Just like it happens here (in patch #2):
> +	unsigned int lcore_id;
> .....
> +	/* take the opportunity to test the foreach macro */
> +	int *v;
> +	lcore_id = 0;
> +	RTE_LCORE_VAR_FOREACH_VALUE(v, test_int) {
> +		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
> +				  "Unexpected value on lcore %d during "
> +				  "iteration", lcore_id);
> +		lcore_id++;
> +	}
> +
>   
> 

Indeed. I'll change it. I suppose you could also have issues if you 
nested the macro, although those could be solved by using something like 
__COUNTER__ to create a unique name.

Supplying the variable name does defeat part of the purpose of the 
RTE_LCORE_VAR_FOREACH_VALUE.

> 
> 

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v4 1/7] eal: add static per-lcore memory allocation facility
  2024-09-17 16:29                                           ` Konstantin Ananyev
@ 2024-09-18  7:50                                             ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  7:50 UTC (permalink / raw)
  To: Konstantin Ananyev, Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob

On 2024-09-17 18:29, Konstantin Ananyev wrote:
> 
>>>> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
>>>> +
>>>> +static void *lcore_buffer;
>>>> +static size_t offset = RTE_MAX_LCORE_VAR;
>>>> +
>>>> +static void *
>>>> +lcore_var_alloc(size_t size, size_t align)
>>>> +{
>>>> +	void *handle;
>>>> +	void *value;
>>>> +
>>>> +	offset = RTE_ALIGN_CEIL(offset, align);
>>>> +
>>>> +	if (offset + size > RTE_MAX_LCORE_VAR) {
>>>> +#ifdef RTE_EXEC_ENV_WINDOWS
>>>> +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
>>>> +					       RTE_CACHE_LINE_SIZE);
>>>> +#else
>>>> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
>>>> +					     LCORE_BUFFER_SIZE);
>>>> +#endif
>>>
>>> Don't remember did that question already arise or not:
>>> For debugging and health-checking purposes - would it make sense to link all
>>> lcore_buffer values into a linked list?
>>> So user/developer/some tool can walk over it to check that provided handle value
>>> is really a valid lcore_var, etc.
>>>
>>
>> At least you could add some basic statistics, like the total size
>> allocated my lcore variables, and the number of variables.
> 
> My thought was more about easing debugging/health-cheking,
> but yes, some stats can also be collected.
> 

Statistics could be used for debugging and maybe some kind of 
rudimentary sanity check.

Maintaining per-variable state is not necessarily something you want to 
do, at least not close (spatially) to the lcore variable values.

In summary, I'm yet to form an opinion what, if anything, we should have 
here to help debugging. To avoid bloat, I would suggest this being 
deferred up to a point where we have more experience with lcore variables.

>> One could also add tracing.
>>
>>>> +		RTE_VERIFY(lcore_buffer != NULL);
>>>> +
>>>> +		offset = 0;
>>>> +	}
>>>> +

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v6 0/7] Lcore variables
  2024-09-17 14:32                                         ` [PATCH v5 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-18  8:00                                           ` Mattias Rönnblom
  2024-09-18  8:00                                             ` [PATCH v6 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                               ` (6 more replies)
  0 siblings, 7 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:00 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 436 ++++++++++++++++++
 app/test/test_lcore_var_perf.c                | 257 +++++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  45 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  79 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 115 ++---
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 388 ++++++++++++++++
 lib/eal/version.map                           |   2 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  35 +-
 17 files changed, 1335 insertions(+), 93 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v6 1/7] eal: add static per-lcore memory allocation facility
  2024-09-18  8:00                                           ` [PATCH v6 0/7] Lcore variables Mattias Rönnblom
@ 2024-09-18  8:00                                             ` Mattias Rönnblom
  2024-09-18  8:24                                               ` Konstantin Ananyev
  2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
  2024-09-18  8:00                                             ` [PATCH v6 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                               ` (5 subsequent siblings)
  6 siblings, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:00 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v6:
 * Have API user provide the loop variable in the FOREACH macro, to
   avoid subtle bugs where the loop variable name clashes with some
   other user-defined variable. (Konstantin Ananyev)

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  45 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  79 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 388 ++++++++++++++++++
 lib/eal/version.map                           |   2 +
 10 files changed, 532 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c5a703b5c0..362d9a3f28 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index dd7bb0d35b..311692e498 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 9559c12a98..12b49672a6 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -433,12 +433,45 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default static variables, blocks allocated on the DPDK heap, and
+other type of memory is shared by all DPDK threads.
+
+An application, a DPDK library or PMD may keep opt to keep per-thread
+state.
+
+Per-thread data may be maintained using either *lcore variables*
+(``rte_lcore_var.h``), *thread-local storage (TLS)*
+(``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE``
+elements, index by ``rte_lcore_id()``. These methods allows for
+per-lcore data to be a largely module-internal affair, and not
+directly visible in its API. Another possibility is to have deal
+explicitly with per-thread aspects in the API (e.g., the ports of the
+Eventdev API).
+
+Lcore varibles are suitable for small object statically allocated at
+the time of module or application initialization. An lcore variable
+take on one value for each lcore id-equipped thread (i.e., for EAL
+threads and registered non-EAL threads, in total ``RTE_MAX_LCORE``
+instances). The lifetime of lcore variables are detached from that of
+the owning threads, and may thus be initialized prior to the owner
+having been created.
+
+Variables with thread-local storage are allocated at the time of
+thread creation, and exists until the thread terminates, for every
+thread in the process. Only very small object should be allocated in
+TLS, since large TLS objects significantly slows down thread creation
+and may needlessly increase memory footprint for application that make
+extensive use of unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore id-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must both cache-aligned, and include a
+``RTE_CACHE_GUARD``. Such extensive use of padding cause internal
+fragmentation (i.e., unused space) and lower cache hit rates.
+
+For more discussions on per-lcore state, see the ``rte_lcore_var.h``
+API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 0ff70d9057..a3884f7491 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -55,6 +55,20 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..6b7690795e
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	unsigned int lcore_id;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+#ifdef RTE_EXEC_ENV_WINDOWS
+		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
+					       RTE_CACHE_LINE_SIZE);
+#else
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+#endif
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..e8db1391fe
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,388 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. There is one
+ * instance for each current and future lcore id-equipped thread, with
+ * a total of RTE_MAX_LCORE instances. The value of an lcore variable
+ * for a particular lcore id is independent from other values (for
+ * other lcore ids) within the same lcore variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for an @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. The handle type is used to inform the
+ * access macros the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as a an opaque identifier. An allocated handle
+ * never has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by two different lcore
+ * ids may be frequently read or written by the owners without risking
+ * false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should employed to assure there are no data races between
+ * the owning thread and any non-owner threads accessing the same
+ * lcore variable instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may choose to define an lcore variable handle, which
+ * it then it goes on to never allocate.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variables' owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values take on an initial value of zero.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         unsigned int lcore_id;
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables have the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. As a result, thread-local variables must be initialized in
+ *     a "lazy" manner (e.g., at the point of thread creation). Lcore
+ *     variables may be accessed immediately after having been
+ *     allocated (which may be prior any thread beyond the main
+ *     thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handle, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param lcore_id
+ *   An <code>unsigned int</code> variable successively set to the
+ *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
+ * @param value
+ *   A pointer variable successively set to point to lcore variable
+ *   value instance of the current lcore id being processed.
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)		\
+	for (lcore_id =	(((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     lcore_id < RTE_MAX_LCORE;					\
+	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..0c80bf7331 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,8 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v6 2/7] eal: add lcore variable functional tests
  2024-09-18  8:00                                           ` [PATCH v6 0/7] Lcore variables Mattias Rönnblom
  2024-09-18  8:00                                             ` [PATCH v6 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-18  8:00                                             ` Mattias Rönnblom
  2024-09-18  8:25                                               ` Konstantin Ananyev
  2024-09-18  8:00                                             ` [PATCH v6 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                               ` (4 subsequent siblings)
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:00 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v6:
 * Update FOREACH invocations to match new API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 436 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 437 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..2a1f258548
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,436 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	unsigned int i = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_int) {
+		TEST_ASSERT_EQUAL(i, lcore_id, "Encountered lcore id %d "
+				  "while expecting %d during iteration",
+				  lcore_id, i);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		i++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	unsigned int lcore_id;
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v6 3/7] eal: add lcore variable performance test
  2024-09-18  8:00                                           ` [PATCH v6 0/7] Lcore variables Mattias Rönnblom
  2024-09-18  8:00                                             ` [PATCH v6 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-18  8:00                                             ` [PATCH v6 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-09-18  8:00                                             ` Mattias Rönnblom
  2024-09-18  8:00                                             ` [PATCH v6 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                               ` (3 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:00 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

--

PATCH v6:
 * Use floating point math when calculating per-update latency.
   (Morten Brørup)

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 257 +++++++++++++++++++++++++++++++++
 2 files changed, 258 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..2680bfb6f7
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =
+		RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / (double)ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays are not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each wiht a small, per-lcore state. Note however that
+ * these tests has very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v6 4/7] random: keep PRNG state in lcore variable
  2024-09-18  8:00                                           ` [PATCH v6 0/7] Lcore variables Mattias Rönnblom
                                                               ` (2 preceding siblings ...)
  2024-09-18  8:00                                             ` [PATCH v6 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-09-18  8:00                                             ` Mattias Rönnblom
  2024-09-18  8:00                                             ` [PATCH v6 5/7] power: keep per-lcore " Mattias Rönnblom
                                                               ` (2 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:00 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v6 5/7] power: keep per-lcore state in lcore variable
  2024-09-18  8:00                                           ` [PATCH v6 0/7] Lcore variables Mattias Rönnblom
                                                               ` (3 preceding siblings ...)
  2024-09-18  8:00                                             ` [PATCH v6 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-09-18  8:00                                             ` Mattias Rönnblom
  2024-09-18  8:00                                             ` [PATCH v6 6/7] service: " Mattias Rönnblom
  2024-09-18  8:00                                             ` [PATCH v6 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:00 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

--

PATCH v6:
 * Update FOREACH invocation to match new API.

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a981db4b39 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,22 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	unsigned int lcore_id;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v6 6/7] service: keep per-lcore state in lcore variable
  2024-09-18  8:00                                           ` [PATCH v6 0/7] Lcore variables Mattias Rönnblom
                                                               ` (4 preceding siblings ...)
  2024-09-18  8:00                                             ` [PATCH v6 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-09-18  8:00                                             ` Mattias Rönnblom
  2024-09-18  8:00                                             ` [PATCH v6 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:00 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

--

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 115 +++++++++++++++++++----------------
 1 file changed, 63 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 56379930b6..03379f1588 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,12 +102,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -122,7 +119,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +132,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +281,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +288,10 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +449,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +462,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +484,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +530,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +546,11 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +567,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +584,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +636,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +688,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +706,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +731,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +755,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +779,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +809,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +818,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +843,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +854,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +862,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +870,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +879,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +895,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +942,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +971,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +983,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1022,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v6 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-09-18  8:00                                           ` [PATCH v6 0/7] Lcore variables Mattias Rönnblom
                                                               ` (5 preceding siblings ...)
  2024-09-18  8:00                                             ` [PATCH v6 6/7] service: " Mattias Rönnblom
@ 2024-09-18  8:00                                             ` Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:00 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v6 1/7] eal: add static per-lcore memory allocation facility
  2024-09-18  8:00                                             ` [PATCH v6 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-18  8:24                                               ` Konstantin Ananyev
  2024-09-18  8:25                                                 ` Mattias Rönnblom
  2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-18  8:24 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob

> +/**
> + * Iterate over each lcore id's value for an lcore variable.
> + *
> + * @param lcore_id
> + *   An <code>unsigned int</code> variable successively set to the
> + *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
> + * @param value
> + *   A pointer variable successively set to point to lcore variable
> + *   value instance of the current lcore id being processed.
> + * @param handle
> + *   The lcore variable handle.
> + */
> +#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)		\
> +	for (lcore_id =	(((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
> +	     lcore_id < RTE_MAX_LCORE;					\
> +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
> +

I think we need a '()' around references to lcore_id:
 for ((lcore_id) = (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
	     (lcore_id) < RTE_MAX_LCORE;					\
	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v6 1/7] eal: add static per-lcore memory allocation facility
  2024-09-18  8:24                                               ` Konstantin Ananyev
@ 2024-09-18  8:25                                                 ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:25 UTC (permalink / raw)
  To: Konstantin Ananyev, Mattias Rönnblom, dev
  Cc: Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob

On 2024-09-18 10:24, Konstantin Ananyev wrote:
>> +/**
>> + * Iterate over each lcore id's value for an lcore variable.
>> + *
>> + * @param lcore_id
>> + *   An <code>unsigned int</code> variable successively set to the
>> + *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
>> + * @param value
>> + *   A pointer variable successively set to point to lcore variable
>> + *   value instance of the current lcore id being processed.
>> + * @param handle
>> + *   The lcore variable handle.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)		\
>> +	for (lcore_id =	(((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
>> +	     lcore_id < RTE_MAX_LCORE;					\
>> +	     lcore_id++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))
>> +
> 
> I think we need a '()' around references to lcore_id:
>   for ((lcore_id) = (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
> 	     (lcore_id) < RTE_MAX_LCORE;					\
> 	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle))

Yes, of course. Thanks.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v6 2/7] eal: add lcore variable functional tests
  2024-09-18  8:00                                             ` [PATCH v6 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-09-18  8:25                                               ` Konstantin Ananyev
  0 siblings, 0 replies; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-18  8:25 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob



> -----Original Message-----
> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Sent: Wednesday, September 18, 2024 9:01 AM
> To: dev@dpdk.org
> Cc: hofors@lysator.liu.se; Morten Brørup <mb@smartsharesystems.com>; Stephen Hemminger <stephen@networkplumber.org>;
> Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>; David Marchand <david.marchand@redhat.com>; Jerin Jacob
> <jerinj@marvell.com>; Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Subject: [PATCH v6 2/7] eal: add lcore variable functional tests
> 
> Add functional test suite to exercise the <rte_lcore_var.h> API.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> --

Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

> 2.34.1
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v7 0/7] Lcore variables
  2024-09-18  8:00                                             ` [PATCH v6 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-18  8:24                                               ` Konstantin Ananyev
@ 2024-09-18  8:26                                               ` Mattias Rönnblom
  2024-09-18  8:26                                                 ` [PATCH v7 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                                   ` (8 more replies)
  1 sibling, 9 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:26 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 436 ++++++++++++++++++
 app/test/test_lcore_var_perf.c                | 257 +++++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  45 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  79 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 117 ++---
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 390 ++++++++++++++++
 lib/eal/version.map                           |   2 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  35 +-
 17 files changed, 1339 insertions(+), 93 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v7 1/7] eal: add static per-lcore memory allocation facility
  2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
@ 2024-09-18  8:26                                                 ` Mattias Rönnblom
  2024-09-18  9:23                                                   ` Konstantin Ananyev
                                                                     ` (2 more replies)
  2024-09-18  8:26                                                 ` [PATCH v7 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:26 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v7:
 * Add () to the FOREACH lcore id macro parameter, to allow arbitrary
   expression, not just a simple variable name, being passed.
   (Konstantin Ananyev)

PATCH v6:
 * Have API user provide the loop variable in the FOREACH macro, to
   avoid subtle bugs where the loop variable name clashes with some
   other user-defined variable. (Konstantin Ananyev)

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  45 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  79 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 390 ++++++++++++++++++
 lib/eal/version.map                           |   2 +
 10 files changed, 534 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c5a703b5c0..362d9a3f28 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index dd7bb0d35b..311692e498 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 9559c12a98..12b49672a6 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -433,12 +433,45 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default static variables, blocks allocated on the DPDK heap, and
+other type of memory is shared by all DPDK threads.
+
+An application, a DPDK library or PMD may keep opt to keep per-thread
+state.
+
+Per-thread data may be maintained using either *lcore variables*
+(``rte_lcore_var.h``), *thread-local storage (TLS)*
+(``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE``
+elements, index by ``rte_lcore_id()``. These methods allows for
+per-lcore data to be a largely module-internal affair, and not
+directly visible in its API. Another possibility is to have deal
+explicitly with per-thread aspects in the API (e.g., the ports of the
+Eventdev API).
+
+Lcore varibles are suitable for small object statically allocated at
+the time of module or application initialization. An lcore variable
+take on one value for each lcore id-equipped thread (i.e., for EAL
+threads and registered non-EAL threads, in total ``RTE_MAX_LCORE``
+instances). The lifetime of lcore variables are detached from that of
+the owning threads, and may thus be initialized prior to the owner
+having been created.
+
+Variables with thread-local storage are allocated at the time of
+thread creation, and exists until the thread terminates, for every
+thread in the process. Only very small object should be allocated in
+TLS, since large TLS objects significantly slows down thread creation
+and may needlessly increase memory footprint for application that make
+extensive use of unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore id-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must both cache-aligned, and include a
+``RTE_CACHE_GUARD``. Such extensive use of padding cause internal
+fragmentation (i.e., unused space) and lower cache hit rates.
+
+For more discussions on per-lcore state, see the ``rte_lcore_var.h``
+API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 0ff70d9057..a3884f7491 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -55,6 +55,20 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..6b7690795e
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	unsigned int lcore_id;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+#ifdef RTE_EXEC_ENV_WINDOWS
+		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
+					       RTE_CACHE_LINE_SIZE);
+#else
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+#endif
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+		align = alignof(max_align_t);
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..894100d1e4
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,390 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. There is one
+ * instance for each current and future lcore id-equipped thread, with
+ * a total of RTE_MAX_LCORE instances. The value of an lcore variable
+ * for a particular lcore id is independent from other values (for
+ * other lcore ids) within the same lcore variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for an @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. The handle type is used to inform the
+ * access macros the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as a an opaque identifier. An allocated handle
+ * never has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by two different lcore
+ * ids may be frequently read or written by the owners without risking
+ * false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should employed to assure there are no data races between
+ * the owning thread and any non-owner threads accessing the same
+ * lcore variable instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may choose to define an lcore variable handle, which
+ * it then it goes on to never allocate.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variables' owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values take on an initial value of zero.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         unsigned int lcore_id;
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables have the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. As a result, thread-local variables must be initialized in
+ *     a "lazy" manner (e.g., at the point of thread creation). Lcore
+ *     variables may be accessed immediately after having been
+ *     allocated (which may be prior any thread beyond the main
+ *     thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handle, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param lcore_id
+ *   An <code>unsigned int</code> variable successively set to the
+ *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
+ * @param value
+ *   A pointer variable successively set to point to lcore variable
+ *   value instance of the current lcore id being processed.
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)		\
+	for ((lcore_id) =						\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     (lcore_id) < RTE_MAX_LCORE;				\
+	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, \
+							       handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..0c80bf7331 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,8 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v7 2/7] eal: add lcore variable functional tests
  2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
  2024-09-18  8:26                                                 ` [PATCH v7 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-18  8:26                                                 ` Mattias Rönnblom
  2024-09-18  8:26                                                 ` [PATCH v7 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:26 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v6:
 * Update FOREACH invocations to match new API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 436 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 437 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..2a1f258548
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,436 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	unsigned int i = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_int) {
+		TEST_ASSERT_EQUAL(i, lcore_id, "Encountered lcore id %d "
+				  "while expecting %d during iteration",
+				  lcore_id, i);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		i++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	unsigned int lcore_id;
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v7 3/7] eal: add lcore variable performance test
  2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
  2024-09-18  8:26                                                 ` [PATCH v7 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-18  8:26                                                 ` [PATCH v7 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-09-18  8:26                                                 ` Mattias Rönnblom
  2024-10-09 20:46                                                   ` Morten Brørup
  2024-09-18  8:26                                                 ` [PATCH v7 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:26 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

--

PATCH v6:
 * Use floating point math when calculating per-update latency.
   (Morten Brørup)

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 257 +++++++++++++++++++++++++++++++++
 2 files changed, 258 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..2680bfb6f7
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =
+		RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / (double)ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays are not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each wiht a small, per-lcore state. Note however that
+ * these tests has very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v7 4/7] random: keep PRNG state in lcore variable
  2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
                                                                   ` (2 preceding siblings ...)
  2024-09-18  8:26                                                 ` [PATCH v7 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-09-18  8:26                                                 ` Mattias Rönnblom
  2024-09-18  8:26                                                 ` [PATCH v7 5/7] power: keep per-lcore " Mattias Rönnblom
                                                                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:26 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v7 5/7] power: keep per-lcore state in lcore variable
  2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
                                                                   ` (3 preceding siblings ...)
  2024-09-18  8:26                                                 ` [PATCH v7 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-09-18  8:26                                                 ` Mattias Rönnblom
  2024-09-18  8:26                                                 ` [PATCH v7 6/7] service: " Mattias Rönnblom
                                                                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:26 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

--

PATCH v6:
 * Update FOREACH invocation to match new API.

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a981db4b39 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,22 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	unsigned int lcore_id;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v7 6/7] service: keep per-lcore state in lcore variable
  2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
                                                                   ` (4 preceding siblings ...)
  2024-09-18  8:26                                                 ` [PATCH v7 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-09-18  8:26                                                 ` Mattias Rönnblom
  2024-09-18  8:26                                                 ` [PATCH v7 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
                                                                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:26 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

--

PATCH v7:
 * Update to match new FOREACH API.

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 117 +++++++++++++++++++----------------
 1 file changed, 65 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 56379930b6..59c4f77966 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -75,7 +76,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -101,12 +102,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -122,7 +119,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -136,7 +132,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -286,7 +281,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -294,9 +288,11 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	unsigned int lcore_id;
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -454,7 +450,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -464,7 +463,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -486,8 +485,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +531,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +547,12 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	unsigned int lcore_id;
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +569,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +586,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,30 +638,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -685,13 +690,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -702,14 +708,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -725,17 +733,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -747,7 +757,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -771,7 +781,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -801,6 +811,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -808,12 +820,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -834,7 +845,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -845,7 +856,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -853,7 +864,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -861,7 +872,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -870,7 +881,7 @@ lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -886,7 +897,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -930,12 +944,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -960,7 +973,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -971,12 +985,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1011,7 +1024,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v7 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
                                                                   ` (5 preceding siblings ...)
  2024-09-18  8:26                                                 ` [PATCH v7 6/7] service: " Mattias Rönnblom
@ 2024-09-18  8:26                                                 ` Mattias Rönnblom
  2024-09-18  9:30                                                 ` [PATCH v7 0/7] Lcore variables fengchengwen
  2024-10-10  5:06                                                 ` Stephen Hemminger
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-18  8:26 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob,
	Mattias Rönnblom, Konstantin Ananyev

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v7 1/7] eal: add static per-lcore memory allocation facility
  2024-09-18  8:26                                                 ` [PATCH v7 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-09-18  9:23                                                   ` Konstantin Ananyev
  2024-10-09 22:15                                                   ` Morten Brørup
  2024-10-10 14:13                                                   ` [PATCH v8 0/7] Lcore variables Mattias Rönnblom
  2 siblings, 0 replies; 313+ messages in thread
From: Konstantin Ananyev @ 2024-09-18  9:23 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob



> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small, frequently-accessed data structures, for which one instance
> should exist for each lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> --

Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>

> 2.34.1
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v7 0/7] Lcore variables
  2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
                                                                   ` (6 preceding siblings ...)
  2024-09-18  8:26                                                 ` [PATCH v7 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
@ 2024-09-18  9:30                                                 ` fengchengwen
  2024-10-10  5:06                                                 ` Stephen Hemminger
  8 siblings, 0 replies; 313+ messages in thread
From: fengchengwen @ 2024-09-18  9:30 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob

Series-acked-by: Chengwen Feng <fengchengwen@huawei.com>

On 2024/9/18 16:26, Mattias Rönnblom wrote:
> This patch set introduces a new API <rte_lcore_var.h> for static
> per-lcore id data allocation.
> 
> Please refer to the <rte_lcore_var.h> API documentation for both a
> rationale for this new API, and a comparison to the alternatives
> available.
> 
> The adoption of this API would affect many different DPDK modules, but
> the author updated only a few, mostly to serve as examples in this
> RFC, and to iron out some, but surely not all, wrinkles in the API.
> 
> The question on how to best allocate static per-lcore memory has been
> up several times on the dev mailing list, for example in the thread on
> "random: use per lcore state" RFC by Stephen Hemminger.
> 
> Lcore variables are surely not the answer to all your per-lcore-data
> needs, since it only allows for more-or-less static allocation. In the
> author's opinion, it does however provide a reasonably simple and
> clean and seemingly very much performant solution to a real problem.
> 
> Mattias Rönnblom (7):
>   eal: add static per-lcore memory allocation facility
>   eal: add lcore variable functional tests
>   eal: add lcore variable performance test
>   random: keep PRNG state in lcore variable
>   power: keep per-lcore state in lcore variable
>   service: keep per-lcore state in lcore variable
>   eal: keep per-lcore power intrinsics state in lcore variable
> 
>  MAINTAINERS                                   |   6 +
>  app/test/meson.build                          |   2 +
>  app/test/test_lcore_var.c                     | 436 ++++++++++++++++++
>  app/test/test_lcore_var_perf.c                | 257 +++++++++++
>  config/rte_config.h                           |   1 +
>  doc/api/doxy-api-index.md                     |   1 +
>  .../prog_guide/env_abstraction_layer.rst      |  45 +-
>  doc/guides/rel_notes/release_24_11.rst        |  14 +
>  lib/eal/common/eal_common_lcore_var.c         |  79 ++++
>  lib/eal/common/meson.build                    |   1 +
>  lib/eal/common/rte_random.c                   |  28 +-
>  lib/eal/common/rte_service.c                  | 117 ++---
>  lib/eal/include/meson.build                   |   1 +
>  lib/eal/include/rte_lcore_var.h               | 390 ++++++++++++++++
>  lib/eal/version.map                           |   2 +
>  lib/eal/x86/rte_power_intrinsics.c            |  17 +-
>  lib/power/rte_power_pmd_mgmt.c                |  35 +-
>  17 files changed, 1339 insertions(+), 93 deletions(-)
>  create mode 100644 app/test/test_lcore_var.c
>  create mode 100644 app/test/test_lcore_var_perf.c
>  create mode 100644 lib/eal/common/eal_common_lcore_var.c
>  create mode 100644 lib/eal/include/rte_lcore_var.h
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v3 3/7] eal: add lcore variable performance test
  2024-09-16 10:50                                             ` Mattias Rönnblom
@ 2024-09-18 10:04                                               ` Jerin Jacob
  0 siblings, 0 replies; 313+ messages in thread
From: Jerin Jacob @ 2024-09-18 10:04 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Mattias Rönnblom, dev, Morten Brørup,
	Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

On Mon, Sep 16, 2024 at 4:20 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>
> On 2024-09-13 13:23, Jerin Jacob wrote:
> > On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> >>
> >> On 2024-09-12 17:11, Jerin Jacob wrote:
> >>> On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> >>>>
> >>>> On 2024-09-12 15:09, Jerin Jacob wrote:
> >>>>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> >>>>> <mattias.ronnblom@ericsson.com> wrote:
> >>>>>>
> >>>>>> Add basic micro benchmark for lcore variables, in an attempt to assure
> >>>>>> that the overhead isn't significantly greater than alternative
> >>>>>> approaches, in scenarios where the benefits aren't expected to show up
> >>>>>> (i.e., when plenty of cache is available compared to the working set
> >>>>>> size of the per-lcore data).
> >>>>>>
> >>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>>>> ---
> >>>>>>     app/test/meson.build           |   1 +
> >>>>>>     app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
> >>>>>>     2 files changed, 161 insertions(+)
> >>>>>>     create mode 100644 app/test/test_lcore_var_perf.c
> >>>>>
> >>>>>
> >>>>>> +static double
> >>>>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> >>>>>> +{
> >>>>>> +       uint64_t i;
> >>>>>> +       uint64_t start;
> >>>>>> +       uint64_t end;
> >>>>>> +       double latency;
> >>>>>> +
> >>>>>> +       init_fun();
> >>>>>> +
> >>>>>> +       start = rte_get_timer_cycles();
> >>>>>> +
> >>>>>> +       for (i = 0; i < ITERATIONS; i++)
> >>>>>> +               update_fun();
> >>>>>> +
> >>>>>> +       end = rte_get_timer_cycles();
> >>>>>
> >>>>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> >>>>
> >>>> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> >>>
> >>> I was thinking in another way, with 1e7 iteration, the additional
> >>> barrier on precise will be amortized, and we get more _deterministic_
> >>> behavior e.s.p in case if we print cycles and if we need to catch
> >>> regressions.
> >>
> >> If you time a section of code which spends ~40000000 cycles, it doesn't
> >> matter if you add or remove a few cycles at the beginning and the end.
> >>
> >> The rte_rdtsc_precise() is both better (more precise in the sense of
> >> more serialization), and worse (because it's more costly, and thus more
> >> intrusive).
> >
> > We can calibrate the overhead to remove the cost.
> >
> What you are interested is primarily the impact of (instruction)
> throughput, not the latency of the sequence of instructions that must be
> retired in order to load the lcore variable values, when you switch from
> (say) lcore id-index static arrays to lcore variables in your module.
>
> Usually, there is not reason to make a distinction between latency and
> throughput in this context, but as you zoom into very short snippets of
> code being executed, the difference becomes relevant. For example,
> adding an div instruction won't necessarily add 12 cc to your program's
> execution time on a Zen 4, even though that is its latency. Rather, the
> effects may, depending on data dependencies and what other instructions
> are executed in parallel, be much smaller.
>
> So, one could argue the ILP you get with the loop is a feature, not a bug.
>
> With or without per-iteration latency measurements, these benchmark are
> not-very-useful at best, and misleading at worst. I will rework them to
> include more than a single module/lcore variable, which I think would be
> somewhat of an improvement.

OK. Module parameter will remove the compiler optimization and more accurate.
I was doing manual loop unrolling[1] in a trace test case(for small
inline functions)
Either way it fine. Thanks for the rework.

[1]
https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30


>
> Even better would have some real domain logic, instead of just a dummy
> multiplication.
>
> >>
> >> You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> >> doesn't matter.
> >
> > Yes. In this setup and it is pretty inaccurate PER iteration. Please
> > refer to the below patch to see the difference.
> >
> > Patch 1: Make nanoseconds to cycles per iteration
> > ------------------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> > index ea1d7ba90b52..b8d25400f593 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> > void (*update_fun)(void))
> >
> >          end = rte_get_timer_cycles();
> >
> > -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> > +       latency = ((end - start)) / ITERATIONS;
> >
> >          return latency;
> >   }
> > @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> >
> > -       printf("Latencies [ns/update]\n");
> > +       printf("Latencies [cycles/update]\n");
> >          printf("Thread-local storage  Static array  Lcore variables\n");
> > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> >          return TEST_SUCCESS;
> >   }
> >
> >
> > Patch 2: Change to precise with calibration
> > -----------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> > index ea1d7ba90b52..8142ecd56241 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -96,23 +96,28 @@ lvar_update(void)
> >   static double
> >   benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> >   {
> > -       uint64_t i;
> > +       double tsc_latency;
> > +       double latency;
> >          uint64_t start;
> >          uint64_t end;
> > -       double latency;
> > +       uint64_t i;
> >
> > -       init_fun();
> > +       /* calculate rte_rdtsc_precise overhead */
> > +       start = rte_rdtsc_precise();
> > +       end = rte_rdtsc_precise();
> > +       tsc_latency = (end - start);
> >
> > -       start = rte_get_timer_cycles();
> > +       init_fun();
> >
> > -       for (i = 0; i < ITERATIONS; i++)
> > +       latency = 0;
> > +       for (i = 0; i < ITERATIONS; i++) {
> > +               start = rte_rdtsc_precise();
> >                  update_fun();
> > +               end = rte_rdtsc_precise();
> > +               latency += (end - start) - tsc_latency;
> > +       }
> >
> > -       end = rte_get_timer_cycles();
> > -
> > -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> > -
> > -       return latency;
> > +       return latency / (double)ITERATIONS;
> >   }
> >
> >   static int
> > @@ -135,10 +140,9 @@ test_lcore_var_access(void)
> >          sarray_latency = benchmark_access_method(sarray_init, sarray_update);
> >          lvar_latency = benchmark_access_method(lvar_init, lvar_update);
> >
> > -       printf("Latencies [ns/update]\n");
> > +       printf("Latencies [cycles/update]\n");
> >          printf("Thread-local storage  Static array  Lcore variables\n");
> > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> >          return TEST_SUCCESS;
> >   }
> >
> > ARM N2 core with patch 1(aka current scheme)
> > -----------------------------------
> >
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage  Static array  Lcore variables
> >                   7.0           7.0              7.0
> >
> >
> > ARM N2 core with patch 2
> > -----------------------------------
> >
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage  Static array  Lcore variables
> >                  11.4          15.5             15.5
> >
> > x86 i9 core with patch 1(aka current scheme)
> > ------------------------------------------------------------
> >
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [ns/update]
> > Thread-local storage  Static array  Lcore variables
> >                   5.0           6.0              6.0
> >
> > x86 i9 core with patch 2
> > --------------------------------
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage  Static array  Lcore variables
> >                   5.3          10.6             11.7
> >
> >
> >
> >
> >
> >>
> >>> Furthermore, you may consider replacing rte_random() in fast path to
> >>> running number or so if it is not deterministic in cycle computation.
> >>
> >> rte_rand() is not used in the fast path. I don't understand what you
> >
> > I missed that. Ignore this comment.
> >
> >> mean by "running number".

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-12 15:22                                     ` Jerin Jacob
@ 2024-09-18 10:11                                       ` Jerin Jacob
  2024-09-19 19:31                                         ` Mattias Rönnblom
  2024-10-14  7:56                                         ` Morten Brørup
  0 siblings, 2 replies; 313+ messages in thread
From: Jerin Jacob @ 2024-09-18 10:11 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, dev, Chengwen Feng, Mattias Rönnblom,
	Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Anatoly Burakov

On Thu, Sep 12, 2024 at 8:52 PM Jerin Jacob <jerinjacobk@gmail.com> wrote:
>
> On Thu, Sep 12, 2024 at 7:11 PM Morten Brørup <mb@smartsharesystems.com> wrote:
> >
> > > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > Sent: Thursday, 12 September 2024 15.17
> > >
> > > On Thu, Sep 12, 2024 at 2:40 PM Morten Brørup <mb@smartsharesystems.com>
> > > wrote:
> > > >
> > > > > +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> > > >
> > > > Considering hugepages...
> > > >
> > > > Lcore variables may be allocated before DPDK's memory allocator
> > > (rte_malloc()) is ready, so rte_malloc() cannot be used for lcore variables.
> > > >
> > > > And lcore variables are not usable (shared) for DPDK multi-process, so the
> > > lcore_buffer could be allocated through the O/S APIs as anonymous hugepages,
> > > instead of using rte_malloc().
> > > >
> > > > The alternative, using rte_malloc(), would disallow allocating lcore
> > > variables before DPDK's memory allocator has been initialized, which I think
> > > is too late.
> > >
> > > I thought it is not. A lot of the subsystems are initialized after the
> > > memory subsystem is initialized.
> > > [1] example given in documentation. I thought, RTE_INIT needs to
> > > replaced if the subsystem called after memory initialized (which is
> > > the case for most of the libraries)
> >
> > The list of RTE_INIT functions are called before main(). It is not very useful.
> >
> > Yes, it would be good to replace (or supplement) RTE_INIT_PRIO by something similar, which calls the list of "INIT" functions at the appropriate time during EAL initialization.
> >
> > DPDK should then use this "INIT" list for all its initialization, so the init function of new features (such as this, and trace) can be inserted at the correct location in the list.
> >
> > > Trace library had a similar situation. It is managed like [2]
> >
> > Yes, if we insist on using rte_malloc() for lcore variables, the alternative is to prohibit establishing lcore variables in functions called through RTE_INIT.
>
> I was not insisting on using ONLY rte_malloc(). Since rte_malloc() can
> be called before rte_eal_init)(it will return NULL). Alloc routine can
> check first rte_malloc() is available if not switch over glibc.


@Mattias Rönnblom This comment is not addressed in v7. Could you check?

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-18 10:11                                       ` Jerin Jacob
@ 2024-09-19 19:31                                         ` Mattias Rönnblom
  2024-10-14  7:56                                         ` Morten Brørup
  1 sibling, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-09-19 19:31 UTC (permalink / raw)
  To: Jerin Jacob, Morten Brørup
  Cc: Mattias Rönnblom, dev, Chengwen Feng, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Anatoly Burakov

On 2024-09-18 12:11, Jerin Jacob wrote:
> On Thu, Sep 12, 2024 at 8:52 PM Jerin Jacob <jerinjacobk@gmail.com> wrote:
>>
>> On Thu, Sep 12, 2024 at 7:11 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>>>
>>>> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
>>>> Sent: Thursday, 12 September 2024 15.17
>>>>
>>>> On Thu, Sep 12, 2024 at 2:40 PM Morten Brørup <mb@smartsharesystems.com>
>>>> wrote:
>>>>>
>>>>>> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
>>>>>
>>>>> Considering hugepages...
>>>>>
>>>>> Lcore variables may be allocated before DPDK's memory allocator
>>>> (rte_malloc()) is ready, so rte_malloc() cannot be used for lcore variables.
>>>>>
>>>>> And lcore variables are not usable (shared) for DPDK multi-process, so the
>>>> lcore_buffer could be allocated through the O/S APIs as anonymous hugepages,
>>>> instead of using rte_malloc().
>>>>>
>>>>> The alternative, using rte_malloc(), would disallow allocating lcore
>>>> variables before DPDK's memory allocator has been initialized, which I think
>>>> is too late.
>>>>
>>>> I thought it is not. A lot of the subsystems are initialized after the
>>>> memory subsystem is initialized.
>>>> [1] example given in documentation. I thought, RTE_INIT needs to
>>>> replaced if the subsystem called after memory initialized (which is
>>>> the case for most of the libraries)
>>>
>>> The list of RTE_INIT functions are called before main(). It is not very useful.
>>>
>>> Yes, it would be good to replace (or supplement) RTE_INIT_PRIO by something similar, which calls the list of "INIT" functions at the appropriate time during EAL initialization.
>>>
>>> DPDK should then use this "INIT" list for all its initialization, so the init function of new features (such as this, and trace) can be inserted at the correct location in the list.
>>>
>>>> Trace library had a similar situation. It is managed like [2]
>>>
>>> Yes, if we insist on using rte_malloc() for lcore variables, the alternative is to prohibit establishing lcore variables in functions called through RTE_INIT.
>>
>> I was not insisting on using ONLY rte_malloc(). Since rte_malloc() can
>> be called before rte_eal_init)(it will return NULL). Alloc routine can
>> check first rte_malloc() is available if not switch over glibc.
> 
> 
> @Mattias Rönnblom This comment is not addressed in v7. Could you check?

Calling rte_malloc() and depending on it returning NULL if it's too 
early in the initialization process sounds a little fragile, but maybe 
it's fine.

One issue with lcore-variables-in-huge-pages I've failed to mentioned 
this time around this is being discussed is that it would increase 
memory usage by something like RTE_MAX_LCORE * 0.5 MB (or more probably 
a little more).

In the huge pages case, you can't rely on demand paging to avoid 
bringing in unused pages.

That said, I suspect some very latency-sensitive apps lock all pages in 
memory, and thus lose out on this OS feature.

I suggest we just leave the first incarnation of lcore variables in 
normal pages.

Thanks for the reminder.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v7 3/7] eal: add lcore variable performance test
  2024-09-18  8:26                                                 ` [PATCH v7 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-10-09 20:46                                                   ` Morten Brørup
  2024-10-10 14:17                                                     ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-10-09 20:46 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: hofors, Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Wednesday, 18 September 2024 10.26

A few corrections to a comment. Besides that,

Acked-by: Morten Brørup <mb@smartsharesystems.com>


> +/*
> + * The potential performance benefit of lcore variables compared to
> + * the use of statically sized, lcore id-indexed arrays are not

are not -> is not

> + * shorter latencies in a scenario with low cache pressure, but rather
> + * fewer cache misses in a real-world scenario, with extensive cache
> + * usage. These tests are a crude simulation of such, using <N> dummy
> + * modules, each wiht a small, per-lcore state. Note however that

wiht -> with

> + * these tests has very little non-lcore/thread local state, which is

has -> have

> + * unrealistic.
> + */


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v7 1/7] eal: add static per-lcore memory allocation facility
  2024-09-18  8:26                                                 ` [PATCH v7 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-18  9:23                                                   ` Konstantin Ananyev
@ 2024-10-09 22:15                                                   ` Morten Brørup
  2024-10-10 10:40                                                     ` Mattias Rönnblom
  2024-10-10 14:13                                                   ` [PATCH v8 0/7] Lcore variables Mattias Rönnblom
  2 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-10-09 22:15 UTC (permalink / raw)
  To: Mattias Rönnblom, dev, Tyler Retzlaff
  Cc: hofors, Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Wednesday, 18 September 2024 10.26
> 
> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small, frequently-accessed data structures, for which one instance
> should exist for each lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>

> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_var.c
> @@ -0,0 +1,79 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2024 Ericsson AB
> + */
> +
> +#include <inttypes.h>
> +#include <stdlib.h>
> +
> +#ifdef RTE_EXEC_ENV_WINDOWS
> +#include <malloc.h>
> +#endif

From what I can read on the internet, max_align_t is missing in stddef.h in MSVC [1], so try adding this to fix the Windows CI compilation failure:

#ifdef RTE_TOOLCHAIN_MSVC
#include <cstddef>
#endif

[1]: https://learn.microsoft.com/en-ie/answers/questions/1726147/why-max-align-t-not-defined-in-stddef-h-in-windows


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v7 0/7] Lcore variables
  2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
                                                                   ` (7 preceding siblings ...)
  2024-09-18  9:30                                                 ` [PATCH v7 0/7] Lcore variables fengchengwen
@ 2024-10-10  5:06                                                 ` Stephen Hemminger
  8 siblings, 0 replies; 313+ messages in thread
From: Stephen Hemminger @ 2024-10-10  5:06 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Konstantin Ananyev,
	David Marchand, Jerin Jacob

On Wed, 18 Sep 2024 10:26:07 +0200
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> This patch set introduces a new API <rte_lcore_var.h> for static
> per-lcore id data allocation.
> 
> Please refer to the <rte_lcore_var.h> API documentation for both a
> rationale for this new API, and a comparison to the alternatives
> available.
> 
> The adoption of this API would affect many different DPDK modules, but
> the author updated only a few, mostly to serve as examples in this
> RFC, and to iron out some, but surely not all, wrinkles in the API.
> 
> The question on how to best allocate static per-lcore memory has been
> up several times on the dev mailing list, for example in the thread on
> "random: use per lcore state" RFC by Stephen Hemminger.
> 
> Lcore variables are surely not the answer to all your per-lcore-data
> needs, since it only allows for more-or-less static allocation. In the
> author's opinion, it does however provide a reasonably simple and
> clean and seemingly very much performant solution to a real problem.
> 
> Mattias Rönnblom (7):
>   eal: add static per-lcore memory allocation facility
>   eal: add lcore variable functional tests
>   eal: add lcore variable performance test
>   random: keep PRNG state in lcore variable
>   power: keep per-lcore state in lcore variable
>   service: keep per-lcore state in lcore variable
>   eal: keep per-lcore power intrinsics state in lcore variable
> 
>  MAINTAINERS                                   |   6 +
>  app/test/meson.build                          |   2 +
>  app/test/test_lcore_var.c                     | 436 ++++++++++++++++++
>  app/test/test_lcore_var_perf.c                | 257 +++++++++++
>  config/rte_config.h                           |   1 +
>  doc/api/doxy-api-index.md                     |   1 +
>  .../prog_guide/env_abstraction_layer.rst      |  45 +-
>  doc/guides/rel_notes/release_24_11.rst        |  14 +
>  lib/eal/common/eal_common_lcore_var.c         |  79 ++++
>  lib/eal/common/meson.build                    |   1 +
>  lib/eal/common/rte_random.c                   |  28 +-
>  lib/eal/common/rte_service.c                  | 117 ++---
>  lib/eal/include/meson.build                   |   1 +
>  lib/eal/include/rte_lcore_var.h               | 390 ++++++++++++++++
>  lib/eal/version.map                           |   2 +
>  lib/eal/x86/rte_power_intrinsics.c            |  17 +-
>  lib/power/rte_power_pmd_mgmt.c                |  35 +-
>  17 files changed, 1339 insertions(+), 93 deletions(-)
>  create mode 100644 app/test/test_lcore_var.c
>  create mode 100644 app/test/test_lcore_var_perf.c
>  create mode 100644 lib/eal/common/eal_common_lcore_var.c
>  create mode 100644 lib/eal/include/rte_lcore_var.h

Looks good, thanks for taking this on. It will help with scaling for
both small systems and mega-core beasts.

It would help if you could rebase/resubmit there were some spelling errors
and typos that should be fixed before merging.


Series-Acked-by: Stephen Hemminger <stephen@networkplumber.org>

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v7 1/7] eal: add static per-lcore memory allocation facility
  2024-10-09 22:15                                                   ` Morten Brørup
@ 2024-10-10 10:40                                                     ` Mattias Rönnblom
  2024-10-10 11:47                                                       ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 10:40 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev, Tyler Retzlaff
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

On 2024-10-10 00:15, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 18 September 2024 10.26
>>
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small, frequently-accessed data structures, for which one instance
>> should exist for each lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_var.c
>> @@ -0,0 +1,79 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2024 Ericsson AB
>> + */
>> +
>> +#include <inttypes.h>
>> +#include <stdlib.h>
>> +
>> +#ifdef RTE_EXEC_ENV_WINDOWS
>> +#include <malloc.h>
>> +#endif
> 
>  From what I can read on the internet, max_align_t is missing in stddef.h in MSVC [1], so try adding this to fix the Windows CI compilation failure:
> 
> #ifdef RTE_TOOLCHAIN_MSVC
> #include <cstddef>
> #endif

Please excuse my MSVC ignorance, but will this work in C? Looks like C++.

> 
> [1]: https://learn.microsoft.com/en-ie/answers/questions/1726147/why-max-align-t-not-defined-in-stddef-h-in-windows
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v7 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 10:40                                                     ` Mattias Rönnblom
@ 2024-10-10 11:47                                                       ` Morten Brørup
  2024-10-10 13:12                                                         ` Morten Brørup
  2024-10-10 13:40                                                         ` Mattias Rönnblom
  0 siblings, 2 replies; 313+ messages in thread
From: Morten Brørup @ 2024-10-10 11:47 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev, Tyler Retzlaff
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Thursday, 10 October 2024 12.40
> 
> On 2024-10-10 00:15, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Wednesday, 18 September 2024 10.26
> >>
> >> Introduce DPDK per-lcore id variables, or lcore variables for short.
> >>
> >> An lcore variable has one value for every current and future lcore
> >> id-equipped thread.
> >>
> >> The primary <rte_lcore_var.h> use case is for statically allocating
> >> small, frequently-accessed data structures, for which one instance
> >> should exist for each lcore.
> >>
> >> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> >> _Thread_local), but decoupling the values' life time with that of
> the
> >> threads.
> >>
> >> Lcore variables are also similar in terms of functionality provided
> by
> >> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >> build-time machinery. DPCPU uses linker scripts, which effectively
> >> prevents the reuse of its, otherwise seemingly viable, approach.
> >>
> >> The currently-prevailing way to solve the same problem as lcore
> >> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
> sized
> >> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> >> lcore variables over this approach is that data related to the same
> >> lcore now is close (spatially, in memory), rather than data used by
> >> the same module, which in turn avoid excessive use of padding,
> >> polluting caches with unused data.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >
> >> --- /dev/null
> >> +++ b/lib/eal/common/eal_common_lcore_var.c
> >> @@ -0,0 +1,79 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2024 Ericsson AB
> >> + */
> >> +
> >> +#include <inttypes.h>
> >> +#include <stdlib.h>
> >> +
> >> +#ifdef RTE_EXEC_ENV_WINDOWS
> >> +#include <malloc.h>
> >> +#endif
> >
> >  From what I can read on the internet, max_align_t is missing in
> stddef.h in MSVC [1], so try adding this to fix the Windows CI
> compilation failure:
> >
> > #ifdef RTE_TOOLCHAIN_MSVC
> > #include <cstddef>
> > #endif
> 
> Please excuse my MSVC ignorance, but will this work in C? Looks like
> C++.

I have no clue. Just parroting what Microsoft says on the internet.

You can try it out and see if the CI accepts it.

> 
> >
> > [1]: https://learn.microsoft.com/en-ie/answers/questions/1726147/why-
> max-align-t-not-defined-in-stddef-h-in-windows
> >

I would like to see this series go into 24.11, and then it needs to work for MSVC too.

@Tyler, any better suggestions for fixing the missing max_align_t in stddef.h?


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v7 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 11:47                                                       ` Morten Brørup
@ 2024-10-10 13:12                                                         ` Morten Brørup
  2024-10-10 13:42                                                           ` Mattias Rönnblom
  2024-10-10 13:40                                                         ` Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-10-10 13:12 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev, Tyler Retzlaff
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Thursday, 10 October 2024 13.48
> 
> > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > Sent: Thursday, 10 October 2024 12.40
> >
> > On 2024-10-10 00:15, Morten Brørup wrote:
> > >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > >> Sent: Wednesday, 18 September 2024 10.26
> > >>


> > >  From what I can read on the internet, max_align_t is missing in
> > stddef.h in MSVC [1], so try adding this to fix the Windows CI
> > compilation failure:
> > >
> > > #ifdef RTE_TOOLCHAIN_MSVC
> > > #include <cstddef>
> > > #endif
> >
> > Please excuse my MSVC ignorance, but will this work in C? Looks like
> > C++.
> 
> I have no clue. Just parroting what Microsoft says on the internet.
> 
> You can try it out and see if the CI accepts it.

Alternative hack...
Add typedef, based on MS source code [2]:

#ifdef RTE_TOOLCHAIN_MSVC
typedef double max_align_t;
#endif

[2]: https://github.com/microsoft/STL/blob/main/stl/inc/cstddef#L30

> 
> >
> > >
> > > [1]: https://learn.microsoft.com/en-
> ie/answers/questions/1726147/why-
> > max-align-t-not-defined-in-stddef-h-in-windows
> > >
> 
> I would like to see this series go into 24.11, and then it needs to
> work for MSVC too.
> 
> @Tyler, any better suggestions for fixing the missing max_align_t in
> stddef.h?


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v7 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 11:47                                                       ` Morten Brørup
  2024-10-10 13:12                                                         ` Morten Brørup
@ 2024-10-10 13:40                                                         ` Mattias Rönnblom
  2024-10-10 13:45                                                           ` Morten Brørup
  1 sibling, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 13:40 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev, Tyler Retzlaff
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

On 2024-10-10 13:47, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Thursday, 10 October 2024 12.40
>>
>> On 2024-10-10 00:15, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>> Sent: Wednesday, 18 September 2024 10.26
>>>>
>>>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>>>
>>>> An lcore variable has one value for every current and future lcore
>>>> id-equipped thread.
>>>>
>>>> The primary <rte_lcore_var.h> use case is for statically allocating
>>>> small, frequently-accessed data structures, for which one instance
>>>> should exist for each lcore.
>>>>
>>>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>>>> _Thread_local), but decoupling the values' life time with that of
>> the
>>>> threads.
>>>>
>>>> Lcore variables are also similar in terms of functionality provided
>> by
>>>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>>>> build-time machinery. DPCPU uses linker scripts, which effectively
>>>> prevents the reuse of its, otherwise seemingly viable, approach.
>>>>
>>>> The currently-prevailing way to solve the same problem as lcore
>>>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-
>> sized
>>>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>>>> lcore variables over this approach is that data related to the same
>>>> lcore now is close (spatially, in memory), rather than data used by
>>>> the same module, which in turn avoid excessive use of padding,
>>>> polluting caches with unused data.
>>>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>>
>>>> --- /dev/null
>>>> +++ b/lib/eal/common/eal_common_lcore_var.c
>>>> @@ -0,0 +1,79 @@
>>>> +/* SPDX-License-Identifier: BSD-3-Clause
>>>> + * Copyright(c) 2024 Ericsson AB
>>>> + */
>>>> +
>>>> +#include <inttypes.h>
>>>> +#include <stdlib.h>
>>>> +
>>>> +#ifdef RTE_EXEC_ENV_WINDOWS
>>>> +#include <malloc.h>
>>>> +#endif
>>>
>>>   From what I can read on the internet, max_align_t is missing in
>> stddef.h in MSVC [1], so try adding this to fix the Windows CI
>> compilation failure:
>>>
>>> #ifdef RTE_TOOLCHAIN_MSVC
>>> #include <cstddef>
>>> #endif
>>
>> Please excuse my MSVC ignorance, but will this work in C? Looks like
>> C++.
> 
> I have no clue. Just parroting what Microsoft says on the internet.
> 
> You can try it out and see if the CI accepts it.
> 

It wouldn't make sense if that worked, so I'll go for this instead:

#ifdef RTE_TOOLCHAIN_MSVC
		/* MSVC <stddef.h> is missing the max_align_t typedef */
		align = alignof(double);
#else
		align = alignof(max_align_t);
#endif

Thanks for pointing out this issue.

>>
>>>
>>> [1]: https://learn.microsoft.com/en-ie/answers/questions/1726147/why-
>> max-align-t-not-defined-in-stddef-h-in-windows
>>>
> 
> I would like to see this series go into 24.11, and then it needs to work for MSVC too.
> 
> @Tyler, any better suggestions for fixing the missing max_align_t in stddef.h?
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v7 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 13:12                                                         ` Morten Brørup
@ 2024-10-10 13:42                                                           ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 13:42 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev, Tyler Retzlaff
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

On 2024-10-10 15:12, Morten Brørup wrote:
>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
>> Sent: Thursday, 10 October 2024 13.48
>>
>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>> Sent: Thursday, 10 October 2024 12.40
>>>
>>> On 2024-10-10 00:15, Morten Brørup wrote:
>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>>> Sent: Wednesday, 18 September 2024 10.26
>>>>>
> 
> 
>>>>   From what I can read on the internet, max_align_t is missing in
>>> stddef.h in MSVC [1], so try adding this to fix the Windows CI
>>> compilation failure:
>>>>
>>>> #ifdef RTE_TOOLCHAIN_MSVC
>>>> #include <cstddef>
>>>> #endif
>>>
>>> Please excuse my MSVC ignorance, but will this work in C? Looks like
>>> C++.
>>
>> I have no clue. Just parroting what Microsoft says on the internet.
>>
>> You can try it out and see if the CI accepts it.
> 
> Alternative hack...
> Add typedef, based on MS source code [2]:
> 
> #ifdef RTE_TOOLCHAIN_MSVC
> typedef double max_align_t;
> #endif
> 
> [2]: https://github.com/microsoft/STL/blob/main/stl/inc/cstddef#L30
> 

That will break the day Microsoft fixes this bug.

>>
>>>
>>>>
>>>> [1]: https://learn.microsoft.com/en-
>> ie/answers/questions/1726147/why-
>>> max-align-t-not-defined-in-stddef-h-in-windows
>>>>
>>
>> I would like to see this series go into 24.11, and then it needs to
>> work for MSVC too.
>>
>> @Tyler, any better suggestions for fixing the missing max_align_t in
>> stddef.h?
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v7 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 13:40                                                         ` Mattias Rönnblom
@ 2024-10-10 13:45                                                           ` Morten Brørup
  2024-10-10 16:21                                                             ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-10-10 13:45 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev, Tyler Retzlaff
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Thursday, 10 October 2024 15.40
> 
> On 2024-10-10 13:47, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Thursday, 10 October 2024 12.40
> >>
> >> On 2024-10-10 00:15, Morten Brørup wrote:

> >>>   From what I can read on the internet, max_align_t is missing in
> >> stddef.h in MSVC [1], so try adding this to fix the Windows CI
> >> compilation failure:
> >>>
> >>> #ifdef RTE_TOOLCHAIN_MSVC
> >>> #include <cstddef>
> >>> #endif
> >>
> >> Please excuse my MSVC ignorance, but will this work in C? Looks like
> >> C++.
> >
> > I have no clue. Just parroting what Microsoft says on the internet.
> >
> > You can try it out and see if the CI accepts it.
> >
> 
> It wouldn't make sense if that worked, so I'll go for this instead:
> 
> #ifdef RTE_TOOLCHAIN_MSVC
> 		/* MSVC <stddef.h> is missing the max_align_t typedef */

Maybe also add some reference to why "double" is the max aligned type in MSVC.

> 		align = alignof(double);
> #else
> 		align = alignof(max_align_t);
> #endif

Even better!

It's a good workaround, and not an ugly hack like the ones suggested by me.

> 
> Thanks for pointing out this issue.

:-)


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v8 0/7] Lcore variables
  2024-09-18  8:26                                                 ` [PATCH v7 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-09-18  9:23                                                   ` Konstantin Ananyev
  2024-10-09 22:15                                                   ` Morten Brørup
@ 2024-10-10 14:13                                                   ` Mattias Rönnblom
  2024-10-10 14:13                                                     ` [PATCH v8 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                                       ` (7 more replies)
  2 siblings, 8 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:13 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 436 ++++++++++++++++++
 app/test/test_lcore_var_perf.c                | 257 +++++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  45 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  84 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 117 ++---
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 390 ++++++++++++++++
 lib/eal/version.map                           |   2 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  35 +-
 17 files changed, 1344 insertions(+), 93 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v8 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 14:13                                                   ` [PATCH v8 0/7] Lcore variables Mattias Rönnblom
@ 2024-10-10 14:13                                                     ` Mattias Rönnblom
  2024-10-10 14:21                                                       ` [PATCH v9 0/7] Lcore variables Mattias Rönnblom
  2024-10-10 14:13                                                     ` [PATCH v8 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                                       ` (6 subsequent siblings)
  7 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:13 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v8:
 * Work around missing max_align_t definition in MSVC. (Morten Brørup)

PATCH v7:
 * Add () to the FOREACH lcore id macro parameter, to allow arbitrary
   expression, not just a simple variable name, being passed.
   (Konstantin Ananyev)

PATCH v6:
 * Have API user provide the loop variable in the FOREACH macro, to
   avoid subtle bugs where the loop variable name clashes with some
   other user-defined variable. (Konstantin Ananyev)

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  45 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  84 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 390 ++++++++++++++++++
 lib/eal/version.map                           |   2 +
 10 files changed, 539 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 812463fe9f..61e5907fb5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index fd6f8a2f1a..498d509244 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index b9fac1839d..57c0061c65 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -429,12 +429,45 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default static variables, blocks allocated on the DPDK heap, and
+other type of memory is shared by all DPDK threads.
+
+An application, a DPDK library or PMD may keep opt to keep per-thread
+state.
+
+Per-thread data may be maintained using either *lcore variables*
+(``rte_lcore_var.h``), *thread-local storage (TLS)*
+(``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE``
+elements, index by ``rte_lcore_id()``. These methods allows for
+per-lcore data to be a largely module-internal affair, and not
+directly visible in its API. Another possibility is to have deal
+explicitly with per-thread aspects in the API (e.g., the ports of the
+Eventdev API).
+
+Lcore varibles are suitable for small object statically allocated at
+the time of module or application initialization. An lcore variable
+take on one value for each lcore id-equipped thread (i.e., for EAL
+threads and registered non-EAL threads, in total ``RTE_MAX_LCORE``
+instances). The lifetime of lcore variables are detached from that of
+the owning threads, and may thus be initialized prior to the owner
+having been created.
+
+Variables with thread-local storage are allocated at the time of
+thread creation, and exists until the thread terminates, for every
+thread in the process. Only very small object should be allocated in
+TLS, since large TLS objects significantly slows down thread creation
+and may needlessly increase memory footprint for application that make
+extensive use of unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore id-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must both cache-aligned, and include a
+``RTE_CACHE_GUARD``. Such extensive use of padding cause internal
+fragmentation (i.e., unused space) and lower cache hit rates.
+
+For more discussions on per-lcore state, see the ``rte_lcore_var.h``
+API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 507807a2b1..ec2bd39521 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -82,6 +82,20 @@ New Features
 
   The new statistics are useful for debugging and profiling.
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..7f437934df
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	unsigned int lcore_id;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+#ifdef RTE_EXEC_ENV_WINDOWS
+		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
+					       RTE_CACHE_LINE_SIZE);
+#else
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+#endif
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+#ifdef RTE_TOOLCHAIN_MSVC
+		/* MSVC <stddef.h> is missing the max_align_t typedef */
+		align = alignof(double);
+#else
+		align = alignof(max_align_t);
+#endif
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..894100d1e4
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,390 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. There is one
+ * instance for each current and future lcore id-equipped thread, with
+ * a total of RTE_MAX_LCORE instances. The value of an lcore variable
+ * for a particular lcore id is independent from other values (for
+ * other lcore ids) within the same lcore variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for an @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. The handle type is used to inform the
+ * access macros the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as a an opaque identifier. An allocated handle
+ * never has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by two different lcore
+ * ids may be frequently read or written by the owners without risking
+ * false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should employed to assure there are no data races between
+ * the owning thread and any non-owner threads accessing the same
+ * lcore variable instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may choose to define an lcore variable handle, which
+ * it then it goes on to never allocate.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variables' owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values take on an initial value of zero.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         unsigned int lcore_id;
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables have the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. As a result, thread-local variables must be initialized in
+ *     a "lazy" manner (e.g., at the point of thread creation). Lcore
+ *     variables may be accessed immediately after having been
+ *     allocated (which may be prior any thread beyond the main
+ *     thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handle, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param lcore_id
+ *   An <code>unsigned int</code> variable successively set to the
+ *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
+ * @param value
+ *   A pointer variable successively set to point to lcore variable
+ *   value instance of the current lcore id being processed.
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)		\
+	for ((lcore_id) =						\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     (lcore_id) < RTE_MAX_LCORE;				\
+	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, \
+							       handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..0c80bf7331 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,8 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v8 2/7] eal: add lcore variable functional tests
  2024-10-10 14:13                                                   ` [PATCH v8 0/7] Lcore variables Mattias Rönnblom
  2024-10-10 14:13                                                     ` [PATCH v8 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-10 14:13                                                     ` Mattias Rönnblom
  2024-10-10 14:13                                                     ` [PATCH v8 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                                       ` (5 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:13 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocations to match new API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 436 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 437 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..2a1f258548
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,436 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	unsigned int i = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_int) {
+		TEST_ASSERT_EQUAL(i, lcore_id, "Encountered lcore id %d "
+				  "while expecting %d during iteration",
+				  lcore_id, i);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		i++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	unsigned int lcore_id;
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v8 3/7] eal: add lcore variable performance test
  2024-10-10 14:13                                                   ` [PATCH v8 0/7] Lcore variables Mattias Rönnblom
  2024-10-10 14:13                                                     ` [PATCH v8 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-10 14:13                                                     ` [PATCH v8 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-10-10 14:13                                                     ` Mattias Rönnblom
  2024-10-10 14:13                                                     ` [PATCH v8 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                                       ` (4 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:13 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v8:
 * Fix spelling. (Morten Brørup)

PATCH v6:
 * Use floating point math when calculating per-update latency.
   (Morten Brørup)

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 257 +++++++++++++++++++++++++++++++++
 2 files changed, 258 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..2efb8342d1
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =
+		RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / (double)ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays is not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each with a small, per-lcore state. Note however that
+ * these tests have very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v8 4/7] random: keep PRNG state in lcore variable
  2024-10-10 14:13                                                   ` [PATCH v8 0/7] Lcore variables Mattias Rönnblom
                                                                       ` (2 preceding siblings ...)
  2024-10-10 14:13                                                     ` [PATCH v8 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-10-10 14:13                                                     ` Mattias Rönnblom
  2024-10-10 14:13                                                     ` [PATCH v8 5/7] power: keep per-lcore " Mattias Rönnblom
                                                                       ` (3 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:13 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v8 5/7] power: keep per-lcore state in lcore variable
  2024-10-10 14:13                                                   ` [PATCH v8 0/7] Lcore variables Mattias Rönnblom
                                                                       ` (3 preceding siblings ...)
  2024-10-10 14:13                                                     ` [PATCH v8 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-10-10 14:13                                                     ` Mattias Rönnblom
  2024-10-10 14:13                                                     ` [PATCH v8 6/7] service: " Mattias Rönnblom
                                                                       ` (2 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:13 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocation to match new API.

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a981db4b39 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,22 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	unsigned int lcore_id;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v8 6/7] service: keep per-lcore state in lcore variable
  2024-10-10 14:13                                                   ` [PATCH v8 0/7] Lcore variables Mattias Rönnblom
                                                                       ` (4 preceding siblings ...)
  2024-10-10 14:13                                                     ` [PATCH v8 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-10-10 14:13                                                     ` Mattias Rönnblom
  2024-10-10 14:13                                                     ` [PATCH v8 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  2024-10-11 14:23                                                     ` [PATCH v8 0/7] Lcore variables Stephen Hemminger
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:13 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v7:
 * Update to match new FOREACH API.

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 117 +++++++++++++++++++----------------
 1 file changed, 65 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index a38c594ce4..3d2c12c39b 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -77,7 +78,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -103,12 +104,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -124,7 +121,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -138,7 +134,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -288,7 +283,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -296,9 +290,11 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	unsigned int lcore_id;
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -467,7 +463,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -477,7 +476,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -499,8 +498,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -546,13 +544,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -560,9 +560,12 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	unsigned int lcore_id;
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -579,7 +582,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -595,7 +599,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -647,30 +651,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -698,13 +703,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -715,14 +721,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -738,17 +746,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -760,7 +770,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -784,7 +794,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -814,6 +824,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -821,12 +833,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -847,7 +858,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -858,7 +869,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -866,7 +877,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -874,7 +885,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -901,7 +912,7 @@ lcore_attr_get_service_error_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -917,7 +928,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -979,12 +993,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1009,7 +1022,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -1020,12 +1034,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1060,7 +1073,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v8 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-10-10 14:13                                                   ` [PATCH v8 0/7] Lcore variables Mattias Rönnblom
                                                                       ` (5 preceding siblings ...)
  2024-10-10 14:13                                                     ` [PATCH v8 6/7] service: " Mattias Rönnblom
@ 2024-10-10 14:13                                                     ` Mattias Rönnblom
  2024-10-11 14:23                                                     ` [PATCH v8 0/7] Lcore variables Stephen Hemminger
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:13 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v7 3/7] eal: add lcore variable performance test
  2024-10-09 20:46                                                   ` Morten Brørup
@ 2024-10-10 14:17                                                     ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:17 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

On 2024-10-09 22:46, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 18 September 2024 10.26
> 
> A few corrections to a comment. Besides that,
> 
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 
> 
>> +/*
>> + * The potential performance benefit of lcore variables compared to
>> + * the use of statically sized, lcore id-indexed arrays are not
> 
> are not -> is not
> 
>> + * shorter latencies in a scenario with low cache pressure, but rather
>> + * fewer cache misses in a real-world scenario, with extensive cache
>> + * usage. These tests are a crude simulation of such, using <N> dummy
>> + * modules, each wiht a small, per-lcore state. Note however that
> 
> wiht -> with
> 
>> + * these tests has very little non-lcore/thread local state, which is
> 
> has -> have
> 
>> + * unrealistic.
>> + */
> 

All fixed. Thanks.



^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v9 0/7] Lcore variables
  2024-10-10 14:13                                                     ` [PATCH v8 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-10 14:21                                                       ` Mattias Rönnblom
  2024-10-10 14:21                                                         ` [PATCH v9 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                                           ` (6 more replies)
  0 siblings, 7 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:21 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 436 ++++++++++++++++++
 app/test/test_lcore_var_perf.c                | 257 +++++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  45 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  84 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 117 ++---
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 390 ++++++++++++++++
 lib/eal/version.map                           |   2 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  35 +-
 17 files changed, 1344 insertions(+), 93 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 14:21                                                       ` [PATCH v9 0/7] Lcore variables Mattias Rönnblom
@ 2024-10-10 14:21                                                         ` Mattias Rönnblom
  2024-10-10 15:54                                                           ` Stephen Hemminger
                                                                             ` (2 more replies)
  2024-10-10 14:22                                                         ` [PATCH v9 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                                           ` (5 subsequent siblings)
  6 siblings, 3 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:21 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v9:
 * Fixed merge conflicts in release notes.

PATCH v8:
 * Work around missing max_align_t definition in MSVC. (Morten Brørup)

PATCH v7:
 * Add () to the FOREACH lcore id macro parameter, to allow arbitrary
   expression, not just a simple variable name, being passed.
   (Konstantin Ananyev)

PATCH v6:
 * Have API user provide the loop variable in the FOREACH macro, to
   avoid subtle bugs where the loop variable name clashes with some
   other user-defined variable. (Konstantin Ananyev)

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  45 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  84 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 390 ++++++++++++++++++
 lib/eal/version.map                           |   2 +
 10 files changed, 539 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 812463fe9f..61e5907fb5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index fd6f8a2f1a..498d509244 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index b9fac1839d..57c0061c65 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -429,12 +429,45 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default static variables, blocks allocated on the DPDK heap, and
+other type of memory is shared by all DPDK threads.
+
+An application, a DPDK library or PMD may keep opt to keep per-thread
+state.
+
+Per-thread data may be maintained using either *lcore variables*
+(``rte_lcore_var.h``), *thread-local storage (TLS)*
+(``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE``
+elements, index by ``rte_lcore_id()``. These methods allows for
+per-lcore data to be a largely module-internal affair, and not
+directly visible in its API. Another possibility is to have deal
+explicitly with per-thread aspects in the API (e.g., the ports of the
+Eventdev API).
+
+Lcore varibles are suitable for small object statically allocated at
+the time of module or application initialization. An lcore variable
+take on one value for each lcore id-equipped thread (i.e., for EAL
+threads and registered non-EAL threads, in total ``RTE_MAX_LCORE``
+instances). The lifetime of lcore variables are detached from that of
+the owning threads, and may thus be initialized prior to the owner
+having been created.
+
+Variables with thread-local storage are allocated at the time of
+thread creation, and exists until the thread terminates, for every
+thread in the process. Only very small object should be allocated in
+TLS, since large TLS objects significantly slows down thread creation
+and may needlessly increase memory footprint for application that make
+extensive use of unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore id-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must both cache-aligned, and include a
+``RTE_CACHE_GUARD``. Such extensive use of padding cause internal
+fragmentation (i.e., unused space) and lower cache hit rates.
+
+For more discussions on per-lcore state, see the ``rte_lcore_var.h``
+API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 915065a6f9..0e15767d41 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -113,6 +113,20 @@ New Features
 
   * Added independent enqueue feature.
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..7f437934df
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	unsigned int lcore_id;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+#ifdef RTE_EXEC_ENV_WINDOWS
+		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
+					       RTE_CACHE_LINE_SIZE);
+#else
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+#endif
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+#ifdef RTE_TOOLCHAIN_MSVC
+		/* MSVC <stddef.h> is missing the max_align_t typedef */
+		align = alignof(double);
+#else
+		align = alignof(max_align_t);
+#endif
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..894100d1e4
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,390 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * RTE Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) has one value
+ * for each EAL thread and registered non-EAL thread. There is one
+ * instance for each current and future lcore id-equipped thread, with
+ * a total of RTE_MAX_LCORE instances. The value of an lcore variable
+ * for a particular lcore id is independent from other values (for
+ * other lcore ids) within the same lcore variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for an @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>. The handle type is used to inform the
+ * access macros the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as a an opaque identifier. An allocated handle
+ * never has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
+ *     module initialization, but may be done at any time.
+ *
+ * An lcore variable is not tied to the owning thread's lifetime. It's
+ * available for use by any thread immediately after having been
+ * allocated, and continues to be available throughout the lifetime of
+ * the EAL.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable but owned by two different lcore
+ * ids may be frequently read or written by the owners without risking
+ * false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should employed to assure there are no data races between
+ * the owning thread and any non-owner threads accessing the same
+ * lcore variable instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may by of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may choose to define an lcore variable handle, which
+ * it then it goes on to never allocate.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * The lcore variable are stored in a series of lcore buffers, which
+ * are allocated from the libc heap. Heap allocation failures are
+ * treated as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variables' owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values take on an initial value of zero.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         unsigned int lcore_id;
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module is kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's.
+ *
+ * Lcore variables have the upside of working with, not against, the
+ * CPU's assumptions and for example next-line prefetchers may well
+ * work the way its designers intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The existence and non-existence of a thread-local variable
+ *     instance follow that of particular thread's. The data cannot be
+ *     accessed before the thread has been created, nor after it has
+ *     exited. As a result, thread-local variables must be initialized in
+ *     a "lazy" manner (e.g., at the point of thread creation). Lcore
+ *     variables may be accessed immediately after having been
+ *     allocated (which may be prior any thread beyond the main
+ *     thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the details of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, such data sharing is supported. In the C11
+ *     standard, the result of accessing another thread's
+ *     _Thread_local object is implementation-defined. Lcore variable
+ *     instances may be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * The aim with this macro is to make clear at the point of
+ * declaration that this is an lcore handle, rather than a regular
+ * pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param lcore_id
+ *   An <code>unsigned int</code> variable successively set to the
+ *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
+ * @param value
+ *   A pointer variable successively set to point to lcore variable
+ *   value instance of the current lcore id being processed.
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)		\
+	for ((lcore_id) =						\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     (lcore_id) < RTE_MAX_LCORE;				\
+	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, \
+							       handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifer of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..0c80bf7331 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,8 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v9 2/7] eal: add lcore variable functional tests
  2024-10-10 14:21                                                       ` [PATCH v9 0/7] Lcore variables Mattias Rönnblom
  2024-10-10 14:21                                                         ` [PATCH v9 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-10 14:22                                                         ` Mattias Rönnblom
  2024-10-10 14:22                                                         ` [PATCH v9 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                                           ` (4 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:22 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocations to match new API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 436 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 437 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..2a1f258548
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,436 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	unsigned int i = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_int) {
+		TEST_ASSERT_EQUAL(i, lcore_id, "Encountered lcore id %d "
+				  "while expecting %d during iteration",
+				  lcore_id, i);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		i++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	unsigned int lcore_id;
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v9 3/7] eal: add lcore variable performance test
  2024-10-10 14:21                                                       ` [PATCH v9 0/7] Lcore variables Mattias Rönnblom
  2024-10-10 14:21                                                         ` [PATCH v9 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-10 14:22                                                         ` [PATCH v9 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-10-10 14:22                                                         ` Mattias Rönnblom
  2024-10-10 14:22                                                         ` [PATCH v9 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                                           ` (3 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:22 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v8:
 * Fix spelling. (Morten Brørup)

PATCH v6:
 * Use floating point math when calculating per-update latency.
   (Morten Brørup)

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 257 +++++++++++++++++++++++++++++++++
 2 files changed, 258 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..2efb8342d1
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =
+		RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / (double)ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays is not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each with a small, per-lcore state. Note however that
+ * these tests have very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v9 4/7] random: keep PRNG state in lcore variable
  2024-10-10 14:21                                                       ` [PATCH v9 0/7] Lcore variables Mattias Rönnblom
                                                                           ` (2 preceding siblings ...)
  2024-10-10 14:22                                                         ` [PATCH v9 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-10-10 14:22                                                         ` Mattias Rönnblom
  2024-10-10 14:22                                                         ` [PATCH v9 5/7] power: keep per-lcore " Mattias Rönnblom
                                                                           ` (2 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:22 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v9 5/7] power: keep per-lcore state in lcore variable
  2024-10-10 14:21                                                       ` [PATCH v9 0/7] Lcore variables Mattias Rönnblom
                                                                           ` (3 preceding siblings ...)
  2024-10-10 14:22                                                         ` [PATCH v9 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-10-10 14:22                                                         ` Mattias Rönnblom
  2024-10-10 14:22                                                         ` [PATCH v9 6/7] service: " Mattias Rönnblom
  2024-10-10 14:22                                                         ` [PATCH v9 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:22 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocation to match new API.

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a981db4b39 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,22 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	unsigned int lcore_id;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v9 6/7] service: keep per-lcore state in lcore variable
  2024-10-10 14:21                                                       ` [PATCH v9 0/7] Lcore variables Mattias Rönnblom
                                                                           ` (4 preceding siblings ...)
  2024-10-10 14:22                                                         ` [PATCH v9 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-10-10 14:22                                                         ` Mattias Rönnblom
  2024-10-10 14:22                                                         ` [PATCH v9 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:22 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v7:
 * Update to match new FOREACH API.

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 117 +++++++++++++++++++----------------
 1 file changed, 65 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index a38c594ce4..3d2c12c39b 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -77,7 +78,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -103,12 +104,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -124,7 +121,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -138,7 +134,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -288,7 +283,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -296,9 +290,11 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	unsigned int lcore_id;
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -467,7 +463,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -477,7 +476,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -499,8 +498,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -546,13 +544,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -560,9 +560,12 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	unsigned int lcore_id;
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -579,7 +582,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -595,7 +599,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -647,30 +651,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -698,13 +703,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -715,14 +721,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -738,17 +746,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -760,7 +770,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -784,7 +794,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -814,6 +824,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -821,12 +833,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -847,7 +858,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -858,7 +869,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -866,7 +877,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -874,7 +885,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -901,7 +912,7 @@ lcore_attr_get_service_error_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -917,7 +928,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -979,12 +993,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1009,7 +1022,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -1020,12 +1034,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1060,7 +1073,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v9 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-10-10 14:21                                                       ` [PATCH v9 0/7] Lcore variables Mattias Rönnblom
                                                                           ` (5 preceding siblings ...)
  2024-10-10 14:22                                                         ` [PATCH v9 6/7] service: " Mattias Rönnblom
@ 2024-10-10 14:22                                                         ` Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 14:22 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 14:21                                                         ` [PATCH v9 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-10 15:54                                                           ` Stephen Hemminger
  2024-10-10 16:17                                                             ` Mattias Rönnblom
  2024-10-10 21:24                                                           ` Thomas Monjalon
  2024-10-11  8:18                                                           ` [PATCH v10 0/7] Lcore variables Mattias Rönnblom
  2 siblings, 1 reply; 313+ messages in thread
From: Stephen Hemminger @ 2024-10-10 15:54 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic, Konstantin Ananyev,
	Chengwen Feng

On Thu, 10 Oct 2024 16:21:59 +0200
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.
> 
> The primary <rte_lcore_var.h> use case is for statically allocating
> small, frequently-accessed data structures, for which one instance
> should exist for each lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.
> 
> Lcore variables are also similar in terms of functionality provided by
> FreeBSD kernel's DPCPU_*() family of macros and the associated
> build-time machinery. DPCPU uses linker scripts, which effectively
> prevents the reuse of its, otherwise seemingly viable, approach.
> 
> The currently-prevailing way to solve the same problem as lcore
> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> lcore variables over this approach is that data related to the same
> lcore now is close (spatially, in memory), rather than data used by
> the same module, which in turn avoid excessive use of padding,
> polluting caches with unused data.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> Acked-by: Stephen Hemminger <stephen@networkplumber.org>

If you need to send v10, fix this please.



WARNING:TYPO_SPELLING: 'varibles' may be misspelled - perhaps 'variables'?
#336: FILE: doc/guides/prog_guide/env_abstraction_layer.rst:447:
+Lcore varibles are suitable for small object statically allocated at

WARNING:TYPO_SPELLING: 'identifer' may be misspelled - perhaps 'identifier'?
#867: FILE: lib/eal/include/rte_lcore_var.h:360:
+ * The pointer returned is only an opaque identifer of the variable. To

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 15:54                                                           ` Stephen Hemminger
@ 2024-10-10 16:17                                                             ` Mattias Rönnblom
  2024-10-10 16:31                                                               ` Stephen Hemminger
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 16:17 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

On 2024-10-10 17:54, Stephen Hemminger wrote:
> On Thu, 10 Oct 2024 16:21:59 +0200
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
>>
>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small, frequently-accessed data structures, for which one instance
>> should exist for each lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
>>
>> Lcore variables are also similar in terms of functionality provided by
>> FreeBSD kernel's DPCPU_*() family of macros and the associated
>> build-time machinery. DPCPU uses linker scripts, which effectively
>> prevents the reuse of its, otherwise seemingly viable, approach.
>>
>> The currently-prevailing way to solve the same problem as lcore
>> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
>> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
>> lcore variables over this approach is that data related to the same
>> lcore now is close (spatially, in memory), rather than data used by
>> the same module, which in turn avoid excessive use of padding,
>> polluting caches with unused data.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
>> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
>> Acked-by: Stephen Hemminger <stephen@networkplumber.org>
> 
> If you need to send v10, fix this please.
> 
> 

OK, will do.

> 
> WARNING:TYPO_SPELLING: 'varibles' may be misspelled - perhaps 'variables'?
> #336: FILE: doc/guides/prog_guide/env_abstraction_layer.rst:447:
> +Lcore varibles are suitable for small object statically allocated at
> 
> WARNING:TYPO_SPELLING: 'identifer' may be misspelled - perhaps 'identifier'?
> #867: FILE: lib/eal/include/rte_lcore_var.h:360:
> + * The pointer returned is only an opaque identifer of the variable. To

I wonder why my checkpatch doesn't spot this, but the DPDK CI version does.



^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v7 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 13:45                                                           ` Morten Brørup
@ 2024-10-10 16:21                                                             ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-10 16:21 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev, Tyler Retzlaff
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand, Jerin Jacob

On 2024-10-10 15:45, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Thursday, 10 October 2024 15.40
>>
>> On 2024-10-10 13:47, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>>> Sent: Thursday, 10 October 2024 12.40
>>>>
>>>> On 2024-10-10 00:15, Morten Brørup wrote:
> 
>>>>>    From what I can read on the internet, max_align_t is missing in
>>>> stddef.h in MSVC [1], so try adding this to fix the Windows CI
>>>> compilation failure:
>>>>>
>>>>> #ifdef RTE_TOOLCHAIN_MSVC
>>>>> #include <cstddef>
>>>>> #endif
>>>>
>>>> Please excuse my MSVC ignorance, but will this work in C? Looks like
>>>> C++.
>>>
>>> I have no clue. Just parroting what Microsoft says on the internet.
>>>
>>> You can try it out and see if the CI accepts it.
>>>
>>
>> It wouldn't make sense if that worked, so I'll go for this instead:
>>
>> #ifdef RTE_TOOLCHAIN_MSVC
>> 		/* MSVC <stddef.h> is missing the max_align_t typedef */
> 
> Maybe also add some reference to why "double" is the max aligned type in MSVC.
> 
>> 		align = alignof(double);
>> #else
>> 		align = alignof(max_align_t);
>> #endif
> 
> Even better!
> 
> It's a good workaround, and not an ugly hack like the ones suggested by me.
> 

Still a hack, I would argue.

Maybe this is an issue meson could help solve? Detect the missing 
max_align_t, and then do something appropriate about it. (Not sure 
exactly what that would be.)

>>
>> Thanks for pointing out this issue.
> 
> :-)
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 16:17                                                             ` Mattias Rönnblom
@ 2024-10-10 16:31                                                               ` Stephen Hemminger
  0 siblings, 0 replies; 313+ messages in thread
From: Stephen Hemminger @ 2024-10-10 16:31 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Mattias Rönnblom, dev, Morten Brørup,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Konstantin Ananyev, Chengwen Feng

On Thu, 10 Oct 2024 18:17:56 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> On 2024-10-10 17:54, Stephen Hemminger wrote:
> > On Thu, 10 Oct 2024 16:21:59 +0200
> > Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> >   
> >> Introduce DPDK per-lcore id variables, or lcore variables for short.
> >>
> >> An lcore variable has one value for every current and future lcore
> >> id-equipped thread.
> >>
> >> The primary <rte_lcore_var.h> use case is for statically allocating
> >> small, frequently-accessed data structures, for which one instance
> >> should exist for each lcore.
> >>
> >> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> >> _Thread_local), but decoupling the values' life time with that of the
> >> threads.
> >>
> >> Lcore variables are also similar in terms of functionality provided by
> >> FreeBSD kernel's DPCPU_*() family of macros and the associated
> >> build-time machinery. DPCPU uses linker scripts, which effectively
> >> prevents the reuse of its, otherwise seemingly viable, approach.
> >>
> >> The currently-prevailing way to solve the same problem as lcore
> >> variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
> >> array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
> >> lcore variables over this approach is that data related to the same
> >> lcore now is close (spatially, in memory), rather than data used by
> >> the same module, which in turn avoid excessive use of padding,
> >> polluting caches with unused data.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> >> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> >> Acked-by: Stephen Hemminger <stephen@networkplumber.org>  
> > 
> > If you need to send v10, fix this please.
> > 
> >   
> 
> OK, will do.
> 
> > 
> > WARNING:TYPO_SPELLING: 'varibles' may be misspelled - perhaps 'variables'?
> > #336: FILE: doc/guides/prog_guide/env_abstraction_layer.rst:447:
> > +Lcore varibles are suitable for small object statically allocated at
> > 
> > WARNING:TYPO_SPELLING: 'identifer' may be misspelled - perhaps 'identifier'?
> > #867: FILE: lib/eal/include/rte_lcore_var.h:360:
> > + * The pointer returned is only an opaque identifer of the variable. To  
> 
> I wonder why my checkpatch doesn't spot this, but the DPDK CI version does.
> 
> 

In order, to get spell checks you need codespell and run the script
to make a local dictionary.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 14:21                                                         ` [PATCH v9 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-10 15:54                                                           ` Stephen Hemminger
@ 2024-10-10 21:24                                                           ` Thomas Monjalon
  2024-10-11  8:04                                                             ` Mattias Rönnblom
  2024-10-11  8:09                                                             ` Morten Brørup
  2024-10-11  8:18                                                           ` [PATCH v10 0/7] Lcore variables Mattias Rönnblom
  2 siblings, 2 replies; 313+ messages in thread
From: Thomas Monjalon @ 2024-10-10 21:24 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Hello,

This new feature looks to bring something interesting to DPDK.
There was a good amount of discussion and review,
and there is a real effort of documentation.

However, some choices done in this implementation
were not explained or advertised enough in the documentation,
in my opinion.

I think the first thing to add is an explanation of the memory layout.
Maybe that a SVG drawing would help to show how it is stored.

We also need to explain why it is not using rte_malloc.

Also please could you re-read the doc and comments in detail?
I think some words are missing and there are typos.
While at it, please allow for easy update of the text
by starting each sentence on a new line.
Breaking lines logically is better for future patches.
One more advice: avoid very long sentences.

Do you have benchmarks results of the modules using such variables
(power, random, service)?
It would be interesting to compare time efficiency and memory usage
before/after, with different number of threads.

Adding more detailed comments below.


10/10/2024 16:21, Mattias Rönnblom:
> Introduce DPDK per-lcore id variables, or lcore variables for short.
> 
> An lcore variable has one value for every current and future lcore
> id-equipped thread.

I find it difficult to read "lcore id-equipped thread".
Can we just say "DPDK thread"?

> The primary <rte_lcore_var.h> use case is for statically allocating
> small, frequently-accessed data structures, for which one instance
> should exist for each lcore.
> 
> Lcore variables are similar to thread-local storage (TLS, e.g., C11
> _Thread_local), but decoupling the values' life time with that of the
> threads.

In which situation we need values of a dead thread?

[...]
> +An application, a DPDK library or PMD may keep opt to keep per-thread
> +state.

I don't understand this sentence.

> +
> +Per-thread data may be maintained using either *lcore variables*
> +(``rte_lcore_var.h``), *thread-local storage (TLS)*
> +(``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE``
> +elements, index by ``rte_lcore_id()``. These methods allows for

index*ed*

> +per-lcore data to be a largely module-internal affair, and not
> +directly visible in its API. Another possibility is to have deal

*to* deal ?

> +explicitly with per-thread aspects in the API (e.g., the ports of the
> +Eventdev API).
> +
> +Lcore varibles are suitable for small object statically allocated at

vari*a*bles

> +the time of module or application initialization. An lcore variable
> +take on one value for each lcore id-equipped thread (i.e., for EAL
> +threads and registered non-EAL threads, in total ``RTE_MAX_LCORE``
> +instances). The lifetime of lcore variables are detached from that of
> +the owning threads, and may thus be initialized prior to the owner
> +having been created.
> +
> +Variables with thread-local storage are allocated at the time of
> +thread creation, and exists until the thread terminates, for every
> +thread in the process. Only very small object should be allocated in
> +TLS, since large TLS objects significantly slows down thread creation
> +and may needlessly increase memory footprint for application that make
> +extensive use of unregistered threads.

I don't understand the relation with non-DPDK threads.

> +
> +A common but now largely obsolete DPDK pattern is to use a static
> +array sized according to the maximum number of lcore id-equipped
> +threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
> +sharing*, each element must both cache-aligned, and include a

must *be*
include*s*

> +``RTE_CACHE_GUARD``. Such extensive use of padding cause internal

cause*s*

> +fragmentation (i.e., unused space) and lower cache hit rates.
> +
> +For more discussions on per-lcore state, see the ``rte_lcore_var.h``
> +API documentation.

[...]
> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)

With #define RTE_MAX_LCORE_VAR 1048576,
LCORE_BUFFER_SIZE can be 100MB, right?

> +
> +static void *lcore_buffer;

It is the last buffer for all lcores.
The name suggests it is one single buffer per lcore.
What about "last_buffer" or "current_buffer"?

> +static size_t offset = RTE_MAX_LCORE_VAR;

A comment may be useful for this value: it triggers the first alloc?

> +
> +static void *
> +lcore_var_alloc(size_t size, size_t align)
> +{
> +	void *handle;
> +	unsigned int lcore_id;
> +	void *value;
> +
> +	offset = RTE_ALIGN_CEIL(offset, align);
> +
> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> +#ifdef RTE_EXEC_ENV_WINDOWS
> +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> +					       RTE_CACHE_LINE_SIZE);
> +#else
> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> +					     LCORE_BUFFER_SIZE);
> +#endif
> +		RTE_VERIFY(lcore_buffer != NULL);

Please no panic in a lib.
You can return NULL.

> +
> +		offset = 0;
> +	}
> +
> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
> +
> +	offset += size;
> +
> +	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)
> +		memset(value, 0, size);
> +
> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
> +		"%"PRIuPTR"-byte alignment", size, align);
> +
> +	return handle;
> +}

[...]
> +#ifndef _RTE_LCORE_VAR_H_
> +#define _RTE_LCORE_VAR_H_

Really we don't need the first and last underscores,
but it's a detail.

> +
> +/**
> + * @file
> + *
> + * RTE Lcore variables

Please don't say "RTE", it is just a prefix.
You can replace it with "DPDK" if you really want to be specific.

> + *
> + * This API provides a mechanism to create and access per-lcore id
> + * variables in a space- and cycle-efficient manner.
> + *
> + * A per-lcore id variable (or lcore variable for short) has one value
> + * for each EAL thread and registered non-EAL thread. There is one
> + * instance for each current and future lcore id-equipped thread, with
> + * a total of RTE_MAX_LCORE instances. The value of an lcore variable
> + * for a particular lcore id is independent from other values (for
> + * other lcore ids) within the same lcore variable.
> + *
> + * In order to access the values of an lcore variable, a handle is
> + * used. The type of the handle is a pointer to the value's type
> + * (e.g., for an @c uint32_t lcore variable, the handle is a
> + * <code>uint32_t *</code>. The handle type is used to inform the
> + * access macros the type of the values. A handle may be passed
> + * between modules and threads just like any pointer, but its value
> + * must be treated as a an opaque identifier. An allocated handle
> + * never has the value NULL.

Most of the explanations here would be better hosted in the prog guide.
The Doxygen API is better suited for short and direct explanations.

> + *
> + * @b Creation
> + *
> + * An lcore variable is created in two steps:
> + *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
> + *  2. Allocate lcore variable storage and initialize the handle with
> + *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
> + *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of

*at* the time

> + *     module initialization, but may be done at any time.

You mean it does not depend on EAL initialization?

> + *
> + * An lcore variable is not tied to the owning thread's lifetime. It's
> + * available for use by any thread immediately after having been
> + * allocated, and continues to be available throughout the lifetime of
> + * the EAL.
> + *
> + * Lcore variables cannot and need not be freed.

I'm curious about that.
If EAL is closed, and the application continues its life,
then we want all this memory to be cleaned as well.
Do you know rte_eal_cleanup()?

> + *
> + * @b Access
> + *
> + * The value of any lcore variable for any lcore id may be accessed
> + * from any thread (including unregistered threads), but it should
> + * only be *frequently* read from or written to by the owner.

Would be interesting to explain why.

> + *
> + * Values of the same lcore variable but owned by two different lcore
> + * ids may be frequently read or written by the owners without risking
> + * false sharing.

Again you could explain why if you explained the storage layout.
What is the minimum object size to avoid false sharing?

> + *
> + * An appropriate synchronization mechanism (e.g., atomic loads and
> + * stores) should employed to assure there are no data races between

should *be*

> + * the owning thread and any non-owner threads accessing the same
> + * lcore variable instance.
> + *
> + * The value of the lcore variable for a particular lcore id is
> + * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
> + *
> + * A common pattern is for an EAL thread or a registered non-EAL
> + * thread to access its own lcore variable value. For this purpose, a
> + * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.

shorthand without hyphen?

> + *
> + * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
> + * pointer with the same type as the value, it may not be directly
> + * dereferenced and must be treated as an opaque identifier.
> + *
> + * Lcore variable handles and value pointers may be freely passed
> + * between different threads.
> + *
> + * @b Storage
> + *
> + * An lcore variable's values may by of a primitive type like @c int,
> + * but would more typically be a @c struct.
> + *
> + * The lcore variable handle introduces a per-variable (not
> + * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
> + * there are some memory footprint gains to be made by organizing all
> + * per-lcore id data for a particular module as one lcore variable
> + * (e.g., as a struct).
> + *
> + * An application may choose to define an lcore variable handle, which
> + * it then it goes on to never allocate.

I don't understand this sentence.

> + *
> + * The size of an lcore variable's value must be less than the DPDK

size of variable, not size of value

> + * build-time constant @c RTE_MAX_LCORE_VAR.
> + *
> + * The lcore variable are stored in a series of lcore buffers, which

variable*s*

> + * are allocated from the libc heap. Heap allocation failures are
> + * treated as fatal.

Why not handling as an error, so the app has a chance to cleanup before crash?

> + *
> + * Lcore variables should generally *not* be @ref __rte_cache_aligned
> + * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
> + * of these constructs are designed to avoid false sharing. In the
> + * case of an lcore variable instance, the thread most recently
> + * accessing nearby data structures should almost-always be the lcore
> + * variables' owner. Adding padding will increase the effective memory
> + * working set size, potentially reducing performance.
> + *
> + * Lcore variable values take on an initial value of zero.
> + *
> + * @b Example
[...]
> + * @b Alternatives
> + *
> + * Lcore variables are designed to replace a pattern exemplified below:

Would be better in the introduction (in the prog guide).

> + * @code{.c}
> + * struct __rte_cache_aligned foo_lcore_state {
> + *         int a;
> + *         long b;
> + *         RTE_CACHE_GUARD;
> + * };
> + *
> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
> + * @endcode
[...]
> +/**
> + * Define an lcore variable handle.
> + *
> + * This macro defines a variable which is used as a handle to access
> + * the various instances of a per-lcore id variable.
> + *
> + * The aim with this macro is to make clear at the point of

This long sentence may be shortened.

> + * declaration that this is an lcore handle, rather than a regular
> + * pointer.
> + *
> + * Add @b static as a prefix in case the lcore variable is only to be
> + * accessed from a particular translation unit.
> + */
> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle.
> + *
> + * The values of the lcore variable are initialized to zero.

The lcore variables are initialized to zero, not the values.

Don't you mention 0 in align?

> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
> +	handle = rte_lcore_var_alloc(size, align)
> +
> +/**
> + * Allocate space for an lcore variable, and initialize its handle,
> + * with values aligned for any type of object.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
> +
> +/**
> + * Allocate space for an lcore variable of the size and alignment requirements
> + * suggested by the handle pointer type, and initialize its handle.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_ALLOC(handle)					\
> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
> +				       alignof(typeof(*(handle))))
> +
> +/**
> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
> + * means of a @ref RTE_INIT constructor.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> +	{								\
> +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
> +	}
> +
> +/**
> + * Allocate an explicitly-sized lcore variable by means of a @ref
> + * RTE_INIT constructor.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
> +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
> +
> +/**
> + * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
> + *
> + * The values of the lcore variable are initialized to zero.
> + */
> +#define RTE_LCORE_VAR_INIT(name)					\
> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> +	{								\
> +		RTE_LCORE_VAR_ALLOC(name);				\
> +	}

I don't get the need for RTE_INIT macros.
It does not cover RTE_INIT_PRIO and anyway
another RTE_INIT is probably already there in the module.

> +
> +/**
> + * Get void pointer to lcore variable instance with the specified
> + * lcore id.
> + *
> + * @param lcore_id
> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> + *   instances should be accessed. The lcore id need not be valid
> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
> + *   is also not valid (and thus should not be dereferenced).
> + * @param handle
> + *   The lcore variable handle.

handle pointer

> + */
> +static inline void *
> +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)

What a long name!
What about rte_lcore_var() ?

> +{
> +	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
> +}
> +
> +/**
> + * Get pointer to lcore variable instance with the specified lcore id.

Same description as the function above.

> + *
> + * @param lcore_id
> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> + *   instances should be accessed. The lcore id need not be valid
> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
> + *   is also not valid (and thus should not be dereferenced).
> + * @param handle
> + *   The lcore variable handle.
> + */
> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
> +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
> +
> +/**
> + * Get pointer to lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_VALUE(handle) \

RTE_LCORE_VAR_LOCAL?

> +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> +
> +/**
> + * Iterate over each lcore id's value for an lcore variable.
> + *
> + * @param lcore_id
> + *   An <code>unsigned int</code> variable successively set to the
> + *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
> + * @param value
> + *   A pointer variable successively set to point to lcore variable
> + *   value instance of the current lcore id being processed.
> + * @param handle
> + *   The lcore variable handle.
> + */
> +#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)		\

RTE_LCORE_VAR_FOREACH?

> +	for ((lcore_id) =						\
> +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
> +	     (lcore_id) < RTE_MAX_LCORE;				\
> +	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, \
> +							       handle))
> +
> +/**
> + * Allocate space in the per-lcore id buffers for an lcore variable.
> + *
> + * The pointer returned is only an opaque identifer of the variable. To
> + * get an actual pointer to a particular instance of the variable use
> + * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
> + *
> + * The lcore variable values' memory is set to zero.
> + *
> + * The allocation is always successful, barring a fatal exhaustion of
> + * the per-lcore id buffer space.
> + *
> + * rte_lcore_var_alloc() is not multi-thread safe.
> + *
> + * @param size
> + *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
> + * @param align
> + *   If 0, the values will be suitably aligned for any kind of type
> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
> + *   on a multiple of *align*, which must be a power of 2 and equal or
> + *   less than @c RTE_CACHE_LINE_SIZE.
> + * @return
> + *   The variable's handle, stored in a void pointer value. The value
> + *   is always non-NULL.
> + */
> +__rte_experimental
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align);

[...]
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -396,6 +396,8 @@ EXPERIMENTAL {
>  
>  	# added in 24.03
>  	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
> +

# added in 24.11

> +	rte_lcore_var_alloc;




^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 21:24                                                           ` Thomas Monjalon
@ 2024-10-11  8:04                                                             ` Mattias Rönnblom
  2024-10-11  8:46                                                               ` Morten Brørup
                                                                                 ` (2 more replies)
  2024-10-11  8:09                                                             ` Morten Brørup
  1 sibling, 3 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-11  8:04 UTC (permalink / raw)
  To: Thomas Monjalon, Mattias Rönnblom
  Cc: dev, Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic, Konstantin Ananyev,
	Chengwen Feng

On 2024-10-10 23:24, Thomas Monjalon wrote:
> Hello,
> 
> This new feature looks to bring something interesting to DPDK.
> There was a good amount of discussion and review,
> and there is a real effort of documentation.
> 
> However, some choices done in this implementation
> were not explained or advertised enough in the documentation,
> in my opinion.
> 

Are those of relevance to the API user?

> I think the first thing to add is an explanation of the memory layout.
> Maybe that a SVG drawing would help to show how it is stored.
> 

That would be helpful to someone wanting to understand the internals. 
But where should that go? If it's put in the API, it will also obscure 
the *actual* API documentation.

I have some drawings already, and I agree they are very helpful - both 
in explaining how things work, and making obvious why the memory layout 
resulting from the use of lcore variables are superior to that of the 
lcore id-index static array approach.

> We also need to explain why it is not using rte_malloc.
> 
> Also please could you re-read the doc and comments in detail?
> I think some words are missing and there are typos.
> While at it, please allow for easy update of the text
> by starting each sentence on a new line.
> Breaking lines logically is better for future patches.
> One more advice: avoid very long sentences.
> 

I've gone through the documentation and will post a new patch set.

There's been a lot of comments and discussion on this patch set. Did you 
have anything in particular in mind?

> Do you have benchmarks results of the modules using such variables
> (power, random, service)?
> It would be interesting to compare time efficiency and memory usage
> before/after, with different number of threads.
> 

I have the dummy modules of test_lcore_var_perf.c, which show the 
performance benefits as the number of modules using lcore variables 
increases.

That said, the gains are hard to quantify with micro benchmarks, and for 
real-world performance, one really has to start using the facility at 
scale before anything interesting may happen.

Keep in mind however, that while this is new to DPDK, similar facilities 
already exists your favorite UN*X kernel. The implementation is 
different, but I think it's accurate to say the goal and the effects 
should be the same.

One can also run the perf autotest for RTE random, but such tests only 
show lcore variables doesn't make things significantly worse when the L1 
cache is essentially unused. (In fact, the lcore variable-enabled 
rte_random.c somewhat counter-intuitively generates a 64-bit number 1 
TSC cycle faster than the old version on my system.)

Just to be clear: it's the footprint in the core-private caches we are 
attempting to reduce.

> Adding more detailed comments below.
> 
> 
> 10/10/2024 16:21, Mattias Rönnblom:
>> Introduce DPDK per-lcore id variables, or lcore variables for short.
>>
>> An lcore variable has one value for every current and future lcore
>> id-equipped thread.
> 
> I find it difficult to read "lcore id-equipped thread".
> Can we just say "DPDK thread"?
> 

Sure, if you point me to a definition of what a DPDK thread is.

I can think of at least four potential definitions
* An EAL thread
* An EAL thread or a registered non-EAL thread
* Any thread calling into DPDK APIs
* Any thread living in a DPDK process

>> The primary <rte_lcore_var.h> use case is for statically allocating
>> small, frequently-accessed data structures, for which one instance
>> should exist for each lcore.
>>
>> Lcore variables are similar to thread-local storage (TLS, e.g., C11
>> _Thread_local), but decoupling the values' life time with that of the
>> threads.
> 
> In which situation we need values of a dead thread?
> 

To clean up heap-allocated memory referenced by such variables, for 
example, or other resources.

> [...]
>> +An application, a DPDK library or PMD may keep opt to keep per-thread
>> +state.
> 
> I don't understand this sentence.
> 

Which part is unclear?

>> +
>> +Per-thread data may be maintained using either *lcore variables*
>> +(``rte_lcore_var.h``), *thread-local storage (TLS)*
>> +(``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE``
>> +elements, index by ``rte_lcore_id()``. These methods allows for
> 
> index*ed*
> 

Fixed.

>> +per-lcore data to be a largely module-internal affair, and not
>> +directly visible in its API. Another possibility is to have deal
> 
> *to* deal ?
> 
>> +explicitly with per-thread aspects in the API (e.g., the ports of the
>> +Eventdev API).
>> +
>> +Lcore varibles are suitable for small object statically allocated at
> 
> vari*a*bles
> 

Fixed.

>> +the time of module or application initialization. An lcore variable
>> +take on one value for each lcore id-equipped thread (i.e., for EAL
>> +threads and registered non-EAL threads, in total ``RTE_MAX_LCORE``
>> +instances). The lifetime of lcore variables are detached from that of
>> +the owning threads, and may thus be initialized prior to the owner
>> +having been created.
>> +
>> +Variables with thread-local storage are allocated at the time of
>> +thread creation, and exists until the thread terminates, for every
>> +thread in the process. Only very small object should be allocated in
>> +TLS, since large TLS objects significantly slows down thread creation
>> +and may needlessly increase memory footprint for application that make
>> +extensive use of unregistered threads.
> 
> I don't understand the relation with non-DPDK threads.
> 

__thread isn't just for "DPDK threads". It will allocate memory on all 
threads in the process.

>> +
>> +A common but now largely obsolete DPDK pattern is to use a static
>> +array sized according to the maximum number of lcore id-equipped
>> +threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
>> +sharing*, each element must both cache-aligned, and include a
> 
> must *be*

Fixed.

> include*s*
> 

No, it's "include".

>> +``RTE_CACHE_GUARD``. Such extensive use of padding cause internal
> 
> cause*s*
> 

Fixed.

>> +fragmentation (i.e., unused space) and lower cache hit rates.
>> +
>> +For more discussions on per-lcore state, see the ``rte_lcore_var.h``
>> +API documentation.
> 
> [...]
>> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> 
> With #define RTE_MAX_LCORE_VAR 1048576,
> LCORE_BUFFER_SIZE can be 100MB, right?
> 

Sure. Unless you mlock the memory, it won't result in the DPDK process 
having 100MB worth of mostly-unused resident memory (RSS, in Linux 
speak). It would, would we use huge pages and thus effectively disabled 
demand paging.

This is similar to how thread stacks generally work, where you often get 
a fairly sizable stack (e.g., 2MB) but as long as you don't use all of 
it, most of the pages won't be resident.

If you want to guard against such mlocked scenarios, you could consider 
lowering the max variable size. You could argue it's strange to have a 
large RTE_MAX_LCORE_VAR and yet tell the API user to only use it for 
small, often-used block of memory.

If RTE_MAX_LCORE_VAR should have a different value, what should it be?

>> +
>> +static void *lcore_buffer;
> 
> It is the last buffer for all lcores.
> The name suggests it is one single buffer per lcore.
> What about "last_buffer" or "current_buffer"?
> 

Would "value_buffer" be better? Or "values_buffer", although that sounds 
awkward. "current_value_buffer".

I agree lcore_buffer is very generic.

The buffer holds values for all lcore ids, for one or (usually many) 
more lcore variables.

>> +static size_t offset = RTE_MAX_LCORE_VAR;
> 
> A comment may be useful for this value: it triggers the first alloc?
> 

Yes. I will add a comment.

>> +
>> +static void *
>> +lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	void *handle;
>> +	unsigned int lcore_id;
>> +	void *value;
>> +
>> +	offset = RTE_ALIGN_CEIL(offset, align);
>> +
>> +	if (offset + size > RTE_MAX_LCORE_VAR) {
>> +#ifdef RTE_EXEC_ENV_WINDOWS
>> +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
>> +					       RTE_CACHE_LINE_SIZE);
>> +#else
>> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
>> +					     LCORE_BUFFER_SIZE);
>> +#endif
>> +		RTE_VERIFY(lcore_buffer != NULL);
> 
> Please no panic in a lib.
> You can return NULL.
> 

One could, but it would be a great cost to the API user.

Something is seriously broken if these kind of allocations fail 
(considering when they occur and what size they are), just like 
something is seriously broken if the kernel fails (or is unwilling to) 
allocate pages used by static lcore id index arrays.

>> +
>> +		offset = 0;
>> +	}
>> +
>> +	handle = RTE_PTR_ADD(lcore_buffer, offset);
>> +
>> +	offset += size;
>> +
>> +	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)
>> +		memset(value, 0, size);
>> +
>> +	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
>> +		"%"PRIuPTR"-byte alignment", size, align);
>> +
>> +	return handle;
>> +}
> 
> [...]
>> +#ifndef _RTE_LCORE_VAR_H_
>> +#define _RTE_LCORE_VAR_H_
> 
> Really we don't need the first and last underscores,
> but it's a detail.
> 

I just follow the DPDK conventions here.

I agree the conventions are wrong.

>> +
>> +/**
>> + * @file
>> + *
>> + * RTE Lcore variables
> 
> Please don't say "RTE", it is just a prefix.

OK.

I just follow the DPDK conventions here as well, but sure, I'll change it.

> You can replace it with "DPDK" if you really want to be specific.
> 
>> + *
>> + * This API provides a mechanism to create and access per-lcore id
>> + * variables in a space- and cycle-efficient manner.
>> + *
>> + * A per-lcore id variable (or lcore variable for short) has one value
>> + * for each EAL thread and registered non-EAL thread. There is one
>> + * instance for each current and future lcore id-equipped thread, with
>> + * a total of RTE_MAX_LCORE instances. The value of an lcore variable
>> + * for a particular lcore id is independent from other values (for
>> + * other lcore ids) within the same lcore variable.
>> + *
>> + * In order to access the values of an lcore variable, a handle is
>> + * used. The type of the handle is a pointer to the value's type
>> + * (e.g., for an @c uint32_t lcore variable, the handle is a
>> + * <code>uint32_t *</code>. The handle type is used to inform the
>> + * access macros the type of the values. A handle may be passed
>> + * between modules and threads just like any pointer, but its value
>> + * must be treated as a an opaque identifier. An allocated handle
>> + * never has the value NULL.
> 
> Most of the explanations here would be better hosted in the prog guide.
> The Doxygen API is better suited for short and direct explanations.
> 

Yeah, maybe. Reworking this to the programming guide format and having 
that reviewed is a sizable undertaking though.

>> + *
>> + * @b Creation
>> + *
>> + * An lcore variable is created in two steps:
>> + *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
>> + *  2. Allocate lcore variable storage and initialize the handle with
>> + *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
>> + *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
> 
> *at* the time
> 
>> + *     module initialization, but may be done at any time.
> 
> You mean it does not depend on EAL initialization?
> 

Lcore variables may be used prior to any other parts of the EAL have 
been initialized.

>> + *
>> + * An lcore variable is not tied to the owning thread's lifetime. It's
>> + * available for use by any thread immediately after having been
>> + * allocated, and continues to be available throughout the lifetime of
>> + * the EAL.
>> + *
>> + * Lcore variables cannot and need not be freed.
> 
> I'm curious about that.
> If EAL is closed, and the application continues its life,
> then we want all this memory to be cleaned as well.
> Do you know rte_eal_cleanup()?

I think the primary reason you would like to free the buffers is to 
avoid false positives from tools like valgrind memcheck (if anyone 
managed to get that working with DPDK).

rte_eal_cleanup() freeing the buffers and resetting the offset would 
make sense. That however would require the buffers to be tracked (e.g., 
as a linked list).

 From a footprint point of view, TLS allocations and static arrays also 
aren't freed by rte_eal_cleanup().

> 
>> + *
>> + * @b Access
>> + *
>> + * The value of any lcore variable for any lcore id may be accessed
>> + * from any thread (including unregistered threads), but it should
>> + * only be *frequently* read from or written to by the owner.
> 
> Would be interesting to explain why.
> 

This is intended to be brief and false sharing is mentioned elsewhere.

>> + *
>> + * Values of the same lcore variable but owned by two different lcore
>> + * ids may be frequently read or written by the owners without risking
>> + * false sharing.
> 
> Again you could explain why if you explained the storage layout.
> What is the minimum object size to avoid false sharing?
> 

Your objects may be as small as you want, and you still do not risk 
false sharing. All objects for a particular lcore id are grouped 
together, spatially.

>> + *
>> + * An appropriate synchronization mechanism (e.g., atomic loads and
>> + * stores) should employed to assure there are no data races between
> 
> should *be*
> 

Fixed.

>> + * the owning thread and any non-owner threads accessing the same
>> + * lcore variable instance.
>> + *
>> + * The value of the lcore variable for a particular lcore id is
>> + * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
>> + *
>> + * A common pattern is for an EAL thread or a registered non-EAL
>> + * thread to access its own lcore variable value. For this purpose, a
>> + * short-hand exists in the form of @ref RTE_LCORE_VAR_VALUE.
> 
> shorthand without hyphen?
> 

Both works, but I'll change.

>> + *
>> + * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
>> + * pointer with the same type as the value, it may not be directly
>> + * dereferenced and must be treated as an opaque identifier.
>> + *
>> + * Lcore variable handles and value pointers may be freely passed
>> + * between different threads.
>> + *
>> + * @b Storage
>> + *
>> + * An lcore variable's values may by of a primitive type like @c int,
>> + * but would more typically be a @c struct.
>> + *
>> + * The lcore variable handle introduces a per-variable (not
>> + * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
>> + * there are some memory footprint gains to be made by organizing all
>> + * per-lcore id data for a particular module as one lcore variable
>> + * (e.g., as a struct).
>> + *
>> + * An application may choose to define an lcore variable handle, which
>> + * it then it goes on to never allocate.
> 
> I don't understand this sentence.
> 

I have rephrased this.

>> + *
>> + * The size of an lcore variable's value must be less than the DPDK
> 
> size of variable, not size of value
> 

RTE_MAX_LCORE_VAR specifies the maximum size of a variable's value. The 
maximum amount of space required to hold a lcore variable is 
RTE_MAX_LCORE_VAR * RTE_MAX_LCORE.

>> + * build-time constant @c RTE_MAX_LCORE_VAR.
>> + *
>> + * The lcore variable are stored in a series of lcore buffers, which
> 
> variable*s*
> 

Fixed.

>> + * are allocated from the libc heap. Heap allocation failures are
>> + * treated as fatal.
> 
> Why not handling as an error, so the app has a chance to cleanup before crash?
> 

Because you don't want to put the burden on the user (app or 
DPDK-internal) to attempt to clean up such failures, which in practice 
will never occur, and in case they do, they are just among several such 
early-memory-allocation failures where the application code has no say 
in what should occur.

What happens if the TLS allocations are so large, the main thread can't 
be created?

What happens if the BSS section is so large (because of all our 
RTE_MAX_LCORE-sized arrays) so its pages can't be made resident in memory?

Lcore variables aren't a dynamic allocation facility.

>> + *
>> + * Lcore variables should generally *not* be @ref __rte_cache_aligned
>> + * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
>> + * of these constructs are designed to avoid false sharing. In the
>> + * case of an lcore variable instance, the thread most recently
>> + * accessing nearby data structures should almost-always be the lcore
>> + * variables' owner. Adding padding will increase the effective memory
>> + * working set size, potentially reducing performance.
>> + *
>> + * Lcore variable values take on an initial value of zero.
>> + *
>> + * @b Example
> [...]
>> + * @b Alternatives
>> + *
>> + * Lcore variables are designed to replace a pattern exemplified below:
> 
> Would be better in the introduction (in the prog guide).
> 

Yes.

>> + * @code{.c}
>> + * struct __rte_cache_aligned foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + *         RTE_CACHE_GUARD;
>> + * };
>> + *
>> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
>> + * @endcode
> [...]
>> +/**
>> + * Define an lcore variable handle.
>> + *
>> + * This macro defines a variable which is used as a handle to access
>> + * the various instances of a per-lcore id variable.
>> + *
>> + * The aim with this macro is to make clear at the point of
> 
> This long sentence may be shortened.
> 

Indeed. Will do.

>> + * declaration that this is an lcore handle, rather than a regular
>> + * pointer.
>> + *
>> + * Add @b static as a prefix in case the lcore variable is only to be
>> + * accessed from a particular translation unit.
>> + */
>> +#define RTE_LCORE_VAR_HANDLE(type, name)	\
>> +	RTE_LCORE_VAR_HANDLE_TYPE(type) name
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle.
>> + *
>> + * The values of the lcore variable are initialized to zero.
> 
> The lcore variables are initialized to zero, not the values.
> 

"The lcore variables are initialized to zero" is the same as "The lcore 
variables' values are initialized to zero" in my world, since the only 
thing that can be initialized in a lcore variable is its values (or 
"value instances" or just "instances", not sure I'm consistent here).

> Don't you mention 0 in align?
> 

I don't understand the question. Are you asking why objects are 
worst-case aligned when RTE_LCORE_VAR_ALLOC_SIZE() is used? Rather than 
naturally aligned?

Good question, in that case. I guess it would make more sense if they 
were naturally aligned. I just thought in terms of malloc() semantics, 
but maybe that's wrong.

>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
>> +	handle = rte_lcore_var_alloc(size, align)
>> +
>> +/**
>> + * Allocate space for an lcore variable, and initialize its handle,
>> + * with values aligned for any type of object.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
>> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
>> +
>> +/**
>> + * Allocate space for an lcore variable of the size and alignment requirements
>> + * suggested by the handle pointer type, and initialize its handle.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_ALLOC(handle)					\
>> +	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
>> +				       alignof(typeof(*(handle))))
>> +
>> +/**
>> + * Allocate an explicitly-sized, explicitly-aligned lcore variable by
>> + * means of a @ref RTE_INIT constructor.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
>> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
>> +	{								\
>> +		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
>> +	}
>> +
>> +/**
>> + * Allocate an explicitly-sized lcore variable by means of a @ref
>> + * RTE_INIT constructor.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
>> +	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
>> +
>> +/**
>> + * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
>> + *
>> + * The values of the lcore variable are initialized to zero.
>> + */
>> +#define RTE_LCORE_VAR_INIT(name)					\
>> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
>> +	{								\
>> +		RTE_LCORE_VAR_ALLOC(name);				\
>> +	}
> 
> I don't get the need for RTE_INIT macros.

Check rte_power_intrinsics.c

I agree it's not obvious they are worth the API clutter.

> It does not cover RTE_INIT_PRIO and anyway
> another RTE_INIT is probably already there in the module.
> 
>> +
>> +/**
>> + * Get void pointer to lcore variable instance with the specified
>> + * lcore id.
>> + *
>> + * @param lcore_id
>> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
>> + *   instances should be accessed. The lcore id need not be valid
>> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
>> + *   is also not valid (and thus should not be dereferenced).
>> + * @param handle
>> + *   The lcore variable handle.
> 
> handle pointer
> 

No, handle. A handle pointer could be thought of as &handle.

>> + */
>> +static inline void *
>> +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
> 
> What a long name!
> What about rte_lcore_var() ?
> 

It's long but consistent with the rest of the API.

This is not a function you will be see called often in API user code. 
Most will use the access macros.

>> +{
>> +	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
>> +}
>> +
>> +/**
>> + * Get pointer to lcore variable instance with the specified lcore id.
> 
> Same description as the function above.
> 

I don't understand this comment.

>> + *
>> + * @param lcore_id
>> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
>> + *   instances should be accessed. The lcore id need not be valid
>> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
>> + *   is also not valid (and thus should not be dereferenced).
>> + * @param handle
>> + *   The lcore variable handle.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
>> +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
>> +
>> +/**
>> + * Get pointer to lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_VALUE(handle) \
> 
> RTE_LCORE_VAR_LOCAL?
> 

Why is that better?

Maybe Morten can remind me here, but I think we had a discussion about 
RTE_LCORE_VAR() versus RTE_LCORE_VAR_VALUE() at some point, and 
RTE_LCORE_VAR_VALUE() was deemed more clear.

>> +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
>> +
>> +/**
>> + * Iterate over each lcore id's value for an lcore variable.
>> + *
>> + * @param lcore_id
>> + *   An <code>unsigned int</code> variable successively set to the
>> + *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
>> + * @param value
>> + *   A pointer variable successively set to point to lcore variable
>> + *   value instance of the current lcore id being processed.
>> + * @param handle
>> + *   The lcore variable handle.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)		\
> 
> RTE_LCORE_VAR_FOREACH?
> 

Has been discussed already, and VALUE was deemed to improve readability.

RTE_LCORE_VAR_FOREACH could mean "iterate over all lcore variables", 
which is not what the function does.

>> +	for ((lcore_id) =						\
>> +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
>> +	     (lcore_id) < RTE_MAX_LCORE;				\
>> +	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, \
>> +							       handle))
>> +
>> +/**
>> + * Allocate space in the per-lcore id buffers for an lcore variable.
>> + *
>> + * The pointer returned is only an opaque identifer of the variable. To
>> + * get an actual pointer to a particular instance of the variable use
>> + * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
>> + *
>> + * The lcore variable values' memory is set to zero.
>> + *
>> + * The allocation is always successful, barring a fatal exhaustion of
>> + * the per-lcore id buffer space.
>> + *
>> + * rte_lcore_var_alloc() is not multi-thread safe.
>> + *
>> + * @param size
>> + *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
>> + * @param align
>> + *   If 0, the values will be suitably aligned for any kind of type
>> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
>> + *   on a multiple of *align*, which must be a power of 2 and equal or
>> + *   less than @c RTE_CACHE_LINE_SIZE.
>> + * @return
>> + *   The variable's handle, stored in a void pointer value. The value
>> + *   is always non-NULL.
>> + */
>> +__rte_experimental
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align);
> 
> [...]
>> --- a/lib/eal/version.map
>> +++ b/lib/eal/version.map
>> @@ -396,6 +396,8 @@ EXPERIMENTAL {
>>   
>>   	# added in 24.03
>>   	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
>> +
> 
> # added in 24.11
> 

Fixed.

Thanks for the review.

>> +	rte_lcore_var_alloc;
> 
> 
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-10 21:24                                                           ` Thomas Monjalon
  2024-10-11  8:04                                                             ` Mattias Rönnblom
@ 2024-10-11  8:09                                                             ` Morten Brørup
  2024-10-11  8:42                                                               ` Thomas Monjalon
  1 sibling, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-10-11  8:09 UTC (permalink / raw)
  To: Thomas Monjalon, Mattias Rönnblom
  Cc: dev, hofors, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Mattias,
Please note that most of Thomas' questions are in the interest of the general public, considered requests for further documentation.


> Do you have benchmarks results of the modules using such variables
> (power, random, service)?
> It would be interesting to compare time efficiency and memory usage
> before/after, with different number of threads.

IMO, the main benefit is the reduction of waste of CPU data cache (no need for RTE_CACHE_GUARD fillers in per-lcore data structures).

Mattias,
The PMU counters library is too new, so I suppose you cannot yet measure the primary benefit, but only derived benefits.
If you have any kind of perf data, please provide them.


> > Lcore variables are similar to thread-local storage (TLS, e.g., C11
> > _Thread_local), but decoupling the values' life time with that of the
> > threads.
> 
> In which situation we need values of a dead thread?

Values of dead threads is not the issue here.
This is:
1. TLS variables are allocated and initialized when ANY thread is created, so using TLS for variables increases the cost of creating new threads. This is relevant for applications that frequently create (and destroy) short-lived threads.
2. TLS variables uses memory for ALL threads, regardless if those threads use the TLS variables or not. This increases the memory footprint for applications that create many (long lived) threads.

Mattias:
Thomas' question might still be relevant...
Is there any situation where the values of the lcore variables are relevant after the thread is dead?

> > +Variables with thread-local storage are allocated at the time of
> > +thread creation, and exists until the thread terminates, for every
> > +thread in the process. Only very small object should be allocated in
> > +TLS, since large TLS objects significantly slows down thread
> creation
> > +and may needlessly increase memory footprint for application that
> make
> > +extensive use of unregistered threads.
> 
> I don't understand the relation with non-DPDK threads.

The relationship here is the same as I described above: Using TLS for DPDK threads also have a cost for non-DPDK threads.


> [...]
> > +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> 
> With #define RTE_MAX_LCORE_VAR 1048576,
> LCORE_BUFFER_SIZE can be 100MB, right?

Mattias:
You should document what you have explained...
This huge amount of memory is not really consumed, but relies on paging; it is only a large virtual address space.
This also makes it somewhat advantageous that the lcore variables don't use hugepages.

> > +static size_t offset = RTE_MAX_LCORE_VAR;
> 
> A comment may be useful for this value: it triggers the first alloc?

Mattias:
If you recall, I also had a hard time understanding this design (instead of simply comparing lcore_buffer to NULL).
Please add a comment that this not only triggers the first allocation, but also additional allocations, if using a lot of memory for lcore variables.

> 
> > +
> > +static void *
> > +lcore_var_alloc(size_t size, size_t align)
> > +{
> > +	void *handle;
> > +	unsigned int lcore_id;
> > +	void *value;
> > +
> > +	offset = RTE_ALIGN_CEIL(offset, align);
> > +
> > +	if (offset + size > RTE_MAX_LCORE_VAR) {
> > +#ifdef RTE_EXEC_ENV_WINDOWS
> > +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> > +					       RTE_CACHE_LINE_SIZE);
> > +#else
> > +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> > +					     LCORE_BUFFER_SIZE);
> > +#endif
> > +		RTE_VERIFY(lcore_buffer != NULL);
> 
> Please no panic in a lib.
> You can return NULL.

I agree with Mattias design here.
Lcore variables are like RTE_PER_LCORE variables and simple "static" variables.
If the system does not have enough memory for those, the application will not be able to deal with it.
Panic early (in this lib) is the correct way to deal with it.


> > +/**
> > + * @file
> > + *
> > + * RTE Lcore variables
> 
> Please don't say "RTE", it is just a prefix.
> You can replace it with "DPDK" if you really want to be specific.

The commit message says:
"Introduce DPDK per-lcore id variables, or lcore variables for short."

Use one of the two here.
I personally prefer the short variant, "Lcore variables", because that is the term we are going to use in conversations on the mailing list, documentation etc.
The long variant is mainly intended for explaining the library itself.


> > + * Lcore variables cannot and need not be freed.
> 
> I'm curious about that.
> If EAL is closed, and the application continues its life,
> then we want all this memory to be cleaned as well.
> Do you know rte_eal_cleanup()?

Good catch, Thomas! I missed that in my review.
Mattias, it seems you need a chained list of lcore_buffer allocations for this.


> > + *
> > + * The size of an lcore variable's value must be less than the DPDK
> 
> size of variable, not size of value

Initially, I thought the same as Thomas...
It is confusing considering the handle the variable, and its instances having values.

However, during the review, Mattias convinced me of its correctness.

And by the way, RTE_PER_LCORE also does it:
https://elixir.bootlin.com/dpdk/v24.07/source/lib/eal/include/rte_per_lcore.h#L48


> 
> > + * build-time constant @c RTE_MAX_LCORE_VAR.
> > + *
> > + * The lcore variable are stored in a series of lcore buffers, which
> 
> variable*s*
> 
> > + * are allocated from the libc heap. Heap allocation failures are
> > + * treated as fatal.
> 
> Why not handling as an error, so the app has a chance to cleanup before
> crash?

Because allocation failures of similar variable types (RTE_PER_LCORE and "static") also don't offer a chance to cleanup. Just following the same design pattern.


> > +static inline void *
> > +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
> 
> What a long name!
> What about rte_lcore_var() ?

+1


> > + *
> > + * @param lcore_id
> > + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> > + *   instances should be accessed. The lcore id need not be valid
> > + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the
> pointer
> > + *   is also not valid (and thus should not be dereferenced).
> > + * @param handle
> > + *   The lcore variable handle.
> > + */
> > +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)

I hope to see a lot of these in the code, so keeping it short would be good.

Suggest:
RTE_LCORE_VAR_LCORE(lcore_id, handle)

> 	\
> > +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
> > +
> > +/**
> > + * Get pointer to lcore variable instance of the current thread.
> > + *
> > + * May only be used by EAL threads and registered non-EAL threads.
> > + */
> > +#define RTE_LCORE_VAR_VALUE(handle) \
> 
> RTE_LCORE_VAR_LOCAL?

Same comment as above, let's keep these popular ones short:
RTE_LCORE_VAR(handle)

> 
> > +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> > +
> > +/**
> > + * Iterate over each lcore id's value for an lcore variable.
> > + *
> > + * @param lcore_id
> > + *   An <code>unsigned int</code> variable successively set to the
> > + *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
> > + * @param value
> > + *   A pointer variable successively set to point to lcore variable
> > + *   value instance of the current lcore id being processed.
> > + * @param handle
> > + *   The lcore variable handle.
> > + */
> > +#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)
> 	\
> 
> RTE_LCORE_VAR_FOREACH?

Generally, get rid of the _VALUE postfix.
RTE_PER_LCORE() doesn't have a _VALUE postfix, even though its description refers to the variable's value.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v10 0/7] Lcore variables
  2024-10-10 14:21                                                         ` [PATCH v9 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-10 15:54                                                           ` Stephen Hemminger
  2024-10-10 21:24                                                           ` Thomas Monjalon
@ 2024-10-11  8:18                                                           ` Mattias Rönnblom
  2024-10-11  8:18                                                             ` [PATCH v10 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                                               ` (7 more replies)
  2 siblings, 8 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-11  8:18 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 436 ++++++++++++++++++
 app/test/test_lcore_var_perf.c                | 257 +++++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  85 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 117 ++---
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 389 ++++++++++++++++
 lib/eal/version.map                           |   3 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  35 +-
 17 files changed, 1343 insertions(+), 93 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v10 1/7] eal: add static per-lcore memory allocation facility
  2024-10-11  8:18                                                           ` [PATCH v10 0/7] Lcore variables Mattias Rönnblom
@ 2024-10-11  8:18                                                             ` Mattias Rönnblom
  2024-10-14  7:43                                                               ` [PATCH v11 0/7] Lcore variables Mattias Rönnblom
  2024-10-11  8:18                                                             ` [PATCH v10 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                                               ` (6 subsequent siblings)
  7 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-11  8:18 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v10:
 * Improve documentation grammar and spelling. (Stephen Hemminger,
   Thomas Monjalon)
 * Add version.map DPDK version comment. (Thomas Monjalon)

PATCH v9:
 * Fixed merge conflicts in release notes.

PATCH v8:
 * Work around missing max_align_t definition in MSVC. (Morten Brørup)

PATCH v7:
 * Add () to the FOREACH lcore id macro parameter, to allow arbitrary
   expression, not just a simple variable name, being passed.
   (Konstantin Ananyev)

PATCH v6:
 * Have API user provide the loop variable in the FOREACH macro, to
   avoid subtle bugs where the loop variable name clashes with some
   other user-defined variable. (Konstantin Ananyev)

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         |  85 ++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 389 ++++++++++++++++++
 lib/eal/version.map                           |   3 +
 10 files changed, 538 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 812463fe9f..61e5907fb5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index fd6f8a2f1a..498d509244 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index b9fac1839d..b659a1d085 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -429,12 +429,43 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default, static variables, memory blocks allocated on the DPDK
+heap, and other types of memory are shared by all DPDK threads.
+
+An application, a DPDK library, or a PMD may opt to keep per-thread state.
+
+Per-thread data can be maintained using either *lcore variables* (see
+``rte_lcore_var.h``), *thread-local storage (TLS)* (see
+``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE`` elements,
+indexed by ``rte_lcore_id()``. These methods allow per-lcore data to be
+largely internal to the module and not directly exposed in its
+API. Another approach is to explicitly handle per-thread aspects in
+the API (e.g., the ports in the Eventdev API).
+
+Lcore variables are suitable for small objects that are statically
+allocated at the time of module or application initialization. An
+lcore variable takes on one value for each lcore ID-equipped thread
+(i.e., for both EAL threads and registered non-EAL threads, in total
+``RTE_MAX_LCORE`` instances). The lifetime of lcore variables is
+independent of the owning threads and can, therefore, be initialized
+before the threads are created.
+
+Variables with thread-local storage are allocated when the thread is
+created and exist until the thread terminates. These are applicable
+for every thread in the process. Only very small objects should be
+allocated in TLS, as large TLS objects can significantly slow down
+thread creation and may unnecessarily increase the memory footprint of
+applications that extensively use unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore ID-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must be both cache-aligned and include an
+``RTE_CACHE_GUARD``. This extensive use of padding causes internal
+fragmentation (i.e., unused space) and reduces cache hit rates.
+
+For more discussions on per-lcore state, refer to the
+``rte_lcore_var.h`` API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 915065a6f9..0e15767d41 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -113,6 +113,20 @@ New Features
 
   * Added independent enqueue feature.
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..fac6ab52b0
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,85 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+
+#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
+
+static void *lcore_buffer;
+/* initialized to trigger buffer allocation on first lcore_var_alloc() call */
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	unsigned int lcore_id;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+#ifdef RTE_EXEC_ENV_WINDOWS
+		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
+					       RTE_CACHE_LINE_SIZE);
+#else
+		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					     LCORE_BUFFER_SIZE);
+#endif
+		RTE_VERIFY(lcore_buffer != NULL);
+
+		offset = 0;
+	}
+
+	handle = RTE_PTR_ADD(lcore_buffer, offset);
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+#ifdef RTE_TOOLCHAIN_MSVC
+		/* MSVC <stddef.h> is missing the max_align_t typedef */
+		align = alignof(double);
+#else
+		align = alignof(max_align_t);
+#endif
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..2f4d388732
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,389 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) holds a
+ * unique value for each EAL thread and registered non-EAL
+ * thread. There is one instance for each current and future lcore
+ * id-equipped thread, with a total of @c RTE_MAX_LCORE instances. The
+ * value of the lcore variable for one lcore id is independent from
+ * the values assigned to other lcore ids within the same variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for an @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>). The handle type is used to inform the
+ * access macros of the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as an opaque identifier. An allocated handle never
+ * has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs at the time
+ *     of module initialization, but may be done at any time.
+ *
+ * The lifetime of an lcore variable is not tied to the thread that
+ * created it. Its per lcore id values (up to @c RTE_MAX_LCORE) are
+ * available from the moment the lcore variable is created and
+ * continue to exist throughout the entire lifetime of the EAL,
+ * whether or not the lcore id is currently in use.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable, associated with different lcore
+ * ids may be frequently read or written by their respective owners
+ * without risking false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should be employed to prevent data races between the owning
+ * thread and any other thread accessing the same value instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * shorthand exists as @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may be of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may define an lcore variable handle without ever
+ * allocating it.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * Lcore variables are stored in a series of lcore buffers, which are
+ * allocated from the libc heap. Heap allocation failures are treated
+ * as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variable's owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values are initialized to zero by default.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         unsigned int lcore_id;
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module are kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's
+ * and prevent false sharing.
+ *
+ * Lcore variables offer the advantage of working with, rather than
+ * against, the CPU's assumptions. A next-line hardware prefetcher,
+ * for example, may function as intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The lifecycle of a thread-local variable instance is tied to
+ *     that of the thread. The data cannot be accessed before the
+ *     thread has been created, nor after it has exited. As a result,
+ *     thread-local variables must be initialized in a "lazy" manner
+ *     (e.g., at the point of thread creation). Lcore variables may be
+ *     accessed immediately after having been allocated (which may occur
+ *     before any thread beyond the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the specifics of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, data sharing between threads is supported.
+ *     In the C11 standard, accessing another thread's _Thread_local
+ *     object is implementation-defined. Lcore variable instances may
+ *     be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * This macro clarifies that the declaration is an lcore handle, not a
+ * regular pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param lcore_id
+ *   An <code>unsigned int</code> variable successively set to the
+ *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
+ * @param value
+ *   A pointer variable successively set to point to lcore variable
+ *   value instance of the current lcore id being processed.
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)		\
+	for ((lcore_id) =						\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     (lcore_id) < RTE_MAX_LCORE;				\
+	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, \
+							       handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifier of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..77d3181087 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,9 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	# added in 24.11
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v10 2/7] eal: add lcore variable functional tests
  2024-10-11  8:18                                                           ` [PATCH v10 0/7] Lcore variables Mattias Rönnblom
  2024-10-11  8:18                                                             ` [PATCH v10 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-11  8:18                                                             ` Mattias Rönnblom
  2024-10-11  8:18                                                             ` [PATCH v10 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                                               ` (5 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-11  8:18 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocations to match new API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 436 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 437 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..2a1f258548
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,436 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	unsigned int i = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_int) {
+		TEST_ASSERT_EQUAL(i, lcore_id, "Encountered lcore id %d "
+				  "while expecting %d during iteration",
+				  lcore_id, i);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		i++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	unsigned int lcore_id;
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v10 3/7] eal: add lcore variable performance test
  2024-10-11  8:18                                                           ` [PATCH v10 0/7] Lcore variables Mattias Rönnblom
  2024-10-11  8:18                                                             ` [PATCH v10 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-11  8:18                                                             ` [PATCH v10 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-10-11  8:18                                                             ` Mattias Rönnblom
  2024-10-11  8:18                                                             ` [PATCH v10 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                                               ` (4 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-11  8:18 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v8:
 * Fix spelling. (Morten Brørup)

PATCH v6:
 * Use floating point math when calculating per-update latency.
   (Morten Brørup)

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 257 +++++++++++++++++++++++++++++++++
 2 files changed, 258 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..2efb8342d1
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =
+		RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / (double)ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays is not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each with a small, per-lcore state. Note however that
+ * these tests have very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v10 4/7] random: keep PRNG state in lcore variable
  2024-10-11  8:18                                                           ` [PATCH v10 0/7] Lcore variables Mattias Rönnblom
                                                                               ` (2 preceding siblings ...)
  2024-10-11  8:18                                                             ` [PATCH v10 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-10-11  8:18                                                             ` Mattias Rönnblom
  2024-10-11  8:18                                                             ` [PATCH v10 5/7] power: keep per-lcore " Mattias Rönnblom
                                                                               ` (3 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-11  8:18 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v10 5/7] power: keep per-lcore state in lcore variable
  2024-10-11  8:18                                                           ` [PATCH v10 0/7] Lcore variables Mattias Rönnblom
                                                                               ` (3 preceding siblings ...)
  2024-10-11  8:18                                                             ` [PATCH v10 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-10-11  8:18                                                             ` Mattias Rönnblom
  2024-10-11  8:19                                                             ` [PATCH v10 6/7] service: " Mattias Rönnblom
                                                                               ` (2 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-11  8:18 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocation to match new API.

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a981db4b39 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,22 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	unsigned int lcore_id;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v10 6/7] service: keep per-lcore state in lcore variable
  2024-10-11  8:18                                                           ` [PATCH v10 0/7] Lcore variables Mattias Rönnblom
                                                                               ` (4 preceding siblings ...)
  2024-10-11  8:18                                                             ` [PATCH v10 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-10-11  8:19                                                             ` Mattias Rönnblom
  2024-10-11  8:19                                                             ` [PATCH v10 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  2024-10-11 14:25                                                             ` [PATCH v10 0/7] Lcore variables Stephen Hemminger
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-11  8:19 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v7:
 * Update to match new FOREACH API.

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 117 +++++++++++++++++++----------------
 1 file changed, 65 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index a38c594ce4..3d2c12c39b 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -77,7 +78,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -103,12 +104,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -124,7 +121,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -138,7 +134,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -288,7 +283,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -296,9 +290,11 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	unsigned int lcore_id;
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -467,7 +463,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -477,7 +476,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -499,8 +498,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -546,13 +544,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -560,9 +560,12 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	unsigned int lcore_id;
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -579,7 +582,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -595,7 +599,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -647,30 +651,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -698,13 +703,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -715,14 +721,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -738,17 +746,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -760,7 +770,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -784,7 +794,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -814,6 +824,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -821,12 +833,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -847,7 +858,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -858,7 +869,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -866,7 +877,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -874,7 +885,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -901,7 +912,7 @@ lcore_attr_get_service_error_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -917,7 +928,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -979,12 +993,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1009,7 +1022,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -1020,12 +1034,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1060,7 +1073,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v10 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-10-11  8:18                                                           ` [PATCH v10 0/7] Lcore variables Mattias Rönnblom
                                                                               ` (5 preceding siblings ...)
  2024-10-11  8:19                                                             ` [PATCH v10 6/7] service: " Mattias Rönnblom
@ 2024-10-11  8:19                                                             ` Mattias Rönnblom
  2024-10-11 14:25                                                             ` [PATCH v10 0/7] Lcore variables Stephen Hemminger
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-11  8:19 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-11  8:09                                                             ` Morten Brørup
@ 2024-10-11  8:42                                                               ` Thomas Monjalon
  0 siblings, 0 replies; 313+ messages in thread
From: Thomas Monjalon @ 2024-10-11  8:42 UTC (permalink / raw)
  To: Mattias Rönnblom, Morten Brørup
  Cc: dev, hofors, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

11/10/2024 10:09, Morten Brørup:
> > > +static void *
> > > +lcore_var_alloc(size_t size, size_t align)
> > > +{
> > > +	void *handle;
> > > +	unsigned int lcore_id;
> > > +	void *value;
> > > +
> > > +	offset = RTE_ALIGN_CEIL(offset, align);
> > > +
> > > +	if (offset + size > RTE_MAX_LCORE_VAR) {
> > > +#ifdef RTE_EXEC_ENV_WINDOWS
> > > +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> > > +					       RTE_CACHE_LINE_SIZE);
> > > +#else
> > > +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> > > +					     LCORE_BUFFER_SIZE);
> > > +#endif
> > > +		RTE_VERIFY(lcore_buffer != NULL);
> > 
> > Please no panic in a lib.
> > You can return NULL.
> 
> I agree with Mattias design here.
> Lcore variables are like RTE_PER_LCORE variables and simple "static" variables.
> If the system does not have enough memory for those, the application will not be able to deal with it.
> Panic early (in this lib) is the correct way to deal with it.

There were discussions in the past where we agreed to remove
as much panic as possible in our libs and drivers.
We want to allow the application to have a chance to cleanup.

I don't think returning NULL in an allocation is something disruptive.

I understand you don't want to manage an error return
in variable declarations, so can we have RTE_VERIFY in declaration macros?


> > > + * Lcore variables cannot and need not be freed.
> > 
> > I'm curious about that.
> > If EAL is closed, and the application continues its life,
> > then we want all this memory to be cleaned as well.
> > Do you know rte_eal_cleanup()?
> 
> Good catch, Thomas! I missed that in my review.
> Mattias, it seems you need a chained list of lcore_buffer allocations for this.

Yes


> > > + * The size of an lcore variable's value must be less than the DPDK
> > 
> > size of variable, not size of value
> 
> Initially, I thought the same as Thomas...
> It is confusing considering the handle the variable, and its instances having values.
> 
> However, during the review, Mattias convinced me of its correctness.
> 
> And by the way, RTE_PER_LCORE also does it:
> https://elixir.bootlin.com/dpdk/v24.07/source/lib/eal/include/rte_per_lcore.h#L48

I understand your point of view and I accept it.




^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-11  8:04                                                             ` Mattias Rönnblom
@ 2024-10-11  8:46                                                               ` Morten Brørup
  2024-10-11  9:11                                                               ` Thomas Monjalon
  2024-10-14  6:51                                                               ` Mattias Rönnblom
  2 siblings, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-10-11  8:46 UTC (permalink / raw)
  To: Mattias Rönnblom, Thomas Monjalon, Mattias Rönnblom
  Cc: dev, Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

> >> +/**
> >> + * Get pointer to lcore variable instance of the current thread.
> >> + *
> >> + * May only be used by EAL threads and registered non-EAL threads.
> >> + */
> >> +#define RTE_LCORE_VAR_VALUE(handle) \
> >
> > RTE_LCORE_VAR_LOCAL?
> >
> 
> Why is that better?
> 
> Maybe Morten can remind me here, but I think we had a discussion about
> RTE_LCORE_VAR() versus RTE_LCORE_VAR_VALUE() at some point, and
> RTE_LCORE_VAR_VALUE() was deemed more clear.

Yes, we had the discussion, and reached this conclusion.

However, having been away from it for awhile, and now coming back to it, I lean towards the shorter names, although they are not 100 % correct.

I am usually a proponent of (very) long - self-explanatory - variable names.
But in this case, brevity will be better for reviewing code using the library.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-11  8:04                                                             ` Mattias Rönnblom
  2024-10-11  8:46                                                               ` Morten Brørup
@ 2024-10-11  9:11                                                               ` Thomas Monjalon
  2024-10-14  6:51                                                               ` Mattias Rönnblom
  2 siblings, 0 replies; 313+ messages in thread
From: Thomas Monjalon @ 2024-10-11  9:11 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic, Konstantin Ananyev,
	Chengwen Feng, Mattias Rönnblom

11/10/2024 10:04, Mattias Rönnblom:
> On 2024-10-10 23:24, Thomas Monjalon wrote:
> > Hello,
> > 
> > This new feature looks to bring something interesting to DPDK.
> > There was a good amount of discussion and review,
> > and there is a real effort of documentation.
> > 
> > However, some choices done in this implementation
> > were not explained or advertised enough in the documentation,
> > in my opinion.
> > 
> 
> Are those of relevance to the API user?

I think it helps to understand when we should use this API.
Such design explanation may come in the prog guide RST file.


> > I think the first thing to add is an explanation of the memory layout.
> > Maybe that a SVG drawing would help to show how it is stored.
> 
> That would be helpful to someone wanting to understand the internals. 
> But where should that go? If it's put in the API, it will also obscure 
> the *actual* API documentation.

Of course not in API doc.
I'm talking about moving a lot of explanations in the prog guide,
and add a bit more about the layout.

> I have some drawings already, and I agree they are very helpful - both 
> in explaining how things work, and making obvious why the memory layout 
> resulting from the use of lcore variables are superior to that of the 
> lcore id-index static array approach.

Cool, please add some in the prog guide.

> > We also need to explain why it is not using rte_malloc.
> > 
> > Also please could you re-read the doc and comments in detail?
> > I think some words are missing and there are typos.
> > While at it, please allow for easy update of the text
> > by starting each sentence on a new line.
> > Breaking lines logically is better for future patches.
> > One more advice: avoid very long sentences.
> 
> I've gone through the documentation and will post a new patch set.

OK thanks.

> There's been a lot of comments and discussion on this patch set. Did you 
> have anything in particular in mind?

Nothing more than what I raised in this review.


> > Do you have benchmarks results of the modules using such variables
> > (power, random, service)?
> > It would be interesting to compare time efficiency and memory usage
> > before/after, with different number of threads.
> > 
> 
> I have the dummy modules of test_lcore_var_perf.c, which show the 
> performance benefits as the number of modules using lcore variables 
> increases.
> 
> That said, the gains are hard to quantify with micro benchmarks, and for 
> real-world performance, one really has to start using the facility at 
> scale before anything interesting may happen.
> 
> Keep in mind however, that while this is new to DPDK, similar facilities 
> already exists your favorite UN*X kernel. The implementation is 
> different, but I think it's accurate to say the goal and the effects 
> should be the same.
> 
> One can also run the perf autotest for RTE random, but such tests only 
> show lcore variables doesn't make things significantly worse when the L1 
> cache is essentially unused. (In fact, the lcore variable-enabled 
> rte_random.c somewhat counter-intuitively generates a 64-bit number 1 
> TSC cycle faster than the old version on my system.)
> 
> Just to be clear: it's the footprint in the core-private caches we are 
> attempting to reduce.

OK


> > 10/10/2024 16:21, Mattias Rönnblom:
> >> Introduce DPDK per-lcore id variables, or lcore variables for short.
> >>
> >> An lcore variable has one value for every current and future lcore
> >> id-equipped thread.
> > 
> > I find it difficult to read "lcore id-equipped thread".
> > Can we just say "DPDK thread"?
> 
> Sure, if you point me to a definition of what a DPDK thread is.
> 
> I can think of at least four potential definitions
> * An EAL thread
> * An EAL thread or a registered non-EAL thread
> * Any thread calling into DPDK APIs
> * Any thread living in a DPDK process

OK I understand your point.
If we move the design explanations in the prog guide,
we can explain this point in the introduction of the chapter.


> > [...]
> >> +An application, a DPDK library or PMD may keep opt to keep per-thread
> >> +state.
> > 
> > I don't understand this sentence.
> 
> Which part is unclear?

"keep opt to keep per-thread"
What do you mean?


[...]
> >> +Variables with thread-local storage are allocated at the time of
> >> +thread creation, and exists until the thread terminates, for every
> >> +thread in the process. Only very small object should be allocated in
> >> +TLS, since large TLS objects significantly slows down thread creation
> >> +and may needlessly increase memory footprint for application that make
> >> +extensive use of unregistered threads.
> > 
> > I don't understand the relation with non-DPDK threads.
> 
> __thread isn't just for "DPDK threads". It will allocate memory on all 
> threads in the process.

OK
May be good to add as a note.


> > [...]
> >> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> > 
> > With #define RTE_MAX_LCORE_VAR 1048576,
> > LCORE_BUFFER_SIZE can be 100MB, right?
> > 
> 
> Sure. Unless you mlock the memory, it won't result in the DPDK process 
> having 100MB worth of mostly-unused resident memory (RSS, in Linux 
> speak). It would, would we use huge pages and thus effectively disabled 
> demand paging.
> 
> This is similar to how thread stacks generally work, where you often get 
> a fairly sizable stack (e.g., 2MB) but as long as you don't use all of 
> it, most of the pages won't be resident.
> 
> If you want to guard against such mlocked scenarios, you could consider 
> lowering the max variable size. You could argue it's strange to have a 
> large RTE_MAX_LCORE_VAR and yet tell the API user to only use it for 
> small, often-used block of memory.
> 
> If RTE_MAX_LCORE_VAR should have a different value, what should it be?

That's fine


> >> +static void *lcore_buffer;
> > 
> > It is the last buffer for all lcores.
> > The name suggests it is one single buffer per lcore.
> > What about "last_buffer" or "current_buffer"?
> 
> Would "value_buffer" be better? Or "values_buffer", although that sounds 
> awkward. "current_value_buffer".
> 
> I agree lcore_buffer is very generic.
> 
> The buffer holds values for all lcore ids, for one or (usually many) 
> more lcore variables.

So you don't need to mention "lcore" in this variable.
The most important is that it is the last buffer allocated IMHO.


> >> +static size_t offset = RTE_MAX_LCORE_VAR;
> > 
> > A comment may be useful for this value: it triggers the first alloc?
> 
> Yes. I will add a comment.
> 
> >> +
> >> +static void *
> >> +lcore_var_alloc(size_t size, size_t align)
> >> +{
> >> +	void *handle;
> >> +	unsigned int lcore_id;
> >> +	void *value;
> >> +
> >> +	offset = RTE_ALIGN_CEIL(offset, align);
> >> +
> >> +	if (offset + size > RTE_MAX_LCORE_VAR) {
> >> +#ifdef RTE_EXEC_ENV_WINDOWS
> >> +		lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> >> +					       RTE_CACHE_LINE_SIZE);
> >> +#else
> >> +		lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> >> +					     LCORE_BUFFER_SIZE);
> >> +#endif
> >> +		RTE_VERIFY(lcore_buffer != NULL);
> > 
> > Please no panic in a lib.
> > You can return NULL.
> 
> One could, but it would be a great cost to the API user.
> 
> Something is seriously broken if these kind of allocations fail 
> (considering when they occur and what size they are), just like 
> something is seriously broken if the kernel fails (or is unwilling to) 
> allocate pages used by static lcore id index arrays.

As said in another email,
we may return NULL in this function and have RTE_VERIFY
in case of declarations for ease of such API.
So the user has a choice to use an API which returns an error
or a simpler one with macros.


> > [...]
> >> +#ifndef _RTE_LCORE_VAR_H_
> >> +#define _RTE_LCORE_VAR_H_
> > 
> > Really we don't need the first and last underscores,
> > but it's a detail.
> 
> I just follow the DPDK conventions here.
> 
> I agree the conventions are wrong.

Such conventions are not consistent. Let's do the right thing.

> >> +
> >> +/**
> >> + * @file
> >> + *
> >> + * RTE Lcore variables
> > 
> > Please don't say "RTE", it is just a prefix.
> 
> OK.
> 
> I just follow the DPDK conventions here as well, but sure, I'll change it.

Not really a convention.

> > You can replace it with "DPDK" if you really want to be specific.
> > 
> >> + *
> >> + * This API provides a mechanism to create and access per-lcore id
> >> + * variables in a space- and cycle-efficient manner.
> >> + *
> >> + * A per-lcore id variable (or lcore variable for short) has one value
> >> + * for each EAL thread and registered non-EAL thread. There is one
> >> + * instance for each current and future lcore id-equipped thread, with
> >> + * a total of RTE_MAX_LCORE instances. The value of an lcore variable
> >> + * for a particular lcore id is independent from other values (for
> >> + * other lcore ids) within the same lcore variable.
> >> + *
> >> + * In order to access the values of an lcore variable, a handle is
> >> + * used. The type of the handle is a pointer to the value's type
> >> + * (e.g., for an @c uint32_t lcore variable, the handle is a
> >> + * <code>uint32_t *</code>. The handle type is used to inform the
> >> + * access macros the type of the values. A handle may be passed
> >> + * between modules and threads just like any pointer, but its value
> >> + * must be treated as a an opaque identifier. An allocated handle
> >> + * never has the value NULL.
> > 
> > Most of the explanations here would be better hosted in the prog guide.
> > The Doxygen API is better suited for short and direct explanations.
> 
> Yeah, maybe. Reworking this to the programming guide format and having 
> that reviewed is a sizable undertaking though.

It is mostly a matter of moving text.
I'm on it, I can review quickly.

> >> + *
> >> + * @b Creation
> >> + *
> >> + * An lcore variable is created in two steps:
> >> + *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
> >> + *  2. Allocate lcore variable storage and initialize the handle with
> >> + *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
> >> + *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
> > 
> > *at* the time
> > 
> >> + *     module initialization, but may be done at any time.
> > 
> > You mean it does not depend on EAL initialization?
> 
> Lcore variables may be used prior to any other parts of the EAL have 
> been initialized.

Please make it explicit.

> >> + *
> >> + * An lcore variable is not tied to the owning thread's lifetime. It's
> >> + * available for use by any thread immediately after having been
> >> + * allocated, and continues to be available throughout the lifetime of
> >> + * the EAL.
> >> + *
> >> + * Lcore variables cannot and need not be freed.
> > 
> > I'm curious about that.
> > If EAL is closed, and the application continues its life,
> > then we want all this memory to be cleaned as well.
> > Do you know rte_eal_cleanup()?
> 
> I think the primary reason you would like to free the buffers is to 
> avoid false positives from tools like valgrind memcheck (if anyone 
> managed to get that working with DPDK).
> 
> rte_eal_cleanup() freeing the buffers and resetting the offset would 
> make sense. That however would require the buffers to be tracked (e.g., 
> as a linked list).
> 
>  From a footprint point of view, TLS allocations and static arrays also 
> aren't freed by rte_eal_cleanup().

They are not dynamic as this one.

I still think it is required.
Think about an application starting and stopping some DPDK modules,
It would be a serious leakage.


> >> + * @b Access
> >> + *
> >> + * The value of any lcore variable for any lcore id may be accessed
> >> + * from any thread (including unregistered threads), but it should
> >> + * only be *frequently* read from or written to by the owner.
> > 
> > Would be interesting to explain why.
> 
> This is intended to be brief and false sharing is mentioned elsewhere.
> 
> >> + *
> >> + * Values of the same lcore variable but owned by two different lcore
> >> + * ids may be frequently read or written by the owners without risking
> >> + * false sharing.
> > 
> > Again you could explain why if you explained the storage layout.
> > What is the minimum object size to avoid false sharing?
> 
> Your objects may be as small as you want, and you still do not risk 
> false sharing. All objects for a particular lcore id are grouped 
> together, spatially.

[...]
> >> + * are allocated from the libc heap. Heap allocation failures are
> >> + * treated as fatal.
> > 
> > Why not handling as an error, so the app has a chance to cleanup before crash?
> > 
> 
> Because you don't want to put the burden on the user (app or 
> DPDK-internal) to attempt to clean up such failures, which in practice 
> will never occur, and in case they do, they are just among several such 
> early-memory-allocation failures where the application code has no say 
> in what should occur.
> 
> What happens if the TLS allocations are so large, the main thread can't 
> be created?
> 
> What happens if the BSS section is so large (because of all our 
> RTE_MAX_LCORE-sized arrays) so its pages can't be made resident in memory?
> 
> Lcore variables aren't a dynamic allocation facility.

I understand that and I agree.
In case someone is using it as a dynamic facility with the function,
can we offer them a NULL return?

[...]
> >> +/**
> >> + * Allocate space for an lcore variable, and initialize its handle.
> >> + *
> >> + * The values of the lcore variable are initialized to zero.
> > 
> > The lcore variables are initialized to zero, not the values.
> > 
> 
> "The lcore variables are initialized to zero" is the same as "The lcore 
> variables' values are initialized to zero" in my world, since the only 
> thing that can be initialized in a lcore variable is its values (or 
> "value instances" or just "instances", not sure I'm consistent here).

OK

> > Don't you mention 0 in align?
> 
> I don't understand the question. Are you asking why objects are 
> worst-case aligned when RTE_LCORE_VAR_ALLOC_SIZE() is used? Rather than 
> naturally aligned?

No I just mention that 0 align value is not documented here.

> Good question, in that case. I guess it would make more sense if they 
> were naturally aligned. I just thought in terms of malloc() semantics, 
> but maybe that's wrong.

[...]
> >> +#define RTE_LCORE_VAR_INIT(name)					\
> >> +	RTE_INIT(rte_lcore_var_init_ ## name)				\
> >> +	{								\
> >> +		RTE_LCORE_VAR_ALLOC(name);				\
> >> +	}
> > 
> > I don't get the need for RTE_INIT macros.
> 
> Check rte_power_intrinsics.c

I'll check later.

> I agree it's not obvious they are worth the API clutter.
> 
> > It does not cover RTE_INIT_PRIO and anyway
> > another RTE_INIT is probably already there in the module.


> >> + */
> >> +static inline void *
> >> +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
> > 
> > What a long name!
> > What about rte_lcore_var() ?
> > 
> 
> It's long but consistent with the rest of the API.
> 
> This is not a function you will be see called often in API user code. 
> Most will use the access macros.

I let you discuss naming with Morten.
It seems he agrees with me about making it short.




^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v8 0/7] Lcore variables
  2024-10-10 14:13                                                   ` [PATCH v8 0/7] Lcore variables Mattias Rönnblom
                                                                       ` (6 preceding siblings ...)
  2024-10-10 14:13                                                     ` [PATCH v8 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
@ 2024-10-11 14:23                                                     ` Stephen Hemminger
  2024-10-13  7:04                                                       ` Mattias Rönnblom
  7 siblings, 1 reply; 313+ messages in thread
From: Stephen Hemminger @ 2024-10-11 14:23 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic

On Thu, 10 Oct 2024 16:13:42 +0200
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> This patch set introduces a new API <rte_lcore_var.h> for static
> per-lcore id data allocation.
> 
> Please refer to the <rte_lcore_var.h> API documentation for both a
> rationale for this new API, and a comparison to the alternatives
> available.
> 
> The adoption of this API would affect many different DPDK modules, but
> the author updated only a few, mostly to serve as examples in this
> RFC, and to iron out some, but surely not all, wrinkles in the API.
> 
> The question on how to best allocate static per-lcore memory has been
> up several times on the dev mailing list, for example in the thread on
> "random: use per lcore state" RFC by Stephen Hemminger.
> 
> Lcore variables are surely not the answer to all your per-lcore-data
> needs, since it only allows for more-or-less static allocation. In the
> author's opinion, it does however provide a reasonably simple and
> clean and seemingly very much performant solution to a real problem.

There should be a mention about whether this storage can be
shared with other threads and processes. Is it like huge page
memory or stack or heap?

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v10 0/7] Lcore variables
  2024-10-11  8:18                                                           ` [PATCH v10 0/7] Lcore variables Mattias Rönnblom
                                                                               ` (6 preceding siblings ...)
  2024-10-11  8:19                                                             ` [PATCH v10 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
@ 2024-10-11 14:25                                                             ` Stephen Hemminger
  2024-10-13  7:02                                                               ` Mattias Rönnblom
  7 siblings, 1 reply; 313+ messages in thread
From: Stephen Hemminger @ 2024-10-11 14:25 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic

On Fri, 11 Oct 2024 10:18:54 +0200
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> This patch set introduces a new API <rte_lcore_var.h> for static
> per-lcore id data allocation.
> 
> Please refer to the <rte_lcore_var.h> API documentation for both a
> rationale for this new API, and a comparison to the alternatives
> available.
> 
> The adoption of this API would affect many different DPDK modules, but
> the author updated only a few, mostly to serve as examples in this
> RFC, and to iron out some, but surely not all, wrinkles in the API.
> 
> The question on how to best allocate static per-lcore memory has been
> up several times on the dev mailing list, for example in the thread on
> "random: use per lcore state" RFC by Stephen Hemminger.
> 
> Lcore variables are surely not the answer to all your per-lcore-data
> needs, since it only allows for more-or-less static allocation. In the
> author's opinion, it does however provide a reasonably simple and
> clean and seemingly very much performant solution to a real problem.
> 
> Mattias Rönnblom (7):
>   eal: add static per-lcore memory allocation facility
>   eal: add lcore variable functional tests
>   eal: add lcore variable performance test
>   random: keep PRNG state in lcore variable
>   power: keep per-lcore state in lcore variable
>   service: keep per-lcore state in lcore variable
>   eal: keep per-lcore power intrinsics state in lcore variable
> 
>  MAINTAINERS                                   |   6 +
>  app/test/meson.build                          |   2 +
>  app/test/test_lcore_var.c                     | 436 ++++++++++++++++++
>  app/test/test_lcore_var_perf.c                | 257 +++++++++++
>  config/rte_config.h                           |   1 +
>  doc/api/doxy-api-index.md                     |   1 +
>  .../prog_guide/env_abstraction_layer.rst      |  43 +-
>  doc/guides/rel_notes/release_24_11.rst        |  14 +
>  lib/eal/common/eal_common_lcore_var.c         |  85 ++++
>  lib/eal/common/meson.build                    |   1 +
>  lib/eal/common/rte_random.c                   |  28 +-
>  lib/eal/common/rte_service.c                  | 117 ++---
>  lib/eal/include/meson.build                   |   1 +
>  lib/eal/include/rte_lcore_var.h               | 389 ++++++++++++++++
>  lib/eal/version.map                           |   3 +
>  lib/eal/x86/rte_power_intrinsics.c            |  17 +-
>  lib/power/rte_power_pmd_mgmt.c                |  35 +-
>  17 files changed, 1343 insertions(+), 93 deletions(-)
>  create mode 100644 app/test/test_lcore_var.c
>  create mode 100644 app/test/test_lcore_var_perf.c
>  create mode 100644 lib/eal/common/eal_common_lcore_var.c
>  create mode 100644 lib/eal/include/rte_lcore_var.h
> 


Are there any trace points in this code? Would be good to have.
Also some optional statistics for telemetry use.
I would presume this is not intended as a hot path API; therefore
it would be ok to always keep statistics.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v10 0/7] Lcore variables
  2024-10-11 14:25                                                             ` [PATCH v10 0/7] Lcore variables Stephen Hemminger
@ 2024-10-13  7:02                                                               ` Mattias Rönnblom
  2024-10-16  8:07                                                                 ` Thomas Monjalon
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-13  7:02 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic

On 2024-10-11 16:25, Stephen Hemminger wrote:
> On Fri, 11 Oct 2024 10:18:54 +0200
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
>> This patch set introduces a new API <rte_lcore_var.h> for static
>> per-lcore id data allocation.
>>
>> Please refer to the <rte_lcore_var.h> API documentation for both a
>> rationale for this new API, and a comparison to the alternatives
>> available.
>>
>> The adoption of this API would affect many different DPDK modules, but
>> the author updated only a few, mostly to serve as examples in this
>> RFC, and to iron out some, but surely not all, wrinkles in the API.
>>
>> The question on how to best allocate static per-lcore memory has been
>> up several times on the dev mailing list, for example in the thread on
>> "random: use per lcore state" RFC by Stephen Hemminger.
>>
>> Lcore variables are surely not the answer to all your per-lcore-data
>> needs, since it only allows for more-or-less static allocation. In the
>> author's opinion, it does however provide a reasonably simple and
>> clean and seemingly very much performant solution to a real problem.
>>
>> Mattias Rönnblom (7):
>>    eal: add static per-lcore memory allocation facility
>>    eal: add lcore variable functional tests
>>    eal: add lcore variable performance test
>>    random: keep PRNG state in lcore variable
>>    power: keep per-lcore state in lcore variable
>>    service: keep per-lcore state in lcore variable
>>    eal: keep per-lcore power intrinsics state in lcore variable
>>
>>   MAINTAINERS                                   |   6 +
>>   app/test/meson.build                          |   2 +
>>   app/test/test_lcore_var.c                     | 436 ++++++++++++++++++
>>   app/test/test_lcore_var_perf.c                | 257 +++++++++++
>>   config/rte_config.h                           |   1 +
>>   doc/api/doxy-api-index.md                     |   1 +
>>   .../prog_guide/env_abstraction_layer.rst      |  43 +-
>>   doc/guides/rel_notes/release_24_11.rst        |  14 +
>>   lib/eal/common/eal_common_lcore_var.c         |  85 ++++
>>   lib/eal/common/meson.build                    |   1 +
>>   lib/eal/common/rte_random.c                   |  28 +-
>>   lib/eal/common/rte_service.c                  | 117 ++---
>>   lib/eal/include/meson.build                   |   1 +
>>   lib/eal/include/rte_lcore_var.h               | 389 ++++++++++++++++
>>   lib/eal/version.map                           |   3 +
>>   lib/eal/x86/rte_power_intrinsics.c            |  17 +-
>>   lib/power/rte_power_pmd_mgmt.c                |  35 +-
>>   17 files changed, 1343 insertions(+), 93 deletions(-)
>>   create mode 100644 app/test/test_lcore_var.c
>>   create mode 100644 app/test/test_lcore_var_perf.c
>>   create mode 100644 lib/eal/common/eal_common_lcore_var.c
>>   create mode 100644 lib/eal/include/rte_lcore_var.h
>>
> 
> 
> Are there any trace points in this code? Would be good to have.

No. Yes, for sure.

> Also some optional statistics for telemetry use.

I agree. It could potentially expose some of the internals of the 
implementation, subject to change, but that is a risk that we can take.

Who does the above two and when? Is this something that is required 
before 24.11 (assuming this feature will make it)?

> I would presume this is not intended as a hot path API; therefore
> it would be ok to always keep statistics.

The allocation functions are expected to be used only in the slowest of 
the slow paths.



^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v8 0/7] Lcore variables
  2024-10-11 14:23                                                     ` [PATCH v8 0/7] Lcore variables Stephen Hemminger
@ 2024-10-13  7:04                                                       ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-13  7:04 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic

On 2024-10-11 16:23, Stephen Hemminger wrote:
> On Thu, 10 Oct 2024 16:13:42 +0200
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
>> This patch set introduces a new API <rte_lcore_var.h> for static
>> per-lcore id data allocation.
>>
>> Please refer to the <rte_lcore_var.h> API documentation for both a
>> rationale for this new API, and a comparison to the alternatives
>> available.
>>
>> The adoption of this API would affect many different DPDK modules, but
>> the author updated only a few, mostly to serve as examples in this
>> RFC, and to iron out some, but surely not all, wrinkles in the API.
>>
>> The question on how to best allocate static per-lcore memory has been
>> up several times on the dev mailing list, for example in the thread on
>> "random: use per lcore state" RFC by Stephen Hemminger.
>>
>> Lcore variables are surely not the answer to all your per-lcore-data
>> needs, since it only allows for more-or-less static allocation. In the
>> author's opinion, it does however provide a reasonably simple and
>> clean and seemingly very much performant solution to a real problem.
> 
> There should be a mention about whether this storage can be
> shared with other threads and processes. Is it like huge page
> memory or stack or heap?

Sure. I'll mention the memory is not in huge pages in the API docs.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-11  8:04                                                             ` Mattias Rönnblom
  2024-10-11  8:46                                                               ` Morten Brørup
  2024-10-11  9:11                                                               ` Thomas Monjalon
@ 2024-10-14  6:51                                                               ` Mattias Rönnblom
  2024-10-14 15:19                                                                 ` Stephen Hemminger
  2 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-14  6:51 UTC (permalink / raw)
  To: Thomas Monjalon, Mattias Rönnblom
  Cc: dev, Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic, Konstantin Ananyev,
	Chengwen Feng

On 2024-10-11 10:04, Mattias Rönnblom wrote:
> On 2024-10-10 23:24, Thomas Monjalon wrote:

<snip>

>>> + *
>>> + * An lcore variable is not tied to the owning thread's lifetime. It's
>>> + * available for use by any thread immediately after having been
>>> + * allocated, and continues to be available throughout the lifetime of
>>> + * the EAL.
>>> + *
>>> + * Lcore variables cannot and need not be freed.
>>
>> I'm curious about that.
>> If EAL is closed, and the application continues its life,
>> then we want all this memory to be cleaned as well.
>> Do you know rte_eal_cleanup()?
> 
> I think the primary reason you would like to free the buffers is to 
> avoid false positives from tools like valgrind memcheck (if anyone 
> managed to get that working with DPDK).
> 
> rte_eal_cleanup() freeing the buffers and resetting the offset would 
> make sense. That however would require the buffers to be tracked (e.g., 
> as a linked list).
> 

I had a quick look at this. Cleaning up the lcore var buffers is very 
straightforward.

One thing though: the rte_eal_cleanup() documentation says "After this 
call, no DPDK function calls may be made.". rte_eal_init() is a "DPDK 
function call". So DPDK/EAL can never be re-initialized, correct?

Cleaning up lcore var buffers would further cement this design, since 
there will be no way to re-initialize them other than changing the 
<rte_lcore_var.h> API.

>  From a footprint point of view, TLS allocations and static arrays also 
> aren't freed by rte_eal_cleanup().
> 

<snip>

^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v11 0/7] Lcore variables
  2024-10-11  8:18                                                             ` [PATCH v10 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-14  7:43                                                               ` Mattias Rönnblom
  2024-10-14  7:43                                                                 ` [PATCH v11 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                                                   ` (6 more replies)
  0 siblings, 7 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-14  7:43 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The adoption of this API would affect many different DPDK modules, but
the author updated only a few, mostly to serve as examples in this
RFC, and to iron out some, but surely not all, wrinkles in the API.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 436 ++++++++++++++++++
 app/test/test_lcore_var_perf.c                | 257 +++++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         | 136 ++++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 117 ++---
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 391 ++++++++++++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   3 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  35 +-
 20 files changed, 1411 insertions(+), 93 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v11 1/7] eal: add static per-lcore memory allocation facility
  2024-10-14  7:43                                                               ` [PATCH v11 0/7] Lcore variables Mattias Rönnblom
@ 2024-10-14  7:43                                                                 ` Mattias Rönnblom
  2024-10-14  8:17                                                                   ` Morten Brørup
  2024-10-15  6:54                                                                   ` [PATCH v12 0/7] Lcore variables Mattias Rönnblom
  2024-10-14  7:43                                                                 ` [PATCH v11 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                                                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-14  7:43 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v11:
 * Add a note in the API docs on lcore variables and huge page memory.
   (Stephen Hemminger)
 * Free lcore var buffers at EAL cleanup. (Thomas Monjalon)
 * Tweak naming and include short lcore var buffer use overview
   in eal_common_lcore_var.c.

PATCH v10:
 * Improve documentation grammar and spelling. (Stephen Hemminger,
   Thomas Monjalon)
 * Add version.map DPDK version comment. (Thomas Monjalon)

PATCH v9:
 * Fixed merge conflicts in release notes.

PATCH v8:
 * Work around missing max_align_t definition in MSVC. (Morten Brørup)

PATCH v7:
 * Add () to the FOREACH lcore id macro parameter, to allow arbitrary
   expression, not just a simple variable name, being passed.
   (Konstantin Ananyev)

PATCH v6:
 * Have API user provide the loop variable in the FOREACH macro, to
   avoid subtle bugs where the loop variable name clashes with some
   other user-defined variable. (Konstantin Ananyev)

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         | 136 ++++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 391 ++++++++++++++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   3 +
 13 files changed, 606 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 812463fe9f..61e5907fb5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index fd6f8a2f1a..498d509244 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index b9fac1839d..b659a1d085 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -429,12 +429,43 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default, static variables, memory blocks allocated on the DPDK
+heap, and other types of memory are shared by all DPDK threads.
+
+An application, a DPDK library, or a PMD may opt to keep per-thread state.
+
+Per-thread data can be maintained using either *lcore variables* (see
+``rte_lcore_var.h``), *thread-local storage (TLS)* (see
+``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE`` elements,
+indexed by ``rte_lcore_id()``. These methods allow per-lcore data to be
+largely internal to the module and not directly exposed in its
+API. Another approach is to explicitly handle per-thread aspects in
+the API (e.g., the ports in the Eventdev API).
+
+Lcore variables are suitable for small objects that are statically
+allocated at the time of module or application initialization. An
+lcore variable takes on one value for each lcore ID-equipped thread
+(i.e., for both EAL threads and registered non-EAL threads, in total
+``RTE_MAX_LCORE`` instances). The lifetime of lcore variables is
+independent of the owning threads and can, therefore, be initialized
+before the threads are created.
+
+Variables with thread-local storage are allocated when the thread is
+created and exist until the thread terminates. These are applicable
+for every thread in the process. Only very small objects should be
+allocated in TLS, as large TLS objects can significantly slow down
+thread creation and may unnecessarily increase the memory footprint of
+applications that extensively use unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore ID-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must be both cache-aligned and include an
+``RTE_CACHE_GUARD``. This extensive use of padding causes internal
+fragmentation (i.e., unused space) and reduces cache hit rates.
+
+For more discussions on per-lcore state, refer to the
+``rte_lcore_var.h`` API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 915065a6f9..0e15767d41 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -113,6 +113,20 @@ New Features
 
   * Added independent enqueue feature.
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..f8508ab61c
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,136 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+#include "eal_lcore_var.h"
+
+/*
+ * An lcore var buffer stores at a minimum one, but usually many,
+ * lcore variables. The value instances for all lcore ids are stored
+ * in the same buffer.
+ *
+ * The address of the value of a particular lcore variable associated
+ * with a particular lcore id is:
+ * buffer->data + offset + lcore_id * RTE_MAX_LCORE_VAR.
+ *
+ * In this way, the values associated with a particular lcore id are
+ * grouped spatially close (in the data array), and no padding is
+ * required to prevent false sharing.
+ *
+ * The (buffer->data + offset) base pointer is what is being returned
+ * to the API user as an opaque handle. The handle is a pointer to the
+ * value for lcore id 0, for that lcore variable.
+ *
+ * The implementation maintains a current lcore var buffer (being
+ * allocated from), and an offset representing the amount of data
+ * already allocated (in bytes) in that buffer.
+ *
+ * The offset is progressively incremented (by the size of the
+ * just-allocated lcore variable), as lcore variables are being
+ * allocated.
+ *
+ * When one lcore var buffer is full, a new is allocated off the heap.
+ *
+ * The lcore var buffers are arranged in a single-link list, to allow
+ * freeing them at the point of rte_eal_cleanup(), and thereby avoid
+ * false positives from tools like valgrind memcheck.
+ */
+struct lcore_var_buffer {
+	char data[RTE_MAX_LCORE_VAR * RTE_MAX_LCORE];
+	struct lcore_var_buffer *prev;
+};
+
+static struct lcore_var_buffer *current_buffer;
+
+/* initialized to trigger buffer allocation on first allocation */
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	unsigned int lcore_id;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+		struct lcore_var_buffer *prev = current_buffer;
+#ifdef RTE_EXEC_ENV_WINDOWS
+		current_buffer = _aligned_malloc(sizeof(struct lcore_var_buffer),
+						 RTE_CACHE_LINE_SIZE);
+#else
+		current_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
+					       sizeof(struct lcore_var_buffer));
+#endif
+		RTE_VERIFY(current_buffer != NULL);
+
+		current_buffer->prev = prev;
+
+		offset = 0;
+	}
+
+	handle = &current_buffer->data[offset];
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
+	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+#ifdef RTE_TOOLCHAIN_MSVC
+		/* MSVC <stddef.h> is missing the max_align_t typedef */
+		align = alignof(double);
+#else
+		align = alignof(max_align_t);
+#endif
+
+	RTE_ASSERT(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
+
+void
+eal_lcore_var_cleanup(void)
+{
+	while (current_buffer != NULL) {
+		struct lcore_var_buffer *prev = current_buffer->prev;
+
+		free(current_buffer);
+
+		current_buffer = prev;
+	}
+}
diff --git a/lib/eal/common/eal_lcore_var.h b/lib/eal/common/eal_lcore_var.h
new file mode 100644
index 0000000000..de2c4e44a0
--- /dev/null
+++ b/lib/eal/common/eal_lcore_var.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2024 Ericsson AB.
+ */
+
+#ifndef EAL_LCORE_VAR_H
+#define EAL_LCORE_VAR_H
+
+void
+eal_lcore_var_cleanup(void);
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index 1229230063..796c9dbf2d 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -47,6 +47,7 @@
 
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -941,6 +942,7 @@ rte_eal_cleanup(void)
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_cleanup_config(internal_conf);
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..5af742b5d6
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,391 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) holds a
+ * unique value for each EAL thread and registered non-EAL
+ * thread. There is one instance for each current and future lcore
+ * id-equipped thread, with a total of @c RTE_MAX_LCORE instances. The
+ * value of the lcore variable for one lcore id is independent from
+ * the values assigned to other lcore ids within the same variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for an @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>). The handle type is used to inform the
+ * access macros of the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as an opaque identifier. An allocated handle never
+ * has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs at the time
+ *     of module initialization, but may be done at any time.
+ *
+ * The lifetime of an lcore variable is not tied to the thread that
+ * created it. Its per lcore id values (up to @c RTE_MAX_LCORE) are
+ * available from the moment the lcore variable is created and
+ * continue to exist throughout the entire lifetime of the EAL,
+ * whether or not the lcore id is currently in use.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable, associated with different lcore
+ * ids may be frequently read or written by their respective owners
+ * without risking false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should be employed to prevent data races between the owning
+ * thread and any other thread accessing the same value instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * shorthand exists as @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may be of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may define an lcore variable handle without ever
+ * allocating it.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * Lcore variables are stored in a series of lcore buffers, which are
+ * allocated from the libc heap. Heap allocation failures are treated
+ * as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variable's owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values are initialized to zero by default.
+ *
+ * Lcore variables are not stored in huge page memory.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         unsigned int lcore_id;
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module are kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's
+ * and prevent false sharing.
+ *
+ * Lcore variables offer the advantage of working with, rather than
+ * against, the CPU's assumptions. A next-line hardware prefetcher,
+ * for example, may function as intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The lifecycle of a thread-local variable instance is tied to
+ *     that of the thread. The data cannot be accessed before the
+ *     thread has been created, nor after it has exited. As a result,
+ *     thread-local variables must be initialized in a "lazy" manner
+ *     (e.g., at the point of thread creation). Lcore variables may be
+ *     accessed immediately after having been allocated (which may occur
+ *     before any thread beyond the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the specifics of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, data sharing between threads is supported.
+ *     In the C11 standard, accessing another thread's _Thread_local
+ *     object is implementation-defined. Lcore variable instances may
+ *     be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * This macro clarifies that the declaration is an lcore handle, not a
+ * regular pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param lcore_id
+ *   An <code>unsigned int</code> variable successively set to the
+ *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
+ * @param value
+ *   A pointer variable successively set to point to lcore variable
+ *   value instance of the current lcore id being processed.
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)		\
+	for ((lcore_id) =						\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     (lcore_id) < RTE_MAX_LCORE;				\
+	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, \
+							       handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifier of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index d742cc98e2..ae4df07bcf 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -45,6 +45,7 @@
 #include <telemetry_internal.h>
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -1387,6 +1388,7 @@ rte_eal_cleanup(void)
 	rte_eal_malloc_heap_cleanup();
 	eal_cleanup_config(internal_conf);
 	rte_eal_log_cleanup();
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..77d3181087 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,9 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	# added in 24.11
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v11 2/7] eal: add lcore variable functional tests
  2024-10-14  7:43                                                               ` [PATCH v11 0/7] Lcore variables Mattias Rönnblom
  2024-10-14  7:43                                                                 ` [PATCH v11 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-14  7:43                                                                 ` Mattias Rönnblom
  2024-10-14  7:43                                                                 ` [PATCH v11 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                                                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-14  7:43 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocations to match new API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 436 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 437 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..2a1f258548
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,436 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	unsigned int i = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_int) {
+		TEST_ASSERT_EQUAL(i, lcore_id, "Encountered lcore id %d "
+				  "while expecting %d during iteration",
+				  lcore_id, i);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		i++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	unsigned int lcore_id;
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v11 3/7] eal: add lcore variable performance test
  2024-10-14  7:43                                                               ` [PATCH v11 0/7] Lcore variables Mattias Rönnblom
  2024-10-14  7:43                                                                 ` [PATCH v11 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-14  7:43                                                                 ` [PATCH v11 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-10-14  7:43                                                                 ` Mattias Rönnblom
  2024-10-14  7:43                                                                 ` [PATCH v11 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                                                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-14  7:43 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v8:
 * Fix spelling. (Morten Brørup)

PATCH v6:
 * Use floating point math when calculating per-update latency.
   (Morten Brørup)

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 257 +++++++++++++++++++++++++++++++++
 2 files changed, 258 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..2efb8342d1
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =
+		RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / (double)ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays is not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each with a small, per-lcore state. Note however that
+ * these tests have very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v11 4/7] random: keep PRNG state in lcore variable
  2024-10-14  7:43                                                               ` [PATCH v11 0/7] Lcore variables Mattias Rönnblom
                                                                                   ` (2 preceding siblings ...)
  2024-10-14  7:43                                                                 ` [PATCH v11 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-10-14  7:43                                                                 ` Mattias Rönnblom
  2024-10-14  7:43                                                                 ` [PATCH v11 5/7] power: keep per-lcore " Mattias Rönnblom
                                                                                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-14  7:43 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v11 5/7] power: keep per-lcore state in lcore variable
  2024-10-14  7:43                                                               ` [PATCH v11 0/7] Lcore variables Mattias Rönnblom
                                                                                   ` (3 preceding siblings ...)
  2024-10-14  7:43                                                                 ` [PATCH v11 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-10-14  7:43                                                                 ` Mattias Rönnblom
  2024-10-14  7:43                                                                 ` [PATCH v11 6/7] service: " Mattias Rönnblom
  2024-10-14  7:43                                                                 ` [PATCH v11 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-14  7:43 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocation to match new API.

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a981db4b39 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,22 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	unsigned int lcore_id;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v11 6/7] service: keep per-lcore state in lcore variable
  2024-10-14  7:43                                                               ` [PATCH v11 0/7] Lcore variables Mattias Rönnblom
                                                                                   ` (4 preceding siblings ...)
  2024-10-14  7:43                                                                 ` [PATCH v11 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-10-14  7:43                                                                 ` Mattias Rönnblom
  2024-10-14  7:43                                                                 ` [PATCH v11 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-14  7:43 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v7:
 * Update to match new FOREACH API.

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 117 +++++++++++++++++++----------------
 1 file changed, 65 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index a38c594ce4..3d2c12c39b 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -77,7 +78,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -103,12 +104,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -124,7 +121,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -138,7 +134,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -288,7 +283,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -296,9 +290,11 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	unsigned int lcore_id;
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -467,7 +463,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -477,7 +476,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -499,8 +498,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -546,13 +544,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -560,9 +560,12 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	unsigned int lcore_id;
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -579,7 +582,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -595,7 +599,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -647,30 +651,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -698,13 +703,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -715,14 +721,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -738,17 +746,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -760,7 +770,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -784,7 +794,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -814,6 +824,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -821,12 +833,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -847,7 +858,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -858,7 +869,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -866,7 +877,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -874,7 +885,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -901,7 +912,7 @@ lcore_attr_get_service_error_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -917,7 +928,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -979,12 +993,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1009,7 +1022,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -1020,12 +1034,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1060,7 +1073,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v11 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-10-14  7:43                                                               ` [PATCH v11 0/7] Lcore variables Mattias Rönnblom
                                                                                   ` (5 preceding siblings ...)
  2024-10-14  7:43                                                                 ` [PATCH v11 6/7] service: " Mattias Rönnblom
@ 2024-10-14  7:43                                                                 ` Mattias Rönnblom
  2024-10-14 16:30                                                                   ` Stephen Hemminger
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-14  7:43 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-09-18 10:11                                       ` Jerin Jacob
  2024-09-19 19:31                                         ` Mattias Rönnblom
@ 2024-10-14  7:56                                         ` Morten Brørup
  2024-10-15  6:29                                           ` Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-10-14  7:56 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, Jerin Jacob, thomas
  Cc: dev, Chengwen Feng, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Anatoly Burakov

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Wednesday, 18 September 2024 12.12
> 
> On Thu, Sep 12, 2024 at 8:52 PM Jerin Jacob <jerinjacobk@gmail.com>
> wrote:
> >
> > On Thu, Sep 12, 2024 at 7:11 PM Morten Brørup
> <mb@smartsharesystems.com> wrote:
> > >
> > > > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > > Sent: Thursday, 12 September 2024 15.17
> > > >
> > > > On Thu, Sep 12, 2024 at 2:40 PM Morten Brørup
> <mb@smartsharesystems.com>
> > > > wrote:
> > > > >
> > > > > > +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR *
> RTE_MAX_LCORE)
> > > > >
> > > > > Considering hugepages...
> > > > >
> > > > > Lcore variables may be allocated before DPDK's memory allocator
> > > > (rte_malloc()) is ready, so rte_malloc() cannot be used for lcore
> variables.
> > > > >
> > > > > And lcore variables are not usable (shared) for DPDK multi-
> process, so the
> > > > lcore_buffer could be allocated through the O/S APIs as anonymous
> hugepages,
> > > > instead of using rte_malloc().
> > > > >
> > > > > The alternative, using rte_malloc(), would disallow allocating
> lcore
> > > > variables before DPDK's memory allocator has been initialized,
> which I think
> > > > is too late.
> > > >
> > > > I thought it is not. A lot of the subsystems are initialized
> after the
> > > > memory subsystem is initialized.
> > > > [1] example given in documentation. I thought, RTE_INIT needs to
> > > > replaced if the subsystem called after memory initialized (which
> is
> > > > the case for most of the libraries)
> > >
> > > The list of RTE_INIT functions are called before main(). It is not
> very useful.
> > >
> > > Yes, it would be good to replace (or supplement) RTE_INIT_PRIO by
> something similar, which calls the list of "INIT" functions at the
> appropriate time during EAL initialization.
> > >
> > > DPDK should then use this "INIT" list for all its initialization,
> so the init function of new features (such as this, and trace) can be
> inserted at the correct location in the list.
> > >
> > > > Trace library had a similar situation. It is managed like [2]
> > >
> > > Yes, if we insist on using rte_malloc() for lcore variables, the
> alternative is to prohibit establishing lcore variables in functions
> called through RTE_INIT.
> >
> > I was not insisting on using ONLY rte_malloc(). Since rte_malloc()
> can
> > be called before rte_eal_init)(it will return NULL). Alloc routine
> can
> > check first rte_malloc() is available if not switch over glibc.
> 
> 
> @Mattias Rönnblom This comment is not addressed in v7. Could you check?

Mattias, following up on Jerin's suggestion:

When allocating an lcore variable, and the buffer holding lcore variables is out of space (or was never allocated), a new buffer is allocated.

Here's the twist I think Jerin is asking for:
You could check if rte_malloc() is available, and use that (instead of the heap) when allocating a new buffer holding lcore variables.
This check can be performed (aggressively) when allocating a new lcore variable, or (conservatively) only when allocating a new buffer.


Now, if using hugepages, the value of RTE_MAX_LCORE_VAR (the maximum size of one lcore variable instance) becomes more important.

Let's consider systems with 2 MB hugepages:

If it supports two lcores (RTE_MAX_LCORE is 2), the current RTE_MAX_LCORE_VAR default of 1 MB is a perfect match; it will use 2 MB of RAM as one 2 MB hugepage.

If it supports 128 lcores, the current RTE_MAX_LCORE_VAR default of 1 MB will use 128 MB of RAM.

If we scale it back, so it only uses one 2 MB hugepage, RTE_MAX_LCORE_VAR will have to be 2 MB / 128 lcores = 16 KB.
16 KB might be too small. E.g. a mempool cache uses 2 * 512 * sizeof(void *) = 8 KB + a few bytes for the information about the cache. So I can easily point at one example where 16 KB is going very close to the edge.

So, as you already asked, what is a reasonable default minimum value of RTE_MAX_LCORE_VAR?

Maybe we should just stick with your initial suggestion (1 MB) and see how it goes.


<roadmap>
At the recent DPDK Summit, we discussed memory consumption in one of the workshops.
One of the possible means for reducing memory consumption is making RTE_MAX_LCORE dynamic, so an application using only a few cores will scale its per-lcore tables to the actual number of lcores, instead of scaling to some hardcoded maximum.

With this in mind, I'm less worried about the RTE_MAX_LCORE multiplier.
</roadmap>


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v11 1/7] eal: add static per-lcore memory allocation facility
  2024-10-14  7:43                                                                 ` [PATCH v11 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-14  8:17                                                                   ` Morten Brørup
  2024-10-15  6:41                                                                     ` Mattias Rönnblom
  2024-10-15  6:54                                                                   ` [PATCH v12 0/7] Lcore variables Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-10-14  8:17 UTC (permalink / raw)
  To: Mattias Rönnblom, dev, thomas
  Cc: hofors, Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Monday, 14 October 2024 09.44


> +struct lcore_var_buffer {
> +	char data[RTE_MAX_LCORE_VAR * RTE_MAX_LCORE];
> +	struct lcore_var_buffer *prev;
> +};

In relation to Jerin's request for using hugepages when available, the "data" field should be a pointer to the memory allocated from either the heap or through rte_malloc. You would also need to add a flag to indicate which it is, so the correct deallocation function can be used to free it on cleanup.

<feature creep>
Here's another (nice to have) idea, which does not need to be part of this series, but can be implemented in a separate patch:
If you move "offset" into this structure, new lcore variables can be allocated from any buffer, instead of only the most recently allocated buffer.
There might even be gains by picking the "optimal" buffer to allocate different size variables from.
</feature creep>

> +
> +static struct lcore_var_buffer *current_buffer;
> +
> +/* initialized to trigger buffer allocation on first allocation */
> +static size_t offset = RTE_MAX_LCORE_VAR;


> +void *
> +rte_lcore_var_alloc(size_t size, size_t align)
> +{
> +	/* Having the per-lcore buffer size aligned on cache lines
> +	 * assures as well as having the base pointer aligned on cache
> +	 * size assures that aligned offsets also translate to alipgned
> +	 * pointers across all values.
> +	 */
> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
> +	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);

This is very slow path, please RTE_VERIFY instead of RTE_ASSERT in this function.


> +/**
> + * Get pointer to lcore variable instance with the specified lcore id.
> + *
> + * @param lcore_id
> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> + *   instances should be accessed. The lcore id need not be valid
> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
> + *   is also not valid (and thus should not be dereferenced).
> + * @param handle
> + *   The lcore variable handle.
> + */
> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
> +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))

Please remove the _VALUE suffix.

> +
> +/**
> + * Get pointer to lcore variable instance of the current thread.
> + *
> + * May only be used by EAL threads and registered non-EAL threads.
> + */
> +#define RTE_LCORE_VAR_VALUE(handle) \
> +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)

Please remove the _VALUE suffix.

> +
> +/**
> + * Iterate over each lcore id's value for an lcore variable.
> + *
> + * @param lcore_id
> + *   An <code>unsigned int</code> variable successively set to the
> + *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
> + * @param value
> + *   A pointer variable successively set to point to lcore variable
> + *   value instance of the current lcore id being processed.
> + * @param handle
> + *   The lcore variable handle.
> + */
> +#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)

Please remove the _VALUE suffix.

> 	\
> +	for ((lcore_id) =						\
> +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0);
> \
> +	     (lcore_id) < RTE_MAX_LCORE;				\
> +	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
> \


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-14  6:51                                                               ` Mattias Rönnblom
@ 2024-10-14 15:19                                                                 ` Stephen Hemminger
  2024-10-16  8:05                                                                   ` Thomas Monjalon
  0 siblings, 1 reply; 313+ messages in thread
From: Stephen Hemminger @ 2024-10-14 15:19 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Thomas Monjalon, Mattias Rönnblom, dev, Morten Brørup,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Konstantin Ananyev, Chengwen Feng

On Mon, 14 Oct 2024 08:51:09 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> On 2024-10-11 10:04, Mattias Rönnblom wrote:
> > On 2024-10-10 23:24, Thomas Monjalon wrote:  
> 
> <snip>
> 
> >>> + *
> >>> + * An lcore variable is not tied to the owning thread's lifetime. It's
> >>> + * available for use by any thread immediately after having been
> >>> + * allocated, and continues to be available throughout the lifetime of
> >>> + * the EAL.
> >>> + *
> >>> + * Lcore variables cannot and need not be freed.  
> >>
> >> I'm curious about that.
> >> If EAL is closed, and the application continues its life,
> >> then we want all this memory to be cleaned as well.
> >> Do you know rte_eal_cleanup()?  
> > 
> > I think the primary reason you would like to free the buffers is to 
> > avoid false positives from tools like valgrind memcheck (if anyone 
> > managed to get that working with DPDK).
> > 
> > rte_eal_cleanup() freeing the buffers and resetting the offset would 
> > make sense. That however would require the buffers to be tracked (e.g., 
> > as a linked list).
> >   
> 
> I had a quick look at this. Cleaning up the lcore var buffers is very 
> straightforward.
> 
> One thing though: the rte_eal_cleanup() documentation says "After this 
> call, no DPDK function calls may be made.". rte_eal_init() is a "DPDK 
> function call". So DPDK/EAL can never be re-initialized, correct?

In practice, calling rte_eal_init() is not tested, and some of the drivers
probably won't work.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v11 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-10-14  7:43                                                                 ` [PATCH v11 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
@ 2024-10-14 16:30                                                                   ` Stephen Hemminger
  2024-10-15  6:48                                                                     ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Stephen Hemminger @ 2024-10-14 16:30 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic, Konstantin Ananyev,
	Chengwen Feng

On Mon, 14 Oct 2024 09:43:48 +0200
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> Keep per-lcore power intrinsics state in a lcore variable to reduce
> cache working set size and avoid any CPU next-line-prefetching causing
> false sharing.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> Acked-by: Stephen Hemminger <stephen@networkplumber.org>

This looks like a problem.

-------------------------------BEGIN LOGS----------------------------
####################################################################################
#### [Begin job log] "ubuntu-22.04-clang-asan+doc+tests" at step Build and test
####################################################################################
+ configure_coredump
+ which gdb
+ ulimit -c unlimited
+ sudo sysctl -w kernel.core_pattern=/tmp/dpdk-core.%e.%p
kernel.core_pattern = /tmp/dpdk-core.%e.%p
+ devtools/test-null.sh
=================================================================
==67776==ERROR: AddressSanitizer: invalid alignment requested in aligned_alloc: 64, alignment must be a power of two and the requested size 0x8000008 must be a multiple of alignment (thread T0)
    #0 0x5562b2504042 in aligned_alloc (/home/runner/work/dpdk/dpdk/build/app/dpdk-testpmd+0xaad042) (BuildId: 731d8ec8ca4a6bf8e01bfd7548ebeb784aece6e3)
    #1 0x5562b37f671b in lcore_var_alloc /home/runner/work/dpdk/dpdk/build/../lib/eal/common/eal_common_lcore_var.c:77:20
    #2 0x5562b37f671b in rte_lcore_var_alloc /home/runner/work/dpdk/dpdk/build/../lib/eal/common/eal_common_lcore_var.c:123:9
    #3 0x5562b341b902 in rte_power_ethdev_pmgmt_init /home/runner/work/dpdk/dpdk/build/../lib/power/rte_power_pmd_mgmt.c:775:2
    #4 0x7f76b7829eba in call_init csu/../csu/libc-start.c:145:3
    #5 0x7f76b7829eba in __libc_start_main csu/../csu/libc-start.c:379:5

==67776==HINT: if you don't care about these errors you may set allocator_may_return_null=1
SUMMARY: AddressSanitizer: invalid-aligned-alloc-alignment (/home/runner/work/dpdk/dpdk/build/app/dpdk-testpmd+0xaad042) (BuildId: 731d8ec8ca4a6bf8e01bfd7548ebeb784aece6e3) in aligned_alloc
==67776==ABORTING

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v2 1/6] eal: add static per-lcore memory allocation facility
  2024-10-14  7:56                                         ` Morten Brørup
@ 2024-10-15  6:29                                           ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  6:29 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, Jerin Jacob, thomas
  Cc: dev, Chengwen Feng, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Anatoly Burakov

On 2024-10-14 09:56, Morten Brørup wrote:
>> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
>> Sent: Wednesday, 18 September 2024 12.12
>>
>> On Thu, Sep 12, 2024 at 8:52 PM Jerin Jacob <jerinjacobk@gmail.com>
>> wrote:
>>>
>>> On Thu, Sep 12, 2024 at 7:11 PM Morten Brørup
>> <mb@smartsharesystems.com> wrote:
>>>>
>>>>> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
>>>>> Sent: Thursday, 12 September 2024 15.17
>>>>>
>>>>> On Thu, Sep 12, 2024 at 2:40 PM Morten Brørup
>> <mb@smartsharesystems.com>
>>>>> wrote:
>>>>>>
>>>>>>> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR *
>> RTE_MAX_LCORE)
>>>>>>
>>>>>> Considering hugepages...
>>>>>>
>>>>>> Lcore variables may be allocated before DPDK's memory allocator
>>>>> (rte_malloc()) is ready, so rte_malloc() cannot be used for lcore
>> variables.
>>>>>>
>>>>>> And lcore variables are not usable (shared) for DPDK multi-
>> process, so the
>>>>> lcore_buffer could be allocated through the O/S APIs as anonymous
>> hugepages,
>>>>> instead of using rte_malloc().
>>>>>>
>>>>>> The alternative, using rte_malloc(), would disallow allocating
>> lcore
>>>>> variables before DPDK's memory allocator has been initialized,
>> which I think
>>>>> is too late.
>>>>>
>>>>> I thought it is not. A lot of the subsystems are initialized
>> after the
>>>>> memory subsystem is initialized.
>>>>> [1] example given in documentation. I thought, RTE_INIT needs to
>>>>> replaced if the subsystem called after memory initialized (which
>> is
>>>>> the case for most of the libraries)
>>>>
>>>> The list of RTE_INIT functions are called before main(). It is not
>> very useful.
>>>>
>>>> Yes, it would be good to replace (or supplement) RTE_INIT_PRIO by
>> something similar, which calls the list of "INIT" functions at the
>> appropriate time during EAL initialization.
>>>>
>>>> DPDK should then use this "INIT" list for all its initialization,
>> so the init function of new features (such as this, and trace) can be
>> inserted at the correct location in the list.
>>>>
>>>>> Trace library had a similar situation. It is managed like [2]
>>>>
>>>> Yes, if we insist on using rte_malloc() for lcore variables, the
>> alternative is to prohibit establishing lcore variables in functions
>> called through RTE_INIT.
>>>
>>> I was not insisting on using ONLY rte_malloc(). Since rte_malloc()
>> can
>>> be called before rte_eal_init)(it will return NULL). Alloc routine
>> can
>>> check first rte_malloc() is available if not switch over glibc.
>>
>>
>> @Mattias Rönnblom This comment is not addressed in v7. Could you check?
> 
> Mattias, following up on Jerin's suggestion:
> 
> When allocating an lcore variable, and the buffer holding lcore variables is out of space (or was never allocated), a new buffer is allocated.
> 
> Here's the twist I think Jerin is asking for:
> You could check if rte_malloc() is available, and use that (instead of the heap) when allocating a new buffer holding lcore variables.
> This check can be performed (aggressively) when allocating a new lcore variable, or (conservatively) only when allocating a new buffer.
> 
> 
> Now, if using hugepages, the value of RTE_MAX_LCORE_VAR (the maximum size of one lcore variable instance) becomes more important.
> 
> Let's consider systems with 2 MB hugepages:
> 
> If it supports two lcores (RTE_MAX_LCORE is 2), the current RTE_MAX_LCORE_VAR default of 1 MB is a perfect match; it will use 2 MB of RAM as one 2 MB hugepage.
> 
> If it supports 128 lcores, the current RTE_MAX_LCORE_VAR default of 1 MB will use 128 MB of RAM.
> 
> If we scale it back, so it only uses one 2 MB hugepage, RTE_MAX_LCORE_VAR will have to be 2 MB / 128 lcores = 16 KB.
> 16 KB might be too small. E.g. a mempool cache uses 2 * 512 * sizeof(void *) = 8 KB + a few bytes for the information about the cache. So I can easily point at one example where 16 KB is going very close to the edge.
> 
> So, as you already asked, what is a reasonable default minimum value of RTE_MAX_LCORE_VAR?
> 
> Maybe we should just stick with your initial suggestion (1 MB) and see how it goes.
> 

Sure. Let's stick with 1 MB.

I'm guessing that if/when someone takes a closer look how to do 
per-lcore *dynamic* allocations, this API and its implementation will be 
revisited as well.

> 
> <roadmap>
> At the recent DPDK Summit, we discussed memory consumption in one of the workshops.
> One of the possible means for reducing memory consumption is making RTE_MAX_LCORE dynamic, so an application using only a few cores will scale its per-lcore tables to the actual number of lcores, instead of scaling to some hardcoded maximum.
> 
> With this in mind, I'm less worried about the RTE_MAX_LCORE multiplier.
> </roadmap>
> 

A interesting hack would be disable huge page usage, set up a swap file 
in a zram device, and then MADV_PAGEOUT the DPDK process after startup.

I wonder how much smaller DPDK process RSS would be, when it had paged 
back in all the pages that were actually required.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v11 1/7] eal: add static per-lcore memory allocation facility
  2024-10-14  8:17                                                                   ` Morten Brørup
@ 2024-10-15  6:41                                                                     ` Mattias Rönnblom
  2024-10-15  7:10                                                                       ` Mattias Rönnblom
  2024-10-15  8:19                                                                       ` Morten Brørup
  0 siblings, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  6:41 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev, thomas
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

On 2024-10-14 10:17, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Monday, 14 October 2024 09.44
> 
> 
>> +struct lcore_var_buffer {
>> +	char data[RTE_MAX_LCORE_VAR * RTE_MAX_LCORE];
>> +	struct lcore_var_buffer *prev;
>> +};
> 
> In relation to Jerin's request for using hugepages when available, the "data" field should be a pointer to the memory allocated from either the heap or through rte_malloc. You would also need to add a flag to indicate which it is, so the correct deallocation function can be used to free it on cleanup.
> 

The typing (glibc heap or DPDK heap) could be in the buffers themselves, no?

> <feature creep>
> Here's another (nice to have) idea, which does not need to be part of this series, but can be implemented in a separate patch:
> If you move "offset" into this structure, new lcore variables can be allocated from any buffer, instead of only the most recently allocated buffer.
> There might even be gains by picking the "optimal" buffer to allocate different size variables from.
> </feature creep>
> 

If the max lcore variable size is much greater than the actual variable 
sizes, the amount of fragmentation (i.e., the space at the end) will be 
very small.

I don't think we should use huge pages for this facility, since they 
don't support demand paging.

The day we have a DPDK heap which support lcore-affinitized allocations, 
then potentially eal_common_lcore_var.c could use that, provided it's 
available (and there is a proper way to check [or get notified] if it is 
available or not).

>> +
>> +static struct lcore_var_buffer *current_buffer;
>> +
>> +/* initialized to trigger buffer allocation on first allocation */
>> +static size_t offset = RTE_MAX_LCORE_VAR;
> 
> 
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	/* Having the per-lcore buffer size aligned on cache lines
>> +	 * assures as well as having the base pointer aligned on cache
>> +	 * size assures that aligned offsets also translate to alipgned
>> +	 * pointers across all values.
>> +	 */
>> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
>> +	RTE_ASSERT(align <= RTE_CACHE_LINE_SIZE);
>> +	RTE_ASSERT(size <= RTE_MAX_LCORE_VAR);
> 
> This is very slow path, please RTE_VERIFY instead of RTE_ASSERT in this function.
> 

Sure. (I think I rejected that before, but now I don't agree with my old 
self.)

> 
>> +/**
>> + * Get pointer to lcore variable instance with the specified lcore id.
>> + *
>> + * @param lcore_id
>> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
>> + *   instances should be accessed. The lcore id need not be valid
>> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
>> + *   is also not valid (and thus should not be dereferenced).
>> + * @param handle
>> + *   The lcore variable handle.
>> + */
>> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
>> +	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
> 
> Please remove the _VALUE suffix.
> 

You changed your mind? I'm missing the rationale here.

>> +
>> +/**
>> + * Get pointer to lcore variable instance of the current thread.
>> + *
>> + * May only be used by EAL threads and registered non-EAL threads.
>> + */
>> +#define RTE_LCORE_VAR_VALUE(handle) \
>> +	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
> 
> Please remove the _VALUE suffix.
> 
>> +
>> +/**
>> + * Iterate over each lcore id's value for an lcore variable.
>> + *
>> + * @param lcore_id
>> + *   An <code>unsigned int</code> variable successively set to the
>> + *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
>> + * @param value
>> + *   A pointer variable successively set to point to lcore variable
>> + *   value instance of the current lcore id being processed.
>> + * @param handle
>> + *   The lcore variable handle.
>> + */
>> +#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)
> 
> Please remove the _VALUE suffix.
> 
>> 	\
>> +	for ((lcore_id) =						\
>> +		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0);
>> \
>> +	     (lcore_id) < RTE_MAX_LCORE;				\
>> +	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
>> \
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v11 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-10-14 16:30                                                                   ` Stephen Hemminger
@ 2024-10-15  6:48                                                                     ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  6:48 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

On 2024-10-14 18:30, Stephen Hemminger wrote:
> On Mon, 14 Oct 2024 09:43:48 +0200
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
>> Keep per-lcore power intrinsics state in a lcore variable to reduce
>> cache working set size and avoid any CPU next-line-prefetching causing
>> false sharing.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
>> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
>> Acked-by: Stephen Hemminger <stephen@networkplumber.org>
> 
> This looks like a problem.
> 
> -------------------------------BEGIN LOGS----------------------------
> ####################################################################################
> #### [Begin job log] "ubuntu-22.04-clang-asan+doc+tests" at step Build and test
> ####################################################################################
> + configure_coredump
> + which gdb
> + ulimit -c unlimited
> + sudo sysctl -w kernel.core_pattern=/tmp/dpdk-core.%e.%p
> kernel.core_pattern = /tmp/dpdk-core.%e.%p
> + devtools/test-null.sh
> =================================================================
> ==67776==ERROR: AddressSanitizer: invalid alignment requested in aligned_alloc: 64, alignment must be a power of two and the requested size 0x8000008 must be a multiple of alignment (thread T0)
>      #0 0x5562b2504042 in aligned_alloc (/home/runner/work/dpdk/dpdk/build/app/dpdk-testpmd+0xaad042) (BuildId: 731d8ec8ca4a6bf8e01bfd7548ebeb784aece6e3)
>      #1 0x5562b37f671b in lcore_var_alloc /home/runner/work/dpdk/dpdk/build/../lib/eal/common/eal_common_lcore_var.c:77:20
>      #2 0x5562b37f671b in rte_lcore_var_alloc /home/runner/work/dpdk/dpdk/build/../lib/eal/common/eal_common_lcore_var.c:123:9
>      #3 0x5562b341b902 in rte_power_ethdev_pmgmt_init /home/runner/work/dpdk/dpdk/build/../lib/power/rte_power_pmd_mgmt.c:775:2
>      #4 0x7f76b7829eba in call_init csu/../csu/libc-start.c:145:3
>      #5 0x7f76b7829eba in __libc_start_main csu/../csu/libc-start.c:379:5
> 
> ==67776==HINT: if you don't care about these errors you may set allocator_may_return_null=1
> SUMMARY: AddressSanitizer: invalid-aligned-alloc-alignment (/home/runner/work/dpdk/dpdk/build/app/dpdk-testpmd+0xaad042) (BuildId: 731d8ec8ca4a6bf8e01bfd7548ebeb784aece6e3) in aligned_alloc
> ==67776==ABORTING

Yes. Thanks.



^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v12 0/7] Lcore variables
  2024-10-14  7:43                                                                 ` [PATCH v11 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-14  8:17                                                                   ` Morten Brørup
@ 2024-10-15  6:54                                                                   ` Mattias Rönnblom
  2024-10-15  6:54                                                                     ` [PATCH v12 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                                                       ` (6 more replies)
  1 sibling, 7 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  6:54 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 436 ++++++++++++++++++
 app/test/test_lcore_var_perf.c                | 257 +++++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         | 138 ++++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 117 ++---
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 391 ++++++++++++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   3 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  35 +-
 20 files changed, 1413 insertions(+), 93 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v12 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15  6:54                                                                   ` [PATCH v12 0/7] Lcore variables Mattias Rönnblom
@ 2024-10-15  6:54                                                                     ` Mattias Rönnblom
  2024-10-15  9:33                                                                       ` [PATCH v13 0/7] Lcore variables Mattias Rönnblom
  2024-10-15  6:55                                                                     ` [PATCH v12 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                                                       ` (5 subsequent siblings)
  6 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  6:54 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v12:
 * Replace RTE_ASSERT() with RTE_VERIFY(), since performance is not
   a concern. (Morten Brørup)
 * Fix issue (introduced in v11) where aligned_malloc() was provided
   an object size which wasn't an even number of the alignment.
   (Stephen Hemminger)

PATCH v11:
 * Add a note in the API docs on lcore variables and huge page memory.
   (Stephen Hemminger)
 * Free lcore var buffers at EAL cleanup. (Thomas Monjalon)
 * Tweak naming and include short lcore var buffer use overview
   in eal_common_lcore_var.c.

PATCH v10:
 * Improve documentation grammar and spelling. (Stephen Hemminger,
   Thomas Monjalon)
 * Add version.map DPDK version comment. (Thomas Monjalon)

PATCH v9:
 * Fixed merge conflicts in release notes.

PATCH v8:
 * Work around missing max_align_t definition in MSVC. (Morten Brørup)

PATCH v7:
 * Add () to the FOREACH lcore id macro parameter, to allow arbitrary
   expression, not just a simple variable name, being passed.
   (Konstantin Ananyev)

PATCH v6:
 * Have API user provide the loop variable in the FOREACH macro, to
   avoid subtle bugs where the loop variable name clashes with some
   other user-defined variable. (Konstantin Ananyev)

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         | 138 +++++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 391 ++++++++++++++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   3 +
 13 files changed, 608 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 812463fe9f..61e5907fb5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index fd6f8a2f1a..498d509244 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index b9fac1839d..b659a1d085 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -429,12 +429,43 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default, static variables, memory blocks allocated on the DPDK
+heap, and other types of memory are shared by all DPDK threads.
+
+An application, a DPDK library, or a PMD may opt to keep per-thread state.
+
+Per-thread data can be maintained using either *lcore variables* (see
+``rte_lcore_var.h``), *thread-local storage (TLS)* (see
+``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE`` elements,
+indexed by ``rte_lcore_id()``. These methods allow per-lcore data to be
+largely internal to the module and not directly exposed in its
+API. Another approach is to explicitly handle per-thread aspects in
+the API (e.g., the ports in the Eventdev API).
+
+Lcore variables are suitable for small objects that are statically
+allocated at the time of module or application initialization. An
+lcore variable takes on one value for each lcore ID-equipped thread
+(i.e., for both EAL threads and registered non-EAL threads, in total
+``RTE_MAX_LCORE`` instances). The lifetime of lcore variables is
+independent of the owning threads and can, therefore, be initialized
+before the threads are created.
+
+Variables with thread-local storage are allocated when the thread is
+created and exist until the thread terminates. These are applicable
+for every thread in the process. Only very small objects should be
+allocated in TLS, as large TLS objects can significantly slow down
+thread creation and may unnecessarily increase the memory footprint of
+applications that extensively use unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore ID-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must be both cache-aligned and include an
+``RTE_CACHE_GUARD``. This extensive use of padding causes internal
+fragmentation (i.e., unused space) and reduces cache hit rates.
+
+For more discussions on per-lcore state, refer to the
+``rte_lcore_var.h`` API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 915065a6f9..0e15767d41 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -113,6 +113,20 @@ New Features
 
   * Added independent enqueue feature.
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..09508f9281
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,138 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+#include "eal_lcore_var.h"
+
+/*
+ * An lcore var buffer stores at a minimum one, but usually many,
+ * lcore variables. The value instances for all lcore ids are stored
+ * in the same buffer.
+ *
+ * The address of the value of a particular lcore variable associated
+ * with a particular lcore id is:
+ * buffer->data + offset + lcore_id * RTE_MAX_LCORE_VAR.
+ *
+ * In this way, the values associated with a particular lcore id are
+ * grouped spatially close (in the data array), and no padding is
+ * required to prevent false sharing.
+ *
+ * The (buffer->data + offset) base pointer is what is being returned
+ * to the API user as an opaque handle. The handle is a pointer to the
+ * value for lcore id 0, for that lcore variable.
+ *
+ * The implementation maintains a current lcore var buffer (being
+ * allocated from), and an offset representing the amount of data
+ * already allocated (in bytes) in that buffer.
+ *
+ * The offset is progressively incremented (by the size of the
+ * just-allocated lcore variable), as lcore variables are being
+ * allocated.
+ *
+ * When one lcore var buffer is full, a new is allocated off the heap.
+ *
+ * The lcore var buffers are arranged in a single-link list, to allow
+ * freeing them at the point of rte_eal_cleanup(), and thereby avoid
+ * false positives from tools like valgrind memcheck.
+ */
+struct lcore_var_buffer {
+	char data[RTE_MAX_LCORE_VAR * RTE_MAX_LCORE];
+	struct lcore_var_buffer *prev;
+};
+
+static struct lcore_var_buffer *current_buffer;
+
+/* initialized to trigger buffer allocation on first allocation */
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	unsigned int lcore_id;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+		struct lcore_var_buffer *prev = current_buffer;
+		size_t alloc_size =
+			RTE_ALIGN_CEIL(sizeof(struct lcore_var_buffer),
+				       RTE_CACHE_LINE_SIZE);
+#ifdef RTE_EXEC_ENV_WINDOWS
+		current_buffer = _aligned_malloc(alloc_size, RTE_CACHE_LINE_SIZE);
+#else
+		current_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE, alloc_size);
+
+#endif
+		RTE_VERIFY(current_buffer != NULL);
+
+		current_buffer->prev = prev;
+
+		offset = 0;
+	}
+
+	handle = &current_buffer->data[offset];
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_VERIFY(align <= RTE_CACHE_LINE_SIZE);
+	RTE_VERIFY(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+#ifdef RTE_TOOLCHAIN_MSVC
+		/* MSVC <stddef.h> is missing the max_align_t typedef */
+		align = alignof(double);
+#else
+		align = alignof(max_align_t);
+#endif
+
+	RTE_VERIFY(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
+
+void
+eal_lcore_var_cleanup(void)
+{
+	while (current_buffer != NULL) {
+		struct lcore_var_buffer *prev = current_buffer->prev;
+
+		free(current_buffer);
+
+		current_buffer = prev;
+	}
+}
diff --git a/lib/eal/common/eal_lcore_var.h b/lib/eal/common/eal_lcore_var.h
new file mode 100644
index 0000000000..de2c4e44a0
--- /dev/null
+++ b/lib/eal/common/eal_lcore_var.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2024 Ericsson AB.
+ */
+
+#ifndef EAL_LCORE_VAR_H
+#define EAL_LCORE_VAR_H
+
+void
+eal_lcore_var_cleanup(void);
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index 1229230063..796c9dbf2d 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -47,6 +47,7 @@
 
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -941,6 +942,7 @@ rte_eal_cleanup(void)
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_cleanup_config(internal_conf);
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..5af742b5d6
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,391 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) holds a
+ * unique value for each EAL thread and registered non-EAL
+ * thread. There is one instance for each current and future lcore
+ * id-equipped thread, with a total of @c RTE_MAX_LCORE instances. The
+ * value of the lcore variable for one lcore id is independent from
+ * the values assigned to other lcore ids within the same variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for an @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>). The handle type is used to inform the
+ * access macros of the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as an opaque identifier. An allocated handle never
+ * has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs at the time
+ *     of module initialization, but may be done at any time.
+ *
+ * The lifetime of an lcore variable is not tied to the thread that
+ * created it. Its per lcore id values (up to @c RTE_MAX_LCORE) are
+ * available from the moment the lcore variable is created and
+ * continue to exist throughout the entire lifetime of the EAL,
+ * whether or not the lcore id is currently in use.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable, associated with different lcore
+ * ids may be frequently read or written by their respective owners
+ * without risking false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should be employed to prevent data races between the owning
+ * thread and any other thread accessing the same value instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * shorthand exists as @ref RTE_LCORE_VAR_VALUE.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may be of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may define an lcore variable handle without ever
+ * allocating it.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * Lcore variables are stored in a series of lcore buffers, which are
+ * allocated from the libc heap. Heap allocation failures are treated
+ * as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variable's owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values are initialized to zero by default.
+ *
+ * Lcore variables are not stored in huge page memory.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR_VALUE(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         unsigned int lcore_id;
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module are kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's
+ * and prevent false sharing.
+ *
+ * Lcore variables offer the advantage of working with, rather than
+ * against, the CPU's assumptions. A next-line hardware prefetcher,
+ * for example, may function as intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The lifecycle of a thread-local variable instance is tied to
+ *     that of the thread. The data cannot be accessed before the
+ *     thread has been created, nor after it has exited. As a result,
+ *     thread-local variables must be initialized in a "lazy" manner
+ *     (e.g., at the point of thread creation). Lcore variables may be
+ *     accessed immediately after having been allocated (which may occur
+ *     before any thread beyond the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the specifics of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, data sharing between threads is supported.
+ *     In the C11 standard, accessing another thread's _Thread_local
+ *     object is implementation-defined. Lcore variable instances may
+ *     be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * This macro clarifies that the declaration is an lcore handle, not a
+ * regular pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR_VALUE(handle) \
+	RTE_LCORE_VAR_LCORE_VALUE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param lcore_id
+ *   An <code>unsigned int</code> variable successively set to the
+ *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
+ * @param value
+ *   A pointer variable successively set to point to lcore variable
+ *   value instance of the current lcore id being processed.
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, value, handle)		\
+	for ((lcore_id) =						\
+		     (((value) = RTE_LCORE_VAR_LCORE_VALUE(0, handle)), 0); \
+	     (lcore_id) < RTE_MAX_LCORE;				\
+	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, \
+							       handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifier of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR_VALUE or @ref RTE_LCORE_VAR_LCORE_VALUE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index d742cc98e2..ae4df07bcf 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -45,6 +45,7 @@
 #include <telemetry_internal.h>
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -1387,6 +1388,7 @@ rte_eal_cleanup(void)
 	rte_eal_malloc_heap_cleanup();
 	eal_cleanup_config(internal_conf);
 	rte_eal_log_cleanup();
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..77d3181087 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,9 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	# added in 24.11
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v12 2/7] eal: add lcore variable functional tests
  2024-10-15  6:54                                                                   ` [PATCH v12 0/7] Lcore variables Mattias Rönnblom
  2024-10-15  6:54                                                                     ` [PATCH v12 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-15  6:55                                                                     ` Mattias Rönnblom
  2024-10-15  6:55                                                                     ` [PATCH v12 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                                                       ` (4 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  6:55 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocations to match new API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 436 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 437 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..2a1f258548
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,436 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR_VALUE(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR_VALUE(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int) =
+			state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	unsigned int i = 0;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_int) {
+		TEST_ASSERT_EQUAL(i, lcore_id, "Encountered lcore id %d "
+				  "while expecting %d during iteration",
+				  lcore_id, i);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		i++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	unsigned int lcore_id;
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR_VALUE(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR_VALUE(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+							   test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE_VALUE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE_VALUE(lcore_id,
+								handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v12 3/7] eal: add lcore variable performance test
  2024-10-15  6:54                                                                   ` [PATCH v12 0/7] Lcore variables Mattias Rönnblom
  2024-10-15  6:54                                                                     ` [PATCH v12 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-15  6:55                                                                     ` [PATCH v12 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-10-15  6:55                                                                     ` Mattias Rönnblom
  2024-10-15  6:55                                                                     ` [PATCH v12 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                                                       ` (3 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  6:55 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v8:
 * Fix spelling. (Morten Brørup)

PATCH v6:
 * Use floating point math when calculating per-update latency.
   (Morten Brørup)

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 257 +++++++++++++++++++++++++++++++++
 2 files changed, 258 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..2efb8342d1
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,257 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =
+		RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / (double)ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays is not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each with a small, per-lcore state. Note however that
+ * these tests have very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v12 4/7] random: keep PRNG state in lcore variable
  2024-10-15  6:54                                                                   ` [PATCH v12 0/7] Lcore variables Mattias Rönnblom
                                                                                       ` (2 preceding siblings ...)
  2024-10-15  6:55                                                                     ` [PATCH v12 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-10-15  6:55                                                                     ` Mattias Rönnblom
  2024-10-15  6:55                                                                     ` [PATCH v12 5/7] power: keep per-lcore " Mattias Rönnblom
                                                                                       ` (2 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  6:55 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..a8d00308dd 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR_VALUE(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v12 5/7] power: keep per-lcore state in lcore variable
  2024-10-15  6:54                                                                   ` [PATCH v12 0/7] Lcore variables Mattias Rönnblom
                                                                                       ` (3 preceding siblings ...)
  2024-10-15  6:55                                                                     ` [PATCH v12 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-10-15  6:55                                                                     ` Mattias Rönnblom
  2024-10-15  6:55                                                                     ` [PATCH v12 6/7] service: " Mattias Rönnblom
  2024-10-15  6:55                                                                     ` [PATCH v12 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  6:55 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocation to match new API.

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..a981db4b39 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR_VALUE(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,22 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	unsigned int lcore_id;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v12 6/7] service: keep per-lcore state in lcore variable
  2024-10-15  6:54                                                                   ` [PATCH v12 0/7] Lcore variables Mattias Rönnblom
                                                                                       ` (4 preceding siblings ...)
  2024-10-15  6:55                                                                     ` [PATCH v12 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-10-15  6:55                                                                     ` Mattias Rönnblom
  2024-10-15  6:55                                                                     ` [PATCH v12 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  6:55 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v7:
 * Update to match new FOREACH API.

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 117 +++++++++++++++++++----------------
 1 file changed, 65 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index a38c594ce4..3d2c12c39b 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -77,7 +78,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -103,12 +104,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -124,7 +121,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -138,7 +134,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -288,7 +283,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -296,9 +290,11 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	unsigned int lcore_id;
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -467,7 +463,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -477,7 +476,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR_VALUE(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -499,8 +498,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_VALUE(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -546,13 +544,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -560,9 +560,12 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	unsigned int lcore_id;
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH_VALUE(lcore_id, cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -579,7 +582,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -595,7 +599,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -647,30 +651,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -698,13 +703,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -715,14 +721,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -738,17 +746,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -760,7 +770,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -784,7 +794,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -814,6 +824,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -821,12 +833,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -847,7 +858,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -858,7 +869,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -866,7 +877,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -874,7 +885,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -901,7 +912,7 @@ lcore_attr_get_service_error_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -917,7 +928,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -979,12 +993,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1009,7 +1022,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -1020,12 +1034,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1060,7 +1073,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE_VALUE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v12 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-10-15  6:54                                                                   ` [PATCH v12 0/7] Lcore variables Mattias Rönnblom
                                                                                       ` (5 preceding siblings ...)
  2024-10-15  6:55                                                                     ` [PATCH v12 6/7] service: " Mattias Rönnblom
@ 2024-10-15  6:55                                                                     ` Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  6:55 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..f4ba2c8ecb 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE_VALUE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR_VALUE(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v11 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15  6:41                                                                     ` Mattias Rönnblom
@ 2024-10-15  7:10                                                                       ` Mattias Rönnblom
  2024-10-15  7:39                                                                         ` Morten Brørup
  2024-10-16  8:10                                                                         ` Thomas Monjalon
  2024-10-15  8:19                                                                       ` Morten Brørup
  1 sibling, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  7:10 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev, thomas
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

On 2024-10-15 08:41, Mattias Rönnblom wrote:
> On 2024-10-14 10:17, Morten Brørup wrote:

<snip>

>>
>>> +/**
>>> + * Get pointer to lcore variable instance with the specified lcore id.
>>> + *
>>> + * @param lcore_id
>>> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
>>> + *   instances should be accessed. The lcore id need not be valid
>>> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
>>> + *   is also not valid (and thus should not be dereferenced).
>>> + * @param handle
>>> + *   The lcore variable handle.
>>> + */
>>> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)            \
>>> +    ((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
>>
>> Please remove the _VALUE suffix.
>>
> 
> You changed your mind? I'm missing the rationale here.
> 

I supposed this is a bit of subjective hairsplitting, but does anyone 
else have an opinion?

Short versus somewhat more readable name.

To get "your own" value should be something like

struct foo *lcore_foo = RTE_LCORE_VAR(foo);
versus
struct foo *lcore_foo = RTE_LCORE_VAR_VALUE(foo);

We should also strip "_VALUE" off of the RTE_LCORE_FOREACH_VALUE() macro 
name in case we change the names of the access macros.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v11 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15  7:10                                                                       ` Mattias Rönnblom
@ 2024-10-15  7:39                                                                         ` Morten Brørup
  2024-10-15  9:09                                                                           ` Mattias Rönnblom
  2024-10-16  8:10                                                                         ` Thomas Monjalon
  1 sibling, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-10-15  7:39 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev, thomas
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Tuesday, 15 October 2024 09.11
> 
> On 2024-10-15 08:41, Mattias Rönnblom wrote:
> > On 2024-10-14 10:17, Morten Brørup wrote:
> 
> >> Please remove the _VALUE suffix.
> >>
> >
> > You changed your mind? I'm missing the rationale here.

Yes, I changed my mind.
Please revisit the discussion regarding patch v9.

> 
> I supposed this is a bit of subjective hairsplitting, but does anyone
> else have an opinion?
> 
> Short versus somewhat more readable name.

Thomas also suggested using shorter names, specifically renaming rte_lcore_var_lcore_ptr() to rte_lcore_var().
I would interpret that feedback as a request to shorten the macro names too.

> 
> To get "your own" value should be something like
> 
> struct foo *lcore_foo = RTE_LCORE_VAR(foo);
> versus
> struct foo *lcore_foo = RTE_LCORE_VAR_VALUE(foo);
> 
> We should also strip "_VALUE" off of the RTE_LCORE_FOREACH_VALUE()
> macro
> name in case we change the names of the access macros.

Agree. Remove "_VALUE" everywhere in the API.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v11 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15  6:41                                                                     ` Mattias Rönnblom
  2024-10-15  7:10                                                                       ` Mattias Rönnblom
@ 2024-10-15  8:19                                                                       ` Morten Brørup
  1 sibling, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-10-15  8:19 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev, thomas
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Tuesday, 15 October 2024 08.42
> 
> On 2024-10-14 10:17, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Monday, 14 October 2024 09.44
> >
> >
> >> +struct lcore_var_buffer {
> >> +	char data[RTE_MAX_LCORE_VAR * RTE_MAX_LCORE];
> >> +	struct lcore_var_buffer *prev;
> >> +};
> >
> > In relation to Jerin's request for using hugepages when available,
> the "data" field should be a pointer to the memory allocated from
> either the heap or through rte_malloc. You would also need to add a
> flag to indicate which it is, so the correct deallocation function can
> be used to free it on cleanup.
> >
> 
> The typing (glibc heap or DPDK heap) could be in the buffers
> themselves, no?

Yes, it would be a flag in the lcore_var_buffer structure.

Also, lcore_var_alloc() would use aligned_alloc() for allocating the "data" memory, not for allocating the lcore_var_buffer structure.

> 
> > <feature creep>
> > Here's another (nice to have) idea, which does not need to be part of
> this series, but can be implemented in a separate patch:
> > If you move "offset" into this structure, new lcore variables can be
> allocated from any buffer, instead of only the most recently allocated
> buffer.
> > There might even be gains by picking the "optimal" buffer to allocate
> different size variables from.
> > </feature creep>
> >
> 
> If the max lcore variable size is much greater than the actual variable
> sizes, the amount of fragmentation (i.e., the space at the end) will be
> very small.

Agree; the current design parameters uses a very large RTE_MAX_LCORE_VAR, so fragmentation should be negligible.
It may become relevant if using a smaller RTE_MAX_LCORE_VAR and many lcore_var_buffers are allocated.


> I don't think we should use huge pages for this facility, since they
> don't support demand paging.

I understand your reasoning; the RAM consumption will be huge.

I am not going to insist.
If someone really needs it, they can provide a separate patch in the future, preferably with a build time option in rte_config.h to enable/disable it.

> 
> The day we have a DPDK heap which support lcore-affinitized
> allocations,
> then potentially eal_common_lcore_var.c could use that, provided it's
> available (and there is a proper way to check [or get notified] if it
> is
> available or not).


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v11 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15  7:39                                                                         ` Morten Brørup
@ 2024-10-15  9:09                                                                           ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  9:09 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev, thomas
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

On 2024-10-15 09:39, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Tuesday, 15 October 2024 09.11
>>
>> On 2024-10-15 08:41, Mattias Rönnblom wrote:
>>> On 2024-10-14 10:17, Morten Brørup wrote:
>>
>>>> Please remove the _VALUE suffix.
>>>>
>>>
>>> You changed your mind? I'm missing the rationale here.
> 
> Yes, I changed my mind.
> Please revisit the discussion regarding patch v9.
> 
>>
>> I supposed this is a bit of subjective hairsplitting, but does anyone
>> else have an opinion?
>>
>> Short versus somewhat more readable name.
> 
> Thomas also suggested using shorter names, specifically renaming rte_lcore_var_lcore_ptr() to rte_lcore_var().
> I would interpret that feedback as a request to shorten the macro names too.
> 

If nobody objects I will change the names of the macros, and the 
above-mentioned function as well.

>>
>> To get "your own" value should be something like
>>
>> struct foo *lcore_foo = RTE_LCORE_VAR(foo);
>> versus
>> struct foo *lcore_foo = RTE_LCORE_VAR_VALUE(foo);
>>
>> We should also strip "_VALUE" off of the RTE_LCORE_FOREACH_VALUE()
>> macro
>> name in case we change the names of the access macros.
> 
> Agree. Remove "_VALUE" everywhere in the API.
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v13 0/7] Lcore variables
  2024-10-15  6:54                                                                     ` [PATCH v12 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-15  9:33                                                                       ` Mattias Rönnblom
  2024-10-15  9:33                                                                         ` [PATCH v13 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                                                           ` (6 more replies)
  0 siblings, 7 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  9:33 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 432 ++++++++++++++++++
 app/test/test_lcore_var_perf.c                | 256 +++++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         | 138 ++++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 116 ++---
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 391 ++++++++++++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   3 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  35 +-
 20 files changed, 1407 insertions(+), 93 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v13 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15  9:33                                                                       ` [PATCH v13 0/7] Lcore variables Mattias Rönnblom
@ 2024-10-15  9:33                                                                         ` Mattias Rönnblom
  2024-10-15 10:13                                                                           ` Morten Brørup
                                                                                             ` (3 more replies)
  2024-10-15  9:33                                                                         ` [PATCH v13 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                                                           ` (5 subsequent siblings)
  6 siblings, 4 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  9:33 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v13:
 * Remove _VALUE() suffix from value lookup and iterator macros.
   (Morten Brørup and Thomas Monjalon)
 * Remove the _ptr() suffix from the value lookup function.

PATCH v12:
 * Replace RTE_ASSERT() with RTE_VERIFY(), since performance is not
   a concern. (Morten Brørup)
 * Fix issue (introduced in v11) where aligned_malloc() was provided
   an object size which wasn't an even number of the alignment.
   (Stephen Hemminger)

PATCH v11:
 * Add a note in the API docs on lcore variables and huge page memory.
   (Stephen Hemminger)
 * Free lcore var buffers at EAL cleanup. (Thomas Monjalon)
 * Tweak naming and include short lcore var buffer use overview
   in eal_common_lcore_var.c.

PATCH v10:
 * Improve documentation grammar and spelling. (Stephen Hemminger,
   Thomas Monjalon)
 * Add version.map DPDK version comment. (Thomas Monjalon)

PATCH v9:
 * Fixed merge conflicts in release notes.

PATCH v8:
 * Work around missing max_align_t definition in MSVC. (Morten Brørup)

PATCH v7:
 * Add () to the FOREACH lcore id macro parameter, to allow arbitrary
   expression, not just a simple variable name, being passed.
   (Konstantin Ananyev)

PATCH v6:
 * Have API user provide the loop variable in the FOREACH macro, to
   avoid subtle bugs where the loop variable name clashes with some
   other user-defined variable. (Konstantin Ananyev)

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         | 138 +++++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 391 ++++++++++++++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   3 +
 13 files changed, 608 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 812463fe9f..61e5907fb5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -282,6 +282,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index fd6f8a2f1a..498d509244 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index f9f0300126..ed577f14ee 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index b9fac1839d..b659a1d085 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -429,12 +429,43 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default, static variables, memory blocks allocated on the DPDK
+heap, and other types of memory are shared by all DPDK threads.
+
+An application, a DPDK library, or a PMD may opt to keep per-thread state.
+
+Per-thread data can be maintained using either *lcore variables* (see
+``rte_lcore_var.h``), *thread-local storage (TLS)* (see
+``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE`` elements,
+indexed by ``rte_lcore_id()``. These methods allow per-lcore data to be
+largely internal to the module and not directly exposed in its
+API. Another approach is to explicitly handle per-thread aspects in
+the API (e.g., the ports in the Eventdev API).
+
+Lcore variables are suitable for small objects that are statically
+allocated at the time of module or application initialization. An
+lcore variable takes on one value for each lcore ID-equipped thread
+(i.e., for both EAL threads and registered non-EAL threads, in total
+``RTE_MAX_LCORE`` instances). The lifetime of lcore variables is
+independent of the owning threads and can, therefore, be initialized
+before the threads are created.
+
+Variables with thread-local storage are allocated when the thread is
+created and exist until the thread terminates. These are applicable
+for every thread in the process. Only very small objects should be
+allocated in TLS, as large TLS objects can significantly slow down
+thread creation and may unnecessarily increase the memory footprint of
+applications that extensively use unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore ID-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must be both cache-aligned and include an
+``RTE_CACHE_GUARD``. This extensive use of padding causes internal
+fragmentation (i.e., unused space) and reduces cache hit rates.
+
+For more discussions on per-lcore state, refer to the
+``rte_lcore_var.h`` API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 915065a6f9..0e15767d41 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -113,6 +113,20 @@ New Features
 
   * Added independent enqueue feature.
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..f4dd5b1a82
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,138 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+#include "eal_lcore_var.h"
+
+/*
+ * An lcore var buffer stores at a minimum one, but usually many,
+ * lcore variables. The value instances for all lcore ids are stored
+ * in the same buffer.
+ *
+ * The address of the value of a particular lcore variable associated
+ * with a particular lcore id is:
+ * buffer->data + offset + lcore_id * RTE_MAX_LCORE_VAR.
+ *
+ * In this way, the values associated with a particular lcore id are
+ * grouped spatially close (in the data array), and no padding is
+ * required to prevent false sharing.
+ *
+ * The (buffer->data + offset) base pointer is what is being returned
+ * to the API user as an opaque handle. The handle is a pointer to the
+ * value for lcore id 0, for that lcore variable.
+ *
+ * The implementation maintains a current lcore var buffer (being
+ * allocated from), and an offset representing the amount of data
+ * already allocated (in bytes) in that buffer.
+ *
+ * The offset is progressively incremented (by the size of the
+ * just-allocated lcore variable), as lcore variables are being
+ * allocated.
+ *
+ * When one lcore var buffer is full, a new is allocated off the heap.
+ *
+ * The lcore var buffers are arranged in a single-link list, to allow
+ * freeing them at the point of rte_eal_cleanup(), and thereby avoid
+ * false positives from tools like valgrind memcheck.
+ */
+struct lcore_var_buffer {
+	char data[RTE_MAX_LCORE_VAR * RTE_MAX_LCORE];
+	struct lcore_var_buffer *prev;
+};
+
+static struct lcore_var_buffer *current_buffer;
+
+/* initialized to trigger buffer allocation on first allocation */
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	unsigned int lcore_id;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+		struct lcore_var_buffer *prev = current_buffer;
+		size_t alloc_size =
+			RTE_ALIGN_CEIL(sizeof(struct lcore_var_buffer),
+				       RTE_CACHE_LINE_SIZE);
+#ifdef RTE_EXEC_ENV_WINDOWS
+		current_buffer = _aligned_malloc(alloc_size, RTE_CACHE_LINE_SIZE);
+#else
+		current_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE, alloc_size);
+
+#endif
+		RTE_VERIFY(current_buffer != NULL);
+
+		current_buffer->prev = prev;
+
+		offset = 0;
+	}
+
+	handle = &current_buffer->data[offset];
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_VERIFY(align <= RTE_CACHE_LINE_SIZE);
+	RTE_VERIFY(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+#ifdef RTE_TOOLCHAIN_MSVC
+		/* MSVC <stddef.h> is missing the max_align_t typedef */
+		align = alignof(double);
+#else
+		align = alignof(max_align_t);
+#endif
+
+	RTE_VERIFY(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
+
+void
+eal_lcore_var_cleanup(void)
+{
+	while (current_buffer != NULL) {
+		struct lcore_var_buffer *prev = current_buffer->prev;
+
+		free(current_buffer);
+
+		current_buffer = prev;
+	}
+}
diff --git a/lib/eal/common/eal_lcore_var.h b/lib/eal/common/eal_lcore_var.h
new file mode 100644
index 0000000000..de2c4e44a0
--- /dev/null
+++ b/lib/eal/common/eal_lcore_var.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2024 Ericsson AB.
+ */
+
+#ifndef EAL_LCORE_VAR_H
+#define EAL_LCORE_VAR_H
+
+void
+eal_lcore_var_cleanup(void);
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 22a626ba6f..d41403680b 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index 1229230063..796c9dbf2d 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -47,6 +47,7 @@
 
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -941,6 +942,7 @@ rte_eal_cleanup(void)
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_cleanup_config(internal_conf);
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..9449253e23 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -27,6 +27,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..032c1cd6e0
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,391 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) holds a
+ * unique value for each EAL thread and registered non-EAL
+ * thread. There is one instance for each current and future lcore
+ * id-equipped thread, with a total of @c RTE_MAX_LCORE instances. The
+ * value of the lcore variable for one lcore id is independent from
+ * the values assigned to other lcore ids within the same variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for an @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>). The handle type is used to inform the
+ * access macros of the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as an opaque identifier. An allocated handle never
+ * has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs at the time
+ *     of module initialization, but may be done at any time.
+ *
+ * The lifetime of an lcore variable is not tied to the thread that
+ * created it. Its per lcore id values (up to @c RTE_MAX_LCORE) are
+ * available from the moment the lcore variable is created and
+ * continue to exist throughout the entire lifetime of the EAL,
+ * whether or not the lcore id is currently in use.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable, associated with different lcore
+ * ids may be frequently read or written by their respective owners
+ * without risking false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should be employed to prevent data races between the owning
+ * thread and any other thread accessing the same value instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * shorthand exists as @ref RTE_LCORE_VAR.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may be of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may define an lcore variable handle without ever
+ * allocating it.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * Lcore variables are stored in a series of lcore buffers, which are
+ * allocated from the libc heap. Heap allocation failures are treated
+ * as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variable's owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values are initialized to zero by default.
+ *
+ * Lcore variables are not stored in huge page memory.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         unsigned int lcore_id;
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH(lcore_id, state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module are kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's
+ * and prevent false sharing.
+ *
+ * Lcore variables offer the advantage of working with, rather than
+ * against, the CPU's assumptions. A next-line hardware prefetcher,
+ * for example, may function as intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The lifecycle of a thread-local variable instance is tied to
+ *     that of the thread. The data cannot be accessed before the
+ *     thread has been created, nor after it has exited. As a result,
+ *     thread-local variables must be initialized in a "lazy" manner
+ *     (e.g., at the point of thread creation). Lcore variables may be
+ *     accessed immediately after having been allocated (which may occur
+ *     before any thread beyond the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the specifics of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, data sharing between threads is supported.
+ *     In the C11 standard, accessing another thread's _Thread_local
+ *     object is implementation-defined. Lcore variable instances may
+ *     be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * This macro clarifies that the declaration is an lcore handle, not a
+ * regular pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR(handle)				\
+	RTE_LCORE_VAR_LCORE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param lcore_id
+ *   An <code>unsigned int</code> variable successively set to the
+ *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
+ * @param value
+ *   A pointer variable successively set to point to lcore variable
+ *   value instance of the current lcore id being processed.
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH(lcore_id, value, handle)			\
+	for ((lcore_id) =						\
+		     (((value) = RTE_LCORE_VAR_LCORE(0, handle)), 0); \
+	     (lcore_id) < RTE_MAX_LCORE;				\
+	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE(lcore_id, \
+							       handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifier of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR or @ref RTE_LCORE_VAR_LCORE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index d742cc98e2..ae4df07bcf 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -45,6 +45,7 @@
 #include <telemetry_internal.h>
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -1387,6 +1388,7 @@ rte_eal_cleanup(void)
 	rte_eal_malloc_heap_cleanup();
 	eal_cleanup_config(internal_conf);
 	rte_eal_log_cleanup();
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e3ff412683..77d3181087 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -396,6 +396,9 @@ EXPERIMENTAL {
 
 	# added in 24.03
 	rte_vfio_get_device_info; # WINDOWS_NO_EXPORT
+
+	# added in 24.11
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v13 2/7] eal: add lcore variable functional tests
  2024-10-15  9:33                                                                       ` [PATCH v13 0/7] Lcore variables Mattias Rönnblom
  2024-10-15  9:33                                                                         ` [PATCH v13 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-15  9:33                                                                         ` Mattias Rönnblom
  2024-10-15  9:33                                                                         ` [PATCH v13 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                                                           ` (4 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  9:33 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocations to match new API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..48279522f0 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -103,6 +103,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..ddf70b03a0
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,432 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE(lcore_id, test_int) = state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	unsigned int i = 0;
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_int) {
+		TEST_ASSERT_EQUAL(i, lcore_id, "Encountered lcore id %d "
+				  "while expecting %d during iteration",
+				  lcore_id, i);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		i++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	unsigned int lcore_id;
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE(lcore_id, test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray = RTE_LCORE_VAR_LCORE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE(lcore_id, handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v13 3/7] eal: add lcore variable performance test
  2024-10-15  9:33                                                                       ` [PATCH v13 0/7] Lcore variables Mattias Rönnblom
  2024-10-15  9:33                                                                         ` [PATCH v13 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-15  9:33                                                                         ` [PATCH v13 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-10-15  9:33                                                                         ` Mattias Rönnblom
  2024-10-15  9:33                                                                         ` [PATCH v13 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                                                           ` (3 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  9:33 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v8:
 * Fix spelling. (Morten Brørup)

PATCH v6:
 * Use floating point math when calculating per-update latency.
   (Morten Brørup)

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 256 +++++++++++++++++++++++++++++++++
 2 files changed, 257 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..6d9869f873
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,256 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =	RTE_LCORE_VAR(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / (double)ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays is not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each with a small, per-lcore state. Note however that
+ * these tests have very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v13 4/7] random: keep PRNG state in lcore variable
  2024-10-15  9:33                                                                       ` [PATCH v13 0/7] Lcore variables Mattias Rönnblom
                                                                                           ` (2 preceding siblings ...)
  2024-10-15  9:33                                                                         ` [PATCH v13 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-10-15  9:33                                                                         ` Mattias Rönnblom
  2024-10-15  9:33                                                                         ` [PATCH v13 5/7] power: keep per-lcore " Mattias Rönnblom
                                                                                           ` (2 subsequent siblings)
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  9:33 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..cf0756f26a 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v13 5/7] power: keep per-lcore state in lcore variable
  2024-10-15  9:33                                                                       ` [PATCH v13 0/7] Lcore variables Mattias Rönnblom
                                                                                           ` (3 preceding siblings ...)
  2024-10-15  9:33                                                                         ` [PATCH v13 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-10-15  9:33                                                                         ` Mattias Rönnblom
  2024-10-15  9:33                                                                         ` [PATCH v13 6/7] service: " Mattias Rönnblom
  2024-10-15  9:33                                                                         ` [PATCH v13 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  9:33 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocation to match new API.

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index b1c18a5f56..2efcab8287 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -518,7 +516,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -619,7 +617,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -769,21 +767,22 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	unsigned int lcore_id;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH(lcore_id, lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v13 6/7] service: keep per-lcore state in lcore variable
  2024-10-15  9:33                                                                       ` [PATCH v13 0/7] Lcore variables Mattias Rönnblom
                                                                                           ` (4 preceding siblings ...)
  2024-10-15  9:33                                                                         ` [PATCH v13 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-10-15  9:33                                                                         ` Mattias Rönnblom
  2024-10-15  9:33                                                                         ` [PATCH v13 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  9:33 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v7:
 * Update to match new FOREACH API.

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 116 +++++++++++++++++++----------------
 1 file changed, 64 insertions(+), 52 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index a38c594ce4..15df1dcc13 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
@@ -77,7 +78,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -103,12 +104,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -124,7 +121,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -138,7 +134,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -288,7 +283,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -296,9 +290,11 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	unsigned int lcore_id;
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		lcore_states[i].service_mask &= ~(UINT64_C(1) << id);
+	RTE_LCORE_VAR_FOREACH(lcore_id, cs, lcore_states)
+		cs->service_mask &= ~(UINT64_C(1) << id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -467,7 +463,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (lcore_states[ids[i]].service_active_on_lcore[id])
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(ids[i], lcore_states);
+
+		if (cs->service_active_on_lcore[id])
 			return 1;
 	}
 
@@ -477,7 +476,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -499,8 +498,7 @@ service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint8_t i;
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -546,13 +544,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -560,9 +560,12 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	unsigned int lcore_id;
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH(lcore_id, cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -579,7 +582,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -595,7 +599,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -647,30 +651,31 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		uint64_t lcore_mapped = cs->service_mask & sid_mask;
 
 		if (*set && !lcore_mapped) {
-			lcore_states[lcore].service_mask |= sid_mask;
+			cs->service_mask |= sid_mask;
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			lcore_states[lcore].service_mask &= ~(sid_mask);
+			cs->service_mask &= ~(sid_mask);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
+		*enabled = !!(cs->service_mask & (sid_mask));
 
 	return 0;
 }
@@ -698,13 +703,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -715,14 +721,15 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all to mask 0 */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			lcore_states[i].service_mask = 0;
+		struct core_state *cs =	RTE_LCORE_VAR_LCORE(i, lcore_states);
+		if (cs->is_service_core) {
+			cs->service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -738,17 +745,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	lcore_states[lcore].service_mask = 0;
+	cs->service_mask = 0;
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -760,7 +769,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -784,7 +793,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -814,6 +823,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -821,12 +832,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 	uint64_t service_mask = cs->service_mask;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -847,7 +857,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -858,7 +868,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -866,7 +876,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -874,7 +884,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -901,7 +911,7 @@ lcore_attr_get_service_error_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -917,7 +927,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -979,12 +992,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1009,7 +1021,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -1020,12 +1033,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1060,7 +1072,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v13 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-10-15  9:33                                                                       ` [PATCH v13 0/7] Lcore variables Mattias Rönnblom
                                                                                           ` (5 preceding siblings ...)
  2024-10-15  9:33                                                                         ` [PATCH v13 6/7] service: " Mattias Rönnblom
@ 2024-10-15  9:33                                                                         ` Mattias Rönnblom
  6 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15  9:33 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..98a2cbc611 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v13 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15  9:33                                                                         ` [PATCH v13 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-15 10:13                                                                           ` Morten Brørup
  2024-10-15 19:02                                                                             ` Mattias Rönnblom
  2024-10-15 22:33                                                                           ` Stephen Hemminger
                                                                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 313+ messages in thread
From: Morten Brørup @ 2024-10-15 10:13 UTC (permalink / raw)
  To: Mattias Rönnblom, dev, hofors
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

> +void *
> +rte_lcore_var_alloc(size_t size, size_t align)
> +{
> +	/* Having the per-lcore buffer size aligned on cache lines
> +	 * assures as well as having the base pointer aligned on cache
> +	 * size assures that aligned offsets also translate to alipgned
> +	 * pointers across all values.
> +	 */
> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> +	RTE_VERIFY(align <= RTE_CACHE_LINE_SIZE);
> +	RTE_VERIFY(size <= RTE_MAX_LCORE_VAR);
> +
> +	/* '0' means asking for worst-case alignment requirements */
> +	if (align == 0)
> +#ifdef RTE_TOOLCHAIN_MSVC
> +		/* MSVC <stddef.h> is missing the max_align_t typedef */
> +		align = alignof(double);
> +#else
> +		align = alignof(max_align_t);
> +#endif

Do we need worst-case alignment, or does automatic alignment suffice:

	/* '0' means asking for automatic alignment requirements */
	if (align == 0) {
#ifdef RTE_ARCH_64
		align = rte_align64pow2(size);
#else
		align = rte_align32pow2(size);
#endif
#ifdef RTE_TOOLCHAIN_MSVC
		/* MSVC <stddef.h> is missing the max_align_t typedef */
		align = RTE_MIN(align, alignof(double));
#else
		align = RTE_MIN(align, alignof(max_align_t));
#endif
	}

It will pack small-size lcore variables even tighter.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v13 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15 10:13                                                                           ` Morten Brørup
@ 2024-10-15 19:02                                                                             ` Mattias Rönnblom
  2024-10-15 20:19                                                                               ` Morten Brørup
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-15 19:02 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

On 2024-10-15 12:13, Morten Brørup wrote:
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align)
>> +{
>> +	/* Having the per-lcore buffer size aligned on cache lines
>> +	 * assures as well as having the base pointer aligned on cache
>> +	 * size assures that aligned offsets also translate to alipgned
>> +	 * pointers across all values.
>> +	 */
>> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
>> +	RTE_VERIFY(align <= RTE_CACHE_LINE_SIZE);
>> +	RTE_VERIFY(size <= RTE_MAX_LCORE_VAR);
>> +
>> +	/* '0' means asking for worst-case alignment requirements */
>> +	if (align == 0)
>> +#ifdef RTE_TOOLCHAIN_MSVC
>> +		/* MSVC <stddef.h> is missing the max_align_t typedef */
>> +		align = alignof(double);
>> +#else
>> +		align = alignof(max_align_t);
>> +#endif
> 
> Do we need worst-case alignment, or does automatic alignment suffice:
> 

I think the term is "natural alignment." As I think I mentioned at some 
point, I don't really have an opinion.

Worst case (max_alignt_t) alignment is the same as malloc(), so 
potentially what the user may expect. On the other hand, I can't see why 
natural alignment (or alignof(max_align_t), whichever is smallest) would 
not always suffice. It is a bit harder to explain in the API docs what 
alignment you actually get in case you don't go for worst-case alignment.

I think it doesn't matter much, because the user will very likely use 
the typed macros (and get whatever alignment the compiler deems 
appropriate for that type).

> 	/* '0' means asking for automatic alignment requirements */
> 	if (align == 0) {
> #ifdef RTE_ARCH_64
> 		align = rte_align64pow2(size);
> #else
> 		align = rte_align32pow2(size);
> #endif
> #ifdef RTE_TOOLCHAIN_MSVC
> 		/* MSVC <stddef.h> is missing the max_align_t typedef */
> 		align = RTE_MIN(align, alignof(double));
> #else
> 		align = RTE_MIN(align, alignof(max_align_t));
> #endif
> 	}
> 
> It will pack small-size lcore variables even tighter.
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* RE: [PATCH v13 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15 19:02                                                                             ` Mattias Rönnblom
@ 2024-10-15 20:19                                                                               ` Morten Brørup
  0 siblings, 0 replies; 313+ messages in thread
From: Morten Brørup @ 2024-10-15 20:19 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Tuesday, 15 October 2024 21.03
> 
> On 2024-10-15 12:13, Morten Brørup wrote:
> >> +void *
> >> +rte_lcore_var_alloc(size_t size, size_t align)
> >> +{
> >> +	/* Having the per-lcore buffer size aligned on cache lines
> >> +	 * assures as well as having the base pointer aligned on cache
> >> +	 * size assures that aligned offsets also translate to alipgned
> >> +	 * pointers across all values.
> >> +	 */
> >> +	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
> >> +	RTE_VERIFY(align <= RTE_CACHE_LINE_SIZE);
> >> +	RTE_VERIFY(size <= RTE_MAX_LCORE_VAR);
> >> +
> >> +	/* '0' means asking for worst-case alignment requirements */
> >> +	if (align == 0)
> >> +#ifdef RTE_TOOLCHAIN_MSVC
> >> +		/* MSVC <stddef.h> is missing the max_align_t typedef */
> >> +		align = alignof(double);
> >> +#else
> >> +		align = alignof(max_align_t);
> >> +#endif
> >
> > Do we need worst-case alignment, or does automatic alignment suffice:
> >
> 
> I think the term is "natural alignment." As I think I mentioned at some
> point, I don't really have an opinion.

Exactly; "natural alignment" was the term I was looking for.

> 
> Worst case (max_alignt_t) alignment is the same as malloc(), so
> potentially what the user may expect.

For this type of variables, which are more like "static" variables, I don't think the user expects malloc()-like alignment; I think the user expects natural alignment.
And if the user requires any special alignment, the user will specify it explicitly.

> On the other hand, I can't see why
> natural alignment (or alignof(max_align_t), whichever is smallest)
> would
> not always suffice. 

Yes, that was exactly my point.

> It is a bit harder to explain in the API docs what
> alignment you actually get in case you don't go for worst-case
> alignment.

Yeah... using "natural alignment" instead of "worst-case alignment" doesn't really cut it; e.g. if the lcore variable is a struct of two uint16_t, the natural alignment is 2 byte, but it will be 4 byte aligned due to the size.
Maybe "automatic alignment" could be used here... with an explanation that it is the minimum of the size, rounded up to a power of two, or max_align_t.
Anyway, in case of doubt, the developer can look at the implementation - it's one of the benefits of having the source code available. :-)

> 
> I think it doesn't matter much, because the user will very likely use
> the typed macros (and get whatever alignment the compiler deems
> appropriate for that type).

Probably.
But the function allowing alignment=0 should still behave 1) as expected by its users, and 2) optimally.

I hope this library is going to be a widely used core component in DPDK, and getting all the small details right will improve the probability of success.

> 
> > 	/* '0' means asking for automatic alignment requirements */
> > 	if (align == 0) {
> > #ifdef RTE_ARCH_64
> > 		align = rte_align64pow2(size);
> > #else
> > 		align = rte_align32pow2(size);
> > #endif
> > #ifdef RTE_TOOLCHAIN_MSVC
> > 		/* MSVC <stddef.h> is missing the max_align_t typedef */
> > 		align = RTE_MIN(align, alignof(double));
> > #else
> > 		align = RTE_MIN(align, alignof(max_align_t));
> > #endif
> > 	}
> >
> > It will pack small-size lcore variables even tighter.
> >


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v13 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15  9:33                                                                         ` [PATCH v13 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-15 10:13                                                                           ` Morten Brørup
@ 2024-10-15 22:33                                                                           ` Stephen Hemminger
  2024-10-16  4:13                                                                             ` Mattias Rönnblom
  2024-10-15 22:35                                                                           ` Stephen Hemminger
  2024-10-16 13:19                                                                           ` [PATCH v14 0/7] Lcore variables Mattias Rönnblom
  3 siblings, 1 reply; 313+ messages in thread
From: Stephen Hemminger @ 2024-10-15 22:33 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic, Konstantin Ananyev,
	Chengwen Feng

On Tue, 15 Oct 2024 11:33:38 +0200
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> + * Lcore variables
> + *
> + * This API provides a mechanism to create and access per-lcore id
> + * variables in a space- and cycle-efficient manner.
> + *
> + * A per-lcore id variable (or lcore variable for short) holds a
> + * unique value for each EAL thread and registered non-EAL
> + * thread. There is one instance for each current and future lcore
> + * id-equipped thread, with a total of @c RTE_MAX_LCORE instances. The
> + * value of the lcore variable for one lcore id is independent from
> + * the values assigned to other lcore ids within the same variable.
> + *
> + * In order to access the values of an lcore variable, a handle is
> + * used. The type of the handle is a pointer to the value's type
> + * (e.g., for an @c uint32_t lcore variable, the handle is a
> + * <code>uint32_t *</code>). The handle type is used to inform the
> + * access macros of the type of the values. A handle may be passed
> + * between modules and threads just like any pointer, but its value
> + * must be treated as an opaque identifier. An allocated handle never
> + * has the value NULL.
> + *
> + * @b Creation
> + *
> + * An lcore variable is created in two steps:
> + *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
> + *  2. Allocate lcore variable storage and initialize the handle with
> + *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
> + *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs at the time
> + *     of module initialization, but may be done at any time.
> + *
> + * The lifetime of an lcore variable is not tied to the thread that
> + * created it. Its per lcore id values (up to @c RTE_MAX_LCORE) are
> + * available from the moment the lcore variable is created and
> + * continue to exist throughout the entire lifetime of the EAL,
> + * whether or not the lcore id is currently in use.
> + *
> + * Lcore variables cannot and need not be freed.
> + *
> + * @b Access
> + *
> + * The value of any lcore variable for any lcore id may be accessed
> + * from any thread (including unregistered threads), but it should
> + * only be *frequently* read from or written to by the owner.
> + *
> + * Values of the same lcore variable, associated with different lcore
> + * ids may be frequently read or written by their respective owners
> + * without risking false sharing.
> + *
> + * An appropriate synchronization mechanism (e.g., atomic loads and
> + * stores) should be employed to prevent data races between the owning
> + * thread and any other thread accessing the same value instance.
> + *
> + * The value of the lcore variable for a particular lcore id is
> + * accessed using @ref RTE_LCORE_VAR_LCORE.
> + *
> + * A common pattern is for an EAL thread or a registered non-EAL
> + * thread to access its own lcore variable value. For this purpose, a
> + * shorthand exists as @ref RTE_LCORE_VAR.
> + *
> + * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
> + * pointer with the same type as the value, it may not be directly
> + * dereferenced and must be treated as an opaque identifier.
> + *
> + * Lcore variable handles and value pointers may be freely passed
> + * between different threads.
> + *
> + * @b Storage
> + *
> + * An lcore variable's values may be of a primitive type like @c int,
> + * but would more typically be a @c struct.
> + *
> + * The lcore variable handle introduces a per-variable (not
> + * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
> + * there are some memory footprint gains to be made by organizing all
> + * per-lcore id data for a particular module as one lcore variable
> + * (e.g., as a struct).
> + *
> + * An application may define an lcore variable handle without ever
> + * allocating it.
> + *
> + * The size of an lcore variable's value must be less than the DPDK
> + * build-time constant @c RTE_MAX_LCORE_VAR.
> + *
> + * Lcore variables are stored in a series of lcore buffers, which are
> + * allocated from the libc heap. Heap allocation failures are treated
> + * as fatal.
> + *
> + * Lcore variables should generally *not* be @ref __rte_cache_aligned
> + * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
> + * of these constructs are designed to avoid false sharing. In the
> + * case of an lcore variable instance, the thread most recently
> + * accessing nearby data structures should almost-always be the lcore
> + * variable's owner. Adding padding will increase the effective memory
> + * working set size, potentially reducing performance.
> + *
> + * Lcore variable values are initialized to zero by default.
> + *
> + * Lcore variables are not stored in huge page memory.
> + *
> + * @b Example
> + *
> + * Below is an example of the use of an lcore variable:
> + *
> + * @code{.c}
> + * struct foo_lcore_state {
> + *         int a;
> + *         long b;
> + * };
> + *
> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
> + *
> + * long foo_get_a_plus_b(void)
> + * {
> + *         struct foo_lcore_state *state = RTE_LCORE_VAR(lcore_states);
> + *
> + *         return state->a + state->b;
> + * }
> + *
> + * RTE_INIT(rte_foo_init)
> + * {
> + *         RTE_LCORE_VAR_ALLOC(lcore_states);
> + *
> + *         unsigned int lcore_id;
> + *         struct foo_lcore_state *state;
> + *         RTE_LCORE_VAR_FOREACH(lcore_id, state, lcore_states) {
> + *                 (initialize 'state')
> + *         }
> + *
> + *         (other initialization)
> + * }
> + * @endcode
> + *
> + *
> + * @b Alternatives
> + *
> + * Lcore variables are designed to replace a pattern exemplified below:
> + * @code{.c}
> + * struct __rte_cache_aligned foo_lcore_state {
> + *         int a;
> + *         long b;
> + *         RTE_CACHE_GUARD;
> + * };
> + *
> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
> + * @endcode
> + *
> + * This scheme is simple and effective, but has one drawback: the data
> + * is organized so that objects related to all lcores for a particular
> + * module are kept close in memory. At a bare minimum, this requires
> + * sizing data structures (e.g., using `__rte_cache_aligned`) to an
> + * even number of cache lines to avoid false sharing. With CPU
> + * hardware prefetching and memory loads resulting from speculative
> + * execution (functions which seemingly are getting more eager faster
> + * than they are getting more intelligent), one or more "guard" cache
> + * lines may be required to separate one lcore's data from another's
> + * and prevent false sharing.
> + *
> + * Lcore variables offer the advantage of working with, rather than
> + * against, the CPU's assumptions. A next-line hardware prefetcher,
> + * for example, may function as intended (i.e., to the benefit, not
> + * detriment, of system performance).
> + *
> + * Another alternative to @ref rte_lcore_var.h is the @ref
> + * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
> + * e.g., GCC __thread or C11 _Thread_local). The main differences
> + * between by using the various forms of TLS (e.g., @ref
> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
> + * variables are:
> + *
> + *   * The lifecycle of a thread-local variable instance is tied to
> + *     that of the thread. The data cannot be accessed before the
> + *     thread has been created, nor after it has exited. As a result,
> + *     thread-local variables must be initialized in a "lazy" manner
> + *     (e.g., at the point of thread creation). Lcore variables may be
> + *     accessed immediately after having been allocated (which may occur
> + *     before any thread beyond the main thread is running).
> + *   * A thread-local variable is duplicated across all threads in the
> + *     process, including unregistered non-EAL threads (i.e.,
> + *     "regular" threads). For DPDK applications heavily relying on
> + *     multi-threading (in conjunction to DPDK's "one thread per core"
> + *     pattern), either by having many concurrent threads or
> + *     creating/destroying threads at a high rate, an excessive use of
> + *     thread-local variables may cause inefficiencies (e.g.,
> + *     increased thread creation overhead due to thread-local storage
> + *     initialization or increased total RAM footprint usage). Lcore
> + *     variables *only* exist for threads with an lcore id.
> + *   * If data in thread-local storage may be shared between threads
> + *     (i.e., can a pointer to a thread-local variable be passed to
> + *     and successfully dereferenced by non-owning thread) depends on
> + *     the specifics of the TLS implementation. With GCC __thread and
> + *     GCC _Thread_local, data sharing between threads is supported.
> + *     In the C11 standard, accessing another thread's _Thread_local
> + *     object is implementation-defined. Lcore variable instances may
> + *     be accessed reliably by any thread.
> + */

For me this comment too wordy for code and belongs in the documentation instead.
Could also be reduced to more precise succinct language.



^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v13 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15  9:33                                                                         ` [PATCH v13 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-15 10:13                                                                           ` Morten Brørup
  2024-10-15 22:33                                                                           ` Stephen Hemminger
@ 2024-10-15 22:35                                                                           ` Stephen Hemminger
  2024-10-16  4:23                                                                             ` Mattias Rönnblom
  2024-10-16 13:19                                                                           ` [PATCH v14 0/7] Lcore variables Mattias Rönnblom
  3 siblings, 1 reply; 313+ messages in thread
From: Stephen Hemminger @ 2024-10-15 22:35 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic, Konstantin Ananyev,
	Chengwen Feng

On Tue, 15 Oct 2024 11:33:38 +0200
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> +/**
> + * Allocate space in the per-lcore id buffers for an lcore variable.
> + *
> + * The pointer returned is only an opaque identifier of the variable. To
> + * get an actual pointer to a particular instance of the variable use
> + * @ref RTE_LCORE_VAR or @ref RTE_LCORE_VAR_LCORE.
> + *
> + * The lcore variable values' memory is set to zero.
> + *
> + * The allocation is always successful, barring a fatal exhaustion of
> + * the per-lcore id buffer space.
> + *
> + * rte_lcore_var_alloc() is not multi-thread safe.
> + *
> + * @param size
> + *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
> + * @param align
> + *   If 0, the values will be suitably aligned for any kind of type
> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
> + *   on a multiple of *align*, which must be a power of 2 and equal or
> + *   less than @c RTE_CACHE_LINE_SIZE.
> + * @return
> + *   The variable's handle, stored in a void pointer value. The value
> + *   is always non-NULL.
> + */
> +__rte_experimental
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align);

This should have the similar function attributes as rte_malloc now does
where it tells the compiler the size, alignment, and aliasing.

Also there should be mention that there is no free function.

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v13 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15 22:33                                                                           ` Stephen Hemminger
@ 2024-10-16  4:13                                                                             ` Mattias Rönnblom
  2024-10-16  8:17                                                                               ` Thomas Monjalon
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-16  4:13 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng



On 2024-10-16 00:33, Stephen Hemminger wrote:
> On Tue, 15 Oct 2024 11:33:38 +0200
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
>> + * Lcore variables
>> + *
>> + * This API provides a mechanism to create and access per-lcore id
>> + * variables in a space- and cycle-efficient manner.
>> + *
>> + * A per-lcore id variable (or lcore variable for short) holds a
>> + * unique value for each EAL thread and registered non-EAL
>> + * thread. There is one instance for each current and future lcore
>> + * id-equipped thread, with a total of @c RTE_MAX_LCORE instances. The
>> + * value of the lcore variable for one lcore id is independent from
>> + * the values assigned to other lcore ids within the same variable.
>> + *
>> + * In order to access the values of an lcore variable, a handle is
>> + * used. The type of the handle is a pointer to the value's type
>> + * (e.g., for an @c uint32_t lcore variable, the handle is a
>> + * <code>uint32_t *</code>). The handle type is used to inform the
>> + * access macros of the type of the values. A handle may be passed
>> + * between modules and threads just like any pointer, but its value
>> + * must be treated as an opaque identifier. An allocated handle never
>> + * has the value NULL.
>> + *
>> + * @b Creation
>> + *
>> + * An lcore variable is created in two steps:
>> + *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
>> + *  2. Allocate lcore variable storage and initialize the handle with
>> + *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
>> + *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs at the time
>> + *     of module initialization, but may be done at any time.
>> + *
>> + * The lifetime of an lcore variable is not tied to the thread that
>> + * created it. Its per lcore id values (up to @c RTE_MAX_LCORE) are
>> + * available from the moment the lcore variable is created and
>> + * continue to exist throughout the entire lifetime of the EAL,
>> + * whether or not the lcore id is currently in use.
>> + *
>> + * Lcore variables cannot and need not be freed.
>> + *
>> + * @b Access
>> + *
>> + * The value of any lcore variable for any lcore id may be accessed
>> + * from any thread (including unregistered threads), but it should
>> + * only be *frequently* read from or written to by the owner.
>> + *
>> + * Values of the same lcore variable, associated with different lcore
>> + * ids may be frequently read or written by their respective owners
>> + * without risking false sharing.
>> + *
>> + * An appropriate synchronization mechanism (e.g., atomic loads and
>> + * stores) should be employed to prevent data races between the owning
>> + * thread and any other thread accessing the same value instance.
>> + *
>> + * The value of the lcore variable for a particular lcore id is
>> + * accessed using @ref RTE_LCORE_VAR_LCORE.
>> + *
>> + * A common pattern is for an EAL thread or a registered non-EAL
>> + * thread to access its own lcore variable value. For this purpose, a
>> + * shorthand exists as @ref RTE_LCORE_VAR.
>> + *
>> + * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
>> + * pointer with the same type as the value, it may not be directly
>> + * dereferenced and must be treated as an opaque identifier.
>> + *
>> + * Lcore variable handles and value pointers may be freely passed
>> + * between different threads.
>> + *
>> + * @b Storage
>> + *
>> + * An lcore variable's values may be of a primitive type like @c int,
>> + * but would more typically be a @c struct.
>> + *
>> + * The lcore variable handle introduces a per-variable (not
>> + * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
>> + * there are some memory footprint gains to be made by organizing all
>> + * per-lcore id data for a particular module as one lcore variable
>> + * (e.g., as a struct).
>> + *
>> + * An application may define an lcore variable handle without ever
>> + * allocating it.
>> + *
>> + * The size of an lcore variable's value must be less than the DPDK
>> + * build-time constant @c RTE_MAX_LCORE_VAR.
>> + *
>> + * Lcore variables are stored in a series of lcore buffers, which are
>> + * allocated from the libc heap. Heap allocation failures are treated
>> + * as fatal.
>> + *
>> + * Lcore variables should generally *not* be @ref __rte_cache_aligned
>> + * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
>> + * of these constructs are designed to avoid false sharing. In the
>> + * case of an lcore variable instance, the thread most recently
>> + * accessing nearby data structures should almost-always be the lcore
>> + * variable's owner. Adding padding will increase the effective memory
>> + * working set size, potentially reducing performance.
>> + *
>> + * Lcore variable values are initialized to zero by default.
>> + *
>> + * Lcore variables are not stored in huge page memory.
>> + *
>> + * @b Example
>> + *
>> + * Below is an example of the use of an lcore variable:
>> + *
>> + * @code{.c}
>> + * struct foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + * };
>> + *
>> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
>> + *
>> + * long foo_get_a_plus_b(void)
>> + * {
>> + *         struct foo_lcore_state *state = RTE_LCORE_VAR(lcore_states);
>> + *
>> + *         return state->a + state->b;
>> + * }
>> + *
>> + * RTE_INIT(rte_foo_init)
>> + * {
>> + *         RTE_LCORE_VAR_ALLOC(lcore_states);
>> + *
>> + *         unsigned int lcore_id;
>> + *         struct foo_lcore_state *state;
>> + *         RTE_LCORE_VAR_FOREACH(lcore_id, state, lcore_states) {
>> + *                 (initialize 'state')
>> + *         }
>> + *
>> + *         (other initialization)
>> + * }
>> + * @endcode
>> + *
>> + *
>> + * @b Alternatives
>> + *
>> + * Lcore variables are designed to replace a pattern exemplified below:
>> + * @code{.c}
>> + * struct __rte_cache_aligned foo_lcore_state {
>> + *         int a;
>> + *         long b;
>> + *         RTE_CACHE_GUARD;
>> + * };
>> + *
>> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
>> + * @endcode
>> + *
>> + * This scheme is simple and effective, but has one drawback: the data
>> + * is organized so that objects related to all lcores for a particular
>> + * module are kept close in memory. At a bare minimum, this requires
>> + * sizing data structures (e.g., using `__rte_cache_aligned`) to an
>> + * even number of cache lines to avoid false sharing. With CPU
>> + * hardware prefetching and memory loads resulting from speculative
>> + * execution (functions which seemingly are getting more eager faster
>> + * than they are getting more intelligent), one or more "guard" cache
>> + * lines may be required to separate one lcore's data from another's
>> + * and prevent false sharing.
>> + *
>> + * Lcore variables offer the advantage of working with, rather than
>> + * against, the CPU's assumptions. A next-line hardware prefetcher,
>> + * for example, may function as intended (i.e., to the benefit, not
>> + * detriment, of system performance).
>> + *
>> + * Another alternative to @ref rte_lcore_var.h is the @ref
>> + * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
>> + * e.g., GCC __thread or C11 _Thread_local). The main differences
>> + * between by using the various forms of TLS (e.g., @ref
>> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
>> + * variables are:
>> + *
>> + *   * The lifecycle of a thread-local variable instance is tied to
>> + *     that of the thread. The data cannot be accessed before the
>> + *     thread has been created, nor after it has exited. As a result,
>> + *     thread-local variables must be initialized in a "lazy" manner
>> + *     (e.g., at the point of thread creation). Lcore variables may be
>> + *     accessed immediately after having been allocated (which may occur
>> + *     before any thread beyond the main thread is running).
>> + *   * A thread-local variable is duplicated across all threads in the
>> + *     process, including unregistered non-EAL threads (i.e.,
>> + *     "regular" threads). For DPDK applications heavily relying on
>> + *     multi-threading (in conjunction to DPDK's "one thread per core"
>> + *     pattern), either by having many concurrent threads or
>> + *     creating/destroying threads at a high rate, an excessive use of
>> + *     thread-local variables may cause inefficiencies (e.g.,
>> + *     increased thread creation overhead due to thread-local storage
>> + *     initialization or increased total RAM footprint usage). Lcore
>> + *     variables *only* exist for threads with an lcore id.
>> + *   * If data in thread-local storage may be shared between threads
>> + *     (i.e., can a pointer to a thread-local variable be passed to
>> + *     and successfully dereferenced by non-owning thread) depends on
>> + *     the specifics of the TLS implementation. With GCC __thread and
>> + *     GCC _Thread_local, data sharing between threads is supported.
>> + *     In the C11 standard, accessing another thread's _Thread_local
>> + *     object is implementation-defined. Lcore variable instances may
>> + *     be accessed reliably by any thread.
>> + */
> 
> For me this comment too wordy for code and belongs in the documentation instead.
> Could also be reduced to more precise succinct language.
> 
> 

Provided this makes it into RC1, I can move most of this and some of the 
information in eal_common_lcore_var.c comments into "the documentation" 
as a RC2 patch.

If "the documentation" is a the EAL programmer's guide, a description of 
lcore variables (with pictures!) in sufficient detail (both API and 
implementation) would make up a large fraction of it. That would look 
silly and in the way of more important things. Lcore variables is just a 
tiny bit of infrastructure. Other, more central EAL features, like the 
RTE spinlock, they have no mention at all in the EAL docs.

Another option I suppose is to documentation it separately from the 
"main" EAL programmer's guide, but - correct me if I'm wrong here - 
there seem to be no precedent for doing this.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v13 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15 22:35                                                                           ` Stephen Hemminger
@ 2024-10-16  4:23                                                                             ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-16  4:23 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

On 2024-10-16 00:35, Stephen Hemminger wrote:
> On Tue, 15 Oct 2024 11:33:38 +0200
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
>> +/**
>> + * Allocate space in the per-lcore id buffers for an lcore variable.
>> + *
>> + * The pointer returned is only an opaque identifier of the variable. To
>> + * get an actual pointer to a particular instance of the variable use
>> + * @ref RTE_LCORE_VAR or @ref RTE_LCORE_VAR_LCORE.
>> + *
>> + * The lcore variable values' memory is set to zero.
>> + *
>> + * The allocation is always successful, barring a fatal exhaustion of
>> + * the per-lcore id buffer space.
>> + *
>> + * rte_lcore_var_alloc() is not multi-thread safe.
>> + *
>> + * @param size
>> + *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
>> + * @param align
>> + *   If 0, the values will be suitably aligned for any kind of type
>> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
>> + *   on a multiple of *align*, which must be a power of 2 and equal or
>> + *   less than @c RTE_CACHE_LINE_SIZE.
>> + * @return
>> + *   The variable's handle, stored in a void pointer value. The value
>> + *   is always non-NULL.
>> + */
>> +__rte_experimental
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align);
> 
> This should have the similar function attributes as rte_malloc now does
> where it tells the compiler the size, alignment, and aliasing.
> 
> Also there should be mention that there is no free function.

OK, both fixed. Thanks.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v9 1/7] eal: add static per-lcore memory allocation facility
  2024-10-14 15:19                                                                 ` Stephen Hemminger
@ 2024-10-16  8:05                                                                   ` Thomas Monjalon
  0 siblings, 0 replies; 313+ messages in thread
From: Thomas Monjalon @ 2024-10-16  8:05 UTC (permalink / raw)
  To: Mattias Rönnblom, Morten Brørup
  Cc: dev, Mattias Rönnblom, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng,
	Stephen Hemminger

14/10/2024 17:19, Stephen Hemminger:
> On Mon, 14 Oct 2024 08:51:09 +0200
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
> > On 2024-10-11 10:04, Mattias Rönnblom wrote:
> > > On 2024-10-10 23:24, Thomas Monjalon wrote:  
> > 
> > <snip>
> > 
> > >>> + *
> > >>> + * An lcore variable is not tied to the owning thread's lifetime. It's
> > >>> + * available for use by any thread immediately after having been
> > >>> + * allocated, and continues to be available throughout the lifetime of
> > >>> + * the EAL.
> > >>> + *
> > >>> + * Lcore variables cannot and need not be freed.  
> > >>
> > >> I'm curious about that.
> > >> If EAL is closed, and the application continues its life,
> > >> then we want all this memory to be cleaned as well.
> > >> Do you know rte_eal_cleanup()?  
> > > 
> > > I think the primary reason you would like to free the buffers is to 
> > > avoid false positives from tools like valgrind memcheck (if anyone 
> > > managed to get that working with DPDK).
> > > 
> > > rte_eal_cleanup() freeing the buffers and resetting the offset would 
> > > make sense. That however would require the buffers to be tracked (e.g., 
> > > as a linked list).
> > >   
> > 
> > I had a quick look at this. Cleaning up the lcore var buffers is very 
> > straightforward.
> > 
> > One thing though: the rte_eal_cleanup() documentation says "After this 
> > call, no DPDK function calls may be made.". rte_eal_init() is a "DPDK 
> > function call". So DPDK/EAL can never be re-initialized, correct?
> 
> In practice, calling rte_eal_init() is not tested, and some of the drivers
> probably won't work.

Yes it is not tested, I have no idea whether restarting DPDK works or not.



^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v10 0/7] Lcore variables
  2024-10-13  7:02                                                               ` Mattias Rönnblom
@ 2024-10-16  8:07                                                                 ` Thomas Monjalon
  0 siblings, 0 replies; 313+ messages in thread
From: Thomas Monjalon @ 2024-10-16  8:07 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom, dev
  Cc: dev, Morten Brørup, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Mattias Rönnblom

13/10/2024 09:02, Mattias Rönnblom:
> On 2024-10-11 16:25, Stephen Hemminger wrote:
> > On Fri, 11 Oct 2024 10:18:54 +0200
> > Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> > Are there any trace points in this code? Would be good to have.
> 
> No. Yes, for sure.
> 
> > Also some optional statistics for telemetry use.
> 
> I agree. It could potentially expose some of the internals of the 
> implementation, subject to change, but that is a risk that we can take.
> 
> Who does the above two and when? Is this something that is required 
> before 24.11 (assuming this feature will make it)?

I don't see it as a strong requirement.
It could be added later.

> 
> > I would presume this is not intended as a hot path API; therefore
> > it would be ok to always keep statistics.
> 
> The allocation functions are expected to be used only in the slowest of 
> the slow paths.




^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v11 1/7] eal: add static per-lcore memory allocation facility
  2024-10-15  7:10                                                                       ` Mattias Rönnblom
  2024-10-15  7:39                                                                         ` Morten Brørup
@ 2024-10-16  8:10                                                                         ` Thomas Monjalon
  1 sibling, 0 replies; 313+ messages in thread
From: Thomas Monjalon @ 2024-10-16  8:10 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: Stephen Hemminger, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng,
	Mattias Rönnblom

15/10/2024 09:10, Mattias Rönnblom:
> On 2024-10-15 08:41, Mattias Rönnblom wrote:
> > On 2024-10-14 10:17, Morten Brørup wrote:
> 
> <snip>
> 
> >>
> >>> +/**
> >>> + * Get pointer to lcore variable instance with the specified lcore id.
> >>> + *
> >>> + * @param lcore_id
> >>> + *   The lcore id specifying which of the @c RTE_MAX_LCORE value
> >>> + *   instances should be accessed. The lcore id need not be valid
> >>> + *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
> >>> + *   is also not valid (and thus should not be dereferenced).
> >>> + * @param handle
> >>> + *   The lcore variable handle.
> >>> + */
> >>> +#define RTE_LCORE_VAR_LCORE_VALUE(lcore_id, handle)            \
> >>> +    ((typeof(handle))rte_lcore_var_lcore_ptr(lcore_id, handle))
> >>
> >> Please remove the _VALUE suffix.
> >>
> > 
> > You changed your mind? I'm missing the rationale here.
> > 
> 
> I supposed this is a bit of subjective hairsplitting, but does anyone 
> else have an opinion?
> 
> Short versus somewhat more readable name.
> 
> To get "your own" value should be something like
> 
> struct foo *lcore_foo = RTE_LCORE_VAR(foo);
> versus
> struct foo *lcore_foo = RTE_LCORE_VAR_VALUE(foo);
> 
> We should also strip "_VALUE" off of the RTE_LCORE_FOREACH_VALUE() macro 
> name in case we change the names of the access macros.

I feel "_VALUE" is too much. I prefer the shorter version.



^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v13 1/7] eal: add static per-lcore memory allocation facility
  2024-10-16  4:13                                                                             ` Mattias Rönnblom
@ 2024-10-16  8:17                                                                               ` Thomas Monjalon
  2024-10-16 12:47                                                                                 ` Mattias Rönnblom
  0 siblings, 1 reply; 313+ messages in thread
From: Thomas Monjalon @ 2024-10-16  8:17 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom, dev
  Cc: dev, Morten Brørup, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng,
	Mattias Rönnblom

16/10/2024 06:13, Mattias Rönnblom:
> 
> On 2024-10-16 00:33, Stephen Hemminger wrote:
> > On Tue, 15 Oct 2024 11:33:38 +0200
> > Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> > 
> >> + * Lcore variables
> >> + *
> >> + * This API provides a mechanism to create and access per-lcore id
> >> + * variables in a space- and cycle-efficient manner.
> >> + *
> >> + * A per-lcore id variable (or lcore variable for short) holds a
> >> + * unique value for each EAL thread and registered non-EAL
> >> + * thread. There is one instance for each current and future lcore
> >> + * id-equipped thread, with a total of @c RTE_MAX_LCORE instances. The
> >> + * value of the lcore variable for one lcore id is independent from
> >> + * the values assigned to other lcore ids within the same variable.
> >> + *
> >> + * In order to access the values of an lcore variable, a handle is
> >> + * used. The type of the handle is a pointer to the value's type
> >> + * (e.g., for an @c uint32_t lcore variable, the handle is a
> >> + * <code>uint32_t *</code>). The handle type is used to inform the
> >> + * access macros of the type of the values. A handle may be passed
> >> + * between modules and threads just like any pointer, but its value
> >> + * must be treated as an opaque identifier. An allocated handle never
> >> + * has the value NULL.
> >> + *
> >> + * @b Creation
> >> + *
> >> + * An lcore variable is created in two steps:
> >> + *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
> >> + *  2. Allocate lcore variable storage and initialize the handle with
> >> + *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
> >> + *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs at the time
> >> + *     of module initialization, but may be done at any time.
> >> + *
> >> + * The lifetime of an lcore variable is not tied to the thread that
> >> + * created it. Its per lcore id values (up to @c RTE_MAX_LCORE) are
> >> + * available from the moment the lcore variable is created and
> >> + * continue to exist throughout the entire lifetime of the EAL,
> >> + * whether or not the lcore id is currently in use.
> >> + *
> >> + * Lcore variables cannot and need not be freed.
> >> + *
> >> + * @b Access
> >> + *
> >> + * The value of any lcore variable for any lcore id may be accessed
> >> + * from any thread (including unregistered threads), but it should
> >> + * only be *frequently* read from or written to by the owner.
> >> + *
> >> + * Values of the same lcore variable, associated with different lcore
> >> + * ids may be frequently read or written by their respective owners
> >> + * without risking false sharing.
> >> + *
> >> + * An appropriate synchronization mechanism (e.g., atomic loads and
> >> + * stores) should be employed to prevent data races between the owning
> >> + * thread and any other thread accessing the same value instance.
> >> + *
> >> + * The value of the lcore variable for a particular lcore id is
> >> + * accessed using @ref RTE_LCORE_VAR_LCORE.
> >> + *
> >> + * A common pattern is for an EAL thread or a registered non-EAL
> >> + * thread to access its own lcore variable value. For this purpose, a
> >> + * shorthand exists as @ref RTE_LCORE_VAR.
> >> + *
> >> + * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
> >> + * pointer with the same type as the value, it may not be directly
> >> + * dereferenced and must be treated as an opaque identifier.
> >> + *
> >> + * Lcore variable handles and value pointers may be freely passed
> >> + * between different threads.
> >> + *
> >> + * @b Storage
> >> + *
> >> + * An lcore variable's values may be of a primitive type like @c int,
> >> + * but would more typically be a @c struct.
> >> + *
> >> + * The lcore variable handle introduces a per-variable (not
> >> + * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
> >> + * there are some memory footprint gains to be made by organizing all
> >> + * per-lcore id data for a particular module as one lcore variable
> >> + * (e.g., as a struct).
> >> + *
> >> + * An application may define an lcore variable handle without ever
> >> + * allocating it.
> >> + *
> >> + * The size of an lcore variable's value must be less than the DPDK
> >> + * build-time constant @c RTE_MAX_LCORE_VAR.
> >> + *
> >> + * Lcore variables are stored in a series of lcore buffers, which are
> >> + * allocated from the libc heap. Heap allocation failures are treated
> >> + * as fatal.
> >> + *
> >> + * Lcore variables should generally *not* be @ref __rte_cache_aligned
> >> + * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
> >> + * of these constructs are designed to avoid false sharing. In the
> >> + * case of an lcore variable instance, the thread most recently
> >> + * accessing nearby data structures should almost-always be the lcore
> >> + * variable's owner. Adding padding will increase the effective memory
> >> + * working set size, potentially reducing performance.
> >> + *
> >> + * Lcore variable values are initialized to zero by default.
> >> + *
> >> + * Lcore variables are not stored in huge page memory.
> >> + *
> >> + * @b Example
> >> + *
> >> + * Below is an example of the use of an lcore variable:
> >> + *
> >> + * @code{.c}
> >> + * struct foo_lcore_state {
> >> + *         int a;
> >> + *         long b;
> >> + * };
> >> + *
> >> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
> >> + *
> >> + * long foo_get_a_plus_b(void)
> >> + * {
> >> + *         struct foo_lcore_state *state = RTE_LCORE_VAR(lcore_states);
> >> + *
> >> + *         return state->a + state->b;
> >> + * }
> >> + *
> >> + * RTE_INIT(rte_foo_init)
> >> + * {
> >> + *         RTE_LCORE_VAR_ALLOC(lcore_states);
> >> + *
> >> + *         unsigned int lcore_id;
> >> + *         struct foo_lcore_state *state;
> >> + *         RTE_LCORE_VAR_FOREACH(lcore_id, state, lcore_states) {
> >> + *                 (initialize 'state')
> >> + *         }
> >> + *
> >> + *         (other initialization)
> >> + * }
> >> + * @endcode
> >> + *
> >> + *
> >> + * @b Alternatives
> >> + *
> >> + * Lcore variables are designed to replace a pattern exemplified below:
> >> + * @code{.c}
> >> + * struct __rte_cache_aligned foo_lcore_state {
> >> + *         int a;
> >> + *         long b;
> >> + *         RTE_CACHE_GUARD;
> >> + * };
> >> + *
> >> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
> >> + * @endcode
> >> + *
> >> + * This scheme is simple and effective, but has one drawback: the data
> >> + * is organized so that objects related to all lcores for a particular
> >> + * module are kept close in memory. At a bare minimum, this requires
> >> + * sizing data structures (e.g., using `__rte_cache_aligned`) to an
> >> + * even number of cache lines to avoid false sharing. With CPU
> >> + * hardware prefetching and memory loads resulting from speculative
> >> + * execution (functions which seemingly are getting more eager faster
> >> + * than they are getting more intelligent), one or more "guard" cache
> >> + * lines may be required to separate one lcore's data from another's
> >> + * and prevent false sharing.
> >> + *
> >> + * Lcore variables offer the advantage of working with, rather than
> >> + * against, the CPU's assumptions. A next-line hardware prefetcher,
> >> + * for example, may function as intended (i.e., to the benefit, not
> >> + * detriment, of system performance).
> >> + *
> >> + * Another alternative to @ref rte_lcore_var.h is the @ref
> >> + * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
> >> + * e.g., GCC __thread or C11 _Thread_local). The main differences
> >> + * between by using the various forms of TLS (e.g., @ref
> >> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
> >> + * variables are:
> >> + *
> >> + *   * The lifecycle of a thread-local variable instance is tied to
> >> + *     that of the thread. The data cannot be accessed before the
> >> + *     thread has been created, nor after it has exited. As a result,
> >> + *     thread-local variables must be initialized in a "lazy" manner
> >> + *     (e.g., at the point of thread creation). Lcore variables may be
> >> + *     accessed immediately after having been allocated (which may occur
> >> + *     before any thread beyond the main thread is running).
> >> + *   * A thread-local variable is duplicated across all threads in the
> >> + *     process, including unregistered non-EAL threads (i.e.,
> >> + *     "regular" threads). For DPDK applications heavily relying on
> >> + *     multi-threading (in conjunction to DPDK's "one thread per core"
> >> + *     pattern), either by having many concurrent threads or
> >> + *     creating/destroying threads at a high rate, an excessive use of
> >> + *     thread-local variables may cause inefficiencies (e.g.,
> >> + *     increased thread creation overhead due to thread-local storage
> >> + *     initialization or increased total RAM footprint usage). Lcore
> >> + *     variables *only* exist for threads with an lcore id.
> >> + *   * If data in thread-local storage may be shared between threads
> >> + *     (i.e., can a pointer to a thread-local variable be passed to
> >> + *     and successfully dereferenced by non-owning thread) depends on
> >> + *     the specifics of the TLS implementation. With GCC __thread and
> >> + *     GCC _Thread_local, data sharing between threads is supported.
> >> + *     In the C11 standard, accessing another thread's _Thread_local
> >> + *     object is implementation-defined. Lcore variable instances may
> >> + *     be accessed reliably by any thread.
> >> + */
> > 
> > For me this comment too wordy for code and belongs in the documentation instead.
> > Could also be reduced to more precise succinct language.

I agree, this is what I was asking for.


> Provided this makes it into RC1, I can move most of this and some of the 
> information in eal_common_lcore_var.c comments into "the documentation" 
> as a RC2 patch.
> 
> If "the documentation" is a the EAL programmer's guide, a description of 
> lcore variables (with pictures!) in sufficient detail (both API and 
> implementation) would make up a large fraction of it. That would look 
> silly and in the way of more important things. Lcore variables is just a 
> tiny bit of infrastructure. Other, more central EAL features, like the 
> RTE spinlock, they have no mention at all in the EAL docs.

Please don't take what exists and not exists as an absolute model.
We must improve the doc, split it better and fill the gaps.
In the meantime we want new features like this one to be properly documented.


> Another option I suppose is to documentation it separately from the 
> "main" EAL programmer's guide, but - correct me if I'm wrong here - 
> there seem to be no precedent for doing this.

For instance, the services cores are a separate chapter of the prog guide.
The lcore variables should be a separate chapter as well.




^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v13 1/7] eal: add static per-lcore memory allocation facility
  2024-10-16  8:17                                                                               ` Thomas Monjalon
@ 2024-10-16 12:47                                                                                 ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-16 12:47 UTC (permalink / raw)
  To: Thomas Monjalon, Stephen Hemminger, Mattias Rönnblom, dev
  Cc: Morten Brørup, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

On 2024-10-16 10:17, Thomas Monjalon wrote:
> 16/10/2024 06:13, Mattias Rönnblom:
>>
>> On 2024-10-16 00:33, Stephen Hemminger wrote:
>>> On Tue, 15 Oct 2024 11:33:38 +0200
>>> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
>>>
>>>> + * Lcore variables
>>>> + *
>>>> + * This API provides a mechanism to create and access per-lcore id
>>>> + * variables in a space- and cycle-efficient manner.
>>>> + *
>>>> + * A per-lcore id variable (or lcore variable for short) holds a
>>>> + * unique value for each EAL thread and registered non-EAL
>>>> + * thread. There is one instance for each current and future lcore
>>>> + * id-equipped thread, with a total of @c RTE_MAX_LCORE instances. The
>>>> + * value of the lcore variable for one lcore id is independent from
>>>> + * the values assigned to other lcore ids within the same variable.
>>>> + *
>>>> + * In order to access the values of an lcore variable, a handle is
>>>> + * used. The type of the handle is a pointer to the value's type
>>>> + * (e.g., for an @c uint32_t lcore variable, the handle is a
>>>> + * <code>uint32_t *</code>). The handle type is used to inform the
>>>> + * access macros of the type of the values. A handle may be passed
>>>> + * between modules and threads just like any pointer, but its value
>>>> + * must be treated as an opaque identifier. An allocated handle never
>>>> + * has the value NULL.
>>>> + *
>>>> + * @b Creation
>>>> + *
>>>> + * An lcore variable is created in two steps:
>>>> + *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
>>>> + *  2. Allocate lcore variable storage and initialize the handle with
>>>> + *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
>>>> + *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs at the time
>>>> + *     of module initialization, but may be done at any time.
>>>> + *
>>>> + * The lifetime of an lcore variable is not tied to the thread that
>>>> + * created it. Its per lcore id values (up to @c RTE_MAX_LCORE) are
>>>> + * available from the moment the lcore variable is created and
>>>> + * continue to exist throughout the entire lifetime of the EAL,
>>>> + * whether or not the lcore id is currently in use.
>>>> + *
>>>> + * Lcore variables cannot and need not be freed.
>>>> + *
>>>> + * @b Access
>>>> + *
>>>> + * The value of any lcore variable for any lcore id may be accessed
>>>> + * from any thread (including unregistered threads), but it should
>>>> + * only be *frequently* read from or written to by the owner.
>>>> + *
>>>> + * Values of the same lcore variable, associated with different lcore
>>>> + * ids may be frequently read or written by their respective owners
>>>> + * without risking false sharing.
>>>> + *
>>>> + * An appropriate synchronization mechanism (e.g., atomic loads and
>>>> + * stores) should be employed to prevent data races between the owning
>>>> + * thread and any other thread accessing the same value instance.
>>>> + *
>>>> + * The value of the lcore variable for a particular lcore id is
>>>> + * accessed using @ref RTE_LCORE_VAR_LCORE.
>>>> + *
>>>> + * A common pattern is for an EAL thread or a registered non-EAL
>>>> + * thread to access its own lcore variable value. For this purpose, a
>>>> + * shorthand exists as @ref RTE_LCORE_VAR.
>>>> + *
>>>> + * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
>>>> + * pointer with the same type as the value, it may not be directly
>>>> + * dereferenced and must be treated as an opaque identifier.
>>>> + *
>>>> + * Lcore variable handles and value pointers may be freely passed
>>>> + * between different threads.
>>>> + *
>>>> + * @b Storage
>>>> + *
>>>> + * An lcore variable's values may be of a primitive type like @c int,
>>>> + * but would more typically be a @c struct.
>>>> + *
>>>> + * The lcore variable handle introduces a per-variable (not
>>>> + * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
>>>> + * there are some memory footprint gains to be made by organizing all
>>>> + * per-lcore id data for a particular module as one lcore variable
>>>> + * (e.g., as a struct).
>>>> + *
>>>> + * An application may define an lcore variable handle without ever
>>>> + * allocating it.
>>>> + *
>>>> + * The size of an lcore variable's value must be less than the DPDK
>>>> + * build-time constant @c RTE_MAX_LCORE_VAR.
>>>> + *
>>>> + * Lcore variables are stored in a series of lcore buffers, which are
>>>> + * allocated from the libc heap. Heap allocation failures are treated
>>>> + * as fatal.
>>>> + *
>>>> + * Lcore variables should generally *not* be @ref __rte_cache_aligned
>>>> + * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
>>>> + * of these constructs are designed to avoid false sharing. In the
>>>> + * case of an lcore variable instance, the thread most recently
>>>> + * accessing nearby data structures should almost-always be the lcore
>>>> + * variable's owner. Adding padding will increase the effective memory
>>>> + * working set size, potentially reducing performance.
>>>> + *
>>>> + * Lcore variable values are initialized to zero by default.
>>>> + *
>>>> + * Lcore variables are not stored in huge page memory.
>>>> + *
>>>> + * @b Example
>>>> + *
>>>> + * Below is an example of the use of an lcore variable:
>>>> + *
>>>> + * @code{.c}
>>>> + * struct foo_lcore_state {
>>>> + *         int a;
>>>> + *         long b;
>>>> + * };
>>>> + *
>>>> + * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
>>>> + *
>>>> + * long foo_get_a_plus_b(void)
>>>> + * {
>>>> + *         struct foo_lcore_state *state = RTE_LCORE_VAR(lcore_states);
>>>> + *
>>>> + *         return state->a + state->b;
>>>> + * }
>>>> + *
>>>> + * RTE_INIT(rte_foo_init)
>>>> + * {
>>>> + *         RTE_LCORE_VAR_ALLOC(lcore_states);
>>>> + *
>>>> + *         unsigned int lcore_id;
>>>> + *         struct foo_lcore_state *state;
>>>> + *         RTE_LCORE_VAR_FOREACH(lcore_id, state, lcore_states) {
>>>> + *                 (initialize 'state')
>>>> + *         }
>>>> + *
>>>> + *         (other initialization)
>>>> + * }
>>>> + * @endcode
>>>> + *
>>>> + *
>>>> + * @b Alternatives
>>>> + *
>>>> + * Lcore variables are designed to replace a pattern exemplified below:
>>>> + * @code{.c}
>>>> + * struct __rte_cache_aligned foo_lcore_state {
>>>> + *         int a;
>>>> + *         long b;
>>>> + *         RTE_CACHE_GUARD;
>>>> + * };
>>>> + *
>>>> + * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
>>>> + * @endcode
>>>> + *
>>>> + * This scheme is simple and effective, but has one drawback: the data
>>>> + * is organized so that objects related to all lcores for a particular
>>>> + * module are kept close in memory. At a bare minimum, this requires
>>>> + * sizing data structures (e.g., using `__rte_cache_aligned`) to an
>>>> + * even number of cache lines to avoid false sharing. With CPU
>>>> + * hardware prefetching and memory loads resulting from speculative
>>>> + * execution (functions which seemingly are getting more eager faster
>>>> + * than they are getting more intelligent), one or more "guard" cache
>>>> + * lines may be required to separate one lcore's data from another's
>>>> + * and prevent false sharing.
>>>> + *
>>>> + * Lcore variables offer the advantage of working with, rather than
>>>> + * against, the CPU's assumptions. A next-line hardware prefetcher,
>>>> + * for example, may function as intended (i.e., to the benefit, not
>>>> + * detriment, of system performance).
>>>> + *
>>>> + * Another alternative to @ref rte_lcore_var.h is the @ref
>>>> + * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
>>>> + * e.g., GCC __thread or C11 _Thread_local). The main differences
>>>> + * between by using the various forms of TLS (e.g., @ref
>>>> + * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
>>>> + * variables are:
>>>> + *
>>>> + *   * The lifecycle of a thread-local variable instance is tied to
>>>> + *     that of the thread. The data cannot be accessed before the
>>>> + *     thread has been created, nor after it has exited. As a result,
>>>> + *     thread-local variables must be initialized in a "lazy" manner
>>>> + *     (e.g., at the point of thread creation). Lcore variables may be
>>>> + *     accessed immediately after having been allocated (which may occur
>>>> + *     before any thread beyond the main thread is running).
>>>> + *   * A thread-local variable is duplicated across all threads in the
>>>> + *     process, including unregistered non-EAL threads (i.e.,
>>>> + *     "regular" threads). For DPDK applications heavily relying on
>>>> + *     multi-threading (in conjunction to DPDK's "one thread per core"
>>>> + *     pattern), either by having many concurrent threads or
>>>> + *     creating/destroying threads at a high rate, an excessive use of
>>>> + *     thread-local variables may cause inefficiencies (e.g.,
>>>> + *     increased thread creation overhead due to thread-local storage
>>>> + *     initialization or increased total RAM footprint usage). Lcore
>>>> + *     variables *only* exist for threads with an lcore id.
>>>> + *   * If data in thread-local storage may be shared between threads
>>>> + *     (i.e., can a pointer to a thread-local variable be passed to
>>>> + *     and successfully dereferenced by non-owning thread) depends on
>>>> + *     the specifics of the TLS implementation. With GCC __thread and
>>>> + *     GCC _Thread_local, data sharing between threads is supported.
>>>> + *     In the C11 standard, accessing another thread's _Thread_local
>>>> + *     object is implementation-defined. Lcore variable instances may
>>>> + *     be accessed reliably by any thread.
>>>> + */
>>>
>>> For me this comment too wordy for code and belongs in the documentation instead.
>>> Could also be reduced to more precise succinct language.
> 
> I agree, this is what I was asking for.
> 
> 
>> Provided this makes it into RC1, I can move most of this and some of the
>> information in eal_common_lcore_var.c comments into "the documentation"
>> as a RC2 patch.
>>
>> If "the documentation" is a the EAL programmer's guide, a description of
>> lcore variables (with pictures!) in sufficient detail (both API and
>> implementation) would make up a large fraction of it. That would look
>> silly and in the way of more important things. Lcore variables is just a
>> tiny bit of infrastructure. Other, more central EAL features, like the
>> RTE spinlock, they have no mention at all in the EAL docs.
> 
> Please don't take what exists and not exists as an absolute model.
> We must improve the doc, split it better and fill the gaps.
> In the meantime we want new features like this one to be properly documented.
> 

I don't have an issue with raising the bar for new features.

> 
>> Another option I suppose is to documentation it separately from the
>> "main" EAL programmer's guide, but - correct me if I'm wrong here -
>> there seem to be no precedent for doing this.
> 
> For instance, the services cores are a separate chapter of the prog guide.

Right, forgot about the service cores. I will follow that model.

> The lcore variables should be a separate chapter as well.
> 


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v14 0/7] Lcore variables
  2024-10-15  9:33                                                                         ` [PATCH v13 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                                                             ` (2 preceding siblings ...)
  2024-10-15 22:35                                                                           ` Stephen Hemminger
@ 2024-10-16 13:19                                                                           ` Mattias Rönnblom
  2024-10-16 13:19                                                                             ` [PATCH v14 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                                                               ` (7 more replies)
  3 siblings, 8 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-16 13:19 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id data allocation.

Please refer to the <rte_lcore_var.h> API documentation for both a
rationale for this new API, and a comparison to the alternatives
available.

The question on how to best allocate static per-lcore memory has been
up several times on the dev mailing list, for example in the thread on
"random: use per lcore state" RFC by Stephen Hemminger.

Lcore variables are surely not the answer to all your per-lcore-data
needs, since it only allows for more-or-less static allocation. In the
author's opinion, it does however provide a reasonably simple and
clean and seemingly very much performant solution to a real problem.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 432 ++++++++++++++++++
 app/test/test_lcore_var_perf.c                | 256 +++++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         | 138 ++++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 116 ++---
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 394 ++++++++++++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   1 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  35 +-
 20 files changed, 1409 insertions(+), 92 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v14 1/7] eal: add static per-lcore memory allocation facility
  2024-10-16 13:19                                                                           ` [PATCH v14 0/7] Lcore variables Mattias Rönnblom
@ 2024-10-16 13:19                                                                             ` Mattias Rönnblom
  2024-10-16 14:53                                                                               ` Stephen Hemminger
  2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
  2024-10-16 13:19                                                                             ` [PATCH v14 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                                                               ` (6 subsequent siblings)
  7 siblings, 2 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-16 13:19 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v14:
 * Add note in rte_lcore_var_alloc() that the memory cannot be freed.
   (Stephen Hemminger)
 * Hint the compiler rte_lcore_var_alloc() is a memory allocation
   facility. (Stephen Hemminger)

PATCH v13:
 * Remove _VALUE() suffix from value lookup and iterator macros.
   (Morten Brørup and Thomas Monjalon)
 * Remove the _ptr() suffix from the value lookup function.

PATCH v12:
 * Replace RTE_ASSERT() with RTE_VERIFY(), since performance is not
   a concern. (Morten Brørup)
 * Fix issue (introduced in v11) where aligned_malloc() was provided
   an object size which wasn't an even number of the alignment.
   (Stephen Hemminger)

PATCH v11:
 * Add a note in the API docs on lcore variables and huge page memory.
   (Stephen Hemminger)
 * Free lcore var buffers at EAL cleanup. (Thomas Monjalon)
 * Tweak naming and include short lcore var buffer use overview
   in eal_common_lcore_var.c.

PATCH v10:
 * Improve documentation grammar and spelling. (Stephen Hemminger,
   Thomas Monjalon)
 * Add version.map DPDK version comment. (Thomas Monjalon)

PATCH v9:
 * Fixed merge conflicts in release notes.

PATCH v8:
 * Work around missing max_align_t definition in MSVC. (Morten Brørup)

PATCH v7:
 * Add () to the FOREACH lcore id macro parameter, to allow arbitrary
   expression, not just a simple variable name, being passed.
   (Konstantin Ananyev)

PATCH v6:
 * Have API user provide the loop variable in the FOREACH macro, to
   avoid subtle bugs where the loop variable name clashes with some
   other user-defined variable. (Konstantin Ananyev)

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         | 138 ++++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 394 ++++++++++++++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   1 +
 13 files changed, 609 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 6814991735..84fe62d339 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -289,6 +289,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index fd6f8a2f1a..498d509244 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index abd44b1861..6306636357 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index b9fac1839d..b659a1d085 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -429,12 +429,43 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default, static variables, memory blocks allocated on the DPDK
+heap, and other types of memory are shared by all DPDK threads.
+
+An application, a DPDK library, or a PMD may opt to keep per-thread state.
+
+Per-thread data can be maintained using either *lcore variables* (see
+``rte_lcore_var.h``), *thread-local storage (TLS)* (see
+``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE`` elements,
+indexed by ``rte_lcore_id()``. These methods allow per-lcore data to be
+largely internal to the module and not directly exposed in its
+API. Another approach is to explicitly handle per-thread aspects in
+the API (e.g., the ports in the Eventdev API).
+
+Lcore variables are suitable for small objects that are statically
+allocated at the time of module or application initialization. An
+lcore variable takes on one value for each lcore ID-equipped thread
+(i.e., for both EAL threads and registered non-EAL threads, in total
+``RTE_MAX_LCORE`` instances). The lifetime of lcore variables is
+independent of the owning threads and can, therefore, be initialized
+before the threads are created.
+
+Variables with thread-local storage are allocated when the thread is
+created and exist until the thread terminates. These are applicable
+for every thread in the process. Only very small objects should be
+allocated in TLS, as large TLS objects can significantly slow down
+thread creation and may unnecessarily increase the memory footprint of
+applications that extensively use unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore ID-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must be both cache-aligned and include an
+``RTE_CACHE_GUARD``. This extensive use of padding causes internal
+fragmentation (i.e., unused space) and reduces cache hit rates.
+
+For more discussions on per-lcore state, refer to the
+``rte_lcore_var.h`` API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 5a6502820d..ba11ccc97e 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -212,6 +212,20 @@ New Features
   Added ability for node to advertise and update multiple xstat counters,
   that can be retrieved using ``rte_graph_cluster_stats_get``.
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..f4dd5b1a82
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,138 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+#include "eal_lcore_var.h"
+
+/*
+ * An lcore var buffer stores at a minimum one, but usually many,
+ * lcore variables. The value instances for all lcore ids are stored
+ * in the same buffer.
+ *
+ * The address of the value of a particular lcore variable associated
+ * with a particular lcore id is:
+ * buffer->data + offset + lcore_id * RTE_MAX_LCORE_VAR.
+ *
+ * In this way, the values associated with a particular lcore id are
+ * grouped spatially close (in the data array), and no padding is
+ * required to prevent false sharing.
+ *
+ * The (buffer->data + offset) base pointer is what is being returned
+ * to the API user as an opaque handle. The handle is a pointer to the
+ * value for lcore id 0, for that lcore variable.
+ *
+ * The implementation maintains a current lcore var buffer (being
+ * allocated from), and an offset representing the amount of data
+ * already allocated (in bytes) in that buffer.
+ *
+ * The offset is progressively incremented (by the size of the
+ * just-allocated lcore variable), as lcore variables are being
+ * allocated.
+ *
+ * When one lcore var buffer is full, a new is allocated off the heap.
+ *
+ * The lcore var buffers are arranged in a single-link list, to allow
+ * freeing them at the point of rte_eal_cleanup(), and thereby avoid
+ * false positives from tools like valgrind memcheck.
+ */
+struct lcore_var_buffer {
+	char data[RTE_MAX_LCORE_VAR * RTE_MAX_LCORE];
+	struct lcore_var_buffer *prev;
+};
+
+static struct lcore_var_buffer *current_buffer;
+
+/* initialized to trigger buffer allocation on first allocation */
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	unsigned int lcore_id;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+		struct lcore_var_buffer *prev = current_buffer;
+		size_t alloc_size =
+			RTE_ALIGN_CEIL(sizeof(struct lcore_var_buffer),
+				       RTE_CACHE_LINE_SIZE);
+#ifdef RTE_EXEC_ENV_WINDOWS
+		current_buffer = _aligned_malloc(alloc_size, RTE_CACHE_LINE_SIZE);
+#else
+		current_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE, alloc_size);
+
+#endif
+		RTE_VERIFY(current_buffer != NULL);
+
+		current_buffer->prev = prev;
+
+		offset = 0;
+	}
+
+	handle = &current_buffer->data[offset];
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_VERIFY(align <= RTE_CACHE_LINE_SIZE);
+	RTE_VERIFY(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+#ifdef RTE_TOOLCHAIN_MSVC
+		/* MSVC <stddef.h> is missing the max_align_t typedef */
+		align = alignof(double);
+#else
+		align = alignof(max_align_t);
+#endif
+
+	RTE_VERIFY(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
+
+void
+eal_lcore_var_cleanup(void)
+{
+	while (current_buffer != NULL) {
+		struct lcore_var_buffer *prev = current_buffer->prev;
+
+		free(current_buffer);
+
+		current_buffer = prev;
+	}
+}
diff --git a/lib/eal/common/eal_lcore_var.h b/lib/eal/common/eal_lcore_var.h
new file mode 100644
index 0000000000..de2c4e44a0
--- /dev/null
+++ b/lib/eal/common/eal_lcore_var.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2024 Ericsson AB.
+ */
+
+#ifndef EAL_LCORE_VAR_H
+#define EAL_LCORE_VAR_H
+
+void
+eal_lcore_var_cleanup(void);
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index c1bbf26654..e273745e93 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index 1229230063..796c9dbf2d 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -47,6 +47,7 @@
 
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -941,6 +942,7 @@ rte_eal_cleanup(void)
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_cleanup_config(internal_conf);
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index 474097f211..d903577caa 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -28,6 +28,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..3f5ae500bc
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,394 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) holds a
+ * unique value for each EAL thread and registered non-EAL
+ * thread. There is one instance for each current and future lcore
+ * id-equipped thread, with a total of @c RTE_MAX_LCORE instances. The
+ * value of the lcore variable for one lcore id is independent from
+ * the values assigned to other lcore ids within the same variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for an @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>). The handle type is used to inform the
+ * access macros of the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as an opaque identifier. An allocated handle never
+ * has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs at the time
+ *     of module initialization, but may be done at any time.
+ *
+ * The lifetime of an lcore variable is not tied to the thread that
+ * created it. Its per lcore id values (up to @c RTE_MAX_LCORE) are
+ * available from the moment the lcore variable is created and
+ * continue to exist throughout the entire lifetime of the EAL,
+ * whether or not the lcore id is currently in use.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable, associated with different lcore
+ * ids may be frequently read or written by their respective owners
+ * without risking false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should be employed to prevent data races between the owning
+ * thread and any other thread accessing the same value instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * shorthand exists as @ref RTE_LCORE_VAR.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may be of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may define an lcore variable handle without ever
+ * allocating it.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * Lcore variables are stored in a series of lcore buffers, which are
+ * allocated from the libc heap. Heap allocation failures are treated
+ * as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variable's owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values are initialized to zero by default.
+ *
+ * Lcore variables are not stored in huge page memory.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         unsigned int lcore_id;
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH(lcore_id, state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module are kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's
+ * and prevent false sharing.
+ *
+ * Lcore variables offer the advantage of working with, rather than
+ * against, the CPU's assumptions. A next-line hardware prefetcher,
+ * for example, may function as intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The lifecycle of a thread-local variable instance is tied to
+ *     that of the thread. The data cannot be accessed before the
+ *     thread has been created, nor after it has exited. As a result,
+ *     thread-local variables must be initialized in a "lazy" manner
+ *     (e.g., at the point of thread creation). Lcore variables may be
+ *     accessed immediately after having been allocated (which may occur
+ *     before any thread beyond the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the specifics of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, data sharing between threads is supported.
+ *     In the C11 standard, accessing another thread's _Thread_local
+ *     object is implementation-defined. Lcore variable instances may
+ *     be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * This macro clarifies that the declaration is an lcore handle, not a
+ * regular pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR(handle)				\
+	RTE_LCORE_VAR_LCORE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param lcore_id
+ *   An <code>unsigned int</code> variable successively set to the
+ *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
+ * @param value
+ *   A pointer variable successively set to point to lcore variable
+ *   value instance of the current lcore id being processed.
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH(lcore_id, value, handle)			\
+	for ((lcore_id) =						\
+		     (((value) = RTE_LCORE_VAR_LCORE(0, handle)), 0); \
+	     (lcore_id) < RTE_MAX_LCORE;				\
+	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE(lcore_id, \
+							       handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifier of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR or @ref RTE_LCORE_VAR_LCORE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * The allocated memory cannot be freed.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+	__rte_alloc_size(2);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 54577b7718..d0f27315b9 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -45,6 +45,7 @@
 #include <telemetry_internal.h>
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -1371,6 +1372,7 @@ rte_eal_cleanup(void)
 	rte_eal_malloc_heap_cleanup();
 	eal_cleanup_config(internal_conf);
 	rte_eal_log_cleanup();
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/version.map b/lib/eal/version.map
index f493cd1ca7..94dc5b17d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -399,6 +399,7 @@ EXPERIMENTAL {
 
 	# added in 24.11
 	rte_bitset_to_str;
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v14 2/7] eal: add lcore variable functional tests
  2024-10-16 13:19                                                                           ` [PATCH v14 0/7] Lcore variables Mattias Rönnblom
  2024-10-16 13:19                                                                             ` [PATCH v14 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-16 13:19                                                                             ` Mattias Rönnblom
  2024-10-16 13:19                                                                             ` [PATCH v14 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                                                               ` (5 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-16 13:19 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocations to match new API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index fe248b786c..9060cfeb7a 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..ddf70b03a0
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,432 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE(lcore_id, test_int) = state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	unsigned int i = 0;
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_int) {
+		TEST_ASSERT_EQUAL(i, lcore_id, "Encountered lcore id %d "
+				  "while expecting %d during iteration",
+				  lcore_id, i);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		i++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	unsigned int lcore_id;
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE(lcore_id, test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray = RTE_LCORE_VAR_LCORE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE(lcore_id, handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v14 3/7] eal: add lcore variable performance test
  2024-10-16 13:19                                                                           ` [PATCH v14 0/7] Lcore variables Mattias Rönnblom
  2024-10-16 13:19                                                                             ` [PATCH v14 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-16 13:19                                                                             ` [PATCH v14 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-10-16 13:19                                                                             ` Mattias Rönnblom
  2024-10-16 13:19                                                                             ` [PATCH v14 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                                                               ` (4 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-16 13:19 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v8:
 * Fix spelling. (Morten Brørup)

PATCH v6:
 * Use floating point math when calculating per-update latency.
   (Morten Brørup)

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 256 +++++++++++++++++++++++++++++++++
 2 files changed, 257 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 9060cfeb7a..cf4908de5a 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -105,6 +105,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..6d9869f873
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,256 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =	RTE_LCORE_VAR(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / (double)ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays is not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each with a small, per-lcore state. Note however that
+ * these tests have very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v14 4/7] random: keep PRNG state in lcore variable
  2024-10-16 13:19                                                                           ` [PATCH v14 0/7] Lcore variables Mattias Rönnblom
                                                                                               ` (2 preceding siblings ...)
  2024-10-16 13:19                                                                             ` [PATCH v14 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-10-16 13:19                                                                             ` Mattias Rönnblom
  2024-10-16 13:19                                                                             ` [PATCH v14 5/7] power: keep per-lcore " Mattias Rönnblom
                                                                                               ` (3 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-16 13:19 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..cf0756f26a 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v14 5/7] power: keep per-lcore state in lcore variable
  2024-10-16 13:19                                                                           ` [PATCH v14 0/7] Lcore variables Mattias Rönnblom
                                                                                               ` (3 preceding siblings ...)
  2024-10-16 13:19                                                                             ` [PATCH v14 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-10-16 13:19                                                                             ` Mattias Rönnblom
  2024-10-16 13:19                                                                             ` [PATCH v14 6/7] service: " Mattias Rönnblom
                                                                                               ` (2 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-16 13:19 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocation to match new API.

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 830a6c7a97..4bab2d5108 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -519,7 +517,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -620,7 +618,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -770,21 +768,22 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	unsigned int lcore_id;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH(lcore_id, lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v14 6/7] service: keep per-lcore state in lcore variable
  2024-10-16 13:19                                                                           ` [PATCH v14 0/7] Lcore variables Mattias Rönnblom
                                                                                               ` (4 preceding siblings ...)
  2024-10-16 13:19                                                                             ` [PATCH v14 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-10-16 13:19                                                                             ` Mattias Rönnblom
  2024-10-16 13:19                                                                             ` [PATCH v14 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  2024-10-16 14:58                                                                             ` [PATCH v14 0/7] Lcore variables Stephen Hemminger
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-16 13:19 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v14:
 * Merge with bitset-related changes.

PATCH v7:
 * Update to match new FOREACH API.

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 116 ++++++++++++++++++++---------------
 1 file changed, 65 insertions(+), 51 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 324471e897..dad3150df9 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_bitset.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
@@ -78,7 +79,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -99,12 +100,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -120,7 +117,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -134,7 +130,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -284,7 +279,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -292,9 +286,11 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	unsigned int lcore_id;
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		rte_bitset_clear(lcore_states[i].mapped_services, id);
+	RTE_LCORE_VAR_FOREACH(lcore_id, cs, lcore_states)
+		rte_bitset_clear(cs->mapped_services, id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -463,7 +459,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (rte_bitset_test(lcore_states[ids[i]].service_active_on_lcore, id))
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(ids[i], lcore_states);
+
+		if (rte_bitset_test(cs->service_active_on_lcore, id))
 			return 1;
 	}
 
@@ -473,7 +472,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -496,8 +495,7 @@ static int32_t
 service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +531,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +547,12 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	unsigned int lcore_id;
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH(lcore_id, cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +569,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +586,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,28 +638,30 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	if (set) {
-		uint64_t lcore_mapped = rte_bitset_test(lcore_states[lcore].mapped_services, sid);
+		bool lcore_mapped = rte_bitset_test(cs->mapped_services, sid);
 
 		if (*set && !lcore_mapped) {
-			rte_bitset_set(lcore_states[lcore].mapped_services, sid);
+			rte_bitset_set(cs->mapped_services, sid);
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			rte_bitset_clear(lcore_states[lcore].mapped_services, sid);
+			rte_bitset_clear(cs->mapped_services, sid);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = rte_bitset_test(lcore_states[lcore].mapped_services, sid);
+		*enabled = rte_bitset_test(cs->mapped_services, sid);
 
 	return 0;
 }
@@ -683,13 +689,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -700,14 +707,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all mapped services */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			rte_bitset_clear_all(lcore_states[i].mapped_services, RTE_SERVICE_NUM_MAX);
+		struct core_state *cs =	RTE_LCORE_VAR_LCORE(i, lcore_states);
+
+		if (cs->is_service_core) {
+			rte_bitset_clear_all(cs->mapped_services, RTE_SERVICE_NUM_MAX);
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -723,17 +732,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	rte_bitset_clear_all(lcore_states[lcore].mapped_services, RTE_SERVICE_NUM_MAX);
+	rte_bitset_clear_all(cs->mapped_services, RTE_SERVICE_NUM_MAX);
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -745,7 +756,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -769,7 +780,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -799,6 +810,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -806,12 +819,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
 		bool enabled = rte_bitset_test(cs->mapped_services, i);
@@ -831,7 +843,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -842,7 +854,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -850,7 +862,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -858,7 +870,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -885,7 +897,7 @@ lcore_attr_get_service_error_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -901,7 +913,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -963,12 +978,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -993,7 +1007,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -1004,12 +1019,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1044,7 +1058,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v14 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-10-16 13:19                                                                           ` [PATCH v14 0/7] Lcore variables Mattias Rönnblom
                                                                                               ` (5 preceding siblings ...)
  2024-10-16 13:19                                                                             ` [PATCH v14 6/7] service: " Mattias Rönnblom
@ 2024-10-16 13:19                                                                             ` Mattias Rönnblom
  2024-10-16 14:58                                                                             ` [PATCH v14 0/7] Lcore variables Stephen Hemminger
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-16 13:19 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..98a2cbc611 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v14 1/7] eal: add static per-lcore memory allocation facility
  2024-10-16 13:19                                                                             ` [PATCH v14 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-16 14:53                                                                               ` Stephen Hemminger
  2024-10-17  5:38                                                                                 ` Mattias Rönnblom
  2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
  1 sibling, 1 reply; 313+ messages in thread
From: Stephen Hemminger @ 2024-10-16 14:53 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic, Konstantin Ananyev,
	Chengwen Feng

On Wed, 16 Oct 2024 15:19:10 +0200
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> +
> +/**
> + * Allocate space in the per-lcore id buffers for an lcore variable.
> + *
> + * The pointer returned is only an opaque identifier of the variable. To
> + * get an actual pointer to a particular instance of the variable use
> + * @ref RTE_LCORE_VAR or @ref RTE_LCORE_VAR_LCORE.
> + *
> + * The lcore variable values' memory is set to zero.
> + *
> + * The allocation is always successful, barring a fatal exhaustion of
> + * the per-lcore id buffer space.
> + *
> + * rte_lcore_var_alloc() is not multi-thread safe.
> + *
> + * The allocated memory cannot be freed.
> + *
> + * @param size
> + *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
> + * @param align
> + *   If 0, the values will be suitably aligned for any kind of type
> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
> + *   on a multiple of *align*, which must be a power of 2 and equal or
> + *   less than @c RTE_CACHE_LINE_SIZE.
> + * @return
> + *   The variable's handle, stored in a void pointer value. The value
> + *   is always non-NULL.
> + */
> +__rte_experimental
> +void *
> +rte_lcore_var_alloc(size_t size, size_t align)
> +	__rte_alloc_size(2);

This is not right, the index is which arg holds the size etc.
should be:
		__rte_alloc_size(1) __rte_alloc_align(2) __rte_malloc;

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v14 0/7] Lcore variables
  2024-10-16 13:19                                                                           ` [PATCH v14 0/7] Lcore variables Mattias Rönnblom
                                                                                               ` (6 preceding siblings ...)
  2024-10-16 13:19                                                                             ` [PATCH v14 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
@ 2024-10-16 14:58                                                                             ` Stephen Hemminger
  2024-10-17  5:40                                                                               ` Mattias Rönnblom
  7 siblings, 1 reply; 313+ messages in thread
From: Stephen Hemminger @ 2024-10-16 14:58 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic

On Wed, 16 Oct 2024 15:19:09 +0200
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> This patch set introduces a new API <rte_lcore_var.h> for static
> per-lcore id data allocation.
> 
> Please refer to the <rte_lcore_var.h> API documentation for both a
> rationale for this new API, and a comparison to the alternatives
> available.
> 
> The question on how to best allocate static per-lcore memory has been
> up several times on the dev mailing list, for example in the thread on
> "random: use per lcore state" RFC by Stephen Hemminger.
> 
> Lcore variables are surely not the answer to all your per-lcore-data
> needs, since it only allows for more-or-less static allocation. In the
> author's opinion, it does however provide a reasonably simple and
> clean and seemingly very much performant solution to a real problem.
> 
> Mattias Rönnblom (7):
>   eal: add static per-lcore memory allocation facility
>   eal: add lcore variable functional tests
>   eal: add lcore variable performance test
>   random: keep PRNG state in lcore variable
>   power: keep per-lcore state in lcore variable
>   service: keep per-lcore state in lcore variable
>   eal: keep per-lcore power intrinsics state in lcore variable

Still too wordy, would you mind if I have a try and summarizing and 
running the text through an editor tool?

^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v14 1/7] eal: add static per-lcore memory allocation facility
  2024-10-16 14:53                                                                               ` Stephen Hemminger
@ 2024-10-17  5:38                                                                                 ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-17  5:38 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic, Konstantin Ananyev, Chengwen Feng

On 2024-10-16 16:53, Stephen Hemminger wrote:
> On Wed, 16 Oct 2024 15:19:10 +0200
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
>> +
>> +/**
>> + * Allocate space in the per-lcore id buffers for an lcore variable.
>> + *
>> + * The pointer returned is only an opaque identifier of the variable. To
>> + * get an actual pointer to a particular instance of the variable use
>> + * @ref RTE_LCORE_VAR or @ref RTE_LCORE_VAR_LCORE.
>> + *
>> + * The lcore variable values' memory is set to zero.
>> + *
>> + * The allocation is always successful, barring a fatal exhaustion of
>> + * the per-lcore id buffer space.
>> + *
>> + * rte_lcore_var_alloc() is not multi-thread safe.
>> + *
>> + * The allocated memory cannot be freed.
>> + *
>> + * @param size
>> + *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
>> + * @param align
>> + *   If 0, the values will be suitably aligned for any kind of type
>> + *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
>> + *   on a multiple of *align*, which must be a power of 2 and equal or
>> + *   less than @c RTE_CACHE_LINE_SIZE.
>> + * @return
>> + *   The variable's handle, stored in a void pointer value. The value
>> + *   is always non-NULL.
>> + */
>> +__rte_experimental
>> +void *
>> +rte_lcore_var_alloc(size_t size, size_t align)
>> +	__rte_alloc_size(2);
> 
> This is not right, the index is which arg holds the size etc.
> should be:
> 		__rte_alloc_size(1) __rte_alloc_align(2) __rte_malloc;

Oops. Will fix in v15. Thanks.



^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v14 0/7] Lcore variables
  2024-10-16 14:58                                                                             ` [PATCH v14 0/7] Lcore variables Stephen Hemminger
@ 2024-10-17  5:40                                                                               ` Mattias Rönnblom
  0 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-17  5:40 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev, David Marchand,
	Jerin Jacob, Luka Jankovic

On 2024-10-16 16:58, Stephen Hemminger wrote:
> On Wed, 16 Oct 2024 15:19:09 +0200
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
>> This patch set introduces a new API <rte_lcore_var.h> for static
>> per-lcore id data allocation.
>>
>> Please refer to the <rte_lcore_var.h> API documentation for both a
>> rationale for this new API, and a comparison to the alternatives
>> available.
>>
>> The question on how to best allocate static per-lcore memory has been
>> up several times on the dev mailing list, for example in the thread on
>> "random: use per lcore state" RFC by Stephen Hemminger.
>>
>> Lcore variables are surely not the answer to all your per-lcore-data
>> needs, since it only allows for more-or-less static allocation. In the
>> author's opinion, it does however provide a reasonably simple and
>> clean and seemingly very much performant solution to a real problem.
>>
>> Mattias Rönnblom (7):
>>    eal: add static per-lcore memory allocation facility
>>    eal: add lcore variable functional tests
>>    eal: add lcore variable performance test
>>    random: keep PRNG state in lcore variable
>>    power: keep per-lcore state in lcore variable
>>    service: keep per-lcore state in lcore variable
>>    eal: keep per-lcore power intrinsics state in lcore variable
> 
> Still too wordy, would you mind if I have a try and summarizing and
> running the text through an editor tool?

I think you need to be a little more wordy here. What text? The cover 
text? That won't survive anyway.


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v15 0/7] Lcore variables
  2024-10-16 13:19                                                                             ` [PATCH v14 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-16 14:53                                                                               ` Stephen Hemminger
@ 2024-10-17  5:57                                                                               ` Mattias Rönnblom
  2024-10-17  5:57                                                                                 ` [PATCH v15 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                                                                   ` (8 more replies)
  1 sibling, 9 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-17  5:57 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id memory allocation.

Lcore variables are designed to replace static lcore id-indexed arrays
and thread-local storage.

See <rte_lcore_var.h> for the rationale and comparison with
alternatives.

Mattias Rönnblom (7):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 432 ++++++++++++++++++
 app/test/test_lcore_var_perf.c                | 256 +++++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         | 138 ++++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 116 ++---
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 394 ++++++++++++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   1 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  35 +-
 20 files changed, 1409 insertions(+), 92 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v15 1/7] eal: add static per-lcore memory allocation facility
  2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
@ 2024-10-17  5:57                                                                                 ` Mattias Rönnblom
  2024-10-17  5:57                                                                                 ` [PATCH v15 2/7] eal: add lcore variable functional tests Mattias Rönnblom
                                                                                                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-17  5:57 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v15:
 * Add alignment-related compiler hint. (Stephen Hemminger)
 * Have size-related compiler hint point toward the right function
   argument. (Stephen Hemminger)

PATCH v14:
 * Add note in rte_lcore_var_alloc() that the memory cannot be freed.
   (Stephen Hemminger)
 * Hint the compiler rte_lcore_var_alloc() is a memory allocation
   facility. (Stephen Hemminger)

PATCH v13:
 * Remove _VALUE() suffix from value lookup and iterator macros.
   (Morten Brørup and Thomas Monjalon)
 * Remove the _ptr() suffix from the value lookup function.

PATCH v12:
 * Replace RTE_ASSERT() with RTE_VERIFY(), since performance is not
   a concern. (Morten Brørup)
 * Fix issue (introduced in v11) where aligned_malloc() was provided
   an object size which wasn't an even number of the alignment.
   (Stephen Hemminger)

PATCH v11:
 * Add a note in the API docs on lcore variables and huge page memory.
   (Stephen Hemminger)
 * Free lcore var buffers at EAL cleanup. (Thomas Monjalon)
 * Tweak naming and include short lcore var buffer use overview
   in eal_common_lcore_var.c.

PATCH v10:
 * Improve documentation grammar and spelling. (Stephen Hemminger,
   Thomas Monjalon)
 * Add version.map DPDK version comment. (Thomas Monjalon)

PATCH v9:
 * Fixed merge conflicts in release notes.

PATCH v8:
 * Work around missing max_align_t definition in MSVC. (Morten Brørup)

PATCH v7:
 * Add () to the FOREACH lcore id macro parameter, to allow arbitrary
   expression, not just a simple variable name, being passed.
   (Konstantin Ananyev)

PATCH v6:
 * Have API user provide the loop variable in the FOREACH macro, to
   avoid subtle bugs where the loop variable name clashes with some
   other user-defined variable. (Konstantin Ananyev)

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         | 138 ++++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 394 ++++++++++++++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   1 +
 13 files changed, 609 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 6814991735..84fe62d339 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -289,6 +289,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index fd6f8a2f1a..498d509244 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index abd44b1861..6306636357 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index b9fac1839d..b659a1d085 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -429,12 +429,43 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default, static variables, memory blocks allocated on the DPDK
+heap, and other types of memory are shared by all DPDK threads.
+
+An application, a DPDK library, or a PMD may opt to keep per-thread state.
+
+Per-thread data can be maintained using either *lcore variables* (see
+``rte_lcore_var.h``), *thread-local storage (TLS)* (see
+``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE`` elements,
+indexed by ``rte_lcore_id()``. These methods allow per-lcore data to be
+largely internal to the module and not directly exposed in its
+API. Another approach is to explicitly handle per-thread aspects in
+the API (e.g., the ports in the Eventdev API).
+
+Lcore variables are suitable for small objects that are statically
+allocated at the time of module or application initialization. An
+lcore variable takes on one value for each lcore ID-equipped thread
+(i.e., for both EAL threads and registered non-EAL threads, in total
+``RTE_MAX_LCORE`` instances). The lifetime of lcore variables is
+independent of the owning threads and can, therefore, be initialized
+before the threads are created.
+
+Variables with thread-local storage are allocated when the thread is
+created and exist until the thread terminates. These are applicable
+for every thread in the process. Only very small objects should be
+allocated in TLS, as large TLS objects can significantly slow down
+thread creation and may unnecessarily increase the memory footprint of
+applications that extensively use unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore ID-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must be both cache-aligned and include an
+``RTE_CACHE_GUARD``. This extensive use of padding causes internal
+fragmentation (i.e., unused space) and reduces cache hit rates.
+
+For more discussions on per-lcore state, refer to the
+``rte_lcore_var.h`` API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index 5a6502820d..ba11ccc97e 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -212,6 +212,20 @@ New Features
   Added ability for node to advertise and update multiple xstat counters,
   that can be retrieved using ``rte_graph_cluster_stats_get``.
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..f4dd5b1a82
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,138 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+#include "eal_lcore_var.h"
+
+/*
+ * An lcore var buffer stores at a minimum one, but usually many,
+ * lcore variables. The value instances for all lcore ids are stored
+ * in the same buffer.
+ *
+ * The address of the value of a particular lcore variable associated
+ * with a particular lcore id is:
+ * buffer->data + offset + lcore_id * RTE_MAX_LCORE_VAR.
+ *
+ * In this way, the values associated with a particular lcore id are
+ * grouped spatially close (in the data array), and no padding is
+ * required to prevent false sharing.
+ *
+ * The (buffer->data + offset) base pointer is what is being returned
+ * to the API user as an opaque handle. The handle is a pointer to the
+ * value for lcore id 0, for that lcore variable.
+ *
+ * The implementation maintains a current lcore var buffer (being
+ * allocated from), and an offset representing the amount of data
+ * already allocated (in bytes) in that buffer.
+ *
+ * The offset is progressively incremented (by the size of the
+ * just-allocated lcore variable), as lcore variables are being
+ * allocated.
+ *
+ * When one lcore var buffer is full, a new is allocated off the heap.
+ *
+ * The lcore var buffers are arranged in a single-link list, to allow
+ * freeing them at the point of rte_eal_cleanup(), and thereby avoid
+ * false positives from tools like valgrind memcheck.
+ */
+struct lcore_var_buffer {
+	char data[RTE_MAX_LCORE_VAR * RTE_MAX_LCORE];
+	struct lcore_var_buffer *prev;
+};
+
+static struct lcore_var_buffer *current_buffer;
+
+/* initialized to trigger buffer allocation on first allocation */
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	unsigned int lcore_id;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+		struct lcore_var_buffer *prev = current_buffer;
+		size_t alloc_size =
+			RTE_ALIGN_CEIL(sizeof(struct lcore_var_buffer),
+				       RTE_CACHE_LINE_SIZE);
+#ifdef RTE_EXEC_ENV_WINDOWS
+		current_buffer = _aligned_malloc(alloc_size, RTE_CACHE_LINE_SIZE);
+#else
+		current_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE, alloc_size);
+
+#endif
+		RTE_VERIFY(current_buffer != NULL);
+
+		current_buffer->prev = prev;
+
+		offset = 0;
+	}
+
+	handle = &current_buffer->data[offset];
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to alipgned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_VERIFY(align <= RTE_CACHE_LINE_SIZE);
+	RTE_VERIFY(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+#ifdef RTE_TOOLCHAIN_MSVC
+		/* MSVC <stddef.h> is missing the max_align_t typedef */
+		align = alignof(double);
+#else
+		align = alignof(max_align_t);
+#endif
+
+	RTE_VERIFY(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
+
+void
+eal_lcore_var_cleanup(void)
+{
+	while (current_buffer != NULL) {
+		struct lcore_var_buffer *prev = current_buffer->prev;
+
+		free(current_buffer);
+
+		current_buffer = prev;
+	}
+}
diff --git a/lib/eal/common/eal_lcore_var.h b/lib/eal/common/eal_lcore_var.h
new file mode 100644
index 0000000000..de2c4e44a0
--- /dev/null
+++ b/lib/eal/common/eal_lcore_var.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2024 Ericsson AB.
+ */
+
+#ifndef EAL_LCORE_VAR_H
+#define EAL_LCORE_VAR_H
+
+void
+eal_lcore_var_cleanup(void);
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index c1bbf26654..e273745e93 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index 1229230063..796c9dbf2d 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -47,6 +47,7 @@
 
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -941,6 +942,7 @@ rte_eal_cleanup(void)
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_cleanup_config(internal_conf);
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index 474097f211..d903577caa 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -28,6 +28,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..38ab54940f
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,394 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * A per-lcore id variable (or lcore variable for short) holds a
+ * unique value for each EAL thread and registered non-EAL
+ * thread. There is one instance for each current and future lcore
+ * id-equipped thread, with a total of @c RTE_MAX_LCORE instances. The
+ * value of the lcore variable for one lcore id is independent from
+ * the values assigned to other lcore ids within the same variable.
+ *
+ * In order to access the values of an lcore variable, a handle is
+ * used. The type of the handle is a pointer to the value's type
+ * (e.g., for an @c uint32_t lcore variable, the handle is a
+ * <code>uint32_t *</code>). The handle type is used to inform the
+ * access macros of the type of the values. A handle may be passed
+ * between modules and threads just like any pointer, but its value
+ * must be treated as an opaque identifier. An allocated handle never
+ * has the value NULL.
+ *
+ * @b Creation
+ *
+ * An lcore variable is created in two steps:
+ *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
+ *  2. Allocate lcore variable storage and initialize the handle with
+ *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
+ *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs at the time
+ *     of module initialization, but may be done at any time.
+ *
+ * The lifetime of an lcore variable is not tied to the thread that
+ * created it. Its per lcore id values (up to @c RTE_MAX_LCORE) are
+ * available from the moment the lcore variable is created and
+ * continue to exist throughout the entire lifetime of the EAL,
+ * whether or not the lcore id is currently in use.
+ *
+ * Lcore variables cannot and need not be freed.
+ *
+ * @b Access
+ *
+ * The value of any lcore variable for any lcore id may be accessed
+ * from any thread (including unregistered threads), but it should
+ * only be *frequently* read from or written to by the owner.
+ *
+ * Values of the same lcore variable, associated with different lcore
+ * ids may be frequently read or written by their respective owners
+ * without risking false sharing.
+ *
+ * An appropriate synchronization mechanism (e.g., atomic loads and
+ * stores) should be employed to prevent data races between the owning
+ * thread and any other thread accessing the same value instance.
+ *
+ * The value of the lcore variable for a particular lcore id is
+ * accessed using @ref RTE_LCORE_VAR_LCORE.
+ *
+ * A common pattern is for an EAL thread or a registered non-EAL
+ * thread to access its own lcore variable value. For this purpose, a
+ * shorthand exists as @ref RTE_LCORE_VAR.
+ *
+ * Although the handle (as defined by @ref RTE_LCORE_VAR_HANDLE) is a
+ * pointer with the same type as the value, it may not be directly
+ * dereferenced and must be treated as an opaque identifier.
+ *
+ * Lcore variable handles and value pointers may be freely passed
+ * between different threads.
+ *
+ * @b Storage
+ *
+ * An lcore variable's values may be of a primitive type like @c int,
+ * but would more typically be a @c struct.
+ *
+ * The lcore variable handle introduces a per-variable (not
+ * per-value/per-lcore id) overhead of @c sizeof(void *) bytes, so
+ * there are some memory footprint gains to be made by organizing all
+ * per-lcore id data for a particular module as one lcore variable
+ * (e.g., as a struct).
+ *
+ * An application may define an lcore variable handle without ever
+ * allocating it.
+ *
+ * The size of an lcore variable's value must be less than the DPDK
+ * build-time constant @c RTE_MAX_LCORE_VAR.
+ *
+ * Lcore variables are stored in a series of lcore buffers, which are
+ * allocated from the libc heap. Heap allocation failures are treated
+ * as fatal.
+ *
+ * Lcore variables should generally *not* be @ref __rte_cache_aligned
+ * and need *not* include a @ref RTE_CACHE_GUARD field, since the use
+ * of these constructs are designed to avoid false sharing. In the
+ * case of an lcore variable instance, the thread most recently
+ * accessing nearby data structures should almost-always be the lcore
+ * variable's owner. Adding padding will increase the effective memory
+ * working set size, potentially reducing performance.
+ *
+ * Lcore variable values are initialized to zero by default.
+ *
+ * Lcore variables are not stored in huge page memory.
+ *
+ * @b Example
+ *
+ * Below is an example of the use of an lcore variable:
+ *
+ * @code{.c}
+ * struct foo_lcore_state {
+ *         int a;
+ *         long b;
+ * };
+ *
+ * static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+ *
+ * long foo_get_a_plus_b(void)
+ * {
+ *         struct foo_lcore_state *state = RTE_LCORE_VAR(lcore_states);
+ *
+ *         return state->a + state->b;
+ * }
+ *
+ * RTE_INIT(rte_foo_init)
+ * {
+ *         RTE_LCORE_VAR_ALLOC(lcore_states);
+ *
+ *         unsigned int lcore_id;
+ *         struct foo_lcore_state *state;
+ *         RTE_LCORE_VAR_FOREACH(lcore_id, state, lcore_states) {
+ *                 (initialize 'state')
+ *         }
+ *
+ *         (other initialization)
+ * }
+ * @endcode
+ *
+ *
+ * @b Alternatives
+ *
+ * Lcore variables are designed to replace a pattern exemplified below:
+ * @code{.c}
+ * struct __rte_cache_aligned foo_lcore_state {
+ *         int a;
+ *         long b;
+ *         RTE_CACHE_GUARD;
+ * };
+ *
+ * static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+ * @endcode
+ *
+ * This scheme is simple and effective, but has one drawback: the data
+ * is organized so that objects related to all lcores for a particular
+ * module are kept close in memory. At a bare minimum, this requires
+ * sizing data structures (e.g., using `__rte_cache_aligned`) to an
+ * even number of cache lines to avoid false sharing. With CPU
+ * hardware prefetching and memory loads resulting from speculative
+ * execution (functions which seemingly are getting more eager faster
+ * than they are getting more intelligent), one or more "guard" cache
+ * lines may be required to separate one lcore's data from another's
+ * and prevent false sharing.
+ *
+ * Lcore variables offer the advantage of working with, rather than
+ * against, the CPU's assumptions. A next-line hardware prefetcher,
+ * for example, may function as intended (i.e., to the benefit, not
+ * detriment, of system performance).
+ *
+ * Another alternative to @ref rte_lcore_var.h is the @ref
+ * rte_per_lcore.h API, which makes use of thread-local storage (TLS,
+ * e.g., GCC __thread or C11 _Thread_local). The main differences
+ * between by using the various forms of TLS (e.g., @ref
+ * RTE_DEFINE_PER_LCORE or _Thread_local) and the use of lcore
+ * variables are:
+ *
+ *   * The lifecycle of a thread-local variable instance is tied to
+ *     that of the thread. The data cannot be accessed before the
+ *     thread has been created, nor after it has exited. As a result,
+ *     thread-local variables must be initialized in a "lazy" manner
+ *     (e.g., at the point of thread creation). Lcore variables may be
+ *     accessed immediately after having been allocated (which may occur
+ *     before any thread beyond the main thread is running).
+ *   * A thread-local variable is duplicated across all threads in the
+ *     process, including unregistered non-EAL threads (i.e.,
+ *     "regular" threads). For DPDK applications heavily relying on
+ *     multi-threading (in conjunction to DPDK's "one thread per core"
+ *     pattern), either by having many concurrent threads or
+ *     creating/destroying threads at a high rate, an excessive use of
+ *     thread-local variables may cause inefficiencies (e.g.,
+ *     increased thread creation overhead due to thread-local storage
+ *     initialization or increased total RAM footprint usage). Lcore
+ *     variables *only* exist for threads with an lcore id.
+ *   * If data in thread-local storage may be shared between threads
+ *     (i.e., can a pointer to a thread-local variable be passed to
+ *     and successfully dereferenced by non-owning thread) depends on
+ *     the specifics of the TLS implementation. With GCC __thread and
+ *     GCC _Thread_local, data sharing between threads is supported.
+ *     In the C11 standard, accessing another thread's _Thread_local
+ *     object is implementation-defined. Lcore variable instances may
+ *     be accessed reliably by any thread.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * This macro clarifies that the declaration is an lcore handle, not a
+ * regular pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR(handle)				\
+	RTE_LCORE_VAR_LCORE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param lcore_id
+ *   An <code>unsigned int</code> variable successively set to the
+ *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
+ * @param value
+ *   A pointer variable successively set to point to lcore variable
+ *   value instance of the current lcore id being processed.
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH(lcore_id, value, handle)			\
+	for ((lcore_id) =						\
+		     (((value) = RTE_LCORE_VAR_LCORE(0, handle)), 0); \
+	     (lcore_id) < RTE_MAX_LCORE;				\
+	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE(lcore_id, \
+							       handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifier of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR or @ref RTE_LCORE_VAR_LCORE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * The allocated memory cannot be freed.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+	__rte_alloc_size(1) __rte_alloc_align(2);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 54577b7718..d0f27315b9 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -45,6 +45,7 @@
 #include <telemetry_internal.h>
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -1371,6 +1372,7 @@ rte_eal_cleanup(void)
 	rte_eal_malloc_heap_cleanup();
 	eal_cleanup_config(internal_conf);
 	rte_eal_log_cleanup();
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/version.map b/lib/eal/version.map
index f493cd1ca7..94dc5b17d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -399,6 +399,7 @@ EXPERIMENTAL {
 
 	# added in 24.11
 	rte_bitset_to_str;
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v15 2/7] eal: add lcore variable functional tests
  2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
  2024-10-17  5:57                                                                                 ` [PATCH v15 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-17  5:57                                                                                 ` Mattias Rönnblom
  2024-10-17  5:57                                                                                 ` [PATCH v15 3/7] eal: add lcore variable performance test Mattias Rönnblom
                                                                                                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-17  5:57 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocations to match new API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index fe248b786c..9060cfeb7a 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..ddf70b03a0
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,432 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE(lcore_id, test_int) = state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	unsigned int i = 0;
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_int) {
+		TEST_ASSERT_EQUAL(i, lcore_id, "Encountered lcore id %d "
+				  "while expecting %d during iteration",
+				  lcore_id, i);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		i++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	unsigned int lcore_id;
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE(lcore_id, test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray = RTE_LCORE_VAR_LCORE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE(lcore_id, handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v15 3/7] eal: add lcore variable performance test
  2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
  2024-10-17  5:57                                                                                 ` [PATCH v15 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-17  5:57                                                                                 ` [PATCH v15 2/7] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-10-17  5:57                                                                                 ` Mattias Rönnblom
  2024-10-17  5:57                                                                                 ` [PATCH v15 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                                                                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-17  5:57 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Chengwen Feng

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v8:
 * Fix spelling. (Morten Brørup)

PATCH v6:
 * Use floating point math when calculating per-update latency.
   (Morten Brørup)

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 256 +++++++++++++++++++++++++++++++++
 2 files changed, 257 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 9060cfeb7a..cf4908de5a 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -105,6 +105,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..6d9869f873
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,256 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =	RTE_LCORE_VAR(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / (double)ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays is not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each with a small, per-lcore state. Note however that
+ * these tests have very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v15 4/7] random: keep PRNG state in lcore variable
  2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
                                                                                                   ` (2 preceding siblings ...)
  2024-10-17  5:57                                                                                 ` [PATCH v15 3/7] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-10-17  5:57                                                                                 ` Mattias Rönnblom
  2024-10-17  5:57                                                                                 ` [PATCH v15 5/7] power: keep per-lcore " Mattias Rönnblom
                                                                                                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-17  5:57 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..cf0756f26a 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v15 5/7] power: keep per-lcore state in lcore variable
  2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
                                                                                                   ` (3 preceding siblings ...)
  2024-10-17  5:57                                                                                 ` [PATCH v15 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-10-17  5:57                                                                                 ` Mattias Rönnblom
  2024-10-17  5:57                                                                                 ` [PATCH v15 6/7] service: " Mattias Rönnblom
                                                                                                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-17  5:57 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocation to match new API.

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 830a6c7a97..4bab2d5108 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -519,7 +517,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -620,7 +618,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -770,21 +768,22 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	unsigned int lcore_id;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH(lcore_id, lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v15 6/7] service: keep per-lcore state in lcore variable
  2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
                                                                                                   ` (4 preceding siblings ...)
  2024-10-17  5:57                                                                                 ` [PATCH v15 5/7] power: keep per-lcore " Mattias Rönnblom
@ 2024-10-17  5:57                                                                                 ` Mattias Rönnblom
  2024-10-17  5:57                                                                                 ` [PATCH v15 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
                                                                                                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-17  5:57 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v14:
 * Merge with bitset-related changes.

PATCH v7:
 * Update to match new FOREACH API.

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 116 ++++++++++++++++++++---------------
 1 file changed, 65 insertions(+), 51 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 324471e897..dad3150df9 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_bitset.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
@@ -78,7 +79,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -99,12 +100,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -120,7 +117,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -134,7 +130,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -284,7 +279,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -292,9 +286,11 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	unsigned int lcore_id;
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		rte_bitset_clear(lcore_states[i].mapped_services, id);
+	RTE_LCORE_VAR_FOREACH(lcore_id, cs, lcore_states)
+		rte_bitset_clear(cs->mapped_services, id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -463,7 +459,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (rte_bitset_test(lcore_states[ids[i]].service_active_on_lcore, id))
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(ids[i], lcore_states);
+
+		if (rte_bitset_test(cs->service_active_on_lcore, id))
 			return 1;
 	}
 
@@ -473,7 +472,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -496,8 +495,7 @@ static int32_t
 service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +531,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +547,12 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	unsigned int lcore_id;
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH(lcore_id, cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +569,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +586,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,28 +638,30 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	if (set) {
-		uint64_t lcore_mapped = rte_bitset_test(lcore_states[lcore].mapped_services, sid);
+		bool lcore_mapped = rte_bitset_test(cs->mapped_services, sid);
 
 		if (*set && !lcore_mapped) {
-			rte_bitset_set(lcore_states[lcore].mapped_services, sid);
+			rte_bitset_set(cs->mapped_services, sid);
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			rte_bitset_clear(lcore_states[lcore].mapped_services, sid);
+			rte_bitset_clear(cs->mapped_services, sid);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = rte_bitset_test(lcore_states[lcore].mapped_services, sid);
+		*enabled = rte_bitset_test(cs->mapped_services, sid);
 
 	return 0;
 }
@@ -683,13 +689,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -700,14 +707,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all mapped services */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			rte_bitset_clear_all(lcore_states[i].mapped_services, RTE_SERVICE_NUM_MAX);
+		struct core_state *cs =	RTE_LCORE_VAR_LCORE(i, lcore_states);
+
+		if (cs->is_service_core) {
+			rte_bitset_clear_all(cs->mapped_services, RTE_SERVICE_NUM_MAX);
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -723,17 +732,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	rte_bitset_clear_all(lcore_states[lcore].mapped_services, RTE_SERVICE_NUM_MAX);
+	rte_bitset_clear_all(cs->mapped_services, RTE_SERVICE_NUM_MAX);
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -745,7 +756,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -769,7 +780,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -799,6 +810,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -806,12 +819,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
 		bool enabled = rte_bitset_test(cs->mapped_services, i);
@@ -831,7 +843,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -842,7 +854,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -850,7 +862,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -858,7 +870,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -885,7 +897,7 @@ lcore_attr_get_service_error_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -901,7 +913,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -963,12 +978,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -993,7 +1007,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -1004,12 +1019,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1044,7 +1058,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v15 7/7] eal: keep per-lcore power intrinsics state in lcore variable
  2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
                                                                                                   ` (5 preceding siblings ...)
  2024-10-17  5:57                                                                                 ` [PATCH v15 6/7] service: " Mattias Rönnblom
@ 2024-10-17  5:57                                                                                 ` Mattias Rönnblom
  2024-10-18 15:37                                                                                 ` [PATCH v15 0/7] Lcore variables Thomas Monjalon
  2024-10-23  7:52                                                                                 ` [PATCH v16 0/8] " Mattias Rönnblom
  8 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-17  5:57 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Mattias Rönnblom, Konstantin Ananyev, Chengwen Feng

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..98a2cbc611 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v15 0/7] Lcore variables
  2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
                                                                                                   ` (6 preceding siblings ...)
  2024-10-17  5:57                                                                                 ` [PATCH v15 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
@ 2024-10-18 15:37                                                                                 ` Thomas Monjalon
  2024-10-19  4:24                                                                                   ` Mattias Rönnblom
  2024-10-23  7:52                                                                                 ` [PATCH v16 0/8] " Mattias Rönnblom
  8 siblings, 1 reply; 313+ messages in thread
From: Thomas Monjalon @ 2024-10-18 15:37 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic

17/10/2024 07:57, Mattias Rönnblom:
> Mattias Rönnblom (7):
>   eal: add static per-lcore memory allocation facility
>   eal: add lcore variable functional tests
>   eal: add lcore variable performance test
>   random: keep PRNG state in lcore variable
>   power: keep per-lcore state in lcore variable
>   service: keep per-lcore state in lcore variable
>   eal: keep per-lcore power intrinsics state in lcore variable

Would it possible to have the doc reworded in RST
including an image of the layout?
I can help, don't hesitate to plan a meeting if needed.
If you could make a new version for this final touch,
we could merge it in 24.11-rc2.

Note: please reply to the cover letter of the first version
for the next one, it would reduce the indentation in the message list.



^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v15 0/7] Lcore variables
  2024-10-18 15:37                                                                                 ` [PATCH v15 0/7] Lcore variables Thomas Monjalon
@ 2024-10-19  4:24                                                                                   ` Mattias Rönnblom
  2024-10-21  9:16                                                                                     ` Thomas Monjalon
  0 siblings, 1 reply; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-19  4:24 UTC (permalink / raw)
  To: Thomas Monjalon, Mattias Rönnblom
  Cc: dev, Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic

On 2024-10-18 17:37, Thomas Monjalon wrote:
> 17/10/2024 07:57, Mattias Rönnblom:
>> Mattias Rönnblom (7):
>>    eal: add static per-lcore memory allocation facility
>>    eal: add lcore variable functional tests
>>    eal: add lcore variable performance test
>>    random: keep PRNG state in lcore variable
>>    power: keep per-lcore state in lcore variable
>>    service: keep per-lcore state in lcore variable
>>    eal: keep per-lcore power intrinsics state in lcore variable
> 
> Would it possible to have the doc reworded in RST
> including an image of the layout?

Sure. I'll submit a new version with a programmer's guide included mid 
next week. Would that work for RC2?

> I can help, don't hesitate to plan a meeting if needed.

Thanks! I don't think any meetings are required.

> If you could make a new version for this final touch,
> we could merge it in 24.11-rc2.
> 
> Note: please reply to the cover letter of the first version
> for the next one, it would reduce the indentation in the message list.
> 
> 

Can I find the message id of the cover letter on patchwork?



^ permalink raw reply	[flat|nested] 313+ messages in thread

* Re: [PATCH v15 0/7] Lcore variables
  2024-10-19  4:24                                                                                   ` Mattias Rönnblom
@ 2024-10-21  9:16                                                                                     ` Thomas Monjalon
  0 siblings, 0 replies; 313+ messages in thread
From: Thomas Monjalon @ 2024-10-21  9:16 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom
  Cc: dev, Morten Brørup, Stephen Hemminger, Konstantin Ananyev,
	David Marchand, Jerin Jacob, Luka Jankovic

19/10/2024 06:24, Mattias Rönnblom:
> On 2024-10-18 17:37, Thomas Monjalon wrote:
> > Note: please reply to the cover letter of the first version
> > for the next one, it would reduce the indentation in the message list.
> 
> Can I find the message id of the cover letter on patchwork?

Yes by clicking on the button "expand" of a patch,
you can see the whole series, including the cover letter.
The you click on the cover letter and you can see the "Message ID":
	20240910070344.699183-1-mattias.ronnblom@ericsson.com




^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v16 0/8] Lcore variables
  2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
                                                                                                   ` (7 preceding siblings ...)
  2024-10-18 15:37                                                                                 ` [PATCH v15 0/7] Lcore variables Thomas Monjalon
@ 2024-10-23  7:52                                                                                 ` Mattias Rönnblom
  2024-10-23  7:52                                                                                   ` [PATCH v16 1/8] eal: add static per-lcore memory allocation facility Mattias Rönnblom
                                                                                                     ` (7 more replies)
  8 siblings, 8 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-23  7:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Thomas Monjalon, Mattias Rönnblom

This patch set introduces a new API <rte_lcore_var.h> for static
per-lcore id memory allocation.

Lcore variables are designed to replace static lcore id-indexed arrays
and thread-local storage.

Refer to the programmer's guide for a rationale and comparison with
alternatives.

Mattias Rönnblom (8):
  eal: add static per-lcore memory allocation facility
  eal: add lcore variable functional tests
  eal: add lcore variable performance test
  eal: add lcore variables' programmer's guide
  random: keep PRNG state in lcore variable
  power: keep per-lcore state in lcore variable
  service: keep per-lcore state in lcore variable
  eal: keep per-lcore power intrinsics state in lcore variable

 MAINTAINERS                                   |   6 +
 app/test/meson.build                          |   2 +
 app/test/test_lcore_var.c                     | 432 ++++++++++++++
 app/test/test_lcore_var_perf.c                | 256 ++++++++
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +-
 .../prog_guide/img/lcore_var_mem_layout.svg   | 310 ++++++++++
 .../img/static_array_mem_layout.svg           | 278 +++++++++
 doc/guides/prog_guide/index.rst               |   1 +
 doc/guides/prog_guide/lcore_var.rst           | 548 ++++++++++++++++++
 doc/guides/rel_notes/release_24_11.rst        |  14 +
 lib/eal/common/eal_common_lcore_var.c         | 112 ++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/common/rte_random.c                   |  28 +-
 lib/eal/common/rte_service.c                  | 116 ++--
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 207 +++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   1 +
 lib/eal/x86/rte_power_intrinsics.c            |  17 +-
 lib/power/rte_power_pmd_mgmt.c                |  35 +-
 24 files changed, 2333 insertions(+), 92 deletions(-)
 create mode 100644 app/test/test_lcore_var.c
 create mode 100644 app/test/test_lcore_var_perf.c
 create mode 100644 doc/guides/prog_guide/img/lcore_var_mem_layout.svg
 create mode 100644 doc/guides/prog_guide/img/static_array_mem_layout.svg
 create mode 100644 doc/guides/prog_guide/lcore_var.rst
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v16 1/8] eal: add static per-lcore memory allocation facility
  2024-10-23  7:52                                                                                 ` [PATCH v16 0/8] " Mattias Rönnblom
@ 2024-10-23  7:52                                                                                   ` Mattias Rönnblom
  2024-10-23  7:52                                                                                   ` [PATCH v16 2/8] eal: add lcore variable functional tests Mattias Rönnblom
                                                                                                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-23  7:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Thomas Monjalon, Mattias Rönnblom, Konstantin Ananyev,
	Chengwen Feng

Introduce DPDK per-lcore id variables, or lcore variables for short.

An lcore variable has one value for every current and future lcore
id-equipped thread.

The primary <rte_lcore_var.h> use case is for statically allocating
small, frequently-accessed data structures, for which one instance
should exist for each lcore.

Lcore variables are similar to thread-local storage (TLS, e.g., C11
_Thread_local), but decoupling the values' life time with that of the
threads.

Lcore variables are also similar in terms of functionality provided by
FreeBSD kernel's DPCPU_*() family of macros and the associated
build-time machinery. DPCPU uses linker scripts, which effectively
prevents the reuse of its, otherwise seemingly viable, approach.

The currently-prevailing way to solve the same problem as lcore
variables is to keep a module's per-lcore data as RTE_MAX_LCORE-sized
array of cache-aligned, RTE_CACHE_GUARDed structs. The benefit of
lcore variables over this approach is that data related to the same
lcore now is close (spatially, in memory), rather than data used by
the same module, which in turn avoid excessive use of padding,
polluting caches with unused data.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v16:
 * Move implementation overview type information to the programmer's
   guide.

PATCH v15:
 * Add alignment-related compiler hint. (Stephen Hemminger)
 * Have size-related compiler hint point toward the right function
   argument. (Stephen Hemminger)

PATCH v14:
 * Add note in rte_lcore_var_alloc() that the memory cannot be freed.
   (Stephen Hemminger)
 * Hint the compiler rte_lcore_var_alloc() is a memory allocation
   facility. (Stephen Hemminger)

PATCH v13:
 * Remove _VALUE() suffix from value lookup and iterator macros.
   (Morten Brørup and Thomas Monjalon)
 * Remove the _ptr() suffix from the value lookup function.

PATCH v12:
 * Replace RTE_ASSERT() with RTE_VERIFY(), since performance is not
   a concern. (Morten Brørup)
 * Fix issue (introduced in v11) where aligned_malloc() was provided
   an object size which wasn't an even number of the alignment.
   (Stephen Hemminger)

PATCH v11:
 * Add a note in the API docs on lcore variables and huge page memory.
   (Stephen Hemminger)
 * Free lcore var buffers at EAL cleanup. (Thomas Monjalon)
 * Tweak naming and include short lcore var buffer use overview
   in eal_common_lcore_var.c.

PATCH v10:
 * Improve documentation grammar and spelling. (Stephen Hemminger,
   Thomas Monjalon)
 * Add version.map DPDK version comment. (Thomas Monjalon)

PATCH v9:
 * Fixed merge conflicts in release notes.

PATCH v8:
 * Work around missing max_align_t definition in MSVC. (Morten Brørup)

PATCH v7:
 * Add () to the FOREACH lcore id macro parameter, to allow arbitrary
   expression, not just a simple variable name, being passed.
   (Konstantin Ananyev)

PATCH v6:
 * Have API user provide the loop variable in the FOREACH macro, to
   avoid subtle bugs where the loop variable name clashes with some
   other user-defined variable. (Konstantin Ananyev)

PATCH v5:
 * Update EAL programming guide.

PATCH v2:
 * Add Windows support. (Morten Brørup)
 * Fix lcore variables API index reference. (Morten Brørup)
 * Various improvements of the API documentation. (Morten Brørup)
 * Elimination of unused symbol in version.map. (Morten Brørup)

PATCH:
 * Update MAINTAINERS and release notes.
 * Stop covering included files in extern "C" {}.

RFC v6:
 * Include <stdlib.h> to get aligned_alloc().
 * Tweak documentation (grammar).
 * Provide API-level guarantees that lcore variable values take on an
   initial value of zero.
 * Fix misplaced __rte_cache_aligned in the API doc example.

RFC v5:
 * In Doxygen, consistenly use @<cmd> (and not \<cmd>).
 * The RTE_LCORE_VAR_GET() and SET() convience access macros
   covered an uncommon use case, where the lcore value is of a
   primitive type, rather than a struct, and is thus eliminated
   from the API. (Morten Brørup)
 * In the wake up GET()/SET() removeal, rename RTE_LCORE_VAR_PTR()
   RTE_LCORE_VAR_VALUE().
 * The underscores are removed from __rte_lcore_var_lcore_ptr() to
   signal that this function is a part of the public API.
 * Macro arguments are documented.

RFV v4:
 * Replace large static array with libc heap-allocated memory. One
   implication of this change is there no longer exists a fixed upper
   bound for the total amount of memory used by lcore variables.
   RTE_MAX_LCORE_VAR has changed meaning, and now represent the
   maximum size of any individual lcore variable value.
 * Fix issues in example. (Morten Brørup)
 * Improve access macro type checking. (Morten Brørup)
 * Refer to the lcore variable handle as "handle" and not "name" in
   various macros.
 * Document lack of thread safety in rte_lcore_var_alloc().
 * Provide API-level assurance the lcore variable handle is
   always non-NULL, to all applications to use NULL to mean
   "not yet allocated".
 * Note zero-sized allocations are not allowed.
 * Give API-level guarantee the lcore variable values are zeroed.

RFC v3:
 * Replace use of GCC-specific alignof(<expression>) with alignof(<type>).
 * Update example to reflect FOREACH macro name change (in RFC v2).

RFC v2:
 * Use alignof to derive alignment requirements. (Morten Brørup)
 * Change name of FOREACH to make it distinct from <rte_lcore.h>'s
   *per-EAL-thread* RTE_LCORE_FOREACH(). (Morten Brørup)
 * Allow user-specified alignment, but limit max to cache line size.
---
 MAINTAINERS                                   |   6 +
 config/rte_config.h                           |   1 +
 doc/api/doxy-api-index.md                     |   1 +
 .../prog_guide/env_abstraction_layer.rst      |  43 +++-
 doc/guides/rel_notes/release_24_11.rst        |  14 ++
 lib/eal/common/eal_common_lcore_var.c         | 112 ++++++++++
 lib/eal/common/eal_lcore_var.h                |  11 +
 lib/eal/common/meson.build                    |   1 +
 lib/eal/freebsd/eal.c                         |   2 +
 lib/eal/include/meson.build                   |   1 +
 lib/eal/include/rte_lcore_var.h               | 207 ++++++++++++++++++
 lib/eal/linux/eal.c                           |   2 +
 lib/eal/version.map                           |   1 +
 13 files changed, 396 insertions(+), 6 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_var.c
 create mode 100644 lib/eal/common/eal_lcore_var.h
 create mode 100644 lib/eal/include/rte_lcore_var.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 6ea7850093..557474a38b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -289,6 +289,12 @@ F: lib/eal/include/rte_random.h
 F: lib/eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+Lcore Variables
+M: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
+F: lib/eal/include/rte_lcore_var.h
+F: lib/eal/common/eal_common_lcore_var.c
+F: app/test/test_lcore_var.c
+
 ARM v7
 M: Wathsala Vithanage <wathsala.vithanage@arm.com>
 F: config/arm/
diff --git a/config/rte_config.h b/config/rte_config.h
index fd6f8a2f1a..498d509244 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -41,6 +41,7 @@
 /* EAL defines */
 #define RTE_CACHE_GUARD_LINES 1
 #define RTE_MAX_HEAPS 32
+#define RTE_MAX_LCORE_VAR 1048576
 #define RTE_MAX_MEMSEG_LISTS 128
 #define RTE_MAX_MEMSEG_PER_LIST 8192
 #define RTE_MAX_MEM_MB_PER_LIST 32768
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 266c8b90dc..1d472c6ceb 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -99,6 +99,7 @@ The public API headers are grouped by topics:
   [interrupts](@ref rte_interrupts.h),
   [launch](@ref rte_launch.h),
   [lcore](@ref rte_lcore.h),
+  [lcore variables](@ref rte_lcore_var.h),
   [per-lcore](@ref rte_per_lcore.h),
   [service cores](@ref rte_service.h),
   [keepalive](@ref rte_keepalive.h),
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index b9fac1839d..b659a1d085 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -429,12 +429,43 @@ with them once they're registered.
 Per-lcore and Shared Variables
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. note::
-
-    lcore refers to a logical execution unit of the processor, sometimes called a hardware *thread*.
-
-Shared variables are the default behavior.
-Per-lcore variables are implemented using *Thread Local Storage* (TLS) to provide per-thread local storage.
+By default, static variables, memory blocks allocated on the DPDK
+heap, and other types of memory are shared by all DPDK threads.
+
+An application, a DPDK library, or a PMD may opt to keep per-thread state.
+
+Per-thread data can be maintained using either *lcore variables* (see
+``rte_lcore_var.h``), *thread-local storage (TLS)* (see
+``rte_per_lcore.h``), or a static array of ``RTE_MAX_LCORE`` elements,
+indexed by ``rte_lcore_id()``. These methods allow per-lcore data to be
+largely internal to the module and not directly exposed in its
+API. Another approach is to explicitly handle per-thread aspects in
+the API (e.g., the ports in the Eventdev API).
+
+Lcore variables are suitable for small objects that are statically
+allocated at the time of module or application initialization. An
+lcore variable takes on one value for each lcore ID-equipped thread
+(i.e., for both EAL threads and registered non-EAL threads, in total
+``RTE_MAX_LCORE`` instances). The lifetime of lcore variables is
+independent of the owning threads and can, therefore, be initialized
+before the threads are created.
+
+Variables with thread-local storage are allocated when the thread is
+created and exist until the thread terminates. These are applicable
+for every thread in the process. Only very small objects should be
+allocated in TLS, as large TLS objects can significantly slow down
+thread creation and may unnecessarily increase the memory footprint of
+applications that extensively use unregistered threads.
+
+A common but now largely obsolete DPDK pattern is to use a static
+array sized according to the maximum number of lcore ID-equipped
+threads (i.e., with ``RTE_MAX_LCORE`` elements). To avoid *false
+sharing*, each element must be both cache-aligned and include an
+``RTE_CACHE_GUARD``. This extensive use of padding causes internal
+fragmentation (i.e., unused space) and reduces cache hit rates.
+
+For more discussions on per-lcore state, refer to the
+``rte_lcore_var.h`` API documentation.
 
 Logs
 ~~~~
diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index fa4822d928..18f2f37944 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -247,6 +247,20 @@ New Features
   Added ability for node to advertise and update multiple xstat counters,
   that can be retrieved using ``rte_graph_cluster_stats_get``.
 
+* **Added EAL per-lcore static memory allocation facility.**
+
+    Added EAL API <rte_lcore_var.h> for statically allocating small,
+    frequently-accessed data structures, for which one instance should
+    exist for each EAL thread and registered non-EAL thread.
+
+    With lcore variables, data is organized spatially on a per-lcore id
+    basis, rather than per library or PMD, avoiding the need for cache
+    aligning (or RTE_CACHE_GUARDing) data structures, which in turn
+    reduces CPU cache internal fragmentation, improving performance.
+
+    Lcore variables are similar to thread-local storage (TLS, e.g.,
+    C11 _Thread_local), but decoupling the values' life time from that
+    of the threads.
 
 Removed Items
 -------------
diff --git a/lib/eal/common/eal_common_lcore_var.c b/lib/eal/common/eal_common_lcore_var.c
new file mode 100644
index 0000000000..3b0e0b89f7
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_var.c
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdlib.h>
+
+#ifdef RTE_EXEC_ENV_WINDOWS
+#include <malloc.h>
+#endif
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_log.h>
+
+#include <rte_lcore_var.h>
+
+#include "eal_private.h"
+#include "eal_lcore_var.h"
+
+/*
+ * Refer to the programmer's guide for an overview of the lcore
+ * variables implementation.
+ */
+
+struct lcore_var_buffer {
+	char data[RTE_MAX_LCORE_VAR * RTE_MAX_LCORE];
+	struct lcore_var_buffer *prev;
+};
+
+static struct lcore_var_buffer *current_buffer;
+
+/* initialized to trigger buffer allocation on first allocation */
+static size_t offset = RTE_MAX_LCORE_VAR;
+
+static void *
+lcore_var_alloc(size_t size, size_t align)
+{
+	void *handle;
+	unsigned int lcore_id;
+	void *value;
+
+	offset = RTE_ALIGN_CEIL(offset, align);
+
+	if (offset + size > RTE_MAX_LCORE_VAR) {
+		struct lcore_var_buffer *prev = current_buffer;
+		size_t alloc_size =
+			RTE_ALIGN_CEIL(sizeof(struct lcore_var_buffer),
+				       RTE_CACHE_LINE_SIZE);
+#ifdef RTE_EXEC_ENV_WINDOWS
+		current_buffer = _aligned_malloc(alloc_size, RTE_CACHE_LINE_SIZE);
+#else
+		current_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE, alloc_size);
+
+#endif
+		RTE_VERIFY(current_buffer != NULL);
+
+		current_buffer->prev = prev;
+
+		offset = 0;
+	}
+
+	handle = &current_buffer->data[offset];
+
+	offset += size;
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, value, handle)
+		memset(value, 0, size);
+
+	EAL_LOG(DEBUG, "Allocated %"PRIuPTR" bytes of per-lcore data with a "
+		"%"PRIuPTR"-byte alignment", size, align);
+
+	return handle;
+}
+
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+{
+	/* Having the per-lcore buffer size aligned on cache lines
+	 * assures as well as having the base pointer aligned on cache
+	 * size assures that aligned offsets also translate to aligned
+	 * pointers across all values.
+	 */
+	RTE_BUILD_BUG_ON(RTE_MAX_LCORE_VAR % RTE_CACHE_LINE_SIZE != 0);
+	RTE_VERIFY(align <= RTE_CACHE_LINE_SIZE);
+	RTE_VERIFY(size <= RTE_MAX_LCORE_VAR);
+
+	/* '0' means asking for worst-case alignment requirements */
+	if (align == 0)
+#ifdef RTE_TOOLCHAIN_MSVC
+		/* MSVC <stddef.h> is missing the max_align_t typedef */
+		align = alignof(double);
+#else
+		align = alignof(max_align_t);
+#endif
+
+	RTE_VERIFY(rte_is_power_of_2(align));
+
+	return lcore_var_alloc(size, align);
+}
+
+void
+eal_lcore_var_cleanup(void)
+{
+	while (current_buffer != NULL) {
+		struct lcore_var_buffer *prev = current_buffer->prev;
+
+		free(current_buffer);
+
+		current_buffer = prev;
+	}
+}
diff --git a/lib/eal/common/eal_lcore_var.h b/lib/eal/common/eal_lcore_var.h
new file mode 100644
index 0000000000..de2c4e44a0
--- /dev/null
+++ b/lib/eal/common/eal_lcore_var.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2024 Ericsson AB.
+ */
+
+#ifndef EAL_LCORE_VAR_H
+#define EAL_LCORE_VAR_H
+
+void
+eal_lcore_var_cleanup(void);
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index c1bbf26654..e273745e93 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -18,6 +18,7 @@ sources += files(
         'eal_common_interrupts.c',
         'eal_common_launch.c',
         'eal_common_lcore.c',
+        'eal_common_lcore_var.c',
         'eal_common_mcfg.c',
         'eal_common_memalloc.c',
         'eal_common_memory.c',
diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index 1229230063..796c9dbf2d 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -47,6 +47,7 @@
 
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -941,6 +942,7 @@ rte_eal_cleanup(void)
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_cleanup_config(internal_conf);
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index 474097f211..d903577caa 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -28,6 +28,7 @@ headers += files(
         'rte_keepalive.h',
         'rte_launch.h',
         'rte_lcore.h',
+        'rte_lcore_var.h',
         'rte_lock_annotations.h',
         'rte_malloc.h',
         'rte_mcslock.h',
diff --git a/lib/eal/include/rte_lcore_var.h b/lib/eal/include/rte_lcore_var.h
new file mode 100644
index 0000000000..ea8b61cf7d
--- /dev/null
+++ b/lib/eal/include/rte_lcore_var.h
@@ -0,0 +1,207 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#ifndef _RTE_LCORE_VAR_H_
+#define _RTE_LCORE_VAR_H_
+
+/**
+ * @file
+ *
+ * Lcore variables
+ *
+ * This API provides a mechanism to create and access per-lcore id
+ * variables in a space- and cycle-efficient manner.
+ *
+ * Please refer to the lcore variables' programmer's guide for an
+ * overview of this API and its implementation.
+ */
+
+#include <stddef.h>
+#include <stdalign.h>
+
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_lcore.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Given the lcore variable type, produces the type of the lcore
+ * variable handle.
+ */
+#define RTE_LCORE_VAR_HANDLE_TYPE(type)		\
+	type *
+
+/**
+ * Define an lcore variable handle.
+ *
+ * This macro defines a variable which is used as a handle to access
+ * the various instances of a per-lcore id variable.
+ *
+ * This macro clarifies that the declaration is an lcore handle, not a
+ * regular pointer.
+ *
+ * Add @b static as a prefix in case the lcore variable is only to be
+ * accessed from a particular translation unit.
+ */
+#define RTE_LCORE_VAR_HANDLE(type, name)	\
+	RTE_LCORE_VAR_HANDLE_TYPE(type) name
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, align)	\
+	handle = rte_lcore_var_alloc(size, align)
+
+/**
+ * Allocate space for an lcore variable, and initialize its handle,
+ * with values aligned for any type of object.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC_SIZE(handle, size)	\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, size, 0)
+
+/**
+ * Allocate space for an lcore variable of the size and alignment requirements
+ * suggested by the handle pointer type, and initialize its handle.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_ALLOC(handle)					\
+	RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(handle, sizeof(*(handle)),	\
+				       alignof(typeof(*(handle))))
+
+/**
+ * Allocate an explicitly-sized, explicitly-aligned lcore variable by
+ * means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, align)		\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC_SIZE_ALIGN(name, size, align);	\
+	}
+
+/**
+ * Allocate an explicitly-sized lcore variable by means of a @ref
+ * RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT_SIZE(name, size)		\
+	RTE_LCORE_VAR_INIT_SIZE_ALIGN(name, size, 0)
+
+/**
+ * Allocate an lcore variable by means of a @ref RTE_INIT constructor.
+ *
+ * The values of the lcore variable are initialized to zero.
+ */
+#define RTE_LCORE_VAR_INIT(name)					\
+	RTE_INIT(rte_lcore_var_init_ ## name)				\
+	{								\
+		RTE_LCORE_VAR_ALLOC(name);				\
+	}
+
+/**
+ * Get void pointer to lcore variable instance with the specified
+ * lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+static inline void *
+rte_lcore_var_lcore(unsigned int lcore_id, void *handle)
+{
+	return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+}
+
+/**
+ * Get pointer to lcore variable instance with the specified lcore id.
+ *
+ * @param lcore_id
+ *   The lcore id specifying which of the @c RTE_MAX_LCORE value
+ *   instances should be accessed. The lcore id need not be valid
+ *   (e.g., may be @ref LCORE_ID_ANY), but in such a case, the pointer
+ *   is also not valid (and thus should not be dereferenced).
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_LCORE(lcore_id, handle)			\
+	((typeof(handle))rte_lcore_var_lcore(lcore_id, handle))
+
+/**
+ * Get pointer to lcore variable instance of the current thread.
+ *
+ * May only be used by EAL threads and registered non-EAL threads.
+ */
+#define RTE_LCORE_VAR(handle)				\
+	RTE_LCORE_VAR_LCORE(rte_lcore_id(), handle)
+
+/**
+ * Iterate over each lcore id's value for an lcore variable.
+ *
+ * @param lcore_id
+ *   An <code>unsigned int</code> variable successively set to the
+ *   lcore id of every valid lcore id (up to @c RTE_MAX_LCORE).
+ * @param value
+ *   A pointer variable successively set to point to lcore variable
+ *   value instance of the current lcore id being processed.
+ * @param handle
+ *   The lcore variable handle.
+ */
+#define RTE_LCORE_VAR_FOREACH(lcore_id, value, handle)			\
+	for ((lcore_id) =						\
+		     (((value) = RTE_LCORE_VAR_LCORE(0, handle)), 0); \
+	     (lcore_id) < RTE_MAX_LCORE;				\
+	     (lcore_id)++, (value) = RTE_LCORE_VAR_LCORE(lcore_id, \
+							       handle))
+
+/**
+ * Allocate space in the per-lcore id buffers for an lcore variable.
+ *
+ * The pointer returned is only an opaque identifier of the variable. To
+ * get an actual pointer to a particular instance of the variable use
+ * @ref RTE_LCORE_VAR or @ref RTE_LCORE_VAR_LCORE.
+ *
+ * The lcore variable values' memory is set to zero.
+ *
+ * The allocation is always successful, barring a fatal exhaustion of
+ * the per-lcore id buffer space.
+ *
+ * rte_lcore_var_alloc() is not multi-thread safe.
+ *
+ * The allocated memory cannot be freed.
+ *
+ * @param size
+ *   The size (in bytes) of the variable's per-lcore id value. Must be > 0.
+ * @param align
+ *   If 0, the values will be suitably aligned for any kind of type
+ *   (i.e., alignof(max_align_t)). Otherwise, the values will be aligned
+ *   on a multiple of *align*, which must be a power of 2 and equal or
+ *   less than @c RTE_CACHE_LINE_SIZE.
+ * @return
+ *   The variable's handle, stored in a void pointer value. The value
+ *   is always non-NULL.
+ */
+__rte_experimental
+void *
+rte_lcore_var_alloc(size_t size, size_t align)
+	__rte_alloc_size(1) __rte_alloc_align(2);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LCORE_VAR_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 54577b7718..d0f27315b9 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -45,6 +45,7 @@
 #include <telemetry_internal.h>
 #include "eal_private.h"
 #include "eal_thread.h"
+#include "eal_lcore_var.h"
 #include "eal_internal_cfg.h"
 #include "eal_filesystem.h"
 #include "eal_hugepages.h"
@@ -1371,6 +1372,7 @@ rte_eal_cleanup(void)
 	rte_eal_malloc_heap_cleanup();
 	eal_cleanup_config(internal_conf);
 	rte_eal_log_cleanup();
+	eal_lcore_var_cleanup();
 	return 0;
 }
 
diff --git a/lib/eal/version.map b/lib/eal/version.map
index f493cd1ca7..94dc5b17d6 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -399,6 +399,7 @@ EXPERIMENTAL {
 
 	# added in 24.11
 	rte_bitset_to_str;
+	rte_lcore_var_alloc;
 };
 
 INTERNAL {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v16 2/8] eal: add lcore variable functional tests
  2024-10-23  7:52                                                                                 ` [PATCH v16 0/8] " Mattias Rönnblom
  2024-10-23  7:52                                                                                   ` [PATCH v16 1/8] eal: add static per-lcore memory allocation facility Mattias Rönnblom
@ 2024-10-23  7:52                                                                                   ` Mattias Rönnblom
  2024-10-23  7:52                                                                                   ` [PATCH v16 3/8] eal: add lcore variable performance test Mattias Rönnblom
                                                                                                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-23  7:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Thomas Monjalon, Mattias Rönnblom, Chengwen Feng

Add functional test suite to exercise the <rte_lcore_var.h> API.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocations to match new API.

RFC v5:
 * Adapt tests to reflect the removal of the GET() and SET() macros.

RFC v4:
 * Check all lcore id's values for all variables in the many variables
   test case.
 * Introduce test case for max-sized lcore variables.

RFC v2:
 * Improve alignment-related test coverage.
---
 app/test/meson.build      |   1 +
 app/test/test_lcore_var.c | 432 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 433 insertions(+)
 create mode 100644 app/test/test_lcore_var.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 0f7e11969a..7dccd197ac 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@ source_file_deps = {
     'test_ipsec_sad.c': ['ipsec'],
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
+    'test_lcore_var.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var.c b/app/test/test_lcore_var.c
new file mode 100644
index 0000000000..ddf70b03a0
--- /dev/null
+++ b/app/test/test_lcore_var.c
@@ -0,0 +1,432 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_launch.h>
+#include <rte_lcore_var.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define MIN_LCORES 2
+
+RTE_LCORE_VAR_HANDLE(int, test_int);
+RTE_LCORE_VAR_HANDLE(char, test_char);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized);
+RTE_LCORE_VAR_HANDLE(short, test_short);
+RTE_LCORE_VAR_HANDLE(long, test_long_sized_aligned);
+
+struct int_checker_state {
+	int old_value;
+	int new_value;
+	bool success;
+};
+
+static void
+rand_blk(void *blk, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < size; i++)
+		((unsigned char *)blk)[i] = (unsigned char)rte_rand();
+}
+
+static bool
+is_ptr_aligned(const void *ptr, size_t align)
+{
+	return ptr != NULL ? (uintptr_t)ptr % align == 0 : false;
+}
+
+static int
+check_int(void *arg)
+{
+	struct int_checker_state *state = arg;
+
+	int *ptr = RTE_LCORE_VAR(test_int);
+
+	bool naturally_aligned = is_ptr_aligned(ptr, sizeof(int));
+
+	bool equal = *(RTE_LCORE_VAR(test_int)) == state->old_value;
+
+	state->success = equal && naturally_aligned;
+
+	*ptr = state->new_value;
+
+	return 0;
+}
+
+RTE_LCORE_VAR_INIT(test_int);
+RTE_LCORE_VAR_INIT(test_char);
+RTE_LCORE_VAR_INIT_SIZE(test_long_sized, 32);
+RTE_LCORE_VAR_INIT(test_short);
+RTE_LCORE_VAR_INIT_SIZE_ALIGN(test_long_sized_aligned, sizeof(long),
+			      RTE_CACHE_LINE_SIZE);
+
+static int
+test_int_lvar(void)
+{
+	unsigned int lcore_id;
+
+	struct int_checker_state states[RTE_MAX_LCORE] = {};
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+
+		state->old_value = (int)rte_rand();
+		state->new_value = (int)rte_rand();
+
+		*RTE_LCORE_VAR_LCORE(lcore_id, test_int) = state->old_value;
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_int, &states[lcore_id], lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct int_checker_state *state = &states[lcore_id];
+		int value;
+
+		TEST_ASSERT(state->success, "Unexpected value "
+			    "encountered on lcore %d", lcore_id);
+
+		value = *RTE_LCORE_VAR_LCORE(lcore_id, test_int);
+		TEST_ASSERT_EQUAL(state->new_value, value,
+				  "Lcore %d failed to update int", lcore_id);
+	}
+
+	/* take the opportunity to test the foreach macro */
+	int *v;
+	unsigned int i = 0;
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_int) {
+		TEST_ASSERT_EQUAL(i, lcore_id, "Encountered lcore id %d "
+				  "while expecting %d during iteration",
+				  lcore_id, i);
+		TEST_ASSERT_EQUAL(states[lcore_id].new_value, *v,
+				  "Unexpected value on lcore %d during "
+				  "iteration", lcore_id);
+		i++;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_sized_alignment(void)
+{
+	unsigned int lcore_id;
+	long *v;
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_long_sized) {
+		TEST_ASSERT(is_ptr_aligned(v, alignof(long)),
+			    "Type-derived alignment failed");
+	}
+
+	RTE_LCORE_VAR_FOREACH(lcore_id, v, test_long_sized_aligned) {
+		TEST_ASSERT(is_ptr_aligned(v, RTE_CACHE_LINE_SIZE),
+			    "Explicit alignment failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* private, larger, struct */
+#define TEST_STRUCT_DATA_SIZE 1234
+
+struct test_struct {
+	uint8_t data[TEST_STRUCT_DATA_SIZE];
+};
+
+static RTE_LCORE_VAR_HANDLE(char, before_struct);
+static RTE_LCORE_VAR_HANDLE(struct test_struct, test_struct);
+static RTE_LCORE_VAR_HANDLE(char, after_struct);
+
+struct struct_checker_state {
+	struct test_struct old_value;
+	struct test_struct new_value;
+	bool success;
+};
+
+static int check_struct(void *arg)
+{
+	struct struct_checker_state *state = arg;
+
+	struct test_struct *lcore_struct = RTE_LCORE_VAR(test_struct);
+
+	bool properly_aligned =
+		is_ptr_aligned(test_struct, alignof(struct test_struct));
+
+	bool equal = memcmp(lcore_struct->data, state->old_value.data,
+			    TEST_STRUCT_DATA_SIZE) == 0;
+
+	state->success = equal && properly_aligned;
+
+	memcpy(lcore_struct->data, state->new_value.data,
+	       TEST_STRUCT_DATA_SIZE);
+
+	return 0;
+}
+
+static int
+test_struct_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_struct);
+	RTE_LCORE_VAR_ALLOC(test_struct);
+	RTE_LCORE_VAR_ALLOC(after_struct);
+
+	struct struct_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+
+		rand_blk(state->old_value.data, TEST_STRUCT_DATA_SIZE);
+		rand_blk(state->new_value.data, TEST_STRUCT_DATA_SIZE);
+
+		memcpy(RTE_LCORE_VAR_LCORE(lcore_id, test_struct)->data,
+		       state->old_value.data, TEST_STRUCT_DATA_SIZE);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_struct, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct struct_checker_state *state = &states[lcore_id];
+		struct test_struct *lstruct =
+			RTE_LCORE_VAR_LCORE(lcore_id, test_struct);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = memcmp(lstruct->data, state->new_value.data,
+				    TEST_STRUCT_DATA_SIZE) == 0;
+
+		TEST_ASSERT(equal, "Lcore %d failed to update struct",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE(lcore_id, before_struct);
+		char after =
+			*RTE_LCORE_VAR_LCORE(lcore_id, after_struct);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "struct was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "struct was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define TEST_ARRAY_SIZE 99
+
+typedef uint16_t test_array_t[TEST_ARRAY_SIZE];
+
+static void test_array_init_rand(test_array_t a)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		a[i] = (uint16_t)rte_rand();
+}
+
+static bool test_array_equal(test_array_t a, test_array_t b)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++) {
+		if (a[i] != b[i])
+			return false;
+	}
+	return true;
+}
+
+static void test_array_copy(test_array_t dst, const test_array_t src)
+{
+	size_t i;
+	for (i = 0; i < TEST_ARRAY_SIZE; i++)
+		dst[i] = src[i];
+}
+
+static RTE_LCORE_VAR_HANDLE(char, before_array);
+static RTE_LCORE_VAR_HANDLE(test_array_t, test_array);
+static RTE_LCORE_VAR_HANDLE(char, after_array);
+
+struct array_checker_state {
+	test_array_t old_value;
+	test_array_t new_value;
+	bool success;
+};
+
+static int check_array(void *arg)
+{
+	struct array_checker_state *state = arg;
+
+	test_array_t *lcore_array = RTE_LCORE_VAR(test_array);
+
+	bool properly_aligned =
+		is_ptr_aligned(lcore_array, alignof(test_array_t));
+
+	bool equal = test_array_equal(*lcore_array, state->old_value);
+
+	state->success = equal && properly_aligned;
+
+	test_array_copy(*lcore_array, state->new_value);
+
+	return 0;
+}
+
+static int
+test_array_lvar(void)
+{
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC(before_array);
+	RTE_LCORE_VAR_ALLOC(test_array);
+	RTE_LCORE_VAR_ALLOC(after_array);
+
+	struct array_checker_state states[RTE_MAX_LCORE];
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+
+		test_array_init_rand(state->new_value);
+		test_array_init_rand(state->old_value);
+
+		test_array_copy(*RTE_LCORE_VAR_LCORE(lcore_id, test_array),
+				state->old_value);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		rte_eal_remote_launch(check_array, &states[lcore_id],
+				      lcore_id);
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		struct array_checker_state *state = &states[lcore_id];
+		test_array_t *larray = RTE_LCORE_VAR_LCORE(lcore_id, test_array);
+
+		TEST_ASSERT(state->success, "Unexpected value encountered on "
+			    "lcore %d", lcore_id);
+
+		bool equal = test_array_equal(*larray, state->new_value);
+
+		TEST_ASSERT(equal, "Lcore %d failed to update array",
+			    lcore_id);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		char before =
+			*RTE_LCORE_VAR_LCORE(lcore_id, before_array);
+		char after =
+			*RTE_LCORE_VAR_LCORE(lcore_id, after_array);
+
+		TEST_ASSERT_EQUAL(before, 0, "Lcore variable before test "
+				  "array was modified on lcore %d", lcore_id);
+		TEST_ASSERT_EQUAL(after, 0, "Lcore variable after test "
+				  "array was modified on lcore %d", lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define MANY_LVARS (2 * RTE_MAX_LCORE_VAR / sizeof(uint32_t))
+
+static int
+test_many_lvars(void)
+{
+	uint32_t **handlers = malloc(sizeof(uint32_t *) * MANY_LVARS);
+	unsigned int i;
+
+	TEST_ASSERT(handlers != NULL, "Unable to allocate memory");
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		RTE_LCORE_VAR_ALLOC(handlers[i]);
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t *v =
+				RTE_LCORE_VAR_LCORE(lcore_id, handlers[i]);
+			*v = (uint32_t)(i * lcore_id);
+		}
+	}
+
+	for (i = 0; i < MANY_LVARS; i++) {
+		unsigned int lcore_id;
+
+		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+			uint32_t v = *RTE_LCORE_VAR_LCORE(lcore_id, handlers[i]);
+			TEST_ASSERT_EQUAL((uint32_t)(i * lcore_id), v,
+					  "Unexpected lcore variable value on "
+					  "lcore %d", lcore_id);
+		}
+	}
+
+	free(handlers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_large_lvar(void)
+{
+	RTE_LCORE_VAR_HANDLE(unsigned char, large);
+	unsigned int lcore_id;
+
+	RTE_LCORE_VAR_ALLOC_SIZE(large, RTE_MAX_LCORE_VAR);
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE(lcore_id, large);
+
+		memset(ptr, (unsigned char)lcore_id, RTE_MAX_LCORE_VAR);
+	}
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		unsigned char *ptr = RTE_LCORE_VAR_LCORE(lcore_id, large);
+		size_t i;
+
+		for (i = 0; i < RTE_MAX_LCORE_VAR; i++)
+			TEST_ASSERT_EQUAL(ptr[i], (unsigned char)lcore_id,
+					  "Large lcore variable value is "
+					  "corrupted on lcore %d.",
+					  lcore_id);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_int_lvar),
+		TEST_CASE(test_sized_alignment),
+		TEST_CASE(test_struct_lvar),
+		TEST_CASE(test_array_lvar),
+		TEST_CASE(test_many_lvars),
+		TEST_CASE(test_large_lvar),
+		TEST_CASES_END()
+	},
+};
+
+static int test_lcore_var(void)
+{
+	if (rte_lcore_count() < MIN_LCORES) {
+		printf("Not enough cores for lcore_var_autotest; expecting at "
+		       "least %d.\n", MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_FAST_TEST(lcore_var_autotest, true, false, test_lcore_var);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v16 3/8] eal: add lcore variable performance test
  2024-10-23  7:52                                                                                 ` [PATCH v16 0/8] " Mattias Rönnblom
  2024-10-23  7:52                                                                                   ` [PATCH v16 1/8] eal: add static per-lcore memory allocation facility Mattias Rönnblom
  2024-10-23  7:52                                                                                   ` [PATCH v16 2/8] eal: add lcore variable functional tests Mattias Rönnblom
@ 2024-10-23  7:52                                                                                   ` Mattias Rönnblom
  2024-10-23  7:52                                                                                   ` [PATCH v16 4/8] eal: add lcore variables' programmer's guide Mattias Rönnblom
                                                                                                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-23  7:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Thomas Monjalon, Mattias Rönnblom, Chengwen Feng

Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>

--

PATCH v8:
 * Fix spelling. (Morten Brørup)

PATCH v6:
 * Use floating point math when calculating per-update latency.
   (Morten Brørup)

PATCH v5:
 * Add variant of thread-local storage with initialization performed
   at the time of thread creation to the benchmark scenarios. (Morten
   Brørup)

PATCH v4:
 * Rework the tests to be a little less unrealistic. Instead of a
   single dummy module using a single variable, use a number of
   variables/modules. In this way, differences in cache effects may
   show up.
 * Add RTE_CACHE_GUARD to better mimic that static array pattern.
   (Morten Brørup)
 * Show latencies as TSC cycles. (Morten Brørup)
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 256 +++++++++++++++++++++++++++++++++
 2 files changed, 257 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 7dccd197ac..40f22a54d5 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -105,6 +105,7 @@ source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..6d9869f873
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,256 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#define MAX_MODS 1024
+
+#include <stdio.h>
+
+#include <rte_bitops.h>
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct mod_lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+mod_init(struct mod_lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+mod_update(volatile struct mod_lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+struct __rte_cache_aligned mod_lcore_state_aligned {
+	struct mod_lcore_state mod_state;
+
+	RTE_CACHE_GUARD;
+};
+
+static struct mod_lcore_state_aligned
+sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		struct mod_lcore_state *mod_state =
+			&sarray_lcore_state[mod][lcore_id].mod_state;
+
+		mod_init(mod_state);
+	}
+}
+
+static __rte_noinline void
+sarray_update(unsigned int mod)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct mod_lcore_state *mod_state =
+		&sarray_lcore_state[mod][lcore_id].mod_state;
+
+	mod_update(mod_state);
+}
+
+struct mod_lcore_state_lazy {
+	struct mod_lcore_state mod_state;
+	bool initialized;
+};
+
+/*
+ * Note: it's usually a bad idea have this much thread-local storage
+ * allocated in a real application, since it will incur a cost on
+ * thread creation and non-lcore thread memory usage.
+ */
+static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy,
+			    tls_lcore_state)[MAX_MODS];
+
+static inline void
+tls_init(struct mod_lcore_state_lazy *state)
+{
+	mod_init(&state->mod_state);
+
+	state->initialized = true;
+}
+
+static __rte_noinline void
+tls_lazy_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	/* With thread-local storage, initialization must usually be lazy */
+	if (!state->initialized)
+		tls_init(state);
+
+	mod_update(&state->mod_state);
+}
+
+static __rte_noinline void
+tls_update(unsigned int mod)
+{
+	struct mod_lcore_state_lazy *state =
+		&RTE_PER_LCORE(tls_lcore_state[mod]);
+
+	mod_update(&state->mod_state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS];
+
+static void
+lvar_init(void)
+{
+	unsigned int mod;
+
+	for (mod = 0; mod < MAX_MODS; mod++) {
+		RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]);
+
+		struct mod_lcore_state *state =
+			RTE_LCORE_VAR(lvar_lcore_state[mod]);
+
+		mod_init(state);
+	}
+}
+
+static __rte_noinline void
+lvar_update(unsigned int mod)
+{
+	struct mod_lcore_state *state =	RTE_LCORE_VAR(lvar_lcore_state[mod]);
+
+	mod_update(state);
+}
+
+static void
+shuffle(unsigned int *elems, size_t len)
+{
+	size_t i;
+
+	for (i = len - 1; i > 0; i--) {
+		unsigned int other = rte_rand_max(i + 1);
+
+		unsigned int tmp = elems[other];
+		elems[other] = elems[i];
+		elems[i] = tmp;
+	}
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static inline double
+benchmark_access(const unsigned int *mods, unsigned int num_mods,
+		 void (*init_fun)(void), void (*update_fun)(unsigned int))
+{
+	unsigned int i;
+	double start;
+	double end;
+	double latency;
+	unsigned int num_mods_mask = num_mods - 1;
+
+	RTE_VERIFY(rte_is_power_of_2(num_mods));
+
+	if (init_fun != NULL)
+		init_fun();
+
+	/* Warm up cache and make sure TLS variables are initialized */
+	for (i = 0; i < num_mods; i++)
+		update_fun(i);
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun(mods[i & num_mods_mask]);
+
+	end = rte_rdtsc();
+
+	latency = (end - start) / (double)ITERATIONS;
+
+	return latency;
+}
+
+static void
+test_lcore_var_access_n(unsigned int num_mods)
+{
+	double sarray_latency;
+	double tls_latency;
+	double lazy_tls_latency;
+	double lvar_latency;
+	unsigned int mods[num_mods];
+	unsigned int i;
+
+	for (i = 0; i < num_mods; i++)
+		mods[i] = i;
+
+	shuffle(mods, num_mods);
+
+	sarray_latency =
+		benchmark_access(mods, num_mods, sarray_init, sarray_update);
+
+	tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_update);
+
+	lazy_tls_latency =
+		benchmark_access(mods, num_mods, NULL, tls_lazy_update);
+
+	lvar_latency =
+		benchmark_access(mods, num_mods, lvar_init, lvar_update);
+
+	printf("%17u %8.1f %14.1f %15.1f %10.1f\n", num_mods, sarray_latency,
+	       tls_latency, lazy_tls_latency, lvar_latency);
+}
+
+/*
+ * The potential performance benefit of lcore variables compared to
+ * the use of statically sized, lcore id-indexed arrays is not
+ * shorter latencies in a scenario with low cache pressure, but rather
+ * fewer cache misses in a real-world scenario, with extensive cache
+ * usage. These tests are a crude simulation of such, using <N> dummy
+ * modules, each with a small, per-lcore state. Note however that
+ * these tests have very little non-lcore/thread local state, which is
+ * unrealistic.
+ */
+
+static int
+test_lcore_var_access(void)
+{
+	unsigned int num_mods = 1;
+
+	printf("- Latencies [TSC cycles/update] -\n");
+	printf("Number of           Static   Thread-local    Thread-local      Lcore\n");
+	printf("Modules/Variables    Array        Storage  Storage (Lazy)  Variables\n");
+
+	for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2)
+		test_lcore_var_access_n(num_mods);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v16 4/8] eal: add lcore variables' programmer's guide
  2024-10-23  7:52                                                                                 ` [PATCH v16 0/8] " Mattias Rönnblom
                                                                                                     ` (2 preceding siblings ...)
  2024-10-23  7:52                                                                                   ` [PATCH v16 3/8] eal: add lcore variable performance test Mattias Rönnblom
@ 2024-10-23  7:52                                                                                   ` Mattias Rönnblom
  2024-10-23  7:52                                                                                   ` [PATCH v16 5/8] random: keep PRNG state in lcore variable Mattias Rönnblom
                                                                                                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-23  7:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Thomas Monjalon, Mattias Rönnblom

Add lcore variables programmer's guide. This guide gives both an
overview of the API, its implementation, and alternatives to the use
of lcore variables for maintaining per-lcore id data.

It has pictures, too.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 .../prog_guide/img/lcore_var_mem_layout.svg   | 310 ++++++++++
 .../img/static_array_mem_layout.svg           | 278 +++++++++
 doc/guides/prog_guide/index.rst               |   1 +
 doc/guides/prog_guide/lcore_var.rst           | 548 ++++++++++++++++++
 4 files changed, 1137 insertions(+)
 create mode 100644 doc/guides/prog_guide/img/lcore_var_mem_layout.svg
 create mode 100644 doc/guides/prog_guide/img/static_array_mem_layout.svg
 create mode 100644 doc/guides/prog_guide/lcore_var.rst

diff --git a/doc/guides/prog_guide/img/lcore_var_mem_layout.svg b/doc/guides/prog_guide/img/lcore_var_mem_layout.svg
new file mode 100644
index 0000000000..ebb4fa2431
--- /dev/null
+++ b/doc/guides/prog_guide/img/lcore_var_mem_layout.svg
@@ -0,0 +1,310 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg version="1.2" width="187.63mm" height="184.65mm" viewBox="1286 2291 18763 18465" preserveAspectRatio="xMidYMid" fill-rule="evenodd" stroke-width="28.222" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg" xmlns:ooo="http://xml.openoffice.org/svg/export" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:presentation="http://sun.com/xmlns/staroffice/presentation" xmlns:smil="http://www.w3.org/2001/SMIL20/" xmlns:anim="urn:oasis:names:tc:opendocument:xmlns:animation:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xml:space="preserve">
+ <defs>
+  <font id="EmbeddedFont_1" horiz-adv-x="2048">
+   <font-face font-family="Liberation Sans embedded" units-per-em="2048" font-weight="normal" font-style="normal" ascent="1852" descent="423"/>
+   <missing-glyph horiz-adv-x="2048" d="M 0,0 L 2047,0 2047,2047 0,2047 0,0 Z"/>
+   <glyph unicode="y" horiz-adv-x="1012" d="M 191,-425 C 142,-425 100,-421 67,-414 L 67,-279 C 92,-283 120,-285 151,-285 263,-285 352,-203 417,-38 L 434,5 5,1082 197,1082 425,484 C 428,475 432,464 437,451 442,438 457,394 482,320 507,246 521,205 523,196 L 593,393 830,1082 1020,1082 604,0 C 559,-115 518,-201 479,-258 440,-314 398,-356 351,-384 304,-411 250,-425 191,-425 Z"/>
+   <glyph unicode="x" horiz-adv-x="976" d="M 801,0 L 510,444 217,0 23,0 408,556 41,1082 240,1082 510,661 778,1082 979,1082 612,558 1002,0 801,0 Z"/>
+   <glyph unicode="v" horiz-adv-x="1007" d="M 613,0 L 400,0 7,1082 199,1082 437,378 C 446,351 469,272 506,141 L 541,258 580,376 826,1082 1017,1082 613,0 Z"/>
+   <glyph unicode="u" horiz-adv-x="867" d="M 314,1082 L 314,396 C 314,325 321,269 335,230 349,191 371,162 402,145 433,128 478,119 537,119 624,119 692,149 742,208 792,267 817,350 817,455 L 817,1082 997,1082 997,231 C 997,105 999,28 1003,0 L 833,0 C 832,3 832,12 831,27 830,42 830,59 829,78 828,97 826,132 825,185 L 822,185 C 781,110 733,58 679,27 624,-4 557,-20 476,-20 357,-20 271,10 216,69 161,128 133,225 133,361 L 133,1082 314,1082 Z"/>
+   <glyph unicode="t" horiz-adv-x="523" d="M 554,8 C 495,-8 434,-16 372,-16 228,-16 156,66 156,229 L 156,951 31,951 31,1082 163,1082 216,1324 336,1324 336,1082 536,1082 536,951 336,951 336,268 C 336,216 345,180 362,159 379,138 408,127 450,127 474,127 509,132 554,141 L 554,8 Z"/>
+   <glyph unicode="s" horiz-adv-x="891" d="M 950,299 C 950,197 912,118 835,63 758,8 650,-20 511,-20 376,-20 273,2 200,47 127,91 79,160 57,254 L 216,285 C 231,227 263,185 311,158 359,131 426,117 511,117 602,117 669,131 712,159 754,187 775,229 775,285 775,328 760,362 731,389 702,416 654,438 589,455 L 460,489 C 357,516 283,542 240,568 196,593 162,624 137,661 112,698 100,743 100,796 100,895 135,970 206,1022 276,1073 378,1099 513,1099 632,1099 727,1078 798,1036 868,994 912,927 931,834 L 769,814 C 759,862 732,899 689,925 645,950 586,963 513,963 432,963 372,951 333,926 294,901 275,864 275,814 275,783 283,758 299,738 315,718 339,701 370,687 401,673 467,654 568,629 663,605 732,583 774,563 816,542 849,520 874,495 898,470 917,442 930,410 943,377 950,340 950,299 Z"/>
+   <glyph unicode="r" horiz-adv-x="511" d="M 142,0 L 142,830 C 142,906 140,990 136,1082 L 306,1082 C 311,959 314,886 314,861 L 318,861 C 347,954 380,1017 417,1051 454,1085 507,1102 575,1102 599,1102 623,1099 648,1092 L 648,927 C 624,934 592,937 552,937 477,937 420,905 381,841 342,776 322,684 322,564 L 322,0 142,0 Z"/>
+   <glyph unicode="p" horiz-adv-x="918" d="M 1053,546 C 1053,169 920,-20 655,-20 488,-20 376,43 319,168 L 314,168 C 317,163 318,106 318,-2 L 318,-425 138,-425 138,861 C 138,972 136,1046 132,1082 L 306,1082 C 307,1079 308,1070 309,1054 310,1037 312,1012 314,978 315,944 316,921 316,908 L 320,908 C 352,975 394,1024 447,1055 500,1086 569,1101 655,1101 788,1101 888,1056 954,967 1020,878 1053,737 1053,546 Z M 864,542 C 864,693 844,800 803,865 762,930 698,962 609,962 538,962 482,947 442,917 401,887 371,840 350,777 329,713 318,630 318,528 318,386 341,281 386,214 431,147 505,113 607,113 696,113 762,146 803,212 844,277 864,387 864,542 Z"/>
+   <glyph unicode="o" horiz-adv-x="964" d="M 1053,542 C 1053,353 1011,212 928,119 845,26 724,-20 565,-20 407,-20 288,28 207,125 126,221 86,360 86,542 86,915 248,1102 571,1102 736,1102 858,1057 936,966 1014,875 1053,733 1053,542 Z M 864,542 C 864,691 842,800 798,868 753,935 679,969 574,969 469,969 393,935 346,866 299,797 275,689 275,542 275,399 298,292 345,221 391,149 464,113 563,113 671,113 748,148 795,217 841,286 864,395 864,542 Z"/>
+   <glyph unicode="n" horiz-adv-x="867" d="M 825,0 L 825,686 C 825,757 818,813 804,852 790,891 768,920 737,937 706,954 661,963 602,963 515,963 447,933 397,874 347,815 322,732 322,627 L 322,0 142,0 142,851 C 142,977 140,1054 136,1082 L 306,1082 C 307,1079 307,1070 308,1055 309,1040 310,1024 311,1005 312,986 313,950 314,897 L 317,897 C 358,972 406,1025 461,1056 515,1087 582,1102 663,1102 782,1102 869,1073 924,1014 979,955 1006,857 1006,721 L 1006,0 825,0 Z"/>
+   <glyph unicode="l" horiz-adv-x="181" d="M 138,0 L 138,1484 318,1484 318,0 138,0 Z"/>
+   <glyph unicode="i" horiz-adv-x="181" d="M 137,1312 L 137,1484 317,1484 317,1312 137,1312 Z M 137,0 L 137,1082 317,1082 317,0 137,0 Z"/>
+   <glyph unicode="h" horiz-adv-x="861" d="M 317,897 C 356,968 402,1020 457,1053 511,1086 580,1102 663,1102 780,1102 867,1073 923,1015 978,956 1006,858 1006,721 L 1006,0 825,0 825,686 C 825,762 818,819 804,856 790,893 767,920 735,937 703,954 659,963 602,963 517,963 450,934 399,875 348,816 322,737 322,638 L 322,0 142,0 142,1484 322,1484 322,1098 C 322,1057 321,1015 319,972 316,929 315,904 314,897 L 317,897 Z"/>
+   <glyph unicode="g" horiz-adv-x="918" d="M 548,-425 C 430,-425 336,-402 266,-356 196,-309 151,-243 131,-158 L 312,-132 C 324,-182 351,-220 392,-248 433,-274 486,-288 553,-288 732,-288 822,-183 822,27 L 822,201 820,201 C 786,132 739,80 680,45 621,10 551,-8 472,-8 339,-8 242,36 180,124 117,212 86,350 86,539 86,730 120,872 187,963 254,1054 355,1099 492,1099 569,1099 635,1082 692,1047 748,1012 791,962 822,897 L 824,897 C 824,917 825,952 828,1001 831,1050 833,1077 836,1082 L 1007,1082 C 1003,1046 1001,971 1001,858 L 1001,31 C 1001,-273 850,-425 548,-425 Z M 822,541 C 822,629 810,705 786,769 762,832 728,881 685,915 641,948 591,965 536,965 444,965 377,932 335,865 293,798 272,690 272,541 272,393 292,287 331,222 370,157 438,125 533,125 590,125 640,142 684,175 728,208 762,256 786,319 810,381 822,455 822,541 Z"/>
+   <glyph unicode="f" horiz-adv-x="543" d="M 361,951 L 361,0 181,0 181,951 29,951 29,1082 181,1082 181,1204 C 181,1303 203,1374 246,1417 289,1460 356,1482 445,1482 495,1482 537,1478 572,1470 L 572,1333 C 542,1338 515,1341 492,1341 446,1341 413,1329 392,1306 371,1283 361,1240 361,1179 L 361,1082 572,1082 572,951 361,951 Z"/>
+   <glyph unicode="e" horiz-adv-x="958" d="M 276,503 C 276,379 302,283 353,216 404,149 479,115 578,115 656,115 719,131 766,162 813,193 844,233 861,281 L 1019,236 C 954,65 807,-20 578,-20 418,-20 296,28 213,123 129,218 87,360 87,548 87,727 129,864 213,959 296,1054 416,1102 571,1102 889,1102 1048,910 1048,527 L 1048,503 276,503 Z M 862,641 C 852,755 823,838 775,891 727,943 658,969 568,969 481,969 412,940 361,882 310,823 282,743 278,641 L 862,641 Z"/>
+   <glyph unicode="d" horiz-adv-x="918" d="M 821,174 C 788,105 744,55 689,25 634,-5 565,-20 484,-20 347,-20 247,26 183,118 118,210 86,349 86,536 86,913 219,1102 484,1102 566,1102 634,1087 689,1057 744,1027 788,979 821,914 L 823,914 821,1035 821,1484 1001,1484 1001,223 C 1001,110 1003,36 1007,0 L 835,0 C 833,11 831,35 829,74 826,113 825,146 825,174 L 821,174 Z M 275,542 C 275,391 295,282 335,217 375,152 440,119 530,119 632,119 706,154 752,225 798,296 821,405 821,554 821,697 798,802 752,869 706,936 633,969 532,969 441,969 376,936 336,869 295,802 275,693 275,542 Z"/>
+   <glyph unicode="c" horiz-adv-x="880" d="M 275,546 C 275,402 298,295 343,226 388,157 457,122 548,122 612,122 666,139 709,174 752,209 778,262 788,334 L 970,322 C 956,218 912,135 837,73 762,11 668,-20 553,-20 402,-20 286,28 207,124 127,219 87,359 87,542 87,724 127,863 207,959 287,1054 402,1102 551,1102 662,1102 754,1073 827,1016 900,959 945,880 964,779 L 779,765 C 770,825 746,873 708,908 670,943 616,961 546,961 451,961 382,929 339,866 296,803 275,696 275,546 Z"/>
+   <glyph unicode="b" horiz-adv-x="918" d="M 1053,546 C 1053,169 920,-20 655,-20 573,-20 505,-5 451,25 396,54 352,102 318,168 L 316,168 C 316,147 315,116 312,74 309,31 307,7 306,0 L 132,0 C 136,36 138,110 138,223 L 138,1484 318,1484 318,1061 C 318,1018 317,967 314,908 L 318,908 C 351,977 396,1027 451,1057 506,1087 574,1102 655,1102 792,1102 892,1056 957,964 1021,872 1053,733 1053,546 Z M 864,540 C 864,691 844,800 804,865 764,930 699,963 609,963 508,963 434,928 388,859 341,790 318,680 318,529 318,387 341,282 386,215 431,147 505,113 607,113 698,113 763,147 804,214 844,281 864,389 864,540 Z"/>
+   <glyph unicode="a" horiz-adv-x="1049" d="M 414,-20 C 305,-20 224,9 169,66 114,123 87,202 87,302 87,414 124,500 198,560 271,620 390,652 554,656 L 797,660 797,719 C 797,807 778,870 741,908 704,946 645,965 565,965 484,965 426,951 389,924 352,897 330,853 323,793 L 135,810 C 166,1005 310,1102 569,1102 705,1102 807,1071 876,1009 945,946 979,856 979,738 L 979,272 C 979,219 986,179 1000,152 1014,125 1041,111 1080,111 1097,111 1117,113 1139,118 L 1139,6 C 1094,-5 1047,-10 1000,-10 933,-10 885,8 855,43 824,78 807,132 803,207 L 797,207 C 751,124 698,66 637,32 576,-3 501,-20 414,-20 Z M 455,115 C 521,115 580,130 631,160 682,190 723,231 753,284 782,336 797,390 797,445 L 797,534 600,530 C 515,529 451,520 408,504 364,488 330,463 307,430 284,397 272,353 272,299 272,240 288,195 320,163 351,131 396,115 455,115 Z"/>
+   <glyph unicode="_" horiz-adv-x="1188" d="M -31,-407 L -31,-277 1162,-277 1162,-407 -31,-407 Z"/>
+   <glyph unicode="X" horiz-adv-x="1273" d="M 1112,0 L 689,616 257,0 46,0 582,732 87,1409 298,1409 690,856 1071,1409 1282,1409 800,739 1323,0 1112,0 Z"/>
+   <glyph unicode="V" horiz-adv-x="1343" d="M 782,0 L 584,0 9,1409 210,1409 600,417 684,168 768,417 1156,1409 1357,1409 782,0 Z"/>
+   <glyph unicode="T" horiz-adv-x="1154" d="M 720,1253 L 720,0 530,0 530,1253 46,1253 46,1409 1204,1409 1204,1253 720,1253 Z"/>
+   <glyph unicode="R" horiz-adv-x="1211" d="M 1164,0 L 798,585 359,585 359,0 168,0 168,1409 831,1409 C 990,1409 1112,1374 1199,1303 1285,1232 1328,1133 1328,1006 1328,901 1298,813 1237,742 1176,671 1091,626 984,607 L 1384,0 1164,0 Z M 1136,1004 C 1136,1086 1108,1149 1053,1192 997,1235 917,1256 812,1256 L 359,1256 359,736 820,736 C 921,736 999,760 1054,807 1109,854 1136,919 1136,1004 Z"/>
+   <glyph unicode="O" horiz-adv-x="1393" d="M 1495,711 C 1495,564 1467,435 1411,324 1354,213 1273,128 1168,69 1063,10 938,-20 795,-20 650,-20 526,9 421,68 316,127 235,212 180,323 125,434 97,563 97,711 97,936 159,1113 282,1240 405,1367 577,1430 797,1430 940,1430 1065,1402 1170,1345 1275,1288 1356,1205 1412,1096 1467,987 1495,859 1495,711 Z M 1300,711 C 1300,886 1256,1024 1169,1124 1081,1224 957,1274 797,1274 636,1274 511,1225 423,1126 335,1027 291,889 291,711 291,534 336,394 425,291 514,187 637,135 795,135 958,135 1083,185 1170,286 1257,386 1300,528 1300,711 Z"/>
+   <glyph unicode="M" horiz-adv-x="1364" d="M 1366,0 L 1366,940 C 1366,1044 1369,1144 1375,1240 1342,1121 1313,1027 1287,960 L 923,0 789,0 420,960 364,1130 331,1240 334,1129 338,940 338,0 168,0 168,1409 419,1409 794,432 C 807,393 820,351 833,306 845,261 853,228 857,208 862,235 874,275 891,330 908,384 919,418 925,432 L 1293,1409 1538,1409 1538,0 1366,0 Z"/>
+   <glyph unicode="L" horiz-adv-x="900" d="M 168,0 L 168,1409 359,1409 359,156 1071,156 1071,0 168,0 Z"/>
+   <glyph unicode="H" horiz-adv-x="1140" d="M 1121,0 L 1121,653 359,653 359,0 168,0 168,1409 359,1409 359,813 1121,813 1121,1409 1312,1409 1312,0 1121,0 Z"/>
+   <glyph unicode="E" horiz-adv-x="1106" d="M 168,0 L 168,1409 1237,1409 1237,1253 359,1253 359,801 1177,801 1177,647 359,647 359,156 1278,156 1278,0 168,0 Z"/>
+   <glyph unicode="C" horiz-adv-x="1292" d="M 792,1274 C 636,1274 515,1224 428,1124 341,1023 298,886 298,711 298,538 343,400 434,295 524,190 646,137 800,137 997,137 1146,235 1245,430 L 1401,352 C 1343,231 1262,138 1157,75 1052,12 930,-20 791,-20 649,-20 526,10 423,69 319,128 240,212 186,322 131,431 104,561 104,711 104,936 165,1112 286,1239 407,1366 575,1430 790,1430 940,1430 1065,1401 1166,1342 1267,1283 1341,1196 1388,1081 L 1207,1021 C 1174,1103 1122,1166 1050,1209 977,1252 891,1274 792,1274 Z"/>
+   <glyph unicode="A" horiz-adv-x="1353" d="M 1167,0 L 1006,412 364,412 202,0 4,0 579,1409 796,1409 1362,0 1167,0 Z M 685,1265 L 676,1237 C 659,1182 635,1111 602,1024 L 422,561 949,561 768,1026 C 749,1072 731,1124 712,1182 L 685,1265 Z"/>
+   <glyph unicode="&gt;" horiz-adv-x="992" d="M 101,154 L 101,307 959,674 101,1040 101,1194 1096,776 1096,571 101,154 Z"/>
+   <glyph unicode="&lt;" horiz-adv-x="992" d="M 101,571 L 101,776 1096,1194 1096,1040 238,674 1096,307 1096,154 101,571 Z"/>
+   <glyph unicode=":" horiz-adv-x="196" d="M 187,875 L 187,1082 382,1082 382,875 187,875 Z M 187,0 L 187,207 382,207 382,0 187,0 Z"/>
+   <glyph unicode="9" horiz-adv-x="943" d="M 1042,733 C 1042,491 998,305 910,175 821,45 695,-20 532,-20 422,-20 334,3 268,50 201,96 154,171 125,274 L 297,301 C 333,184 412,125 535,125 638,125 718,173 775,269 832,365 861,502 864,680 837,620 792,572 727,536 662,499 591,481 514,481 387,481 286,524 210,611 134,698 96,813 96,956 96,1103 137,1219 220,1304 303,1388 418,1430 565,1430 722,1430 840,1372 921,1256 1002,1140 1042,966 1042,733 Z M 846,907 C 846,1020 820,1112 768,1181 716,1250 646,1284 559,1284 472,1284 404,1255 354,1196 304,1137 279,1057 279,956 279,853 304,772 354,713 404,653 472,623 557,623 609,623 657,635 702,659 747,682 782,716 808,759 833,802 846,852 846,907 Z"/>
+   <glyph unicode="8" horiz-adv-x="958" d="M 1050,393 C 1050,263 1009,162 926,89 843,16 725,-20 570,-20 419,-20 302,16 217,87 132,158 89,260 89,391 89,483 115,560 168,623 221,686 288,724 370,737 L 370,741 C 293,759 233,798 189,858 144,918 122,988 122,1069 122,1176 162,1263 243,1330 323,1397 431,1430 566,1430 705,1430 814,1397 895,1332 975,1267 1015,1178 1015,1067 1015,986 993,916 948,856 903,796 842,758 765,743 L 765,739 C 855,724 925,686 975,625 1025,563 1050,486 1050,393 Z M 828,1057 C 828,1216 741,1296 566,1296 481,1296 417,1276 373,1236 328,1196 306,1136 306,1057 306,976 329,915 375,873 420,830 485,809 568,809 653,809 717,829 762,868 806,907 828,970 828,1057 Z M 863,410 C 863,497 837,563 785,608 733,652 660,674 566,674 475,674 403,650 352,603 301,555 275,489 275,406 275,212 374,115 572,115 670,115 743,139 791,186 839,233 863,307 863,410 Z"/>
+   <glyph unicode="7" horiz-adv-x="928" d="M 1036,1263 C 892,1043 790,871 731,746 672,621 627,498 598,377 568,256 553,130 553,0 L 365,0 C 365,180 403,370 480,569 556,768 683,997 862,1256 L 105,1256 105,1409 1036,1409 1036,1263 Z"/>
+   <glyph unicode="6" horiz-adv-x="942" d="M 1049,461 C 1049,312 1009,195 928,109 847,23 736,-20 594,-20 435,-20 314,39 230,157 146,275 104,447 104,672 104,916 148,1103 235,1234 322,1365 447,1430 608,1430 821,1430 955,1334 1010,1143 L 838,1112 C 803,1227 725,1284 606,1284 503,1284 424,1236 368,1141 311,1045 283,906 283,725 316,786 362,832 421,864 480,895 548,911 625,911 755,911 858,870 935,789 1011,708 1049,598 1049,461 Z M 866,453 C 866,555 841,634 791,689 741,744 671,772 582,772 498,772 430,748 379,699 327,650 301,582 301,496 301,387 328,298 382,229 435,160 504,125 588,125 675,125 743,154 792,213 841,271 866,351 866,453 Z"/>
+   <glyph unicode="5" horiz-adv-x="968" d="M 1053,459 C 1053,310 1009,193 921,108 832,23 710,-20 553,-20 422,-20 316,9 235,66 154,123 103,206 82,315 L 264,336 C 302,197 400,127 557,127 654,127 729,156 784,215 839,273 866,353 866,455 866,544 839,615 784,670 729,725 654,752 561,752 512,752 467,744 425,729 383,714 341,688 299,651 L 123,651 170,1409 971,1409 971,1256 334,1256 307,809 C 385,869 482,899 598,899 737,899 847,858 930,777 1012,696 1053,590 1053,459 Z"/>
+   <glyph unicode="4" horiz-adv-x="1029" d="M 881,319 L 881,0 711,0 711,319 47,319 47,459 692,1409 881,1409 881,461 1079,461 1079,319 881,319 Z M 711,1206 C 710,1202 700,1184 683,1153 666,1122 653,1100 644,1087 L 283,555 229,481 213,461 711,461 711,1206 Z"/>
+   <glyph unicode="3" horiz-adv-x="968" d="M 1049,389 C 1049,259 1008,158 925,87 842,16 724,-20 571,-20 428,-20 315,12 230,77 145,141 94,236 78,362 L 264,379 C 288,212 390,129 571,129 662,129 733,151 785,196 836,241 862,307 862,395 862,472 833,532 774,575 715,618 629,639 518,639 L 416,639 416,795 514,795 C 613,795 689,817 744,860 798,903 825,962 825,1038 825,1113 803,1173 759,1217 714,1260 648,1282 561,1282 482,1282 418,1262 369,1221 320,1180 291,1123 283,1049 L 102,1063 C 115,1178 163,1268 246,1333 328,1398 434,1430 563,1430 704,1430 814,1397 893,1332 971,1266 1010,1174 1010,1057 1010,967 985,894 935,838 884,781 811,743 715,723 L 715,719 C 820,708 902,672 961,613 1020,554 1049,479 1049,389 Z"/>
+   <glyph unicode="2" horiz-adv-x="930" d="M 103,0 L 103,127 C 137,205 179,274 228,334 277,393 328,447 382,496 436,544 490,589 543,630 596,671 643,713 686,754 729,795 763,839 790,884 816,929 829,981 829,1038 829,1115 806,1175 761,1218 716,1261 653,1282 572,1282 495,1282 432,1261 383,1220 333,1178 304,1119 295,1044 L 111,1061 C 124,1174 172,1263 255,1330 337,1397 443,1430 572,1430 714,1430 823,1397 900,1330 976,1263 1014,1167 1014,1044 1014,989 1002,935 977,881 952,827 914,773 865,719 816,665 721,581 582,468 505,405 444,349 399,299 354,248 321,200 301,153 L 1036,153 1036,0 103,0 Z"/>
+   <glyph unicode="1" horiz-adv-x="880" d="M 156,0 L 156,153 515,153 515,1237 197,1010 197,1180 530,1409 696,1409 696,153 1039,153 1039,0 156,0 Z"/>
+   <glyph unicode="0" horiz-adv-x="976" d="M 1059,705 C 1059,470 1018,290 935,166 852,42 729,-20 567,-20 405,-20 283,42 202,165 121,288 80,468 80,705 80,947 120,1128 199,1249 278,1370 402,1430 573,1430 739,1430 862,1369 941,1247 1020,1125 1059,944 1059,705 Z M 876,705 C 876,908 853,1056 806,1147 759,1238 681,1284 573,1284 462,1284 383,1239 335,1149 286,1059 262,911 262,705 262,505 287,359 336,266 385,173 462,127 569,127 675,127 753,174 802,269 851,364 876,509 876,705 Z"/>
+   <glyph unicode="." horiz-adv-x="196" d="M 187,0 L 187,219 382,219 382,0 187,0 Z"/>
+   <glyph unicode="#" horiz-adv-x="1117" d="M 896,885 L 818,516 1078,516 1078,408 795,408 707,0 597,0 683,408 320,408 236,0 126,0 210,408 9,408 9,516 234,516 312,885 60,885 60,993 334,993 423,1401 533,1401 445,993 808,993 896,1401 1006,1401 918,993 1129,993 1129,885 896,885 Z M 425,885 L 345,516 707,516 785,885 425,885 Z"/>
+   <glyph unicode=" " horiz-adv-x="556"/>
+  </font>
+ </defs>
+ <defs class="EmbeddedBulletChars">
+  <g id="bullet-char-template-57356" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 580,1141 L 1163,571 580,0 -4,571 580,1141 Z"/>
+  </g>
+  <g id="bullet-char-template-57354" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 8,1128 L 1137,1128 1137,0 8,0 8,1128 Z"/>
+  </g>
+  <g id="bullet-char-template-10146" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 174,0 L 602,739 174,1481 1456,739 174,0 Z M 1358,739 L 309,1346 659,739 1358,739 Z"/>
+  </g>
+  <g id="bullet-char-template-10132" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 2015,739 L 1276,0 717,0 1260,543 174,543 174,936 1260,936 717,1481 1274,1481 2015,739 Z"/>
+  </g>
+  <g id="bullet-char-template-10007" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 0,-2 C -7,14 -16,27 -25,37 L 356,567 C 262,823 215,952 215,954 215,979 228,992 255,992 264,992 276,990 289,987 310,991 331,999 354,1012 L 381,999 492,748 772,1049 836,1024 860,1049 C 881,1039 901,1025 922,1006 886,937 835,863 770,784 769,783 710,716 594,584 L 774,223 C 774,196 753,168 711,139 L 727,119 C 717,90 699,76 672,76 641,76 570,178 457,381 L 164,-76 C 142,-110 111,-127 72,-127 30,-127 9,-110 8,-76 1,-67 -2,-52 -2,-32 -2,-23 -1,-13 0,-2 Z"/>
+  </g>
+  <g id="bullet-char-template-10004" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 285,-33 C 182,-33 111,30 74,156 52,228 41,333 41,471 41,549 55,616 82,672 116,743 169,778 240,778 293,778 328,747 346,684 L 369,508 C 377,444 397,411 428,410 L 1163,1116 C 1174,1127 1196,1133 1229,1133 1271,1133 1292,1118 1292,1087 L 1292,965 C 1292,929 1282,901 1262,881 L 442,47 C 390,-6 338,-33 285,-33 Z"/>
+  </g>
+  <g id="bullet-char-template-9679" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 813,0 C 632,0 489,54 383,161 276,268 223,411 223,592 223,773 276,916 383,1023 489,1130 632,1184 813,1184 992,1184 1136,1130 1245,1023 1353,916 1407,772 1407,592 1407,412 1353,268 1245,161 1136,54 992,0 813,0 Z"/>
+  </g>
+  <g id="bullet-char-template-8226" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 346,457 C 273,457 209,483 155,535 101,586 74,649 74,723 74,796 101,859 155,911 209,963 273,989 346,989 419,989 480,963 531,910 582,859 608,796 608,723 608,648 583,586 532,535 482,483 420,457 346,457 Z"/>
+  </g>
+  <g id="bullet-char-template-8211" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M -4,459 L 1135,459 1135,606 -4,606 -4,459 Z"/>
+  </g>
+  <g id="bullet-char-template-61548" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 173,740 C 173,903 231,1043 346,1159 462,1274 601,1332 765,1332 928,1332 1067,1274 1183,1159 1299,1043 1357,903 1357,740 1357,577 1299,437 1183,322 1067,206 928,148 765,148 601,148 462,206 346,322 231,437 173,577 173,740 Z"/>
+  </g>
+ </defs>
+ <g class="Page">
+  <g class="com.sun.star.drawing.TableShape">
+   <g>
+    <rect class="BoundingBox" stroke="none" fill="none" x="4168" y="3981" width="12419" height="15072"/>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,3999 L 5563,3999 5563,4886 4186,4886 4186,3999 Z"/>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 5563,3999 L 6940,3999 6940,4886 5563,4886 5563,3999 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="6135" y="4590"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">0</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 6940,3999 L 8317,3999 8317,4886 6940,4886 6940,3999 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="7512" y="4590"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">1</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 8317,3999 L 9694,3999 9694,4886 8317,4886 8317,3999 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="8889" y="4590"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">2</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 9694,3999 L 11073,3999 11073,4886 9694,4886 9694,3999 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="10267" y="4590"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">3</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 11073,3999 L 12452,3999 12452,4886 11073,4886 11073,3999 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="11646" y="4590"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">4</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 12452,3999 L 13831,3999 13831,4886 12452,4886 12452,3999 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="13025" y="4590"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">5</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 13831,3999 L 15208,3999 15208,4886 13831,4886 13831,3999 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="14403" y="4590"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">6</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 15208,3999 L 16568,3999 16568,4886 15208,4886 15208,3999 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="15772" y="4590"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">7</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,4886 L 5563,4886 5563,5773 4186,5773 4186,4886 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4758" y="5477"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">0</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 5563,4886 L 11073,4886 11073,5773 5563,5773 5563,4886 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="7918" y="5477"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">int a</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 11073,4886 L 12452,4886 12452,5773 11073,5773 11073,4886 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="11176" y="5477"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">char b</tspan></tspan></tspan></text>
+    <path fill="rgb(255,255,153)" stroke="none" d="M 12452,4886 L 16568,4886 16568,5773 12452,5773 12452,4886 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="13511" y="5477"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">&lt;padding&gt;</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,5773 L 5563,5773 5563,6660 4186,6660 4186,5773 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4758" y="6364"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">8</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 5563,5773 L 16568,5773 16568,6660 5563,6660 5563,5773 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="10503" y="6364"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">long c</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,6660 L 5563,6660 5563,7547 4186,7547 4186,6660 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4640" y="7251"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">16</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 5563,6660 L 16568,6660 16568,7547 5563,7547 5563,6660 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="10490" y="7251"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">long d</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,7547 L 5563,7547 5563,8434 4186,8434 4186,7547 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4640" y="8138"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">24</tspan></tspan></tspan></text>
+    <path fill="rgb(0,184,255)" stroke="none" d="M 5563,7547 L 16568,7547 16568,11970 5563,11970 5563,7547 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="9736" y="9906"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">&lt;unallocated&gt;</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,8434 L 5563,8434 5563,9321 4186,9321 4186,8434 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4640" y="9025"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">32</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,9321 L 5563,9321 5563,10204 4186,10204 4186,9321 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4640" y="9910"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">40</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,10204 L 5563,10204 5563,11087 4186,11087 4186,10204 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4640" y="10793"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">48</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,11087 L 5563,11087 5563,11970 4186,11970 4186,11087 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4640" y="11676"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">56</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,11970 L 5563,11970 5563,12853 4186,12853 4186,11970 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4640" y="12559"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">64</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 5563,11970 L 11073,11970 11073,12853 5563,12853 5563,11970 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="7918" y="12559"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">int a</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 11073,11970 L 12452,11970 12452,12853 11073,12853 11073,11970 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="11176" y="12559"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">char b</tspan></tspan></tspan></text>
+    <path fill="rgb(255,255,153)" stroke="none" d="M 12452,11970 L 16568,11970 16568,12853 12452,12853 12452,11970 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="13511" y="12559"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">&lt;padding&gt;</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,12853 L 5563,12853 5563,13736 4186,13736 4186,12853 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4640" y="13442"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">72</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 5563,12853 L 16568,12853 16568,13736 5563,13736 5563,12853 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="10503" y="13442"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">long c</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,13736 L 5563,13736 5563,14619 4186,14619 4186,13736 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4640" y="14325"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">80</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 5563,13736 L 16568,13736 16568,14619 5563,14619 5563,13736 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="10490" y="14325"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">long d</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,14619 L 5563,14619 5563,15502 4186,15502 4186,14619 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4640" y="15208"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">88</tspan></tspan></tspan></text>
+    <path fill="rgb(0,184,255)" stroke="none" d="M 5563,14619 L 16568,14619 16568,19034 5563,19034 5563,14619 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="9736" y="16974"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">&lt;unallocated&gt;</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,15502 L 5563,15502 5563,16385 4186,16385 4186,15502 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4640" y="16091"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">96</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,16385 L 5563,16385 5563,17268 4186,17268 4186,16385 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4523" y="16974"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">104</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,17268 L 5563,17268 5563,18151 4186,18151 4186,17268 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4538" y="17857"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">112</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 4186,18151 L 5563,18151 5563,19034 4186,19034 4186,18151 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4523" y="18740"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">120</tspan></tspan></tspan></text>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,3999 L 16581,3999"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4186,3986 L 4186,19047"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 5563,3986 L 5563,19047"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 6940,3986 L 6940,4899"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 8317,3986 L 8317,4899"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 9694,3986 L 9694,4899"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 11073,3986 L 11073,5786"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 12452,3986 L 12452,5786"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 13831,3986 L 13831,4899"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 15208,3986 L 15208,4899"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 16568,3986 L 16568,19047"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,4886 L 16581,4886"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,5773 L 16581,5773"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,6660 L 16581,6660"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,7547 L 16581,7547"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,8434 L 5576,8434"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,9321 L 5576,9321"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,10204 L 5576,10204"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,11087 L 5576,11087"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,11970 L 16581,11970"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 11073,11957 L 11073,12866"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 12452,11957 L 12452,12866"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,12853 L 16581,12853"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,13736 L 16581,13736"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,14619 L 16581,14619"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,15502 L 5576,15502"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,16385 L 5576,16385"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,17268 L 5576,17268"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,18151 L 5576,18151"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 4173,19034 L 16581,19034"/>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.CustomShape">
+   <g id="id3">
+    <rect class="BoundingBox" stroke="none" fill="none" x="16619" y="4904" width="317" height="868"/>
+    <path fill="none" stroke="rgb(0,0,0)" d="M 16620,4905 C 16698,4905 16777,4941 16777,4977 L 16777,5265 C 16777,5301 16855,5337 16934,5337 16855,5337 16777,5373 16777,5409 L 16777,5697 C 16777,5733 16698,5770 16620,5770"/>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id4">
+    <rect class="BoundingBox" stroke="none" fill="none" x="17031" y="5010" width="3018" height="726"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="17281" y="5520"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">struct x_lcore</tspan></tspan></tspan></text>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.CustomShape">
+   <g id="id5">
+    <rect class="BoundingBox" stroke="none" fill="none" x="3421" y="4901" width="477" height="7064"/>
+    <path fill="none" stroke="rgb(0,0,0)" d="M 3896,4902 C 3777,4902 3659,5196 3659,5490 L 3659,7844 C 3659,8138 3540,8432 3422,8432 3540,8432 3659,8726 3659,9020 L 3659,11374 C 3659,11668 3777,11963 3896,11963"/>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id6">
+    <rect class="BoundingBox" stroke="none" fill="none" x="2573" y="7325" width="726" height="2218"/>
+    <text class="SVGTextShape" transform="rotate(-90 3083 9292)"><tspan class="TextParagraph"><tspan class="TextPosition" x="3083" y="9292"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">lcore id 0</tspan></tspan></tspan></text>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.CustomShape">
+   <g id="id7">
+    <rect class="BoundingBox" stroke="none" fill="none" x="16619" y="5799" width="317" height="1747"/>
+    <path fill="none" stroke="rgb(0,0,0)" d="M 16620,5800 C 16698,5800 16777,5872 16777,5945 L 16777,6526 C 16777,6599 16855,6672 16934,6672 16855,6672 16777,6744 16777,6817 L 16777,7398 C 16777,7471 16698,7544 16620,7544"/>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id8">
+    <rect class="BoundingBox" stroke="none" fill="none" x="17031" y="6325" width="3018" height="726"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="17281" y="6835"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">struct y_lcore</tspan></tspan></tspan></text>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id9">
+    <rect class="BoundingBox" stroke="none" fill="none" x="4290" y="19556" width="7357" height="1200"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4540" y="20066"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">#define RTE_MAX_LCORE 2</tspan></tspan></tspan><tspan class="TextParagraph"><tspan class="TextPosition" x="4540" y="20540"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">#define RTE_MAX_LCORE_VAR 64</tspan></tspan></tspan></text>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.CustomShape">
+   <g id="id10">
+    <rect class="BoundingBox" stroke="none" fill="none" x="3400" y="11964" width="477" height="7064"/>
+    <path fill="none" stroke="rgb(52,101,164)" d="M 3875,11965 C 3756,11965 3638,12259 3638,12553 L 3638,14907 C 3638,15201 3519,15495 3401,15495 3519,15495 3638,15789 3638,16083 L 3638,18437 C 3638,18731 3756,19026 3875,19026"/>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id11">
+    <rect class="BoundingBox" stroke="none" fill="none" x="2515" y="14347" width="726" height="2218"/>
+    <text class="SVGTextShape" transform="rotate(-90 3025 16314)"><tspan class="TextParagraph"><tspan class="TextPosition" x="3025" y="16314"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">lcore id 1</tspan></tspan></tspan></text>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.CustomShape">
+   <g id="id12">
+    <rect class="BoundingBox" stroke="none" fill="none" x="16584" y="11978" width="317" height="868"/>
+    <path fill="none" stroke="rgb(0,0,0)" d="M 16585,11979 C 16663,11979 16742,12015 16742,12051 L 16742,12339 C 16742,12375 16820,12411 16899,12411 16820,12411 16742,12447 16742,12483 L 16742,12771 C 16742,12807 16663,12844 16585,12844"/>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id13">
+    <rect class="BoundingBox" stroke="none" fill="none" x="16996" y="12084" width="3018" height="726"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="17246" y="12594"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">struct x_lcore</tspan></tspan></tspan></text>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.CustomShape">
+   <g id="id14">
+    <rect class="BoundingBox" stroke="none" fill="none" x="16584" y="12873" width="317" height="1747"/>
+    <path fill="none" stroke="rgb(0,0,0)" d="M 16585,12874 C 16663,12874 16742,12946 16742,13019 L 16742,13600 C 16742,13673 16820,13746 16899,13746 16820,13746 16742,13818 16742,13891 L 16742,14472 C 16742,14545 16663,14618 16585,14618"/>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id15">
+    <rect class="BoundingBox" stroke="none" fill="none" x="16996" y="13399" width="3018" height="726"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="17246" y="13909"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">struct y_lcore</tspan></tspan></tspan></text>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.CustomShape">
+   <g id="id16">
+    <rect class="BoundingBox" stroke="none" fill="none" x="2065" y="4892" width="851" height="14154"/>
+    <path fill="none" stroke="rgb(0,0,0)" d="M 2914,4893 C 2702,4893 2490,5482 2490,6072 L 2490,10789 C 2490,11378 2278,11968 2066,11968 2278,11968 2490,12558 2490,13147 L 2490,17864 C 2490,18454 2702,19044 2914,19044"/>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id17">
+    <rect class="BoundingBox" stroke="none" fill="none" x="1286" y="8935" width="726" height="5982"/>
+    <text class="SVGTextShape" transform="rotate(-90 1796 14666)"><tspan class="TextParagraph"><tspan class="TextPosition" x="1796" y="14666"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">struct lcore_var_buffer.data</tspan></tspan></tspan></text>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id18">
+    <rect class="BoundingBox" stroke="none" fill="none" x="4965" y="2291" width="4159" height="1200"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="5215" y="2801"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">Handle pointers:</tspan></tspan></tspan><tspan class="TextParagraph"><tspan class="TextPosition" x="5215" y="3275"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">x_lcores  y_lcores</tspan></tspan></tspan></text>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.LineShape">
+   <g id="id19">
+    <rect class="BoundingBox" stroke="none" fill="none" x="5522" y="3452" width="682" height="1956"/>
+    <path fill="none" stroke="rgb(0,0,0)" d="M 6202,3453 L 5663,5001"/>
+    <path fill="rgb(0,0,0)" stroke="none" d="M 5522,5407 L 5812,5031 5528,4933 5522,5407 Z"/>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.LineShape">
+   <g id="id20">
+    <rect class="BoundingBox" stroke="none" fill="none" x="5556" y="3418" width="2160" height="2900"/>
+    <path fill="none" stroke="rgb(0,0,0)" d="M 7714,3419 L 5813,5972"/>
+    <path fill="rgb(0,0,0)" stroke="none" d="M 5556,6317 L 5945,6046 5704,5866 5556,6317 Z"/>
+   </g>
+  </g>
+ </g>
+</svg>
\ No newline at end of file
diff --git a/doc/guides/prog_guide/img/static_array_mem_layout.svg b/doc/guides/prog_guide/img/static_array_mem_layout.svg
new file mode 100644
index 0000000000..ed8bead826
--- /dev/null
+++ b/doc/guides/prog_guide/img/static_array_mem_layout.svg
@@ -0,0 +1,278 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg version="1.2" width="173.66mm" height="196.72mm" viewBox="2001 3124 17366 19672" preserveAspectRatio="xMidYMid" fill-rule="evenodd" stroke-width="28.222" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg" xmlns:ooo="http://xml.openoffice.org/svg/export" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:presentation="http://sun.com/xmlns/staroffice/presentation" xmlns:smil="http://www.w3.org/2001/SMIL20/" xmlns:anim="urn:oasis:names:tc:opendocument:xmlns:animation:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xml:space="preserve">
+ <defs>
+  <font id="EmbeddedFont_1" horiz-adv-x="2048">
+   <font-face font-family="Liberation Sans embedded" units-per-em="2048" font-weight="normal" font-style="normal" ascent="1852" descent="423"/>
+   <missing-glyph horiz-adv-x="2048" d="M 0,0 L 2047,0 2047,2047 0,2047 0,0 Z"/>
+   <glyph unicode="x" horiz-adv-x="976" d="M 801,0 L 510,444 217,0 23,0 408,556 41,1082 240,1082 510,661 778,1082 979,1082 612,558 1002,0 801,0 Z"/>
+   <glyph unicode="u" horiz-adv-x="867" d="M 314,1082 L 314,396 C 314,325 321,269 335,230 349,191 371,162 402,145 433,128 478,119 537,119 624,119 692,149 742,208 792,267 817,350 817,455 L 817,1082 997,1082 997,231 C 997,105 999,28 1003,0 L 833,0 C 832,3 832,12 831,27 830,42 830,59 829,78 828,97 826,132 825,185 L 822,185 C 781,110 733,58 679,27 624,-4 557,-20 476,-20 357,-20 271,10 216,69 161,128 133,225 133,361 L 133,1082 314,1082 Z"/>
+   <glyph unicode="t" horiz-adv-x="523" d="M 554,8 C 495,-8 434,-16 372,-16 228,-16 156,66 156,229 L 156,951 31,951 31,1082 163,1082 216,1324 336,1324 336,1082 536,1082 536,951 336,951 336,268 C 336,216 345,180 362,159 379,138 408,127 450,127 474,127 509,132 554,141 L 554,8 Z"/>
+   <glyph unicode="s" horiz-adv-x="891" d="M 950,299 C 950,197 912,118 835,63 758,8 650,-20 511,-20 376,-20 273,2 200,47 127,91 79,160 57,254 L 216,285 C 231,227 263,185 311,158 359,131 426,117 511,117 602,117 669,131 712,159 754,187 775,229 775,285 775,328 760,362 731,389 702,416 654,438 589,455 L 460,489 C 357,516 283,542 240,568 196,593 162,624 137,661 112,698 100,743 100,796 100,895 135,970 206,1022 276,1073 378,1099 513,1099 632,1099 727,1078 798,1036 868,994 912,927 931,834 L 769,814 C 759,862 732,899 689,925 645,950 586,963 513,963 432,963 372,951 333,926 294,901 275,864 275,814 275,783 283,758 299,738 315,718 339,701 370,687 401,673 467,654 568,629 663,605 732,583 774,563 816,542 849,520 874,495 898,470 917,442 930,410 943,377 950,340 950,299 Z"/>
+   <glyph unicode="r" horiz-adv-x="511" d="M 142,0 L 142,830 C 142,906 140,990 136,1082 L 306,1082 C 311,959 314,886 314,861 L 318,861 C 347,954 380,1017 417,1051 454,1085 507,1102 575,1102 599,1102 623,1099 648,1092 L 648,927 C 624,934 592,937 552,937 477,937 420,905 381,841 342,776 322,684 322,564 L 322,0 142,0 Z"/>
+   <glyph unicode="p" horiz-adv-x="918" d="M 1053,546 C 1053,169 920,-20 655,-20 488,-20 376,43 319,168 L 314,168 C 317,163 318,106 318,-2 L 318,-425 138,-425 138,861 C 138,972 136,1046 132,1082 L 306,1082 C 307,1079 308,1070 309,1054 310,1037 312,1012 314,978 315,944 316,921 316,908 L 320,908 C 352,975 394,1024 447,1055 500,1086 569,1101 655,1101 788,1101 888,1056 954,967 1020,878 1053,737 1053,546 Z M 864,542 C 864,693 844,800 803,865 762,930 698,962 609,962 538,962 482,947 442,917 401,887 371,840 350,777 329,713 318,630 318,528 318,386 341,281 386,214 431,147 505,113 607,113 696,113 762,146 803,212 844,277 864,387 864,542 Z"/>
+   <glyph unicode="o" horiz-adv-x="964" d="M 1053,542 C 1053,353 1011,212 928,119 845,26 724,-20 565,-20 407,-20 288,28 207,125 126,221 86,360 86,542 86,915 248,1102 571,1102 736,1102 858,1057 936,966 1014,875 1053,733 1053,542 Z M 864,542 C 864,691 842,800 798,868 753,935 679,969 574,969 469,969 393,935 346,866 299,797 275,689 275,542 275,399 298,292 345,221 391,149 464,113 563,113 671,113 748,148 795,217 841,286 864,395 864,542 Z"/>
+   <glyph unicode="n" horiz-adv-x="867" d="M 825,0 L 825,686 C 825,757 818,813 804,852 790,891 768,920 737,937 706,954 661,963 602,963 515,963 447,933 397,874 347,815 322,732 322,627 L 322,0 142,0 142,851 C 142,977 140,1054 136,1082 L 306,1082 C 307,1079 307,1070 308,1055 309,1040 310,1024 311,1005 312,986 313,950 314,897 L 317,897 C 358,972 406,1025 461,1056 515,1087 582,1102 663,1102 782,1102 869,1073 924,1014 979,955 1006,857 1006,721 L 1006,0 825,0 Z"/>
+   <glyph unicode="l" horiz-adv-x="181" d="M 138,0 L 138,1484 318,1484 318,0 138,0 Z"/>
+   <glyph unicode="i" horiz-adv-x="181" d="M 137,1312 L 137,1484 317,1484 317,1312 137,1312 Z M 137,0 L 137,1082 317,1082 317,0 137,0 Z"/>
+   <glyph unicode="h" horiz-adv-x="861" d="M 317,897 C 356,968 402,1020 457,1053 511,1086 580,1102 663,1102 780,1102 867,1073 923,1015 978,956 1006,858 1006,721 L 1006,0 825,0 825,686 C 825,762 818,819 804,856 790,893 767,920 735,937 703,954 659,963 602,963 517,963 450,934 399,875 348,816 322,737 322,638 L 322,0 142,0 142,1484 322,1484 322,1098 C 322,1057 321,1015 319,972 316,929 315,904 314,897 L 317,897 Z"/>
+   <glyph unicode="g" horiz-adv-x="918" d="M 548,-425 C 430,-425 336,-402 266,-356 196,-309 151,-243 131,-158 L 312,-132 C 324,-182 351,-220 392,-248 433,-274 486,-288 553,-288 732,-288 822,-183 822,27 L 822,201 820,201 C 786,132 739,80 680,45 621,10 551,-8 472,-8 339,-8 242,36 180,124 117,212 86,350 86,539 86,730 120,872 187,963 254,1054 355,1099 492,1099 569,1099 635,1082 692,1047 748,1012 791,962 822,897 L 824,897 C 824,917 825,952 828,1001 831,1050 833,1077 836,1082 L 1007,1082 C 1003,1046 1001,971 1001,858 L 1001,31 C 1001,-273 850,-425 548,-425 Z M 822,541 C 822,629 810,705 786,769 762,832 728,881 685,915 641,948 591,965 536,965 444,965 377,932 335,865 293,798 272,690 272,541 272,393 292,287 331,222 370,157 438,125 533,125 590,125 640,142 684,175 728,208 762,256 786,319 810,381 822,455 822,541 Z"/>
+   <glyph unicode="e" horiz-adv-x="958" d="M 276,503 C 276,379 302,283 353,216 404,149 479,115 578,115 656,115 719,131 766,162 813,193 844,233 861,281 L 1019,236 C 954,65 807,-20 578,-20 418,-20 296,28 213,123 129,218 87,360 87,548 87,727 129,864 213,959 296,1054 416,1102 571,1102 889,1102 1048,910 1048,527 L 1048,503 276,503 Z M 862,641 C 852,755 823,838 775,891 727,943 658,969 568,969 481,969 412,940 361,882 310,823 282,743 278,641 L 862,641 Z"/>
+   <glyph unicode="d" horiz-adv-x="918" d="M 821,174 C 788,105 744,55 689,25 634,-5 565,-20 484,-20 347,-20 247,26 183,118 118,210 86,349 86,536 86,913 219,1102 484,1102 566,1102 634,1087 689,1057 744,1027 788,979 821,914 L 823,914 821,1035 821,1484 1001,1484 1001,223 C 1001,110 1003,36 1007,0 L 835,0 C 833,11 831,35 829,74 826,113 825,146 825,174 L 821,174 Z M 275,542 C 275,391 295,282 335,217 375,152 440,119 530,119 632,119 706,154 752,225 798,296 821,405 821,554 821,697 798,802 752,869 706,936 633,969 532,969 441,969 376,936 336,869 295,802 275,693 275,542 Z"/>
+   <glyph unicode="c" horiz-adv-x="880" d="M 275,546 C 275,402 298,295 343,226 388,157 457,122 548,122 612,122 666,139 709,174 752,209 778,262 788,334 L 970,322 C 956,218 912,135 837,73 762,11 668,-20 553,-20 402,-20 286,28 207,124 127,219 87,359 87,542 87,724 127,863 207,959 287,1054 402,1102 551,1102 662,1102 754,1073 827,1016 900,959 945,880 964,779 L 779,765 C 770,825 746,873 708,908 670,943 616,961 546,961 451,961 382,929 339,866 296,803 275,696 275,546 Z"/>
+   <glyph unicode="b" horiz-adv-x="918" d="M 1053,546 C 1053,169 920,-20 655,-20 573,-20 505,-5 451,25 396,54 352,102 318,168 L 316,168 C 316,147 315,116 312,74 309,31 307,7 306,0 L 132,0 C 136,36 138,110 138,223 L 138,1484 318,1484 318,1061 C 318,1018 317,967 314,908 L 318,908 C 351,977 396,1027 451,1057 506,1087 574,1102 655,1102 792,1102 892,1056 957,964 1021,872 1053,733 1053,546 Z M 864,540 C 864,691 844,800 804,865 764,930 699,963 609,963 508,963 434,928 388,859 341,790 318,680 318,529 318,387 341,282 386,215 431,147 505,113 607,113 698,113 763,147 804,214 844,281 864,389 864,540 Z"/>
+   <glyph unicode="a" horiz-adv-x="1049" d="M 414,-20 C 305,-20 224,9 169,66 114,123 87,202 87,302 87,414 124,500 198,560 271,620 390,652 554,656 L 797,660 797,719 C 797,807 778,870 741,908 704,946 645,965 565,965 484,965 426,951 389,924 352,897 330,853 323,793 L 135,810 C 166,1005 310,1102 569,1102 705,1102 807,1071 876,1009 945,946 979,856 979,738 L 979,272 C 979,219 986,179 1000,152 1014,125 1041,111 1080,111 1097,111 1117,113 1139,118 L 1139,6 C 1094,-5 1047,-10 1000,-10 933,-10 885,8 855,43 824,78 807,132 803,207 L 797,207 C 751,124 698,66 637,32 576,-3 501,-20 414,-20 Z M 455,115 C 521,115 580,130 631,160 682,190 723,231 753,284 782,336 797,390 797,445 L 797,534 600,530 C 515,529 451,520 408,504 364,488 330,463 307,430 284,397 272,353 272,299 272,240 288,195 320,163 351,131 396,115 455,115 Z"/>
+   <glyph unicode="_" horiz-adv-x="1188" d="M -31,-407 L -31,-277 1162,-277 1162,-407 -31,-407 Z"/>
+   <glyph unicode="]" horiz-adv-x="407" d="M 16,-425 L 16,-296 249,-296 249,1355 16,1355 16,1484 423,1484 423,-425 16,-425 Z"/>
+   <glyph unicode="[" horiz-adv-x="407" d="M 146,-425 L 146,1484 553,1484 553,1355 320,1355 320,-296 553,-296 553,-425 146,-425 Z"/>
+   <glyph unicode="X" horiz-adv-x="1273" d="M 1112,0 L 689,616 257,0 46,0 582,732 87,1409 298,1409 690,856 1071,1409 1282,1409 800,739 1323,0 1112,0 Z"/>
+   <glyph unicode="U" horiz-adv-x="1159" d="M 731,-20 C 616,-20 515,1 429,43 343,85 276,146 229,226 182,306 158,401 158,512 L 158,1409 349,1409 349,528 C 349,399 382,302 447,235 512,168 607,135 730,135 857,135 955,170 1026,239 1096,308 1131,408 1131,541 L 1131,1409 1321,1409 1321,530 C 1321,416 1297,318 1249,235 1200,152 1132,89 1044,46 955,2 851,-20 731,-20 Z"/>
+   <glyph unicode="T" horiz-adv-x="1154" d="M 720,1253 L 720,0 530,0 530,1253 46,1253 46,1409 1204,1409 1204,1253 720,1253 Z"/>
+   <glyph unicode="R" horiz-adv-x="1211" d="M 1164,0 L 798,585 359,585 359,0 168,0 168,1409 831,1409 C 990,1409 1112,1374 1199,1303 1285,1232 1328,1133 1328,1006 1328,901 1298,813 1237,742 1176,671 1091,626 984,607 L 1384,0 1164,0 Z M 1136,1004 C 1136,1086 1108,1149 1053,1192 997,1235 917,1256 812,1256 L 359,1256 359,736 820,736 C 921,736 999,760 1054,807 1109,854 1136,919 1136,1004 Z"/>
+   <glyph unicode="O" horiz-adv-x="1393" d="M 1495,711 C 1495,564 1467,435 1411,324 1354,213 1273,128 1168,69 1063,10 938,-20 795,-20 650,-20 526,9 421,68 316,127 235,212 180,323 125,434 97,563 97,711 97,936 159,1113 282,1240 405,1367 577,1430 797,1430 940,1430 1065,1402 1170,1345 1275,1288 1356,1205 1412,1096 1467,987 1495,859 1495,711 Z M 1300,711 C 1300,886 1256,1024 1169,1124 1081,1224 957,1274 797,1274 636,1274 511,1225 423,1126 335,1027 291,889 291,711 291,534 336,394 425,291 514,187 637,135 795,135 958,135 1083,185 1170,286 1257,386 1300,528 1300,711 Z"/>
+   <glyph unicode="M" horiz-adv-x="1364" d="M 1366,0 L 1366,940 C 1366,1044 1369,1144 1375,1240 1342,1121 1313,1027 1287,960 L 923,0 789,0 420,960 364,1130 331,1240 334,1129 338,940 338,0 168,0 168,1409 419,1409 794,432 C 807,393 820,351 833,306 845,261 853,228 857,208 862,235 874,275 891,330 908,384 919,418 925,432 L 1293,1409 1538,1409 1538,0 1366,0 Z"/>
+   <glyph unicode="L" horiz-adv-x="900" d="M 168,0 L 168,1409 359,1409 359,156 1071,156 1071,0 168,0 Z"/>
+   <glyph unicode="H" horiz-adv-x="1140" d="M 1121,0 L 1121,653 359,653 359,0 168,0 168,1409 359,1409 359,813 1121,813 1121,1409 1312,1409 1312,0 1121,0 Z"/>
+   <glyph unicode="G" horiz-adv-x="1332" d="M 103,711 C 103,940 164,1117 287,1242 410,1367 582,1430 804,1430 960,1430 1087,1404 1184,1351 1281,1298 1356,1214 1409,1098 L 1227,1044 C 1187,1124 1132,1182 1062,1219 991,1256 904,1274 799,1274 636,1274 512,1225 426,1127 340,1028 297,890 297,711 297,533 343,393 434,290 525,187 652,135 813,135 905,135 991,149 1071,177 1150,205 1215,243 1264,291 L 1264,545 843,545 843,705 1440,705 1440,219 C 1365,143 1274,84 1166,43 1057,1 940,-20 813,-20 666,-20 539,9 432,68 325,127 244,211 188,322 131,432 103,562 103,711 Z"/>
+   <glyph unicode="E" horiz-adv-x="1106" d="M 168,0 L 168,1409 1237,1409 1237,1253 359,1253 359,801 1177,801 1177,647 359,647 359,156 1278,156 1278,0 168,0 Z"/>
+   <glyph unicode="D" horiz-adv-x="1208" d="M 1381,719 C 1381,574 1353,447 1296,338 1239,229 1159,145 1055,87 951,29 831,0 695,0 L 168,0 168,1409 634,1409 C 873,1409 1057,1349 1187,1230 1316,1110 1381,940 1381,719 Z M 1189,719 C 1189,894 1141,1027 1046,1119 950,1210 811,1256 630,1256 L 359,1256 359,153 673,153 C 776,153 867,176 946,221 1024,266 1084,332 1126,417 1168,502 1189,603 1189,719 Z"/>
+   <glyph unicode="C" horiz-adv-x="1292" d="M 792,1274 C 636,1274 515,1224 428,1124 341,1023 298,886 298,711 298,538 343,400 434,295 524,190 646,137 800,137 997,137 1146,235 1245,430 L 1401,352 C 1343,231 1262,138 1157,75 1052,12 930,-20 791,-20 649,-20 526,10 423,69 319,128 240,212 186,322 131,431 104,561 104,711 104,936 165,1112 286,1239 407,1366 575,1430 790,1430 940,1430 1065,1401 1166,1342 1267,1283 1341,1196 1388,1081 L 1207,1021 C 1174,1103 1122,1166 1050,1209 977,1252 891,1274 792,1274 Z"/>
+   <glyph unicode="A" horiz-adv-x="1353" d="M 1167,0 L 1006,412 364,412 202,0 4,0 579,1409 796,1409 1362,0 1167,0 Z M 685,1265 L 676,1237 C 659,1182 635,1111 602,1024 L 422,561 949,561 768,1026 C 749,1072 731,1124 712,1182 L 685,1265 Z"/>
+   <glyph unicode="&gt;" horiz-adv-x="992" d="M 101,154 L 101,307 959,674 101,1040 101,1194 1096,776 1096,571 101,154 Z"/>
+   <glyph unicode="&lt;" horiz-adv-x="992" d="M 101,571 L 101,776 1096,1194 1096,1040 238,674 1096,307 1096,154 101,571 Z"/>
+   <glyph unicode="9" horiz-adv-x="943" d="M 1042,733 C 1042,491 998,305 910,175 821,45 695,-20 532,-20 422,-20 334,3 268,50 201,96 154,171 125,274 L 297,301 C 333,184 412,125 535,125 638,125 718,173 775,269 832,365 861,502 864,680 837,620 792,572 727,536 662,499 591,481 514,481 387,481 286,524 210,611 134,698 96,813 96,956 96,1103 137,1219 220,1304 303,1388 418,1430 565,1430 722,1430 840,1372 921,1256 1002,1140 1042,966 1042,733 Z M 846,907 C 846,1020 820,1112 768,1181 716,1250 646,1284 559,1284 472,1284 404,1255 354,1196 304,1137 279,1057 279,956 279,853 304,772 354,713 404,653 472,623 557,623 609,623 657,635 702,659 747,682 782,716 808,759 833,802 846,852 846,907 Z"/>
+   <glyph unicode="8" horiz-adv-x="958" d="M 1050,393 C 1050,263 1009,162 926,89 843,16 725,-20 570,-20 419,-20 302,16 217,87 132,158 89,260 89,391 89,483 115,560 168,623 221,686 288,724 370,737 L 370,741 C 293,759 233,798 189,858 144,918 122,988 122,1069 122,1176 162,1263 243,1330 323,1397 431,1430 566,1430 705,1430 814,1397 895,1332 975,1267 1015,1178 1015,1067 1015,986 993,916 948,856 903,796 842,758 765,743 L 765,739 C 855,724 925,686 975,625 1025,563 1050,486 1050,393 Z M 828,1057 C 828,1216 741,1296 566,1296 481,1296 417,1276 373,1236 328,1196 306,1136 306,1057 306,976 329,915 375,873 420,830 485,809 568,809 653,809 717,829 762,868 806,907 828,970 828,1057 Z M 863,410 C 863,497 837,563 785,608 733,652 660,674 566,674 475,674 403,650 352,603 301,555 275,489 275,406 275,212 374,115 572,115 670,115 743,139 791,186 839,233 863,307 863,410 Z"/>
+   <glyph unicode="7" horiz-adv-x="928" d="M 1036,1263 C 892,1043 790,871 731,746 672,621 627,498 598,377 568,256 553,130 553,0 L 365,0 C 365,180 403,370 480,569 556,768 683,997 862,1256 L 105,1256 105,1409 1036,1409 1036,1263 Z"/>
+   <glyph unicode="6" horiz-adv-x="942" d="M 1049,461 C 1049,312 1009,195 928,109 847,23 736,-20 594,-20 435,-20 314,39 230,157 146,275 104,447 104,672 104,916 148,1103 235,1234 322,1365 447,1430 608,1430 821,1430 955,1334 1010,1143 L 838,1112 C 803,1227 725,1284 606,1284 503,1284 424,1236 368,1141 311,1045 283,906 283,725 316,786 362,832 421,864 480,895 548,911 625,911 755,911 858,870 935,789 1011,708 1049,598 1049,461 Z M 866,453 C 866,555 841,634 791,689 741,744 671,772 582,772 498,772 430,748 379,699 327,650 301,582 301,496 301,387 328,298 382,229 435,160 504,125 588,125 675,125 743,154 792,213 841,271 866,351 866,453 Z"/>
+   <glyph unicode="5" horiz-adv-x="968" d="M 1053,459 C 1053,310 1009,193 921,108 832,23 710,-20 553,-20 422,-20 316,9 235,66 154,123 103,206 82,315 L 264,336 C 302,197 400,127 557,127 654,127 729,156 784,215 839,273 866,353 866,455 866,544 839,615 784,670 729,725 654,752 561,752 512,752 467,744 425,729 383,714 341,688 299,651 L 123,651 170,1409 971,1409 971,1256 334,1256 307,809 C 385,869 482,899 598,899 737,899 847,858 930,777 1012,696 1053,590 1053,459 Z"/>
+   <glyph unicode="4" horiz-adv-x="1029" d="M 881,319 L 881,0 711,0 711,319 47,319 47,459 692,1409 881,1409 881,461 1079,461 1079,319 881,319 Z M 711,1206 C 710,1202 700,1184 683,1153 666,1122 653,1100 644,1087 L 283,555 229,481 213,461 711,461 711,1206 Z"/>
+   <glyph unicode="3" horiz-adv-x="968" d="M 1049,389 C 1049,259 1008,158 925,87 842,16 724,-20 571,-20 428,-20 315,12 230,77 145,141 94,236 78,362 L 264,379 C 288,212 390,129 571,129 662,129 733,151 785,196 836,241 862,307 862,395 862,472 833,532 774,575 715,618 629,639 518,639 L 416,639 416,795 514,795 C 613,795 689,817 744,860 798,903 825,962 825,1038 825,1113 803,1173 759,1217 714,1260 648,1282 561,1282 482,1282 418,1262 369,1221 320,1180 291,1123 283,1049 L 102,1063 C 115,1178 163,1268 246,1333 328,1398 434,1430 563,1430 704,1430 814,1397 893,1332 971,1266 1010,1174 1010,1057 1010,967 985,894 935,838 884,781 811,743 715,723 L 715,719 C 820,708 902,672 961,613 1020,554 1049,479 1049,389 Z"/>
+   <glyph unicode="2" horiz-adv-x="930" d="M 103,0 L 103,127 C 137,205 179,274 228,334 277,393 328,447 382,496 436,544 490,589 543,630 596,671 643,713 686,754 729,795 763,839 790,884 816,929 829,981 829,1038 829,1115 806,1175 761,1218 716,1261 653,1282 572,1282 495,1282 432,1261 383,1220 333,1178 304,1119 295,1044 L 111,1061 C 124,1174 172,1263 255,1330 337,1397 443,1430 572,1430 714,1430 823,1397 900,1330 976,1263 1014,1167 1014,1044 1014,989 1002,935 977,881 952,827 914,773 865,719 816,665 721,581 582,468 505,405 444,349 399,299 354,248 321,200 301,153 L 1036,153 1036,0 103,0 Z"/>
+   <glyph unicode="1" horiz-adv-x="880" d="M 156,0 L 156,153 515,153 515,1237 197,1010 197,1180 530,1409 696,1409 696,153 1039,153 1039,0 156,0 Z"/>
+   <glyph unicode="0" horiz-adv-x="976" d="M 1059,705 C 1059,470 1018,290 935,166 852,42 729,-20 567,-20 405,-20 283,42 202,165 121,288 80,468 80,705 80,947 120,1128 199,1249 278,1370 402,1430 573,1430 739,1430 862,1369 941,1247 1020,1125 1059,944 1059,705 Z M 876,705 C 876,908 853,1056 806,1147 759,1238 681,1284 573,1284 462,1284 383,1239 335,1149 286,1059 262,911 262,705 262,505 287,359 336,266 385,173 462,127 569,127 675,127 753,174 802,269 851,364 876,509 876,705 Z"/>
+   <glyph unicode=" " horiz-adv-x="556"/>
+  </font>
+ </defs>
+ <defs class="EmbeddedBulletChars">
+  <g id="bullet-char-template-57356" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 580,1141 L 1163,571 580,0 -4,571 580,1141 Z"/>
+  </g>
+  <g id="bullet-char-template-57354" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 8,1128 L 1137,1128 1137,0 8,0 8,1128 Z"/>
+  </g>
+  <g id="bullet-char-template-10146" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 174,0 L 602,739 174,1481 1456,739 174,0 Z M 1358,739 L 309,1346 659,739 1358,739 Z"/>
+  </g>
+  <g id="bullet-char-template-10132" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 2015,739 L 1276,0 717,0 1260,543 174,543 174,936 1260,936 717,1481 1274,1481 2015,739 Z"/>
+  </g>
+  <g id="bullet-char-template-10007" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 0,-2 C -7,14 -16,27 -25,37 L 356,567 C 262,823 215,952 215,954 215,979 228,992 255,992 264,992 276,990 289,987 310,991 331,999 354,1012 L 381,999 492,748 772,1049 836,1024 860,1049 C 881,1039 901,1025 922,1006 886,937 835,863 770,784 769,783 710,716 594,584 L 774,223 C 774,196 753,168 711,139 L 727,119 C 717,90 699,76 672,76 641,76 570,178 457,381 L 164,-76 C 142,-110 111,-127 72,-127 30,-127 9,-110 8,-76 1,-67 -2,-52 -2,-32 -2,-23 -1,-13 0,-2 Z"/>
+  </g>
+  <g id="bullet-char-template-10004" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 285,-33 C 182,-33 111,30 74,156 52,228 41,333 41,471 41,549 55,616 82,672 116,743 169,778 240,778 293,778 328,747 346,684 L 369,508 C 377,444 397,411 428,410 L 1163,1116 C 1174,1127 1196,1133 1229,1133 1271,1133 1292,1118 1292,1087 L 1292,965 C 1292,929 1282,901 1262,881 L 442,47 C 390,-6 338,-33 285,-33 Z"/>
+  </g>
+  <g id="bullet-char-template-9679" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 813,0 C 632,0 489,54 383,161 276,268 223,411 223,592 223,773 276,916 383,1023 489,1130 632,1184 813,1184 992,1184 1136,1130 1245,1023 1353,916 1407,772 1407,592 1407,412 1353,268 1245,161 1136,54 992,0 813,0 Z"/>
+  </g>
+  <g id="bullet-char-template-8226" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 346,457 C 273,457 209,483 155,535 101,586 74,649 74,723 74,796 101,859 155,911 209,963 273,989 346,989 419,989 480,963 531,910 582,859 608,796 608,723 608,648 583,586 532,535 482,483 420,457 346,457 Z"/>
+  </g>
+  <g id="bullet-char-template-8211" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M -4,459 L 1135,459 1135,606 -4,606 -4,459 Z"/>
+  </g>
+  <g id="bullet-char-template-61548" transform="scale(0.00048828125,-0.00048828125)">
+   <path d="M 173,740 C 173,903 231,1043 346,1159 462,1274 601,1332 765,1332 928,1332 1067,1274 1183,1159 1299,1043 1357,903 1357,740 1357,577 1299,437 1183,322 1067,206 928,148 765,148 601,148 462,206 346,322 231,437 173,577 173,740 Z"/>
+  </g>
+ </defs>
+ <g class="Page">
+  <g class="com.sun.star.drawing.TableShape">
+   <g>
+    <rect class="BoundingBox" stroke="none" fill="none" x="3698" y="3124" width="13628" height="19672"/>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,3142 L 5226,3142 5226,3737 3716,3737 3716,3142 Z"/>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 5226,3142 L 6736,3142 6736,3737 5226,3737 5226,3142 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="5884" y="3560"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">0</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 6736,3142 L 8246,3142 8246,3737 6736,3737 6736,3142 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="7394" y="3560"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">1</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 8246,3142 L 9756,3142 9756,3737 8246,3737 8246,3142 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="8904" y="3560"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">2</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 9756,3142 L 11269,3142 11269,3737 9756,3737 9756,3142 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="10415" y="3560"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">3</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 11269,3142 L 12782,3142 12782,3737 11269,3737 11269,3142 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="11928" y="3560"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">4</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 12782,3142 L 14295,3142 14295,3737 12782,3737 12782,3142 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="13441" y="3560"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">5</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 14295,3142 L 15805,3142 15805,3737 14295,3737 14295,3142 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="14953" y="3560"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">6</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 15805,3142 L 17307,3142 17307,3737 15805,3737 15805,3142 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="16459" y="3560"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">7</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,3737 L 5226,3737 5226,4332 3716,4332 3716,3737 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4374" y="4155"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">0</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 5226,3737 L 11269,3737 11269,4332 5226,4332 5226,3737 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="7918" y="4155"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">int a</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 11269,3737 L 12782,3737 12782,4332 11269,4332 11269,3737 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="11539" y="4155"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">char b</tspan></tspan></tspan></text>
+    <path fill="rgb(255,255,153)" stroke="none" d="M 12782,3737 L 17307,3737 17307,4332 12782,4332 12782,3737 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="14215" y="4155"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">&lt;padding&gt;</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,4332 L 5226,4332 5226,4927 3716,4927 3716,4332 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4374" y="4750"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">8</tspan></tspan></tspan></text>
+    <path fill="rgb(255,255,153)" stroke="none" d="M 5226,4332 L 17307,4332 17307,8497 5226,8497 5226,4332 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="8756" y="6535"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">__rte_cache_aligned &lt;padding&gt;</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,4927 L 5226,4927 5226,5522 3716,5522 3716,4927 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4277" y="5345"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">16</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,5522 L 5226,5522 5226,6117 3716,6117 3716,5522 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4277" y="5940"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">24</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,6117 L 5226,6117 5226,6712 3716,6712 3716,6117 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4277" y="6535"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">32</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,6712 L 5226,6712 5226,7307 3716,7307 3716,6712 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4277" y="7130"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">40</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,7307 L 5226,7307 5226,7902 3716,7902 3716,7307 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4277" y="7725"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">48</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,7902 L 5226,7902 5226,8497 3716,8497 3716,7902 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4277" y="8320"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">56</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,8497 L 5226,8497 5226,9092 3716,9092 3716,8497 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4277" y="8915"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">64</tspan></tspan></tspan></text>
+    <path fill="rgb(255,255,153)" stroke="none" d="M 5226,8497 L 17307,8497 17307,13257 5226,13257 5226,8497 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="8596" y="10998"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">RTE_CACHE_GUARD &lt;padding&gt;</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,9092 L 5226,9092 5226,9687 3716,9687 3716,9092 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4277" y="9510"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">72</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,9687 L 5226,9687 5226,10282 3716,10282 3716,9687 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4277" y="10105"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">80</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,10282 L 5226,10282 5226,10877 3716,10877 3716,10282 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4277" y="10700"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">88</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,10877 L 5226,10877 5226,11472 3716,11472 3716,10877 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4277" y="11295"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">96</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,11472 L 5226,11472 5226,12067 3716,12067 3716,11472 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="11890"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">104</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,12067 L 5226,12067 5226,12662 3716,12662 3716,12067 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4192" y="12485"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">112</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,12662 L 5226,12662 5226,13257 3716,13257 3716,12662 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="13080"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">120</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,13257 L 5226,13257 5226,13852 3716,13852 3716,13257 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="13675"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">128</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 5226,13257 L 11269,13257 11269,13852 5226,13852 5226,13257 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="7918" y="13675"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">int a</tspan></tspan></tspan></text>
+    <path fill="rgb(51,204,102)" stroke="none" d="M 11269,13257 L 12782,13257 12782,13852 11269,13852 11269,13257 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="11539" y="13675"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">char b</tspan></tspan></tspan></text>
+    <path fill="rgb(255,255,153)" stroke="none" d="M 12782,13257 L 17307,13257 17307,13852 12782,13852 12782,13257 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="14215" y="13675"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">&lt;padding&gt;</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,13852 L 5226,13852 5226,14447 3716,14447 3716,13852 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="14270"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">136</tspan></tspan></tspan></text>
+    <path fill="rgb(255,255,153)" stroke="none" d="M 5226,13852 L 17307,13852 17307,18017 5226,18017 5226,13852 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="8756" y="16055"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">__rte_cache_aligned &lt;padding&gt;</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,14447 L 5226,14447 5226,15042 3716,15042 3716,14447 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="14865"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">144</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,15042 L 5226,15042 5226,15637 3716,15637 3716,15042 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="15460"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">152</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,15637 L 5226,15637 5226,16232 3716,16232 3716,15637 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="16055"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">160</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,16232 L 5226,16232 5226,16827 3716,16827 3716,16232 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="16650"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">168</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,16827 L 5226,16827 5226,17422 3716,17422 3716,16827 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="17245"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">176</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,17422 L 5226,17422 5226,18017 3716,18017 3716,17422 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="17840"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">184</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,18017 L 5226,18017 5226,18612 3716,18612 3716,18017 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="18435"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">192</tspan></tspan></tspan></text>
+    <path fill="rgb(255,255,153)" stroke="none" d="M 5226,18017 L 17307,18017 17307,22777 5226,22777 5226,18017 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="8596" y="20518"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">RTE_CACHE_GUARD &lt;padding&gt;</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,18612 L 5226,18612 5226,19207 3716,19207 3716,18612 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="19030"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">200</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,19207 L 5226,19207 5226,19802 3716,19802 3716,19207 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="19625"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">208</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,19802 L 5226,19802 5226,20397 3716,20397 3716,19802 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="20220"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">216</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,20397 L 5226,20397 5226,20992 3716,20992 3716,20397 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="20815"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">224</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,20992 L 5226,20992 5226,21587 3716,21587 3716,20992 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="21410"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">232</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,21587 L 5226,21587 5226,22182 3716,22182 3716,21587 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="22005"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">240</tspan></tspan></tspan></text>
+    <path fill="rgb(89,131,176)" stroke="none" d="M 3716,22182 L 5226,22182 5226,22777 3716,22777 3716,22182 Z"/>
+    <text class="SVGTextShape"><tspan class="TextParagraph"><tspan class="TextPosition" x="4179" y="22600"><tspan font-family="Liberation Sans, sans-serif" font-size="353px" font-weight="400" fill="rgb(255,255,255)" stroke="none" style="white-space: pre">248</tspan></tspan></tspan></text>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,3142 L 17320,3142"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3716,3129 L 3716,22790"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 5226,3129 L 5226,22790"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 6736,3129 L 6736,3750"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 8246,3129 L 8246,3750"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 9756,3129 L 9756,3750"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 11269,3129 L 11269,4345"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 12782,3129 L 12782,4345"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 14295,3129 L 14295,3750"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 15805,3129 L 15805,3750"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 17307,3129 L 17307,22790"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,3737 L 17320,3737"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,4332 L 17320,4332"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,4927 L 5239,4927"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,5522 L 5239,5522"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,6117 L 5239,6117"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,6712 L 5239,6712"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,7307 L 5239,7307"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,7902 L 5239,7902"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,8497 L 17320,8497"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,9092 L 5239,9092"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,9687 L 5239,9687"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,10282 L 5239,10282"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,10877 L 5239,10877"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,11472 L 5239,11472"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,12067 L 5239,12067"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,12662 L 5239,12662"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,13257 L 17320,13257"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 11269,13244 L 11269,13865"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 12782,13244 L 12782,13865"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,13852 L 17320,13852"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,14447 L 5239,14447"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,15042 L 5239,15042"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,15637 L 5239,15637"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,16232 L 5239,16232"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,16827 L 5239,16827"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,17422 L 5239,17422"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,18017 L 17320,18017"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,18612 L 5239,18612"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,19207 L 5239,19207"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,19802 L 5239,19802"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,20397 L 5239,20397"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,20992 L 5239,20992"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,21587 L 5239,21587"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,22182 L 5239,22182"/>
+    <path fill="none" stroke="rgb(255,255,255)" stroke-width="26" stroke-linejoin="round" d="M 3703,22777 L 17320,22777"/>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.CustomShape">
+   <g id="id3">
+    <rect class="BoundingBox" stroke="none" fill="none" x="17450" y="3781" width="474" height="9436"/>
+    <path fill="none" stroke="rgb(0,0,0)" d="M 17451,3782 C 17568,3782 17686,4175 17686,4568 L 17686,7712 C 17686,8105 17804,8498 17922,8498 17804,8498 17686,8891 17686,9284 L 17686,12428 C 17686,12821 17568,13215 17451,13215"/>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id4">
+    <rect class="BoundingBox" stroke="none" fill="none" x="18113" y="6986" width="1200" height="3018"/>
+    <text class="SVGTextShape" transform="rotate(-90 18623 9753)"><tspan class="TextParagraph"><tspan class="TextPosition" x="18623" y="9753"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">struct x_lcore</tspan></tspan></tspan><tspan class="TextParagraph"><tspan class="TextPosition" x="19097" y="9353"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">lcore id 0</tspan></tspan></tspan></text>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.CustomShape">
+   <g id="id5">
+    <rect class="BoundingBox" stroke="none" fill="none" x="2793" y="3754" width="851" height="18961"/>
+    <path fill="none" stroke="rgb(0,0,0)" d="M 3642,3755 C 3430,3755 3218,4544 3218,5334 L 3218,11654 C 3218,12444 3006,13234 2794,13234 3006,13234 3218,14023 3218,14813 L 3218,21133 C 3218,21923 3430,22713 3642,22713"/>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id6">
+    <rect class="BoundingBox" stroke="none" fill="none" x="2001" y="8847" width="726" height="8631"/>
+    <text class="SVGTextShape" transform="rotate(-90 2511 17227)"><tspan class="TextParagraph"><tspan class="TextPosition" x="2511" y="17227"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">struct x_lcore x_lcores[RTE_MAX_LCORE]</tspan></tspan></tspan></text>
+   </g>
+  </g>
+  <g class="com.sun.star.drawing.CustomShape">
+   <g id="id7">
+    <rect class="BoundingBox" stroke="none" fill="none" x="17459" y="13305" width="474" height="9436"/>
+    <path fill="none" stroke="rgb(0,0,0)" d="M 17460,13306 C 17577,13306 17695,13699 17695,14092 L 17695,17236 C 17695,17629 17813,18022 17931,18022 17813,18022 17695,18415 17695,18808 L 17695,21952 C 17695,22345 17577,22739 17460,22739"/>
+   </g>
+  </g>
+  <g class="TextShape">
+   <g id="id8">
+    <rect class="BoundingBox" stroke="none" fill="none" x="18167" y="16527" width="1200" height="3018"/>
+    <text class="SVGTextShape" transform="rotate(-90 18677 19294)"><tspan class="TextParagraph"><tspan class="TextPosition" x="18677" y="19294"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">struct x_lcore</tspan></tspan></tspan><tspan class="TextParagraph"><tspan class="TextPosition" x="19151" y="18894"><tspan font-family="Liberation Sans, sans-serif" font-size="423px" font-weight="400" fill="rgb(0,0,0)" stroke="none" style="white-space: pre">lcore id 1</tspan></tspan></tspan></text>
+   </g>
+  </g>
+ </g>
+</svg>
\ No newline at end of file
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index 7eb1a98d88..c4432c4b74 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -27,6 +27,7 @@ Memory Management
     mempool_lib
     mbuf_lib
     multi_proc_support
+    lcore_var
 
 
 CPU Management
diff --git a/doc/guides/prog_guide/lcore_var.rst b/doc/guides/prog_guide/lcore_var.rst
new file mode 100644
index 0000000000..b647ba7391
--- /dev/null
+++ b/doc/guides/prog_guide/lcore_var.rst
@@ -0,0 +1,548 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2024 Ericsson AB
+
+Lcore Variables
+===============
+
+The ``rte_lcore_var.h`` API provides a mechanism to allocate and
+access per-lcore id variables in a space- and cycle-efficient manner.
+
+Lcore Variables API
+-------------------
+
+A per-lcore id variable (or lcore variable for short) holds a unique
+value for each EAL thread and registered non-EAL thread. Thus, there
+is one distinct value for each past, current and future lcore
+id-equipped thread, with a total of ``RTE_MAX_LCORE`` instances.
+
+The value of the lcore variable for one lcore id is independent of the
+values associated with other lcore ids within the same variable.
+
+For detailed information on the lcore variables API, please refer to
+the ``rte_lcore_var.h`` API documentation.
+
+Lcore Variable Handle
+^^^^^^^^^^^^^^^^^^^^^
+
+To allocate and access an lcore variable's values, a *handle* is
+used. The handle is represented by an opaque pointer, only to be
+dereferenced using the appropriate ``<rte_lcore_var.h>`` macros.
+
+The handle is a pointer to the value's type (e.g., for an ``uint32_t``
+lcore variable, the handle is a ``uint32_t *``).
+
+The reason the handle is typed (i.e., it's not a void pointer or an
+integer) is to enable type checking when accessing values of the lcore
+variable.
+
+A handle may be passed between modules and threads just like any other
+pointer.
+
+A valid (i.e., allocated) handle never has the value NULL. Thus, a
+handle set to NULL may be used to signify that allocation has not yet
+been done.
+
+Lcore Variable Allocation
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An lcore variable is created in two steps:
+
+1. Define an lcore variable handle by using ``RTE_LCORE_VAR_HANDLE``.
+2. Allocate lcore variable storage and initialize the handle by using
+   ``RTE_LCORE_VAR_ALLOC`` or ``RTE_LCORE_VAR_INIT``. Allocation
+   generally occurs at the time of module initialization, but may be
+   done at any time.
+
+The lifetime of an lcore variable is not tied to the thread that
+created it.
+
+Each lcore variable has ``RTE_MAX_LCORE`` values, one for each
+possible lcore id. All of an lcore variable's values may be accessed
+from the moment the lcore variable is created, throughout the lifetime
+of the EAL (i.e., until ``rte_eal_cleanup()``).
+
+Lcore variables do not need to be freed and cannot be freed.
+
+Access
+^^^^^^
+
+The value of any lcore variable for any lcore id may be accessed from
+any thread (including unregistered threads), but it should only be
+*frequently* read from or written to by the *owner*. A thread is
+considered the owner of a particular lcore variable value instance if
+it has the lcore id associated with that instance.
+
+Non-owner accesses results in *false sharing*. As long as non-owner
+accesses are rare, they will have only a very slight effect on
+performance. This property of lcore variables memory organization is
+intentional. See the implementation section for more information.
+
+Values of the same lcore variable, associated with different lcore ids
+may be frequently read or written by their respective owners without
+risking false sharing.
+
+An appropriate synchronization mechanism, such as atomic load and
+stores, should be employed to prevent data races between the owning
+thread and any other thread accessing the same value instance.
+
+The value of the lcore variable for a particular lcore id is accessed
+via ``RTE_LCORE_VAR_LCORE``.
+
+A common pattern is for an EAL thread or a registered non-EAL
+thread to access its own lcore variable value. For this purpose, a
+shorthand exists as ``RTE_LCORE_VAR``.
+
+The handle, defined by ``RTE_LCORE_VAR_HANDLE``, is a pointer of the
+same type as the value, but it must be treated as an opaque identifier
+and cannot be directly dereferenced.
+
+Lcore variable handles and value pointers may be freely passed
+between different threads.
+
+Storage
+^^^^^^^
+
+An lcore variable's values may be of a primitive type like ``int``,
+but is typically a ``struct``.
+
+The lcore variable handle introduces a per-variable (not
+per-value/per-lcore id) overhead of ``sizeof(void *)`` bytes, so there
+are some memory footprint gains to be made by organizing all per-lcore
+id data for a particular module as one lcore variable (e.g., as a
+struct).
+
+An application may define an lcore variable handle without ever
+allocating the lcore variable.
+
+The size of an lcore variable's value cannot exceed the DPDK
+build-time constant ``RTE_MAX_LCORE_VAR``. An lcore variable's size is
+the size of one of its value instance, not the aggregate of all its
+``RTE_MAX_LCORE`` instances.
+
+Lcore variables should generally *not* be ``__rte_cache_aligned`` and
+need *not* include a ``RTE_CACHE_GUARD`` field, since these constructs
+are designed to avoid false sharing. With lcore variables, false
+sharing is largely avoided by other means. In the case of an lcore
+variable instance, the thread most recently accessing nearby data
+structures should almost always be the lcore variable's owner. Adding
+padding (e.g., with ``RTE_CACHE_GUARD``) will increase the effective
+memory working set size, potentially reducing performance.
+
+Lcore variable values are initialized to zero by default.
+
+Lcore variables are not stored in huge page memory.
+
+Example
+^^^^^^^
+
+Below is an example of the use of an lcore variable:
+
+.. code-block:: c
+
+    struct foo_lcore_state {
+            int a;
+            long b;
+    };
+    
+    static RTE_LCORE_VAR_HANDLE(struct foo_lcore_state, lcore_states);
+    
+    long foo_get_a_plus_b(void)
+    {
+            const struct foo_lcore_state *state = RTE_LCORE_VAR(lcore_states);
+    
+            return state->a + state->b;
+    }
+    
+    RTE_INIT(rte_foo_init)
+    {
+            RTE_LCORE_VAR_ALLOC(lcore_states);
+    
+            unsigned int lcore_id;
+            struct foo_lcore_state *state;
+            RTE_LCORE_VAR_FOREACH(lcore_id, state, lcore_states) {
+                    /* initialize state */
+            }
+    
+            /* other initialization */
+    }
+
+
+Implementation
+--------------
+
+This section gives an overview of the implementation of lcore
+variables, and some background to its design.
+
+Lcore Variable Buffers
+^^^^^^^^^^^^^^^^^^^^^^
+
+Lcore variable values are kept in a set of ``lcore_var_buffer`` structs.
+
+.. code-block:: c
+
+    struct lcore_var_buffer {
+            char data[RTE_MAX_LCORE_VAR * RTE_MAX_LCORE];
+            struct lcore_var_buffer *prev;
+    };
+
+An lcore var buffer stores at a minimum one, but usually many, lcore
+variables.
+
+The value instances for all lcore ids are stored in the same
+buffer. However, each lcore id has its own slice of the ``data``
+array. Such a slice is ``RTE_MAX_LCORE_VAR`` bytes in size.
+
+In this way, the values associated with a particular lcore id are
+grouped spatially close (in memory). No padding is required to prevent
+false sharing.
+
+.. code-block:: c
+
+    static struct lcore_var_buffer *current_buffer;
+    
+    /* initialized to trigger buffer allocation on first allocation */
+    static size_t offset = RTE_MAX_LCORE_VAR;
+
+The implementation maintains a current ``lcore_var_buffer`` and
+an ``offset``, where the latter tracks how many bytes of this
+current buffer has been allocated.
+
+The ``offset`` is progressively incremented (by the size of the
+just-allocated lcore variable), as lcore variables are being
+allocated.
+
+If the allocation of a variable would result in an ``offset`` larger
+than ``RTE_MAX_LCORE_VAR`` (i.e., the slice size), the buffer is
+full. In that case, new buffer is allocated off the heap, and the
+``offset`` is reset.
+
+The lcore var buffers are arranged in a link list, to allow freeing
+them at the point of ``rte_eal_cleanup()``, thereby avoiding false
+positives from tools like valgrind memcheck.
+
+The lcore variable buffers are allocated off the regular C heap. There
+are a number of reasons for not using ``<rte_malloc.h>`` and huge
+pages for lcore variables:
+
+- The libc heap is available at any time, including early in the
+  DPDK initialization.
+- The amount of data kept in lcore variables is projected to be small,
+  and thus is unlikely to induce translate lookaside buffer (TLB)
+  misses.
+- The last (and potentially only) lcore buffer in the chain will
+  likely only partially be in use. Huge pages of the sort used by DPDK
+  are always resident in memory, and their use would result in a
+  significant amount of memory going to waste. An example: ~256 kB
+  worth of lcore variables are allocated by DPDK libraries, PMDs and
+  the application. ``RTE_MAX_LCORE_VAR`` is set to 1 MB and
+  ``RTE_MAX_LCORE`` to 128. With 4 kB OS pages, only the first ~64
+  pages of each of the 128 per-lcore id slices in the (only)
+  ``lcore_var_buffer`` will actually be resident (paged in). Here,
+  demand paging saves ~98 MB of memory.
+
+Not residing in huge pages, lcore variables cannot be accessed from
+secondary processes.
+
+Heap allocation failures are treated as fatal. The reason for this
+unorthodox design is that a majority of the allocations are deemed to
+happen at initialization. An early heap allocation failure for a fixed
+amount of data is a situation not unlike one where there is not enough
+memory available for static variables (i.e., the BSS or data
+sections).
+
+Provided these assumptions hold true, it's deemed acceptable to leave
+the application out of handling memory allocation failures.
+
+The upside of this approach is that no error handling code is required
+on the API user side.
+
+Lcore Variable Handles
+^^^^^^^^^^^^^^^^^^^^^^
+
+Upon lcore variable allocation, the lcore variables API returns an
+opaque *handle* in the form of a pointer. The value of the pointer is
+``buffer->data + offset``.
+
+Translating a handle base pointer to a pointer to a value associated
+with a particular lcore id is straightforward:
+
+.. code-block:: c
+
+    static inline void *
+    rte_lcore_var_lcore(unsigned int lcore_id, void *handle)
+    {
+            return RTE_PTR_ADD(handle, lcore_id * RTE_MAX_LCORE_VAR);
+    }
+
+``RTE_MAX_LCORE_VAR`` is a public macro to allow the compiler to
+optimize the ``lcore_id * RTE_MAX_LCORE_VAR`` expression, and replace
+the multiplication with a less expensive arithmetic operation.
+
+To maintain type safety, the ``RTE_LCORE_VAR*()`` macros should be
+used, instead of directly invoking ``rte_lcore_var_lcore()``.  The
+macros return a pointer of the same type as the handle (i.e., a
+pointer to the value's type).
+
+Memory Layout
+^^^^^^^^^^^^^
+
+This section describes how lcore variables are organized in memory.
+
+As an illustration, two example modules are used, ``rte_x`` and
+``rte_y``, both maintaining per-lcore id state as a part of their
+implementation.
+
+Two different methods will be used to maintain such state - lcore
+variables and, to serve as a reference, lcore id-indexed static
+arrays.
+
+Certain parameters are scaled down to make graphical depictions more
+practical.
+
+For the purpose of this exercise, a ``RTE_MAX_LCORE`` of 2 is
+assumed. In a real-world configuration the maximum number of EAL
+threads and registered threads will be much greater (e.g., 128).
+
+The lcore variables example assumes a ``RTE_MAX_LCORE_VAR`` of 64. In
+a real-world configuration (as controlled by ``rte_config.h``) the
+value of this compile-time constant will be much greater (e.g.,
+1048576).
+
+The per-lcore id state is also smaller than what most real-world
+modules would have.
+
+Lcore Variables Example
+"""""""""""""""""""""""
+
+When lcore variables are used, the parts of ``rte_x`` and ``rte_y``
+that deal with the declaration and allocation of per-lcore id data may
+look something like below.
+
+.. code-block:: c
+
+    /* -- Lcore variables -- */
+    
+    /* rte_x.c */
+    
+    struct x_lcore
+    {
+        int a;
+        char b;
+    };
+    
+    static RTE_LCORE_VAR_HANDLE(struct x_lcore, x_lcores);
+    RTE_LCORE_VAR_INIT(x_lcores);
+    
+    /../
+    
+    /* rte_y.c */
+    
+    struct y_lcore
+    {
+        long c;
+        long d;
+    };
+    
+    static RTE_LCORE_VAR_HANDLE(struct y_lcore, y_lcores);
+    RTE_LCORE_VAR_INIT(y_lcores);
+
+    /../
+
+The resulting memory layout will look something like the following:
+
+.. _figure_lcore_var_mem_layout:
+
+.. figure:: img/lcore_var_mem_layout.*
+
+The above figure assumes that ``x_lcores`` is allocated prior to
+``y_lcores``. ``RTE_LCORE_VAR_INIT()`` relies constructors, run prior
+to ``main()`` in an undefined order.
+
+The use of lcore variables ensures that per-lcore id data is kept in
+close proximity, within a designated region of memory. This proximity
+enhances data locality and can improve performance.
+
+Lcore Id Index Static Array Example
+"""""""""""""""""""""""""""""""""""
+
+Below is an example of the struct declarations, declarations and the
+resulting organization in memory in case an lcore id indexed static
+array of cache-line aligned, RTE_CACHE_GUARDed structs are used to
+maintain per-lcore id state.
+
+This is a common pattern in DPDK, which lcore variables attempts to
+replace.
+
+.. code-block:: c
+
+    /* -- Cache-aligned static arrays -- */
+    
+    /* rte_x.c */
+    
+    struct x_lcore
+    {
+        int a;
+        char b;
+        RTE_CACHE_GUARD;
+    } __rte_cache_aligned;
+    
+    static struct x_lcore x_lcores[RTE_MAX_LCORE];
+
+    /../
+    
+    /* rte_y.c */
+    
+    struct y_lcore
+    {
+        long c;
+        long d;
+        RTE_CACHE_GUARD;
+    } __rte_cache_aligned;
+    
+    static struct y_lcore y_lcores[RTE_MAX_LCORE];
+
+    /../
+
+In this approach, accessing the state for a particular lcore id is
+merely a matter retrieving the lcore id and looking up the correct
+struct instance.
+
+.. code-block:: c
+
+    struct x_lcore *my_lcore_state = &x_lcores[rte_lcore_id()];
+
+The address "0" at the top of the left-most column in the figure
+represent the base address for the ``x_lcores`` array (in the BSS
+segment in memory).
+
+The figure only includes the memory layout for the ``rte_x`` example
+module. ``rte_y`` would look very similar, with ``y_lcores`` being
+located at some other address in the BSS section.
+
+.. _figure_static_array_mem_layout:
+
+.. figure:: img/static_array_mem_layout.*
+
+The static array approach results in the per-lcore id being organized
+around modules, not lcore ids. To avoid false sharing, an extensive
+use of padding is employed, causing cache fragmentation.
+
+Because the padding is interspersed with the data, demand paging is
+unlikely to reduce the actual resident DRAM memory footprint. This is
+because the padding is smaller than a typical operating system memory
+page (usually 4 kB).
+
+Performance
+^^^^^^^^^^^
+
+One of the goals of lcore variables is to improve performance. This is
+achieved by packing often-used data in fewer cache lines, and thus
+reducing fragmentation in CPU caches and thus somewhat improving the
+effective cache size and cache hit rates.
+
+The application-level gains depends much on how much data is kept in
+lcore variables, and how often it is accessed, and how much pressure
+the application asserts on the CPU caches (i.e., how much other memory
+it accesses).
+
+The ``lcore_var_perf_autotest`` is an attempt at exploring the
+performance benefits (or drawbacks) of lcore variables compared to its
+alternatives. Being a micro benchmark, it needs to be taken with a
+grain of salt.
+
+Generally, one shouldn't expect more than some very modest gains in
+performance after a switch from lcore id indexed arrays to lcore
+variables.
+
+An additional benefit of the use of lcore variables is that it avoids
+certain tricky issues related to CPU core hardware prefetching (e.g.,
+next-N-lines prefetching) that may cause false sharing even when data
+used by two cores do not reside on the same cache line. Hardware
+prefetch behavior is generally not publicly documented and varies
+across CPU vendors, CPU generations and BIOS (or similar)
+configurations. For applications aiming to be portable, this may cause
+issues. Often, CPU hardware prefetch-induced issues are non-existent,
+except some particular circumstances, where their adverse effects may
+be significant.
+
+Alternatives
+------------
+
+Lcore Id Indexed Static Arrays
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Lcore variables are designed to replace a pattern exemplified below:
+
+.. code-block:: c
+
+    struct __rte_cache_aligned foo_lcore_state {
+            int a;
+            long b;
+            RTE_CACHE_GUARD;
+    };
+    
+    static struct foo_lcore_state lcore_states[RTE_MAX_LCORE];
+
+This scheme is simple and effective, but has one drawback: the data is
+organized so that objects related to all lcores for a particular
+module are kept close in memory. At a bare minimum, this requires
+sizing data structures (e.g., using ``__rte_cache_aligned``) to an
+even number of cache lines and ensuring that allocation of such
+objects are cache line aligned to avoid false sharing. With CPU
+hardware prefetching and memory loads resulting from speculative
+execution (functions which seemingly are getting more eager faster
+than they are getting more intelligent), one or more "guard" cache
+lines may be required to separate one lcore's data from another's and
+prevent false sharing.
+
+Lcore variables offer the advantage of working with, rather than
+against, the CPU's assumptions. A next-line hardware prefetcher,
+for example, may function as intended (i.e., to the benefit, not
+detriment, of system performance).
+
+Thread Local Storage
+^^^^^^^^^^^^^^^^^^^^
+
+An alternative to ``rte_lcore_var.h`` is the ``rte_per_lcore.h`` API,
+which makes use of thread-local storage (TLS, e.g., GCC ``__thread`` or
+C11 ``_Thread_local``).
+
+The are a number of differences between using TLS and the use of lcore
+variables.
+
+The lifecycle of a thread-local variable instance is tied to that of
+the thread. The data cannot be accessed before the thread has been
+created, nor after it has terminated. As a result, thread-local
+variables must be initialized in a "lazy" manner (e.g., at the point
+of thread creation). Lcore variables may be accessed immediately after
+having been allocated (which may occur before any thread beyond the
+main thread is running).
+
+A thread-local variable is duplicated across all threads in the
+process, including unregistered non-EAL threads (i.e., "regular"
+threads). For DPDK applications heavily relying on multi-threading (in
+conjunction to DPDK's "one thread per core" pattern), either by having
+many concurrent threads or creating/destroying threads at a high rate,
+an excessive use of thread-local variables may cause inefficiencies
+(e.g., increased thread creation overhead due to thread-local storage
+initialization or increased memory footprint). Lcore variables *only*
+exist for threads with an lcore id.
+
+Whether data in thread-local storage can be shared between threads
+(i.e., whether a pointer to a thread-local variable can be passed to
+and successfully dereferenced by a non-owning thread) depends on the
+specifics of the TLS implementation. With GCC __thread and GCC
+_Thread_local, data sharing between threads is supported.  In the C11
+standard, accessing another thread's _Thread_local object is
+implementation-defined. Lcore variable instances may be accessed
+reliably by any thread.
+
+Lcore variables also relies on TLS to retrieve the thread's
+lcore id. However, the rest of the per-thread data is not kept in TLS.
+
+From a memory layout perspective, TLS is similar to lcore variables,
+and thus per-thread data structure need not be padded.
+
+In case the above-mentioned drawbacks of the use of TLS is of no
+significance to a particular application, TLS is a good alternative to
+lcore variables.
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v16 5/8] random: keep PRNG state in lcore variable
  2024-10-23  7:52                                                                                 ` [PATCH v16 0/8] " Mattias Rönnblom
                                                                                                     ` (3 preceding siblings ...)
  2024-10-23  7:52                                                                                   ` [PATCH v16 4/8] eal: add lcore variables' programmer's guide Mattias Rönnblom
@ 2024-10-23  7:52                                                                                   ` Mattias Rönnblom
  2024-10-23  7:53                                                                                   ` [PATCH v16 6/8] power: keep per-lcore " Mattias Rönnblom
                                                                                                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-23  7:52 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Thomas Monjalon, Mattias Rönnblom, Konstantin Ananyev,
	Chengwen Feng

Replace keeping PRNG state in a RTE_MAX_LCORE-sized static array of
cache-aligned and RTE_CACHE_GUARDed struct instances with keeping the
same state in a more cache-friendly lcore variable.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

RFC v3:
 * Remove cache alignment on unregistered threads' rte_rand_state.
   (Morten Brørup)
---
 lib/eal/common/rte_random.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/lib/eal/common/rte_random.c b/lib/eal/common/rte_random.c
index 90e91b3c4f..cf0756f26a 100644
--- a/lib/eal/common/rte_random.c
+++ b/lib/eal/common/rte_random.c
@@ -11,6 +11,7 @@
 #include <rte_branch_prediction.h>
 #include <rte_cycles.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_random.h>
 
 struct __rte_cache_aligned rte_rand_state {
@@ -19,14 +20,12 @@ struct __rte_cache_aligned rte_rand_state {
 	uint64_t z3;
 	uint64_t z4;
 	uint64_t z5;
-	RTE_CACHE_GUARD;
 };
 
-/* One instance each for every lcore id-equipped thread, and one
- * additional instance to be shared by all others threads (i.e., all
- * unregistered non-EAL threads).
- */
-static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
+RTE_LCORE_VAR_HANDLE(struct rte_rand_state, rand_state);
+
+/* instance to be shared by all unregistered non-EAL threads */
+static struct rte_rand_state unregistered_rand_state;
 
 static uint32_t
 __rte_rand_lcg32(uint32_t *seed)
@@ -85,8 +84,14 @@ rte_srand(uint64_t seed)
 	unsigned int lcore_id;
 
 	/* add lcore_id to seed to avoid having the same sequence */
-	for (lcore_id = 0; lcore_id < RTE_DIM(rand_states); lcore_id++)
-		__rte_srand_lfsr258(seed + lcore_id, &rand_states[lcore_id]);
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		struct rte_rand_state *lcore_state =
+			RTE_LCORE_VAR_LCORE(lcore_id, rand_state);
+
+		__rte_srand_lfsr258(seed + lcore_id, lcore_state);
+	}
+
+	__rte_srand_lfsr258(seed + lcore_id, &unregistered_rand_state);
 }
 
 static __rte_always_inline uint64_t
@@ -124,11 +129,10 @@ struct rte_rand_state *__rte_rand_get_state(void)
 
 	idx = rte_lcore_id();
 
-	/* last instance reserved for unregistered non-EAL threads */
 	if (unlikely(idx == LCORE_ID_ANY))
-		idx = RTE_MAX_LCORE;
+		return &unregistered_rand_state;
 
-	return &rand_states[idx];
+	return RTE_LCORE_VAR(rand_state);
 }
 
 uint64_t
@@ -228,6 +232,8 @@ RTE_INIT(rte_rand_init)
 {
 	uint64_t seed;
 
+	RTE_LCORE_VAR_ALLOC(rand_state);
+
 	seed = __rte_random_initial_seed();
 
 	rte_srand(seed);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v16 6/8] power: keep per-lcore state in lcore variable
  2024-10-23  7:52                                                                                 ` [PATCH v16 0/8] " Mattias Rönnblom
                                                                                                     ` (4 preceding siblings ...)
  2024-10-23  7:52                                                                                   ` [PATCH v16 5/8] random: keep PRNG state in lcore variable Mattias Rönnblom
@ 2024-10-23  7:53                                                                                   ` Mattias Rönnblom
  2024-10-23  7:53                                                                                   ` [PATCH v16 7/8] service: " Mattias Rönnblom
  2024-10-23  7:53                                                                                   ` [PATCH v16 8/8] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-23  7:53 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Thomas Monjalon, Mattias Rönnblom, Konstantin Ananyev,
	Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v6:
 * Update FOREACH invocation to match new API.

RFC v3:
 * Replace for loop with FOREACH macro.
---
 lib/power/rte_power_pmd_mgmt.c | 35 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index 5e50613f5b..a2fff3b765 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -5,6 +5,7 @@
 #include <stdlib.h>
 
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_cycles.h>
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
@@ -69,7 +70,7 @@ struct __rte_cache_aligned pmd_core_cfg {
 	uint64_t sleep_target;
 	/**< Prevent a queue from triggering sleep multiple times */
 };
-static struct pmd_core_cfg lcore_cfgs[RTE_MAX_LCORE];
+static RTE_LCORE_VAR_HANDLE(struct pmd_core_cfg, lcore_cfgs);
 
 static inline bool
 queue_equal(const union queue *l, const union queue *r)
@@ -252,12 +253,11 @@ clb_multiwait(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 
 	/* early exit */
 	if (likely(!empty))
@@ -317,13 +317,12 @@ clb_pause(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	struct queue_list_entry *queue_conf = arg;
 	struct pmd_core_cfg *lcore_conf;
 	const bool empty = nb_rx == 0;
 	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
 
-	lcore_conf = &lcore_cfgs[lcore];
+	lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 
 	if (likely(!empty))
 		/* early exit */
@@ -358,9 +357,8 @@ clb_scale_freq(uint16_t port_id __rte_unused, uint16_t qidx __rte_unused,
 		struct rte_mbuf **pkts __rte_unused, uint16_t nb_rx,
 		uint16_t max_pkts __rte_unused, void *arg)
 {
-	const unsigned int lcore = rte_lcore_id();
 	const bool empty = nb_rx == 0;
-	struct pmd_core_cfg *lcore_conf = &lcore_cfgs[lcore];
+	struct pmd_core_cfg *lcore_conf = RTE_LCORE_VAR(lcore_cfgs);
 	struct queue_list_entry *queue_conf = arg;
 
 	if (likely(!empty)) {
@@ -519,7 +517,7 @@ rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		goto end;
 	}
 
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -620,7 +618,7 @@ rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 	}
 
 	/* no need to check queue id as wrong queue id would not be enabled */
-	lcore_cfg = &lcore_cfgs[lcore_id];
+	lcore_cfg = RTE_LCORE_VAR_LCORE(lcore_id, lcore_cfgs);
 
 	/* check if other queues are stopped as well */
 	ret = cfg_queues_stopped(lcore_cfg);
@@ -770,21 +768,22 @@ rte_power_pmd_mgmt_get_scaling_freq_max(unsigned int lcore)
 }
 
 RTE_INIT(rte_power_ethdev_pmgmt_init) {
-	size_t i;
-	int j;
+	unsigned int lcore_id;
+	struct pmd_core_cfg *lcore_cfg;
+	int i;
+
+	RTE_LCORE_VAR_ALLOC(lcore_cfgs);
 
 	/* initialize all tailqs */
-	for (i = 0; i < RTE_DIM(lcore_cfgs); i++) {
-		struct pmd_core_cfg *cfg = &lcore_cfgs[i];
-		TAILQ_INIT(&cfg->head);
-	}
+	RTE_LCORE_VAR_FOREACH(lcore_id, lcore_cfg, lcore_cfgs)
+		TAILQ_INIT(&lcore_cfg->head);
 
 	/* initialize config defaults */
 	emptypoll_max = 512;
 	pause_duration = 1;
 	/* scaling defaults out of range to ensure not used unless set by user or app */
-	for (j = 0; j < RTE_MAX_LCORE; j++) {
-		scale_freq_min[j] = 0;
-		scale_freq_max[j] = UINT32_MAX;
+	for (i = 0; i < RTE_MAX_LCORE; i++) {
+		scale_freq_min[i] = 0;
+		scale_freq_max[i] = UINT32_MAX;
 	}
 }
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v16 7/8] service: keep per-lcore state in lcore variable
  2024-10-23  7:52                                                                                 ` [PATCH v16 0/8] " Mattias Rönnblom
                                                                                                     ` (5 preceding siblings ...)
  2024-10-23  7:53                                                                                   ` [PATCH v16 6/8] power: keep per-lcore " Mattias Rönnblom
@ 2024-10-23  7:53                                                                                   ` Mattias Rönnblom
  2024-10-23  7:53                                                                                   ` [PATCH v16 8/8] eal: keep per-lcore power intrinsics " Mattias Rönnblom
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-23  7:53 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Thomas Monjalon, Mattias Rönnblom, Konstantin Ananyev,
	Chengwen Feng

Replace static array of cache-aligned structs with an lcore variable,
to slightly benefit code simplicity and performance.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>

--

PATCH v14:
 * Merge with bitset-related changes.

PATCH v7:
 * Update to match new FOREACH API.

RFC v6:
 * Remove a now-redundant lcore variable value memset().

RFC v5:
 * Fix lcore value pointer bug introduced by RFC v4.

RFC v4:
 * Remove strange-looking lcore value lookup potentially containing
   invalid lcore id. (Morten Brørup)
 * Replace misplaced tab with space. (Morten Brørup)
---
 lib/eal/common/rte_service.c | 116 ++++++++++++++++++++---------------
 1 file changed, 65 insertions(+), 51 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 324471e897..dad3150df9 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -11,6 +11,7 @@
 
 #include <eal_trace_internal.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_bitset.h>
 #include <rte_branch_prediction.h>
 #include <rte_common.h>
@@ -78,7 +79,7 @@ struct __rte_cache_aligned core_state {
 
 static uint32_t rte_service_count;
 static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static RTE_LCORE_VAR_HANDLE(struct core_state, lcore_states);
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -99,12 +100,8 @@ rte_service_init(void)
 		goto fail_mem;
 	}
 
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
+	if (lcore_states == NULL)
+		RTE_LCORE_VAR_ALLOC(lcore_states);
 
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
@@ -120,7 +117,6 @@ rte_service_init(void)
 	return 0;
 fail_mem:
 	rte_free(rte_services);
-	rte_free(lcore_states);
 	return -ENOMEM;
 }
 
@@ -134,7 +130,6 @@ rte_service_finalize(void)
 	rte_eal_mp_wait_lcore();
 
 	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
@@ -284,7 +279,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 int32_t
 rte_service_component_unregister(uint32_t id)
 {
-	uint32_t i;
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
@@ -292,9 +286,11 @@ rte_service_component_unregister(uint32_t id)
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
+	unsigned int lcore_id;
+	struct core_state *cs;
 	/* clear the run-bit in all cores */
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		rte_bitset_clear(lcore_states[i].mapped_services, id);
+	RTE_LCORE_VAR_FOREACH(lcore_id, cs, lcore_states)
+		rte_bitset_clear(cs->mapped_services, id);
 
 	memset(&rte_services[id], 0, sizeof(struct rte_service_spec_impl));
 
@@ -463,7 +459,10 @@ rte_service_may_be_active(uint32_t id)
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
-		if (rte_bitset_test(lcore_states[ids[i]].service_active_on_lcore, id))
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(ids[i], lcore_states);
+
+		if (rte_bitset_test(cs->service_active_on_lcore, id))
 			return 1;
 	}
 
@@ -473,7 +472,7 @@ rte_service_may_be_active(uint32_t id)
 int32_t
 rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 {
-	struct core_state *cs = &lcore_states[rte_lcore_id()];
+	struct core_state *cs =	RTE_LCORE_VAR(lcore_states);
 	struct rte_service_spec_impl *s;
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
@@ -496,8 +495,7 @@ static int32_t
 service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
-	const int lcore = rte_lcore_id();
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR(lcore_states);
 
 	rte_atomic_store_explicit(&cs->thread_active, 1, rte_memory_order_seq_cst);
 
@@ -533,13 +531,15 @@ service_runner_func(void *arg)
 int32_t
 rte_service_lcore_may_be_active(uint32_t lcore)
 {
-	if (lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	struct core_state *cs = RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
+	if (lcore >= RTE_MAX_LCORE || !cs->is_service_core)
 		return -EINVAL;
 
 	/* Load thread_active using ACQUIRE to avoid instructions dependent on
 	 * the result being re-ordered before this load completes.
 	 */
-	return rte_atomic_load_explicit(&lcore_states[lcore].thread_active,
+	return rte_atomic_load_explicit(&cs->thread_active,
 			       rte_memory_order_acquire);
 }
 
@@ -547,9 +547,12 @@ int32_t
 rte_service_lcore_count(void)
 {
 	int32_t count = 0;
-	uint32_t i;
-	for (i = 0; i < RTE_MAX_LCORE; i++)
-		count += lcore_states[i].is_service_core;
+
+	unsigned int lcore_id;
+	struct core_state *cs;
+	RTE_LCORE_VAR_FOREACH(lcore_id, cs, lcore_states)
+		count += cs->is_service_core;
+
 	return count;
 }
 
@@ -566,7 +569,8 @@ rte_service_lcore_list(uint32_t array[], uint32_t n)
 	uint32_t i;
 	uint32_t idx = 0;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		struct core_state *cs = &lcore_states[i];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(i, lcore_states);
 		if (cs->is_service_core) {
 			array[idx] = i;
 			idx++;
@@ -582,7 +586,7 @@ rte_service_lcore_count_services(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs = RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -634,28 +638,30 @@ rte_service_start_with_defaults(void)
 static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
 	/* validate ID, or return error value */
 	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
-			!lcore_states[lcore].is_service_core)
+			!cs->is_service_core)
 		return -EINVAL;
 
 	if (set) {
-		uint64_t lcore_mapped = rte_bitset_test(lcore_states[lcore].mapped_services, sid);
+		bool lcore_mapped = rte_bitset_test(cs->mapped_services, sid);
 
 		if (*set && !lcore_mapped) {
-			rte_bitset_set(lcore_states[lcore].mapped_services, sid);
+			rte_bitset_set(cs->mapped_services, sid);
 			rte_atomic_fetch_add_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 		if (!*set && lcore_mapped) {
-			rte_bitset_clear(lcore_states[lcore].mapped_services, sid);
+			rte_bitset_clear(cs->mapped_services, sid);
 			rte_atomic_fetch_sub_explicit(&rte_services[sid].num_mapped_cores,
 				1, rte_memory_order_relaxed);
 		}
 	}
 
 	if (enabled)
-		*enabled = rte_bitset_test(lcore_states[lcore].mapped_services, sid);
+		*enabled = rte_bitset_test(cs->mapped_services, sid);
 
 	return 0;
 }
@@ -683,13 +689,14 @@ set_lcore_state(uint32_t lcore, int32_t state)
 {
 	/* mark core state in hugepage backed config */
 	struct rte_config *cfg = rte_eal_get_configuration();
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	cfg->lcore_role[lcore] = state;
 
 	/* mark state in process local lcore_config */
 	lcore_config[lcore].core_role = state;
 
 	/* update per-lcore optimized state tracking */
-	lcore_states[lcore].is_service_core = (state == ROLE_SERVICE);
+	cs->is_service_core = (state == ROLE_SERVICE);
 
 	rte_eal_trace_service_lcore_state_change(lcore, state);
 }
@@ -700,14 +707,16 @@ rte_service_lcore_reset_all(void)
 	/* loop over cores, reset all mapped services */
 	uint32_t i;
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
-		if (lcore_states[i].is_service_core) {
-			rte_bitset_clear_all(lcore_states[i].mapped_services, RTE_SERVICE_NUM_MAX);
+		struct core_state *cs =	RTE_LCORE_VAR_LCORE(i, lcore_states);
+
+		if (cs->is_service_core) {
+			rte_bitset_clear_all(cs->mapped_services, RTE_SERVICE_NUM_MAX);
 			set_lcore_state(i, ROLE_RTE);
 			/* runstate act as guard variable Use
 			 * store-release memory order here to synchronize
 			 * with load-acquire in runstate read functions.
 			 */
-			rte_atomic_store_explicit(&lcore_states[i].runstate,
+			rte_atomic_store_explicit(&cs->runstate,
 				RUNSTATE_STOPPED, rte_memory_order_release);
 		}
 	}
@@ -723,17 +732,19 @@ rte_service_lcore_add(uint32_t lcore)
 {
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
-	if (lcore_states[lcore].is_service_core)
+
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+	if (cs->is_service_core)
 		return -EALREADY;
 
 	set_lcore_state(lcore, ROLE_SERVICE);
 
 	/* ensure that after adding a core the mask and state are defaults */
-	rte_bitset_clear_all(lcore_states[lcore].mapped_services, RTE_SERVICE_NUM_MAX);
+	rte_bitset_clear_all(cs->mapped_services, RTE_SERVICE_NUM_MAX);
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	return rte_eal_wait_lcore(lcore);
@@ -745,7 +756,7 @@ rte_service_lcore_del(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -769,7 +780,7 @@ rte_service_lcore_start(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 	if (!cs->is_service_core)
 		return -EINVAL;
 
@@ -799,6 +810,8 @@ rte_service_lcore_start(uint32_t lcore)
 int32_t
 rte_service_lcore_stop(uint32_t lcore)
 {
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
@@ -806,12 +819,11 @@ rte_service_lcore_stop(uint32_t lcore)
 	 * memory order here to synchronize with store-release
 	 * in runstate update functions.
 	 */
-	if (rte_atomic_load_explicit(&lcore_states[lcore].runstate, rte_memory_order_acquire) ==
+	if (rte_atomic_load_explicit(&cs->runstate, rte_memory_order_acquire) ==
 			RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
 		bool enabled = rte_bitset_test(cs->mapped_services, i);
@@ -831,7 +843,7 @@ rte_service_lcore_stop(uint32_t lcore)
 	/* Use store-release memory order here to synchronize with
 	 * load-acquire in runstate read functions.
 	 */
-	rte_atomic_store_explicit(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+	rte_atomic_store_explicit(&cs->runstate, RUNSTATE_STOPPED,
 		rte_memory_order_release);
 
 	rte_eal_trace_service_lcore_stop(lcore);
@@ -842,7 +854,7 @@ rte_service_lcore_stop(uint32_t lcore)
 static uint64_t
 lcore_attr_get_loops(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->loops, rte_memory_order_relaxed);
 }
@@ -850,7 +862,7 @@ lcore_attr_get_loops(unsigned int lcore)
 static uint64_t
 lcore_attr_get_cycles(unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->cycles, rte_memory_order_relaxed);
 }
@@ -858,7 +870,7 @@ lcore_attr_get_cycles(unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].calls,
 		rte_memory_order_relaxed);
@@ -885,7 +897,7 @@ lcore_attr_get_service_error_calls(uint32_t service_id, unsigned int lcore)
 static uint64_t
 lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
 {
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	return rte_atomic_load_explicit(&cs->service_stats[service_id].cycles,
 		rte_memory_order_relaxed);
@@ -901,7 +913,10 @@ attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
 	uint64_t sum = 0;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		if (lcore_states[lcore].is_service_core)
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(lcore, lcore_states);
+
+		if (cs->is_service_core)
 			sum += lcore_attr_get(id, lcore);
 	}
 
@@ -963,12 +978,11 @@ int32_t
 rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 			   uint64_t *attr_value)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE || !attr_value)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -993,7 +1007,8 @@ rte_service_attr_reset_all(uint32_t id)
 		return -EINVAL;
 
 	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
-		struct core_state *cs = &lcore_states[lcore];
+		struct core_state *cs =
+			RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 		cs->service_stats[id] = (struct service_stats) {};
 	}
@@ -1004,12 +1019,11 @@ rte_service_attr_reset_all(uint32_t id)
 int32_t
 rte_service_lcore_attr_reset_all(uint32_t lcore)
 {
-	struct core_state *cs;
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	cs = &lcore_states[lcore];
 	if (!cs->is_service_core)
 		return -ENOTSUP;
 
@@ -1044,7 +1058,7 @@ static void
 service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 {
 	uint32_t i;
-	struct core_state *cs = &lcore_states[lcore];
+	struct core_state *cs =	RTE_LCORE_VAR_LCORE(lcore, lcore_states);
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

* [PATCH v16 8/8] eal: keep per-lcore power intrinsics state in lcore variable
  2024-10-23  7:52                                                                                 ` [PATCH v16 0/8] " Mattias Rönnblom
                                                                                                     ` (6 preceding siblings ...)
  2024-10-23  7:53                                                                                   ` [PATCH v16 7/8] service: " Mattias Rönnblom
@ 2024-10-23  7:53                                                                                   ` Mattias Rönnblom
  7 siblings, 0 replies; 313+ messages in thread
From: Mattias Rönnblom @ 2024-10-23  7:53 UTC (permalink / raw)
  To: dev
  Cc: hofors, Morten Brørup, Stephen Hemminger,
	Konstantin Ananyev, David Marchand, Jerin Jacob, Luka Jankovic,
	Thomas Monjalon, Mattias Rönnblom, Konstantin Ananyev,
	Chengwen Feng

Keep per-lcore power intrinsics state in a lcore variable to reduce
cache working set size and avoid any CPU next-line-prefetching causing
false sharing.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/eal/x86/rte_power_intrinsics.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
index 6d9b64240c..98a2cbc611 100644
--- a/lib/eal/x86/rte_power_intrinsics.c
+++ b/lib/eal/x86/rte_power_intrinsics.c
@@ -6,6 +6,7 @@
 
 #include <rte_common.h>
 #include <rte_lcore.h>
+#include <rte_lcore_var.h>
 #include <rte_rtm.h>
 #include <rte_spinlock.h>
 
@@ -14,10 +15,14 @@
 /*
  * Per-lcore structure holding current status of C0.2 sleeps.
  */
-static alignas(RTE_CACHE_LINE_SIZE) struct power_wait_status {
+struct power_wait_status {
 	rte_spinlock_t lock;
 	volatile void *monitor_addr; /**< NULL if not currently sleeping */
-} wait_status[RTE_MAX_LCORE];
+};
+
+RTE_LCORE_VAR_HANDLE(struct power_wait_status, wait_status);
+
+RTE_LCORE_VAR_INIT(wait_status);
 
 /*
  * This function uses UMONITOR/UMWAIT instructions and will enter C0.2 state.
@@ -172,7 +177,7 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 	if (pmc->fn == NULL)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE(lcore_id, wait_status);
 
 	/* update sleep address */
 	rte_spinlock_lock(&s->lock);
@@ -264,7 +269,7 @@ rte_power_monitor_wakeup(const unsigned int lcore_id)
 	if (lcore_id >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	s = &wait_status[lcore_id];
+	s = RTE_LCORE_VAR_LCORE(lcore_id, wait_status);
 
 	/*
 	 * There is a race condition between sleep, wakeup and locking, but we
@@ -303,8 +308,8 @@ int
 rte_power_monitor_multi(const struct rte_power_monitor_cond pmc[],
 		const uint32_t num, const uint64_t tsc_timestamp)
 {
-	const unsigned int lcore_id = rte_lcore_id();
-	struct power_wait_status *s = &wait_status[lcore_id];
+	struct power_wait_status *s = RTE_LCORE_VAR(wait_status);
+
 	uint32_t i, rc;
 
 	/* check if supported */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 313+ messages in thread

end of thread, other threads:[~2024-10-23  8:07 UTC | newest]

Thread overview: 313+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-08 18:16 [RFC 0/5] Lcore variables Mattias Rönnblom
2024-02-08 18:16 ` [RFC 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-02-09  8:25   ` Morten Brørup
2024-02-09 11:46     ` Mattias Rönnblom
2024-02-09 13:04       ` Morten Brørup
2024-02-19  7:49         ` Mattias Rönnblom
2024-02-19 11:10           ` Morten Brørup
2024-02-19 14:31             ` Mattias Rönnblom
2024-02-19 15:04               ` Morten Brørup
2024-02-19  9:40   ` [RFC v2 0/5] Lcore variables Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 1/5] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-02-20  8:49       ` [RFC v3 0/6] Lcore variables Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-02-20  9:11           ` Bruce Richardson
2024-02-20 10:47             ` Mattias Rönnblom
2024-02-20 11:39               ` Bruce Richardson
2024-02-20 13:37                 ` Morten Brørup
2024-02-20 16:26                 ` Mattias Rönnblom
2024-02-21  9:43           ` Jerin Jacob
2024-02-21 10:31             ` Morten Brørup
2024-02-21 14:26             ` Mattias Rönnblom
2024-02-22  9:22           ` Morten Brørup
2024-02-23 10:12             ` Mattias Rönnblom
2024-02-25 15:03           ` [RFC v4 0/6] Lcore variables Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-02-27  9:58               ` Morten Brørup
2024-02-27 13:44                 ` Mattias Rönnblom
2024-02-27 15:05                   ` Morten Brørup
2024-02-27 16:27                     ` Mattias Rönnblom
2024-02-27 16:51                       ` Morten Brørup
2024-02-28 10:09               ` [RFC v5 0/6] Lcore variables Mattias Rönnblom
2024-02-28 10:09                 ` [RFC v5 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-03-19 12:52                   ` Konstantin Ananyev
2024-03-20 10:24                     ` Mattias Rönnblom
2024-03-20 14:18                       ` Konstantin Ananyev
2024-05-06  8:27                   ` [RFC v6 0/6] Lcore variables Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-09-10  7:03                       ` [PATCH 0/6] Lcore variables Mattias Rönnblom
2024-09-10  7:03                         ` [PATCH 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-09-10  9:32                           ` Morten Brørup
2024-09-10 10:44                             ` Mattias Rönnblom
2024-09-10 13:07                               ` Morten Brørup
2024-09-10 15:55                               ` Stephen Hemminger
2024-09-11 10:32                           ` Morten Brørup
2024-09-11 15:05                             ` Mattias Rönnblom
2024-09-11 15:07                               ` Morten Brørup
2024-09-11 17:04                           ` [PATCH v2 0/6] Lcore variables Mattias Rönnblom
2024-09-11 17:04                             ` [PATCH v2 1/6] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-09-12  2:33                               ` fengchengwen
2024-09-12  5:35                                 ` Mattias Rönnblom
2024-09-12  7:05                                   ` fengchengwen
2024-09-12  7:28                                   ` Jerin Jacob
2024-09-12  8:44                               ` [PATCH v3 0/7] Lcore variables Mattias Rönnblom
2024-09-12  8:44                                 ` [PATCH v3 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-09-16 10:52                                   ` [PATCH v4 0/7] Lcore variables Mattias Rönnblom
2024-09-16 10:52                                     ` [PATCH v4 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-09-16 14:02                                       ` Konstantin Ananyev
2024-09-16 17:39                                         ` Morten Brørup
2024-09-16 23:19                                           ` Konstantin Ananyev
2024-09-17  7:12                                             ` Morten Brørup
2024-09-17  8:09                                               ` Konstantin Ananyev
2024-09-17 14:28                                         ` Mattias Rönnblom
2024-09-17 16:11                                           ` Konstantin Ananyev
2024-09-18  7:00                                             ` Mattias Rönnblom
2024-09-17 16:29                                           ` Konstantin Ananyev
2024-09-18  7:50                                             ` Mattias Rönnblom
2024-09-17 14:32                                       ` [PATCH v5 0/7] Lcore variables Mattias Rönnblom
2024-09-17 14:32                                         ` [PATCH v5 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-09-18  8:00                                           ` [PATCH v6 0/7] Lcore variables Mattias Rönnblom
2024-09-18  8:00                                             ` [PATCH v6 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-09-18  8:24                                               ` Konstantin Ananyev
2024-09-18  8:25                                                 ` Mattias Rönnblom
2024-09-18  8:26                                               ` [PATCH v7 0/7] Lcore variables Mattias Rönnblom
2024-09-18  8:26                                                 ` [PATCH v7 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-09-18  9:23                                                   ` Konstantin Ananyev
2024-10-09 22:15                                                   ` Morten Brørup
2024-10-10 10:40                                                     ` Mattias Rönnblom
2024-10-10 11:47                                                       ` Morten Brørup
2024-10-10 13:12                                                         ` Morten Brørup
2024-10-10 13:42                                                           ` Mattias Rönnblom
2024-10-10 13:40                                                         ` Mattias Rönnblom
2024-10-10 13:45                                                           ` Morten Brørup
2024-10-10 16:21                                                             ` Mattias Rönnblom
2024-10-10 14:13                                                   ` [PATCH v8 0/7] Lcore variables Mattias Rönnblom
2024-10-10 14:13                                                     ` [PATCH v8 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-10-10 14:21                                                       ` [PATCH v9 0/7] Lcore variables Mattias Rönnblom
2024-10-10 14:21                                                         ` [PATCH v9 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-10-10 15:54                                                           ` Stephen Hemminger
2024-10-10 16:17                                                             ` Mattias Rönnblom
2024-10-10 16:31                                                               ` Stephen Hemminger
2024-10-10 21:24                                                           ` Thomas Monjalon
2024-10-11  8:04                                                             ` Mattias Rönnblom
2024-10-11  8:46                                                               ` Morten Brørup
2024-10-11  9:11                                                               ` Thomas Monjalon
2024-10-14  6:51                                                               ` Mattias Rönnblom
2024-10-14 15:19                                                                 ` Stephen Hemminger
2024-10-16  8:05                                                                   ` Thomas Monjalon
2024-10-11  8:09                                                             ` Morten Brørup
2024-10-11  8:42                                                               ` Thomas Monjalon
2024-10-11  8:18                                                           ` [PATCH v10 0/7] Lcore variables Mattias Rönnblom
2024-10-11  8:18                                                             ` [PATCH v10 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-10-14  7:43                                                               ` [PATCH v11 0/7] Lcore variables Mattias Rönnblom
2024-10-14  7:43                                                                 ` [PATCH v11 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-10-14  8:17                                                                   ` Morten Brørup
2024-10-15  6:41                                                                     ` Mattias Rönnblom
2024-10-15  7:10                                                                       ` Mattias Rönnblom
2024-10-15  7:39                                                                         ` Morten Brørup
2024-10-15  9:09                                                                           ` Mattias Rönnblom
2024-10-16  8:10                                                                         ` Thomas Monjalon
2024-10-15  8:19                                                                       ` Morten Brørup
2024-10-15  6:54                                                                   ` [PATCH v12 0/7] Lcore variables Mattias Rönnblom
2024-10-15  6:54                                                                     ` [PATCH v12 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-10-15  9:33                                                                       ` [PATCH v13 0/7] Lcore variables Mattias Rönnblom
2024-10-15  9:33                                                                         ` [PATCH v13 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-10-15 10:13                                                                           ` Morten Brørup
2024-10-15 19:02                                                                             ` Mattias Rönnblom
2024-10-15 20:19                                                                               ` Morten Brørup
2024-10-15 22:33                                                                           ` Stephen Hemminger
2024-10-16  4:13                                                                             ` Mattias Rönnblom
2024-10-16  8:17                                                                               ` Thomas Monjalon
2024-10-16 12:47                                                                                 ` Mattias Rönnblom
2024-10-15 22:35                                                                           ` Stephen Hemminger
2024-10-16  4:23                                                                             ` Mattias Rönnblom
2024-10-16 13:19                                                                           ` [PATCH v14 0/7] Lcore variables Mattias Rönnblom
2024-10-16 13:19                                                                             ` [PATCH v14 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-10-16 14:53                                                                               ` Stephen Hemminger
2024-10-17  5:38                                                                                 ` Mattias Rönnblom
2024-10-17  5:57                                                                               ` [PATCH v15 0/7] Lcore variables Mattias Rönnblom
2024-10-17  5:57                                                                                 ` [PATCH v15 1/7] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-10-17  5:57                                                                                 ` [PATCH v15 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-10-17  5:57                                                                                 ` [PATCH v15 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-10-17  5:57                                                                                 ` [PATCH v15 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-10-17  5:57                                                                                 ` [PATCH v15 5/7] power: keep per-lcore " Mattias Rönnblom
2024-10-17  5:57                                                                                 ` [PATCH v15 6/7] service: " Mattias Rönnblom
2024-10-17  5:57                                                                                 ` [PATCH v15 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-10-18 15:37                                                                                 ` [PATCH v15 0/7] Lcore variables Thomas Monjalon
2024-10-19  4:24                                                                                   ` Mattias Rönnblom
2024-10-21  9:16                                                                                     ` Thomas Monjalon
2024-10-23  7:52                                                                                 ` [PATCH v16 0/8] " Mattias Rönnblom
2024-10-23  7:52                                                                                   ` [PATCH v16 1/8] eal: add static per-lcore memory allocation facility Mattias Rönnblom
2024-10-23  7:52                                                                                   ` [PATCH v16 2/8] eal: add lcore variable functional tests Mattias Rönnblom
2024-10-23  7:52                                                                                   ` [PATCH v16 3/8] eal: add lcore variable performance test Mattias Rönnblom
2024-10-23  7:52                                                                                   ` [PATCH v16 4/8] eal: add lcore variables' programmer's guide Mattias Rönnblom
2024-10-23  7:52                                                                                   ` [PATCH v16 5/8] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-10-23  7:53                                                                                   ` [PATCH v16 6/8] power: keep per-lcore " Mattias Rönnblom
2024-10-23  7:53                                                                                   ` [PATCH v16 7/8] service: " Mattias Rönnblom
2024-10-23  7:53                                                                                   ` [PATCH v16 8/8] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-10-16 13:19                                                                             ` [PATCH v14 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-10-16 13:19                                                                             ` [PATCH v14 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-10-16 13:19                                                                             ` [PATCH v14 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-10-16 13:19                                                                             ` [PATCH v14 5/7] power: keep per-lcore " Mattias Rönnblom
2024-10-16 13:19                                                                             ` [PATCH v14 6/7] service: " Mattias Rönnblom
2024-10-16 13:19                                                                             ` [PATCH v14 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-10-16 14:58                                                                             ` [PATCH v14 0/7] Lcore variables Stephen Hemminger
2024-10-17  5:40                                                                               ` Mattias Rönnblom
2024-10-15  9:33                                                                         ` [PATCH v13 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-10-15  9:33                                                                         ` [PATCH v13 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-10-15  9:33                                                                         ` [PATCH v13 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-10-15  9:33                                                                         ` [PATCH v13 5/7] power: keep per-lcore " Mattias Rönnblom
2024-10-15  9:33                                                                         ` [PATCH v13 6/7] service: " Mattias Rönnblom
2024-10-15  9:33                                                                         ` [PATCH v13 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-10-15  6:55                                                                     ` [PATCH v12 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-10-15  6:55                                                                     ` [PATCH v12 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-10-15  6:55                                                                     ` [PATCH v12 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-10-15  6:55                                                                     ` [PATCH v12 5/7] power: keep per-lcore " Mattias Rönnblom
2024-10-15  6:55                                                                     ` [PATCH v12 6/7] service: " Mattias Rönnblom
2024-10-15  6:55                                                                     ` [PATCH v12 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-10-14  7:43                                                                 ` [PATCH v11 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-10-14  7:43                                                                 ` [PATCH v11 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-10-14  7:43                                                                 ` [PATCH v11 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-10-14  7:43                                                                 ` [PATCH v11 5/7] power: keep per-lcore " Mattias Rönnblom
2024-10-14  7:43                                                                 ` [PATCH v11 6/7] service: " Mattias Rönnblom
2024-10-14  7:43                                                                 ` [PATCH v11 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-10-14 16:30                                                                   ` Stephen Hemminger
2024-10-15  6:48                                                                     ` Mattias Rönnblom
2024-10-11  8:18                                                             ` [PATCH v10 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-10-11  8:18                                                             ` [PATCH v10 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-10-11  8:18                                                             ` [PATCH v10 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-10-11  8:18                                                             ` [PATCH v10 5/7] power: keep per-lcore " Mattias Rönnblom
2024-10-11  8:19                                                             ` [PATCH v10 6/7] service: " Mattias Rönnblom
2024-10-11  8:19                                                             ` [PATCH v10 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-10-11 14:25                                                             ` [PATCH v10 0/7] Lcore variables Stephen Hemminger
2024-10-13  7:02                                                               ` Mattias Rönnblom
2024-10-16  8:07                                                                 ` Thomas Monjalon
2024-10-10 14:22                                                         ` [PATCH v9 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-10-10 14:22                                                         ` [PATCH v9 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-10-10 14:22                                                         ` [PATCH v9 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-10-10 14:22                                                         ` [PATCH v9 5/7] power: keep per-lcore " Mattias Rönnblom
2024-10-10 14:22                                                         ` [PATCH v9 6/7] service: " Mattias Rönnblom
2024-10-10 14:22                                                         ` [PATCH v9 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-10-10 14:13                                                     ` [PATCH v8 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-10-10 14:13                                                     ` [PATCH v8 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-10-10 14:13                                                     ` [PATCH v8 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-10-10 14:13                                                     ` [PATCH v8 5/7] power: keep per-lcore " Mattias Rönnblom
2024-10-10 14:13                                                     ` [PATCH v8 6/7] service: " Mattias Rönnblom
2024-10-10 14:13                                                     ` [PATCH v8 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-10-11 14:23                                                     ` [PATCH v8 0/7] Lcore variables Stephen Hemminger
2024-10-13  7:04                                                       ` Mattias Rönnblom
2024-09-18  8:26                                                 ` [PATCH v7 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-09-18  8:26                                                 ` [PATCH v7 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-10-09 20:46                                                   ` Morten Brørup
2024-10-10 14:17                                                     ` Mattias Rönnblom
2024-09-18  8:26                                                 ` [PATCH v7 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-09-18  8:26                                                 ` [PATCH v7 5/7] power: keep per-lcore " Mattias Rönnblom
2024-09-18  8:26                                                 ` [PATCH v7 6/7] service: " Mattias Rönnblom
2024-09-18  8:26                                                 ` [PATCH v7 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-09-18  9:30                                                 ` [PATCH v7 0/7] Lcore variables fengchengwen
2024-10-10  5:06                                                 ` Stephen Hemminger
2024-09-18  8:00                                             ` [PATCH v6 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-09-18  8:25                                               ` Konstantin Ananyev
2024-09-18  8:00                                             ` [PATCH v6 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-09-18  8:00                                             ` [PATCH v6 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-09-18  8:00                                             ` [PATCH v6 5/7] power: keep per-lcore " Mattias Rönnblom
2024-09-18  8:00                                             ` [PATCH v6 6/7] service: " Mattias Rönnblom
2024-09-18  8:00                                             ` [PATCH v6 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-09-17 14:32                                         ` [PATCH v5 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-09-17 14:32                                         ` [PATCH v5 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-09-17 15:40                                           ` Morten Brørup
2024-09-18  6:05                                             ` Mattias Rönnblom
2024-09-17 14:32                                         ` [PATCH v5 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-09-17 14:32                                         ` [PATCH v5 5/7] power: keep per-lcore " Mattias Rönnblom
2024-09-17 14:32                                         ` [PATCH v5 6/7] service: " Mattias Rönnblom
2024-09-17 14:32                                         ` [PATCH v5 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-09-16 10:52                                     ` [PATCH v4 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-09-16 10:52                                     ` [PATCH v4 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-09-16 11:13                                       ` Mattias Rönnblom
2024-09-16 11:54                                         ` Morten Brørup
2024-09-16 16:12                                           ` Mattias Rönnblom
2024-09-16 17:19                                             ` Morten Brørup
2024-09-16 10:52                                     ` [PATCH v4 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-09-16 16:11                                       ` Konstantin Ananyev
2024-09-16 10:52                                     ` [PATCH v4 5/7] power: keep per-lcore " Mattias Rönnblom
2024-09-16 16:12                                       ` Konstantin Ananyev
2024-09-16 10:52                                     ` [PATCH v4 6/7] service: " Mattias Rönnblom
2024-09-16 16:13                                       ` Konstantin Ananyev
2024-09-16 10:52                                     ` [PATCH v4 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-09-16 16:14                                       ` Konstantin Ananyev
2024-09-12  8:44                                 ` [PATCH v3 2/7] eal: add lcore variable functional tests Mattias Rönnblom
2024-09-12  8:44                                 ` [PATCH v3 3/7] eal: add lcore variable performance test Mattias Rönnblom
2024-09-12  9:39                                   ` Morten Brørup
2024-09-12 13:01                                     ` Mattias Rönnblom
2024-09-12 13:09                                   ` Jerin Jacob
2024-09-12 13:20                                     ` Mattias Rönnblom
2024-09-12 15:11                                       ` Jerin Jacob
2024-09-13  6:47                                         ` Mattias Rönnblom
2024-09-13 11:23                                           ` Jerin Jacob
2024-09-13 14:40                                             ` Morten Brørup
2024-09-16  8:12                                               ` Jerin Jacob
2024-09-16  9:51                                                 ` Morten Brørup
2024-09-16 10:50                                             ` Mattias Rönnblom
2024-09-18 10:04                                               ` Jerin Jacob
2024-09-12  8:44                                 ` [PATCH v3 4/7] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-09-12  8:44                                 ` [PATCH v3 5/7] power: keep per-lcore " Mattias Rönnblom
2024-09-12  8:44                                 ` [PATCH v3 6/7] service: " Mattias Rönnblom
2024-09-12  8:44                                 ` [PATCH v3 7/7] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-09-12  9:10                               ` [PATCH v2 1/6] eal: add static per-lcore memory allocation facility Morten Brørup
2024-09-12 13:16                                 ` Jerin Jacob
2024-09-12 13:41                                   ` Morten Brørup
2024-09-12 15:22                                     ` Jerin Jacob
2024-09-18 10:11                                       ` Jerin Jacob
2024-09-19 19:31                                         ` Mattias Rönnblom
2024-10-14  7:56                                         ` Morten Brørup
2024-10-15  6:29                                           ` Mattias Rönnblom
2024-09-11 17:04                             ` [PATCH v2 2/6] eal: add lcore variable test suite Mattias Rönnblom
2024-09-12  7:35                               ` Jerin Jacob
2024-09-12  8:56                                 ` Mattias Rönnblom
2024-09-11 17:04                             ` [PATCH v2 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-09-11 17:04                             ` [PATCH v2 4/6] power: keep per-lcore " Mattias Rönnblom
2024-09-11 17:04                             ` [PATCH v2 5/6] service: " Mattias Rönnblom
2024-09-11 17:04                             ` [PATCH v2 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-09-10  7:03                         ` [PATCH 2/6] eal: add lcore variable test suite Mattias Rönnblom
2024-09-10  7:03                         ` [PATCH 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-09-10  7:03                         ` [PATCH 4/6] power: keep per-lcore " Mattias Rönnblom
2024-09-10  7:03                         ` [PATCH 5/6] service: " Mattias Rönnblom
2024-09-10  7:03                         ` [PATCH 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 2/6] eal: add lcore variable test suite Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 4/6] power: keep per-lcore " Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 5/6] service: " Mattias Rönnblom
2024-05-06  8:27                     ` [RFC v6 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-09-02 14:42                     ` [RFC v6 0/6] Lcore variables Morten Brørup
2024-09-10  6:41                       ` Mattias Rönnblom
2024-09-10 15:41                         ` Stephen Hemminger
2024-02-28 10:09                 ` [RFC v5 2/6] eal: add lcore variable test suite Mattias Rönnblom
2024-02-28 10:09                 ` [RFC v5 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-28 10:09                 ` [RFC v5 4/6] power: keep per-lcore " Mattias Rönnblom
2024-02-28 10:09                 ` [RFC v5 5/6] service: " Mattias Rönnblom
2024-02-28 10:09                 ` [RFC v5 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 2/6] eal: add lcore variable test suite Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 4/6] power: keep per-lcore " Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 5/6] service: " Mattias Rönnblom
2024-02-25 16:28               ` Mattias Rönnblom
2024-02-25 15:03             ` [RFC v4 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 2/6] eal: add lcore variable test suite Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 3/6] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-20 15:31           ` Morten Brørup
2024-02-20  8:49         ` [RFC v3 4/6] power: keep per-lcore " Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 5/6] service: " Mattias Rönnblom
2024-02-22  9:42           ` Morten Brørup
2024-02-23 10:19             ` Mattias Rönnblom
2024-02-20  8:49         ` [RFC v3 6/6] eal: keep per-lcore power intrinsics " Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 2/5] eal: add lcore variable test suite Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-19 11:22       ` Morten Brørup
2024-02-19 14:04         ` Mattias Rönnblom
2024-02-19 15:10           ` Morten Brørup
2024-02-19  9:40     ` [RFC v2 4/5] power: keep per-lcore " Mattias Rönnblom
2024-02-19  9:40     ` [RFC v2 5/5] service: " Mattias Rönnblom
2024-02-08 18:16 ` [RFC 2/5] eal: add lcore variable test suite Mattias Rönnblom
2024-02-08 18:16 ` [RFC 3/5] random: keep PRNG state in lcore variable Mattias Rönnblom
2024-02-08 18:16 ` [RFC 4/5] power: keep per-lcore " Mattias Rönnblom
2024-02-08 18:16 ` [RFC 5/5] service: " Mattias Rönnblom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).