From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id E6CE8459AD; Mon, 16 Sep 2024 13:13:32 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 76FE34025F; Mon, 16 Sep 2024 13:13:32 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 89F4340150 for ; Mon, 16 Sep 2024 13:13:31 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 84336156D4 for ; Mon, 16 Sep 2024 13:13:30 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 7739B15732; Mon, 16 Sep 2024 13:13:30 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.2 Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 5F9081566D; Mon, 16 Sep 2024 13:13:28 +0200 (CEST) Message-ID: <73c62731-caae-41b2-b9ac-0a190d56073a@lysator.liu.se> Date: Mon, 16 Sep 2024 13:13:27 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 3/7] eal: add lcore variable performance test To: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= , dev@dpdk.org Cc: =?UTF-8?Q?Morten_Br=C3=B8rup?= , Stephen Hemminger , Konstantin Ananyev , David Marchand , Jerin Jacob References: <20240912084429.703405-2-mattias.ronnblom@ericsson.com> <20240916105210.721315-1-mattias.ronnblom@ericsson.com> <20240916105210.721315-4-mattias.ronnblom@ericsson.com> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= In-Reply-To: <20240916105210.721315-4-mattias.ronnblom@ericsson.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-09-16 12:52, Mattias Rönnblom wrote: > Add basic micro benchmark for lcore variables, in an attempt to assure > that the overhead isn't significantly greater than alternative > approaches, in scenarios where the benefits aren't expected to show up > (i.e., when plenty of cache is available compared to the working set > size of the per-lcore data). > Here are some test results for a Raptor Cove @ 3,2 GHz (GCC 11): + ------------------------------------------------------- + + Test Suite : lcore variable perf autotest + ------------------------------------------------------- + Latencies [TSC cycles/update] Modules/Variables Static array Thread-local Storage Lcore variables 1 3.9 5.5 3.7 2 3.8 5.5 3.8 4 4.9 5.5 3.7 8 3.8 5.5 3.8 16 11.3 5.5 3.7 32 20.9 5.5 3.7 64 23.5 5.5 3.7 128 23.2 5.5 3.7 256 23.5 5.5 3.7 512 24.1 5.5 3.7 1024 25.3 5.5 3.9 + TestCase [ 0] : test_lcore_var_access succeeded + ------------------------------------------------------- + The reason for TLS being slower than lcore variables (which in turn relies on TLS for lcore id lookup) is the lazy initialization conditional that is imposed on variant. Could that be avoided (which is module-dependent I suppose), it beats lcore variables at ~3.0 cycles/update. I must say I'm surprised to see lcore variables doing this good, at these very modest working set sizes. Probably, you can stay at near-zero L1 misses with lcore variables (and TLS), but start missing the L1 with static arrays. > Signed-off-by: Mattias Rönnblom > > -- > > PATCH v4: > * Rework the tests to be a little less unrealistic. Instead of a > single dummy module using a single variable, use a number of > variables/modules. In this way, differences in cache effects may > show up. > * Add RTE_CACHE_GUARD to better mimic that static array pattern. > (Morten Brørup) > * Show latencies as TSC cycles. (Morten Brørup) > --- > app/test/meson.build | 1 + > app/test/test_lcore_var_perf.c | 244 +++++++++++++++++++++++++++++++++ > 2 files changed, 245 insertions(+) > create mode 100644 app/test/test_lcore_var_perf.c > > diff --git a/app/test/meson.build b/app/test/meson.build > index 48279522f0..d4e0c59900 100644 > --- a/app/test/meson.build > +++ b/app/test/meson.build > @@ -104,6 +104,7 @@ source_file_deps = { > 'test_kvargs.c': ['kvargs'], > 'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps, > 'test_lcore_var.c': [], > + 'test_lcore_var_perf.c': [], > 'test_lcores.c': [], > 'test_link_bonding.c': ['ethdev', 'net_bond', > 'net'] + packet_burst_generator_deps + virtual_pmd_deps, > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c > new file mode 100644 > index 0000000000..8b0abc771c > --- /dev/null > +++ b/app/test/test_lcore_var_perf.c > @@ -0,0 +1,244 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2024 Ericsson AB > + */ > + > +#define MAX_MODS 1024 > + > +#include > + > +#include > +#include > +#include > +#include > +#include > + > +#include "test.h" > + > +struct mod_lcore_state { > + uint64_t a; > + uint64_t b; > + uint64_t sum; > +}; > + > +static void > +mod_init(struct mod_lcore_state *state) > +{ > + state->a = rte_rand(); > + state->b = rte_rand(); > + state->sum = 0; > +} > + > +static __rte_always_inline void > +mod_update(volatile struct mod_lcore_state *state) > +{ > + state->sum += state->a * state->b; > +} > + > +struct __rte_cache_aligned mod_lcore_state_aligned { > + struct mod_lcore_state mod_state; > + > + RTE_CACHE_GUARD; > +}; > + > +static struct mod_lcore_state_aligned > +sarray_lcore_state[MAX_MODS][RTE_MAX_LCORE]; > + > +static void > +sarray_init(void) > +{ > + unsigned int lcore_id = rte_lcore_id(); > + int mod; > + > + for (mod = 0; mod < MAX_MODS; mod++) { > + struct mod_lcore_state *mod_state = > + &sarray_lcore_state[mod][lcore_id].mod_state; > + > + mod_init(mod_state); > + } > +} > + > +static __rte_noinline void > +sarray_update(unsigned int mod) > +{ > + unsigned int lcore_id = rte_lcore_id(); > + struct mod_lcore_state *mod_state = > + &sarray_lcore_state[mod][lcore_id].mod_state; > + > + mod_update(mod_state); > +} > + > +struct mod_lcore_state_lazy { > + struct mod_lcore_state mod_state; > + bool initialized; > +}; > + > +/* > + * Note: it's usually a bad idea have this much thread-local storage > + * allocated in a real application, since it will incur a cost on > + * thread creation and non-lcore thread memory usage. > + */ > +static RTE_DEFINE_PER_LCORE(struct mod_lcore_state_lazy, > + tls_lcore_state)[MAX_MODS]; > + > +static inline void > +tls_init(struct mod_lcore_state_lazy *state) > +{ > + mod_init(&state->mod_state); > + > + state->initialized = true; > +} > + > +static __rte_noinline void > +tls_update(unsigned int mod) > +{ > + struct mod_lcore_state_lazy *state = > + &RTE_PER_LCORE(tls_lcore_state[mod]); > + > + /* With thread-local storage, initialization must usually be lazy */ > + if (!state->initialized) > + tls_init(state); > + > + mod_update(&state->mod_state); > +} > + > +RTE_LCORE_VAR_HANDLE(struct mod_lcore_state, lvar_lcore_state)[MAX_MODS]; > + > +static void > +lvar_init(void) > +{ > + unsigned int mod; > + > + for (mod = 0; mod < MAX_MODS; mod++) { > + RTE_LCORE_VAR_ALLOC(lvar_lcore_state[mod]); > + > + struct mod_lcore_state *state = > + RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]); > + > + mod_init(state); > + } > +} > + > +static __rte_noinline void > +lvar_update(unsigned int mod) > +{ > + struct mod_lcore_state *state = > + RTE_LCORE_VAR_VALUE(lvar_lcore_state[mod]); > + > + mod_update(state); > +} > + > +static void > +shuffle(unsigned int *elems, size_t len) > +{ > + size_t i; > + > + for (i = len - 1; i > 0; i--) { > + unsigned int other = rte_rand_max(i + 1); > + > + unsigned int tmp = elems[other]; > + elems[other] = elems[i]; > + elems[i] = tmp; > + } > +} > + > +#define ITERATIONS UINT64_C(10000000) > + > +static inline double > +benchmark_access(const unsigned int *mods, unsigned int num_mods, > + void (*init_fun)(void), void (*update_fun)(unsigned int)) > +{ > + unsigned int i; > + double start; > + double end; > + double latency; > + unsigned int num_mods_mask = num_mods - 1; > + > + RTE_VERIFY(rte_is_power_of_2(num_mods)); > + > + if (init_fun != NULL) > + init_fun(); > + > + /* Warm up cache and make sure TLS variables are initialized */ > + for (i = 0; i < num_mods; i++) > + update_fun(i); > + > + start = rte_rdtsc(); > + > + for (i = 0; i < ITERATIONS; i++) > + update_fun(mods[i & num_mods_mask]); > + > + end = rte_rdtsc(); > + > + latency = (end - start) / ITERATIONS; > + > + return latency; > +} > + > +static void > +test_lcore_var_access_n(unsigned int num_mods) > +{ > + double sarray_latency; > + double tls_latency; > + double lvar_latency; > + unsigned int mods[num_mods]; > + unsigned int i; > + > + for (i = 0; i < num_mods; i++) > + mods[i] = i; > + > + shuffle(mods, num_mods); > + > + sarray_latency = > + benchmark_access(mods, num_mods, sarray_init, sarray_update); > + > + tls_latency = > + benchmark_access(mods, num_mods, NULL, tls_update); > + > + lvar_latency = > + benchmark_access(mods, num_mods, lvar_init, lvar_update); > + > + printf("%17u %13.1f %13.1f %16.1f\n", num_mods, sarray_latency, > + tls_latency, lvar_latency); > +} > + > +/* > + * The potential performance benefit of lcore variables compared to > + * the use of statically sized, lcore id-indexed arrays are not > + * shorter latencies in a scenario with low cache pressure, but rather > + * fewer cache misses in a real-world scenario, with extensive cache > + * usage. These tests are a crude simulation of such, using dummy > + * modules, each wiht a small, per-lcore state. Note however that > + * these tests has very little non-lcore/thread local state, which is > + * unrealistic. > + */ > + > +static int > +test_lcore_var_access(void) > +{ > + unsigned int num_mods = 1; > + > + printf("Latencies [TSC cycles/update]\n"); > + printf("Modules/Variables Static array Thread-local Storage " > + "Lcore variables\n"); > + > + for (num_mods = 1; num_mods <= MAX_MODS; num_mods *= 2) > + test_lcore_var_access_n(num_mods); > + > + return TEST_SUCCESS; > +} > + > +static struct unit_test_suite lcore_var_testsuite = { > + .suite_name = "lcore variable perf autotest", > + .unit_test_cases = { > + TEST_CASE(test_lcore_var_access), > + TEST_CASES_END() > + }, > +}; > + > +static int > +test_lcore_var_perf(void) > +{ > + return unit_test_suite_runner(&lcore_var_testsuite); > +} > + > +REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);