From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id BE77D459C8; Wed, 18 Sep 2024 12:05:26 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 8E2BA402C8; Wed, 18 Sep 2024 12:05:26 +0200 (CEST) Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com [209.85.160.182]) by mails.dpdk.org (Postfix) with ESMTP id A29DF4025C for ; Wed, 18 Sep 2024 12:05:25 +0200 (CEST) Received: by mail-qt1-f182.google.com with SMTP id d75a77b69052e-4582a0b438aso56336791cf.0 for ; Wed, 18 Sep 2024 03:05:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726653925; x=1727258725; darn=dpdk.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=JMEaTU1TeMSfbZHkxL0OhelyUl95sULPPuR4w6uRQKI=; b=j9ILwbMvCxCyyusDXtkGUQEh8b37q+lqzaIsckU+8JmvCxFG09Sybz+61Nxjf/QI8E QxlWzvJUls9oijYubzV08PfnFdHXJs4uafaiIfhipBcG2phLAdSEKvj3a1cO2uW41KYQ 3cj8NuZBS4EC/EucqqXoE65GPBTkhSTJXnBTXvc/YFuZFYpfjo0k46XjHkTO1ydNClrD dCgvWCyQttHvCIpi4haG57JF5dLdviLI6pxgRBmuA846H/oEt+TB0ljYS9NjK+p6nk6b Swtw/1DCyfPEVf06Ka1pI3rxtv2rgNgeYPJO1b+0qjIS64EBjtEDBfgGZQz5sDqJp2tA sXXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726653925; x=1727258725; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JMEaTU1TeMSfbZHkxL0OhelyUl95sULPPuR4w6uRQKI=; b=eEZc0kUyUb6Ud/xTaEiCDggxDIEtX2mZ5WyrKLZKMJtuiSXMNCbXBBOXtlthHnn4FD qzk/7SKWVhsHAzlmWb1dujLxTO9byocZwiK3Bl69ACp5D19d+M87vhqGqNvJquudCAnZ Yaz3hnH6/hABcfZ1ltbP3y2tXvet+j6DvpMSGoQTYC4FBeMUTdhkfNUuk6rDPew2gYAi Hy1s3vBxq+leES5SgmxXTW5uywmbLC+lLC8HfFjRoPiMDe6AsHXmGpUgPDLzZxHyhIBl rJS5mw3bhGw2VHLaShq0RZWjKV1PT8ZNNVqiEiO4Re6/Gic1iZGYvm2Wvy74O0h4x4O6 jCpQ== X-Forwarded-Encrypted: i=1; AJvYcCUHL9zFk2plzKssPc25yrwqqNXgKJfnbuKZZAEH4B7/GdsytLOZqsCvJ/wIImLLjCDhodY=@dpdk.org X-Gm-Message-State: AOJu0YwuCjyOcbA9SEenypBwU/kgGEJvPuWWn35zdHKK2SOt15ec3s2D CXc+0uBszNwXa7S4JTsSGA0U2gXB+9J1XmVTxnJYRyBQh4vpmfRXGipGpdweRy7AQz3HTgBlT8A v3LvBu8TSx2yEA8O1s3xmJjhchCw= X-Google-Smtp-Source: AGHT+IFUQk0a5SPbLr+Sr9iU08Ab9crIVdQgvEKE7RfIfcSPt3s7T7ad3MVeLNafUVBQSlwCfqMpbTnZarjy1fbIGFg= X-Received: by 2002:a05:622a:1449:b0:458:4b8b:1517 with SMTP id d75a77b69052e-458602e2304mr336523061cf.18.1726653924893; Wed, 18 Sep 2024 03:05:24 -0700 (PDT) MIME-Version: 1.0 References: <20240911170430.701685-2-mattias.ronnblom@ericsson.com> <20240912084429.703405-1-mattias.ronnblom@ericsson.com> <20240912084429.703405-4-mattias.ronnblom@ericsson.com> <88a778d3-e157-41cd-9da7-2d06864a654d@lysator.liu.se> <0a8dd454-976c-4f17-a870-09ba2d90c717@lysator.liu.se> <504ca550-f89f-44ce-b716-899b882590a3@lysator.liu.se> In-Reply-To: <504ca550-f89f-44ce-b716-899b882590a3@lysator.liu.se> From: Jerin Jacob Date: Wed, 18 Sep 2024 15:34:58 +0530 Message-ID: Subject: Re: [PATCH v3 3/7] eal: add lcore variable performance test To: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= Cc: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= , dev@dpdk.org, =?UTF-8?Q?Morten_Br=C3=B8rup?= , Stephen Hemminger , Konstantin Ananyev , David Marchand , Jerin Jacob Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Mon, Sep 16, 2024 at 4:20=E2=80=AFPM Mattias R=C3=B6nnblom wrote: > > On 2024-09-13 13:23, Jerin Jacob wrote: > > On Fri, Sep 13, 2024 at 12:17=E2=80=AFPM Mattias R=C3=B6nnblom wrote: > >> > >> On 2024-09-12 17:11, Jerin Jacob wrote: > >>> On Thu, Sep 12, 2024 at 6:50=E2=80=AFPM Mattias R=C3=B6nnblom wrote: > >>>> > >>>> On 2024-09-12 15:09, Jerin Jacob wrote: > >>>>> On Thu, Sep 12, 2024 at 2:34=E2=80=AFPM Mattias R=C3=B6nnblom > >>>>> wrote: > >>>>>> > >>>>>> Add basic micro benchmark for lcore variables, in an attempt to as= sure > >>>>>> that the overhead isn't significantly greater than alternative > >>>>>> approaches, in scenarios where the benefits aren't expected to sho= w up > >>>>>> (i.e., when plenty of cache is available compared to the working s= et > >>>>>> size of the per-lcore data). > >>>>>> > >>>>>> Signed-off-by: Mattias R=C3=B6nnblom > >>>>>> --- > >>>>>> app/test/meson.build | 1 + > >>>>>> app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++= ++++++++ > >>>>>> 2 files changed, 161 insertions(+) > >>>>>> create mode 100644 app/test/test_lcore_var_perf.c > >>>>> > >>>>> > >>>>>> +static double > >>>>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun= )(void)) > >>>>>> +{ > >>>>>> + uint64_t i; > >>>>>> + uint64_t start; > >>>>>> + uint64_t end; > >>>>>> + double latency; > >>>>>> + > >>>>>> + init_fun(); > >>>>>> + > >>>>>> + start =3D rte_get_timer_cycles(); > >>>>>> + > >>>>>> + for (i =3D 0; i < ITERATIONS; i++) > >>>>>> + update_fun(); > >>>>>> + > >>>>>> + end =3D rte_get_timer_cycles(); > >>>>> > >>>>> Use precise variant. rte_rdtsc_precise() or so to be accurate > >>>> > >>>> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not. > >>> > >>> I was thinking in another way, with 1e7 iteration, the additional > >>> barrier on precise will be amortized, and we get more _deterministic_ > >>> behavior e.s.p in case if we print cycles and if we need to catch > >>> regressions. > >> > >> If you time a section of code which spends ~40000000 cycles, it doesn'= t > >> matter if you add or remove a few cycles at the beginning and the end. > >> > >> The rte_rdtsc_precise() is both better (more precise in the sense of > >> more serialization), and worse (because it's more costly, and thus mor= e > >> intrusive). > > > > We can calibrate the overhead to remove the cost. > > > What you are interested is primarily the impact of (instruction) > throughput, not the latency of the sequence of instructions that must be > retired in order to load the lcore variable values, when you switch from > (say) lcore id-index static arrays to lcore variables in your module. > > Usually, there is not reason to make a distinction between latency and > throughput in this context, but as you zoom into very short snippets of > code being executed, the difference becomes relevant. For example, > adding an div instruction won't necessarily add 12 cc to your program's > execution time on a Zen 4, even though that is its latency. Rather, the > effects may, depending on data dependencies and what other instructions > are executed in parallel, be much smaller. > > So, one could argue the ILP you get with the loop is a feature, not a bug= . > > With or without per-iteration latency measurements, these benchmark are > not-very-useful at best, and misleading at worst. I will rework them to > include more than a single module/lcore variable, which I think would be > somewhat of an improvement. OK. Module parameter will remove the compiler optimization and more accurat= e. I was doing manual loop unrolling[1] in a trace test case(for small inline functions) Either way it fine. Thanks for the rework. [1] https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30 > > Even better would have some real domain logic, instead of just a dummy > multiplication. > > >> > >> You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It > >> doesn't matter. > > > > Yes. In this setup and it is pretty inaccurate PER iteration. Please > > refer to the below patch to see the difference. > > > > Patch 1: Make nanoseconds to cycles per iteration > > ------------------------------------------------------------------ > > > > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_p= erf.c > > index ea1d7ba90b52..b8d25400f593 100644 > > --- a/app/test/test_lcore_var_perf.c > > +++ b/app/test/test_lcore_var_perf.c > > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void), > > void (*update_fun)(void)) > > > > end =3D rte_get_timer_cycles(); > > > > - latency =3D ((end - start) / (double)rte_get_timer_hz()) / ITER= ATIONS; > > + latency =3D ((end - start)) / ITERATIONS; > > > > return latency; > > } > > @@ -137,8 +137,7 @@ test_lcore_var_access(void) > > > > - printf("Latencies [ns/update]\n"); > > + printf("Latencies [cycles/update]\n"); > > printf("Thread-local storage Static array Lcore variables\n"= ); > > - printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9, > > - sarray_latency * 1e9, lvar_latency * 1e9); > > + printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency, > > lvar_latency); > > > > return TEST_SUCCESS; > > } > > > > > > Patch 2: Change to precise with calibration > > ----------------------------------------------------------- > > > > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_p= erf.c > > index ea1d7ba90b52..8142ecd56241 100644 > > --- a/app/test/test_lcore_var_perf.c > > +++ b/app/test/test_lcore_var_perf.c > > @@ -96,23 +96,28 @@ lvar_update(void) > > static double > > benchmark_access_method(void (*init_fun)(void), void (*update_fun)(vo= id)) > > { > > - uint64_t i; > > + double tsc_latency; > > + double latency; > > uint64_t start; > > uint64_t end; > > - double latency; > > + uint64_t i; > > > > - init_fun(); > > + /* calculate rte_rdtsc_precise overhead */ > > + start =3D rte_rdtsc_precise(); > > + end =3D rte_rdtsc_precise(); > > + tsc_latency =3D (end - start); > > > > - start =3D rte_get_timer_cycles(); > > + init_fun(); > > > > - for (i =3D 0; i < ITERATIONS; i++) > > + latency =3D 0; > > + for (i =3D 0; i < ITERATIONS; i++) { > > + start =3D rte_rdtsc_precise(); > > update_fun(); > > + end =3D rte_rdtsc_precise(); > > + latency +=3D (end - start) - tsc_latency; > > + } > > > > - end =3D rte_get_timer_cycles(); > > - > > - latency =3D ((end - start) / (double)rte_get_timer_hz()) / ITER= ATIONS; > > - > > - return latency; > > + return latency / (double)ITERATIONS; > > } > > > > static int > > @@ -135,10 +140,9 @@ test_lcore_var_access(void) > > sarray_latency =3D benchmark_access_method(sarray_init, sarray= _update); > > lvar_latency =3D benchmark_access_method(lvar_init, lvar_updat= e); > > > > - printf("Latencies [ns/update]\n"); > > + printf("Latencies [cycles/update]\n"); > > printf("Thread-local storage Static array Lcore variables\n"= ); > > - printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9, > > - sarray_latency * 1e9, lvar_latency * 1e9); > > + printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency, > > lvar_latency); > > > > return TEST_SUCCESS; > > } > > > > ARM N2 core with patch 1(aka current scheme) > > ----------------------------------- > > > > + ------------------------------------------------------- + > > + Test Suite : lcore variable perf autotest > > + ------------------------------------------------------- + > > Latencies [cycles/update] > > Thread-local storage Static array Lcore variables > > 7.0 7.0 7.0 > > > > > > ARM N2 core with patch 2 > > ----------------------------------- > > > > + ------------------------------------------------------- + > > + Test Suite : lcore variable perf autotest > > + ------------------------------------------------------- + > > Latencies [cycles/update] > > Thread-local storage Static array Lcore variables > > 11.4 15.5 15.5 > > > > x86 i9 core with patch 1(aka current scheme) > > ------------------------------------------------------------ > > > > + ------------------------------------------------------- + > > + Test Suite : lcore variable perf autotest > > + ------------------------------------------------------- + > > Latencies [ns/update] > > Thread-local storage Static array Lcore variables > > 5.0 6.0 6.0 > > > > x86 i9 core with patch 2 > > -------------------------------- > > + ------------------------------------------------------- + > > + Test Suite : lcore variable perf autotest > > + ------------------------------------------------------- + > > Latencies [cycles/update] > > Thread-local storage Static array Lcore variables > > 5.3 10.6 11.7 > > > > > > > > > > > >> > >>> Furthermore, you may consider replacing rte_random() in fast path to > >>> running number or so if it is not deterministic in cycle computation. > >> > >> rte_rand() is not used in the fast path. I don't understand what you > > > > I missed that. Ignore this comment. > > > >> mean by "running number".