From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id BE77D459C8;
	Wed, 18 Sep 2024 12:05:26 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 8E2BA402C8;
	Wed, 18 Sep 2024 12:05:26 +0200 (CEST)
Received: from mail-qt1-f182.google.com (mail-qt1-f182.google.com
 [209.85.160.182])
 by mails.dpdk.org (Postfix) with ESMTP id A29DF4025C
 for <dev@dpdk.org>; Wed, 18 Sep 2024 12:05:25 +0200 (CEST)
Received: by mail-qt1-f182.google.com with SMTP id
 d75a77b69052e-4582a0b438aso56336791cf.0
 for <dev@dpdk.org>; Wed, 18 Sep 2024 03:05:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1726653925; x=1727258725; darn=dpdk.org;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:from:to:cc:subject:date
 :message-id:reply-to;
 bh=JMEaTU1TeMSfbZHkxL0OhelyUl95sULPPuR4w6uRQKI=;
 b=j9ILwbMvCxCyyusDXtkGUQEh8b37q+lqzaIsckU+8JmvCxFG09Sybz+61Nxjf/QI8E
 QxlWzvJUls9oijYubzV08PfnFdHXJs4uafaiIfhipBcG2phLAdSEKvj3a1cO2uW41KYQ
 3cj8NuZBS4EC/EucqqXoE65GPBTkhSTJXnBTXvc/YFuZFYpfjo0k46XjHkTO1ydNClrD
 dCgvWCyQttHvCIpi4haG57JF5dLdviLI6pxgRBmuA846H/oEt+TB0ljYS9NjK+p6nk6b
 Swtw/1DCyfPEVf06Ka1pI3rxtv2rgNgeYPJO1b+0qjIS64EBjtEDBfgGZQz5sDqJp2tA
 sXXg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1726653925; x=1727258725;
 h=content-transfer-encoding:cc:to:subject:message-id:date:from
 :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=JMEaTU1TeMSfbZHkxL0OhelyUl95sULPPuR4w6uRQKI=;
 b=eEZc0kUyUb6Ud/xTaEiCDggxDIEtX2mZ5WyrKLZKMJtuiSXMNCbXBBOXtlthHnn4FD
 qzk/7SKWVhsHAzlmWb1dujLxTO9byocZwiK3Bl69ACp5D19d+M87vhqGqNvJquudCAnZ
 Yaz3hnH6/hABcfZ1ltbP3y2tXvet+j6DvpMSGoQTYC4FBeMUTdhkfNUuk6rDPew2gYAi
 Hy1s3vBxq+leES5SgmxXTW5uywmbLC+lLC8HfFjRoPiMDe6AsHXmGpUgPDLzZxHyhIBl
 rJS5mw3bhGw2VHLaShq0RZWjKV1PT8ZNNVqiEiO4Re6/Gic1iZGYvm2Wvy74O0h4x4O6
 jCpQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCUHL9zFk2plzKssPc25yrwqqNXgKJfnbuKZZAEH4B7/GdsytLOZqsCvJ/wIImLLjCDhodY=@dpdk.org
X-Gm-Message-State: AOJu0YwuCjyOcbA9SEenypBwU/kgGEJvPuWWn35zdHKK2SOt15ec3s2D
 CXc+0uBszNwXa7S4JTsSGA0U2gXB+9J1XmVTxnJYRyBQh4vpmfRXGipGpdweRy7AQz3HTgBlT8A
 v3LvBu8TSx2yEA8O1s3xmJjhchCw=
X-Google-Smtp-Source: AGHT+IFUQk0a5SPbLr+Sr9iU08Ab9crIVdQgvEKE7RfIfcSPt3s7T7ad3MVeLNafUVBQSlwCfqMpbTnZarjy1fbIGFg=
X-Received: by 2002:a05:622a:1449:b0:458:4b8b:1517 with SMTP id
 d75a77b69052e-458602e2304mr336523061cf.18.1726653924893; Wed, 18 Sep 2024
 03:05:24 -0700 (PDT)
MIME-Version: 1.0
References: <20240911170430.701685-2-mattias.ronnblom@ericsson.com>
 <20240912084429.703405-1-mattias.ronnblom@ericsson.com>
 <20240912084429.703405-4-mattias.ronnblom@ericsson.com>
 <CALBAE1N-GDbTwNbZDVcQSUn3T-bVSVo4Ckeu3DMz=m2cZqFqYQ@mail.gmail.com>
 <88a778d3-e157-41cd-9da7-2d06864a654d@lysator.liu.se>
 <CALBAE1P=dKRm=UMoar9D6d-KyfasO9Q6ksk4_xXnK7E4jeDo2Q@mail.gmail.com>
 <0a8dd454-976c-4f17-a870-09ba2d90c717@lysator.liu.se>
 <CALBAE1OfqwEmg_3KwcwG7kUOM=1n-L7fvdvYCD6+SK9ZWrF0pw@mail.gmail.com>
 <504ca550-f89f-44ce-b716-899b882590a3@lysator.liu.se>
In-Reply-To: <504ca550-f89f-44ce-b716-899b882590a3@lysator.liu.se>
From: Jerin Jacob <jerinjacobk@gmail.com>
Date: Wed, 18 Sep 2024 15:34:58 +0530
Message-ID: <CALBAE1NkMQp3bshc9CQwik6_q0i6ozerFyjdDR4P+nZ8iU=3KQ@mail.gmail.com>
Subject: Re: [PATCH v3 3/7] eal: add lcore variable performance test
To: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <hofors@lysator.liu.se>
Cc: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <mattias.ronnblom@ericsson.com>, 
 dev@dpdk.org, =?UTF-8?Q?Morten_Br=C3=B8rup?= <mb@smartsharesystems.com>, 
 Stephen Hemminger <stephen@networkplumber.org>, 
 Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>,
 David Marchand <david.marchand@redhat.com>, 
 Jerin Jacob <jerinj@marvell.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On Mon, Sep 16, 2024 at 4:20=E2=80=AFPM Mattias R=C3=B6nnblom <hofors@lysat=
or.liu.se> wrote:
>
> On 2024-09-13 13:23, Jerin Jacob wrote:
> > On Fri, Sep 13, 2024 at 12:17=E2=80=AFPM Mattias R=C3=B6nnblom <hofors@=
lysator.liu.se> wrote:
> >>
> >> On 2024-09-12 17:11, Jerin Jacob wrote:
> >>> On Thu, Sep 12, 2024 at 6:50=E2=80=AFPM Mattias R=C3=B6nnblom <hofors=
@lysator.liu.se> wrote:
> >>>>
> >>>> On 2024-09-12 15:09, Jerin Jacob wrote:
> >>>>> On Thu, Sep 12, 2024 at 2:34=E2=80=AFPM Mattias R=C3=B6nnblom
> >>>>> <mattias.ronnblom@ericsson.com> wrote:
> >>>>>>
> >>>>>> Add basic micro benchmark for lcore variables, in an attempt to as=
sure
> >>>>>> that the overhead isn't significantly greater than alternative
> >>>>>> approaches, in scenarios where the benefits aren't expected to sho=
w up
> >>>>>> (i.e., when plenty of cache is available compared to the working s=
et
> >>>>>> size of the per-lcore data).
> >>>>>>
> >>>>>> Signed-off-by: Mattias R=C3=B6nnblom <mattias.ronnblom@ericsson.co=
m>
> >>>>>> ---
> >>>>>>     app/test/meson.build           |   1 +
> >>>>>>     app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++=
++++++++
> >>>>>>     2 files changed, 161 insertions(+)
> >>>>>>     create mode 100644 app/test/test_lcore_var_perf.c
> >>>>>
> >>>>>
> >>>>>> +static double
> >>>>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun=
)(void))
> >>>>>> +{
> >>>>>> +       uint64_t i;
> >>>>>> +       uint64_t start;
> >>>>>> +       uint64_t end;
> >>>>>> +       double latency;
> >>>>>> +
> >>>>>> +       init_fun();
> >>>>>> +
> >>>>>> +       start =3D rte_get_timer_cycles();
> >>>>>> +
> >>>>>> +       for (i =3D 0; i < ITERATIONS; i++)
> >>>>>> +               update_fun();
> >>>>>> +
> >>>>>> +       end =3D rte_get_timer_cycles();
> >>>>>
> >>>>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> >>>>
> >>>> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> >>>
> >>> I was thinking in another way, with 1e7 iteration, the additional
> >>> barrier on precise will be amortized, and we get more _deterministic_
> >>> behavior e.s.p in case if we print cycles and if we need to catch
> >>> regressions.
> >>
> >> If you time a section of code which spends ~40000000 cycles, it doesn'=
t
> >> matter if you add or remove a few cycles at the beginning and the end.
> >>
> >> The rte_rdtsc_precise() is both better (more precise in the sense of
> >> more serialization), and worse (because it's more costly, and thus mor=
e
> >> intrusive).
> >
> > We can calibrate the overhead to remove the cost.
> >
> What you are interested is primarily the impact of (instruction)
> throughput, not the latency of the sequence of instructions that must be
> retired in order to load the lcore variable values, when you switch from
> (say) lcore id-index static arrays to lcore variables in your module.
>
> Usually, there is not reason to make a distinction between latency and
> throughput in this context, but as you zoom into very short snippets of
> code being executed, the difference becomes relevant. For example,
> adding an div instruction won't necessarily add 12 cc to your program's
> execution time on a Zen 4, even though that is its latency. Rather, the
> effects may, depending on data dependencies and what other instructions
> are executed in parallel, be much smaller.
>
> So, one could argue the ILP you get with the loop is a feature, not a bug=
.
>
> With or without per-iteration latency measurements, these benchmark are
> not-very-useful at best, and misleading at worst. I will rework them to
> include more than a single module/lcore variable, which I think would be
> somewhat of an improvement.

OK. Module parameter will remove the compiler optimization and more accurat=
e.
I was doing manual loop unrolling[1] in a trace test case(for small
inline functions)
Either way it fine. Thanks for the rework.

[1]
https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30


>
> Even better would have some real domain logic, instead of just a dummy
> multiplication.
>
> >>
> >> You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> >> doesn't matter.
> >
> > Yes. In this setup and it is pretty inaccurate PER iteration. Please
> > refer to the below patch to see the difference.
> >
> > Patch 1: Make nanoseconds to cycles per iteration
> > ------------------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_p=
erf.c
> > index ea1d7ba90b52..b8d25400f593 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> > void (*update_fun)(void))
> >
> >          end =3D rte_get_timer_cycles();
> >
> > -       latency =3D ((end - start) / (double)rte_get_timer_hz()) / ITER=
ATIONS;
> > +       latency =3D ((end - start)) / ITERATIONS;
> >
> >          return latency;
> >   }
> > @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> >
> > -       printf("Latencies [ns/update]\n");
> > +       printf("Latencies [cycles/update]\n");
> >          printf("Thread-local storage  Static array  Lcore variables\n"=
);
> > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> >          return TEST_SUCCESS;
> >   }
> >
> >
> > Patch 2: Change to precise with calibration
> > -----------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_p=
erf.c
> > index ea1d7ba90b52..8142ecd56241 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -96,23 +96,28 @@ lvar_update(void)
> >   static double
> >   benchmark_access_method(void (*init_fun)(void), void (*update_fun)(vo=
id))
> >   {
> > -       uint64_t i;
> > +       double tsc_latency;
> > +       double latency;
> >          uint64_t start;
> >          uint64_t end;
> > -       double latency;
> > +       uint64_t i;
> >
> > -       init_fun();
> > +       /* calculate rte_rdtsc_precise overhead */
> > +       start =3D rte_rdtsc_precise();
> > +       end =3D rte_rdtsc_precise();
> > +       tsc_latency =3D (end - start);
> >
> > -       start =3D rte_get_timer_cycles();
> > +       init_fun();
> >
> > -       for (i =3D 0; i < ITERATIONS; i++)
> > +       latency =3D 0;
> > +       for (i =3D 0; i < ITERATIONS; i++) {
> > +               start =3D rte_rdtsc_precise();
> >                  update_fun();
> > +               end =3D rte_rdtsc_precise();
> > +               latency +=3D (end - start) - tsc_latency;
> > +       }
> >
> > -       end =3D rte_get_timer_cycles();
> > -
> > -       latency =3D ((end - start) / (double)rte_get_timer_hz()) / ITER=
ATIONS;
> > -
> > -       return latency;
> > +       return latency / (double)ITERATIONS;
> >   }
> >
> >   static int
> > @@ -135,10 +140,9 @@ test_lcore_var_access(void)
> >          sarray_latency =3D benchmark_access_method(sarray_init, sarray=
_update);
> >          lvar_latency =3D benchmark_access_method(lvar_init, lvar_updat=
e);
> >
> > -       printf("Latencies [ns/update]\n");
> > +       printf("Latencies [cycles/update]\n");
> >          printf("Thread-local storage  Static array  Lcore variables\n"=
);
> > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> >          return TEST_SUCCESS;
> >   }
> >
> > ARM N2 core with patch 1(aka current scheme)
> > -----------------------------------
> >
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage  Static array  Lcore variables
> >                   7.0           7.0              7.0
> >
> >
> > ARM N2 core with patch 2
> > -----------------------------------
> >
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage  Static array  Lcore variables
> >                  11.4          15.5             15.5
> >
> > x86 i9 core with patch 1(aka current scheme)
> > ------------------------------------------------------------
> >
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [ns/update]
> > Thread-local storage  Static array  Lcore variables
> >                   5.0           6.0              6.0
> >
> > x86 i9 core with patch 2
> > --------------------------------
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage  Static array  Lcore variables
> >                   5.3          10.6             11.7
> >
> >
> >
> >
> >
> >>
> >>> Furthermore, you may consider replacing rte_random() in fast path to
> >>> running number or so if it is not deterministic in cycle computation.
> >>
> >> rte_rand() is not used in the fast path. I don't understand what you
> >
> > I missed that. Ignore this comment.
> >
> >> mean by "running number".