From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 1C0CC459AD;
	Mon, 16 Sep 2024 12:50:06 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id B9E214025F;
	Mon, 16 Sep 2024 12:50:05 +0200 (CEST)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id 1985C40041
 for <dev@dpdk.org>; Mon, 16 Sep 2024 12:50:04 +0200 (CEST)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id BBF97155C7
 for <dev@dpdk.org>; Mon, 16 Sep 2024 12:50:03 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id AE10E154EA; Mon, 16 Sep 2024 12:50:03 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL,
 T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0
X-Spam-Score: -1.2
Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se
 [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id C7B7C155C5;
 Mon, 16 Sep 2024 12:50:00 +0200 (CEST)
Message-ID: <504ca550-f89f-44ce-b716-899b882590a3@lysator.liu.se>
Date: Mon, 16 Sep 2024 12:50:00 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v3 3/7] eal: add lcore variable performance test
To: Jerin Jacob <jerinjacobk@gmail.com>
Cc: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <mattias.ronnblom@ericsson.com>,
 dev@dpdk.org, =?UTF-8?Q?Morten_Br=C3=B8rup?= <mb@smartsharesystems.com>,
 Stephen Hemminger <stephen@networkplumber.org>,
 Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>,
 David Marchand <david.marchand@redhat.com>, Jerin Jacob <jerinj@marvell.com>
References: <20240911170430.701685-2-mattias.ronnblom@ericsson.com>
 <20240912084429.703405-1-mattias.ronnblom@ericsson.com>
 <20240912084429.703405-4-mattias.ronnblom@ericsson.com>
 <CALBAE1N-GDbTwNbZDVcQSUn3T-bVSVo4Ckeu3DMz=m2cZqFqYQ@mail.gmail.com>
 <88a778d3-e157-41cd-9da7-2d06864a654d@lysator.liu.se>
 <CALBAE1P=dKRm=UMoar9D6d-KyfasO9Q6ksk4_xXnK7E4jeDo2Q@mail.gmail.com>
 <0a8dd454-976c-4f17-a870-09ba2d90c717@lysator.liu.se>
 <CALBAE1OfqwEmg_3KwcwG7kUOM=1n-L7fvdvYCD6+SK9ZWrF0pw@mail.gmail.com>
Content-Language: en-US
From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <CALBAE1OfqwEmg_3KwcwG7kUOM=1n-L7fvdvYCD6+SK9ZWrF0pw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On 2024-09-13 13:23, Jerin Jacob wrote:
> On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>
>> On 2024-09-12 17:11, Jerin Jacob wrote:
>>> On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>>>
>>>> On 2024-09-12 15:09, Jerin Jacob wrote:
>>>>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
>>>>> <mattias.ronnblom@ericsson.com> wrote:
>>>>>>
>>>>>> Add basic micro benchmark for lcore variables, in an attempt to assure
>>>>>> that the overhead isn't significantly greater than alternative
>>>>>> approaches, in scenarios where the benefits aren't expected to show up
>>>>>> (i.e., when plenty of cache is available compared to the working set
>>>>>> size of the per-lcore data).
>>>>>>
>>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>>> ---
>>>>>>     app/test/meson.build           |   1 +
>>>>>>     app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
>>>>>>     2 files changed, 161 insertions(+)
>>>>>>     create mode 100644 app/test/test_lcore_var_perf.c
>>>>>
>>>>>
>>>>>> +static double
>>>>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
>>>>>> +{
>>>>>> +       uint64_t i;
>>>>>> +       uint64_t start;
>>>>>> +       uint64_t end;
>>>>>> +       double latency;
>>>>>> +
>>>>>> +       init_fun();
>>>>>> +
>>>>>> +       start = rte_get_timer_cycles();
>>>>>> +
>>>>>> +       for (i = 0; i < ITERATIONS; i++)
>>>>>> +               update_fun();
>>>>>> +
>>>>>> +       end = rte_get_timer_cycles();
>>>>>
>>>>> Use precise variant. rte_rdtsc_precise() or so to be accurate
>>>>
>>>> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
>>>
>>> I was thinking in another way, with 1e7 iteration, the additional
>>> barrier on precise will be amortized, and we get more _deterministic_
>>> behavior e.s.p in case if we print cycles and if we need to catch
>>> regressions.
>>
>> If you time a section of code which spends ~40000000 cycles, it doesn't
>> matter if you add or remove a few cycles at the beginning and the end.
>>
>> The rte_rdtsc_precise() is both better (more precise in the sense of
>> more serialization), and worse (because it's more costly, and thus more
>> intrusive).
> 
> We can calibrate the overhead to remove the cost.
> 
What you are interested is primarily the impact of (instruction) 
throughput, not the latency of the sequence of instructions that must be 
retired in order to load the lcore variable values, when you switch from
(say) lcore id-index static arrays to lcore variables in your module.

Usually, there is not reason to make a distinction between latency and 
throughput in this context, but as you zoom into very short snippets of 
code being executed, the difference becomes relevant. For example, 
adding an div instruction won't necessarily add 12 cc to your program's 
execution time on a Zen 4, even though that is its latency. Rather, the 
effects may, depending on data dependencies and what other instructions 
are executed in parallel, be much smaller.

So, one could argue the ILP you get with the loop is a feature, not a bug.

With or without per-iteration latency measurements, these benchmark are 
not-very-useful at best, and misleading at worst. I will rework them to 
include more than a single module/lcore variable, which I think would be 
somewhat of an improvement.

Even better would have some real domain logic, instead of just a dummy 
multiplication.

>>
>> You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
>> doesn't matter.
> 
> Yes. In this setup and it is pretty inaccurate PER iteration. Please
> refer to the below patch to see the difference.
> 
> Patch 1: Make nanoseconds to cycles per iteration
> ------------------------------------------------------------------
> 
> diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> index ea1d7ba90b52..b8d25400f593 100644
> --- a/app/test/test_lcore_var_perf.c
> +++ b/app/test/test_lcore_var_perf.c
> @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> void (*update_fun)(void))
> 
>          end = rte_get_timer_cycles();
> 
> -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> +       latency = ((end - start)) / ITERATIONS;
> 
>          return latency;
>   }
> @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> 
> -       printf("Latencies [ns/update]\n");
> +       printf("Latencies [cycles/update]\n");
>          printf("Thread-local storage  Static array  Lcore variables\n");
> -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> -              sarray_latency * 1e9, lvar_latency * 1e9);
> +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> lvar_latency);
> 
>          return TEST_SUCCESS;
>   }
> 
> 
> Patch 2: Change to precise with calibration
> -----------------------------------------------------------
> 
> diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> index ea1d7ba90b52..8142ecd56241 100644
> --- a/app/test/test_lcore_var_perf.c
> +++ b/app/test/test_lcore_var_perf.c
> @@ -96,23 +96,28 @@ lvar_update(void)
>   static double
>   benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
>   {
> -       uint64_t i;
> +       double tsc_latency;
> +       double latency;
>          uint64_t start;
>          uint64_t end;
> -       double latency;
> +       uint64_t i;
> 
> -       init_fun();
> +       /* calculate rte_rdtsc_precise overhead */
> +       start = rte_rdtsc_precise();
> +       end = rte_rdtsc_precise();
> +       tsc_latency = (end - start);
> 
> -       start = rte_get_timer_cycles();
> +       init_fun();
> 
> -       for (i = 0; i < ITERATIONS; i++)
> +       latency = 0;
> +       for (i = 0; i < ITERATIONS; i++) {
> +               start = rte_rdtsc_precise();
>                  update_fun();
> +               end = rte_rdtsc_precise();
> +               latency += (end - start) - tsc_latency;
> +       }
> 
> -       end = rte_get_timer_cycles();
> -
> -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> -
> -       return latency;
> +       return latency / (double)ITERATIONS;
>   }
> 
>   static int
> @@ -135,10 +140,9 @@ test_lcore_var_access(void)
>          sarray_latency = benchmark_access_method(sarray_init, sarray_update);
>          lvar_latency = benchmark_access_method(lvar_init, lvar_update);
> 
> -       printf("Latencies [ns/update]\n");
> +       printf("Latencies [cycles/update]\n");
>          printf("Thread-local storage  Static array  Lcore variables\n");
> -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> -              sarray_latency * 1e9, lvar_latency * 1e9);
> +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> lvar_latency);
> 
>          return TEST_SUCCESS;
>   }
> 
> ARM N2 core with patch 1(aka current scheme)
> -----------------------------------
> 
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [cycles/update]
> Thread-local storage  Static array  Lcore variables
>                   7.0           7.0              7.0
> 
> 
> ARM N2 core with patch 2
> -----------------------------------
> 
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [cycles/update]
> Thread-local storage  Static array  Lcore variables
>                  11.4          15.5             15.5
> 
> x86 i9 core with patch 1(aka current scheme)
> ------------------------------------------------------------
> 
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [ns/update]
> Thread-local storage  Static array  Lcore variables
>                   5.0           6.0              6.0
> 
> x86 i9 core with patch 2
> --------------------------------
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [cycles/update]
> Thread-local storage  Static array  Lcore variables
>                   5.3          10.6             11.7
> 
> 
> 
> 
> 
>>
>>> Furthermore, you may consider replacing rte_random() in fast path to
>>> running number or so if it is not deterministic in cycle computation.
>>
>> rte_rand() is not used in the fast path. I don't understand what you
> 
> I missed that. Ignore this comment.
> 
>> mean by "running number".