From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id F3C524597A; Fri, 13 Sep 2024 13:24:07 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 870B84025E; Fri, 13 Sep 2024 13:24:07 +0200 (CEST) Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com [209.85.160.169]) by mails.dpdk.org (Postfix) with ESMTP id 1054C4003C for ; Fri, 13 Sep 2024 13:24:06 +0200 (CEST) Received: by mail-qt1-f169.google.com with SMTP id d75a77b69052e-45815723c87so16218281cf.0 for ; Fri, 13 Sep 2024 04:24:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726226645; x=1726831445; darn=dpdk.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=0X3uV7134SLyvWiXZfYXBOxzV4PLH/6r2WghuDf7cz0=; b=gkQ8I565HFkBKEKXADz4zmhVKLf/eV+weqYlhuRm3FX5FdMD1SFA/xovW4kjHcymPp GQrJPYcvhIk+sm1eYEo8JS1D7j2p4SEv/dlEYuRB170JD0thqjJsXw//539X1aCFjr3l ThYIckBmnONw97Fbd+mRNYz3IePB8ZZ0y6hiK/0FAqP/I55p5PP52UqqYrXi24fppejn z4iYT5NMKRrBCgDMeppFCZZ+CxhQnDhbp5ryXFKBbvkgXLQ2JfdksEHVct8SGec25+6v 2/O+PxYFxavdHfkZa/QC4O3kegAYZDmMN7Q6QwArDQ+Bsc+Ek5/BrfP1+F5tZ/aOrMw2 7T+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726226645; x=1726831445; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0X3uV7134SLyvWiXZfYXBOxzV4PLH/6r2WghuDf7cz0=; b=n915Bqr7tQtZyNQNk0zC8+148G4hB6raqqTdkvmOCIDh9c8RsyrjEf0hFNpXfDEYdx s6mJr4laXEsFGf7UWf10H4FrEZ/+KEKLRcGrO+oVjxU0Ze7g7VtpuVgikWri63DpdlS0 9ch//TvfiPf+ciQzxkLhPpG9C0NbfYWv8ObDs1e50GBYysOphT5VTkMjwLeH1QSRP9Gc O0cFuU1B8JaUdbghKyFqaVSx08XIOgEItR6AlgdUBMvfMJ2mbgJQtOIJ/tJXRasISPiZ X9lwrZx9dJKV3+V3d+7bG6sF2q11Avpq0iwNMTMQZ1kIQd1oZJWhaQOZa0st4DqNKTyT v5tg== X-Forwarded-Encrypted: i=1; AJvYcCUSQxTsbUAburBWyxSc1187G3qlP6cgBQ9Y8u+EwKjRzitckSMGUvyGBnOpgINnN46igyk=@dpdk.org X-Gm-Message-State: AOJu0Yzux2cPwPbv4H7bnQQDkE6tFGEWpg7bKWvg57Yk2B10ri0AwEo6 wWSoCrLFYU4lAaVsU8aRkgR16ArBAQCUW/+HgUqSWgOcyRGvDFu+2OCmu76HLJS5CWfq3EbhxtS Y0tjGpXY/vXAUtfeJllkBDRJDlxQ= X-Google-Smtp-Source: AGHT+IFs0gxCXZxBZTd+prlOEevWFkcUCEg2wHnAquJsjz8pukies6e+TaXPp34BBvJZxw4uNZ+qngvVCoE0sVpRfdA= X-Received: by 2002:a05:622a:509:b0:44f:e11c:b0d8 with SMTP id d75a77b69052e-4583c71dec0mr213975771cf.7.1726226645025; Fri, 13 Sep 2024 04:24:05 -0700 (PDT) MIME-Version: 1.0 References: <20240911170430.701685-2-mattias.ronnblom@ericsson.com> <20240912084429.703405-1-mattias.ronnblom@ericsson.com> <20240912084429.703405-4-mattias.ronnblom@ericsson.com> <88a778d3-e157-41cd-9da7-2d06864a654d@lysator.liu.se> <0a8dd454-976c-4f17-a870-09ba2d90c717@lysator.liu.se> In-Reply-To: <0a8dd454-976c-4f17-a870-09ba2d90c717@lysator.liu.se> From: Jerin Jacob Date: Fri, 13 Sep 2024 16:53:38 +0530 Message-ID: Subject: Re: [PATCH v3 3/7] eal: add lcore variable performance test To: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= Cc: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= , dev@dpdk.org, =?UTF-8?Q?Morten_Br=C3=B8rup?= , Stephen Hemminger , Konstantin Ananyev , David Marchand , Jerin Jacob Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Fri, Sep 13, 2024 at 12:17=E2=80=AFPM Mattias R=C3=B6nnblom wrote: > > On 2024-09-12 17:11, Jerin Jacob wrote: > > On Thu, Sep 12, 2024 at 6:50=E2=80=AFPM Mattias R=C3=B6nnblom wrote: > >> > >> On 2024-09-12 15:09, Jerin Jacob wrote: > >>> On Thu, Sep 12, 2024 at 2:34=E2=80=AFPM Mattias R=C3=B6nnblom > >>> wrote: > >>>> > >>>> Add basic micro benchmark for lcore variables, in an attempt to assu= re > >>>> that the overhead isn't significantly greater than alternative > >>>> approaches, in scenarios where the benefits aren't expected to show = up > >>>> (i.e., when plenty of cache is available compared to the working set > >>>> size of the per-lcore data). > >>>> > >>>> Signed-off-by: Mattias R=C3=B6nnblom > >>>> --- > >>>> app/test/meson.build | 1 + > >>>> app/test/test_lcore_var_perf.c | 160 ++++++++++++++++++++++++++++= +++++ > >>>> 2 files changed, 161 insertions(+) > >>>> create mode 100644 app/test/test_lcore_var_perf.c > >>> > >>> > >>>> +static double > >>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(= void)) > >>>> +{ > >>>> + uint64_t i; > >>>> + uint64_t start; > >>>> + uint64_t end; > >>>> + double latency; > >>>> + > >>>> + init_fun(); > >>>> + > >>>> + start =3D rte_get_timer_cycles(); > >>>> + > >>>> + for (i =3D 0; i < ITERATIONS; i++) > >>>> + update_fun(); > >>>> + > >>>> + end =3D rte_get_timer_cycles(); > >>> > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate > >> > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not. > > > > I was thinking in another way, with 1e7 iteration, the additional > > barrier on precise will be amortized, and we get more _deterministic_ > > behavior e.s.p in case if we print cycles and if we need to catch > > regressions. > > If you time a section of code which spends ~40000000 cycles, it doesn't > matter if you add or remove a few cycles at the beginning and the end. > > The rte_rdtsc_precise() is both better (more precise in the sense of > more serialization), and worse (because it's more costly, and thus more > intrusive). We can calibrate the overhead to remove the cost. > > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It > doesn't matter. Yes. In this setup and it is pretty inaccurate PER iteration. Please refer to the below patch to see the difference. Patch 1: Make nanoseconds to cycles per iteration ------------------------------------------------------------------ diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.= c index ea1d7ba90b52..b8d25400f593 100644 --- a/app/test/test_lcore_var_perf.c +++ b/app/test/test_lcore_var_perf.c @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void)) end =3D rte_get_timer_cycles(); - latency =3D ((end - start) / (double)rte_get_timer_hz()) / ITERATIO= NS; + latency =3D ((end - start)) / ITERATIONS; return latency; } @@ -137,8 +137,7 @@ test_lcore_var_access(void) - printf("Latencies [ns/update]\n"); + printf("Latencies [cycles/update]\n"); printf("Thread-local storage Static array Lcore variables\n"); - printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9, - sarray_latency * 1e9, lvar_latency * 1e9); + printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency, lvar_latency); return TEST_SUCCESS; } Patch 2: Change to precise with calibration ----------------------------------------------------------- diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.= c index ea1d7ba90b52..8142ecd56241 100644 --- a/app/test/test_lcore_var_perf.c +++ b/app/test/test_lcore_var_perf.c @@ -96,23 +96,28 @@ lvar_update(void) static double benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void)) { - uint64_t i; + double tsc_latency; + double latency; uint64_t start; uint64_t end; - double latency; + uint64_t i; - init_fun(); + /* calculate rte_rdtsc_precise overhead */ + start =3D rte_rdtsc_precise(); + end =3D rte_rdtsc_precise(); + tsc_latency =3D (end - start); - start =3D rte_get_timer_cycles(); + init_fun(); - for (i =3D 0; i < ITERATIONS; i++) + latency =3D 0; + for (i =3D 0; i < ITERATIONS; i++) { + start =3D rte_rdtsc_precise(); update_fun(); + end =3D rte_rdtsc_precise(); + latency +=3D (end - start) - tsc_latency; + } - end =3D rte_get_timer_cycles(); - - latency =3D ((end - start) / (double)rte_get_timer_hz()) / ITERATIO= NS; - - return latency; + return latency / (double)ITERATIONS; } static int @@ -135,10 +140,9 @@ test_lcore_var_access(void) sarray_latency =3D benchmark_access_method(sarray_init, sarray_upda= te); lvar_latency =3D benchmark_access_method(lvar_init, lvar_update); - printf("Latencies [ns/update]\n"); + printf("Latencies [cycles/update]\n"); printf("Thread-local storage Static array Lcore variables\n"); - printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9, - sarray_latency * 1e9, lvar_latency * 1e9); + printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency, lvar_latency); return TEST_SUCCESS; } ARM N2 core with patch 1(aka current scheme) ----------------------------------- + ------------------------------------------------------- + + Test Suite : lcore variable perf autotest + ------------------------------------------------------- + Latencies [cycles/update] Thread-local storage Static array Lcore variables 7.0 7.0 7.0 ARM N2 core with patch 2 ----------------------------------- + ------------------------------------------------------- + + Test Suite : lcore variable perf autotest + ------------------------------------------------------- + Latencies [cycles/update] Thread-local storage Static array Lcore variables 11.4 15.5 15.5 x86 i9 core with patch 1(aka current scheme) ------------------------------------------------------------ + ------------------------------------------------------- + + Test Suite : lcore variable perf autotest + ------------------------------------------------------- + Latencies [ns/update] Thread-local storage Static array Lcore variables 5.0 6.0 6.0 x86 i9 core with patch 2 -------------------------------- + ------------------------------------------------------- + + Test Suite : lcore variable perf autotest + ------------------------------------------------------- + Latencies [cycles/update] Thread-local storage Static array Lcore variables 5.3 10.6 11.7 > > > Furthermore, you may consider replacing rte_random() in fast path to > > running number or so if it is not deterministic in cycle computation. > > rte_rand() is not used in the fast path. I don't understand what you I missed that. Ignore this comment. > mean by "running number".