[PATCH 1/2] test/service: add perf measurements for with stats mode

DPDK patches and discussions
 help / color / mirror / Atom feed

* [PATCH 1/2] test/service: add perf measurements for with stats mode
@ 2022-07-08 12:56 Harry van Haaren
  2022-07-08 12:56 ` [PATCH 2/2] service: fix potential stats race-condition on MT services Harry van Haaren
  0 siblings, 1 reply; 43+ messages in thread
From: Harry van Haaren @ 2022-07-08 12:56 UTC (permalink / raw)
  To: dev
  Cc: Harry van Haaren, Mattias Rönnblom, Honnappa Nagarahalli,
	Morten Brørup

This commit improves the performance reporting of the service
cores polling loop to show both with and without statistics
collection modes. Collecting cycle statistics is costly, due
to calls to rte_rdtsc() per service iteration.

Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Suggested-by: Morten Brørup <mb@smartsharesystems.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

---

This is split out as a seperate patch from the fix to allow
measuring the before/after of the service stats atomic fixup.

---
 app/test/test_service_cores.c | 36 ++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index ced6ed0081..7415b6b686 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -777,6 +777,22 @@ service_run_on_app_core_func(void *arg)
 	return rte_service_run_iter_on_app_lcore(*delay_service_id, 1);
 }
 
+static float
+service_app_lcore_perf_measure(uint32_t id)
+{
+	/* Performance test: call in a loop, and measure tsc() */
+	const uint32_t perf_iters = (1 << 12);
+	uint64_t start = rte_rdtsc();
+	uint32_t i;
+	for (i = 0; i < perf_iters; i++) {
+		int err = service_run_on_app_core_func(&id);
+		TEST_ASSERT_EQUAL(0, err, "perf test: returned run failure");
+	}
+	uint64_t end = rte_rdtsc();
+
+	return (end - start)/(float)perf_iters;
+}
+
 static int
 service_app_lcore_poll_impl(const int mt_safe)
 {
@@ -828,17 +844,15 @@ service_app_lcore_poll_impl(const int mt_safe)
 				"MT Unsafe: App core1 didn't return -EBUSY");
 	}
 
-	/* Performance test: call in a loop, and measure tsc() */
-	const uint32_t perf_iters = (1 << 12);
-	uint64_t start = rte_rdtsc();
-	uint32_t i;
-	for (i = 0; i < perf_iters; i++) {
-		int err = service_run_on_app_core_func(&id);
-		TEST_ASSERT_EQUAL(0, err, "perf test: returned run failure");
-	}
-	uint64_t end = rte_rdtsc();
-	printf("perf test for %s: %0.1f cycles per call\n", mt_safe ?
-		"MT Safe" : "MT Unsafe", (end - start)/(float)perf_iters);
+	/* Measure performance of no-stats and with-stats. */
+	float cyc_no_stats = service_app_lcore_perf_measure(id);
+
+	TEST_ASSERT_EQUAL(0, rte_service_set_stats_enable(id, 1),
+				"failed to enable stats for service.");
+	float cyc_with_stats = service_app_lcore_perf_measure(id);
+
+	printf("perf test for %s, no stats: %0.1f, with stats %0.1f cycles/call\n",
+		mt_safe ? "MT Safe" : "MT Unsafe", cyc_no_stats, cyc_with_stats);
 
 	unregister_all();
 	return TEST_SUCCESS;
-- 
2.32.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 12:56 [PATCH 1/2] test/service: add perf measurements for with stats mode Harry van Haaren
@ 2022-07-08 12:56 ` Harry van Haaren
  2022-07-08 13:23   ` Morten Brørup
                     ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Harry van Haaren @ 2022-07-08 12:56 UTC (permalink / raw)
  To: dev
  Cc: Harry van Haaren, Mattias Rönnblom, Honnappa Nagarahalli,
	Morten Brørup

This commit fixes a potential racey-add that could occur if
multiple service-lcores were executing the same MT-safe service
at the same time, with service statistics collection enabled.

Because multiple threads can run and execute the service, the
stats values can have multiple writer threads, resulting in the
requirement of using atomic addition for correctness.

Note that when a MT unsafe service is executed, a spinlock is
held, so the stats increments are protected. This fact is used
to avoid executing atomic add instructions when not required.

This patch causes a 1.25x increase in cycle-cost for polling a
MT safe service when statistics are enabled. No change was seen
for MT unsafe services, or when statistics are disabled.

Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Suggested-by: Morten Brørup <mb@smartsharesystems.com>
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>

---
---
 lib/eal/common/rte_service.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index ef31b1f63c..f045e74ef3 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -363,9 +363,15 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 		uint64_t start = rte_rdtsc();
 		s->spec.callback(userdata);
 		uint64_t end = rte_rdtsc();
-		s->cycles_spent += end - start;
+		uint64_t cycles = end - start;
 		cs->calls_per_service[service_idx]++;
-		s->calls++;
+		if (service_mt_safe(s)) {
+			__atomic_fetch_add(&s->cycles_spent, cycles, __ATOMIC_RELAXED);
+			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
+		} else {
+			s->cycles_spent += cycles;
+			s->calls++;
+		}
 	} else
 		s->spec.callback(userdata);
 }
-- 
2.32.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 12:56 ` [PATCH 2/2] service: fix potential stats race-condition on MT services Harry van Haaren
@ 2022-07-08 13:23   ` Morten Brørup
  2022-07-08 13:44     ` Van Haaren, Harry
  2022-07-08 13:48   ` Mattias Rönnblom
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Morten Brørup @ 2022-07-08 13:23 UTC (permalink / raw)
  To: Harry van Haaren, dev; +Cc: Mattias Rönnblom, Honnappa Nagarahalli

> From: Harry van Haaren [mailto:harry.van.haaren@intel.com]
> Sent: Friday, 8 July 2022 14.57
> 
> This commit fixes a potential racey-add that could occur if
> multiple service-lcores were executing the same MT-safe service
> at the same time, with service statistics collection enabled.
> 
> Because multiple threads can run and execute the service, the
> stats values can have multiple writer threads, resulting in the
> requirement of using atomic addition for correctness.
> 
> Note that when a MT unsafe service is executed, a spinlock is
> held, so the stats increments are protected. This fact is used
> to avoid executing atomic add instructions when not required.
> 
> This patch causes a 1.25x increase in cycle-cost for polling a
> MT safe service when statistics are enabled. No change was seen
> for MT unsafe services, or when statistics are disabled.
> 
> Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> ---

[...]

> +		if (service_mt_safe(s)) {
> +			__atomic_fetch_add(&s->cycles_spent, cycles,
> __ATOMIC_RELAXED);
> +			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
> +		} else {
> +			s->cycles_spent += cycles;
> +			s->calls++;
> +		}

Have you considered the performance cost of the __atomic_fetch_add(__ATOMIC_RELAXED) versus the performance cost of the branch to compare if the service is MT safe? It might be cheaper to just always use the atomic addition. I don't know, just mentioning that the compare-and-branch also has a cost.

I'm not familiar with the DPDK services library, so perhaps MT safe and MT unsafe services are never mixed, in which case the branch will always take the same path, so branch prediction will eliminate the cost of branching.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 13:23   ` Morten Brørup
@ 2022-07-08 13:44     ` Van Haaren, Harry
  2022-07-08 14:14       ` Morten Brørup
  0 siblings, 1 reply; 43+ messages in thread
From: Van Haaren, Harry @ 2022-07-08 13:44 UTC (permalink / raw)
  To: Morten Brørup, dev; +Cc: mattias.ronnblom, Honnappa Nagarahalli

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Friday, July 8, 2022 2:23 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Cc: mattias.ronnblom <mattias.ronnblom@ericsson.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT services

<snip commit message, focus on performance data>

> > This patch causes a 1.25x increase in cycle-cost for polling a
> > MT safe service when statistics are enabled. No change was seen
> > for MT unsafe services, or when statistics are disabled.
> >
> > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> >
> > ---
> 
> [...]
> 
> > +		if (service_mt_safe(s)) {
> > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > __ATOMIC_RELAXED);
> > +			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
> > +		} else {
> > +			s->cycles_spent += cycles;
> > +			s->calls++;
> > +		}
> 
> Have you considered the performance cost of the
> __atomic_fetch_add(__ATOMIC_RELAXED) versus the performance cost of the
> branch to compare if the service is MT safe? It might be cheaper to just always use
> the atomic addition. I don't know, just mentioning that the compare-and-branch also
> has a cost.

Great question!

> I'm not familiar with the DPDK services library, so perhaps MT safe and MT unsafe
> services are never mixed, in which case the branch will always take the same path,
> so branch prediction will eliminate the cost of branching.

MT safe & unsafe can be mixed yes, so you're right, there may be mis-predicts. Note that
assuming a service is actually doing something useful, there's likely quite a few branches
between each call.. so unknown how fresh/accurate the branch history will be.

The common case is for many services to be "mt unsafe" (think polling an ethdev queue).
In this case, it is nice to try reduce cost. Given this is likely the highest quantity of services,
we'd like the performance here to be reduced the least. The branch method achieves that.

I did benchmark the "always use atomic" case, and it caused a ~30cycle hit in the "mt unsafe" case,
where the atomic is not required (and hence the performance hit can be avoided by branching).
Given branch-misses are handled between 15-20 cycles (uarch dependent), attempting to avoid the
atomic still makes sense from cycle-cost perspective too I think..

I did spend the morning benchmarking solutions (and hence the patch split,
to allow easy benchmarking before & after), so thanks for asking!

Regards, -Harry

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 12:56 ` [PATCH 2/2] service: fix potential stats race-condition on MT services Harry van Haaren
  2022-07-08 13:23   ` Morten Brørup
@ 2022-07-08 13:48   ` Mattias Rönnblom
  2022-07-08 15:16   ` Honnappa Nagarahalli
  2022-09-06 16:13   ` [PATCH 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
  3 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-07-08 13:48 UTC (permalink / raw)
  To: Harry van Haaren, dev; +Cc: Honnappa Nagarahalli, Morten Brørup

On 2022-07-08 14:56, Harry van Haaren wrote:
> This commit fixes a potential racey-add that could occur if
> multiple service-lcores were executing the same MT-safe service
> at the same time, with service statistics collection enabled.
> 
> Because multiple threads can run and execute the service, the
> stats values can have multiple writer threads, resulting in the
> requirement of using atomic addition for correctness.
> 
> Note that when a MT unsafe service is executed, a spinlock is
> held, so the stats increments are protected. This fact is used
> to avoid executing atomic add instructions when not required.
> 
> This patch causes a 1.25x increase in cycle-cost for polling a
> MT safe service when statistics are enabled. No change was seen
> for MT unsafe services, or when statistics are disabled.
> 
> Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> ---
> ---
>   lib/eal/common/rte_service.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
> index ef31b1f63c..f045e74ef3 100644
> --- a/lib/eal/common/rte_service.c
> +++ b/lib/eal/common/rte_service.c
> @@ -363,9 +363,15 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
>   		uint64_t start = rte_rdtsc();
>   		s->spec.callback(userdata);
>   		uint64_t end = rte_rdtsc();
> -		s->cycles_spent += end - start;
> +		uint64_t cycles = end - start;
>   		cs->calls_per_service[service_idx]++;
> -		s->calls++;
> +		if (service_mt_safe(s)) {
> +			__atomic_fetch_add(&s->cycles_spent, cycles, __ATOMIC_RELAXED);
> +			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
> +		} else {
> +			s->cycles_spent += cycles;
> +			s->calls++;
> +		}
>   	} else
>   		s->spec.callback(userdata);
>   }

Reviewed-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Thanks for the fix Harry.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 13:44     ` Van Haaren, Harry
@ 2022-07-08 14:14       ` Morten Brørup
  0 siblings, 0 replies; 43+ messages in thread
From: Morten Brørup @ 2022-07-08 14:14 UTC (permalink / raw)
  To: Van Haaren, Harry, dev; +Cc: mattias.ronnblom, Honnappa Nagarahalli

> From: Van Haaren, Harry [mailto:harry.van.haaren@intel.com]
> Sent: Friday, 8 July 2022 15.45
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Friday, July 8, 2022 2:23 PM
> 
> <snip commit message, focus on performance data>
> 
> > > This patch causes a 1.25x increase in cycle-cost for polling a
> > > MT safe service when statistics are enabled. No change was seen
> > > for MT unsafe services, or when statistics are disabled.
> > >
> > > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > >
> > > ---
> >
> > [...]
> >
> > > +		if (service_mt_safe(s)) {
> > > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > > __ATOMIC_RELAXED);
> > > +			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
> > > +		} else {
> > > +			s->cycles_spent += cycles;
> > > +			s->calls++;
> > > +		}
> >
> > Have you considered the performance cost of the
> > __atomic_fetch_add(__ATOMIC_RELAXED) versus the performance cost of
> the
> > branch to compare if the service is MT safe? It might be cheaper to
> just always use
> > the atomic addition. I don't know, just mentioning that the compare-
> and-branch also
> > has a cost.
> 
> Great question!
> 
> > I'm not familiar with the DPDK services library, so perhaps MT safe
> and MT unsafe
> > services are never mixed, in which case the branch will always take
> the same path,
> > so branch prediction will eliminate the cost of branching.
> 
> MT safe & unsafe can be mixed yes, so you're right, there may be mis-
> predicts. Note that
> assuming a service is actually doing something useful, there's likely
> quite a few branches
> between each call.. so unknown how fresh/accurate the branch history
> will be.
> 
> The common case is for many services to be "mt unsafe" (think polling
> an ethdev queue).
> In this case, it is nice to try reduce cost. Given this is likely the
> highest quantity of services,
> we'd like the performance here to be reduced the least. The branch
> method achieves that.
> 
> I did benchmark the "always use atomic" case, and it caused a ~30cycle
> hit in the "mt unsafe" case,
> where the atomic is not required (and hence the performance hit can be
> avoided by branching).
> Given branch-misses are handled between 15-20 cycles (uarch dependent),
> attempting to avoid the
> atomic still makes sense from cycle-cost perspective too I think..
> 
> I did spend the morning benchmarking solutions (and hence the patch
> split,
> to allow easy benchmarking before & after), so thanks for asking!
> 
> Regards, -Harry

Thank you for elaborating, Harry. I am impressed with the considerations you have put into this, and have no further concerns.

Reviewed-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 12:56 ` [PATCH 2/2] service: fix potential stats race-condition on MT services Harry van Haaren
  2022-07-08 13:23   ` Morten Brørup
  2022-07-08 13:48   ` Mattias Rönnblom
@ 2022-07-08 15:16   ` Honnappa Nagarahalli
  2022-07-08 15:31     ` Van Haaren, Harry
  2022-07-08 16:29     ` Morten Brørup
  2022-09-06 16:13   ` [PATCH 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
  3 siblings, 2 replies; 43+ messages in thread
From: Honnappa Nagarahalli @ 2022-07-08 15:16 UTC (permalink / raw)
  To: Harry van Haaren, dev; +Cc: Mattias Rönnblom, Morten Brørup, nd, nd

<snip>
> 
> This commit fixes a potential racey-add that could occur if multiple service-
> lcores were executing the same MT-safe service at the same time, with
> service statistics collection enabled.
> 
> Because multiple threads can run and execute the service, the stats values
> can have multiple writer threads, resulting in the requirement of using
> atomic addition for correctness.
> 
> Note that when a MT unsafe service is executed, a spinlock is held, so the
> stats increments are protected. This fact is used to avoid executing atomic
> add instructions when not required.
> 
> This patch causes a 1.25x increase in cycle-cost for polling a MT safe service
> when statistics are enabled. No change was seen for MT unsafe services, or
> when statistics are disabled.
> 
> Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> ---
> ---
>  lib/eal/common/rte_service.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
> index ef31b1f63c..f045e74ef3 100644
> --- a/lib/eal/common/rte_service.c
> +++ b/lib/eal/common/rte_service.c
> @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> rte_service_spec_impl *s,
>  		uint64_t start = rte_rdtsc();
>  		s->spec.callback(userdata);
>  		uint64_t end = rte_rdtsc();
> -		s->cycles_spent += end - start;
> +		uint64_t cycles = end - start;
>  		cs->calls_per_service[service_idx]++;
> -		s->calls++;
> +		if (service_mt_safe(s)) {
> +			__atomic_fetch_add(&s->cycles_spent, cycles,
> __ATOMIC_RELAXED);
> +			__atomic_fetch_add(&s->calls, 1,
> __ATOMIC_RELAXED);
> +		} else {
> +			s->cycles_spent += cycles;
> +			s->calls++;
This is still a problem from a reader perspective. It is possible that the writes could be split while a reader is reading the stats. These need to be atomic adds.

> +		}
>  	} else
>  		s->spec.callback(userdata);
>  }
> --
> 2.32.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 15:16   ` Honnappa Nagarahalli
@ 2022-07-08 15:31     ` Van Haaren, Harry
  2022-07-08 16:21       ` Bruce Richardson
  2022-07-08 16:29     ` Morten Brørup
  1 sibling, 1 reply; 43+ messages in thread
From: Van Haaren, Harry @ 2022-07-08 15:31 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev; +Cc: mattias.ronnblom, Morten Brørup, nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Friday, July 8, 2022 4:16 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Cc: mattias.ronnblom <mattias.ronnblom@ericsson.com>; Morten Brørup
> <mb@smartsharesystems.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT services

<snip previous discussions>

> > diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
> > index ef31b1f63c..f045e74ef3 100644
> > --- a/lib/eal/common/rte_service.c
> > +++ b/lib/eal/common/rte_service.c
> > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > rte_service_spec_impl *s,
> >  		uint64_t start = rte_rdtsc();
> >  		s->spec.callback(userdata);
> >  		uint64_t end = rte_rdtsc();
> > -		s->cycles_spent += end - start;
> > +		uint64_t cycles = end - start;
> >  		cs->calls_per_service[service_idx]++;
> > -		s->calls++;
> > +		if (service_mt_safe(s)) {
> > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > __ATOMIC_RELAXED);
> > +			__atomic_fetch_add(&s->calls, 1,
> > __ATOMIC_RELAXED);
> > +		} else {
> > +			s->cycles_spent += cycles;
> > +			s->calls++;
> This is still a problem from a reader perspective. It is possible that the writes could be
> split while a reader is reading the stats. These need to be atomic adds.

Thanks for pointing out; I do "think" in x86 in terms of load/store tearing; and on x86
naturally aligned load/stores will not tear. Apologies for missing the ARM angle here.

I'm not sure how to best encode the difference between tearing & "locked instructions"
to make things multi-writer safe. But they're not the same thing, and I'd prefer not pay
the penalty for LOCK instructions (multi-writer) only to satisfy the non-tearing requirements.

Is there an rte_atomic-* type that is guaranteed as non-tearing?

In that case, changing the type of the calls/cycles_spent variables to such a type to ensure "non-tearing"
single-reader, single-writer behaviour is enough, instead of forcing __atomic_fetch_add() everywhere?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 15:31     ` Van Haaren, Harry
@ 2022-07-08 16:21       ` Bruce Richardson
  2022-07-08 16:33         ` Honnappa Nagarahalli
  2022-07-08 20:02         ` Mattias Rönnblom
  0 siblings, 2 replies; 43+ messages in thread
From: Bruce Richardson @ 2022-07-08 16:21 UTC (permalink / raw)
  To: Van Haaren, Harry
  Cc: Honnappa Nagarahalli, dev, mattias.ronnblom, Morten Brørup, nd

On Fri, Jul 08, 2022 at 03:31:01PM +0000, Van Haaren, Harry wrote:
> > -----Original Message-----
> > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Sent: Friday, July 8, 2022 4:16 PM
> > To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> > Cc: mattias.ronnblom <mattias.ronnblom@ericsson.com>; Morten Brørup
> > <mb@smartsharesystems.com>; nd <nd@arm.com>; nd <nd@arm.com>
> > Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> 
> <snip previous discussions>
> 
> > > diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
> > > index ef31b1f63c..f045e74ef3 100644
> > > --- a/lib/eal/common/rte_service.c
> > > +++ b/lib/eal/common/rte_service.c
> > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > rte_service_spec_impl *s,
> > >  		uint64_t start = rte_rdtsc();
> > >  		s->spec.callback(userdata);
> > >  		uint64_t end = rte_rdtsc();
> > > -		s->cycles_spent += end - start;
> > > +		uint64_t cycles = end - start;
> > >  		cs->calls_per_service[service_idx]++;
> > > -		s->calls++;
> > > +		if (service_mt_safe(s)) {
> > > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > > __ATOMIC_RELAXED);
> > > +			__atomic_fetch_add(&s->calls, 1,
> > > __ATOMIC_RELAXED);
> > > +		} else {
> > > +			s->cycles_spent += cycles;
> > > +			s->calls++;
> > This is still a problem from a reader perspective. It is possible that the writes could be
> > split while a reader is reading the stats. These need to be atomic adds.
> 
> Thanks for pointing out; I do "think" in x86 in terms of load/store tearing; and on x86
> naturally aligned load/stores will not tear. Apologies for missing the ARM angle here.
> 
> I'm not sure how to best encode the difference between tearing & "locked instructions"
> to make things multi-writer safe. But they're not the same thing, and I'd prefer not pay
> the penalty for LOCK instructions (multi-writer) only to satisfy the non-tearing requirements.
> 
> Is there an rte_atomic-* type that is guaranteed as non-tearing?
> 
> In that case, changing the type of the calls/cycles_spent variables to such a type to ensure "non-tearing"
> single-reader, single-writer behaviour is enough, instead of forcing __atomic_fetch_add() everywhere?

Regular read, increment and then atomic store should work without locks
where alignment is correct on most architectures, and doing the store
atomically should prevent any tearing.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 15:16   ` Honnappa Nagarahalli
  2022-07-08 15:31     ` Van Haaren, Harry
@ 2022-07-08 16:29     ` Morten Brørup
  2022-07-08 16:45       ` Honnappa Nagarahalli
  1 sibling, 1 reply; 43+ messages in thread
From: Morten Brørup @ 2022-07-08 16:29 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Harry van Haaren, dev; +Cc: Mattias Rönnblom, nd, nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Friday, 8 July 2022 17.16
> 
> <snip>
> >
> > This commit fixes a potential racey-add that could occur if multiple
> service-
> > lcores were executing the same MT-safe service at the same time, with
> > service statistics collection enabled.
> >
> > Because multiple threads can run and execute the service, the stats
> values
> > can have multiple writer threads, resulting in the requirement of
> using
> > atomic addition for correctness.
> >
> > Note that when a MT unsafe service is executed, a spinlock is held,
> so the
> > stats increments are protected. This fact is used to avoid executing
> atomic
> > add instructions when not required.
> >
> > This patch causes a 1.25x increase in cycle-cost for polling a MT
> safe service
> > when statistics are enabled. No change was seen for MT unsafe
> services, or
> > when statistics are disabled.
> >
> > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> >
> > ---
> > ---
> >  lib/eal/common/rte_service.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/eal/common/rte_service.c
> b/lib/eal/common/rte_service.c
> > index ef31b1f63c..f045e74ef3 100644
> > --- a/lib/eal/common/rte_service.c
> > +++ b/lib/eal/common/rte_service.c
> > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > rte_service_spec_impl *s,
> >  		uint64_t start = rte_rdtsc();
> >  		s->spec.callback(userdata);
> >  		uint64_t end = rte_rdtsc();
> > -		s->cycles_spent += end - start;
> > +		uint64_t cycles = end - start;
> >  		cs->calls_per_service[service_idx]++;
> > -		s->calls++;
> > +		if (service_mt_safe(s)) {
> > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > __ATOMIC_RELAXED);
> > +			__atomic_fetch_add(&s->calls, 1,
> > __ATOMIC_RELAXED);
> > +		} else {
> > +			s->cycles_spent += cycles;
> > +			s->calls++;
> This is still a problem from a reader perspective. It is possible that
> the writes could be split while a reader is reading the stats. These
> need to be atomic adds.

I don't understand what you suggest can go wrong here, Honnappa. If you talking about 64 bit counters on 32 bit architectures, then I understand the problem (and have many years of direct experience with it myself). Otherwise, I hope you can elaborate or direct me to educational material about the issue, considering this a learning opportunity. :-)

> 
> > +		}
> >  	} else
> >  		s->spec.callback(userdata);
> >  }
> > --
> > 2.32.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 16:21       ` Bruce Richardson
@ 2022-07-08 16:33         ` Honnappa Nagarahalli
  2022-07-08 20:02         ` Mattias Rönnblom
  1 sibling, 0 replies; 43+ messages in thread
From: Honnappa Nagarahalli @ 2022-07-08 16:33 UTC (permalink / raw)
  To: Bruce Richardson, Van Haaren, Harry
  Cc: dev, mattias.ronnblom, Morten Brørup, nd, nd

> > <snip previous discussions>
> >
> > > > diff --git a/lib/eal/common/rte_service.c
> > > > b/lib/eal/common/rte_service.c index ef31b1f63c..f045e74ef3 100644
> > > > --- a/lib/eal/common/rte_service.c
> > > > +++ b/lib/eal/common/rte_service.c
> > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > > rte_service_spec_impl *s,
> > > >  		uint64_t start = rte_rdtsc();
> > > >  		s->spec.callback(userdata);
> > > >  		uint64_t end = rte_rdtsc();
> > > > -		s->cycles_spent += end - start;
> > > > +		uint64_t cycles = end - start;
> > > >  		cs->calls_per_service[service_idx]++;
> > > > -		s->calls++;
> > > > +		if (service_mt_safe(s)) {
> > > > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > > > __ATOMIC_RELAXED);
> > > > +			__atomic_fetch_add(&s->calls, 1,
> > > > __ATOMIC_RELAXED);
> > > > +		} else {
> > > > +			s->cycles_spent += cycles;
> > > > +			s->calls++;
> > > This is still a problem from a reader perspective. It is possible
> > > that the writes could be split while a reader is reading the stats. These
> need to be atomic adds.
> >
> > Thanks for pointing out; I do "think" in x86 in terms of load/store
> > tearing; and on x86 naturally aligned load/stores will not tear. Apologies for
> missing the ARM angle here.
Arm architecture has similar things as well. I believe compiler does not provide any guarantees that it will only generate non-tearing instructions. Refer to a discussion in the past [1] [2] where it was thought that the compiler would generate a non-tearing store (this is a slightly different scenario).

[1] http://inbox.dpdk.org/dev/d5d563ab-0411-3faf-39ec-4994f2bc9f6f@intel.com/
[2] Refer to commit '316095eb41ed22da808b5826404c398917c83c89'

> >
> > I'm not sure how to best encode the difference between tearing & "locked
> instructions"
> > to make things multi-writer safe. But they're not the same thing, and
> > I'd prefer not pay the penalty for LOCK instructions (multi-writer) only to
> satisfy the non-tearing requirements.
> >
> > Is there an rte_atomic-* type that is guaranteed as non-tearing?
Nothing that I know of.

> >
> > In that case, changing the type of the calls/cycles_spent variables to such a
> type to ensure "non-tearing"
> > single-reader, single-writer behaviour is enough, instead of forcing
> __atomic_fetch_add() everywhere?
> 
> Regular read, increment and then atomic store should work without locks
> where alignment is correct on most architectures, and doing the store
> atomically should prevent any tearing.
Agree, we could do this, will provide more flexibility for the micro-architecture to work with. Good to understand the perf benefits vs complexity and the branch cost.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 16:29     ` Morten Brørup
@ 2022-07-08 16:45       ` Honnappa Nagarahalli
  2022-07-08 17:22         ` Morten Brørup
  0 siblings, 1 reply; 43+ messages in thread
From: Honnappa Nagarahalli @ 2022-07-08 16:45 UTC (permalink / raw)
  To: Morten Brørup, Harry van Haaren, dev; +Cc: Mattias Rönnblom, nd, nd

<snip>
> > >
> > > This commit fixes a potential racey-add that could occur if multiple
> > service-
> > > lcores were executing the same MT-safe service at the same time,
> > > with service statistics collection enabled.
> > >
> > > Because multiple threads can run and execute the service, the stats
> > values
> > > can have multiple writer threads, resulting in the requirement of
> > using
> > > atomic addition for correctness.
> > >
> > > Note that when a MT unsafe service is executed, a spinlock is held,
> > so the
> > > stats increments are protected. This fact is used to avoid executing
> > atomic
> > > add instructions when not required.
> > >
> > > This patch causes a 1.25x increase in cycle-cost for polling a MT
> > safe service
> > > when statistics are enabled. No change was seen for MT unsafe
> > services, or
> > > when statistics are disabled.
> > >
> > > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > >
> > > ---
> > > ---
> > >  lib/eal/common/rte_service.c | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/eal/common/rte_service.c
> > b/lib/eal/common/rte_service.c
> > > index ef31b1f63c..f045e74ef3 100644
> > > --- a/lib/eal/common/rte_service.c
> > > +++ b/lib/eal/common/rte_service.c
> > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > rte_service_spec_impl *s,
> > >  		uint64_t start = rte_rdtsc();
> > >  		s->spec.callback(userdata);
> > >  		uint64_t end = rte_rdtsc();
> > > -		s->cycles_spent += end - start;
> > > +		uint64_t cycles = end - start;
> > >  		cs->calls_per_service[service_idx]++;
> > > -		s->calls++;
> > > +		if (service_mt_safe(s)) {
> > > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > > __ATOMIC_RELAXED);
> > > +			__atomic_fetch_add(&s->calls, 1,
> > > __ATOMIC_RELAXED);
> > > +		} else {
> > > +			s->cycles_spent += cycles;
> > > +			s->calls++;
> > This is still a problem from a reader perspective. It is possible that
> > the writes could be split while a reader is reading the stats. These
> > need to be atomic adds.
> 
> I don't understand what you suggest can go wrong here, Honnappa. If you
> talking about 64 bit counters on 32 bit architectures, then I understand the
> problem (and have many years of direct experience with it myself).
> Otherwise, I hope you can elaborate or direct me to educational material
> about the issue, considering this a learning opportunity. :-)
I am thinking of the case where the 64b write is split into two 32b (or more) write operations either by the compiler or the micro-architecture. If this were to happen, it causes race conditions with the reader.

As far as I understand, the compiler does not provide any guarantees on generating non-tearing stores unless an atomic builtin/function is used. If we have to ensure the micro-architecture does not generate split writes, we need to be careful that future code additions do not change the alignment of the stats.

> 
> >
> > > +		}
> > >  	} else
> > >  		s->spec.callback(userdata);
> > >  }
> > > --
> > > 2.32.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 16:45       ` Honnappa Nagarahalli
@ 2022-07-08 17:22         ` Morten Brørup
  2022-07-08 17:39           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 43+ messages in thread
From: Morten Brørup @ 2022-07-08 17:22 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Harry van Haaren, dev; +Cc: Mattias Rönnblom, nd, nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Friday, 8 July 2022 18.46
> 
> <snip>
> > > >
> > > > This commit fixes a potential racey-add that could occur if
> multiple
> > > service-
> > > > lcores were executing the same MT-safe service at the same time,
> > > > with service statistics collection enabled.
> > > >
> > > > Because multiple threads can run and execute the service, the
> stats
> > > values
> > > > can have multiple writer threads, resulting in the requirement of
> > > using
> > > > atomic addition for correctness.
> > > >
> > > > Note that when a MT unsafe service is executed, a spinlock is
> held,
> > > so the
> > > > stats increments are protected. This fact is used to avoid
> executing
> > > atomic
> > > > add instructions when not required.
> > > >
> > > > This patch causes a 1.25x increase in cycle-cost for polling a MT
> > > safe service
> > > > when statistics are enabled. No change was seen for MT unsafe
> > > services, or
> > > > when statistics are disabled.
> > > >
> > > > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > > Suggested-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > > >
> > > > ---
> > > > ---
> > > >  lib/eal/common/rte_service.c | 10 ++++++++--
> > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/eal/common/rte_service.c
> > > b/lib/eal/common/rte_service.c
> > > > index ef31b1f63c..f045e74ef3 100644
> > > > --- a/lib/eal/common/rte_service.c
> > > > +++ b/lib/eal/common/rte_service.c
> > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > > rte_service_spec_impl *s,
> > > >  		uint64_t start = rte_rdtsc();
> > > >  		s->spec.callback(userdata);
> > > >  		uint64_t end = rte_rdtsc();
> > > > -		s->cycles_spent += end - start;
> > > > +		uint64_t cycles = end - start;
> > > >  		cs->calls_per_service[service_idx]++;
> > > > -		s->calls++;
> > > > +		if (service_mt_safe(s)) {
> > > > +			__atomic_fetch_add(&s->cycles_spent, cycles,
> > > > __ATOMIC_RELAXED);
> > > > +			__atomic_fetch_add(&s->calls, 1,
> > > > __ATOMIC_RELAXED);
> > > > +		} else {
> > > > +			s->cycles_spent += cycles;
> > > > +			s->calls++;
> > > This is still a problem from a reader perspective. It is possible
> that
> > > the writes could be split while a reader is reading the stats.
> These
> > > need to be atomic adds.
> >
> > I don't understand what you suggest can go wrong here, Honnappa. If
> you
> > talking about 64 bit counters on 32 bit architectures, then I
> understand the
> > problem (and have many years of direct experience with it myself).
> > Otherwise, I hope you can elaborate or direct me to educational
> material
> > about the issue, considering this a learning opportunity. :-)
> I am thinking of the case where the 64b write is split into two 32b (or
> more) write operations either by the compiler or the micro-
> architecture. If this were to happen, it causes race conditions with
> the reader.
> 
> As far as I understand, the compiler does not provide any guarantees on
> generating non-tearing stores unless an atomic builtin/function is
> used.

This seems like a generic problem for all 64b statistics counters in DPDK, and any other C code using 64 bit counters. Being a generic C problem, there is probably a generic solution to it.

Is any compiler going to do something that stupid (i.e. tearing a store into multiple write operations) to a simple 64b counter on any 64 bit architecture (assuming the counter is 64b aligned)? Otherwise, we might only need to take special precautions for 32 bit architectures.

> If we have to ensure the micro-architecture does not generate
> split writes, we need to be careful that future code additions do not
> change the alignment of the stats.

Unless the structure containing the stats counters is packed, the contained 64b counters will be 64b aligned (on 64 bit architecture). So we should not worry about alignment, except perhaps on 32 bit architectures.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 17:22         ` Morten Brørup
@ 2022-07-08 17:39           ` Honnappa Nagarahalli
  2022-07-08 18:08             ` Morten Brørup
  0 siblings, 1 reply; 43+ messages in thread
From: Honnappa Nagarahalli @ 2022-07-08 17:39 UTC (permalink / raw)
  To: Morten Brørup, Harry van Haaren, dev; +Cc: Mattias Rönnblom, nd, nd

<snip>
> > > > >
> > > > > This commit fixes a potential racey-add that could occur if
> > multiple
> > > > service-
> > > > > lcores were executing the same MT-safe service at the same time,
> > > > > with service statistics collection enabled.
> > > > >
> > > > > Because multiple threads can run and execute the service, the
> > stats
> > > > values
> > > > > can have multiple writer threads, resulting in the requirement
> > > > > of
> > > > using
> > > > > atomic addition for correctness.
> > > > >
> > > > > Note that when a MT unsafe service is executed, a spinlock is
> > held,
> > > > so the
> > > > > stats increments are protected. This fact is used to avoid
> > executing
> > > > atomic
> > > > > add instructions when not required.
> > > > >
> > > > > This patch causes a 1.25x increase in cycle-cost for polling a
> > > > > MT
> > > > safe service
> > > > > when statistics are enabled. No change was seen for MT unsafe
> > > > services, or
> > > > > when statistics are disabled.
> > > > >
> > > > > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > > > Suggested-by: Honnappa Nagarahalli
> > > > > <Honnappa.Nagarahalli@arm.com>
> > > > > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > > > >
> > > > > ---
> > > > > ---
> > > > >  lib/eal/common/rte_service.c | 10 ++++++++--
> > > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/eal/common/rte_service.c
> > > > b/lib/eal/common/rte_service.c
> > > > > index ef31b1f63c..f045e74ef3 100644
> > > > > --- a/lib/eal/common/rte_service.c
> > > > > +++ b/lib/eal/common/rte_service.c
> > > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > > > rte_service_spec_impl *s,
> > > > >  		uint64_t start = rte_rdtsc();
> > > > >  		s->spec.callback(userdata);
> > > > >  		uint64_t end = rte_rdtsc();
> > > > > -		s->cycles_spent += end - start;
> > > > > +		uint64_t cycles = end - start;
> > > > >  		cs->calls_per_service[service_idx]++;
> > > > > -		s->calls++;
> > > > > +		if (service_mt_safe(s)) {
> > > > > +			__atomic_fetch_add(&s->cycles_spent,
> cycles,
> > > > > __ATOMIC_RELAXED);
> > > > > +			__atomic_fetch_add(&s->calls, 1,
> > > > > __ATOMIC_RELAXED);
> > > > > +		} else {
> > > > > +			s->cycles_spent += cycles;
> > > > > +			s->calls++;
> > > > This is still a problem from a reader perspective. It is possible
> > that
> > > > the writes could be split while a reader is reading the stats.
> > These
> > > > need to be atomic adds.
> > >
> > > I don't understand what you suggest can go wrong here, Honnappa. If
> > you
> > > talking about 64 bit counters on 32 bit architectures, then I
> > understand the
> > > problem (and have many years of direct experience with it myself).
> > > Otherwise, I hope you can elaborate or direct me to educational
> > material
> > > about the issue, considering this a learning opportunity. :-)
> > I am thinking of the case where the 64b write is split into two 32b
> > (or
> > more) write operations either by the compiler or the micro-
> > architecture. If this were to happen, it causes race conditions with
> > the reader.
> >
> > As far as I understand, the compiler does not provide any guarantees
> > on generating non-tearing stores unless an atomic builtin/function is
> > used.
> 
> This seems like a generic problem for all 64b statistics counters in DPDK, and
> any other C code using 64 bit counters. Being a generic C problem, there is
> probably a generic solution to it.
Browsing through the code, I see similar problems elsewhere.

> 
> Is any compiler going to do something that stupid (i.e. tearing a store into
> multiple write operations) to a simple 64b counter on any 64 bit architecture
> (assuming the counter is 64b aligned)? Otherwise, we might only need to
> take special precautions for 32 bit architectures.
It is always a debate on who is stupid, compiler or programmer 😊

Though not the same case, you can look at this discussion where compiler generated torn stores [1] when we all thought it has been generating a 64b store.

[1] http://inbox.dpdk.org/dev/d5d563ab-0411-3faf-39ec-4994f2bc9f6f@intel.com/

> 
> > If we have to ensure the micro-architecture does not generate split
> > writes, we need to be careful that future code additions do not change
> > the alignment of the stats.
> 
> Unless the structure containing the stats counters is packed, the contained
> 64b counters will be 64b aligned (on 64 bit architecture). So we should not
> worry about alignment, except perhaps on 32 bit architectures.
Agree, future code changes need to be aware of these issues and DPDK supports 32b architectures.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 17:39           ` Honnappa Nagarahalli
@ 2022-07-08 18:08             ` Morten Brørup
  0 siblings, 0 replies; 43+ messages in thread
From: Morten Brørup @ 2022-07-08 18:08 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Harry van Haaren, dev; +Cc: Mattias Rönnblom, nd, nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Friday, 8 July 2022 19.40
> 
> <snip>
> > > > > >
> > > > > > This commit fixes a potential racey-add that could occur if
> > > multiple
> > > > > service-
> > > > > > lcores were executing the same MT-safe service at the same
> time,
> > > > > > with service statistics collection enabled.
> > > > > >
> > > > > > Because multiple threads can run and execute the service, the
> > > stats
> > > > > values
> > > > > > can have multiple writer threads, resulting in the
> requirement
> > > > > > of
> > > > > using
> > > > > > atomic addition for correctness.
> > > > > >
> > > > > > Note that when a MT unsafe service is executed, a spinlock is
> > > held,
> > > > > so the
> > > > > > stats increments are protected. This fact is used to avoid
> > > executing
> > > > > atomic
> > > > > > add instructions when not required.
> > > > > >
> > > > > > This patch causes a 1.25x increase in cycle-cost for polling
> a
> > > > > > MT
> > > > > safe service
> > > > > > when statistics are enabled. No change was seen for MT unsafe
> > > > > services, or
> > > > > > when statistics are disabled.
> > > > > >
> > > > > > Reported-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > > > > Suggested-by: Honnappa Nagarahalli
> > > > > > <Honnappa.Nagarahalli@arm.com>
> > > > > > Suggested-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > > > > >
> > > > > > ---
> > > > > > ---
> > > > > >  lib/eal/common/rte_service.c | 10 ++++++++--
> > > > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/eal/common/rte_service.c
> > > > > b/lib/eal/common/rte_service.c
> > > > > > index ef31b1f63c..f045e74ef3 100644
> > > > > > --- a/lib/eal/common/rte_service.c
> > > > > > +++ b/lib/eal/common/rte_service.c
> > > > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct
> > > > > > rte_service_spec_impl *s,
> > > > > >  		uint64_t start = rte_rdtsc();
> > > > > >  		s->spec.callback(userdata);
> > > > > >  		uint64_t end = rte_rdtsc();
> > > > > > -		s->cycles_spent += end - start;
> > > > > > +		uint64_t cycles = end - start;
> > > > > >  		cs->calls_per_service[service_idx]++;
> > > > > > -		s->calls++;
> > > > > > +		if (service_mt_safe(s)) {
> > > > > > +			__atomic_fetch_add(&s->cycles_spent,
> > cycles,
> > > > > > __ATOMIC_RELAXED);
> > > > > > +			__atomic_fetch_add(&s->calls, 1,
> > > > > > __ATOMIC_RELAXED);
> > > > > > +		} else {
> > > > > > +			s->cycles_spent += cycles;
> > > > > > +			s->calls++;
> > > > > This is still a problem from a reader perspective. It is
> possible
> > > that
> > > > > the writes could be split while a reader is reading the stats.
> > > These
> > > > > need to be atomic adds.
> > > >
> > > > I don't understand what you suggest can go wrong here, Honnappa.
> If
> > > you
> > > > talking about 64 bit counters on 32 bit architectures, then I
> > > understand the
> > > > problem (and have many years of direct experience with it
> myself).
> > > > Otherwise, I hope you can elaborate or direct me to educational
> > > material
> > > > about the issue, considering this a learning opportunity. :-)
> > > I am thinking of the case where the 64b write is split into two 32b
> > > (or
> > > more) write operations either by the compiler or the micro-
> > > architecture. If this were to happen, it causes race conditions
> with
> > > the reader.
> > >
> > > As far as I understand, the compiler does not provide any
> guarantees
> > > on generating non-tearing stores unless an atomic builtin/function
> is
> > > used.
> >
> > This seems like a generic problem for all 64b statistics counters in
> DPDK, and
> > any other C code using 64 bit counters. Being a generic C problem,
> there is
> > probably a generic solution to it.
> Browsing through the code, I see similar problems elsewhere.
> 
> >
> > Is any compiler going to do something that stupid (i.e. tearing a
> store into
> > multiple write operations) to a simple 64b counter on any 64 bit
> architecture
> > (assuming the counter is 64b aligned)? Otherwise, we might only need
> to
> > take special precautions for 32 bit architectures.
> It is always a debate on who is stupid, compiler or programmer 😊

Compilers will never stop surprising me. Thankfully, they are not so unreliable and full of bugs as they were 25 years ago. :-)

> 
> Though not the same case, you can look at this discussion where
> compiler generated torn stores [1] when we all thought it has been
> generating a 64b store.
> 
> [1] http://inbox.dpdk.org/dev/d5d563ab-0411-3faf-39ec-
> 4994f2bc9f6f@intel.com/

Good reference.

Technically, this sets a bunch of fields in the rte_lpm_tbl_entry structure (which happens to be 32b in total size), so it is not completely unreasonable for the compiler to store those fields individually. I wonder if using a union to cast the rte_lpm_tbl_entry struct to uint32_t (and ensuring 32b alignment) would have solved the problem, and the __atomic_store() could be avoided?

> 
> >
> > > If we have to ensure the micro-architecture does not generate split
> > > writes, we need to be careful that future code additions do not
> change
> > > the alignment of the stats.
> >
> > Unless the structure containing the stats counters is packed, the
> contained
> > 64b counters will be 64b aligned (on 64 bit architecture). So we
> should not
> > worry about alignment, except perhaps on 32 bit architectures.
> Agree, future code changes need to be aware of these issues and DPDK
> supports 32b architectures.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/2] service: fix potential stats race-condition on MT services
  2022-07-08 16:21       ` Bruce Richardson
  2022-07-08 16:33         ` Honnappa Nagarahalli
@ 2022-07-08 20:02         ` Mattias Rönnblom
  1 sibling, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-07-08 20:02 UTC (permalink / raw)
  To: Bruce Richardson, Van Haaren, Harry
  Cc: Honnappa Nagarahalli, dev, mattias.ronnblom, Morten Brørup, nd

On 2022-07-08 18:21, Bruce Richardson wrote:
> On Fri, Jul 08, 2022 at 03:31:01PM +0000, Van Haaren, Harry wrote:
>>> -----Original Message-----
>>> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
>>> Sent: Friday, July 8, 2022 4:16 PM
>>> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
>>> Cc: mattias.ronnblom <mattias.ronnblom@ericsson.com>; Morten Brørup
>>> <mb@smartsharesystems.com>; nd <nd@arm.com>; nd <nd@arm.com>
>>> Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
>>
>> <snip previous discussions>
>>
>>>> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
>>>> index ef31b1f63c..f045e74ef3 100644
>>>> --- a/lib/eal/common/rte_service.c
>>>> +++ b/lib/eal/common/rte_service.c
>>>> @@ -363,9 +363,15 @@ service_runner_do_callback(struct
>>>> rte_service_spec_impl *s,
>>>>   		uint64_t start = rte_rdtsc();
>>>>   		s->spec.callback(userdata);
>>>>   		uint64_t end = rte_rdtsc();
>>>> -		s->cycles_spent += end - start;
>>>> +		uint64_t cycles = end - start;
>>>>   		cs->calls_per_service[service_idx]++;
>>>> -		s->calls++;
>>>> +		if (service_mt_safe(s)) {
>>>> +			__atomic_fetch_add(&s->cycles_spent, cycles,
>>>> __ATOMIC_RELAXED);
>>>> +			__atomic_fetch_add(&s->calls, 1,
>>>> __ATOMIC_RELAXED);
>>>> +		} else {
>>>> +			s->cycles_spent += cycles;
>>>> +			s->calls++;
>>> This is still a problem from a reader perspective. It is possible that the writes could be
>>> split while a reader is reading the stats. These need to be atomic adds.
>>
>> Thanks for pointing out; I do "think" in x86 in terms of load/store tearing; and on x86
>> naturally aligned load/stores will not tear. Apologies for missing the ARM angle here.
>>
>> I'm not sure how to best encode the difference between tearing & "locked instructions"
>> to make things multi-writer safe. But they're not the same thing, and I'd prefer not pay
>> the penalty for LOCK instructions (multi-writer) only to satisfy the non-tearing requirements.
>>
>> Is there an rte_atomic-* type that is guaranteed as non-tearing?
>>
>> In that case, changing the type of the calls/cycles_spent variables to such a type to ensure "non-tearing"
>> single-reader, single-writer behaviour is enough, instead of forcing __atomic_fetch_add() everywhere?
> 
> Regular read, increment and then atomic store should work without locks
> where alignment is correct on most architectures, and doing the store
> atomically should prevent any tearing.

This is a good pattern, and provides a noticeable performance benefit 
compared to using an atomic add (which is an overkill in single-writer 
scenarios), in my experience. The DPDK seqcount implementation 
increments the count in this manner.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/6] service: reduce statistics overhead for parallel services
  2022-07-08 12:56 ` [PATCH 2/2] service: fix potential stats race-condition on MT services Harry van Haaren
                     ` (2 preceding siblings ...)
  2022-07-08 15:16   ` Honnappa Nagarahalli
@ 2022-09-06 16:13   ` Mattias Rönnblom
  2022-09-06 16:13     ` [PATCH 2/6] service: introduce per-lcore cycles counter Mattias Rönnblom
                       ` (7 more replies)
  3 siblings, 8 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-09-06 16:13 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, Mattias Rönnblom

Move the statistics from the service data structure to the per-lcore
struct. This eliminates contention for the counter cache lines, which
decreases the producer-side statistics overhead for services deployed
across many lcores.

Prior to this patch, enabling statistics for a service with a
per-service function call latency of 1000 clock cycles deployed across
16 cores on a Intel Xeon 6230N @ 2,3 GHz would incur a cost of ~10000
core clock cycles per service call. After this patch, the statistics
overhead is reduce to 22 clock cycles per call.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c | 182 +++++++++++++++++++++++------------
 1 file changed, 121 insertions(+), 61 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 94cb056196..b5103f2a20 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -50,17 +50,8 @@ struct rte_service_spec_impl {
 	 * on currently.
 	 */
 	uint32_t num_mapped_cores;
-
-	/* 32-bit builds won't naturally align a uint64_t, so force alignment,
-	 * allowing regular reads to be atomic.
-	 */
-	uint64_t calls __rte_aligned(8);
-	uint64_t cycles_spent __rte_aligned(8);
 } __rte_cache_aligned;
 
-/* Mask used to ensure uint64_t 8 byte vars are naturally aligned. */
-#define RTE_SERVICE_STAT_ALIGN_MASK (8 - 1)
-
 /* the internal values of a service core */
 struct core_state {
 	/* map of services IDs are run on this core */
@@ -71,6 +62,7 @@ struct core_state {
 	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
 	uint64_t loops;
 	uint64_t calls_per_service[RTE_SERVICE_NUM_MAX];
+	uint64_t cycles_per_service[RTE_SERVICE_NUM_MAX];
 } __rte_cache_aligned;
 
 static uint32_t rte_service_count;
@@ -138,13 +130,16 @@ rte_service_finalize(void)
 	rte_service_library_initialized = 0;
 }
 
-/* returns 1 if service is registered and has not been unregistered
- * Returns 0 if service never registered, or has been unregistered
- */
-static inline int
+static inline bool
+service_registered(uint32_t id)
+{
+	return rte_services[id].internal_flags & SERVICE_F_REGISTERED;
+}
+
+static inline bool
 service_valid(uint32_t id)
 {
-	return !!(rte_services[id].internal_flags & SERVICE_F_REGISTERED);
+	return id < RTE_SERVICE_NUM_MAX && service_registered(id);
 }
 
 static struct rte_service_spec_impl *
@@ -155,7 +150,7 @@ service_get(uint32_t id)
 
 /* validate ID and retrieve service pointer, or return error value */
 #define SERVICE_VALID_GET_OR_ERR_RET(id, service, retval) do {          \
-	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))            \
+	if (!service_valid(id))                                         \
 		return retval;                                          \
 	service = &rte_services[id];                                    \
 } while (0)
@@ -217,7 +212,7 @@ rte_service_get_by_name(const char *name, uint32_t *service_id)
 
 	int i;
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (service_valid(i) &&
+		if (service_registered(i) &&
 				strcmp(name, rte_services[i].spec.name) == 0) {
 			*service_id = i;
 			return 0;
@@ -254,7 +249,7 @@ rte_service_component_register(const struct rte_service_spec *spec,
 		return -EINVAL;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (!service_valid(i)) {
+		if (!service_registered(i)) {
 			free_slot = i;
 			break;
 		}
@@ -366,29 +361,24 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 {
 	void *userdata = s->spec.callback_userdata;
 
-	/* Ensure the atomically stored variables are naturally aligned,
-	 * as required for regular loads to be atomic.
-	 */
-	RTE_BUILD_BUG_ON((offsetof(struct rte_service_spec_impl, calls)
-		& RTE_SERVICE_STAT_ALIGN_MASK) != 0);
-	RTE_BUILD_BUG_ON((offsetof(struct rte_service_spec_impl, cycles_spent)
-		& RTE_SERVICE_STAT_ALIGN_MASK) != 0);
-
 	if (service_stats_enabled(s)) {
 		uint64_t start = rte_rdtsc();
 		s->spec.callback(userdata);
 		uint64_t end = rte_rdtsc();
 		uint64_t cycles = end - start;
-		cs->calls_per_service[service_idx]++;
-		if (service_mt_safe(s)) {
-			__atomic_fetch_add(&s->cycles_spent, cycles, __ATOMIC_RELAXED);
-			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
-		} else {
-			uint64_t cycles_new = s->cycles_spent + cycles;
-			uint64_t calls_new = s->calls++;
-			__atomic_store_n(&s->cycles_spent, cycles_new, __ATOMIC_RELAXED);
-			__atomic_store_n(&s->calls, calls_new, __ATOMIC_RELAXED);
-		}
+
+		/* The lcore service worker thread is the only writer,
+		 * and thus only a non-atomic load and an atomic store
+		 * is needed, and not the more expensive atomic
+		 * add.
+		 */
+		__atomic_store_n(&cs->calls_per_service[service_idx],
+				 cs->calls_per_service[service_idx] + 1,
+				 __ATOMIC_RELAXED);
+
+		__atomic_store_n(&cs->cycles_per_service[service_idx],
+				 cs->cycles_per_service[service_idx] + cycles,
+				 __ATOMIC_RELAXED);
 	} else
 		s->spec.callback(userdata);
 }
@@ -436,7 +426,7 @@ rte_service_may_be_active(uint32_t id)
 	int32_t lcore_count = rte_service_lcore_list(ids, RTE_MAX_LCORE);
 	int i;
 
-	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+	if (!service_valid(id))
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
@@ -483,16 +473,17 @@ service_runner_func(void *arg)
 	 */
 	while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) ==
 			RUNSTATE_RUNNING) {
+
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-			if (!service_valid(i))
+			if (!service_registered(i))
 				continue;
 			/* return value ignored as no change to code flow */
 			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
-		cs->loops++;
+		__atomic_store_n(&cs->loops, cs->loops + 1, __ATOMIC_RELAXED);
 	}
 
 	/* Use SEQ CST memory ordering to avoid any re-ordering around
@@ -608,8 +599,8 @@ static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
 	/* validate ID, or return error value */
-	if (sid >= RTE_SERVICE_NUM_MAX || !service_valid(sid) ||
-	    lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
+	    !lcore_states[lcore].is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
@@ -813,21 +804,75 @@ rte_service_lcore_stop(uint32_t lcore)
 	return 0;
 }
 
+static uint64_t
+lcore_attr_get_loops(unsigned int lcore)
+{
+	struct core_state *cs = &lcore_states[lcore];
+
+	return __atomic_load_n(&cs->loops, __ATOMIC_RELAXED);
+}
+
+static uint64_t
+lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
+{
+	struct core_state *cs = &lcore_states[lcore];
+
+	return __atomic_load_n(&cs->calls_per_service[service_id],
+			       __ATOMIC_RELAXED);
+}
+
+static uint64_t
+lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
+{
+	struct core_state *cs = &lcore_states[lcore];
+
+	return __atomic_load_n(&cs->cycles_per_service[service_id],
+			       __ATOMIC_RELAXED);
+}
+
+typedef uint64_t (*lcore_attr_get_fun)(uint32_t service_id,
+				       unsigned int lcore);
+
+static uint64_t
+attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
+{
+	unsigned int lcore;
+	uint64_t sum = 0;
+
+	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++)
+		if (lcore_states[lcore].is_service_core)
+			sum += lcore_attr_get(id, lcore);
+
+	return sum;
+}
+
+static uint64_t
+attr_get_service_calls(uint32_t service_id)
+{
+	return attr_get(service_id, lcore_attr_get_service_calls);
+}
+
+static uint64_t
+attr_get_service_cycles(uint32_t service_id)
+{
+	return attr_get(service_id, lcore_attr_get_service_cycles);
+}
+
 int32_t
 rte_service_attr_get(uint32_t id, uint32_t attr_id, uint64_t *attr_value)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	if (!service_valid(id))
+		return -EINVAL;
 
 	if (!attr_value)
 		return -EINVAL;
 
 	switch (attr_id) {
-	case RTE_SERVICE_ATTR_CYCLES:
-		*attr_value = s->cycles_spent;
-		return 0;
 	case RTE_SERVICE_ATTR_CALL_COUNT:
-		*attr_value = s->calls;
+		*attr_value = attr_get_service_calls(id);
+		return 0;
+	case RTE_SERVICE_ATTR_CYCLES:
+		*attr_value = attr_get_service_cycles(id);
 		return 0;
 	default:
 		return -EINVAL;
@@ -849,7 +894,7 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 
 	switch (attr_id) {
 	case RTE_SERVICE_LCORE_ATTR_LOOPS:
-		*attr_value = cs->loops;
+		*attr_value = lcore_attr_get_loops(lcore);
 		return 0;
 	default:
 		return -EINVAL;
@@ -859,11 +904,18 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 int32_t
 rte_service_attr_reset_all(uint32_t id)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	unsigned int lcore;
+
+	if (!service_valid(id))
+		return -EINVAL;
+
+	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
+		struct core_state *cs = &lcore_states[lcore];
+
+		cs->calls_per_service[id] = 0;
+		cs->cycles_per_service[id] = 0;
+	}
 
-	s->cycles_spent = 0;
-	s->calls = 0;
 	return 0;
 }
 
@@ -885,17 +937,25 @@ rte_service_lcore_attr_reset_all(uint32_t lcore)
 }
 
 static void
-service_dump_one(FILE *f, struct rte_service_spec_impl *s)
+service_dump_one(FILE *f, uint32_t id)
 {
+	struct rte_service_spec_impl *s;
+	uint64_t service_calls;
+	uint64_t service_cycles;
+
+	service_calls = attr_get_service_calls(id);
+	service_cycles = attr_get_service_cycles(id);
+
 	/* avoid divide by zero */
-	int calls = 1;
+	if (service_calls == 0)
+		service_calls = 1;
+
+	s = service_get(id);
 
-	if (s->calls != 0)
-		calls = s->calls;
 	fprintf(f, "  %s: stats %d\tcalls %"PRIu64"\tcycles %"
 			PRIu64"\tavg: %"PRIu64"\n",
-			s->spec.name, service_stats_enabled(s), s->calls,
-			s->cycles_spent, s->cycles_spent / calls);
+			s->spec.name, service_stats_enabled(s), service_calls,
+			service_cycles, service_cycles / service_calls);
 }
 
 static void
@@ -906,7 +966,7 @@ service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (!service_valid(i))
+		if (!service_registered(i))
 			continue;
 		fprintf(f, "%"PRIu64"\t", cs->calls_per_service[i]);
 	}
@@ -924,16 +984,16 @@ rte_service_dump(FILE *f, uint32_t id)
 		struct rte_service_spec_impl *s;
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
-		service_dump_one(f, s);
+		service_dump_one(f, id);
 		return 0;
 	}
 
 	/* print all services, as UINT32_MAX was passed as id */
 	fprintf(f, "Services Summary\n");
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (!service_valid(i))
+		if (!service_registered(i))
 			continue;
-		service_dump_one(f, &rte_services[i]);
+		service_dump_one(f, i);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 2/6] service: introduce per-lcore cycles counter
  2022-09-06 16:13   ` [PATCH 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
@ 2022-09-06 16:13     ` Mattias Rönnblom
  2022-09-06 16:13     ` [PATCH 3/6] service: reduce average case service core overhead Mattias Rönnblom
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-09-06 16:13 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, Mattias Rönnblom

Introduce a per-lcore counter for the total time spent on processing
services on that core.

This counter is useful when measuring individual lcore load.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/test_service_cores.c |  2 +-
 lib/eal/common/rte_service.c  | 14 ++++++++++++++
 lib/eal/include/rte_service.h |  6 ++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index 7415b6b686..096405133b 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -403,7 +403,7 @@ service_lcore_attr_get(void)
 			"lcore_attr_get() failed to get loops "
 			"(expected > zero)");
 
-	lcore_attr_id++;  // invalid lcore attr id
+	lcore_attr_id = 42; /* invalid lcore attr id */
 	TEST_ASSERT_EQUAL(-EINVAL, rte_service_lcore_attr_get(slcore_id,
 			lcore_attr_id, &lcore_attr_value),
 			"Invalid lcore attr didn't return -EINVAL");
diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index b5103f2a20..87df04e3ac 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -61,6 +61,7 @@ struct core_state {
 	uint8_t is_service_core; /* set if core is currently a service core */
 	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
 	uint64_t loops;
+	uint64_t cycles;
 	uint64_t calls_per_service[RTE_SERVICE_NUM_MAX];
 	uint64_t cycles_per_service[RTE_SERVICE_NUM_MAX];
 } __rte_cache_aligned;
@@ -372,6 +373,8 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 		 * is needed, and not the more expensive atomic
 		 * add.
 		 */
+		__atomic_store_n(&cs->cycles, cs->cycles + cycles,
+				 __ATOMIC_RELAXED);
 		__atomic_store_n(&cs->calls_per_service[service_idx],
 				 cs->calls_per_service[service_idx] + 1,
 				 __ATOMIC_RELAXED);
@@ -812,6 +815,14 @@ lcore_attr_get_loops(unsigned int lcore)
 	return __atomic_load_n(&cs->loops, __ATOMIC_RELAXED);
 }
 
+static uint64_t
+lcore_attr_get_cycles(unsigned int lcore)
+{
+	struct core_state *cs = &lcore_states[lcore];
+
+	return __atomic_load_n(&cs->cycles, __ATOMIC_RELAXED);
+}
+
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
@@ -896,6 +907,9 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 	case RTE_SERVICE_LCORE_ATTR_LOOPS:
 		*attr_value = lcore_attr_get_loops(lcore);
 		return 0;
+	case RTE_SERVICE_LCORE_ATTR_CYCLES:
+		*attr_value = lcore_attr_get_cycles(lcore);
+		return 0;
 	default:
 		return -EINVAL;
 	}
diff --git a/lib/eal/include/rte_service.h b/lib/eal/include/rte_service.h
index 35d8018684..70deb6e53a 100644
--- a/lib/eal/include/rte_service.h
+++ b/lib/eal/include/rte_service.h
@@ -407,6 +407,12 @@ int32_t rte_service_attr_reset_all(uint32_t id);
  */
 #define RTE_SERVICE_LCORE_ATTR_LOOPS 0
 
+/**
+ * Returns the total number of cycles that the lcore has spent on
+ * running services.
+ */
+#define RTE_SERVICE_LCORE_ATTR_CYCLES 1
+
 /**
  * Get an attribute from a service core.
  *
-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 3/6] service: reduce average case service core overhead
  2022-09-06 16:13   ` [PATCH 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
  2022-09-06 16:13     ` [PATCH 2/6] service: introduce per-lcore cycles counter Mattias Rönnblom
@ 2022-09-06 16:13     ` Mattias Rönnblom
  2022-10-03 13:33       ` Van Haaren, Harry
  2022-09-06 16:13     ` [PATCH 4/6] service: tweak cycle statistics semantics Mattias Rönnblom
                       ` (5 subsequent siblings)
  7 siblings, 1 reply; 43+ messages in thread
From: Mattias Rönnblom @ 2022-09-06 16:13 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, Mattias Rönnblom

Optimize service loop so that the starting point is the lowest-indexed
service mapped to the lcore in question, and terminate the loop at the
highest-indexed service.

While the worst case latency remains the same, this patch
significantly reduces the service framework overhead for the average
case. In particular, scenarios where an lcore only runs a single
service, or multiple services which id values are close (e.g., three
services with ids 17, 18 and 22), show significant improvements.

The worse case is a where the lcore two services mapped to it; one
with service id 0 and the other with id 63.

On a service lcore serving a single service, the service loop overhead
is reduced from ~190 core clock cycles to ~46. (On an Intel Cascade
Lake generation Xeon.) On weakly ordered CPUs, the gain is larger,
since the loop included load-acquire atomic operations.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 87df04e3ac..4cac866792 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -464,7 +464,6 @@ static int32_t
 service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
-	uint32_t i;
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];

@@ -478,10 +477,17 @@ service_runner_func(void *arg)
 			RUNSTATE_RUNNING) {

 		const uint64_t service_mask = cs->service_mask;
+		uint8_t start_id;
+		uint8_t end_id;
+		uint8_t i;

-		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-			if (!service_registered(i))
-				continue;
+		if (service_mask == 0)
+			continue;
+
+		start_id = __builtin_ctzl(service_mask);
+		end_id = 64 - __builtin_clzl(service_mask);
+
+		for (i = start_id; i < end_id; i++) {
 			/* return value ignored as no change to code flow */
 			service_run(i, cs, service_mask, service_get(i), 1);
 		}
-- 
2.34.1

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 4/6] service: tweak cycle statistics semantics
  2022-09-06 16:13   ` [PATCH 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
  2022-09-06 16:13     ` [PATCH 2/6] service: introduce per-lcore cycles counter Mattias Rönnblom
  2022-09-06 16:13     ` [PATCH 3/6] service: reduce average case service core overhead Mattias Rönnblom
@ 2022-09-06 16:13     ` Mattias Rönnblom
  2022-09-07  8:41       ` Morten Brørup
  2022-09-06 16:13     ` [PATCH 5/6] event/sw: report idle when no work is performed Mattias Rönnblom
                       ` (4 subsequent siblings)
  7 siblings, 1 reply; 43+ messages in thread
From: Mattias Rönnblom @ 2022-09-06 16:13 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, Mattias Rönnblom

As a part of its service function, a service usually polls some kind
of source (e.g., an RX queue, a ring, an eventdev port, or a timer
wheel) to retrieve one or more items of work.

In low-load situations, the service framework reports a significant
amount of cycles spent for all running services, despite the fact they
have performed little or no actual work.

The per-call cycle expenditure for an idle service (i.e., a service
currently without pending jobs) is typically very low. Polling an
empty ring or RX queue is inexpensive. However, since the service
function call frequency on an idle or lightly loaded lcore is going to
be very high indeed, the service function calls' cycles adds up to a
significant amount. The only thing preventing the idle services'
cycles counters to make up 100% of the available CPU cycles is the
overhead of the service framework itself.

If the RTE_SERVICE_ATTR_CYCLES or RTE_SERVICE_LCORE_ATTR_CYCLES are
used to estimate service core load, the cores may look very busy when
the system is mostly doing nothing useful at all.

This patch allows for an idle service to indicate that no actual work
was performed during a particular service function call (by returning
-EAGAIN). In such cases the RTE_SERVICE_ATTR_CYCLES and
RTE_SERVICE_LCORE_ATTR_CYCLES values are not incremented.

The convention of returning -EAGAIN for idle services may in the
future also be used to have the lcore enter a short sleep, or reduce
its operating frequency, in case all services are currently idle.

This change is backward-compatible.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c            | 22 ++++++++++++++--------
 lib/eal/include/rte_service_component.h |  5 +++++
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 4cac866792..123610688c 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -10,6 +10,7 @@
 #include <rte_service_component.h>
 
 #include <rte_lcore.h>
+#include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
 #include <rte_atomic.h>
@@ -364,24 +365,29 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 
 	if (service_stats_enabled(s)) {
 		uint64_t start = rte_rdtsc();
-		s->spec.callback(userdata);
-		uint64_t end = rte_rdtsc();
-		uint64_t cycles = end - start;
+		int rc = s->spec.callback(userdata);
 
 		/* The lcore service worker thread is the only writer,
 		 * and thus only a non-atomic load and an atomic store
 		 * is needed, and not the more expensive atomic
 		 * add.
 		 */
-		__atomic_store_n(&cs->cycles, cs->cycles + cycles,
-				 __ATOMIC_RELAXED);
+
+		if (likely(rc != -EAGAIN)) {
+			uint64_t end = rte_rdtsc();
+			uint64_t cycles = end - start;
+
+			__atomic_store_n(&cs->cycles, cs->cycles + cycles,
+					 __ATOMIC_RELAXED);
+			__atomic_store_n(&cs->cycles_per_service[service_idx],
+					 cs->cycles_per_service[service_idx] +
+					 cycles, __ATOMIC_RELAXED);
+		}
+
 		__atomic_store_n(&cs->calls_per_service[service_idx],
 				 cs->calls_per_service[service_idx] + 1,
 				 __ATOMIC_RELAXED);
 
-		__atomic_store_n(&cs->cycles_per_service[service_idx],
-				 cs->cycles_per_service[service_idx] + cycles,
-				 __ATOMIC_RELAXED);
 	} else
 		s->spec.callback(userdata);
 }
diff --git a/lib/eal/include/rte_service_component.h b/lib/eal/include/rte_service_component.h
index 9e66ee7e29..9be49d698a 100644
--- a/lib/eal/include/rte_service_component.h
+++ b/lib/eal/include/rte_service_component.h
@@ -19,6 +19,11 @@ extern "C" {
 
 /**
  * Signature of callback function to run a service.
+ *
+ * A service function call resulting in no actual work being
+ * performed, should return -EAGAIN. In that case, the (presumbly few)
+ * cycles spent will not be counted toward the lcore or service-level
+ * cycles attributes.
  */
 typedef int32_t (*rte_service_func)(void *args);
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 5/6] event/sw: report idle when no work is performed
  2022-09-06 16:13   ` [PATCH 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
                       ` (2 preceding siblings ...)
  2022-09-06 16:13     ` [PATCH 4/6] service: tweak cycle statistics semantics Mattias Rönnblom
@ 2022-09-06 16:13     ` Mattias Rönnblom
  2022-09-06 16:13     ` [PATCH 6/6] service: provide links to functions in documentation Mattias Rönnblom
                       ` (3 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-09-06 16:13 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, Mattias Rönnblom

Have the SW event device conform to the service core convention, where
-EAGAIN is return in case no work was performed.

Prior to this patch, for an idle SW event device, a service lcore load
estimate based on RTE_SERVICE_ATTR_CYCLES would suggest 48% core
load.

At 7% of its maximum capacity, the SW event device needs about 15% of
the available CPU cycles* to perform its duties, but
RTE_SERVICE_ATTR_CYCLES would suggest the SW service used 48% of the
service core.

After this change, load deduced from RTE_SERVICE_ATTR_CYCLES will only
be a minor overestimation of the actual cycles used.

* The SW scheduler becomes more efficient at higher loads.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 drivers/event/sw/sw_evdev.c           | 3 +--
 drivers/event/sw/sw_evdev.h           | 2 +-
 drivers/event/sw/sw_evdev_scheduler.c | 6 ++++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/event/sw/sw_evdev.c b/drivers/event/sw/sw_evdev.c
index f93313b31b..f14c6427dd 100644
--- a/drivers/event/sw/sw_evdev.c
+++ b/drivers/event/sw/sw_evdev.c
@@ -933,8 +933,7 @@ set_refill_once(const char *key __rte_unused, const char *value, void *opaque)
 static int32_t sw_sched_service_func(void *args)
 {
 	struct rte_eventdev *dev = args;
-	sw_event_schedule(dev);
-	return 0;
+	return sw_event_schedule(dev);
 }
 
 static int
diff --git a/drivers/event/sw/sw_evdev.h b/drivers/event/sw/sw_evdev.h
index 4fd1054470..8542b7d34d 100644
--- a/drivers/event/sw/sw_evdev.h
+++ b/drivers/event/sw/sw_evdev.h
@@ -295,7 +295,7 @@ uint16_t sw_event_enqueue_burst(void *port, const struct rte_event ev[],
 uint16_t sw_event_dequeue(void *port, struct rte_event *ev, uint64_t wait);
 uint16_t sw_event_dequeue_burst(void *port, struct rte_event *ev, uint16_t num,
 			uint64_t wait);
-void sw_event_schedule(struct rte_eventdev *dev);
+int32_t sw_event_schedule(struct rte_eventdev *dev);
 int sw_xstats_init(struct sw_evdev *dev);
 int sw_xstats_uninit(struct sw_evdev *dev);
 int sw_xstats_get_names(const struct rte_eventdev *dev,
diff --git a/drivers/event/sw/sw_evdev_scheduler.c b/drivers/event/sw/sw_evdev_scheduler.c
index 809a54d731..8bc21944f5 100644
--- a/drivers/event/sw/sw_evdev_scheduler.c
+++ b/drivers/event/sw/sw_evdev_scheduler.c
@@ -506,7 +506,7 @@ sw_schedule_pull_port_dir(struct sw_evdev *sw, uint32_t port_id)
 	return pkts_iter;
 }
 
-void
+int32_t
 sw_event_schedule(struct rte_eventdev *dev)
 {
 	struct sw_evdev *sw = sw_pmd_priv(dev);
@@ -517,7 +517,7 @@ sw_event_schedule(struct rte_eventdev *dev)
 
 	sw->sched_called++;
 	if (unlikely(!sw->started))
-		return;
+		return -EAGAIN;
 
 	do {
 		uint32_t in_pkts_this_iteration = 0;
@@ -610,4 +610,6 @@ sw_event_schedule(struct rte_eventdev *dev)
 	sw->sched_last_iter_bitmask = cqs_scheds_last_iter;
 	if (unlikely(sw->port_count >= 64))
 		sw->sched_last_iter_bitmask = UINT64_MAX;
+
+	return work_done ? 0 : -EAGAIN;
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 6/6] service: provide links to functions in documentation
  2022-09-06 16:13   ` [PATCH 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
                       ` (3 preceding siblings ...)
  2022-09-06 16:13     ` [PATCH 5/6] event/sw: report idle when no work is performed Mattias Rönnblom
@ 2022-09-06 16:13     ` Mattias Rönnblom
  2022-10-03  8:06     ` [PATCH 1/6] service: reduce statistics overhead for parallel services David Marchand
                       ` (2 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-09-06 16:13 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, Mattias Rönnblom

Refer to API functions with parenthesis, making doxygen create
hyperlinks.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/include/rte_service.h | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/lib/eal/include/rte_service.h b/lib/eal/include/rte_service.h
index 70deb6e53a..90116a773a 100644
--- a/lib/eal/include/rte_service.h
+++ b/lib/eal/include/rte_service.h
@@ -37,7 +37,7 @@ extern "C" {
 
 /* Capabilities of a service.
  *
- * Use the *rte_service_probe_capability* function to check if a service is
+ * Use the rte_service_probe_capability() function to check if a service is
  * capable of a specific capability.
  */
 /** When set, the service is capable of having multiple threads run it at the
@@ -147,13 +147,13 @@ int32_t rte_service_map_lcore_get(uint32_t service_id, uint32_t lcore);
 int32_t rte_service_runstate_set(uint32_t id, uint32_t runstate);
 
 /**
- * Get the runstate for the service with *id*. See *rte_service_runstate_set*
+ * Get the runstate for the service with *id*. See rte_service_runstate_set()
  * for details of runstates. A service can call this function to ensure that
  * the application has indicated that it will receive CPU cycles. Either a
  * service-core is mapped (default case), or the application has explicitly
  * disabled the check that a service-cores is mapped to the service and takes
  * responsibility to run the service manually using the available function
- * *rte_service_run_iter_on_app_lcore* to do so.
+ * rte_service_run_iter_on_app_lcore() to do so.
  *
  * @retval 1 Service is running
  * @retval 0 Service is stopped
@@ -181,7 +181,7 @@ rte_service_may_be_active(uint32_t id);
 /**
  * Enable or disable the check for a service-core being mapped to the service.
  * An application can disable the check when takes the responsibility to run a
- * service itself using *rte_service_run_iter_on_app_lcore*.
+ * service itself using rte_service_run_iter_on_app_lcore().
  *
  * @param id The id of the service to set the check on
  * @param enable When zero, the check is disabled. Non-zero enables the check.
@@ -216,7 +216,7 @@ int32_t rte_service_set_runstate_mapped_check(uint32_t id, int32_t enable);
  *           atomics, applications can choose to enable or disable this feature
  *
  * Note that any thread calling this function MUST be a DPDK EAL thread, as
- * the *rte_lcore_id* function is used to access internal data structures.
+ * the rte_lcore_id() function is used to access internal data structures.
  *
  * @retval 0 Service was run on the calling thread successfully
  * @retval -EBUSY Another lcore is executing the service, and it is not a
@@ -232,7 +232,7 @@ int32_t rte_service_run_iter_on_app_lcore(uint32_t id,
  *
  * Starting a core makes the core begin polling. Any services assigned to it
  * will be run as fast as possible. The application must ensure that the lcore
- * is in a launchable state: e.g. call *rte_eal_lcore_wait* on the lcore_id
+ * is in a launchable state: e.g. call rte_eal_lcore_wait() on the lcore_id
  * before calling this function.
  *
  * @retval 0 Success
@@ -248,7 +248,7 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
  * service core. Note that the service lcore thread may not have returned from
  * the service it is running when this API returns.
  *
- * The *rte_service_lcore_may_be_active* API can be used to check if the
+ * The rte_service_lcore_may_be_active() API can be used to check if the
  * service lcore is * still active.
  *
  * @retval 0 Success
@@ -265,7 +265,7 @@ int32_t rte_service_lcore_stop(uint32_t lcore_id);
  * Reports if a service lcore is currently running.
  *
  * This function returns if the core has finished service cores code, and has
- * returned to EAL control. If *rte_service_lcore_stop* has been called but
+ * returned to EAL control. If rte_service_lcore_stop() has been called but
  * the lcore has not returned to EAL yet, it might be required to wait and call
  * this function again. The amount of time to wait before the core returns
  * depends on the duration of the services being run.
@@ -293,7 +293,7 @@ int32_t rte_service_lcore_add(uint32_t lcore);
 /**
  * Removes lcore from the list of service cores.
  *
- * This can fail if the core is not stopped, see *rte_service_core_stop*.
+ * This can fail if the core is not stopped, see rte_service_core_stop().
  *
  * @retval 0 Success
  * @retval -EBUSY Lcore is not stopped, stop service core before removing.
@@ -308,7 +308,7 @@ int32_t rte_service_lcore_del(uint32_t lcore);
  * service core count can be used in mapping logic when creating mappings
  * from service cores to services.
  *
- * See *rte_service_lcore_list* for details on retrieving the lcore_id of each
+ * See rte_service_lcore_list() for details on retrieving the lcore_id of each
  * service core.
  *
  * @return The number of service cores currently configured.
@@ -344,14 +344,14 @@ int32_t rte_service_set_stats_enable(uint32_t id, int32_t enable);
  * indicating the lcore_id of a service core.
  *
  * Adding and removing service cores can be performed using
- * *rte_service_lcore_add* and *rte_service_lcore_del*.
- * @param [out] array An array of at least *rte_service_lcore_count* items.
+ * rte_service_lcore_add() and rte_service_lcore_del().
+ * @param [out] array An array of at least rte_service_lcore_count() items.
  *              If statically allocating the buffer, use RTE_MAX_LCORE.
  * @param [out] n The size of *array*.
  * @retval >=0 Number of service cores that have been populated in the array
  * @retval -ENOMEM The provided array is not large enough to fill in the
  *          service core list. No items have been populated, call this function
- *          with a size of at least *rte_service_core_count* items.
+ *          with a size of at least rte_service_core_count() items.
  */
 int32_t rte_service_lcore_list(uint32_t array[], uint32_t n);
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 4/6] service: tweak cycle statistics semantics
  2022-09-06 16:13     ` [PATCH 4/6] service: tweak cycle statistics semantics Mattias Rönnblom
@ 2022-09-07  8:41       ` Morten Brørup
  2022-10-03 13:45         ` Van Haaren, Harry
  0 siblings, 1 reply; 43+ messages in thread
From: Morten Brørup @ 2022-09-07  8:41 UTC (permalink / raw)
  To: Mattias Rönnblom, Harry van Haaren; +Cc: dev, Honnappa Nagarahalli, nd

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Tuesday, 6 September 2022 18.14
> 
> As a part of its service function, a service usually polls some kind
> of source (e.g., an RX queue, a ring, an eventdev port, or a timer
> wheel) to retrieve one or more items of work.
> 
> In low-load situations, the service framework reports a significant
> amount of cycles spent for all running services, despite the fact they
> have performed little or no actual work.
> 
> The per-call cycle expenditure for an idle service (i.e., a service
> currently without pending jobs) is typically very low. Polling an
> empty ring or RX queue is inexpensive. However, since the service
> function call frequency on an idle or lightly loaded lcore is going to
> be very high indeed, the service function calls' cycles adds up to a
> significant amount. The only thing preventing the idle services'
> cycles counters to make up 100% of the available CPU cycles is the
> overhead of the service framework itself.
> 
> If the RTE_SERVICE_ATTR_CYCLES or RTE_SERVICE_LCORE_ATTR_CYCLES are
> used to estimate service core load, the cores may look very busy when
> the system is mostly doing nothing useful at all.
> 
> This patch allows for an idle service to indicate that no actual work
> was performed during a particular service function call (by returning
> -EAGAIN). In such cases the RTE_SERVICE_ATTR_CYCLES and
> RTE_SERVICE_LCORE_ATTR_CYCLES values are not incremented.
> 
> The convention of returning -EAGAIN for idle services may in the
> future also be used to have the lcore enter a short sleep, or reduce
> its operating frequency, in case all services are currently idle.
> 
> This change is backward-compatible.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---

This entire series contains a bunch of good improvements.

Returning -EAGAIN is a step in the right direction towards measuring CPU usage, and a great way to make it backwards compatible.

Series-Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] service: reduce statistics overhead for parallel services
  2022-09-06 16:13   ` [PATCH 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
                       ` (4 preceding siblings ...)
  2022-09-06 16:13     ` [PATCH 6/6] service: provide links to functions in documentation Mattias Rönnblom
@ 2022-10-03  8:06     ` David Marchand
  2022-10-03  8:40       ` Mattias Rönnblom
  2022-10-03 13:33     ` Van Haaren, Harry
  2022-10-05  9:16     ` [PATCH v2 0/6] Service cores performance and statistics improvements Mattias Rönnblom
  7 siblings, 1 reply; 43+ messages in thread
From: David Marchand @ 2022-10-03  8:06 UTC (permalink / raw)
  To: Mattias Rönnblom, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd

On Tue, Sep 6, 2022 at 6:17 PM Mattias Rönnblom
<mattias.ronnblom@ericsson.com> wrote:
>
> Move the statistics from the service data structure to the per-lcore
> struct. This eliminates contention for the counter cache lines, which
> decreases the producer-side statistics overhead for services deployed
> across many lcores.
>
> Prior to this patch, enabling statistics for a service with a
> per-service function call latency of 1000 clock cycles deployed across
> 16 cores on a Intel Xeon 6230N @ 2,3 GHz would incur a cost of ~10000
> core clock cycles per service call. After this patch, the statistics
> overhead is reduce to 22 clock cycles per call.
>
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Re-reading the mail archive, I found that Morten acked the series
(replying on patch 4).
Mattias, such a series deserved a cover letter.

How does this series fit with other patches from Harry?
https://patchwork.dpdk.org/project/dpdk/list/?series=23959&state=*

Thanks.

-- 
David Marchand


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] service: reduce statistics overhead for parallel services
  2022-10-03  8:06     ` [PATCH 1/6] service: reduce statistics overhead for parallel services David Marchand
@ 2022-10-03  8:40       ` Mattias Rönnblom
  2022-10-03  9:53         ` David Marchand
  0 siblings, 1 reply; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-03  8:40 UTC (permalink / raw)
  To: David Marchand, Harry; +Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd

On 2022-10-03 10:06, David Marchand wrote:
> On Tue, Sep 6, 2022 at 6:17 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
>>
>> Move the statistics from the service data structure to the per-lcore
>> struct. This eliminates contention for the counter cache lines, which
>> decreases the producer-side statistics overhead for services deployed
>> across many lcores.
>>
>> Prior to this patch, enabling statistics for a service with a
>> per-service function call latency of 1000 clock cycles deployed across
>> 16 cores on a Intel Xeon 6230N @ 2,3 GHz would incur a cost of ~10000
>> core clock cycles per service call. After this patch, the statistics
>> overhead is reduce to 22 clock cycles per call.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> 
> Re-reading the mail archive, I found that Morten acked the series
> (replying on patch 4).
> Mattias, such a series deserved a cover letter.
> 

Noted. Do you want a v2 with a cover letter?

> How does this series fit with other patches from Harry?
> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-9ac8e99afde9bc77&q=1&e=946b2bf5-b58c-4c02-b173-bba36b0b12f9&u=https%3A%2F%2Fpatchwork.dpdk.org%2Fproject%2Fdpdk%2Flist%2F%3Fseries%3D23959%26state%3D%2A
> 
> Thanks.
> 

The patchset I submitted is made against the above-linked patchset. 
Nothing in the patchset I submitted depends on those patches. I just 
assumed that Harry's patches would have been merged at the point my 
patchset was considered.

I can base a v2 against the current main, if you so prefer.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] service: reduce statistics overhead for parallel services
  2022-10-03  8:40       ` Mattias Rönnblom
@ 2022-10-03  9:53         ` David Marchand
  2022-10-03 11:37           ` Mattias Rönnblom
  0 siblings, 1 reply; 43+ messages in thread
From: David Marchand @ 2022-10-03  9:53 UTC (permalink / raw)
  To: Mattias Rönnblom, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd

On Mon, Oct 3, 2022 at 10:40 AM Mattias Rönnblom
<mattias.ronnblom@ericsson.com> wrote:
>
> On 2022-10-03 10:06, David Marchand wrote:
> > On Tue, Sep 6, 2022 at 6:17 PM Mattias Rönnblom
> > <mattias.ronnblom@ericsson.com> wrote:
> >>
> >> Move the statistics from the service data structure to the per-lcore
> >> struct. This eliminates contention for the counter cache lines, which
> >> decreases the producer-side statistics overhead for services deployed
> >> across many lcores.
> >>
> >> Prior to this patch, enabling statistics for a service with a
> >> per-service function call latency of 1000 clock cycles deployed across
> >> 16 cores on a Intel Xeon 6230N @ 2,3 GHz would incur a cost of ~10000
> >> core clock cycles per service call. After this patch, the statistics
> >> overhead is reduce to 22 clock cycles per call.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >
> > Re-reading the mail archive, I found that Morten acked the series
> > (replying on patch 4).
> > Mattias, such a series deserved a cover letter.
> >
>
> Noted. Do you want a v2 with a cover letter?
>
> > How does this series fit with other patches from Harry?
> > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-9ac8e99afde9bc77&q=1&e=946b2bf5-b58c-4c02-b173-bba36b0b12f9&u=https%3A%2F%2Fpatchwork.dpdk.org%2Fproject%2Fdpdk%2Flist%2F%3Fseries%3D23959%26state%3D%2A
> >
> > Thanks.
> >
>
> The patchset I submitted is made against the above-linked patchset.
> Nothing in the patchset I submitted depends on those patches. I just
> assumed that Harry's patches would have been merged at the point my
> patchset was considered.

Harry v1 patch was indeed sent a while ago.
But this v1 patch from Harry was superseded with a v2 and v3 series.
There were comments on the v3 series from Harry.


Now, looking at this series, without a cover letter, I had to guess.
I saw it is linked to the v1 patch.
I "assumed" it was then an alternative, since you had comments on
Harry v3 patch, or at least Harry would reply and we could conclude.


So what do we do?
Should I understand that your comments on Harry series can be ignored
and I proceed with all this?

I hope it applies cleanly.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] service: reduce statistics overhead for parallel services
  2022-10-03  9:53         ` David Marchand
@ 2022-10-03 11:37           ` Mattias Rönnblom
  2022-10-03 13:03             ` Van Haaren, Harry
  0 siblings, 1 reply; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-03 11:37 UTC (permalink / raw)
  To: David Marchand, Harry; +Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd

On 2022-10-03 11:53, David Marchand wrote:
> On Mon, Oct 3, 2022 at 10:40 AM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
>>
>> On 2022-10-03 10:06, David Marchand wrote:
>>> On Tue, Sep 6, 2022 at 6:17 PM Mattias Rönnblom
>>> <mattias.ronnblom@ericsson.com> wrote:
>>>>
>>>> Move the statistics from the service data structure to the per-lcore
>>>> struct. This eliminates contention for the counter cache lines, which
>>>> decreases the producer-side statistics overhead for services deployed
>>>> across many lcores.
>>>>
>>>> Prior to this patch, enabling statistics for a service with a
>>>> per-service function call latency of 1000 clock cycles deployed across
>>>> 16 cores on a Intel Xeon 6230N @ 2,3 GHz would incur a cost of ~10000
>>>> core clock cycles per service call. After this patch, the statistics
>>>> overhead is reduce to 22 clock cycles per call.
>>>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>
>>> Re-reading the mail archive, I found that Morten acked the series
>>> (replying on patch 4).
>>> Mattias, such a series deserved a cover letter.
>>>
>>
>> Noted. Do you want a v2 with a cover letter?
>>
>>> How does this series fit with other patches from Harry?
>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-9ac8e99afde9bc77&q=1&e=946b2bf5-b58c-4c02-b173-bba36b0b12f9&u=https%3A%2F%2Fpatchwork.dpdk.org%2Fproject%2Fdpdk%2Flist%2F%3Fseries%3D23959%26state%3D%2A
>>>
>>> Thanks.
>>>
>>
>> The patchset I submitted is made against the above-linked patchset.
>> Nothing in the patchset I submitted depends on those patches. I just
>> assumed that Harry's patches would have been merged at the point my
>> patchset was considered.
> 
> Harry v1 patch was indeed sent a while ago.
> But this v1 patch from Harry was superseded with a v2 and v3 series.
> There were comments on the v3 series from Harry.
> 
> 
> Now, looking at this series, without a cover letter, I had to guess.
> I saw it is linked to the v1 patch.
> I "assumed" it was then an alternative, since you had comments on
> Harry v3 patch, or at least Harry would reply and we could conclude.
> 

Sorry for failing to mention enough context in the patchset. I assumed 
it was Harry and not you that would resolve this issue. Harry's patchset 
fixes the statistics bug related to MT safe services, but does not 
address the performance issue discussed. So Harry's patchset makes sense 
on its own. It also adds a performance test case.

I believe the test case is the only thing left of Harry's improvements 
after my patchset is applied.

My patchset was meant as an improvement on what Harry already had done, 
not as an alternative.

This is all up the maintainer of course, but it seems to me that Harry's 
patchset should go in first, and then mine.

If Harry or you so prefer, I can rework my patchset to apply cleanly 
against current main (i.e., w/o Harry's patches).

> 
> So what do we do?
> Should I understand that your comments on Harry series can be ignored
> and I proceed with all this?
> 

My comments were minor, except those that relates to the issue that my 
follow-up patchset addresses.

> I hope it applies cleanly.
> 
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 1/6] service: reduce statistics overhead for parallel services
  2022-10-03 11:37           ` Mattias Rönnblom
@ 2022-10-03 13:03             ` Van Haaren, Harry
  0 siblings, 0 replies; 43+ messages in thread
From: Van Haaren, Harry @ 2022-10-03 13:03 UTC (permalink / raw)
  To: mattias.ronnblom, David Marchand
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd

> -----Original Message-----
> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Sent: Monday, October 3, 2022 12:37 PM
> To: David Marchand <david.marchand@redhat.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> Morten Brørup <mb@smartsharesystems.com>; nd <nd@arm.com>
> Subject: Re: [PATCH 1/6] service: reduce statistics overhead for parallel services

<snip>

> > Now, looking at this series, without a cover letter, I had to guess.
> > I saw it is linked to the v1 patch.
> > I "assumed" it was then an alternative, since you had comments on
> > Harry v3 patch, or at least Harry would reply and we could conclude.
> >
> 
> Sorry for failing to mention enough context in the patchset. I assumed
> it was Harry and not you that would resolve this issue. Harry's patchset
> fixes the statistics bug related to MT safe services, but does not
> address the performance issue discussed. So Harry's patchset makes sense
> on its own. It also adds a performance test case.
> 
> I believe the test case is the only thing left of Harry's improvements
> after my patchset is applied.
> 
> My patchset was meant as an improvement on what Harry already had done,
> not as an alternative.
> 
> This is all up the maintainer of course, but it seems to me that Harry's
> patchset should go in first, and then mine.
> 
> If Harry or you so prefer, I can rework my patchset to apply cleanly
> against current main (i.e., w/o Harry's patches).

I'd like to keep the performance unit-test, but otherwise your patchset is good with me.
(Will test/review the series asap).

> > So what do we do?
> > Should I understand that your comments on Harry series can be ignored
> > and I proceed with all this?
> >
> 
> My comments were minor, except those that relates to the issue that my
> follow-up patchset addresses.
> 
> > I hope it applies cleanly.

I have no strong opinion here; whatever is easier for maintainers.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 3/6] service: reduce average case service core overhead
  2022-09-06 16:13     ` [PATCH 3/6] service: reduce average case service core overhead Mattias Rönnblom
@ 2022-10-03 13:33       ` Van Haaren, Harry
  2022-10-03 14:32         ` Mattias Rönnblom
  0 siblings, 1 reply; 43+ messages in thread
From: Van Haaren, Harry @ 2022-10-03 13:33 UTC (permalink / raw)
  To: mattias.ronnblom
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, mattias.ronnblom

> -----Original Message-----
> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Sent: Tuesday, September 6, 2022 5:14 PM
> To: Van; Haaren; Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> Morten Brørup <mb@smartsharesystems.com>; nd <nd@arm.com>;
> mattias.ronnblom <mattias.ronnblom@ericsson.com>
> Subject: [PATCH 3/6] service: reduce average case service core overhead
> 
> Optimize service loop so that the starting point is the lowest-indexed
> service mapped to the lcore in question, and terminate the loop at the
> highest-indexed service.
> 
> While the worst case latency remains the same, this patch
> significantly reduces the service framework overhead for the average
> case. In particular, scenarios where an lcore only runs a single
> service, or multiple services which id values are close (e.g., three
> services with ids 17, 18 and 22), show significant improvements.
> 
> The worse case is a where the lcore two services mapped to it; one
> with service id 0 and the other with id 63.

I like the optimization - nice work. There is one caveat, that with the
builtin_ctz() call, RTE_SERVICE_NUM_MAX *must* be 64 or lower.
Today it is defined as 64, but we must ensure that this value cannot
be changed "by accident" without explicit compilation failures and a
comment explaining that fact.

There are likely options around making it runtime-dynamic, but I don't
think the complexity is justified: suggest we use compile-time check
BUILD_BUG_ON() and error if its > 64?

Note in rte_service_component_register(), we *re-use* IDs when they
become available, so we can have up to 64 active services at a time, but
the can register/unregister more times than that. This is a very unlikely
usage of the services API to continually register-unregister services.

With the BUILD_BUG_ON() around the 64 MAX value with a comment:
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>


> On a service lcore serving a single service, the service loop overhead
> is reduced from ~190 core clock cycles to ~46. (On an Intel Cascade
> Lake generation Xeon.) On weakly ordered CPUs, the gain is larger,
> since the loop included load-acquire atomic operations.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  lib/eal/common/rte_service.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
> index 87df04e3ac..4cac866792 100644
> --- a/lib/eal/common/rte_service.c
> +++ b/lib/eal/common/rte_service.c
> @@ -464,7 +464,6 @@ static int32_t
>  service_runner_func(void *arg)
>  {
>  	RTE_SET_USED(arg);
> -	uint32_t i;
>  	const int lcore = rte_lcore_id();
>  	struct core_state *cs = &lcore_states[lcore];
> 
> @@ -478,10 +477,17 @@ service_runner_func(void *arg)
>  			RUNSTATE_RUNNING) {
> 
>  		const uint64_t service_mask = cs->service_mask;
> +		uint8_t start_id;
> +		uint8_t end_id;
> +		uint8_t i;
> 
> -		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
> -			if (!service_registered(i))
> -				continue;
> +		if (service_mask == 0)
> +			continue;
> +
> +		start_id = __builtin_ctzl(service_mask);
> +		end_id = 64 - __builtin_clzl(service_mask);
> +
> +		for (i = start_id; i < end_id; i++) {
>  			/* return value ignored as no change to code flow */
>  			service_run(i, cs, service_mask, service_get(i), 1);
>  		}


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 1/6] service: reduce statistics overhead for parallel services
  2022-09-06 16:13   ` [PATCH 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
                       ` (5 preceding siblings ...)
  2022-10-03  8:06     ` [PATCH 1/6] service: reduce statistics overhead for parallel services David Marchand
@ 2022-10-03 13:33     ` Van Haaren, Harry
  2022-10-03 14:37       ` Mattias Rönnblom
  2022-10-05  9:16     ` [PATCH v2 0/6] Service cores performance and statistics improvements Mattias Rönnblom
  7 siblings, 1 reply; 43+ messages in thread
From: Van Haaren, Harry @ 2022-10-03 13:33 UTC (permalink / raw)
  To: mattias.ronnblom
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, mattias.ronnblom

> -----Original Message-----
> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Sent: Tuesday, September 6, 2022 5:14 PM
> To: Van; Haaren; Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> Morten Brørup <mb@smartsharesystems.com>; nd <nd@arm.com>;
> mattias.ronnblom <mattias.ronnblom@ericsson.com>
> Subject: [PATCH 1/6] service: reduce statistics overhead for parallel services
> 
> Move the statistics from the service data structure to the per-lcore
> struct. This eliminates contention for the counter cache lines, which
> decreases the producer-side statistics overhead for services deployed
> across many lcores.

Agree with the approach, good stuff. One comment below, but no changes
necessary.

> Prior to this patch, enabling statistics for a service with a
> per-service function call latency of 1000 clock cycles deployed across
> 16 cores on a Intel Xeon 6230N @ 2,3 GHz would incur a cost of ~10000
> core clock cycles per service call. After this patch, the statistics
> overhead is reduce to 22 clock cycles per call.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Tested by applying the two patch fixes first (http://patches.dpdk.org/project/dpdk/list/?series=23959)
and this patchset from Mattias.

Acked-by: Harry van Haaren <harry.van.haaren@intel.com>

<snip>

>  static uint32_t rte_service_count;
> @@ -138,13 +130,16 @@ rte_service_finalize(void)
>  	rte_service_library_initialized = 0;
>  }
> 
> -/* returns 1 if service is registered and has not been unregistered
> - * Returns 0 if service never registered, or has been unregistered
> - */
> -static inline int
> +static inline bool
> +service_registered(uint32_t id)
> +{
> +	return rte_services[id].internal_flags & SERVICE_F_REGISTERED;
> +}

Before we had a !! to flip any value (e.g. binary 100 to just binary 1).
SERVICE_F_REGISTERED is binary 1,and internal_flags is u8, so its fine as is.

<snip>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 4/6] service: tweak cycle statistics semantics
  2022-09-07  8:41       ` Morten Brørup
@ 2022-10-03 13:45         ` Van Haaren, Harry
  0 siblings, 0 replies; 43+ messages in thread
From: Van Haaren, Harry @ 2022-10-03 13:45 UTC (permalink / raw)
  To: Morten Brørup, mattias.ronnblom; +Cc: dev, Honnappa Nagarahalli, nd



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Wednesday, September 7, 2022 9:41 AM
> To: mattias.ronnblom <mattias.ronnblom@ericsson.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd
> <nd@arm.com>
> Subject: RE: [PATCH 4/6] service: tweak cycle statistics semantics
> 
> > From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > Sent: Tuesday, 6 September 2022 18.14
> >
> > As a part of its service function, a service usually polls some kind
> > of source (e.g., an RX queue, a ring, an eventdev port, or a timer
> > wheel) to retrieve one or more items of work.
> >
> > In low-load situations, the service framework reports a significant
> > amount of cycles spent for all running services, despite the fact they
> > have performed little or no actual work.
> >
> > The per-call cycle expenditure for an idle service (i.e., a service
> > currently without pending jobs) is typically very low. Polling an
> > empty ring or RX queue is inexpensive. However, since the service
> > function call frequency on an idle or lightly loaded lcore is going to
> > be very high indeed, the service function calls' cycles adds up to a
> > significant amount. The only thing preventing the idle services'
> > cycles counters to make up 100% of the available CPU cycles is the
> > overhead of the service framework itself.
> >
> > If the RTE_SERVICE_ATTR_CYCLES or RTE_SERVICE_LCORE_ATTR_CYCLES are
> > used to estimate service core load, the cores may look very busy when
> > the system is mostly doing nothing useful at all.
> >
> > This patch allows for an idle service to indicate that no actual work
> > was performed during a particular service function call (by returning
> > -EAGAIN). In such cases the RTE_SERVICE_ATTR_CYCLES and
> > RTE_SERVICE_LCORE_ATTR_CYCLES values are not incremented.
> >
> > The convention of returning -EAGAIN for idle services may in the
> > future also be used to have the lcore enter a short sleep, or reduce
> > its operating frequency, in case all services are currently idle.
> >
> > This change is backward-compatible.
> >
> > Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > ---
> 
> This entire series contains a bunch of good improvements.
> 
> Returning -EAGAIN is a step in the right direction towards measuring CPU usage, and
> a great way to make it backwards compatible.
> 
> Series-Acked-by: Morten Brørup <mb@smartsharesystems.com>

Agreed, thanks Mattias for authoring & Morten for review/ack;

I've left 2 minor comments on individual patches, but for the remaining 4 patches;
Series-Acked-by: Harry van Haaren <harry.van.haaren@intel.com>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/6] service: reduce average case service core overhead
  2022-10-03 13:33       ` Van Haaren, Harry
@ 2022-10-03 14:32         ` Mattias Rönnblom
  0 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-03 14:32 UTC (permalink / raw)
  To: Van Haaren, Harry; +Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd

On 2022-10-03 15:33, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Sent: Tuesday, September 6, 2022 5:14 PM
>> To: Van; Haaren; Van Haaren, Harry <harry.van.haaren@intel.com>
>> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
>> Morten Brørup <mb@smartsharesystems.com>; nd <nd@arm.com>;
>> mattias.ronnblom <mattias.ronnblom@ericsson.com>
>> Subject: [PATCH 3/6] service: reduce average case service core overhead
>>
>> Optimize service loop so that the starting point is the lowest-indexed
>> service mapped to the lcore in question, and terminate the loop at the
>> highest-indexed service.
>>
>> While the worst case latency remains the same, this patch
>> significantly reduces the service framework overhead for the average
>> case. In particular, scenarios where an lcore only runs a single
>> service, or multiple services which id values are close (e.g., three
>> services with ids 17, 18 and 22), show significant improvements.
>>
>> The worse case is a where the lcore two services mapped to it; one
>> with service id 0 and the other with id 63.
> 
> I like the optimization - nice work. There is one caveat, that with the
> builtin_ctz() call, RTE_SERVICE_NUM_MAX *must* be 64 or lower.
> Today it is defined as 64, but we must ensure that this value cannot
> be changed "by accident" without explicit compilation failures and a
> comment explaining that fact.
>  > There are likely options around making it runtime-dynamic, but I don't
> think the complexity is justified: suggest we use compile-time check
> BUILD_BUG_ON() and error if its > 64?
> 

Sounds like a good idea. The limitations is not new though; the use of 
an uint64_t-based bitmask limits the services to 64 already.

> Note in rte_service_component_register(), we *re-use* IDs when they
> become available, so we can have up to 64 active services at a time, but
> the can register/unregister more times than that. This is a very unlikely
> usage of the services API to continually register-unregister services.
> 
> With the BUILD_BUG_ON() around the 64 MAX value with a comment:
> Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
Thanks for your reviews Harry.

> 
>> On a service lcore serving a single service, the service loop overhead
>> is reduced from ~190 core clock cycles to ~46. (On an Intel Cascade
>> Lake generation Xeon.) On weakly ordered CPUs, the gain is larger,
>> since the loop included load-acquire atomic operations.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   lib/eal/common/rte_service.c | 14 ++++++++++----
>>   1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
>> index 87df04e3ac..4cac866792 100644
>> --- a/lib/eal/common/rte_service.c
>> +++ b/lib/eal/common/rte_service.c
>> @@ -464,7 +464,6 @@ static int32_t
>>   service_runner_func(void *arg)
>>   {
>>   	RTE_SET_USED(arg);
>> -	uint32_t i;
>>   	const int lcore = rte_lcore_id();
>>   	struct core_state *cs = &lcore_states[lcore];
>>
>> @@ -478,10 +477,17 @@ service_runner_func(void *arg)
>>   			RUNSTATE_RUNNING) {
>>
>>   		const uint64_t service_mask = cs->service_mask;
>> +		uint8_t start_id;
>> +		uint8_t end_id;
>> +		uint8_t i;
>>
>> -		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
>> -			if (!service_registered(i))
>> -				continue;
>> +		if (service_mask == 0)
>> +			continue;
>> +
>> +		start_id = __builtin_ctzl(service_mask);
>> +		end_id = 64 - __builtin_clzl(service_mask);
>> +
>> +		for (i = start_id; i < end_id; i++) {
>>   			/* return value ignored as no change to code flow */
>>   			service_run(i, cs, service_mask, service_get(i), 1);
>>   		}
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/6] service: reduce statistics overhead for parallel services
  2022-10-03 13:33     ` Van Haaren, Harry
@ 2022-10-03 14:37       ` Mattias Rönnblom
  0 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-03 14:37 UTC (permalink / raw)
  To: Van Haaren, Harry; +Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd

On 2022-10-03 15:33, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Sent: Tuesday, September 6, 2022 5:14 PM
>> To: Van; Haaren; Van Haaren, Harry <harry.van.haaren@intel.com>
>> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
>> Morten Brørup <mb@smartsharesystems.com>; nd <nd@arm.com>;
>> mattias.ronnblom <mattias.ronnblom@ericsson.com>
>> Subject: [PATCH 1/6] service: reduce statistics overhead for parallel services
>>
>> Move the statistics from the service data structure to the per-lcore
>> struct. This eliminates contention for the counter cache lines, which
>> decreases the producer-side statistics overhead for services deployed
>> across many lcores.
> 
> Agree with the approach, good stuff. One comment below, but no changes
> necessary.
> 
>> Prior to this patch, enabling statistics for a service with a
>> per-service function call latency of 1000 clock cycles deployed across
>> 16 cores on a Intel Xeon 6230N @ 2,3 GHz would incur a cost of ~10000
>> core clock cycles per service call. After this patch, the statistics
>> overhead is reduce to 22 clock cycles per call.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> 
> Tested by applying the two patch fixes first (https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-733cb32d7965fb15&q=1&e=2a5d80ad-c4ec-40ca-b66d-1b546435288d&u=http%3A%2F%2Fpatches.dpdk.org%2Fproject%2Fdpdk%2Flist%2F%3Fseries%3D23959)
> and this patchset from Mattias.
> 
> Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
> 
> <snip>
> 
>>   static uint32_t rte_service_count;
>> @@ -138,13 +130,16 @@ rte_service_finalize(void)
>>   	rte_service_library_initialized = 0;
>>   }
>>
>> -/* returns 1 if service is registered and has not been unregistered
>> - * Returns 0 if service never registered, or has been unregistered
>> - */
>> -static inline int
>> +static inline bool
>> +service_registered(uint32_t id)
>> +{
>> +	return rte_services[id].internal_flags & SERVICE_F_REGISTERED;
>> +}
> 
> Before we had a !! to flip any value (e.g. binary 100 to just binary 1).
> SERVICE_F_REGISTERED is binary 1,and internal_flags is u8, so its fine as is.
> 
> <snip>
> 

Since the function now returns a bool, it does not matter what size the 
internal_flags field has or which of its bits SERVICE_F_REGISTERED 
refers to.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 0/6] Service cores performance and statistics improvements
  2022-09-06 16:13   ` [PATCH 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
                       ` (6 preceding siblings ...)
  2022-10-03 13:33     ` Van Haaren, Harry
@ 2022-10-05  9:16     ` Mattias Rönnblom
  2022-10-05  9:16       ` [PATCH v2 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
                         ` (7 more replies)
  7 siblings, 8 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-05  9:16 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, hofors,
	Mattias Rönnblom

This series contains performance improvements and new statistics-
related functionality for the EAL service cores framework.

A new per-lcore TSC cycle counter is introduced, which reflect the
total amount of cycles spent by that lcore running services. This may
be used to estimate service lcore load.

The patchset introduces a backward-compatible convention, where a DPDK
service may signal to the framework that no useful work was performed,
which in turn is used to make the busy cycles statistics more
accurate.

Depends-on: series-23959 ("test/service: add perf measurements for with stats mode ")

Mattias Rönnblom (6):
  service: reduce statistics overhead for parallel services
  service: introduce per-lcore cycles counter
  service: reduce average case service core overhead
  service: tweak cycle statistics semantics
  event/sw: report idle when no work is performed
  service: provide links to functions in documentation

 app/test/test_service_cores.c           |   2 +-
 drivers/event/sw/sw_evdev.c             |   3 +-
 drivers/event/sw/sw_evdev.h             |   2 +-
 drivers/event/sw/sw_evdev_scheduler.c   |   6 +-
 lib/eal/common/rte_service.c            | 228 +++++++++++++++++-------
 lib/eal/include/rte_service.h           |  32 ++--
 lib/eal/include/rte_service_component.h |   5 +
 7 files changed, 192 insertions(+), 86 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 1/6] service: reduce statistics overhead for parallel services
  2022-10-05  9:16     ` [PATCH v2 0/6] Service cores performance and statistics improvements Mattias Rönnblom
@ 2022-10-05  9:16       ` Mattias Rönnblom
  2022-10-05  9:16       ` [PATCH v2 2/6] service: introduce per-lcore cycles counter Mattias Rönnblom
                         ` (6 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-05  9:16 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, hofors,
	Mattias Rönnblom

Move the statistics from the service data structure to the per-lcore
struct. This eliminates contention for the counter cache lines, which
decreases the producer-side statistics overhead for services deployed
across many lcores.

Prior to this patch, enabling statistics for a service with a
per-service function call latency of 1000 clock cycles deployed across
16 cores on a Intel Xeon 6230N @ 2,3 GHz would incur a cost of ~10000
core clock cycles per service call. After this patch, the statistics
overhead is reduce to 22 clock cycles per call.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

--

v2:
* Introduce service stats struct which hold per-lcore service stats
  state, to improve spatial locality.
---
 lib/eal/common/rte_service.c | 190 +++++++++++++++++++++++------------
 1 file changed, 128 insertions(+), 62 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 94cb056196..6b807afacf 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -50,16 +50,12 @@ struct rte_service_spec_impl {
 	 * on currently.
 	 */
 	uint32_t num_mapped_cores;
-
-	/* 32-bit builds won't naturally align a uint64_t, so force alignment,
-	 * allowing regular reads to be atomic.
-	 */
-	uint64_t calls __rte_aligned(8);
-	uint64_t cycles_spent __rte_aligned(8);
 } __rte_cache_aligned;
 
-/* Mask used to ensure uint64_t 8 byte vars are naturally aligned. */
-#define RTE_SERVICE_STAT_ALIGN_MASK (8 - 1)
+struct service_stats {
+	uint64_t calls;
+	uint64_t cycles;
+};
 
 /* the internal values of a service core */
 struct core_state {
@@ -70,7 +66,7 @@ struct core_state {
 	uint8_t is_service_core; /* set if core is currently a service core */
 	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
 	uint64_t loops;
-	uint64_t calls_per_service[RTE_SERVICE_NUM_MAX];
+	struct service_stats service_stats[RTE_SERVICE_NUM_MAX];
 } __rte_cache_aligned;
 
 static uint32_t rte_service_count;
@@ -138,13 +134,16 @@ rte_service_finalize(void)
 	rte_service_library_initialized = 0;
 }
 
-/* returns 1 if service is registered and has not been unregistered
- * Returns 0 if service never registered, or has been unregistered
- */
-static inline int
+static inline bool
+service_registered(uint32_t id)
+{
+	return rte_services[id].internal_flags & SERVICE_F_REGISTERED;
+}
+
+static inline bool
 service_valid(uint32_t id)
 {
-	return !!(rte_services[id].internal_flags & SERVICE_F_REGISTERED);
+	return id < RTE_SERVICE_NUM_MAX && service_registered(id);
 }
 
 static struct rte_service_spec_impl *
@@ -155,7 +154,7 @@ service_get(uint32_t id)
 
 /* validate ID and retrieve service pointer, or return error value */
 #define SERVICE_VALID_GET_OR_ERR_RET(id, service, retval) do {          \
-	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))            \
+	if (!service_valid(id))                                         \
 		return retval;                                          \
 	service = &rte_services[id];                                    \
 } while (0)
@@ -217,7 +216,7 @@ rte_service_get_by_name(const char *name, uint32_t *service_id)
 
 	int i;
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (service_valid(i) &&
+		if (service_registered(i) &&
 				strcmp(name, rte_services[i].spec.name) == 0) {
 			*service_id = i;
 			return 0;
@@ -254,7 +253,7 @@ rte_service_component_register(const struct rte_service_spec *spec,
 		return -EINVAL;
 
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (!service_valid(i)) {
+		if (!service_registered(i)) {
 			free_slot = i;
 			break;
 		}
@@ -366,29 +365,27 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 {
 	void *userdata = s->spec.callback_userdata;
 
-	/* Ensure the atomically stored variables are naturally aligned,
-	 * as required for regular loads to be atomic.
-	 */
-	RTE_BUILD_BUG_ON((offsetof(struct rte_service_spec_impl, calls)
-		& RTE_SERVICE_STAT_ALIGN_MASK) != 0);
-	RTE_BUILD_BUG_ON((offsetof(struct rte_service_spec_impl, cycles_spent)
-		& RTE_SERVICE_STAT_ALIGN_MASK) != 0);
-
 	if (service_stats_enabled(s)) {
 		uint64_t start = rte_rdtsc();
 		s->spec.callback(userdata);
 		uint64_t end = rte_rdtsc();
 		uint64_t cycles = end - start;
-		cs->calls_per_service[service_idx]++;
-		if (service_mt_safe(s)) {
-			__atomic_fetch_add(&s->cycles_spent, cycles, __ATOMIC_RELAXED);
-			__atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
-		} else {
-			uint64_t cycles_new = s->cycles_spent + cycles;
-			uint64_t calls_new = s->calls++;
-			__atomic_store_n(&s->cycles_spent, cycles_new, __ATOMIC_RELAXED);
-			__atomic_store_n(&s->calls, calls_new, __ATOMIC_RELAXED);
-		}
+
+		/* The lcore service worker thread is the only writer,
+		 * and thus only a non-atomic load and an atomic store
+		 * is needed, and not the more expensive atomic
+		 * add.
+		 */
+		struct service_stats *service_stats =
+			&cs->service_stats[service_idx];
+
+		__atomic_store_n(&service_stats->calls,
+				 service_stats->calls + 1,
+				 __ATOMIC_RELAXED);
+
+		__atomic_store_n(&service_stats->cycles,
+				 service_stats->cycles + cycles,
+				 __ATOMIC_RELAXED);
 	} else
 		s->spec.callback(userdata);
 }
@@ -436,7 +433,7 @@ rte_service_may_be_active(uint32_t id)
 	int32_t lcore_count = rte_service_lcore_list(ids, RTE_MAX_LCORE);
 	int i;
 
-	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+	if (!service_valid(id))
 		return -EINVAL;
 
 	for (i = 0; i < lcore_count; i++) {
@@ -483,16 +480,17 @@ service_runner_func(void *arg)
 	 */
 	while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) ==
 			RUNSTATE_RUNNING) {
+
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-			if (!service_valid(i))
+			if (!service_registered(i))
 				continue;
 			/* return value ignored as no change to code flow */
 			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
-		cs->loops++;
+		__atomic_store_n(&cs->loops, cs->loops + 1, __ATOMIC_RELAXED);
 	}
 
 	/* Use SEQ CST memory ordering to avoid any re-ordering around
@@ -608,8 +606,8 @@ static int32_t
 service_update(uint32_t sid, uint32_t lcore, uint32_t *set, uint32_t *enabled)
 {
 	/* validate ID, or return error value */
-	if (sid >= RTE_SERVICE_NUM_MAX || !service_valid(sid) ||
-	    lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
+	if (!service_valid(sid) || lcore >= RTE_MAX_LCORE ||
+	    !lcore_states[lcore].is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
@@ -813,21 +811,75 @@ rte_service_lcore_stop(uint32_t lcore)
 	return 0;
 }
 
+static uint64_t
+lcore_attr_get_loops(unsigned int lcore)
+{
+	struct core_state *cs = &lcore_states[lcore];
+
+	return __atomic_load_n(&cs->loops, __ATOMIC_RELAXED);
+}
+
+static uint64_t
+lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
+{
+	struct core_state *cs = &lcore_states[lcore];
+
+	return __atomic_load_n(&cs->service_stats[service_id].calls,
+			       __ATOMIC_RELAXED);
+}
+
+static uint64_t
+lcore_attr_get_service_cycles(uint32_t service_id, unsigned int lcore)
+{
+	struct core_state *cs = &lcore_states[lcore];
+
+	return __atomic_load_n(&cs->service_stats[service_id].cycles,
+			       __ATOMIC_RELAXED);
+}
+
+typedef uint64_t (*lcore_attr_get_fun)(uint32_t service_id,
+				       unsigned int lcore);
+
+static uint64_t
+attr_get(uint32_t id, lcore_attr_get_fun lcore_attr_get)
+{
+	unsigned int lcore;
+	uint64_t sum = 0;
+
+	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++)
+		if (lcore_states[lcore].is_service_core)
+			sum += lcore_attr_get(id, lcore);
+
+	return sum;
+}
+
+static uint64_t
+attr_get_service_calls(uint32_t service_id)
+{
+	return attr_get(service_id, lcore_attr_get_service_calls);
+}
+
+static uint64_t
+attr_get_service_cycles(uint32_t service_id)
+{
+	return attr_get(service_id, lcore_attr_get_service_cycles);
+}
+
 int32_t
 rte_service_attr_get(uint32_t id, uint32_t attr_id, uint64_t *attr_value)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	if (!service_valid(id))
+		return -EINVAL;
 
 	if (!attr_value)
 		return -EINVAL;
 
 	switch (attr_id) {
-	case RTE_SERVICE_ATTR_CYCLES:
-		*attr_value = s->cycles_spent;
-		return 0;
 	case RTE_SERVICE_ATTR_CALL_COUNT:
-		*attr_value = s->calls;
+		*attr_value = attr_get_service_calls(id);
+		return 0;
+	case RTE_SERVICE_ATTR_CYCLES:
+		*attr_value = attr_get_service_cycles(id);
 		return 0;
 	default:
 		return -EINVAL;
@@ -849,7 +901,7 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 
 	switch (attr_id) {
 	case RTE_SERVICE_LCORE_ATTR_LOOPS:
-		*attr_value = cs->loops;
+		*attr_value = lcore_attr_get_loops(lcore);
 		return 0;
 	default:
 		return -EINVAL;
@@ -859,11 +911,17 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 int32_t
 rte_service_attr_reset_all(uint32_t id)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	unsigned int lcore;
+
+	if (!service_valid(id))
+		return -EINVAL;
+
+	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
+		struct core_state *cs = &lcore_states[lcore];
+
+		cs->service_stats[id] = (struct service_stats) {};
+	}
 
-	s->cycles_spent = 0;
-	s->calls = 0;
 	return 0;
 }
 
@@ -885,17 +943,25 @@ rte_service_lcore_attr_reset_all(uint32_t lcore)
 }
 
 static void
-service_dump_one(FILE *f, struct rte_service_spec_impl *s)
+service_dump_one(FILE *f, uint32_t id)
 {
+	struct rte_service_spec_impl *s;
+	uint64_t service_calls;
+	uint64_t service_cycles;
+
+	service_calls = attr_get_service_calls(id);
+	service_cycles = attr_get_service_cycles(id);
+
 	/* avoid divide by zero */
-	int calls = 1;
+	if (service_calls == 0)
+		service_calls = 1;
+
+	s = service_get(id);
 
-	if (s->calls != 0)
-		calls = s->calls;
 	fprintf(f, "  %s: stats %d\tcalls %"PRIu64"\tcycles %"
 			PRIu64"\tavg: %"PRIu64"\n",
-			s->spec.name, service_stats_enabled(s), s->calls,
-			s->cycles_spent, s->cycles_spent / calls);
+			s->spec.name, service_stats_enabled(s), service_calls,
+			service_cycles, service_cycles / service_calls);
 }
 
 static void
@@ -906,9 +972,9 @@ service_dump_calls_per_lcore(FILE *f, uint32_t lcore)
 
 	fprintf(f, "%02d\t", lcore);
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (!service_valid(i))
+		if (!service_registered(i))
 			continue;
-		fprintf(f, "%"PRIu64"\t", cs->calls_per_service[i]);
+		fprintf(f, "%"PRIu64"\t", cs->service_stats[i].calls);
 	}
 	fprintf(f, "\n");
 }
@@ -924,16 +990,16 @@ rte_service_dump(FILE *f, uint32_t id)
 		struct rte_service_spec_impl *s;
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
-		service_dump_one(f, s);
+		service_dump_one(f, id);
 		return 0;
 	}
 
 	/* print all services, as UINT32_MAX was passed as id */
 	fprintf(f, "Services Summary\n");
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (!service_valid(i))
+		if (!service_registered(i))
 			continue;
-		service_dump_one(f, &rte_services[i]);
+		service_dump_one(f, i);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 2/6] service: introduce per-lcore cycles counter
  2022-10-05  9:16     ` [PATCH v2 0/6] Service cores performance and statistics improvements Mattias Rönnblom
  2022-10-05  9:16       ` [PATCH v2 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
@ 2022-10-05  9:16       ` Mattias Rönnblom
  2022-10-05  9:16       ` [PATCH v2 3/6] service: reduce average case service core overhead Mattias Rönnblom
                         ` (5 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-05  9:16 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, hofors,
	Mattias Rönnblom

Introduce a per-lcore counter for the total time spent on processing
services on that core.

This counter is useful when measuring individual lcore load.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/test_service_cores.c |  2 +-
 lib/eal/common/rte_service.c  | 15 +++++++++++++++
 lib/eal/include/rte_service.h |  6 ++++++
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index 7415b6b686..096405133b 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -403,7 +403,7 @@ service_lcore_attr_get(void)
 			"lcore_attr_get() failed to get loops "
 			"(expected > zero)");
 
-	lcore_attr_id++;  // invalid lcore attr id
+	lcore_attr_id = 42; /* invalid lcore attr id */
 	TEST_ASSERT_EQUAL(-EINVAL, rte_service_lcore_attr_get(slcore_id,
 			lcore_attr_id, &lcore_attr_value),
 			"Invalid lcore attr didn't return -EINVAL");
diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 6b807afacf..4d51de638d 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -66,6 +66,7 @@ struct core_state {
 	uint8_t is_service_core; /* set if core is currently a service core */
 	uint8_t service_active_on_lcore[RTE_SERVICE_NUM_MAX];
 	uint64_t loops;
+	uint64_t cycles;
 	struct service_stats service_stats[RTE_SERVICE_NUM_MAX];
 } __rte_cache_aligned;
 
@@ -376,6 +377,9 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 		 * is needed, and not the more expensive atomic
 		 * add.
 		 */
+		__atomic_store_n(&cs->cycles, cs->cycles + cycles,
+				 __ATOMIC_RELAXED);
+
 		struct service_stats *service_stats =
 			&cs->service_stats[service_idx];
 
@@ -819,6 +823,14 @@ lcore_attr_get_loops(unsigned int lcore)
 	return __atomic_load_n(&cs->loops, __ATOMIC_RELAXED);
 }
 
+static uint64_t
+lcore_attr_get_cycles(unsigned int lcore)
+{
+	struct core_state *cs = &lcore_states[lcore];
+
+	return __atomic_load_n(&cs->cycles, __ATOMIC_RELAXED);
+}
+
 static uint64_t
 lcore_attr_get_service_calls(uint32_t service_id, unsigned int lcore)
 {
@@ -903,6 +915,9 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 	case RTE_SERVICE_LCORE_ATTR_LOOPS:
 		*attr_value = lcore_attr_get_loops(lcore);
 		return 0;
+	case RTE_SERVICE_LCORE_ATTR_CYCLES:
+		*attr_value = lcore_attr_get_cycles(lcore);
+		return 0;
 	default:
 		return -EINVAL;
 	}
diff --git a/lib/eal/include/rte_service.h b/lib/eal/include/rte_service.h
index 35d8018684..70deb6e53a 100644
--- a/lib/eal/include/rte_service.h
+++ b/lib/eal/include/rte_service.h
@@ -407,6 +407,12 @@ int32_t rte_service_attr_reset_all(uint32_t id);
  */
 #define RTE_SERVICE_LCORE_ATTR_LOOPS 0
 
+/**
+ * Returns the total number of cycles that the lcore has spent on
+ * running services.
+ */
+#define RTE_SERVICE_LCORE_ATTR_CYCLES 1
+
 /**
  * Get an attribute from a service core.
  *
-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 3/6] service: reduce average case service core overhead
  2022-10-05  9:16     ` [PATCH v2 0/6] Service cores performance and statistics improvements Mattias Rönnblom
  2022-10-05  9:16       ` [PATCH v2 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
  2022-10-05  9:16       ` [PATCH v2 2/6] service: introduce per-lcore cycles counter Mattias Rönnblom
@ 2022-10-05  9:16       ` Mattias Rönnblom
  2022-10-05  9:16       ` [PATCH v2 4/6] service: tweak cycle statistics semantics Mattias Rönnblom
                         ` (4 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-05  9:16 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, hofors,
	Mattias Rönnblom

Optimize service loop so that the starting point is the lowest-indexed
service mapped to the lcore in question, and terminate the loop at the
highest-indexed service.

While the worst case latency remains the same, this patch
significantly reduces the service framework overhead for the average
case. In particular, scenarios where an lcore only runs a single
service, or multiple services which id values are close (e.g., three
services with ids 17, 18 and 22), show significant improvements.

The worse case is a where the lcore two services mapped to it; one
with service id 0 and the other with id 63.

On a service lcore serving a single service, the service loop overhead
is reduced from ~190 core clock cycles to ~46, on an Intel Cascade
Lake generation Xeon. On weakly ordered CPUs, the gain is larger,
since the loop included load-acquire atomic operations.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

---

v2: Added build-time assertion to prevent the maximum number of
    services to accidentally be changed to a higher value than
    the implementation supports. (Harry van Haaren)
---
 lib/eal/common/rte_service.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 4d51de638d..035c36b8bb 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -78,6 +78,11 @@ static uint32_t rte_service_library_initialized;
 int32_t
 rte_service_init(void)
 {
+	/* Hard limit due to the use of an uint64_t-based bitmask (and the
+	 * clzl intrinsic).
+	 */
+	RTE_BUILD_BUG_ON(RTE_SERVICE_NUM_MAX > 64);
+
 	if (rte_service_library_initialized) {
 		RTE_LOG(NOTICE, EAL,
 			"service library init() called, init flag %d\n",
@@ -472,7 +477,6 @@ static int32_t
 service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
-	uint32_t i;
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
@@ -486,10 +490,17 @@ service_runner_func(void *arg)
 			RUNSTATE_RUNNING) {
 
 		const uint64_t service_mask = cs->service_mask;
+		uint8_t start_id;
+		uint8_t end_id;
+		uint8_t i;
 
-		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-			if (!service_registered(i))
-				continue;
+		if (service_mask == 0)
+			continue;
+
+		start_id = __builtin_ctzl(service_mask);
+		end_id = 64 - __builtin_clzl(service_mask);
+
+		for (i = start_id; i < end_id; i++) {
 			/* return value ignored as no change to code flow */
 			service_run(i, cs, service_mask, service_get(i), 1);
 		}
-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 4/6] service: tweak cycle statistics semantics
  2022-10-05  9:16     ` [PATCH v2 0/6] Service cores performance and statistics improvements Mattias Rönnblom
                         ` (2 preceding siblings ...)
  2022-10-05  9:16       ` [PATCH v2 3/6] service: reduce average case service core overhead Mattias Rönnblom
@ 2022-10-05  9:16       ` Mattias Rönnblom
  2022-10-05  9:16       ` [PATCH v2 5/6] event/sw: report idle when no work is performed Mattias Rönnblom
                         ` (3 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-05  9:16 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, hofors,
	Mattias Rönnblom

As a part of its service function, a service usually polls some kind
of source (e.g., an RX queue, a ring, an eventdev port, or a timer
wheel) to retrieve one or more items of work.

In low-load situations, the service framework reports a significant
amount of cycles spent for all running services, despite the fact they
have performed little or no actual work.

The per-call cycle expenditure for an idle service (i.e., a service
currently without pending jobs) is typically very low. Polling an
empty ring or RX queue is inexpensive. However, since the service
function call frequency on an idle or lightly loaded lcore is going to
be very high indeed, the service function calls' cycles adds up to a
significant amount. The only thing preventing the idle services'
cycles counters to make up 100% of the available CPU cycles is the
overhead of the service framework itself.

If the RTE_SERVICE_ATTR_CYCLES or RTE_SERVICE_LCORE_ATTR_CYCLES are
used to estimate service core load, the cores may look very busy when
the system is mostly doing nothing useful at all.

This patch allows for an idle service to indicate that no actual work
was performed during a particular service function call (by returning
-EAGAIN). In such cases the RTE_SERVICE_ATTR_CYCLES and
RTE_SERVICE_LCORE_ATTR_CYCLES values are not incremented.

The convention of returning -EAGAIN for idle services may in the
future also be used to have the lcore enter a short sleep, or reduce
its operating frequency, in case all services are currently idle.

This change is backward-compatible.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c            | 26 +++++++++++++------------
 lib/eal/include/rte_service_component.h |  5 +++++
 2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 035c36b8bb..718371814f 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -10,6 +10,7 @@
 #include <rte_service_component.h>
 
 #include <rte_lcore.h>
+#include <rte_branch_prediction.h>
 #include <rte_common.h>
 #include <rte_cycles.h>
 #include <rte_atomic.h>
@@ -373,28 +374,29 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 
 	if (service_stats_enabled(s)) {
 		uint64_t start = rte_rdtsc();
-		s->spec.callback(userdata);
-		uint64_t end = rte_rdtsc();
-		uint64_t cycles = end - start;
+		int rc = s->spec.callback(userdata);
 
 		/* The lcore service worker thread is the only writer,
 		 * and thus only a non-atomic load and an atomic store
 		 * is needed, and not the more expensive atomic
 		 * add.
 		 */
-		__atomic_store_n(&cs->cycles, cs->cycles + cycles,
-				 __ATOMIC_RELAXED);
-
 		struct service_stats *service_stats =
 			&cs->service_stats[service_idx];
 
-		__atomic_store_n(&service_stats->calls,
-				 service_stats->calls + 1,
-				 __ATOMIC_RELAXED);
+		if (likely(rc != -EAGAIN)) {
+			uint64_t end = rte_rdtsc();
+			uint64_t cycles = end - start;
 
-		__atomic_store_n(&service_stats->cycles,
-				 service_stats->cycles + cycles,
-				 __ATOMIC_RELAXED);
+			__atomic_store_n(&cs->cycles, cs->cycles + cycles,
+					 __ATOMIC_RELAXED);
+			__atomic_store_n(&service_stats->cycles,
+					 service_stats->cycles + cycles,
+					 __ATOMIC_RELAXED);
+		}
+
+		__atomic_store_n(&service_stats->calls,
+				 service_stats->calls + 1, __ATOMIC_RELAXED);
 	} else
 		s->spec.callback(userdata);
 }
diff --git a/lib/eal/include/rte_service_component.h b/lib/eal/include/rte_service_component.h
index 9e66ee7e29..9be49d698a 100644
--- a/lib/eal/include/rte_service_component.h
+++ b/lib/eal/include/rte_service_component.h
@@ -19,6 +19,11 @@ extern "C" {
 
 /**
  * Signature of callback function to run a service.
+ *
+ * A service function call resulting in no actual work being
+ * performed, should return -EAGAIN. In that case, the (presumbly few)
+ * cycles spent will not be counted toward the lcore or service-level
+ * cycles attributes.
  */
 typedef int32_t (*rte_service_func)(void *args);
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 5/6] event/sw: report idle when no work is performed
  2022-10-05  9:16     ` [PATCH v2 0/6] Service cores performance and statistics improvements Mattias Rönnblom
                         ` (3 preceding siblings ...)
  2022-10-05  9:16       ` [PATCH v2 4/6] service: tweak cycle statistics semantics Mattias Rönnblom
@ 2022-10-05  9:16       ` Mattias Rönnblom
  2022-10-05  9:16       ` [PATCH v2 6/6] service: provide links to functions in documentation Mattias Rönnblom
                         ` (2 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-05  9:16 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, hofors,
	Mattias Rönnblom

Have the SW event device conform to the service core convention, where
-EAGAIN is return in case no work was performed.

Prior to this patch, for an idle SW event device, a service lcore load
estimate based on RTE_SERVICE_ATTR_CYCLES would suggest 48% core
load.

At 7% of its maximum capacity, the SW event device needs about 15% of
the available CPU cycles* to perform its duties, but
RTE_SERVICE_ATTR_CYCLES would suggest the SW service used 48% of the
service core.

After this change, load deduced from RTE_SERVICE_ATTR_CYCLES will only
be a minor overestimation of the actual cycles used.

* The SW scheduler becomes more efficient at higher loads.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 drivers/event/sw/sw_evdev.c           | 3 +--
 drivers/event/sw/sw_evdev.h           | 2 +-
 drivers/event/sw/sw_evdev_scheduler.c | 6 ++++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/event/sw/sw_evdev.c b/drivers/event/sw/sw_evdev.c
index bfa9469e29..3531821dd4 100644
--- a/drivers/event/sw/sw_evdev.c
+++ b/drivers/event/sw/sw_evdev.c
@@ -934,8 +934,7 @@ set_refill_once(const char *key __rte_unused, const char *value, void *opaque)
 static int32_t sw_sched_service_func(void *args)
 {
 	struct rte_eventdev *dev = args;
-	sw_event_schedule(dev);
-	return 0;
+	return sw_event_schedule(dev);
 }
 
 static int
diff --git a/drivers/event/sw/sw_evdev.h b/drivers/event/sw/sw_evdev.h
index 4fd1054470..8542b7d34d 100644
--- a/drivers/event/sw/sw_evdev.h
+++ b/drivers/event/sw/sw_evdev.h
@@ -295,7 +295,7 @@ uint16_t sw_event_enqueue_burst(void *port, const struct rte_event ev[],
 uint16_t sw_event_dequeue(void *port, struct rte_event *ev, uint64_t wait);
 uint16_t sw_event_dequeue_burst(void *port, struct rte_event *ev, uint16_t num,
 			uint64_t wait);
-void sw_event_schedule(struct rte_eventdev *dev);
+int32_t sw_event_schedule(struct rte_eventdev *dev);
 int sw_xstats_init(struct sw_evdev *dev);
 int sw_xstats_uninit(struct sw_evdev *dev);
 int sw_xstats_get_names(const struct rte_eventdev *dev,
diff --git a/drivers/event/sw/sw_evdev_scheduler.c b/drivers/event/sw/sw_evdev_scheduler.c
index 809a54d731..8bc21944f5 100644
--- a/drivers/event/sw/sw_evdev_scheduler.c
+++ b/drivers/event/sw/sw_evdev_scheduler.c
@@ -506,7 +506,7 @@ sw_schedule_pull_port_dir(struct sw_evdev *sw, uint32_t port_id)
 	return pkts_iter;
 }
 
-void
+int32_t
 sw_event_schedule(struct rte_eventdev *dev)
 {
 	struct sw_evdev *sw = sw_pmd_priv(dev);
@@ -517,7 +517,7 @@ sw_event_schedule(struct rte_eventdev *dev)
 
 	sw->sched_called++;
 	if (unlikely(!sw->started))
-		return;
+		return -EAGAIN;
 
 	do {
 		uint32_t in_pkts_this_iteration = 0;
@@ -610,4 +610,6 @@ sw_event_schedule(struct rte_eventdev *dev)
 	sw->sched_last_iter_bitmask = cqs_scheds_last_iter;
 	if (unlikely(sw->port_count >= 64))
 		sw->sched_last_iter_bitmask = UINT64_MAX;
+
+	return work_done ? 0 : -EAGAIN;
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 6/6] service: provide links to functions in documentation
  2022-10-05  9:16     ` [PATCH v2 0/6] Service cores performance and statistics improvements Mattias Rönnblom
                         ` (4 preceding siblings ...)
  2022-10-05  9:16       ` [PATCH v2 5/6] event/sw: report idle when no work is performed Mattias Rönnblom
@ 2022-10-05  9:16       ` Mattias Rönnblom
  2022-10-05  9:49       ` [PATCH v2 0/6] Service cores performance and statistics improvements Morten Brørup
  2022-10-05 13:39       ` David Marchand
  7 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-05  9:16 UTC (permalink / raw)
  To: Van, Haaren, Harry
  Cc: dev, Honnappa Nagarahalli, Morten Brørup, nd, hofors,
	Mattias Rönnblom

Refer to API functions with parenthesis, making doxygen create
hyperlinks.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/include/rte_service.h | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/lib/eal/include/rte_service.h b/lib/eal/include/rte_service.h
index 70deb6e53a..90116a773a 100644
--- a/lib/eal/include/rte_service.h
+++ b/lib/eal/include/rte_service.h
@@ -37,7 +37,7 @@ extern "C" {
 
 /* Capabilities of a service.
  *
- * Use the *rte_service_probe_capability* function to check if a service is
+ * Use the rte_service_probe_capability() function to check if a service is
  * capable of a specific capability.
  */
 /** When set, the service is capable of having multiple threads run it at the
@@ -147,13 +147,13 @@ int32_t rte_service_map_lcore_get(uint32_t service_id, uint32_t lcore);
 int32_t rte_service_runstate_set(uint32_t id, uint32_t runstate);
 
 /**
- * Get the runstate for the service with *id*. See *rte_service_runstate_set*
+ * Get the runstate for the service with *id*. See rte_service_runstate_set()
  * for details of runstates. A service can call this function to ensure that
  * the application has indicated that it will receive CPU cycles. Either a
  * service-core is mapped (default case), or the application has explicitly
  * disabled the check that a service-cores is mapped to the service and takes
  * responsibility to run the service manually using the available function
- * *rte_service_run_iter_on_app_lcore* to do so.
+ * rte_service_run_iter_on_app_lcore() to do so.
  *
  * @retval 1 Service is running
  * @retval 0 Service is stopped
@@ -181,7 +181,7 @@ rte_service_may_be_active(uint32_t id);
 /**
  * Enable or disable the check for a service-core being mapped to the service.
  * An application can disable the check when takes the responsibility to run a
- * service itself using *rte_service_run_iter_on_app_lcore*.
+ * service itself using rte_service_run_iter_on_app_lcore().
  *
  * @param id The id of the service to set the check on
  * @param enable When zero, the check is disabled. Non-zero enables the check.
@@ -216,7 +216,7 @@ int32_t rte_service_set_runstate_mapped_check(uint32_t id, int32_t enable);
  *           atomics, applications can choose to enable or disable this feature
  *
  * Note that any thread calling this function MUST be a DPDK EAL thread, as
- * the *rte_lcore_id* function is used to access internal data structures.
+ * the rte_lcore_id() function is used to access internal data structures.
  *
  * @retval 0 Service was run on the calling thread successfully
  * @retval -EBUSY Another lcore is executing the service, and it is not a
@@ -232,7 +232,7 @@ int32_t rte_service_run_iter_on_app_lcore(uint32_t id,
  *
  * Starting a core makes the core begin polling. Any services assigned to it
  * will be run as fast as possible. The application must ensure that the lcore
- * is in a launchable state: e.g. call *rte_eal_lcore_wait* on the lcore_id
+ * is in a launchable state: e.g. call rte_eal_lcore_wait() on the lcore_id
  * before calling this function.
  *
  * @retval 0 Success
@@ -248,7 +248,7 @@ int32_t rte_service_lcore_start(uint32_t lcore_id);
  * service core. Note that the service lcore thread may not have returned from
  * the service it is running when this API returns.
  *
- * The *rte_service_lcore_may_be_active* API can be used to check if the
+ * The rte_service_lcore_may_be_active() API can be used to check if the
  * service lcore is * still active.
  *
  * @retval 0 Success
@@ -265,7 +265,7 @@ int32_t rte_service_lcore_stop(uint32_t lcore_id);
  * Reports if a service lcore is currently running.
  *
  * This function returns if the core has finished service cores code, and has
- * returned to EAL control. If *rte_service_lcore_stop* has been called but
+ * returned to EAL control. If rte_service_lcore_stop() has been called but
  * the lcore has not returned to EAL yet, it might be required to wait and call
  * this function again. The amount of time to wait before the core returns
  * depends on the duration of the services being run.
@@ -293,7 +293,7 @@ int32_t rte_service_lcore_add(uint32_t lcore);
 /**
  * Removes lcore from the list of service cores.
  *
- * This can fail if the core is not stopped, see *rte_service_core_stop*.
+ * This can fail if the core is not stopped, see rte_service_core_stop().
  *
  * @retval 0 Success
  * @retval -EBUSY Lcore is not stopped, stop service core before removing.
@@ -308,7 +308,7 @@ int32_t rte_service_lcore_del(uint32_t lcore);
  * service core count can be used in mapping logic when creating mappings
  * from service cores to services.
  *
- * See *rte_service_lcore_list* for details on retrieving the lcore_id of each
+ * See rte_service_lcore_list() for details on retrieving the lcore_id of each
  * service core.
  *
  * @return The number of service cores currently configured.
@@ -344,14 +344,14 @@ int32_t rte_service_set_stats_enable(uint32_t id, int32_t enable);
  * indicating the lcore_id of a service core.
  *
  * Adding and removing service cores can be performed using
- * *rte_service_lcore_add* and *rte_service_lcore_del*.
- * @param [out] array An array of at least *rte_service_lcore_count* items.
+ * rte_service_lcore_add() and rte_service_lcore_del().
+ * @param [out] array An array of at least rte_service_lcore_count() items.
  *              If statically allocating the buffer, use RTE_MAX_LCORE.
  * @param [out] n The size of *array*.
  * @retval >=0 Number of service cores that have been populated in the array
  * @retval -ENOMEM The provided array is not large enough to fill in the
  *          service core list. No items have been populated, call this function
- *          with a size of at least *rte_service_core_count* items.
+ *          with a size of at least rte_service_core_count() items.
  */
 int32_t rte_service_lcore_list(uint32_t array[], uint32_t n);
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v2 0/6] Service cores performance and statistics improvements
  2022-10-05  9:16     ` [PATCH v2 0/6] Service cores performance and statistics improvements Mattias Rönnblom
                         ` (5 preceding siblings ...)
  2022-10-05  9:16       ` [PATCH v2 6/6] service: provide links to functions in documentation Mattias Rönnblom
@ 2022-10-05  9:49       ` Morten Brørup
  2022-10-05 10:14         ` Mattias Rönnblom
  2022-10-05 13:39       ` David Marchand
  7 siblings, 1 reply; 43+ messages in thread
From: Morten Brørup @ 2022-10-05  9:49 UTC (permalink / raw)
  To: Mattias Rönnblom, Harry van Haaren
  Cc: dev, Honnappa Nagarahalli, nd, hofors

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Wednesday, 5 October 2022 11.16
> 
> This series contains performance improvements and new statistics-
> related functionality for the EAL service cores framework.
> 
> A new per-lcore TSC cycle counter is introduced, which reflect the
> total amount of cycles spent by that lcore running services. This may
> be used to estimate service lcore load.
> 
> The patchset introduces a backward-compatible convention, where a DPDK
> service may signal to the framework that no useful work was performed,
> which in turn is used to make the busy cycles statistics more
> accurate.
> 
> Depends-on: series-23959 ("test/service: add perf measurements for with
> stats mode ")
> 
> Mattias Rönnblom (6):
>   service: reduce statistics overhead for parallel services
>   service: introduce per-lcore cycles counter
>   service: reduce average case service core overhead
>   service: tweak cycle statistics semantics
>   event/sw: report idle when no work is performed
>   service: provide links to functions in documentation
> 
>  app/test/test_service_cores.c           |   2 +-
>  drivers/event/sw/sw_evdev.c             |   3 +-
>  drivers/event/sw/sw_evdev.h             |   2 +-
>  drivers/event/sw/sw_evdev_scheduler.c   |   6 +-
>  lib/eal/common/rte_service.c            | 228 +++++++++++++++++-------
>  lib/eal/include/rte_service.h           |  32 ++--
>  lib/eal/include/rte_service_component.h |   5 +
>  7 files changed, 192 insertions(+), 86 deletions(-)
> 
> --
> 2.34.1
> 

In case it wasn't noticed with v1 of the patch series...

Series-Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/6] Service cores performance and statistics improvements
  2022-10-05  9:49       ` [PATCH v2 0/6] Service cores performance and statistics improvements Morten Brørup
@ 2022-10-05 10:14         ` Mattias Rönnblom
  0 siblings, 0 replies; 43+ messages in thread
From: Mattias Rönnblom @ 2022-10-05 10:14 UTC (permalink / raw)
  To: Morten Brørup, Harry van Haaren
  Cc: dev, Honnappa Nagarahalli, nd, hofors

On 2022-10-05 11:49, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 5 October 2022 11.16
>>
>> This series contains performance improvements and new statistics-
>> related functionality for the EAL service cores framework.
>>
>> A new per-lcore TSC cycle counter is introduced, which reflect the
>> total amount of cycles spent by that lcore running services. This may
>> be used to estimate service lcore load.
>>
>> The patchset introduces a backward-compatible convention, where a DPDK
>> service may signal to the framework that no useful work was performed,
>> which in turn is used to make the busy cycles statistics more
>> accurate.
>>
>> Depends-on: series-23959 ("test/service: add perf measurements for with
>> stats mode ")
>>
>> Mattias Rönnblom (6):
>>    service: reduce statistics overhead for parallel services
>>    service: introduce per-lcore cycles counter
>>    service: reduce average case service core overhead
>>    service: tweak cycle statistics semantics
>>    event/sw: report idle when no work is performed
>>    service: provide links to functions in documentation
>>
>>   app/test/test_service_cores.c           |   2 +-
>>   drivers/event/sw/sw_evdev.c             |   3 +-
>>   drivers/event/sw/sw_evdev.h             |   2 +-
>>   drivers/event/sw/sw_evdev_scheduler.c   |   6 +-
>>   lib/eal/common/rte_service.c            | 228 +++++++++++++++++-------
>>   lib/eal/include/rte_service.h           |  32 ++--
>>   lib/eal/include/rte_service_component.h |   5 +
>>   7 files changed, 192 insertions(+), 86 deletions(-)
>>
>> --
>> 2.34.1
>>
> 
> In case it wasn't noticed with v1 of the patch series...
> 
> Series-Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 

It was noticed, but then forgotten. Just like I forgot to add the
Acked-by from Harry van Haaren.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 0/6] Service cores performance and statistics improvements
  2022-10-05  9:16     ` [PATCH v2 0/6] Service cores performance and statistics improvements Mattias Rönnblom
                         ` (6 preceding siblings ...)
  2022-10-05  9:49       ` [PATCH v2 0/6] Service cores performance and statistics improvements Morten Brørup
@ 2022-10-05 13:39       ` David Marchand
  7 siblings, 0 replies; 43+ messages in thread
From: David Marchand @ 2022-10-05 13:39 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Harry, dev, Honnappa Nagarahalli, Morten Brørup, nd, hofors

On Wed, Oct 5, 2022 at 11:20 AM Mattias Rönnblom
<mattias.ronnblom@ericsson.com> wrote:
>
> This series contains performance improvements and new statistics-
> related functionality for the EAL service cores framework.
>
> A new per-lcore TSC cycle counter is introduced, which reflect the
> total amount of cycles spent by that lcore running services. This may
> be used to estimate service lcore load.
>
> The patchset introduces a backward-compatible convention, where a DPDK
> service may signal to the framework that no useful work was performed,
> which in turn is used to make the busy cycles statistics more
> accurate.
>
> Depends-on: series-23959 ("test/service: add perf measurements for with stats mode ")
>
> Mattias Rönnblom (6):
>   service: reduce statistics overhead for parallel services
>   service: introduce per-lcore cycles counter
>   service: reduce average case service core overhead
>   service: tweak cycle statistics semantics
>   event/sw: report idle when no work is performed
>   service: provide links to functions in documentation
>
>  app/test/test_service_cores.c           |   2 +-
>  drivers/event/sw/sw_evdev.c             |   3 +-
>  drivers/event/sw/sw_evdev.h             |   2 +-
>  drivers/event/sw/sw_evdev_scheduler.c   |   6 +-
>  lib/eal/common/rte_service.c            | 228 +++++++++++++++++-------
>  lib/eal/include/rte_service.h           |  32 ++--
>  lib/eal/include/rte_service_component.h |   5 +
>  7 files changed, 192 insertions(+), 86 deletions(-)

Added acks.
Series applied, thanks.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2022-10-05 13:39 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-08 12:56 [PATCH 1/2] test/service: add perf measurements for with stats mode Harry van Haaren
2022-07-08 12:56 ` [PATCH 2/2] service: fix potential stats race-condition on MT services Harry van Haaren
2022-07-08 13:23   ` Morten Brørup
2022-07-08 13:44     ` Van Haaren, Harry
2022-07-08 14:14       ` Morten Brørup
2022-07-08 13:48   ` Mattias Rönnblom
2022-07-08 15:16   ` Honnappa Nagarahalli
2022-07-08 15:31     ` Van Haaren, Harry
2022-07-08 16:21       ` Bruce Richardson
2022-07-08 16:33         ` Honnappa Nagarahalli
2022-07-08 20:02         ` Mattias Rönnblom
2022-07-08 16:29     ` Morten Brørup
2022-07-08 16:45       ` Honnappa Nagarahalli
2022-07-08 17:22         ` Morten Brørup
2022-07-08 17:39           ` Honnappa Nagarahalli
2022-07-08 18:08             ` Morten Brørup
2022-09-06 16:13   ` [PATCH 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
2022-09-06 16:13     ` [PATCH 2/6] service: introduce per-lcore cycles counter Mattias Rönnblom
2022-09-06 16:13     ` [PATCH 3/6] service: reduce average case service core overhead Mattias Rönnblom
2022-10-03 13:33       ` Van Haaren, Harry
2022-10-03 14:32         ` Mattias Rönnblom
2022-09-06 16:13     ` [PATCH 4/6] service: tweak cycle statistics semantics Mattias Rönnblom
2022-09-07  8:41       ` Morten Brørup
2022-10-03 13:45         ` Van Haaren, Harry
2022-09-06 16:13     ` [PATCH 5/6] event/sw: report idle when no work is performed Mattias Rönnblom
2022-09-06 16:13     ` [PATCH 6/6] service: provide links to functions in documentation Mattias Rönnblom
2022-10-03  8:06     ` [PATCH 1/6] service: reduce statistics overhead for parallel services David Marchand
2022-10-03  8:40       ` Mattias Rönnblom
2022-10-03  9:53         ` David Marchand
2022-10-03 11:37           ` Mattias Rönnblom
2022-10-03 13:03             ` Van Haaren, Harry
2022-10-03 13:33     ` Van Haaren, Harry
2022-10-03 14:37       ` Mattias Rönnblom
2022-10-05  9:16     ` [PATCH v2 0/6] Service cores performance and statistics improvements Mattias Rönnblom
2022-10-05  9:16       ` [PATCH v2 1/6] service: reduce statistics overhead for parallel services Mattias Rönnblom
2022-10-05  9:16       ` [PATCH v2 2/6] service: introduce per-lcore cycles counter Mattias Rönnblom
2022-10-05  9:16       ` [PATCH v2 3/6] service: reduce average case service core overhead Mattias Rönnblom
2022-10-05  9:16       ` [PATCH v2 4/6] service: tweak cycle statistics semantics Mattias Rönnblom
2022-10-05  9:16       ` [PATCH v2 5/6] event/sw: report idle when no work is performed Mattias Rönnblom
2022-10-05  9:16       ` [PATCH v2 6/6] service: provide links to functions in documentation Mattias Rönnblom
2022-10-05  9:49       ` [PATCH v2 0/6] Service cores performance and statistics improvements Morten Brørup
2022-10-05 10:14         ` Mattias Rönnblom
2022-10-05 13:39       ` David Marchand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).