patches for DPDK stable branches
 help / color / mirror / Atom feed
* [PATCH] service: avoid worker lcore exit deadlock
@ 2024-10-01 16:26 Mattias Rönnblom
  2024-10-02 14:02 ` Tyler Retzlaff
  2024-10-03  6:57 ` [PATCH v2] service: fix deadlock on worker lcore exit David Marchand
  0 siblings, 2 replies; 6+ messages in thread
From: Mattias Rönnblom @ 2024-10-01 16:26 UTC (permalink / raw)
  To: Harry van Haaren, Stephen Hemminger
  Cc: hofors, dev, Suanming Mou, thomas, david.marchand,
	Mattias Rönnblom, stable

Calling rte_exit() from a worker lcore thread causes a deadlock in
rte_service_finalize().

This patch makes rte_service_finalize() deadlock-free by avoiding the
need to synchronize with service lcore threads, which in turn is
achieved by moving service and per-lcore state from the heap to being
statically allocated.

The BSS segment increases with ~156 kB (on x86_64 with default
RTE_MAX_LCORE and RTE_SERVICE_NUM_MAX).

According to the service perf autotest, this change also results in a
slight reduction of service framework overhead.

Fixes: 33666b448f15 ("service: fix crash on exit")
Cc: harry.van.haaren@intel.com
Cc: stable@dpdk.org

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/eal/common/rte_service.c | 28 ++--------------------------
 1 file changed, 2 insertions(+), 26 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index 94e872a08a..2ea0c0c0d9 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -15,7 +15,6 @@
 #include <rte_common.h>
 #include <rte_cycles.h>
 #include <rte_atomic.h>
-#include <rte_malloc.h>
 #include <rte_spinlock.h>
 #include <rte_trace_point.h>
 
@@ -74,8 +73,8 @@ struct core_state {
 } __rte_cache_aligned;
 
 static uint32_t rte_service_count;
-static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static struct rte_service_spec_impl rte_services[RTE_SERVICE_NUM_MAX];
+static struct core_state lcore_states[RTE_MAX_LCORE];
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -93,21 +92,6 @@ rte_service_init(void)
 		return -EALREADY;
 	}
 
-	rte_services = rte_calloc("rte_services", RTE_SERVICE_NUM_MAX,
-			sizeof(struct rte_service_spec_impl),
-			RTE_CACHE_LINE_SIZE);
-	if (!rte_services) {
-		RTE_LOG(ERR, EAL, "error allocating rte services array\n");
-		goto fail_mem;
-	}
-
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		RTE_LOG(ERR, EAL, "error allocating core states array\n");
-		goto fail_mem;
-	}
-
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
@@ -120,10 +104,6 @@ rte_service_init(void)
 
 	rte_service_library_initialized = 1;
 	return 0;
-fail_mem:
-	rte_free(rte_services);
-	rte_free(lcore_states);
-	return -ENOMEM;
 }
 
 void
@@ -133,10 +113,6 @@ rte_service_finalize(void)
 		return;
 
 	rte_service_lcore_reset_all();
-	rte_eal_mp_wait_lcore();
-
-	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
-- 
2.34.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] service: avoid worker lcore exit deadlock
  2024-10-01 16:26 [PATCH] service: avoid worker lcore exit deadlock Mattias Rönnblom
@ 2024-10-02 14:02 ` Tyler Retzlaff
  2024-10-03  6:57 ` [PATCH v2] service: fix deadlock on worker lcore exit David Marchand
  1 sibling, 0 replies; 6+ messages in thread
From: Tyler Retzlaff @ 2024-10-02 14:02 UTC (permalink / raw)
  To: 20230629203720.682f90c3
  Cc: Harry van Haaren, Stephen Hemminger, hofors, dev, Suanming Mou,
	thomas, david.marchand, Mattias Rönnblom, stable

On Tue, Oct 01, 2024 at 06:26:03PM +0200, Mattias Rönnblom wrote:
> Calling rte_exit() from a worker lcore thread causes a deadlock in
> rte_service_finalize().
> 
> This patch makes rte_service_finalize() deadlock-free by avoiding the
> need to synchronize with service lcore threads, which in turn is
> achieved by moving service and per-lcore state from the heap to being
> statically allocated.
> 
> The BSS segment increases with ~156 kB (on x86_64 with default
> RTE_MAX_LCORE and RTE_SERVICE_NUM_MAX).
> 
> According to the service perf autotest, this change also results in a
> slight reduction of service framework overhead.
> 
> Fixes: 33666b448f15 ("service: fix crash on exit")
> Cc: harry.van.haaren@intel.com
> Cc: stable@dpdk.org
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2] service: fix deadlock on worker lcore exit
  2024-10-01 16:26 [PATCH] service: avoid worker lcore exit deadlock Mattias Rönnblom
  2024-10-02 14:02 ` Tyler Retzlaff
@ 2024-10-03  6:57 ` David Marchand
  2024-10-03  9:13   ` David Marchand
  1 sibling, 1 reply; 6+ messages in thread
From: David Marchand @ 2024-10-03  6:57 UTC (permalink / raw)
  To: dev
  Cc: stephen, suanmingm, thomas, Mattias Rönnblom, stable,
	Tyler Retzlaff, Harry van Haaren, Aaron Conole

From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Calling rte_exit() from a worker lcore thread causes a deadlock in
rte_service_finalize().

This patch makes rte_service_finalize() deadlock-free by avoiding the
need to synchronize with service lcore threads, which in turn is
achieved by moving service and per-lcore state from the heap to being
statically allocated.

The BSS segment increases with ~156 kB (on x86_64 with default
RTE_MAX_LCORE and RTE_SERVICE_NUM_MAX).

According to the service perf autotest, this change also results in a
slight reduction of service framework overhead.

Fixes: 33666b448f15 ("service: fix crash on exit")
Cc: stable@dpdk.org

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
---
Changes since v1:
- rebased,

---
 lib/eal/common/rte_service.c | 28 ++--------------------------
 1 file changed, 2 insertions(+), 26 deletions(-)

diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index a38c594ce4..5b6805b8d8 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -15,7 +15,6 @@
 #include <rte_common.h>
 #include <rte_cycles.h>
 #include <rte_atomic.h>
-#include <rte_malloc.h>
 #include <rte_spinlock.h>
 #include <rte_trace_point.h>
 
@@ -76,8 +75,8 @@ struct __rte_cache_aligned core_state {
 };
 
 static uint32_t rte_service_count;
-static struct rte_service_spec_impl *rte_services;
-static struct core_state *lcore_states;
+static struct rte_service_spec_impl rte_services[RTE_SERVICE_NUM_MAX];
+static struct core_state lcore_states[RTE_MAX_LCORE];
 static uint32_t rte_service_library_initialized;
 
 int32_t
@@ -95,21 +94,6 @@ rte_service_init(void)
 		return -EALREADY;
 	}
 
-	rte_services = rte_calloc("rte_services", RTE_SERVICE_NUM_MAX,
-			sizeof(struct rte_service_spec_impl),
-			RTE_CACHE_LINE_SIZE);
-	if (!rte_services) {
-		EAL_LOG(ERR, "error allocating rte services array");
-		goto fail_mem;
-	}
-
-	lcore_states = rte_calloc("rte_service_core_states", RTE_MAX_LCORE,
-			sizeof(struct core_state), RTE_CACHE_LINE_SIZE);
-	if (!lcore_states) {
-		EAL_LOG(ERR, "error allocating core states array");
-		goto fail_mem;
-	}
-
 	int i;
 	struct rte_config *cfg = rte_eal_get_configuration();
 	for (i = 0; i < RTE_MAX_LCORE; i++) {
@@ -122,10 +106,6 @@ rte_service_init(void)
 
 	rte_service_library_initialized = 1;
 	return 0;
-fail_mem:
-	rte_free(rte_services);
-	rte_free(lcore_states);
-	return -ENOMEM;
 }
 
 void
@@ -135,10 +115,6 @@ rte_service_finalize(void)
 		return;
 
 	rte_service_lcore_reset_all();
-	rte_eal_mp_wait_lcore();
-
-	rte_free(rte_services);
-	rte_free(lcore_states);
 
 	rte_service_library_initialized = 0;
 }
-- 
2.46.2


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] service: fix deadlock on worker lcore exit
  2024-10-03  6:57 ` [PATCH v2] service: fix deadlock on worker lcore exit David Marchand
@ 2024-10-03  9:13   ` David Marchand
  2024-10-03 15:50     ` Van Haaren, Harry
  0 siblings, 1 reply; 6+ messages in thread
From: David Marchand @ 2024-10-03  9:13 UTC (permalink / raw)
  To: Mattias Rönnblom, Harry van Haaren
  Cc: dev, stephen, suanmingm, thomas, stable, Tyler Retzlaff, Aaron Conole

On Thu, Oct 3, 2024 at 8:57 AM David Marchand <david.marchand@redhat.com> wrote:
>
> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>
> Calling rte_exit() from a worker lcore thread causes a deadlock in
> rte_service_finalize().
>
> This patch makes rte_service_finalize() deadlock-free by avoiding the
> need to synchronize with service lcore threads, which in turn is
> achieved by moving service and per-lcore state from the heap to being
> statically allocated.
>
> The BSS segment increases with ~156 kB (on x86_64 with default
> RTE_MAX_LCORE and RTE_SERVICE_NUM_MAX).
>
> According to the service perf autotest, this change also results in a
> slight reduction of service framework overhead.
>
> Fixes: 33666b448f15 ("service: fix crash on exit")
> Cc: stable@dpdk.org
>
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
> ---
> Changes since v1:
> - rebased,

I can't merge this patch in its current state.

At the moment, two CI report a problem with the
eal_flags_file_prefix_autotest unit test.

-------------------------------------stdout-------------------------------------
RTE>>eal_flags_file_prefix_autotest
Running binary with argv[]:'/home/zhoumin/gh_dpdk/build/app/dpdk-test'
'--proc-type=secondary' '-m' '18' '--file-prefix=memtest'
Running binary with argv[]:'/home/zhoumin/gh_dpdk/build/app/dpdk-test'
'-m' '18' '--file-prefix=memtest1'
Error - hugepage files for memtest1 were not deleted!
Test Failed
RTE>>

Can you have a look?

Thanks.

-- 
David marchand


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] service: fix deadlock on worker lcore exit
  2024-10-03  9:13   ` David Marchand
@ 2024-10-03 15:50     ` Van Haaren, Harry
  2024-10-11  8:50       ` David Marchand
  0 siblings, 1 reply; 6+ messages in thread
From: Van Haaren, Harry @ 2024-10-03 15:50 UTC (permalink / raw)
  To: Marchand, David, Mattias Rönnblom
  Cc: dev, stephen, suanmingm, thomas, stable, Tyler Retzlaff, Aaron Conole

[-- Attachment #1: Type: text/plain, Size: 3606 bytes --]

> From: David Marchand <david.marchand@redhat.com>
> Sent: Thursday, October 3, 2024 10:13 AM
> To: Mattias Rönnblom <mattias.ronnblom@ericsson.com>; Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: dev@dpdk.org <dev@dpdk.org>; stephen@networkplumber.org <stephen@networkplumber.org>; suanmingm@nvidia.com <suanmingm@nvidia.com>; thomas@monjalon.net <thomas@monjalon.net>; stable@dpdk.org <stable@dpdk.org>; Tyler Retzlaff <roretzla@linux.microsoft.com>; Aaron Conole <aconole@redhat.com>
> Subject: Re: [PATCH v2] service: fix deadlock on worker lcore exit
>
> On Thu, Oct 3, 2024 at 8:57 AM David Marchand <david.marchand@redhat.com> wrote:
> >
> > From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >
> > Calling rte_exit() from a worker lcore thread causes a deadlock in
> > rte_service_finalize().
> >
> > This patch makes rte_service_finalize() deadlock-free by avoiding the
> > need to synchronize with service lcore threads, which in turn is
> > achieved by moving service and per-lcore state from the heap to being
> > statically allocated.
> >
> > The BSS segment increases with ~156 kB (on x86_64 with default
> > RTE_MAX_LCORE and RTE_SERVICE_NUM_MAX).
> >
> > According to the service perf autotest, this change also results in a
> > slight reduction of service framework overhead.
> >
> > Fixes: 33666b448f15 ("service: fix crash on exit")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
> > ---
> > Changes since v1:
> > - rebased,
>
> I can't merge this patch in its current state.
>
> At the moment, two CI report a problem with the
> eal_flags_file_prefix_autotest unit test.
>
> -------------------------------------stdout-------------------------------------
> RTE>>eal_flags_file_prefix_autotest
> Running binary with argv[]:'/home/zhoumin/gh_dpdk/build/app/dpdk-test'
> '--proc-type=secondary' '-m' '18' '--file-prefix=memtest'
> Running binary with argv[]:'/home/zhoumin/gh_dpdk/build/app/dpdk-test'
> '-m' '18' '--file-prefix=memtest1'
> Error - hugepage files for memtest1 were not deleted!
> Test Failed
> RTE>>
>
> Can you have a look?

Not sure how the code change in question is relating to the eal-flags failure, but I can reproduce the failure here.
Reproducing issue on *all* of the below tags; this indicates its likely a board-config issue, and not a true issue (unless its been there since 23.11??).

Tested commits were all bad:
b3485f4293 (HEAD, tag: v24.07) version: 24.07.0
a9778aad62 (HEAD, tag: v24.03) version: 24.03.0
eeb0605f11 (HEAD, tag: v23.11) version: 23.11.0

So I'm pretty sure this is a board/runner config issue, with the error output as follows here:
RTE>>eal_flags_file_prefix_autotest
Running binary with argv[]:'./app/test/dpdk-test' '--proc-type=secondary' '-m' '18' '--file-prefix=memtest'
EAL: Detected CPU lcores: 64
EAL: Detected NUMA nodes: 2
EAL: Detected static linkage of DPDK
EAL: Cannot open '/var/run/dpdk/memtest/config' for rte_mem_config
EAL: FATAL: Cannot init config
EAL: Cannot init config

FAIL:
DPDK_TEST=eal_flags_file_prefix_autotest ./app/test/dpdk-test  --no-pci

PASS:
DPDK_TEST=eal_flags_file_prefix_autotest ./app/test/dpdk-test

So seems like the eal-flags test is NOT able to handle args like "--no-pci"? I tend to run tests in no PCI mode to speed up things :)

In short, this service-cores patch is not the root cause. Perhaps some of the CI folks can confirm if there's extra args passed to the runner?

Regards, -Harry

[-- Attachment #2: Type: text/html, Size: 17135 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] service: fix deadlock on worker lcore exit
  2024-10-03 15:50     ` Van Haaren, Harry
@ 2024-10-11  8:50       ` David Marchand
  0 siblings, 0 replies; 6+ messages in thread
From: David Marchand @ 2024-10-11  8:50 UTC (permalink / raw)
  To: Van Haaren, Harry, ci
  Cc: Mattias Rönnblom, dev, stephen, suanmingm, thomas, stable,
	Tyler Retzlaff, Aaron Conole

On Thu, Oct 3, 2024 at 5:50 PM Van Haaren, Harry
<harry.van.haaren@intel.com> wrote:
> > From: David Marchand <david.marchand@redhat.com>
> > Sent: Thursday, October 3, 2024 10:13 AM
> > To: Mattias Rönnblom <mattias.ronnblom@ericsson.com>; Van Haaren, Harry <harry.van.haaren@intel.com>
> > Cc: dev@dpdk.org <dev@dpdk.org>; stephen@networkplumber.org <stephen@networkplumber.org>; suanmingm@nvidia.com <suanmingm@nvidia.com>; thomas@monjalon.net <thomas@monjalon.net>; stable@dpdk.org <stable@dpdk.org>; Tyler Retzlaff <roretzla@linux.microsoft.com>; Aaron Conole <aconole@redhat.com>
> > Subject: Re: [PATCH v2] service: fix deadlock on worker lcore exit
> >
> > On Thu, Oct 3, 2024 at 8:57 AM David Marchand <david.marchand@redhat.com> wrote:
> > >
> > > From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > >
> > > Calling rte_exit() from a worker lcore thread causes a deadlock in
> > > rte_service_finalize().
> > >
> > > This patch makes rte_service_finalize() deadlock-free by avoiding the
> > > need to synchronize with service lcore threads, which in turn is
> > > achieved by moving service and per-lcore state from the heap to being
> > > statically allocated.
> > >
> > > The BSS segment increases with ~156 kB (on x86_64 with default
> > > RTE_MAX_LCORE and RTE_SERVICE_NUM_MAX).
> > >
> > > According to the service perf autotest, this change also results in a
> > > slight reduction of service framework overhead.
> > >
> > > Fixes: 33666b448f15 ("service: fix crash on exit")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > > Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
> > > ---
> > > Changes since v1:
> > > - rebased,
> >
> > I can't merge this patch in its current state.
> >
> > At the moment, two CI report a problem with the
> > eal_flags_file_prefix_autotest unit test.
> >
> > -------------------------------------stdout-------------------------------------
> > RTE>>eal_flags_file_prefix_autotest
> > Running binary with argv[]:'/home/zhoumin/gh_dpdk/build/app/dpdk-test'
> > '--proc-type=secondary' '-m' '18' '--file-prefix=memtest'
> > Running binary with argv[]:'/home/zhoumin/gh_dpdk/build/app/dpdk-test'
> > '-m' '18' '--file-prefix=memtest1'
> > Error - hugepage files for memtest1 were not deleted!
> > Test Failed
> > RTE>>
> >
> > Can you have a look?
>
> Not sure how the code change in question is relating to the eal-flags failure, but I can reproduce the failure here.
> Reproducing issue on *all* of the below tags; this indicates its likely a board-config issue, and not a true issue (unless its been there since 23.11??).
>
> Tested commits were all bad:
> b3485f4293 (HEAD, tag: v24.07) version: 24.07.0
> a9778aad62 (HEAD, tag: v24.03) version: 24.03.0
> eeb0605f11 (HEAD, tag: v23.11) version: 23.11.0
>
> So I'm pretty sure this is a board/runner config issue, with the error output as follows here:
> RTE>>eal_flags_file_prefix_autotest
> Running binary with argv[]:'./app/test/dpdk-test' '--proc-type=secondary' '-m' '18' '--file-prefix=memtest'
> EAL: Detected CPU lcores: 64
> EAL: Detected NUMA nodes: 2
> EAL: Detected static linkage of DPDK
> EAL: Cannot open '/var/run/dpdk/memtest/config' for rte_mem_config
> EAL: FATAL: Cannot init config
> EAL: Cannot init config
>
> FAIL:
> DPDK_TEST=eal_flags_file_prefix_autotest ./app/test/dpdk-test  --no-pci
>
> PASS:
> DPDK_TEST=eal_flags_file_prefix_autotest ./app/test/dpdk-test
>
> So seems like the eal-flags test is NOT able to handle args like "--no-pci"? I tend to run tests in no PCI mode to speed up things :)

Well, speeding up, or hiding the issue, I guess.

> In short, this service-cores patch is not the root cause. Perhaps some of the CI folks can confirm if there's extra args passed to the runner?

To be clear, I can't merge this patch because of this (systematic)
failure in many CI env (GHA, LoongArch, UNH).

Adding CI ml in the loop.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-10-11  8:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-01 16:26 [PATCH] service: avoid worker lcore exit deadlock Mattias Rönnblom
2024-10-02 14:02 ` Tyler Retzlaff
2024-10-03  6:57 ` [PATCH v2] service: fix deadlock on worker lcore exit David Marchand
2024-10-03  9:13   ` David Marchand
2024-10-03 15:50     ` Van Haaren, Harry
2024-10-11  8:50       ` David Marchand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).