* [dpdk-dev] [PATCH] eal/service: fix exit by resetting service lcores @ 2020-03-10 13:33 Harry van Haaren 2020-03-10 16:31 ` David Marchand 2020-03-11 14:39 ` [dpdk-dev] [PATCH v2] " Harry van Haaren 0 siblings, 2 replies; 15+ messages in thread From: Harry van Haaren @ 2020-03-10 13:33 UTC (permalink / raw) To: dev; +Cc: david.marchand, aconole, Harry van Haaren This commit releases all service cores from thier role, returning them to ROLE_RTE on rte_service_finalize(). This may fix an issue relating to the service cores causing a race-condition on eal_cleanup(), where the service core could still be executing while the main thread has already free-d the service memory, leading to a segfault. Fixes: 21698354c832 ("service: introduce service cores concept") Reported-by: David Marchand <david.marchand@redhat.com> Reported-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> --- Please note that this patch is being sent to community for testing as I cannot reliably reproduce the reported issue on my local setup (despite code-changes in attempts to make the problem more visible, and instructions from David on how he can reproduce it). Email discusson on this topic here: https://mails.dpdk.org/archives/dev/2020-March/159584.html --- lib/librte_eal/common/rte_service.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c index 7e537b8cd..d400ccf79 100644 --- a/lib/librte_eal/common/rte_service.c +++ b/lib/librte_eal/common/rte_service.c @@ -122,6 +122,8 @@ rte_service_finalize(void) if (!rte_service_library_initialized) return; + rte_service_lcore_reset_all(); + rte_free(rte_services); rte_free(lcore_states); -- 2.17.1 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH] eal/service: fix exit by resetting service lcores 2020-03-10 13:33 [dpdk-dev] [PATCH] eal/service: fix exit by resetting service lcores Harry van Haaren @ 2020-03-10 16:31 ` David Marchand 2020-03-10 16:38 ` Van Haaren, Harry 2020-03-11 14:39 ` [dpdk-dev] [PATCH v2] " Harry van Haaren 1 sibling, 1 reply; 15+ messages in thread From: David Marchand @ 2020-03-10 16:31 UTC (permalink / raw) To: Harry van Haaren; +Cc: dev, Aaron Conole On Tue, Mar 10, 2020 at 2:32 PM Harry van Haaren <harry.van.haaren@intel.com> wrote: > > This commit releases all service cores from thier role, > returning them to ROLE_RTE on rte_service_finalize(). > > This may fix an issue relating to the service cores causing > a race-condition on eal_cleanup(), where the service core > could still be executing while the main thread has already > free-d the service memory, leading to a segfault. Adding rte_service_lcore_reset_all() just tells a (remaining) service lcore to quit its loop, but does not close the race on lcore_states. The backtrace shows the same. (gdb) bt full #0 rte_service_runner_func (arg=<optimized out>) at ../lib/librte_eal/common/rte_service.c:455 service_mask = 1 i = <optimized out> lcore = 1 cs = 0x1003ea200 #1 0x00007ffff72030ef in eal_thread_loop (arg=<optimized out>) at ../lib/librte_eal/linux/eal/eal_thread.c:153 fct_arg = <optimized out> c = 0 '\000' n = <optimized out> ret = <optimized out> lcore_id = <optimized out> thread_id = 140737203603200 m2s = 14 s2m = 22 cpuset = "1", '\000' <repeats 175 times>, "\200\000\000\000\000\000\000\000\221\354e\360\377\177", '\000' <repeats 65 times> __func__ = "eal_thread_loop" #2 0x00007ffff065ddd5 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007ffff038702d in clone () from /lib64/libc.so.6 No symbol table info available. I added a rte_eal_mp_wait_lcore(), to ensure that each service lcore _did_ quit its loop. @@ -123,6 +123,7 @@ rte_service_finalize(void) return; rte_service_lcore_reset_all(); + rte_eal_mp_wait_lcore(); rte_free(rte_services); rte_free(lcore_states); I can't reproduce with this. -- David Marchand ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH] eal/service: fix exit by resetting service lcores 2020-03-10 16:31 ` David Marchand @ 2020-03-10 16:38 ` Van Haaren, Harry 2020-03-10 17:44 ` Aaron Conole 2020-03-11 9:09 ` David Marchand 0 siblings, 2 replies; 15+ messages in thread From: Van Haaren, Harry @ 2020-03-10 16:38 UTC (permalink / raw) To: David Marchand; +Cc: dev, Aaron Conole > -----Original Message----- > From: David Marchand <david.marchand@redhat.com> > Sent: Tuesday, March 10, 2020 4:31 PM > To: Van Haaren, Harry <harry.van.haaren@intel.com> > Cc: dev <dev@dpdk.org>; Aaron Conole <aconole@redhat.com> > Subject: Re: [PATCH] eal/service: fix exit by resetting service lcores > > On Tue, Mar 10, 2020 at 2:32 PM Harry van Haaren > <harry.van.haaren@intel.com> wrote: > > > > This commit releases all service cores from thier role, > > returning them to ROLE_RTE on rte_service_finalize(). > > > > This may fix an issue relating to the service cores causing > > a race-condition on eal_cleanup(), where the service core > > could still be executing while the main thread has already > > free-d the service memory, leading to a segfault. > > Adding rte_service_lcore_reset_all() just tells a (remaining) service > lcore to quit its loop, but does not close the race on lcore_states. > > The backtrace shows the same. > > (gdb) bt full > #0 rte_service_runner_func (arg=<optimized out>) at > ../lib/librte_eal/common/rte_service.c:455 > service_mask = 1 > i = <optimized out> > lcore = 1 > cs = 0x1003ea200 > #1 0x00007ffff72030ef in eal_thread_loop (arg=<optimized out>) at > ../lib/librte_eal/linux/eal/eal_thread.c:153 > fct_arg = <optimized out> > c = 0 '\000' > n = <optimized out> > ret = <optimized out> > lcore_id = <optimized out> > thread_id = 140737203603200 > m2s = 14 > s2m = 22 > cpuset = "1", '\000' <repeats 175 times>, > "\200\000\000\000\000\000\000\000\221\354e\360\377\177", '\000' > <repeats 65 times> > __func__ = "eal_thread_loop" > #2 0x00007ffff065ddd5 in start_thread () from /lib64/libpthread.so.0 > No symbol table info available. > #3 0x00007ffff038702d in clone () from /lib64/libc.so.6 > No symbol table info available. > > > I added a rte_eal_mp_wait_lcore(), to ensure that each service lcore > _did_ quit its loop. > @@ -123,6 +123,7 @@ rte_service_finalize(void) > return; > > rte_service_lcore_reset_all(); > + rte_eal_mp_wait_lcore(); > > rte_free(rte_services); > rte_free(lcore_states); > > > I can't reproduce with this. OK - that's good news, thanks for the quick testing & feedback. Agree with your analysis of the above, indeed waiting for the cores explicitly seems the right solution to remove the race. Will I spin up a v2 patchset with your co-authored-by added and the above change included? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH] eal/service: fix exit by resetting service lcores 2020-03-10 16:38 ` Van Haaren, Harry @ 2020-03-10 17:44 ` Aaron Conole 2020-03-10 19:14 ` Aaron Conole 2020-03-11 9:09 ` David Marchand 1 sibling, 1 reply; 15+ messages in thread From: Aaron Conole @ 2020-03-10 17:44 UTC (permalink / raw) To: Van Haaren, Harry; +Cc: David Marchand, dev "Van Haaren, Harry" <harry.van.haaren@intel.com> writes: >> -----Original Message----- >> From: David Marchand <david.marchand@redhat.com> >> Sent: Tuesday, March 10, 2020 4:31 PM >> To: Van Haaren, Harry <harry.van.haaren@intel.com> >> Cc: dev <dev@dpdk.org>; Aaron Conole <aconole@redhat.com> >> Subject: Re: [PATCH] eal/service: fix exit by resetting service lcores >> >> On Tue, Mar 10, 2020 at 2:32 PM Harry van Haaren >> <harry.van.haaren@intel.com> wrote: >> > >> > This commit releases all service cores from thier role, >> > returning them to ROLE_RTE on rte_service_finalize(). >> > >> > This may fix an issue relating to the service cores causing >> > a race-condition on eal_cleanup(), where the service core >> > could still be executing while the main thread has already >> > free-d the service memory, leading to a segfault. >> >> Adding rte_service_lcore_reset_all() just tells a (remaining) service >> lcore to quit its loop, but does not close the race on lcore_states. >> >> The backtrace shows the same. >> >> (gdb) bt full >> #0 rte_service_runner_func (arg=<optimized out>) at >> ../lib/librte_eal/common/rte_service.c:455 >> service_mask = 1 >> i = <optimized out> >> lcore = 1 >> cs = 0x1003ea200 >> #1 0x00007ffff72030ef in eal_thread_loop (arg=<optimized out>) at >> ../lib/librte_eal/linux/eal/eal_thread.c:153 >> fct_arg = <optimized out> >> c = 0 '\000' >> n = <optimized out> >> ret = <optimized out> >> lcore_id = <optimized out> >> thread_id = 140737203603200 >> m2s = 14 >> s2m = 22 >> cpuset = "1", '\000' <repeats 175 times>, >> "\200\000\000\000\000\000\000\000\221\354e\360\377\177", '\000' >> <repeats 65 times> >> __func__ = "eal_thread_loop" >> #2 0x00007ffff065ddd5 in start_thread () from /lib64/libpthread.so.0 >> No symbol table info available. >> #3 0x00007ffff038702d in clone () from /lib64/libc.so.6 >> No symbol table info available. >> >> >> I added a rte_eal_mp_wait_lcore(), to ensure that each service lcore >> _did_ quit its loop. >> @@ -123,6 +123,7 @@ rte_service_finalize(void) >> return; >> >> rte_service_lcore_reset_all(); >> + rte_eal_mp_wait_lcore(); >> >> rte_free(rte_services); >> rte_free(lcore_states); >> >> >> I can't reproduce with this. > > OK - that's good news, thanks for the quick testing & feedback. > > Agree with your analysis of the above, indeed waiting for the cores > explicitly seems the right solution to remove the race. > > Will I spin up a v2 patchset with your co-authored-by added and the above > change included? Please spin the v2 - I am currently testing with David's incremental on my setup now. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH] eal/service: fix exit by resetting service lcores 2020-03-10 17:44 ` Aaron Conole @ 2020-03-10 19:14 ` Aaron Conole 0 siblings, 0 replies; 15+ messages in thread From: Aaron Conole @ 2020-03-10 19:14 UTC (permalink / raw) To: Van Haaren, Harry; +Cc: David Marchand, dev Aaron Conole <aconole@redhat.com> writes: > "Van Haaren, Harry" <harry.van.haaren@intel.com> writes: > >>> -----Original Message----- >>> From: David Marchand <david.marchand@redhat.com> >>> Sent: Tuesday, March 10, 2020 4:31 PM >>> To: Van Haaren, Harry <harry.van.haaren@intel.com> >>> Cc: dev <dev@dpdk.org>; Aaron Conole <aconole@redhat.com> >>> Subject: Re: [PATCH] eal/service: fix exit by resetting service lcores >>> >>> On Tue, Mar 10, 2020 at 2:32 PM Harry van Haaren >>> <harry.van.haaren@intel.com> wrote: >>> > >>> > This commit releases all service cores from thier role, >>> > returning them to ROLE_RTE on rte_service_finalize(). >>> > >>> > This may fix an issue relating to the service cores causing >>> > a race-condition on eal_cleanup(), where the service core >>> > could still be executing while the main thread has already >>> > free-d the service memory, leading to a segfault. >>> >>> Adding rte_service_lcore_reset_all() just tells a (remaining) service >>> lcore to quit its loop, but does not close the race on lcore_states. >>> >>> The backtrace shows the same. >>> >>> (gdb) bt full >>> #0 rte_service_runner_func (arg=<optimized out>) at >>> ../lib/librte_eal/common/rte_service.c:455 >>> service_mask = 1 >>> i = <optimized out> >>> lcore = 1 >>> cs = 0x1003ea200 >>> #1 0x00007ffff72030ef in eal_thread_loop (arg=<optimized out>) at >>> ../lib/librte_eal/linux/eal/eal_thread.c:153 >>> fct_arg = <optimized out> >>> c = 0 '\000' >>> n = <optimized out> >>> ret = <optimized out> >>> lcore_id = <optimized out> >>> thread_id = 140737203603200 >>> m2s = 14 >>> s2m = 22 >>> cpuset = "1", '\000' <repeats 175 times>, >>> "\200\000\000\000\000\000\000\000\221\354e\360\377\177", '\000' >>> <repeats 65 times> >>> __func__ = "eal_thread_loop" >>> #2 0x00007ffff065ddd5 in start_thread () from /lib64/libpthread.so.0 >>> No symbol table info available. >>> #3 0x00007ffff038702d in clone () from /lib64/libc.so.6 >>> No symbol table info available. >>> >>> >>> I added a rte_eal_mp_wait_lcore(), to ensure that each service lcore >>> _did_ quit its loop. >>> @@ -123,6 +123,7 @@ rte_service_finalize(void) >>> return; >>> >>> rte_service_lcore_reset_all(); >>> + rte_eal_mp_wait_lcore(); >>> >>> rte_free(rte_services); >>> rte_free(lcore_states); >>> >>> >>> I can't reproduce with this. >> >> OK - that's good news, thanks for the quick testing & feedback. >> >> Agree with your analysis of the above, indeed waiting for the cores >> explicitly seems the right solution to remove the race. >> >> Will I spin up a v2 patchset with your co-authored-by added and the above >> change included? > > Please spin the v2 - I am currently testing with David's incremental on > my setup now. Additionally, with the incremental: Acked-by: Aaron Conole <aconole@redhat.com> Please make sure to cc stable@. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH] eal/service: fix exit by resetting service lcores 2020-03-10 16:38 ` Van Haaren, Harry 2020-03-10 17:44 ` Aaron Conole @ 2020-03-11 9:09 ` David Marchand 1 sibling, 0 replies; 15+ messages in thread From: David Marchand @ 2020-03-11 9:09 UTC (permalink / raw) To: Van Haaren, Harry; +Cc: dev, Aaron Conole On Tue, Mar 10, 2020 at 5:38 PM Van Haaren, Harry <harry.van.haaren@intel.com> wrote: > > > -----Original Message----- > > From: David Marchand <david.marchand@redhat.com> > > Sent: Tuesday, March 10, 2020 4:31 PM > > To: Van Haaren, Harry <harry.van.haaren@intel.com> > > Cc: dev <dev@dpdk.org>; Aaron Conole <aconole@redhat.com> > > Subject: Re: [PATCH] eal/service: fix exit by resetting service lcores > > > > On Tue, Mar 10, 2020 at 2:32 PM Harry van Haaren > > <harry.van.haaren@intel.com> wrote: > > > > > > This commit releases all service cores from thier role, > > > returning them to ROLE_RTE on rte_service_finalize(). > > > > > > This may fix an issue relating to the service cores causing > > > a race-condition on eal_cleanup(), where the service core > > > could still be executing while the main thread has already > > > free-d the service memory, leading to a segfault. > > > > Adding rte_service_lcore_reset_all() just tells a (remaining) service > > lcore to quit its loop, but does not close the race on lcore_states. > > > > The backtrace shows the same. > > > > (gdb) bt full > > #0 rte_service_runner_func (arg=<optimized out>) at > > ../lib/librte_eal/common/rte_service.c:455 > > service_mask = 1 > > i = <optimized out> > > lcore = 1 > > cs = 0x1003ea200 > > #1 0x00007ffff72030ef in eal_thread_loop (arg=<optimized out>) at > > ../lib/librte_eal/linux/eal/eal_thread.c:153 > > fct_arg = <optimized out> > > c = 0 '\000' > > n = <optimized out> > > ret = <optimized out> > > lcore_id = <optimized out> > > thread_id = 140737203603200 > > m2s = 14 > > s2m = 22 > > cpuset = "1", '\000' <repeats 175 times>, > > "\200\000\000\000\000\000\000\000\221\354e\360\377\177", '\000' > > <repeats 65 times> > > __func__ = "eal_thread_loop" > > #2 0x00007ffff065ddd5 in start_thread () from /lib64/libpthread.so.0 > > No symbol table info available. > > #3 0x00007ffff038702d in clone () from /lib64/libc.so.6 > > No symbol table info available. > > > > > > I added a rte_eal_mp_wait_lcore(), to ensure that each service lcore > > _did_ quit its loop. > > @@ -123,6 +123,7 @@ rte_service_finalize(void) > > return; > > > > rte_service_lcore_reset_all(); > > + rte_eal_mp_wait_lcore(); > > > > rte_free(rte_services); > > rte_free(lcore_states); > > > > > > I can't reproduce with this. > > OK - that's good news, thanks for the quick testing & feedback. > > Agree with your analysis of the above, indeed waiting for the cores > explicitly seems the right solution to remove the race. Another thing that seemed odd with your patch is that the unit test already calls rte_service_lcore_reset_all() as part of the unregister_all() helper. Why don't we ensure that calling rte_service_lcore_start|stop|reset_all guarantee the service lcores status? Putting explicit (and documented) synchronisation points in the rte_service API seems the right fix to me and could help remove those rte_delay we have in the unit test. -- David Marchand ^ permalink raw reply [flat|nested] 15+ messages in thread
* [dpdk-dev] [PATCH v2] eal/service: fix exit by resetting service lcores 2020-03-10 13:33 [dpdk-dev] [PATCH] eal/service: fix exit by resetting service lcores Harry van Haaren 2020-03-10 16:31 ` David Marchand @ 2020-03-11 14:39 ` Harry van Haaren 2020-03-11 16:15 ` David Marchand 2020-03-13 10:04 ` David Marchand 1 sibling, 2 replies; 15+ messages in thread From: Harry van Haaren @ 2020-03-11 14:39 UTC (permalink / raw) To: dev; +Cc: david.marchand, aconole, Harry van Haaren, stable This commit releases all service cores from their role, returning them to ROLE_RTE on rte_service_finalize(). This may fix an issue relating to the service cores causing a race-condition on eal_cleanup(), where the service core could still be executing while the main thread has already free-d the service memory, leading to a segfault. Fixes: 21698354c832 ("service: introduce service cores concept") Cc: stable@dpdk.org Reported-by: David Marchand <david.marchand@redhat.com> Reported-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: Aaron Conole <aconole@redhat.com> --- v2: - Added rte_eal_mp_wait_lcore() after reset (David) - Added Signed-off and Acked from mailing list (David, Aaron) --- lib/librte_eal/common/rte_service.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c index 7e537b8cd..b0b78baab 100644 --- a/lib/librte_eal/common/rte_service.c +++ b/lib/librte_eal/common/rte_service.c @@ -122,6 +122,9 @@ rte_service_finalize(void) if (!rte_service_library_initialized) return; + rte_service_lcore_reset_all(); + rte_eal_mp_wait_lcore(); + rte_free(rte_services); rte_free(lcore_states); -- 2.17.1 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH v2] eal/service: fix exit by resetting service lcores 2020-03-11 14:39 ` [dpdk-dev] [PATCH v2] " Harry van Haaren @ 2020-03-11 16:15 ` David Marchand 2020-03-11 16:21 ` Van Haaren, Harry 2020-03-11 17:08 ` Aaron Conole 2020-03-13 10:04 ` David Marchand 1 sibling, 2 replies; 15+ messages in thread From: David Marchand @ 2020-03-11 16:15 UTC (permalink / raw) To: Harry van Haaren; +Cc: dev, Aaron Conole, dpdk stable On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren <harry.van.haaren@intel.com> wrote: > > This commit releases all service cores from their role, > returning them to ROLE_RTE on rte_service_finalize(). > > This may fix an issue relating to the service cores causing You don't seem convinced. > a race-condition on eal_cleanup(), where the service core > could still be executing while the main thread has already > free-d the service memory, leading to a segfault. > > Fixes: 21698354c832 ("service: introduce service cores concept") > Cc: stable@dpdk.org > > Reported-by: David Marchand <david.marchand@redhat.com> > Reported-by: Aaron Conole <aconole@redhat.com> > Signed-off-by: David Marchand <david.marchand@redhat.com> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> > Acked-by: Aaron Conole <aconole@redhat.com> I am okay with merging this so that we stop getting random failures of the ut. I will let this patch on the ml and apply on Friday at worse. Please take the time to reply to my question. Thanks. -- David Marchand ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH v2] eal/service: fix exit by resetting service lcores 2020-03-11 16:15 ` David Marchand @ 2020-03-11 16:21 ` Van Haaren, Harry 2020-03-12 8:59 ` David Marchand 2020-03-11 17:08 ` Aaron Conole 1 sibling, 1 reply; 15+ messages in thread From: Van Haaren, Harry @ 2020-03-11 16:21 UTC (permalink / raw) To: David Marchand; +Cc: dev, Aaron Conole, dpdk stable > -----Original Message----- > From: David Marchand <david.marchand@redhat.com> > Sent: Wednesday, March 11, 2020 4:16 PM > To: Van Haaren, Harry <harry.van.haaren@intel.com> > Cc: dev <dev@dpdk.org>; Aaron Conole <aconole@redhat.com>; dpdk stable > <stable@dpdk.org> > Subject: Re: [PATCH v2] eal/service: fix exit by resetting service lcores > > On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren > <harry.van.haaren@intel.com> wrote: > > > > This commit releases all service cores from their role, > > returning them to ROLE_RTE on rte_service_finalize(). > > > > This may fix an issue relating to the service cores causing > > You don't seem convinced. Apologies - kept from v1 of commit message, should have removed "may" for v2. Issue was that service cores can remain running while main thread has freed service-core memory, later racy return of service lcore then causes use-after-free. This commit fixes it by A) resetting all service cores to return B) waiting for them to return C) freeing memory I am confident in the fix. > > a race-condition on eal_cleanup(), where the service core > > could still be executing while the main thread has already > > free-d the service memory, leading to a segfault. > > > > Fixes: 21698354c832 ("service: introduce service cores concept") > > Cc: stable@dpdk.org > > > > Reported-by: David Marchand <david.marchand@redhat.com> > > Reported-by: Aaron Conole <aconole@redhat.com> > > Signed-off-by: David Marchand <david.marchand@redhat.com> > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> > > Acked-by: Aaron Conole <aconole@redhat.com> > > I am okay with merging this so that we stop getting random failures of the > ut. I will let this patch on the ml and apply on Friday at worse. > > Please take the time to reply to my question. > Thanks. Thanks, -Harry ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH v2] eal/service: fix exit by resetting service lcores 2020-03-11 16:21 ` Van Haaren, Harry @ 2020-03-12 8:59 ` David Marchand 0 siblings, 0 replies; 15+ messages in thread From: David Marchand @ 2020-03-12 8:59 UTC (permalink / raw) To: Van Haaren, Harry; +Cc: dev, Aaron Conole, dpdk stable Hello, On Wed, Mar 11, 2020 at 5:21 PM Van Haaren, Harry <harry.van.haaren@intel.com> wrote: > Issue was that service cores can remain running while main thread > has freed service-core memory, later racy return of service lcore > then causes use-after-free. > > This commit fixes it by > A) resetting all service cores to return > B) waiting for them to return > C) freeing memory > > I am confident in the fix. Ok. > > > a race-condition on eal_cleanup(), where the service core > > > could still be executing while the main thread has already > > > free-d the service memory, leading to a segfault. > > > > > > Fixes: 21698354c832 ("service: introduce service cores concept") The race per se was introduced with: da23f0aa87d8 ("service: fix memory leak with new function") -- David Marchand ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH v2] eal/service: fix exit by resetting service lcores 2020-03-11 16:15 ` David Marchand 2020-03-11 16:21 ` Van Haaren, Harry @ 2020-03-11 17:08 ` Aaron Conole 2020-03-12 9:03 ` David Marchand 1 sibling, 1 reply; 15+ messages in thread From: Aaron Conole @ 2020-03-11 17:08 UTC (permalink / raw) To: David Marchand; +Cc: Harry van Haaren, dev, dpdk stable David Marchand <david.marchand@redhat.com> writes: > On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren > <harry.van.haaren@intel.com> wrote: >> >> This commit releases all service cores from their role, >> returning them to ROLE_RTE on rte_service_finalize(). >> >> This may fix an issue relating to the service cores causing > > You don't seem convinced. > > >> a race-condition on eal_cleanup(), where the service core >> could still be executing while the main thread has already >> free-d the service memory, leading to a segfault. >> >> Fixes: 21698354c832 ("service: introduce service cores concept") >> Cc: stable@dpdk.org >> >> Reported-by: David Marchand <david.marchand@redhat.com> >> Reported-by: Aaron Conole <aconole@redhat.com> >> Signed-off-by: David Marchand <david.marchand@redhat.com> >> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> >> Acked-by: Aaron Conole <aconole@redhat.com> > > I am okay with merging this so that we stop getting random failures of the ut. I think it could also potentially cause errors in user applications that regularly exit, and which use the service core architecture. So it's worth getting in now, anyway. > I will let this patch on the ml and apply on Friday at worse. > > Please take the time to reply to my question. > Thanks. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH v2] eal/service: fix exit by resetting service lcores 2020-03-11 17:08 ` Aaron Conole @ 2020-03-12 9:03 ` David Marchand 0 siblings, 0 replies; 15+ messages in thread From: David Marchand @ 2020-03-12 9:03 UTC (permalink / raw) To: Aaron Conole; +Cc: Harry van Haaren, dev, dpdk stable On Wed, Mar 11, 2020 at 6:08 PM Aaron Conole <aconole@redhat.com> wrote: > > David Marchand <david.marchand@redhat.com> writes: > > > On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren > > <harry.van.haaren@intel.com> wrote: > >> > >> This commit releases all service cores from their role, > >> returning them to ROLE_RTE on rte_service_finalize(). > >> > >> This may fix an issue relating to the service cores causing > > > > You don't seem convinced. > > > > > >> a race-condition on eal_cleanup(), where the service core > >> could still be executing while the main thread has already > >> free-d the service memory, leading to a segfault. > >> > >> Fixes: 21698354c832 ("service: introduce service cores concept") > >> Cc: stable@dpdk.org > >> > >> Reported-by: David Marchand <david.marchand@redhat.com> > >> Reported-by: Aaron Conole <aconole@redhat.com> > >> Signed-off-by: David Marchand <david.marchand@redhat.com> > >> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> > >> Acked-by: Aaron Conole <aconole@redhat.com> > > > > I am okay with merging this so that we stop getting random failures of the ut. > > I think it could also potentially cause errors in user applications that > regularly exit, and which use the service core architecture. So it's > worth getting in now, anyway. Indeed, thanks for the precision. In my defense, we did not get report of such crashes out of the CI. The CI is the main reason why I (selfishly :-)) have been pressing on this issue. -- David Marchand ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH v2] eal/service: fix exit by resetting service lcores 2020-03-11 14:39 ` [dpdk-dev] [PATCH v2] " Harry van Haaren 2020-03-11 16:15 ` David Marchand @ 2020-03-13 10:04 ` David Marchand 2020-04-06 10:30 ` Burakov, Anatoly 1 sibling, 1 reply; 15+ messages in thread From: David Marchand @ 2020-03-13 10:04 UTC (permalink / raw) To: Harry van Haaren; +Cc: dev, Aaron Conole, dpdk stable On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren <harry.van.haaren@intel.com> wrote: > > This commit releases all service cores from their role, > returning them to ROLE_RTE on rte_service_finalize(). > > This may fix an issue relating to the service cores causing s/may fix/fixes/ > a race-condition on eal_cleanup(), where the service core > could still be executing while the main thread has already > free-d the service memory, leading to a segfault. > > Fixes: 21698354c832 ("service: introduce service cores concept") Replaced with: Fixes: da23f0aa87d8 ("service: fix memory leak with new function") > Cc: stable@dpdk.org > > Reported-by: David Marchand <david.marchand@redhat.com> > Reported-by: Aaron Conole <aconole@redhat.com> > Signed-off-by: David Marchand <david.marchand@redhat.com> > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> > Acked-by: Aaron Conole <aconole@redhat.com> Applied, thanks. -- David Marchand ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH v2] eal/service: fix exit by resetting service lcores 2020-03-13 10:04 ` David Marchand @ 2020-04-06 10:30 ` Burakov, Anatoly 2020-04-14 13:22 ` Aaron Conole 0 siblings, 1 reply; 15+ messages in thread From: Burakov, Anatoly @ 2020-04-06 10:30 UTC (permalink / raw) To: David Marchand, Harry van Haaren, Ananyev, Konstantin Cc: dev, Aaron Conole, dpdk stable On 13-Mar-20 10:04 AM, David Marchand wrote: > On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren > <harry.van.haaren@intel.com> wrote: >> >> This commit releases all service cores from their role, >> returning them to ROLE_RTE on rte_service_finalize(). >> >> This may fix an issue relating to the service cores causing > > s/may fix/fixes/ > >> a race-condition on eal_cleanup(), where the service core >> could still be executing while the main thread has already >> free-d the service memory, leading to a segfault. >> >> Fixes: 21698354c832 ("service: introduce service cores concept") > > Replaced with: > Fixes: da23f0aa87d8 ("service: fix memory leak with new function") > >> Cc: stable@dpdk.org >> >> Reported-by: David Marchand <david.marchand@redhat.com> >> Reported-by: Aaron Conole <aconole@redhat.com> >> Signed-off-by: David Marchand <david.marchand@redhat.com> >> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> >> Acked-by: Aaron Conole <aconole@redhat.com> > > Applied, thanks. > > This patch breaks a couple of apps (or rather the apps were broken to begin with, but the brokenness has been exposed with this patch). A "good" way to handle a SIGINT is to catch it, set some kind of global exit flag, and exit the signal handler, so that all of the threads see the exit flag, stop spinning, and exit the main loop and proceed to gracefully shutdown. That's what majority of our apps do. A bad way to handle SIGINT is to call rte_exit() inside the signal handler, without setting any global exit flags. Since rte_exit() now waits for all of the threads to stop, the exit will never actually happen because threads can't stop without an exit signal, and no exit signal was provided by the signal handler. Affected apps: * l3fwd-power (i'm preparing a patch) * ip_reassembly (see main.c:988) - +Konstantin There are also a bunch of apps that simply call exit(0) and do unclean shutdown without DPDK cleanup, and also apps i have no idea what they're doing (call kill() on themselves in the SIGINT handler? l3fwd-cat does that, so do a bunch of others), but this is probably a bigger problem that should be addressed separately. -- Thanks, Anatoly ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdk-dev] [PATCH v2] eal/service: fix exit by resetting service lcores 2020-04-06 10:30 ` Burakov, Anatoly @ 2020-04-14 13:22 ` Aaron Conole 0 siblings, 0 replies; 15+ messages in thread From: Aaron Conole @ 2020-04-14 13:22 UTC (permalink / raw) To: Burakov, Anatoly Cc: David Marchand, Harry van Haaren, Ananyev, Konstantin, dev, dpdk stable "Burakov, Anatoly" <anatoly.burakov@intel.com> writes: > On 13-Mar-20 10:04 AM, David Marchand wrote: >> On Wed, Mar 11, 2020 at 3:39 PM Harry van Haaren >> <harry.van.haaren@intel.com> wrote: >>> >>> This commit releases all service cores from their role, >>> returning them to ROLE_RTE on rte_service_finalize(). >>> >>> This may fix an issue relating to the service cores causing >> >> s/may fix/fixes/ >> >>> a race-condition on eal_cleanup(), where the service core >>> could still be executing while the main thread has already >>> free-d the service memory, leading to a segfault. >>> >>> Fixes: 21698354c832 ("service: introduce service cores concept") >> >> Replaced with: >> Fixes: da23f0aa87d8 ("service: fix memory leak with new function") >> >>> Cc: stable@dpdk.org >>> >>> Reported-by: David Marchand <david.marchand@redhat.com> >>> Reported-by: Aaron Conole <aconole@redhat.com> >>> Signed-off-by: David Marchand <david.marchand@redhat.com> >>> Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> >>> Acked-by: Aaron Conole <aconole@redhat.com> >> >> Applied, thanks. >> >> > > This patch breaks a couple of apps (or rather the apps were broken to > begin with, but the brokenness has been exposed with this patch). > > A "good" way to handle a SIGINT is to catch it, set some kind of > global exit flag, and exit the signal handler, so that all of the > threads see the exit flag, stop spinning, and exit the main loop and > proceed to gracefully shutdown. That's what majority of our apps do. > > A bad way to handle SIGINT is to call rte_exit() inside the signal > handler, without setting any global exit flags. Since rte_exit() now > waits for all of the threads to stop, the exit will never actually > happen because threads can't stop without an exit signal, and no exit > signal was provided by the signal handler. Yes, I don't consider it 'breaking' anything - exit in signal handlers is always a bad idea. I guess we should correct the examples to show this. > Affected apps: > > * l3fwd-power (i'm preparing a patch) > * ip_reassembly (see main.c:988) - +Konstantin > > There are also a bunch of apps that simply call exit(0) and do unclean > shutdown without DPDK cleanup, and also apps i have no idea what > they're doing (call kill() on themselves in the SIGINT handler? > l3fwd-cat does that, so do a bunch of others), but this is probably a > bigger problem that should be addressed separately. I think one way to mitigate this is to register an at_exit() function that will check if eal is currently initialized and do the needed cleanup call. I don't know if there are any side-effects that we need to consider for it, though. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2020-04-14 13:22 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-03-10 13:33 [dpdk-dev] [PATCH] eal/service: fix exit by resetting service lcores Harry van Haaren 2020-03-10 16:31 ` David Marchand 2020-03-10 16:38 ` Van Haaren, Harry 2020-03-10 17:44 ` Aaron Conole 2020-03-10 19:14 ` Aaron Conole 2020-03-11 9:09 ` David Marchand 2020-03-11 14:39 ` [dpdk-dev] [PATCH v2] " Harry van Haaren 2020-03-11 16:15 ` David Marchand 2020-03-11 16:21 ` Van Haaren, Harry 2020-03-12 8:59 ` David Marchand 2020-03-11 17:08 ` Aaron Conole 2020-03-12 9:03 ` David Marchand 2020-03-13 10:04 ` David Marchand 2020-04-06 10:30 ` Burakov, Anatoly 2020-04-14 13:22 ` Aaron Conole
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).