DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC] service: stop lcore threads before 'finalize'
@ 2020-01-16 19:50 Aaron Conole
  2020-01-17  8:17 ` David Marchand
  0 siblings, 1 reply; 10+ messages in thread
From: Aaron Conole @ 2020-01-16 19:50 UTC (permalink / raw)
  To: dev; +Cc: Harry Van Haaren, David Marchand

I've noticed an occasional segfault from the build system in the
service_autotest and after talking with David (CC'd), it seems like it's
due to the rte_service_finalize deleting the lcore_states object while
active lcores are running.

The below patch is an attempt to solve it by first reassigning all the
lcores back to ROLE_RTE before releasing the memory.  There is probably
a larger question for DPDK proper about actually closing the pending
lcore threads, but that's a separate issue.  I've been running with the
patch for a while, and haven't seen the crash anymore on my system.

Thoughts?  Is it acceptable as-is?
---
diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 7e537b8cd2..7d13287bee 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -71,6 +71,8 @@ static struct rte_service_spec_impl *rte_services;
 static struct core_state *lcore_states;
 static uint32_t rte_service_library_initialized;
 
+static void service_lcore_uninit(void);
+
 int32_t
 rte_service_init(void)
 {
@@ -122,6 +124,9 @@ rte_service_finalize(void)
 	if (!rte_service_library_initialized)
 		return;
 
+	/* Ensure that all service threads are returned to the ROLE_RTE
+	 */
+	service_lcore_uninit();
 	rte_free(rte_services);
 	rte_free(lcore_states);
 
@@ -897,3 +902,14 @@ rte_service_dump(FILE *f, uint32_t id)
 
 	return 0;
 }
+
+static void service_lcore_uninit(void)
+{
+	unsigned lcore_id;
+	RTE_LCORE_FOREACH(lcore_id) {
+		if (!lcore_states[lcore_id].is_service_core)
+			continue;
+
+		while (rte_service_lcore_del(lcore_id) == -EBUSY);
+	}
+}
---


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] service: stop lcore threads before 'finalize'
  2020-01-16 19:50 [dpdk-dev] [RFC] service: stop lcore threads before 'finalize' Aaron Conole
@ 2020-01-17  8:17 ` David Marchand
  2020-02-04 13:34   ` David Marchand
  0 siblings, 1 reply; 10+ messages in thread
From: David Marchand @ 2020-01-17  8:17 UTC (permalink / raw)
  To: Aaron Conole, Harry Van Haaren; +Cc: dev

On Thu, Jan 16, 2020 at 8:50 PM Aaron Conole <aconole@redhat.com> wrote:
>
> I've noticed an occasional segfault from the build system in the
> service_autotest and after talking with David (CC'd), it seems like it's
> due to the rte_service_finalize deleting the lcore_states object while
> active lcores are running.
>
> The below patch is an attempt to solve it by first reassigning all the
> lcores back to ROLE_RTE before releasing the memory.  There is probably
> a larger question for DPDK proper about actually closing the pending
> lcore threads, but that's a separate issue.  I've been running with the
> patch for a while, and haven't seen the crash anymore on my system.
>
> Thoughts?  Is it acceptable as-is?

Added this patch to my env, still reproducing the same issue after ~10-20 tries.
I added a breakpoint to service_lcore_uninit that is indeed caught
when exiting the test application (just wanted to make sure your
change was in my binary).


To reproduce:

I modified app/test/meson.build to have an explicit "-l 0-1" +
compiled with your patch.
Then, I started a dummy busyloop "while true; do true; done" in a
shell that I had pinned to core 1 (taskset -pc 1 $$).
Finally, started another shell (as root), pinned to cores 0-1 on my
laptop (taskset -pc 0,1 $$) and ran meson test --gdb  --repeat=10000
service_autotest

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff4922700 (LWP 8572)]
rte_service_runner_func (arg=<optimized out>) at
../lib/librte_eal/common/rte_service.c:458
458            cs->loops++;
A debugging session is active.

    Inferior 1 [process 8566] will be killed.

Quit anyway? (y or n) n
Not confirmed.
Missing separate debuginfos, use: debuginfo-install
elfutils-libelf-0.172-2.el7.x86_64 glibc-2.17-260.el7_6.6.x86_64
libgcc-4.8.5-36.el7_6.2.x86_64 libibverbs-17.2-3.el7.x86_64
libnl3-3.2.28-4.el7.x86_64 libpcap-1.5.3-11.el7.x86_64
numactl-libs-2.0.9-7.el7.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64
zlib-1.2.7-18.el7.x86_64
(gdb) info threads
  Id   Target Id         Frame
* 4    Thread 0x7ffff4922700 (LWP 8572) "lcore-slave-1"
rte_service_runner_func (arg=<optimized out>) at
../lib/librte_eal/common/rte_service.c:458
  3    Thread 0x7ffff5123700 (LWP 8571) "rte_mp_handle"
0x00007ffff63a4b4d in recvmsg () from /lib64/libpthread.so.0
  2    Thread 0x7ffff5924700 (LWP 8570) "eal-intr-thread"
0x00007ffff60c7603 in epoll_wait () from /lib64/libc.so.6
  1    Thread 0x7ffff7fd2c00 (LWP 8566) "dpdk-test" 0x00007ffff7deb96f
in _dl_name_match_p () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0  rte_service_runner_func (arg=<optimized out>) at
../lib/librte_eal/common/rte_service.c:458
#1  0x0000000000b2c84f in eal_thread_loop (arg=<optimized out>) at
../lib/librte_eal/linux/eal/eal_thread.c:153
#2  0x00007ffff639ddd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ffff60c702d in clone () from /lib64/libc.so.6
(gdb) f 0
#0  rte_service_runner_func (arg=<optimized out>) at
../lib/librte_eal/common/rte_service.c:458
458            cs->loops++;
(gdb) p *cs
$1 = {service_mask = 0, runstate = 0 '\000', is_service_core = 0
'\000', service_active_on_lcore = '\000' <repeats 63 times>, loops =
0, calls_per_service = {0 <repeats 64 times>}}
(gdb) p lcore_config[1]
$2 = {thread_id = 140737296606976, pipe_master2slave = {14, 20},
pipe_slave2master = {21, 22}, f = 0xb26ec0 <rte_service_runner_func>,
arg = 0x0, ret = 0, state = RUNNING, socket_id = 0, core_id = 1,
  core_index = 1, core_role = 0 '\000', detected = 1 '\001', cpuset =
{__bits = {2, 0 <repeats 15 times>}}}
(gdb) p lcore_config[0]
$3 = {thread_id = 0, pipe_master2slave = {0, 0}, pipe_slave2master =
{0, 0}, f = 0x0, arg = 0x0, ret = 0, state = WAIT, socket_id = 0,
core_id = 0, core_index = 0, core_role = 0 '\000', detected = 1
'\001',
  cpuset = {__bits = {1, 0 <repeats 15 times>}}}

(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fd2c00 (LWP 8566))]
#0  0x00007ffff7deb96f in _dl_name_match_p () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0  0x00007ffff7deb96f in _dl_name_match_p () from /lib64/ld-linux-x86-64.so.2
#1  0x00007ffff7de4756 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2
#2  0x00007ffff7de4fcf in _dl_lookup_symbol_x () from
/lib64/ld-linux-x86-64.so.2
#3  0x00007ffff7de9d1e in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#4  0x00007ffff7df19da in _dl_runtime_resolve_xsavec () from
/lib64/ld-linux-x86-64.so.2
#5  0x00007ffff7deafba in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#6  0x00007ffff6002c29 in __run_exit_handlers () from /lib64/libc.so.6
#7  0x00007ffff6002c77 in exit () from /lib64/libc.so.6
#8  0x00007ffff5feb49c in __libc_start_main () from /lib64/libc.so.6
#9  0x00000000004fa126 in _start ()


--
David Marchand


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] service: stop lcore threads before 'finalize'
  2020-01-17  8:17 ` David Marchand
@ 2020-02-04 13:34   ` David Marchand
  2020-02-04 14:50     ` Aaron Conole
  0 siblings, 1 reply; 10+ messages in thread
From: David Marchand @ 2020-02-04 13:34 UTC (permalink / raw)
  To: Harry Van Haaren; +Cc: dev, Aaron Conole

On Fri, Jan 17, 2020 at 9:17 AM David Marchand
<david.marchand@redhat.com> wrote:
>
> On Thu, Jan 16, 2020 at 8:50 PM Aaron Conole <aconole@redhat.com> wrote:
> >
> > I've noticed an occasional segfault from the build system in the
> > service_autotest and after talking with David (CC'd), it seems like it's
> > due to the rte_service_finalize deleting the lcore_states object while
> > active lcores are running.
> >
> > The below patch is an attempt to solve it by first reassigning all the
> > lcores back to ROLE_RTE before releasing the memory.  There is probably
> > a larger question for DPDK proper about actually closing the pending
> > lcore threads, but that's a separate issue.  I've been running with the
> > patch for a while, and haven't seen the crash anymore on my system.
> >
> > Thoughts?  Is it acceptable as-is?
>
> Added this patch to my env, still reproducing the same issue after ~10-20 tries.
> I added a breakpoint to service_lcore_uninit that is indeed caught
> when exiting the test application (just wanted to make sure your
> change was in my binary).

Harry,

We need a fix for this issue.

Interestingly, Stephen patch that joins all pthreads at
rte_eal_cleanup [1] makes this issue disappear.
So my understanding is that we are missing a api (well, I could not
find a way) to synchronously stop service lcores.


1: https://patchwork.dpdk.org/patch/64201/

-- 
David Marchand


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] service: stop lcore threads before 'finalize'
  2020-02-04 13:34   ` David Marchand
@ 2020-02-04 14:50     ` Aaron Conole
  2020-02-10 14:16       ` Van Haaren, Harry
  0 siblings, 1 reply; 10+ messages in thread
From: Aaron Conole @ 2020-02-04 14:50 UTC (permalink / raw)
  To: David Marchand; +Cc: Harry Van Haaren, dev

David Marchand <david.marchand@redhat.com> writes:

> On Fri, Jan 17, 2020 at 9:17 AM David Marchand
> <david.marchand@redhat.com> wrote:
>>
>> On Thu, Jan 16, 2020 at 8:50 PM Aaron Conole <aconole@redhat.com> wrote:
>> >
>> > I've noticed an occasional segfault from the build system in the
>> > service_autotest and after talking with David (CC'd), it seems like it's
>> > due to the rte_service_finalize deleting the lcore_states object while
>> > active lcores are running.
>> >
>> > The below patch is an attempt to solve it by first reassigning all the
>> > lcores back to ROLE_RTE before releasing the memory.  There is probably
>> > a larger question for DPDK proper about actually closing the pending
>> > lcore threads, but that's a separate issue.  I've been running with the
>> > patch for a while, and haven't seen the crash anymore on my system.
>> >
>> > Thoughts?  Is it acceptable as-is?
>>
>> Added this patch to my env, still reproducing the same issue after ~10-20 tries.
>> I added a breakpoint to service_lcore_uninit that is indeed caught
>> when exiting the test application (just wanted to make sure your
>> change was in my binary).
>
> Harry,
>
> We need a fix for this issue.

+1

> Interestingly, Stephen patch that joins all pthreads at
> rte_eal_cleanup [1] makes this issue disappear.
> So my understanding is that we are missing a api (well, I could not
> find a way) to synchronously stop service lcores.

Maybe we can take that patch as a fix.  I hate to see this segfault
in the field.  I need to figure out what I missed in my cleanup
(probably missed a synchronization point).

>
> 1: https://patchwork.dpdk.org/patch/64201/


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] service: stop lcore threads before 'finalize'
  2020-02-04 14:50     ` Aaron Conole
@ 2020-02-10 14:16       ` Van Haaren, Harry
  2020-02-10 14:42         ` David Marchand
  2020-02-20 13:25         ` David Marchand
  0 siblings, 2 replies; 10+ messages in thread
From: Van Haaren, Harry @ 2020-02-10 14:16 UTC (permalink / raw)
  To: Aaron Conole, David Marchand; +Cc: dev

> -----Original Message-----
> From: Aaron Conole <aconole@redhat.com>
> Sent: Tuesday, February 4, 2020 2:51 PM
> To: David Marchand <david.marchand@redhat.com>
> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev <dev@dpdk.org>
> Subject: Re: [RFC] service: stop lcore threads before 'finalize'
> 
> David Marchand <david.marchand@redhat.com> writes:
> 
> > On Fri, Jan 17, 2020 at 9:17 AM David Marchand
> > <david.marchand@redhat.com> wrote:
> >>
> >> On Thu, Jan 16, 2020 at 8:50 PM Aaron Conole <aconole@redhat.com> wrote:
> >> >
> >> > I've noticed an occasional segfault from the build system in the
> >> > service_autotest and after talking with David (CC'd), it seems like
> it's
> >> > due to the rte_service_finalize deleting the lcore_states object while
> >> > active lcores are running.
> >> >
> >> > The below patch is an attempt to solve it by first reassigning all the
> >> > lcores back to ROLE_RTE before releasing the memory.  There is probably
> >> > a larger question for DPDK proper about actually closing the pending
> >> > lcore threads, but that's a separate issue.  I've been running with the
> >> > patch for a while, and haven't seen the crash anymore on my system.
> >> >
> >> > Thoughts?  Is it acceptable as-is?
> >>
> >> Added this patch to my env, still reproducing the same issue after ~10-20
> tries.
> >> I added a breakpoint to service_lcore_uninit that is indeed caught
> >> when exiting the test application (just wanted to make sure your
> >> change was in my binary).
> >
> > Harry,
> >
> > We need a fix for this issue.
> 
> +1

Hi All,

> > Interestingly, Stephen patch that joins all pthreads at
> > rte_eal_cleanup [1] makes this issue disappear.
> > So my understanding is that we are missing a api (well, I could not
> > find a way) to synchronously stop service lcores.
> 
> Maybe we can take that patch as a fix.  I hate to see this segfault
> in the field.  I need to figure out what I missed in my cleanup
> (probably missed a synchronization point).

I haven't easily reproduced this yet - so I'll investigate a way to 
reproduce with close to 100% rate, then we can identify the root cause
and actually get a clean fix. If you have pointers to reproduce easily,
please let me know.

-H

> > 1: https://patchwork.dpdk.org/patch/64201/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] service: stop lcore threads before 'finalize'
  2020-02-10 14:16       ` Van Haaren, Harry
@ 2020-02-10 14:42         ` David Marchand
  2020-02-20 13:25         ` David Marchand
  1 sibling, 0 replies; 10+ messages in thread
From: David Marchand @ 2020-02-10 14:42 UTC (permalink / raw)
  To: Van Haaren, Harry; +Cc: Aaron Conole, dev

On Mon, Feb 10, 2020 at 3:16 PM Van Haaren, Harry
<harry.van.haaren@intel.com> wrote:
> I haven't easily reproduced this yet - so I'll investigate a way to
> reproduce with close to 100% rate, then we can identify the root cause
> and actually get a clean fix. If you have pointers to reproduce easily,
> please let me know.

- In shell #1:

$ git reset --hard v20.02-rc2
HEAD is now at 2636c2a23 version: 20.02-rc2
$ rm -rf build

$ git diff
diff --git a/app/test/meson.build b/app/test/meson.build
index 3675ffb5c..23c00a618 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -400,7 +400,7 @@ timeout_seconds = 600
 timeout_seconds_fast = 10

 get_coremask = find_program('get-coremask.sh')
-num_cores_arg = '-l ' + run_command(get_coremask).stdout().strip()
+num_cores_arg = '-l 0,1'

 test_args = [num_cores_arg]
 foreach arg : fast_test_names

$ meson --werror --buildtype=debugoptimized build
The Meson build system
Version: 0.47.2
Source dir: /home/dmarchan/dpdk
Build dir: /home/dmarchan/dpdk/build
Build type: native build
Program cat found: YES (/usr/bin/cat)
Project name: DPDK
Project version: 20.02.0-rc2
...

$ ninja-build -C build
ninja: Entering directory `build'
[2081/2081] Linking target app/test/dpdk-test.

$ taskset -pc 1 $$
pid 11143's current affinity list: 0-7
pid 11143's new affinity list: 1

$ while true; do true; done


- Now, in shell #2, as root:

# taskset -pc 0,1 $$
pid 22233's current affinity list: 0-7
pid 22233's new affinity list: 0,1

# meson test --gdb  --repeat=10000 service_autotest
...

 + ------------------------------------------------------- +
 + Test Suite Summary
 + Tests Total :       16
 + Tests Skipped :      3
 + Tests Executed :    16
 + Tests Unsupported:   0
 + Tests Passed :      13
 + Tests Failed :       0
 + ------------------------------------------------------- +

Test OK
RTE>>
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff4922700 (LWP 31194)]
rte_service_runner_func (arg=<optimized out>) at
../lib/librte_eal/common/rte_service.c:453
453            cs->loops++;
A debugging session is active.

    Inferior 1 [process 31187] will be killed.

Quit anyway? (y or n)


I get the crash in like 30s, often less.
In my test right now, I got the crash on the 3rd try.



-- 
David Marchand


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] service: stop lcore threads before 'finalize'
  2020-02-10 14:16       ` Van Haaren, Harry
  2020-02-10 14:42         ` David Marchand
@ 2020-02-20 13:25         ` David Marchand
  2020-02-21 12:28           ` Van Haaren, Harry
  1 sibling, 1 reply; 10+ messages in thread
From: David Marchand @ 2020-02-20 13:25 UTC (permalink / raw)
  To: Van Haaren, Harry; +Cc: Aaron Conole, dev

On Mon, Feb 10, 2020 at 3:16 PM Van Haaren, Harry
<harry.van.haaren@intel.com> wrote:
> > > We need a fix for this issue.
> >
> > +1
>
> > > Interestingly, Stephen patch that joins all pthreads at
> > > rte_eal_cleanup [1] makes this issue disappear.
> > > So my understanding is that we are missing a api (well, I could not
> > > find a way) to synchronously stop service lcores.
> >
> > Maybe we can take that patch as a fix.  I hate to see this segfault
> > in the field.  I need to figure out what I missed in my cleanup
> > (probably missed a synchronization point).
>
> I haven't easily reproduced this yet - so I'll investigate a way to
> reproduce with close to 100% rate, then we can identify the root cause
> and actually get a clean fix. If you have pointers to reproduce easily,
> please let me know.
>

ping.
I want a fix in 20.05, or I will start considering how to drop this thing.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] service: stop lcore threads before 'finalize'
  2020-02-20 13:25         ` David Marchand
@ 2020-02-21 12:28           ` Van Haaren, Harry
  2020-03-10 13:04             ` David Marchand
  0 siblings, 1 reply; 10+ messages in thread
From: Van Haaren, Harry @ 2020-02-21 12:28 UTC (permalink / raw)
  To: David Marchand; +Cc: Aaron Conole, dev

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Thursday, February 20, 2020 1:25 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: Aaron Conole <aconole@redhat.com>; dev <dev@dpdk.org>
> Subject: Re: [RFC] service: stop lcore threads before 'finalize'
> 
> On Mon, Feb 10, 2020 at 3:16 PM Van Haaren, Harry
> <harry.van.haaren@intel.com> wrote:
> > > > We need a fix for this issue.
> > >
> > > +1
> >
> > > > Interestingly, Stephen patch that joins all pthreads at
> > > > rte_eal_cleanup [1] makes this issue disappear.
> > > > So my understanding is that we are missing a api (well, I could not
> > > > find a way) to synchronously stop service lcores.
> > >
> > > Maybe we can take that patch as a fix.  I hate to see this segfault
> > > in the field.  I need to figure out what I missed in my cleanup
> > > (probably missed a synchronization point).
> >
> > I haven't easily reproduced this yet - so I'll investigate a way to
> > reproduce with close to 100% rate, then we can identify the root cause
> > and actually get a clean fix. If you have pointers to reproduce easily,
> > please let me know.
> >
> 
> ping.
> I want a fix in 20.05, or I will start considering how to drop this thing.

Hi David,

I have been attempting to reproduce, unfortunately without success.

Attempted you suggested meson test approach (thanks for suggesting!), but
I haven't had a segfault with that approach (yet, and its done a lot of iterations..)

I've made the service-cores unit tests delay before exit, in an attempt
to have them access previously rte_free()-ed memory, no luck to reproduce.

Thinking perhaps we need it on exit, I've also POCed a unit test that leaves
service cores active on exit on purpose, to try have them poll after exit,
still no luck.

Simplifying the problem, and using hello-world sample app with a rte_eal_cleaup()
call at the end also doesn't easily aggravate the problem.

From code inspection, I agree there is an issue. It seems like a call to
rte_service_lcore_reset_all() from rte_service_finalize() is enough...
But without reproducing it is hard to have good confidence in a fix.

If you have cycles to help, investigating if the above reset_all() call fixes there?
Otherwise I'll continue trying to reproduce reliably.

Regards, -HvH



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] service: stop lcore threads before 'finalize'
  2020-02-21 12:28           ` Van Haaren, Harry
@ 2020-03-10 13:04             ` David Marchand
  2020-03-10 13:27               ` Van Haaren, Harry
  0 siblings, 1 reply; 10+ messages in thread
From: David Marchand @ 2020-03-10 13:04 UTC (permalink / raw)
  To: Van Haaren, Harry; +Cc: Aaron Conole, dev

On Fri, Feb 21, 2020 at 1:28 PM Van Haaren, Harry
<harry.van.haaren@intel.com> wrote:
>
> > -----Original Message-----
> > From: David Marchand <david.marchand@redhat.com>
> > Sent: Thursday, February 20, 2020 1:25 PM
> > To: Van Haaren, Harry <harry.van.haaren@intel.com>
> > Cc: Aaron Conole <aconole@redhat.com>; dev <dev@dpdk.org>
> > Subject: Re: [RFC] service: stop lcore threads before 'finalize'
> >
> > On Mon, Feb 10, 2020 at 3:16 PM Van Haaren, Harry
> > <harry.van.haaren@intel.com> wrote:
> > > > > We need a fix for this issue.
> > > >
> > > > +1
> > >
> > > > > Interestingly, Stephen patch that joins all pthreads at
> > > > > rte_eal_cleanup [1] makes this issue disappear.
> > > > > So my understanding is that we are missing a api (well, I could not
> > > > > find a way) to synchronously stop service lcores.
> > > >
> > > > Maybe we can take that patch as a fix.  I hate to see this segfault
> > > > in the field.  I need to figure out what I missed in my cleanup
> > > > (probably missed a synchronization point).
> > >
> > > I haven't easily reproduced this yet - so I'll investigate a way to
> > > reproduce with close to 100% rate, then we can identify the root cause
> > > and actually get a clean fix. If you have pointers to reproduce easily,
> > > please let me know.
> > >
> >
> > ping.
> > I want a fix in 20.05, or I will start considering how to drop this thing.
>
> Hi David,
>
> I have been attempting to reproduce, unfortunately without success.
>
> Attempted you suggested meson test approach (thanks for suggesting!), but
> I haven't had a segfault with that approach (yet, and its done a lot of iterations..)

I reproduced it on the first try, just now.
Travis catches it every once in a while (look at the ovsrobot).

For the reproduction, this is on my laptop (core i7-8650U), baremetal,
no fancy stuff.
FWIW, the cores are ruled by the "powersave" governor.
I can see the frequency oscillates between 3.5GHz and 3.7Ghz while the
max frequency is 4.2GHz.

Travis runs virtual machines with 2 cores, and there must be quite
some overprovisioning on those servers.
We can expect some cycles being stolen or at least something happening
on the various cores.


>
> I've made the service-cores unit tests delay before exit, in an attempt
> to have them access previously rte_free()-ed memory, no luck to reproduce.

Ok, let's forget about the segfault, what do you think of the
backtrace I caught?
A service lcore thread is still in the service loop.
The master thread of the application is in the libc exiting code.

This is what I get in all crashes.


>
> Thinking perhaps we need it on exit, I've also POCed a unit test that leaves
> service cores active on exit on purpose, to try have them poll after exit,
> still no luck.
>
> Simplifying the problem, and using hello-world sample app with a rte_eal_cleaup()
> call at the end also doesn't easily aggravate the problem.
>
> From code inspection, I agree there is an issue. It seems like a call to
> rte_service_lcore_reset_all() from rte_service_finalize() is enough...
> But without reproducing it is hard to have good confidence in a fix.

You promised a doc update on the services API.
Thanks.


--
David Marchand


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] service: stop lcore threads before 'finalize'
  2020-03-10 13:04             ` David Marchand
@ 2020-03-10 13:27               ` Van Haaren, Harry
  0 siblings, 0 replies; 10+ messages in thread
From: Van Haaren, Harry @ 2020-03-10 13:27 UTC (permalink / raw)
  To: David Marchand; +Cc: Aaron Conole, dev

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Tuesday, March 10, 2020 1:05 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>
> Cc: Aaron Conole <aconole@redhat.com>; dev <dev@dpdk.org>
> Subject: Re: [RFC] service: stop lcore threads before 'finalize'
> 
> On Fri, Feb 21, 2020 at 1:28 PM Van Haaren, Harry
> <harry.van.haaren@intel.com> wrote:
<snip>
> >
> > Hi David,
> >
> > I have been attempting to reproduce, unfortunately without success.
> >
> > Attempted you suggested meson test approach (thanks for suggesting!), but
> > I haven't had a segfault with that approach (yet, and its done a lot of
> iterations..)
> 
> I reproduced it on the first try, just now.
> Travis catches it every once in a while (look at the ovsrobot).
> 
> For the reproduction, this is on my laptop (core i7-8650U), baremetal,
> no fancy stuff.
> FWIW, the cores are ruled by the "powersave" governor.
> I can see the frequency oscillates between 3.5GHz and 3.7Ghz while the
> max frequency is 4.2GHz.
> 
> Travis runs virtual machines with 2 cores, and there must be quite
> some overprovisioning on those servers.
> We can expect some cycles being stolen or at least something happening
> on the various cores.
> 
> 
> >
> > I've made the service-cores unit tests delay before exit, in an attempt
> > to have them access previously rte_free()-ed memory, no luck to reproduce.
> 
> Ok, let's forget about the segfault, what do you think of the
> backtrace I caught?
> A service lcore thread is still in the service loop.
> The master thread of the application is in the libc exiting code.
> 
> This is what I get in all crashes.

Hi,

I was actually coding up the above as a patch to send to ML for testing.
I've tried to reproduce - it doesn't happen here. I don't like sending
patches for fixes that I haven't been able to reliably reproduce and fix
locally - but in this case there's I don't see any other option.

I'll post the fix patch to the mailing list ASAP, your and Aaron's
help in testing would be greatly appreciated.


> > Thinking perhaps we need it on exit, I've also POCed a unit test that
> leaves
> > service cores active on exit on purpose, to try have them poll after exit,
> > still no luck.
> >
> > Simplifying the problem, and using hello-world sample app with a
> rte_eal_cleaup()
> > call at the end also doesn't easily aggravate the problem.
> >
> > From code inspection, I agree there is an issue. It seems like a call to
> > rte_service_lcore_reset_all() from rte_service_finalize() is enough...
> > But without reproducing it is hard to have good confidence in a fix.
> 
> You promised a doc update on the services API.
> Thanks.

Yes, I heard there are some questions around what service cores is useful for.
Having reviewed the programmer guide and doxygen of the API, I'm not sure
what needs to change. Do you have specific questions you'd like to see
addressed here, or what do you feel needs to change?

https://doc.dpdk.org/guides/prog_guide/service_cores.html
http://doc.dpdk.org/api/rte__service_8h.html


Regards, -Harry

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-03-10 13:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-16 19:50 [dpdk-dev] [RFC] service: stop lcore threads before 'finalize' Aaron Conole
2020-01-17  8:17 ` David Marchand
2020-02-04 13:34   ` David Marchand
2020-02-04 14:50     ` Aaron Conole
2020-02-10 14:16       ` Van Haaren, Harry
2020-02-10 14:42         ` David Marchand
2020-02-20 13:25         ` David Marchand
2020-02-21 12:28           ` Van Haaren, Harry
2020-03-10 13:04             ` David Marchand
2020-03-10 13:27               ` Van Haaren, Harry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).