[dpdk-dev] Random failure in service

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] Random failure in service_autotest
@ 2020-07-15 10:14 David Marchand
  2020-07-15 10:41 ` Ferruh Yigit
  2020-07-15 12:56 ` Aaron Conole
  0 siblings, 2 replies; 23+ messages in thread
From: David Marchand @ 2020-07-15 10:14 UTC (permalink / raw)
  To: Van Haaren Harry, Lukasz Wojciechowski, Igor Romanov,
	Honnappa Nagarahalli, Phil Yang
  Cc: dev, Aaron Conole

Hello Harry and guys who touched the service code recently :-)

I spotted a failure for the service UT in Travis:
https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697

I found only a single instance of this failure and tried to reproduce
it with my usual "brute" active loop with no success so far.

Any chance it could be due to recent changes?
https://git.dpdk.org/dpdk/commit/?id=f3c256b621262e581d3edcca383df83875ab7ebe
https://git.dpdk.org/dpdk/commit/?id=048db4b6dcccaee9277ce5b4fbb2fe684b212e22

-- 
David Marchand

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-15 10:14 [dpdk-dev] Random failure in service_autotest David Marchand
@ 2020-07-15 10:41 ` Ferruh Yigit
  2020-07-17  8:56   ` David Marchand
  2020-07-15 12:56 ` Aaron Conole
  1 sibling, 1 reply; 23+ messages in thread
From: Ferruh Yigit @ 2020-07-15 10:41 UTC (permalink / raw)
  To: David Marchand, Van Haaren Harry, Lukasz Wojciechowski,
	Igor Romanov, Honnappa Nagarahalli, Phil Yang
  Cc: dev, Aaron Conole

On 7/15/2020 11:14 AM, David Marchand wrote:
> Hello Harry and guys who touched the service code recently :-)
> 
> I spotted a failure for the service UT in Travis:
> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
> 
> I found only a single instance of this failure and tried to reproduce
> it with my usual "brute" active loop with no success so far.

+1, I didn't able to reproduce it in my environment but observed it in the
Travis CI.

> 
> Any chance it could be due to recent changes?
> https://git.dpdk.org/dpdk/commit/?id=f3c256b621262e581d3edcca383df83875ab7ebe
> https://git.dpdk.org/dpdk/commit/?id=048db4b6dcccaee9277ce5b4fbb2fe684b212e22
> 
> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-15 10:14 [dpdk-dev] Random failure in service_autotest David Marchand
  2020-07-15 10:41 ` Ferruh Yigit
@ 2020-07-15 12:56 ` Aaron Conole
  2020-07-15 13:02   ` David Marchand
  1 sibling, 1 reply; 23+ messages in thread
From: Aaron Conole @ 2020-07-15 12:56 UTC (permalink / raw)
  To: David Marchand
  Cc: Van Haaren Harry, Lukasz Wojciechowski, Igor Romanov,
	Honnappa Nagarahalli, Phil Yang, dev

David Marchand <david.marchand@redhat.com> writes:

> Hello Harry and guys who touched the service code recently :-)
>
> I spotted a failure for the service UT in Travis:
> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
>
> I found only a single instance of this failure and tried to reproduce
> it with my usual "brute" active loop with no success so far.
>
> Any chance it could be due to recent changes?
> https://git.dpdk.org/dpdk/commit/?id=f3c256b621262e581d3edcca383df83875ab7ebe
> https://git.dpdk.org/dpdk/commit/?id=048db4b6dcccaee9277ce5b4fbb2fe684b212e22

I guess the service_lcore_attr_get failed, but with no useful
information.  We should have the test suite app turn the log level all
the way up.  I realize it will increase the log data even higher, but
without it, we have no idea why this test failed.

Something like the following
---

diff --git a/app/test/test.c b/app/test/test.c
index 94d26ab1f6..c47cb075f9 100644
--- a/app/test/test.c
+++ b/app/test/test.c
@@ -150,6 +150,9 @@ main(int argc, char **argv)
 
        prgname = argv[0];
 
+       rte_log_set_global_level(RTE_LOG_DEBUG);
+       rte_log_set_level(RTE_LOGTYPE_EAL, RTE_LOG_DEBUG);
+
        recursive_call = getenv(RECURSIVE_ENV_VAR);
        if (recursive_call != NULL) {
                ret = do_recursive_call();
---

That way we can at least debug when it happens.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-15 12:56 ` Aaron Conole
@ 2020-07-15 13:02   ` David Marchand
  2020-07-15 13:09     ` Lukasz Wojciechowski
  0 siblings, 1 reply; 23+ messages in thread
From: David Marchand @ 2020-07-15 13:02 UTC (permalink / raw)
  To: Aaron Conole
  Cc: Van Haaren Harry, Lukasz Wojciechowski, Igor Romanov,
	Honnappa Nagarahalli, Phil Yang, dev

On Wed, Jul 15, 2020 at 2:56 PM Aaron Conole <aconole@redhat.com> wrote:
> I guess the service_lcore_attr_get failed, but with no useful
> information.  We should have the test suite app turn the log level all
> the way up.  I realize it will increase the log data even higher, but
> without it, we have no idea why this test failed.
>
> Something like the following
> ---
>
> diff --git a/app/test/test.c b/app/test/test.c
> index 94d26ab1f6..c47cb075f9 100644
> --- a/app/test/test.c
> +++ b/app/test/test.c
> @@ -150,6 +150,9 @@ main(int argc, char **argv)
>
>         prgname = argv[0];
>
> +       rte_log_set_global_level(RTE_LOG_DEBUG);
> +       rte_log_set_level(RTE_LOGTYPE_EAL, RTE_LOG_DEBUG);
> +
>         recursive_call = getenv(RECURSIVE_ENV_VAR);
>         if (recursive_call != NULL) {
>                 ret = do_recursive_call();
> ---
>
> That way we can at least debug when it happens.

Debugging is one thing.
But here service_lcore_attr_get() has a lot of asserts that should
trigger a straight error.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-15 13:02   ` David Marchand
@ 2020-07-15 13:09     ` Lukasz Wojciechowski
  2020-07-15 13:28       ` David Marchand
  0 siblings, 1 reply; 23+ messages in thread
From: Lukasz Wojciechowski @ 2020-07-15 13:09 UTC (permalink / raw)
  To: David Marchand, Aaron Conole
  Cc: Van Haaren Harry, Igor Romanov, Honnappa Nagarahalli, Phil Yang, dev


W dniu 15.07.2020 o 15:02, David Marchand pisze:
> On Wed, Jul 15, 2020 at 2:56 PM Aaron Conole <aconole@redhat.com> wrote:
>> I guess the service_lcore_attr_get failed, but with no useful
Why do you suspect service_lcore_attr_get() ?
>> information.  We should have the test suite app turn the log level all
>> the way up.  I realize it will increase the log data even higher, but
>> without it, we have no idea why this test failed.
>>
>> Something like the following
>> ---
>>
>> diff --git a/app/test/test.c b/app/test/test.c
>> index 94d26ab1f6..c47cb075f9 100644
>> --- a/app/test/test.c
>> +++ b/app/test/test.c
>> @@ -150,6 +150,9 @@ main(int argc, char **argv)
>>
>>          prgname = argv[0];
>>
>> +       rte_log_set_global_level(RTE_LOG_DEBUG);
>> +       rte_log_set_level(RTE_LOGTYPE_EAL, RTE_LOG_DEBUG);
>> +
>>          recursive_call = getenv(RECURSIVE_ENV_VAR);
>>          if (recursive_call != NULL) {
>>                  ret = do_recursive_call();
>> ---
>>
>> That way we can at least debug when it happens.
> Debugging is one thing.
> But here service_lcore_attr_get() has a lot of asserts that should
> trigger a straight error.
>
Yes, but without debugs enabled, the assert message won't be printed out.

-- 
Lukasz Wojciechowski
Principal Software Engineer

Samsung R&D Institute Poland
Samsung Electronics
Office +48 22 377 88 25
l.wojciechow@partner.samsung.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-15 13:09     ` Lukasz Wojciechowski
@ 2020-07-15 13:28       ` David Marchand
  2020-07-15 13:39         ` Aaron Conole
  0 siblings, 1 reply; 23+ messages in thread
From: David Marchand @ 2020-07-15 13:28 UTC (permalink / raw)
  To: Lukasz Wojciechowski
  Cc: Aaron Conole, Van Haaren Harry, Igor Romanov,
	Honnappa Nagarahalli, Phil Yang, dev

On Wed, Jul 15, 2020 at 3:09 PM Lukasz Wojciechowski
<l.wojciechow@partner.samsung.com> wrote:
>
>
> W dniu 15.07.2020 o 15:02, David Marchand pisze:
> > On Wed, Jul 15, 2020 at 2:56 PM Aaron Conole <aconole@redhat.com> wrote:
> >> I guess the service_lcore_attr_get failed, but with no useful
> Why do you suspect service_lcore_attr_get() ?

https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697

RTE>>service_autotest
 + ------------------------------------------------------- +
 + Test Suite : service core test suite
 + ------------------------------------------------------- +
 + TestCase [ 0] : unregister_all succeeded
 + TestCase [ 1] : service_name succeeded
 + TestCase [ 2] : service_get_by_name succeeded
Service dummy_service Summary
  dummy_service: stats 1 calls 0 cycles 0 avg: 0
Service dummy_service Summary
  dummy_service: stats 0 calls 0 cycles 0 avg: 0
 + TestCase [ 3] : service_dump succeeded
 + TestCase [ 4] : service_attr_get succeeded

***
 + TestCase [ 5] : service_lcore_attr_get failed
***

 + TestCase [ 6] : service_probe_capability succeeded
 + TestCase [ 7] : service_start_stop succeeded
 + TestCase [ 8] : service_lcore_add_del skipped
 + TestCase [ 9] : service_lcore_start_stop succeeded
 + TestCase [10] : service_lcore_en_dis_able succeeded
 + TestCase [11] : service_mt_unsafe_poll skipped
 + TestCase [12] : service_mt_safe_poll skipped
perf test for MT Safe: 56.9 cycles per call
 + TestCase [13] : service_app_lcore_mt_safe succeeded
perf test for MT Unsafe: 83.4 cycles per call
 + TestCase [14] : service_app_lcore_mt_unsafe succeeded
 + TestCase [15] : service_may_be_active succeeded
 + TestCase [16] : service_active_two_cores skipped
 + ------------------------------------------------------- +


> >> information.  We should have the test suite app turn the log level all
> >> the way up.  I realize it will increase the log data even higher, but
> >> without it, we have no idea why this test failed.
> >>
> >> Something like the following
> >> ---
> >>
> >> diff --git a/app/test/test.c b/app/test/test.c
> >> index 94d26ab1f6..c47cb075f9 100644
> >> --- a/app/test/test.c
> >> +++ b/app/test/test.c
> >> @@ -150,6 +150,9 @@ main(int argc, char **argv)
> >>
> >>          prgname = argv[0];
> >>
> >> +       rte_log_set_global_level(RTE_LOG_DEBUG);
> >> +       rte_log_set_level(RTE_LOGTYPE_EAL, RTE_LOG_DEBUG);
> >> +
> >>          recursive_call = getenv(RECURSIVE_ENV_VAR);
> >>          if (recursive_call != NULL) {
> >>                  ret = do_recursive_call();
> >> ---
> >>
> >> That way we can at least debug when it happens.
> > Debugging is one thing.
> > But here service_lcore_attr_get() has a lot of asserts that should
> > trigger a straight error.
> >
> Yes, but without debugs enabled, the assert message won't be printed out.

Changing the whole debug levels could have side effects on the
libraries being tested: maybe hide races (too bad) or reveal races
(that would be interesting ;-)).

On the other hand, what I am saying is that using debug level logs for
test asserts might not be the best solution.

-- 
David Marchand


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-15 13:28       ` David Marchand
@ 2020-07-15 13:39         ` Aaron Conole
  2020-07-15 20:26           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 23+ messages in thread
From: Aaron Conole @ 2020-07-15 13:39 UTC (permalink / raw)
  To: David Marchand
  Cc: Lukasz Wojciechowski, Van Haaren Harry, Igor Romanov,
	Honnappa Nagarahalli, Phil Yang, dev

David Marchand <david.marchand@redhat.com> writes:

> On Wed, Jul 15, 2020 at 3:09 PM Lukasz Wojciechowski
> <l.wojciechow@partner.samsung.com> wrote:
>>
>>
>> W dniu 15.07.2020 o 15:02, David Marchand pisze:
>> > On Wed, Jul 15, 2020 at 2:56 PM Aaron Conole <aconole@redhat.com> wrote:
>> >> I guess the service_lcore_attr_get failed, but with no useful
>> Why do you suspect service_lcore_attr_get() ?
>
> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
>
> RTE>>service_autotest
>  + ------------------------------------------------------- +
>  + Test Suite : service core test suite
>  + ------------------------------------------------------- +
>  + TestCase [ 0] : unregister_all succeeded
>  + TestCase [ 1] : service_name succeeded
>  + TestCase [ 2] : service_get_by_name succeeded
> Service dummy_service Summary
>   dummy_service: stats 1 calls 0 cycles 0 avg: 0
> Service dummy_service Summary
>   dummy_service: stats 0 calls 0 cycles 0 avg: 0
>  + TestCase [ 3] : service_dump succeeded
>  + TestCase [ 4] : service_attr_get succeeded
>
> ***
>  + TestCase [ 5] : service_lcore_attr_get failed
> ***
>
>  + TestCase [ 6] : service_probe_capability succeeded
>  + TestCase [ 7] : service_start_stop succeeded
>  + TestCase [ 8] : service_lcore_add_del skipped
>  + TestCase [ 9] : service_lcore_start_stop succeeded
>  + TestCase [10] : service_lcore_en_dis_able succeeded
>  + TestCase [11] : service_mt_unsafe_poll skipped
>  + TestCase [12] : service_mt_safe_poll skipped
> perf test for MT Safe: 56.9 cycles per call
>  + TestCase [13] : service_app_lcore_mt_safe succeeded
> perf test for MT Unsafe: 83.4 cycles per call
>  + TestCase [14] : service_app_lcore_mt_unsafe succeeded
>  + TestCase [15] : service_may_be_active succeeded
>  + TestCase [16] : service_active_two_cores skipped
>  + ------------------------------------------------------- +
>
>
>> >> information.  We should have the test suite app turn the log level all
>> >> the way up.  I realize it will increase the log data even higher, but
>> >> without it, we have no idea why this test failed.
>> >>
>> >> Something like the following
>> >> ---
>> >>
>> >> diff --git a/app/test/test.c b/app/test/test.c
>> >> index 94d26ab1f6..c47cb075f9 100644
>> >> --- a/app/test/test.c
>> >> +++ b/app/test/test.c
>> >> @@ -150,6 +150,9 @@ main(int argc, char **argv)
>> >>
>> >>          prgname = argv[0];
>> >>
>> >> +       rte_log_set_global_level(RTE_LOG_DEBUG);
>> >> +       rte_log_set_level(RTE_LOGTYPE_EAL, RTE_LOG_DEBUG);
>> >> +
>> >>          recursive_call = getenv(RECURSIVE_ENV_VAR);
>> >>          if (recursive_call != NULL) {
>> >>                  ret = do_recursive_call();
>> >> ---
>> >>
>> >> That way we can at least debug when it happens.
>> > Debugging is one thing.
>> > But here service_lcore_attr_get() has a lot of asserts that should
>> > trigger a straight error.
>> >
>> Yes, but without debugs enabled, the assert message won't be printed out.
>
> Changing the whole debug levels could have side effects on the
> libraries being tested: maybe hide races (too bad) or reveal races
> (that would be interesting ;-)).

6 of one, 1/2 dozen of the other.

Maybe there are bugs where people run code inside log level tests that
get omitted otherwise.

> On the other hand, what I am saying is that using debug level logs for
> test asserts might not be the best solution.

Yeah, that's also a good idea :)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-15 13:39         ` Aaron Conole
@ 2020-07-15 20:26           ` Honnappa Nagarahalli
  0 siblings, 0 replies; 23+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-15 20:26 UTC (permalink / raw)
  To: Aaron Conole, David Marchand
  Cc: Lukasz Wojciechowski, Van Haaren Harry, Igor Romanov, Phil Yang,
	dev, nd, Honnappa Nagarahalli, nd

<snip>

> Subject: Re: Random failure in service_autotest
> 
> David Marchand <david.marchand@redhat.com> writes:
> 
> > On Wed, Jul 15, 2020 at 3:09 PM Lukasz Wojciechowski
> > <l.wojciechow@partner.samsung.com> wrote:
> >>
> >>
> >> W dniu 15.07.2020 o 15:02, David Marchand pisze:
> >> > On Wed, Jul 15, 2020 at 2:56 PM Aaron Conole <aconole@redhat.com>
> wrote:
> >> >> I guess the service_lcore_attr_get failed, but with no useful
> >> Why do you suspect service_lcore_attr_get() ?
> >
> > https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
> >
> > RTE>>service_autotest
> >  + ------------------------------------------------------- +  + Test
> > Suite : service core test suite  +
> > ------------------------------------------------------- +  + TestCase
> > [ 0] : unregister_all succeeded  + TestCase [ 1] : service_name
> > succeeded  + TestCase [ 2] : service_get_by_name succeeded Service
> > dummy_service Summary
> >   dummy_service: stats 1 calls 0 cycles 0 avg: 0 Service dummy_service
> > Summary
> >   dummy_service: stats 0 calls 0 cycles 0 avg: 0  + TestCase [ 3] :
> > service_dump succeeded  + TestCase [ 4] : service_attr_get succeeded
> >
> > ***
> >  + TestCase [ 5] : service_lcore_attr_get failed
> > ***
> >
> >  + TestCase [ 6] : service_probe_capability succeeded  + TestCase [ 7]
> > : service_start_stop succeeded  + TestCase [ 8] :
> > service_lcore_add_del skipped  + TestCase [ 9] :
> > service_lcore_start_stop succeeded  + TestCase [10] :
> > service_lcore_en_dis_able succeeded  + TestCase [11] :
> > service_mt_unsafe_poll skipped  + TestCase [12] : service_mt_safe_poll
> > skipped perf test for MT Safe: 56.9 cycles per call  + TestCase [13] :
> > service_app_lcore_mt_safe succeeded perf test for MT Unsafe: 83.4
> > cycles per call  + TestCase [14] : service_app_lcore_mt_unsafe
> > succeeded  + TestCase [15] : service_may_be_active succeeded  +
> > TestCase [16] : service_active_two_cores skipped  +
> > ------------------------------------------------------- +
> >
> >
> >> >> information.  We should have the test suite app turn the log level
> >> >> all the way up.  I realize it will increase the log data even
> >> >> higher, but without it, we have no idea why this test failed.
> >> >>
> >> >> Something like the following
> >> >> ---
> >> >>
> >> >> diff --git a/app/test/test.c b/app/test/test.c index
> >> >> 94d26ab1f6..c47cb075f9 100644
> >> >> --- a/app/test/test.c
> >> >> +++ b/app/test/test.c
> >> >> @@ -150,6 +150,9 @@ main(int argc, char **argv)
> >> >>
> >> >>          prgname = argv[0];
> >> >>
> >> >> +       rte_log_set_global_level(RTE_LOG_DEBUG);
> >> >> +       rte_log_set_level(RTE_LOGTYPE_EAL, RTE_LOG_DEBUG);
> >> >> +
> >> >>          recursive_call = getenv(RECURSIVE_ENV_VAR);
> >> >>          if (recursive_call != NULL) {
> >> >>                  ret = do_recursive_call();
> >> >> ---
> >> >>
> >> >> That way we can at least debug when it happens.
> >> > Debugging is one thing.
> >> > But here service_lcore_attr_get() has a lot of asserts that should
> >> > trigger a straight error.
> >> >
> >> Yes, but without debugs enabled, the assert message won't be printed out.
> >
> > Changing the whole debug levels could have side effects on the
> > libraries being tested: maybe hide races (too bad) or reveal races
> > (that would be interesting ;-)).
> 
> 6 of one, 1/2 dozen of the other.
> 
> Maybe there are bugs where people run code inside log level tests that get
> omitted otherwise.
> 
> > On the other hand, what I am saying is that using debug level logs for
> > test asserts might not be the best solution.
Agree, at least error level is required.
Please check: https://patches.dpdk.org/patch/74134/

> 
> Yeah, that's also a good idea :)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-15 10:41 ` Ferruh Yigit
@ 2020-07-17  8:56   ` David Marchand
  2020-07-17 15:19     ` David Marchand
  0 siblings, 1 reply; 23+ messages in thread
From: David Marchand @ 2020-07-17  8:56 UTC (permalink / raw)
  To: Van Haaren Harry
  Cc: Lukasz Wojciechowski, Igor Romanov, Honnappa Nagarahalli,
	Phil Yang, dev, Aaron Conole, Ferruh Yigit

On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit@intel.com> wrote:
>
> On 7/15/2020 11:14 AM, David Marchand wrote:
> > Hello Harry and guys who touched the service code recently :-)
> >
> > I spotted a failure for the service UT in Travis:
> > https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
> >
> > I found only a single instance of this failure and tried to reproduce
> > it with my usual "brute" active loop with no success so far.
>
> +1, I didn't able to reproduce it in my environment but observed it in the
> Travis CI.
>
> >
> > Any chance it could be due to recent changes?
> > https://git.dpdk.org/dpdk/commit/?id=f3c256b621262e581d3edcca383df83875ab7ebe
> > https://git.dpdk.org/dpdk/commit/?id=048db4b6dcccaee9277ce5b4fbb2fe684b212e22

I can see more occurrences of the issue in the CI.
I just applied the patch changing the log level for test assert, in
the hope it will help.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-17  8:56   ` David Marchand
@ 2020-07-17 15:19     ` David Marchand
  2020-07-17 20:31       ` Lukasz Wojciechowski
  0 siblings, 1 reply; 23+ messages in thread
From: David Marchand @ 2020-07-17 15:19 UTC (permalink / raw)
  To: Van Haaren Harry, Lukasz Wojciechowski
  Cc: Igor Romanov, Honnappa Nagarahalli, Phil Yang, dev, Aaron Conole,
	Ferruh Yigit

On Fri, Jul 17, 2020 at 10:56 AM David Marchand
<david.marchand@redhat.com> wrote:
>
> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit@intel.com> wrote:
> >
> > On 7/15/2020 11:14 AM, David Marchand wrote:
> > > Hello Harry and guys who touched the service code recently :-)
> > >
> > > I spotted a failure for the service UT in Travis:
> > > https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
> > >
> > > I found only a single instance of this failure and tried to reproduce
> > > it with my usual "brute" active loop with no success so far.
> >
> > +1, I didn't able to reproduce it in my environment but observed it in the
> > Travis CI.
> >
> > >
> > > Any chance it could be due to recent changes?
> > > https://git.dpdk.org/dpdk/commit/?id=f3c256b621262e581d3edcca383df83875ab7ebe
> > > https://git.dpdk.org/dpdk/commit/?id=048db4b6dcccaee9277ce5b4fbb2fe684b212e22
>
> I can see more occurrences of the issue in the CI.
> I just applied the patch changing the log level for test assert, in
> the hope it will help.

And... we just got one with logs:
https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948

EAL: Test assert service_lcore_attr_get line 396 failed:
lcore_attr_get() didn't get correct loop count (zero)

It looks like a race between the service core still running and the
core resetting the loops attr.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-17 15:19     ` David Marchand
@ 2020-07-17 20:31       ` Lukasz Wojciechowski
  2020-07-17 22:38         ` Aaron Conole
  0 siblings, 1 reply; 23+ messages in thread
From: Lukasz Wojciechowski @ 2020-07-17 20:31 UTC (permalink / raw)
  To: David Marchand, Van Haaren Harry
  Cc: Igor Romanov, Honnappa Nagarahalli, Phil Yang, dev, Aaron Conole,
	Ferruh Yigit


W dniu 17.07.2020 o 17:19, David Marchand pisze:
> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
> <david.marchand@redhat.com> wrote:
>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit@intel.com> wrote:
>>> On 7/15/2020 11:14 AM, David Marchand wrote:
>>>> Hello Harry and guys who touched the service code recently :-)
>>>>
>>>> I spotted a failure for the service UT in Travis:
>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
>>>>
>>>> I found only a single instance of this failure and tried to reproduce
>>>> it with my usual "brute" active loop with no success so far.
>>> +1, I didn't able to reproduce it in my environment but observed it in the
>>> Travis CI.
>>>
>>>> Any chance it could be due to recent changes?
>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-0cc47a31ce4e-231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcommit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-0cc47a31ce4e-d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcommit%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
>> I can see more occurrences of the issue in the CI.
>> I just applied the patch changing the log level for test assert, in
>> the hope it will help.
> And... we just got one with logs:
> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
>
> EAL: Test assert service_lcore_attr_get line 396 failed:
> lcore_attr_get() didn't get correct loop count (zero)
>
> It looks like a race between the service core still running and the
> core resetting the loops attr.
>
Yes, it seems to be just lack of patience of the test. It should wait a 
bit for lcore to stop before resetting attrs.
Something like this should help:
@@ -384,6 +384,9 @@ service_lcore_attr_get(void)

         rte_service_lcore_stop(slcore_id);

+       /* wait for the service lcore to stop */
+       rte_delay_ms(200);
+
         TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
                           "Valid lcore_attr_reset_all() didn't return 
success");

-- 
Lukasz Wojciechowski
Principal Software Engineer

Samsung R&D Institute Poland
Samsung Electronics
Office +48 22 377 88 25
l.wojciechow@partner.samsung.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-17 20:31       ` Lukasz Wojciechowski
@ 2020-07-17 22:38         ` Aaron Conole
  2020-07-17 22:43           ` Honnappa Nagarahalli
  2020-07-18  8:34           ` Phil Yang
  0 siblings, 2 replies; 23+ messages in thread
From: Aaron Conole @ 2020-07-17 22:38 UTC (permalink / raw)
  To: Lukasz Wojciechowski
  Cc: David Marchand, Van Haaren Harry, Igor Romanov,
	Honnappa Nagarahalli, Phil Yang, dev, Ferruh Yigit

Lukasz Wojciechowski <l.wojciechow@partner.samsung.com> writes:

> W dniu 17.07.2020 o 17:19, David Marchand pisze:
>> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
>> <david.marchand@redhat.com> wrote:
>>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit@intel.com> wrote:
>>>> On 7/15/2020 11:14 AM, David Marchand wrote:
>>>>> Hello Harry and guys who touched the service code recently :-)
>>>>>
>>>>> I spotted a failure for the service UT in Travis:
>>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
>>>>>
>>>>> I found only a single instance of this failure and tried to reproduce
>>>>> it with my usual "brute" active loop with no success so far.
>>>> +1, I didn't able to reproduce it in my environment but observed it in the
>>>> Travis CI.
>>>>
>>>>> Any chance it could be due to recent changes?
>>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-0cc47a31ce4e-231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcommit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
>>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-0cc47a31ce4e-d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcommit%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
>>> I can see more occurrences of the issue in the CI.
>>> I just applied the patch changing the log level for test assert, in
>>> the hope it will help.
>> And... we just got one with logs:
>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
>>
>> EAL: Test assert service_lcore_attr_get line 396 failed:
>> lcore_attr_get() didn't get correct loop count (zero)
>>
>> It looks like a race between the service core still running and the
>> core resetting the loops attr.
>>
> Yes, it seems to be just lack of patience of the test. It should wait a 
> bit for lcore to stop before resetting attrs.
> Something like this should help:
> @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
>
>          rte_service_lcore_stop(slcore_id);
>
> +       /* wait for the service lcore to stop */
> +       rte_delay_ms(200);
> +
>          TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
>                            "Valid lcore_attr_reset_all() didn't return 
> success");

Would an rte_eal_wait_lcore make sense?  Overall, I really dislike
sleeps because they can hide racy synchronization points.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-17 22:38         ` Aaron Conole
@ 2020-07-17 22:43           ` Honnappa Nagarahalli
  2020-07-18  8:34           ` Phil Yang
  1 sibling, 0 replies; 23+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-17 22:43 UTC (permalink / raw)
  To: Aaron Conole, Lukasz Wojciechowski
  Cc: David Marchand, Van Haaren Harry, Igor Romanov, Phil Yang, dev,
	Ferruh Yigit, nd, Honnappa Nagarahalli, nd

<snip>

> Subject: Re: [dpdk-dev] Random failure in service_autotest
> 
> Lukasz Wojciechowski <l.wojciechow@partner.samsung.com> writes:
> 
> > W dniu 17.07.2020 o 17:19, David Marchand pisze:
> >> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
> >> <david.marchand@redhat.com> wrote:
> >>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit@intel.com>
> wrote:
> >>>> On 7/15/2020 11:14 AM, David Marchand wrote:
> >>>>> Hello Harry and guys who touched the service code recently :-)
> >>>>>
> >>>>> I spotted a failure for the service UT in Travis:
> >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
> >>>>>
> >>>>> I found only a single instance of this failure and tried to
> >>>>> reproduce it with my usual "brute" active loop with no success so far.
> >>>> +1, I didn't able to reproduce it in my environment but observed it
> >>>> +in the
> >>>> Travis CI.
> >>>>
> >>>>> Any chance it could be due to recent changes?
> >>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-0cc4
> >>>>> 7a31ce4e-
> 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%
> >>>>> 2Fcommit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
> >>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-0cc4
> >>>>> 7a31ce4e-
> d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%
> >>>>> 2Fcommit%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
> >>> I can see more occurrences of the issue in the CI.
> >>> I just applied the patch changing the log level for test assert, in
> >>> the hope it will help.
> >> And... we just got one with logs:
> >> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
> >>
> >> EAL: Test assert service_lcore_attr_get line 396 failed:
> >> lcore_attr_get() didn't get correct loop count (zero)
> >>
> >> It looks like a race between the service core still running and the
> >> core resetting the loops attr.
> >>
> > Yes, it seems to be just lack of patience of the test. It should wait
> > a bit for lcore to stop before resetting attrs.
> > Something like this should help:
> > @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
> >
> >          rte_service_lcore_stop(slcore_id);
> >
> > +       /* wait for the service lcore to stop */
> > +       rte_delay_ms(200);
> > +
> >          TEST_ASSERT_EQUAL(0,
> > rte_service_lcore_attr_reset_all(slcore_id),
> >                            "Valid lcore_attr_reset_all() didn't return
> > success");
> 
> Would an rte_eal_wait_lcore make sense?  Overall, I really dislike sleeps
> because they can hide racy synchronization points.
I think something like below might be better.

diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
index ef1d8fcb9..f0bedbe5e 100644
--- a/app/test/test_service_cores.c
+++ b/app/test/test_service_cores.c
@@ -384,6 +384,16 @@ service_lcore_attr_get(void)

        rte_service_lcore_stop(slcore_id);

+       /* give the service 200ms to stop running */
+       for (i = 0; i < 200; i++) {
+               if (!rte_service_may_be_active(sid))
+                       break;
+               rte_delay_ms(SERVICE_DELAY);
+       }
+
+       TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid),
+                         "Error: Service not stopped after 200ms");
+
        TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
                          "Valid lcore_attr_reset_all() didn't return success");



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-17 22:38         ` Aaron Conole
  2020-07-17 22:43           ` Honnappa Nagarahalli
@ 2020-07-18  8:34           ` Phil Yang
  2020-07-20 12:09             ` Van Haaren, Harry
  1 sibling, 1 reply; 23+ messages in thread
From: Phil Yang @ 2020-07-18  8:34 UTC (permalink / raw)
  To: Aaron Conole, Lukasz Wojciechowski
  Cc: David Marchand, Van Haaren Harry, Igor Romanov,
	Honnappa Nagarahalli, dev, Ferruh Yigit, nd, nd

> -----Original Message-----
> From: Aaron Conole <aconole@redhat.com>
> Sent: Saturday, July 18, 2020 6:39 AM
> To: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
> Cc: David Marchand <david.marchand@redhat.com>; Van Haaren Harry
> <harry.van.haaren@intel.com>; Igor Romanov
> <igor.romanov@oktetlabs.ru>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Phil Yang <Phil.Yang@arm.com>; dev
> <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>
> Subject: Re: [dpdk-dev] Random failure in service_autotest
> 
> Lukasz Wojciechowski <l.wojciechow@partner.samsung.com> writes:
> 
> > W dniu 17.07.2020 o 17:19, David Marchand pisze:
> >> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
> >> <david.marchand@redhat.com> wrote:
> >>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit@intel.com>
> wrote:
> >>>> On 7/15/2020 11:14 AM, David Marchand wrote:
> >>>>> Hello Harry and guys who touched the service code recently :-)
> >>>>>
> >>>>> I spotted a failure for the service UT in Travis:
> >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
> >>>>>
> >>>>> I found only a single instance of this failure and tried to reproduce
> >>>>> it with my usual "brute" active loop with no success so far.
> >>>> +1, I didn't able to reproduce it in my environment but observed it in
> the
> >>>> Travis CI.
> >>>>
> >>>>> Any chance it could be due to recent changes?
> >>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-
> 0cc47a31ce4e-
> 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom
> mit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
> >>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-
> 0cc47a31ce4e-
> d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcomm
> it%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
> >>> I can see more occurrences of the issue in the CI.
> >>> I just applied the patch changing the log level for test assert, in
> >>> the hope it will help.
> >> And... we just got one with logs:
> >> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
> >>
> >> EAL: Test assert service_lcore_attr_get line 396 failed:
> >> lcore_attr_get() didn't get correct loop count (zero)
> >>
> >> It looks like a race between the service core still running and the
> >> core resetting the loops attr.
> >>
> > Yes, it seems to be just lack of patience of the test. It should wait a
> > bit for lcore to stop before resetting attrs.
> > Something like this should help:
> > @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
> >
> >          rte_service_lcore_stop(slcore_id);
> >
> > +       /* wait for the service lcore to stop */
> > +       rte_delay_ms(200);
> > +
> >          TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
> >                            "Valid lcore_attr_reset_all() didn't return
> > success");
> 
> Would an rte_eal_wait_lcore make sense?  Overall, I really dislike
> sleeps because they can hide racy synchronization points.


The rte_service_lcore_stop() operation changes the status to RUNSTATE_STOPED. 
However, it will not terminate the service_run() procedure (there is a spinlock in service_run() MT_UNSAFE path).
So the 'cs->loop' might increase after calling rte_service_lcore_attr_reset_all(), which leads to this failure.
I think if we move the loop counter update operation before the service_run() procedure, it can avoid this conflict.

I cannot reproduce this issue on my testbed, so not sure something like this can help or not. 
Please check the code below.

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6a0e0ff..7b703dd 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -464,6 +464,7 @@ service_runner_func(void *arg)
         while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) ==
                         RUNSTATE_RUNNING) {
                 const uint64_t service_mask = cs->service_mask;
+                cs->loops++;
 
                 for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
                         if (!service_valid(i))
@@ -471,8 +472,6 @@ service_runner_func(void *arg)
                         /* return value ignored as no change to code flow */
                         service_run(i, cs, service_mask, service_get(i), 1);
                 }
-
-                cs->loops++;
         }
 
         return 0;


Thanks,
Phil




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-18  8:34           ` Phil Yang
@ 2020-07-20 12:09             ` Van Haaren, Harry
  2020-07-20 12:47               ` Lukasz Wojciechowski
  0 siblings, 1 reply; 23+ messages in thread
From: Van Haaren, Harry @ 2020-07-20 12:09 UTC (permalink / raw)
  To: Phil Yang, Aaron Conole, Lukasz Wojciechowski
  Cc: David Marchand, Igor Romanov, Honnappa Nagarahalli, dev, Yigit,
	Ferruh, nd, nd

> -----Original Message-----
> From: Phil Yang <Phil.Yang@arm.com>
> Sent: Saturday, July 18, 2020 9:34 AM
> To: Aaron Conole <aconole@redhat.com>; Lukasz Wojciechowski
> <l.wojciechow@partner.samsung.com>
> Cc: David Marchand <david.marchand@redhat.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Igor Romanov <igor.romanov@oktetlabs.ru>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; dev
> <dev@dpdk.org>; Yigit, Ferruh <ferruh.yigit@intel.com>; nd <nd@arm.com>; nd
> <nd@arm.com>
> Subject: RE: [dpdk-dev] Random failure in service_autotest
> 
> > -----Original Message-----
> > From: Aaron Conole <aconole@redhat.com>
> > Sent: Saturday, July 18, 2020 6:39 AM
> > To: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
> > Cc: David Marchand <david.marchand@redhat.com>; Van Haaren Harry
> > <harry.van.haaren@intel.com>; Igor Romanov
> > <igor.romanov@oktetlabs.ru>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Phil Yang <Phil.Yang@arm.com>; dev
> > <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>
> > Subject: Re: [dpdk-dev] Random failure in service_autotest
> >
> > Lukasz Wojciechowski <l.wojciechow@partner.samsung.com> writes:
> >
> > > W dniu 17.07.2020 o 17:19, David Marchand pisze:
> > >> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
> > >> <david.marchand@redhat.com> wrote:
> > >>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit@intel.com>
> > wrote:
> > >>>> On 7/15/2020 11:14 AM, David Marchand wrote:
> > >>>>> Hello Harry and guys who touched the service code recently :-)
> > >>>>>
> > >>>>> I spotted a failure for the service UT in Travis:
> > >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
> > >>>>>
> > >>>>> I found only a single instance of this failure and tried to reproduce
> > >>>>> it with my usual "brute" active loop with no success so far.
> > >>>> +1, I didn't able to reproduce it in my environment but observed it in
> > the
> > >>>> Travis CI.
> > >>>>
> > >>>>> Any chance it could be due to recent changes?
> > >>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-
> > 0cc47a31ce4e-
> > 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom
> > mit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
> > >>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-
> > 0cc47a31ce4e-
> > d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcomm
> > it%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
> > >>> I can see more occurrences of the issue in the CI.
> > >>> I just applied the patch changing the log level for test assert, in
> > >>> the hope it will help.
> > >> And... we just got one with logs:
> > >> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
> > >>
> > >> EAL: Test assert service_lcore_attr_get line 396 failed:
> > >> lcore_attr_get() didn't get correct loop count (zero)
> > >>
> > >> It looks like a race between the service core still running and the
> > >> core resetting the loops attr.
> > >>
> > > Yes, it seems to be just lack of patience of the test. It should wait a
> > > bit for lcore to stop before resetting attrs.
> > > Something like this should help:
> > > @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
> > >
> > >          rte_service_lcore_stop(slcore_id);
> > >
> > > +       /* wait for the service lcore to stop */
> > > +       rte_delay_ms(200);
> > > +
> > >          TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
> > >                            "Valid lcore_attr_reset_all() didn't return
> > > success");
> >
> > Would an rte_eal_wait_lcore make sense?  Overall, I really dislike
> > sleeps because they can hide racy synchronization points.
> 
> 
> The rte_service_lcore_stop() operation changes the status to RUNSTATE_STOPED.
> However, it will not terminate the service_run() procedure (there is a spinlock in
> service_run() MT_UNSAFE path).
> So the 'cs->loop' might increase after calling rte_service_lcore_attr_reset_all(),
> which leads to this failure.
> I think if we move the loop counter update operation before the service_run()
> procedure, it can avoid this conflict.
> 
> I cannot reproduce this issue on my testbed, so not sure something like this can
> help or not.
> Please check the code below.
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index 6a0e0ff..7b703dd 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -464,6 +464,7 @@ service_runner_func(void *arg)
>          while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) ==
>                          RUNSTATE_RUNNING) {
>                  const uint64_t service_mask = cs->service_mask;
> +                cs->loops++;
> 
>                  for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
>                          if (!service_valid(i))
> @@ -471,8 +472,6 @@ service_runner_func(void *arg)
>                          /* return value ignored as no change to code flow */
>                          service_run(i, cs, service_mask, service_get(i), 1);
>                  }
> -
> -                cs->loops++;
>          }
> 
>          return 0;

Changing the implementation loop counting is one option - but changing the
implementation as a workaround for a unit test seems like the wrong way around.

Honnappa's suggestion in the other reply seems to not work here, as the "service_may_be_active"
won't get updated if the service core is stopped after running its final iteration..?

I'd prefer to see the implementation actually improve, or workaround in the test code itself.
For improving the implementation, I can suggest this patch: http://patches.dpdk.org/patch/74493/


That leaves us with 3 options;
1) Add a sleep, allowing the service core to finish its delay() as part of the service its running, and give it some grace time to finish
2) Relax the check in the unit test, to allow for  value <= (expected + 1)   which would ignore the last cs->loops++; call.
    (Another way to think about option 2, is to allow for the non-determinism of the threading into the test).
3) Use the linked patch to implement/add/fix a wait time in the stop() API

Strong options either way?

Regards, -Harry

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-20 12:09             ` Van Haaren, Harry
@ 2020-07-20 12:47               ` Lukasz Wojciechowski
  2020-07-21  5:39                 ` Honnappa Nagarahalli
  0 siblings, 1 reply; 23+ messages in thread
From: Lukasz Wojciechowski @ 2020-07-20 12:47 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, Aaron Conole
  Cc: David Marchand, Igor Romanov, Honnappa Nagarahalli, dev, Yigit,
	Ferruh, nd


W dniu 20.07.2020 o 14:09, Van Haaren, Harry pisze:
>> -----Original Message-----
>> From: Phil Yang <Phil.Yang@arm.com>
>> Sent: Saturday, July 18, 2020 9:34 AM
>> To: Aaron Conole <aconole@redhat.com>; Lukasz Wojciechowski
>> <l.wojciechow@partner.samsung.com>
>> Cc: David Marchand <david.marchand@redhat.com>; Van Haaren, Harry
>> <harry.van.haaren@intel.com>; Igor Romanov <igor.romanov@oktetlabs.ru>;
>> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; dev
>> <dev@dpdk.org>; Yigit, Ferruh <ferruh.yigit@intel.com>; nd <nd@arm.com>; nd
>> <nd@arm.com>
>> Subject: RE: [dpdk-dev] Random failure in service_autotest
>>
>>> -----Original Message-----
>>> From: Aaron Conole <aconole@redhat.com>
>>> Sent: Saturday, July 18, 2020 6:39 AM
>>> To: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
>>> Cc: David Marchand <david.marchand@redhat.com>; Van Haaren Harry
>>> <harry.van.haaren@intel.com>; Igor Romanov
>>> <igor.romanov@oktetlabs.ru>; Honnappa Nagarahalli
>>> <Honnappa.Nagarahalli@arm.com>; Phil Yang <Phil.Yang@arm.com>; dev
>>> <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>
>>> Subject: Re: [dpdk-dev] Random failure in service_autotest
>>>
>>> Lukasz Wojciechowski <l.wojciechow@partner.samsung.com> writes:
>>>
>>>> W dniu 17.07.2020 o 17:19, David Marchand pisze:
>>>>> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
>>>>> <david.marchand@redhat.com> wrote:
>>>>>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit@intel.com>
>>> wrote:
>>>>>>> On 7/15/2020 11:14 AM, David Marchand wrote:
>>>>>>>> Hello Harry and guys who touched the service code recently :-)
>>>>>>>>
>>>>>>>> I spotted a failure for the service UT in Travis:
>>>>>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
>>>>>>>>
>>>>>>>> I found only a single instance of this failure and tried to reproduce
>>>>>>>> it with my usual "brute" active loop with no success so far.
>>>>>>> +1, I didn't able to reproduce it in my environment but observed it in
>>> the
>>>>>>> Travis CI.
>>>>>>>
>>>>>>>> Any chance it could be due to recent changes?
>>>>>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-
>>> 0cc47a31ce4e-
>>> 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom
>>> mit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
>>>>>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-
>>> 0cc47a31ce4e-
>>> d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcomm
>>> it%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
>>>>>> I can see more occurrences of the issue in the CI.
>>>>>> I just applied the patch changing the log level for test assert, in
>>>>>> the hope it will help.
>>>>> And... we just got one with logs:
>>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
>>>>>
>>>>> EAL: Test assert service_lcore_attr_get line 396 failed:
>>>>> lcore_attr_get() didn't get correct loop count (zero)
>>>>>
>>>>> It looks like a race between the service core still running and the
>>>>> core resetting the loops attr.
>>>>>
>>>> Yes, it seems to be just lack of patience of the test. It should wait a
>>>> bit for lcore to stop before resetting attrs.
>>>> Something like this should help:
>>>> @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
>>>>
>>>>           rte_service_lcore_stop(slcore_id);
>>>>
>>>> +       /* wait for the service lcore to stop */
>>>> +       rte_delay_ms(200);
>>>> +
>>>>           TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
>>>>                             "Valid lcore_attr_reset_all() didn't return
>>>> success");
>>> Would an rte_eal_wait_lcore make sense?  Overall, I really dislike
>>> sleeps because they can hide racy synchronization points.
>>
>> The rte_service_lcore_stop() operation changes the status to RUNSTATE_STOPED.
>> However, it will not terminate the service_run() procedure (there is a spinlock in
>> service_run() MT_UNSAFE path).
>> So the 'cs->loop' might increase after calling rte_service_lcore_attr_reset_all(),
>> which leads to this failure.
>> I think if we move the loop counter update operation before the service_run()
>> procedure, it can avoid this conflict.
>>
>> I cannot reproduce this issue on my testbed, so not sure something like this can
>> help or not.
>> Please check the code below.
>>
>> diff --git a/lib/librte_eal/common/rte_service.c
>> b/lib/librte_eal/common/rte_service.c
>> index 6a0e0ff..7b703dd 100644
>> --- a/lib/librte_eal/common/rte_service.c
>> +++ b/lib/librte_eal/common/rte_service.c
>> @@ -464,6 +464,7 @@ service_runner_func(void *arg)
>>           while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) ==
>>                           RUNSTATE_RUNNING) {
>>                   const uint64_t service_mask = cs->service_mask;
>> +                cs->loops++;
>>
>>                   for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
>>                           if (!service_valid(i))
>> @@ -471,8 +472,6 @@ service_runner_func(void *arg)
>>                           /* return value ignored as no change to code flow */
>>                           service_run(i, cs, service_mask, service_get(i), 1);
>>                   }
>> -
>> -                cs->loops++;
>>           }
>>
>>           return 0;
> Changing the implementation loop counting is one option - but changing the
> implementation as a workaround for a unit test seems like the wrong way around.
I agree. We should fix the test not the service implementation. Of 
course it would be nice to do so without inserting sleeps as it's a 
workaround for true synchronization.
>
> Honnappa's suggestion in the other reply seems to not work here, as the "service_may_be_active"
> won't get updated if the service core is stopped after running its final iteration..?
>
> I'd prefer to see the implementation actually improve, or workaround in the test code itself.
> For improving the implementation, I can suggest this patch: https://protect2.fireeye.com/url?k=8ee70138-d3351a60-8ee68a77-0cc47a31c8b4-f3f6ba587be54162&q=1&u=http%3A%2F%2Fpatches.dpdk.org%2Fpatch%2F74493%2F
>
>
> That leaves us with 3 options;
> 1) Add a sleep, allowing the service core to finish its delay() as part of the service its running, and give it some grace time to finish
I'm not a fan of sleeps, but the test code is already full of them ant 
this would be the easiest fix.
> 2) Relax the check in the unit test, to allow for  value <= (expected + 1)   which would ignore the last cs->loops++; call.
>      (Another way to think about option 2, is to allow for the non-determinism of the threading into the test).
I don't like this one.
> 3) Use the linked patch to implement/add/fix a wait time in the stop() API
The patch adds new functionality to the public API function. I don't 
know if's a good step. And also I don't like the solution. I think that 
defining a 1000*1ms = 1s time for service to finish is also bad as we 
define some static number in the code.

My proposal is to keep the thread_active flag and add a new function for 
checking it. This way the final user (in our case test app) would be 
able to do checking that flag in a loop by itself. The final user who 
delegated service to be run on lcore knows what type of service it is 
and how long should it wait for the service to stop.

Or we can provide a synchronous function using conditionals to know if 
the service has stopped, but that would also require to change the 
behavior of API.
>
> Strong options either way?
>
> Regards, -Harry

-- 
Lukasz Wojciechowski
Principal Software Engineer

Samsung R&D Institute Poland
Samsung Electronics
Office +48 22 377 88 25
l.wojciechow@partner.samsung.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-20 12:47               ` Lukasz Wojciechowski
@ 2020-07-21  5:39                 ` Honnappa Nagarahalli
  2020-07-21  8:01                   ` Van Haaren, Harry
  0 siblings, 1 reply; 23+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-21  5:39 UTC (permalink / raw)
  To: Lukasz Wojciechowski, Van Haaren, Harry, Phil Yang, Aaron Conole
  Cc: David Marchand, Igor Romanov, dev, Yigit, Ferruh, nd,
	Honnappa Nagarahalli, nd

<snip>

> >>> Subject: Re: [dpdk-dev] Random failure in service_autotest
> >>>
> >>> Lukasz Wojciechowski <l.wojciechow@partner.samsung.com> writes:
> >>>
> >>>> W dniu 17.07.2020 o 17:19, David Marchand pisze:
> >>>>> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
> >>>>> <david.marchand@redhat.com> wrote:
> >>>>>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit
> >>>>>> <ferruh.yigit@intel.com>
> >>> wrote:
> >>>>>>> On 7/15/2020 11:14 AM, David Marchand wrote:
> >>>>>>>> Hello Harry and guys who touched the service code recently :-)
> >>>>>>>>
> >>>>>>>> I spotted a failure for the service UT in Travis:
> >>>>>>>> https://travis-
> ci.com/github/ovsrobot/dpdk/jobs/361097992#L1869
> >>>>>>>> 7
> >>>>>>>>
> >>>>>>>> I found only a single instance of this failure and tried to
> >>>>>>>> reproduce it with my usual "brute" active loop with no success so
> far.
> >>>>>>> +1, I didn't able to reproduce it in my environment but observed
> >>>>>>> +it in
> >>> the
> >>>>>>> Travis CI.
> >>>>>>>
> >>>>>>>> Any chance it could be due to recent changes?
> >>>>>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-
> >>> 0cc47a31ce4e-
> >>>
> 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom
> >>> mit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
> >>>>>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-
> >>> 0cc47a31ce4e-
> >>>
> d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom
> m
> >>> it%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
> >>>>>> I can see more occurrences of the issue in the CI.
> >>>>>> I just applied the patch changing the log level for test assert,
> >>>>>> in the hope it will help.
> >>>>> And... we just got one with logs:
> >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
> >>>>>
> >>>>> EAL: Test assert service_lcore_attr_get line 396 failed:
> >>>>> lcore_attr_get() didn't get correct loop count (zero)
> >>>>>
> >>>>> It looks like a race between the service core still running and
> >>>>> the core resetting the loops attr.
> >>>>>
> >>>> Yes, it seems to be just lack of patience of the test. It should
> >>>> wait a bit for lcore to stop before resetting attrs.
> >>>> Something like this should help:
> >>>> @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
> >>>>
> >>>>           rte_service_lcore_stop(slcore_id);
> >>>>
> >>>> +       /* wait for the service lcore to stop */
> >>>> +       rte_delay_ms(200);
> >>>> +
> >>>>           TEST_ASSERT_EQUAL(0,
> rte_service_lcore_attr_reset_all(slcore_id),
> >>>>                             "Valid lcore_attr_reset_all() didn't
> >>>> return success");
> >>> Would an rte_eal_wait_lcore make sense?  Overall, I really dislike
> >>> sleeps because they can hide racy synchronization points.
> >>
> >> The rte_service_lcore_stop() operation changes the status to
> RUNSTATE_STOPED.
> >> However, it will not terminate the service_run() procedure (there is
> >> a spinlock in
> >> service_run() MT_UNSAFE path).
> >> So the 'cs->loop' might increase after calling
> >> rte_service_lcore_attr_reset_all(),
> >> which leads to this failure.
> >> I think if we move the loop counter update operation before the
> >> service_run() procedure, it can avoid this conflict.
> >>
> >> I cannot reproduce this issue on my testbed, so not sure something
> >> like this can help or not.
> >> Please check the code below.
> >>
> >> diff --git a/lib/librte_eal/common/rte_service.c
> >> b/lib/librte_eal/common/rte_service.c
> >> index 6a0e0ff..7b703dd 100644
> >> --- a/lib/librte_eal/common/rte_service.c
> >> +++ b/lib/librte_eal/common/rte_service.c
> >> @@ -464,6 +464,7 @@ service_runner_func(void *arg)
> >>           while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) ==
> >>                           RUNSTATE_RUNNING) {
> >>                   const uint64_t service_mask = cs->service_mask;
> >> +                cs->loops++;
> >>
> >>                   for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
> >>                           if (!service_valid(i)) @@ -471,8 +472,6 @@
> >> service_runner_func(void *arg)
> >>                           /* return value ignored as no change to code flow */
> >>                           service_run(i, cs, service_mask, service_get(i), 1);
> >>                   }
> >> -
> >> -                cs->loops++;
> >>           }
> >>
> >>           return 0;
> > Changing the implementation loop counting is one option - but changing
> > the implementation as a workaround for a unit test seems like the wrong
> way around.
> I agree. We should fix the test not the service implementation. Of course it
> would be nice to do so without inserting sleeps as it's a workaround for true
> synchronization.
I think the service shutdown sequence in the test case is not correct. We cannot call 'rte_service_lcore_stop' without first shutting down the service using 'rte_service_runstate_set' and 'rte_service_component_runstate_set'. The 'rte_service_lcore_stop' has checks to validate that the service is not running currently (among other things). In fact, the call to 'rte_service_lcore_stop' API is returning -EBUSY currently in the test case. We are not checking the return status.

If I understand the intent of the test case correctly, the sequence of the calls needs to be:
rte_service_runstate_set(id, 0)
rte_service_component_runstate_set(id, 0);
rte_service_may_be_active - loop till the service is not active
rte_service_lcore_stop(slcore_id);

> >
> > Honnappa's suggestion in the other reply seems to not work here, as the
> "service_may_be_active"
> > won't get updated if the service core is stopped after running its final
> iteration..?
This also is a problem.
This needs to be fixed by setting 'service_active_on_lcore' immediately after the service completes running in 'service_run' (rather than waiting for the next iteration of service_run).

> >
> > I'd prefer to see the implementation actually improve, or workaround in
> the test code itself.
> > For improving the implementation, I can suggest this patch:
> > https://protect2.fireeye.com/url?k=8ee70138-d3351a60-8ee68a77-
> 0cc47a31
> > c8b4-
> f3f6ba587be54162&q=1&u=http%3A%2F%2Fpatches.dpdk.org%2Fpatch%2F
> 74
> > 493%2F
> >
> >
> > That leaves us with 3 options;
> > 1) Add a sleep, allowing the service core to finish its delay() as
> > part of the service its running, and give it some grace time to finish
> I'm not a fan of sleeps, but the test code is already full of them ant this
> would be the easiest fix.
> > 2) Relax the check in the unit test, to allow for  value <= (expected + 1)
> which would ignore the last cs->loops++; call.
> >      (Another way to think about option 2, is to allow for the non-
> determinism of the threading into the test).
> I don't like this one.
> > 3) Use the linked patch to implement/add/fix a wait time in the stop()
> > API
> The patch adds new functionality to the public API function. I don't know if's
> a good step. And also I don't like the solution. I think that defining a
> 1000*1ms = 1s time for service to finish is also bad as we define some static
> number in the code.
> 
> My proposal is to keep the thread_active flag and add a new function for
> checking it. This way the final user (in our case test app) would be able to do
> checking that flag in a loop by itself. The final user who delegated service to
> be run on lcore knows what type of service it is and how long should it wait
> for the service to stop.
> 
> Or we can provide a synchronous function using conditionals to know if the
> service has stopped, but that would also require to change the behavior of
> API.
> >
> > Strong options either way?
> >
> > Regards, -Harry
> 
> --
> Lukasz Wojciechowski
> Principal Software Engineer
> 
> Samsung R&D Institute Poland
> Samsung Electronics
> Office +48 22 377 88 25
> l.wojciechow@partner.samsung.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-21  5:39                 ` Honnappa Nagarahalli
@ 2020-07-21  8:01                   ` Van Haaren, Harry
  2020-07-21  8:07                     ` David Marchand
  2020-07-21 15:09                     ` Honnappa Nagarahalli
  0 siblings, 2 replies; 23+ messages in thread
From: Van Haaren, Harry @ 2020-07-21  8:01 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Lukasz Wojciechowski, Phil Yang, Aaron Conole
  Cc: David Marchand, Igor Romanov, dev, Yigit, Ferruh, nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Tuesday, July 21, 2020 6:40 AM
> To: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>; Van Haaren,
> Harry <harry.van.haaren@intel.com>; Phil Yang <Phil.Yang@arm.com>; Aaron
> Conole <aconole@redhat.com>
> Cc: David Marchand <david.marchand@redhat.com>; Igor Romanov
> <igor.romanov@oktetlabs.ru>; dev <dev@dpdk.org>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] Random failure in service_autotest
> 
> <snip>
<more snip>

Thanks Lukasz & Phil for v2 reviews & Acks.

> > >>           return 0;
> > > Changing the implementation loop counting is one option - but changing
> > > the implementation as a workaround for a unit test seems like the wrong
> > way around.
> > I agree. We should fix the test not the service implementation. Of course it
> > would be nice to do so without inserting sleeps as it's a workaround for true
> > synchronization.
> I think the service shutdown sequence in the test case is not correct. We cannot
> call 'rte_service_lcore_stop' without first shutting down the service using
> 'rte_service_runstate_set' and 'rte_service_component_runstate_set'. The
> 'rte_service_lcore_stop' has checks to validate that the service is not running
> currently (among other things). In fact, the call to 'rte_service_lcore_stop' API is
> returning -EBUSY currently in the test case. We are not checking the return
> status.
> 
> If I understand the intent of the test case correctly, the sequence of the calls
> needs to be:
> rte_service_runstate_set(id, 0)
> rte_service_component_runstate_set(id, 0);
> rte_service_may_be_active - loop till the service is not active
> rte_service_lcore_stop(slcore_id);

No need to change service runstates, unmapping the service lcore to the service
allows service_lcore_stop() to work as expected, and not return -EBUSY. This
change to add an unmap() is integrated in the test case in the v2 patch.


> > > Honnappa's suggestion in the other reply seems to not work here, as the
> > "service_may_be_active"
> > > won't get updated if the service core is stopped after running its final
> > iteration..?
> This also is a problem.
> This needs to be fixed by setting 'service_active_on_lcore' immediately after the
> service completes running in 'service_run' (rather than waiting for the next
> iteration of service_run).

This is a different bug - not directly related to the service_autotest issue reported
here. I'll keep focus & resolve that issue first as it has higher priorty due to CI failures.

<snip previous discsussion>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-21  8:01                   ` Van Haaren, Harry
@ 2020-07-21  8:07                     ` David Marchand
  2020-07-21  8:16                       ` Lukasz Wojciechowski
  2020-07-21 15:09                     ` Honnappa Nagarahalli
  1 sibling, 1 reply; 23+ messages in thread
From: David Marchand @ 2020-07-21  8:07 UTC (permalink / raw)
  To: Van Haaren, Harry
  Cc: Honnappa Nagarahalli, Lukasz Wojciechowski, Phil Yang,
	Aaron Conole, Igor Romanov, dev, Yigit, Ferruh, nd

On Tue, Jul 21, 2020 at 10:01 AM Van Haaren, Harry
<harry.van.haaren@intel.com> wrote:
> > > > Honnappa's suggestion in the other reply seems to not work here, as the
> > > "service_may_be_active"
> > > > won't get updated if the service core is stopped after running its final
> > > iteration..?
> > This also is a problem.
> > This needs to be fixed by setting 'service_active_on_lcore' immediately after the
> > service completes running in 'service_run' (rather than waiting for the next
> > iteration of service_run).
>
> This is a different bug - not directly related to the service_autotest issue reported
> here. I'll keep focus & resolve that issue first as it has higher priorty due to CI failures.

Is the current bug caught by the CI something introduced by the recent changes?


-- 
David Marchand


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-21  8:07                     ` David Marchand
@ 2020-07-21  8:16                       ` Lukasz Wojciechowski
  0 siblings, 0 replies; 23+ messages in thread
From: Lukasz Wojciechowski @ 2020-07-21  8:16 UTC (permalink / raw)
  To: David Marchand, Van Haaren, Harry
  Cc: Honnappa Nagarahalli, Phil Yang, Aaron Conole, Igor Romanov, dev,
	Yigit, Ferruh, nd


W dniu 21.07.2020 o 10:07, David Marchand pisze:
> On Tue, Jul 21, 2020 at 10:01 AM Van Haaren, Harry
> <harry.van.haaren@intel.com> wrote:
>>>>> Honnappa's suggestion in the other reply seems to not work here, as the
>>>> "service_may_be_active"
>>>>> won't get updated if the service core is stopped after running its final
>>>> iteration..?
>>> This also is a problem.
>>> This needs to be fixed by setting 'service_active_on_lcore' immediately after the
>>> service completes running in 'service_run' (rather than waiting for the next
>>> iteration of service_run).
>> This is a different bug - not directly related to the service_autotest issue reported
>> here. I'll keep focus & resolve that issue first as it has higher priorty due to CI failures.
> Is the current bug caught by the CI something introduced by the recent changes?
No, I believe it's not. It's a lack of synchronization issue in test 
between test procedure and service core that happens to show very 
rarely. That's why it was so hard to reproduce.
>
-- 
Lukasz Wojciechowski
Principal Software Engineer

Samsung R&D Institute Poland
Samsung Electronics
Office +48 22 377 88 25
l.wojciechow@partner.samsung.com


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-21  8:01                   ` Van Haaren, Harry
  2020-07-21  8:07                     ` David Marchand
@ 2020-07-21 15:09                     ` Honnappa Nagarahalli
  2020-07-21 15:38                       ` Van Haaren, Harry
  1 sibling, 1 reply; 23+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-21 15:09 UTC (permalink / raw)
  To: Van Haaren, Harry, Lukasz Wojciechowski, Phil Yang, Aaron Conole
  Cc: David Marchand, Igor Romanov, dev, Yigit, Ferruh, nd,
	Honnappa Nagarahalli, nd

<snip>

> <more snip>
> 
> Thanks Lukasz & Phil for v2 reviews & Acks.
> 
> > > >>           return 0;
> > > > Changing the implementation loop counting is one option - but
> > > > changing the implementation as a workaround for a unit test seems
> > > > like the wrong
> > > way around.
> > > I agree. We should fix the test not the service implementation. Of
> > > course it would be nice to do so without inserting sleeps as it's a
> > > workaround for true synchronization.
> > I think the service shutdown sequence in the test case is not correct.
> > We cannot call 'rte_service_lcore_stop' without first shutting down
> > the service using 'rte_service_runstate_set' and
> > 'rte_service_component_runstate_set'. The 'rte_service_lcore_stop' has
> > checks to validate that the service is not running currently (among
> > other things). In fact, the call to 'rte_service_lcore_stop' API is
> > returning -EBUSY currently in the test case. We are not checking the return
> status.
> >
> > If I understand the intent of the test case correctly, the sequence of
> > the calls needs to be:
> > rte_service_runstate_set(id, 0)
> > rte_service_component_runstate_set(id, 0); rte_service_may_be_active -
> > loop till the service is not active rte_service_lcore_stop(slcore_id);
> 
> No need to change service runstates, unmapping the service lcore to the
> service allows service_lcore_stop() to work as expected, and not return -
> EBUSY. This change to add an unmap() is integrated in the test case in the v2
> patch.
Ok, understood.
Looking at the patch, why not use the 'rte_service_lcore_stop' to provide the status of the lcore?
For ex: the 'thread_active' can be used in 'rte_service_lcore_stop' to indicate that the lcore is still busy?

> 
> 
> > > > Honnappa's suggestion in the other reply seems to not work here,
> > > > as the
> > > "service_may_be_active"
> > > > won't get updated if the service core is stopped after running its
> > > > final
> > > iteration..?
> > This also is a problem.
> > This needs to be fixed by setting 'service_active_on_lcore'
> > immediately after the service completes running in 'service_run'
> > (rather than waiting for the next iteration of service_run).
> 
> This is a different bug - not directly related to the service_autotest issue
> reported here. I'll keep focus & resolve that issue first as it has higher priorty
> due to CI failures.
Sure.

> 
> <snip previous discsussion>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-21 15:09                     ` Honnappa Nagarahalli
@ 2020-07-21 15:38                       ` Van Haaren, Harry
  2020-07-21 16:21                         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 23+ messages in thread
From: Van Haaren, Harry @ 2020-07-21 15:38 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Lukasz Wojciechowski, Phil Yang, Aaron Conole
  Cc: David Marchand, Igor Romanov, dev, Yigit, Ferruh, nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Tuesday, July 21, 2020 4:09 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Lukasz Wojciechowski
> <l.wojciechow@partner.samsung.com>; Phil Yang <Phil.Yang@arm.com>; Aaron
> Conole <aconole@redhat.com>
> Cc: David Marchand <david.marchand@redhat.com>; Igor Romanov
> <igor.romanov@oktetlabs.ru>; dev <dev@dpdk.org>; Yigit, Ferruh
> <ferruh.yigit@intel.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] Random failure in service_autotest
> 
> <snip>
> 
> > <more snip>

<triple snip>

> > > If I understand the intent of the test case correctly, the sequence of
> > > the calls needs to be:
> > > rte_service_runstate_set(id, 0)
> > > rte_service_component_runstate_set(id, 0); rte_service_may_be_active -
> > > loop till the service is not active rte_service_lcore_stop(slcore_id);
> >
> > No need to change service runstates, unmapping the service lcore to the
> > service allows service_lcore_stop() to work as expected, and not return -
> > EBUSY. This change to add an unmap() is integrated in the test case in the v2
> > patch.
> Ok, understood.
> Looking at the patch, why not use the 'rte_service_lcore_stop' to provide the
> status of the lcore?
> For ex: the 'thread_active' can be used in 'rte_service_lcore_stop' to indicate that
> the lcore is still busy?

Heh - we think alike it seems, v1 of patch had that :) http://patches.dpdk.org/patch/74493/
Actually, the looping mechanics and things was inspired by your initial solution.

Based on Lukasz's feedback, changing the API behavior isn't nice, and indeed I kind of
agree looking at it now. Adding a concept to check if the core is actually done is useful,
and allows the user to do custom checks. Lukasz's feedback on the v1 sums it up better
than I'll type here.

<snip>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] Random failure in service_autotest
  2020-07-21 15:38                       ` Van Haaren, Harry
@ 2020-07-21 16:21                         ` Honnappa Nagarahalli
  0 siblings, 0 replies; 23+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-21 16:21 UTC (permalink / raw)
  To: Van Haaren, Harry, Lukasz Wojciechowski, Phil Yang, Aaron Conole
  Cc: David Marchand, Igor Romanov, dev, Yigit, Ferruh, nd,
	Honnappa Nagarahalli, nd

> > <snip>
> >
> > > <more snip>
> 
> <triple snip>
> 
> > > > If I understand the intent of the test case correctly, the
> > > > sequence of the calls needs to be:
> > > > rte_service_runstate_set(id, 0)
> > > > rte_service_component_runstate_set(id, 0);
> > > > rte_service_may_be_active - loop till the service is not active
> > > > rte_service_lcore_stop(slcore_id);
> > >
> > > No need to change service runstates, unmapping the service lcore to
> > > the service allows service_lcore_stop() to work as expected, and not
> > > return - EBUSY. This change to add an unmap() is integrated in the
> > > test case in the v2 patch.
> > Ok, understood.
> > Looking at the patch, why not use the 'rte_service_lcore_stop' to
> > provide the status of the lcore?
> > For ex: the 'thread_active' can be used in 'rte_service_lcore_stop' to
> > indicate that the lcore is still busy?
> 
> Heh - we think alike it seems, v1 of patch had that :)
> http://patches.dpdk.org/patch/74493/
> Actually, the looping mechanics and things was inspired by your initial
> solution.
> 
> Based on Lukasz's feedback, changing the API behavior isn't nice, and indeed I
> kind of agree looking at it now. Adding a concept to check if the core is
> actually done is useful, and allows the user to do custom checks. Lukasz's
> feedback on the v1 sums it up better than I'll type here.
Thanks for pointing out. A separate API sounds better. This is also in line with 'rte_service_may_be_active' API.
I have few comments on v2, will send them out.

> 
> <snip>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-07-21 16:21 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15 10:14 [dpdk-dev] Random failure in service_autotest David Marchand
2020-07-15 10:41 ` Ferruh Yigit
2020-07-17  8:56   ` David Marchand
2020-07-17 15:19     ` David Marchand
2020-07-17 20:31       ` Lukasz Wojciechowski
2020-07-17 22:38         ` Aaron Conole
2020-07-17 22:43           ` Honnappa Nagarahalli
2020-07-18  8:34           ` Phil Yang
2020-07-20 12:09             ` Van Haaren, Harry
2020-07-20 12:47               ` Lukasz Wojciechowski
2020-07-21  5:39                 ` Honnappa Nagarahalli
2020-07-21  8:01                   ` Van Haaren, Harry
2020-07-21  8:07                     ` David Marchand
2020-07-21  8:16                       ` Lukasz Wojciechowski
2020-07-21 15:09                     ` Honnappa Nagarahalli
2020-07-21 15:38                       ` Van Haaren, Harry
2020-07-21 16:21                         ` Honnappa Nagarahalli
2020-07-15 12:56 ` Aaron Conole
2020-07-15 13:02   ` David Marchand
2020-07-15 13:09     ` Lukasz Wojciechowski
2020-07-15 13:28       ` David Marchand
2020-07-15 13:39         ` Aaron Conole
2020-07-15 20:26           ` Honnappa Nagarahalli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).