Re: [dpdk-dev] Random failure in service_autotest

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
To: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>,
	"Van Haaren, Harry" <harry.van.haaren@intel.com>,
	Phil Yang <Phil.Yang@arm.com>, Aaron Conole <aconole@redhat.com>
Cc: David Marchand <david.marchand@redhat.com>,
	Igor Romanov <igor.romanov@oktetlabs.ru>, dev <dev@dpdk.org>,
	"Yigit, Ferruh" <ferruh.yigit@intel.com>, nd <nd@arm.com>,
	Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
	nd <nd@arm.com>
Subject: Re: [dpdk-dev] Random failure in service_autotest
Date: Tue, 21 Jul 2020 05:39:37 +0000	[thread overview]
Message-ID: <DB6PR0802MB2216CF809EDD9D969E7436BD98780@DB6PR0802MB2216.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <3bc99c95-1736-635e-e8af-c9ae47da6d9c@partner.samsung.com>

<snip>

> >>> Subject: Re: [dpdk-dev] Random failure in service_autotest
> >>>
> >>> Lukasz Wojciechowski <l.wojciechow@partner.samsung.com> writes:
> >>>
> >>>> W dniu 17.07.2020 o 17:19, David Marchand pisze:
> >>>>> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
> >>>>> <david.marchand@redhat.com> wrote:
> >>>>>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit
> >>>>>> <ferruh.yigit@intel.com>
> >>> wrote:
> >>>>>>> On 7/15/2020 11:14 AM, David Marchand wrote:
> >>>>>>>> Hello Harry and guys who touched the service code recently :-)
> >>>>>>>>
> >>>>>>>> I spotted a failure for the service UT in Travis:
> >>>>>>>> https://travis-
> ci.com/github/ovsrobot/dpdk/jobs/361097992#L1869
> >>>>>>>> 7
> >>>>>>>>
> >>>>>>>> I found only a single instance of this failure and tried to
> >>>>>>>> reproduce it with my usual "brute" active loop with no success so
> far.
> >>>>>>> +1, I didn't able to reproduce it in my environment but observed
> >>>>>>> +it in
> >>> the
> >>>>>>> Travis CI.
> >>>>>>>
> >>>>>>>> Any chance it could be due to recent changes?
> >>>>>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-
> >>> 0cc47a31ce4e-
> >>>
> 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom
> >>> mit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
> >>>>>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-
> >>> 0cc47a31ce4e-
> >>>
> d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom
> m
> >>> it%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
> >>>>>> I can see more occurrences of the issue in the CI.
> >>>>>> I just applied the patch changing the log level for test assert,
> >>>>>> in the hope it will help.
> >>>>> And... we just got one with logs:
> >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
> >>>>>
> >>>>> EAL: Test assert service_lcore_attr_get line 396 failed:
> >>>>> lcore_attr_get() didn't get correct loop count (zero)
> >>>>>
> >>>>> It looks like a race between the service core still running and
> >>>>> the core resetting the loops attr.
> >>>>>
> >>>> Yes, it seems to be just lack of patience of the test. It should
> >>>> wait a bit for lcore to stop before resetting attrs.
> >>>> Something like this should help:
> >>>> @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
> >>>>
> >>>>           rte_service_lcore_stop(slcore_id);
> >>>>
> >>>> +       /* wait for the service lcore to stop */
> >>>> +       rte_delay_ms(200);
> >>>> +
> >>>>           TEST_ASSERT_EQUAL(0,
> rte_service_lcore_attr_reset_all(slcore_id),
> >>>>                             "Valid lcore_attr_reset_all() didn't
> >>>> return success");
> >>> Would an rte_eal_wait_lcore make sense?  Overall, I really dislike
> >>> sleeps because they can hide racy synchronization points.
> >>
> >> The rte_service_lcore_stop() operation changes the status to
> RUNSTATE_STOPED.
> >> However, it will not terminate the service_run() procedure (there is
> >> a spinlock in
> >> service_run() MT_UNSAFE path).
> >> So the 'cs->loop' might increase after calling
> >> rte_service_lcore_attr_reset_all(),
> >> which leads to this failure.
> >> I think if we move the loop counter update operation before the
> >> service_run() procedure, it can avoid this conflict.
> >>
> >> I cannot reproduce this issue on my testbed, so not sure something
> >> like this can help or not.
> >> Please check the code below.
> >>
> >> diff --git a/lib/librte_eal/common/rte_service.c
> >> b/lib/librte_eal/common/rte_service.c
> >> index 6a0e0ff..7b703dd 100644
> >> --- a/lib/librte_eal/common/rte_service.c
> >> +++ b/lib/librte_eal/common/rte_service.c
> >> @@ -464,6 +464,7 @@ service_runner_func(void *arg)
> >>           while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) ==
> >>                           RUNSTATE_RUNNING) {
> >>                   const uint64_t service_mask = cs->service_mask;
> >> +                cs->loops++;
> >>
> >>                   for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
> >>                           if (!service_valid(i)) @@ -471,8 +472,6 @@
> >> service_runner_func(void *arg)
> >>                           /* return value ignored as no change to code flow */
> >>                           service_run(i, cs, service_mask, service_get(i), 1);
> >>                   }
> >> -
> >> -                cs->loops++;
> >>           }
> >>
> >>           return 0;
> > Changing the implementation loop counting is one option - but changing
> > the implementation as a workaround for a unit test seems like the wrong
> way around.
> I agree. We should fix the test not the service implementation. Of course it
> would be nice to do so without inserting sleeps as it's a workaround for true
> synchronization.
I think the service shutdown sequence in the test case is not correct. We cannot call 'rte_service_lcore_stop' without first shutting down the service using 'rte_service_runstate_set' and 'rte_service_component_runstate_set'. The 'rte_service_lcore_stop' has checks to validate that the service is not running currently (among other things). In fact, the call to 'rte_service_lcore_stop' API is returning -EBUSY currently in the test case. We are not checking the return status.

If I understand the intent of the test case correctly, the sequence of the calls needs to be:
rte_service_runstate_set(id, 0)
rte_service_component_runstate_set(id, 0);
rte_service_may_be_active - loop till the service is not active
rte_service_lcore_stop(slcore_id);

> >
> > Honnappa's suggestion in the other reply seems to not work here, as the
> "service_may_be_active"
> > won't get updated if the service core is stopped after running its final
> iteration..?
This also is a problem.
This needs to be fixed by setting 'service_active_on_lcore' immediately after the service completes running in 'service_run' (rather than waiting for the next iteration of service_run).

> >
> > I'd prefer to see the implementation actually improve, or workaround in
> the test code itself.
> > For improving the implementation, I can suggest this patch:
> > https://protect2.fireeye.com/url?k=8ee70138-d3351a60-8ee68a77-
> 0cc47a31
> > c8b4-
> f3f6ba587be54162&q=1&u=http%3A%2F%2Fpatches.dpdk.org%2Fpatch%2F
> 74
> > 493%2F
> >
> >
> > That leaves us with 3 options;
> > 1) Add a sleep, allowing the service core to finish its delay() as
> > part of the service its running, and give it some grace time to finish
> I'm not a fan of sleeps, but the test code is already full of them ant this
> would be the easiest fix.
> > 2) Relax the check in the unit test, to allow for  value <= (expected + 1)
> which would ignore the last cs->loops++; call.
> >      (Another way to think about option 2, is to allow for the non-
> determinism of the threading into the test).
> I don't like this one.
> > 3) Use the linked patch to implement/add/fix a wait time in the stop()
> > API
> The patch adds new functionality to the public API function. I don't know if's
> a good step. And also I don't like the solution. I think that defining a
> 1000*1ms = 1s time for service to finish is also bad as we define some static
> number in the code.
> 
> My proposal is to keep the thread_active flag and add a new function for
> checking it. This way the final user (in our case test app) would be able to do
> checking that flag in a loop by itself. The final user who delegated service to
> be run on lcore knows what type of service it is and how long should it wait
> for the service to stop.
> 
> Or we can provide a synchronous function using conditionals to know if the
> service has stopped, but that would also require to change the behavior of
> API.
> >
> > Strong options either way?
> >
> > Regards, -Harry
> 
> --
> Lukasz Wojciechowski
> Principal Software Engineer
> 
> Samsung R&D Institute Poland
> Samsung Electronics
> Office +48 22 377 88 25
> l.wojciechow@partner.samsung.com

next prev parent reply	other threads:[~2020-07-21  5:39 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-15 10:14 David Marchand
2020-07-15 10:41 ` Ferruh Yigit
2020-07-17  8:56   ` David Marchand
2020-07-17 15:19     ` David Marchand
2020-07-17 20:31       ` Lukasz Wojciechowski
2020-07-17 22:38         ` Aaron Conole
2020-07-17 22:43           ` Honnappa Nagarahalli
2020-07-18  8:34           ` Phil Yang
2020-07-20 12:09             ` Van Haaren, Harry
2020-07-20 12:47               ` Lukasz Wojciechowski
2020-07-21  5:39                 ` Honnappa Nagarahalli [this message]
2020-07-21  8:01                   ` Van Haaren, Harry
2020-07-21  8:07                     ` David Marchand
2020-07-21  8:16                       ` Lukasz Wojciechowski
2020-07-21 15:09                     ` Honnappa Nagarahalli
2020-07-21 15:38                       ` Van Haaren, Harry
2020-07-21 16:21                         ` Honnappa Nagarahalli
2020-07-15 12:56 ` Aaron Conole
2020-07-15 13:02   ` David Marchand
2020-07-15 13:09     ` Lukasz Wojciechowski
2020-07-15 13:28       ` David Marchand
2020-07-15 13:39         ` Aaron Conole
2020-07-15 20:26           ` Honnappa Nagarahalli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DB6PR0802MB2216CF809EDD9D969E7436BD98780@DB6PR0802MB2216.eurprd08.prod.outlook.com \
    --to=honnappa.nagarahalli@arm.com \
    --cc=Phil.Yang@arm.com \
    --cc=aconole@redhat.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=harry.van.haaren@intel.com \
    --cc=igor.romanov@oktetlabs.ru \
    --cc=l.wojciechow@partner.samsung.com \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).