DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Van Haaren, Harry" <harry.van.haaren@intel.com>
To: Phil Yang <Phil.Yang@arm.com>, Aaron Conole <aconole@redhat.com>,
	"Lukasz Wojciechowski" <l.wojciechow@partner.samsung.com>
Cc: David Marchand <david.marchand@redhat.com>,
	Igor Romanov <igor.romanov@oktetlabs.ru>,
	Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
	dev <dev@dpdk.org>, "Yigit, Ferruh" <ferruh.yigit@intel.com>,
	nd <nd@arm.com>, nd <nd@arm.com>
Subject: Re: [dpdk-dev] Random failure in service_autotest
Date: Mon, 20 Jul 2020 12:09:39 +0000	[thread overview]
Message-ID: <BYAPR11MB31432147D3121F86BF2AA337D77B0@BYAPR11MB3143.namprd11.prod.outlook.com> (raw)
In-Reply-To: <VE1PR08MB4640AE63D9D825EBCE6729E5E97D0@VE1PR08MB4640.eurprd08.prod.outlook.com>

> -----Original Message-----
> From: Phil Yang <Phil.Yang@arm.com>
> Sent: Saturday, July 18, 2020 9:34 AM
> To: Aaron Conole <aconole@redhat.com>; Lukasz Wojciechowski
> <l.wojciechow@partner.samsung.com>
> Cc: David Marchand <david.marchand@redhat.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Igor Romanov <igor.romanov@oktetlabs.ru>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; dev
> <dev@dpdk.org>; Yigit, Ferruh <ferruh.yigit@intel.com>; nd <nd@arm.com>; nd
> <nd@arm.com>
> Subject: RE: [dpdk-dev] Random failure in service_autotest
> 
> > -----Original Message-----
> > From: Aaron Conole <aconole@redhat.com>
> > Sent: Saturday, July 18, 2020 6:39 AM
> > To: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
> > Cc: David Marchand <david.marchand@redhat.com>; Van Haaren Harry
> > <harry.van.haaren@intel.com>; Igor Romanov
> > <igor.romanov@oktetlabs.ru>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Phil Yang <Phil.Yang@arm.com>; dev
> > <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>
> > Subject: Re: [dpdk-dev] Random failure in service_autotest
> >
> > Lukasz Wojciechowski <l.wojciechow@partner.samsung.com> writes:
> >
> > > W dniu 17.07.2020 o 17:19, David Marchand pisze:
> > >> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
> > >> <david.marchand@redhat.com> wrote:
> > >>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yigit@intel.com>
> > wrote:
> > >>>> On 7/15/2020 11:14 AM, David Marchand wrote:
> > >>>>> Hello Harry and guys who touched the service code recently :-)
> > >>>>>
> > >>>>> I spotted a failure for the service UT in Travis:
> > >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697
> > >>>>>
> > >>>>> I found only a single instance of this failure and tried to reproduce
> > >>>>> it with my usual "brute" active loop with no success so far.
> > >>>> +1, I didn't able to reproduce it in my environment but observed it in
> > the
> > >>>> Travis CI.
> > >>>>
> > >>>>> Any chance it could be due to recent changes?
> > >>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-
> > 0cc47a31ce4e-
> > 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom
> > mit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
> > >>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-
> > 0cc47a31ce4e-
> > d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcomm
> > it%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
> > >>> I can see more occurrences of the issue in the CI.
> > >>> I just applied the patch changing the log level for test assert, in
> > >>> the hope it will help.
> > >> And... we just got one with logs:
> > >> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
> > >>
> > >> EAL: Test assert service_lcore_attr_get line 396 failed:
> > >> lcore_attr_get() didn't get correct loop count (zero)
> > >>
> > >> It looks like a race between the service core still running and the
> > >> core resetting the loops attr.
> > >>
> > > Yes, it seems to be just lack of patience of the test. It should wait a
> > > bit for lcore to stop before resetting attrs.
> > > Something like this should help:
> > > @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
> > >
> > >          rte_service_lcore_stop(slcore_id);
> > >
> > > +       /* wait for the service lcore to stop */
> > > +       rte_delay_ms(200);
> > > +
> > >          TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id),
> > >                            "Valid lcore_attr_reset_all() didn't return
> > > success");
> >
> > Would an rte_eal_wait_lcore make sense?  Overall, I really dislike
> > sleeps because they can hide racy synchronization points.
> 
> 
> The rte_service_lcore_stop() operation changes the status to RUNSTATE_STOPED.
> However, it will not terminate the service_run() procedure (there is a spinlock in
> service_run() MT_UNSAFE path).
> So the 'cs->loop' might increase after calling rte_service_lcore_attr_reset_all(),
> which leads to this failure.
> I think if we move the loop counter update operation before the service_run()
> procedure, it can avoid this conflict.
> 
> I cannot reproduce this issue on my testbed, so not sure something like this can
> help or not.
> Please check the code below.
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index 6a0e0ff..7b703dd 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -464,6 +464,7 @@ service_runner_func(void *arg)
>          while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) ==
>                          RUNSTATE_RUNNING) {
>                  const uint64_t service_mask = cs->service_mask;
> +                cs->loops++;
> 
>                  for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
>                          if (!service_valid(i))
> @@ -471,8 +472,6 @@ service_runner_func(void *arg)
>                          /* return value ignored as no change to code flow */
>                          service_run(i, cs, service_mask, service_get(i), 1);
>                  }
> -
> -                cs->loops++;
>          }
> 
>          return 0;

Changing the implementation loop counting is one option - but changing the
implementation as a workaround for a unit test seems like the wrong way around.

Honnappa's suggestion in the other reply seems to not work here, as the "service_may_be_active"
won't get updated if the service core is stopped after running its final iteration..?

I'd prefer to see the implementation actually improve, or workaround in the test code itself.
For improving the implementation, I can suggest this patch: http://patches.dpdk.org/patch/74493/


That leaves us with 3 options;
1) Add a sleep, allowing the service core to finish its delay() as part of the service its running, and give it some grace time to finish
2) Relax the check in the unit test, to allow for  value <= (expected + 1)   which would ignore the last cs->loops++; call.
    (Another way to think about option 2, is to allow for the non-determinism of the threading into the test).
3) Use the linked patch to implement/add/fix a wait time in the stop() API

Strong options either way?

Regards, -Harry

  reply	other threads:[~2020-07-20 12:09 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-15 10:14 David Marchand
2020-07-15 10:41 ` Ferruh Yigit
2020-07-17  8:56   ` David Marchand
2020-07-17 15:19     ` David Marchand
2020-07-17 20:31       ` Lukasz Wojciechowski
2020-07-17 22:38         ` Aaron Conole
2020-07-17 22:43           ` Honnappa Nagarahalli
2020-07-18  8:34           ` Phil Yang
2020-07-20 12:09             ` Van Haaren, Harry [this message]
2020-07-20 12:47               ` Lukasz Wojciechowski
2020-07-21  5:39                 ` Honnappa Nagarahalli
2020-07-21  8:01                   ` Van Haaren, Harry
2020-07-21  8:07                     ` David Marchand
2020-07-21  8:16                       ` Lukasz Wojciechowski
2020-07-21 15:09                     ` Honnappa Nagarahalli
2020-07-21 15:38                       ` Van Haaren, Harry
2020-07-21 16:21                         ` Honnappa Nagarahalli
2020-07-15 12:56 ` Aaron Conole
2020-07-15 13:02   ` David Marchand
2020-07-15 13:09     ` Lukasz Wojciechowski
2020-07-15 13:28       ` David Marchand
2020-07-15 13:39         ` Aaron Conole
2020-07-15 20:26           ` Honnappa Nagarahalli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BYAPR11MB31432147D3121F86BF2AA337D77B0@BYAPR11MB3143.namprd11.prod.outlook.com \
    --to=harry.van.haaren@intel.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=Phil.Yang@arm.com \
    --cc=aconole@redhat.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=igor.romanov@oktetlabs.ru \
    --cc=l.wojciechow@partner.samsung.com \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).