DPDK patches and discussions
 help / color / mirror / Atom feed
From: Aaron Conole <aconole@redhat.com>
To: "Van Haaren\, Harry" <harry.van.haaren@intel.com>
Cc: David Marchand <david.marchand@redhat.com>,
	"dev\@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing
Date: Mon, 14 Oct 2019 10:53:48 -0400	[thread overview]
Message-ID: <f7ttv8bs7ib.fsf@dhcp-25.97.bos.redhat.com> (raw)
In-Reply-To: <f7tblusn31q.fsf@dhcp-25.97.bos.redhat.com> (Aaron Conole's message of "Mon, 07 Oct 2019 08:38:41 -0400")

Aaron Conole <aconole@redhat.com> writes:

> "Van Haaren, Harry" <harry.van.haaren@intel.com> writes:
>
>>> -----Original Message-----
>>> From: Aaron Conole [mailto:aconole@redhat.com]
>>> Sent: Wednesday, September 4, 2019 8:56 PM
>>> To: David Marchand <david.marchand@redhat.com>
>>> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
>>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest
>>> failing
>>> 
>>> David Marchand <david.marchand@redhat.com> writes:
>>> 
>>> > On Wed, Sep 4, 2019 at 12:04 PM David Marchand
>>> > <david.marchand@redhat.com> wrote:
>>> >>
>>> >> On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry
>>> >> <harry.van.haaren@intel.com> wrote:
>>> >> >
>>> >> > > -----Original Message-----
>>> >> > > From: Aaron Conole [mailto:aconole@redhat.com]
>>> >> > > Sent: Tuesday, September 3, 2019 3:46 PM
>>> >> > > To: Van Haaren, Harry <harry.van.haaren@intel.com>
>>> >> > > Cc: dev@dpdk.org
>>> >> > > Subject: [BUG] service_lcore_en_dis_able from service_autotest failing
>>> >> > >
>>> >> > > Hi Harry,
>>> >> >
>>> >> > Hey Aaron,
>>> >> >
>>> >> > > I noticed as part of series_6218
>>> >> > > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the
>>> travis
>>> >> > > build had a single failure, in service_autotest but it doesn't seem
>>> >> > > related to the series at all.
>>> >> > >
>>> >> > >   https://travis-ci.com/ovsrobot/dpdk/jobs/230358460
>>> >> > >
>>> >> > > Not sure if there's some kind of debugging we can add or look at to
>>> help
>>> >> > > diagnose failures when they occur.  Do you have time to have a look?
>>> >> >
>>> >> > Thanks for flagging this.
>>> >> >
>>> >> > I've just re-run the unit tests here multiple times to see if I can
>>> >> > reproduce something strange, no luck on reproducing the issue.
>>> >> >
>>> >> > Attempted with clang-6 and clang-7 (travis error on clang-7),
>>> >> > still no issues found.
>>> >> >
>>> >> > Building with Clang-7 and Shared libs (instead of default static)
>>> >> > still no issues found.
>>> >> >
>>> >> > If somebody can reproduce please send an update to here and I'll
>>> >> > attempt to replicate that setup. Right now I can't reproduce the issue.
>>> >>
>>> >> You have to be patient, but I caught it on my laptop:
>>> >>
>>> >
>>> > Ok, and now with the logs:
>>> >
>>> >
>>> > # time (log=/tmp/$$.log; while true; do echo service_autotest |taskset
>>> > -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 -l 0-1
>>> >>$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f
>>> > $log)
>>> > EAL: Detected lcore 0 as core 0 on socket 0
>>> > EAL: Detected lcore 1 as core 1 on socket 0
>>> > EAL: Detected lcore 2 as core 2 on socket 0
>> <snip verbose logs>
>>> > EAL:   probe driver: 8086:15d7 net_e1000_em
>>> > EAL:   Not managed by a supported kernel driver, skipped
>>> > EAL: Module /sys/module/vfio not found! error 2 (No such file or directory)
>>> > APP: HPET is not enabled, using TSC as default timer
>>> > RTE>>service_autotest
>>> >  + ------------------------------------------------------- +
>>> >  + Test Suite : service core test suite
>>> >  + ------------------------------------------------------- +
>>> >  + TestCase [ 0] : unregister_all succeeded
>>> >  + TestCase [ 1] : service_name succeeded
>>> >  + TestCase [ 2] : service_get_by_name succeeded
>>> > Service dummy_service Summary
>>> >   dummy_service: stats 1    calls 0    cycles 0    avg: 0
>>> > Service dummy_service Summary
>>> >   dummy_service: stats 0    calls 0    cycles 0    avg: 0
>>> >  + TestCase [ 3] : service_dump succeeded
>>> >  + TestCase [ 4] : service_attr_get succeeded
>>> >  + TestCase [ 5] : service_lcore_attr_get succeeded
>>> >  + TestCase [ 6] : service_probe_capability succeeded
>>> >  + TestCase [ 7] : service_start_stop succeeded
>>> >  + TestCase [ 8] : service_lcore_add_del skipped
>>> >  + TestCase [ 9] : service_lcore_start_stop succeeded
>>> > EAL: Test assert service_lcore_en_dis_able line 488 failed: Ex-service
>>> > core function call had no effect.
>>> >  + TestCase [10] : service_lcore_en_dis_able failed
>>> >  + TestCase [11] : service_mt_unsafe_poll skipped
>>> >  + TestCase [12] : service_mt_safe_poll skipped
>>> >  + TestCase [13] : service_app_lcore_mt_safe succeeded
>>> >  + TestCase [14] : service_app_lcore_mt_unsafe succeeded
>>> >  + TestCase [15] : service_may_be_active succeeded
>>> >  + ------------------------------------------------------- +
>>> >  + Test Suite Summary
>>> >  + Tests Total :       16
>>> >  + Tests Skipped :      3
>>> >  + Tests Executed :    16
>>> >  + Tests Unsupported:   0
>>> >  + Tests Passed :      12
>>> >  + Tests Failed :       1
>>> >  + ------------------------------------------------------- +
>>> > Test Failed
>>> > RTE>>EAL: request: mp_malloc_sync
>>> > EAL: Heap on socket 0 was shrunk by 2MB
>>> >
>>> > real    2m42.884s
>>> > user    5m1.902s
>>> > sys    0m2.208s
>>> 
>>> I can confirm - takes about 1m to fail.
>>
>>
>> Hi Aaron and David,
>>
>> I've been attempting to reproduce this, still no errors here.
>>
>> Given the nature of service-cores, and the difficulty to reproduce
>> here this feels like a race-condition - one that may not exist in all
>> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here).
>>
>> I'm using Meson to build, so reproducing using this instead of the command
>> as provided above. There should be no difference in reproducing due to this:
>
> The command runs far more iterations than meson does (I think).
>
> I still see it periodically occur in the travis environment.
>
> I did see at least one missing memory barrier (I believe).  Please
> review the following code change (and if you agree I can submit it
> formally):
>
> -----
> --- a/lib/librte_eal/common/eal_common_launch.c
> +++ b/lib/librte_eal/common/eal_common_launch.c
> @@ -21,8 +21,10 @@
>  int
>  rte_eal_wait_lcore(unsigned slave_id)
>  {
> -       if (lcore_config[slave_id].state == WAIT)
> +       if (lcore_config[slave_id].state == WAIT) {
> +               rte_rmb();
>                 return 0;
> +       }
>  
>         while (lcore_config[slave_id].state != WAIT &&
>                lcore_config[slave_id].state != FINISHED)
> -----
>
> This is because in lib/librte_eal/linux/eal/eal_thread.c:
>
> -----
> 		/* when a service core returns, it should go directly to WAIT
> 		 * state, because the application will not lcore_wait() for it.
> 		 */
> 		if (lcore_config[lcore_id].core_role == ROLE_SERVICE)
> 			lcore_config[lcore_id].state = WAIT;
> 		else
> 			lcore_config[lcore_id].state = FINISHED;
> -----
>
> NOTE that the service core skips the rte_eal_wait_lcore() code from
> making the FINISHED->WAIT transition.  So I think at least that read
> barrier will be needed (maybe I miss the pairing, though?).
>
> Additionally, I'm wondering if there is an additional write or sync
> barrier needed to ensure that some of the transitions are properly
> recorded when using lcore as a service lcore function.  The fact that
> this only happens occasionally tells me that it's either a race (which
> is possible... because the variable update in the test might not be
> sync'd across cores or something), or some other missing
> synchronization.
>
>> $ meson test service_autotest --repeat 50
>>
>> 1/1 DPDK:fast-tests / service_autotest      OK       3.86 s
>> 1/1 DPDK:fast-tests / service_autotest      OK       3.87 s
>> ...
>> 1/1 DPDK:fast-tests / service_autotest      OK       3.84 s
>>
>> OK:        50
>> FAIL:       0
>> SKIP:       0
>> TIMEOUT:    0
>>
>> I'll keep it running for a few hours but I have little faith if it only
>> takes 1 minute on your machines...
>
> Please try the flat command.

Not sure if you've had any time to look at this.

I think there's a change we can make, but not sure about how it fits in
the overall service lcore design.

The proposal is to use a pthread_cond variable which blocks the thread
requesting the service function to run.  The service function merely
sets the condition.  The requesting thread does a timed wait (up to 5s?)
and if the timeout is exceeded can throw an error.  Otherwise, it will
unblock and can assume that the test passes.  WDYT?  I think it works
better than the racy code in the test case for now.

>> Regards, -Harry

  reply	other threads:[~2019-10-14 14:53 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-03 14:45 Aaron Conole
2019-09-04  9:41 ` Van Haaren, Harry
2019-09-04 10:04   ` David Marchand
2019-09-04 10:38     ` David Marchand
2019-09-04 19:56       ` Aaron Conole
2019-10-07  9:50         ` Van Haaren, Harry
2019-10-07 12:38           ` Aaron Conole
2019-10-14 14:53             ` Aaron Conole [this message]
2019-10-14 16:48               ` Van Haaren, Harry
2019-10-15 16:42                 ` Van Haaren, Harry
2019-09-04  9:55 ` David Marchand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f7ttv8bs7ib.fsf@dhcp-25.97.bos.redhat.com \
    --to=aconole@redhat.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=harry.van.haaren@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).