From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 56844A2EFC for ; Mon, 14 Oct 2019 16:53:56 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 40BE31C29F; Mon, 14 Oct 2019 16:53:55 +0200 (CEST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id 3F3D31C25E for ; Mon, 14 Oct 2019 16:53:53 +0200 (CEST) Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 9C30B18CB8F4; Mon, 14 Oct 2019 14:53:52 +0000 (UTC) Received: from dhcp-25.97.bos.redhat.com (unknown [10.18.25.121]) by smtp.corp.redhat.com (Postfix) with ESMTPS id A3184196AE; Mon, 14 Oct 2019 14:53:49 +0000 (UTC) From: Aaron Conole To: "Van Haaren\, Harry" Cc: David Marchand , "dev\@dpdk.org" References: Date: Mon, 14 Oct 2019 10:53:48 -0400 In-Reply-To: (Aaron Conole's message of "Mon, 07 Oct 2019 08:38:41 -0400") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (mx1.redhat.com [10.5.110.63]); Mon, 14 Oct 2019 14:53:52 +0000 (UTC) Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Aaron Conole writes: > "Van Haaren, Harry" writes: > >>> -----Original Message----- >>> From: Aaron Conole [mailto:aconole@redhat.com] >>> Sent: Wednesday, September 4, 2019 8:56 PM >>> To: David Marchand >>> Cc: Van Haaren, Harry ; dev@dpdk.org >>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest >>> failing >>> >>> David Marchand writes: >>> >>> > On Wed, Sep 4, 2019 at 12:04 PM David Marchand >>> > wrote: >>> >> >>> >> On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry >>> >> wrote: >>> >> > >>> >> > > -----Original Message----- >>> >> > > From: Aaron Conole [mailto:aconole@redhat.com] >>> >> > > Sent: Tuesday, September 3, 2019 3:46 PM >>> >> > > To: Van Haaren, Harry >>> >> > > Cc: dev@dpdk.org >>> >> > > Subject: [BUG] service_lcore_en_dis_able from service_autotest failing >>> >> > > >>> >> > > Hi Harry, >>> >> > >>> >> > Hey Aaron, >>> >> > >>> >> > > I noticed as part of series_6218 >>> >> > > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the >>> travis >>> >> > > build had a single failure, in service_autotest but it doesn't seem >>> >> > > related to the series at all. >>> >> > > >>> >> > > https://travis-ci.com/ovsrobot/dpdk/jobs/230358460 >>> >> > > >>> >> > > Not sure if there's some kind of debugging we can add or look at to >>> help >>> >> > > diagnose failures when they occur. Do you have time to have a look? >>> >> > >>> >> > Thanks for flagging this. >>> >> > >>> >> > I've just re-run the unit tests here multiple times to see if I can >>> >> > reproduce something strange, no luck on reproducing the issue. >>> >> > >>> >> > Attempted with clang-6 and clang-7 (travis error on clang-7), >>> >> > still no issues found. >>> >> > >>> >> > Building with Clang-7 and Shared libs (instead of default static) >>> >> > still no issues found. >>> >> > >>> >> > If somebody can reproduce please send an update to here and I'll >>> >> > attempt to replicate that setup. Right now I can't reproduce the issue. >>> >> >>> >> You have to be patient, but I caught it on my laptop: >>> >> >>> > >>> > Ok, and now with the logs: >>> > >>> > >>> > # time (log=/tmp/$$.log; while true; do echo service_autotest |taskset >>> > -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 -l 0-1 >>> >>$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f >>> > $log) >>> > EAL: Detected lcore 0 as core 0 on socket 0 >>> > EAL: Detected lcore 1 as core 1 on socket 0 >>> > EAL: Detected lcore 2 as core 2 on socket 0 >> >>> > EAL: probe driver: 8086:15d7 net_e1000_em >>> > EAL: Not managed by a supported kernel driver, skipped >>> > EAL: Module /sys/module/vfio not found! error 2 (No such file or directory) >>> > APP: HPET is not enabled, using TSC as default timer >>> > RTE>>service_autotest >>> > + ------------------------------------------------------- + >>> > + Test Suite : service core test suite >>> > + ------------------------------------------------------- + >>> > + TestCase [ 0] : unregister_all succeeded >>> > + TestCase [ 1] : service_name succeeded >>> > + TestCase [ 2] : service_get_by_name succeeded >>> > Service dummy_service Summary >>> > dummy_service: stats 1 calls 0 cycles 0 avg: 0 >>> > Service dummy_service Summary >>> > dummy_service: stats 0 calls 0 cycles 0 avg: 0 >>> > + TestCase [ 3] : service_dump succeeded >>> > + TestCase [ 4] : service_attr_get succeeded >>> > + TestCase [ 5] : service_lcore_attr_get succeeded >>> > + TestCase [ 6] : service_probe_capability succeeded >>> > + TestCase [ 7] : service_start_stop succeeded >>> > + TestCase [ 8] : service_lcore_add_del skipped >>> > + TestCase [ 9] : service_lcore_start_stop succeeded >>> > EAL: Test assert service_lcore_en_dis_able line 488 failed: Ex-service >>> > core function call had no effect. >>> > + TestCase [10] : service_lcore_en_dis_able failed >>> > + TestCase [11] : service_mt_unsafe_poll skipped >>> > + TestCase [12] : service_mt_safe_poll skipped >>> > + TestCase [13] : service_app_lcore_mt_safe succeeded >>> > + TestCase [14] : service_app_lcore_mt_unsafe succeeded >>> > + TestCase [15] : service_may_be_active succeeded >>> > + ------------------------------------------------------- + >>> > + Test Suite Summary >>> > + Tests Total : 16 >>> > + Tests Skipped : 3 >>> > + Tests Executed : 16 >>> > + Tests Unsupported: 0 >>> > + Tests Passed : 12 >>> > + Tests Failed : 1 >>> > + ------------------------------------------------------- + >>> > Test Failed >>> > RTE>>EAL: request: mp_malloc_sync >>> > EAL: Heap on socket 0 was shrunk by 2MB >>> > >>> > real 2m42.884s >>> > user 5m1.902s >>> > sys 0m2.208s >>> >>> I can confirm - takes about 1m to fail. >> >> >> Hi Aaron and David, >> >> I've been attempting to reproduce this, still no errors here. >> >> Given the nature of service-cores, and the difficulty to reproduce >> here this feels like a race-condition - one that may not exist in all >> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here). >> >> I'm using Meson to build, so reproducing using this instead of the command >> as provided above. There should be no difference in reproducing due to this: > > The command runs far more iterations than meson does (I think). > > I still see it periodically occur in the travis environment. > > I did see at least one missing memory barrier (I believe). Please > review the following code change (and if you agree I can submit it > formally): > > ----- > --- a/lib/librte_eal/common/eal_common_launch.c > +++ b/lib/librte_eal/common/eal_common_launch.c > @@ -21,8 +21,10 @@ > int > rte_eal_wait_lcore(unsigned slave_id) > { > - if (lcore_config[slave_id].state == WAIT) > + if (lcore_config[slave_id].state == WAIT) { > + rte_rmb(); > return 0; > + } > > while (lcore_config[slave_id].state != WAIT && > lcore_config[slave_id].state != FINISHED) > ----- > > This is because in lib/librte_eal/linux/eal/eal_thread.c: > > ----- > /* when a service core returns, it should go directly to WAIT > * state, because the application will not lcore_wait() for it. > */ > if (lcore_config[lcore_id].core_role == ROLE_SERVICE) > lcore_config[lcore_id].state = WAIT; > else > lcore_config[lcore_id].state = FINISHED; > ----- > > NOTE that the service core skips the rte_eal_wait_lcore() code from > making the FINISHED->WAIT transition. So I think at least that read > barrier will be needed (maybe I miss the pairing, though?). > > Additionally, I'm wondering if there is an additional write or sync > barrier needed to ensure that some of the transitions are properly > recorded when using lcore as a service lcore function. The fact that > this only happens occasionally tells me that it's either a race (which > is possible... because the variable update in the test might not be > sync'd across cores or something), or some other missing > synchronization. > >> $ meson test service_autotest --repeat 50 >> >> 1/1 DPDK:fast-tests / service_autotest OK 3.86 s >> 1/1 DPDK:fast-tests / service_autotest OK 3.87 s >> ... >> 1/1 DPDK:fast-tests / service_autotest OK 3.84 s >> >> OK: 50 >> FAIL: 0 >> SKIP: 0 >> TIMEOUT: 0 >> >> I'll keep it running for a few hours but I have little faith if it only >> takes 1 minute on your machines... > > Please try the flat command. Not sure if you've had any time to look at this. I think there's a change we can make, but not sure about how it fits in the overall service lcore design. The proposal is to use a pthread_cond variable which blocks the thread requesting the service function to run. The service function merely sets the condition. The requesting thread does a timed wait (up to 5s?) and if the timeout is exceeded can throw an error. Otherwise, it will unblock and can assume that the test passes. WDYT? I think it works better than the racy code in the test case for now. >> Regards, -Harry