From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id B592AA2EEB for ; Mon, 7 Oct 2019 14:38:47 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id E50401C1BC; Mon, 7 Oct 2019 14:38:46 +0200 (CEST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id B43B41BF57 for ; Mon, 7 Oct 2019 14:38:45 +0200 (CEST) Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 0E93D2102; Mon, 7 Oct 2019 12:38:45 +0000 (UTC) Received: from dhcp-25.97.bos.redhat.com (unknown [10.18.25.121]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2B747608C2; Mon, 7 Oct 2019 12:38:42 +0000 (UTC) From: Aaron Conole To: "Van Haaren\, Harry" Cc: David Marchand , "dev\@dpdk.org" References: Date: Mon, 07 Oct 2019 08:38:41 -0400 In-Reply-To: (Van Haaren's message of "Mon, 7 Oct 2019 09:50:46 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (mx1.redhat.com [10.5.110.71]); Mon, 07 Oct 2019 12:38:45 +0000 (UTC) Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" "Van Haaren, Harry" writes: >> -----Original Message----- >> From: Aaron Conole [mailto:aconole@redhat.com] >> Sent: Wednesday, September 4, 2019 8:56 PM >> To: David Marchand >> Cc: Van Haaren, Harry ; dev@dpdk.org >> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest >> failing >> >> David Marchand writes: >> >> > On Wed, Sep 4, 2019 at 12:04 PM David Marchand >> > wrote: >> >> >> >> On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry >> >> wrote: >> >> > >> >> > > -----Original Message----- >> >> > > From: Aaron Conole [mailto:aconole@redhat.com] >> >> > > Sent: Tuesday, September 3, 2019 3:46 PM >> >> > > To: Van Haaren, Harry >> >> > > Cc: dev@dpdk.org >> >> > > Subject: [BUG] service_lcore_en_dis_able from service_autotest failing >> >> > > >> >> > > Hi Harry, >> >> > >> >> > Hey Aaron, >> >> > >> >> > > I noticed as part of series_6218 >> >> > > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the >> travis >> >> > > build had a single failure, in service_autotest but it doesn't seem >> >> > > related to the series at all. >> >> > > >> >> > > https://travis-ci.com/ovsrobot/dpdk/jobs/230358460 >> >> > > >> >> > > Not sure if there's some kind of debugging we can add or look at to >> help >> >> > > diagnose failures when they occur. Do you have time to have a look? >> >> > >> >> > Thanks for flagging this. >> >> > >> >> > I've just re-run the unit tests here multiple times to see if I can >> >> > reproduce something strange, no luck on reproducing the issue. >> >> > >> >> > Attempted with clang-6 and clang-7 (travis error on clang-7), >> >> > still no issues found. >> >> > >> >> > Building with Clang-7 and Shared libs (instead of default static) >> >> > still no issues found. >> >> > >> >> > If somebody can reproduce please send an update to here and I'll >> >> > attempt to replicate that setup. Right now I can't reproduce the issue. >> >> >> >> You have to be patient, but I caught it on my laptop: >> >> >> > >> > Ok, and now with the logs: >> > >> > >> > # time (log=/tmp/$$.log; while true; do echo service_autotest |taskset >> > -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 -l 0-1 >> >>$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f >> > $log) >> > EAL: Detected lcore 0 as core 0 on socket 0 >> > EAL: Detected lcore 1 as core 1 on socket 0 >> > EAL: Detected lcore 2 as core 2 on socket 0 > >> > EAL: probe driver: 8086:15d7 net_e1000_em >> > EAL: Not managed by a supported kernel driver, skipped >> > EAL: Module /sys/module/vfio not found! error 2 (No such file or directory) >> > APP: HPET is not enabled, using TSC as default timer >> > RTE>>service_autotest >> > + ------------------------------------------------------- + >> > + Test Suite : service core test suite >> > + ------------------------------------------------------- + >> > + TestCase [ 0] : unregister_all succeeded >> > + TestCase [ 1] : service_name succeeded >> > + TestCase [ 2] : service_get_by_name succeeded >> > Service dummy_service Summary >> > dummy_service: stats 1 calls 0 cycles 0 avg: 0 >> > Service dummy_service Summary >> > dummy_service: stats 0 calls 0 cycles 0 avg: 0 >> > + TestCase [ 3] : service_dump succeeded >> > + TestCase [ 4] : service_attr_get succeeded >> > + TestCase [ 5] : service_lcore_attr_get succeeded >> > + TestCase [ 6] : service_probe_capability succeeded >> > + TestCase [ 7] : service_start_stop succeeded >> > + TestCase [ 8] : service_lcore_add_del skipped >> > + TestCase [ 9] : service_lcore_start_stop succeeded >> > EAL: Test assert service_lcore_en_dis_able line 488 failed: Ex-service >> > core function call had no effect. >> > + TestCase [10] : service_lcore_en_dis_able failed >> > + TestCase [11] : service_mt_unsafe_poll skipped >> > + TestCase [12] : service_mt_safe_poll skipped >> > + TestCase [13] : service_app_lcore_mt_safe succeeded >> > + TestCase [14] : service_app_lcore_mt_unsafe succeeded >> > + TestCase [15] : service_may_be_active succeeded >> > + ------------------------------------------------------- + >> > + Test Suite Summary >> > + Tests Total : 16 >> > + Tests Skipped : 3 >> > + Tests Executed : 16 >> > + Tests Unsupported: 0 >> > + Tests Passed : 12 >> > + Tests Failed : 1 >> > + ------------------------------------------------------- + >> > Test Failed >> > RTE>>EAL: request: mp_malloc_sync >> > EAL: Heap on socket 0 was shrunk by 2MB >> > >> > real 2m42.884s >> > user 5m1.902s >> > sys 0m2.208s >> >> I can confirm - takes about 1m to fail. > > > Hi Aaron and David, > > I've been attempting to reproduce this, still no errors here. > > Given the nature of service-cores, and the difficulty to reproduce > here this feels like a race-condition - one that may not exist in all > binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here). > > I'm using Meson to build, so reproducing using this instead of the command > as provided above. There should be no difference in reproducing due to this: The command runs far more iterations than meson does (I think). I still see it periodically occur in the travis environment. I did see at least one missing memory barrier (I believe). Please review the following code change (and if you agree I can submit it formally): ----- --- a/lib/librte_eal/common/eal_common_launch.c +++ b/lib/librte_eal/common/eal_common_launch.c @@ -21,8 +21,10 @@ int rte_eal_wait_lcore(unsigned slave_id) { - if (lcore_config[slave_id].state == WAIT) + if (lcore_config[slave_id].state == WAIT) { + rte_rmb(); return 0; + } while (lcore_config[slave_id].state != WAIT && lcore_config[slave_id].state != FINISHED) ----- This is because in lib/librte_eal/linux/eal/eal_thread.c: ----- /* when a service core returns, it should go directly to WAIT * state, because the application will not lcore_wait() for it. */ if (lcore_config[lcore_id].core_role == ROLE_SERVICE) lcore_config[lcore_id].state = WAIT; else lcore_config[lcore_id].state = FINISHED; ----- NOTE that the service core skips the rte_eal_wait_lcore() code from making the FINISHED->WAIT transition. So I think at least that read barrier will be needed (maybe I miss the pairing, though?). Additionally, I'm wondering if there is an additional write or sync barrier needed to ensure that some of the transitions are properly recorded when using lcore as a service lcore function. The fact that this only happens occasionally tells me that it's either a race (which is possible... because the variable update in the test might not be sync'd across cores or something), or some other missing synchronization. > $ meson test service_autotest --repeat 50 > > 1/1 DPDK:fast-tests / service_autotest OK 3.86 s > 1/1 DPDK:fast-tests / service_autotest OK 3.87 s > ... > 1/1 DPDK:fast-tests / service_autotest OK 3.84 s > > OK: 50 > FAIL: 0 > SKIP: 0 > TIMEOUT: 0 > > I'll keep it running for a few hours but I have little faith if it only > takes 1 minute on your machines... Please try the flat command. > Regards, -Harry