From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 26872A2EFC for ; Tue, 15 Oct 2019 18:43:00 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 659EF1ECFE; Tue, 15 Oct 2019 18:42:59 +0200 (CEST) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id B8EDE1EC8A for ; Tue, 15 Oct 2019 18:42:57 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 15 Oct 2019 09:42:56 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.67,300,1566889200"; d="scan'208";a="225479502" Received: from irsmsx152.ger.corp.intel.com ([163.33.192.66]) by fmsmga002.fm.intel.com with ESMTP; 15 Oct 2019 09:42:55 -0700 Received: from irsmsx102.ger.corp.intel.com ([169.254.2.40]) by IRSMSX152.ger.corp.intel.com ([169.254.6.76]) with mapi id 14.03.0439.000; Tue, 15 Oct 2019 17:42:55 +0100 From: "Van Haaren, Harry" To: "Van Haaren, Harry" , Aaron Conole CC: David Marchand , "dev@dpdk.org" Thread-Topic: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing Thread-Index: AQHVgp80VUNJ52iRAE26iuFuUKIIsqdaSn1QgAGc3HA= Date: Tue, 15 Oct 2019 16:42:54 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZDczZjZjNWEtN2FlYS00YTgzLWE1MmYtMzk1NjEyYzVhYjQ5IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiXC9sdG1LdmVoVVZHMDVXOGpPcit6ZzRXSXc1K2ZmbUZZSE1LMDRlZ05WdSsxOWhVMW1EVmw1UjArMVVTOUxiMFgifQ== x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.2.0.6 dlp-reaction: no-action x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_autotest failing X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Van Haaren, Harry > Sent: Monday, October 14, 2019 5:49 PM > To: Aaron Conole > Cc: David Marchand ; dev@dpdk.org > Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from service_auto= test > failing >=20 > > -----Original Message----- > > From: Aaron Conole [mailto:aconole@redhat.com] > > Sent: Monday, October 14, 2019 3:54 PM > > To: Van Haaren, Harry > > Cc: David Marchand ; dev@dpdk.org > > Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from > > service_autotest failing > > > > Aaron Conole writes: > > > > > "Van Haaren, Harry" writes: > > > > > >>> -----Original Message----- > > >>> From: Aaron Conole [mailto:aconole@redhat.com] > > >>> Sent: Wednesday, September 4, 2019 8:56 PM > > >>> To: David Marchand > > >>> Cc: Van Haaren, Harry ; dev@dpdk.org > > >>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from > > service_autotest > > >>> failing > > > >>> > real 2m42.884s > > >>> > user 5m1.902s > > >>> > sys 0m2.208s > > >>> > > >>> I can confirm - takes about 1m to fail. > > >> > > >> > > >> Hi Aaron and David, > > >> > > >> I've been attempting to reproduce this, still no errors here. > > >> > > >> Given the nature of service-cores, and the difficulty to reproduce > > >> here this feels like a race-condition - one that may not exist in al= l > > >> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 h= ere). > > >> > > >> I'm using Meson to build, so reproducing using this instead of the > > command > > >> as provided above. There should be no difference in reproducing due = to > > this: > > > > > > The command runs far more iterations than meson does (I think). > > > > > > I still see it periodically occur in the travis environment. > > > > > > I did see at least one missing memory barrier (I believe). Please > > > review the following code change (and if you agree I can submit it > > > formally): > > > > > > ----- > > > --- a/lib/librte_eal/common/eal_common_launch.c > > > +++ b/lib/librte_eal/common/eal_common_launch.c > > > @@ -21,8 +21,10 @@ > > > int > > > rte_eal_wait_lcore(unsigned slave_id) > > > { > > > - if (lcore_config[slave_id].state =3D=3D WAIT) > > > + if (lcore_config[slave_id].state =3D=3D WAIT) { > > > + rte_rmb(); > > > return 0; > > > + } > > > > > > while (lcore_config[slave_id].state !=3D WAIT && > > > lcore_config[slave_id].state !=3D FINISHED) > > > ----- > > > > > > This is because in lib/librte_eal/linux/eal/eal_thread.c: > > > > > > ----- > > > /* when a service core returns, it should go directly to WAIT > > > * state, because the application will not lcore_wait() for it. > > > */ > > > if (lcore_config[lcore_id].core_role =3D=3D ROLE_SERVICE) > > > lcore_config[lcore_id].state =3D WAIT; > > > else > > > lcore_config[lcore_id].state =3D FINISHED; > > > ----- > > > > > > NOTE that the service core skips the rte_eal_wait_lcore() code from > > > making the FINISHED->WAIT transition. So I think at least that read > > > barrier will be needed (maybe I miss the pairing, though?). > > > > > > Additionally, I'm wondering if there is an additional write or sync > > > barrier needed to ensure that some of the transitions are properly > > > recorded when using lcore as a service lcore function. The fact that > > > this only happens occasionally tells me that it's either a race (whic= h > > > is possible... because the variable update in the test might not be > > > sync'd across cores or something), or some other missing > > > synchronization. > > > > > >> $ meson test service_autotest --repeat 50 > > >> > > >> 1/1 DPDK:fast-tests / service_autotest OK 3.86 s > > >> 1/1 DPDK:fast-tests / service_autotest OK 3.87 s > > >> ... > > >> 1/1 DPDK:fast-tests / service_autotest OK 3.84 s > > >> > > >> OK: 50 > > >> FAIL: 0 > > >> SKIP: 0 > > >> TIMEOUT: 0 > > >> > > >> I'll keep it running for a few hours but I have little faith if it o= nly > > >> takes 1 minute on your machines... > > > > > > Please try the flat command. > > > > Not sure if you've had any time to look at this. >=20 > Apologies for delay in response - I've ran the existing tests a few 1000'= s of > times during the week, with one reproduction. That's not enough for confi= dence > in debug/fix for me. >=20 >=20 > > I think there's a change we can make, but not sure about how it fits in > > the overall service lcore design. >=20 > This suggestion is only changing the test code correct? >=20 >=20 > > The proposal is to use a pthread_cond variable which blocks the thread > > requesting the service function to run. The service function merely > > sets the condition. The requesting thread does a timed wait (up to 5s?= ) > > and if the timeout is exceeded can throw an error. Otherwise, it will > > unblock and can assume that the test passes. WDYT? I think it works > > better than the racy code in the test case for now. >=20 > The idea/concept is right above, but I think that's what the test is > approximating anyway? The main thread does an "mp_wait_lcore()" until > the service core has returned, essentially a blocking call. >=20 > The test fails if the flag is not =3D=3D 1 (as that indidcates failure in > launching > an application function on a previously-use-as-service-core lthread). >=20 > I think your RMB suggestion is likely to be the correct, but I'd like to = dig > into it a bit more. >=20 > Thanks for the ping on this thread. Good news - adding a rte_delay_ms() to the start of service_remote_launch_f= unc() makes this 100% (so far) reproducible here. So yes it's a race condition, and I think I have a handle on why/what - it's an (lcore.state =3D=3D WAIT)= race. To be continued...