From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 61F1AA0540; Mon, 20 Jul 2020 14:47:32 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id DFEC329CB; Mon, 20 Jul 2020 14:47:30 +0200 (CEST) Received: from mailout2.w1.samsung.com (mailout2.w1.samsung.com [210.118.77.12]) by dpdk.org (Postfix) with ESMTP id 49B0B1DBB for ; Mon, 20 Jul 2020 14:47:29 +0200 (CEST) Received: from eucas1p2.samsung.com (unknown [182.198.249.207]) by mailout2.w1.samsung.com (KnoxPortal) with ESMTP id 20200720124728euoutp02ef7dcaca6ea8da886e31512a9f1cd35c~jdr1Hr9RF1063010630euoutp02e for ; Mon, 20 Jul 2020 12:47:28 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 mailout2.w1.samsung.com 20200720124728euoutp02ef7dcaca6ea8da886e31512a9f1cd35c~jdr1Hr9RF1063010630euoutp02e DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsung.com; s=mail20170921; t=1595249248; bh=C9DQho/I/kQlo72vD1BJrZ3R9z6TPytvNk8J1MXX3uU=; h=Subject:To:Cc:From:Date:In-Reply-To:References:From; b=XtrW1XUj4GGm1cTlfeoZnpN8CAgidIZW7cUXrhrivVTkirZHnLdT4YgdLvr8/1tER Qrc6XbZt/j74kqVdJEylw/WVgnPcDxzgn8IwwccHUVlRxaxbz4ejgE0g2Peufd4fxl 1E/q7dw4zmsR36H/disl5RWGdl+qyZIoD7mvqcvI= Received: from eusmges3new.samsung.com (unknown [203.254.199.245]) by eucas1p1.samsung.com (KnoxPortal) with ESMTP id 20200720124728eucas1p1b1c5378a6c2528b8304a7853896ed77b~jdr1BjxC_1156111561eucas1p1A; Mon, 20 Jul 2020 12:47:28 +0000 (GMT) Received: from eucas1p1.samsung.com ( [182.198.249.206]) by eusmges3new.samsung.com (EUCPMTA) with SMTP id 43.DF.06318.062951F5; Mon, 20 Jul 2020 13:47:28 +0100 (BST) Received: from eusmtrp2.samsung.com (unknown [182.198.249.139]) by eucas1p1.samsung.com (KnoxPortal) with ESMTPA id 20200720124727eucas1p12fd3836b1a8b01308675b00fdb8369a2~jdr0sCKOz0768807688eucas1p1N; Mon, 20 Jul 2020 12:47:27 +0000 (GMT) Received: from eusmgms1.samsung.com (unknown [182.198.249.179]) by eusmtrp2.samsung.com (KnoxPortal) with ESMTP id 20200720124727eusmtrp2f660b1fe52611d790d1d26e3a68c7224~jdr0rUfFG2332323323eusmtrp2e; Mon, 20 Jul 2020 12:47:27 +0000 (GMT) X-AuditID: cbfec7f5-38bff700000018ae-e7-5f1592604859 Received: from eusmtip1.samsung.com ( [203.254.199.221]) by eusmgms1.samsung.com (EUCPMTA) with SMTP id 1F.EE.06314.F52951F5; Mon, 20 Jul 2020 13:47:27 +0100 (BST) Received: from [106.210.88.70] (unknown [106.210.88.70]) by eusmtip1.samsung.com (KnoxPortal) with ESMTPA id 20200720124726eusmtip16ad2d5f4592b483f0b7e4322f31da4c3~jdrzsXgTJ2934629346eusmtip1V; Mon, 20 Jul 2020 12:47:26 +0000 (GMT) To: "Van Haaren, Harry" , Phil Yang , Aaron Conole Cc: David Marchand , Igor Romanov , Honnappa Nagarahalli , dev , "Yigit, Ferruh" , nd From: Lukasz Wojciechowski Message-ID: <3bc99c95-1736-635e-e8af-c9ae47da6d9c@partner.samsung.com> Date: Mon, 20 Jul 2020 14:47:25 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Transfer-Encoding: 8bit Content-Language: en-US X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrIKsWRmVeSWpSXmKPExsWy7djPc7oJk0TjDW61cFj8evOA3WL7ii42 i3eftjNZ3Nl7mt2isf8bi8XMpy3sFocmZ1ucWd7DbLHmfTezA6fHmnlrGD1+LVjK6rF4z0sm j+tfX7F6vN93lS2ANYrLJiU1J7MstUjfLoErY3dTO1vBFtuKKccSGxgbjLoYOTkkBEwkrm/6 zdLFyMUhJLCCUeL5gwmMEM4XRomV3/dBOZ8ZJd603WeBaTm57TE7iC0ksJxR4sR3c4iit4wS f5e0ghUJC1hJzP32EqxIRKBCYv/GY6wgNrPAI0aJ7kVgcTYBW4kjM7+CxXkF3CTO9e5mA7FZ BFQl7n28yghiiwrESax/uZ0JokZQ4uTMJ0Dz2Tk4BWIl9olBTJSXaN46mxnCFpe49WQ+E8g5 EgLH2CWWL3zBCHGzi8SX01uhbGGJV8e3sEPYMhL/d8I0bGOUuPr7JyOEs59R4nrvCqgqa4nD /34DHccBtEJTYv0ufRBTQsBR4vzvWgiTT+LGW0GIG/gkJm2bzgwR5pXoaBOCmKEn8bRnKiPM 1j9rn7BMYFSaheSvWUi+mYXkm1kIaxcwsqxiFE8tLc5NTy02zkst1ytOzC0uzUvXS87P3cQI TEin/x3/uoNx35+kQ4wCHIxKPLwzOkXjhVgTy4orcw8xSnAwK4nwOp09HSfEm5JYWZValB9f VJqTWnyIUZqDRUmc13jRy1ghgfTEktTs1NSC1CKYLBMHp1QDI3dj++rC3Fsa1ZL8VqX37/be cpGZumBTboeQqUvLplNbk269+bL32Zo30/97ZfwxfPilvcd916zaz1Ml7glytySfSL/4vPT1 jF7hszdZ+5+9+NP3s26zk+QXe8PlM+KOHbGceHf5mRSjGaXr9/QtNJjMK/nN6MQx8zfTy5Pi pz19d/Q7U1scW7cSS3FGoqEWc1FxIgDIjAnvRAMAAA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrKIsWRmVeSWpSXmKPExsVy+t/xu7rxk0TjDY7MNLD49eYBu8X2FV1s Fu8+bWeyuLP3NLtFY/83FouZT1vYLQ5NzrY4s7yH2WLN+25mB06PNfPWMHr8WrCU1WPxnpdM Hte/vmL1eL/vKlsAa5SeTVF+aUmqQkZ+cYmtUrShhZGeoaWFnpGJpZ6hsXmslZGpkr6dTUpq TmZZapG+XYJexu6mdraCLbYVU44lNjA2GHUxcnJICJhInNz2mL2LkYtDSGApo8TB91NYuhg5 gBIyEh8uCUDUCEv8udbFBlHzmlHi0YnJ7CAJYQEribnfXoLZIgIVEg+/zWMGKWIWeMQoce3p LiaIjm8sEmfPdDGCVLEJ2EocmfmVFcTmFXCTONe7mw3EZhFQlbj38SpYjahAnMTyLfPZIWoE JU7OfAJ0ETsHp0CsxD4xkCizgJnEvM0PmSFseYnmrbOhbHGJW0/mM01gFJqFpHkWkpZZSFpm IWlZwMiyilEktbQ4Nz232FCvODG3uDQvXS85P3cTIzAKtx37uXkH46WNwYcYBTgYlXh4P3SL xguxJpYVV+YeYpTgYFYS4XU6ezpOiDclsbIqtSg/vqg0J7X4EKMp0GsTmaVEk/OBCSKvJN7Q 1NDcwtLQ3Njc2MxCSZy3Q+BgjJBAemJJanZqakFqEUwfEwenVAOjnvT72+tuXMvZqVjBozxZ ZZ8k66u53El1M6esbp+f2POJNWjprp0THjxJkmE4eOb5m03Kv18ci/r3KdJNO/ncnhXbld1k bu/vUWS9f8q+IG3Oy+9s/trGNjfeqbBW8ZZN7TEJ2fXyid3EmX3p/tfn/LpgVmxbcStB8Rvn wUt5Gb4zAk5q5XkkKrEUZyQaajEXFScCAB+ye8jYAgAA X-CMS-MailID: 20200720124727eucas1p12fd3836b1a8b01308675b00fdb8369a2 X-Msg-Generator: CA Content-Type: text/plain; charset="utf-8" X-RootMTR: 20200717151927eucas1p200a2d921ccf1fd5ab1d62c0f964b7594 X-EPHeader: CA CMS-TYPE: 201P X-CMS-RootMailID: 20200717151927eucas1p200a2d921ccf1fd5ab1d62c0f964b7594 References: <6d0bf076-c9bd-564c-20f6-ba8a49bc2389@intel.com> <88ee4323-1336-4bc7-414d-3dd7f19d684f@partner.samsung.com> Subject: Re: [dpdk-dev] Random failure in service_autotest X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" W dniu 20.07.2020 o 14:09, Van Haaren, Harry pisze: >> -----Original Message----- >> From: Phil Yang >> Sent: Saturday, July 18, 2020 9:34 AM >> To: Aaron Conole ; Lukasz Wojciechowski >> >> Cc: David Marchand ; Van Haaren, Harry >> ; Igor Romanov ; >> Honnappa Nagarahalli ; dev >> ; Yigit, Ferruh ; nd ; nd >> >> Subject: RE: [dpdk-dev] Random failure in service_autotest >> >>> -----Original Message----- >>> From: Aaron Conole >>> Sent: Saturday, July 18, 2020 6:39 AM >>> To: Lukasz Wojciechowski >>> Cc: David Marchand ; Van Haaren Harry >>> ; Igor Romanov >>> ; Honnappa Nagarahalli >>> ; Phil Yang ; dev >>> ; Ferruh Yigit >>> Subject: Re: [dpdk-dev] Random failure in service_autotest >>> >>> Lukasz Wojciechowski writes: >>> >>>> W dniu 17.07.2020 o 17:19, David Marchand pisze: >>>>> On Fri, Jul 17, 2020 at 10:56 AM David Marchand >>>>> wrote: >>>>>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit >>> wrote: >>>>>>> On 7/15/2020 11:14 AM, David Marchand wrote: >>>>>>>> Hello Harry and guys who touched the service code recently :-) >>>>>>>> >>>>>>>> I spotted a failure for the service UT in Travis: >>>>>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697 >>>>>>>> >>>>>>>> I found only a single instance of this failure and tried to reproduce >>>>>>>> it with my usual "brute" active loop with no success so far. >>>>>>> +1, I didn't able to reproduce it in my environment but observed it in >>> the >>>>>>> Travis CI. >>>>>>> >>>>>>>> Any chance it could be due to recent changes? >>>>>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc- >>> 0cc47a31ce4e- >>> 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom >>> mit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe >>>>>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c- >>> 0cc47a31ce4e- >>> d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcomm >>> it%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22 >>>>>> I can see more occurrences of the issue in the CI. >>>>>> I just applied the patch changing the log level for test assert, in >>>>>> the hope it will help. >>>>> And... we just got one with logs: >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948 >>>>> >>>>> EAL: Test assert service_lcore_attr_get line 396 failed: >>>>> lcore_attr_get() didn't get correct loop count (zero) >>>>> >>>>> It looks like a race between the service core still running and the >>>>> core resetting the loops attr. >>>>> >>>> Yes, it seems to be just lack of patience of the test. It should wait a >>>> bit for lcore to stop before resetting attrs. >>>> Something like this should help: >>>> @@ -384,6 +384,9 @@ service_lcore_attr_get(void) >>>> >>>>         rte_service_lcore_stop(slcore_id); >>>> >>>> +       /* wait for the service lcore to stop */ >>>> +       rte_delay_ms(200); >>>> + >>>>         TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id), >>>>                           "Valid lcore_attr_reset_all() didn't return >>>> success"); >>> Would an rte_eal_wait_lcore make sense? Overall, I really dislike >>> sleeps because they can hide racy synchronization points. >> >> The rte_service_lcore_stop() operation changes the status to RUNSTATE_STOPED. >> However, it will not terminate the service_run() procedure (there is a spinlock in >> service_run() MT_UNSAFE path). >> So the 'cs->loop' might increase after calling rte_service_lcore_attr_reset_all(), >> which leads to this failure. >> I think if we move the loop counter update operation before the service_run() >> procedure, it can avoid this conflict. >> >> I cannot reproduce this issue on my testbed, so not sure something like this can >> help or not. >> Please check the code below. >> >> diff --git a/lib/librte_eal/common/rte_service.c >> b/lib/librte_eal/common/rte_service.c >> index 6a0e0ff..7b703dd 100644 >> --- a/lib/librte_eal/common/rte_service.c >> +++ b/lib/librte_eal/common/rte_service.c >> @@ -464,6 +464,7 @@ service_runner_func(void *arg) >> while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) == >> RUNSTATE_RUNNING) { >> const uint64_t service_mask = cs->service_mask; >> + cs->loops++; >> >> for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) { >> if (!service_valid(i)) >> @@ -471,8 +472,6 @@ service_runner_func(void *arg) >> /* return value ignored as no change to code flow */ >> service_run(i, cs, service_mask, service_get(i), 1); >> } >> - >> - cs->loops++; >> } >> >> return 0; > Changing the implementation loop counting is one option - but changing the > implementation as a workaround for a unit test seems like the wrong way around. I agree. We should fix the test not the service implementation. Of course it would be nice to do so without inserting sleeps as it's a workaround for true synchronization. > > Honnappa's suggestion in the other reply seems to not work here, as the "service_may_be_active" > won't get updated if the service core is stopped after running its final iteration..? > > I'd prefer to see the implementation actually improve, or workaround in the test code itself. > For improving the implementation, I can suggest this patch: https://protect2.fireeye.com/url?k=8ee70138-d3351a60-8ee68a77-0cc47a31c8b4-f3f6ba587be54162&q=1&u=http%3A%2F%2Fpatches.dpdk.org%2Fpatch%2F74493%2F > > > That leaves us with 3 options; > 1) Add a sleep, allowing the service core to finish its delay() as part of the service its running, and give it some grace time to finish I'm not a fan of sleeps, but the test code is already full of them ant this would be the easiest fix. > 2) Relax the check in the unit test, to allow for value <= (expected + 1) which would ignore the last cs->loops++; call. > (Another way to think about option 2, is to allow for the non-determinism of the threading into the test). I don't like this one. > 3) Use the linked patch to implement/add/fix a wait time in the stop() API The patch adds new functionality to the public API function. I don't know if's a good step. And also I don't like the solution. I think that defining a 1000*1ms = 1s time for service to finish is also bad as we define some static number in the code. My proposal is to keep the thread_active flag and add a new function for checking it. This way the final user (in our case test app) would be able to do checking that flag in a loop by itself. The final user who delegated service to be run on lcore knows what type of service it is and how long should it wait for the service to stop. Or we can provide a synchronous function using conditionals to know if the service has stopped, but that would also require to change the behavior of API. > > Strong options either way? > > Regards, -Harry -- Lukasz Wojciechowski Principal Software Engineer Samsung R&D Institute Poland Samsung Electronics Office +48 22 377 88 25 l.wojciechow@partner.samsung.com