From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ci-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 0C66341D54
	for <public@inbox.dpdk.org>; Thu, 23 Feb 2023 21:15:10 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id DE13740693;
	Thu, 23 Feb 2023 21:15:09 +0100 (CET)
Received: from wout3-smtp.messagingengine.com (wout3-smtp.messagingengine.com
 [64.147.123.19])
 by mails.dpdk.org (Postfix) with ESMTP id 275AD40689;
 Thu, 23 Feb 2023 21:15:08 +0100 (CET)
Received: from compute1.internal (compute1.nyi.internal [10.202.2.41])
 by mailout.west.internal (Postfix) with ESMTP id 4ECCC3200931;
 Thu, 23 Feb 2023 15:15:06 -0500 (EST)
Received: from mailfrontend2 ([10.202.2.163])
 by compute1.internal (MEProxy); Thu, 23 Feb 2023 15:15:07 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=monjalon.net; h=
 cc:cc:content-transfer-encoding:content-type:date:date:from:from
 :in-reply-to:in-reply-to:message-id:mime-version:references
 :reply-to:sender:subject:subject:to:to; s=fm1; t=1677183305; x=
 1677269705; bh=VL67hhx3L5COhZ21eKyNNC6bksql/a8ViEgh1mFkJEk=; b=a
 QAq48V29nWkY7QceGWbV2gLsSAUbmlsBph7uGHIG3RFAEaWPXZBQRJgy5weYL+94
 8nH/jrJgd6qIN3MAkv0Fi7TDotVKEKVbviba+RBH92lXLpxhvcU3PJ9xKNDgPMss
 b9zAE/wiXYx1WK0+eyI8l2JT9sfjFGr+ml5G0QdAuI1Mtau3hJNu53uOBxOFmJb8
 lOu40DkXTOKuMd1CIWjhPLYeAhlxjCujwfQjMRy/Sl5WPlgZXmnLN3rAgEw9zZr+
 1VoQ06sQe6ZDE/TsIEPkIP4A7QQJkxiO1BdnYIRa/2PHTWvDDrj3cJnm3/WcjCAc
 F9erajTN+O33p0OnljxrA==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:cc:content-transfer-encoding
 :content-type:date:date:feedback-id:feedback-id:from:from
 :in-reply-to:in-reply-to:message-id:mime-version:references
 :reply-to:sender:subject:subject:to:to:x-me-proxy:x-me-proxy
 :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1677183305; x=
 1677269705; bh=VL67hhx3L5COhZ21eKyNNC6bksql/a8ViEgh1mFkJEk=; b=N
 cRXGlDg92+BbXUabVJcjgi93wfTyIFGUwsmkf67E/u5NTzrAPTh6/0SDNQCbUwRS
 pDuklcLQwgARSRdnH6CIAI3f29oqxlR8l+32un+fPoZzXuhFu7C37ifNPShyrB/v
 KiKzswEWyD3iVR0ZB2yWuzDWQHV5Nek5yZOsYN/fYJd8zg7uVABeTCfRfh8Z8L9a
 h1vL0F0eEQ/YpDBH2Qwhrt0rzKs5iRFX9sf3Veb7GJw0SPdE5HeMLUCnu5i7NsGj
 w907WDZ8r2le0P3cQtIOiVR/OIojzOpqdHSPNS2+3n21oxITKF5agGMhrs1oohHW
 h+yOwnAYiB19c7TYJ0N7g==
X-ME-Sender: <xms:Scn3Y-UijRHcksXU6rwwWWXn6832l8HYv_jyvuBnq6pyG51kN-v9gw>
 <xme:Scn3Y6m1g08iX4RSMi1v1WRQcnBQY1lBdzw8Pe0kjYHrvgebQ053YCPJ43Fx9h9jS
 vwPftbeMHbGKIJarQ>
X-ME-Received: <xmr:Scn3YyaHdi27kHbtxvUD3L0Yhp4ZHP_1gqQjWWWA6w_O-A4vSzFFOXW4T-TdZvfvjP_ycCGhH6zx-gDg_kVkjo75QA>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrudekuddgudefudcutefuodetggdotefrod
 ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh
 necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd
 enucfjughrpefhvfevufffkfgjfhgggfgtsehtufertddttddvnecuhfhrohhmpefvhhho
 mhgrshcuofhonhhjrghlohhnuceothhhohhmrghssehmohhnjhgrlhhonhdrnhgvtheqne
 cuggftrfgrthhtvghrnheptdejieeifeehtdffgfdvleetueeffeehueejgfeuteeftddt
 ieekgfekudehtdfgnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilh
 hfrhhomhepthhhohhmrghssehmohhnjhgrlhhonhdrnhgvth
X-ME-Proxy: <xmx:Scn3Y1Ubz6ireWR-woTyFOU104zP_ggS0WWKdppswpqw8ieGoVVcew>
 <xmx:Scn3Y4nFXBfQkaoDr_msZMQRCwTkyemyRE86MTFlHbLGMC7ekz3qrg>
 <xmx:Scn3Y6fNK7XYEVD_m_LevM1tDeNFnn4xyvX3yaAgK-AfukieWqn0nQ>
 <xmx:Scn3Y1i7Zdon8jnJpHXjdUhW64xfjHxnFFyIL3ZyLMAVmW1JyUbIuQ>
Feedback-ID: i47234305:Fastmail
Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu,
 23 Feb 2023 15:15:04 -0500 (EST)
From: Thomas Monjalon <thomas@monjalon.net>
To: "Van Haaren, Harry" <harry.van.haaren@intel.com>
Cc: David Marchand <david.marchand@redhat.com>, dev@dpdk.org,
 "dpdklab@iol.unh.edu" <dpdklab@iol.unh.edu>, "ci@dpdk.org" <ci@dpdk.org>,
 "Honnappa.Nagarahalli@arm.com" <Honnappa.Nagarahalli@arm.com>,
 "mattias. ronnblom" <mattias.ronnblom@ericsson.com>,
 Morten =?ISO-8859-1?Q?Br=F8rup?= <mb@smartsharesystems.com>,
 Tyler Retzlaff <roretzla@linux.microsoft.com>,
 Aaron Conole <aconole@redhat.com>, bruce.richardson@intel.com
Subject: Re: [PATCH v3] test/service: fix spurious failures by extending
 timeout
Date: Thu, 23 Feb 2023 21:15:03 +0100
Message-ID: <4205390.Fh7cpCN91P@thomas>
In-Reply-To: <BN0PR11MB571285C339B02AD6D6EE71E7D7D79@BN0PR11MB5712.namprd11.prod.outlook.com>
References: <20221006081729.578475-1-harry.van.haaren@intel.com>
 <21760850.EfDdHjke4D@thomas>
 <BN0PR11MB571285C339B02AD6D6EE71E7D7D79@BN0PR11MB5712.namprd11.prod.outlook.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-BeenThere: ci@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK CI discussions <ci.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/ci>,
 <mailto:ci-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/ci/>
List-Post: <mailto:ci@dpdk.org>
List-Help: <mailto:ci-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/ci>,
 <mailto:ci-request@dpdk.org?subject=subscribe>
Errors-To: ci-bounces@dpdk.org

03/02/2023 17:09, Van Haaren, Harry:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 03/02/2023 16:03, Van Haaren, Harry:
> > > From: Van Haaren, Harry
> > > > > The timeout approach just does not have its place in a functional test.
> > > > > Either this test is rewritten, or it must go to the performance tests
> > > > > list so that we stop getting false positives.
> > > > > Can you work on this?
> > > >
> > > > I'll investigate various approaches on Thursday and reply here with suggested
> > > > next steps.
> > >
> > > I've identified 3 checks that fail in CI (from the above log outputs), all 3 cases
> > > Have different dlays: 100 ms delay, 200 ms delay and 1000ms.
> > > In the CI, the service-core just hasn't been scheduled (yet) and causes the
> > "failure".
> > >
> > > Option 1)
> > > One option is to while(1) loop, waiting for the service-thread to be scheduled.
> > This can be
> > > seen as "increasing the timeout", however in this case the test-case would be
> > errored
> > > not in the test-code, but in the meson-test runner as a timeout (with a 10sec
> > default?)
> > > The benefit here is that massively increasing (~1sec or less to 10 sec) will cover
> > all/many
> > > of the CI timeouts.
> > >
> > > Option 2)
> > > Move to perf-tests, and not run these in a noisy-CI environment where the
> > results are not
> > > consistent enough to have value. This would mean that the tests are not run in
> > CI for the
> > > 3 checks in question are below, they all *require* the service core to be
> > scheduled:
> > > service_attr_get() -> requires service core to run for service stats to increment
> > > service_lcore_attr_get() -> requires service core to run for lcore stats to
> > increment
> > > service_lcore_start_stop() -> requires service to run to to ensure service-func
> > itself executes.
> > >
> > > I don't see how we can "improve" option 2 to not require the service-thread to
> > be scheduled by the OS..
> > > And the only way to make the OS schedule it in the CI more consistently is to
> > give it more time?
> > 
> > We are talking about seconds.
> > There are setups where scheduling a thread is taking seconds?
> 
> Apparently so - otherwise these tests would always pass.
> 
> They *only* fail at random runs in CI, and reliably pass everywhere else.. I've not had
> them fail locally, and that includes running in a loop for hours with a busy system..
> but not a low-priority CI VM in a busy datacenter.
> 
> 
> [Bruce wrote in separate mail]

Bruce was not Cc'ed in this reply.

> >>> For me, the question is - why hasn't the service-core been scheduled? Can
> >>> we use sched-yield or some other mechanism to force a wakeup of it?
> 
> I'm not aware of a way to make *a specific other pthread* wakeup.  We could sacrifice
> the current lcore that's waiting for the service-lcore, with a sched_yield() as you suggest.
> It would potentially "churn" the scheduler enough to give the service core some CPU?
> It's a guess/gamble in the end, kind of like the timeouts we have today..
> 
> > > Thoughts and input welcomed, I'm happy to make the code changes
> > themselves, its small effort
> > > For both option 1 & 2.
> > 
> > For time-sensitive tests, yes they should be in perf tests category.
> > As David said earlier, no timeout approach in functional tests.
> 
> Ok, as before, option 1) is to while(1) and wait for "success". Then there's
> no timeout in the test code, but our meson test runner will time-out/fail after ~10sec IIRC.
> 
> Or we move the tests perf-tests, as per Option 2), and these simply won't run in CI.
> 
> I'm OK with all 3 (including testing with sched_yield() for a month or two and if that helps?)

Did you send a patch to go in a direction or another?
If not, please move the test to perf-test as suggested before.
We are still hitting the issues in the CI and it is *very* annoying.
It is consuming time of a lot of people for a lot of patches,
just to check it is again an issue with this test.

Please let's remove this test from the CI now.