From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id B3B10A00C4 for ; Thu, 6 Oct 2022 15:51:46 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id A8C5342C75; Thu, 6 Oct 2022 15:51:46 +0200 (CEST) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by mails.dpdk.org (Postfix) with ESMTP id 2B00C42C67 for ; Thu, 6 Oct 2022 15:51:44 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1665064303; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=B7HuzJiVvOP6r2hblUgI2IdPQSVM8Y8GNOqAwEtJIvA=; b=Kvotuhmh1KyVZonvIPih8JoGrbHdnDZvyqc62uWFEgZaqvCeebUeIT7OfGHMBN1OxDr5K1 umXtyLrMoe5NANUSspK5F2PxHFPVv16f7AqwYY5nRRvYhEO42TrDUo1pc0RHGBTW3E6BqY +/Cu/HODEQjQofKzhQzLHOupysx/dwU= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-548-vGPVtqcmMPCOhFfa9C01mg-1; Thu, 06 Oct 2022 09:51:40 -0400 X-MC-Unique: vGPVtqcmMPCOhFfa9C01mg-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 17EF6811E67; Thu, 6 Oct 2022 13:51:40 +0000 (UTC) Received: from RHTPC1VM0NT (unknown [10.22.32.41]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 8E5E840315F; Thu, 6 Oct 2022 13:51:39 +0000 (UTC) From: Aaron Conole To: Morten =?utf-8?Q?Br=C3=B8rup?= Cc: Mattias =?utf-8?Q?R=C3=B6nnblom?= , "Thomas Monjalon" , "Van Haaren Harry" , "David Marchand" , "dpdklab" , , "Honnappa Nagarahalli" , "dev" Subject: Re: rte_service unit test failing randomly References: <739ee0ca-ccbe-5918-c2af-18e77327a898@ericsson.com> <3000673.mvXUDI8C0e@thomas> <98CBD80474FA8B44BF855DF32C47DC35D87399@smartserver.smartshare.dk> Date: Thu, 06 Oct 2022 09:51:39 -0400 In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87399@smartserver.smartshare.dk> ("Morten =?utf-8?Q?Br=C3=B8rup=22's?= message of "Thu, 6 Oct 2022 08:53:32 +0200") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: ci@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK CI discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ci-bounces@dpdk.org Morten Br=C3=B8rup writes: >> From: Mattias R=C3=B6nnblom [mailto:mattias.ronnblom@ericsson.com] >> Sent: Wednesday, 5 October 2022 23.34 >>=20 >> On 2022-10-05 22:52, Thomas Monjalon wrote: >> > 05/10/2022 22:33, Mattias R=C3=B6nnblom: >> >> On 2022-10-05 21:14, David Marchand wrote: >> >>> Hello, >> >>> >> >>> The service_autotest unit test has been failing randomly. >> >>> This is not something new. > > [...] > >> >>> EAL: Test assert service_may_be_active line 960 failed: Error: >> Service >> >>> not stopped after 100ms >> >>> >> >>> Ideas? >> >>> >> >>> >> >>> Thanks. >> >> >> >> Do you run the test suite in a controlled environment? I.e., one >> where >> >> you can trust that the lcore threads aren't interrupted for long >> periods >> >> of time. >> >> >> >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for >> the >> >> CPU with other threads. >> > >> > You mean the tests cannot be interrupted? >>=20 >> I just took a very quick look, but it seems like the main thread can, >> but the worker lcore thread cannot be interrupt for anything close to >> 100 ms, or you risk a test failure. >>=20 >> > Then it looks very fragile. >>=20 >> Tests like this are by their very nature racey. If a test thread sends >> a >> request to another thread, there is no way for it to decide when a >> non-response should result in a test failure, unless the scheduling >> latency of the receiving thread has an upper bound. >>=20 >> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get >> a lot of matches. I bet there more like the service core one, but they >> allow for longer interruptions. >>=20 >> That said, 100 ms sounds like very short. I don't see why this can be a >> lot longer. >>=20 >> ...and that said, I would argue you still need a reasonably controlled >> environment for the autotests. If you have a server is arbitrarily >> overloaded, maybe also with high memory pressure (and associated >> instruction page faults and god-knows-what), the real-world worst-case >> interruptions could be very long indeed. Seconds. Designing inherently >> racey tests for that kind of environment will make them have very long >> run times. > > Forgive me, if I am sidetracking a bit here... The issue discussed > seems to be related to some threads waiting for other threads, and my > question is not directly related to that. > > I have been wondering how accurate the tests really are. Where can I > see what is being done to ensure that the EAL worker threads are fully > isolated, and never interrupted by the O/S scheduler or similar? > > For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a > NIC is configured with 4096 Rx descriptors, packet loss will occur > after ca. 70 us (microseconds!) if not servicing the ingress queue > when receiving at max packet rate. > > I recently posted some code for monitoring the O/S noise in EAL worker > threads [1]. What should I do if I want to run that code in the > automated test environment? It would be for informational purposes > only, i.e. I would manually look at the test output to see the result. One hacky way is to post a PATCH telling that it should never be merged, but that introduces your test case, and then look at the logs. > I would write a test application that simply starts the O/S noise > monitor thread as an isolated EAL worker thread, the main thread would > then wait for 10 minutes (or some other duration), dump the result to > the standard output, and exit the application. > > [1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@sma= rtserver.smartshare.dk/