From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 003F4A00C4 for ; Thu, 6 Oct 2022 09:04:51 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id EEAD642B7A; Thu, 6 Oct 2022 09:04:51 +0200 (CEST) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by mails.dpdk.org (Postfix) with ESMTP id 6A2294280E for ; Thu, 6 Oct 2022 09:04:49 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1665039888; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SeRosAo295AsLUXiO20t1Xz3F00x2dHjmAD4od03G+4=; b=JVbZl0oZvPdzqJxy0FO0o1KOEU06UFVw9P4/Ek53AV4hludjZc0rfr0dwdaPF4VX5N2oPs /sTb+N+8S7jnrNAWmQxy48g5KfDb3JnkRR6t5sbOHNET87pNgAFGsoXjtelHoQ8b0EZycq FAqoz76yKajGq5ccD+Kv2dfDIXxlIew= Received: from mail-pf1-f198.google.com (mail-pf1-f198.google.com [209.85.210.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-610-2tyv2pJnNCyBt3t8MWC9HA-1; Thu, 06 Oct 2022 03:04:47 -0400 X-MC-Unique: 2tyv2pJnNCyBt3t8MWC9HA-1 Received: by mail-pf1-f198.google.com with SMTP id cb7-20020a056a00430700b00561b86e0265so704618pfb.13 for ; Thu, 06 Oct 2022 00:04:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date; bh=SeRosAo295AsLUXiO20t1Xz3F00x2dHjmAD4od03G+4=; b=ro2CfAba0KcQKhMB+k+3mK3DKmp9zA5SpJmKyKLD++E8arB0xpaZ77xtzHoqIJSz7q acDhtAIM+rzVbMCJskv63Vv5NoMv1/0IhLYZpxteSNf52RgGxISQoOcGN8nkkImKAfqs tm54pBddmoYsABd+NWlZ14VHqTffgAwX5GSHLFIz1fOTXgtVD818MMYlZmjRhR1l4eJh NTY2hn4sRSAstWkGxaXcJAxsZ1oF3OgPAgmJZaOz/MeHpnFtLKk6Ib77f1/VDS+RQ8tR ovqhaZ3VDtPXR+2WxSm5aobuY2z1Z/PR1j99w/eGAMbKMEFgcPARd/+NSMuh8N//F4Nu AKyg== X-Gm-Message-State: ACrzQf0/icsMcgV+lUy+q2udjQn9TIQKieR4ZSso+LJR+xlexHx5ic8J EvEkc0IWeRQnSElZaOZU9Is8SbA8ELMdodb1Ft8buvszmrIkBRI+DKyPu9HgDO1HGzHNbSvB79d wSVzxbXurpdN57oo8yw== X-Received: by 2002:a05:6a00:1c5d:b0:562:7125:ff10 with SMTP id s29-20020a056a001c5d00b005627125ff10mr2564246pfw.60.1665039886760; Thu, 06 Oct 2022 00:04:46 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7CLsJiyu9IU0O3wAoU108DUYxtRttwV591UbNYRrfZDBx+tUXUkdq0r6LeO+6bgKcttEPPk58GNXBJyxMfjfU= X-Received: by 2002:a05:6a00:1c5d:b0:562:7125:ff10 with SMTP id s29-20020a056a001c5d00b005627125ff10mr2564221pfw.60.1665039886412; Thu, 06 Oct 2022 00:04:46 -0700 (PDT) MIME-Version: 1.0 References: <739ee0ca-ccbe-5918-c2af-18e77327a898@ericsson.com> <3000673.mvXUDI8C0e@thomas> <98CBD80474FA8B44BF855DF32C47DC35D87399@smartserver.smartshare.dk> In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87399@smartserver.smartshare.dk> From: David Marchand Date: Thu, 6 Oct 2022 09:04:30 +0200 Message-ID: Subject: Re: rte_service unit test failing randomly To: =?UTF-8?Q?Morten_Br=C3=B8rup?= Cc: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= , Thomas Monjalon , Van Haaren Harry , dpdklab , ci@dpdk.org, Honnappa Nagarahalli , Aaron Conole , dev X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: ci@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK CI discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ci-bounces@dpdk.org On Thu, Oct 6, 2022 at 8:53 AM Morten Br=C3=B8rup wrote: > > > From: Mattias R=C3=B6nnblom [mailto:mattias.ronnblom@ericsson.com] > > Sent: Wednesday, 5 October 2022 23.34 > > > > On 2022-10-05 22:52, Thomas Monjalon wrote: > > > 05/10/2022 22:33, Mattias R=C3=B6nnblom: > > >> On 2022-10-05 21:14, David Marchand wrote: > > >>> Hello, > > >>> > > >>> The service_autotest unit test has been failing randomly. > > >>> This is not something new. > > [...] > > > >>> EAL: Test assert service_may_be_active line 960 failed: Error: > > Service > > >>> not stopped after 100ms > > >>> > > >>> Ideas? > > >>> > > >>> > > >>> Thanks. > > >> > > >> Do you run the test suite in a controlled environment? I.e., one > > where > > >> you can trust that the lcore threads aren't interrupted for long > > periods > > >> of time. > > >> > > >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for > > the > > >> CPU with other threads. > > > > > > You mean the tests cannot be interrupted? > > > > I just took a very quick look, but it seems like the main thread can, > > but the worker lcore thread cannot be interrupt for anything close to > > 100 ms, or you risk a test failure. > > > > > Then it looks very fragile. > > > > Tests like this are by their very nature racey. If a test thread sends > > a > > request to another thread, there is no way for it to decide when a > > non-response should result in a test failure, unless the scheduling > > latency of the receiving thread has an upper bound. > > > > If you grep for "sleep", or "delay", in app/test/test_*.c, you will get > > a lot of matches. I bet there more like the service core one, but they > > allow for longer interruptions. > > > > That said, 100 ms sounds like very short. I don't see why this can be a > > lot longer. > > > > ...and that said, I would argue you still need a reasonably controlled > > environment for the autotests. If you have a server is arbitrarily > > overloaded, maybe also with high memory pressure (and associated > > instruction page faults and god-knows-what), the real-world worst-case > > interruptions could be very long indeed. Seconds. Designing inherently > > racey tests for that kind of environment will make them have very long > > run times. > > Forgive me, if I am sidetracking a bit here... The issue discussed seems = to be related to some threads waiting for other threads, and my question is= not directly related to that. > > I have been wondering how accurate the tests really are. Where can I see = what is being done to ensure that the EAL worker threads are fully isolated= , and never interrupted by the O/S scheduler or similar? > > For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a NI= C is configured with 4096 Rx descriptors, packet loss will occur after ca. = 70 us (microseconds!) if not servicing the ingress queue when receiving at = max packet rate. > > I recently posted some code for monitoring the O/S noise in EAL worker th= reads [1]. What should I do if I want to run that code in the automated tes= t environment? It would be for informational purposes only, i.e. I would ma= nually look at the test output to see the result. > > I would write a test application that simply starts the O/S noise monitor= thread as an isolated EAL worker thread, the main thread would then wait f= or 10 minutes (or some other duration), dump the result to the standard out= put, and exit the application. > > [1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@sma= rtserver.smartshare.dk/ Maybe we could do some hack, like finding a test in the current CI that matches the test requirement: number of cores, ports, setup params etc... (retrieving the output could be another challenge). But if you think this is something we should have on the long run, I'd suggest writing a new DTS test. --=20 David Marchand