From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 8DDA8A00C2; Thu, 6 Oct 2022 09:04:51 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 307E54280E; Thu, 6 Oct 2022 09:04:51 +0200 (CEST) Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by mails.dpdk.org (Postfix) with ESMTP id 4498441153 for ; Thu, 6 Oct 2022 09:04:49 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1665039888; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SeRosAo295AsLUXiO20t1Xz3F00x2dHjmAD4od03G+4=; b=JVbZl0oZvPdzqJxy0FO0o1KOEU06UFVw9P4/Ek53AV4hludjZc0rfr0dwdaPF4VX5N2oPs /sTb+N+8S7jnrNAWmQxy48g5KfDb3JnkRR6t5sbOHNET87pNgAFGsoXjtelHoQ8b0EZycq FAqoz76yKajGq5ccD+Kv2dfDIXxlIew= Received: from mail-pj1-f70.google.com (mail-pj1-f70.google.com [209.85.216.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-185-Iq-yjNl-M8qv0sYyd2Xk4A-1; Thu, 06 Oct 2022 03:04:47 -0400 X-MC-Unique: Iq-yjNl-M8qv0sYyd2Xk4A-1 Received: by mail-pj1-f70.google.com with SMTP id y16-20020a17090aa41000b001fdf0a76a4eso614585pjp.3 for ; Thu, 06 Oct 2022 00:04:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date; bh=SeRosAo295AsLUXiO20t1Xz3F00x2dHjmAD4od03G+4=; b=MTqWlb0J90iVQJKPlaGvHi2ooHJurY2vzwIKIeRV3BFZPJKljqD5ips+vQOce+izHg 3HhFbmd3Zad7VtNLKBT5ItyK78CKFcC94zXQGxRYJG/UKc0FH185O8SQSNfB7yjIOsDt WALDrrPQAtEM4vGwx9kbpFOCrhOMEU/u5hq+wJgmoXsn5UU9BaiDEediHv7jt6UieV1b HonfSTVSLoDFHvqjTVSRYHSiPEgRRCy8Rldbj2T3JIPlo68x4Q282sZKxY/ndHE3R92Z sE+CCrmk3v9rK8l1WqEBob6GPrjI2xn3ZC/zniX0PXcLHVeZP0mwl4fcn3YnMUYM9Ucz JrZQ== X-Gm-Message-State: ACrzQf17yz+YgAkIOI9xajQRilEwaGfgKHZcM7cN02d3XlRgcwTMJT1i gFXKaXqmgXal8GZcDNoMKK2CPVtS7rKCYNBBzK1Szo26QDkbygGaYEuIN0rxvb0/2z4UZ9UBGAl o3uHrjQNElLibXs5nIxU= X-Received: by 2002:a05:6a00:1c5d:b0:562:7125:ff10 with SMTP id s29-20020a056a001c5d00b005627125ff10mr2564248pfw.60.1665039886762; Thu, 06 Oct 2022 00:04:46 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7CLsJiyu9IU0O3wAoU108DUYxtRttwV591UbNYRrfZDBx+tUXUkdq0r6LeO+6bgKcttEPPk58GNXBJyxMfjfU= X-Received: by 2002:a05:6a00:1c5d:b0:562:7125:ff10 with SMTP id s29-20020a056a001c5d00b005627125ff10mr2564221pfw.60.1665039886412; Thu, 06 Oct 2022 00:04:46 -0700 (PDT) MIME-Version: 1.0 References: <739ee0ca-ccbe-5918-c2af-18e77327a898@ericsson.com> <3000673.mvXUDI8C0e@thomas> <98CBD80474FA8B44BF855DF32C47DC35D87399@smartserver.smartshare.dk> In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87399@smartserver.smartshare.dk> From: David Marchand Date: Thu, 6 Oct 2022 09:04:30 +0200 Message-ID: Subject: Re: rte_service unit test failing randomly To: =?UTF-8?Q?Morten_Br=C3=B8rup?= Cc: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= , Thomas Monjalon , Van Haaren Harry , dpdklab , ci@dpdk.org, Honnappa Nagarahalli , Aaron Conole , dev X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Thu, Oct 6, 2022 at 8:53 AM Morten Br=C3=B8rup wrote: > > > From: Mattias R=C3=B6nnblom [mailto:mattias.ronnblom@ericsson.com] > > Sent: Wednesday, 5 October 2022 23.34 > > > > On 2022-10-05 22:52, Thomas Monjalon wrote: > > > 05/10/2022 22:33, Mattias R=C3=B6nnblom: > > >> On 2022-10-05 21:14, David Marchand wrote: > > >>> Hello, > > >>> > > >>> The service_autotest unit test has been failing randomly. > > >>> This is not something new. > > [...] > > > >>> EAL: Test assert service_may_be_active line 960 failed: Error: > > Service > > >>> not stopped after 100ms > > >>> > > >>> Ideas? > > >>> > > >>> > > >>> Thanks. > > >> > > >> Do you run the test suite in a controlled environment? I.e., one > > where > > >> you can trust that the lcore threads aren't interrupted for long > > periods > > >> of time. > > >> > > >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for > > the > > >> CPU with other threads. > > > > > > You mean the tests cannot be interrupted? > > > > I just took a very quick look, but it seems like the main thread can, > > but the worker lcore thread cannot be interrupt for anything close to > > 100 ms, or you risk a test failure. > > > > > Then it looks very fragile. > > > > Tests like this are by their very nature racey. If a test thread sends > > a > > request to another thread, there is no way for it to decide when a > > non-response should result in a test failure, unless the scheduling > > latency of the receiving thread has an upper bound. > > > > If you grep for "sleep", or "delay", in app/test/test_*.c, you will get > > a lot of matches. I bet there more like the service core one, but they > > allow for longer interruptions. > > > > That said, 100 ms sounds like very short. I don't see why this can be a > > lot longer. > > > > ...and that said, I would argue you still need a reasonably controlled > > environment for the autotests. If you have a server is arbitrarily > > overloaded, maybe also with high memory pressure (and associated > > instruction page faults and god-knows-what), the real-world worst-case > > interruptions could be very long indeed. Seconds. Designing inherently > > racey tests for that kind of environment will make them have very long > > run times. > > Forgive me, if I am sidetracking a bit here... The issue discussed seems = to be related to some threads waiting for other threads, and my question is= not directly related to that. > > I have been wondering how accurate the tests really are. Where can I see = what is being done to ensure that the EAL worker threads are fully isolated= , and never interrupted by the O/S scheduler or similar? > > For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a NI= C is configured with 4096 Rx descriptors, packet loss will occur after ca. = 70 us (microseconds!) if not servicing the ingress queue when receiving at = max packet rate. > > I recently posted some code for monitoring the O/S noise in EAL worker th= reads [1]. What should I do if I want to run that code in the automated tes= t environment? It would be for informational purposes only, i.e. I would ma= nually look at the test output to see the result. > > I would write a test application that simply starts the O/S noise monitor= thread as an isolated EAL worker thread, the main thread would then wait f= or 10 minutes (or some other duration), dump the result to the standard out= put, and exit the application. > > [1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@sma= rtserver.smartshare.dk/ Maybe we could do some hack, like finding a test in the current CI that matches the test requirement: number of cores, ports, setup params etc... (retrieving the output could be another challenge). But if you think this is something we should have on the long run, I'd suggest writing a new DTS test. --=20 David Marchand