From: "Mattias Rönnblom" <mattias.ronnblom@ericsson.com>
To: "Morten Brørup" <mb@smartsharesystems.com>,
"Thomas Monjalon" <thomas@monjalon.net>,
"Van Haaren Harry" <harry.van.haaren@intel.com>
Cc: David Marchand <david.marchand@redhat.com>,
dpdklab <dpdklab@iol.unh.edu>, "ci@dpdk.org" <ci@dpdk.org>,
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
Aaron Conole <aconole@redhat.com>, dev <dev@dpdk.org>
Subject: Re: rte_service unit test failing randomly
Date: Thu, 6 Oct 2022 07:50:53 +0000 [thread overview]
Message-ID: <d9aa3d6b-fa53-42c1-9089-9568b1c78095@ericsson.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87399@smartserver.smartshare.dk>
On 2022-10-06 08:53, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 5 October 2022 23.34
>>
>> On 2022-10-05 22:52, Thomas Monjalon wrote:
>>> 05/10/2022 22:33, Mattias Rönnblom:
>>>> On 2022-10-05 21:14, David Marchand wrote:
>>>>> Hello,
>>>>>
>>>>> The service_autotest unit test has been failing randomly.
>>>>> This is not something new.
>
> [...]
>
>>>>> EAL: Test assert service_may_be_active line 960 failed: Error:
>> Service
>>>>> not stopped after 100ms
>>>>>
>>>>> Ideas?
>>>>>
>>>>>
>>>>> Thanks.
>>>>
>>>> Do you run the test suite in a controlled environment? I.e., one
>> where
>>>> you can trust that the lcore threads aren't interrupted for long
>> periods
>>>> of time.
>>>>
>>>> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
>> the
>>>> CPU with other threads.
>>>
>>> You mean the tests cannot be interrupted?
>>
>> I just took a very quick look, but it seems like the main thread can,
>> but the worker lcore thread cannot be interrupt for anything close to
>> 100 ms, or you risk a test failure.
>>
>>> Then it looks very fragile.
>>
>> Tests like this are by their very nature racey. If a test thread sends
>> a
>> request to another thread, there is no way for it to decide when a
>> non-response should result in a test failure, unless the scheduling
>> latency of the receiving thread has an upper bound.
>>
>> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
>> a lot of matches. I bet there more like the service core one, but they
>> allow for longer interruptions.
>>
>> That said, 100 ms sounds like very short. I don't see why this can be a
>> lot longer.
>>
>> ...and that said, I would argue you still need a reasonably controlled
>> environment for the autotests. If you have a server is arbitrarily
>> overloaded, maybe also with high memory pressure (and associated
>> instruction page faults and god-knows-what), the real-world worst-case
>> interruptions could be very long indeed. Seconds. Designing inherently
>> racey tests for that kind of environment will make them have very long
>> run times.
>
> Forgive me, if I am sidetracking a bit here... The issue discussed seems to be related to some threads waiting for other threads, and my question is not directly related to that.
>
> I have been wondering how accurate the tests really are. Where can I see what is being done to ensure that the EAL worker threads are fully isolated, and never interrupted by the O/S scheduler or similar?
>
There are kernel-level counters for how many times a thread have been
involuntarily interrupted, and also, if I recall correctly, the amount
of wall-time the thread have been runnable, but not running (i.e.,
waiting to be scheduled). The latter may require some scheduler debug
kernel option being enabled on the kernel build.
> For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a NIC is configured with 4096 Rx descriptors, packet loss will occur after ca. 70 us (microseconds!) if not servicing the ingress queue when receiving at max packet rate.
>
> I recently posted some code for monitoring the O/S noise in EAL worker threads [1]. What should I do if I want to run that code in the automated test environment? It would be for informational purposes only, i.e. I would manually look at the test output to see the result.
>
> I would write a test application that simply starts the O/S noise monitor thread as an isolated EAL worker thread, the main thread would then wait for 10 minutes (or some other duration), dump the result to the standard output, and exit the application.
>
> [1]: https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-2ff9afcfa197abb4&q=1&e=379fc3f1-046a-4ad8-a55d-5ebc6f63d4ff&u=http%3A%2F%2Finbox.dpdk.org%2Fdev%2F98CBD80474FA8B44BF855DF32C47DC35D87352%40smartserver.smartshare.dk%2F
>
next prev parent reply other threads:[~2022-10-06 7:51 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-05 19:14 David Marchand
2022-10-05 20:33 ` Mattias Rönnblom
2022-10-05 20:52 ` Thomas Monjalon
2022-10-05 21:33 ` Mattias Rönnblom
2022-10-06 6:53 ` Morten Brørup
2022-10-06 7:04 ` David Marchand
2022-10-06 7:50 ` Morten Brørup
2022-10-06 7:50 ` Mattias Rönnblom [this message]
2022-10-06 8:18 ` Morten Brørup
2022-10-06 8:59 ` Mattias Rönnblom
2022-10-06 9:49 ` Morten Brørup
2022-10-06 11:07 ` [dpdklab] " Lincoln Lavoie
2022-10-06 12:00 ` Morten Brørup
2022-10-06 17:52 ` Honnappa Nagarahalli
2022-10-06 13:51 ` Aaron Conole
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d9aa3d6b-fa53-42c1-9089-9568b1c78095@ericsson.com \
--to=mattias.ronnblom@ericsson.com \
--cc=Honnappa.Nagarahalli@arm.com \
--cc=aconole@redhat.com \
--cc=ci@dpdk.org \
--cc=david.marchand@redhat.com \
--cc=dev@dpdk.org \
--cc=dpdklab@iol.unh.edu \
--cc=harry.van.haaren@intel.com \
--cc=mb@smartsharesystems.com \
--cc=thomas@monjalon.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).