Re: rte_service unit test failing randomly

DPDK CI discussions
 help / color / mirror / Atom feed

From: Aaron Conole <aconole@redhat.com>
To: "Morten Brørup" <mb@smartsharesystems.com>
Cc: "Mattias Rönnblom" <mattias.ronnblom@ericsson.com>,
	"Thomas Monjalon" <thomas@monjalon.net>,
	"Van Haaren Harry" <harry.van.haaren@intel.com>,
	"David Marchand" <david.marchand@redhat.com>,
	dpdklab <dpdklab@iol.unh.edu>,
	ci@dpdk.org,
	"Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com>,
	dev <dev@dpdk.org>
Subject: Re: rte_service unit test failing randomly
Date: Thu, 06 Oct 2022 09:51:39 -0400	[thread overview]
Message-ID: <f7t1qrlm55g.fsf@redhat.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87399@smartserver.smartshare.dk> ("Morten =?utf-8?Q?Br=C3=B8rup=22's?= message of "Thu, 6 Oct 2022 08:53:32 +0200")

Morten Brørup <mb@smartsharesystems.com> writes:

>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 5 October 2022 23.34
>> 
>> On 2022-10-05 22:52, Thomas Monjalon wrote:
>> > 05/10/2022 22:33, Mattias Rönnblom:
>> >> On 2022-10-05 21:14, David Marchand wrote:
>> >>> Hello,
>> >>>
>> >>> The service_autotest unit test has been failing randomly.
>> >>> This is not something new.
>
> [...]
>
>> >>> EAL: Test assert service_may_be_active line 960 failed: Error:
>> Service
>> >>> not stopped after 100ms
>> >>>
>> >>> Ideas?
>> >>>
>> >>>
>> >>> Thanks.
>> >>
>> >> Do you run the test suite in a controlled environment? I.e., one
>> where
>> >> you can trust that the lcore threads aren't interrupted for long
>> periods
>> >> of time.
>> >>
>> >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
>> the
>> >> CPU with other threads.
>> >
>> > You mean the tests cannot be interrupted?
>> 
>> I just took a very quick look, but it seems like the main thread can,
>> but the worker lcore thread cannot be interrupt for anything close to
>> 100 ms, or you risk a test failure.
>> 
>> > Then it looks very fragile.
>> 
>> Tests like this are by their very nature racey. If a test thread sends
>> a
>> request to another thread, there is no way for it to decide when a
>> non-response should result in a test failure, unless the scheduling
>> latency of the receiving thread has an upper bound.
>> 
>> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
>> a lot of matches. I bet there more like the service core one, but they
>> allow for longer interruptions.
>> 
>> That said, 100 ms sounds like very short. I don't see why this can be a
>> lot longer.
>> 
>> ...and that said, I would argue you still need a reasonably controlled
>> environment for the autotests. If you have a server is arbitrarily
>> overloaded, maybe also with high memory pressure (and associated
>> instruction page faults and god-knows-what), the real-world worst-case
>> interruptions could be very long indeed. Seconds. Designing inherently
>> racey tests for that kind of environment will make them have very long
>> run times.
>
> Forgive me, if I am sidetracking a bit here... The issue discussed
> seems to be related to some threads waiting for other threads, and my
> question is not directly related to that.
>
> I have been wondering how accurate the tests really are. Where can I
> see what is being done to ensure that the EAL worker threads are fully
> isolated, and never interrupted by the O/S scheduler or similar?
>
> For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a
> NIC is configured with 4096 Rx descriptors, packet loss will occur
> after ca. 70 us (microseconds!) if not servicing the ingress queue
> when receiving at max packet rate.
>
> I recently posted some code for monitoring the O/S noise in EAL worker
> threads [1]. What should I do if I want to run that code in the
> automated test environment? It would be for informational purposes
> only, i.e. I would manually look at the test output to see the result.

One hacky way is to post a PATCH telling that it should never be merged,
but that introduces your test case, and then look at the logs.

> I would write a test application that simply starts the O/S noise
> monitor thread as an isolated EAL worker thread, the main thread would
> then wait for 10 minutes (or some other duration), dump the result to
> the standard output, and exit the application.
>
> [1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@smartserver.smartshare.dk/

     prev parent reply	other threads:[~2022-10-06 13:51 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-05 19:14 David Marchand
2022-10-05 20:33 ` Mattias Rönnblom
2022-10-05 20:52   ` Thomas Monjalon
2022-10-05 21:33     ` Mattias Rönnblom
2022-10-06  6:53       ` Morten Brørup
2022-10-06  7:04         ` David Marchand
2022-10-06  7:50           ` Morten Brørup
2022-10-06  7:50         ` Mattias Rönnblom
2022-10-06  8:18           ` Morten Brørup
2022-10-06  8:59             ` Mattias Rönnblom
2022-10-06  9:49               ` Morten Brørup
2022-10-06 11:07                 ` [dpdklab] " Lincoln Lavoie
2022-10-06 12:00                   ` Morten Brørup
2022-10-06 17:52                     ` Honnappa Nagarahalli
2022-10-06 13:51         ` Aaron Conole [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f7t1qrlm55g.fsf@redhat.com \
    --to=aconole@redhat.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=ci@dpdk.org \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=dpdklab@iol.unh.edu \
    --cc=harry.van.haaren@intel.com \
    --cc=mattias.ronnblom@ericsson.com \
    --cc=mb@smartsharesystems.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).