Re: rte_service unit test failing randomly

DPDK CI discussions
 help / color / mirror / Atom feed

From: "Mattias Rönnblom" <mattias.ronnblom@ericsson.com>
To: Thomas Monjalon <thomas@monjalon.net>,
	Van Haaren Harry <harry.van.haaren@intel.com>
Cc: "David Marchand" <david.marchand@redhat.com>,
	dpdklab <dpdklab@iol.unh.edu>, "ci@dpdk.org" <ci@dpdk.org>,
	"Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com>,
	"Morten Brørup" <mb@smartsharesystems.com>,
	"Aaron Conole" <aconole@redhat.com>, dev <dev@dpdk.org>
Subject: Re: rte_service unit test failing randomly
Date: Wed, 5 Oct 2022 21:33:56 +0000	[thread overview]
Message-ID: <e4aca5cb-4805-9391-c73e-6ba8b8d5982a@ericsson.com> (raw)
In-Reply-To: <3000673.mvXUDI8C0e@thomas>

On 2022-10-05 22:52, Thomas Monjalon wrote:
> 05/10/2022 22:33, Mattias Rönnblom:
>> On 2022-10-05 21:14, David Marchand wrote:
>>> Hello,
>>>
>>> The service_autotest unit test has been failing randomly.
>>> This is not something new.
>>> We have been fixing this unit test and the service code, here and there.
>>> For some time we were "fine": the failures were rare.
>>>
>>> But recenly (for the last two weeks at least), it started failing more
>>> frequently in UNH lab.
>>>
>>> The symptoms are linked to places where the unit test code is "waiting
>>> for some time":
>>>
>>> -  service_lcore_attr_get:
>>> + TestCase [ 5] : service_lcore_attr_get failed
>>> EAL: Test assert service_lcore_attr_get line 422 failed: Service lcore
>>> not stopped after waiting.
>>>
>>>
>>> -  service_may_be_active:
>>> + TestCase [15] : service_may_be_active failed
>>> ...
>>> EAL: Test assert service_may_be_active line 960 failed: Error: Service
>>> not stopped after 100ms
>>>
>>> Ideas?
>>>
>>>
>>> Thanks.
>>
>> Do you run the test suite in a controlled environment? I.e., one where
>> you can trust that the lcore threads aren't interrupted for long periods
>> of time.
>>
>> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for the
>> CPU with other threads.
> 
> You mean the tests cannot be interrupted?

I just took a very quick look, but it seems like the main thread can, 
but the worker lcore thread cannot be interrupt for anything close to 
100 ms, or you risk a test failure.

> Then it looks very fragile.

Tests like this are by their very nature racey. If a test thread sends a 
request to another thread, there is no way for it to decide when a 
non-response should result in a test failure, unless the scheduling 
latency of the receiving thread has an upper bound.

If you grep for "sleep", or "delay", in app/test/test_*.c, you will get 
a lot of matches. I bet there more like the service core one, but they 
allow for longer interruptions.

That said, 100 ms sounds like very short. I don't see why this can be a 
lot longer.

...and that said, I would argue you still need a reasonably controlled 
environment for the autotests. If you have a server is arbitrarily 
overloaded, maybe also with high memory pressure (and associated 
instruction page faults and god-knows-what), the real-world worst-case 
interruptions could be very long indeed. Seconds. Designing inherently 
racey tests for that kind of environment will make them have very long 
run times.

> Please could help making it more robust?
> 

I can send a patch, if Harry can't.

next prev parent reply	other threads:[~2022-10-05 21:33 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-05 19:14 David Marchand
2022-10-05 20:33 ` Mattias Rönnblom
2022-10-05 20:52   ` Thomas Monjalon
2022-10-05 21:33     ` Mattias Rönnblom [this message]
2022-10-06  6:53       ` Morten Brørup
2022-10-06  7:04         ` David Marchand
2022-10-06  7:50           ` Morten Brørup
2022-10-06  7:50         ` Mattias Rönnblom
2022-10-06  8:18           ` Morten Brørup
2022-10-06  8:59             ` Mattias Rönnblom
2022-10-06  9:49               ` Morten Brørup
2022-10-06 11:07                 ` [dpdklab] " Lincoln Lavoie
2022-10-06 12:00                   ` Morten Brørup
2022-10-06 17:52                     ` Honnappa Nagarahalli
2022-10-06 13:51         ` Aaron Conole

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e4aca5cb-4805-9391-c73e-6ba8b8d5982a@ericsson.com \
    --to=mattias.ronnblom@ericsson.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=aconole@redhat.com \
    --cc=ci@dpdk.org \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=dpdklab@iol.unh.edu \
    --cc=harry.van.haaren@intel.com \
    --cc=mb@smartsharesystems.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).