DPDK CI discussions
 help / color / mirror / Atom feed
* rte_service unit test failing randomly
@ 2022-10-05 19:14 David Marchand
  2022-10-05 20:33 ` Mattias Rönnblom
  0 siblings, 1 reply; 15+ messages in thread
From: David Marchand @ 2022-10-05 19:14 UTC (permalink / raw)
  To: Van Haaren Harry, dpdklab
  Cc: Mattias Rönnblom, Honnappa Nagarahalli, Morten Brørup,
	Aaron Conole, dev, ci

Hello,

The service_autotest unit test has been failing randomly.
This is not something new.
We have been fixing this unit test and the service code, here and there.
For some time we were "fine": the failures were rare.

But recenly (for the last two weeks at least), it started failing more
frequently in UNH lab.

The symptoms are linked to places where the unit test code is "waiting
for some time":

-  service_lcore_attr_get:
+ TestCase [ 5] : service_lcore_attr_get failed
EAL: Test assert service_lcore_attr_get line 422 failed: Service lcore
not stopped after waiting.


-  service_may_be_active:
+ TestCase [15] : service_may_be_active failed
...
EAL: Test assert service_may_be_active line 960 failed: Error: Service
not stopped after 100ms

Ideas?


Thanks.
-- 
David Marchand


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rte_service unit test failing randomly
  2022-10-05 19:14 rte_service unit test failing randomly David Marchand
@ 2022-10-05 20:33 ` Mattias Rönnblom
  2022-10-05 20:52   ` Thomas Monjalon
  0 siblings, 1 reply; 15+ messages in thread
From: Mattias Rönnblom @ 2022-10-05 20:33 UTC (permalink / raw)
  To: David Marchand, Van Haaren Harry, dpdklab
  Cc: Honnappa Nagarahalli, Morten Brørup, Aaron Conole, dev, ci

On 2022-10-05 21:14, David Marchand wrote:
> Hello,
> 
> The service_autotest unit test has been failing randomly.
> This is not something new.
> We have been fixing this unit test and the service code, here and there.
> For some time we were "fine": the failures were rare.
> 
> But recenly (for the last two weeks at least), it started failing more
> frequently in UNH lab.
> 
> The symptoms are linked to places where the unit test code is "waiting
> for some time":
> 
> -  service_lcore_attr_get:
> + TestCase [ 5] : service_lcore_attr_get failed
> EAL: Test assert service_lcore_attr_get line 422 failed: Service lcore
> not stopped after waiting.
> 
> 
> -  service_may_be_active:
> + TestCase [15] : service_may_be_active failed
> ...
> EAL: Test assert service_may_be_active line 960 failed: Error: Service
> not stopped after 100ms
> 
> Ideas?
> 
> 
> Thanks.

Do you run the test suite in a controlled environment? I.e., one where 
you can trust that the lcore threads aren't interrupted for long periods 
of time.

100 ms is not a long time if a SCHED_OTHER lcore thread competes for the 
CPU with other threads.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rte_service unit test failing randomly
  2022-10-05 20:33 ` Mattias Rönnblom
@ 2022-10-05 20:52   ` Thomas Monjalon
  2022-10-05 21:33     ` Mattias Rönnblom
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Monjalon @ 2022-10-05 20:52 UTC (permalink / raw)
  To: Van Haaren Harry, Mattias Rönnblom
  Cc: David Marchand, dpdklab, ci, Honnappa Nagarahalli,
	Morten Brørup, Aaron Conole, dev, ci

05/10/2022 22:33, Mattias Rönnblom:
> On 2022-10-05 21:14, David Marchand wrote:
> > Hello,
> > 
> > The service_autotest unit test has been failing randomly.
> > This is not something new.
> > We have been fixing this unit test and the service code, here and there.
> > For some time we were "fine": the failures were rare.
> > 
> > But recenly (for the last two weeks at least), it started failing more
> > frequently in UNH lab.
> > 
> > The symptoms are linked to places where the unit test code is "waiting
> > for some time":
> > 
> > -  service_lcore_attr_get:
> > + TestCase [ 5] : service_lcore_attr_get failed
> > EAL: Test assert service_lcore_attr_get line 422 failed: Service lcore
> > not stopped after waiting.
> > 
> > 
> > -  service_may_be_active:
> > + TestCase [15] : service_may_be_active failed
> > ...
> > EAL: Test assert service_may_be_active line 960 failed: Error: Service
> > not stopped after 100ms
> > 
> > Ideas?
> > 
> > 
> > Thanks.
> 
> Do you run the test suite in a controlled environment? I.e., one where 
> you can trust that the lcore threads aren't interrupted for long periods 
> of time.
> 
> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for the 
> CPU with other threads.

You mean the tests cannot be interrupted?
Then it looks very fragile.
Please could help making it more robust?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rte_service unit test failing randomly
  2022-10-05 20:52   ` Thomas Monjalon
@ 2022-10-05 21:33     ` Mattias Rönnblom
  2022-10-06  6:53       ` Morten Brørup
  0 siblings, 1 reply; 15+ messages in thread
From: Mattias Rönnblom @ 2022-10-05 21:33 UTC (permalink / raw)
  To: Thomas Monjalon, Van Haaren Harry
  Cc: David Marchand, dpdklab, ci, Honnappa Nagarahalli,
	Morten Brørup, Aaron Conole, dev

On 2022-10-05 22:52, Thomas Monjalon wrote:
> 05/10/2022 22:33, Mattias Rönnblom:
>> On 2022-10-05 21:14, David Marchand wrote:
>>> Hello,
>>>
>>> The service_autotest unit test has been failing randomly.
>>> This is not something new.
>>> We have been fixing this unit test and the service code, here and there.
>>> For some time we were "fine": the failures were rare.
>>>
>>> But recenly (for the last two weeks at least), it started failing more
>>> frequently in UNH lab.
>>>
>>> The symptoms are linked to places where the unit test code is "waiting
>>> for some time":
>>>
>>> -  service_lcore_attr_get:
>>> + TestCase [ 5] : service_lcore_attr_get failed
>>> EAL: Test assert service_lcore_attr_get line 422 failed: Service lcore
>>> not stopped after waiting.
>>>
>>>
>>> -  service_may_be_active:
>>> + TestCase [15] : service_may_be_active failed
>>> ...
>>> EAL: Test assert service_may_be_active line 960 failed: Error: Service
>>> not stopped after 100ms
>>>
>>> Ideas?
>>>
>>>
>>> Thanks.
>>
>> Do you run the test suite in a controlled environment? I.e., one where
>> you can trust that the lcore threads aren't interrupted for long periods
>> of time.
>>
>> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for the
>> CPU with other threads.
> 
> You mean the tests cannot be interrupted?

I just took a very quick look, but it seems like the main thread can, 
but the worker lcore thread cannot be interrupt for anything close to 
100 ms, or you risk a test failure.

> Then it looks very fragile.

Tests like this are by their very nature racey. If a test thread sends a 
request to another thread, there is no way for it to decide when a 
non-response should result in a test failure, unless the scheduling 
latency of the receiving thread has an upper bound.

If you grep for "sleep", or "delay", in app/test/test_*.c, you will get 
a lot of matches. I bet there more like the service core one, but they 
allow for longer interruptions.

That said, 100 ms sounds like very short. I don't see why this can be a 
lot longer.

...and that said, I would argue you still need a reasonably controlled 
environment for the autotests. If you have a server is arbitrarily 
overloaded, maybe also with high memory pressure (and associated 
instruction page faults and god-knows-what), the real-world worst-case 
interruptions could be very long indeed. Seconds. Designing inherently 
racey tests for that kind of environment will make them have very long 
run times.

> Please could help making it more robust?
> 

I can send a patch, if Harry can't.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: rte_service unit test failing randomly
  2022-10-05 21:33     ` Mattias Rönnblom
@ 2022-10-06  6:53       ` Morten Brørup
  2022-10-06  7:04         ` David Marchand
                           ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Morten Brørup @ 2022-10-06  6:53 UTC (permalink / raw)
  To: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry
  Cc: David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Wednesday, 5 October 2022 23.34
> 
> On 2022-10-05 22:52, Thomas Monjalon wrote:
> > 05/10/2022 22:33, Mattias Rönnblom:
> >> On 2022-10-05 21:14, David Marchand wrote:
> >>> Hello,
> >>>
> >>> The service_autotest unit test has been failing randomly.
> >>> This is not something new.

[...]

> >>> EAL: Test assert service_may_be_active line 960 failed: Error:
> Service
> >>> not stopped after 100ms
> >>>
> >>> Ideas?
> >>>
> >>>
> >>> Thanks.
> >>
> >> Do you run the test suite in a controlled environment? I.e., one
> where
> >> you can trust that the lcore threads aren't interrupted for long
> periods
> >> of time.
> >>
> >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
> the
> >> CPU with other threads.
> >
> > You mean the tests cannot be interrupted?
> 
> I just took a very quick look, but it seems like the main thread can,
> but the worker lcore thread cannot be interrupt for anything close to
> 100 ms, or you risk a test failure.
> 
> > Then it looks very fragile.
> 
> Tests like this are by their very nature racey. If a test thread sends
> a
> request to another thread, there is no way for it to decide when a
> non-response should result in a test failure, unless the scheduling
> latency of the receiving thread has an upper bound.
> 
> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
> a lot of matches. I bet there more like the service core one, but they
> allow for longer interruptions.
> 
> That said, 100 ms sounds like very short. I don't see why this can be a
> lot longer.
> 
> ...and that said, I would argue you still need a reasonably controlled
> environment for the autotests. If you have a server is arbitrarily
> overloaded, maybe also with high memory pressure (and associated
> instruction page faults and god-knows-what), the real-world worst-case
> interruptions could be very long indeed. Seconds. Designing inherently
> racey tests for that kind of environment will make them have very long
> run times.

Forgive me, if I am sidetracking a bit here... The issue discussed seems to be related to some threads waiting for other threads, and my question is not directly related to that.

I have been wondering how accurate the tests really are. Where can I see what is being done to ensure that the EAL worker threads are fully isolated, and never interrupted by the O/S scheduler or similar?

For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a NIC is configured with 4096 Rx descriptors, packet loss will occur after ca. 70 us (microseconds!) if not servicing the ingress queue when receiving at max packet rate.

I recently posted some code for monitoring the O/S noise in EAL worker threads [1]. What should I do if I want to run that code in the automated test environment? It would be for informational purposes only, i.e. I would manually look at the test output to see the result.

I would write a test application that simply starts the O/S noise monitor thread as an isolated EAL worker thread, the main thread would then wait for 10 minutes (or some other duration), dump the result to the standard output, and exit the application.

[1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@smartserver.smartshare.dk/


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rte_service unit test failing randomly
  2022-10-06  6:53       ` Morten Brørup
@ 2022-10-06  7:04         ` David Marchand
  2022-10-06  7:50           ` Morten Brørup
  2022-10-06  7:50         ` Mattias Rönnblom
  2022-10-06 13:51         ` Aaron Conole
  2 siblings, 1 reply; 15+ messages in thread
From: David Marchand @ 2022-10-06  7:04 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry,
	dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev

On Thu, Oct 6, 2022 at 8:53 AM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > Sent: Wednesday, 5 October 2022 23.34
> >
> > On 2022-10-05 22:52, Thomas Monjalon wrote:
> > > 05/10/2022 22:33, Mattias Rönnblom:
> > >> On 2022-10-05 21:14, David Marchand wrote:
> > >>> Hello,
> > >>>
> > >>> The service_autotest unit test has been failing randomly.
> > >>> This is not something new.
>
> [...]
>
> > >>> EAL: Test assert service_may_be_active line 960 failed: Error:
> > Service
> > >>> not stopped after 100ms
> > >>>
> > >>> Ideas?
> > >>>
> > >>>
> > >>> Thanks.
> > >>
> > >> Do you run the test suite in a controlled environment? I.e., one
> > where
> > >> you can trust that the lcore threads aren't interrupted for long
> > periods
> > >> of time.
> > >>
> > >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
> > the
> > >> CPU with other threads.
> > >
> > > You mean the tests cannot be interrupted?
> >
> > I just took a very quick look, but it seems like the main thread can,
> > but the worker lcore thread cannot be interrupt for anything close to
> > 100 ms, or you risk a test failure.
> >
> > > Then it looks very fragile.
> >
> > Tests like this are by their very nature racey. If a test thread sends
> > a
> > request to another thread, there is no way for it to decide when a
> > non-response should result in a test failure, unless the scheduling
> > latency of the receiving thread has an upper bound.
> >
> > If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
> > a lot of matches. I bet there more like the service core one, but they
> > allow for longer interruptions.
> >
> > That said, 100 ms sounds like very short. I don't see why this can be a
> > lot longer.
> >
> > ...and that said, I would argue you still need a reasonably controlled
> > environment for the autotests. If you have a server is arbitrarily
> > overloaded, maybe also with high memory pressure (and associated
> > instruction page faults and god-knows-what), the real-world worst-case
> > interruptions could be very long indeed. Seconds. Designing inherently
> > racey tests for that kind of environment will make them have very long
> > run times.
>
> Forgive me, if I am sidetracking a bit here... The issue discussed seems to be related to some threads waiting for other threads, and my question is not directly related to that.
>
> I have been wondering how accurate the tests really are. Where can I see what is being done to ensure that the EAL worker threads are fully isolated, and never interrupted by the O/S scheduler or similar?
>
> For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a NIC is configured with 4096 Rx descriptors, packet loss will occur after ca. 70 us (microseconds!) if not servicing the ingress queue when receiving at max packet rate.
>
> I recently posted some code for monitoring the O/S noise in EAL worker threads [1]. What should I do if I want to run that code in the automated test environment? It would be for informational purposes only, i.e. I would manually look at the test output to see the result.
>
> I would write a test application that simply starts the O/S noise monitor thread as an isolated EAL worker thread, the main thread would then wait for 10 minutes (or some other duration), dump the result to the standard output, and exit the application.
>
> [1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@smartserver.smartshare.dk/

Maybe we could do some hack, like finding a test in the current CI
that matches the test requirement: number of cores, ports, setup
params etc... (retrieving the output could be another challenge).
But if you think this is something we should have on the long run, I'd
suggest writing a new DTS test.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: rte_service unit test failing randomly
  2022-10-06  7:04         ` David Marchand
@ 2022-10-06  7:50           ` Morten Brørup
  0 siblings, 0 replies; 15+ messages in thread
From: Morten Brørup @ 2022-10-06  7:50 UTC (permalink / raw)
  To: David Marchand
  Cc: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry,
	dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev, dts

+CC: DTS mailing list.

> From: David Marchand [mailto:david.marchand@redhat.com]
> Sent: Thursday, 6 October 2022 09.05
> 
> On Thu, Oct 6, 2022 at 8:53 AM Morten Brørup <mb@smartsharesystems.com>
> wrote:

[...]

> > Forgive me, if I am sidetracking a bit here... The issue discussed
> seems to be related to some threads waiting for other threads, and my
> question is not directly related to that.
> >
> > I have been wondering how accurate the tests really are. Where can I
> see what is being done to ensure that the EAL worker threads are fully
> isolated, and never interrupted by the O/S scheduler or similar?
> >
> > For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If
> a NIC is configured with 4096 Rx descriptors, packet loss will occur
> after ca. 70 us (microseconds!) if not servicing the ingress queue when
> receiving at max packet rate.
> >
> > I recently posted some code for monitoring the O/S noise in EAL
> worker threads [1]. What should I do if I want to run that code in the
> automated test environment? It would be for informational purposes
> only, i.e. I would manually look at the test output to see the result.
> >
> > I would write a test application that simply starts the O/S noise
> monitor thread as an isolated EAL worker thread, the main thread would
> then wait for 10 minutes (or some other duration), dump the result to
> the standard output, and exit the application.
> >
> > [1]:
> http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@smarts
> erver.smartshare.dk/
> 
> Maybe we could do some hack, like finding a test in the current CI
> that matches the test requirement: number of cores, ports, setup
> params etc... (retrieving the output could be another challenge).
> But if you think this is something we should have on the long run, I'd
> suggest writing a new DTS test.

It would be useful having in the long run - it could catch sudden anomalies in the test environment.

Where can I find documentation on how to add a new test case to DTS?

-Morten


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rte_service unit test failing randomly
  2022-10-06  6:53       ` Morten Brørup
  2022-10-06  7:04         ` David Marchand
@ 2022-10-06  7:50         ` Mattias Rönnblom
  2022-10-06  8:18           ` Morten Brørup
  2022-10-06 13:51         ` Aaron Conole
  2 siblings, 1 reply; 15+ messages in thread
From: Mattias Rönnblom @ 2022-10-06  7:50 UTC (permalink / raw)
  To: Morten Brørup, Thomas Monjalon, Van Haaren Harry
  Cc: David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev

On 2022-10-06 08:53, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 5 October 2022 23.34
>>
>> On 2022-10-05 22:52, Thomas Monjalon wrote:
>>> 05/10/2022 22:33, Mattias Rönnblom:
>>>> On 2022-10-05 21:14, David Marchand wrote:
>>>>> Hello,
>>>>>
>>>>> The service_autotest unit test has been failing randomly.
>>>>> This is not something new.
> 
> [...]
> 
>>>>> EAL: Test assert service_may_be_active line 960 failed: Error:
>> Service
>>>>> not stopped after 100ms
>>>>>
>>>>> Ideas?
>>>>>
>>>>>
>>>>> Thanks.
>>>>
>>>> Do you run the test suite in a controlled environment? I.e., one
>> where
>>>> you can trust that the lcore threads aren't interrupted for long
>> periods
>>>> of time.
>>>>
>>>> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
>> the
>>>> CPU with other threads.
>>>
>>> You mean the tests cannot be interrupted?
>>
>> I just took a very quick look, but it seems like the main thread can,
>> but the worker lcore thread cannot be interrupt for anything close to
>> 100 ms, or you risk a test failure.
>>
>>> Then it looks very fragile.
>>
>> Tests like this are by their very nature racey. If a test thread sends
>> a
>> request to another thread, there is no way for it to decide when a
>> non-response should result in a test failure, unless the scheduling
>> latency of the receiving thread has an upper bound.
>>
>> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
>> a lot of matches. I bet there more like the service core one, but they
>> allow for longer interruptions.
>>
>> That said, 100 ms sounds like very short. I don't see why this can be a
>> lot longer.
>>
>> ...and that said, I would argue you still need a reasonably controlled
>> environment for the autotests. If you have a server is arbitrarily
>> overloaded, maybe also with high memory pressure (and associated
>> instruction page faults and god-knows-what), the real-world worst-case
>> interruptions could be very long indeed. Seconds. Designing inherently
>> racey tests for that kind of environment will make them have very long
>> run times.
> 
> Forgive me, if I am sidetracking a bit here... The issue discussed seems to be related to some threads waiting for other threads, and my question is not directly related to that.
> 
> I have been wondering how accurate the tests really are. Where can I see what is being done to ensure that the EAL worker threads are fully isolated, and never interrupted by the O/S scheduler or similar?
> 

There are kernel-level counters for how many times a thread have been 
involuntarily interrupted, and also, if I recall correctly, the amount 
of wall-time the thread have been runnable, but not running (i.e., 
waiting to be scheduled). The latter may require some scheduler debug 
kernel option being enabled on the kernel build.

> For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a NIC is configured with 4096 Rx descriptors, packet loss will occur after ca. 70 us (microseconds!) if not servicing the ingress queue when receiving at max packet rate.
> 
> I recently posted some code for monitoring the O/S noise in EAL worker threads [1]. What should I do if I want to run that code in the automated test environment? It would be for informational purposes only, i.e. I would manually look at the test output to see the result.
> 
> I would write a test application that simply starts the O/S noise monitor thread as an isolated EAL worker thread, the main thread would then wait for 10 minutes (or some other duration), dump the result to the standard output, and exit the application.
> 
> [1]: https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-2ff9afcfa197abb4&q=1&e=379fc3f1-046a-4ad8-a55d-5ebc6f63d4ff&u=http%3A%2F%2Finbox.dpdk.org%2Fdev%2F98CBD80474FA8B44BF855DF32C47DC35D87352%40smartserver.smartshare.dk%2F
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: rte_service unit test failing randomly
  2022-10-06  7:50         ` Mattias Rönnblom
@ 2022-10-06  8:18           ` Morten Brørup
  2022-10-06  8:59             ` Mattias Rönnblom
  0 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2022-10-06  8:18 UTC (permalink / raw)
  To: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry
  Cc: David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Thursday, 6 October 2022 09.51
> 
> On 2022-10-06 08:53, Morten Brørup wrote:

[...]

> > I have been wondering how accurate the tests really are. Where can I
> see what is being done to ensure that the EAL worker threads are fully
> isolated, and never interrupted by the O/S scheduler or similar?
> >
> 
> There are kernel-level counters for how many times a thread have been
> involuntarily interrupted,

Thanks, Mattias. I will look into that.

Old kernels (2.4 and 2.6) ascribed the time spent in interrupt handlers to the CPU usage of the running process, instead of counting the time spent in interrupt handlers separately. Does anyone know it this has been fixed?

> and also, if I recall correctly, the amount
> of wall-time the thread have been runnable, but not running (i.e.,
> waiting to be scheduled). The latter may require some scheduler debug
> kernel option being enabled on the kernel build.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rte_service unit test failing randomly
  2022-10-06  8:18           ` Morten Brørup
@ 2022-10-06  8:59             ` Mattias Rönnblom
  2022-10-06  9:49               ` Morten Brørup
  0 siblings, 1 reply; 15+ messages in thread
From: Mattias Rönnblom @ 2022-10-06  8:59 UTC (permalink / raw)
  To: Morten Brørup, Thomas Monjalon, Van Haaren Harry
  Cc: David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev

On 2022-10-06 10:18, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Thursday, 6 October 2022 09.51
>>
>> On 2022-10-06 08:53, Morten Brørup wrote:
> 
> [...]
> 
>>> I have been wondering how accurate the tests really are. Where can I
>> see what is being done to ensure that the EAL worker threads are fully
>> isolated, and never interrupted by the O/S scheduler or similar?
>>>
>>
>> There are kernel-level counters for how many times a thread have been
>> involuntarily interrupted,
> 
> Thanks, Mattias. I will look into that.
> 
> Old kernels (2.4 and 2.6) ascribed the time spent in interrupt handlers to the CPU usage of the running process, instead of counting the time spent in interrupt handlers separately. Does anyone know it this has been fixed?
> 

If you mean top half interrupt handler, my guess would be it does not 
matter, except in some strange corner cases or faulty hardware. An ISR 
should have very short run time, and not being run *that* often (after 
NAPI). With isolated cores, it should be even less of a problem, but 
then you may not have that.

Bottom halves are not attributed to the process, I believe. (In old 
kernels, the time spent in soft IRQs were not attributed to anything, 
which could create situations where the system was very busy indeed 
[e.g., with network stack bottom halves doing IP forwarding], but 
looking idle in 'top'.)

>> and also, if I recall correctly, the amount
>> of wall-time the thread have been runnable, but not running (i.e.,
>> waiting to be scheduled). The latter may require some scheduler debug
>> kernel option being enabled on the kernel build.
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: rte_service unit test failing randomly
  2022-10-06  8:59             ` Mattias Rönnblom
@ 2022-10-06  9:49               ` Morten Brørup
  2022-10-06 11:07                 ` [dpdklab] " Lincoln Lavoie
  0 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2022-10-06  9:49 UTC (permalink / raw)
  To: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry
  Cc: David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Thursday, 6 October 2022 10.59
> 
> On 2022-10-06 10:18, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Thursday, 6 October 2022 09.51
> >>
> >> On 2022-10-06 08:53, Morten Brørup wrote:
> >
> > [...]
> >
> >>> I have been wondering how accurate the tests really are. Where can
> I
> >> see what is being done to ensure that the EAL worker threads are
> fully
> >> isolated, and never interrupted by the O/S scheduler or similar?
> >>>
> >>
> >> There are kernel-level counters for how many times a thread have
> been
> >> involuntarily interrupted,
> >
> > Thanks, Mattias. I will look into that.
> >
> > Old kernels (2.4 and 2.6) ascribed the time spent in interrupt
> handlers to the CPU usage of the running process, instead of counting
> the time spent in interrupt handlers separately. Does anyone know it
> this has been fixed?
> >
> 
> If you mean top half interrupt handler, my guess would be it does not
> matter, except in some strange corner cases or faulty hardware. An ISR
> should have very short run time, and not being run *that* often (after
> NAPI). With isolated cores, it should be even less of a problem, but
> then you may not have that.
> 

Many years ago, we used a NIC that didn't have DMA, and only 4 RX descriptors, so it had to be serviced in the top half.

> Bottom halves are not attributed to the process, I believe.

This is an improvement.

> (In old
> kernels, the time spent in soft IRQs were not attributed to anything,
> which could create situations where the system was very busy indeed
> [e.g., with network stack bottom halves doing IP forwarding], but
> looking idle in 'top'.)

We also experienced that. The kernel's scheduling information was completely useless, so eventually we removed the CPU Utilization information from our GUI. ;-)

And IIRC, it wasn't fixed in kernel 2.6.

> 
> >> and also, if I recall correctly, the amount
> >> of wall-time the thread have been runnable, but not running (i.e.,
> >> waiting to be scheduled). The latter may require some scheduler
> debug
> >> kernel option being enabled on the kernel build.
> >
> >


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [dpdklab] RE: rte_service unit test failing randomly
  2022-10-06  9:49               ` Morten Brørup
@ 2022-10-06 11:07                 ` Lincoln Lavoie
  2022-10-06 12:00                   ` Morten Brørup
  0 siblings, 1 reply; 15+ messages in thread
From: Lincoln Lavoie @ 2022-10-06 11:07 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry,
	David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole,
	dev

[-- Attachment #1: Type: text/plain, Size: 3311 bytes --]

On Thu, Oct 6, 2022 at 5:49 AM Morten Brørup <mb@smartsharesystems.com>
wrote:

> > From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > Sent: Thursday, 6 October 2022 10.59
> >
> > On 2022-10-06 10:18, Morten Brørup wrote:
> > >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > >> Sent: Thursday, 6 October 2022 09.51
> > >>
> > >> On 2022-10-06 08:53, Morten Brørup wrote:
> > >
> > > [...]
> > >
> > >>> I have been wondering how accurate the tests really are. Where can
> > I
> > >> see what is being done to ensure that the EAL worker threads are
> > fully
> > >> isolated, and never interrupted by the O/S scheduler or similar?
> > >>>
> > >>
> > >> There are kernel-level counters for how many times a thread have
> > been
> > >> involuntarily interrupted,
> > >
> > > Thanks, Mattias. I will look into that.
> > >
> > > Old kernels (2.4 and 2.6) ascribed the time spent in interrupt
> > handlers to the CPU usage of the running process, instead of counting
> > the time spent in interrupt handlers separately. Does anyone know it
> > this has been fixed?
> > >
> >
> > If you mean top half interrupt handler, my guess would be it does not
> > matter, except in some strange corner cases or faulty hardware. An ISR
> > should have very short run time, and not being run *that* often (after
> > NAPI). With isolated cores, it should be even less of a problem, but
> > then you may not have that.
> >
>
> Many years ago, we used a NIC that didn't have DMA, and only 4 RX
> descriptors, so it had to be serviced in the top half.
>
> > Bottom halves are not attributed to the process, I believe.
>
> This is an improvement.
>
> > (In old
> > kernels, the time spent in soft IRQs were not attributed to anything,
> > which could create situations where the system was very busy indeed
> > [e.g., with network stack bottom halves doing IP forwarding], but
> > looking idle in 'top'.)
>
> We also experienced that. The kernel's scheduling information was
> completely useless, so eventually we removed the CPU Utilization
> information from our GUI. ;-)
>
> And IIRC, it wasn't fixed in kernel 2.6.
>
> >
> > >> and also, if I recall correctly, the amount
> > >> of wall-time the thread have been runnable, but not running (i.e.,
> > >> waiting to be scheduled). The latter may require some scheduler
> > debug
> > >> kernel option being enabled on the kernel build.
> > >
> > >
>
> Back to the topic of unit testing, I think we need to consider their
purpose and where ew expect them to run.  Unit tests are run in automated
environments, across multiple CI systems, i.e. UNH-IOL Community Lab,
GitHub, etc.  Those environments are typically virtualized and I don't
think the unit tests should require turning down to the level of CPU clock
ticks.  Those tests are likely better suited to dedicated performance
environments, where the complete host is tightly controlled, for the
purpose of repeatable and deterministic results on things like packet
throughput, etc.

Cheers,
Lincoln


-- 
*Lincoln Lavoie*
Principal Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
lylavoie@iol.unh.edu
https://www.iol.unh.edu
+1-603-674-2755 (m)
<https://www.iol.unh.edu>

[-- Attachment #2: Type: text/html, Size: 5039 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [dpdklab] RE: rte_service unit test failing randomly
  2022-10-06 11:07                 ` [dpdklab] " Lincoln Lavoie
@ 2022-10-06 12:00                   ` Morten Brørup
  2022-10-06 17:52                     ` Honnappa Nagarahalli
  0 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2022-10-06 12:00 UTC (permalink / raw)
  To: Lincoln Lavoie
  Cc: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry,
	David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole,
	dev, dts

From: Lincoln Lavoie [mailto:lylavoie@iol.unh.edu] 
Sent: Thursday, 6 October 2022 13.07
> On Thu, Oct 6, 2022 at 5:49 AM Morten Brørup <mb@smartsharesystems.com> wrote:
> > From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > Sent: Thursday, 6 October 2022 10.59
> > 
> > On 2022-10-06 10:18, Morten Brørup wrote:
> > >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > >> Sent: Thursday, 6 October 2022 09.51
> > >>
> > >> On 2022-10-06 08:53, Morten Brørup wrote:
> > >
> > > [...]
> > >
> > >>> I have been wondering how accurate the tests really are. Where can
> > I
> > >> see what is being done to ensure that the EAL worker threads are
> > fully
> > >> isolated, and never interrupted by the O/S scheduler or similar?

[...]

> Back to the topic of unit testing, I think we need to consider their purpose and where we expect them to run.  Unit tests are run in automated environments, across multiple CI systems, i.e. UNH-IOL Community Lab, GitHub, etc.  Those environments are typically virtualized and I don't think the unit tests should require turning down to the level of CPU clock ticks.  Those tests are likely better suited to dedicated performance environments, where the complete host is tightly controlled, for the purpose of repeatable and deterministic results on things like packet throughput, etc.

Excellent point, Lincoln. Verifying the performance of the surrounding runtime environment - i.e. the host running the tests - is not unit test material, it belongs with the performance tests.

So, how do I add a test case to the performance test suite? Is there a guide, a reference example, or any other documentation I can look at for inspiration?

NB: This is a good example of yesterday's techboard meeting discussion... I'm a DTS newbie wanting to contribute with a new test case, how do I get started?

If no such documentation exists, could someone please point at a couple of files representing what needs to be added?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: rte_service unit test failing randomly
  2022-10-06  6:53       ` Morten Brørup
  2022-10-06  7:04         ` David Marchand
  2022-10-06  7:50         ` Mattias Rönnblom
@ 2022-10-06 13:51         ` Aaron Conole
  2 siblings, 0 replies; 15+ messages in thread
From: Aaron Conole @ 2022-10-06 13:51 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry,
	David Marchand, dpdklab, ci, Honnappa Nagarahalli, dev

Morten Brørup <mb@smartsharesystems.com> writes:

>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 5 October 2022 23.34
>> 
>> On 2022-10-05 22:52, Thomas Monjalon wrote:
>> > 05/10/2022 22:33, Mattias Rönnblom:
>> >> On 2022-10-05 21:14, David Marchand wrote:
>> >>> Hello,
>> >>>
>> >>> The service_autotest unit test has been failing randomly.
>> >>> This is not something new.
>
> [...]
>
>> >>> EAL: Test assert service_may_be_active line 960 failed: Error:
>> Service
>> >>> not stopped after 100ms
>> >>>
>> >>> Ideas?
>> >>>
>> >>>
>> >>> Thanks.
>> >>
>> >> Do you run the test suite in a controlled environment? I.e., one
>> where
>> >> you can trust that the lcore threads aren't interrupted for long
>> periods
>> >> of time.
>> >>
>> >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
>> the
>> >> CPU with other threads.
>> >
>> > You mean the tests cannot be interrupted?
>> 
>> I just took a very quick look, but it seems like the main thread can,
>> but the worker lcore thread cannot be interrupt for anything close to
>> 100 ms, or you risk a test failure.
>> 
>> > Then it looks very fragile.
>> 
>> Tests like this are by their very nature racey. If a test thread sends
>> a
>> request to another thread, there is no way for it to decide when a
>> non-response should result in a test failure, unless the scheduling
>> latency of the receiving thread has an upper bound.
>> 
>> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
>> a lot of matches. I bet there more like the service core one, but they
>> allow for longer interruptions.
>> 
>> That said, 100 ms sounds like very short. I don't see why this can be a
>> lot longer.
>> 
>> ...and that said, I would argue you still need a reasonably controlled
>> environment for the autotests. If you have a server is arbitrarily
>> overloaded, maybe also with high memory pressure (and associated
>> instruction page faults and god-knows-what), the real-world worst-case
>> interruptions could be very long indeed. Seconds. Designing inherently
>> racey tests for that kind of environment will make them have very long
>> run times.
>
> Forgive me, if I am sidetracking a bit here... The issue discussed
> seems to be related to some threads waiting for other threads, and my
> question is not directly related to that.
>
> I have been wondering how accurate the tests really are. Where can I
> see what is being done to ensure that the EAL worker threads are fully
> isolated, and never interrupted by the O/S scheduler or similar?
>
> For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a
> NIC is configured with 4096 Rx descriptors, packet loss will occur
> after ca. 70 us (microseconds!) if not servicing the ingress queue
> when receiving at max packet rate.
>
> I recently posted some code for monitoring the O/S noise in EAL worker
> threads [1]. What should I do if I want to run that code in the
> automated test environment? It would be for informational purposes
> only, i.e. I would manually look at the test output to see the result.

One hacky way is to post a PATCH telling that it should never be merged,
but that introduces your test case, and then look at the logs.

> I would write a test application that simply starts the O/S noise
> monitor thread as an isolated EAL worker thread, the main thread would
> then wait for 10 minutes (or some other duration), dump the result to
> the standard output, and exit the application.
>
> [1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@smartserver.smartshare.dk/


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [dpdklab] RE: rte_service unit test failing randomly
  2022-10-06 12:00                   ` Morten Brørup
@ 2022-10-06 17:52                     ` Honnappa Nagarahalli
  0 siblings, 0 replies; 15+ messages in thread
From: Honnappa Nagarahalli @ 2022-10-06 17:52 UTC (permalink / raw)
  To: Morten Brørup, Lincoln Lavoie
  Cc: Mattias Rönnblom, thomas, Van Haaren Harry, David Marchand,
	dpdklab, ci, Aaron Conole, dev, dts, nd, nd

<snip>

> > >
> > > On 2022-10-06 10:18, Morten Brørup wrote:
> > > >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > > >> Sent: Thursday, 6 October 2022 09.51
> > > >>
> > > >> On 2022-10-06 08:53, Morten Brørup wrote:
> > > >
> > > > [...]
> > > >
> > > >>> I have been wondering how accurate the tests really are. Where can
> > > I
> > > >> see what is being done to ensure that the EAL worker threads are
> > > fully
> > > >> isolated, and never interrupted by the O/S scheduler or similar?
> 
> [...]
> 
> > Back to the topic of unit testing, I think we need to consider their purpose and
> where we expect them to run.  Unit tests are run in automated environments,
> across multiple CI systems, i.e. UNH-IOL Community Lab, GitHub, etc.  Those
> environments are typically virtualized and I don't think the unit tests should
> require turning down to the level of CPU clock ticks.  Those tests are likely better
> suited to dedicated performance environments, where the complete host is
> tightly controlled, for the purpose of repeatable and deterministic results on
> things like packet throughput, etc.
> 
> Excellent point, Lincoln. Verifying the performance of the surrounding runtime
> environment - i.e. the host running the tests - is not unit test material, it belongs
> with the performance tests.
IIRC, the unit tests were separated into perf tests and non-perf tests. The perf tests were meant to be run on bare metal systems in the lab.

> 
> So, how do I add a test case to the performance test suite? Is there a guide, a
> reference example, or any other documentation I can look at for inspiration?
> 
> NB: This is a good example of yesterday's techboard meeting discussion... I'm a
> DTS newbie wanting to contribute with a new test case, how do I get started?
> 
> If no such documentation exists, could someone please point at a couple of files
> representing what needs to be added?
In the current DTS, you can look at some documentation at doc/dts_gsg/usr_guide/intro.rst and look at hello world test case. You can also look at tests/TestSuite_hello_world.py and test_plans/hello_world_test_plan.rst. 

> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-10-06 17:53 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-05 19:14 rte_service unit test failing randomly David Marchand
2022-10-05 20:33 ` Mattias Rönnblom
2022-10-05 20:52   ` Thomas Monjalon
2022-10-05 21:33     ` Mattias Rönnblom
2022-10-06  6:53       ` Morten Brørup
2022-10-06  7:04         ` David Marchand
2022-10-06  7:50           ` Morten Brørup
2022-10-06  7:50         ` Mattias Rönnblom
2022-10-06  8:18           ` Morten Brørup
2022-10-06  8:59             ` Mattias Rönnblom
2022-10-06  9:49               ` Morten Brørup
2022-10-06 11:07                 ` [dpdklab] " Lincoln Lavoie
2022-10-06 12:00                   ` Morten Brørup
2022-10-06 17:52                     ` Honnappa Nagarahalli
2022-10-06 13:51         ` Aaron Conole

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).