* Re: rte_service unit test failing randomly
2022-10-06 6:53 ` Morten Brørup
@ 2022-10-06 7:04 ` David Marchand
2022-10-06 7:50 ` Morten Brørup
2022-10-06 7:50 ` Mattias Rönnblom
2022-10-06 13:51 ` Aaron Conole
2 siblings, 1 reply; 15+ messages in thread
From: David Marchand @ 2022-10-06 7:04 UTC (permalink / raw)
To: Morten Brørup
Cc: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry,
dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev
On Thu, Oct 6, 2022 at 8:53 AM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > Sent: Wednesday, 5 October 2022 23.34
> >
> > On 2022-10-05 22:52, Thomas Monjalon wrote:
> > > 05/10/2022 22:33, Mattias Rönnblom:
> > >> On 2022-10-05 21:14, David Marchand wrote:
> > >>> Hello,
> > >>>
> > >>> The service_autotest unit test has been failing randomly.
> > >>> This is not something new.
>
> [...]
>
> > >>> EAL: Test assert service_may_be_active line 960 failed: Error:
> > Service
> > >>> not stopped after 100ms
> > >>>
> > >>> Ideas?
> > >>>
> > >>>
> > >>> Thanks.
> > >>
> > >> Do you run the test suite in a controlled environment? I.e., one
> > where
> > >> you can trust that the lcore threads aren't interrupted for long
> > periods
> > >> of time.
> > >>
> > >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
> > the
> > >> CPU with other threads.
> > >
> > > You mean the tests cannot be interrupted?
> >
> > I just took a very quick look, but it seems like the main thread can,
> > but the worker lcore thread cannot be interrupt for anything close to
> > 100 ms, or you risk a test failure.
> >
> > > Then it looks very fragile.
> >
> > Tests like this are by their very nature racey. If a test thread sends
> > a
> > request to another thread, there is no way for it to decide when a
> > non-response should result in a test failure, unless the scheduling
> > latency of the receiving thread has an upper bound.
> >
> > If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
> > a lot of matches. I bet there more like the service core one, but they
> > allow for longer interruptions.
> >
> > That said, 100 ms sounds like very short. I don't see why this can be a
> > lot longer.
> >
> > ...and that said, I would argue you still need a reasonably controlled
> > environment for the autotests. If you have a server is arbitrarily
> > overloaded, maybe also with high memory pressure (and associated
> > instruction page faults and god-knows-what), the real-world worst-case
> > interruptions could be very long indeed. Seconds. Designing inherently
> > racey tests for that kind of environment will make them have very long
> > run times.
>
> Forgive me, if I am sidetracking a bit here... The issue discussed seems to be related to some threads waiting for other threads, and my question is not directly related to that.
>
> I have been wondering how accurate the tests really are. Where can I see what is being done to ensure that the EAL worker threads are fully isolated, and never interrupted by the O/S scheduler or similar?
>
> For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a NIC is configured with 4096 Rx descriptors, packet loss will occur after ca. 70 us (microseconds!) if not servicing the ingress queue when receiving at max packet rate.
>
> I recently posted some code for monitoring the O/S noise in EAL worker threads [1]. What should I do if I want to run that code in the automated test environment? It would be for informational purposes only, i.e. I would manually look at the test output to see the result.
>
> I would write a test application that simply starts the O/S noise monitor thread as an isolated EAL worker thread, the main thread would then wait for 10 minutes (or some other duration), dump the result to the standard output, and exit the application.
>
> [1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@smartserver.smartshare.dk/
Maybe we could do some hack, like finding a test in the current CI
that matches the test requirement: number of cores, ports, setup
params etc... (retrieving the output could be another challenge).
But if you think this is something we should have on the long run, I'd
suggest writing a new DTS test.
--
David Marchand
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: rte_service unit test failing randomly
2022-10-06 7:04 ` David Marchand
@ 2022-10-06 7:50 ` Morten Brørup
0 siblings, 0 replies; 15+ messages in thread
From: Morten Brørup @ 2022-10-06 7:50 UTC (permalink / raw)
To: David Marchand
Cc: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry,
dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev, dts
+CC: DTS mailing list.
> From: David Marchand [mailto:david.marchand@redhat.com]
> Sent: Thursday, 6 October 2022 09.05
>
> On Thu, Oct 6, 2022 at 8:53 AM Morten Brørup <mb@smartsharesystems.com>
> wrote:
[...]
> > Forgive me, if I am sidetracking a bit here... The issue discussed
> seems to be related to some threads waiting for other threads, and my
> question is not directly related to that.
> >
> > I have been wondering how accurate the tests really are. Where can I
> see what is being done to ensure that the EAL worker threads are fully
> isolated, and never interrupted by the O/S scheduler or similar?
> >
> > For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If
> a NIC is configured with 4096 Rx descriptors, packet loss will occur
> after ca. 70 us (microseconds!) if not servicing the ingress queue when
> receiving at max packet rate.
> >
> > I recently posted some code for monitoring the O/S noise in EAL
> worker threads [1]. What should I do if I want to run that code in the
> automated test environment? It would be for informational purposes
> only, i.e. I would manually look at the test output to see the result.
> >
> > I would write a test application that simply starts the O/S noise
> monitor thread as an isolated EAL worker thread, the main thread would
> then wait for 10 minutes (or some other duration), dump the result to
> the standard output, and exit the application.
> >
> > [1]:
> http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@smarts
> erver.smartshare.dk/
>
> Maybe we could do some hack, like finding a test in the current CI
> that matches the test requirement: number of cores, ports, setup
> params etc... (retrieving the output could be another challenge).
> But if you think this is something we should have on the long run, I'd
> suggest writing a new DTS test.
It would be useful having in the long run - it could catch sudden anomalies in the test environment.
Where can I find documentation on how to add a new test case to DTS?
-Morten
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: rte_service unit test failing randomly
2022-10-06 6:53 ` Morten Brørup
2022-10-06 7:04 ` David Marchand
@ 2022-10-06 7:50 ` Mattias Rönnblom
2022-10-06 8:18 ` Morten Brørup
2022-10-06 13:51 ` Aaron Conole
2 siblings, 1 reply; 15+ messages in thread
From: Mattias Rönnblom @ 2022-10-06 7:50 UTC (permalink / raw)
To: Morten Brørup, Thomas Monjalon, Van Haaren Harry
Cc: David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev
On 2022-10-06 08:53, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 5 October 2022 23.34
>>
>> On 2022-10-05 22:52, Thomas Monjalon wrote:
>>> 05/10/2022 22:33, Mattias Rönnblom:
>>>> On 2022-10-05 21:14, David Marchand wrote:
>>>>> Hello,
>>>>>
>>>>> The service_autotest unit test has been failing randomly.
>>>>> This is not something new.
>
> [...]
>
>>>>> EAL: Test assert service_may_be_active line 960 failed: Error:
>> Service
>>>>> not stopped after 100ms
>>>>>
>>>>> Ideas?
>>>>>
>>>>>
>>>>> Thanks.
>>>>
>>>> Do you run the test suite in a controlled environment? I.e., one
>> where
>>>> you can trust that the lcore threads aren't interrupted for long
>> periods
>>>> of time.
>>>>
>>>> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
>> the
>>>> CPU with other threads.
>>>
>>> You mean the tests cannot be interrupted?
>>
>> I just took a very quick look, but it seems like the main thread can,
>> but the worker lcore thread cannot be interrupt for anything close to
>> 100 ms, or you risk a test failure.
>>
>>> Then it looks very fragile.
>>
>> Tests like this are by their very nature racey. If a test thread sends
>> a
>> request to another thread, there is no way for it to decide when a
>> non-response should result in a test failure, unless the scheduling
>> latency of the receiving thread has an upper bound.
>>
>> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
>> a lot of matches. I bet there more like the service core one, but they
>> allow for longer interruptions.
>>
>> That said, 100 ms sounds like very short. I don't see why this can be a
>> lot longer.
>>
>> ...and that said, I would argue you still need a reasonably controlled
>> environment for the autotests. If you have a server is arbitrarily
>> overloaded, maybe also with high memory pressure (and associated
>> instruction page faults and god-knows-what), the real-world worst-case
>> interruptions could be very long indeed. Seconds. Designing inherently
>> racey tests for that kind of environment will make them have very long
>> run times.
>
> Forgive me, if I am sidetracking a bit here... The issue discussed seems to be related to some threads waiting for other threads, and my question is not directly related to that.
>
> I have been wondering how accurate the tests really are. Where can I see what is being done to ensure that the EAL worker threads are fully isolated, and never interrupted by the O/S scheduler or similar?
>
There are kernel-level counters for how many times a thread have been
involuntarily interrupted, and also, if I recall correctly, the amount
of wall-time the thread have been runnable, but not running (i.e.,
waiting to be scheduled). The latter may require some scheduler debug
kernel option being enabled on the kernel build.
> For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a NIC is configured with 4096 Rx descriptors, packet loss will occur after ca. 70 us (microseconds!) if not servicing the ingress queue when receiving at max packet rate.
>
> I recently posted some code for monitoring the O/S noise in EAL worker threads [1]. What should I do if I want to run that code in the automated test environment? It would be for informational purposes only, i.e. I would manually look at the test output to see the result.
>
> I would write a test application that simply starts the O/S noise monitor thread as an isolated EAL worker thread, the main thread would then wait for 10 minutes (or some other duration), dump the result to the standard output, and exit the application.
>
> [1]: https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-2ff9afcfa197abb4&q=1&e=379fc3f1-046a-4ad8-a55d-5ebc6f63d4ff&u=http%3A%2F%2Finbox.dpdk.org%2Fdev%2F98CBD80474FA8B44BF855DF32C47DC35D87352%40smartserver.smartshare.dk%2F
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: rte_service unit test failing randomly
2022-10-06 7:50 ` Mattias Rönnblom
@ 2022-10-06 8:18 ` Morten Brørup
2022-10-06 8:59 ` Mattias Rönnblom
0 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2022-10-06 8:18 UTC (permalink / raw)
To: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry
Cc: David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev
> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Thursday, 6 October 2022 09.51
>
> On 2022-10-06 08:53, Morten Brørup wrote:
[...]
> > I have been wondering how accurate the tests really are. Where can I
> see what is being done to ensure that the EAL worker threads are fully
> isolated, and never interrupted by the O/S scheduler or similar?
> >
>
> There are kernel-level counters for how many times a thread have been
> involuntarily interrupted,
Thanks, Mattias. I will look into that.
Old kernels (2.4 and 2.6) ascribed the time spent in interrupt handlers to the CPU usage of the running process, instead of counting the time spent in interrupt handlers separately. Does anyone know it this has been fixed?
> and also, if I recall correctly, the amount
> of wall-time the thread have been runnable, but not running (i.e.,
> waiting to be scheduled). The latter may require some scheduler debug
> kernel option being enabled on the kernel build.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: rte_service unit test failing randomly
2022-10-06 8:18 ` Morten Brørup
@ 2022-10-06 8:59 ` Mattias Rönnblom
2022-10-06 9:49 ` Morten Brørup
0 siblings, 1 reply; 15+ messages in thread
From: Mattias Rönnblom @ 2022-10-06 8:59 UTC (permalink / raw)
To: Morten Brørup, Thomas Monjalon, Van Haaren Harry
Cc: David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev
On 2022-10-06 10:18, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Thursday, 6 October 2022 09.51
>>
>> On 2022-10-06 08:53, Morten Brørup wrote:
>
> [...]
>
>>> I have been wondering how accurate the tests really are. Where can I
>> see what is being done to ensure that the EAL worker threads are fully
>> isolated, and never interrupted by the O/S scheduler or similar?
>>>
>>
>> There are kernel-level counters for how many times a thread have been
>> involuntarily interrupted,
>
> Thanks, Mattias. I will look into that.
>
> Old kernels (2.4 and 2.6) ascribed the time spent in interrupt handlers to the CPU usage of the running process, instead of counting the time spent in interrupt handlers separately. Does anyone know it this has been fixed?
>
If you mean top half interrupt handler, my guess would be it does not
matter, except in some strange corner cases or faulty hardware. An ISR
should have very short run time, and not being run *that* often (after
NAPI). With isolated cores, it should be even less of a problem, but
then you may not have that.
Bottom halves are not attributed to the process, I believe. (In old
kernels, the time spent in soft IRQs were not attributed to anything,
which could create situations where the system was very busy indeed
[e.g., with network stack bottom halves doing IP forwarding], but
looking idle in 'top'.)
>> and also, if I recall correctly, the amount
>> of wall-time the thread have been runnable, but not running (i.e.,
>> waiting to be scheduled). The latter may require some scheduler debug
>> kernel option being enabled on the kernel build.
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: rte_service unit test failing randomly
2022-10-06 8:59 ` Mattias Rönnblom
@ 2022-10-06 9:49 ` Morten Brørup
2022-10-06 11:07 ` [dpdklab] " Lincoln Lavoie
0 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2022-10-06 9:49 UTC (permalink / raw)
To: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry
Cc: David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole, dev
> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Thursday, 6 October 2022 10.59
>
> On 2022-10-06 10:18, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Thursday, 6 October 2022 09.51
> >>
> >> On 2022-10-06 08:53, Morten Brørup wrote:
> >
> > [...]
> >
> >>> I have been wondering how accurate the tests really are. Where can
> I
> >> see what is being done to ensure that the EAL worker threads are
> fully
> >> isolated, and never interrupted by the O/S scheduler or similar?
> >>>
> >>
> >> There are kernel-level counters for how many times a thread have
> been
> >> involuntarily interrupted,
> >
> > Thanks, Mattias. I will look into that.
> >
> > Old kernels (2.4 and 2.6) ascribed the time spent in interrupt
> handlers to the CPU usage of the running process, instead of counting
> the time spent in interrupt handlers separately. Does anyone know it
> this has been fixed?
> >
>
> If you mean top half interrupt handler, my guess would be it does not
> matter, except in some strange corner cases or faulty hardware. An ISR
> should have very short run time, and not being run *that* often (after
> NAPI). With isolated cores, it should be even less of a problem, but
> then you may not have that.
>
Many years ago, we used a NIC that didn't have DMA, and only 4 RX descriptors, so it had to be serviced in the top half.
> Bottom halves are not attributed to the process, I believe.
This is an improvement.
> (In old
> kernels, the time spent in soft IRQs were not attributed to anything,
> which could create situations where the system was very busy indeed
> [e.g., with network stack bottom halves doing IP forwarding], but
> looking idle in 'top'.)
We also experienced that. The kernel's scheduling information was completely useless, so eventually we removed the CPU Utilization information from our GUI. ;-)
And IIRC, it wasn't fixed in kernel 2.6.
>
> >> and also, if I recall correctly, the amount
> >> of wall-time the thread have been runnable, but not running (i.e.,
> >> waiting to be scheduled). The latter may require some scheduler
> debug
> >> kernel option being enabled on the kernel build.
> >
> >
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [dpdklab] RE: rte_service unit test failing randomly
2022-10-06 9:49 ` Morten Brørup
@ 2022-10-06 11:07 ` Lincoln Lavoie
2022-10-06 12:00 ` Morten Brørup
0 siblings, 1 reply; 15+ messages in thread
From: Lincoln Lavoie @ 2022-10-06 11:07 UTC (permalink / raw)
To: Morten Brørup
Cc: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry,
David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole,
dev
[-- Attachment #1: Type: text/plain, Size: 3311 bytes --]
On Thu, Oct 6, 2022 at 5:49 AM Morten Brørup <mb@smartsharesystems.com>
wrote:
> > From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > Sent: Thursday, 6 October 2022 10.59
> >
> > On 2022-10-06 10:18, Morten Brørup wrote:
> > >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > >> Sent: Thursday, 6 October 2022 09.51
> > >>
> > >> On 2022-10-06 08:53, Morten Brørup wrote:
> > >
> > > [...]
> > >
> > >>> I have been wondering how accurate the tests really are. Where can
> > I
> > >> see what is being done to ensure that the EAL worker threads are
> > fully
> > >> isolated, and never interrupted by the O/S scheduler or similar?
> > >>>
> > >>
> > >> There are kernel-level counters for how many times a thread have
> > been
> > >> involuntarily interrupted,
> > >
> > > Thanks, Mattias. I will look into that.
> > >
> > > Old kernels (2.4 and 2.6) ascribed the time spent in interrupt
> > handlers to the CPU usage of the running process, instead of counting
> > the time spent in interrupt handlers separately. Does anyone know it
> > this has been fixed?
> > >
> >
> > If you mean top half interrupt handler, my guess would be it does not
> > matter, except in some strange corner cases or faulty hardware. An ISR
> > should have very short run time, and not being run *that* often (after
> > NAPI). With isolated cores, it should be even less of a problem, but
> > then you may not have that.
> >
>
> Many years ago, we used a NIC that didn't have DMA, and only 4 RX
> descriptors, so it had to be serviced in the top half.
>
> > Bottom halves are not attributed to the process, I believe.
>
> This is an improvement.
>
> > (In old
> > kernels, the time spent in soft IRQs were not attributed to anything,
> > which could create situations where the system was very busy indeed
> > [e.g., with network stack bottom halves doing IP forwarding], but
> > looking idle in 'top'.)
>
> We also experienced that. The kernel's scheduling information was
> completely useless, so eventually we removed the CPU Utilization
> information from our GUI. ;-)
>
> And IIRC, it wasn't fixed in kernel 2.6.
>
> >
> > >> and also, if I recall correctly, the amount
> > >> of wall-time the thread have been runnable, but not running (i.e.,
> > >> waiting to be scheduled). The latter may require some scheduler
> > debug
> > >> kernel option being enabled on the kernel build.
> > >
> > >
>
> Back to the topic of unit testing, I think we need to consider their
purpose and where ew expect them to run. Unit tests are run in automated
environments, across multiple CI systems, i.e. UNH-IOL Community Lab,
GitHub, etc. Those environments are typically virtualized and I don't
think the unit tests should require turning down to the level of CPU clock
ticks. Those tests are likely better suited to dedicated performance
environments, where the complete host is tightly controlled, for the
purpose of repeatable and deterministic results on things like packet
throughput, etc.
Cheers,
Lincoln
--
*Lincoln Lavoie*
Principal Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
lylavoie@iol.unh.edu
https://www.iol.unh.edu
+1-603-674-2755 (m)
<https://www.iol.unh.edu>
[-- Attachment #2: Type: text/html, Size: 5039 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [dpdklab] RE: rte_service unit test failing randomly
2022-10-06 11:07 ` [dpdklab] " Lincoln Lavoie
@ 2022-10-06 12:00 ` Morten Brørup
2022-10-06 17:52 ` Honnappa Nagarahalli
0 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2022-10-06 12:00 UTC (permalink / raw)
To: Lincoln Lavoie
Cc: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry,
David Marchand, dpdklab, ci, Honnappa Nagarahalli, Aaron Conole,
dev, dts
From: Lincoln Lavoie [mailto:lylavoie@iol.unh.edu]
Sent: Thursday, 6 October 2022 13.07
> On Thu, Oct 6, 2022 at 5:49 AM Morten Brørup <mb@smartsharesystems.com> wrote:
> > From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > Sent: Thursday, 6 October 2022 10.59
> >
> > On 2022-10-06 10:18, Morten Brørup wrote:
> > >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > >> Sent: Thursday, 6 October 2022 09.51
> > >>
> > >> On 2022-10-06 08:53, Morten Brørup wrote:
> > >
> > > [...]
> > >
> > >>> I have been wondering how accurate the tests really are. Where can
> > I
> > >> see what is being done to ensure that the EAL worker threads are
> > fully
> > >> isolated, and never interrupted by the O/S scheduler or similar?
[...]
> Back to the topic of unit testing, I think we need to consider their purpose and where we expect them to run. Unit tests are run in automated environments, across multiple CI systems, i.e. UNH-IOL Community Lab, GitHub, etc. Those environments are typically virtualized and I don't think the unit tests should require turning down to the level of CPU clock ticks. Those tests are likely better suited to dedicated performance environments, where the complete host is tightly controlled, for the purpose of repeatable and deterministic results on things like packet throughput, etc.
Excellent point, Lincoln. Verifying the performance of the surrounding runtime environment - i.e. the host running the tests - is not unit test material, it belongs with the performance tests.
So, how do I add a test case to the performance test suite? Is there a guide, a reference example, or any other documentation I can look at for inspiration?
NB: This is a good example of yesterday's techboard meeting discussion... I'm a DTS newbie wanting to contribute with a new test case, how do I get started?
If no such documentation exists, could someone please point at a couple of files representing what needs to be added?
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [dpdklab] RE: rte_service unit test failing randomly
2022-10-06 12:00 ` Morten Brørup
@ 2022-10-06 17:52 ` Honnappa Nagarahalli
0 siblings, 0 replies; 15+ messages in thread
From: Honnappa Nagarahalli @ 2022-10-06 17:52 UTC (permalink / raw)
To: Morten Brørup, Lincoln Lavoie
Cc: Mattias Rönnblom, thomas, Van Haaren Harry, David Marchand,
dpdklab, ci, Aaron Conole, dev, dts, nd, nd
<snip>
> > >
> > > On 2022-10-06 10:18, Morten Brørup wrote:
> > > >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > > >> Sent: Thursday, 6 October 2022 09.51
> > > >>
> > > >> On 2022-10-06 08:53, Morten Brørup wrote:
> > > >
> > > > [...]
> > > >
> > > >>> I have been wondering how accurate the tests really are. Where can
> > > I
> > > >> see what is being done to ensure that the EAL worker threads are
> > > fully
> > > >> isolated, and never interrupted by the O/S scheduler or similar?
>
> [...]
>
> > Back to the topic of unit testing, I think we need to consider their purpose and
> where we expect them to run. Unit tests are run in automated environments,
> across multiple CI systems, i.e. UNH-IOL Community Lab, GitHub, etc. Those
> environments are typically virtualized and I don't think the unit tests should
> require turning down to the level of CPU clock ticks. Those tests are likely better
> suited to dedicated performance environments, where the complete host is
> tightly controlled, for the purpose of repeatable and deterministic results on
> things like packet throughput, etc.
>
> Excellent point, Lincoln. Verifying the performance of the surrounding runtime
> environment - i.e. the host running the tests - is not unit test material, it belongs
> with the performance tests.
IIRC, the unit tests were separated into perf tests and non-perf tests. The perf tests were meant to be run on bare metal systems in the lab.
>
> So, how do I add a test case to the performance test suite? Is there a guide, a
> reference example, or any other documentation I can look at for inspiration?
>
> NB: This is a good example of yesterday's techboard meeting discussion... I'm a
> DTS newbie wanting to contribute with a new test case, how do I get started?
>
> If no such documentation exists, could someone please point at a couple of files
> representing what needs to be added?
In the current DTS, you can look at some documentation at doc/dts_gsg/usr_guide/intro.rst and look at hello world test case. You can also look at tests/TestSuite_hello_world.py and test_plans/hello_world_test_plan.rst.
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: rte_service unit test failing randomly
2022-10-06 6:53 ` Morten Brørup
2022-10-06 7:04 ` David Marchand
2022-10-06 7:50 ` Mattias Rönnblom
@ 2022-10-06 13:51 ` Aaron Conole
2 siblings, 0 replies; 15+ messages in thread
From: Aaron Conole @ 2022-10-06 13:51 UTC (permalink / raw)
To: Morten Brørup
Cc: Mattias Rönnblom, Thomas Monjalon, Van Haaren Harry,
David Marchand, dpdklab, ci, Honnappa Nagarahalli, dev
Morten Brørup <mb@smartsharesystems.com> writes:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 5 October 2022 23.34
>>
>> On 2022-10-05 22:52, Thomas Monjalon wrote:
>> > 05/10/2022 22:33, Mattias Rönnblom:
>> >> On 2022-10-05 21:14, David Marchand wrote:
>> >>> Hello,
>> >>>
>> >>> The service_autotest unit test has been failing randomly.
>> >>> This is not something new.
>
> [...]
>
>> >>> EAL: Test assert service_may_be_active line 960 failed: Error:
>> Service
>> >>> not stopped after 100ms
>> >>>
>> >>> Ideas?
>> >>>
>> >>>
>> >>> Thanks.
>> >>
>> >> Do you run the test suite in a controlled environment? I.e., one
>> where
>> >> you can trust that the lcore threads aren't interrupted for long
>> periods
>> >> of time.
>> >>
>> >> 100 ms is not a long time if a SCHED_OTHER lcore thread competes for
>> the
>> >> CPU with other threads.
>> >
>> > You mean the tests cannot be interrupted?
>>
>> I just took a very quick look, but it seems like the main thread can,
>> but the worker lcore thread cannot be interrupt for anything close to
>> 100 ms, or you risk a test failure.
>>
>> > Then it looks very fragile.
>>
>> Tests like this are by their very nature racey. If a test thread sends
>> a
>> request to another thread, there is no way for it to decide when a
>> non-response should result in a test failure, unless the scheduling
>> latency of the receiving thread has an upper bound.
>>
>> If you grep for "sleep", or "delay", in app/test/test_*.c, you will get
>> a lot of matches. I bet there more like the service core one, but they
>> allow for longer interruptions.
>>
>> That said, 100 ms sounds like very short. I don't see why this can be a
>> lot longer.
>>
>> ...and that said, I would argue you still need a reasonably controlled
>> environment for the autotests. If you have a server is arbitrarily
>> overloaded, maybe also with high memory pressure (and associated
>> instruction page faults and god-knows-what), the real-world worst-case
>> interruptions could be very long indeed. Seconds. Designing inherently
>> racey tests for that kind of environment will make them have very long
>> run times.
>
> Forgive me, if I am sidetracking a bit here... The issue discussed
> seems to be related to some threads waiting for other threads, and my
> question is not directly related to that.
>
> I have been wondering how accurate the tests really are. Where can I
> see what is being done to ensure that the EAL worker threads are fully
> isolated, and never interrupted by the O/S scheduler or similar?
>
> For reference, the max packet rate at 40 Gbit/s is 59.52 M pkt/s. If a
> NIC is configured with 4096 Rx descriptors, packet loss will occur
> after ca. 70 us (microseconds!) if not servicing the ingress queue
> when receiving at max packet rate.
>
> I recently posted some code for monitoring the O/S noise in EAL worker
> threads [1]. What should I do if I want to run that code in the
> automated test environment? It would be for informational purposes
> only, i.e. I would manually look at the test output to see the result.
One hacky way is to post a PATCH telling that it should never be merged,
but that introduces your test case, and then look at the logs.
> I would write a test application that simply starts the O/S noise
> monitor thread as an isolated EAL worker thread, the main thread would
> then wait for 10 minutes (or some other duration), dump the result to
> the standard output, and exit the application.
>
> [1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87352@smartserver.smartshare.dk/
^ permalink raw reply [flat|nested] 15+ messages in thread