DPDK CI discussions
 help / color / mirror / Atom feed
* pcapng_autotest unit test false positive
@ 2024-03-20 18:02 David Marchand
  2024-03-22 17:12 ` Patrick Robb
  0 siblings, 1 reply; 5+ messages in thread
From: David Marchand @ 2024-03-20 18:02 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Patrick Robb, Aaron Conole, ci

Hello Stephen,

I noticed a (time based?) failure of the pcapng unit test in some UNH
Debian 11 container.
Please have a look.

https://lab.dpdk.org/results/dashboard/patchsets/29604/

----------------------------------- stdout -----------------------------------
RTE>>pcapng_autotest
 + ------------------------------------------------------- +
 + Test Suite : Test Pcapng Unit Test Suite
 + ------------------------------------------------------- +
pcapng: output file /tmp/pcapng_test_oIueHb.pcapng
 + TestCase [ 0] : test_add_interface succeeded
pcapng: output file /tmp/pcapng_test_4hbuWV.pcapng
16:51:22.955616600: EE:47:6C:93:DE:F0 -> FF:FF:FF:FF:FF:FF type 800 length 200
 + TestCase [ 1] : test_write_packets failed
 + ------------------------------------------------------- +
 + Test Suite Summary : Test Pcapng Unit Test Suite
 + ------------------------------------------------------- +
 + Tests Total :        2
 + Tests Skipped :      0
 + Tests Executed :     2
 + Tests Unsupported:   0
 + Tests Passed :       1
 + Tests Failed :       1
 + ------------------------------------------------------- +
Test Failed
RTE>>
----------------------------------- stderr -----------------------------------
EAL: Detected CPU lcores: 16
EAL: Detected NUMA nodes: 2
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: VFIO support initialized
EAL: Device 0000:03:00.0 is not NUMA-aware
APP: HPET is not enabled, using TSC as default timer
Timestamp out of range [16:51:16.481074161 .. 16:51:22.953203736]
pcap_dispatch: failed:


-- 
David Marchand


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: pcapng_autotest unit test false positive
  2024-03-20 18:02 pcapng_autotest unit test false positive David Marchand
@ 2024-03-22 17:12 ` Patrick Robb
  2024-03-22 23:34   ` Stephen Hemminger
  0 siblings, 1 reply; 5+ messages in thread
From: Patrick Robb @ 2024-03-22 17:12 UTC (permalink / raw)
  To: David Marchand; +Cc: Stephen Hemminger, dev, Aaron Conole, ci, Cody Cheng

Hi David,

Yes I'm seeing this pcapng_autotest fail intermittently on debian 11
recently. It also got flagged on Slack, where Stephen indicated it is
likely a lab infra failure.

Anyways, I guess based on the logs above it is a timestamp error from
TSC (as you can see HPET is not used). Indeed, the write_packets test
that is failing does call rte_get_tsc_cycles().

So, some steps I think we should take.

1. Refresh the debian11 test container image we are using in CI testing.
2. Reset VM which containers are running on (which does also reset tsc
cycle counter of course).
3. Re-image our VMs which the test containers run on, bringing it to
Ubuntu 22.04 and a newer kernel version.

If that fails, I guess we can also look at substituting HPET for TSC.
It looks like (provided you have set the right bootloader option) you
use -Duse_hpet=true with meson for this. But, it looks like the unit
test is written to use TSC instead of HPET anyways, so I don't think
this is relevant.

Does this sound reasonable? If so we will proceed.

@Cody Cheng Since I know you are refreshing the lab container images
anyways, let's bring debian 11 to the front of the queue. Thanks.

On Wed, Mar 20, 2024 at 2:02 PM David Marchand
<david.marchand@redhat.com> wrote:
>
> Hello Stephen,
>
> I noticed a (time based?) failure of the pcapng unit test in some UNH
> Debian 11 container.
> Please have a look.
>
> https://lab.dpdk.org/results/dashboard/patchsets/29604/
>
> ----------------------------------- stdout -----------------------------------
> RTE>>pcapng_autotest
>  + ------------------------------------------------------- +
>  + Test Suite : Test Pcapng Unit Test Suite
>  + ------------------------------------------------------- +
> pcapng: output file /tmp/pcapng_test_oIueHb.pcapng
>  + TestCase [ 0] : test_add_interface succeeded
> pcapng: output file /tmp/pcapng_test_4hbuWV.pcapng
> 16:51:22.955616600: EE:47:6C:93:DE:F0 -> FF:FF:FF:FF:FF:FF type 800 length 200
>  + TestCase [ 1] : test_write_packets failed
>  + ------------------------------------------------------- +
>  + Test Suite Summary : Test Pcapng Unit Test Suite
>  + ------------------------------------------------------- +
>  + Tests Total :        2
>  + Tests Skipped :      0
>  + Tests Executed :     2
>  + Tests Unsupported:   0
>  + Tests Passed :       1
>  + Tests Failed :       1
>  + ------------------------------------------------------- +
> Test Failed
> RTE>>
> ----------------------------------- stderr -----------------------------------
> EAL: Detected CPU lcores: 16
> EAL: Detected NUMA nodes: 2
> EAL: Detected static linkage of DPDK
> EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
> EAL: Selected IOVA mode 'VA'
> EAL: VFIO support initialized
> EAL: Device 0000:03:00.0 is not NUMA-aware
> APP: HPET is not enabled, using TSC as default timer
> Timestamp out of range [16:51:16.481074161 .. 16:51:22.953203736]
> pcap_dispatch: failed:
>
>
> --
> David Marchand
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: pcapng_autotest unit test false positive
  2024-03-22 17:12 ` Patrick Robb
@ 2024-03-22 23:34   ` Stephen Hemminger
  2024-04-01 22:26     ` Patrick Robb
  0 siblings, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2024-03-22 23:34 UTC (permalink / raw)
  To: Patrick Robb; +Cc: David Marchand, dev, Aaron Conole, ci, Cody Cheng

On Fri, 22 Mar 2024 13:12:24 -0400
Patrick Robb <probb@iol.unh.edu> wrote:

> Hi David,
> 
> Yes I'm seeing this pcapng_autotest fail intermittently on debian 11
> recently. It also got flagged on Slack, where Stephen indicated it is
> likely a lab infra failure.
> 
> Anyways, I guess based on the logs above it is a timestamp error from
> TSC (as you can see HPET is not used). Indeed, the write_packets test
> that is failing does call rte_get_tsc_cycles().
> 
> So, some steps I think we should take.
> 
> 1. Refresh the debian11 test container image we are using in CI testing.
> 2. Reset VM which containers are running on (which does also reset tsc
> cycle counter of course).
> 3. Re-image our VMs which the test containers run on, bringing it to
> Ubuntu 22.04 and a newer kernel version.
> 
> If that fails, I guess we can also look at substituting HPET for TSC.
> It looks like (provided you have set the right bootloader option) you
> use -Duse_hpet=true with meson for this. But, it looks like the unit
> test is written to use TSC instead of HPET anyways, so I don't think
> this is relevant.
> 
> Does this sound reasonable? If so we will proceed.
> 
> @Cody Cheng Since I know you are refreshing the lab container images
> anyways, let's bring debian 11 to the front of the queue. Thanks.
> 
> On Wed, Mar 20, 2024 at 2:02 PM David Marchand
> <david.marchand@redhat.com> wrote:
> >
> > Hello Stephen,
> >
> > I noticed a (time based?) failure of the pcapng unit test in some UNH
> > Debian 11 container.
> > Please have a look.
> >
> > https://lab.dpdk.org/results/dashboard/patchsets/29604/
> >
> > ----------------------------------- stdout -----------------------------------  
> > RTE>>pcapng_autotest  
> >  + ------------------------------------------------------- +
> >  + Test Suite : Test Pcapng Unit Test Suite
> >  + ------------------------------------------------------- +
> > pcapng: output file /tmp/pcapng_test_oIueHb.pcapng
> >  + TestCase [ 0] : test_add_interface succeeded
> > pcapng: output file /tmp/pcapng_test_4hbuWV.pcapng
> > 16:51:22.955616600: EE:47:6C:93:DE:F0 -> FF:FF:FF:FF:FF:FF type 800 length 200
> >  + TestCase [ 1] : test_write_packets failed
> >  + ------------------------------------------------------- +
> >  + Test Suite Summary : Test Pcapng Unit Test Suite
> >  + ------------------------------------------------------- +
> >  + Tests Total :        2
> >  + Tests Skipped :      0
> >  + Tests Executed :     2
> >  + Tests Unsupported:   0
> >  + Tests Passed :       1
> >  + Tests Failed :       1
> >  + ------------------------------------------------------- +
> > Test Failed  
> > RTE>>  
> > ----------------------------------- stderr -----------------------------------
> > EAL: Detected CPU lcores: 16
> > EAL: Detected NUMA nodes: 2
> > EAL: Detected static linkage of DPDK
> > EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
> > EAL: Selected IOVA mode 'VA'
> > EAL: VFIO support initialized
> > EAL: Device 0000:03:00.0 is not NUMA-aware
> > APP: HPET is not enabled, using TSC as default timer
> > Timestamp out of range [16:51:16.481074161 .. 16:51:22.953203736]
> > pcap_dispatch: failed:
> >
> >
> > --
> > David Marchand
> >  

Could you build a simple test to see if TSC every runs backwards on
this machine. Or there could be yet another math error.
Or maybe container TSC is huge an wrapping around?

The point of the test is to make sure that there wasn't wraparound errors.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: pcapng_autotest unit test false positive
  2024-03-22 23:34   ` Stephen Hemminger
@ 2024-04-01 22:26     ` Patrick Robb
  2024-04-02  0:46       ` Stephen Hemminger
  0 siblings, 1 reply; 5+ messages in thread
From: Patrick Robb @ 2024-04-01 22:26 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Marchand, dev, Aaron Conole, ci, Cody Cheng

On Fri, Mar 22, 2024 at 7:34 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
>
> Could you build a simple test to see if TSC every runs backwards on
> this machine. Or there could be yet another math error.
> Or maybe container TSC is huge an wrapping around?
>
> The point of the test is to make sure that there wasn't wraparound errors.

Sorry about the wait on this one, but we did write that simple C
program to check for whether TSC ever runs backwards on this system.
It gets TSC using __rdtsc() because that's the same approach from the
x86 rte_cycles.c. And it just loops for 10 seconds or so and compares
n TSC to n-1 TSC, and if n's TSC is ever less than n-1's TSC it prints
a message saying so. Otherwise at the end it prints that TSC is
working normally. From running this the first time, it showed TSC as
never running backwards. Another thing I can do is trigger a full set
of testing (so that the system is under normal load) and then run the
tsc checking program concurrently.

Another idea - maybe multiple timestamps are gathered from different
CPU registers during the same test, and they are misaligned for that
reason. Maybe we can try reducing the cores for each unit test to 1
and checking whether the issue persists.

Or there could be another math error as you say.

And I should mention that now that I'm looking at this more closely I
did see that unfortunately all these fail results are coming from a
new debian 12 x86 environment which was added a few weeks ago, but
mistakenly labeled as debian 11 x86. So, the fact that fails started
can be explained by the fact that we added this new debian 12
container recently.

So, I'll try a few more things and keep yall updated.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: pcapng_autotest unit test false positive
  2024-04-01 22:26     ` Patrick Robb
@ 2024-04-02  0:46       ` Stephen Hemminger
  0 siblings, 0 replies; 5+ messages in thread
From: Stephen Hemminger @ 2024-04-02  0:46 UTC (permalink / raw)
  To: Patrick Robb; +Cc: David Marchand, dev, Aaron Conole, ci, Cody Cheng

On Mon, 1 Apr 2024 18:26:44 -0400
Patrick Robb <probb@iol.unh.edu> wrote:

> Another idea - maybe multiple timestamps are gathered from different
> CPU registers during the same test, and they are misaligned for that
> reason. Maybe we can try reducing the cores for each unit test to 1
> and checking whether the issue persists.

TSC is expected to be sync'd between cores. But of course packets can
arrive out of order on different cores.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-04-02  0:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-20 18:02 pcapng_autotest unit test false positive David Marchand
2024-03-22 17:12 ` Patrick Robb
2024-03-22 23:34   ` Stephen Hemminger
2024-04-01 22:26     ` Patrick Robb
2024-04-02  0:46       ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).