DPDK CI discussions
 help / color / mirror / Atom feed
* [dpdk-ci] CI reliability
@ 2020-05-24  9:50 Thomas Monjalon
  2020-05-26 20:27 ` Lincoln Lavoie
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Monjalon @ 2020-05-24  9:50 UTC (permalink / raw)
  To: ci; +Cc: j.hendergart, Lincoln Lavoie

Hi all,

I think we have a CI reliability issue in general.
Perhaps we lack some alert mechanism warning test platform maintainers
when too many tests are failing.

Recent example: the community lab compilation test is failing on
Fedora 31 for at least 2 weeks, and I don't see any action to fix it:
	https://lab.dpdk.org/results/dashboard/patchsets/11040/

Because of such recurring errors, the whole CI becomes irrelevant.
Please, we need taking actions to avoid such issue in the near future.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-ci] CI reliability
  2020-05-24  9:50 [dpdk-ci] CI reliability Thomas Monjalon
@ 2020-05-26 20:27 ` Lincoln Lavoie
  2020-05-26 21:10   ` Thomas Monjalon
  0 siblings, 1 reply; 6+ messages in thread
From: Lincoln Lavoie @ 2020-05-26 20:27 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: ci, James Hendergart

[-- Attachment #1: Type: text/plain, Size: 1856 bytes --]

Hi Thomas,

This has been fixed as of yesterday.  The failure was caused by a commit to
the SPDK repos in how they pull in their dependencies, which was done in a
way that is not compatible with docker.  The team created a work around so
that case is fixed, but there is always a risk where other commits for
those type of items could cause a failure in the containers.

I asked Brandon to change the scripts that run the testing in the
containers to try and catch failures from docker separately, so they can be
flagged as infrastructure, compared to failures of the build.

I'm also very surprised, this was not raised during the CI meeting, or by
anyone else.  I'm wondering if this is caused by the actual error logs
being a little abstracted from the emails, i.e. they are a link and a zip
file away for the actual email text, so maybe folks are not really looking
into the output as closely as they should be.  Is this something we can
make better by including more detail in the email text, so issues are
caught more quickly?

Cheers,
Lincoln

On Sun, May 24, 2020 at 5:50 AM Thomas Monjalon <thomas@monjalon.net> wrote:

> Hi all,
>
> I think we have a CI reliability issue in general.
> Perhaps we lack some alert mechanism warning test platform maintainers
> when too many tests are failing.
>
> Recent example: the community lab compilation test is failing on
> Fedora 31 for at least 2 weeks, and I don't see any action to fix it:
>         https://lab.dpdk.org/results/dashboard/patchsets/11040/
>
> Because of such recurring errors, the whole CI becomes irrelevant.
> Please, we need taking actions to avoid such issue in the near future.
>
>
>

-- 
*Lincoln Lavoie*
Senior Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
lylavoie@iol.unh.edu
https://www.iol.unh.edu
+1-603-674-2755 (m)
<https://www.iol.unh.edu/>

[-- Attachment #2: Type: text/html, Size: 3422 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-ci] CI reliability
  2020-05-26 20:27 ` Lincoln Lavoie
@ 2020-05-26 21:10   ` Thomas Monjalon
  0 siblings, 0 replies; 6+ messages in thread
From: Thomas Monjalon @ 2020-05-26 21:10 UTC (permalink / raw)
  To: Lincoln Lavoie; +Cc: ci, James Hendergart

26/05/2020 22:27, Lincoln Lavoie:
> On Sun, May 24, 2020 at 5:50 AM Thomas Monjalon <thomas@monjalon.net> wrote:
> 
> > Hi all,
> >
> > I think we have a CI reliability issue in general.
> > Perhaps we lack some alert mechanism warning test platform maintainers
> > when too many tests are failing.
> >
> > Recent example: the community lab compilation test is failing on
> > Fedora 31 for at least 2 weeks, and I don't see any action to fix it:
> >         https://lab.dpdk.org/results/dashboard/patchsets/11040/
> >
> > Because of such recurring errors, the whole CI becomes irrelevant.
> 
> This has been fixed as of yesterday.  The failure was caused by a commit to
> the SPDK repos in how they pull in their dependencies, which was done in a
> way that is not compatible with docker.  The team created a work around so
> that case is fixed, but there is always a risk where other commits for
> those type of items could cause a failure in the containers.

Thanks for fixing

> I asked Brandon to change the scripts that run the testing in the
> containers to try and catch failures from docker separately, so they can be
> flagged as infrastructure, compared to failures of the build.

Yes good idea.

When compiling external projects, we can see some errors which
are not due to the DPDK patch.
I guess we validate any upgrade of the external projects
before making them live?


> I'm also very surprised, this was not raised during the CI meeting, or by
> anyone else.  I'm wondering if this is caused by the actual error logs
> being a little abstracted from the emails, i.e. they are a link and a zip
> file away for the actual email text, so maybe folks are not really looking
> into the output as closely as they should be.  Is this something we can
> make better by including more detail in the email text, so issues are
> caught more quickly?

I think the table in the report is already quite expressive.

As I proposed above, I think we need a better monitoring.
If the same test is failing on many DPDK patches, it should raise an alarm.




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-ci] CI reliability
  2021-06-02 12:55 ` Lincoln Lavoie
@ 2021-06-02 13:34   ` Thomas Monjalon
  0 siblings, 0 replies; 6+ messages in thread
From: Thomas Monjalon @ 2021-06-02 13:34 UTC (permalink / raw)
  To: Lincoln Lavoie; +Cc: ci, aconole

02/06/2021 14:55, Lincoln Lavoie:
> Hi Thomas,
> 
> The unit tests that fail are nearly always the same specific unit test.
> Aaron addressed some of these in a patch that has yet to be applied to DPDK
> (cycles_autotest and test_alarm).  The other one that we still consistently
> fail is func_reentrancy_autotest.  It seems like that unit test case can
> pass in one run and fail in the next.  We have not been able to determine a
> root cause for it yet.  Maybe that is something the devs could help look
> into.

Yes definitely we should help and apply fixes in DPDK.

> Other failures have been caused by DTS.  As part of the plan, we've been
> trying to upgrade the DTS deployments on the system, so as the other
> changes are made, we can easily pull those in.  However, pulling in the new
> DTS version has also pulled in bugs that exist in that version.  For
> example, on the stats test suite, it was changed to not skip
> the test_xstats_check_vf when no VMs are configured on the system, so when
> the overall test suite was being run, it failed on the bare metal where
> there are VMs configured right now.  Every time the lab has to upgrade DTS,
> we run the risk of introducing these types of failures, which then take
> time to debug and fix.

For non-transient issues, we should not deploy a new DTS if there are regressions.
Is it possible to deploy an older patched version of DTS?


> On Wed, Jun 2, 2021 at 3:27 AM Thomas Monjalon <thomas@monjalon.net> wrote:
> > I see a lot of failures in the CI, especially unit tests run in UNH IOL.
> > It seems to fail for several weeks but did not investigate more.
> > What is the cause and what is the plan?
> > Should we rely on CI results?




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-ci] CI reliability
  2021-06-02  7:27 Thomas Monjalon
@ 2021-06-02 12:55 ` Lincoln Lavoie
  2021-06-02 13:34   ` Thomas Monjalon
  0 siblings, 1 reply; 6+ messages in thread
From: Lincoln Lavoie @ 2021-06-02 12:55 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: ci, aconole

[-- Attachment #1: Type: text/plain, Size: 1647 bytes --]

Hi Thomas,

The unit tests that fail are nearly always the same specific unit test.
Aaron addressed some of these in a patch that has yet to be applied to DPDK
(cycles_autotest and test_alarm).  The other one that we still consistently
fail is func_reentrancy_autotest.  It seems like that unit test case can
pass in one run and fail in the next.  We have not been able to determine a
root cause for it yet.  Maybe that is something the devs could help look
into.

Other failures have been caused by DTS.  As part of the plan, we've been
trying to upgrade the DTS deployments on the system, so as the other
changes are made, we can easily pull those in.  However, pulling in the new
DTS version has also pulled in bugs that exist in that version.  For
example, on the stats test suite, it was changed to not skip
the test_xstats_check_vf when no VMs are configured on the system, so when
the overall test suite was being run, it failed on the bare metal where
there are VMs configured right now.  Every time the lab has to upgrade DTS,
we run the risk of introducing these types of failures, which then take
time to debug and fix.

Cheers,
Lincoln

On Wed, Jun 2, 2021 at 3:27 AM Thomas Monjalon <thomas@monjalon.net> wrote:

> I see a lot of failures in the CI, especially unit tests run in UNH IOL.
> It seems to fail for several weeks but did not investigate more.
> What is the cause and what is the plan?
> Should we rely on CI results?
>
>
>

-- 
*Lincoln Lavoie*
Principal Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
lylavoie@iol.unh.edu
https://www.iol.unh.edu
+1-603-674-2755 (m)
<https://www.iol.unh.edu>

[-- Attachment #2: Type: text/html, Size: 3040 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [dpdk-ci] CI reliability
@ 2021-06-02  7:27 Thomas Monjalon
  2021-06-02 12:55 ` Lincoln Lavoie
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Monjalon @ 2021-06-02  7:27 UTC (permalink / raw)
  To: ci; +Cc: aconole

I see a lot of failures in the CI, especially unit tests run in UNH IOL.
It seems to fail for several weeks but did not investigate more.
What is the cause and what is the plan?
Should we rely on CI results?



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-06-02 13:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-24  9:50 [dpdk-ci] CI reliability Thomas Monjalon
2020-05-26 20:27 ` Lincoln Lavoie
2020-05-26 21:10   ` Thomas Monjalon
2021-06-02  7:27 Thomas Monjalon
2021-06-02 12:55 ` Lincoln Lavoie
2021-06-02 13:34   ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).