DPDK CI discussions
 help / color / Atom feed
* [dpdk-ci] CI reliability
@ 2020-05-24  9:50 Thomas Monjalon
  2020-05-26 20:27 ` Lincoln Lavoie
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Monjalon @ 2020-05-24  9:50 UTC (permalink / raw)
  To: ci; +Cc: j.hendergart, Lincoln Lavoie

Hi all,

I think we have a CI reliability issue in general.
Perhaps we lack some alert mechanism warning test platform maintainers
when too many tests are failing.

Recent example: the community lab compilation test is failing on
Fedora 31 for at least 2 weeks, and I don't see any action to fix it:
	https://lab.dpdk.org/results/dashboard/patchsets/11040/

Because of such recurring errors, the whole CI becomes irrelevant.
Please, we need taking actions to avoid such issue in the near future.



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [dpdk-ci] CI reliability
  2020-05-24  9:50 [dpdk-ci] CI reliability Thomas Monjalon
@ 2020-05-26 20:27 ` Lincoln Lavoie
  2020-05-26 21:10   ` Thomas Monjalon
  0 siblings, 1 reply; 3+ messages in thread
From: Lincoln Lavoie @ 2020-05-26 20:27 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: ci, James Hendergart

[-- Attachment #1: Type: text/plain, Size: 1856 bytes --]

Hi Thomas,

This has been fixed as of yesterday.  The failure was caused by a commit to
the SPDK repos in how they pull in their dependencies, which was done in a
way that is not compatible with docker.  The team created a work around so
that case is fixed, but there is always a risk where other commits for
those type of items could cause a failure in the containers.

I asked Brandon to change the scripts that run the testing in the
containers to try and catch failures from docker separately, so they can be
flagged as infrastructure, compared to failures of the build.

I'm also very surprised, this was not raised during the CI meeting, or by
anyone else.  I'm wondering if this is caused by the actual error logs
being a little abstracted from the emails, i.e. they are a link and a zip
file away for the actual email text, so maybe folks are not really looking
into the output as closely as they should be.  Is this something we can
make better by including more detail in the email text, so issues are
caught more quickly?

Cheers,
Lincoln

On Sun, May 24, 2020 at 5:50 AM Thomas Monjalon <thomas@monjalon.net> wrote:

> Hi all,
>
> I think we have a CI reliability issue in general.
> Perhaps we lack some alert mechanism warning test platform maintainers
> when too many tests are failing.
>
> Recent example: the community lab compilation test is failing on
> Fedora 31 for at least 2 weeks, and I don't see any action to fix it:
>         https://lab.dpdk.org/results/dashboard/patchsets/11040/
>
> Because of such recurring errors, the whole CI becomes irrelevant.
> Please, we need taking actions to avoid such issue in the near future.
>
>
>

-- 
*Lincoln Lavoie*
Senior Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
lylavoie@iol.unh.edu
https://www.iol.unh.edu
+1-603-674-2755 (m)
<https://www.iol.unh.edu/>

[-- Attachment #2: Type: text/html, Size: 3422 bytes --]

<div dir="ltr"><div class="gmail_default" style="font-size:small">Hi Thomas,</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">This has been fixed as of yesterday.  The failure was caused by a commit to the SPDK repos in how they pull in their dependencies, which was done in a way that is not compatible with docker.  The team created a work around so that case is fixed, but there is always a risk where other commits for those type of items could cause a failure in the containers.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">I asked Brandon to change the scripts that run the testing in the containers to try and catch failures from docker separately, so they can be flagged as infrastructure, compared to failures of the build.</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">I&#39;m also very surprised, this was not raised during the CI meeting, or by anyone else.  I&#39;m wondering if this is caused by the actual error logs being a little abstracted from the emails, i.e. they are a link and a zip file away for the actual email text, so maybe folks are not really looking into the output as closely as they should be.  Is this something we can make better by including more detail in the email text, so issues are caught more quickly?</div><div class="gmail_default" style="font-size:small"><br></div><div class="gmail_default" style="font-size:small">Cheers,<br>Lincoln</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, May 24, 2020 at 5:50 AM Thomas Monjalon &lt;<a href="mailto:thomas@monjalon.net">thomas@monjalon.net</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi all,<br>
<br>
I think we have a CI reliability issue in general.<br>
Perhaps we lack some alert mechanism warning test platform maintainers<br>
when too many tests are failing.<br>
<br>
Recent example: the community lab compilation test is failing on<br>
Fedora 31 for at least 2 weeks, and I don&#39;t see any action to fix it:<br>
        <a href="https://lab.dpdk.org/results/dashboard/patchsets/11040/" rel="noreferrer" target="_blank">https://lab.dpdk.org/results/dashboard/patchsets/11040/</a><br>
<br>
Because of such recurring errors, the whole CI becomes irrelevant.<br>
Please, we need taking actions to avoid such issue in the near future.<br>
<br>
<br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div><b>Lincoln Lavoie</b><br></div><div>Senior Engineer, Broadband Technologies</div><div>21 Madbury Rd., Ste. 100, Durham, NH 03824</div><div><a href="mailto:lylavoie@iol.unh.edu" target="_blank">lylavoie@iol.unh.edu</a></div><div><a href="https://www.iol.unh.edu" target="_blank">https://www.iol.unh.edu</a></div><div>+1-603-674-2755 (m)<br></div><div><a href="https://www.iol.unh.edu/" target="_blank"><img src="http://homeautomation.lavoieholdings.com/_/rsrc/1390068882701/unh-iol-logo.png"></a></div></div></div></div></div></div></div></div></div></div></div></div>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [dpdk-ci] CI reliability
  2020-05-26 20:27 ` Lincoln Lavoie
@ 2020-05-26 21:10   ` Thomas Monjalon
  0 siblings, 0 replies; 3+ messages in thread
From: Thomas Monjalon @ 2020-05-26 21:10 UTC (permalink / raw)
  To: Lincoln Lavoie; +Cc: ci, James Hendergart

26/05/2020 22:27, Lincoln Lavoie:
> On Sun, May 24, 2020 at 5:50 AM Thomas Monjalon <thomas@monjalon.net> wrote:
> 
> > Hi all,
> >
> > I think we have a CI reliability issue in general.
> > Perhaps we lack some alert mechanism warning test platform maintainers
> > when too many tests are failing.
> >
> > Recent example: the community lab compilation test is failing on
> > Fedora 31 for at least 2 weeks, and I don't see any action to fix it:
> >         https://lab.dpdk.org/results/dashboard/patchsets/11040/
> >
> > Because of such recurring errors, the whole CI becomes irrelevant.
> 
> This has been fixed as of yesterday.  The failure was caused by a commit to
> the SPDK repos in how they pull in their dependencies, which was done in a
> way that is not compatible with docker.  The team created a work around so
> that case is fixed, but there is always a risk where other commits for
> those type of items could cause a failure in the containers.

Thanks for fixing

> I asked Brandon to change the scripts that run the testing in the
> containers to try and catch failures from docker separately, so they can be
> flagged as infrastructure, compared to failures of the build.

Yes good idea.

When compiling external projects, we can see some errors which
are not due to the DPDK patch.
I guess we validate any upgrade of the external projects
before making them live?


> I'm also very surprised, this was not raised during the CI meeting, or by
> anyone else.  I'm wondering if this is caused by the actual error logs
> being a little abstracted from the emails, i.e. they are a link and a zip
> file away for the actual email text, so maybe folks are not really looking
> into the output as closely as they should be.  Is this something we can
> make better by including more detail in the email text, so issues are
> caught more quickly?

I think the table in the report is already quite expressive.

As I proposed above, I think we need a better monitoring.
If the same test is failing on many DPDK patches, it should raise an alarm.




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, back to index

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-24  9:50 [dpdk-ci] CI reliability Thomas Monjalon
2020-05-26 20:27 ` Lincoln Lavoie
2020-05-26 21:10   ` Thomas Monjalon

DPDK CI discussions

Archives are clonable:
	git clone --mirror http://inbox.dpdk.org/ci/0 ci/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ci ci/ http://inbox.dpdk.org/ci \
		ci@dpdk.org
	public-inbox-index ci


Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.ci


AGPL code for this site: git clone https://public-inbox.org/ public-inbox