DPDK CI discussions
 help / color / mirror / Atom feed
From: Thomas Monjalon <thomas@monjalon.net>
To: Lincoln Lavoie <lylavoie@iol.unh.edu>
Cc: ci@dpdk.org
Subject: Re: [dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem
Date: Mon, 11 Oct 2021 10:36:53 +0200	[thread overview]
Message-ID: <1933604.Tb0Rl1FMR3@thomas> (raw)
In-Reply-To: <CAOE1vsNMSBid3VzduNrfHjnHNS_9ByRbRE-td2O1AU10ARtKxw@mail.gmail.com>

I have two main concerns:
1/ We did not have been noticed of the issue.
2/ Restoring the system took 7 days.

08/10/2021 22:29, Lincoln Lavoie:
> Hello All,
> During the CI meeting, there was a request to provide the post mortem
> review of the recent unplanned downtime.
> Timeline:
> * September 27, 8:30am - WHat should have been a routine upgrade to the
> Jenkins server failed, triggering the down time.
> * September 27, 8:40am - Failed upgrade detected through combination of
> automated notifications and job failures in Jenkins.
> * September 27 - October 3 - UNH Team worked to restore the system to the
> original configuration.
> * October 3, 4pm - Server functionality restored
> * October 4, 11:30am - Jenkins pipelines re-enabled for compile and unit
> testing
> * October 5, 11am - Jenkins pipeline for bare-metal performance and
> functional testing re-enabled, after nominal debug / trial run.
> Root Cause:
> The ansible script / playbook used to maintain the lab (including the
> Jenkins server) caused a trust failure of kerberos (between the server and
> the IPA domain controller) used to secure the NFS mounts hosting the
> Jenkins databases, configuration, log output, etc.  This prevented Jenkins
> from starting properly and complicated the restoration of the Jenkins
> service.
> Changes:
> 1. Per the community request, UNH will provide notice to the CI email list
> prior to upgrades, even for routine maintenance upgrades.
> 2. The UNH-IOL notification / monitoring server will be configured to also
> send notifications to the CI email list.  Note, you will see all
> notifications, including routine maintenance, i.e. host reboots, etc.  This
> was indicated as acceptable during the CI meeting.
> 3. This email summary.
> As of Friday afternoon, Jenkins has "caught up" and has a queue of about
> 20'ish jobs, which is about 1 patch worth of testing.  Please let me know
> if there are any questions or if anything else looks incorrect in the test
> results.  We apologize for the inconvenience this caused, while waiting for
> the automated testing to be restored.
> Cheers,
> Lincoln

      reply	other threads:[~2021-10-11  8:36 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-08 20:29 Lincoln Lavoie
2021-10-11  8:36 ` Thomas Monjalon [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1933604.Tb0Rl1FMR3@thomas \
    --to=thomas@monjalon.net \
    --cc=ci@dpdk.org \
    --cc=lylavoie@iol.unh.edu \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).