From: Thomas Monjalon <thomas@monjalon.net>
To: Lincoln Lavoie <lylavoie@iol.unh.edu>
Cc: ci@dpdk.org
Subject: Re: [dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem
Date: Mon, 11 Oct 2021 10:36:53 +0200 [thread overview]
Message-ID: <1933604.Tb0Rl1FMR3@thomas> (raw)
In-Reply-To: <CAOE1vsNMSBid3VzduNrfHjnHNS_9ByRbRE-td2O1AU10ARtKxw@mail.gmail.com>
I have two main concerns:
1/ We did not have been noticed of the issue.
2/ Restoring the system took 7 days.
08/10/2021 22:29, Lincoln Lavoie:
> Hello All,
>
> During the CI meeting, there was a request to provide the post mortem
> review of the recent unplanned downtime.
>
> Timeline:
> * September 27, 8:30am - WHat should have been a routine upgrade to the
> Jenkins server failed, triggering the down time.
> * September 27, 8:40am - Failed upgrade detected through combination of
> automated notifications and job failures in Jenkins.
> * September 27 - October 3 - UNH Team worked to restore the system to the
> original configuration.
> * October 3, 4pm - Server functionality restored
> * October 4, 11:30am - Jenkins pipelines re-enabled for compile and unit
> testing
> * October 5, 11am - Jenkins pipeline for bare-metal performance and
> functional testing re-enabled, after nominal debug / trial run.
>
> Root Cause:
> The ansible script / playbook used to maintain the lab (including the
> Jenkins server) caused a trust failure of kerberos (between the server and
> the IPA domain controller) used to secure the NFS mounts hosting the
> Jenkins databases, configuration, log output, etc. This prevented Jenkins
> from starting properly and complicated the restoration of the Jenkins
> service.
>
> Changes:
> 1. Per the community request, UNH will provide notice to the CI email list
> prior to upgrades, even for routine maintenance upgrades.
> 2. The UNH-IOL notification / monitoring server will be configured to also
> send notifications to the CI email list. Note, you will see all
> notifications, including routine maintenance, i.e. host reboots, etc. This
> was indicated as acceptable during the CI meeting.
> 3. This email summary.
>
> As of Friday afternoon, Jenkins has "caught up" and has a queue of about
> 20'ish jobs, which is about 1 patch worth of testing. Please let me know
> if there are any questions or if anything else looks incorrect in the test
> results. We apologize for the inconvenience this caused, while waiting for
> the automated testing to be restored.
>
> Cheers,
> Lincoln
prev parent reply other threads:[~2021-10-11 8:36 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-08 20:29 Lincoln Lavoie
2021-10-11 8:36 ` Thomas Monjalon [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1933604.Tb0Rl1FMR3@thomas \
--to=thomas@monjalon.net \
--cc=ci@dpdk.org \
--cc=lylavoie@iol.unh.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).