DPDK CI discussions
 help / color / mirror / Atom feed
From: Lincoln Lavoie <lylavoie@iol.unh.edu>
To: ci@dpdk.org
Subject: [dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem
Date: Fri, 8 Oct 2021 16:29:48 -0400	[thread overview]
Message-ID: <CAOE1vsNMSBid3VzduNrfHjnHNS_9ByRbRE-td2O1AU10ARtKxw@mail.gmail.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 2106 bytes --]

Hello All,

During the CI meeting, there was a request to provide the post mortem
review of the recent unplanned downtime.

Timeline:
* September 27, 8:30am - WHat should have been a routine upgrade to the
Jenkins server failed, triggering the down time.
* September 27, 8:40am - Failed upgrade detected through combination of
automated notifications and job failures in Jenkins.
* September 27 - October 3 - UNH Team worked to restore the system to the
original configuration.
* October 3, 4pm - Server functionality restored
* October 4, 11:30am - Jenkins pipelines re-enabled for compile and unit
testing
* October 5, 11am - Jenkins pipeline for bare-metal performance and
functional testing re-enabled, after nominal debug / trial run.

Root Cause:
The ansible script / playbook used to maintain the lab (including the
Jenkins server) caused a trust failure of kerberos (between the server and
the IPA domain controller) used to secure the NFS mounts hosting the
Jenkins databases, configuration, log output, etc.  This prevented Jenkins
from starting properly and complicated the restoration of the Jenkins
service.

Changes:
1. Per the community request, UNH will provide notice to the CI email list
prior to upgrades, even for routine maintenance upgrades.
2. The UNH-IOL notification / monitoring server will be configured to also
send notifications to the CI email list.  Note, you will see all
notifications, including routine maintenance, i.e. host reboots, etc.  This
was indicated as acceptable during the CI meeting.
3. This email summary.

As of Friday afternoon, Jenkins has "caught up" and has a queue of about
20'ish jobs, which is about 1 patch worth of testing.  Please let me know
if there are any questions or if anything else looks incorrect in the test
results.  We apologize for the inconvenience this caused, while waiting for
the automated testing to be restored.

Cheers,
Lincoln
-- 
*Lincoln Lavoie*
Principal Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
lylavoie@iol.unh.edu
https://www.iol.unh.edu
+1-603-674-2755 (m)
<https://www.iol.unh.edu>

[-- Attachment #2: Type: text/html, Size: 4121 bytes --]

             reply	other threads:[~2021-10-08 20:30 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-08 20:29 Lincoln Lavoie [this message]
2021-10-11  8:36 ` Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOE1vsNMSBid3VzduNrfHjnHNS_9ByRbRE-td2O1AU10ARtKxw@mail.gmail.com \
    --to=lylavoie@iol.unh.edu \
    --cc=ci@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).