DPDK CI discussions
 help / color / mirror / Atom feed
* [dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem
@ 2021-10-08 20:29 Lincoln Lavoie
  2021-10-11  8:36 ` Thomas Monjalon
  0 siblings, 1 reply; 2+ messages in thread
From: Lincoln Lavoie @ 2021-10-08 20:29 UTC (permalink / raw)
  To: ci

[-- Attachment #1: Type: text/plain, Size: 2106 bytes --]

Hello All,

During the CI meeting, there was a request to provide the post mortem
review of the recent unplanned downtime.

* September 27, 8:30am - WHat should have been a routine upgrade to the
Jenkins server failed, triggering the down time.
* September 27, 8:40am - Failed upgrade detected through combination of
automated notifications and job failures in Jenkins.
* September 27 - October 3 - UNH Team worked to restore the system to the
original configuration.
* October 3, 4pm - Server functionality restored
* October 4, 11:30am - Jenkins pipelines re-enabled for compile and unit
* October 5, 11am - Jenkins pipeline for bare-metal performance and
functional testing re-enabled, after nominal debug / trial run.

Root Cause:
The ansible script / playbook used to maintain the lab (including the
Jenkins server) caused a trust failure of kerberos (between the server and
the IPA domain controller) used to secure the NFS mounts hosting the
Jenkins databases, configuration, log output, etc.  This prevented Jenkins
from starting properly and complicated the restoration of the Jenkins

1. Per the community request, UNH will provide notice to the CI email list
prior to upgrades, even for routine maintenance upgrades.
2. The UNH-IOL notification / monitoring server will be configured to also
send notifications to the CI email list.  Note, you will see all
notifications, including routine maintenance, i.e. host reboots, etc.  This
was indicated as acceptable during the CI meeting.
3. This email summary.

As of Friday afternoon, Jenkins has "caught up" and has a queue of about
20'ish jobs, which is about 1 patch worth of testing.  Please let me know
if there are any questions or if anything else looks incorrect in the test
results.  We apologize for the inconvenience this caused, while waiting for
the automated testing to be restored.

*Lincoln Lavoie*
Principal Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
+1-603-674-2755 (m)

[-- Attachment #2: Type: text/html, Size: 4121 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-10-11  8:36 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-08 20:29 [dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem Lincoln Lavoie
2021-10-11  8:36 ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).