DPDK CI discussions
 help / color / mirror / Atom feed
* [dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem
@ 2021-10-08 20:29 Lincoln Lavoie
  2021-10-11  8:36 ` Thomas Monjalon
  0 siblings, 1 reply; 2+ messages in thread
From: Lincoln Lavoie @ 2021-10-08 20:29 UTC (permalink / raw)
  To: ci

[-- Attachment #1: Type: text/plain, Size: 2106 bytes --]

Hello All,

During the CI meeting, there was a request to provide the post mortem
review of the recent unplanned downtime.

Timeline:
* September 27, 8:30am - WHat should have been a routine upgrade to the
Jenkins server failed, triggering the down time.
* September 27, 8:40am - Failed upgrade detected through combination of
automated notifications and job failures in Jenkins.
* September 27 - October 3 - UNH Team worked to restore the system to the
original configuration.
* October 3, 4pm - Server functionality restored
* October 4, 11:30am - Jenkins pipelines re-enabled for compile and unit
testing
* October 5, 11am - Jenkins pipeline for bare-metal performance and
functional testing re-enabled, after nominal debug / trial run.

Root Cause:
The ansible script / playbook used to maintain the lab (including the
Jenkins server) caused a trust failure of kerberos (between the server and
the IPA domain controller) used to secure the NFS mounts hosting the
Jenkins databases, configuration, log output, etc.  This prevented Jenkins
from starting properly and complicated the restoration of the Jenkins
service.

Changes:
1. Per the community request, UNH will provide notice to the CI email list
prior to upgrades, even for routine maintenance upgrades.
2. The UNH-IOL notification / monitoring server will be configured to also
send notifications to the CI email list.  Note, you will see all
notifications, including routine maintenance, i.e. host reboots, etc.  This
was indicated as acceptable during the CI meeting.
3. This email summary.

As of Friday afternoon, Jenkins has "caught up" and has a queue of about
20'ish jobs, which is about 1 patch worth of testing.  Please let me know
if there are any questions or if anything else looks incorrect in the test
results.  We apologize for the inconvenience this caused, while waiting for
the automated testing to be restored.

Cheers,
Lincoln
-- 
*Lincoln Lavoie*
Principal Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
lylavoie@iol.unh.edu
https://www.iol.unh.edu
+1-603-674-2755 (m)
<https://www.iol.unh.edu>

[-- Attachment #2: Type: text/html, Size: 4121 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem
  2021-10-08 20:29 [dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem Lincoln Lavoie
@ 2021-10-11  8:36 ` Thomas Monjalon
  0 siblings, 0 replies; 2+ messages in thread
From: Thomas Monjalon @ 2021-10-11  8:36 UTC (permalink / raw)
  To: Lincoln Lavoie; +Cc: ci

I have two main concerns:
1/ We did not have been noticed of the issue.
2/ Restoring the system took 7 days.


08/10/2021 22:29, Lincoln Lavoie:
> Hello All,
> 
> During the CI meeting, there was a request to provide the post mortem
> review of the recent unplanned downtime.
> 
> Timeline:
> * September 27, 8:30am - WHat should have been a routine upgrade to the
> Jenkins server failed, triggering the down time.
> * September 27, 8:40am - Failed upgrade detected through combination of
> automated notifications and job failures in Jenkins.
> * September 27 - October 3 - UNH Team worked to restore the system to the
> original configuration.
> * October 3, 4pm - Server functionality restored
> * October 4, 11:30am - Jenkins pipelines re-enabled for compile and unit
> testing
> * October 5, 11am - Jenkins pipeline for bare-metal performance and
> functional testing re-enabled, after nominal debug / trial run.
> 
> Root Cause:
> The ansible script / playbook used to maintain the lab (including the
> Jenkins server) caused a trust failure of kerberos (between the server and
> the IPA domain controller) used to secure the NFS mounts hosting the
> Jenkins databases, configuration, log output, etc.  This prevented Jenkins
> from starting properly and complicated the restoration of the Jenkins
> service.
> 
> Changes:
> 1. Per the community request, UNH will provide notice to the CI email list
> prior to upgrades, even for routine maintenance upgrades.
> 2. The UNH-IOL notification / monitoring server will be configured to also
> send notifications to the CI email list.  Note, you will see all
> notifications, including routine maintenance, i.e. host reboots, etc.  This
> was indicated as acceptable during the CI meeting.
> 3. This email summary.
> 
> As of Friday afternoon, Jenkins has "caught up" and has a queue of about
> 20'ish jobs, which is about 1 patch worth of testing.  Please let me know
> if there are any questions or if anything else looks incorrect in the test
> results.  We apologize for the inconvenience this caused, while waiting for
> the automated testing to be restored.
> 
> Cheers,
> Lincoln



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-10-11  8:36 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-08 20:29 [dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem Lincoln Lavoie
2021-10-11  8:36 ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).