Hello All,
During the CI meeting, there was a request to provide the post mortem review of the recent unplanned downtime.
Timeline:
* September 27, 8:30am - WHat should have been a routine upgrade to the Jenkins server failed, triggering the down time.
* September 27, 8:40am - Failed upgrade detected through combination of automated notifications and job failures in Jenkins.
* September 27 - October 3 - UNH Team worked to restore the system to the original configuration.
* October 3, 4pm - Server functionality restored
* October 4, 11:30am - Jenkins pipelines re-enabled for compile and unit testing
* October 5, 11am - Jenkins pipeline for bare-metal performance and functional testing re-enabled, after nominal debug / trial run.
Root Cause:
The ansible script / playbook used to maintain the lab (including the Jenkins server) caused a trust failure of kerberos (between the server and the IPA domain controller) used to secure the NFS mounts hosting the Jenkins databases, configuration, log output, etc. This prevented Jenkins from starting properly and complicated the restoration of the Jenkins service.
Changes:
1. Per the community request, UNH will provide notice to the CI email list prior to upgrades, even for routine maintenance upgrades.
2. The UNH-IOL notification / monitoring server will be configured to also send notifications to the CI email list. Note, you will see all notifications, including routine maintenance, i.e. host reboots, etc. This was indicated as acceptable during the CI meeting.
3. This email summary.
As of Friday afternoon, Jenkins has "caught up" and has a queue of about 20'ish jobs, which is about 1 patch worth of testing. Please let me know if there are any questions or if anything else looks incorrect in the test results. We apologize for the inconvenience this caused, while waiting for the automated testing to be restored.
Cheers,
Lincoln
--
Lincoln Lavoie
Principal Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
+1-603-674-2755 (m)