From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id BB5AAA0C47 for ; Fri, 8 Oct 2021 22:30:00 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id AC6AF40DDB; Fri, 8 Oct 2021 22:30:00 +0200 (CEST) Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) by mails.dpdk.org (Postfix) with ESMTP id CE4F340DDA for ; Fri, 8 Oct 2021 22:29:59 +0200 (CEST) Received: by mail-ed1-f51.google.com with SMTP id p13so41289907edw.0 for ; Fri, 08 Oct 2021 13:29:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=iol.unh.edu; s=unh-iol; h=mime-version:from:date:message-id:subject:to; bh=J5pglUcf5+eU8FhyBTrGnHMSj5SyKKJkAquuGblaaX4=; b=aYhuNCrSVE2VKzFDjmB8vPAYe45IpFsqLYkstxKc6Bd0JU+bSoZZDZbu/Z2IgV02y6 omUL95kSmViHOvP5DpU741QswNdvQev/1Py/kyEW9lPp/jru/PesUnZ+i6W/OW4iiJW9 midUDZuxl0NJWCysumDZnx4xd1PZGkY/u39Yk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=J5pglUcf5+eU8FhyBTrGnHMSj5SyKKJkAquuGblaaX4=; b=TjJ8i7/l4M5HWxSLK4roD/mCsCATitNYLTwHOHYz9tptmfcWK87TIx1XF9lmRY0KOV u09Efz/O2BpA2UI/Cgcr4piLR0TuuFLn2903XI5v+7XFDzzA1cmXBpBcZfkZnuATJKoj YGz9EqgMUZNcxebC1wLWXINRbNXCmbv57Nof/UujMPzVdYUkAgOCrLxRCLWIkodYSQMv MaMO6sCLTOvzzL5HFoeCGF7nOX1xQSAgyk42Z+a+NJ7ijXYDNmBf8G0nnEH+IdtaiNIg h8wcPqGKfjnqXSwwuM1+rCGYhkY1K3Az3qXlJxvKJuwOSpfyRLZnugahr3Jq9sdU5W9C TUZw== X-Gm-Message-State: AOAM532LrInfXXeFcDXTYfwQ8wO8gPaYXSPt6Y1eOiFHJ5/n09kbHiWp fcMrHIUj5xbORzfaEnTloDi/yp0K+iT4coE77xOfV6Z/tpw= X-Google-Smtp-Source: ABdhPJwT/OsrqdcWpScrpeGEllOMo24UnqWVPTR3ymbY1MI6bGwfoHpp/rXu5CgdY1ZQ95BtUOWxvTVF28aSpxyvtoY= X-Received: by 2002:a17:906:e104:: with SMTP id gj4mr5134109ejb.358.1633724999008; Fri, 08 Oct 2021 13:29:59 -0700 (PDT) MIME-Version: 1.0 From: Lincoln Lavoie Date: Fri, 8 Oct 2021 16:29:48 -0400 Message-ID: To: ci@dpdk.org Content-Type: multipart/alternative; boundary="0000000000003a224c05cddd3e08" Subject: [dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem X-BeenThere: ci@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK CI discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ci-bounces@dpdk.org Sender: "ci" --0000000000003a224c05cddd3e08 Content-Type: text/plain; charset="UTF-8" Hello All, During the CI meeting, there was a request to provide the post mortem review of the recent unplanned downtime. Timeline: * September 27, 8:30am - WHat should have been a routine upgrade to the Jenkins server failed, triggering the down time. * September 27, 8:40am - Failed upgrade detected through combination of automated notifications and job failures in Jenkins. * September 27 - October 3 - UNH Team worked to restore the system to the original configuration. * October 3, 4pm - Server functionality restored * October 4, 11:30am - Jenkins pipelines re-enabled for compile and unit testing * October 5, 11am - Jenkins pipeline for bare-metal performance and functional testing re-enabled, after nominal debug / trial run. Root Cause: The ansible script / playbook used to maintain the lab (including the Jenkins server) caused a trust failure of kerberos (between the server and the IPA domain controller) used to secure the NFS mounts hosting the Jenkins databases, configuration, log output, etc. This prevented Jenkins from starting properly and complicated the restoration of the Jenkins service. Changes: 1. Per the community request, UNH will provide notice to the CI email list prior to upgrades, even for routine maintenance upgrades. 2. The UNH-IOL notification / monitoring server will be configured to also send notifications to the CI email list. Note, you will see all notifications, including routine maintenance, i.e. host reboots, etc. This was indicated as acceptable during the CI meeting. 3. This email summary. As of Friday afternoon, Jenkins has "caught up" and has a queue of about 20'ish jobs, which is about 1 patch worth of testing. Please let me know if there are any questions or if anything else looks incorrect in the test results. We apologize for the inconvenience this caused, while waiting for the automated testing to be restored. Cheers, Lincoln -- *Lincoln Lavoie* Principal Engineer, Broadband Technologies 21 Madbury Rd., Ste. 100, Durham, NH 03824 lylavoie@iol.unh.edu https://www.iol.unh.edu +1-603-674-2755 (m) --0000000000003a224c05cddd3e08 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hel= lo All,

During the CI mee= ting, there was a request to provide the post mortem review of the recent u= nplanned downtime.

Timeli= ne:=C2=A0
* Sep= tember 27, 8:30am - WHat should have been a routine upgrade to the Jenkins = server failed, triggering the down time.
* September 27, 8:40am - Failed upgrade detected = through combination of automated notifications and job failures in Jenkins.=
* September 27= - October 3 - UNH Team worked to restore the system to the original config= uration.
* Octo= ber 3, 4pm - Server functionality restored
* October 4, 11:30am - Jenkins pipelines re-ena= bled for compile and unit testing
* October 5, 11am - Jenkins pipeline=C2=A0for bare-met= al performance and functional testing re-enabled, after nominal debug / tri= al run.

Root Cause:
=
The ansible script /= playbook=C2=A0used to maintain=C2=A0the lab (including the Jenkins server)= caused a trust failure of kerberos (between the server and the=C2=A0IPA do= main controller) used to secure the NFS mounts hosting the Jenkins database= s, configuration, log output, etc.=C2=A0 This prevented Jenkins from starti= ng properly and complicated the restoration of the Jenkins service.

Changes:
1. Per the community request, UNH wi= ll provide notice to the CI email list prior to upgrades, even for routine = maintenance upgrades.
2. The UNH-IOL notification / monitoring=C2=A0server will be configu= red to also send notifications to the CI email list.=C2=A0 Note, you will s= ee all notifications, including routine maintenance, i.e. host reboots, etc= .=C2=A0 This was indicated as acceptable during the CI meeting.
3. This email summary.

As of Friday afternoon, Jenk= ins has "caught up" and has a queue of about 20'ish jobs, whi= ch is about 1 patch worth of testing.=C2=A0 Please let me know if there are= any questions or if anything else looks incorrect in the test results.=C2= =A0 We apologize for the inconvenience this caused, while waiting for the a= utomated testing to be restored.

Cheers,
Lincoln
--
Lincoln Lavoie
Principal Engin= eer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH = 03824
+1-603-674-2755 (m)

--0000000000003a224c05cddd3e08--