From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id ADDBFA034F for ; Mon, 11 Oct 2021 10:36:59 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id A61DF410EA; Mon, 11 Oct 2021 10:36:59 +0200 (CEST) Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) by mails.dpdk.org (Postfix) with ESMTP id BA70F410E6 for ; Mon, 11 Oct 2021 10:36:57 +0200 (CEST) Received: from compute2.internal (compute2.nyi.internal [10.202.2.42]) by mailout.nyi.internal (Postfix) with ESMTP id 6F3AE5C0041; Mon, 11 Oct 2021 04:36:57 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Mon, 11 Oct 2021 04:36:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=monjalon.net; h= from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding:content-type; s=fm2; bh= VcehgcUfd90vBYQNBRUQkKYgb6RVTkHbMFdwk94JDBo=; b=d12jRkCGe4ZQewtX epGR1s7fdDCTvv0KC8hocs65HIUV0b74wFNMV8GgN7imhrbjbW9bmS4fmohuRodt aV6l8kFlZLy4lPC7UOXrlRWtLon5rI9R09Qr0PGdR03HbRUazVgjecL1d9GWD48x s6wVXBXI61v2+01AqyyEz/djETIWFwqh3DGHVMRqARCr2uB/KayV6PrEYjv16eW7 uNCBkulS6D1vEpN1xbiVuudBQU7i4Q1cK5ndCmT2Py/wNZY92B2q0KJmLgEiwWnX 54LZlgI6bMbL8nd+e7lr8oHCVHpDJyna0RRrc+sz4cYNFG1fA/BdZ7DE/9PqFNs4 Zbaf4A== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm1; bh=VcehgcUfd90vBYQNBRUQkKYgb6RVTkHbMFdwk94JD Bo=; b=lbAiZOnd9l/VnSfl2EZqRJLFcEFMudScxfXy6CjGMr/LJOo/249VKYtNX LAadCEh39ppT/1aR86Ei018A0jvJAHM9raefm9Dxw7Cd3TuZ8mqfFZwprrqSBlRe ELbhO3EwcRDV9HFEQFEdPw5/STE5Gpa0Ej6kVdzw65w2jc/jsbgG5oV/nfJEZxbl o/4CG6IjMUfD2YOYpqIcIzFlm5MmPi/0lhcur9deVEIHRhqQm9fPhNPkoekAaAQ3 Yx4mTzJgz3ei+ViQIG1g+IaV9QoCj1WeRPimJDg16D3hSQmnVr6lPSPSVdzSTEa+ C9y3dUZwLe7S+s2vl+Xs4eFSWzLaw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvtddrvddtiedgtddvucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhephffvufffkfgjfhgggfgtsehtufertddttddvnecuhfhrohhmpefvhhhomhgr shcuofhonhhjrghlohhnuceothhhohhmrghssehmohhnjhgrlhhonhdrnhgvtheqnecugg ftrfgrthhtvghrnhepudeggfdvfeduffdtfeeglefghfeukefgfffhueejtdetuedtjeeu ieeivdffgeehnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrh homhepthhhohhmrghssehmohhnjhgrlhhonhdrnhgvth X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 11 Oct 2021 04:36:56 -0400 (EDT) From: Thomas Monjalon To: Lincoln Lavoie Cc: ci@dpdk.org Date: Mon, 11 Oct 2021 10:36:53 +0200 Message-ID: <1933604.Tb0Rl1FMR3@thomas> In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Subject: Re: [dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem X-BeenThere: ci@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK CI discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ci-bounces@dpdk.org Sender: "ci" I have two main concerns: 1/ We did not have been noticed of the issue. 2/ Restoring the system took 7 days. 08/10/2021 22:29, Lincoln Lavoie: > Hello All, > > During the CI meeting, there was a request to provide the post mortem > review of the recent unplanned downtime. > > Timeline: > * September 27, 8:30am - WHat should have been a routine upgrade to the > Jenkins server failed, triggering the down time. > * September 27, 8:40am - Failed upgrade detected through combination of > automated notifications and job failures in Jenkins. > * September 27 - October 3 - UNH Team worked to restore the system to the > original configuration. > * October 3, 4pm - Server functionality restored > * October 4, 11:30am - Jenkins pipelines re-enabled for compile and unit > testing > * October 5, 11am - Jenkins pipeline for bare-metal performance and > functional testing re-enabled, after nominal debug / trial run. > > Root Cause: > The ansible script / playbook used to maintain the lab (including the > Jenkins server) caused a trust failure of kerberos (between the server and > the IPA domain controller) used to secure the NFS mounts hosting the > Jenkins databases, configuration, log output, etc. This prevented Jenkins > from starting properly and complicated the restoration of the Jenkins > service. > > Changes: > 1. Per the community request, UNH will provide notice to the CI email list > prior to upgrades, even for routine maintenance upgrades. > 2. The UNH-IOL notification / monitoring server will be configured to also > send notifications to the CI email list. Note, you will see all > notifications, including routine maintenance, i.e. host reboots, etc. This > was indicated as acceptable during the CI meeting. > 3. This email summary. > > As of Friday afternoon, Jenkins has "caught up" and has a queue of about > 20'ish jobs, which is about 1 patch worth of testing. Please let me know > if there are any questions or if anything else looks incorrect in the test > results. We apologize for the inconvenience this caused, while waiting for > the automated testing to be restored. > > Cheers, > Lincoln