From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id D90B7A0C46; Fri, 23 Jul 2021 14:51:36 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 544C240040; Fri, 23 Jul 2021 14:51:36 +0200 (CEST) Received: from out3-smtp.messagingengine.com (out3-smtp.messagingengine.com [66.111.4.27]) by mails.dpdk.org (Postfix) with ESMTP id B87114003C for ; Fri, 23 Jul 2021 14:51:34 +0200 (CEST) Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.nyi.internal (Postfix) with ESMTP id 3A94E5C00A3; Fri, 23 Jul 2021 08:51:34 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute1.internal (MEProxy); Fri, 23 Jul 2021 08:51:34 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=monjalon.net; h= from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding:content-type; s=fm1; bh= gYu4gyRaziHdvH5KYiMs/vEWZXULfIkK+X/OCw/lpYo=; b=B7SmhTPufhtRkf/k qpYOfb17rf1jGr1aZaIMODZCAlA5Oh2UtsNuf56p3jd8iE+/EUCrch4d97Zm4JHY sx6kKyJ2tA7v7YvvNtqPlVDy1ix4MhWVzVyh0TgicJiYyt570qRvJO2DE78Afm7J paSec9fJXFlom5gU667wty4eWYMCNNRaZ+IntwyErs/9AaZaqA6v2M8suN8ojbE8 pugonU/pGGk86boJTedFIkasRFXP3m/LYHIU0WmKKhieKp0nOKVHn9ROvZqbNT1R 4BypsffZj4MK/f3AHiwVTH3sPTBiQhlH+/tP5grxVEVv6uV1B+eZ/vorPWJ3oA/T 0ecmJw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm3; bh=gYu4gyRaziHdvH5KYiMs/vEWZXULfIkK+X/OCw/lp Yo=; b=VI8tBPsZG0YAiCkeQb+vXzqDE1HpayhyPzhOrkaBtmU5jna1943QI1yjA AQvNEn/9d2EUOf/ffZFF7exvzX8Fzn7SkjE0fTaPbSOdAfSC75UVauVpuaoHdrUH cv4iGD8YWrp1EKAypqenkcZoYgWHwKQkXH2fKFOQ1MvhCFF448Fu/+kvSyLgclEm tCCcOgx4I9nq019de8JI1Th7yXbQu+49cUCpl/Y44BTn/O0Z3gG7T05T1wWaidRQ PYNWdiM6IBlfHqvX3SE7PrVRDp4NU5kYteDBqYMANjien0yktgi8SaGXjhVtANBF WZM/jQolPBq9PfeeBlPxHDwhWMfvw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvtddrfeekgdehjecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpefhvffufffkjghfggfgtgesthfuredttddtvdenucfhrhhomhepvfhhohhmrghs ucfoohhnjhgrlhhonhcuoehthhhomhgrshesmhhonhhjrghlohhnrdhnvghtqeenucggtf frrghtthgvrhhnpedugefgvdefudfftdefgeelgffhueekgfffhfeujedtteeutdejueei iedvffegheenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhroh hmpehthhhomhgrshesmhhonhhjrghlohhnrdhnvght X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA; Fri, 23 Jul 2021 08:51:33 -0400 (EDT) From: Thomas Monjalon To: fengchengwen , Ferruh Yigit Cc: "dev@dpdk.org" , Andrew Rybchenko Date: Fri, 23 Jul 2021 14:51:55 +0200 Message-ID: <10636274.k5fJVezR7q@thomas> In-Reply-To: <6e220d0b-5683-ee12-bdab-1ef78d19ebdc@intel.com> References: <0bc940bb-65e6-1acb-d026-7a2a08a0ad8b@huawei.com> <4435152.k7BQ785f6v@thomas> <6e220d0b-5683-ee12-bdab-1ef78d19ebdc@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Subject: Re: [dpdk-dev] Question about hardware error handling policy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" 23/07/2021 14:33, Ferruh Yigit: > On 7/22/2021 4:46 PM, Thomas Monjalon wrote: > > 22/07/2021 15:50, fengchengwen: > >> Hi, all > >> > >> I notice ethdev support dev_reset ops, which could be used to recover from > >> errors, and only 13+ drivers support this function. > > 'rte_eth_dev_reset()' can be used to reset device config to defaults, not have > to be for error recovering. > > >> And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6 > >> drivers support it (most of them are VF). > >> > >> This provides users with two ways to handle hardware errors: > >> a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops. > >> b. application detect errors (the detection method is unclear), and call > >> reset ops to recover. > >> > >> According to the design of this API, error handling is assigned to the > >> application, and the driver is only responsible for reporting events. This > >> simplifies the driver design (for example, the driver does not need to maintain > >> mutex locks). > >> > >> As we know, many modern NICs come with firmware, have PCIE interfaces, > >> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/ > >> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in > >> most drivers. > >> > >> Question 1: what do we think of these errors(particularly firmware/PF)? Do > >> we think that the probability is very low and that there is no need to deal with > >> them? > > > > Even rare errors must be managed. > > > > +1 > > >> Question 2: I prefer to put error handling in the application layer, because > >> doing it in the driver can make the driver complex, but there is no app to > >> register the INTR_RESET event handler. I think we can build a standard handler > >> in testpmd, What do you think? > > > > Absolutely. As any ethdev API, it must be tested with testpmd. > > > > Testpmd registers for RESET event, but when event received all it does is print > a log, so there is not logic behind it. > > If the intention is to add a error handling logic into testpmd, my concern is it > being too complex or too device specific. It shows a problem in the API. We don't have a clear generic recovering process. > And if there is something to cleanup, or recover etc in application level, it > makes sense application to receive the event and act on it. But if the > reset/recover can be handled in the PMD, if possible transparently, I think that > is better choice. > > Another thing is I am not sure if what the applications should do on the reset > event clear or same for all PMDs, which is not good. Indeed we should improve this area, and implement a logic in testpmd.