From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 2F184A0C46; Fri, 23 Jul 2021 15:04:14 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id A276D40040; Fri, 23 Jul 2021 15:04:13 +0200 (CEST) Received: from shelob.oktetlabs.ru (shelob.oktetlabs.ru [91.220.146.113]) by mails.dpdk.org (Postfix) with ESMTP id B426E4003C for ; Fri, 23 Jul 2021 15:04:11 +0200 (CEST) Received: from [192.168.100.116] (unknown [37.139.99.76]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by shelob.oktetlabs.ru (Postfix) with ESMTPSA id DF0BE7F52A; Fri, 23 Jul 2021 16:04:10 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 shelob.oktetlabs.ru DF0BE7F52A DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=oktetlabs.ru; s=default; t=1627045451; bh=XKLlrSyqdMJZeGQ1LH3Tzx/p78JQM+MZ1cLtJXcEWbk=; h=Subject:To:Cc:References:From:Date:In-Reply-To; b=FeyEnTeo0clWrfAanfG9n2bMU5z4yX6x+PPh89V6Dypm22VXAMLc17owuambDmPGJ 57gLS6Wcd0lsAFNLV+zwCrMmSVZmezkL9L9OhixW3FPJMR0lw1BewRaH3mk97fBHYE vD92JIy/0E70HarWdgu5U1Rr9trTXIyvX86KEMsc= To: Ferruh Yigit , Thomas Monjalon , fengchengwen Cc: "dev@dpdk.org" References: <0bc940bb-65e6-1acb-d026-7a2a08a0ad8b@huawei.com> <4435152.k7BQ785f6v@thomas> <6e220d0b-5683-ee12-bdab-1ef78d19ebdc@intel.com> From: Andrew Rybchenko Message-ID: Date: Fri, 23 Jul 2021 16:04:09 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 MIME-Version: 1.0 In-Reply-To: <6e220d0b-5683-ee12-bdab-1ef78d19ebdc@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] Question about hardware error handling policy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 7/23/21 3:33 PM, Ferruh Yigit wrote: > On 7/22/2021 4:46 PM, Thomas Monjalon wrote: >> 22/07/2021 15:50, fengchengwen: >>> Hi, all >>> >>> I notice ethdev support dev_reset ops, which could be used to recover from >>> errors, and only 13+ drivers support this function. > > 'rte_eth_dev_reset()' can be used to reset device config to defaults, not have > to be for error recovering. > >>> And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6 >>> drivers support it (most of them are VF). >>> >>> This provides users with two ways to handle hardware errors: >>> a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops. >>> b. application detect errors (the detection method is unclear), and call >>> reset ops to recover. >>> >>> According to the design of this API, error handling is assigned to the >>> application, and the driver is only responsible for reporting events. This >>> simplifies the driver design (for example, the driver does not need to maintain >>> mutex locks). >>> >>> As we know, many modern NICs come with firmware, have PCIE interfaces, >>> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/ >>> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in >>> most drivers. >>> >>> Question 1: what do we think of these errors(particularly firmware/PF)? Do >>> we think that the probability is very low and that there is no need to deal with >>> them? >> >> Even rare errors must be managed. >> > > +1 +1 >>> Question 2: I prefer to put error handling in the application layer, because >>> doing it in the driver can make the driver complex, but there is no app to >>> register the INTR_RESET event handler. I think we can build a standard handler >>> in testpmd, What do you think? >> >> Absolutely. As any ethdev API, it must be tested with testpmd. >> > > Testpmd registers for RESET event, but when event received all it does is print > a log, so there is not logic behind it. > > If the intention is to add a error handling logic into testpmd, my concern is it > being too complex or too device specific. Error recovery must not be device specific. Otherwise, every application needs the specific and will be hard to port across different drivers. > And if there is something to cleanup, or recover etc in application level, it > makes sense application to receive the event and act on it. But if the > reset/recover can be handled in the PMD, if possible transparently, I think that > is better choice. Application should be notified to stop datapath at least. IMHO it would be better if application controls the recovery using easy and well defined algorithm: - register to be notified about necessity to do the recovery - receive event - stop datapath - do reset - restore configuration, start - start datapath > Another thing is I am not sure if what the applications should do on the reset > event clear or same for all PMDs, which is not good. >