From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 022A1A0C47; Mon, 26 Jul 2021 08:21:28 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id BACA540F35; Mon, 26 Jul 2021 08:21:28 +0200 (CEST) Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by mails.dpdk.org (Postfix) with ESMTP id 44ACF40DDA for ; Mon, 26 Jul 2021 08:21:25 +0200 (CEST) Received: from dggemv711-chm.china.huawei.com (unknown [172.30.72.56]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4GY8mf4gpxz1CNsp; Mon, 26 Jul 2021 14:15:30 +0800 (CST) Received: from dggpeml500024.china.huawei.com (7.185.36.10) by dggemv711-chm.china.huawei.com (10.1.198.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Mon, 26 Jul 2021 14:21:23 +0800 Received: from [10.40.190.165] (10.40.190.165) by dggpeml500024.china.huawei.com (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Mon, 26 Jul 2021 14:21:23 +0800 To: Matan Azrad , NBU-Contact-Thomas Monjalon CC: Ferruh Yigit , "dev@dpdk.org" , "beilei.xing@intel.com" References: <0bc940bb-65e6-1acb-d026-7a2a08a0ad8b@huawei.com> <4435152.k7BQ785f6v@thomas> From: fengchengwen Message-ID: <5c477b6c-9bdc-388b-5183-dcc2acb4e571@huawei.com> Date: Mon, 26 Jul 2021 14:21:23 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.40.190.165] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpeml500024.china.huawei.com (7.185.36.10) X-CFilter-Loop: Reflected Subject: Re: [dpdk-dev] Question about hardware error handling policy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi On 2021/7/25 23:12, Matan Azrad wrote: > Hi > > From: fengchengwen: >> On 2021/7/22 23:46, Thomas Monjalon wrote: >>> 22/07/2021 15:50, fengchengwen: >>>> Hi, all >>>> >>>> I notice ethdev support dev_reset ops, which could be used to >>>> recover from errors, and only 13+ drivers support this function. >>>> And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and >>>> only 6 drivers support it (most of them are VF). >>>> >>>> This provides users with two ways to handle hardware errors: >>>> a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset >> ops. >>>> b. application detect errors (the detection method is unclear), and call >>>> reset ops to recover. >>>> >>>> According to the design of this API, error handling is assigned >>>> to the application, and the driver is only responsible for reporting >>>> events. This simplifies the driver design (for example, the driver >>>> does not need to maintain mutex locks). >>>> >>>> As we know, many modern NICs come with firmware, have PCIE >>>> interfaces, support SR-IOV, the hardware errors can have: firmware >>>> reboot/PF reset/ VF reset/FLR, but these errors(particularly >>>> firmware/PF) are not addressed in most drivers. >>>> >>>> Question 1: what do we think of these errors(particularly >>>> firmware/PF)? Do we think that the probability is very low and that >>>> there is no need to deal with them? >>> >>> Even rare errors must be managed. >> >> Because intel and mlx NIC are widely used, I look at i40e/mlx5 driver code, >> and found: >> 1) i40e PF driver, it only show logs when detect global reset and other error: >> if (icr0 & I40E_PFINT_ICR0_GRST_MASK) >> PMD_DRV_LOG(INFO, "ICR0: global reset requested"); >> if (icr0 & I40E_PFINT_ICR0_PCI_EXCEPTION_MASK) >> PMD_DRV_LOG(INFO, "ICR0: PCI exception activated"); >> if (icr0 & I40E_PFINT_ICR0_STORM_DETECT_MASK) >> PMD_DRV_LOG(INFO, "ICR0: a change in the storm control state"); >> @Beilei Why not report RESET_EVENT in these cases ? or these error are >> very rarely >> so just report it. And also, the application may still send or recv packet, >> These >> resets, if not handled correctly, do not cause the hardware/driver to hang ? >> >> 2) mlx5 PF driver, I notice there is a mlx5_dev_interrupt_device_fatal which >> will remove the device. >> @Matan Why not report RESET_EVENT in these cases ? so the application >> can be recovered quickly. > > RTE_ETH_EVENT_INTR_RMV is reported by the driver to notify the app that the device was physically plugged out from the PCI slot. > So, when the app sees this event, it should free all the SW resources of this device(call port close to the PMD by the way). > > RTE_ETH_EVENT_INTR_RESET, /**< reset interrupt event, sent to VF on PF reset */ > Looks like VF-PF communication, this is not our case of plugged out which is more fatal. I think it can be changed so that the PF can also be used. > > Matan > > >>> >>>> Question 2: I prefer to put error handling in the application >>>> layer, because doing it in the driver can make the driver complex, >>>> but there is no app to register the INTR_RESET event handler. I think >>>> we can build a standard handler in testpmd, What do you think? >>> >>> Absolutely. As any ethdev API, it must be tested with testpmd. >>> >>> >>> . >>>