From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 15A58A0C46; Fri, 23 Jul 2021 04:18:14 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 97C1940040; Fri, 23 Jul 2021 04:18:13 +0200 (CEST) Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by mails.dpdk.org (Postfix) with ESMTP id 5BDD54003C for ; Fri, 23 Jul 2021 04:18:11 +0200 (CEST) Received: from dggemv703-chm.china.huawei.com (unknown [172.30.72.54]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4GWCWS0QfKz1CMqb; Fri, 23 Jul 2021 10:12:20 +0800 (CST) Received: from dggpeml500024.china.huawei.com (7.185.36.10) by dggemv703-chm.china.huawei.com (10.3.19.46) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Fri, 23 Jul 2021 10:18:08 +0800 Received: from [10.40.190.165] (10.40.190.165) by dggpeml500024.china.huawei.com (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Fri, 23 Jul 2021 10:18:08 +0800 To: Thomas Monjalon CC: Ferruh Yigit , "dev@dpdk.org" , , Matan Azrad References: <0bc940bb-65e6-1acb-d026-7a2a08a0ad8b@huawei.com> <4435152.k7BQ785f6v@thomas> From: fengchengwen Message-ID: Date: Fri, 23 Jul 2021 10:18:08 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: <4435152.k7BQ785f6v@thomas> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.40.190.165] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpeml500024.china.huawei.com (7.185.36.10) X-CFilter-Loop: Reflected Subject: Re: [dpdk-dev] Question about hardware error handling policy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 2021/7/22 23:46, Thomas Monjalon wrote: > 22/07/2021 15:50, fengchengwen: >> Hi, all >> >> I notice ethdev support dev_reset ops, which could be used to recover from >> errors, and only 13+ drivers support this function. >> And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6 >> drivers support it (most of them are VF). >> >> This provides users with two ways to handle hardware errors: >> a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops. >> b. application detect errors (the detection method is unclear), and call >> reset ops to recover. >> >> According to the design of this API, error handling is assigned to the >> application, and the driver is only responsible for reporting events. This >> simplifies the driver design (for example, the driver does not need to maintain >> mutex locks). >> >> As we know, many modern NICs come with firmware, have PCIE interfaces, >> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/ >> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in >> most drivers. >> >> Question 1: what do we think of these errors(particularly firmware/PF)? Do >> we think that the probability is very low and that there is no need to deal with >> them? > > Even rare errors must be managed. Because intel and mlx NIC are widely used, I look at i40e/mlx5 driver code, and found: 1) i40e PF driver, it only show logs when detect global reset and other error: if (icr0 & I40E_PFINT_ICR0_GRST_MASK) PMD_DRV_LOG(INFO, "ICR0: global reset requested"); if (icr0 & I40E_PFINT_ICR0_PCI_EXCEPTION_MASK) PMD_DRV_LOG(INFO, "ICR0: PCI exception activated"); if (icr0 & I40E_PFINT_ICR0_STORM_DETECT_MASK) PMD_DRV_LOG(INFO, "ICR0: a change in the storm control state"); @Beilei Why not report RESET_EVENT in these cases ? or these error are very rarely so just report it. And also, the application may still send or recv packet, These resets, if not handled correctly, do not cause the hardware/driver to hang ? 2) mlx5 PF driver, I notice there is a mlx5_dev_interrupt_device_fatal which will remove the device. @Matan Why not report RESET_EVENT in these cases ? so the application can be recovered quickly. > >> Question 2: I prefer to put error handling in the application layer, because >> doing it in the driver can make the driver complex, but there is no app to >> register the INTR_RESET event handler. I think we can build a standard handler >> in testpmd, What do you think? > > Absolutely. As any ethdev API, it must be tested with testpmd. > > > . >