From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 0C87BA0A0C; Thu, 22 Jul 2021 15:50:08 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 7AAEA4014D; Thu, 22 Jul 2021 15:50:08 +0200 (CEST) Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by mails.dpdk.org (Postfix) with ESMTP id 8FCB540040 for ; Thu, 22 Jul 2021 15:50:06 +0200 (CEST) Received: from dggemv704-chm.china.huawei.com (unknown [172.30.72.56]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4GVtwH5XNDz1CMc1; Thu, 22 Jul 2021 21:44:15 +0800 (CST) Received: from dggpeml500024.china.huawei.com (7.185.36.10) by dggemv704-chm.china.huawei.com (10.3.19.47) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 22 Jul 2021 21:50:03 +0800 Received: from [10.40.190.165] (10.40.190.165) by dggpeml500024.china.huawei.com (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 22 Jul 2021 21:50:03 +0800 From: fengchengwen To: Thomas Monjalon , Ferruh Yigit , "dev@dpdk.org" Message-ID: <0bc940bb-65e6-1acb-d026-7a2a08a0ad8b@huawei.com> Date: Thu, 22 Jul 2021 21:50:02 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.40.190.165] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpeml500024.china.huawei.com (7.185.36.10) X-CFilter-Loop: Reflected Subject: [dpdk-dev] Question about hardware error handling policy X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi, all I notice ethdev support dev_reset ops, which could be used to recover from errors, and only 13+ drivers support this function. And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6 drivers support it (most of them are VF). This provides users with two ways to handle hardware errors: a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops. b. application detect errors (the detection method is unclear), and call reset ops to recover. According to the design of this API, error handling is assigned to the application, and the driver is only responsible for reporting events. This simplifies the driver design (for example, the driver does not need to maintain mutex locks). As we know, many modern NICs come with firmware, have PCIE interfaces, support SR-IOV, the hardware errors can have: firmware reboot/PF reset/ VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in most drivers. Question 1: what do we think of these errors(particularly firmware/PF)? Do we think that the probability is very low and that there is no need to deal with them? Question 2: I prefer to put error handling in the application layer, because doing it in the driver can make the driver complex, but there is no app to register the INTR_RESET event handler. I think we can build a standard handler in testpmd, What do you think? Thanks