From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id A2118A00BE;
	Thu, 16 Jun 2022 11:48:07 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id E82E242C26;
	Thu, 16 Jun 2022 11:47:56 +0200 (CEST)
Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188])
 by mails.dpdk.org (Postfix) with ESMTP id 5FA8042BF3
 for <dev@dpdk.org>; Thu, 16 Jun 2022 11:47:52 +0200 (CEST)
Received: from dggpeml500024.china.huawei.com (unknown [172.30.72.57])
 by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4LNy1q1h9BzSgxv;
 Thu, 16 Jun 2022 17:44:31 +0800 (CST)
Received: from localhost.localdomain (10.67.165.24) by
 dggpeml500024.china.huawei.com (7.185.36.10) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.24; Thu, 16 Jun 2022 17:47:50 +0800
From: Chengwen Feng <fengchengwen@huawei.com>
To: <thomas@monjalon.net>, <ferruh.yigit@xilinx.com>
CC: <dev@dpdk.org>, <kalesh-anakkur.purayil@broadcom.com>,
 <somnath.kotur@broadcom.com>, <ajit.khaparde@broadcom.com>, <mdr@ashroe.eu>,
 <Andrew.Rybchenko@oktetlabs.ru>
Subject: [PATCH v8 1/4] ethdev: support device error recovery notification
Date: Thu, 16 Jun 2022 17:41:19 +0800
Message-ID: <20220616094122.1909-2-fengchengwen@huawei.com>
X-Mailer: git-send-email 2.33.0
In-Reply-To: <20220616094122.1909-1-fengchengwen@huawei.com>
References: <20220128124830.427-1-kalesh-anakkur.purayil@broadcom.com>
 <20220616094122.1909-1-fengchengwen@huawei.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
X-Originating-IP: [10.67.165.24]
X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To
 dggpeml500024.china.huawei.com (7.185.36.10)
X-CFilter-Loop: Reflected
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
to recover from the errors. In this process, the PMD sets the data path
pointers to dummy functions (which will prevent the crash), and also
make sure the control path operations failed with retcode -EBUSY.

Also in this process, from the perspective of application, services are
affected. For example, the Rx/Tx bust APIs cannot receive and send
packets, and the control plane API return failure.

In some service scenarios, application needs to be aware of the event
to determine whether to migrate services. So three events were
introduced:

1. RTE_ETH_EVENT_ERR_RECOVERING: the PMD must trigger this event to
notify the application that it detected a hardware or firmware error
and tries to recover.
2. RTE_ETH_EVENT_RECOVER_SUCCESS: the PMD must trigger this event to
notify the application that it has recovered from the error. And PMD
already re-configures the port to the state prior to the error.
3. RTE_ETH_EVENT_RECOVER_FAILED: the PMD must trigger this event to
notify the application that it has failed to recover from the error.
The port may not be usable anymore.

Note: the error recovery of these events is mainly performed by the
PMD. Unlike the RTE_ETH_EVENT_INTR_RESET which the error recovery is
performed by the application. The PMD must ensure that the above two
error handling methods cannot be used at the same time.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 doc/guides/prog_guide/poll_mode_drv.rst | 32 +++++++++++++++++++++++++
 doc/guides/rel_notes/release_22_07.rst  | 11 +++++++++
 lib/ethdev/rte_ethdev.h                 |  6 +++++
 3 files changed, 49 insertions(+)

diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
index 9d081b1cba..6398917485 100644
--- a/doc/guides/prog_guide/poll_mode_drv.rst
+++ b/doc/guides/prog_guide/poll_mode_drv.rst
@@ -627,3 +627,35 @@ by application.
 The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
 the application to handle reset event. It is duty of application to
 handle all synchronization before it calls rte_eth_dev_reset().
+
+Error Recovery Notification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try to
+recover from the errors. In this process, the PMD sets the data path pointers
+to dummy functions (which will prevent the crash), and also make sure the
+control path operations failed with retcode -EBUSY.
+
+Also in this process, from the perspective of application, services are
+affected. For example, the Rx/Tx bust APIs cannot receive and send packets,
+and the control plane API return failure.
+
+In some service scenarios, application needs to be aware of the event to
+determine whether to migrate services. So three events was introduced.
+
+The PMD must trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
+application that it detected a hardware or firmware error and tries to recover.
+
+The PMD must trigger RTE_ETH_EVENT_RECOVER_SUCCESS event to notify the
+application that it has recovered from the error. And PMD already re-configures
+the port to the state prior to the error.
+
+The PMD must trigger RTE_ETH_EVENT_RECOVER_FAILED event to notify the
+application that it has failed to recover from the error. The port may not be
+usable anymore.
+
+.. note::
+        The error recovery of these events is mainly performed by the PMD.
+        Unlike the RTE_ETH_EVENT_INTR_RESET which the error recovery is
+        performed by the application. The PMD must ensure that the above two
+        error handling methods cannot be used at the same time.
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index 6fc044edaa..b237bd3303 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -108,6 +108,17 @@ New Features
 
   Added an API which can get the device type of vDPA device.
 
+* **Added error recover notification.**
+
+  Added error recover notification to application including:
+
+  * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the PMD to report
+    that the port is recovering from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_SUCCESS`` for the PMD to report
+    that the port recover successful from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVER_FAILED`` for the PMD to report
+    that the prot recover failed from an error.
+
 * **Updated Amazon ena driver.**
 
   The new driver version (v2.7.0) includes:
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 045ee64747..6998f6f0be 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -3928,6 +3928,12 @@ enum rte_eth_event_type {
 	 * @see rte_eth_rx_avail_thresh_set()
 	 */
 	RTE_ETH_EVENT_RX_AVAIL_THRESH,
+	/** Port recovering from a hardware or firmware error */
+	RTE_ETH_EVENT_ERR_RECOVERING,
+	/** Port recovers successful from the error */
+	RTE_ETH_EVENT_RECOVER_SUCCESS,
+	/** Port recovers failed from the error */
+	RTE_ETH_EVENT_RECOVER_FAILED,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.33.0