From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id D2D68A0093; Thu, 13 Oct 2022 14:50:39 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C441442F73; Thu, 13 Oct 2022 14:50:39 +0200 (CEST) Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by mails.dpdk.org (Postfix) with ESMTP id B5C9942EAF for ; Thu, 13 Oct 2022 14:50:37 +0200 (CEST) Received: from dggpeml500024.china.huawei.com (unknown [172.30.72.56]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4Mp8QR5xjpzVhrk; Thu, 13 Oct 2022 20:46:07 +0800 (CST) Received: from [10.67.100.224] (10.67.100.224) by dggpeml500024.china.huawei.com (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Thu, 13 Oct 2022 20:50:36 +0800 Subject: Re: [PATCH v12 2/5] ethdev: support proactive error handling mode To: Andrew Rybchenko , , , CC: , , , , References: <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com> <20221012034555.9781-1-fengchengwen@huawei.com> <20221012034555.9781-3-fengchengwen@huawei.com> From: fengchengwen Message-ID: Date: Thu, 13 Oct 2022 20:50:35 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.100.224] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpeml500024.china.huawei.com (7.185.36.10) X-CFilter-Loop: Reflected X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Hi Andrew, I rework part of rst according your comments, sent by v13, please take a look. Thanks. On 2022/10/13 16:58, Andrew Rybchenko wrote: > On 10/12/22 06:45, Chengwen Feng wrote: >> From: Kalesh AP >> >> Some PMDs (e.g. hns3) could detect hardware or firmware errors, one >> error recovery mode is to report RTE_ETH_EVENT_INTR_RESET event, and >> wait for application invoke rte_eth_dev_reset() to recover the port, >> however, this mode has the following weaknesses: >> >> 1) Due to different hardware and software design, some NIC port recovery >> process requires multiple handshakes with the firmware and PF (when the >> port is VF). It takes a long time to complete the entire operation for >> one port, If multiple ports (for example, multiple VFs of a PF) are >> reset at the same time, other VFs may fail to be reset. (Because the >> reset processing is serial, the previous VFs must be processed before >> the subsequent VFs). >> >> 2) The impact on the application layer is great, and it should stop >> working queues, stop calling Rx and Tx functions, and then call >> rte_eth_dev_reset(), and re-setup all again. >> >> This patch introduces proactive error handling mode, the PMD will try >> to recover from the errors itself. In this process, the PMD sets the >> data path pointers to dummy functions (which will prevent the crash), >> and also make sure the control path operations failed with retcode >> -EBUSY. >> >> Because the PMD recovers automatically, the application can only sense >> that the data flow is disconnected for a while and the control API >> returns an error in this period. >> >> In order to sense the error happening/recovering, three events were >> introduced: >> >> 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it >> detected an error and the recovery is being started. Upon receiving the >> event, the application should not invoke any control path APIs until >> receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or >> RTE_ETH_EVENT_RECOVERY_FAILED event. >> >> 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that >> it recovers successful from the error, the PMD already re-configures the >> port, and the effect is the same as that of the restart operation. >> >> 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it >> recovers failed from the error, the port should not usable anymore. The >> application should close the port. >> >> Signed-off-by: Kalesh AP >> Signed-off-by: Somnath Kotur >> Signed-off-by: Chengwen Feng >> Reviewed-by: Ajit Khaparde > > With few nits below, > > Acked-by: Andrew Rybchenko > > [snip] > >> diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst >> index 9d081b1cba..73941a74bd 100644 >> --- a/doc/guides/prog_guide/poll_mode_drv.rst >> +++ b/doc/guides/prog_guide/poll_mode_drv.rst >> @@ -627,3 +627,41 @@ by application. >>   The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger >>   the application to handle reset event. It is duty of application to >>   handle all synchronization before it calls rte_eth_dev_reset(). >> + >> +The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``. >> + >> +Proactive Error Handling Mode >> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> + >> +If PMD supports ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, it means once detect >> +hardware or firmware errors, the PMD will try to recover from the errors. In >> +this process, the PMD sets the data path pointers to dummy functions (which >> +will prevent the crash), and also make sure the control path operations failed >> +with retcode -EBUSY. >> + >> +Also in this process, from the perspective of application, services are >> +affected. For example, the Rx/Tx bust APIs cannot receive and send packets, > > bust -> burst > >> +and the control plane API return failure. > > I think we need to highlight here that the key advantage of the > proactive error recover that it requires nothing from PMD by > default. The recover simply happens. > >> + >> +In some service scenarios, application needs to be aware of the event to >> +determine whether to migrate services. So three events were introduced: >> + >> +* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected >> +  an error and the recovery is being started. Upon receiving the event, the >> +  application should not invoke any control path APIs until receiving >> +  RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event. >> + >> +* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it >> +  recovers successful from the error, the PMD already re-configures the port, >> +  and the effect is the same as that of the restart operation. >> + >> +* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it >> +  recovers failed from the error, the port should not usable anymore. the >> +  application should close the port. >> + >> +.. note:: >> +        * Before the PMD reports the recovery result, the PMD may report the >> +          ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error >> +          may occur during the recovery. >> +        * The error handling mode supported by the PMD can be reported through >> +          the ``rte_eth_dev_info_get`` API. >> diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst >> +     *    - LRO configuration >> +     *    - LSC configuration >> +     *    - MTU >> +     *    - Mac address (default and those supplied by MAC address array) >> +     *    - Promiscuous and allmulticast mode >> +     *    - PTP configuration >> +     *    - Queue (Rx/Tx) settings >> +     *    - Queue statistics mappings >> +     *    - RSS configuration by rte_eth_dev_rss_xxx() family >> +     *    - Rx checksum configuration >> +     *    - Rx interrupt settings >> +     *    - Traffic management configuration >> +     *    - VLAN configuration (including filtering, tpid, strip, pvid) >> +     *    - VMDq configuration >> +     * b) the following configuration maybe retained or not depending on the >> +     *    device capabilities: >> +     *    - flow rules >> +     *      @see RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP >> +     *    - shared flow objects >> +     *      @see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP >> +     * c) the other configuration will not be stored and will need to be >> +     *    re-configured. >> +     */ >> +    RTE_ETH_EVENT_RECOVERY_SUCCESS, >> +    /** Port recovers failed from the error. >> +     * It means that the port should not usable anymore. The application >> +     * should close the port. >> +     */ >> +    RTE_ETH_EVENT_RECOVERY_FAILED, >>       RTE_ETH_EVENT_MAX       /**< max value of this enum */ >>   }; > > [snip] > > > .