From: fengchengwen <fengchengwen@huawei.com>
To: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>,
<thomas@monjalon.net>, <ferruh.yigit@xilinx.com>,
<ferruh.yigit@amd.com>
Cc: <dev@dpdk.org>, <kalesh-anakkur.purayil@broadcom.com>,
<somnath.kotur@broadcom.com>, <ajit.khaparde@broadcom.com>,
<mdr@ashroe.eu>
Subject: Re: [PATCH v12 2/5] ethdev: support proactive error handling mode
Date: Thu, 13 Oct 2022 20:50:35 +0800 [thread overview]
Message-ID: <c8f8a43c-389e-9370-7f43-e670edede013@huawei.com> (raw)
In-Reply-To: <f2c881de-32c8-f402-8b73-24d70865bca1@oktetlabs.ru>
Hi Andrew,
I rework part of rst according your comments, sent by v13, please take a look.
Thanks.
On 2022/10/13 16:58, Andrew Rybchenko wrote:
> On 10/12/22 06:45, Chengwen Feng wrote:
>> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>
>> Some PMDs (e.g. hns3) could detect hardware or firmware errors, one
>> error recovery mode is to report RTE_ETH_EVENT_INTR_RESET event, and
>> wait for application invoke rte_eth_dev_reset() to recover the port,
>> however, this mode has the following weaknesses:
>>
>> 1) Due to different hardware and software design, some NIC port recovery
>> process requires multiple handshakes with the firmware and PF (when the
>> port is VF). It takes a long time to complete the entire operation for
>> one port, If multiple ports (for example, multiple VFs of a PF) are
>> reset at the same time, other VFs may fail to be reset. (Because the
>> reset processing is serial, the previous VFs must be processed before
>> the subsequent VFs).
>>
>> 2) The impact on the application layer is great, and it should stop
>> working queues, stop calling Rx and Tx functions, and then call
>> rte_eth_dev_reset(), and re-setup all again.
>>
>> This patch introduces proactive error handling mode, the PMD will try
>> to recover from the errors itself. In this process, the PMD sets the
>> data path pointers to dummy functions (which will prevent the crash),
>> and also make sure the control path operations failed with retcode
>> -EBUSY.
>>
>> Because the PMD recovers automatically, the application can only sense
>> that the data flow is disconnected for a while and the control API
>> returns an error in this period.
>>
>> In order to sense the error happening/recovering, three events were
>> introduced:
>>
>> 1) RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it
>> detected an error and the recovery is being started. Upon receiving the
>> event, the application should not invoke any control path APIs until
>> receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
>> RTE_ETH_EVENT_RECOVERY_FAILED event.
>>
>> 2) RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that
>> it recovers successful from the error, the PMD already re-configures the
>> port, and the effect is the same as that of the restart operation.
>>
>> 3) RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
>> recovers failed from the error, the port should not usable anymore. The
>> application should close the port.
>>
>> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
>
> With few nits below,
>
> Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>
> [snip]
>
>> diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
>> index 9d081b1cba..73941a74bd 100644
>> --- a/doc/guides/prog_guide/poll_mode_drv.rst
>> +++ b/doc/guides/prog_guide/poll_mode_drv.rst
>> @@ -627,3 +627,41 @@ by application.
>> The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
>> the application to handle reset event. It is duty of application to
>> handle all synchronization before it calls rte_eth_dev_reset().
>> +
>> +The above error handling mode is known as ``RTE_ETH_ERROR_HANDLE_MODE_PASSIVE``.
>> +
>> +Proactive Error Handling Mode
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +If PMD supports ``RTE_ETH_ERROR_HANDLE_MODE_PROACTIVE``, it means once detect
>> +hardware or firmware errors, the PMD will try to recover from the errors. In
>> +this process, the PMD sets the data path pointers to dummy functions (which
>> +will prevent the crash), and also make sure the control path operations failed
>> +with retcode -EBUSY.
>> +
>> +Also in this process, from the perspective of application, services are
>> +affected. For example, the Rx/Tx bust APIs cannot receive and send packets,
>
> bust -> burst
>
>> +and the control plane API return failure.
>
> I think we need to highlight here that the key advantage of the
> proactive error recover that it requires nothing from PMD by
> default. The recover simply happens.
>
>> +
>> +In some service scenarios, application needs to be aware of the event to
>> +determine whether to migrate services. So three events were introduced:
>> +
>> +* RTE_ETH_EVENT_ERR_RECOVERING: used to notify the application that it detected
>> + an error and the recovery is being started. Upon receiving the event, the
>> + application should not invoke any control path APIs until receiving
>> + RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
>> +
>> +* RTE_ETH_EVENT_RECOVERY_SUCCESS: used to notify the application that it
>> + recovers successful from the error, the PMD already re-configures the port,
>> + and the effect is the same as that of the restart operation.
>> +
>> +* RTE_ETH_EVENT_RECOVERY_FAILED: used to notify the application that it
>> + recovers failed from the error, the port should not usable anymore. the
>> + application should close the port.
>> +
>> +.. note::
>> + * Before the PMD reports the recovery result, the PMD may report the
>> + ``RTE_ETH_EVENT_ERR_RECOVERING`` event again, because a larger error
>> + may occur during the recovery.
>> + * The error handling mode supported by the PMD can be reported through
>> + the ``rte_eth_dev_info_get`` API.
>> diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
>> + * - LRO configuration
>> + * - LSC configuration
>> + * - MTU
>> + * - Mac address (default and those supplied by MAC address array)
>> + * - Promiscuous and allmulticast mode
>> + * - PTP configuration
>> + * - Queue (Rx/Tx) settings
>> + * - Queue statistics mappings
>> + * - RSS configuration by rte_eth_dev_rss_xxx() family
>> + * - Rx checksum configuration
>> + * - Rx interrupt settings
>> + * - Traffic management configuration
>> + * - VLAN configuration (including filtering, tpid, strip, pvid)
>> + * - VMDq configuration
>> + * b) the following configuration maybe retained or not depending on the
>> + * device capabilities:
>> + * - flow rules
>> + * @see RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP
>> + * - shared flow objects
>> + * @see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP
>> + * c) the other configuration will not be stored and will need to be
>> + * re-configured.
>> + */
>> + RTE_ETH_EVENT_RECOVERY_SUCCESS,
>> + /** Port recovers failed from the error.
>> + * It means that the port should not usable anymore. The application
>> + * should close the port.
>> + */
>> + RTE_ETH_EVENT_RECOVERY_FAILED,
>> RTE_ETH_EVENT_MAX /**< max value of this enum */
>> };
>
> [snip]
>
>
> .
next prev parent reply other threads:[~2022-10-13 12:50 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20220128124831.427-1-kalesh-anakkur.purayil@broadcom.com>
2022-09-22 7:41 ` [PATCH v9 0/5] support " Chengwen Feng
2022-09-22 7:41 ` [PATCH v9 1/5] ethdev: support get port " Chengwen Feng
2022-10-03 17:35 ` Ferruh Yigit
2022-10-05 1:56 ` fengchengwen
2022-09-22 7:41 ` [PATCH v9 2/5] ethdev: support proactive " Chengwen Feng
2022-10-03 17:35 ` Ferruh Yigit
2022-09-22 7:41 ` [PATCH v9 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-09-22 7:41 ` [PATCH v9 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-09-22 7:41 ` [PATCH v9 5/5] net/bnxt: " Chengwen Feng
2022-10-09 7:53 ` [PATCH v10 0/5] support " Chengwen Feng
2022-10-09 7:53 ` [PATCH v10 1/5] ethdev: support get port " Chengwen Feng
2022-10-09 7:53 ` [PATCH v10 2/5] ethdev: support proactive " Chengwen Feng
2022-10-09 7:53 ` [PATCH v10 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-09 7:53 ` [PATCH v10 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-09 7:53 ` [PATCH v10 5/5] net/bnxt: " Chengwen Feng
2022-10-09 9:10 ` [PATCH v11 0/5] support " Chengwen Feng
2022-10-09 9:10 ` [PATCH v11 1/5] ethdev: support get port " Chengwen Feng
2022-10-10 8:38 ` Andrew Rybchenko
2022-10-10 8:44 ` Andrew Rybchenko
2022-10-09 9:10 ` [PATCH v11 2/5] ethdev: support proactive " Chengwen Feng
2022-10-10 8:47 ` Andrew Rybchenko
2022-10-11 14:48 ` fengchengwen
2022-10-09 9:10 ` [PATCH v11 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-09 9:10 ` [PATCH v11 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-09 11:05 ` Dongdong Liu
2022-10-09 9:10 ` [PATCH v11 5/5] net/bnxt: " Chengwen Feng
2022-10-12 3:45 ` [PATCH v12 0/5] support " Chengwen Feng
2022-10-12 3:45 ` [PATCH v12 1/5] ethdev: add error handling mode to device info Chengwen Feng
2022-10-12 3:45 ` [PATCH v12 2/5] ethdev: support proactive error handling mode Chengwen Feng
2022-10-13 8:58 ` Andrew Rybchenko
2022-10-13 12:50 ` fengchengwen [this message]
2022-10-12 3:45 ` [PATCH v12 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-12 3:45 ` [PATCH v12 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-12 3:45 ` [PATCH v12 5/5] net/bnxt: " Chengwen Feng
2022-10-13 12:42 ` [PATCH v13 0/5] support " Chengwen Feng
2022-10-13 12:42 ` [PATCH v13 1/5] ethdev: add error handling mode to device info Chengwen Feng
2022-10-13 12:42 ` [PATCH v13 2/5] ethdev: support proactive error handling mode Chengwen Feng
2022-10-13 12:42 ` [PATCH v13 3/5] app/testpmd: support error handling mode event Chengwen Feng
2022-10-13 12:42 ` [PATCH v13 4/5] net/hns3: support proactive error handling mode Chengwen Feng
2022-10-13 12:42 ` [PATCH v13 5/5] net/bnxt: " Chengwen Feng
2022-10-17 7:42 ` [PATCH v13 0/5] support " Andrew Rybchenko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c8f8a43c-389e-9370-7f43-e670edede013@huawei.com \
--to=fengchengwen@huawei.com \
--cc=ajit.khaparde@broadcom.com \
--cc=andrew.rybchenko@oktetlabs.ru \
--cc=dev@dpdk.org \
--cc=ferruh.yigit@amd.com \
--cc=ferruh.yigit@xilinx.com \
--cc=kalesh-anakkur.purayil@broadcom.com \
--cc=mdr@ashroe.eu \
--cc=somnath.kotur@broadcom.com \
--cc=thomas@monjalon.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).