DPDK patches and discussions
 help / color / mirror / Atom feed
From: Ray Kinsella <mdr@ashroe.eu>
To: Chengwen Feng <fengchengwen@huawei.com>
Cc: thomas@monjalon.net, ferruh.yigit@xilinx.com, dev@dpdk.org,
	kalesh-anakkur.purayil@broadcom.com, somnath.kotur@broadcom.com,
	ajit.khaparde@broadcom.com, Andrew.Rybchenko@oktetlabs.ru
Subject: Re: [PATCH v8 1/4] ethdev: support device error recovery notification
Date: Thu, 23 Jun 2022 16:58:33 +0100	[thread overview]
Message-ID: <87edzf4bi9.fsf@mdr78.vserver.site> (raw)
In-Reply-To: <20220616094122.1909-2-fengchengwen@huawei.com>


Chengwen Feng <fengchengwen@huawei.com> writes:

> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>
> Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try
> to recover from the errors. In this process, the PMD sets the data path
> pointers to dummy functions (which will prevent the crash), and also
> make sure the control path operations failed with retcode -EBUSY.
>
> Also in this process, from the perspective of application, services are
> affected. For example, the Rx/Tx bust APIs cannot receive and send
> packets, and the control plane API return failure.
>
> In some service scenarios, application needs to be aware of the event
> to determine whether to migrate services. So three events were
> introduced:
>
> 1. RTE_ETH_EVENT_ERR_RECOVERING: the PMD must trigger this event to
> notify the application that it detected a hardware or firmware error
> and tries to recover.
> 2. RTE_ETH_EVENT_RECOVER_SUCCESS: the PMD must trigger this event to
> notify the application that it has recovered from the error. And PMD
> already re-configures the port to the state prior to the error.
> 3. RTE_ETH_EVENT_RECOVER_FAILED: the PMD must trigger this event to
> notify the application that it has failed to recover from the error.
> The port may not be usable anymore.
>
> Note: the error recovery of these events is mainly performed by the
> PMD. Unlike the RTE_ETH_EVENT_INTR_RESET which the error recovery is
> performed by the application. The PMD must ensure that the above two
> error handling methods cannot be used at the same time.
>
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> ---
>  doc/guides/prog_guide/poll_mode_drv.rst | 32 +++++++++++++++++++++++++
>  doc/guides/rel_notes/release_22_07.rst  | 11 +++++++++
>  lib/ethdev/rte_ethdev.h                 |  6 +++++
>  3 files changed, 49 insertions(+)
>
> diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
> index 9d081b1cba..6398917485 100644
> --- a/doc/guides/prog_guide/poll_mode_drv.rst
> +++ b/doc/guides/prog_guide/poll_mode_drv.rst
> @@ -627,3 +627,35 @@ by application.
>  The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
>  the application to handle reset event. It is duty of application to
>  handle all synchronization before it calls rte_eth_dev_reset().
> +
> +Error Recovery Notification
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Some PMDs (e.g. hns3) could detect hardware or firmware errors, and try to
> +recover from the errors. In this process, the PMD sets the data path pointers
> +to dummy functions (which will prevent the crash), and also make sure the
> +control path operations failed with retcode -EBUSY.
> +
> +Also in this process, from the perspective of application, services are
> +affected. For example, the Rx/Tx bust APIs cannot receive and send packets,
> +and the control plane API return failure.
> +
> +In some service scenarios, application needs to be aware of the event to
> +determine whether to migrate services. So three events was introduced.
> +
> +The PMD must trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
> +application that it detected a hardware or firmware error and tries to recover.
> +
> +The PMD must trigger RTE_ETH_EVENT_RECOVER_SUCCESS event to notify the
> +application that it has recovered from the error. And PMD already re-configures
> +the port to the state prior to the error.
> +
> +The PMD must trigger RTE_ETH_EVENT_RECOVER_FAILED event to notify the
> +application that it has failed to recover from the error. The port may not be
> +usable anymore.
> +
> +.. note::
> +        The error recovery of these events is mainly performed by the PMD.
> +        Unlike the RTE_ETH_EVENT_INTR_RESET which the error recovery is
> +        performed by the application. The PMD must ensure that the above two
> +        error handling methods cannot be used at the same time.
> diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
> index 6fc044edaa..b237bd3303 100644
> --- a/doc/guides/rel_notes/release_22_07.rst
> +++ b/doc/guides/rel_notes/release_22_07.rst
> @@ -108,6 +108,17 @@ New Features
>  
>    Added an API which can get the device type of vDPA device.
>  
> +* **Added error recover notification.**
> +
> +  Added error recover notification to application including:
> +
> +  * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the PMD to report
> +    that the port is recovering from an error.
> +  * Added new event: ``RTE_ETH_EVENT_RECOVER_SUCCESS`` for the PMD to report
> +    that the port recover successful from an error.

RTE_ETH_EVENT_RECOVERY_SUCCESS

> +  * Added new event: ``RTE_ETH_EVENT_RECOVER_FAILED`` for the PMD to
> report

RTE_ETH_EVENT_RECOVERY_FAILED

> +    that the prot recover failed from an error

to report that port recovery failed

> +
>  * **Updated Amazon ena driver.**
>  
>    The new driver version (v2.7.0) includes:
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 045ee64747..6998f6f0be 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -3928,6 +3928,12 @@ enum rte_eth_event_type {
>  	 * @see rte_eth_rx_avail_thresh_set()
>  	 */
>  	RTE_ETH_EVENT_RX_AVAIL_THRESH,
> +	/** Port recovering from a hardware or firmware error */
> +	RTE_ETH_EVENT_ERR_RECOVERING,
> +	/** Port recovers successful from the error */
> +	RTE_ETH_EVENT_RECOVER_SUCCESS,

RTE_ETH_EVENT_RECOVERY_SUCCESS

> +	/** Port recovers failed from the error */
> +	RTE_ETH_EVENT_RECOVER_FAILED,
RTE_ETH_EVENT_RECOVERY_FAILED

>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>  };



-- 
Regards, Ray K

  parent reply	other threads:[~2022-06-23 15:59 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event Kalesh A P
2020-03-11 13:20   ` Thomas Monjalon
2020-03-12  3:31     ` Kalesh Anakkur Purayil
2020-03-12  7:29       ` Thomas Monjalon
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 2/3] net/bnxt: notify applications about device reset Kalesh A P
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-03-11 13:19 ` [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Thomas Monjalon
2020-03-12  3:25   ` Kalesh Anakkur Purayil
2020-03-12  7:34     ` Thomas Monjalon
2020-07-03 16:12       ` Ferruh Yigit
2020-09-30  7:03 ` [dpdk-dev] [RFC V2 " Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-09-30  7:07 ` [dpdk-dev] [RFC PATCH v2 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-09-30  7:12 ` [dpdk-dev] [RFC PATCH v3 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:50     ` Thomas Monjalon
2020-09-30  8:35       ` Kalesh Anakkur Purayil
2020-09-30  9:31         ` Thomas Monjalon
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-08 10:53     ` Asaf Penso
2020-09-30 12:33 ` [dpdk-dev] [RFC PATCH v4 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-06 17:25     ` Ophir Munk
2020-10-07  4:46       ` Kalesh Anakkur Purayil
2020-10-07  8:36         ` Ophir Munk
2020-10-07  9:37         ` Ferruh Yigit
2020-10-07 18:42           ` Ajit Khaparde
2020-10-07 16:49 ` [dpdk-dev] [PATCH v5 0/3] librte_ethdev: error recovery support Kalesh A P
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-10-08 10:49     ` Asaf Penso
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-09  3:48 ` [dpdk-dev] [PATCH v6 0/3] librte_ethdev: error recovery support Kalesh A P
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-10-11 21:29     ` Thomas Monjalon
2020-10-12  8:09       ` Andrew Rybchenko
2021-02-18 15:32         ` Ferruh Yigit
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 3/3] app/testpmd: handle device recovery event Kalesh A P
2022-01-28 12:48   ` [dpdk-dev] [PATCH v7 0/4] ethdev: error recovery support Kalesh A P
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 1/4] ethdev: support device reset and recovery events Kalesh A P
2022-02-01 12:11       ` Ferruh Yigit
2022-02-01 13:09         ` Kalesh Anakkur Purayil
2022-02-01 13:19           ` Ferruh Yigit
2022-02-03 20:28             ` Ajit Khaparde
2022-02-10 22:42               ` Thomas Monjalon
2022-02-01 12:52       ` Ferruh Yigit
2022-02-02 11:44         ` Ray Kinsella
2022-02-10 22:16           ` Thomas Monjalon
2022-02-11 10:09             ` Ray Kinsella
2022-02-14 10:16               ` Ray Kinsella
2022-02-14 11:15                 ` Thomas Monjalon
2022-02-14 16:06                   ` Ray Kinsella
2022-02-14 16:25                     ` Thomas Monjalon
2022-02-14 18:27                       ` Ray Kinsella
2022-02-15 13:55                         ` Ray Kinsella
2022-02-15 15:12                           ` Thomas Monjalon
2022-02-15 16:12                             ` Ray Kinsella
2022-05-21 10:33                     ` fengchengwen
2022-05-24 15:11                       ` Ray Kinsella
2022-06-10  0:16                         ` fengchengwen
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 2/4] app/testpmd: handle device recovery event Kalesh A P
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 3/4] net/bnxt: notify applications about device reset/recovery Kalesh A P
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 4/4] doc: update release notes Kalesh A P
2022-02-01 12:12       ` Ferruh Yigit
2022-06-16  9:41     ` [PATCH v8 0/4] ethdev: support error recovery notification Chengwen Feng
2022-06-16  9:41       ` [PATCH v8 1/4] ethdev: support device " Chengwen Feng
2022-06-20 17:42         ` Thomas Monjalon
2022-06-21  1:38           ` fengchengwen
2022-06-21  7:04             ` Thomas Monjalon
2022-09-22  7:53               ` fengchengwen
2022-06-23 15:58         ` Ray Kinsella [this message]
2022-06-16  9:41       ` [PATCH v8 2/4] app/testpmd: handle error recovery notification event Chengwen Feng
2022-06-16  9:41       ` [PATCH v8 3/4] net/hns3: support " Chengwen Feng
2022-06-16  9:41       ` [PATCH v8 4/4] net/bnxt: notify applications about device reset/recovery Chengwen Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87edzf4bi9.fsf@mdr78.vserver.site \
    --to=mdr@ashroe.eu \
    --cc=Andrew.Rybchenko@oktetlabs.ru \
    --cc=ajit.khaparde@broadcom.com \
    --cc=dev@dpdk.org \
    --cc=fengchengwen@huawei.com \
    --cc=ferruh.yigit@xilinx.com \
    --cc=kalesh-anakkur.purayil@broadcom.com \
    --cc=somnath.kotur@broadcom.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).