DPDK patches and discussions
 help / color / mirror / Atom feed
From: Ajit Khaparde <ajit.khaparde@broadcom.com>
To: Ferruh Yigit <ferruh.yigit@intel.com>
Cc: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com>,
	dpdk-dev <dev@dpdk.org>, Thomas Monjalon <thomas@monjalon.net>,
	Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>,
	 Konstantin Ananyev <konstantin.ananyev@intel.com>,
	Asaf Penso <asafp@nvidia.com>,
	 Stephen Hemminger <stephen@networkplumber.org>,
	"jerinj@marvell.com" <jerinj@marvell.com>
Subject: Re: [dpdk-dev] [PATCH v7 1/4] ethdev: support device reset and recovery events
Date: Thu, 3 Feb 2022 12:28:11 -0800	[thread overview]
Message-ID: <CACZ4nhuTX4dvDJvfq7edSbbJ7X+sE1uvTS0zS6uU8+qRhQCE6A@mail.gmail.com> (raw)
In-Reply-To: <ff901ac4-afac-0153-a3c6-945ed2f62611@intel.com>

[-- Attachment #1: Type: text/plain, Size: 8887 bytes --]

On Tue, Feb 1, 2022 at 5:20 AM Ferruh Yigit <ferruh.yigit@intel.com> wrote:
>
> On 2/1/2022 1:09 PM, Kalesh Anakkur Purayil wrote:
> > Thank you Ferruh for the review. Please see inline.
> >
> > On Tue, Feb 1, 2022 at 5:41 PM Ferruh Yigit <ferruh.yigit@intel.com <mailto:ferruh.yigit@intel.com>> wrote:
> >
> >     On 1/28/2022 12:48 PM, Kalesh A P wrote:
> >      > From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com <mailto:kalesh-anakkur.purayil@broadcom.com>>
> >      >
> >      > Adding support for the device reset and recovery events in the
> >      > rte_eth_event framework. FW error and FW reset conditions would be
> >      > managed internally by the PMD without needing application intervention.
> >      > In such cases, PMD would need reset/recovery events to notify application
> >      > that PMD is undergoing a reset.
> >      >
> >      > While most of the recovery process is transparent to the application since
> >      > most of the driver ensures recovery from FW reset or FW error conditions,
> >      > the application will have to reprogram any flows which were offloaded to
> >      > the underlying hardware.
> >      >
> >      > Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com <mailto:kalesh-anakkur.purayil@broadcom.com>>
> >      > Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com <mailto:somnath.kotur@broadcom.com>>
> >      > Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com <mailto:ajit.khaparde@broadcom.com>>
> >
> >     More developer cc'ed.
> >
> >      > ---
> >      >   doc/guides/prog_guide/poll_mode_drv.rst | 24 ++++++++++++++++++++++++
> >      >   lib/ethdev/rte_ethdev.h                 | 18 ++++++++++++++++++
> >      >   2 files changed, 42 insertions(+)
> >      >
> >      > diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
> >      > index 6831289..9ecc0e4 100644
> >      > --- a/doc/guides/prog_guide/poll_mode_drv.rst
> >      > +++ b/doc/guides/prog_guide/poll_mode_drv.rst
> >      > @@ -623,3 +623,27 @@ by application.
> >      >   The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
> >      >   the application to handle reset event. It is duty of application to
> >      >   handle all synchronization before it calls rte_eth_dev_reset().
> >      > +
> >      > +Error recovery support
> >      > +~~~~~~~~~~~~~~~~~~~~~~
> >      > +
> >      > +When the PMD detects a FW reset or error condition, it may try to recover
> >      > +from the error without needing the application intervention. In such cases,
> >      > +PMD would need events to notify the application that it is undergoing
> >      > +an error recovery.
> >      > +
> >      > +The PMD should trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
> >      > +application that PMD detected a FW reset or FW error condition. PMD may
> >      > +try to recover from the error by itself. Data path may be quiesced and
> >      > +control path operations may fail during the recovery period. The application
> >      > +should stop polling till it receives RTE_ETH_EVENT_RECOVERED event from the PMD.
> >      > +
> >
> >     Between the time FW error occurred and the application receive the RECOVERING event,
> >     datapath will continue to poll and application may call control APIs, so the event
> >     really is not solving the issue and driver somehow should be sure this won't crash
> >     the application, in that case not sure about the benefit of this event.
> >
> > [Kalesh]: As soon as the driver detects a FW dead or reset condition, it sets the fastpath pointers to dummy functions. This will prevent the crash. All control path operations would fail with -EBUSY. This change is already there in bnxt PMD. This event is a notification to the application that the PMD is recovering from a FW error condition so that it can stop polling and issue control path operations.
> >
>
> This was my point indeed. PMD can't rely on this event and should take actions (for both
> datapath and control path) to prevent possible crash/error. If that is the case event
> doesn't have that much benefit since PMD already has to protect itself.

This is an attempt to prevent PMD and applications from encountering
unexpected situations because of an error in hardware or firmware.
The unexpected scenarios can include lockups to segfaults apart from
other situations.

When hardware or firmware encounters and error, propagating it to the
application for it to react can be slow, time consuming and could be
vendor specific.

Instead in the case of Broadcom devices we are trying to contain the
detection and recovery within the hardware, firmware mostly and to a
small extent driver framework.
In the case of kernel drivers, this allows avoiding a system reset as
much as possible while in the case of DPDK PMD, it can prevent
application reloads or ungraceful exits.

The PMD generated notification to the application will be more like a
"For Your Information" for most of the applications.
But applications can decide to act on the notification and take
specific steps - like flushing or deleting all the existing flow
rules, meters etc.. without waiting for success or failure indication.
Applications can also re-offload the flow rules, recreate meters based
on the notification allowing them to sync with the fresh hardware,
firmware state.

We think other PMD can also take advantage of these notifications to
provide better reliability, accessibility and serviceability in
general.

Thanks

Ajit
>
> >
> >      > +The PMD should trigger RTE_ETH_EVENT_RECOVERED event to notify the application
> >      > +that the it has recovered from the error condition. PMD re-configures the port
> >      > +to the state prior to the error condition. Control path and data path are up now.
> >      > +Since the device has undergone a reset, flow rules offloaded prior to reset
> >      > +may be lost and the application should recreate the rules again.
> >      > +
> >
> >     I think the most difficult part here is clarify what application should do
> >     when this event received consistent for all devices, "flow rules may be lost"
> >     looks very vague to me.
> >     Unless it is not clear for application what to do when this event is received,
> >     it is not that useful or it will be specific to some PMDs. And I can see it is
> >     hard to clarify this but perhaps we can define a set of common behavior.
> >     Not sure what others are thinking.
> >
> > [Kalesh]: Sure, let's wait for others' opinions as well.
> >
> >
> >      > +The PMD should trigger RTE_ETH_EVENT_INTR_RMV event to notify the application
> >      > +that it has failed to recover from the error condition. The device may not be
> >      > +usable anymore.
> >      > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> >      > index 147cc1c..a46819f 100644
> >      > --- a/lib/ethdev/rte_ethdev.h
> >      > +++ b/lib/ethdev/rte_ethdev.h
> >      > @@ -3818,6 +3818,24 @@ enum rte_eth_event_type {
> >      >       RTE_ETH_EVENT_DESTROY,  /**< port is released */
> >      >       RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> >      >       RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
> >      > +     RTE_ETH_EVENT_ERR_RECOVERING,
> >      > +                     /**< port recovering from an error
> >      > +                      *
> >      > +                      * PMD detected a FW reset or error condition.
> >      > +                      * PMD will try to recover from the error.
> >      > +                      * Data path may be quiesced and Control path operations
> >      > +                      * may fail at this time.
> >      > +                      */
> >
> >     Please put multi line comments before enum, Andrew did a set of cleanups for these.
> >
> > [Kalesh]: Sure, will do.
> >
> >
> >      > +     RTE_ETH_EVENT_RECOVERED,
> >      > +                     /**< port recovered from an error
> >      > +                      *
> >      > +                      * PMD has recovered from the error condition.
> >      > +                      * Control path and Data path are up now.
> >      > +                      * PMD re-configures the port to the state prior to the error.
> >      > +                      * Since the device has undergone a reset, flow rules
> >      > +                      * offloaded prior to reset may be lost and
> >      > +                      * the application should recreate the rules again.
> >      > +                      */
> >      >       RTE_ETH_EVENT_MAX       /**< max value of this enum */
> >      >   };
> >      >
> >
> >
> >
> > --
> > Regards,
> > Kalesh A P
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4218 bytes --]

  reply	other threads:[~2022-02-03 20:28 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event Kalesh A P
2020-03-11 13:20   ` Thomas Monjalon
2020-03-12  3:31     ` Kalesh Anakkur Purayil
2020-03-12  7:29       ` Thomas Monjalon
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 2/3] net/bnxt: notify applications about device reset Kalesh A P
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-03-11 13:19 ` [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Thomas Monjalon
2020-03-12  3:25   ` Kalesh Anakkur Purayil
2020-03-12  7:34     ` Thomas Monjalon
2020-07-03 16:12       ` Ferruh Yigit
2020-09-30  7:03 ` [dpdk-dev] [RFC V2 " Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-09-30  7:07 ` [dpdk-dev] [RFC PATCH v2 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-09-30  7:12 ` [dpdk-dev] [RFC PATCH v3 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:50     ` Thomas Monjalon
2020-09-30  8:35       ` Kalesh Anakkur Purayil
2020-09-30  9:31         ` Thomas Monjalon
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-08 10:53     ` Asaf Penso
2020-09-30 12:33 ` [dpdk-dev] [RFC PATCH v4 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-06 17:25     ` Ophir Munk
2020-10-07  4:46       ` Kalesh Anakkur Purayil
2020-10-07  8:36         ` Ophir Munk
2020-10-07  9:37         ` Ferruh Yigit
2020-10-07 18:42           ` Ajit Khaparde
2020-10-07 16:49 ` [dpdk-dev] [PATCH v5 0/3] librte_ethdev: error recovery support Kalesh A P
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-10-08 10:49     ` Asaf Penso
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-09  3:48 ` [dpdk-dev] [PATCH v6 0/3] librte_ethdev: error recovery support Kalesh A P
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-10-11 21:29     ` Thomas Monjalon
2020-10-12  8:09       ` Andrew Rybchenko
2021-02-18 15:32         ` Ferruh Yigit
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 3/3] app/testpmd: handle device recovery event Kalesh A P
2022-01-28 12:48   ` [dpdk-dev] [PATCH v7 0/4] ethdev: error recovery support Kalesh A P
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 1/4] ethdev: support device reset and recovery events Kalesh A P
2022-02-01 12:11       ` Ferruh Yigit
2022-02-01 13:09         ` Kalesh Anakkur Purayil
2022-02-01 13:19           ` Ferruh Yigit
2022-02-03 20:28             ` Ajit Khaparde [this message]
2022-02-10 22:42               ` Thomas Monjalon
2022-02-01 12:52       ` Ferruh Yigit
2022-02-02 11:44         ` Ray Kinsella
2022-02-10 22:16           ` Thomas Monjalon
2022-02-11 10:09             ` Ray Kinsella
2022-02-14 10:16               ` Ray Kinsella
2022-02-14 11:15                 ` Thomas Monjalon
2022-02-14 16:06                   ` Ray Kinsella
2022-02-14 16:25                     ` Thomas Monjalon
2022-02-14 18:27                       ` Ray Kinsella
2022-02-15 13:55                         ` Ray Kinsella
2022-02-15 15:12                           ` Thomas Monjalon
2022-02-15 16:12                             ` Ray Kinsella
2022-05-21 10:33                     ` fengchengwen
2022-05-24 15:11                       ` Ray Kinsella
2022-06-10  0:16                         ` fengchengwen
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 2/4] app/testpmd: handle device recovery event Kalesh A P
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 3/4] net/bnxt: notify applications about device reset/recovery Kalesh A P
2022-01-28 12:48     ` [dpdk-dev] [PATCH v7 4/4] doc: update release notes Kalesh A P
2022-02-01 12:12       ` Ferruh Yigit
2022-06-16  9:41     ` [PATCH v8 0/4] ethdev: support error recovery notification Chengwen Feng
2022-06-16  9:41       ` [PATCH v8 1/4] ethdev: support device " Chengwen Feng
2022-06-20 17:42         ` Thomas Monjalon
2022-06-21  1:38           ` fengchengwen
2022-06-21  7:04             ` Thomas Monjalon
2022-09-22  7:53               ` fengchengwen
2022-06-23 15:58         ` Ray Kinsella
2022-06-16  9:41       ` [PATCH v8 2/4] app/testpmd: handle error recovery notification event Chengwen Feng
2022-06-16  9:41       ` [PATCH v8 3/4] net/hns3: support " Chengwen Feng
2022-06-16  9:41       ` [PATCH v8 4/4] net/bnxt: notify applications about device reset/recovery Chengwen Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACZ4nhuTX4dvDJvfq7edSbbJ7X+sE1uvTS0zS6uU8+qRhQCE6A@mail.gmail.com \
    --to=ajit.khaparde@broadcom.com \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=asafp@nvidia.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=jerinj@marvell.com \
    --cc=kalesh-anakkur.purayil@broadcom.com \
    --cc=konstantin.ananyev@intel.com \
    --cc=stephen@networkplumber.org \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).