From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id DBF2BA034E; Tue, 1 Feb 2022 14:09:51 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 6857940698; Tue, 1 Feb 2022 14:09:51 +0100 (CET) Received: from mail-ej1-f41.google.com (mail-ej1-f41.google.com [209.85.218.41]) by mails.dpdk.org (Postfix) with ESMTP id 3037D40691 for ; Tue, 1 Feb 2022 14:09:50 +0100 (CET) Received: by mail-ej1-f41.google.com with SMTP id ka4so53925777ejc.11 for ; Tue, 01 Feb 2022 05:09:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=broadcom.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Hax0piFOF2ZrlhElYEHUDQahO1pUC7wRFb8jHOhEz6M=; b=A5QQKGQoCS/TNp54yFTkZHyFkBqSe7V/g6sTFFNq1gcymbhqF+gDyZmGEfDt0CaMd/ sBsdoqUsv++Uh3QXYMFF7cF1oE+LpbLkhlNY89sDSvRD3vOrUzF8J06IqxvQjhQHYYsJ tvEzoCvXjcwVchje3aGxOu3oQsWwipS7g7UtI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Hax0piFOF2ZrlhElYEHUDQahO1pUC7wRFb8jHOhEz6M=; b=1rI219MOASxoKU6fQ72ncGgWmUnuU2imftk0TL2VZkSzKHfCWg1Iqx8h1rW727mv69 krAIhi0n2ln4lA/a31nefenIkXgbEAOQanefzIBTFiToDssK5vyyj2Gex60qe8V2NBCE oG/InLuridl4SeghtqalKT6LrzCL7Ecs/lVmI3+WeyIJNvEfVyKOdha2vQn50sOVy9QQ 4h/tDP20mvUFladM/z3hHZI86svpbF7u6CC9as5UUIGVcHzVLCcv6c+q/vB2VaxzjNO5 PKzycAS3iXgG/LGOI+TBbJC01k8nKBSAiV1PA11KIZjmMD3201VwHb2UWEsizMTHpKya HUPw== X-Gm-Message-State: AOAM531peDSp3VZDZhjGgjcPL8Txl2BTsDX3wTvf405wR35O3iqQnyeM 9nHaiHfayIe+uwyMkJKTRScWLcc2pgO3R5+fDLvfStudu+8= X-Google-Smtp-Source: ABdhPJxZ3+msWDWa6H8xgUMNLvhbEQbLmX9mSrW5B3Fk/tqke6ycvGRi3XmwQ3+97O3HwOM76XVordjf1wE6aIcGsvs= X-Received: by 2002:a17:906:9ac4:: with SMTP id ah4mr21523676ejc.738.1643720989581; Tue, 01 Feb 2022 05:09:49 -0800 (PST) MIME-Version: 1.0 References: <20201009034832.10302-1-kalesh-anakkur.purayil@broadcom.com> <20220128124830.427-1-kalesh-anakkur.purayil@broadcom.com> <20220128124830.427-2-kalesh-anakkur.purayil@broadcom.com> <1bd15a49-b53c-b1e7-bed4-2f8aef55a0f9@intel.com> In-Reply-To: <1bd15a49-b53c-b1e7-bed4-2f8aef55a0f9@intel.com> From: Kalesh Anakkur Purayil Date: Tue, 1 Feb 2022 18:39:38 +0530 Message-ID: Subject: Re: [dpdk-dev] [PATCH v7 1/4] ethdev: support device reset and recovery events To: Ferruh Yigit Cc: dpdk-dev , Thomas Monjalon , Andrew Rybchenko , Konstantin Ananyev , Ajit Kumar Khaparde , asafp@nvidia.com, Stephen Hemminger , "jerinj@marvell.com" Content-Type: multipart/alternative; boundary="000000000000b2561d05d6f49d2b" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org --000000000000b2561d05d6f49d2b Content-Type: text/plain; charset="UTF-8" Thank you Ferruh for the review. Please see inline. On Tue, Feb 1, 2022 at 5:41 PM Ferruh Yigit wrote: > On 1/28/2022 12:48 PM, Kalesh A P wrote: > > From: Kalesh AP > > > > Adding support for the device reset and recovery events in the > > rte_eth_event framework. FW error and FW reset conditions would be > > managed internally by the PMD without needing application intervention. > > In such cases, PMD would need reset/recovery events to notify application > > that PMD is undergoing a reset. > > > > While most of the recovery process is transparent to the application > since > > most of the driver ensures recovery from FW reset or FW error conditions, > > the application will have to reprogram any flows which were offloaded to > > the underlying hardware. > > > > Signed-off-by: Kalesh AP > > Signed-off-by: Somnath Kotur > > Reviewed-by: Ajit Khaparde > > More developer cc'ed. > > > --- > > doc/guides/prog_guide/poll_mode_drv.rst | 24 ++++++++++++++++++++++++ > > lib/ethdev/rte_ethdev.h | 18 ++++++++++++++++++ > > 2 files changed, 42 insertions(+) > > > > diff --git a/doc/guides/prog_guide/poll_mode_drv.rst > b/doc/guides/prog_guide/poll_mode_drv.rst > > index 6831289..9ecc0e4 100644 > > --- a/doc/guides/prog_guide/poll_mode_drv.rst > > +++ b/doc/guides/prog_guide/poll_mode_drv.rst > > @@ -623,3 +623,27 @@ by application. > > The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger > > the application to handle reset event. It is duty of application to > > handle all synchronization before it calls rte_eth_dev_reset(). > > + > > +Error recovery support > > +~~~~~~~~~~~~~~~~~~~~~~ > > + > > +When the PMD detects a FW reset or error condition, it may try to > recover > > +from the error without needing the application intervention. In such > cases, > > +PMD would need events to notify the application that it is undergoing > > +an error recovery. > > + > > +The PMD should trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the > > +application that PMD detected a FW reset or FW error condition. PMD may > > +try to recover from the error by itself. Data path may be quiesced and > > +control path operations may fail during the recovery period. The > application > > +should stop polling till it receives RTE_ETH_EVENT_RECOVERED event from > the PMD. > > + > > Between the time FW error occurred and the application receive the > RECOVERING event, > datapath will continue to poll and application may call control APIs, so > the event > really is not solving the issue and driver somehow should be sure this > won't crash > the application, in that case not sure about the benefit of this event. > [Kalesh]: As soon as the driver detects a FW dead or reset condition, it sets the fastpath pointers to dummy functions. This will prevent the crash. All control path operations would fail with -EBUSY. This change is already there in bnxt PMD. This event is a notification to the application that the PMD is recovering from a FW error condition so that it can stop polling and issue control path operations. > > > +The PMD should trigger RTE_ETH_EVENT_RECOVERED event to notify the > application > > +that the it has recovered from the error condition. PMD re-configures > the port > > +to the state prior to the error condition. Control path and data path > are up now. > > +Since the device has undergone a reset, flow rules offloaded prior to > reset > > +may be lost and the application should recreate the rules again. > > + > > I think the most difficult part here is clarify what application should do > when this event received consistent for all devices, "flow rules may be > lost" > looks very vague to me. > Unless it is not clear for application what to do when this event is > received, > it is not that useful or it will be specific to some PMDs. And I can see > it is > hard to clarify this but perhaps we can define a set of common behavior. > Not sure what others are thinking. > [Kalesh]: Sure, let's wait for others' opinions as well. > > > +The PMD should trigger RTE_ETH_EVENT_INTR_RMV event to notify the > application > > +that it has failed to recover from the error condition. The device may > not be > > +usable anymore. > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h > > index 147cc1c..a46819f 100644 > > --- a/lib/ethdev/rte_ethdev.h > > +++ b/lib/ethdev/rte_ethdev.h > > @@ -3818,6 +3818,24 @@ enum rte_eth_event_type { > > RTE_ETH_EVENT_DESTROY, /**< port is released */ > > RTE_ETH_EVENT_IPSEC, /**< IPsec offload related event */ > > RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */ > > + RTE_ETH_EVENT_ERR_RECOVERING, > > + /**< port recovering from an error > > + * > > + * PMD detected a FW reset or error condition. > > + * PMD will try to recover from the error. > > + * Data path may be quiesced and Control path > operations > > + * may fail at this time. > > + */ > > Please put multi line comments before enum, Andrew did a set of cleanups > for these. > [Kalesh]: Sure, will do. > > > + RTE_ETH_EVENT_RECOVERED, > > + /**< port recovered from an error > > + * > > + * PMD has recovered from the error condition. > > + * Control path and Data path are up now. > > + * PMD re-configures the port to the state prior > to the error. > > + * Since the device has undergone a reset, flow > rules > > + * offloaded prior to reset may be lost and > > + * the application should recreate the rules again. > > + */ > > RTE_ETH_EVENT_MAX /**< max value of this enum */ > > }; > > > > -- Regards, Kalesh A P --000000000000b2561d05d6f49d2b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thank you Ferruh for the review. Please s= ee inline.

On Tue, Feb 1, 2022 at 5:41 PM Ferruh Yigit <ferruh.yigit@intel.com> wrote:
On 1/28/2022 12:48 PM, Kales= h A P wrote:
> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>
> Adding support for the device reset and recovery events in the
> rte_eth_event framework. FW error and FW reset conditions would be
> managed internally by the PMD without needing application intervention= .
> In such cases, PMD would need reset/recovery events to notify applicat= ion
> that PMD is undergoing a reset.
>
> While most of the recovery process is transparent to the application s= ince
> most of the driver ensures recovery from FW reset or FW error conditio= ns,
> the application will have to reprogram any flows which were offloaded = to
> the underlying hardware.
>
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>=
> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

More developer cc'ed.

> ---
>=C2=A0 =C2=A0doc/guides/prog_guide/poll_mode_drv.rst | 24 +++++++++++++= +++++++++++
>=C2=A0 =C2=A0lib/ethdev/rte_ethdev.h=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0| 18 ++++++++++++++++++
>=C2=A0 =C2=A02 files changed, 42 insertions(+)
>
> diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog= _guide/poll_mode_drv.rst
> index 6831289..9ecc0e4 100644
> --- a/doc/guides/prog_guide/poll_mode_drv.rst
> +++ b/doc/guides/prog_guide/poll_mode_drv.rst
> @@ -623,3 +623,27 @@ by application.
>=C2=A0 =C2=A0The PMD itself should not call rte_eth_dev_reset(). The PM= D can trigger
>=C2=A0 =C2=A0the application to handle reset event. It is duty of appli= cation to
>=C2=A0 =C2=A0handle all synchronization before it calls rte_eth_dev_res= et().
> +
> +Error recovery support
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> +When the PMD detects a FW reset or error condition, it may try to rec= over
> +from the error without needing the application intervention. In such = cases,
> +PMD would need events to notify the application that it is undergoing=
> +an error recovery.
> +
> +The PMD should trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify t= he
> +application that PMD detected a FW reset or FW error condition. PMD m= ay
> +try to recover from the error by itself. Data path may be quiesced an= d
> +control path operations may fail during the recovery period. The appl= ication
> +should stop polling till it receives RTE_ETH_EVENT_RECOVERED event fr= om the PMD.
> +

Between the time FW error occurred and the application receive the RECOVERI= NG event,
datapath will continue to poll and application may call control APIs, so th= e event
really is not solving the issue and driver somehow should be sure this won&= #39;t crash
the application, in that case not sure about the benefit of this event.
=
[Kalesh]: As soon as the driver detects a FW dead or rese= t condition, it sets the fastpath pointers to dummy functions. This will pr= event=C2=A0the crash. All control path operations would fail with -EBUSY. T= his change is already there in bnxt PMD. This event is a notification=C2=A0= to the application that the PMD is recovering from a FW error condition so = that it can stop polling and issue control path operations.

> +The PMD should trigger RTE_ETH_EVENT_RECOVERED event to notify the ap= plication
> +that the it has recovered from the error condition. PMD re-configures= the port
> +to the state prior to the error condition. Control path and data path= are up now.
> +Since the device has undergone a reset, flow rules offloaded prior to= reset
> +may be lost and the application should recreate the rules again.
> +

I think the most difficult part here is clarify what application should do<= br> when this event received consistent for all devices, "flow rules may b= e lost"
looks very vague to me.
Unless it is not clear for application what to do when this event is receiv= ed,
it is not that useful or it will be specific to some PMDs. And I can see it= is
hard to clarify this but perhaps we can define a set of common behavior. Not sure what others are thinking.
[Kalesh]: Sure, let= 's wait for others' opinions as well.

> +The PMD should trigger RTE_ETH_EVENT_INTR_RMV event to notify the app= lication
> +that it has failed to recover from the error condition. The device ma= y not be
> +usable anymore.
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 147cc1c..a46819f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -3818,6 +3818,24 @@ enum rte_eth_event_type {
>=C2=A0 =C2=A0 =C2=A0 =C2=A0RTE_ETH_EVENT_DESTROY,=C2=A0 /**< port is= released */
>=C2=A0 =C2=A0 =C2=A0 =C2=A0RTE_ETH_EVENT_IPSEC,=C2=A0 =C2=A0 /**< IP= sec offload related event */
>=C2=A0 =C2=A0 =C2=A0 =C2=A0RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out= flows is detected */
> +=C2=A0 =C2=A0 =C2=A0RTE_ETH_EVENT_ERR_RECOVERING,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0/**< port recovering from an error
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 *
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 * PMD detected a FW reset or error condition.
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 * PMD will try to recover from the error.
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 * Data path may be quiesced and Control path operations
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 * may fail at this time.
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 */

Please put multi line comments before enum, Andrew did a set of cleanups fo= r these.
[Kalesh]: Sure, will do.=C2=A0

> +=C2=A0 =C2=A0 =C2=A0RTE_ETH_EVENT_RECOVERED,
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0/**< port recovered from an error
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 *
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 * PMD has recovered from the error condition.
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 * Control path and Data path are up now.
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 * PMD re-configures the port to the state prior to the error.
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 * Since the device has undergone a reset, flow rules
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 * offloaded prior to reset may be lost and
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 * the application should recreate the rules again.
> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 */
>=C2=A0 =C2=A0 =C2=A0 =C2=A0RTE_ETH_EVENT_MAX=C2=A0 =C2=A0 =C2=A0 =C2=A0= /**< max value of this enum */
>=C2=A0 =C2=A0};
>=C2=A0 =C2=A0



--
Regards,
Kalesh A P
--000000000000b2561d05d6f49d2b--