DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev]  [RFC PATCH 0/3] librte_ethdev: error recovery support
@ 2020-01-22 10:16 Kalesh A P
  2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event Kalesh A P
                   ` (9 more replies)
  0 siblings, 10 replies; 48+ messages in thread
From: Kalesh A P @ 2020-01-22 10:16 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, declan.doherty

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

This patch adds support for recovery event in rte_eth_event framework.
FW error and FW reset conditions would be managed by PMD. Driver uses
RTE_ETH_EVENT_INTR_RESET event to notify the applications about the
FW reset or error. In such cases, PMD would need recovery events to
notify application about PMD has recovered from FW reset or FW error.

Kalesh AP (3):
  librte_ethdev: support device recovery event
  net/bnxt: notify applications about device reset
  app/testpmd: handle device recovery event

 app/test-pmd/testpmd.c         |  7 ++++++-
 drivers/net/bnxt/bnxt_cpr.c    |  3 +++
 drivers/net/bnxt/bnxt_ethdev.c | 10 ++++++++++
 lib/librte_ethdev/rte_ethdev.h |  1 +
 4 files changed, 20 insertions(+), 1 deletion(-)

-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event
  2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
@ 2020-01-22 10:16 ` Kalesh A P
  2020-03-11 13:20   ` Thomas Monjalon
  2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 2/3] net/bnxt: notify applications about device reset Kalesh A P
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Kalesh A P @ 2020-01-22 10:16 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, declan.doherty

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Adding support for recovery event in rte_eth_event framework.
FW error and FW reset conditions would be managed by PMD.
In such cases, PMD would need recovery events to notify application
about PMD has recovered from FW reset or FW error.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
---
 lib/librte_ethdev/rte_ethdev.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index d1a593a..0989f41 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_NEW,      /**< port is probed */
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
+	RTE_ETH_EVENT_RECOVERED, /**< port recovered from an error */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH 2/3] net/bnxt: notify applications about device reset
  2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event Kalesh A P
@ 2020-01-22 10:16 ` Kalesh A P
  2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 3/3] app/testpmd: handle device recovery event Kalesh A P
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-01-22 10:16 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, declan.doherty

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

When the driver receives RESET_NOTIFY async event from FW or detected
an error condition, it should update the application that FW is going
to reset. Once the driver recoveres from the reset, update the reset
recovery status to application as well.

The recovery process is transparent to the application as the driver
itself tries to recover from FW reset or FW error conditions.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
---
 drivers/net/bnxt/bnxt_cpr.c    |  3 +++
 drivers/net/bnxt/bnxt_ethdev.c | 10 ++++++++++
 2 files changed, 13 insertions(+)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index bb316b9..02eeeb9 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -76,6 +76,9 @@ void bnxt_handle_async_event(struct bnxt *bp,
 		PMD_DRV_LOG(INFO, "Port conn async event\n");
 		break;
 	case HWRM_ASYNC_EVENT_CMPL_EVENT_ID_RESET_NOTIFY:
+		_rte_eth_dev_callback_process(bp->eth_dev,
+					      RTE_ETH_EVENT_INTR_RESET,
+					      NULL);
 		event_data = rte_le_to_cpu_32(async_cmp->event_data1);
 		/* timestamp_lo/hi values are in units of 100ms */
 		bp->fw_reset_max_msecs = async_cmp->timestamp_hi ?
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index 2ef1169..93e67b1 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -4021,11 +4021,17 @@ static void bnxt_dev_recover(void *arg)
 		goto err;
 
 	PMD_DRV_LOG(INFO, "Recovered from FW reset\n");
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_RECOVERED,
+				      NULL);
 	return;
 err:
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
 	PMD_DRV_LOG(ERR, "Failed to recover from FW reset\n");
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_INTR_RMV,
+				      NULL);
 }
 
 void bnxt_dev_reset_and_resume(void *arg)
@@ -4160,6 +4166,10 @@ static void bnxt_check_fw_health(void *arg)
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bp->flags |= BNXT_FLAG_FW_RESET;
 
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_INTR_RESET,
+				      NULL);
+
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
 
 	if (bnxt_is_master_func(bp))
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH 3/3] app/testpmd: handle device recovery event
  2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event Kalesh A P
  2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 2/3] net/bnxt: notify applications about device reset Kalesh A P
@ 2020-01-22 10:16 ` Kalesh A P
  2020-03-11 13:19 ` [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Thomas Monjalon
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-01-22 10:16 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, declan.doherty

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Added code to handle device recovery event in testpmd.
This is an indication from PMD that it has recovered from a FW
reset or FW error condition.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/testpmd.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index f9f4cd1..d1f063e 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -375,6 +375,7 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_INTR_RMV] = "device removal",
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
+	[RTE_ETH_EVENT_RECOVERED] = "device recovered",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -388,7 +389,8 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RESET) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV);
+			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
 /*
  * Decide if all memory are locked for performance.
  */
@@ -2920,6 +2922,9 @@ eth_event_callback(portid_t port_id, enum rte_eth_event_type type, void *param,
 				rmv_port_callback, (void *)(intptr_t)port_id))
 			fprintf(stderr, "Could not set up deferred device removal\n");
 		break;
+	case RTE_ETH_EVENT_RECOVERED:
+		printf("Please recreate any rules offloaded prior to this\n");
+		break;
 	default:
 		break;
 	}
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support
  2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
                   ` (2 preceding siblings ...)
  2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 3/3] app/testpmd: handle device recovery event Kalesh A P
@ 2020-03-11 13:19 ` Thomas Monjalon
  2020-03-12  3:25   ` Kalesh Anakkur Purayil
  2020-09-30  7:03 ` [dpdk-dev] [RFC V2 " Kalesh A P
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Thomas Monjalon @ 2020-03-11 13:19 UTC (permalink / raw)
  To: Kalesh A P; +Cc: dev, ferruh.yigit, declan.doherty, arybchenko

22/01/2020 11:16, Kalesh A P:
> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> 
> This patch adds support for recovery event in rte_eth_event framework.
> FW error and FW reset conditions would be managed by PMD. Driver uses

"Driver"? THE driver? :)

> RTE_ETH_EVENT_INTR_RESET event to notify the applications about the
> FW reset or error.

Which drivers doe that?

> In such cases, PMD would need recovery events to
> notify application about PMD has recovered from FW reset or FW error.

Sorry I don't understand. You said application is notified of any error.
But the PMD can recover from this error? So what is the error at the end?
If the error is recovered why notifying the application?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event
  2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event Kalesh A P
@ 2020-03-11 13:20   ` Thomas Monjalon
  2020-03-12  3:31     ` Kalesh Anakkur Purayil
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Monjalon @ 2020-03-11 13:20 UTC (permalink / raw)
  To: Kalesh A P; +Cc: dev, ferruh.yigit, declan.doherty, arybchenko

22/01/2020 11:16, Kalesh A P:
> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> 
> Adding support for recovery event in rte_eth_event framework.
> FW error and FW reset conditions would be managed by PMD.
> In such cases, PMD would need recovery events to notify application
> about PMD has recovered from FW reset or FW error.
> 
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
> ---
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
>  	RTE_ETH_EVENT_NEW,      /**< port is probed */
>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> +	RTE_ETH_EVENT_RECOVERED, /**< port recovered from an error */

What the application is supposed to do when receiving such event?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support
  2020-03-11 13:19 ` [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Thomas Monjalon
@ 2020-03-12  3:25   ` Kalesh Anakkur Purayil
  2020-03-12  7:34     ` Thomas Monjalon
  0 siblings, 1 reply; 48+ messages in thread
From: Kalesh Anakkur Purayil @ 2020-03-12  3:25 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, ferruh.yigit, declan.doherty, arybchenko

Hi Thomas,

On Wed, Mar 11, 2020 at 6:49 PM Thomas Monjalon <thomas@monjalon.net> wrote:

> 22/01/2020 11:16, Kalesh A P:
> > From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> >
> > This patch adds support for recovery event in rte_eth_event framework.
> > FW error and FW reset conditions would be managed by PMD. Driver uses
>
> "Driver"? THE driver? :)
>
> > RTE_ETH_EVENT_INTR_RESET event to notify the applications about the
> > FW reset or error.
>
> Which drivers doe that?
>
[Kalesh]: Second patch in this series implements this behavior in bnxt PMD.
Error recovery is a new feature added in bnxt PMD in 19.11. This change is
needed to support error recovery functionality.

>
> > In such cases, PMD would need recovery events to
> > notify application about PMD has recovered from FW reset or FW error.
>
> Sorry I don't understand. You said application is notified of any error.
> But the PMD can recover from this error? So what is the error at the end?
> If the error is recovered why notifying the application?
>
[Kalesh] : Let me give you some insight on this.

The error recovery solution is a protocol implemented between firmware and
bnxt PMD to recover from the fatal errors without a system reboot. There is
an alarm thread which constantly monitors the health of the firmware and
initiates a recovery when needed.

There are two scenarios here:

1. Hardware or firmware encountered an error which firmware detected.
Firmware is in operational status here. In this case, firmware can reset
the chip and notify the driver about the reset.
2. Hardware or firmware encountered an error but firmware is dead/hung.
Firmware is not in operational status. In this case, the only possible way
to recover the adapter is through host driver(bnxt PMD).

In both cases, bnxt PMD reinitializes with the FW again after the reset.
During that recovery process, data path will be halted and any control path
operation would fail. So, bnxt PMD has to notify the application about this
reset/error event to prevent any activities from application during this
time.

-- 
Regards,
Kalesh A P

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event
  2020-03-11 13:20   ` Thomas Monjalon
@ 2020-03-12  3:31     ` Kalesh Anakkur Purayil
  2020-03-12  7:29       ` Thomas Monjalon
  0 siblings, 1 reply; 48+ messages in thread
From: Kalesh Anakkur Purayil @ 2020-03-12  3:31 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, ferruh.yigit, declan.doherty, arybchenko

Hi Thomas,

On Wed, Mar 11, 2020 at 6:50 PM Thomas Monjalon <thomas@monjalon.net> wrote:

> 22/01/2020 11:16, Kalesh A P:
> > From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> >
> > Adding support for recovery event in rte_eth_event framework.
> > FW error and FW reset conditions would be managed by PMD.
> > In such cases, PMD would need recovery events to notify application
> > about PMD has recovered from FW reset or FW error.
> >
> > Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> > Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> > Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
> > ---
> > --- a/lib/librte_ethdev/rte_ethdev.h
> > +++ b/lib/librte_ethdev/rte_ethdev.h
> > @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
> >       RTE_ETH_EVENT_NEW,      /**< port is probed */
> >       RTE_ETH_EVENT_DESTROY,  /**< port is released */
> >       RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > +     RTE_ETH_EVENT_RECOVERED, /**< port recovered from an error */
>
> What the application is supposed to do when receiving such event?
>
[Kalesh]:  RTE_ETH_EVENT_RECOVERED event is an indication to the
application that PMD has recovered from the reset and is functional
again(i.e control path and data path is up again). On receiving this event
from PMD, application has to re-create any flow rules created prior to the
reset.

-- 
Regards,
Kalesh A P

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event
  2020-03-12  3:31     ` Kalesh Anakkur Purayil
@ 2020-03-12  7:29       ` Thomas Monjalon
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Monjalon @ 2020-03-12  7:29 UTC (permalink / raw)
  To: Kalesh Anakkur Purayil; +Cc: dev, ferruh.yigit, declan.doherty, arybchenko

12/03/2020 04:31, Kalesh Anakkur Purayil:
> On Wed, Mar 11, 2020 at 6:50 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > 22/01/2020 11:16, Kalesh A P:
> > > From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> > >
> > > Adding support for recovery event in rte_eth_event framework.
> > > FW error and FW reset conditions would be managed by PMD.
> > > In such cases, PMD would need recovery events to notify application
> > > about PMD has recovered from FW reset or FW error.
> > >
> > > Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> > > Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> > > Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
> > > ---
> > > --- a/lib/librte_ethdev/rte_ethdev.h
> > > +++ b/lib/librte_ethdev/rte_ethdev.h
> > > @@ -3015,6 +3015,7 @@ enum rte_eth_event_type {
> > >       RTE_ETH_EVENT_NEW,      /**< port is probed */
> > >       RTE_ETH_EVENT_DESTROY,  /**< port is released */
> > >       RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> > > +     RTE_ETH_EVENT_RECOVERED, /**< port recovered from an error */
> >
> > What the application is supposed to do when receiving such event?
> >
> [Kalesh]:  RTE_ETH_EVENT_RECOVERED event is an indication to the
> application that PMD has recovered from the reset and is functional
> again(i.e control path and data path is up again). On receiving this event
> from PMD, application has to re-create any flow rules created prior to the
> reset.

How the application knows that flow rules were resetted?
Is there any other configuration resetted?
These informations must be explicit in the doxygen comments.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support
  2020-03-12  3:25   ` Kalesh Anakkur Purayil
@ 2020-03-12  7:34     ` Thomas Monjalon
  2020-07-03 16:12       ` Ferruh Yigit
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Monjalon @ 2020-03-12  7:34 UTC (permalink / raw)
  To: Kalesh Anakkur Purayil; +Cc: dev, ferruh.yigit, declan.doherty, arybchenko

12/03/2020 04:25, Kalesh Anakkur Purayil:
> Hi Thomas,
> 
> On Wed, Mar 11, 2020 at 6:49 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> 
> > 22/01/2020 11:16, Kalesh A P:
> > > From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> > >
> > > This patch adds support for recovery event in rte_eth_event framework.
> > > FW error and FW reset conditions would be managed by PMD. Driver uses
> >
> > "Driver"? THE driver? :)
> >
> > > RTE_ETH_EVENT_INTR_RESET event to notify the applications about the
> > > FW reset or error.
> >
> > Which drivers doe that?
> >
> [Kalesh]: Second patch in this series implements this behavior in bnxt PMD.
> Error recovery is a new feature added in bnxt PMD in 19.11. This change is
> needed to support error recovery functionality.
> 
> >
> > > In such cases, PMD would need recovery events to
> > > notify application about PMD has recovered from FW reset or FW error.
> >
> > Sorry I don't understand. You said application is notified of any error.
> > But the PMD can recover from this error? So what is the error at the end?
> > If the error is recovered why notifying the application?
> >
> [Kalesh] : Let me give you some insight on this.
> 
> The error recovery solution is a protocol implemented between firmware and
> bnxt PMD to recover from the fatal errors without a system reboot. There is
> an alarm thread which constantly monitors the health of the firmware and
> initiates a recovery when needed.
> 
> There are two scenarios here:
> 
> 1. Hardware or firmware encountered an error which firmware detected.
> Firmware is in operational status here. In this case, firmware can reset
> the chip and notify the driver about the reset.
> 2. Hardware or firmware encountered an error but firmware is dead/hung.
> Firmware is not in operational status. In this case, the only possible way
> to recover the adapter is through host driver(bnxt PMD).
> 
> In both cases, bnxt PMD reinitializes with the FW again after the reset.
> During that recovery process, data path will be halted and any control path
> operation would fail. So, bnxt PMD has to notify the application about this
> reset/error event to prevent any activities from application during this
> time.

I think you are changing the meaning of the reset event.
It was described like this:
RTE_ETH_EVENT_INTR_RESET,
            /**< reset interrupt event, sent to VF on PF reset */

Please update this description as well.

Of course, we'll need approval from other PMD maintainers
to accept the new recovery API.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support
  2020-03-12  7:34     ` Thomas Monjalon
@ 2020-07-03 16:12       ` Ferruh Yigit
  0 siblings, 0 replies; 48+ messages in thread
From: Ferruh Yigit @ 2020-07-03 16:12 UTC (permalink / raw)
  To: Thomas Monjalon, Kalesh Anakkur Purayil
  Cc: dev, declan.doherty, arybchenko, Ajit Khaparde

On 3/12/2020 7:34 AM, Thomas Monjalon wrote:
> 12/03/2020 04:25, Kalesh Anakkur Purayil:
>> Hi Thomas,
>>
>> On Wed, Mar 11, 2020 at 6:49 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>>
>>> 22/01/2020 11:16, Kalesh A P:
>>>> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>>>
>>>> This patch adds support for recovery event in rte_eth_event framework.
>>>> FW error and FW reset conditions would be managed by PMD. Driver uses
>>>
>>> "Driver"? THE driver? :)
>>>
>>>> RTE_ETH_EVENT_INTR_RESET event to notify the applications about the
>>>> FW reset or error.
>>>
>>> Which drivers doe that?
>>>
>> [Kalesh]: Second patch in this series implements this behavior in bnxt PMD.
>> Error recovery is a new feature added in bnxt PMD in 19.11. This change is
>> needed to support error recovery functionality.
>>
>>>
>>>> In such cases, PMD would need recovery events to
>>>> notify application about PMD has recovered from FW reset or FW error.
>>>
>>> Sorry I don't understand. You said application is notified of any error.
>>> But the PMD can recover from this error? So what is the error at the end?
>>> If the error is recovered why notifying the application?
>>>
>> [Kalesh] : Let me give you some insight on this.
>>
>> The error recovery solution is a protocol implemented between firmware and
>> bnxt PMD to recover from the fatal errors without a system reboot. There is
>> an alarm thread which constantly monitors the health of the firmware and
>> initiates a recovery when needed.
>>
>> There are two scenarios here:
>>
>> 1. Hardware or firmware encountered an error which firmware detected.
>> Firmware is in operational status here. In this case, firmware can reset
>> the chip and notify the driver about the reset.
>> 2. Hardware or firmware encountered an error but firmware is dead/hung.
>> Firmware is not in operational status. In this case, the only possible way
>> to recover the adapter is through host driver(bnxt PMD).
>>
>> In both cases, bnxt PMD reinitializes with the FW again after the reset.
>> During that recovery process, data path will be halted and any control path
>> operation would fail. So, bnxt PMD has to notify the application about this
>> reset/error event to prevent any activities from application during this
>> time.
> 
> I think you are changing the meaning of the reset event.
> It was described like this:
> RTE_ETH_EVENT_INTR_RESET,
>             /**< reset interrupt event, sent to VF on PF reset */
> 
> Please update this description as well.
> 
> Of course, we'll need approval from other PMD maintainers
> to accept the new recovery API.
> 

Hi Kalesh,

Is this RFC still relevant/valid?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev]  [RFC V2 0/3] librte_ethdev: error recovery support
  2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
                   ` (3 preceding siblings ...)
  2020-03-11 13:19 ` [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Thomas Monjalon
@ 2020-09-30  7:03 ` Kalesh A P
  2020-09-30  7:03   ` [dpdk-dev] [RFC V2 1/3] ethdev: support device reset and recovery events Kalesh A P
                     ` (2 more replies)
  2020-09-30  7:07 ` [dpdk-dev] [RFC PATCH v2 0/3] librte_ethdev: error recovery support Kalesh A P
                   ` (4 subsequent siblings)
  9 siblings, 3 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:03 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

The error recovery solution is a protocol implemented between firmware
and bnxt PMD to recover from the fatal errors without a system reboot.
There is an alarm thread which constantly monitors the health of the
firmware and initiates a recovery when needed.

There are two scenarios here:

1. Hardware or firmware encountered an error which firmware detected.
   Firmware is in operational status here. In this case, firmware can
   reset the chip and notify the driver about the reset.
2. Hardware or firmware encountered an error but firmware is dead/hung.
   Firmware is not in operational status. In this case, the only possible
   way to recover the adapter is through host driver(bnxt PMD).

In both cases, bnxt PMD reinitializes with the FW again after the reset.
During that recovery process, data path will be halted and any control path
operation would fail. So, the PMD has to notify the application about this
reset/error event to prevent any activities from the application while
the PMD is recovering from the error.

This patch set adds support for the reset and recovery event in
the rte_eth_event framework. FW error and FW reset conditions would be
managed by the PMD. Driver uses RTE_ETH_EVENT_RESET event to notify
the applications about the FW reset or error. In such cases,
PMD use the RTE_ETH_EVENT_RECOVERED event to notify application about
PMD has recovered from FW reset or FW error.

V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.

Kalesh AP (3):
  ethdev: support device reset and recovery events
  net/bnxt: notify applications about device reset/recovery
  app/testpmd: handle device recovery event

 app/test-pmd/testpmd.c         | 6 +++++-
 drivers/net/bnxt/bnxt_cpr.c    | 3 +++
 drivers/net/bnxt/bnxt_ethdev.c | 9 +++++++++
 lib/librte_ethdev/rte_ethdev.h | 2 ++
 4 files changed, 19 insertions(+), 1 deletion(-)

-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC V2 1/3] ethdev: support device reset and recovery events
  2020-09-30  7:03 ` [dpdk-dev] [RFC V2 " Kalesh A P
@ 2020-09-30  7:03   ` Kalesh A P
  2020-09-30  7:03   ` [dpdk-dev] [RFC V2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
  2020-09-30  7:03   ` [dpdk-dev] [RFC V2 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:03 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Adding support for device reset and recovery events in the
rte_eth_event framework. FW error and FW reset conditions would be
managed internally by PMD without needing application intervention.
In such cases, PMD would need reset/recovery events to notify application
that PMD is undergoing a reset.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 lib/librte_ethdev/rte_ethdev.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 9759f13..b9dd14c 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3207,6 +3207,8 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	RTE_ETH_EVENT_RESET,    /**< port resetting from an error */
+	RTE_ETH_EVENT_RECOVERED, /**< port recovered from an error */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC V2 2/3] net/bnxt: notify applications about device reset/recovery
  2020-09-30  7:03 ` [dpdk-dev] [RFC V2 " Kalesh A P
  2020-09-30  7:03   ` [dpdk-dev] [RFC V2 1/3] ethdev: support device reset and recovery events Kalesh A P
@ 2020-09-30  7:03   ` Kalesh A P
  2020-09-30  7:03   ` [dpdk-dev] [RFC V2 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:03 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

When the driver receives RESET_NOTIFY async event from FW or detected
an error condition, it should update the application that FW is going
to reset. Once the driver recoveres from the reset, update the reset
recovery status to application as well.

The recovery process is transparent to the application as the driver
itself tries to recover from FW reset or FW error conditions.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
---
 drivers/net/bnxt/bnxt_cpr.c    | 3 +++
 drivers/net/bnxt/bnxt_ethdev.c | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index 8311e26..f2fddf7 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -129,6 +129,9 @@ void bnxt_handle_async_event(struct bnxt *bp,
 			bp->flags |= BNXT_FLAG_FATAL_ERROR;
 			return;
 		}
+		_rte_eth_dev_callback_process(bp->eth_dev,
+					      RTE_ETH_EVENT_RESET,
+					      NULL);
 
 		event_data = rte_le_to_cpu_32(async_cmp->event_data1);
 		/* timestamp_lo/hi values are in units of 100ms */
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index b99c712..609bf46 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -4566,6 +4566,9 @@ static void bnxt_dev_recover(void *arg)
 		goto err_start;
 
 	PMD_DRV_LOG(INFO, "Recovered from FW reset\n");
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_RECOVERED,
+				      NULL);
 	return;
 err_start:
 	bnxt_dev_stop_op(bp->eth_dev);
@@ -4573,6 +4576,9 @@ static void bnxt_dev_recover(void *arg)
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
 	PMD_DRV_LOG(ERR, "Failed to recover from FW reset\n");
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_INTR_RMV,
+				      NULL);
 }
 
 void bnxt_dev_reset_and_resume(void *arg)
@@ -4708,6 +4714,9 @@ static void bnxt_check_fw_health(void *arg)
 	bp->flags |= BNXT_FLAG_FW_RESET;
 
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_RESET,
+				      NULL);
 
 	if (bnxt_is_master_func(bp))
 		wait_msec = info->master_func_wait_period;
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC V2 3/3] app/testpmd: handle device recovery event
  2020-09-30  7:03 ` [dpdk-dev] [RFC V2 " Kalesh A P
  2020-09-30  7:03   ` [dpdk-dev] [RFC V2 1/3] ethdev: support device reset and recovery events Kalesh A P
  2020-09-30  7:03   ` [dpdk-dev] [RFC V2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
@ 2020-09-30  7:03   ` Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:03 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Added code to handle device reset and recovery event in testpmd.
This is an indication from the PMD that device has resetted and
recovered error condition.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/testpmd.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index fe6450c..1c8fb46 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_RESET] = "device reset",
+	[RTE_ETH_EVENT_RECOVERED] = "device recovery",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RESET) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
 /*
  * Decide if all memory are locked for performance.
  */
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v2 0/3] librte_ethdev: error recovery support
  2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
                   ` (4 preceding siblings ...)
  2020-09-30  7:03 ` [dpdk-dev] [RFC V2 " Kalesh A P
@ 2020-09-30  7:07 ` Kalesh A P
  2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 1/3] ethdev: support device reset and recovery events Kalesh A P
                     ` (2 more replies)
  2020-09-30  7:12 ` [dpdk-dev] [RFC PATCH v3 0/3] librte_ethdev: error recovery support Kalesh A P
                   ` (3 subsequent siblings)
  9 siblings, 3 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:07 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

The error recovery solution is a protocol implemented between firmware
and bnxt PMD to recover from the fatal errors without a system reboot.
There is an alarm thread which constantly monitors the health of the
firmware and initiates a recovery when needed.

There are two scenarios here:

1. Hardware or firmware encountered an error which firmware detected.
   Firmware is in operational status here. In this case, firmware can
   reset the chip and notify the driver about the reset.
2. Hardware or firmware encountered an error but firmware is dead/hung.
   Firmware is not in operational status. In this case, the only possible
   way to recover the adapter is through host driver(bnxt PMD).

In both cases, bnxt PMD reinitializes with the FW again after the reset.
During that recovery process, data path will be halted and any control path
operation would fail. So, the PMD has to notify the application about this
reset/error event to prevent any activities from the application while
the PMD is recovering from the error.

This patch set adds support for the reset and recovery event in
the rte_eth_event framework. FW error and FW reset conditions would be
managed by the PMD. Driver uses RTE_ETH_EVENT_RESET event to notify
the applications about the FW reset or error. In such cases,
PMD use the RTE_ETH_EVENT_RECOVERED event to notify application about
PMD has recovered from FW reset or FW error.

V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.

Kalesh AP (3):
  ethdev: support device reset and recovery events
  net/bnxt: notify applications about device reset/recovery
  app/testpmd: handle device recovery event

 app/test-pmd/testpmd.c         | 6 +++++-
 drivers/net/bnxt/bnxt_cpr.c    | 3 +++
 drivers/net/bnxt/bnxt_ethdev.c | 9 +++++++++
 lib/librte_ethdev/rte_ethdev.h | 2 ++
 4 files changed, 19 insertions(+), 1 deletion(-)

-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v2 1/3] ethdev: support device reset and recovery events
  2020-09-30  7:07 ` [dpdk-dev] [RFC PATCH v2 0/3] librte_ethdev: error recovery support Kalesh A P
@ 2020-09-30  7:07   ` Kalesh A P
  2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
  2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:07 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Adding support for device reset and recovery events in the
rte_eth_event framework. FW error and FW reset conditions would be
managed internally by PMD without needing application intervention.
In such cases, PMD would need reset/recovery events to notify application
that PMD is undergoing a reset.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 lib/librte_ethdev/rte_ethdev.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 9759f13..b9dd14c 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3207,6 +3207,8 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	RTE_ETH_EVENT_RESET,    /**< port resetting from an error */
+	RTE_ETH_EVENT_RECOVERED, /**< port recovered from an error */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v2 2/3] net/bnxt: notify applications about device reset/recovery
  2020-09-30  7:07 ` [dpdk-dev] [RFC PATCH v2 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 1/3] ethdev: support device reset and recovery events Kalesh A P
@ 2020-09-30  7:07   ` Kalesh A P
  2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:07 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

When the driver receives RESET_NOTIFY async event from FW or detected
an error condition, it should update the application that FW is going
to reset. Once the driver recoveres from the reset, update the reset
recovery status to application as well.

The recovery process is transparent to the application as the driver
itself tries to recover from FW reset or FW error conditions.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
---
 drivers/net/bnxt/bnxt_cpr.c    | 3 +++
 drivers/net/bnxt/bnxt_ethdev.c | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index 8311e26..f2fddf7 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -129,6 +129,9 @@ void bnxt_handle_async_event(struct bnxt *bp,
 			bp->flags |= BNXT_FLAG_FATAL_ERROR;
 			return;
 		}
+		_rte_eth_dev_callback_process(bp->eth_dev,
+					      RTE_ETH_EVENT_RESET,
+					      NULL);
 
 		event_data = rte_le_to_cpu_32(async_cmp->event_data1);
 		/* timestamp_lo/hi values are in units of 100ms */
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index b99c712..609bf46 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -4566,6 +4566,9 @@ static void bnxt_dev_recover(void *arg)
 		goto err_start;
 
 	PMD_DRV_LOG(INFO, "Recovered from FW reset\n");
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_RECOVERED,
+				      NULL);
 	return;
 err_start:
 	bnxt_dev_stop_op(bp->eth_dev);
@@ -4573,6 +4576,9 @@ static void bnxt_dev_recover(void *arg)
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
 	PMD_DRV_LOG(ERR, "Failed to recover from FW reset\n");
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_INTR_RMV,
+				      NULL);
 }
 
 void bnxt_dev_reset_and_resume(void *arg)
@@ -4708,6 +4714,9 @@ static void bnxt_check_fw_health(void *arg)
 	bp->flags |= BNXT_FLAG_FW_RESET;
 
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_RESET,
+				      NULL);
 
 	if (bnxt_is_master_func(bp))
 		wait_msec = info->master_func_wait_period;
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v2 3/3] app/testpmd: handle device recovery event
  2020-09-30  7:07 ` [dpdk-dev] [RFC PATCH v2 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 1/3] ethdev: support device reset and recovery events Kalesh A P
  2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
@ 2020-09-30  7:07   ` Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:07 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Added code to handle device reset and recovery event in testpmd.
This is an indication from the PMD that device has resetted and
recovered error condition.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/testpmd.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index fe6450c..1c8fb46 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_RESET] = "device reset",
+	[RTE_ETH_EVENT_RECOVERED] = "device recovery",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RESET) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
 /*
  * Decide if all memory are locked for performance.
  */
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v3 0/3] librte_ethdev: error recovery support
  2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
                   ` (5 preceding siblings ...)
  2020-09-30  7:07 ` [dpdk-dev] [RFC PATCH v2 0/3] librte_ethdev: error recovery support Kalesh A P
@ 2020-09-30  7:12 ` Kalesh A P
  2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events Kalesh A P
                     ` (2 more replies)
  2020-09-30 12:33 ` [dpdk-dev] [RFC PATCH v4 0/3] librte_ethdev: error recovery support Kalesh A P
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:12 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

The error recovery solution is a protocol implemented between firmware
and bnxt PMD to recover from the fatal errors without a system reboot.
There is an alarm thread which constantly monitors the health of the
firmware and initiates a recovery when needed.

There are two scenarios here:

1. Hardware or firmware encountered an error which firmware detected.
   Firmware is in operational status here. In this case, firmware can
   reset the chip and notify the driver about the reset.
2. Hardware or firmware encountered an error but firmware is dead/hung.
   Firmware is not in operational status. In this case, the only possible
   way to recover the adapter is through host driver(bnxt PMD).

In both cases, bnxt PMD reinitializes with the FW again after the reset.
During that recovery process, data path will be halted and any control path
operation would fail. So, the PMD has to notify the application about this
reset/error event to prevent any activities from the application while
the PMD is recovering from the error.

This patch set adds support for the reset and recovery event in
the rte_eth_event framework. FW error and FW reset conditions would be
managed by the PMD. Driver uses RTE_ETH_EVENT_RESET event to notify
the applications about the FW reset or error. In such cases,
PMD use the RTE_ETH_EVENT_RECOVERED event to notify application about
PMD has recovered from FW reset or FW error.

V3: fixed a typo in commit log.
V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.

Kalesh AP (3):
  ethdev: support device reset and recovery events
  net/bnxt: notify applications about device reset/recovery
  app/testpmd: handle device recovery event

 app/test-pmd/testpmd.c         | 6 +++++-
 drivers/net/bnxt/bnxt_cpr.c    | 3 +++
 drivers/net/bnxt/bnxt_ethdev.c | 9 +++++++++
 lib/librte_ethdev/rte_ethdev.h | 2 ++
 4 files changed, 19 insertions(+), 1 deletion(-)

-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events
  2020-09-30  7:12 ` [dpdk-dev] [RFC PATCH v3 0/3] librte_ethdev: error recovery support Kalesh A P
@ 2020-09-30  7:12   ` Kalesh A P
  2020-09-30  7:50     ` Thomas Monjalon
  2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
  2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 1 reply; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:12 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Adding support for device reset and recovery events in the
rte_eth_event framework. FW error and FW reset conditions would be
managed internally by PMD without needing application intervention.
In such cases, PMD would need reset/recovery events to notify application
that PMD is undergoing a reset.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 lib/librte_ethdev/rte_ethdev.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 9759f13..b9dd14c 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3207,6 +3207,8 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	RTE_ETH_EVENT_RESET,    /**< port resetting from an error */
+	RTE_ETH_EVENT_RECOVERED, /**< port recovered from an error */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v3 2/3] net/bnxt: notify applications about device reset/recovery
  2020-09-30  7:12 ` [dpdk-dev] [RFC PATCH v3 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events Kalesh A P
@ 2020-09-30  7:12   ` Kalesh A P
  2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:12 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

When the driver receives RESET_NOTIFY async event from FW or detected
an error condition, it should update the application that FW is going
to reset. Once the driver recoveres from the reset, update the reset
recovery status to application as well.

The recovery process is transparent to the application as the driver
itself tries to recover from FW reset or FW error conditions.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
---
 drivers/net/bnxt/bnxt_cpr.c    | 3 +++
 drivers/net/bnxt/bnxt_ethdev.c | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index 8311e26..f2fddf7 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -129,6 +129,9 @@ void bnxt_handle_async_event(struct bnxt *bp,
 			bp->flags |= BNXT_FLAG_FATAL_ERROR;
 			return;
 		}
+		_rte_eth_dev_callback_process(bp->eth_dev,
+					      RTE_ETH_EVENT_RESET,
+					      NULL);
 
 		event_data = rte_le_to_cpu_32(async_cmp->event_data1);
 		/* timestamp_lo/hi values are in units of 100ms */
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index b99c712..609bf46 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -4566,6 +4566,9 @@ static void bnxt_dev_recover(void *arg)
 		goto err_start;
 
 	PMD_DRV_LOG(INFO, "Recovered from FW reset\n");
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_RECOVERED,
+				      NULL);
 	return;
 err_start:
 	bnxt_dev_stop_op(bp->eth_dev);
@@ -4573,6 +4576,9 @@ static void bnxt_dev_recover(void *arg)
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
 	PMD_DRV_LOG(ERR, "Failed to recover from FW reset\n");
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_INTR_RMV,
+				      NULL);
 }
 
 void bnxt_dev_reset_and_resume(void *arg)
@@ -4708,6 +4714,9 @@ static void bnxt_check_fw_health(void *arg)
 	bp->flags |= BNXT_FLAG_FW_RESET;
 
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
+	_rte_eth_dev_callback_process(bp->eth_dev,
+				      RTE_ETH_EVENT_RESET,
+				      NULL);
 
 	if (bnxt_is_master_func(bp))
 		wait_msec = info->master_func_wait_period;
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v3 3/3] app/testpmd: handle device recovery event
  2020-09-30  7:12 ` [dpdk-dev] [RFC PATCH v3 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events Kalesh A P
  2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
@ 2020-09-30  7:12   ` Kalesh A P
  2020-10-08 10:53     ` Asaf Penso
  2 siblings, 1 reply; 48+ messages in thread
From: Kalesh A P @ 2020-09-30  7:12 UTC (permalink / raw)
  To: dev; +Cc: thomas, ferruh.yigit, ajit.khaparde

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Added code to handle device reset and recovery event in testpmd.
This is an indication from the PMD that device has reset and
recovered error condition.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/testpmd.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index fe6450c..1c8fb46 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_RESET] = "device reset",
+	[RTE_ETH_EVENT_RECOVERED] = "device recovery",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RESET) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
 /*
  * Decide if all memory are locked for performance.
  */
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events
  2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events Kalesh A P
@ 2020-09-30  7:50     ` Thomas Monjalon
  2020-09-30  8:35       ` Kalesh Anakkur Purayil
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Monjalon @ 2020-09-30  7:50 UTC (permalink / raw)
  To: dev, Kalesh A P; +Cc: ferruh.yigit, ajit.khaparde, arybchenko

Hi,

Please use --cc-cmd devtools/get-maintainer.sh so all relevant
maintainers are Cc'ed. Adding Andrew.

> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> 
> Adding support for device reset and recovery events in the
> rte_eth_event framework. FW error and FW reset conditions would be
> managed internally by PMD without needing application intervention.
> In such cases, PMD would need reset/recovery events to notify application
> that PMD is undergoing a reset.

We already have this event:
	
    RTE_ETH_EVENT_INTR_RESET,
            /**< reset interrupt event, sent to VF on PF reset */

I don't know why "INTR" is in the name of this event,
and I think it does not need to be restricted to VF.
The application does not need to know whether the reset
is caused by the PF, the FW or the HW.
That's why I think you could share the same event.

> +       RTE_ETH_EVENT_RESET,    /**< port resetting from an error */
> +       RTE_ETH_EVENT_RECOVERED, /**< port recovered from an error */

You ignored my previous comments:
"
What the application is supposed to do when receiving such event?
How the application knows that flow rules were resetted?
Is there any other configuration resetted?
These informations must be explicit in the doxygen comments.
"



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events
  2020-09-30  7:50     ` Thomas Monjalon
@ 2020-09-30  8:35       ` Kalesh Anakkur Purayil
  2020-09-30  9:31         ` Thomas Monjalon
  0 siblings, 1 reply; 48+ messages in thread
From: Kalesh Anakkur Purayil @ 2020-09-30  8:35 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Yigit, Ferruh, Ajit Kumar Khaparde, arybchenko

Hi Thomas,

Please see my response inline.

Regards,
Kalesh

On Wed, Sep 30, 2020 at 1:21 PM Thomas Monjalon <thomas@monjalon.net> wrote:

> Hi,
>
> Please use --cc-cmd devtools/get-maintainer.sh so all relevant
> maintainers are Cc'ed. Adding Andrew.
>
[Kalesh]: Thank you. Sure, I will follow next time.

>
> > From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> >
> > Adding support for device reset and recovery events in the
> > rte_eth_event framework. FW error and FW reset conditions would be
> > managed internally by PMD without needing application intervention.
> > In such cases, PMD would need reset/recovery events to notify application
> > that PMD is undergoing a reset.
>
> We already have this event:
>
>     RTE_ETH_EVENT_INTR_RESET,
>             /**< reset interrupt event, sent to VF on PF reset */
>
> I don't know why "INTR" is in the name of this event,
> and I think it does not need to be restricted to VF.
> The application does not need to know whether the reset
> is caused by the PF, the FW or the HW.
> That's why I think you could share the same event.
>

[Kalesh]: Yes. As you mentioned, this event is used for some other purpose.
I did not want to break the existing usage/purpose of this event.
For example, upon receiving the RTE_ETH_EVENT_INTR_RESET event OVS
application invokes rte_eth_dev_reset() to reset the port.
The aim here is to recover from the device error condition without the
intervention of Applications. PMD itself will recover from the error using
the protocol with FW.

>
> > +       RTE_ETH_EVENT_RESET,    /**< port resetting from an error */
> > +       RTE_ETH_EVENT_RECOVERED, /**< port recovered from an error */
>
> You ignored my previous comments:
> "
> What the application is supposed to do when receiving such event?
> How the application knows that flow rules were resetted?
> Is there any other configuration resetted?
> These informations must be explicit in the doxygen comments.
> "
>
[Kalesh]: Sorry, I missed it.
I am not sure what you meant by "These information must be explicit in the
doxygen comments ".
Could you please elaborate a little how to/where to put these details?


-- 
Regards,
Kalesh A P

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events
  2020-09-30  8:35       ` Kalesh Anakkur Purayil
@ 2020-09-30  9:31         ` Thomas Monjalon
  0 siblings, 0 replies; 48+ messages in thread
From: Thomas Monjalon @ 2020-09-30  9:31 UTC (permalink / raw)
  To: Kalesh Anakkur Purayil
  Cc: dev, Yigit, Ferruh, Ajit Kumar Khaparde, arybchenko

30/09/2020 10:35, Kalesh Anakkur Purayil:
> On Wed, Sep 30, 2020 at 1:21 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> > > From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> > >
> > > Adding support for device reset and recovery events in the
> > > rte_eth_event framework. FW error and FW reset conditions would be
> > > managed internally by PMD without needing application intervention.
> > > In such cases, PMD would need reset/recovery events to notify application
> > > that PMD is undergoing a reset.
> >
> > We already have this event:
> >
> >     RTE_ETH_EVENT_INTR_RESET,
> >             /**< reset interrupt event, sent to VF on PF reset */
> >
> > I don't know why "INTR" is in the name of this event,
> > and I think it does not need to be restricted to VF.
> > The application does not need to know whether the reset
> > is caused by the PF, the FW or the HW.
> > That's why I think you could share the same event.
> >
> 
> [Kalesh]: Yes. As you mentioned, this event is used for some other purpose.
> I did not want to break the existing usage/purpose of this event.
> For example, upon receiving the RTE_ETH_EVENT_INTR_RESET event OVS
> application invokes rte_eth_dev_reset() to reset the port.
> The aim here is to recover from the device error condition without the
> intervention of Applications. PMD itself will recover from the error using
> the protocol with FW.
> 
> >
> > > +       RTE_ETH_EVENT_RESET,    /**< port resetting from an error */
> > > +       RTE_ETH_EVENT_RECOVERED, /**< port recovered from an error */
> >
> > You ignored my previous comments:
> > "
> > What the application is supposed to do when receiving such event?
> > How the application knows that flow rules were resetted?
> > Is there any other configuration resetted?
> > These informations must be explicit in the doxygen comments.
> > "
> >
> [Kalesh]: Sorry, I missed it.
> I am not sure what you meant by "These information must be explicit in the
> doxygen comments ".
> Could you please elaborate a little how to/where to put these details?

/** is the start of a doxygen comment.
This is the place (in the .h file) to explain to application
developer what to do with the event.
The code + the comments is what we call "the API".

You should complete the description of RTE_ETH_EVENT_INTR_RESET
as well: the need for calling rte_eth_dev_reset() was not explicit.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v4 0/3] librte_ethdev: error recovery support
  2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
                   ` (6 preceding siblings ...)
  2020-09-30  7:12 ` [dpdk-dev] [RFC PATCH v3 0/3] librte_ethdev: error recovery support Kalesh A P
@ 2020-09-30 12:33 ` Kalesh A P
  2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 1/3] ethdev: support device reset and recovery events Kalesh A P
                     ` (2 more replies)
  2020-10-07 16:49 ` [dpdk-dev] [PATCH v5 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-10-09  3:48 ` [dpdk-dev] [PATCH v6 0/3] librte_ethdev: error recovery support Kalesh A P
  9 siblings, 3 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30 12:33 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

The error recovery solution is a protocol implemented between firmware
and bnxt PMD to recover from the fatal errors without a system reboot.
There is an alarm thread which constantly monitors the health of the
firmware and initiates a recovery when needed.

There are two scenarios here:

1. Hardware or firmware encountered an error which firmware detected.
   Firmware is in operational status here. In this case, firmware can
   reset the chip and notify the driver about the reset.
2. Hardware or firmware encountered an error but firmware is dead/hung.
   Firmware is not in operational status. In this case, the only possible
   way to recover the adapter is through host driver(bnxt PMD).

In both cases, bnxt PMD reinitializes with the FW again after the reset.
During that recovery process, data path will be halted and any control path
operation would fail. So, the PMD has to notify the application about this
reset/error event to prevent any activities from the application while
the PMD is recovering from the error.

This patch set adds support for the reset and recovery event in
the rte_eth_event framework. FW error and FW reset conditions would be
managed by the PMD. Driver uses RTE_ETH_EVENT_RESET event to notify
the applications about the FW reset or error. In such cases,
PMD use the RTE_ETH_EVENT_RECOVERED event to notify application about
PMD has recovered from FW reset or FW error.

v4: Added doxygen comments about new events.
V3: Fixed a typo in commit log.
V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.

Kalesh AP (3):
  ethdev: support device reset and recovery events
  net/bnxt: notify applications about device reset/recovery
  app/testpmd: handle device recovery event

 app/test-pmd/testpmd.c         |  6 +++++-
 drivers/net/bnxt/bnxt_cpr.c    |  3 +++
 drivers/net/bnxt/bnxt_ethdev.c |  9 +++++++++
 lib/librte_ethdev/rte_ethdev.h | 17 +++++++++++++++++
 4 files changed, 34 insertions(+), 1 deletion(-)

-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v4 1/3] ethdev: support device reset and recovery events
  2020-09-30 12:33 ` [dpdk-dev] [RFC PATCH v4 0/3] librte_ethdev: error recovery support Kalesh A P
@ 2020-09-30 12:33   ` Kalesh A P
  2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
  2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30 12:33 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Adding support for device reset and recovery events in the
rte_eth_event framework. FW error and FW reset conditions would be
managed internally by PMD without needing application intervention.
In such cases, PMD would need reset/recovery events to notify application
that PMD is undergoing a reset.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 lib/librte_ethdev/rte_ethdev.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 9759f13..c0c90b7 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3207,6 +3207,23 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	RTE_ETH_EVENT_RESET,
+			/**< port resetting from an error
+			 *
+			 * PMD detected a FW reset or error condition.
+			 * PMD will try to recover from the error.
+			 * Data path will be halted and Control path operations
+			 * would fail at this time.
+			 */
+	RTE_ETH_EVENT_RECOVERED,
+			/**< port recovered from an error
+			 *
+			 * PMD has recovered from the error condition.
+			 * Control path and Data path are up now.
+			 * Since the device undergone a reset, flow rules
+			 * offloaded prior to the reset will be lost and
+			 * the application has to recreate the rules again.
+			 */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v4 2/3] net/bnxt: notify applications about device reset/recovery
  2020-09-30 12:33 ` [dpdk-dev] [RFC PATCH v4 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 1/3] ethdev: support device reset and recovery events Kalesh A P
@ 2020-09-30 12:33   ` Kalesh A P
  2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-09-30 12:33 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

When the driver receives RESET_NOTIFY async event from FW or detected
an error condition, it should update the application that FW is going
to reset. Once the driver recoveres from the reset, update the reset
recovery status to application as well.

The recovery process is transparent to the application as the driver
itself tries to recover from FW reset or FW error conditions.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
---
 drivers/net/bnxt/bnxt_cpr.c    | 3 +++
 drivers/net/bnxt/bnxt_ethdev.c | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index 8311e26..3b19ef7 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -129,6 +129,9 @@ void bnxt_handle_async_event(struct bnxt *bp,
 			bp->flags |= BNXT_FLAG_FATAL_ERROR;
 			return;
 		}
+		rte_eth_dev_callback_process(bp->eth_dev,
+					     RTE_ETH_EVENT_RESET,
+					     NULL);
 
 		event_data = rte_le_to_cpu_32(async_cmp->event_data1);
 		/* timestamp_lo/hi values are in units of 100ms */
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index b99c712..9a47bfd 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -4566,6 +4566,9 @@ static void bnxt_dev_recover(void *arg)
 		goto err_start;
 
 	PMD_DRV_LOG(INFO, "Recovered from FW reset\n");
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERED,
+				     NULL);
 	return;
 err_start:
 	bnxt_dev_stop_op(bp->eth_dev);
@@ -4573,6 +4576,9 @@ static void bnxt_dev_recover(void *arg)
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
 	PMD_DRV_LOG(ERR, "Failed to recover from FW reset\n");
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_INTR_RMV,
+				     NULL);
 }
 
 void bnxt_dev_reset_and_resume(void *arg)
@@ -4708,6 +4714,9 @@ static void bnxt_check_fw_health(void *arg)
 	bp->flags |= BNXT_FLAG_FW_RESET;
 
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RESET,
+				     NULL);
 
 	if (bnxt_is_master_func(bp))
 		wait_msec = info->master_func_wait_period;
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event
  2020-09-30 12:33 ` [dpdk-dev] [RFC PATCH v4 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 1/3] ethdev: support device reset and recovery events Kalesh A P
  2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
@ 2020-09-30 12:33   ` Kalesh A P
  2020-10-06 17:25     ` Ophir Munk
  2 siblings, 1 reply; 48+ messages in thread
From: Kalesh A P @ 2020-09-30 12:33 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Added code to handle device reset and recovery event in testpmd.
This is an indication from the PMD that device has reset and
recovered error condition.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/testpmd.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index fe6450c..1c8fb46 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_RESET] = "device reset",
+	[RTE_ETH_EVENT_RECOVERED] = "device recovery",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RESET) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
 /*
  * Decide if all memory are locked for performance.
  */
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event
  2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event Kalesh A P
@ 2020-10-06 17:25     ` Ophir Munk
  2020-10-07  4:46       ` Kalesh Anakkur Purayil
  0 siblings, 1 reply; 48+ messages in thread
From: Ophir Munk @ 2020-10-06 17:25 UTC (permalink / raw)
  To: Kalesh A P, dev; +Cc: Ophir Munk, NBU-Contact-Thomas Monjalon

Hi Kalesh,
Please find a few comments.
The name you gave to the event (EVENT_RESET) is very close to an already existing one: "EVENT_INTR_RESET".
But they are different.
EVENT_INTR_RESET originates from a port reset. It requires application reaction. It is widely used. It is documented in *.rst files.
EVENT_RESET originates from FW error (or maybe any error). It requires no application reaction (PMD manages by itself). It is not documented.
I therefore suggest renaming it (maybe EVENT_ERR_RECOVERING) and please document it in *.rst files.
More comments below:

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Kalesh A P
> Sent: Wednesday, September 30, 2020 3:33 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery
> event
> 
> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> 
> Added code to handle device reset and recovery event in testpmd.
> This is an indication from the PMD that device has reset and recovered error
> condition.
> 
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
> ---
>  app/test-pmd/testpmd.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
> fe6450c..1c8fb46 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
>  	[RTE_ETH_EVENT_NEW] = "device probed",
>  	[RTE_ETH_EVENT_DESTROY] = "device released",
>  	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
> +	[RTE_ETH_EVENT_RESET] = "device reset",

"device reset" is similar to the existing "reset" string. Can you suggest a different one? Maybe "error under recovery" ?

> +	[RTE_ETH_EVENT_RECOVERED] = "device recovery",

Wouldn't you prefer "device recovered" ?

>  	[RTE_ETH_EVENT_MAX] = NULL,
>  };
> 
> @@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) <<
> RTE_ETH_EVENT_UNKNOWN) |
>  			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
>  			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
>  			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
> -			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
> +			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
> +			    (UINT32_C(1) << RTE_ETH_EVENT_RESET) |
> +			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
>  /*
>   * Decide if all memory are locked for performance.
>   */
> --
> 2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event
  2020-10-06 17:25     ` Ophir Munk
@ 2020-10-07  4:46       ` Kalesh Anakkur Purayil
  2020-10-07  8:36         ` Ophir Munk
  2020-10-07  9:37         ` Ferruh Yigit
  0 siblings, 2 replies; 48+ messages in thread
From: Kalesh Anakkur Purayil @ 2020-10-07  4:46 UTC (permalink / raw)
  To: Ophir Munk; +Cc: dev, NBU-Contact-Thomas Monjalon

Hi Ophir,

Thank you for the comments. I will address them in the next version.

I will push these changes as Patches next time and not as an RFC. Hope that
is OK.

Regards,
Kalesh

On Tue, Oct 6, 2020 at 10:55 PM Ophir Munk <ophirmu@nvidia.com> wrote:

> Hi Kalesh,
> Please find a few comments.
> The name you gave to the event (EVENT_RESET) is very close to an already
> existing one: "EVENT_INTR_RESET".
> But they are different.
> EVENT_INTR_RESET originates from a port reset. It requires application
> reaction. It is widely used. It is documented in *.rst files.
> EVENT_RESET originates from FW error (or maybe any error). It requires no
> application reaction (PMD manages by itself). It is not documented.
> I therefore suggest renaming it (maybe EVENT_ERR_RECOVERING) and please
> document it in *.rst files.
> More comments below:
>
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Kalesh A P
> > Sent: Wednesday, September 30, 2020 3:33 PM
> > To: dev@dpdk.org
> > Subject: [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device
> recovery
> > event
> >
> > From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> >
> > Added code to handle device reset and recovery event in testpmd.
> > This is an indication from the PMD that device has reset and recovered
> error
> > condition.
> >
> > Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> > Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
> > ---
> >  app/test-pmd/testpmd.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
> > fe6450c..1c8fb46 100644
> > --- a/app/test-pmd/testpmd.c
> > +++ b/app/test-pmd/testpmd.c
> > @@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
> >       [RTE_ETH_EVENT_NEW] = "device probed",
> >       [RTE_ETH_EVENT_DESTROY] = "device released",
> >       [RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
> > +     [RTE_ETH_EVENT_RESET] = "device reset",
>
> "device reset" is similar to the existing "reset" string. Can you suggest
> a different one? Maybe "error under recovery" ?
>
> > +     [RTE_ETH_EVENT_RECOVERED] = "device recovery",
>
> Wouldn't you prefer "device recovered" ?
>
> >       [RTE_ETH_EVENT_MAX] = NULL,
> >  };
> >
> > @@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) <<
> > RTE_ETH_EVENT_UNKNOWN) |
> >                           (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
> >                           (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
> >                           (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
> > -                         (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
> > +                         (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
> > +                         (UINT32_C(1) << RTE_ETH_EVENT_RESET) |
> > +                         (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
> >  /*
> >   * Decide if all memory are locked for performance.
> >   */
> > --
> > 2.10.1
>
>

-- 
Regards,
Kalesh A P

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event
  2020-10-07  4:46       ` Kalesh Anakkur Purayil
@ 2020-10-07  8:36         ` Ophir Munk
  2020-10-07  9:37         ` Ferruh Yigit
  1 sibling, 0 replies; 48+ messages in thread
From: Ophir Munk @ 2020-10-07  8:36 UTC (permalink / raw)
  To: Kalesh Anakkur Purayil
  Cc: dev, NBU-Contact-Thomas Monjalon, ferruh.yigit,
	Ajit Khaparde (ajit.khaparde

Adding Ferruh and Ajit

From: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com>
Sent: Wednesday, October 7, 2020 7:47 AM
To: Ophir Munk <ophirmu@nvidia.com>
Cc: dev@dpdk.org; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>
Subject: Re: [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event

Hi Ophir,

Thank you for the comments. I will address them in the next version.

I will push these changes as Patches next time and not as an RFC. Hope that is OK.

Regards,
Kalesh

On Tue, Oct 6, 2020 at 10:55 PM Ophir Munk <ophirmu@nvidia.com<mailto:ophirmu@nvidia.com>> wrote:
Hi Kalesh,
Please find a few comments.
The name you gave to the event (EVENT_RESET) is very close to an already existing one: "EVENT_INTR_RESET".
But they are different.
EVENT_INTR_RESET originates from a port reset. It requires application reaction. It is widely used. It is documented in *.rst files.
EVENT_RESET originates from FW error (or maybe any error). It requires no application reaction (PMD manages by itself). It is not documented.
I therefore suggest renaming it (maybe EVENT_ERR_RECOVERING) and please document it in *.rst files.
More comments below:

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org<mailto:dev-bounces@dpdk.org>> On Behalf Of Kalesh A P
> Sent: Wednesday, September 30, 2020 3:33 PM
> To: dev@dpdk.org<mailto:dev@dpdk.org>
> Subject: [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery
> event
>
> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com<mailto:kalesh-anakkur.purayil@broadcom.com>>
>
> Added code to handle device reset and recovery event in testpmd.
> This is an indication from the PMD that device has reset and recovered error
> condition.
>
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com<mailto:kalesh-anakkur.purayil@broadcom.com>>
> Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com<mailto:ajit.khaparde@broadcom.com>>
> ---
>  app/test-pmd/testpmd.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
> fe6450c..1c8fb46 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
>       [RTE_ETH_EVENT_NEW] = "device probed",
>       [RTE_ETH_EVENT_DESTROY] = "device released",
>       [RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
> +     [RTE_ETH_EVENT_RESET] = "device reset",

"device reset" is similar to the existing "reset" string. Can you suggest a different one? Maybe "error under recovery" ?

> +     [RTE_ETH_EVENT_RECOVERED] = "device recovery",

Wouldn't you prefer "device recovered" ?

>       [RTE_ETH_EVENT_MAX] = NULL,
>  };
>
> @@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) <<
> RTE_ETH_EVENT_UNKNOWN) |
>                           (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
>                           (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
>                           (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
> -                         (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
> +                         (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
> +                         (UINT32_C(1) << RTE_ETH_EVENT_RESET) |
> +                         (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
>  /*
>   * Decide if all memory are locked for performance.
>   */
> --
> 2.10.1


--
Regards,
Kalesh A P

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event
  2020-10-07  4:46       ` Kalesh Anakkur Purayil
  2020-10-07  8:36         ` Ophir Munk
@ 2020-10-07  9:37         ` Ferruh Yigit
  2020-10-07 18:42           ` Ajit Khaparde
  1 sibling, 1 reply; 48+ messages in thread
From: Ferruh Yigit @ 2020-10-07  9:37 UTC (permalink / raw)
  To: Kalesh Anakkur Purayil, Ophir Munk; +Cc: dev, NBU-Contact-Thomas Monjalon

On 10/7/2020 5:46 AM, Kalesh Anakkur Purayil wrote:
> Hi Ophir,
> 
> Thank you for the comments. I will address them in the next version.
> 
> I will push these changes as Patches next time and not as an RFC. Hope that
> is OK.
> 
> Regards,
> Kalesh
> 
> On Tue, Oct 6, 2020 at 10:55 PM Ophir Munk <ophirmu@nvidia.com> wrote:
> 
>> Hi Kalesh,
>> Please find a few comments.
>> The name you gave to the event (EVENT_RESET) is very close to an already
>> existing one: "EVENT_INTR_RESET".
>> But they are different.
>> EVENT_INTR_RESET originates from a port reset. It requires application
>> reaction. It is widely used. It is documented in *.rst files.
>> EVENT_RESET originates from FW error (or maybe any error). It requires no
>> application reaction (PMD manages by itself). It is not documented.
>> I therefore suggest renaming it (maybe EVENT_ERR_RECOVERING) and please
>> document it in *.rst files.

+1 to renaming and documenting the event.

And agree to proceed as regular patch instead of RFC.


>> More comments below:
>>
>>> -----Original Message-----
>>> From: dev <dev-bounces@dpdk.org> On Behalf Of Kalesh A P
>>> Sent: Wednesday, September 30, 2020 3:33 PM
>>> To: dev@dpdk.org
>>> Subject: [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device
>> recovery
>>> event
>>>
>>> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>>
>>> Added code to handle device reset and recovery event in testpmd.
>>> This is an indication from the PMD that device has reset and recovered
>> error
>>> condition.
>>>
>>> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>> Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
>>> ---
>>>   app/test-pmd/testpmd.c | 6 +++++-
>>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
>>> fe6450c..1c8fb46 100644
>>> --- a/app/test-pmd/testpmd.c
>>> +++ b/app/test-pmd/testpmd.c
>>> @@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
>>>        [RTE_ETH_EVENT_NEW] = "device probed",
>>>        [RTE_ETH_EVENT_DESTROY] = "device released",
>>>        [RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
>>> +     [RTE_ETH_EVENT_RESET] = "device reset",
>>
>> "device reset" is similar to the existing "reset" string. Can you suggest
>> a different one? Maybe "error under recovery" ?
>>
>>> +     [RTE_ETH_EVENT_RECOVERED] = "device recovery",
>>
>> Wouldn't you prefer "device recovered" ?
>>
>>>        [RTE_ETH_EVENT_MAX] = NULL,
>>>   };
>>>
>>> @@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) <<
>>> RTE_ETH_EVENT_UNKNOWN) |
>>>                            (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
>>>                            (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
>>>                            (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
>>> -                         (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
>>> +                         (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
>>> +                         (UINT32_C(1) << RTE_ETH_EVENT_RESET) |
>>> +                         (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
>>>   /*
>>>    * Decide if all memory are locked for performance.
>>>    */
>>> --
>>> 2.10.1
>>
>>
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev]  [PATCH v5 0/3] librte_ethdev: error recovery support
  2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
                   ` (7 preceding siblings ...)
  2020-09-30 12:33 ` [dpdk-dev] [RFC PATCH v4 0/3] librte_ethdev: error recovery support Kalesh A P
@ 2020-10-07 16:49 ` Kalesh A P
  2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 1/3] ethdev: support device reset and recovery events Kalesh A P
                     ` (2 more replies)
  2020-10-09  3:48 ` [dpdk-dev] [PATCH v6 0/3] librte_ethdev: error recovery support Kalesh A P
  9 siblings, 3 replies; 48+ messages in thread
From: Kalesh A P @ 2020-10-07 16:49 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

The error recovery solution is a protocol implemented between firmware
and bnxt PMD to recover from the fatal errors without a system reboot.
There is an alarm thread which constantly monitors the health of the
firmware and initiates a recovery when needed.

There are two scenarios here:

1. Hardware or firmware encountered an error which firmware detected.
   Firmware is in operational status here. In this case, firmware can
   reset the chip and notify the driver about the reset.
2. Hardware or firmware encountered an error but firmware is dead/hung.
   Firmware is not in operational status. In this case, the only possible
   way to recover the adapter is through host driver(bnxt PMD).

In both cases, bnxt PMD reinitializes with the FW again after the reset.
During that recovery process, data path will be halted and any control path
operation would fail. So, the PMD has to notify the application about this
reset/error event to prevent any activities from the application while
the PMD is recovering from the error.

This patch set adds support for the reset and recovery event in
the rte_eth_event framework. FW error and FW reset conditions would be
managed by the PMD. Driver uses RTE_ETH_EVENT_RESET event to notify
the applications about the FW reset or error. In such cases,
PMD use the RTE_ETH_EVENT_RECOVERED event to notify application about
PMD has recovered from FW reset or FW error.

v5: Addressed comments from Ophir.
    1. Renamed the new event name to RTE_ETH_EVENT_ERR_RECOVERING.
    2. Fixed testpmd logs.
    3. Documented the new recovery events.
v4: Added doxygen comments about new events.
V3: Fixed a typo in commit log.
V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
    RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.

Kalesh AP (3):
  ethdev: support device reset and recovery events
  net/bnxt: notify applications about device reset/recovery
  app/testpmd: handle device recovery event

 app/test-pmd/testpmd.c                  |  6 +++++-
 doc/guides/prog_guide/poll_mode_drv.rst | 18 ++++++++++++++++++
 drivers/net/bnxt/bnxt_cpr.c             |  3 +++
 drivers/net/bnxt/bnxt_ethdev.c          |  9 +++++++++
 lib/librte_ethdev/rte_ethdev.h          | 17 +++++++++++++++++
 5 files changed, 52 insertions(+), 1 deletion(-)

-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v5 1/3] ethdev: support device reset and recovery events
  2020-10-07 16:49 ` [dpdk-dev] [PATCH v5 0/3] librte_ethdev: error recovery support Kalesh A P
@ 2020-10-07 16:49   ` Kalesh A P
  2020-10-08 10:49     ` Asaf Penso
  2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
  2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 1 reply; 48+ messages in thread
From: Kalesh A P @ 2020-10-07 16:49 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Adding support for device reset and recovery events in the
rte_eth_event framework. FW error and FW reset conditions would be
managed internally by PMD without needing application intervention.
In such cases, PMD would need reset/recovery events to notify application
that PMD is undergoing a reset.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
---
 doc/guides/prog_guide/poll_mode_drv.rst | 18 ++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h          | 17 +++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
index 86e0a14..c03f0ef 100644
--- a/doc/guides/prog_guide/poll_mode_drv.rst
+++ b/doc/guides/prog_guide/poll_mode_drv.rst
@@ -615,3 +615,21 @@ by application.
 The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
 the application to handle reset event. It is duty of application to
 handle all synchronization before it calls rte_eth_dev_reset().
+
+Error recovery support
+~~~~~~~~~~~~~~~~~~~~~~
+
+When the PMD detects a FW reset or error condition, it will try to recover
+from the error without needing the application intervention. In such cases,
+PMD would need events to notify the application that it is undergoing
+an error recovery.
+
+The PMD will trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
+application that PMD detected a FW reset or FW error condition. PMD will
+try to recover from the error by itself. Data path will be halted and
+control path operations would fail during the recovery period.
+
+The PMD will trigger RTE_ETH_EVENT_RECOVERED event to notify the application
+that the it has recovered from the error condition. Control path and data path
+are up now. Since the device undergone a reset, flow rules offloaded prior to
+the reset will be lost and the application has to recreate the rules again.
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 9759f13..9b4b015 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3207,6 +3207,23 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	RTE_ETH_EVENT_ERR_RECOVERING,
+			/**< port recovering from an error
+			 *
+			 * PMD detected a FW reset or error condition.
+			 * PMD will try to recover from the error.
+			 * Data path will be halted and Control path operations
+			 * would fail at this time.
+			 */
+	RTE_ETH_EVENT_RECOVERED,
+			/**< port recovered from an error
+			 *
+			 * PMD has recovered from the error condition.
+			 * Control path and Data path are up now.
+			 * Since the device undergone a reset, flow rules
+			 * offloaded prior to the reset will be lost and
+			 * the application has to recreate the rules again.
+			 */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v5 2/3] net/bnxt: notify applications about device reset/recovery
  2020-10-07 16:49 ` [dpdk-dev] [PATCH v5 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 1/3] ethdev: support device reset and recovery events Kalesh A P
@ 2020-10-07 16:49   ` Kalesh A P
  2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-10-07 16:49 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

When the driver receives RESET_NOTIFY async event from FW or detected
an error condition, it should update the application that FW is going
to reset. Once the driver recoveres from the reset, update the reset
recovery status to application as well.

The recovery process is transparent to the application as the driver
itself tries to recover from FW reset or FW error conditions.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
---
 drivers/net/bnxt/bnxt_cpr.c    | 3 +++
 drivers/net/bnxt/bnxt_ethdev.c | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index 8311e26..987c010 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -129,6 +129,9 @@ void bnxt_handle_async_event(struct bnxt *bp,
 			bp->flags |= BNXT_FLAG_FATAL_ERROR;
 			return;
 		}
+		rte_eth_dev_callback_process(bp->eth_dev,
+					     RTE_ETH_EVENT_ERR_RECOVERING,
+					     NULL);
 
 		event_data = rte_le_to_cpu_32(async_cmp->event_data1);
 		/* timestamp_lo/hi values are in units of 100ms */
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index b99c712..e3798de 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -4566,6 +4566,9 @@ static void bnxt_dev_recover(void *arg)
 		goto err_start;
 
 	PMD_DRV_LOG(INFO, "Recovered from FW reset\n");
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERED,
+				     NULL);
 	return;
 err_start:
 	bnxt_dev_stop_op(bp->eth_dev);
@@ -4573,6 +4576,9 @@ static void bnxt_dev_recover(void *arg)
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
 	PMD_DRV_LOG(ERR, "Failed to recover from FW reset\n");
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_INTR_RMV,
+				     NULL);
 }
 
 void bnxt_dev_reset_and_resume(void *arg)
@@ -4708,6 +4714,9 @@ static void bnxt_check_fw_health(void *arg)
 	bp->flags |= BNXT_FLAG_FW_RESET;
 
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_ERR_RECOVERING,
+				     NULL);
 
 	if (bnxt_is_master_func(bp))
 		wait_msec = info->master_func_wait_period;
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v5 3/3] app/testpmd: handle device recovery event
  2020-10-07 16:49 ` [dpdk-dev] [PATCH v5 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 1/3] ethdev: support device reset and recovery events Kalesh A P
  2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
@ 2020-10-07 16:49   ` Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-10-07 16:49 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Added code to handle error recovery events in testpmd.
This is an indication from the PMD that it is undergoing
an error recovery and recovered from the error condition.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/testpmd.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index fe6450c..80ae3fa 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_ERR_RECOVERING] = "device error under recovery",
+	[RTE_ETH_EVENT_RECOVERED] = "device recovered",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
 /*
  * Decide if all memory are locked for performance.
  */
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event
  2020-10-07  9:37         ` Ferruh Yigit
@ 2020-10-07 18:42           ` Ajit Khaparde
  0 siblings, 0 replies; 48+ messages in thread
From: Ajit Khaparde @ 2020-10-07 18:42 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: Kalesh Anakkur Purayil, Ophir Munk, dev, NBU-Contact-Thomas Monjalon

On Wed, Oct 7, 2020 at 2:37 AM Ferruh Yigit <ferruh.yigit@intel.com> wrote:
>
> On 10/7/2020 5:46 AM, Kalesh Anakkur Purayil wrote:
> > Hi Ophir,
> >
> > Thank you for the comments. I will address them in the next version.
> >
> > I will push these changes as Patches next time and not as an RFC. Hope that
> > is OK.
> >
> > Regards,
> > Kalesh
> >
> > On Tue, Oct 6, 2020 at 10:55 PM Ophir Munk <ophirmu@nvidia.com> wrote:
> >
> >> Hi Kalesh,
> >> Please find a few comments.
> >> The name you gave to the event (EVENT_RESET) is very close to an already
> >> existing one: "EVENT_INTR_RESET".
> >> But they are different.
> >> EVENT_INTR_RESET originates from a port reset. It requires application
> >> reaction. It is widely used. It is documented in *.rst files.
> >> EVENT_RESET originates from FW error (or maybe any error). It requires no
> >> application reaction (PMD manages by itself). It is not documented.
> >> I therefore suggest renaming it (maybe EVENT_ERR_RECOVERING) and please
> >> document it in *.rst files.
>
> +1 to renaming and documenting the event.
>
> And agree to proceed as regular patch instead of RFC.
Ferruh,
If/when the new version of patch is good,
Can you pick the bnxt PMD patch along with the ethdev and testpmd patch?
Let me know.


>
>
> >> More comments below:
> >>
> >>> -----Original Message-----
> >>> From: dev <dev-bounces@dpdk.org> On Behalf Of Kalesh A P
> >>> Sent: Wednesday, September 30, 2020 3:33 PM
> >>> To: dev@dpdk.org
> >>> Subject: [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device
> >> recovery
> >>> event
> >>>
> >>> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> >>>
> >>> Added code to handle device reset and recovery event in testpmd.
> >>> This is an indication from the PMD that device has reset and recovered
> >> error
> >>> condition.
> >>>
> >>> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> >>> Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
> >>> ---
> >>>   app/test-pmd/testpmd.c | 6 +++++-
> >>>   1 file changed, 5 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
> >>> fe6450c..1c8fb46 100644
> >>> --- a/app/test-pmd/testpmd.c
> >>> +++ b/app/test-pmd/testpmd.c
> >>> @@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
> >>>        [RTE_ETH_EVENT_NEW] = "device probed",
> >>>        [RTE_ETH_EVENT_DESTROY] = "device released",
> >>>        [RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
> >>> +     [RTE_ETH_EVENT_RESET] = "device reset",
> >>
> >> "device reset" is similar to the existing "reset" string. Can you suggest
> >> a different one? Maybe "error under recovery" ?
> >>
> >>> +     [RTE_ETH_EVENT_RECOVERED] = "device recovery",
> >>
> >> Wouldn't you prefer "device recovered" ?
> >>
> >>>        [RTE_ETH_EVENT_MAX] = NULL,
> >>>   };
> >>>
> >>> @@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) <<
> >>> RTE_ETH_EVENT_UNKNOWN) |
> >>>                            (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
> >>>                            (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
> >>>                            (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
> >>> -                         (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
> >>> +                         (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
> >>> +                         (UINT32_C(1) << RTE_ETH_EVENT_RESET) |
> >>> +                         (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
> >>>   /*
> >>>    * Decide if all memory are locked for performance.
> >>>    */
> >>> --
> >>> 2.10.1
> >>
> >>
> >
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/3] ethdev: support device reset and recovery events
  2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 1/3] ethdev: support device reset and recovery events Kalesh A P
@ 2020-10-08 10:49     ` Asaf Penso
  0 siblings, 0 replies; 48+ messages in thread
From: Asaf Penso @ 2020-10-08 10:49 UTC (permalink / raw)
  To: Kalesh A P, dev; +Cc: Matan Azrad

>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Kalesh A P
>Sent: Wednesday, October 7, 2020 7:49 PM
>To: dev@dpdk.org
>Subject: [dpdk-dev] [PATCH v5 1/3] ethdev: support device reset and
>recovery events
>
>From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>
>Adding support for device reset and recovery events in the rte_eth_event
>framework. FW error and FW reset conditions would be managed internally
>by PMD without needing application intervention.
>In such cases, PMD would need reset/recovery events to notify application
>that PMD is undergoing a reset.
>
>Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
>Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
>---
> doc/guides/prog_guide/poll_mode_drv.rst | 18 ++++++++++++++++++
> lib/librte_ethdev/rte_ethdev.h          | 17 +++++++++++++++++
> 2 files changed, 35 insertions(+)
>
>diff --git a/doc/guides/prog_guide/poll_mode_drv.rst
>b/doc/guides/prog_guide/poll_mode_drv.rst
>index 86e0a14..c03f0ef 100644
>--- a/doc/guides/prog_guide/poll_mode_drv.rst
>+++ b/doc/guides/prog_guide/poll_mode_drv.rst
>@@ -615,3 +615,21 @@ by application.
> The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger  the
>application to handle reset event. It is duty of application to  handle all
>synchronization before it calls rte_eth_dev_reset().
>+
>+Error recovery support
>+~~~~~~~~~~~~~~~~~~~~~~
>+
>+When the PMD detects a FW reset or error condition, it will try to
>+recover from the error without needing the application intervention. In
>+such cases, PMD would need events to notify the application that it is
>+undergoing an error recovery.
>+
>+The PMD will trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
>+application that PMD detected a FW reset or FW error condition. PMD
>+will try to recover from the error by itself. Data path will be halted
>+and control path operations would fail during the recovery period.
>+
>+The PMD will trigger RTE_ETH_EVENT_RECOVERED event to notify the
>+application that the it has recovered from the error condition. Control
>+path and data path are up now. Since the device undergone a reset, flow
>+rules offloaded prior to the reset will be lost and the application has to
>recreate the rules again.
>diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
>index 9759f13..9b4b015 100644
>--- a/lib/librte_ethdev/rte_ethdev.h
>+++ b/lib/librte_ethdev/rte_ethdev.h
>@@ -3207,6 +3207,23 @@ enum rte_eth_event_type {
> 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
> 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected
>*/
>+	RTE_ETH_EVENT_ERR_RECOVERING,
>+			/**< port recovering from an error
>+			 *
>+			 * PMD detected a FW reset or error condition.
>+			 * PMD will try to recover from the error.
>+			 * Data path will be halted and Control path operations
>+			 * would fail at this time.
>+			 */
>+	RTE_ETH_EVENT_RECOVERED,
>+			/**< port recovered from an error
>+			 *
>+			 * PMD has recovered from the error condition.
>+			 * Control path and Data path are up now.
>+			 * Since the device undergone a reset, flow rules
>+			 * offloaded prior to the reset will be lost and
>+			 * the application has to recreate the rules again.
>+			 */
> 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
> };
>
>--
>2.10.1

LGTM
Reviewed-by: Asaf Penso <asafp@nvidia.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [RFC PATCH v3 3/3] app/testpmd: handle device recovery event
  2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 3/3] app/testpmd: handle device recovery event Kalesh A P
@ 2020-10-08 10:53     ` Asaf Penso
  0 siblings, 0 replies; 48+ messages in thread
From: Asaf Penso @ 2020-10-08 10:53 UTC (permalink / raw)
  To: Kalesh A P, dev; +Cc: NBU-Contact-Thomas Monjalon, ferruh.yigit, ajit.khaparde

>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Kalesh A P
>Sent: Wednesday, September 30, 2020 10:13 AM
>To: dev@dpdk.org
>Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
>ferruh.yigit@intel.com; ajit.khaparde@broadcom.com
>Subject: [dpdk-dev] [RFC PATCH v3 3/3] app/testpmd: handle device recovery
>event
>
>From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>
>Added code to handle device reset and recovery event in testpmd.
>This is an indication from the PMD that device has reset and recovered error
>condition.
>
>Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
>---
> app/test-pmd/testpmd.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
>diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
>fe6450c..1c8fb46 100644
>--- a/app/test-pmd/testpmd.c
>+++ b/app/test-pmd/testpmd.c
>@@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
> 	[RTE_ETH_EVENT_NEW] = "device probed",
> 	[RTE_ETH_EVENT_DESTROY] = "device released",
> 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
>+	[RTE_ETH_EVENT_RESET] = "device reset",
>+	[RTE_ETH_EVENT_RECOVERED] = "device recovery",
> 	[RTE_ETH_EVENT_MAX] = NULL,
> };
>
>@@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) <<
>RTE_ETH_EVENT_UNKNOWN) |
> 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
> 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
> 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
>-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
>+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
>+			    (UINT32_C(1) << RTE_ETH_EVENT_RESET) |
>+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
> /*
>  * Decide if all memory are locked for performance.
>  */
>--
>2.10.1

I think you would need also to update parse_event_printing_config function in parameters.c file.
Do you think the release_notes_20.11.rst should be updated with these new events?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev]  [PATCH v6 0/3] librte_ethdev: error recovery support
  2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
                   ` (8 preceding siblings ...)
  2020-10-07 16:49 ` [dpdk-dev] [PATCH v5 0/3] librte_ethdev: error recovery support Kalesh A P
@ 2020-10-09  3:48 ` Kalesh A P
  2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events Kalesh A P
                     ` (2 more replies)
  9 siblings, 3 replies; 48+ messages in thread
From: Kalesh A P @ 2020-10-09  3:48 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

The error recovery solution is a protocol implemented between firmware
and bnxt PMD to recover from the fatal errors without a system reboot.
There is an alarm thread which constantly monitors the health of the
firmware and initiates a recovery when needed.

There are two scenarios here:

1. Hardware or firmware encountered an error which firmware detected.
   Firmware is in operational status here. In this case, firmware can
   reset the chip and notify the driver about the reset.
2. Hardware or firmware encountered an error but firmware is dead/hung.
   Firmware is not in operational status. In this case, the only possible
   way to recover the adapter is through host driver(bnxt PMD).

In both cases, bnxt PMD reinitializes with the FW again after the reset.
During that recovery process, data path will be halted and any control path
operation would fail. So, the PMD has to notify the application about this
reset/error event to prevent any activities from the application while
the PMD is recovering from the error.

This patch set adds support for the reset and recovery event in
the rte_eth_event framework. FW error and FW reset conditions would be
managed by the PMD. Driver uses RTE_ETH_EVENT_RESET event to notify
the applications about the FW reset or error. In such cases,
PMD use the RTE_ETH_EVENT_RECOVERED event to notify application about
PMD has recovered from FW reset or FW error.

v6: Addressed comments from Asaf Penso.
    1. Updated 20.11 release notes with the new events added.
    2. updated testpmd parse_event_printing_config function.
v5: Addressed comments from Ophir Munk.
    1. Renamed the new event name to RTE_ETH_EVENT_ERR_RECOVERING.
    2. Fixed testpmd logs.
    3. Documented the new recovery events.
v4: Addressed comments from Thomas Monjalon
    1. Added doxygen comments about new events.
V3: Fixed a typo in commit log.
V2: Added a new event RTE_ETH_EVENT_RESET instead of using the
    RTE_ETH_EVENT_INTR_RESET to notify applications about device reset.

Kalesh AP (3):
  ethdev: support device reset and recovery events
  net/bnxt: notify applications about device reset/recovery
  app/testpmd: handle device recovery event

 app/test-pmd/parameters.c               |  8 ++++++--
 app/test-pmd/testpmd.c                  |  6 +++++-
 doc/guides/prog_guide/poll_mode_drv.rst | 18 ++++++++++++++++++
 doc/guides/rel_notes/release_20_11.rst  | 10 ++++++++++
 drivers/net/bnxt/bnxt_cpr.c             |  3 +++
 drivers/net/bnxt/bnxt_ethdev.c          |  9 +++++++++
 lib/librte_ethdev/rte_ethdev.h          | 17 +++++++++++++++++
 7 files changed, 68 insertions(+), 3 deletions(-)

-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events
  2020-10-09  3:48 ` [dpdk-dev] [PATCH v6 0/3] librte_ethdev: error recovery support Kalesh A P
@ 2020-10-09  3:48   ` Kalesh A P
  2020-10-11 21:29     ` Thomas Monjalon
  2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
  2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 1 reply; 48+ messages in thread
From: Kalesh A P @ 2020-10-09  3:48 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Adding support for device reset and recovery events in the
rte_eth_event framework. FW error and FW reset conditions would be
managed internally by PMD without needing application intervention.
In such cases, PMD would need reset/recovery events to notify application
that PMD is undergoing a reset.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Asaf Penso <asafp@nvidia.com>
---
 doc/guides/prog_guide/poll_mode_drv.rst | 18 ++++++++++++++++++
 lib/librte_ethdev/rte_ethdev.h          | 17 +++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
index 86e0a14..c03f0ef 100644
--- a/doc/guides/prog_guide/poll_mode_drv.rst
+++ b/doc/guides/prog_guide/poll_mode_drv.rst
@@ -615,3 +615,21 @@ by application.
 The PMD itself should not call rte_eth_dev_reset(). The PMD can trigger
 the application to handle reset event. It is duty of application to
 handle all synchronization before it calls rte_eth_dev_reset().
+
+Error recovery support
+~~~~~~~~~~~~~~~~~~~~~~
+
+When the PMD detects a FW reset or error condition, it will try to recover
+from the error without needing the application intervention. In such cases,
+PMD would need events to notify the application that it is undergoing
+an error recovery.
+
+The PMD will trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
+application that PMD detected a FW reset or FW error condition. PMD will
+try to recover from the error by itself. Data path will be halted and
+control path operations would fail during the recovery period.
+
+The PMD will trigger RTE_ETH_EVENT_RECOVERED event to notify the application
+that the it has recovered from the error condition. Control path and data path
+are up now. Since the device undergone a reset, flow rules offloaded prior to
+the reset will be lost and the application has to recreate the rules again.
diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
index 9759f13..9b4b015 100644
--- a/lib/librte_ethdev/rte_ethdev.h
+++ b/lib/librte_ethdev/rte_ethdev.h
@@ -3207,6 +3207,23 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	RTE_ETH_EVENT_ERR_RECOVERING,
+			/**< port recovering from an error
+			 *
+			 * PMD detected a FW reset or error condition.
+			 * PMD will try to recover from the error.
+			 * Data path will be halted and Control path operations
+			 * would fail at this time.
+			 */
+	RTE_ETH_EVENT_RECOVERED,
+			/**< port recovered from an error
+			 *
+			 * PMD has recovered from the error condition.
+			 * Control path and Data path are up now.
+			 * Since the device undergone a reset, flow rules
+			 * offloaded prior to the reset will be lost and
+			 * the application has to recreate the rules again.
+			 */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v6 2/3] net/bnxt: notify applications about device reset/recovery
  2020-10-09  3:48 ` [dpdk-dev] [PATCH v6 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events Kalesh A P
@ 2020-10-09  3:48   ` Kalesh A P
  2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 3/3] app/testpmd: handle device recovery event Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-10-09  3:48 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

When the driver receives RESET_NOTIFY async event from FW or detected
an error condition, it should update the application that FW is going
to reset. Once the driver recoveres from the reset, update the reset
recovery status to application as well.

The recovery process is transparent to the application as the driver
itself tries to recover from FW reset or FW error conditions.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
---
 drivers/net/bnxt/bnxt_cpr.c    | 3 +++
 drivers/net/bnxt/bnxt_ethdev.c | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/net/bnxt/bnxt_cpr.c b/drivers/net/bnxt/bnxt_cpr.c
index 8311e26..987c010 100644
--- a/drivers/net/bnxt/bnxt_cpr.c
+++ b/drivers/net/bnxt/bnxt_cpr.c
@@ -129,6 +129,9 @@ void bnxt_handle_async_event(struct bnxt *bp,
 			bp->flags |= BNXT_FLAG_FATAL_ERROR;
 			return;
 		}
+		rte_eth_dev_callback_process(bp->eth_dev,
+					     RTE_ETH_EVENT_ERR_RECOVERING,
+					     NULL);
 
 		event_data = rte_le_to_cpu_32(async_cmp->event_data1);
 		/* timestamp_lo/hi values are in units of 100ms */
diff --git a/drivers/net/bnxt/bnxt_ethdev.c b/drivers/net/bnxt/bnxt_ethdev.c
index b99c712..e3798de 100644
--- a/drivers/net/bnxt/bnxt_ethdev.c
+++ b/drivers/net/bnxt/bnxt_ethdev.c
@@ -4566,6 +4566,9 @@ static void bnxt_dev_recover(void *arg)
 		goto err_start;
 
 	PMD_DRV_LOG(INFO, "Recovered from FW reset\n");
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_RECOVERED,
+				     NULL);
 	return;
 err_start:
 	bnxt_dev_stop_op(bp->eth_dev);
@@ -4573,6 +4576,9 @@ static void bnxt_dev_recover(void *arg)
 	bp->flags |= BNXT_FLAG_FATAL_ERROR;
 	bnxt_uninit_resources(bp, false);
 	PMD_DRV_LOG(ERR, "Failed to recover from FW reset\n");
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_INTR_RMV,
+				     NULL);
 }
 
 void bnxt_dev_reset_and_resume(void *arg)
@@ -4708,6 +4714,9 @@ static void bnxt_check_fw_health(void *arg)
 	bp->flags |= BNXT_FLAG_FW_RESET;
 
 	PMD_DRV_LOG(ERR, "Detected FW dead condition\n");
+	rte_eth_dev_callback_process(bp->eth_dev,
+				     RTE_ETH_EVENT_ERR_RECOVERING,
+				     NULL);
 
 	if (bnxt_is_master_func(bp))
 		wait_msec = info->master_func_wait_period;
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH v6 3/3] app/testpmd: handle device recovery event
  2020-10-09  3:48 ` [dpdk-dev] [PATCH v6 0/3] librte_ethdev: error recovery support Kalesh A P
  2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events Kalesh A P
  2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
@ 2020-10-09  3:48   ` Kalesh A P
  2 siblings, 0 replies; 48+ messages in thread
From: Kalesh A P @ 2020-10-09  3:48 UTC (permalink / raw)
  To: dev

From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>

Added code to handle error recovery events in testpmd.
This is an indication from the PMD that it is undergoing
an error recovery and recovered from the error condition.

Updated 20.11 release notes as well.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Ajit Kumar Khaparde <ajit.khaparde@broadcom.com>
---
 app/test-pmd/parameters.c              |  8 ++++++--
 app/test-pmd/testpmd.c                 |  6 +++++-
 doc/guides/rel_notes/release_20_11.rst | 10 ++++++++++
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 1ead595..560f9ba 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -192,9 +192,9 @@ usage(char* progname)
 	printf("  --no-rmv-interrupt: disable device removal interrupt.\n");
 	printf("  --bitrate-stats=N: set the logical core N to perform "
 		"bit-rate calculation.\n");
-	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --print-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovered|all>: "
 	       "enable print of designated event or all of them.\n");
-	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|all>: "
+	printf("  --mask-event <unknown|intr_lsc|queue_state|intr_reset|vf_mbox|macsec|intr_rmv|flow_aged|err_recovering|recovered|all>: "
 	       "disable print of designated event or all of them.\n");
 	printf("  --flow-isolate-all: "
 	       "requests flow API isolated mode on all ports at initialization time.\n");
@@ -556,6 +556,10 @@ parse_event_printing_config(const char *optarg, int enable)
 		mask = UINT32_C(1) << RTE_ETH_EVENT_DESTROY;
 	else if (!strcmp(optarg, "flow_aged"))
 		mask = UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED;
+	else if (!strcmp(optarg, "err_recovering"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING;
+	else if (!strcmp(optarg, "recovered"))
+		mask = UINT32_C(1) << RTE_ETH_EVENT_RECOVERED;
 	else if (!strcmp(optarg, "all"))
 		mask = ~UINT32_C(0);
 	else {
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index fe6450c..80ae3fa 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -380,6 +380,8 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_ERR_RECOVERING] = "device error under recovery",
+	[RTE_ETH_EVENT_RECOVERED] = "device recovered",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -394,7 +396,9 @@ uint32_t event_print_mask = (UINT32_C(1) << RTE_ETH_EVENT_UNKNOWN) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_IPSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_MACSEC) |
 			    (UINT32_C(1) << RTE_ETH_EVENT_INTR_RMV) |
-			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED);
+			    (UINT32_C(1) << RTE_ETH_EVENT_FLOW_AGED) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_ERR_RECOVERING) |
+			    (UINT32_C(1) << RTE_ETH_EVENT_RECOVERED);
 /*
  * Decide if all memory are locked for performance.
  */
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 4bcf220..f732ff6 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -78,11 +78,21 @@ New Features
     ``--portmask=N``
     where N represents the hexadecimal bitmask of ports used.
 
+* **Added error recovery support.**
+
+  Added error recovery support to detect and recover from device errors including:
+
+  * Added new event: ``RTE_ETH_EVENT_ERR_RECOVERING`` for the driver to report
+    that the port is recovering from an error.
+  * Added new event: ``RTE_ETH_EVENT_RECOVERED`` for the driver to report
+    that the port has recovered from an error.
+
 * **Updated Broadcom bnxt driver.**
 
   Updated the Broadcom bnxt driver with new features and improvements, including:
 
   * Added support for 200G PAM4 link speed.
+  * Added support to handle device recovery events.
 
 
 Removed Items
-- 
2.10.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events
  2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events Kalesh A P
@ 2020-10-11 21:29     ` Thomas Monjalon
  2020-10-12  8:09       ` Andrew Rybchenko
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Monjalon @ 2020-10-11 21:29 UTC (permalink / raw)
  To: Kalesh A P; +Cc: dev, ferruh.yigit, arybchenko

09/10/2020 05:48, Kalesh A P:
> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> 
> Adding support for device reset and recovery events in the
> rte_eth_event framework. FW error and FW reset conditions would be
> managed internally by PMD without needing application intervention.
> In such cases, PMD would need reset/recovery events to notify application
> that PMD is undergoing a reset.
> 
> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
> Reviewed-by: Asaf Penso <asafp@nvidia.com>

The ethdev maintainers are not Cc'ed.
Please use the option --cc-cmd devtools/get-maintainer.sh


> +Error recovery support
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> +When the PMD detects a FW reset or error condition, it will try to recover
> +from the error without needing the application intervention. In such cases,
> +PMD would need events to notify the application that it is undergoing
> +an error recovery.
> +
> +The PMD will trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
> +application that PMD detected a FW reset or FW error condition. PMD will
> +try to recover from the error by itself. Data path will be halted and
> +control path operations would fail during the recovery period.
> +
> +The PMD will trigger RTE_ETH_EVENT_RECOVERED event to notify the application
> +that the it has recovered from the error condition. Control path and data path
> +are up now. Since the device undergone a reset, flow rules offloaded prior to
> +the reset will be lost and the application has to recreate the rules again.
> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
> index 9759f13..9b4b015 100644
> --- a/lib/librte_ethdev/rte_ethdev.h
> +++ b/lib/librte_ethdev/rte_ethdev.h
> @@ -3207,6 +3207,23 @@ enum rte_eth_event_type {
>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>  	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
> +	RTE_ETH_EVENT_ERR_RECOVERING,
> +			/**< port recovering from an error
> +			 *
> +			 * PMD detected a FW reset or error condition.
> +			 * PMD will try to recover from the error.
> +			 * Data path will be halted and Control path operations
> +			 * would fail at this time.
> +			 */

Does it mean the application has nothing to do when receiving this event?
I think the app should stop polling at least.

> +	RTE_ETH_EVENT_RECOVERED,
> +			/**< port recovered from an error
> +			 *
> +			 * PMD has recovered from the error condition.
> +			 * Control path and Data path are up now.
> +			 * Since the device undergone a reset, flow rules
> +			 * offloaded prior to the reset will be lost and
> +			 * the application has to recreate the rules again.
> +			 */

Please be more precise.
Should the app re-configure the port, setup the queues, start the port?




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events
  2020-10-11 21:29     ` Thomas Monjalon
@ 2020-10-12  8:09       ` Andrew Rybchenko
  2021-02-18 15:32         ` Ferruh Yigit
  0 siblings, 1 reply; 48+ messages in thread
From: Andrew Rybchenko @ 2020-10-12  8:09 UTC (permalink / raw)
  To: Thomas Monjalon, Kalesh A P; +Cc: dev, ferruh.yigit

On 10/12/20 12:29 AM, Thomas Monjalon wrote:
> 09/10/2020 05:48, Kalesh A P:
>> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>
>> Adding support for device reset and recovery events in the
>> rte_eth_event framework. FW error and FW reset conditions would be
>> managed internally by PMD without needing application intervention.
>> In such cases, PMD would need reset/recovery events to notify application
>> that PMD is undergoing a reset.
>>
>> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
>> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
>> Reviewed-by: Asaf Penso <asafp@nvidia.com>
> 
> The ethdev maintainers are not Cc'ed.
> Please use the option --cc-cmd devtools/get-maintainer.sh
> 
> 
>> +Error recovery support
>> +~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +When the PMD detects a FW reset or error condition, it will try to recover
>> +from the error without needing the application intervention. In such cases,
>> +PMD would need events to notify the application that it is undergoing
>> +an error recovery.
>> +
>> +The PMD will trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
>> +application that PMD detected a FW reset or FW error condition. PMD will
>> +try to recover from the error by itself. Data path will be halted and
>> +control path operations would fail during the recovery period.
>> +
>> +The PMD will trigger RTE_ETH_EVENT_RECOVERED event to notify the application
>> +that the it has recovered from the error condition. Control path and data path
>> +are up now. Since the device undergone a reset, flow rules offloaded prior to
>> +the reset will be lost and the application has to recreate the rules again.

What should be done if the state is not recoverable?

>> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
>> index 9759f13..9b4b015 100644
>> --- a/lib/librte_ethdev/rte_ethdev.h
>> +++ b/lib/librte_ethdev/rte_ethdev.h
>> @@ -3207,6 +3207,23 @@ enum rte_eth_event_type {
>>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>>  	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
>> +	RTE_ETH_EVENT_ERR_RECOVERING,
>> +			/**< port recovering from an error
>> +			 *
>> +			 * PMD detected a FW reset or error condition.
>> +			 * PMD will try to recover from the error.
>> +			 * Data path will be halted and Control path operations
>> +			 * would fail at this time.
>> +			 */
> 
> Does it mean the application has nothing to do when receiving this event?
> I think the app should stop polling at least.
> 
>> +	RTE_ETH_EVENT_RECOVERED,
>> +			/**< port recovered from an error
>> +			 *
>> +			 * PMD has recovered from the error condition.
>> +			 * Control path and Data path are up now.
>> +			 * Since the device undergone a reset, flow rules
>> +			 * offloaded prior to the reset will be lost and
>> +			 * the application has to recreate the rules again.
>> +			 */
> 
> Please be more precise.
> Should the app re-configure the port, setup the queues, start the port?
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events
  2020-10-12  8:09       ` Andrew Rybchenko
@ 2021-02-18 15:32         ` Ferruh Yigit
  0 siblings, 0 replies; 48+ messages in thread
From: Ferruh Yigit @ 2021-02-18 15:32 UTC (permalink / raw)
  To: Andrew Rybchenko, Thomas Monjalon, Kalesh A P
  Cc: dev, Ajit Khaparde, Ori Kam, Asaf Penso

On 10/12/2020 9:09 AM, Andrew Rybchenko wrote:
> On 10/12/20 12:29 AM, Thomas Monjalon wrote:
>> 09/10/2020 05:48, Kalesh A P:
>>> From: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>>
>>> Adding support for device reset and recovery events in the
>>> rte_eth_event framework. FW error and FW reset conditions would be
>>> managed internally by PMD without needing application intervention.
>>> In such cases, PMD would need reset/recovery events to notify application
>>> that PMD is undergoing a reset.
>>>
>>> Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
>>> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
>>> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
>>> Reviewed-by: Asaf Penso <asafp@nvidia.com>
>>
>> The ethdev maintainers are not Cc'ed.
>> Please use the option --cc-cmd devtools/get-maintainer.sh
>>
>>
>>> +Error recovery support
>>> +~~~~~~~~~~~~~~~~~~~~~~
>>> +
>>> +When the PMD detects a FW reset or error condition, it will try to recover
>>> +from the error without needing the application intervention. In such cases,
>>> +PMD would need events to notify the application that it is undergoing
>>> +an error recovery.
>>> +
>>> +The PMD will trigger RTE_ETH_EVENT_ERR_RECOVERING event to notify the
>>> +application that PMD detected a FW reset or FW error condition. PMD will
>>> +try to recover from the error by itself. Data path will be halted and
>>> +control path operations would fail during the recovery period.
>>> +
>>> +The PMD will trigger RTE_ETH_EVENT_RECOVERED event to notify the application
>>> +that the it has recovered from the error condition. Control path and data path
>>> +are up now. Since the device undergone a reset, flow rules offloaded prior to
>>> +the reset will be lost and the application has to recreate the rules again.
> 
> What should be done if the state is not recoverable?
> 
>>> diff --git a/lib/librte_ethdev/rte_ethdev.h b/lib/librte_ethdev/rte_ethdev.h
>>> index 9759f13..9b4b015 100644
>>> --- a/lib/librte_ethdev/rte_ethdev.h
>>> +++ b/lib/librte_ethdev/rte_ethdev.h
>>> @@ -3207,6 +3207,23 @@ enum rte_eth_event_type {
>>>   	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>>>   	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>>>   	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
>>> +	RTE_ETH_EVENT_ERR_RECOVERING,
>>> +			/**< port recovering from an error
>>> +			 *
>>> +			 * PMD detected a FW reset or error condition.
>>> +			 * PMD will try to recover from the error.
>>> +			 * Data path will be halted and Control path operations
>>> +			 * would fail at this time.
>>> +			 */
>>
>> Does it mean the application has nothing to do when receiving this event?
>> I think the app should stop polling at least.
>>
>>> +	RTE_ETH_EVENT_RECOVERED,
>>> +			/**< port recovered from an error
>>> +			 *
>>> +			 * PMD has recovered from the error condition.
>>> +			 * Control path and Data path are up now.
>>> +			 * Since the device undergone a reset, flow rules
>>> +			 * offloaded prior to the reset will be lost and
>>> +			 * the application has to recreate the rules again.
>>> +			 */
>>
>> Please be more precise.
>> Should the app re-configure the port, setup the queues, start the port?
>>
>>
> 

Hi Kalesh Anakkur,

The mechanics of notifying the application looks good, but the concerns seems 
more about what application should do with this information.

PMD notifies the application on the FW/HW reset and pushes some 
tasks/responsibilities to the application, but for this to be useful, these 
tasks should be clear to application.

Think yourself in a situation that you are developing an application and you 
received these events from a device that you don't know its internals, what will 
you do?

Both Thomas and Andrew put cases that needs more clarification for application. 
Can you please send a new version with those clarifications?


Thanks,
ferruh

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-02-18 15:32 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-22 10:16 [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Kalesh A P
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 1/3] librte_ethdev: support device recovery event Kalesh A P
2020-03-11 13:20   ` Thomas Monjalon
2020-03-12  3:31     ` Kalesh Anakkur Purayil
2020-03-12  7:29       ` Thomas Monjalon
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 2/3] net/bnxt: notify applications about device reset Kalesh A P
2020-01-22 10:16 ` [dpdk-dev] [RFC PATCH 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-03-11 13:19 ` [dpdk-dev] [RFC PATCH 0/3] librte_ethdev: error recovery support Thomas Monjalon
2020-03-12  3:25   ` Kalesh Anakkur Purayil
2020-03-12  7:34     ` Thomas Monjalon
2020-07-03 16:12       ` Ferruh Yigit
2020-09-30  7:03 ` [dpdk-dev] [RFC V2 " Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:03   ` [dpdk-dev] [RFC V2 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-09-30  7:07 ` [dpdk-dev] [RFC PATCH v2 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:07   ` [dpdk-dev] [RFC PATCH v2 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-09-30  7:12 ` [dpdk-dev] [RFC PATCH v3 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30  7:50     ` Thomas Monjalon
2020-09-30  8:35       ` Kalesh Anakkur Purayil
2020-09-30  9:31         ` Thomas Monjalon
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30  7:12   ` [dpdk-dev] [RFC PATCH v3 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-08 10:53     ` Asaf Penso
2020-09-30 12:33 ` [dpdk-dev] [RFC PATCH v4 0/3] librte_ethdev: error recovery support Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-09-30 12:33   ` [dpdk-dev] [RFC PATCH v4 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-06 17:25     ` Ophir Munk
2020-10-07  4:46       ` Kalesh Anakkur Purayil
2020-10-07  8:36         ` Ophir Munk
2020-10-07  9:37         ` Ferruh Yigit
2020-10-07 18:42           ` Ajit Khaparde
2020-10-07 16:49 ` [dpdk-dev] [PATCH v5 0/3] librte_ethdev: error recovery support Kalesh A P
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-10-08 10:49     ` Asaf Penso
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-10-07 16:49   ` [dpdk-dev] [PATCH v5 3/3] app/testpmd: handle device recovery event Kalesh A P
2020-10-09  3:48 ` [dpdk-dev] [PATCH v6 0/3] librte_ethdev: error recovery support Kalesh A P
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 1/3] ethdev: support device reset and recovery events Kalesh A P
2020-10-11 21:29     ` Thomas Monjalon
2020-10-12  8:09       ` Andrew Rybchenko
2021-02-18 15:32         ` Ferruh Yigit
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 2/3] net/bnxt: notify applications about device reset/recovery Kalesh A P
2020-10-09  3:48   ` [dpdk-dev] [PATCH v6 3/3] app/testpmd: handle device recovery event Kalesh A P

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git