DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] Question about hardware error handling policy
@ 2021-07-22 13:50 fengchengwen
  2021-07-22 15:46 ` Thomas Monjalon
  0 siblings, 1 reply; 8+ messages in thread
From: fengchengwen @ 2021-07-22 13:50 UTC (permalink / raw)
  To: Thomas Monjalon, Ferruh Yigit, dev

Hi, all

    I notice ethdev support dev_reset ops, which could be used to recover from
errors, and only 13+ drivers support this function.
    And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6
drivers support it (most of them are VF).

    This provides users with two ways to handle hardware errors:
    a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops.
    b. application detect errors (the detection method is unclear), and call
    reset ops to recover.

    According to the design of this API, error handling is assigned to the
application, and the driver is only responsible for reporting events. This
simplifies the driver design (for example, the driver does not need to maintain
mutex locks).

    As we know, many modern NICs come with firmware, have PCIE interfaces,
support SR-IOV, the hardware errors can have: firmware reboot/PF reset/
VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in
most drivers.

    Question 1: what do we think of these errors(particularly firmware/PF)? Do
we think that the probability is very low and that there is no need to deal with
them?
    Question 2: I prefer to put error handling in the application layer, because
doing it in the driver can make the driver complex, but there is no app to
register the INTR_RESET event handler. I think we can build a standard handler
in testpmd, What do you think?

Thanks

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] Question about hardware error handling policy
  2021-07-22 13:50 [dpdk-dev] Question about hardware error handling policy fengchengwen
@ 2021-07-22 15:46 ` Thomas Monjalon
  2021-07-23  2:18   ` fengchengwen
  2021-07-23 12:33   ` Ferruh Yigit
  0 siblings, 2 replies; 8+ messages in thread
From: Thomas Monjalon @ 2021-07-22 15:46 UTC (permalink / raw)
  To: fengchengwen; +Cc: Ferruh Yigit, dev

22/07/2021 15:50, fengchengwen:
> Hi, all
> 
>     I notice ethdev support dev_reset ops, which could be used to recover from
> errors, and only 13+ drivers support this function.
>     And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6
> drivers support it (most of them are VF).
> 
>     This provides users with two ways to handle hardware errors:
>     a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops.
>     b. application detect errors (the detection method is unclear), and call
>     reset ops to recover.
> 
>     According to the design of this API, error handling is assigned to the
> application, and the driver is only responsible for reporting events. This
> simplifies the driver design (for example, the driver does not need to maintain
> mutex locks).
> 
>     As we know, many modern NICs come with firmware, have PCIE interfaces,
> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/
> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in
> most drivers.
> 
>     Question 1: what do we think of these errors(particularly firmware/PF)? Do
> we think that the probability is very low and that there is no need to deal with
> them?

Even rare errors must be managed.

>     Question 2: I prefer to put error handling in the application layer, because
> doing it in the driver can make the driver complex, but there is no app to
> register the INTR_RESET event handler. I think we can build a standard handler
> in testpmd, What do you think?

Absolutely. As any ethdev API, it must be tested with testpmd.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] Question about hardware error handling policy
  2021-07-22 15:46 ` Thomas Monjalon
@ 2021-07-23  2:18   ` fengchengwen
  2021-07-25 15:12     ` Matan Azrad
  2021-07-23 12:33   ` Ferruh Yigit
  1 sibling, 1 reply; 8+ messages in thread
From: fengchengwen @ 2021-07-23  2:18 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Ferruh Yigit, dev, beilei.xing, Matan Azrad

On 2021/7/22 23:46, Thomas Monjalon wrote:
> 22/07/2021 15:50, fengchengwen:
>> Hi, all
>>
>>     I notice ethdev support dev_reset ops, which could be used to recover from
>> errors, and only 13+ drivers support this function.
>>     And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6
>> drivers support it (most of them are VF).
>>
>>     This provides users with two ways to handle hardware errors:
>>     a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops.
>>     b. application detect errors (the detection method is unclear), and call
>>     reset ops to recover.
>>
>>     According to the design of this API, error handling is assigned to the
>> application, and the driver is only responsible for reporting events. This
>> simplifies the driver design (for example, the driver does not need to maintain
>> mutex locks).
>>
>>     As we know, many modern NICs come with firmware, have PCIE interfaces,
>> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/
>> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in
>> most drivers.
>>
>>     Question 1: what do we think of these errors(particularly firmware/PF)? Do
>> we think that the probability is very low and that there is no need to deal with
>> them?
> 
> Even rare errors must be managed.

Because intel and mlx NIC are widely used, I look at i40e/mlx5 driver code, and found:
1) i40e PF driver, it only show logs when detect global reset and other error:
	if (icr0 & I40E_PFINT_ICR0_GRST_MASK)
		PMD_DRV_LOG(INFO, "ICR0: global reset requested");
	if (icr0 & I40E_PFINT_ICR0_PCI_EXCEPTION_MASK)
		PMD_DRV_LOG(INFO, "ICR0: PCI exception activated");
	if (icr0 & I40E_PFINT_ICR0_STORM_DETECT_MASK)
		PMD_DRV_LOG(INFO, "ICR0: a change in the storm control state");
   @Beilei Why not report RESET_EVENT in these cases ? or these error are very rarely
   so just report it. And also, the application may still send or recv packet, These
   resets, if not handled correctly, do not cause the hardware/driver to hang ?

2) mlx5 PF driver, I notice there is a mlx5_dev_interrupt_device_fatal which will
remove the device.
   @Matan Why not report RESET_EVENT in these cases ? so the application can be
recovered quickly.

> 
>>     Question 2: I prefer to put error handling in the application layer, because
>> doing it in the driver can make the driver complex, but there is no app to
>> register the INTR_RESET event handler. I think we can build a standard handler
>> in testpmd, What do you think?
> 
> Absolutely. As any ethdev API, it must be tested with testpmd.
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] Question about hardware error handling policy
  2021-07-22 15:46 ` Thomas Monjalon
  2021-07-23  2:18   ` fengchengwen
@ 2021-07-23 12:33   ` Ferruh Yigit
  2021-07-23 12:51     ` Thomas Monjalon
  2021-07-23 13:04     ` Andrew Rybchenko
  1 sibling, 2 replies; 8+ messages in thread
From: Ferruh Yigit @ 2021-07-23 12:33 UTC (permalink / raw)
  To: Thomas Monjalon, fengchengwen; +Cc: dev, Andrew Rybchenko

On 7/22/2021 4:46 PM, Thomas Monjalon wrote:
> 22/07/2021 15:50, fengchengwen:
>> Hi, all
>>
>>     I notice ethdev support dev_reset ops, which could be used to recover from
>> errors, and only 13+ drivers support this function.

'rte_eth_dev_reset()' can be used to reset device config to defaults, not have
to be for error recovering.

>>     And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6
>> drivers support it (most of them are VF).
>>
>>     This provides users with two ways to handle hardware errors:
>>     a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops.
>>     b. application detect errors (the detection method is unclear), and call
>>     reset ops to recover.
>>
>>     According to the design of this API, error handling is assigned to the
>> application, and the driver is only responsible for reporting events. This
>> simplifies the driver design (for example, the driver does not need to maintain
>> mutex locks).
>>
>>     As we know, many modern NICs come with firmware, have PCIE interfaces,
>> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/
>> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in
>> most drivers.
>>
>>     Question 1: what do we think of these errors(particularly firmware/PF)? Do
>> we think that the probability is very low and that there is no need to deal with
>> them?
> 
> Even rare errors must be managed.
> 

+1

>>     Question 2: I prefer to put error handling in the application layer, because
>> doing it in the driver can make the driver complex, but there is no app to
>> register the INTR_RESET event handler. I think we can build a standard handler
>> in testpmd, What do you think?
> 
> Absolutely. As any ethdev API, it must be tested with testpmd.
> 

Testpmd registers for RESET event, but when event received all it does is print
a log, so there is not logic behind it.

If the intention is to add a error handling logic into testpmd, my concern is it
being too complex or too device specific.


And if there is something to cleanup, or recover etc in application level, it
makes sense application to receive the event and act on it. But if the
reset/recover can be handled in the PMD, if possible transparently, I think that
is better choice.

Another thing is I am not sure if what the applications should do on the reset
event clear or same for all PMDs, which is not good.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] Question about hardware error handling policy
  2021-07-23 12:33   ` Ferruh Yigit
@ 2021-07-23 12:51     ` Thomas Monjalon
  2021-07-23 13:04     ` Andrew Rybchenko
  1 sibling, 0 replies; 8+ messages in thread
From: Thomas Monjalon @ 2021-07-23 12:51 UTC (permalink / raw)
  To: fengchengwen, Ferruh Yigit; +Cc: dev, Andrew Rybchenko

23/07/2021 14:33, Ferruh Yigit:
> On 7/22/2021 4:46 PM, Thomas Monjalon wrote:
> > 22/07/2021 15:50, fengchengwen:
> >> Hi, all
> >>
> >>     I notice ethdev support dev_reset ops, which could be used to recover from
> >> errors, and only 13+ drivers support this function.
> 
> 'rte_eth_dev_reset()' can be used to reset device config to defaults, not have
> to be for error recovering.
> 
> >>     And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6
> >> drivers support it (most of them are VF).
> >>
> >>     This provides users with two ways to handle hardware errors:
> >>     a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops.
> >>     b. application detect errors (the detection method is unclear), and call
> >>     reset ops to recover.
> >>
> >>     According to the design of this API, error handling is assigned to the
> >> application, and the driver is only responsible for reporting events. This
> >> simplifies the driver design (for example, the driver does not need to maintain
> >> mutex locks).
> >>
> >>     As we know, many modern NICs come with firmware, have PCIE interfaces,
> >> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/
> >> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in
> >> most drivers.
> >>
> >>     Question 1: what do we think of these errors(particularly firmware/PF)? Do
> >> we think that the probability is very low and that there is no need to deal with
> >> them?
> > 
> > Even rare errors must be managed.
> > 
> 
> +1
> 
> >>     Question 2: I prefer to put error handling in the application layer, because
> >> doing it in the driver can make the driver complex, but there is no app to
> >> register the INTR_RESET event handler. I think we can build a standard handler
> >> in testpmd, What do you think?
> > 
> > Absolutely. As any ethdev API, it must be tested with testpmd.
> > 
> 
> Testpmd registers for RESET event, but when event received all it does is print
> a log, so there is not logic behind it.
> 
> If the intention is to add a error handling logic into testpmd, my concern is it
> being too complex or too device specific.

It shows a problem in the API.
We don't have a clear generic recovering process.

> And if there is something to cleanup, or recover etc in application level, it
> makes sense application to receive the event and act on it. But if the
> reset/recover can be handled in the PMD, if possible transparently, I think that
> is better choice.
> 
> Another thing is I am not sure if what the applications should do on the reset
> event clear or same for all PMDs, which is not good.

Indeed we should improve this area,
and implement a logic in testpmd.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] Question about hardware error handling policy
  2021-07-23 12:33   ` Ferruh Yigit
  2021-07-23 12:51     ` Thomas Monjalon
@ 2021-07-23 13:04     ` Andrew Rybchenko
  1 sibling, 0 replies; 8+ messages in thread
From: Andrew Rybchenko @ 2021-07-23 13:04 UTC (permalink / raw)
  To: Ferruh Yigit, Thomas Monjalon, fengchengwen; +Cc: dev

On 7/23/21 3:33 PM, Ferruh Yigit wrote:
> On 7/22/2021 4:46 PM, Thomas Monjalon wrote:
>> 22/07/2021 15:50, fengchengwen:
>>> Hi, all
>>>
>>>      I notice ethdev support dev_reset ops, which could be used to recover from
>>> errors, and only 13+ drivers support this function.
> 
> 'rte_eth_dev_reset()' can be used to reset device config to defaults, not have
> to be for error recovering.
> 
>>>      And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6
>>> drivers support it (most of them are VF).
>>>
>>>      This provides users with two ways to handle hardware errors:
>>>      a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops.
>>>      b. application detect errors (the detection method is unclear), and call
>>>      reset ops to recover.
>>>
>>>      According to the design of this API, error handling is assigned to the
>>> application, and the driver is only responsible for reporting events. This
>>> simplifies the driver design (for example, the driver does not need to maintain
>>> mutex locks).
>>>
>>>      As we know, many modern NICs come with firmware, have PCIE interfaces,
>>> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/
>>> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in
>>> most drivers.
>>>
>>>      Question 1: what do we think of these errors(particularly firmware/PF)? Do
>>> we think that the probability is very low and that there is no need to deal with
>>> them?
>>
>> Even rare errors must be managed.
>>
> 
> +1

+1

>>>      Question 2: I prefer to put error handling in the application layer, because
>>> doing it in the driver can make the driver complex, but there is no app to
>>> register the INTR_RESET event handler. I think we can build a standard handler
>>> in testpmd, What do you think?
>>
>> Absolutely. As any ethdev API, it must be tested with testpmd.
>>
> 
> Testpmd registers for RESET event, but when event received all it does is print
> a log, so there is not logic behind it.
> 
> If the intention is to add a error handling logic into testpmd, my concern is it
> being too complex or too device specific.

Error recovery must not be device specific. Otherwise, every application
needs the specific and will be hard to port across different drivers.

> And if there is something to cleanup, or recover etc in application level, it
> makes sense application to receive the event and act on it. But if the
> reset/recover can be handled in the PMD, if possible transparently, I think that
> is better choice.

Application should be notified to stop datapath at least. IMHO it would
be better if application controls the recovery using easy and well
defined algorithm:
  - register to be notified about necessity to do the recovery
  - receive event
  - stop datapath
  - do reset
  - restore configuration, start
  - start datapath

> Another thing is I am not sure if what the applications should do on the reset
> event clear or same for all PMDs, which is not good.
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] Question about hardware error handling policy
  2021-07-23  2:18   ` fengchengwen
@ 2021-07-25 15:12     ` Matan Azrad
  2021-07-26  6:21       ` fengchengwen
  0 siblings, 1 reply; 8+ messages in thread
From: Matan Azrad @ 2021-07-25 15:12 UTC (permalink / raw)
  To: fengchengwen, NBU-Contact-Thomas Monjalon; +Cc: Ferruh Yigit, dev, beilei.xing

Hi

From: fengchengwen:
> On 2021/7/22 23:46, Thomas Monjalon wrote:
> > 22/07/2021 15:50, fengchengwen:
> >> Hi, all
> >>
> >>     I notice ethdev support dev_reset ops, which could be used to
> >> recover from errors, and only 13+ drivers support this function.
> >>     And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and
> >> only 6 drivers support it (most of them are VF).
> >>
> >>     This provides users with two ways to handle hardware errors:
> >>     a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset
> ops.
> >>     b. application detect errors (the detection method is unclear), and call
> >>     reset ops to recover.
> >>
> >>     According to the design of this API, error handling is assigned
> >> to the application, and the driver is only responsible for reporting
> >> events. This simplifies the driver design (for example, the driver
> >> does not need to maintain mutex locks).
> >>
> >>     As we know, many modern NICs come with firmware, have PCIE
> >> interfaces, support SR-IOV, the hardware errors can have: firmware
> >> reboot/PF reset/ VF reset/FLR, but these errors(particularly
> >> firmware/PF) are not addressed in most drivers.
> >>
> >>     Question 1: what do we think of these errors(particularly
> >> firmware/PF)? Do we think that the probability is very low and that
> >> there is no need to deal with them?
> >
> > Even rare errors must be managed.
> 
> Because intel and mlx NIC are widely used, I look at i40e/mlx5 driver code,
> and found:
> 1) i40e PF driver, it only show logs when detect global reset and other error:
>         if (icr0 & I40E_PFINT_ICR0_GRST_MASK)
>                 PMD_DRV_LOG(INFO, "ICR0: global reset requested");
>         if (icr0 & I40E_PFINT_ICR0_PCI_EXCEPTION_MASK)
>                 PMD_DRV_LOG(INFO, "ICR0: PCI exception activated");
>         if (icr0 & I40E_PFINT_ICR0_STORM_DETECT_MASK)
>                 PMD_DRV_LOG(INFO, "ICR0: a change in the storm control state");
>    @Beilei Why not report RESET_EVENT in these cases ? or these error are
> very rarely
>    so just report it. And also, the application may still send or recv packet,
> These
>    resets, if not handled correctly, do not cause the hardware/driver to hang ?
> 
> 2) mlx5 PF driver, I notice there is a mlx5_dev_interrupt_device_fatal which
> will remove the device.
>    @Matan Why not report RESET_EVENT in these cases ? so the application
> can be recovered quickly.

RTE_ETH_EVENT_INTR_RMV is reported by the driver to notify the app that the device was physically plugged out from the PCI slot.
So, when the app sees this event, it should free all the SW resources of this device(call port close to the PMD by the way).

RTE_ETH_EVENT_INTR_RESET,	/**< reset interrupt event, sent to VF on PF reset */
Looks like VF-PF communication, this is not our case of plugged out which is more fatal.

Matan


> >
> >>     Question 2: I prefer to put error handling in the application
> >> layer, because doing it in the driver can make the driver complex,
> >> but there is no app to register the INTR_RESET event handler. I think
> >> we can build a standard handler in testpmd, What do you think?
> >
> > Absolutely. As any ethdev API, it must be tested with testpmd.
> >
> >
> > .
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] Question about hardware error handling policy
  2021-07-25 15:12     ` Matan Azrad
@ 2021-07-26  6:21       ` fengchengwen
  0 siblings, 0 replies; 8+ messages in thread
From: fengchengwen @ 2021-07-26  6:21 UTC (permalink / raw)
  To: Matan Azrad, NBU-Contact-Thomas Monjalon; +Cc: Ferruh Yigit, dev, beilei.xing

Hi

On 2021/7/25 23:12, Matan Azrad wrote:
> Hi
> 
> From: fengchengwen:
>> On 2021/7/22 23:46, Thomas Monjalon wrote:
>>> 22/07/2021 15:50, fengchengwen:
>>>> Hi, all
>>>>
>>>>     I notice ethdev support dev_reset ops, which could be used to
>>>> recover from errors, and only 13+ drivers support this function.
>>>>     And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and
>>>> only 6 drivers support it (most of them are VF).
>>>>
>>>>     This provides users with two ways to handle hardware errors:
>>>>     a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset
>> ops.
>>>>     b. application detect errors (the detection method is unclear), and call
>>>>     reset ops to recover.
>>>>
>>>>     According to the design of this API, error handling is assigned
>>>> to the application, and the driver is only responsible for reporting
>>>> events. This simplifies the driver design (for example, the driver
>>>> does not need to maintain mutex locks).
>>>>
>>>>     As we know, many modern NICs come with firmware, have PCIE
>>>> interfaces, support SR-IOV, the hardware errors can have: firmware
>>>> reboot/PF reset/ VF reset/FLR, but these errors(particularly
>>>> firmware/PF) are not addressed in most drivers.
>>>>
>>>>     Question 1: what do we think of these errors(particularly
>>>> firmware/PF)? Do we think that the probability is very low and that
>>>> there is no need to deal with them?
>>>
>>> Even rare errors must be managed.
>>
>> Because intel and mlx NIC are widely used, I look at i40e/mlx5 driver code,
>> and found:
>> 1) i40e PF driver, it only show logs when detect global reset and other error:
>>         if (icr0 & I40E_PFINT_ICR0_GRST_MASK)
>>                 PMD_DRV_LOG(INFO, "ICR0: global reset requested");
>>         if (icr0 & I40E_PFINT_ICR0_PCI_EXCEPTION_MASK)
>>                 PMD_DRV_LOG(INFO, "ICR0: PCI exception activated");
>>         if (icr0 & I40E_PFINT_ICR0_STORM_DETECT_MASK)
>>                 PMD_DRV_LOG(INFO, "ICR0: a change in the storm control state");
>>    @Beilei Why not report RESET_EVENT in these cases ? or these error are
>> very rarely
>>    so just report it. And also, the application may still send or recv packet,
>> These
>>    resets, if not handled correctly, do not cause the hardware/driver to hang ?
>>
>> 2) mlx5 PF driver, I notice there is a mlx5_dev_interrupt_device_fatal which
>> will remove the device.
>>    @Matan Why not report RESET_EVENT in these cases ? so the application
>> can be recovered quickly.
> 
> RTE_ETH_EVENT_INTR_RMV is reported by the driver to notify the app that the device was physically plugged out from the PCI slot.
> So, when the app sees this event, it should free all the SW resources of this device(call port close to the PMD by the way).
> 
> RTE_ETH_EVENT_INTR_RESET,	/**< reset interrupt event, sent to VF on PF reset */
> Looks like VF-PF communication, this is not our case of plugged out which is more fatal.

I think it can be changed so that the PF can also be used.

> 
> Matan
> 
> 
>>>
>>>>     Question 2: I prefer to put error handling in the application
>>>> layer, because doing it in the driver can make the driver complex,
>>>> but there is no app to register the INTR_RESET event handler. I think
>>>> we can build a standard handler in testpmd, What do you think?
>>>
>>> Absolutely. As any ethdev API, it must be tested with testpmd.
>>>
>>>
>>> .
>>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-07-26  6:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-22 13:50 [dpdk-dev] Question about hardware error handling policy fengchengwen
2021-07-22 15:46 ` Thomas Monjalon
2021-07-23  2:18   ` fengchengwen
2021-07-25 15:12     ` Matan Azrad
2021-07-26  6:21       ` fengchengwen
2021-07-23 12:33   ` Ferruh Yigit
2021-07-23 12:51     ` Thomas Monjalon
2021-07-23 13:04     ` Andrew Rybchenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).