DPDK patches and discussions
 help / color / mirror / Atom feed
From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
To: Ferruh Yigit <ferruh.yigit@intel.com>,
	Thomas Monjalon <thomas@monjalon.net>,
	fengchengwen <fengchengwen@huawei.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] Question about hardware error handling policy
Date: Fri, 23 Jul 2021 16:04:09 +0300	[thread overview]
Message-ID: <e4d5b798-7303-2d52-6e1c-d8b5b1f20f92@oktetlabs.ru> (raw)
In-Reply-To: <6e220d0b-5683-ee12-bdab-1ef78d19ebdc@intel.com>

On 7/23/21 3:33 PM, Ferruh Yigit wrote:
> On 7/22/2021 4:46 PM, Thomas Monjalon wrote:
>> 22/07/2021 15:50, fengchengwen:
>>> Hi, all
>>>
>>>      I notice ethdev support dev_reset ops, which could be used to recover from
>>> errors, and only 13+ drivers support this function.
> 
> 'rte_eth_dev_reset()' can be used to reset device config to defaults, not have
> to be for error recovering.
> 
>>>      And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and only 6
>>> drivers support it (most of them are VF).
>>>
>>>      This provides users with two ways to handle hardware errors:
>>>      a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset ops.
>>>      b. application detect errors (the detection method is unclear), and call
>>>      reset ops to recover.
>>>
>>>      According to the design of this API, error handling is assigned to the
>>> application, and the driver is only responsible for reporting events. This
>>> simplifies the driver design (for example, the driver does not need to maintain
>>> mutex locks).
>>>
>>>      As we know, many modern NICs come with firmware, have PCIE interfaces,
>>> support SR-IOV, the hardware errors can have: firmware reboot/PF reset/
>>> VF reset/FLR, but these errors(particularly firmware/PF) are not addressed in
>>> most drivers.
>>>
>>>      Question 1: what do we think of these errors(particularly firmware/PF)? Do
>>> we think that the probability is very low and that there is no need to deal with
>>> them?
>>
>> Even rare errors must be managed.
>>
> 
> +1

+1

>>>      Question 2: I prefer to put error handling in the application layer, because
>>> doing it in the driver can make the driver complex, but there is no app to
>>> register the INTR_RESET event handler. I think we can build a standard handler
>>> in testpmd, What do you think?
>>
>> Absolutely. As any ethdev API, it must be tested with testpmd.
>>
> 
> Testpmd registers for RESET event, but when event received all it does is print
> a log, so there is not logic behind it.
> 
> If the intention is to add a error handling logic into testpmd, my concern is it
> being too complex or too device specific.

Error recovery must not be device specific. Otherwise, every application
needs the specific and will be hard to port across different drivers.

> And if there is something to cleanup, or recover etc in application level, it
> makes sense application to receive the event and act on it. But if the
> reset/recover can be handled in the PMD, if possible transparently, I think that
> is better choice.

Application should be notified to stop datapath at least. IMHO it would
be better if application controls the recovery using easy and well
defined algorithm:
  - register to be notified about necessity to do the recovery
  - receive event
  - stop datapath
  - do reset
  - restore configuration, start
  - start datapath

> Another thing is I am not sure if what the applications should do on the reset
> event clear or same for all PMDs, which is not good.
> 


      parent reply	other threads:[~2021-07-23 13:04 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-22 13:50 fengchengwen
2021-07-22 15:46 ` Thomas Monjalon
2021-07-23  2:18   ` fengchengwen
2021-07-25 15:12     ` Matan Azrad
2021-07-26  6:21       ` fengchengwen
2021-07-23 12:33   ` Ferruh Yigit
2021-07-23 12:51     ` Thomas Monjalon
2021-07-23 13:04     ` Andrew Rybchenko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e4d5b798-7303-2d52-6e1c-d8b5b1f20f92@oktetlabs.ru \
    --to=andrew.rybchenko@oktetlabs.ru \
    --cc=dev@dpdk.org \
    --cc=fengchengwen@huawei.com \
    --cc=ferruh.yigit@intel.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).