DPDK patches and discussions
 help / color / mirror / Atom feed
From: fengchengwen <fengchengwen@huawei.com>
To: Matan Azrad <matan@nvidia.com>,
	NBU-Contact-Thomas Monjalon <thomas@monjalon.net>
Cc: Ferruh Yigit <ferruh.yigit@intel.com>,
	"dev@dpdk.org" <dev@dpdk.org>,
	"beilei.xing@intel.com" <beilei.xing@intel.com>
Subject: Re: [dpdk-dev] Question about hardware error handling policy
Date: Mon, 26 Jul 2021 14:21:23 +0800	[thread overview]
Message-ID: <5c477b6c-9bdc-388b-5183-dcc2acb4e571@huawei.com> (raw)
In-Reply-To: <DM4PR12MB5389B41F923CAC9B32D5B7A0DFE79@DM4PR12MB5389.namprd12.prod.outlook.com>

Hi

On 2021/7/25 23:12, Matan Azrad wrote:
> Hi
> 
> From: fengchengwen:
>> On 2021/7/22 23:46, Thomas Monjalon wrote:
>>> 22/07/2021 15:50, fengchengwen:
>>>> Hi, all
>>>>
>>>>     I notice ethdev support dev_reset ops, which could be used to
>>>> recover from errors, and only 13+ drivers support this function.
>>>>     And also there is event for reset: RTE_ETH_EVENT_INTR_RESET, and
>>>> only 6 drivers support it (most of them are VF).
>>>>
>>>>     This provides users with two ways to handle hardware errors:
>>>>     a. driver report RTE_ETH_EVENT_INTR_RESET, and application do reset
>> ops.
>>>>     b. application detect errors (the detection method is unclear), and call
>>>>     reset ops to recover.
>>>>
>>>>     According to the design of this API, error handling is assigned
>>>> to the application, and the driver is only responsible for reporting
>>>> events. This simplifies the driver design (for example, the driver
>>>> does not need to maintain mutex locks).
>>>>
>>>>     As we know, many modern NICs come with firmware, have PCIE
>>>> interfaces, support SR-IOV, the hardware errors can have: firmware
>>>> reboot/PF reset/ VF reset/FLR, but these errors(particularly
>>>> firmware/PF) are not addressed in most drivers.
>>>>
>>>>     Question 1: what do we think of these errors(particularly
>>>> firmware/PF)? Do we think that the probability is very low and that
>>>> there is no need to deal with them?
>>>
>>> Even rare errors must be managed.
>>
>> Because intel and mlx NIC are widely used, I look at i40e/mlx5 driver code,
>> and found:
>> 1) i40e PF driver, it only show logs when detect global reset and other error:
>>         if (icr0 & I40E_PFINT_ICR0_GRST_MASK)
>>                 PMD_DRV_LOG(INFO, "ICR0: global reset requested");
>>         if (icr0 & I40E_PFINT_ICR0_PCI_EXCEPTION_MASK)
>>                 PMD_DRV_LOG(INFO, "ICR0: PCI exception activated");
>>         if (icr0 & I40E_PFINT_ICR0_STORM_DETECT_MASK)
>>                 PMD_DRV_LOG(INFO, "ICR0: a change in the storm control state");
>>    @Beilei Why not report RESET_EVENT in these cases ? or these error are
>> very rarely
>>    so just report it. And also, the application may still send or recv packet,
>> These
>>    resets, if not handled correctly, do not cause the hardware/driver to hang ?
>>
>> 2) mlx5 PF driver, I notice there is a mlx5_dev_interrupt_device_fatal which
>> will remove the device.
>>    @Matan Why not report RESET_EVENT in these cases ? so the application
>> can be recovered quickly.
> 
> RTE_ETH_EVENT_INTR_RMV is reported by the driver to notify the app that the device was physically plugged out from the PCI slot.
> So, when the app sees this event, it should free all the SW resources of this device(call port close to the PMD by the way).
> 
> RTE_ETH_EVENT_INTR_RESET,	/**< reset interrupt event, sent to VF on PF reset */
> Looks like VF-PF communication, this is not our case of plugged out which is more fatal.

I think it can be changed so that the PF can also be used.

> 
> Matan
> 
> 
>>>
>>>>     Question 2: I prefer to put error handling in the application
>>>> layer, because doing it in the driver can make the driver complex,
>>>> but there is no app to register the INTR_RESET event handler. I think
>>>> we can build a standard handler in testpmd, What do you think?
>>>
>>> Absolutely. As any ethdev API, it must be tested with testpmd.
>>>
>>>
>>> .
>>>

  reply	other threads:[~2021-07-26  6:21 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-22 13:50 fengchengwen
2021-07-22 15:46 ` Thomas Monjalon
2021-07-23  2:18   ` fengchengwen
2021-07-25 15:12     ` Matan Azrad
2021-07-26  6:21       ` fengchengwen [this message]
2021-07-23 12:33   ` Ferruh Yigit
2021-07-23 12:51     ` Thomas Monjalon
2021-07-23 13:04     ` Andrew Rybchenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5c477b6c-9bdc-388b-5183-dcc2acb4e571@huawei.com \
    --to=fengchengwen@huawei.com \
    --cc=beilei.xing@intel.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=matan@nvidia.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).