[dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add
@ 2020-10-03  0:01 Long Li
  2020-10-05  9:42 ` Gaëtan Rivet
  2020-10-15 10:11 ` Gaëtan Rivet
  0 siblings, 2 replies; 7+ messages in thread
From: Long Li @ 2020-10-03  0:01 UTC (permalink / raw)
  To: Gaetan Rivet; +Cc: dev, Long Li, stable

From: Long Li <longli@microsoft.com>

When adding a sub-device, it's possible that the sub-device is configured
successfully but later fails to start. This error should not be masked.
The driver needs to check the error status to prevent endless loop of
trying to start the sub-device.

fixes (ae80146c5a1b net/failsafe: fix removed device handling)

cc: stable@dpdk.org
Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/failsafe/failsafe_private.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index 651578a..c58c0de 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -497,7 +497,7 @@ int failsafe_eth_new_event_callback(uint16_t port_id,
 fs_err(struct sub_device *sdev, int err)
 {
 	/* A device removal shouldn't be reported as an error. */
-	if (sdev->remove == 1 || err == -EIO)
+	if (sdev->remove == 1 && err == -EIO)
 		return rte_errno = 0;
 	return err;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add
  2020-10-03  0:01 [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add Long Li
@ 2020-10-05  9:42 ` Gaëtan Rivet
  2020-10-09 16:20   ` Gaëtan Rivet
  2020-10-15 10:11 ` Gaëtan Rivet
  1 sibling, 1 reply; 7+ messages in thread
From: Gaëtan Rivet @ 2020-10-05  9:42 UTC (permalink / raw)
  To: Long Li; +Cc: dev, Long Li, stable

Hi,

On 02/10/20 17:01 -0700, Long Li wrote:
> From: Long Li <longli@microsoft.com>
> 
> When adding a sub-device, it's possible that the sub-device is configured
> successfully but later fails to start. This error should not be masked.

Some of those errors are meant to be masked: -EIO, when the device is
marked as removed at the ethdev level (see eth_err() in rte_ethdev.c:819).

> The driver needs to check the error status to prevent endless loop of
> trying to start the sub-device.

If the ethdev layer error is due to the device being removed, and
failsafe loops on trying to sync the eth device to its own state, then
an RMV event should have been emitted but wasn't or it was missed by
failsafe.

If the ethdev layer error is *not* due to the device being removed, the
error should be != -EIO, and sdev->remove should not be set, so fs_err()
should not mask it and it should be seen by the app.

Can you provide the following details:

 * What is the return code of rte_eth_dev_start() that is masked in your
   start loop?

 * Is the device marked as removed in failsafe?

 * Is the device marked as removed in ethdev?

 * Was there an RMV event generated for the device? Whether yes or no,
   is it correct?

Thanks,

> 
> fixes (ae80146c5a1b net/failsafe: fix removed device handling)
> 
> cc: stable@dpdk.org
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  drivers/net/failsafe/failsafe_private.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
> index 651578a..c58c0de 100644
> --- a/drivers/net/failsafe/failsafe_private.h
> +++ b/drivers/net/failsafe/failsafe_private.h
> @@ -497,7 +497,7 @@ int failsafe_eth_new_event_callback(uint16_t port_id,
>  fs_err(struct sub_device *sdev, int err)
>  {
>  	/* A device removal shouldn't be reported as an error. */
> -	if (sdev->remove == 1 || err == -EIO)
> +	if (sdev->remove == 1 && err == -EIO)
>  		return rte_errno = 0;
>  	return err;
>  }
> -- 
> 1.8.3.1
> 

-- 
Gaëtan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add
  2020-10-05  9:42 ` Gaëtan Rivet
@ 2020-10-09 16:20   ` Gaëtan Rivet
  2020-10-09 20:30     ` Long Li
  0 siblings, 1 reply; 7+ messages in thread
From: Gaëtan Rivet @ 2020-10-09 16:20 UTC (permalink / raw)
  To: Long Li; +Cc: dev, Long Li, stable

On 05/10/20 11:42 +0200, Gaëtan Rivet wrote:
> Hi,
> 
> On 02/10/20 17:01 -0700, Long Li wrote:
> > From: Long Li <longli@microsoft.com>
> > 
> > When adding a sub-device, it's possible that the sub-device is configured
> > successfully but later fails to start. This error should not be masked.
> 
> Some of those errors are meant to be masked: -EIO, when the device is
> marked as removed at the ethdev level (see eth_err() in rte_ethdev.c:819).
> 
> > The driver needs to check the error status to prevent endless loop of
> > trying to start the sub-device.
> 
> If the ethdev layer error is due to the device being removed, and
> failsafe loops on trying to sync the eth device to its own state, then
> an RMV event should have been emitted but wasn't or it was missed by
> failsafe.
> 
> If the ethdev layer error is *not* due to the device being removed, the
> error should be != -EIO, and sdev->remove should not be set, so fs_err()
> should not mask it and it should be seen by the app.
> 
> Can you provide the following details:
> 
>  * What is the return code of rte_eth_dev_start() that is masked in your
>    start loop?
> 
>  * Is the device marked as removed in failsafe?
> 
>  * Is the device marked as removed in ethdev?
> 
>  * Was there an RMV event generated for the device? Whether yes or no,
>    is it correct?
> 
> Thanks,
> 

Hello Li,

I've found the previous mail thread [1] where you described how you got this
error. In your description, you say that you try unplug then quick
replug, before any event is processed?

If that's the case, it seems a clear race condition, and an issue of
missing the removal event of the device. I would not say yet that the
bug is in failsafe, but it could be in ethdev.

Can you please check whether the device removal event was properly
generated in rte_ethdev? Failsafe (and any other hotplug support layer
actually) will depend on it so it should be first checked to work.

Thanks,

[1]: http://mails.dpdk.org/archives/dev/2020-September/182977.html

-- 
Gaëtan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add
  2020-10-09 16:20   ` Gaëtan Rivet
@ 2020-10-09 20:30     ` Long Li
  2020-10-12 14:22       ` Gaëtan Rivet
  0 siblings, 1 reply; 7+ messages in thread
From: Long Li @ 2020-10-09 20:30 UTC (permalink / raw)
  To: Gaëtan Rivet, Long Li; +Cc: dev, stable

>Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while
>handling sub-device add
>
>On 05/10/20 11:42 +0200, Gaëtan Rivet wrote:
>> Hi,
>>
>> On 02/10/20 17:01 -0700, Long Li wrote:
>> > From: Long Li <longli@microsoft.com>
>> >
>> > When adding a sub-device, it's possible that the sub-device is
>> > configured successfully but later fails to start. This error should not be
>masked.
>>
>> Some of those errors are meant to be masked: -EIO, when the device is
>> marked as removed at the ethdev level (see eth_err() in rte_ethdev.c:819).
>>
>> > The driver needs to check the error status to prevent endless loop
>> > of trying to start the sub-device.
>>
>> If the ethdev layer error is due to the device being removed, and
>> failsafe loops on trying to sync the eth device to its own state, then
>> an RMV event should have been emitted but wasn't or it was missed by
>> failsafe.
>>
>> If the ethdev layer error is *not* due to the device being removed,
>> the error should be != -EIO, and sdev->remove should not be set, so
>> fs_err() should not mask it and it should be seen by the app.
>>
>> Can you provide the following details:
>>
>>  * What is the return code of rte_eth_dev_start() that is masked in your
>>    start loop?
>>
>>  * Is the device marked as removed in failsafe?
>>
>>  * Is the device marked as removed in ethdev?
>>
>>  * Was there an RMV event generated for the device? Whether yes or no,
>>    is it correct?
>>
>> Thanks,
>>
>
>Hello Li,
>
>I've found the previous mail thread [1] where you described how you got this
>error. In your description, you say that you try unplug then quick replug,
>before any event is processed?

Hi Gaëtan,

Sorry for getting back late. I had trouble with my email.

I think the issue is that: when the failsafe driver tries to start the sub device, it may fail. The failsafe driver needs to deal with the failure. Here is another log that I captured:

net_failsafe: Starting sub_device 0
net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation failure: Cannot allocate memory
net_mlx4: cannot initialize common RSS resources (queue 0): unable to create Rx queue resources: Cannot allocate memory
net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate memory
net_failsafe: Starting sub_device 0
net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation failure: Cannot allocate memory
net_mlx4: cannot initialize common RSS resources (queue 0): unable to create Rx queue resources: Cannot allocate memory
net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate memory

It's called from fs_dev_start(). The MLX4 can't start, it's in a state being removed in ethdev. But this error is being masked by fs_err():

static inline int
fs_err(struct sub_device *sdev, int err)
{
        /* A device removal shouldn't be reported as an error. */
        if (sdev->remove == 1 || err == -EIO)
                return rte_errno = 0;
        return err;
} 

So fs_dev_start() always return a success, returned to failsafe_eth_dev_state_sync(), which is repeated called from the alarm fs_hotplug_alarm(). This goes to an endless loop.

I believe failsafe_eth_rmv_event_callback() has been called prior to this. It sets sdev->remove = 1, but it didn't help with exiting the loop. fs_err() always returns 0, and fs_dev_start() is called from fs_hotplug_alarm(), that keeps re-alarming itself.

Why do we check (sdev->remove == 1 || err == -EIO) in fs_err()? If sdev->remove is set, this will always return 0, regardless of err.

It's hard to reproduce with gdb. Hopefully I can get a good trace in gdb.

Thanks,

Long

>
>If that's the case, it seems a clear race condition, and an issue of missing the
>removal event of the device. I would not say yet that the bug is in failsafe, but
>it could be in ethdev.
>
>Can you please check whether the device removal event was properly
>generated in rte_ethdev? Failsafe (and any other hotplug support layer
>actually) will depend on it so it should be first checked to work.
>
>Thanks,
>
>[1]:
>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmails.
>dpdk.org%2Farchives%2Fdev%2F2020-
>September%2F182977.html&amp;data=02%7C01%7Clongli%40microsoft.com
>%7C3e044201096c4f48508108d86c6f319d%7C72f988bf86f141af91ab2d7cd011d
>b47%7C1%7C0%7C637378572568404734&amp;sdata=DAUVxBJQtJx8mUSWlPI
>DtenhpMmdiMPpIczmxOdrCqE%3D&amp;reserved=0
>
>--
>Gaëtan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add
  2020-10-09 20:30     ` Long Li
@ 2020-10-12 14:22       ` Gaëtan Rivet
  2020-10-13 17:14         ` Long Li
  0 siblings, 1 reply; 7+ messages in thread
From: Gaëtan Rivet @ 2020-10-12 14:22 UTC (permalink / raw)
  To: Long Li; +Cc: Long Li, dev, stable

On 09/10/20 20:30 +0000, Long Li wrote:
> >Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while
> >handling sub-device add
> >
> >On 05/10/20 11:42 +0200, Gaëtan Rivet wrote:
> >> Hi,
> >>
> >> On 02/10/20 17:01 -0700, Long Li wrote:
> >> > From: Long Li <longli@microsoft.com>
> >> >
> >> > When adding a sub-device, it's possible that the sub-device is
> >> > configured successfully but later fails to start. This error should not be
> >masked.
> >>
> >> Some of those errors are meant to be masked: -EIO, when the device is
> >> marked as removed at the ethdev level (see eth_err() in rte_ethdev.c:819).
> >>
> >> > The driver needs to check the error status to prevent endless loop
> >> > of trying to start the sub-device.
> >>
> >> If the ethdev layer error is due to the device being removed, and
> >> failsafe loops on trying to sync the eth device to its own state, then
> >> an RMV event should have been emitted but wasn't or it was missed by
> >> failsafe.
> >>
> >> If the ethdev layer error is *not* due to the device being removed,
> >> the error should be != -EIO, and sdev->remove should not be set, so
> >> fs_err() should not mask it and it should be seen by the app.
> >>
> >> Can you provide the following details:
> >>
> >>  * What is the return code of rte_eth_dev_start() that is masked in your
> >>    start loop?
> >>
> >>  * Is the device marked as removed in failsafe?
> >>
> >>  * Is the device marked as removed in ethdev?
> >>
> >>  * Was there an RMV event generated for the device? Whether yes or no,
> >>    is it correct?
> >>
> >> Thanks,
> >>
> >
> >Hello Li,
> >
> >I've found the previous mail thread [1] where you described how you got this
> >error. In your description, you say that you try unplug then quick replug,
> >before any event is processed?
> 
> Hi Gaëtan,
> 
> Sorry for getting back late. I had trouble with my email.
> 
> I think the issue is that: when the failsafe driver tries to start the sub device, it may fail. The failsafe driver needs to deal with the failure. Here is another log that I captured:
> 
> net_failsafe: Starting sub_device 0
> net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation failure: Cannot allocate memory
> net_mlx4: cannot initialize common RSS resources (queue 0): unable to create Rx queue resources: Cannot allocate memory
> net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate memory
> net_failsafe: Starting sub_device 0
> net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation failure: Cannot allocate memory
> net_mlx4: cannot initialize common RSS resources (queue 0): unable to create Rx queue resources: Cannot allocate memory
> net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate memory
> 
> It's called from fs_dev_start(). The MLX4 can't start, it's in a state being removed in ethdev. But this error is being masked by fs_err():
> 
> static inline int
> fs_err(struct sub_device *sdev, int err)
> {
>         /* A device removal shouldn't be reported as an error. */
>         if (sdev->remove == 1 || err == -EIO)
>                 return rte_errno = 0;
>         return err;
> } 
> 
> So fs_dev_start() always return a success, returned to failsafe_eth_dev_state_sync(), which is repeated called from the alarm fs_hotplug_alarm(). This goes to an endless loop.
> 
> I believe failsafe_eth_rmv_event_callback() has been called prior to this. It sets sdev->remove = 1, but it didn't help with exiting the loop. fs_err() always returns 0, and fs_dev_start() is called from fs_hotplug_alarm(), that keeps re-alarming itself.

Ok, the issue here is that even with sdev->remove == 1, the device is
not actually removed, but it should.

I think we discussed it with Stephen off-list not too long ago actually.
In his case the actual bug was in vdev_netvsc, but there was still this
issue in failsafe. It is happening there:

   void
   failsafe_dev_remove(struct rte_eth_dev *dev)
   {
           struct sub_device *sdev;
           uint8_t i;

           FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE)
                                         /* ^^^^^^^^^^^^ HERE is the bug */
                   if (sdev->remove && fs_rxtx_clean(sdev)) {
                           if (fs_lock(dev, 1) != 0)
                                   return;
                           fs_dev_stats_save(sdev);
                           fs_dev_remove(sdev);
                           fs_unlock(dev, 1);
                   }
   }

A device with the remove flag will be cleaned up only if it is already
in an active state. This is not correct, it should be removed in all
cases -- even if it stayed stuck in PROBED state.

Here is a fix I propose for this issue:
http://mails.dpdk.org/archives/dev/2020-October/185921.html

Could you please test it and verify it solves your issue?

> 
> Why do we check (sdev->remove == 1 || err == -EIO) in fs_err()? If sdev->remove is set, this will always return 0, regardless of err.
> 

if sdev->remove == 1, there should not even be a call to sub-device
start ops actually. The device is flagged for removal, next thing to
happen is to remove it, even if the device reappeared on the bus
in-between. The failsafe needs to consume the removed flag (close the device
then remove it), then it will be ready to detect the new device and
will probe it back from the beginning.

> It's hard to reproduce with gdb. Hopefully I can get a good trace in gdb.
> 
> Thanks,
> 
> Long
> 
> >
> >If that's the case, it seems a clear race condition, and an issue of missing the
> >removal event of the device. I would not say yet that the bug is in failsafe, but
> >it could be in ethdev.
> >
> >Can you please check whether the device removal event was properly
> >generated in rte_ethdev? Failsafe (and any other hotplug support layer
> >actually) will depend on it so it should be first checked to work.
> >
> >Thanks,
> >
> >[1]:
> >https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmails.
> >dpdk.org%2Farchives%2Fdev%2F2020-
> >September%2F182977.html&amp;data=02%7C01%7Clongli%40microsoft.com
> >%7C3e044201096c4f48508108d86c6f319d%7C72f988bf86f141af91ab2d7cd011d
> >b47%7C1%7C0%7C637378572568404734&amp;sdata=DAUVxBJQtJx8mUSWlPI
> >DtenhpMmdiMPpIczmxOdrCqE%3D&amp;reserved=0
> >
> >--
> >Gaëtan

Regards,
-- 
Gaëtan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add
  2020-10-12 14:22       ` Gaëtan Rivet
@ 2020-10-13 17:14         ` Long Li
  0 siblings, 0 replies; 7+ messages in thread
From: Long Li @ 2020-10-13 17:14 UTC (permalink / raw)
  To: Gaëtan Rivet; +Cc: Long Li, dev, stable

>Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while
>handling sub-device add
>
>On 09/10/20 20:30 +0000, Long Li wrote:
>> >Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error
>> >code while handling sub-device add
>> >
>> >On 05/10/20 11:42 +0200, Gaëtan Rivet wrote:
>> >> Hi,
>> >>
>> >> On 02/10/20 17:01 -0700, Long Li wrote:
>> >> > From: Long Li <longli@microsoft.com>
>> >> >
>> >> > When adding a sub-device, it's possible that the sub-device is
>> >> > configured successfully but later fails to start. This error
>> >> > should not be
>> >masked.
>> >>
>> >> Some of those errors are meant to be masked: -EIO, when the device
>> >> is marked as removed at the ethdev level (see eth_err() in
>rte_ethdev.c:819).
>> >>
>> >> > The driver needs to check the error status to prevent endless
>> >> > loop of trying to start the sub-device.
>> >>
>> >> If the ethdev layer error is due to the device being removed, and
>> >> failsafe loops on trying to sync the eth device to its own state,
>> >> then an RMV event should have been emitted but wasn't or it was
>> >> missed by failsafe.
>> >>
>> >> If the ethdev layer error is *not* due to the device being removed,
>> >> the error should be != -EIO, and sdev->remove should not be set, so
>> >> fs_err() should not mask it and it should be seen by the app.
>> >>
>> >> Can you provide the following details:
>> >>
>> >>  * What is the return code of rte_eth_dev_start() that is masked in your
>> >>    start loop?
>> >>
>> >>  * Is the device marked as removed in failsafe?
>> >>
>> >>  * Is the device marked as removed in ethdev?
>> >>
>> >>  * Was there an RMV event generated for the device? Whether yes or
>no,
>> >>    is it correct?
>> >>
>> >> Thanks,
>> >>
>> >
>> >Hello Li,
>> >
>> >I've found the previous mail thread [1] where you described how you
>> >got this error. In your description, you say that you try unplug then
>> >quick replug, before any event is processed?
>>
>> Hi Gaëtan,
>>
>> Sorry for getting back late. I had trouble with my email.
>>
>> I think the issue is that: when the failsafe driver tries to start the sub device,
>it may fail. The failsafe driver needs to deal with the failure. Here is another
>log that I captured:
>>
>> net_failsafe: Starting sub_device 0
>> net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation
>> failure: Cannot allocate memory
>> net_mlx4: cannot initialize common RSS resources (queue 0): unable to
>> create Rx queue resources: Cannot allocate memory
>> net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate
>> memory
>> net_failsafe: Starting sub_device 0
>> net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation
>> failure: Cannot allocate memory
>> net_mlx4: cannot initialize common RSS resources (queue 0): unable to
>> create Rx queue resources: Cannot allocate memory
>> net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate
>> memory
>>
>> It's called from fs_dev_start(). The MLX4 can't start, it's in a state being
>removed in ethdev. But this error is being masked by fs_err():
>>
>> static inline int
>> fs_err(struct sub_device *sdev, int err) {
>>         /* A device removal shouldn't be reported as an error. */
>>         if (sdev->remove == 1 || err == -EIO)
>>                 return rte_errno = 0;
>>         return err;
>> }
>>
>> So fs_dev_start() always return a success, returned to
>failsafe_eth_dev_state_sync(), which is repeated called from the alarm
>fs_hotplug_alarm(). This goes to an endless loop.
>>
>> I believe failsafe_eth_rmv_event_callback() has been called prior to this. It
>sets sdev->remove = 1, but it didn't help with exiting the loop. fs_err() always
>returns 0, and fs_dev_start() is called from fs_hotplug_alarm(), that keeps re-
>alarming itself.
>
>Ok, the issue here is that even with sdev->remove == 1, the device is not
>actually removed, but it should.
>
>I think we discussed it with Stephen off-list not too long ago actually.
>In his case the actual bug was in vdev_netvsc, but there was still this issue in
>failsafe. It is happening there:
>
>   void
>   failsafe_dev_remove(struct rte_eth_dev *dev)
>   {
>           struct sub_device *sdev;
>           uint8_t i;
>
>           FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE)
>                                         /* ^^^^^^^^^^^^ HERE is the bug */
>                   if (sdev->remove && fs_rxtx_clean(sdev)) {
>                           if (fs_lock(dev, 1) != 0)
>                                   return;
>                           fs_dev_stats_save(sdev);
>                           fs_dev_remove(sdev);
>                           fs_unlock(dev, 1);
>                   }
>   }
>
>A device with the remove flag will be cleaned up only if it is already in an active
>state. This is not correct, it should be removed in all cases -- even if it stayed
>stuck in PROBED state.
>
>Here is a fix I propose for this issue:
>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmails.
>dpdk.org%2Farchives%2Fdev%2F2020-
>October%2F185921.html&amp;data=04%7C01%7Clongli%40microsoft.com%7C
>0cfccf19cf3d42ee549008d86eba3d6e%7C72f988bf86f141af91ab2d7cd011db47
>%7C1%7C0%7C637381093902558861%7CUnknown%7CTWFpbGZsb3d8eyJWIjo
>iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C300
>0&amp;sdata=j13foFEbNV7DqWxl2ahLloEZSbCO96lmPaqSfjU5KsQ%3D&amp;r
>eserved=0
>
>Could you please test it and verify it solves your issue?

Thank you!

I have been testing this patch for 24 hours and it worked well.

>
>>
>> Why do we check (sdev->remove == 1 || err == -EIO) in fs_err()? If sdev-
>>remove is set, this will always return 0, regardless of err.
>>
>
>if sdev->remove == 1, there should not even be a call to sub-device start ops
>actually. The device is flagged for removal, next thing to happen is to remove
>it, even if the device reappeared on the bus in-between. The failsafe needs to
>consume the removed flag (close the device then remove it), then it will be
>ready to detect the new device and will probe it back from the beginning.
>
>> It's hard to reproduce with gdb. Hopefully I can get a good trace in gdb.
>>
>> Thanks,
>>
>> Long
>>
>> >
>> >If that's the case, it seems a clear race condition, and an issue of
>> >missing the removal event of the device. I would not say yet that the
>> >bug is in failsafe, but it could be in ethdev.
>> >
>> >Can you please check whether the device removal event was properly
>> >generated in rte_ethdev? Failsafe (and any other hotplug support
>> >layer
>> >actually) will depend on it so it should be first checked to work.
>> >
>> >Thanks,
>> >
>> >[1]:
>>
>>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmails
>.
>> >dpdk.org%2Farchives%2Fdev%2F2020-
>>
>>September%2F182977.html&amp;data=02%7C01%7Clongli%40microsoft.co
>m
>>
>>%7C3e044201096c4f48508108d86c6f319d%7C72f988bf86f141af91ab2d7cd011
>d
>>
>>b47%7C1%7C0%7C637378572568404734&amp;sdata=DAUVxBJQtJx8mUSWlP
>I
>> >DtenhpMmdiMPpIczmxOdrCqE%3D&amp;reserved=0
>> >
>> >--
>> >Gaëtan
>
>Regards,
>--
>Gaëtan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add
  2020-10-03  0:01 [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add Long Li
  2020-10-05  9:42 ` Gaëtan Rivet
@ 2020-10-15 10:11 ` Gaëtan Rivet
  1 sibling, 0 replies; 7+ messages in thread
From: Gaëtan Rivet @ 2020-10-15 10:11 UTC (permalink / raw)
  To: Long Li; +Cc: dev, Long Li, stable

On 02/10/20 17:01 -0700, Long Li wrote:
> From: Long Li <longli@microsoft.com>
> 
> When adding a sub-device, it's possible that the sub-device is configured
> successfully but later fails to start. This error should not be masked.
> The driver needs to check the error status to prevent endless loop of
> trying to start the sub-device.
> 
> fixes (ae80146c5a1b net/failsafe: fix removed device handling)
> 
> cc: stable@dpdk.org
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  drivers/net/failsafe/failsafe_private.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
> index 651578a..c58c0de 100644
> --- a/drivers/net/failsafe/failsafe_private.h
> +++ b/drivers/net/failsafe/failsafe_private.h
> @@ -497,7 +497,7 @@ int failsafe_eth_new_event_callback(uint16_t port_id,
>  fs_err(struct sub_device *sdev, int err)
>  {
>  	/* A device removal shouldn't be reported as an error. */
> -	if (sdev->remove == 1 || err == -EIO)
> +	if (sdev->remove == 1 && err == -EIO)
>  		return rte_errno = 0;
>  	return err;
>  }
> -- 
> 1.8.3.1
> 

This bug has been fixed by the following commit:
http://mails.dpdk.org/archives/dev/2020-October/185921.html

Thanks for validating this other fix Li, it has been
integrated.

This patch can be dropped (nack).

-- 
Gaëtan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-10-15 10:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-03  0:01 [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add Long Li
2020-10-05  9:42 ` Gaëtan Rivet
2020-10-09 16:20   ` Gaëtan Rivet
2020-10-09 20:30     ` Long Li
2020-10-12 14:22       ` Gaëtan Rivet
2020-10-13 17:14         ` Long Li
2020-10-15 10:11 ` Gaëtan Rivet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).