DPDK patches and discussions
 help / color / mirror / Atom feed
From: Long Li <longli@microsoft.com>
To: "Gaëtan Rivet" <grive@u256.net>, "Long Li" <longli@linuxonhyperv.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, "stable@dpdk.org" <stable@dpdk.org>
Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add
Date: Fri, 9 Oct 2020 20:30:55 +0000	[thread overview]
Message-ID: <BN8PR21MB11556CB2041982D497B8A8CFCE080@BN8PR21MB1155.namprd21.prod.outlook.com> (raw)
In-Reply-To: <20201009162003.5ucroctwjpwhv64f@u256.net>

>Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while
>handling sub-device add
>
>On 05/10/20 11:42 +0200, Gaëtan Rivet wrote:
>> Hi,
>>
>> On 02/10/20 17:01 -0700, Long Li wrote:
>> > From: Long Li <longli@microsoft.com>
>> >
>> > When adding a sub-device, it's possible that the sub-device is
>> > configured successfully but later fails to start. This error should not be
>masked.
>>
>> Some of those errors are meant to be masked: -EIO, when the device is
>> marked as removed at the ethdev level (see eth_err() in rte_ethdev.c:819).
>>
>> > The driver needs to check the error status to prevent endless loop
>> > of trying to start the sub-device.
>>
>> If the ethdev layer error is due to the device being removed, and
>> failsafe loops on trying to sync the eth device to its own state, then
>> an RMV event should have been emitted but wasn't or it was missed by
>> failsafe.
>>
>> If the ethdev layer error is *not* due to the device being removed,
>> the error should be != -EIO, and sdev->remove should not be set, so
>> fs_err() should not mask it and it should be seen by the app.
>>
>> Can you provide the following details:
>>
>>  * What is the return code of rte_eth_dev_start() that is masked in your
>>    start loop?
>>
>>  * Is the device marked as removed in failsafe?
>>
>>  * Is the device marked as removed in ethdev?
>>
>>  * Was there an RMV event generated for the device? Whether yes or no,
>>    is it correct?
>>
>> Thanks,
>>
>
>Hello Li,
>
>I've found the previous mail thread [1] where you described how you got this
>error. In your description, you say that you try unplug then quick replug,
>before any event is processed?

Hi Gaëtan,

Sorry for getting back late. I had trouble with my email.

I think the issue is that: when the failsafe driver tries to start the sub device, it may fail. The failsafe driver needs to deal with the failure. Here is another log that I captured:

net_failsafe: Starting sub_device 0
net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation failure: Cannot allocate memory
net_mlx4: cannot initialize common RSS resources (queue 0): unable to create Rx queue resources: Cannot allocate memory
net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate memory
net_failsafe: Starting sub_device 0
net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation failure: Cannot allocate memory
net_mlx4: cannot initialize common RSS resources (queue 0): unable to create Rx queue resources: Cannot allocate memory
net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate memory

It's called from fs_dev_start(). The MLX4 can't start, it's in a state being removed in ethdev. But this error is being masked by fs_err():

static inline int
fs_err(struct sub_device *sdev, int err)
{
        /* A device removal shouldn't be reported as an error. */
        if (sdev->remove == 1 || err == -EIO)
                return rte_errno = 0;
        return err;
} 

So fs_dev_start() always return a success, returned to failsafe_eth_dev_state_sync(), which is repeated called from the alarm fs_hotplug_alarm(). This goes to an endless loop.

I believe failsafe_eth_rmv_event_callback() has been called prior to this. It sets sdev->remove = 1, but it didn't help with exiting the loop. fs_err() always returns 0, and fs_dev_start() is called from fs_hotplug_alarm(), that keeps re-alarming itself.

Why do we check (sdev->remove == 1 || err == -EIO) in fs_err()? If sdev->remove is set, this will always return 0, regardless of err.

It's hard to reproduce with gdb. Hopefully I can get a good trace in gdb.

Thanks,

Long

>
>If that's the case, it seems a clear race condition, and an issue of missing the
>removal event of the device. I would not say yet that the bug is in failsafe, but
>it could be in ethdev.
>
>Can you please check whether the device removal event was properly
>generated in rte_ethdev? Failsafe (and any other hotplug support layer
>actually) will depend on it so it should be first checked to work.
>
>Thanks,
>
>[1]:
>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmails.
>dpdk.org%2Farchives%2Fdev%2F2020-
>September%2F182977.html&amp;data=02%7C01%7Clongli%40microsoft.com
>%7C3e044201096c4f48508108d86c6f319d%7C72f988bf86f141af91ab2d7cd011d
>b47%7C1%7C0%7C637378572568404734&amp;sdata=DAUVxBJQtJx8mUSWlPI
>DtenhpMmdiMPpIczmxOdrCqE%3D&amp;reserved=0
>
>--
>Gaëtan

  reply	other threads:[~2020-10-09 20:31 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-03  0:01 Long Li
2020-10-05  9:42 ` Gaëtan Rivet
2020-10-09 16:20   ` Gaëtan Rivet
2020-10-09 20:30     ` Long Li [this message]
2020-10-12 14:22       ` Gaëtan Rivet
2020-10-13 17:14         ` Long Li
2020-10-15 10:11 ` Gaëtan Rivet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BN8PR21MB11556CB2041982D497B8A8CFCE080@BN8PR21MB1155.namprd21.prod.outlook.com \
    --to=longli@microsoft.com \
    --cc=dev@dpdk.org \
    --cc=grive@u256.net \
    --cc=longli@linuxonhyperv.com \
    --cc=stable@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).