From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id A2ABDA04B6; Mon, 12 Oct 2020 16:22:28 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 35C511D71E; Mon, 12 Oct 2020 16:22:27 +0200 (CEST) Received: from relay12.mail.gandi.net (relay12.mail.gandi.net [217.70.178.232]) by dpdk.org (Postfix) with ESMTP id 55BEE1D712; Mon, 12 Oct 2020 16:22:24 +0200 (CEST) Received: from u256.net (lfbn-poi-1-843-59.w86-254.abo.wanadoo.fr [86.254.165.59]) (Authenticated sender: grive@u256.net) by relay12.mail.gandi.net (Postfix) with ESMTPSA id 46C1020000C; Mon, 12 Oct 2020 14:22:21 +0000 (UTC) Date: Mon, 12 Oct 2020 16:22:17 +0200 From: =?utf-8?Q?Ga=C3=ABtan?= Rivet To: Long Li Cc: Long Li , "dev@dpdk.org" , "stable@dpdk.org" Message-ID: <20201012142217.4s3l6wrt2qyiezcu@u256.net> References: <1601683308-18738-1-git-send-email-longli@linuxonhyperv.com> <20201005094215.u4kt64ycbk35kbeg@u256.net> <20201009162003.5ucroctwjpwhv64f@u256.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 09/10/20 20:30 +0000, Long Li wrote: > >Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while > >handling sub-device add > > > >On 05/10/20 11:42 +0200, Gaëtan Rivet wrote: > >> Hi, > >> > >> On 02/10/20 17:01 -0700, Long Li wrote: > >> > From: Long Li > >> > > >> > When adding a sub-device, it's possible that the sub-device is > >> > configured successfully but later fails to start. This error should not be > >masked. > >> > >> Some of those errors are meant to be masked: -EIO, when the device is > >> marked as removed at the ethdev level (see eth_err() in rte_ethdev.c:819). > >> > >> > The driver needs to check the error status to prevent endless loop > >> > of trying to start the sub-device. > >> > >> If the ethdev layer error is due to the device being removed, and > >> failsafe loops on trying to sync the eth device to its own state, then > >> an RMV event should have been emitted but wasn't or it was missed by > >> failsafe. > >> > >> If the ethdev layer error is *not* due to the device being removed, > >> the error should be != -EIO, and sdev->remove should not be set, so > >> fs_err() should not mask it and it should be seen by the app. > >> > >> Can you provide the following details: > >> > >> * What is the return code of rte_eth_dev_start() that is masked in your > >> start loop? > >> > >> * Is the device marked as removed in failsafe? > >> > >> * Is the device marked as removed in ethdev? > >> > >> * Was there an RMV event generated for the device? Whether yes or no, > >> is it correct? > >> > >> Thanks, > >> > > > >Hello Li, > > > >I've found the previous mail thread [1] where you described how you got this > >error. In your description, you say that you try unplug then quick replug, > >before any event is processed? > > Hi Gaëtan, > > Sorry for getting back late. I had trouble with my email. > > I think the issue is that: when the failsafe driver tries to start the sub device, it may fail. The failsafe driver needs to deal with the failure. Here is another log that I captured: > > net_failsafe: Starting sub_device 0 > net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation failure: Cannot allocate memory > net_mlx4: cannot initialize common RSS resources (queue 0): unable to create Rx queue resources: Cannot allocate memory > net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate memory > net_failsafe: Starting sub_device 0 > net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation failure: Cannot allocate memory > net_mlx4: cannot initialize common RSS resources (queue 0): unable to create Rx queue resources: Cannot allocate memory > net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate memory > > It's called from fs_dev_start(). The MLX4 can't start, it's in a state being removed in ethdev. But this error is being masked by fs_err(): > > static inline int > fs_err(struct sub_device *sdev, int err) > { > /* A device removal shouldn't be reported as an error. */ > if (sdev->remove == 1 || err == -EIO) > return rte_errno = 0; > return err; > } > > So fs_dev_start() always return a success, returned to failsafe_eth_dev_state_sync(), which is repeated called from the alarm fs_hotplug_alarm(). This goes to an endless loop. > > I believe failsafe_eth_rmv_event_callback() has been called prior to this. It sets sdev->remove = 1, but it didn't help with exiting the loop. fs_err() always returns 0, and fs_dev_start() is called from fs_hotplug_alarm(), that keeps re-alarming itself. Ok, the issue here is that even with sdev->remove == 1, the device is not actually removed, but it should. I think we discussed it with Stephen off-list not too long ago actually. In his case the actual bug was in vdev_netvsc, but there was still this issue in failsafe. It is happening there: void failsafe_dev_remove(struct rte_eth_dev *dev) { struct sub_device *sdev; uint8_t i; FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) /* ^^^^^^^^^^^^ HERE is the bug */ if (sdev->remove && fs_rxtx_clean(sdev)) { if (fs_lock(dev, 1) != 0) return; fs_dev_stats_save(sdev); fs_dev_remove(sdev); fs_unlock(dev, 1); } } A device with the remove flag will be cleaned up only if it is already in an active state. This is not correct, it should be removed in all cases -- even if it stayed stuck in PROBED state. Here is a fix I propose for this issue: http://mails.dpdk.org/archives/dev/2020-October/185921.html Could you please test it and verify it solves your issue? > > Why do we check (sdev->remove == 1 || err == -EIO) in fs_err()? If sdev->remove is set, this will always return 0, regardless of err. > if sdev->remove == 1, there should not even be a call to sub-device start ops actually. The device is flagged for removal, next thing to happen is to remove it, even if the device reappeared on the bus in-between. The failsafe needs to consume the removed flag (close the device then remove it), then it will be ready to detect the new device and will probe it back from the beginning. > It's hard to reproduce with gdb. Hopefully I can get a good trace in gdb. > > Thanks, > > Long > > > > >If that's the case, it seems a clear race condition, and an issue of missing the > >removal event of the device. I would not say yet that the bug is in failsafe, but > >it could be in ethdev. > > > >Can you please check whether the device removal event was properly > >generated in rte_ethdev? Failsafe (and any other hotplug support layer > >actually) will depend on it so it should be first checked to work. > > > >Thanks, > > > >[1]: > >https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmails. > >dpdk.org%2Farchives%2Fdev%2F2020- > >September%2F182977.html&data=02%7C01%7Clongli%40microsoft.com > >%7C3e044201096c4f48508108d86c6f319d%7C72f988bf86f141af91ab2d7cd011d > >b47%7C1%7C0%7C637378572568404734&sdata=DAUVxBJQtJx8mUSWlPI > >DtenhpMmdiMPpIczmxOdrCqE%3D&reserved=0 > > > >-- > >Gaëtan Regards, -- Gaëtan