From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 5D3E8A04BC for ; Fri, 9 Oct 2020 18:20:15 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 10CD61D5C0; Fri, 9 Oct 2020 18:20:14 +0200 (CEST) Received: from relay9-d.mail.gandi.net (relay9-d.mail.gandi.net [217.70.183.199]) by dpdk.org (Postfix) with ESMTP id 040AF1C1EB; Fri, 9 Oct 2020 18:20:10 +0200 (CEST) X-Originating-IP: 86.254.165.59 Received: from u256.net (lfbn-poi-1-843-59.w86-254.abo.wanadoo.fr [86.254.165.59]) (Authenticated sender: grive@u256.net) by relay9-d.mail.gandi.net (Postfix) with ESMTPSA id 37883FF807; Fri, 9 Oct 2020 16:20:07 +0000 (UTC) Date: Fri, 9 Oct 2020 18:20:03 +0200 From: =?utf-8?Q?Ga=C3=ABtan?= Rivet To: Long Li Cc: dev@dpdk.org, Long Li , stable@dpdk.org Message-ID: <20201009162003.5ucroctwjpwhv64f@u256.net> References: <1601683308-18738-1-git-send-email-longli@linuxonhyperv.com> <20201005094215.u4kt64ycbk35kbeg@u256.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20201005094215.u4kt64ycbk35kbeg@u256.net> Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] net/failsafe: check correct error code while handling sub-device add X-BeenThere: stable@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches for DPDK stable branches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: stable-bounces@dpdk.org Sender: "stable" On 05/10/20 11:42 +0200, Gaëtan Rivet wrote: > Hi, > > On 02/10/20 17:01 -0700, Long Li wrote: > > From: Long Li > > > > When adding a sub-device, it's possible that the sub-device is configured > > successfully but later fails to start. This error should not be masked. > > Some of those errors are meant to be masked: -EIO, when the device is > marked as removed at the ethdev level (see eth_err() in rte_ethdev.c:819). > > > The driver needs to check the error status to prevent endless loop of > > trying to start the sub-device. > > If the ethdev layer error is due to the device being removed, and > failsafe loops on trying to sync the eth device to its own state, then > an RMV event should have been emitted but wasn't or it was missed by > failsafe. > > If the ethdev layer error is *not* due to the device being removed, the > error should be != -EIO, and sdev->remove should not be set, so fs_err() > should not mask it and it should be seen by the app. > > Can you provide the following details: > > * What is the return code of rte_eth_dev_start() that is masked in your > start loop? > > * Is the device marked as removed in failsafe? > > * Is the device marked as removed in ethdev? > > * Was there an RMV event generated for the device? Whether yes or no, > is it correct? > > Thanks, > Hello Li, I've found the previous mail thread [1] where you described how you got this error. In your description, you say that you try unplug then quick replug, before any event is processed? If that's the case, it seems a clear race condition, and an issue of missing the removal event of the device. I would not say yet that the bug is in failsafe, but it could be in ethdev. Can you please check whether the device removal event was properly generated in rte_ethdev? Failsafe (and any other hotplug support layer actually) will depend on it so it should be first checked to work. Thanks, [1]: http://mails.dpdk.org/archives/dev/2020-September/182977.html -- Gaëtan