From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR02-HE1-obe.outbound.protection.outlook.com (mail-eopbgr10067.outbound.protection.outlook.com [40.107.1.67]) by dpdk.org (Postfix) with ESMTP id 1FC3B3250; Sat, 3 Feb 2018 20:42:17 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Mellanox.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=2l7EeewP39nLF1JahKUD4J/qhyXoYeV7Ax6fb4XGfm0=; b=SbFjVAuEnQ75aPJo5lWI/Lz7SD5SRdrJVAHfLCKLk6lEbpvmV19ZIuqZnmnW6dpsBHriPvd1JbC3L1IL/XHKrtO7gvr7mEe1mZOwVk3KFuiJrvdCcZxwtWqDqPGdJ8M9dvcZHdbWK8M6ZRLvRiSgzdjNgIPyPKVoj7D3Readgn8= Received: from AM4PR0501MB2657.eurprd05.prod.outlook.com (10.172.215.19) by AM4PR0501MB2244.eurprd05.prod.outlook.com (10.165.82.27) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.20.464.11; Sat, 3 Feb 2018 19:42:15 +0000 Received: from AM4PR0501MB2657.eurprd05.prod.outlook.com ([fe80::50a5:cd88:b3d8:763e]) by AM4PR0501MB2657.eurprd05.prod.outlook.com ([fe80::50a5:cd88:b3d8:763e%17]) with mapi id 15.20.0464.012; Sat, 3 Feb 2018 19:42:15 +0000 From: Matan Azrad To: Adrien Mazarguil CC: Shahaf Shuler , Mordechay Haimovsky , "dev@dpdk.org" , "stable@dpdk.org" Thread-Topic: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop Thread-Index: AQHTmoBrLBnxYvtDvEq0X4rYUAyrUKON6+DwgAAfnYCAABjRgIADZZSAgAGCfUA= Date: Sat, 3 Feb 2018 19:42:14 +0000 Message-ID: References: <1517214877-126768-1-git-send-email-motih@mellanox.com> <20180130093958.GE4256@6wind.com> <20180131091513.GS4256@6wind.com> <20180131104352.GT4256@6wind.com> <20180131143157.GV4256@6wind.com> <20180202195307.GD4256@6wind.com> In-Reply-To: <20180202195307.GD4256@6wind.com> Accept-Language: en-US, he-IL Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=matan@mellanox.com; x-originating-ip: [85.64.136.190] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; AM4PR0501MB2244; 7:II6tlcV1P9trJupE32oJTPeQ9wDNUnl7/54vdAaeQrZ/hdZo7EeXk5wGH6s5hnDZnrWQ1C9H2D/Zlh3JUX3teC2PDeJncHdNEuCU1d0pbana6BuQbgS5/0+4CAe6MIlAa7zwsRfJEHZ3g/mWhUZLbQEBEsx78hANeKKFceqtFI66MfrDsQIOmROpW5OxEtCGeGK4WkiwUXYBTz1AISNGnYQTwq/dpis4mKu+YuShbeleBxyIDNpoWQqMHHbeaxPa x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-ht: Tenant x-ms-office365-filtering-correlation-id: ef594774-ec5c-4d84-eed2-08d56b3e3a1e x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(7020095)(4652020)(48565401081)(4534165)(4627221)(201703031133081)(201702281549075)(5600026)(4604075)(3008032)(2017052603307)(7153060)(7193020); SRVR:AM4PR0501MB2244; x-ms-traffictypediagnostic: AM4PR0501MB2244: x-ld-processed: a652971c-7d2e-4d9b-a6a4-d149256f461b,ExtAddr x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(278428928389397); x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040501)(2401047)(5005006)(8121501046)(93006095)(93001095)(3002001)(3231101)(2400082)(944501161)(10201501046)(6055026)(6041288)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123560045)(20161123564045)(20161123562045)(20161123558120)(6072148)(201708071742011); SRVR:AM4PR0501MB2244; BCL:0; PCL:0; RULEID:; SRVR:AM4PR0501MB2244; x-forefront-prvs: 05724A8921 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(376002)(39860400002)(396003)(346002)(366004)(39380400002)(52314003)(55674003)(189003)(199004)(106356001)(4326008)(25786009)(6506007)(186003)(59450400001)(2906002)(2900100001)(8936002)(81166006)(81156014)(8676002)(7696005)(97736004)(54906003)(86362001)(99286004)(6116002)(3846002)(102836004)(26005)(229853002)(93886005)(66066001)(316002)(3660700001)(33656002)(5250100002)(305945005)(74316002)(3280700002)(68736007)(105586002)(76176011)(14454004)(9686003)(6436002)(55016002)(7736002)(478600001)(6916009)(2950100002)(6246003)(5660300001)(53936002); DIR:OUT; SFP:1101; SCL:1; SRVR:AM4PR0501MB2244; H:AM4PR0501MB2657.eurprd05.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en; received-spf: None (protection.outlook.com: mellanox.com does not designate permitted sender hosts) x-microsoft-antispam-message-info: pNiFLntDql9POdm5FkkUPGgDNzAObzlzYe4L2ndcla0JvRn0Oq/hriU6rCYAij2vqH9KixprrA3h1lz72soIPA== spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-Network-Message-Id: ef594774-ec5c-4d84-eed2-08d56b3e3a1e X-MS-Exchange-CrossTenant-originalarrivaltime: 03 Feb 2018 19:42:14.8480 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM4PR0501MB2244 Subject: Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Feb 2018 19:42:17 -0000 Hi Adrien > From: Adrien Mazarguil , Sent: Friday, February 2, 2018 9:53 PM > Hi Matan, >=20 > On Wed, Jan 31, 2018 at 05:07:56PM +0000, Matan Azrad wrote: > > Hi Adrien > > > > From: Adrien Mazarguil , Sent: Wednesday, January 31, 2018 4:32 PM > > > Hi Matan, > > > > > > On Wed, Jan 31, 2018 at 01:44:41PM +0000, Matan Azrad wrote: > > > > Hi Adrien > > > > > I don't know what any application does but for me it is a mistake > > > > to stop all event processes in dev_stop(), Maybe for other > > > > application > > > maintainers too. > > > > > > Just like you, I don't know either what all the applications ever > > > written for DPDK expect out of dev_stop(). What's for sure is that > > > currently, LSC/RMV don't occcur afterward, the same way these events > > > do not occur before dev_start(). > > > > Why not? RMV event can occur any time after probe. >=20 > LSC as well (keep in mind this patch modifies the behavior for both event= s). > RMV events may also occur before application has a chance to register a > handler for it, in which case this approach fails to solve the problem it= 's > supposed to solve. Mitigate all you want, the application still can't rel= y on that > event only. Again, the application should register to event when it want to be notified= about it by callback and unregister to it when it doesn't want to. So, if application still didn't has chance to register to it, its ok not to= get notification about it. Obviously, Application can check both RMV and LCS statuses after the first = registration. =20 > > > Any application possibly relying on this fact will break. In such a > > > situation, a conservative approach is better. > > > > If an application should fail to get event in stopped state it may fail= in the > previous code too: > > The interrupt run from host thread so the next race may occur: > > dev_start() : master thread. > > Context switch. > > RMV interrupt started to run callbacks: host thread. > > Context switch. > > dev_stop(): master thread. > > Start reconfiguration: master thread. > > Context switch. > > Callback running. > > > > So, the only thing which can disable callback running after dev_stop() = is > callback unregistration before it. >=20 > After dev_stop() returns, new events cannot be triggered by the PMD which > is what matters. No, from application point of view it can happen even if the PMD will stop = to trigger it from dev_stop() (as the old code did). This is exactly what the above scenario says. =20 >Obviously a callback that already started to run before that > will eventually have to complete. What's your point? It can even start after dev_stop() in spite of the PMD will stop to trigger= the event, read the above scenario again. So, my point is: If application is not safe to get notification after dev_stop() it may fail= in both old and new versions. So every PMD which stops to trigger event from dev_stop() because of your r= easons makes a mistake and fails for its purpose.=20 =20 > There's a race only if an application performs multiple simultaneous cont= rol > operations on the underlying device, but this has always been unsafe (not > only during RMV) because there are no locks, it's documented as such. So, it is probably safe. > > > > > Setting up RMV/LSC callbacks is not the only configuration an > > > > > application usually performs before calling dev_start(). Think > > > > > about setting up flow rules, MAC addresses, VLANs, and so on, > > > > > this on multiple ports before starting them up all at once. > > > > > Previously it could be done in an unspecified order, now they > > > > > have to take special care > > > for RMV/LSC. > > > > > > > > Or maybe there callbacks code are already safe for it. > > > > Or they manages the unregister\register calls in the right places. > > > > > > That's my point, these "maybes" don't argue in favor of changing thin= gs. > > > > What I'm saying is that callbacks should be safe or not registered in t= he right > time. >=20 > I understand that, though it's not a valid counter argument :) With the above scenario it is :) > > > > > Many devops are only safe when called while a device is stopped. > > > > > It's even documented in rte_ethdev.h. > > > > > > > > > > > > > And? > > > > > > ...And applications therefore often do all their configuration in an > > > unspecified order while a port is stopped as a measure of safety. No > > > extra care is taken for RMV/LSC. This uncertainty can be addressed > > > by not modifying the current behavior. > > > > Or they expect to get interrupt and the corner case will come later if = we will > not change it. >=20 > Look, we're throwing opposite use cases at each other in order to make a > point, and I don't see an end to this since we're both stubborn. Let's th= us > assume applications use a bit of both. >=20 > Now we're left with a problem, before this patch neither use cases were > broken. No, again, The above scenario shows it can be broken even before this patch= . =20 > Now it's applied, mine is broken so let's agree something needs to > be done. Either all affected applications need to be updated, or we can > simply revert this and properly fix fail-safe instead. I think no, applications which will fail because of this patch may fail bef= ore this patch. So, probably applications should be safe for this patch. > > > > > So, at least for RMV event, we need the notification also in stoppe= d > state. > > > > > > You sent the rte_eth_dev_is_removed() series. You're aware that PMDs > > > implementing this call benefit from an automatic is_removed() check > > > on all remaining devops whenever some error occur. > > > In short, an application will get notified simply by getting > > > dev_start() (or any other callback) return -EIO and not being able to= use > the device. > > > > Yes, but between dev_stop to dev_start may not be any ethdev API callin= g. >=20 > So what, if an application is not using the device, why does it need to k= now > it's been removed? Data path will not check synchronously the removal status - this is must co= me by asynchronic notification. >If it's that important, why can't it run its own periodic > rte_eth_dev_is_removed() probe? Because ETHDEV have event for it, why to do similar mechanism again? > > > PMDs that do not implement is_removed() (or in addition to it) could > > > also artificially trigger a RMV event after dev_start() is called. > > > As long as the PMD remains quiet while the device is stopped, it's fi= ne. > > > > How can the PMD do it after dev_start()? Initiate alarm in dev start fu= nction > to do it later And entering into race again? >=20 > What race? All the PMD needs to to is trigger an event by registering one > with immediate effect, it won't make any difference to the application. I= f it > performs racy control operations from the handler, it's never been a PMD'= s > problem. See the race scenario above, it should be similar. > Anyway I just provided this idea as a counter argument, it's not really > needed; relying on rte_eth_dev_is_removed() is the safest approach in my > opinion. Not always. We need them both to be really hotplug aware from all sides. > > I think it is not worth the effort and this patch is doing the right th= ing(also > some other PMDs) . >=20 > Well, it's probably too late to revert this patch for 18.02 since one wou= ld also > have to come up with the proper fix for fail-safe, however that doesn't m= ean > breaking things and expecting applications to deal with it because it's n= ever > been properly documented is OK either. Nothing breaking for my opinion. > I'm post-NACKing this patch, it will have to be reverted and a fix submit= ted > for the upcoming 18.02 stable branch. In the meantime, it's not a fix for > mlx4 and as such must not be backported. I can't agree with you about it now. But you know, You are the maintainer :) do whatever you want. Anyway, I think we are all agree that all PMDs should do the same thing and= documentation somewhere must be added. My opinion, as you probably know, is to continue to trigger events even aft= er dev_stop(). Thanks, Matan.=20 > -- > Adrien Mazarguil > 6WIND