From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <matan@mellanox.com>
Received: from EUR02-HE1-obe.outbound.protection.outlook.com
 (mail-eopbgr10067.outbound.protection.outlook.com [40.107.1.67])
 by dpdk.org (Postfix) with ESMTP id 1FC3B3250;
 Sat,  3 Feb 2018 20:42:17 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Mellanox.com;
 s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version;
 bh=2l7EeewP39nLF1JahKUD4J/qhyXoYeV7Ax6fb4XGfm0=;
 b=SbFjVAuEnQ75aPJo5lWI/Lz7SD5SRdrJVAHfLCKLk6lEbpvmV19ZIuqZnmnW6dpsBHriPvd1JbC3L1IL/XHKrtO7gvr7mEe1mZOwVk3KFuiJrvdCcZxwtWqDqPGdJ8M9dvcZHdbWK8M6ZRLvRiSgzdjNgIPyPKVoj7D3Readgn8=
Received: from AM4PR0501MB2657.eurprd05.prod.outlook.com (10.172.215.19) by
 AM4PR0501MB2244.eurprd05.prod.outlook.com (10.165.82.27) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id
 15.20.464.11; Sat, 3 Feb 2018 19:42:15 +0000
Received: from AM4PR0501MB2657.eurprd05.prod.outlook.com
 ([fe80::50a5:cd88:b3d8:763e]) by AM4PR0501MB2657.eurprd05.prod.outlook.com
 ([fe80::50a5:cd88:b3d8:763e%17]) with mapi id 15.20.0464.012; Sat, 3 Feb 2018
 19:42:15 +0000
From: Matan Azrad <matan@mellanox.com>
To: Adrien Mazarguil <adrien.mazarguil@6wind.com>
CC: Shahaf Shuler <shahafs@mellanox.com>, Mordechay Haimovsky
 <motih@mellanox.com>, "dev@dpdk.org" <dev@dpdk.org>, "stable@dpdk.org"
 <stable@dpdk.org>
Thread-Topic: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after
 port stop
Thread-Index: AQHTmoBrLBnxYvtDvEq0X4rYUAyrUKON6+DwgAAfnYCAABjRgIADZZSAgAGCfUA=
Date: Sat, 3 Feb 2018 19:42:14 +0000
Message-ID: <AM4PR0501MB26573B5C171596A50E5E3D18D2F80@AM4PR0501MB2657.eurprd05.prod.outlook.com>
References: <1517214877-126768-1-git-send-email-motih@mellanox.com>
 <VI1PR05MB314912AF24111F75B1EC7BCBC3E50@VI1PR05MB3149.eurprd05.prod.outlook.com>
 <20180130093958.GE4256@6wind.com>
 <VI1PR05MB3149E091672CE4DB20071B9FC3E40@VI1PR05MB3149.eurprd05.prod.outlook.com>
 <20180131091513.GS4256@6wind.com>
 <AM4PR0501MB2657356B2CA46A039B96F7F6D2FB0@AM4PR0501MB2657.eurprd05.prod.outlook.com>
 <20180131104352.GT4256@6wind.com>
 <AM4PR0501MB2657F359FA9A25BEF2CBA14AD2FB0@AM4PR0501MB2657.eurprd05.prod.outlook.com>
 <20180131143157.GV4256@6wind.com>
 <AM4PR0501MB2657E999EF07D6A9C8E4E730D2FB0@AM4PR0501MB2657.eurprd05.prod.outlook.com>
 <20180202195307.GD4256@6wind.com>
In-Reply-To: <20180202195307.GD4256@6wind.com>
Accept-Language: en-US, he-IL
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
authentication-results: spf=none (sender IP is )
 smtp.mailfrom=matan@mellanox.com; 
x-originating-ip: [85.64.136.190]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; AM4PR0501MB2244;
 7:II6tlcV1P9trJupE32oJTPeQ9wDNUnl7/54vdAaeQrZ/hdZo7EeXk5wGH6s5hnDZnrWQ1C9H2D/Zlh3JUX3teC2PDeJncHdNEuCU1d0pbana6BuQbgS5/0+4CAe6MIlAa7zwsRfJEHZ3g/mWhUZLbQEBEsx78hANeKKFceqtFI66MfrDsQIOmROpW5OxEtCGeGK4WkiwUXYBTz1AISNGnYQTwq/dpis4mKu+YuShbeleBxyIDNpoWQqMHHbeaxPa
x-ms-exchange-antispam-srfa-diagnostics: SSOS;
x-ms-office365-filtering-ht: Tenant
x-ms-office365-filtering-correlation-id: ef594774-ec5c-4d84-eed2-08d56b3e3a1e
x-microsoft-antispam: UriScan:; BCL:0; PCL:0;
 RULEID:(7020095)(4652020)(48565401081)(4534165)(4627221)(201703031133081)(201702281549075)(5600026)(4604075)(3008032)(2017052603307)(7153060)(7193020);
 SRVR:AM4PR0501MB2244; 
x-ms-traffictypediagnostic: AM4PR0501MB2244:
x-ld-processed: a652971c-7d2e-4d9b-a6a4-d149256f461b,ExtAddr
x-microsoft-antispam-prvs: <AM4PR0501MB2244BE66B5FAD3FF29589294D2F80@AM4PR0501MB2244.eurprd05.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(278428928389397);
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
 RULEID:(6040501)(2401047)(5005006)(8121501046)(93006095)(93001095)(3002001)(3231101)(2400082)(944501161)(10201501046)(6055026)(6041288)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123560045)(20161123564045)(20161123562045)(20161123558120)(6072148)(201708071742011);
 SRVR:AM4PR0501MB2244; BCL:0; PCL:0; RULEID:; SRVR:AM4PR0501MB2244; 
x-forefront-prvs: 05724A8921
x-forefront-antispam-report: SFV:NSPM;
 SFS:(10009020)(376002)(39860400002)(396003)(346002)(366004)(39380400002)(52314003)(55674003)(189003)(199004)(106356001)(4326008)(25786009)(6506007)(186003)(59450400001)(2906002)(2900100001)(8936002)(81166006)(81156014)(8676002)(7696005)(97736004)(54906003)(86362001)(99286004)(6116002)(3846002)(102836004)(26005)(229853002)(93886005)(66066001)(316002)(3660700001)(33656002)(5250100002)(305945005)(74316002)(3280700002)(68736007)(105586002)(76176011)(14454004)(9686003)(6436002)(55016002)(7736002)(478600001)(6916009)(2950100002)(6246003)(5660300001)(53936002);
 DIR:OUT; SFP:1101; SCL:1; SRVR:AM4PR0501MB2244;
 H:AM4PR0501MB2657.eurprd05.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords;
 MX:1; A:1; LANG:en; 
received-spf: None (protection.outlook.com: mellanox.com does not designate
 permitted sender hosts)
x-microsoft-antispam-message-info: pNiFLntDql9POdm5FkkUPGgDNzAObzlzYe4L2ndcla0JvRn0Oq/hriU6rCYAij2vqH9KixprrA3h1lz72soIPA==
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: Mellanox.com
X-MS-Exchange-CrossTenant-Network-Message-Id: ef594774-ec5c-4d84-eed2-08d56b3e3a1e
X-MS-Exchange-CrossTenant-originalarrivaltime: 03 Feb 2018 19:42:14.8480 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM4PR0501MB2244
Subject: Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after
 port stop
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Feb 2018 19:42:17 -0000

Hi Adrien

> From: Adrien Mazarguil , Sent: Friday, February 2, 2018 9:53 PM
> Hi Matan,
>=20
> On Wed, Jan 31, 2018 at 05:07:56PM +0000, Matan Azrad wrote:
> > Hi Adrien
> >
> > From: Adrien Mazarguil , Sent: Wednesday, January 31, 2018 4:32 PM
> > > Hi Matan,
> > >
> > > On Wed, Jan 31, 2018 at 01:44:41PM +0000, Matan Azrad wrote:
> > > > Hi Adrien
> <snip>
> > > > I don't know what any application does but for me it is a mistake
> > > > to stop all event processes in dev_stop(), Maybe for other
> > > > application
> > > maintainers too.
> > >
> > > Just like you, I don't know either what all the applications ever
> > > written for DPDK expect out of dev_stop(). What's for sure is that
> > > currently, LSC/RMV don't occcur afterward, the same way these events
> > > do not occur before dev_start().
> >
> > Why not? RMV event can occur any time after probe.
>=20
> LSC as well (keep in mind this patch modifies the behavior for both event=
s).
> RMV events may also occur before application has a chance to register a
> handler for it, in which case this approach fails to solve the problem it=
's
> supposed to solve. Mitigate all you want, the application still can't rel=
y on that
> event only.

Again, the application should register to event when it want to be notified=
 about it by callback and unregister to it when it doesn't want to.
So, if application still didn't has chance to register to it, its ok not to=
 get notification about it.
Obviously, Application can check both RMV and LCS statuses after the first =
registration.
=20

> > > Any application possibly relying on this fact will break. In such a
> > > situation, a conservative approach is better.
> >
> > If an application should fail to get event in stopped state it may fail=
 in the
> previous code too:
> > The interrupt run from host thread so the next race may occur:
> > dev_start() : master thread.
> > Context switch.
> > RMV interrupt started to run callbacks: host thread.
> > Context switch.
> > dev_stop(): master thread.
> > Start reconfiguration: master thread.
> > Context switch.
> > Callback running.
> >
> > So, the only thing which can disable callback running after dev_stop() =
is
> callback unregistration before it.
>=20
> After dev_stop() returns, new events cannot be triggered by the PMD which
> is what matters.

No, from application point of view it can happen even if the PMD will stop =
to trigger it from dev_stop() (as the old code did).
This is exactly what the above scenario says.
=20
 >Obviously a callback that already started to run before that
> will eventually have to complete. What's your point?

It can even start after dev_stop() in spite of the PMD will stop to trigger=
 the event, read the above scenario again.

So, my point is:
If application is not safe to get notification after dev_stop() it may fail=
 in both old and new versions.
So every PMD which stops to trigger event from dev_stop() because of your r=
easons makes a mistake and fails for its purpose.=20

=20
> There's a race only if an application performs multiple simultaneous cont=
rol
> operations on the underlying device, but this has always been unsafe (not
> only during RMV) because there are no locks, it's documented as such.

So, it is probably safe.

> > > > > Setting up RMV/LSC callbacks is not the only configuration an
> > > > > application usually performs before calling dev_start(). Think
> > > > > about setting up flow rules, MAC addresses, VLANs, and so on,
> > > > > this on multiple ports before starting them up all at once.
> > > > > Previously it could be done in an unspecified order, now they
> > > > > have to take special care
> > > for RMV/LSC.
> > > >
> > > > Or maybe there callbacks code are already safe for it.
> > > > Or they manages the unregister\register calls in the right places.
> > >
> > > That's my point, these "maybes" don't argue in favor of changing thin=
gs.
> >
> > What I'm saying is that callbacks should be safe or not registered in t=
he right
> time.
>=20
> I understand that, though it's not a valid counter argument :)

With the above scenario it is :)

> > > > > Many devops are only safe when called while a device is stopped.
> > > > > It's even documented in rte_ethdev.h.
> > > > >
> > > >
> > > > And?
> > >
> > > ...And applications therefore often do all their configuration in an
> > > unspecified order while a port is stopped as a measure of safety. No
> > > extra care is taken for RMV/LSC. This uncertainty can be addressed
> > > by not modifying the current behavior.
> >
> > Or they expect to get interrupt and the corner case will come later if =
we will
> not change it.
>=20
> Look, we're throwing opposite use cases at each other in order to make a
> point, and I don't see an end to this since we're both stubborn. Let's th=
us
> assume applications use a bit of both.
>=20
> Now we're left with a problem, before this patch neither use cases were
> broken.

No, again, The above scenario shows it can be broken even before this patch=
.
=20
> Now it's applied, mine is broken so let's agree something needs to
> be done. Either all affected applications need to be updated, or we can
> simply revert this and properly fix fail-safe instead.

I think no, applications which will fail because of this patch may fail bef=
ore this patch.
So, probably applications should be safe for this patch.

> <snip>
> > > > So, at least for RMV event, we need the notification also in stoppe=
d
> state.
> > >
> > > You sent the rte_eth_dev_is_removed() series. You're aware that PMDs
> > > implementing this call benefit from an automatic is_removed() check
> > > on all remaining devops whenever some error occur.
> > > In short, an application will get notified simply by getting
> > > dev_start() (or any other callback) return -EIO and not being able to=
 use
> the device.
> >
> > Yes, but between dev_stop to dev_start may not be any ethdev API callin=
g.
>=20
> So what, if an application is not using the device, why does it need to k=
now
> it's been removed?

Data path will not check synchronously the removal status - this is must co=
me by asynchronic  notification.

 >If it's that important, why can't it run its own periodic
> rte_eth_dev_is_removed() probe?

Because ETHDEV have event for it, why to do similar mechanism again?

> > > PMDs that do not implement is_removed() (or in addition to it) could
> > > also artificially trigger a RMV event after dev_start() is called.
> > > As long as the PMD remains quiet while the device is stopped, it's fi=
ne.
> >
> > How can the PMD do it after dev_start()? Initiate alarm in dev start fu=
nction
> to do it later And entering into race again?
>=20
> What race? All the PMD needs to to is trigger an event by registering one
> with immediate effect, it won't make any difference to the application. I=
f it
> performs racy control operations from the handler, it's never been a PMD'=
s
> problem.

See the race scenario above, it should be similar.

> Anyway I just provided this idea as a counter argument, it's not really
> needed; relying on rte_eth_dev_is_removed() is the safest approach in my
> opinion.

Not always.
We need them both to be really hotplug aware from all sides.

> > I think it is not worth the effort and this patch is doing the right th=
ing(also
> some other PMDs) .
>=20
> Well, it's probably too late to revert this patch for 18.02 since one wou=
ld also
> have to come up with the proper fix for fail-safe, however that doesn't m=
ean
> breaking things and expecting applications to deal with it because it's n=
ever
> been properly documented is OK either.

Nothing breaking for my opinion.

> I'm post-NACKing this patch, it will have to be reverted and a fix submit=
ted
> for the upcoming 18.02 stable branch. In the meantime, it's not a fix for
> mlx4 and as such must not be backported.

I can't agree with you about it now.
But you know, You are the maintainer :) do whatever you want.

Anyway, I think we are all agree that all PMDs should do the same thing and=
 documentation somewhere must be added.
My opinion, as you probably know, is to continue to trigger events even aft=
er dev_stop().

Thanks,
Matan.=20
> --
> Adrien Mazarguil
> 6WIND