From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <thomas@monjalon.net>
Received: from out2-smtp.messagingengine.com (out2-smtp.messagingengine.com
 [66.111.4.26]) by dpdk.org (Postfix) with ESMTP id ADCA31D6E8;
 Thu, 14 Jun 2018 23:37:24 +0200 (CEST)
Received: from compute1.internal (compute1.nyi.internal [10.202.2.41])
 by mailout.nyi.internal (Postfix) with ESMTP id 126A321B00;
 Thu, 14 Jun 2018 17:37:21 -0400 (EDT)
Received: from mailfrontend2 ([10.202.2.163])
 by compute1.internal (MEProxy); Thu, 14 Jun 2018 17:37:21 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=monjalon.net; h=
 cc:content-transfer-encoding:content-type:date:from:in-reply-to
 :message-id:mime-version:references:subject:to:x-me-sender
 :x-me-sender:x-sasl-enc; s=mesmtp; bh=vuhzUKzIqfnZWQ7MSDKLyoLtYB
 FHvm70UL4YL5NHh8E=; b=HcN2TvQVElOETqjTbKu3cPzuhoTvA2eyHzChTtp3vH
 D25rzbofXa8DIEiIp+alr+Qvw2AtXYquFPAuM1gYP10/XKjLak5xNtSreESXA7Lj
 e3qccNEO0t9UBm8fEJet9/OuJW69gWz9M/FwQ2B//POYqFQf624AvJ6bWXtIi+Ap
 Q=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
 messagingengine.com; h=cc:content-transfer-encoding:content-type
 :date:from:in-reply-to:message-id:mime-version:references
 :subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; bh=vuhzUK
 zIqfnZWQ7MSDKLyoLtYBFHvm70UL4YL5NHh8E=; b=rHb/Bb590quoElEi+RVxNy
 G2AkJxaE8eEDZthUHjWv+FnguGhYoLoMUyJ340hn/xqezV5jj5kzfDwMxWErFldQ
 wEkH4lvzDRQnYjBtEeO4PZETv5M0bdapy/Qp8+2R1zamRAIar87uwDUpxSovUISj
 VIRDXJccwdY1UpY1ZaQNjxKgeSawLK0OcRaODY9BMmTmnhc85+6rS2SJ1XpQrz6T
 plsVal+VNMXVyaoTCjcVAebnSRcQaHfHDVkaGWa7VxNTNOfHiPs75VcRBZ7DHvmD
 zQPKCLOla7wrV1qcU9tdNVRtmvCnBzJ4+hi10qk3KY7CiPRbdIBhGABrWPwSnlXg
 ==
X-ME-Proxy: <xmx:EOAiW068FxYLLveKrZxAxZZdFMlKEZfpjFMFBdcp5GuM_4dY-nGtew>
 <xmx:EOAiW_5eGmxVKCEbZjoMntKFdftYa_MhSpGkZ2qZ3wdohs8VDCk2YQ>
 <xmx:EeAiW6E0-bZgxOqhMPLsJvqSl_l-Hg3gBj_URA3KoHlJYTCbCEzl9A>
 <xmx:EeAiW3UMiNAbwiM-hEVZrOpP-NvvG_VlEvdvUZfMKmxu1VapNjFUZw>
 <xmx:EeAiW2W8vXn_2-EijySQfs05ZyaYnBrzVcXOwPdkDESTkiWlAXLkgw>
 <xmx:EeAiW9cYnTKhzOZRAgXNILkK0pUZfFfwSaQyLDSp_DipkVKuZxrcIg>
X-ME-Sender: <xms:EOAiW_MrBcVtjmyAK_VOI4mprT_vCz2JBIfhhEpzrn4RuvRvSeb9mw>
Received: from xps.localnet (184.203.134.77.rev.sfr.net [77.134.203.184])
 by mail.messagingengine.com (Postfix) with ESMTPA id 1112810262;
 Thu, 14 Jun 2018 17:37:18 -0400 (EDT)
From: Thomas Monjalon <thomas@monjalon.net>
To: "Guo, Jia" <jia.guo@intel.com>
Cc: dev@dpdk.org, Bruce Richardson <bruce.richardson@intel.com>,
 techboard@dpdk.org, "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 "stephen@networkplumber.org" <stephen@networkplumber.org>, "Yigit,
 Ferruh" <ferruh.yigit@intel.com>,
 "gaetan.rivet@6wind.com" <gaetan.rivet@6wind.com>, "Wu,
 Jingjing" <jingjing.wu@intel.com>, "motih@mellanox.com" <motih@mellanox.com>,
 "matan@mellanox.com" <matan@mellanox.com>, "Van Haaren,
 Harry" <harry.van.haaren@intel.com>, "Zhang, Qi Z" <qi.z.zhang@intel.com>,
 "Zhang, Helin" <helin.zhang@intel.com>,
 "shreyansh.jain@nxp.com" <shreyansh.jain@nxp.com>
Date: Thu, 14 Jun 2018 23:37:17 +0200
Message-ID: <2204635.oFaBryakOT@xps>
In-Reply-To: <b7e01d0e-f37a-c7e4-e894-127cd6f40563@intel.com>
References: <01BA8470C017D6468C8290E4B9C5E1E83B379B43@shsmsx102.ccr.corp.intel.com>
 <20180606125451.GA2960@bricha3-MOBL.ger.corp.intel.com>
 <b7e01d0e-f37a-c7e4-e894-127cd6f40563@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="UTF-8"
Subject: Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Jun 2018 21:37:25 -0000

Hi,

I am sorry, it is very hard to be sure we understand correctly
your thougths. I like the proposal, but I want to be sure
my understanding was not biased by what I would like to read :)
So I try to reword below. Please confirm it matches your intent.

Hot unplug can happen when a hardware device is removed physically,
or when the software disables it. In both case, the datapath will fail.
When the unplug is detected, we must stop and close the related instance
of the driver.
The detection can be done with hotplug monitoring (like uevent)
=2D this is RTE_DEV_EVENT_REMOVE - or by handling the failure
in control path or data path - this is RTE_ETH_EVENT_INTR_RMV.
Between the unplug event and its detection, we need to manage
any related failure. That's why you propose a sigbus handler
which will avoid the crash, and can be used to detect the unplug.

Please confirm this is what you thought.
If not, do you agree, or am I missing something?

I would like to be sure the sigbus handler will not hide any other
unrelated failure.


07/06/2018 04:14, Guo, Jia:
>=20
> On 6/6/2018 8:54 PM, Bruce Richardson wrote:
> > +Tech-board as I think that this should have more input at the design s=
tage
> > ahead of any code patches being pushed.
> >
> > On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote:
> >> hi,bruce
> >>
> >>
> >> On 5/29/2018 7:20 PM, Bruce Richardson wrote:
> >>> On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
> >>> <snip>
> >>>>      The hot plug failure handle mechanism should be come across as =
bellow:
> >>>>
> >>>>      1.      Add a new bus ops =E2=80=9Chandle_hot-unplug=E2=80=9Din=
 bus to handle bus
> >>>>      read/write error, it is bus-specific and each
> >>>>
> >>>>      kind of bus can implement its own logic.
> >>>>
> >>>>      2.      Implement pci bus specific ops=E2=80=9Cpci_handle_hot_u=
nplug=E2=80=9D, in the
> >>>>      function, base on the
> >>>>
> >>>>      failure address to remap memory which belong to the correspondi=
ng
> >>>>      device that unplugged.
> >>>>
> >>>>      3.      Implement a new sigbus handler, and register it when st=
art
> >>>>      device event monitoring,
> >>>>
> >>>>      once the MMIO sigbus error exposure, it will trigger the above =
hot plug
> >>>>      failure handle mechanism,
> >>>>
> >>>>      that will keep app, that working on packet processing, would no=
t be
> >>>>      broken and crash, then could
> >>>>
> >>>>      keep going clean, fail-safe or other working task.
> >>>>
> >>>>      4.      Also also will introduce the solution by use testpmd to=
 show
> >>>>      the example of the whole procedure like that:
> >>>>
> >>>>      device unplug ->failure handle->stop forwarding->stop port->clo=
se
> >>>>      port->detach port.
> >>>>
> >>> Hi Jeff,
> >>>
> >>> so if I understand this correctly the proposal is that we need two pa=
rallel
> >>> solutions to handle safe removal of a device.
> >>>
> >>> 1. We need a solution to support unpluging of the device at the bus l=
evel,
> >>>      ie. remove the device from the list of devices and to make acces=
s to
> >>>      that device invalid.
> >>> 2. Since the removal of the device from the software lists is not goi=
ng to
> >>>      be instantaneous, we need a mechanism to handle any accesses to =
the
> >>>      device from the data path until such time as the removal is comp=
lete. To
> >>>      support that, you propose to add a sigbus handler which will
> >>>      automatically replace any mmio bar mappings with some other memo=
ry that is
> >>>      ok to access - presumable zero memory or similar.
> >>>
> >>> Is this understanding correct?
> >> i think you are correct about that.
> >>
> >>> Point #2 seems reasonably clear to me, but for #1, presumably the tri=
gger
> >>> to the bus needs to come from the kernel. What is planned to be used =
there?
> >> about point #1, i should clarify here is that, we will use the device =
event
> >> monitor mechanism to detect the hot unplug event.
> >> the monitor be enabled by app(or fail-safe driver), and app(fail-safe
> >> driver) register the event callback. Once the hot unplug behave be det=
ected,
> >> the user's callback could be triggered to let app(fail-safe driver) kn=
ow the
> >> event and manage the process, it will call APIs to stop the device
> >> and detach the device from the bus.
> > Ok. If there is no failsafe driver, and the app does not set up a handl=
er,
> > does nothing happen when we get a removal event? Will the app just cras=
h?
>=20
> when the device event monitor be enabled by app, the handler auto be set=
=20
> up, app or fail safe driver no need and can not directly do it.
> so if app want to process this hot plug event, what they need to do is=20
> only enable hot plug event monitor and register their self callback,
> then the app will not crash when hotplug behavior occur.
>=20
> >>> You also talk about using testpmd as a reference for this, but you do=
n't
> >>> explain how an application can be notified of a device removal, or ev=
en why
> >>> that is necessary. Since all applications should now be using the pro=
per
> >>> macros when iterating device lists, and not just assuming devices 0-N=
 are
> >>> valid, what changes would you see a normal app having to make to be
> >>> hotplug-safe?
> >> we could use app or fail-safe driver to use these mechanism , but at t=
his
> >> stage i will firstly use testpmd as a reference.
> >> as above reply, testpmd should enable device event mechanism to monito=
r the
> >> device removal, and register callback,
> >> the device bdf list is managed by bus and the hoplug fail handler will=
 be
> >> process by the eal layer, then the app would be hotplug-safe.
> >>
> >> is there anything i miss to clarify? please shout. and i think i will =
detail
> >> more later.
> > This is becoming clearer now, thanks. Just the one question above I hav=
e at
> > this point.
> > Given how long-running this issue of hotplug is, I'm hoping others on t=
he
> > technical board can also review this proposal.
> >
> > Regards,
> > /Bruce
>=20
>=20