From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bruce.richardson@intel.com>
Received: from mga06.intel.com (mga06.intel.com [134.134.136.31])
 by dpdk.org (Postfix) with ESMTP id 86D6B2C02;
 Wed,  6 Jun 2018 14:55:00 +0200 (CEST)
X-Amp-Result: UNSCANNABLE
X-Amp-File-Uploaded: False
Received: from fmsmga007.fm.intel.com ([10.253.24.52])
 by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 06 Jun 2018 05:54:58 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.49,483,1520924400"; d="scan'208";a="45073645"
Received: from bricha3-mobl.ger.corp.intel.com ([10.237.221.133])
 by fmsmga007.fm.intel.com with SMTP; 06 Jun 2018 05:54:54 -0700
Received: by  (sSMTP sendmail emulation); Wed, 06 Jun 2018 13:54:52 +0100
Date: Wed, 6 Jun 2018 13:54:52 +0100
From: Bruce Richardson <bruce.richardson@intel.com>
To: "Guo, Jia" <jia.guo@intel.com>, techboard@dpdk.org
Cc: "dev@dpdk.org" <dev@dpdk.org>,
 "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 "stephen@networkplumber.org" <stephen@networkplumber.org>,
 "Yigit, Ferruh" <ferruh.yigit@intel.com>,
 "gaetan.rivet@6wind.com" <gaetan.rivet@6wind.com>,
 "Wu, Jingjing" <jingjing.wu@intel.com>,
 "thomas@monjalon.net" <thomas@monjalon.net>,
 "motih@mellanox.com" <motih@mellanox.com>,
 "matan@mellanox.com" <matan@mellanox.com>,
 "Van Haaren, Harry" <harry.van.haaren@intel.com>,
 "Zhang, Qi Z" <qi.z.zhang@intel.com>,
 "Zhang, Helin" <helin.zhang@intel.com>,
 "jblunck@infradead.org" <jblunck@infradead.org>,
 "shreyansh.jain@nxp.com" <shreyansh.jain@nxp.com>
Message-ID: <20180606125451.GA2960@bricha3-MOBL.ger.corp.intel.com>
References: <01BA8470C017D6468C8290E4B9C5E1E83B379B43@shsmsx102.ccr.corp.intel.com>
 <20180529112011.GA22740@bricha3-MOBL.ger.corp.intel.com>
 <851b7fb1-bb7c-b277-ee85-6c27cef67238@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <851b7fb1-bb7c-b277-ee85-6c27cef67238@intel.com>
Organization: Intel Research and Development Ireland Ltd.
User-Agent: Mutt/1.10.0 (2018-05-17)
Subject: Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Jun 2018 12:55:02 -0000

+Tech-board as I think that this should have more input at the design stage
ahead of any code patches being pushed.

On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote:
> hi,bruce
> 
> 
> On 5/29/2018 7:20 PM, Bruce Richardson wrote:
> > On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
> > <snip>
> > >     The hot plug failure handle mechanism should be come across as bellow:
> > > 
> > >     1.      Add a new bus ops “handle_hot-unplug”in bus to handle bus
> > >     read/write error, it is bus-specific and each
> > > 
> > >     kind of bus can implement its own logic.
> > > 
> > >     2.      Implement pci bus specific ops“pci_handle_hot_unplug”, in the
> > >     function, base on the
> > > 
> > >     failure address to remap memory which belong to the corresponding
> > >     device that unplugged.
> > > 
> > >     3.      Implement a new sigbus handler, and register it when start
> > >     device event monitoring,
> > > 
> > >     once the MMIO sigbus error exposure, it will trigger the above hot plug
> > >     failure handle mechanism,
> > > 
> > >     that will keep app, that working on packet processing, would not be
> > >     broken and crash, then could
> > > 
> > >     keep going clean, fail-safe or other working task.
> > > 
> > >     4.      Also also will introduce the solution by use testpmd to show
> > >     the example of the whole procedure like that:
> > > 
> > >     device unplug ->failure handle->stop forwarding->stop port->close
> > >     port->detach port.
> > > 
> > Hi Jeff,
> > 
> > so if I understand this correctly the proposal is that we need two parallel
> > solutions to handle safe removal of a device.
> > 
> > 1. We need a solution to support unpluging of the device at the bus level,
> >     ie. remove the device from the list of devices and to make access to
> >     that device invalid.
> > 2. Since the removal of the device from the software lists is not going to
> >     be instantaneous, we need a mechanism to handle any accesses to the
> >     device from the data path until such time as the removal is complete. To
> >     support that, you propose to add a sigbus handler which will
> >     automatically replace any mmio bar mappings with some other memory that is
> >     ok to access - presumable zero memory or similar.
> > 
> > Is this understanding correct?
> 
> i think you are correct about that.
> 
> > Point #2 seems reasonably clear to me, but for #1, presumably the trigger
> > to the bus needs to come from the kernel. What is planned to be used there?
> 
> about point #1, i should clarify here is that, we will use the device event
> monitor mechanism to detect the hot unplug event.
> the monitor be enabled by app(or fail-safe driver), and app(fail-safe
> driver) register the event callback. Once the hot unplug behave be detected,
> the user's callback could be triggered to let app(fail-safe driver) know the
> event and manage the process, it will call APIs to stop the device
> and detach the device from the bus.

Ok. If there is no failsafe driver, and the app does not set up a handler,
does nothing happen when we get a removal event? Will the app just crash?

> 
> > You also talk about using testpmd as a reference for this, but you don't
> > explain how an application can be notified of a device removal, or even why
> > that is necessary. Since all applications should now be using the proper
> > macros when iterating device lists, and not just assuming devices 0-N are
> > valid, what changes would you see a normal app having to make to be
> > hotplug-safe?
> 
> we could use app or fail-safe driver to use these mechanism , but at this
> stage i will firstly use testpmd as a reference.
> as above reply, testpmd should enable device event mechanism to monitor the
> device removal, and register callback,
> the device bdf list is managed by bus and the hoplug fail handler will be
> process by the eal layer, then the app would be hotplug-safe.
> 
> is there anything i miss to clarify? please shout. and i think i will detail
> more later.

This is becoming clearer now, thanks. Just the one question above I have at
this point.
Given how long-running this issue of hotplug is, I'm hoping others on the
technical board can also review this proposal.

Regards,
/Bruce