From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by dpdk.org (Postfix) with ESMTP id 86D6B2C02; Wed, 6 Jun 2018 14:55:00 +0200 (CEST) X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 06 Jun 2018 05:54:58 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,483,1520924400"; d="scan'208";a="45073645" Received: from bricha3-mobl.ger.corp.intel.com ([10.237.221.133]) by fmsmga007.fm.intel.com with SMTP; 06 Jun 2018 05:54:54 -0700 Received: by (sSMTP sendmail emulation); Wed, 06 Jun 2018 13:54:52 +0100 Date: Wed, 6 Jun 2018 13:54:52 +0100 From: Bruce Richardson To: "Guo, Jia" , techboard@dpdk.org Cc: "dev@dpdk.org" , "Ananyev, Konstantin" , "stephen@networkplumber.org" , "Yigit, Ferruh" , "gaetan.rivet@6wind.com" , "Wu, Jingjing" , "thomas@monjalon.net" , "motih@mellanox.com" , "matan@mellanox.com" , "Van Haaren, Harry" , "Zhang, Qi Z" , "Zhang, Helin" , "jblunck@infradead.org" , "shreyansh.jain@nxp.com" Message-ID: <20180606125451.GA2960@bricha3-MOBL.ger.corp.intel.com> References: <01BA8470C017D6468C8290E4B9C5E1E83B379B43@shsmsx102.ccr.corp.intel.com> <20180529112011.GA22740@bricha3-MOBL.ger.corp.intel.com> <851b7fb1-bb7c-b277-ee85-6c27cef67238@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <851b7fb1-bb7c-b277-ee85-6c27cef67238@intel.com> Organization: Intel Research and Development Ireland Ltd. User-Agent: Mutt/1.10.0 (2018-05-17) Subject: Re: [dpdk-dev] [RFC] hot plug failure handle mechanism X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Jun 2018 12:55:02 -0000 +Tech-board as I think that this should have more input at the design stage ahead of any code patches being pushed. On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote: > hi,bruce > > > On 5/29/2018 7:20 PM, Bruce Richardson wrote: > > On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote: > > > > > The hot plug failure handle mechanism should be come across as bellow: > > > > > > 1. Add a new bus ops “handle_hot-unplug”in bus to handle bus > > > read/write error, it is bus-specific and each > > > > > > kind of bus can implement its own logic. > > > > > > 2. Implement pci bus specific ops“pci_handle_hot_unplug”, in the > > > function, base on the > > > > > > failure address to remap memory which belong to the corresponding > > > device that unplugged. > > > > > > 3. Implement a new sigbus handler, and register it when start > > > device event monitoring, > > > > > > once the MMIO sigbus error exposure, it will trigger the above hot plug > > > failure handle mechanism, > > > > > > that will keep app, that working on packet processing, would not be > > > broken and crash, then could > > > > > > keep going clean, fail-safe or other working task. > > > > > > 4. Also also will introduce the solution by use testpmd to show > > > the example of the whole procedure like that: > > > > > > device unplug ->failure handle->stop forwarding->stop port->close > > > port->detach port. > > > > > Hi Jeff, > > > > so if I understand this correctly the proposal is that we need two parallel > > solutions to handle safe removal of a device. > > > > 1. We need a solution to support unpluging of the device at the bus level, > > ie. remove the device from the list of devices and to make access to > > that device invalid. > > 2. Since the removal of the device from the software lists is not going to > > be instantaneous, we need a mechanism to handle any accesses to the > > device from the data path until such time as the removal is complete. To > > support that, you propose to add a sigbus handler which will > > automatically replace any mmio bar mappings with some other memory that is > > ok to access - presumable zero memory or similar. > > > > Is this understanding correct? > > i think you are correct about that. > > > Point #2 seems reasonably clear to me, but for #1, presumably the trigger > > to the bus needs to come from the kernel. What is planned to be used there? > > about point #1, i should clarify here is that, we will use the device event > monitor mechanism to detect the hot unplug event. > the monitor be enabled by app(or fail-safe driver), and app(fail-safe > driver) register the event callback. Once the hot unplug behave be detected, > the user's callback could be triggered to let app(fail-safe driver) know the > event and manage the process, it will call APIs to stop the device > and detach the device from the bus. Ok. If there is no failsafe driver, and the app does not set up a handler, does nothing happen when we get a removal event? Will the app just crash? > > > You also talk about using testpmd as a reference for this, but you don't > > explain how an application can be notified of a device removal, or even why > > that is necessary. Since all applications should now be using the proper > > macros when iterating device lists, and not just assuming devices 0-N are > > valid, what changes would you see a normal app having to make to be > > hotplug-safe? > > we could use app or fail-safe driver to use these mechanism , but at this > stage i will firstly use testpmd as a reference. > as above reply, testpmd should enable device event mechanism to monitor the > device removal, and register callback, > the device bdf list is managed by bus and the hoplug fail handler will be > process by the eal layer, then the app would be hotplug-safe. > > is there anything i miss to clarify? please shout. and i think i will detail > more later. This is becoming clearer now, thanks. Just the one question above I have at this point. Given how long-running this issue of hotplug is, I'm hoping others on the technical board can also review this proposal. Regards, /Bruce