From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 586571D8D8; Fri, 15 Jun 2018 10:31:23 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 15 Jun 2018 01:31:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.51,226,1526367600"; d="scan'208";a="237671246" Received: from jguo15x-mobl3.ccr.corp.intel.com (HELO [10.67.68.54]) ([10.67.68.54]) by fmsmga006.fm.intel.com with ESMTP; 15 Jun 2018 01:31:19 -0700 To: Thomas Monjalon References: <01BA8470C017D6468C8290E4B9C5E1E83B379B43@shsmsx102.ccr.corp.intel.com> <20180606125451.GA2960@bricha3-MOBL.ger.corp.intel.com> <2204635.oFaBryakOT@xps> Cc: dev@dpdk.org, Bruce Richardson , techboard@dpdk.org, "Ananyev, Konstantin" , "stephen@networkplumber.org" , "Yigit, Ferruh" , "gaetan.rivet@6wind.com" , "Wu, Jingjing" , "motih@mellanox.com" , "matan@mellanox.com" , "Van Haaren, Harry" , "Zhang, Qi Z" , "Zhang, Helin" , "shreyansh.jain@nxp.com" From: "Guo, Jia" Message-ID: <5fac4426-6a7e-7745-0501-f5854a6f5791@intel.com> Date: Fri, 15 Jun 2018 16:31:16 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <2204635.oFaBryakOT@xps> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [RFC] hot plug failure handle mechanism X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Jun 2018 08:31:24 -0000 On 6/15/2018 5:37 AM, Thomas Monjalon wrote: > Hi, > > I am sorry, it is very hard to be sure we understand correctly > your thougths. I like the proposal, but I want to be sure > my understanding was not biased by what I would like to read :) > So I try to reword below. Please confirm it matches your intent. > > Hot unplug can happen when a hardware device is removed physically, > or when the software disables it. In both case, the datapath will fail. > When the unplug is detected, we must stop and close the related instance > of the driver. > The detection can be done with hotplug monitoring (like uevent) > - this is RTE_DEV_EVENT_REMOVE - or by handling the failure > in control path or data path - this is RTE_ETH_EVENT_INTR_RMV. > Between the unplug event and its detection, we need to manage > any related failure. That's why you propose a sigbus handler > which will avoid the crash, and can be used to detect the unplug. > > Please confirm this is what you thought. > If not, do you agree, or am I missing something? i think that is what i want to propose here. > I would like to be sure the sigbus handler will not hide any other > unrelated failure. I agree, even the sigbus handler use as an hot plug exception handler in this case, but it definitely should not affect any other failure process. i will cover all about this in my patch. > > 07/06/2018 04:14, Guo, Jia: >> On 6/6/2018 8:54 PM, Bruce Richardson wrote: >>> +Tech-board as I think that this should have more input at the design stage >>> ahead of any code patches being pushed. >>> >>> On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote: >>>> hi,bruce >>>> >>>> >>>> On 5/29/2018 7:20 PM, Bruce Richardson wrote: >>>>> On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote: >>>>> >>>>>> The hot plug failure handle mechanism should be come across as bellow: >>>>>> >>>>>> 1. Add a new bus ops “handle_hot-unplug”in bus to handle bus >>>>>> read/write error, it is bus-specific and each >>>>>> >>>>>> kind of bus can implement its own logic. >>>>>> >>>>>> 2. Implement pci bus specific ops“pci_handle_hot_unplug”, in the >>>>>> function, base on the >>>>>> >>>>>> failure address to remap memory which belong to the corresponding >>>>>> device that unplugged. >>>>>> >>>>>> 3. Implement a new sigbus handler, and register it when start >>>>>> device event monitoring, >>>>>> >>>>>> once the MMIO sigbus error exposure, it will trigger the above hot plug >>>>>> failure handle mechanism, >>>>>> >>>>>> that will keep app, that working on packet processing, would not be >>>>>> broken and crash, then could >>>>>> >>>>>> keep going clean, fail-safe or other working task. >>>>>> >>>>>> 4. Also also will introduce the solution by use testpmd to show >>>>>> the example of the whole procedure like that: >>>>>> >>>>>> device unplug ->failure handle->stop forwarding->stop port->close >>>>>> port->detach port. >>>>>> >>>>> Hi Jeff, >>>>> >>>>> so if I understand this correctly the proposal is that we need two parallel >>>>> solutions to handle safe removal of a device. >>>>> >>>>> 1. We need a solution to support unpluging of the device at the bus level, >>>>> ie. remove the device from the list of devices and to make access to >>>>> that device invalid. >>>>> 2. Since the removal of the device from the software lists is not going to >>>>> be instantaneous, we need a mechanism to handle any accesses to the >>>>> device from the data path until such time as the removal is complete. To >>>>> support that, you propose to add a sigbus handler which will >>>>> automatically replace any mmio bar mappings with some other memory that is >>>>> ok to access - presumable zero memory or similar. >>>>> >>>>> Is this understanding correct? >>>> i think you are correct about that. >>>> >>>>> Point #2 seems reasonably clear to me, but for #1, presumably the trigger >>>>> to the bus needs to come from the kernel. What is planned to be used there? >>>> about point #1, i should clarify here is that, we will use the device event >>>> monitor mechanism to detect the hot unplug event. >>>> the monitor be enabled by app(or fail-safe driver), and app(fail-safe >>>> driver) register the event callback. Once the hot unplug behave be detected, >>>> the user's callback could be triggered to let app(fail-safe driver) know the >>>> event and manage the process, it will call APIs to stop the device >>>> and detach the device from the bus. >>> Ok. If there is no failsafe driver, and the app does not set up a handler, >>> does nothing happen when we get a removal event? Will the app just crash? >> when the device event monitor be enabled by app, the handler auto be set >> up, app or fail safe driver no need and can not directly do it. >> so if app want to process this hot plug event, what they need to do is >> only enable hot plug event monitor and register their self callback, >> then the app will not crash when hotplug behavior occur. >> >>>>> You also talk about using testpmd as a reference for this, but you don't >>>>> explain how an application can be notified of a device removal, or even why >>>>> that is necessary. Since all applications should now be using the proper >>>>> macros when iterating device lists, and not just assuming devices 0-N are >>>>> valid, what changes would you see a normal app having to make to be >>>>> hotplug-safe? >>>> we could use app or fail-safe driver to use these mechanism , but at this >>>> stage i will firstly use testpmd as a reference. >>>> as above reply, testpmd should enable device event mechanism to monitor the >>>> device removal, and register callback, >>>> the device bdf list is managed by bus and the hoplug fail handler will be >>>> process by the eal layer, then the app would be hotplug-safe. >>>> >>>> is there anything i miss to clarify? please shout. and i think i will detail >>>> more later. >>> This is becoming clearer now, thanks. Just the one question above I have at >>> this point. >>> Given how long-running this issue of hotplug is, I'm hoping others on the >>> technical board can also review this proposal. >>> >>> Regards, >>> /Bruce >> > > > >