From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jia.guo@intel.com>
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id 586571D8D8;
 Fri, 15 Jun 2018 10:31:23 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga006.fm.intel.com ([10.253.24.20])
 by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 15 Jun 2018 01:31:22 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.51,226,1526367600"; d="scan'208";a="237671246"
Received: from jguo15x-mobl3.ccr.corp.intel.com (HELO [10.67.68.54])
 ([10.67.68.54])
 by fmsmga006.fm.intel.com with ESMTP; 15 Jun 2018 01:31:19 -0700
To: Thomas Monjalon <thomas@monjalon.net>
References: <01BA8470C017D6468C8290E4B9C5E1E83B379B43@shsmsx102.ccr.corp.intel.com>
 <20180606125451.GA2960@bricha3-MOBL.ger.corp.intel.com>
 <b7e01d0e-f37a-c7e4-e894-127cd6f40563@intel.com> <2204635.oFaBryakOT@xps>
Cc: dev@dpdk.org, Bruce Richardson <bruce.richardson@intel.com>,
 techboard@dpdk.org, "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 "stephen@networkplumber.org" <stephen@networkplumber.org>,
 "Yigit, Ferruh" <ferruh.yigit@intel.com>,
 "gaetan.rivet@6wind.com" <gaetan.rivet@6wind.com>,
 "Wu, Jingjing" <jingjing.wu@intel.com>,
 "motih@mellanox.com" <motih@mellanox.com>,
 "matan@mellanox.com" <matan@mellanox.com>,
 "Van Haaren, Harry" <harry.van.haaren@intel.com>,
 "Zhang, Qi Z" <qi.z.zhang@intel.com>, "Zhang, Helin"
 <helin.zhang@intel.com>, "shreyansh.jain@nxp.com" <shreyansh.jain@nxp.com>
From: "Guo, Jia" <jia.guo@intel.com>
Message-ID: <5fac4426-6a7e-7745-0501-f5854a6f5791@intel.com>
Date: Fri, 15 Jun 2018 16:31:16 +0800
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.4.0
MIME-Version: 1.0
In-Reply-To: <2204635.oFaBryakOT@xps>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Jun 2018 08:31:24 -0000


On 6/15/2018 5:37 AM, Thomas Monjalon wrote:
> Hi,
>
> I am sorry, it is very hard to be sure we understand correctly
> your thougths. I like the proposal, but I want to be sure
> my understanding was not biased by what I would like to read :)
> So I try to reword below. Please confirm it matches your intent.
>
> Hot unplug can happen when a hardware device is removed physically,
> or when the software disables it. In both case, the datapath will fail.
> When the unplug is detected, we must stop and close the related instance
> of the driver.
> The detection can be done with hotplug monitoring (like uevent)
> - this is RTE_DEV_EVENT_REMOVE - or by handling the failure
> in control path or data path - this is RTE_ETH_EVENT_INTR_RMV.
> Between the unplug event and its detection, we need to manage
> any related failure. That's why you propose a sigbus handler
> which will avoid the crash, and can be used to detect the unplug.
>
> Please confirm this is what you thought.
> If not, do you agree, or am I missing something?

i think that is what i want to propose here.

> I would like to be sure the sigbus handler will not hide any other
> unrelated failure.

I agree,  even the sigbus handler use as an hot plug exception handler 
in this case, but it definitely should not affect any other failure process.
i will cover all about this in my patch.

>
> 07/06/2018 04:14, Guo, Jia:
>> On 6/6/2018 8:54 PM, Bruce Richardson wrote:
>>> +Tech-board as I think that this should have more input at the design stage
>>> ahead of any code patches being pushed.
>>>
>>> On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote:
>>>> hi,bruce
>>>>
>>>>
>>>> On 5/29/2018 7:20 PM, Bruce Richardson wrote:
>>>>> On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
>>>>> <snip>
>>>>>>       The hot plug failure handle mechanism should be come across as bellow:
>>>>>>
>>>>>>       1.      Add a new bus ops “handle_hot-unplug”in bus to handle bus
>>>>>>       read/write error, it is bus-specific and each
>>>>>>
>>>>>>       kind of bus can implement its own logic.
>>>>>>
>>>>>>       2.      Implement pci bus specific ops“pci_handle_hot_unplug”, in the
>>>>>>       function, base on the
>>>>>>
>>>>>>       failure address to remap memory which belong to the corresponding
>>>>>>       device that unplugged.
>>>>>>
>>>>>>       3.      Implement a new sigbus handler, and register it when start
>>>>>>       device event monitoring,
>>>>>>
>>>>>>       once the MMIO sigbus error exposure, it will trigger the above hot plug
>>>>>>       failure handle mechanism,
>>>>>>
>>>>>>       that will keep app, that working on packet processing, would not be
>>>>>>       broken and crash, then could
>>>>>>
>>>>>>       keep going clean, fail-safe or other working task.
>>>>>>
>>>>>>       4.      Also also will introduce the solution by use testpmd to show
>>>>>>       the example of the whole procedure like that:
>>>>>>
>>>>>>       device unplug ->failure handle->stop forwarding->stop port->close
>>>>>>       port->detach port.
>>>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> so if I understand this correctly the proposal is that we need two parallel
>>>>> solutions to handle safe removal of a device.
>>>>>
>>>>> 1. We need a solution to support unpluging of the device at the bus level,
>>>>>       ie. remove the device from the list of devices and to make access to
>>>>>       that device invalid.
>>>>> 2. Since the removal of the device from the software lists is not going to
>>>>>       be instantaneous, we need a mechanism to handle any accesses to the
>>>>>       device from the data path until such time as the removal is complete. To
>>>>>       support that, you propose to add a sigbus handler which will
>>>>>       automatically replace any mmio bar mappings with some other memory that is
>>>>>       ok to access - presumable zero memory or similar.
>>>>>
>>>>> Is this understanding correct?
>>>> i think you are correct about that.
>>>>
>>>>> Point #2 seems reasonably clear to me, but for #1, presumably the trigger
>>>>> to the bus needs to come from the kernel. What is planned to be used there?
>>>> about point #1, i should clarify here is that, we will use the device event
>>>> monitor mechanism to detect the hot unplug event.
>>>> the monitor be enabled by app(or fail-safe driver), and app(fail-safe
>>>> driver) register the event callback. Once the hot unplug behave be detected,
>>>> the user's callback could be triggered to let app(fail-safe driver) know the
>>>> event and manage the process, it will call APIs to stop the device
>>>> and detach the device from the bus.
>>> Ok. If there is no failsafe driver, and the app does not set up a handler,
>>> does nothing happen when we get a removal event? Will the app just crash?
>> when the device event monitor be enabled by app, the handler auto be set
>> up, app or fail safe driver no need and can not directly do it.
>> so if app want to process this hot plug event, what they need to do is
>> only enable hot plug event monitor and register their self callback,
>> then the app will not crash when hotplug behavior occur.
>>
>>>>> You also talk about using testpmd as a reference for this, but you don't
>>>>> explain how an application can be notified of a device removal, or even why
>>>>> that is necessary. Since all applications should now be using the proper
>>>>> macros when iterating device lists, and not just assuming devices 0-N are
>>>>> valid, what changes would you see a normal app having to make to be
>>>>> hotplug-safe?
>>>> we could use app or fail-safe driver to use these mechanism , but at this
>>>> stage i will firstly use testpmd as a reference.
>>>> as above reply, testpmd should enable device event mechanism to monitor the
>>>> device removal, and register callback,
>>>> the device bdf list is managed by bus and the hoplug fail handler will be
>>>> process by the eal layer, then the app would be hotplug-safe.
>>>>
>>>> is there anything i miss to clarify? please shout. and i think i will detail
>>>> more later.
>>> This is becoming clearer now, thanks. Just the one question above I have at
>>> this point.
>>> Given how long-running this issue of hotplug is, I'm hoping others on the
>>> technical board can also review this proposal.
>>>
>>> Regards,
>>> /Bruce
>>
>
>
>
>