DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev]  [RFC] hot plug failure handle mechanism
@ 2018-05-24  6:55 Guo, Jia
  2018-05-24 14:57 ` Matan Azrad
  2018-05-29 11:20 ` Bruce Richardson
  0 siblings, 2 replies; 10+ messages in thread
From: Guo, Jia @ 2018-05-24  6:55 UTC (permalink / raw)
  To: dev
  Cc: Ananyev, Konstantin, stephen, Richardson, Bruce, Yigit, Ferruh,
	gaetan.rivet, Wu, Jingjing, thomas, motih, matan, Van Haaren,
	Harry, Zhang, Qi Z, Zhang, Helin, jblunck, shreyansh.jain, Guo,
	Jia

As we know, hot plug is an importance feature whenever it use for the datacenter device's
fail-safe and consumption management , or use for the dynamic deployment  and SRIOV
Live Migration in SDN/NFV, it could be bring the higher flexibility and continuality of the
networking services in multiple use case in industry.

So let we see, dpdk as an importance networking combine framework with packet control
path/fast path lib and multiple diversity PMD drivers, what can it do to help if application want
to achieve their hot plug solution when they are working in packet processing by dpdk.

We already have a general device event mechanism, failsafe driver, bonding driver and hot plug/unplug
api in framework, app could use these api to develop functional, but for the case of hot plug failure handle,
that is removing a device at run-time will cause app trigger MMIO error and crash out, it is lack of a mechanism
to handle the failure when hot unplug device. At present, kernel only guantiy the hotplug handle safer on the
kernel side, but for the user mode side, no more specific 3rd tools such as udev/driverctl have especially
cover about these part of mechanism, and considerate feasibility of the implementation, runtime performance and
the general for almost user mode PMD driver, here a general hot plug failure handle mechanism in dpdk framework
would be proposed.

The hot plug failure handle mechanism should be come across as bellow:

1.      Add a new bus ops "handle_hot-unplug"in bus to handle bus read/write error, it is bus-specific and each

kind of bus can implement its own logic.

2.      Implement pci bus specific ops"pci_handle_hot_unplug", in the function, base on the

failure address to remap memory which belong to the corresponding device that unplugged.

3.      Implement a new sigbus handler, and register it when start device event monitoring,

once the MMIO sigbus error exposure, it will trigger the above hot plug failure handle mechanism,

that will keep app, that working on packet processing, would not be broken and crash, then could

keep going clean, fail-safe or other working task.

4.      Also also will introduce the solution by use testpmd to show the example of the whole procedure like that:

device unplug ->failure handle->stop forwarding->stop port->close port->detach port.

Best regards,

Jeff Guo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
  2018-05-24  6:55 [dpdk-dev] [RFC] hot plug failure handle mechanism Guo, Jia
@ 2018-05-24 14:57 ` Matan Azrad
  2018-05-25  7:49   ` Guo, Jia
  2018-05-29 11:20 ` Bruce Richardson
  1 sibling, 1 reply; 10+ messages in thread
From: Matan Azrad @ 2018-05-24 14:57 UTC (permalink / raw)
  To: Guo, Jia, dev
  Cc: Ananyev, Konstantin, stephen, Richardson, Bruce, Yigit, Ferruh,
	gaetan.rivet, Wu, Jingjing, Thomas Monjalon, Mordechay Haimovsky,
	Van Haaren,  Harry, Zhang, Qi Z, Zhang, Helin, jblunck,
	shreyansh.jain

Hi Guo

Some questions.

From: Guo Jia
> As we know, hot plug is an importance feature whenever it use for the
> datacenter device's fail-safe and consumption management , or use for the
> dynamic deployment  and SRIOV Live Migration in SDN/NFV, it could be bring
> the higher flexibility and continuality of the networking services in multiple use
> case in industry.
> 
> So let we see, dpdk as an importance networking combine framework with
> packet control path/fast path lib and multiple diversity PMD drivers, what can it
> do to help if application want to achieve their hot plug solution when they are
> working in packet processing by dpdk.
> 
> We already have a general device event mechanism, failsafe driver, bonding
> driver and hot plug/unplug api in framework, app could use these api to
> develop functional, but for the case of hot plug failure handle, that is removing
> a device at run-time will cause app trigger MMIO error and crash out, it is lack
> of a mechanism to handle the failure when hot unplug device. At present,
> kernel only guantiy the hotplug handle safer on the kernel side, but for the user
> mode side, no more specific 3rd tools such as udev/driverctl have especially
> cover about these part of mechanism, and considerate feasibility of the
> implementation, runtime performance and the general for almost user mode
> PMD driver, here a general hot plug failure handle mechanism in dpdk
> framework would be proposed.
> 
> The hot plug failure handle mechanism should be come across as bellow:
> 1. Add a new bus ops "handle_hot-unplug"in bus to handle bus read/write
> error, it is bus-specific and each kind of bus can implement its own logic.
> 2. Implement pci bus specific ops"pci_handle_hot_unplug", in the function,
> base on the failure address to remap memory which belong to the
> corresponding device that unplugged.
> 3. Implement a new sigbus handler, and register it when start device event
> monitoring, once the MMIO sigbus error exposure, it will trigger the above hot
> plug failure handle mechanism, that will keep app, that working on packet
> processing, would not be broken and crash, then could keep going clean, fail-
> safe or other working task.

Can you explain more what's happened with all the threads? Master thread, host thread, data-path threads,
The signal may happened only in a datapath thread or even from a control thread?

What's about resource leak?  (mainly relevant for control threads):
If you jump from the signal address to the restart address, how can you clean the process which was started and got the signal?

Matan.
> 4. Also also will introduce the solution by use testpmd to show the example of
> the whole procedure like that:
> device unplug ->failure handle->stop forwarding->stop port->close port->detach
> port.
> 
> Best regards,
> 
> Jeff Guo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
  2018-05-24 14:57 ` Matan Azrad
@ 2018-05-25  7:49   ` Guo, Jia
  0 siblings, 0 replies; 10+ messages in thread
From: Guo, Jia @ 2018-05-25  7:49 UTC (permalink / raw)
  To: Matan Azrad, dev
  Cc: Ananyev, Konstantin, stephen, Richardson, Bruce, Yigit, Ferruh,
	gaetan.rivet, Wu, Jingjing, Thomas Monjalon, Mordechay Haimovsky,
	Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, jblunck,
	shreyansh.jain

hi,matan


On 5/24/2018 10:57 PM, Matan Azrad wrote:
> Hi Guo
>
> Some questions.
>
> From: Guo Jia
>> As we know, hot plug is an importance feature whenever it use for the
>> datacenter device's fail-safe and consumption management , or use for the
>> dynamic deployment  and SRIOV Live Migration in SDN/NFV, it could be bring
>> the higher flexibility and continuality of the networking services in multiple use
>> case in industry.
>>
>> So let we see, dpdk as an importance networking combine framework with
>> packet control path/fast path lib and multiple diversity PMD drivers, what can it
>> do to help if application want to achieve their hot plug solution when they are
>> working in packet processing by dpdk.
>>
>> We already have a general device event mechanism, failsafe driver, bonding
>> driver and hot plug/unplug api in framework, app could use these api to
>> develop functional, but for the case of hot plug failure handle, that is removing
>> a device at run-time will cause app trigger MMIO error and crash out, it is lack
>> of a mechanism to handle the failure when hot unplug device. At present,
>> kernel only guantiy the hotplug handle safer on the kernel side, but for the user
>> mode side, no more specific 3rd tools such as udev/driverctl have especially
>> cover about these part of mechanism, and considerate feasibility of the
>> implementation, runtime performance and the general for almost user mode
>> PMD driver, here a general hot plug failure handle mechanism in dpdk
>> framework would be proposed.
>>
>> The hot plug failure handle mechanism should be come across as bellow:
>> 1. Add a new bus ops "handle_hot-unplug"in bus to handle bus read/write
>> error, it is bus-specific and each kind of bus can implement its own logic.
>> 2. Implement pci bus specific ops"pci_handle_hot_unplug", in the function,
>> base on the failure address to remap memory which belong to the
>> corresponding device that unplugged.
>> 3. Implement a new sigbus handler, and register it when start device event
>> monitoring, once the MMIO sigbus error exposure, it will trigger the above hot
>> plug failure handle mechanism, that will keep app, that working on packet
>> processing, would not be broken and crash, then could keep going clean, fail-
>> safe or other working task.
> Can you explain more what's happened with all the threads? Master thread, host thread, data-path threads,
> The signal may happened only in a datapath thread or even from a control thread?
i will explain it here for you at first, sigbus handler is register per 
process, cause of the signal event mechanism, control thread and 
data-path thread will random receive the sigbus error, but will go
to the common sigbus handler, in the handler find the device according 
the failure address, then remap the memory for the device.
> What's about resource leak?  (mainly relevant for control threads):
> If you jump from the signal address to the restart address, how can you clean the process which was started and got the signal?
it will not use long jump to turn back the restart address, just capture 
the sigbus event and then do failure handle, then let the thread keep 
going at current position.
> Matan.
>> 4. Also also will introduce the solution by use testpmd to show the example of
>> the whole procedure like that:
>> device unplug ->failure handle->stop forwarding->stop port->close port->detach
>> port.
>>
>> Best regards,
>>
>> Jeff Guo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
  2018-05-24  6:55 [dpdk-dev] [RFC] hot plug failure handle mechanism Guo, Jia
  2018-05-24 14:57 ` Matan Azrad
@ 2018-05-29 11:20 ` Bruce Richardson
  2018-06-04  1:56   ` Guo, Jia
  1 sibling, 1 reply; 10+ messages in thread
From: Bruce Richardson @ 2018-05-29 11:20 UTC (permalink / raw)
  To: Guo, Jia
  Cc: dev, Ananyev, Konstantin, stephen, Yigit, Ferruh, gaetan.rivet,
	Wu, Jingjing, thomas, motih, matan, Van Haaren, Harry, Zhang,
	Qi Z, Zhang, Helin, jblunck, shreyansh.jain

On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
<snip> 
>    The hot plug failure handle mechanism should be come across as bellow:
> 
>    1.      Add a new bus ops “handle_hot-unplug”in bus to handle bus
>    read/write error, it is bus-specific and each
> 
>    kind of bus can implement its own logic.
> 
>    2.      Implement pci bus specific ops“pci_handle_hot_unplug”, in the
>    function, base on the
> 
>    failure address to remap memory which belong to the corresponding
>    device that unplugged.
> 
>    3.      Implement a new sigbus handler, and register it when start
>    device event monitoring,
> 
>    once the MMIO sigbus error exposure, it will trigger the above hot plug
>    failure handle mechanism,
> 
>    that will keep app, that working on packet processing, would not be
>    broken and crash, then could
> 
>    keep going clean, fail-safe or other working task.
> 
>    4.      Also also will introduce the solution by use testpmd to show
>    the example of the whole procedure like that:
> 
>    device unplug ->failure handle->stop forwarding->stop port->close
>    port->detach port.
> 

Hi Jeff, 

so if I understand this correctly the proposal is that we need two parallel
solutions to handle safe removal of a device.

1. We need a solution to support unpluging of the device at the bus level,
   ie. remove the device from the list of devices and to make access to
   that device invalid.
2. Since the removal of the device from the software lists is not going to
   be instantaneous, we need a mechanism to handle any accesses to the
   device from the data path until such time as the removal is complete. To
   support that, you propose to add a sigbus handler which will
   automatically replace any mmio bar mappings with some other memory that is
   ok to access - presumable zero memory or similar.

Is this understanding correct?

Point #2 seems reasonably clear to me, but for #1, presumably the trigger
to the bus needs to come from the kernel. What is planned to be used there?

You also talk about using testpmd as a reference for this, but you don't
explain how an application can be notified of a device removal, or even why
that is necessary. Since all applications should now be using the proper
macros when iterating device lists, and not just assuming devices 0-N are
valid, what changes would you see a normal app having to make to be
hotplug-safe?

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
  2018-05-29 11:20 ` Bruce Richardson
@ 2018-06-04  1:56   ` Guo, Jia
  2018-06-06 12:54     ` Bruce Richardson
  0 siblings, 1 reply; 10+ messages in thread
From: Guo, Jia @ 2018-06-04  1:56 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: dev, Ananyev, Konstantin, stephen, Yigit, Ferruh, gaetan.rivet,
	Wu, Jingjing, thomas, motih, matan, Van Haaren, Harry, Zhang,
	Qi Z, Zhang, Helin, jblunck, shreyansh.jain

hi,bruce


On 5/29/2018 7:20 PM, Bruce Richardson wrote:
> On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
> <snip>
>>     The hot plug failure handle mechanism should be come across as bellow:
>>
>>     1.      Add a new bus ops “handle_hot-unplug”in bus to handle bus
>>     read/write error, it is bus-specific and each
>>
>>     kind of bus can implement its own logic.
>>
>>     2.      Implement pci bus specific ops“pci_handle_hot_unplug”, in the
>>     function, base on the
>>
>>     failure address to remap memory which belong to the corresponding
>>     device that unplugged.
>>
>>     3.      Implement a new sigbus handler, and register it when start
>>     device event monitoring,
>>
>>     once the MMIO sigbus error exposure, it will trigger the above hot plug
>>     failure handle mechanism,
>>
>>     that will keep app, that working on packet processing, would not be
>>     broken and crash, then could
>>
>>     keep going clean, fail-safe or other working task.
>>
>>     4.      Also also will introduce the solution by use testpmd to show
>>     the example of the whole procedure like that:
>>
>>     device unplug ->failure handle->stop forwarding->stop port->close
>>     port->detach port.
>>
> Hi Jeff,
>
> so if I understand this correctly the proposal is that we need two parallel
> solutions to handle safe removal of a device.
>
> 1. We need a solution to support unpluging of the device at the bus level,
>     ie. remove the device from the list of devices and to make access to
>     that device invalid.
> 2. Since the removal of the device from the software lists is not going to
>     be instantaneous, we need a mechanism to handle any accesses to the
>     device from the data path until such time as the removal is complete. To
>     support that, you propose to add a sigbus handler which will
>     automatically replace any mmio bar mappings with some other memory that is
>     ok to access - presumable zero memory or similar.
>
> Is this understanding correct?

i think you are correct about that.

> Point #2 seems reasonably clear to me, but for #1, presumably the trigger
> to the bus needs to come from the kernel. What is planned to be used there?

about point #1, i should clarify here is that, we will use the device 
event monitor mechanism to detect the hot unplug event.
the monitor be enabled by app(or fail-safe driver), and app(fail-safe 
driver) register the event callback. Once the hot unplug behave be 
detected,
the user's callback could be triggered to let app(fail-safe driver) know 
the event and manage the process, it will call APIs to stop the device
and detach the device from the bus.

> You also talk about using testpmd as a reference for this, but you don't
> explain how an application can be notified of a device removal, or even why
> that is necessary. Since all applications should now be using the proper
> macros when iterating device lists, and not just assuming devices 0-N are
> valid, what changes would you see a normal app having to make to be
> hotplug-safe?

we could use app or fail-safe driver to use these mechanism , but at 
this stage i will firstly use testpmd as a reference.
as above reply, testpmd should enable device event mechanism to monitor 
the device removal, and register callback,
the device bdf list is managed by bus and the hoplug fail handler will 
be process by the eal layer, then the app would be hotplug-safe.

is there anything i miss to clarify? please shout. and i think i will 
detail more later.
> Regards,
> /Bruce

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
  2018-06-04  1:56   ` Guo, Jia
@ 2018-06-06 12:54     ` Bruce Richardson
  2018-06-06 13:11       ` Ananyev, Konstantin
  2018-06-07  2:14       ` Guo, Jia
  0 siblings, 2 replies; 10+ messages in thread
From: Bruce Richardson @ 2018-06-06 12:54 UTC (permalink / raw)
  To: Guo, Jia, techboard
  Cc: dev, Ananyev, Konstantin, stephen, Yigit, Ferruh, gaetan.rivet,
	Wu, Jingjing, thomas, motih, matan, Van Haaren, Harry, Zhang,
	Qi Z, Zhang, Helin, jblunck, shreyansh.jain

+Tech-board as I think that this should have more input at the design stage
ahead of any code patches being pushed.

On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote:
> hi,bruce
> 
> 
> On 5/29/2018 7:20 PM, Bruce Richardson wrote:
> > On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
> > <snip>
> > >     The hot plug failure handle mechanism should be come across as bellow:
> > > 
> > >     1.      Add a new bus ops “handle_hot-unplug”in bus to handle bus
> > >     read/write error, it is bus-specific and each
> > > 
> > >     kind of bus can implement its own logic.
> > > 
> > >     2.      Implement pci bus specific ops“pci_handle_hot_unplug”, in the
> > >     function, base on the
> > > 
> > >     failure address to remap memory which belong to the corresponding
> > >     device that unplugged.
> > > 
> > >     3.      Implement a new sigbus handler, and register it when start
> > >     device event monitoring,
> > > 
> > >     once the MMIO sigbus error exposure, it will trigger the above hot plug
> > >     failure handle mechanism,
> > > 
> > >     that will keep app, that working on packet processing, would not be
> > >     broken and crash, then could
> > > 
> > >     keep going clean, fail-safe or other working task.
> > > 
> > >     4.      Also also will introduce the solution by use testpmd to show
> > >     the example of the whole procedure like that:
> > > 
> > >     device unplug ->failure handle->stop forwarding->stop port->close
> > >     port->detach port.
> > > 
> > Hi Jeff,
> > 
> > so if I understand this correctly the proposal is that we need two parallel
> > solutions to handle safe removal of a device.
> > 
> > 1. We need a solution to support unpluging of the device at the bus level,
> >     ie. remove the device from the list of devices and to make access to
> >     that device invalid.
> > 2. Since the removal of the device from the software lists is not going to
> >     be instantaneous, we need a mechanism to handle any accesses to the
> >     device from the data path until such time as the removal is complete. To
> >     support that, you propose to add a sigbus handler which will
> >     automatically replace any mmio bar mappings with some other memory that is
> >     ok to access - presumable zero memory or similar.
> > 
> > Is this understanding correct?
> 
> i think you are correct about that.
> 
> > Point #2 seems reasonably clear to me, but for #1, presumably the trigger
> > to the bus needs to come from the kernel. What is planned to be used there?
> 
> about point #1, i should clarify here is that, we will use the device event
> monitor mechanism to detect the hot unplug event.
> the monitor be enabled by app(or fail-safe driver), and app(fail-safe
> driver) register the event callback. Once the hot unplug behave be detected,
> the user's callback could be triggered to let app(fail-safe driver) know the
> event and manage the process, it will call APIs to stop the device
> and detach the device from the bus.

Ok. If there is no failsafe driver, and the app does not set up a handler,
does nothing happen when we get a removal event? Will the app just crash?

> 
> > You also talk about using testpmd as a reference for this, but you don't
> > explain how an application can be notified of a device removal, or even why
> > that is necessary. Since all applications should now be using the proper
> > macros when iterating device lists, and not just assuming devices 0-N are
> > valid, what changes would you see a normal app having to make to be
> > hotplug-safe?
> 
> we could use app or fail-safe driver to use these mechanism , but at this
> stage i will firstly use testpmd as a reference.
> as above reply, testpmd should enable device event mechanism to monitor the
> device removal, and register callback,
> the device bdf list is managed by bus and the hoplug fail handler will be
> process by the eal layer, then the app would be hotplug-safe.
> 
> is there anything i miss to clarify? please shout. and i think i will detail
> more later.

This is becoming clearer now, thanks. Just the one question above I have at
this point.
Given how long-running this issue of hotplug is, I'm hoping others on the
technical board can also review this proposal.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
  2018-06-06 12:54     ` Bruce Richardson
@ 2018-06-06 13:11       ` Ananyev, Konstantin
  2018-06-07  2:14       ` Guo, Jia
  1 sibling, 0 replies; 10+ messages in thread
From: Ananyev, Konstantin @ 2018-06-06 13:11 UTC (permalink / raw)
  To: Richardson, Bruce, Guo, Jia, techboard
  Cc: dev, stephen, Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, thomas,
	motih, matan, Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin,
	jblunck, shreyansh.jain



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Wednesday, June 6, 2018 1:55 PM
> To: Guo, Jia <jia.guo@intel.com>; techboard@dpdk.org
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>; stephen@networkplumber.org; Yigit, Ferruh
> <ferruh.yigit@intel.com>; gaetan.rivet@6wind.com; Wu, Jingjing <jingjing.wu@intel.com>; thomas@monjalon.net;
> motih@mellanox.com; matan@mellanox.com; Van Haaren, Harry <harry.van.haaren@intel.com>; Zhang, Qi Z
> <qi.z.zhang@intel.com>; Zhang, Helin <helin.zhang@intel.com>; jblunck@infradead.org; shreyansh.jain@nxp.com
> Subject: Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
> 
> +Tech-board as I think that this should have more input at the design stage
> ahead of any code patches being pushed.
> 
> On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote:
> > hi,bruce
> >
> >
> > On 5/29/2018 7:20 PM, Bruce Richardson wrote:
> > > On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
> > > <snip>
> > > >     The hot plug failure handle mechanism should be come across as bellow:
> > > >
> > > >     1.      Add a new bus ops “handle_hot-unplug”in bus to handle bus
> > > >     read/write error, it is bus-specific and each
> > > >
> > > >     kind of bus can implement its own logic.
> > > >
> > > >     2.      Implement pci bus specific ops“pci_handle_hot_unplug”, in the
> > > >     function, base on the
> > > >
> > > >     failure address to remap memory which belong to the corresponding
> > > >     device that unplugged.
> > > >
> > > >     3.      Implement a new sigbus handler, and register it when start
> > > >     device event monitoring,
> > > >
> > > >     once the MMIO sigbus error exposure, it will trigger the above hot plug
> > > >     failure handle mechanism,
> > > >
> > > >     that will keep app, that working on packet processing, would not be
> > > >     broken and crash, then could
> > > >
> > > >     keep going clean, fail-safe or other working task.
> > > >
> > > >     4.      Also also will introduce the solution by use testpmd to show
> > > >     the example of the whole procedure like that:
> > > >
> > > >     device unplug ->failure handle->stop forwarding->stop port->close
> > > >     port->detach port.
> > > >
> > > Hi Jeff,
> > >
> > > so if I understand this correctly the proposal is that we need two parallel
> > > solutions to handle safe removal of a device.
> > >
> > > 1. We need a solution to support unpluging of the device at the bus level,
> > >     ie. remove the device from the list of devices and to make access to
> > >     that device invalid.
> > > 2. Since the removal of the device from the software lists is not going to
> > >     be instantaneous, we need a mechanism to handle any accesses to the
> > >     device from the data path until such time as the removal is complete. To
> > >     support that, you propose to add a sigbus handler which will
> > >     automatically replace any mmio bar mappings with some other memory that is
> > >     ok to access - presumable zero memory or similar.
> > >
> > > Is this understanding correct?
> >
> > i think you are correct about that.
> >
> > > Point #2 seems reasonably clear to me, but for #1, presumably the trigger
> > > to the bus needs to come from the kernel. What is planned to be used there?
> >
> > about point #1, i should clarify here is that, we will use the device event
> > monitor mechanism to detect the hot unplug event.
> > the monitor be enabled by app(or fail-safe driver), and app(fail-safe
> > driver) register the event callback. Once the hot unplug behave be detected,
> > the user's callback could be triggered to let app(fail-safe driver) know the
> > event and manage the process, it will call APIs to stop the device
> > and detach the device from the bus.
> 
> Ok. If there is no failsafe driver, and the app does not set up a handler,
> does nothing happen when we get a removal event? Will the app just crash?
> 
> >
> > > You also talk about using testpmd as a reference for this, but you don't
> > > explain how an application can be notified of a device removal, or even why
> > > that is necessary. Since all applications should now be using the proper
> > > macros when iterating device lists, and not just assuming devices 0-N are
> > > valid, what changes would you see a normal app having to make to be
> > > hotplug-safe?
> >
> > we could use app or fail-safe driver to use these mechanism , but at this
> > stage i will firstly use testpmd as a reference.
> > as above reply, testpmd should enable device event mechanism to monitor the
> > device removal, and register callback,
> > the device bdf list is managed by bus and the hoplug fail handler will be
> > process by the eal layer, then the app would be hotplug-safe.
> >
> > is there anything i miss to clarify? please shout. and i think i will detail
> > more later.
> 
> This is becoming clearer now, thanks. Just the one question above I have at
> this point.
> Given how long-running this issue of hotplug is, I'm hoping others on the
> technical board can also review this proposal.

I looked at the actual code a bit for 18.05.
It seems ok to me in general, though I provided few comments regarding
particular implementation details.
Konstantin



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
  2018-06-06 12:54     ` Bruce Richardson
  2018-06-06 13:11       ` Ananyev, Konstantin
@ 2018-06-07  2:14       ` Guo, Jia
  2018-06-14 21:37         ` Thomas Monjalon
  1 sibling, 1 reply; 10+ messages in thread
From: Guo, Jia @ 2018-06-07  2:14 UTC (permalink / raw)
  To: Bruce Richardson, techboard
  Cc: dev, Ananyev, Konstantin, stephen, Yigit, Ferruh, gaetan.rivet,
	Wu, Jingjing, thomas, motih, matan, Van Haaren, Harry, Zhang,
	Qi Z, Zhang, Helin, jblunck, shreyansh.jain



On 6/6/2018 8:54 PM, Bruce Richardson wrote:
> +Tech-board as I think that this should have more input at the design stage
> ahead of any code patches being pushed.
>
> On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote:
>> hi,bruce
>>
>>
>> On 5/29/2018 7:20 PM, Bruce Richardson wrote:
>>> On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
>>> <snip>
>>>>      The hot plug failure handle mechanism should be come across as bellow:
>>>>
>>>>      1.      Add a new bus ops “handle_hot-unplug”in bus to handle bus
>>>>      read/write error, it is bus-specific and each
>>>>
>>>>      kind of bus can implement its own logic.
>>>>
>>>>      2.      Implement pci bus specific ops“pci_handle_hot_unplug”, in the
>>>>      function, base on the
>>>>
>>>>      failure address to remap memory which belong to the corresponding
>>>>      device that unplugged.
>>>>
>>>>      3.      Implement a new sigbus handler, and register it when start
>>>>      device event monitoring,
>>>>
>>>>      once the MMIO sigbus error exposure, it will trigger the above hot plug
>>>>      failure handle mechanism,
>>>>
>>>>      that will keep app, that working on packet processing, would not be
>>>>      broken and crash, then could
>>>>
>>>>      keep going clean, fail-safe or other working task.
>>>>
>>>>      4.      Also also will introduce the solution by use testpmd to show
>>>>      the example of the whole procedure like that:
>>>>
>>>>      device unplug ->failure handle->stop forwarding->stop port->close
>>>>      port->detach port.
>>>>
>>> Hi Jeff,
>>>
>>> so if I understand this correctly the proposal is that we need two parallel
>>> solutions to handle safe removal of a device.
>>>
>>> 1. We need a solution to support unpluging of the device at the bus level,
>>>      ie. remove the device from the list of devices and to make access to
>>>      that device invalid.
>>> 2. Since the removal of the device from the software lists is not going to
>>>      be instantaneous, we need a mechanism to handle any accesses to the
>>>      device from the data path until such time as the removal is complete. To
>>>      support that, you propose to add a sigbus handler which will
>>>      automatically replace any mmio bar mappings with some other memory that is
>>>      ok to access - presumable zero memory or similar.
>>>
>>> Is this understanding correct?
>> i think you are correct about that.
>>
>>> Point #2 seems reasonably clear to me, but for #1, presumably the trigger
>>> to the bus needs to come from the kernel. What is planned to be used there?
>> about point #1, i should clarify here is that, we will use the device event
>> monitor mechanism to detect the hot unplug event.
>> the monitor be enabled by app(or fail-safe driver), and app(fail-safe
>> driver) register the event callback. Once the hot unplug behave be detected,
>> the user's callback could be triggered to let app(fail-safe driver) know the
>> event and manage the process, it will call APIs to stop the device
>> and detach the device from the bus.
> Ok. If there is no failsafe driver, and the app does not set up a handler,
> does nothing happen when we get a removal event? Will the app just crash?

when the device event monitor be enabled by app, the handler auto be set 
up, app or fail safe driver no need and can not directly do it.
so if app want to process this hot plug event, what they need to do is 
only enable hot plug event monitor and register their self callback,
then the app will not crash when hotplug behavior occur.

>>> You also talk about using testpmd as a reference for this, but you don't
>>> explain how an application can be notified of a device removal, or even why
>>> that is necessary. Since all applications should now be using the proper
>>> macros when iterating device lists, and not just assuming devices 0-N are
>>> valid, what changes would you see a normal app having to make to be
>>> hotplug-safe?
>> we could use app or fail-safe driver to use these mechanism , but at this
>> stage i will firstly use testpmd as a reference.
>> as above reply, testpmd should enable device event mechanism to monitor the
>> device removal, and register callback,
>> the device bdf list is managed by bus and the hoplug fail handler will be
>> process by the eal layer, then the app would be hotplug-safe.
>>
>> is there anything i miss to clarify? please shout. and i think i will detail
>> more later.
> This is becoming clearer now, thanks. Just the one question above I have at
> this point.
> Given how long-running this issue of hotplug is, I'm hoping others on the
> technical board can also review this proposal.
>
> Regards,
> /Bruce

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
  2018-06-07  2:14       ` Guo, Jia
@ 2018-06-14 21:37         ` Thomas Monjalon
  2018-06-15  8:31           ` Guo, Jia
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Monjalon @ 2018-06-14 21:37 UTC (permalink / raw)
  To: Guo, Jia
  Cc: dev, Bruce Richardson, techboard, Ananyev, Konstantin, stephen,
	Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, motih, matan,
	Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, shreyansh.jain

Hi,

I am sorry, it is very hard to be sure we understand correctly
your thougths. I like the proposal, but I want to be sure
my understanding was not biased by what I would like to read :)
So I try to reword below. Please confirm it matches your intent.

Hot unplug can happen when a hardware device is removed physically,
or when the software disables it. In both case, the datapath will fail.
When the unplug is detected, we must stop and close the related instance
of the driver.
The detection can be done with hotplug monitoring (like uevent)
- this is RTE_DEV_EVENT_REMOVE - or by handling the failure
in control path or data path - this is RTE_ETH_EVENT_INTR_RMV.
Between the unplug event and its detection, we need to manage
any related failure. That's why you propose a sigbus handler
which will avoid the crash, and can be used to detect the unplug.

Please confirm this is what you thought.
If not, do you agree, or am I missing something?

I would like to be sure the sigbus handler will not hide any other
unrelated failure.


07/06/2018 04:14, Guo, Jia:
> 
> On 6/6/2018 8:54 PM, Bruce Richardson wrote:
> > +Tech-board as I think that this should have more input at the design stage
> > ahead of any code patches being pushed.
> >
> > On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote:
> >> hi,bruce
> >>
> >>
> >> On 5/29/2018 7:20 PM, Bruce Richardson wrote:
> >>> On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
> >>> <snip>
> >>>>      The hot plug failure handle mechanism should be come across as bellow:
> >>>>
> >>>>      1.      Add a new bus ops “handle_hot-unplug”in bus to handle bus
> >>>>      read/write error, it is bus-specific and each
> >>>>
> >>>>      kind of bus can implement its own logic.
> >>>>
> >>>>      2.      Implement pci bus specific ops“pci_handle_hot_unplug”, in the
> >>>>      function, base on the
> >>>>
> >>>>      failure address to remap memory which belong to the corresponding
> >>>>      device that unplugged.
> >>>>
> >>>>      3.      Implement a new sigbus handler, and register it when start
> >>>>      device event monitoring,
> >>>>
> >>>>      once the MMIO sigbus error exposure, it will trigger the above hot plug
> >>>>      failure handle mechanism,
> >>>>
> >>>>      that will keep app, that working on packet processing, would not be
> >>>>      broken and crash, then could
> >>>>
> >>>>      keep going clean, fail-safe or other working task.
> >>>>
> >>>>      4.      Also also will introduce the solution by use testpmd to show
> >>>>      the example of the whole procedure like that:
> >>>>
> >>>>      device unplug ->failure handle->stop forwarding->stop port->close
> >>>>      port->detach port.
> >>>>
> >>> Hi Jeff,
> >>>
> >>> so if I understand this correctly the proposal is that we need two parallel
> >>> solutions to handle safe removal of a device.
> >>>
> >>> 1. We need a solution to support unpluging of the device at the bus level,
> >>>      ie. remove the device from the list of devices and to make access to
> >>>      that device invalid.
> >>> 2. Since the removal of the device from the software lists is not going to
> >>>      be instantaneous, we need a mechanism to handle any accesses to the
> >>>      device from the data path until such time as the removal is complete. To
> >>>      support that, you propose to add a sigbus handler which will
> >>>      automatically replace any mmio bar mappings with some other memory that is
> >>>      ok to access - presumable zero memory or similar.
> >>>
> >>> Is this understanding correct?
> >> i think you are correct about that.
> >>
> >>> Point #2 seems reasonably clear to me, but for #1, presumably the trigger
> >>> to the bus needs to come from the kernel. What is planned to be used there?
> >> about point #1, i should clarify here is that, we will use the device event
> >> monitor mechanism to detect the hot unplug event.
> >> the monitor be enabled by app(or fail-safe driver), and app(fail-safe
> >> driver) register the event callback. Once the hot unplug behave be detected,
> >> the user's callback could be triggered to let app(fail-safe driver) know the
> >> event and manage the process, it will call APIs to stop the device
> >> and detach the device from the bus.
> > Ok. If there is no failsafe driver, and the app does not set up a handler,
> > does nothing happen when we get a removal event? Will the app just crash?
> 
> when the device event monitor be enabled by app, the handler auto be set 
> up, app or fail safe driver no need and can not directly do it.
> so if app want to process this hot plug event, what they need to do is 
> only enable hot plug event monitor and register their self callback,
> then the app will not crash when hotplug behavior occur.
> 
> >>> You also talk about using testpmd as a reference for this, but you don't
> >>> explain how an application can be notified of a device removal, or even why
> >>> that is necessary. Since all applications should now be using the proper
> >>> macros when iterating device lists, and not just assuming devices 0-N are
> >>> valid, what changes would you see a normal app having to make to be
> >>> hotplug-safe?
> >> we could use app or fail-safe driver to use these mechanism , but at this
> >> stage i will firstly use testpmd as a reference.
> >> as above reply, testpmd should enable device event mechanism to monitor the
> >> device removal, and register callback,
> >> the device bdf list is managed by bus and the hoplug fail handler will be
> >> process by the eal layer, then the app would be hotplug-safe.
> >>
> >> is there anything i miss to clarify? please shout. and i think i will detail
> >> more later.
> > This is becoming clearer now, thanks. Just the one question above I have at
> > this point.
> > Given how long-running this issue of hotplug is, I'm hoping others on the
> > technical board can also review this proposal.
> >
> > Regards,
> > /Bruce
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism
  2018-06-14 21:37         ` Thomas Monjalon
@ 2018-06-15  8:31           ` Guo, Jia
  0 siblings, 0 replies; 10+ messages in thread
From: Guo, Jia @ 2018-06-15  8:31 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Bruce Richardson, techboard, Ananyev, Konstantin, stephen,
	Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, motih, matan,
	Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, shreyansh.jain



On 6/15/2018 5:37 AM, Thomas Monjalon wrote:
> Hi,
>
> I am sorry, it is very hard to be sure we understand correctly
> your thougths. I like the proposal, but I want to be sure
> my understanding was not biased by what I would like to read :)
> So I try to reword below. Please confirm it matches your intent.
>
> Hot unplug can happen when a hardware device is removed physically,
> or when the software disables it. In both case, the datapath will fail.
> When the unplug is detected, we must stop and close the related instance
> of the driver.
> The detection can be done with hotplug monitoring (like uevent)
> - this is RTE_DEV_EVENT_REMOVE - or by handling the failure
> in control path or data path - this is RTE_ETH_EVENT_INTR_RMV.
> Between the unplug event and its detection, we need to manage
> any related failure. That's why you propose a sigbus handler
> which will avoid the crash, and can be used to detect the unplug.
>
> Please confirm this is what you thought.
> If not, do you agree, or am I missing something?

i think that is what i want to propose here.

> I would like to be sure the sigbus handler will not hide any other
> unrelated failure.

I agree,  even the sigbus handler use as an hot plug exception handler 
in this case, but it definitely should not affect any other failure process.
i will cover all about this in my patch.

>
> 07/06/2018 04:14, Guo, Jia:
>> On 6/6/2018 8:54 PM, Bruce Richardson wrote:
>>> +Tech-board as I think that this should have more input at the design stage
>>> ahead of any code patches being pushed.
>>>
>>> On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote:
>>>> hi,bruce
>>>>
>>>>
>>>> On 5/29/2018 7:20 PM, Bruce Richardson wrote:
>>>>> On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote:
>>>>> <snip>
>>>>>>       The hot plug failure handle mechanism should be come across as bellow:
>>>>>>
>>>>>>       1.      Add a new bus ops “handle_hot-unplug”in bus to handle bus
>>>>>>       read/write error, it is bus-specific and each
>>>>>>
>>>>>>       kind of bus can implement its own logic.
>>>>>>
>>>>>>       2.      Implement pci bus specific ops“pci_handle_hot_unplug”, in the
>>>>>>       function, base on the
>>>>>>
>>>>>>       failure address to remap memory which belong to the corresponding
>>>>>>       device that unplugged.
>>>>>>
>>>>>>       3.      Implement a new sigbus handler, and register it when start
>>>>>>       device event monitoring,
>>>>>>
>>>>>>       once the MMIO sigbus error exposure, it will trigger the above hot plug
>>>>>>       failure handle mechanism,
>>>>>>
>>>>>>       that will keep app, that working on packet processing, would not be
>>>>>>       broken and crash, then could
>>>>>>
>>>>>>       keep going clean, fail-safe or other working task.
>>>>>>
>>>>>>       4.      Also also will introduce the solution by use testpmd to show
>>>>>>       the example of the whole procedure like that:
>>>>>>
>>>>>>       device unplug ->failure handle->stop forwarding->stop port->close
>>>>>>       port->detach port.
>>>>>>
>>>>> Hi Jeff,
>>>>>
>>>>> so if I understand this correctly the proposal is that we need two parallel
>>>>> solutions to handle safe removal of a device.
>>>>>
>>>>> 1. We need a solution to support unpluging of the device at the bus level,
>>>>>       ie. remove the device from the list of devices and to make access to
>>>>>       that device invalid.
>>>>> 2. Since the removal of the device from the software lists is not going to
>>>>>       be instantaneous, we need a mechanism to handle any accesses to the
>>>>>       device from the data path until such time as the removal is complete. To
>>>>>       support that, you propose to add a sigbus handler which will
>>>>>       automatically replace any mmio bar mappings with some other memory that is
>>>>>       ok to access - presumable zero memory or similar.
>>>>>
>>>>> Is this understanding correct?
>>>> i think you are correct about that.
>>>>
>>>>> Point #2 seems reasonably clear to me, but for #1, presumably the trigger
>>>>> to the bus needs to come from the kernel. What is planned to be used there?
>>>> about point #1, i should clarify here is that, we will use the device event
>>>> monitor mechanism to detect the hot unplug event.
>>>> the monitor be enabled by app(or fail-safe driver), and app(fail-safe
>>>> driver) register the event callback. Once the hot unplug behave be detected,
>>>> the user's callback could be triggered to let app(fail-safe driver) know the
>>>> event and manage the process, it will call APIs to stop the device
>>>> and detach the device from the bus.
>>> Ok. If there is no failsafe driver, and the app does not set up a handler,
>>> does nothing happen when we get a removal event? Will the app just crash?
>> when the device event monitor be enabled by app, the handler auto be set
>> up, app or fail safe driver no need and can not directly do it.
>> so if app want to process this hot plug event, what they need to do is
>> only enable hot plug event monitor and register their self callback,
>> then the app will not crash when hotplug behavior occur.
>>
>>>>> You also talk about using testpmd as a reference for this, but you don't
>>>>> explain how an application can be notified of a device removal, or even why
>>>>> that is necessary. Since all applications should now be using the proper
>>>>> macros when iterating device lists, and not just assuming devices 0-N are
>>>>> valid, what changes would you see a normal app having to make to be
>>>>> hotplug-safe?
>>>> we could use app or fail-safe driver to use these mechanism , but at this
>>>> stage i will firstly use testpmd as a reference.
>>>> as above reply, testpmd should enable device event mechanism to monitor the
>>>> device removal, and register callback,
>>>> the device bdf list is managed by bus and the hoplug fail handler will be
>>>> process by the eal layer, then the app would be hotplug-safe.
>>>>
>>>> is there anything i miss to clarify? please shout. and i think i will detail
>>>> more later.
>>> This is becoming clearer now, thanks. Just the one question above I have at
>>> this point.
>>> Given how long-running this issue of hotplug is, I'm hoping others on the
>>> technical board can also review this proposal.
>>>
>>> Regards,
>>> /Bruce
>>
>
>
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-06-15  8:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-24  6:55 [dpdk-dev] [RFC] hot plug failure handle mechanism Guo, Jia
2018-05-24 14:57 ` Matan Azrad
2018-05-25  7:49   ` Guo, Jia
2018-05-29 11:20 ` Bruce Richardson
2018-06-04  1:56   ` Guo, Jia
2018-06-06 12:54     ` Bruce Richardson
2018-06-06 13:11       ` Ananyev, Konstantin
2018-06-07  2:14       ` Guo, Jia
2018-06-14 21:37         ` Thomas Monjalon
2018-06-15  8:31           ` Guo, Jia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).