* [dpdk-dev] [RFC] hot plug failure handle mechanism @ 2018-05-24 6:55 Guo, Jia 2018-05-24 14:57 ` Matan Azrad 2018-05-29 11:20 ` Bruce Richardson 0 siblings, 2 replies; 10+ messages in thread From: Guo, Jia @ 2018-05-24 6:55 UTC (permalink / raw) To: dev Cc: Ananyev, Konstantin, stephen, Richardson, Bruce, Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, thomas, motih, matan, Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, jblunck, shreyansh.jain, Guo, Jia As we know, hot plug is an importance feature whenever it use for the datacenter device's fail-safe and consumption management , or use for the dynamic deployment and SRIOV Live Migration in SDN/NFV, it could be bring the higher flexibility and continuality of the networking services in multiple use case in industry. So let we see, dpdk as an importance networking combine framework with packet control path/fast path lib and multiple diversity PMD drivers, what can it do to help if application want to achieve their hot plug solution when they are working in packet processing by dpdk. We already have a general device event mechanism, failsafe driver, bonding driver and hot plug/unplug api in framework, app could use these api to develop functional, but for the case of hot plug failure handle, that is removing a device at run-time will cause app trigger MMIO error and crash out, it is lack of a mechanism to handle the failure when hot unplug device. At present, kernel only guantiy the hotplug handle safer on the kernel side, but for the user mode side, no more specific 3rd tools such as udev/driverctl have especially cover about these part of mechanism, and considerate feasibility of the implementation, runtime performance and the general for almost user mode PMD driver, here a general hot plug failure handle mechanism in dpdk framework would be proposed. The hot plug failure handle mechanism should be come across as bellow: 1. Add a new bus ops "handle_hot-unplug"in bus to handle bus read/write error, it is bus-specific and each kind of bus can implement its own logic. 2. Implement pci bus specific ops"pci_handle_hot_unplug", in the function, base on the failure address to remap memory which belong to the corresponding device that unplugged. 3. Implement a new sigbus handler, and register it when start device event monitoring, once the MMIO sigbus error exposure, it will trigger the above hot plug failure handle mechanism, that will keep app, that working on packet processing, would not be broken and crash, then could keep going clean, fail-safe or other working task. 4. Also also will introduce the solution by use testpmd to show the example of the whole procedure like that: device unplug ->failure handle->stop forwarding->stop port->close port->detach port. Best regards, Jeff Guo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism 2018-05-24 6:55 [dpdk-dev] [RFC] hot plug failure handle mechanism Guo, Jia @ 2018-05-24 14:57 ` Matan Azrad 2018-05-25 7:49 ` Guo, Jia 2018-05-29 11:20 ` Bruce Richardson 1 sibling, 1 reply; 10+ messages in thread From: Matan Azrad @ 2018-05-24 14:57 UTC (permalink / raw) To: Guo, Jia, dev Cc: Ananyev, Konstantin, stephen, Richardson, Bruce, Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, Thomas Monjalon, Mordechay Haimovsky, Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, jblunck, shreyansh.jain Hi Guo Some questions. From: Guo Jia > As we know, hot plug is an importance feature whenever it use for the > datacenter device's fail-safe and consumption management , or use for the > dynamic deployment and SRIOV Live Migration in SDN/NFV, it could be bring > the higher flexibility and continuality of the networking services in multiple use > case in industry. > > So let we see, dpdk as an importance networking combine framework with > packet control path/fast path lib and multiple diversity PMD drivers, what can it > do to help if application want to achieve their hot plug solution when they are > working in packet processing by dpdk. > > We already have a general device event mechanism, failsafe driver, bonding > driver and hot plug/unplug api in framework, app could use these api to > develop functional, but for the case of hot plug failure handle, that is removing > a device at run-time will cause app trigger MMIO error and crash out, it is lack > of a mechanism to handle the failure when hot unplug device. At present, > kernel only guantiy the hotplug handle safer on the kernel side, but for the user > mode side, no more specific 3rd tools such as udev/driverctl have especially > cover about these part of mechanism, and considerate feasibility of the > implementation, runtime performance and the general for almost user mode > PMD driver, here a general hot plug failure handle mechanism in dpdk > framework would be proposed. > > The hot plug failure handle mechanism should be come across as bellow: > 1. Add a new bus ops "handle_hot-unplug"in bus to handle bus read/write > error, it is bus-specific and each kind of bus can implement its own logic. > 2. Implement pci bus specific ops"pci_handle_hot_unplug", in the function, > base on the failure address to remap memory which belong to the > corresponding device that unplugged. > 3. Implement a new sigbus handler, and register it when start device event > monitoring, once the MMIO sigbus error exposure, it will trigger the above hot > plug failure handle mechanism, that will keep app, that working on packet > processing, would not be broken and crash, then could keep going clean, fail- > safe or other working task. Can you explain more what's happened with all the threads? Master thread, host thread, data-path threads, The signal may happened only in a datapath thread or even from a control thread? What's about resource leak? (mainly relevant for control threads): If you jump from the signal address to the restart address, how can you clean the process which was started and got the signal? Matan. > 4. Also also will introduce the solution by use testpmd to show the example of > the whole procedure like that: > device unplug ->failure handle->stop forwarding->stop port->close port->detach > port. > > Best regards, > > Jeff Guo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism 2018-05-24 14:57 ` Matan Azrad @ 2018-05-25 7:49 ` Guo, Jia 0 siblings, 0 replies; 10+ messages in thread From: Guo, Jia @ 2018-05-25 7:49 UTC (permalink / raw) To: Matan Azrad, dev Cc: Ananyev, Konstantin, stephen, Richardson, Bruce, Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, Thomas Monjalon, Mordechay Haimovsky, Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, jblunck, shreyansh.jain hi,matan On 5/24/2018 10:57 PM, Matan Azrad wrote: > Hi Guo > > Some questions. > > From: Guo Jia >> As we know, hot plug is an importance feature whenever it use for the >> datacenter device's fail-safe and consumption management , or use for the >> dynamic deployment and SRIOV Live Migration in SDN/NFV, it could be bring >> the higher flexibility and continuality of the networking services in multiple use >> case in industry. >> >> So let we see, dpdk as an importance networking combine framework with >> packet control path/fast path lib and multiple diversity PMD drivers, what can it >> do to help if application want to achieve their hot plug solution when they are >> working in packet processing by dpdk. >> >> We already have a general device event mechanism, failsafe driver, bonding >> driver and hot plug/unplug api in framework, app could use these api to >> develop functional, but for the case of hot plug failure handle, that is removing >> a device at run-time will cause app trigger MMIO error and crash out, it is lack >> of a mechanism to handle the failure when hot unplug device. At present, >> kernel only guantiy the hotplug handle safer on the kernel side, but for the user >> mode side, no more specific 3rd tools such as udev/driverctl have especially >> cover about these part of mechanism, and considerate feasibility of the >> implementation, runtime performance and the general for almost user mode >> PMD driver, here a general hot plug failure handle mechanism in dpdk >> framework would be proposed. >> >> The hot plug failure handle mechanism should be come across as bellow: >> 1. Add a new bus ops "handle_hot-unplug"in bus to handle bus read/write >> error, it is bus-specific and each kind of bus can implement its own logic. >> 2. Implement pci bus specific ops"pci_handle_hot_unplug", in the function, >> base on the failure address to remap memory which belong to the >> corresponding device that unplugged. >> 3. Implement a new sigbus handler, and register it when start device event >> monitoring, once the MMIO sigbus error exposure, it will trigger the above hot >> plug failure handle mechanism, that will keep app, that working on packet >> processing, would not be broken and crash, then could keep going clean, fail- >> safe or other working task. > Can you explain more what's happened with all the threads? Master thread, host thread, data-path threads, > The signal may happened only in a datapath thread or even from a control thread? i will explain it here for you at first, sigbus handler is register per process, cause of the signal event mechanism, control thread and data-path thread will random receive the sigbus error, but will go to the common sigbus handler, in the handler find the device according the failure address, then remap the memory for the device. > What's about resource leak? (mainly relevant for control threads): > If you jump from the signal address to the restart address, how can you clean the process which was started and got the signal? it will not use long jump to turn back the restart address, just capture the sigbus event and then do failure handle, then let the thread keep going at current position. > Matan. >> 4. Also also will introduce the solution by use testpmd to show the example of >> the whole procedure like that: >> device unplug ->failure handle->stop forwarding->stop port->close port->detach >> port. >> >> Best regards, >> >> Jeff Guo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism 2018-05-24 6:55 [dpdk-dev] [RFC] hot plug failure handle mechanism Guo, Jia 2018-05-24 14:57 ` Matan Azrad @ 2018-05-29 11:20 ` Bruce Richardson 2018-06-04 1:56 ` Guo, Jia 1 sibling, 1 reply; 10+ messages in thread From: Bruce Richardson @ 2018-05-29 11:20 UTC (permalink / raw) To: Guo, Jia Cc: dev, Ananyev, Konstantin, stephen, Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, thomas, motih, matan, Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, jblunck, shreyansh.jain On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote: <snip> > The hot plug failure handle mechanism should be come across as bellow: > > 1. Add a new bus ops “handle_hot-unplug”in bus to handle bus > read/write error, it is bus-specific and each > > kind of bus can implement its own logic. > > 2. Implement pci bus specific ops“pci_handle_hot_unplug”, in the > function, base on the > > failure address to remap memory which belong to the corresponding > device that unplugged. > > 3. Implement a new sigbus handler, and register it when start > device event monitoring, > > once the MMIO sigbus error exposure, it will trigger the above hot plug > failure handle mechanism, > > that will keep app, that working on packet processing, would not be > broken and crash, then could > > keep going clean, fail-safe or other working task. > > 4. Also also will introduce the solution by use testpmd to show > the example of the whole procedure like that: > > device unplug ->failure handle->stop forwarding->stop port->close > port->detach port. > Hi Jeff, so if I understand this correctly the proposal is that we need two parallel solutions to handle safe removal of a device. 1. We need a solution to support unpluging of the device at the bus level, ie. remove the device from the list of devices and to make access to that device invalid. 2. Since the removal of the device from the software lists is not going to be instantaneous, we need a mechanism to handle any accesses to the device from the data path until such time as the removal is complete. To support that, you propose to add a sigbus handler which will automatically replace any mmio bar mappings with some other memory that is ok to access - presumable zero memory or similar. Is this understanding correct? Point #2 seems reasonably clear to me, but for #1, presumably the trigger to the bus needs to come from the kernel. What is planned to be used there? You also talk about using testpmd as a reference for this, but you don't explain how an application can be notified of a device removal, or even why that is necessary. Since all applications should now be using the proper macros when iterating device lists, and not just assuming devices 0-N are valid, what changes would you see a normal app having to make to be hotplug-safe? Regards, /Bruce ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism 2018-05-29 11:20 ` Bruce Richardson @ 2018-06-04 1:56 ` Guo, Jia 2018-06-06 12:54 ` Bruce Richardson 0 siblings, 1 reply; 10+ messages in thread From: Guo, Jia @ 2018-06-04 1:56 UTC (permalink / raw) To: Bruce Richardson Cc: dev, Ananyev, Konstantin, stephen, Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, thomas, motih, matan, Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, jblunck, shreyansh.jain hi,bruce On 5/29/2018 7:20 PM, Bruce Richardson wrote: > On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote: > <snip> >> The hot plug failure handle mechanism should be come across as bellow: >> >> 1. Add a new bus ops “handle_hot-unplug”in bus to handle bus >> read/write error, it is bus-specific and each >> >> kind of bus can implement its own logic. >> >> 2. Implement pci bus specific ops“pci_handle_hot_unplug”, in the >> function, base on the >> >> failure address to remap memory which belong to the corresponding >> device that unplugged. >> >> 3. Implement a new sigbus handler, and register it when start >> device event monitoring, >> >> once the MMIO sigbus error exposure, it will trigger the above hot plug >> failure handle mechanism, >> >> that will keep app, that working on packet processing, would not be >> broken and crash, then could >> >> keep going clean, fail-safe or other working task. >> >> 4. Also also will introduce the solution by use testpmd to show >> the example of the whole procedure like that: >> >> device unplug ->failure handle->stop forwarding->stop port->close >> port->detach port. >> > Hi Jeff, > > so if I understand this correctly the proposal is that we need two parallel > solutions to handle safe removal of a device. > > 1. We need a solution to support unpluging of the device at the bus level, > ie. remove the device from the list of devices and to make access to > that device invalid. > 2. Since the removal of the device from the software lists is not going to > be instantaneous, we need a mechanism to handle any accesses to the > device from the data path until such time as the removal is complete. To > support that, you propose to add a sigbus handler which will > automatically replace any mmio bar mappings with some other memory that is > ok to access - presumable zero memory or similar. > > Is this understanding correct? i think you are correct about that. > Point #2 seems reasonably clear to me, but for #1, presumably the trigger > to the bus needs to come from the kernel. What is planned to be used there? about point #1, i should clarify here is that, we will use the device event monitor mechanism to detect the hot unplug event. the monitor be enabled by app(or fail-safe driver), and app(fail-safe driver) register the event callback. Once the hot unplug behave be detected, the user's callback could be triggered to let app(fail-safe driver) know the event and manage the process, it will call APIs to stop the device and detach the device from the bus. > You also talk about using testpmd as a reference for this, but you don't > explain how an application can be notified of a device removal, or even why > that is necessary. Since all applications should now be using the proper > macros when iterating device lists, and not just assuming devices 0-N are > valid, what changes would you see a normal app having to make to be > hotplug-safe? we could use app or fail-safe driver to use these mechanism , but at this stage i will firstly use testpmd as a reference. as above reply, testpmd should enable device event mechanism to monitor the device removal, and register callback, the device bdf list is managed by bus and the hoplug fail handler will be process by the eal layer, then the app would be hotplug-safe. is there anything i miss to clarify? please shout. and i think i will detail more later. > Regards, > /Bruce ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism 2018-06-04 1:56 ` Guo, Jia @ 2018-06-06 12:54 ` Bruce Richardson 2018-06-06 13:11 ` Ananyev, Konstantin 2018-06-07 2:14 ` Guo, Jia 0 siblings, 2 replies; 10+ messages in thread From: Bruce Richardson @ 2018-06-06 12:54 UTC (permalink / raw) To: Guo, Jia, techboard Cc: dev, Ananyev, Konstantin, stephen, Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, thomas, motih, matan, Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, jblunck, shreyansh.jain +Tech-board as I think that this should have more input at the design stage ahead of any code patches being pushed. On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote: > hi,bruce > > > On 5/29/2018 7:20 PM, Bruce Richardson wrote: > > On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote: > > <snip> > > > The hot plug failure handle mechanism should be come across as bellow: > > > > > > 1. Add a new bus ops “handle_hot-unplug”in bus to handle bus > > > read/write error, it is bus-specific and each > > > > > > kind of bus can implement its own logic. > > > > > > 2. Implement pci bus specific ops“pci_handle_hot_unplug”, in the > > > function, base on the > > > > > > failure address to remap memory which belong to the corresponding > > > device that unplugged. > > > > > > 3. Implement a new sigbus handler, and register it when start > > > device event monitoring, > > > > > > once the MMIO sigbus error exposure, it will trigger the above hot plug > > > failure handle mechanism, > > > > > > that will keep app, that working on packet processing, would not be > > > broken and crash, then could > > > > > > keep going clean, fail-safe or other working task. > > > > > > 4. Also also will introduce the solution by use testpmd to show > > > the example of the whole procedure like that: > > > > > > device unplug ->failure handle->stop forwarding->stop port->close > > > port->detach port. > > > > > Hi Jeff, > > > > so if I understand this correctly the proposal is that we need two parallel > > solutions to handle safe removal of a device. > > > > 1. We need a solution to support unpluging of the device at the bus level, > > ie. remove the device from the list of devices and to make access to > > that device invalid. > > 2. Since the removal of the device from the software lists is not going to > > be instantaneous, we need a mechanism to handle any accesses to the > > device from the data path until such time as the removal is complete. To > > support that, you propose to add a sigbus handler which will > > automatically replace any mmio bar mappings with some other memory that is > > ok to access - presumable zero memory or similar. > > > > Is this understanding correct? > > i think you are correct about that. > > > Point #2 seems reasonably clear to me, but for #1, presumably the trigger > > to the bus needs to come from the kernel. What is planned to be used there? > > about point #1, i should clarify here is that, we will use the device event > monitor mechanism to detect the hot unplug event. > the monitor be enabled by app(or fail-safe driver), and app(fail-safe > driver) register the event callback. Once the hot unplug behave be detected, > the user's callback could be triggered to let app(fail-safe driver) know the > event and manage the process, it will call APIs to stop the device > and detach the device from the bus. Ok. If there is no failsafe driver, and the app does not set up a handler, does nothing happen when we get a removal event? Will the app just crash? > > > You also talk about using testpmd as a reference for this, but you don't > > explain how an application can be notified of a device removal, or even why > > that is necessary. Since all applications should now be using the proper > > macros when iterating device lists, and not just assuming devices 0-N are > > valid, what changes would you see a normal app having to make to be > > hotplug-safe? > > we could use app or fail-safe driver to use these mechanism , but at this > stage i will firstly use testpmd as a reference. > as above reply, testpmd should enable device event mechanism to monitor the > device removal, and register callback, > the device bdf list is managed by bus and the hoplug fail handler will be > process by the eal layer, then the app would be hotplug-safe. > > is there anything i miss to clarify? please shout. and i think i will detail > more later. This is becoming clearer now, thanks. Just the one question above I have at this point. Given how long-running this issue of hotplug is, I'm hoping others on the technical board can also review this proposal. Regards, /Bruce ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism 2018-06-06 12:54 ` Bruce Richardson @ 2018-06-06 13:11 ` Ananyev, Konstantin 2018-06-07 2:14 ` Guo, Jia 1 sibling, 0 replies; 10+ messages in thread From: Ananyev, Konstantin @ 2018-06-06 13:11 UTC (permalink / raw) To: Richardson, Bruce, Guo, Jia, techboard Cc: dev, stephen, Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, thomas, motih, matan, Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, jblunck, shreyansh.jain > -----Original Message----- > From: Richardson, Bruce > Sent: Wednesday, June 6, 2018 1:55 PM > To: Guo, Jia <jia.guo@intel.com>; techboard@dpdk.org > Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>; stephen@networkplumber.org; Yigit, Ferruh > <ferruh.yigit@intel.com>; gaetan.rivet@6wind.com; Wu, Jingjing <jingjing.wu@intel.com>; thomas@monjalon.net; > motih@mellanox.com; matan@mellanox.com; Van Haaren, Harry <harry.van.haaren@intel.com>; Zhang, Qi Z > <qi.z.zhang@intel.com>; Zhang, Helin <helin.zhang@intel.com>; jblunck@infradead.org; shreyansh.jain@nxp.com > Subject: Re: [dpdk-dev] [RFC] hot plug failure handle mechanism > > +Tech-board as I think that this should have more input at the design stage > ahead of any code patches being pushed. > > On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote: > > hi,bruce > > > > > > On 5/29/2018 7:20 PM, Bruce Richardson wrote: > > > On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote: > > > <snip> > > > > The hot plug failure handle mechanism should be come across as bellow: > > > > > > > > 1. Add a new bus ops “handle_hot-unplug”in bus to handle bus > > > > read/write error, it is bus-specific and each > > > > > > > > kind of bus can implement its own logic. > > > > > > > > 2. Implement pci bus specific ops“pci_handle_hot_unplug”, in the > > > > function, base on the > > > > > > > > failure address to remap memory which belong to the corresponding > > > > device that unplugged. > > > > > > > > 3. Implement a new sigbus handler, and register it when start > > > > device event monitoring, > > > > > > > > once the MMIO sigbus error exposure, it will trigger the above hot plug > > > > failure handle mechanism, > > > > > > > > that will keep app, that working on packet processing, would not be > > > > broken and crash, then could > > > > > > > > keep going clean, fail-safe or other working task. > > > > > > > > 4. Also also will introduce the solution by use testpmd to show > > > > the example of the whole procedure like that: > > > > > > > > device unplug ->failure handle->stop forwarding->stop port->close > > > > port->detach port. > > > > > > > Hi Jeff, > > > > > > so if I understand this correctly the proposal is that we need two parallel > > > solutions to handle safe removal of a device. > > > > > > 1. We need a solution to support unpluging of the device at the bus level, > > > ie. remove the device from the list of devices and to make access to > > > that device invalid. > > > 2. Since the removal of the device from the software lists is not going to > > > be instantaneous, we need a mechanism to handle any accesses to the > > > device from the data path until such time as the removal is complete. To > > > support that, you propose to add a sigbus handler which will > > > automatically replace any mmio bar mappings with some other memory that is > > > ok to access - presumable zero memory or similar. > > > > > > Is this understanding correct? > > > > i think you are correct about that. > > > > > Point #2 seems reasonably clear to me, but for #1, presumably the trigger > > > to the bus needs to come from the kernel. What is planned to be used there? > > > > about point #1, i should clarify here is that, we will use the device event > > monitor mechanism to detect the hot unplug event. > > the monitor be enabled by app(or fail-safe driver), and app(fail-safe > > driver) register the event callback. Once the hot unplug behave be detected, > > the user's callback could be triggered to let app(fail-safe driver) know the > > event and manage the process, it will call APIs to stop the device > > and detach the device from the bus. > > Ok. If there is no failsafe driver, and the app does not set up a handler, > does nothing happen when we get a removal event? Will the app just crash? > > > > > > You also talk about using testpmd as a reference for this, but you don't > > > explain how an application can be notified of a device removal, or even why > > > that is necessary. Since all applications should now be using the proper > > > macros when iterating device lists, and not just assuming devices 0-N are > > > valid, what changes would you see a normal app having to make to be > > > hotplug-safe? > > > > we could use app or fail-safe driver to use these mechanism , but at this > > stage i will firstly use testpmd as a reference. > > as above reply, testpmd should enable device event mechanism to monitor the > > device removal, and register callback, > > the device bdf list is managed by bus and the hoplug fail handler will be > > process by the eal layer, then the app would be hotplug-safe. > > > > is there anything i miss to clarify? please shout. and i think i will detail > > more later. > > This is becoming clearer now, thanks. Just the one question above I have at > this point. > Given how long-running this issue of hotplug is, I'm hoping others on the > technical board can also review this proposal. I looked at the actual code a bit for 18.05. It seems ok to me in general, though I provided few comments regarding particular implementation details. Konstantin ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism 2018-06-06 12:54 ` Bruce Richardson 2018-06-06 13:11 ` Ananyev, Konstantin @ 2018-06-07 2:14 ` Guo, Jia 2018-06-14 21:37 ` Thomas Monjalon 1 sibling, 1 reply; 10+ messages in thread From: Guo, Jia @ 2018-06-07 2:14 UTC (permalink / raw) To: Bruce Richardson, techboard Cc: dev, Ananyev, Konstantin, stephen, Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, thomas, motih, matan, Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, jblunck, shreyansh.jain On 6/6/2018 8:54 PM, Bruce Richardson wrote: > +Tech-board as I think that this should have more input at the design stage > ahead of any code patches being pushed. > > On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote: >> hi,bruce >> >> >> On 5/29/2018 7:20 PM, Bruce Richardson wrote: >>> On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote: >>> <snip> >>>> The hot plug failure handle mechanism should be come across as bellow: >>>> >>>> 1. Add a new bus ops “handle_hot-unplug”in bus to handle bus >>>> read/write error, it is bus-specific and each >>>> >>>> kind of bus can implement its own logic. >>>> >>>> 2. Implement pci bus specific ops“pci_handle_hot_unplug”, in the >>>> function, base on the >>>> >>>> failure address to remap memory which belong to the corresponding >>>> device that unplugged. >>>> >>>> 3. Implement a new sigbus handler, and register it when start >>>> device event monitoring, >>>> >>>> once the MMIO sigbus error exposure, it will trigger the above hot plug >>>> failure handle mechanism, >>>> >>>> that will keep app, that working on packet processing, would not be >>>> broken and crash, then could >>>> >>>> keep going clean, fail-safe or other working task. >>>> >>>> 4. Also also will introduce the solution by use testpmd to show >>>> the example of the whole procedure like that: >>>> >>>> device unplug ->failure handle->stop forwarding->stop port->close >>>> port->detach port. >>>> >>> Hi Jeff, >>> >>> so if I understand this correctly the proposal is that we need two parallel >>> solutions to handle safe removal of a device. >>> >>> 1. We need a solution to support unpluging of the device at the bus level, >>> ie. remove the device from the list of devices and to make access to >>> that device invalid. >>> 2. Since the removal of the device from the software lists is not going to >>> be instantaneous, we need a mechanism to handle any accesses to the >>> device from the data path until such time as the removal is complete. To >>> support that, you propose to add a sigbus handler which will >>> automatically replace any mmio bar mappings with some other memory that is >>> ok to access - presumable zero memory or similar. >>> >>> Is this understanding correct? >> i think you are correct about that. >> >>> Point #2 seems reasonably clear to me, but for #1, presumably the trigger >>> to the bus needs to come from the kernel. What is planned to be used there? >> about point #1, i should clarify here is that, we will use the device event >> monitor mechanism to detect the hot unplug event. >> the monitor be enabled by app(or fail-safe driver), and app(fail-safe >> driver) register the event callback. Once the hot unplug behave be detected, >> the user's callback could be triggered to let app(fail-safe driver) know the >> event and manage the process, it will call APIs to stop the device >> and detach the device from the bus. > Ok. If there is no failsafe driver, and the app does not set up a handler, > does nothing happen when we get a removal event? Will the app just crash? when the device event monitor be enabled by app, the handler auto be set up, app or fail safe driver no need and can not directly do it. so if app want to process this hot plug event, what they need to do is only enable hot plug event monitor and register their self callback, then the app will not crash when hotplug behavior occur. >>> You also talk about using testpmd as a reference for this, but you don't >>> explain how an application can be notified of a device removal, or even why >>> that is necessary. Since all applications should now be using the proper >>> macros when iterating device lists, and not just assuming devices 0-N are >>> valid, what changes would you see a normal app having to make to be >>> hotplug-safe? >> we could use app or fail-safe driver to use these mechanism , but at this >> stage i will firstly use testpmd as a reference. >> as above reply, testpmd should enable device event mechanism to monitor the >> device removal, and register callback, >> the device bdf list is managed by bus and the hoplug fail handler will be >> process by the eal layer, then the app would be hotplug-safe. >> >> is there anything i miss to clarify? please shout. and i think i will detail >> more later. > This is becoming clearer now, thanks. Just the one question above I have at > this point. > Given how long-running this issue of hotplug is, I'm hoping others on the > technical board can also review this proposal. > > Regards, > /Bruce ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism 2018-06-07 2:14 ` Guo, Jia @ 2018-06-14 21:37 ` Thomas Monjalon 2018-06-15 8:31 ` Guo, Jia 0 siblings, 1 reply; 10+ messages in thread From: Thomas Monjalon @ 2018-06-14 21:37 UTC (permalink / raw) To: Guo, Jia Cc: dev, Bruce Richardson, techboard, Ananyev, Konstantin, stephen, Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, motih, matan, Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, shreyansh.jain Hi, I am sorry, it is very hard to be sure we understand correctly your thougths. I like the proposal, but I want to be sure my understanding was not biased by what I would like to read :) So I try to reword below. Please confirm it matches your intent. Hot unplug can happen when a hardware device is removed physically, or when the software disables it. In both case, the datapath will fail. When the unplug is detected, we must stop and close the related instance of the driver. The detection can be done with hotplug monitoring (like uevent) - this is RTE_DEV_EVENT_REMOVE - or by handling the failure in control path or data path - this is RTE_ETH_EVENT_INTR_RMV. Between the unplug event and its detection, we need to manage any related failure. That's why you propose a sigbus handler which will avoid the crash, and can be used to detect the unplug. Please confirm this is what you thought. If not, do you agree, or am I missing something? I would like to be sure the sigbus handler will not hide any other unrelated failure. 07/06/2018 04:14, Guo, Jia: > > On 6/6/2018 8:54 PM, Bruce Richardson wrote: > > +Tech-board as I think that this should have more input at the design stage > > ahead of any code patches being pushed. > > > > On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote: > >> hi,bruce > >> > >> > >> On 5/29/2018 7:20 PM, Bruce Richardson wrote: > >>> On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote: > >>> <snip> > >>>> The hot plug failure handle mechanism should be come across as bellow: > >>>> > >>>> 1. Add a new bus ops “handle_hot-unplug”in bus to handle bus > >>>> read/write error, it is bus-specific and each > >>>> > >>>> kind of bus can implement its own logic. > >>>> > >>>> 2. Implement pci bus specific ops“pci_handle_hot_unplug”, in the > >>>> function, base on the > >>>> > >>>> failure address to remap memory which belong to the corresponding > >>>> device that unplugged. > >>>> > >>>> 3. Implement a new sigbus handler, and register it when start > >>>> device event monitoring, > >>>> > >>>> once the MMIO sigbus error exposure, it will trigger the above hot plug > >>>> failure handle mechanism, > >>>> > >>>> that will keep app, that working on packet processing, would not be > >>>> broken and crash, then could > >>>> > >>>> keep going clean, fail-safe or other working task. > >>>> > >>>> 4. Also also will introduce the solution by use testpmd to show > >>>> the example of the whole procedure like that: > >>>> > >>>> device unplug ->failure handle->stop forwarding->stop port->close > >>>> port->detach port. > >>>> > >>> Hi Jeff, > >>> > >>> so if I understand this correctly the proposal is that we need two parallel > >>> solutions to handle safe removal of a device. > >>> > >>> 1. We need a solution to support unpluging of the device at the bus level, > >>> ie. remove the device from the list of devices and to make access to > >>> that device invalid. > >>> 2. Since the removal of the device from the software lists is not going to > >>> be instantaneous, we need a mechanism to handle any accesses to the > >>> device from the data path until such time as the removal is complete. To > >>> support that, you propose to add a sigbus handler which will > >>> automatically replace any mmio bar mappings with some other memory that is > >>> ok to access - presumable zero memory or similar. > >>> > >>> Is this understanding correct? > >> i think you are correct about that. > >> > >>> Point #2 seems reasonably clear to me, but for #1, presumably the trigger > >>> to the bus needs to come from the kernel. What is planned to be used there? > >> about point #1, i should clarify here is that, we will use the device event > >> monitor mechanism to detect the hot unplug event. > >> the monitor be enabled by app(or fail-safe driver), and app(fail-safe > >> driver) register the event callback. Once the hot unplug behave be detected, > >> the user's callback could be triggered to let app(fail-safe driver) know the > >> event and manage the process, it will call APIs to stop the device > >> and detach the device from the bus. > > Ok. If there is no failsafe driver, and the app does not set up a handler, > > does nothing happen when we get a removal event? Will the app just crash? > > when the device event monitor be enabled by app, the handler auto be set > up, app or fail safe driver no need and can not directly do it. > so if app want to process this hot plug event, what they need to do is > only enable hot plug event monitor and register their self callback, > then the app will not crash when hotplug behavior occur. > > >>> You also talk about using testpmd as a reference for this, but you don't > >>> explain how an application can be notified of a device removal, or even why > >>> that is necessary. Since all applications should now be using the proper > >>> macros when iterating device lists, and not just assuming devices 0-N are > >>> valid, what changes would you see a normal app having to make to be > >>> hotplug-safe? > >> we could use app or fail-safe driver to use these mechanism , but at this > >> stage i will firstly use testpmd as a reference. > >> as above reply, testpmd should enable device event mechanism to monitor the > >> device removal, and register callback, > >> the device bdf list is managed by bus and the hoplug fail handler will be > >> process by the eal layer, then the app would be hotplug-safe. > >> > >> is there anything i miss to clarify? please shout. and i think i will detail > >> more later. > > This is becoming clearer now, thanks. Just the one question above I have at > > this point. > > Given how long-running this issue of hotplug is, I'm hoping others on the > > technical board can also review this proposal. > > > > Regards, > > /Bruce > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [RFC] hot plug failure handle mechanism 2018-06-14 21:37 ` Thomas Monjalon @ 2018-06-15 8:31 ` Guo, Jia 0 siblings, 0 replies; 10+ messages in thread From: Guo, Jia @ 2018-06-15 8:31 UTC (permalink / raw) To: Thomas Monjalon Cc: dev, Bruce Richardson, techboard, Ananyev, Konstantin, stephen, Yigit, Ferruh, gaetan.rivet, Wu, Jingjing, motih, matan, Van Haaren, Harry, Zhang, Qi Z, Zhang, Helin, shreyansh.jain On 6/15/2018 5:37 AM, Thomas Monjalon wrote: > Hi, > > I am sorry, it is very hard to be sure we understand correctly > your thougths. I like the proposal, but I want to be sure > my understanding was not biased by what I would like to read :) > So I try to reword below. Please confirm it matches your intent. > > Hot unplug can happen when a hardware device is removed physically, > or when the software disables it. In both case, the datapath will fail. > When the unplug is detected, we must stop and close the related instance > of the driver. > The detection can be done with hotplug monitoring (like uevent) > - this is RTE_DEV_EVENT_REMOVE - or by handling the failure > in control path or data path - this is RTE_ETH_EVENT_INTR_RMV. > Between the unplug event and its detection, we need to manage > any related failure. That's why you propose a sigbus handler > which will avoid the crash, and can be used to detect the unplug. > > Please confirm this is what you thought. > If not, do you agree, or am I missing something? i think that is what i want to propose here. > I would like to be sure the sigbus handler will not hide any other > unrelated failure. I agree, even the sigbus handler use as an hot plug exception handler in this case, but it definitely should not affect any other failure process. i will cover all about this in my patch. > > 07/06/2018 04:14, Guo, Jia: >> On 6/6/2018 8:54 PM, Bruce Richardson wrote: >>> +Tech-board as I think that this should have more input at the design stage >>> ahead of any code patches being pushed. >>> >>> On Mon, Jun 04, 2018 at 09:56:10AM +0800, Guo, Jia wrote: >>>> hi,bruce >>>> >>>> >>>> On 5/29/2018 7:20 PM, Bruce Richardson wrote: >>>>> On Thu, May 24, 2018 at 07:55:43AM +0100, Guo, Jia wrote: >>>>> <snip> >>>>>> The hot plug failure handle mechanism should be come across as bellow: >>>>>> >>>>>> 1. Add a new bus ops “handle_hot-unplug”in bus to handle bus >>>>>> read/write error, it is bus-specific and each >>>>>> >>>>>> kind of bus can implement its own logic. >>>>>> >>>>>> 2. Implement pci bus specific ops“pci_handle_hot_unplug”, in the >>>>>> function, base on the >>>>>> >>>>>> failure address to remap memory which belong to the corresponding >>>>>> device that unplugged. >>>>>> >>>>>> 3. Implement a new sigbus handler, and register it when start >>>>>> device event monitoring, >>>>>> >>>>>> once the MMIO sigbus error exposure, it will trigger the above hot plug >>>>>> failure handle mechanism, >>>>>> >>>>>> that will keep app, that working on packet processing, would not be >>>>>> broken and crash, then could >>>>>> >>>>>> keep going clean, fail-safe or other working task. >>>>>> >>>>>> 4. Also also will introduce the solution by use testpmd to show >>>>>> the example of the whole procedure like that: >>>>>> >>>>>> device unplug ->failure handle->stop forwarding->stop port->close >>>>>> port->detach port. >>>>>> >>>>> Hi Jeff, >>>>> >>>>> so if I understand this correctly the proposal is that we need two parallel >>>>> solutions to handle safe removal of a device. >>>>> >>>>> 1. We need a solution to support unpluging of the device at the bus level, >>>>> ie. remove the device from the list of devices and to make access to >>>>> that device invalid. >>>>> 2. Since the removal of the device from the software lists is not going to >>>>> be instantaneous, we need a mechanism to handle any accesses to the >>>>> device from the data path until such time as the removal is complete. To >>>>> support that, you propose to add a sigbus handler which will >>>>> automatically replace any mmio bar mappings with some other memory that is >>>>> ok to access - presumable zero memory or similar. >>>>> >>>>> Is this understanding correct? >>>> i think you are correct about that. >>>> >>>>> Point #2 seems reasonably clear to me, but for #1, presumably the trigger >>>>> to the bus needs to come from the kernel. What is planned to be used there? >>>> about point #1, i should clarify here is that, we will use the device event >>>> monitor mechanism to detect the hot unplug event. >>>> the monitor be enabled by app(or fail-safe driver), and app(fail-safe >>>> driver) register the event callback. Once the hot unplug behave be detected, >>>> the user's callback could be triggered to let app(fail-safe driver) know the >>>> event and manage the process, it will call APIs to stop the device >>>> and detach the device from the bus. >>> Ok. If there is no failsafe driver, and the app does not set up a handler, >>> does nothing happen when we get a removal event? Will the app just crash? >> when the device event monitor be enabled by app, the handler auto be set >> up, app or fail safe driver no need and can not directly do it. >> so if app want to process this hot plug event, what they need to do is >> only enable hot plug event monitor and register their self callback, >> then the app will not crash when hotplug behavior occur. >> >>>>> You also talk about using testpmd as a reference for this, but you don't >>>>> explain how an application can be notified of a device removal, or even why >>>>> that is necessary. Since all applications should now be using the proper >>>>> macros when iterating device lists, and not just assuming devices 0-N are >>>>> valid, what changes would you see a normal app having to make to be >>>>> hotplug-safe? >>>> we could use app or fail-safe driver to use these mechanism , but at this >>>> stage i will firstly use testpmd as a reference. >>>> as above reply, testpmd should enable device event mechanism to monitor the >>>> device removal, and register callback, >>>> the device bdf list is managed by bus and the hoplug fail handler will be >>>> process by the eal layer, then the app would be hotplug-safe. >>>> >>>> is there anything i miss to clarify? please shout. and i think i will detail >>>> more later. >>> This is becoming clearer now, thanks. Just the one question above I have at >>> this point. >>> Given how long-running this issue of hotplug is, I'm hoping others on the >>> technical board can also review this proposal. >>> >>> Regards, >>> /Bruce >> > > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-06-15 8:31 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-05-24 6:55 [dpdk-dev] [RFC] hot plug failure handle mechanism Guo, Jia 2018-05-24 14:57 ` Matan Azrad 2018-05-25 7:49 ` Guo, Jia 2018-05-29 11:20 ` Bruce Richardson 2018-06-04 1:56 ` Guo, Jia 2018-06-06 12:54 ` Bruce Richardson 2018-06-06 13:11 ` Ananyev, Konstantin 2018-06-07 2:14 ` Guo, Jia 2018-06-14 21:37 ` Thomas Monjalon 2018-06-15 8:31 ` Guo, Jia
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).