patches for DPDK stable branches
 help / color / mirror / Atom feed
* [dpdk-stable] [PATCH 1/2] bus/pci: fix driver detach clear
@ 2019-11-12  8:47 Matan Azrad
  2019-11-12  8:47 ` [dpdk-stable] [PATCH 2/2] app/testpmd: fix invalid port detaching Matan Azrad
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Matan Azrad @ 2019-11-12  8:47 UTC (permalink / raw)
  To: dev; +Cc: Gaetan Rivet, Bernard Iremonger, mukawa, stable

When a rte_device is unplugged, the driver should be detached from the
device.

The PCI detach driver operation wrongly didn't clear the driver from the
device structure what remain the device in probe state from the EAL
point of view.

Clear the driver in driver detach successful operation.

Fixes: dbe6b4b61b0e ("pci: probe or close device")
Cc: mukawa@igel.co.jp
Cc: stable@dpdk.org

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/bus/pci/pci_common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
index 6b46b4f..3f55420 100644
--- a/drivers/bus/pci/pci_common.c
+++ b/drivers/bus/pci/pci_common.c
@@ -247,6 +247,7 @@ static struct rte_devargs *pci_devargs_lookup(struct rte_pci_device *dev)
 
 	/* clear driver structure */
 	dev->driver = NULL;
+	dev->device.driver = NULL;
 
 	if (dr->drv_flags & RTE_PCI_DRV_NEED_MAPPING)
 		/* unmap resources for devices that use igb_uio */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [dpdk-stable] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2019-11-12  8:47 [dpdk-stable] [PATCH 1/2] bus/pci: fix driver detach clear Matan Azrad
@ 2019-11-12  8:47 ` Matan Azrad
  2019-11-12 11:20   ` Iremonger, Bernard
  2020-01-23 13:19   ` [dpdk-stable] [dpdk-dev] " Yigit, Ferruh
  2019-11-19 22:40 ` [dpdk-stable] [PATCH 1/2] bus/pci: fix driver detach clear Thomas Monjalon
  2019-11-20  9:47 ` [dpdk-stable] [PATCH v2] " Matan Azrad
  2 siblings, 2 replies; 23+ messages in thread
From: Matan Azrad @ 2019-11-12  8:47 UTC (permalink / raw)
  To: dev; +Cc: Gaetan Rivet, Bernard Iremonger, thomas, stable

The port was not validated before detaching.

Ignore port detach operation when the port is not valid.

Fixes: f8e5baa2662d ("app/testpmd: check not detaching device twice")
Cc: thomas@monjalon.net
Cc: stable@dpdk.org

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 app/test-pmd/testpmd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 4444346..370eefe 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -2545,6 +2545,9 @@ struct extmem_param {
 
 	printf("Removing a device...\n");
 
+	if (port_id_is_invalid(port_id, ENABLED_WARN))
+		return;
+
 	dev = rte_eth_devices[port_id].device;
 	if (dev == NULL) {
 		printf("Device already removed\n");
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2019-11-12  8:47 ` [dpdk-stable] [PATCH 2/2] app/testpmd: fix invalid port detaching Matan Azrad
@ 2019-11-12 11:20   ` Iremonger, Bernard
  2019-11-20 22:52     ` David Marchand
  2020-01-23 13:19   ` [dpdk-stable] [dpdk-dev] " Yigit, Ferruh
  1 sibling, 1 reply; 23+ messages in thread
From: Iremonger, Bernard @ 2019-11-12 11:20 UTC (permalink / raw)
  To: Matan Azrad, dev; +Cc: Gaetan Rivet, thomas, stable

> -----Original Message-----
> From: Matan Azrad <matan@mellanox.com>
> Sent: Tuesday, November 12, 2019 8:48 AM
> To: dev@dpdk.org
> Cc: Gaetan Rivet <gaetan.rivet@6wind.com>; Iremonger, Bernard
> <bernard.iremonger@intel.com>; thomas@monjalon.net; stable@dpdk.org
> Subject: [PATCH 2/2] app/testpmd: fix invalid port detaching
> 
> The port was not validated before detaching.
> 
> Ignore port detach operation when the port is not valid.
> 
> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device twice")
> Cc: thomas@monjalon.net
> Cc: stable@dpdk.org
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>

Acked-by: Bernard Iremonger <bernard.iremonger@intel.com>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [PATCH 1/2] bus/pci: fix driver detach clear
  2019-11-12  8:47 [dpdk-stable] [PATCH 1/2] bus/pci: fix driver detach clear Matan Azrad
  2019-11-12  8:47 ` [dpdk-stable] [PATCH 2/2] app/testpmd: fix invalid port detaching Matan Azrad
@ 2019-11-19 22:40 ` Thomas Monjalon
  2019-11-20  9:02   ` Matan Azrad
  2019-11-20  9:47 ` [dpdk-stable] [PATCH v2] " Matan Azrad
  2 siblings, 1 reply; 23+ messages in thread
From: Thomas Monjalon @ 2019-11-19 22:40 UTC (permalink / raw)
  To: Matan Azrad
  Cc: stable, dev, Gaetan Rivet, Bernard Iremonger, mukawa, david.marchand

12/11/2019 09:47, Matan Azrad:
> When a rte_device is unplugged, the driver should be detached from the
> device.

Yes

> The PCI detach driver operation wrongly didn't clear the driver from the
> device structure what remain the device in probe state from the EAL
> point of view.

Are you aware of an use case which is broken because of that?


> --- a/drivers/bus/pci/pci_common.c
> +++ b/drivers/bus/pci/pci_common.c
> @@ -247,6 +247,7 @@ static struct rte_devargs *pci_devargs_lookup(struct rte_pci_device *dev)

The git context above is wrong, it should show the function rte_pci_detach_dev.

>  	/* clear driver structure */
>  	dev->driver = NULL;
> +	dev->device.driver = NULL;

It looks a good fix.
Acked-by: Thomas Monjalon <thomas@monjalon.net>

I am wondering if there could be a risk for any test application
if applied in 19.11-rc3.
I think we should try to get it and revert if a side effect is discovered.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [PATCH 1/2] bus/pci: fix driver detach clear
  2019-11-19 22:40 ` [dpdk-stable] [PATCH 1/2] bus/pci: fix driver detach clear Thomas Monjalon
@ 2019-11-20  9:02   ` Matan Azrad
  0 siblings, 0 replies; 23+ messages in thread
From: Matan Azrad @ 2019-11-20  9:02 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: stable, dev, Gaetan Rivet, Bernard Iremonger, mukawa, david.marchand

Hi

From: Thomas Monjalon
> 12/11/2019 09:47, Matan Azrad:
> > When a rte_device is unplugged, the driver should be detached from the
> > device.
> 
> Yes
> 
> > The PCI detach driver operation wrongly didn't clear the driver from
> > the device structure what remain the device in probe state from the
> > EAL point of view.
> 
> Are you aware of an use case which is broken because of that?

Yes, will add  a small example.

> 
> > --- a/drivers/bus/pci/pci_common.c
> > +++ b/drivers/bus/pci/pci_common.c
> > @@ -247,6 +247,7 @@ static struct rte_devargs
> > *pci_devargs_lookup(struct rte_pci_device *dev)
> 
> The git context above is wrong, it should show the function
> rte_pci_detach_dev.
> 
> >  	/* clear driver structure */
> >  	dev->driver = NULL;
> > +	dev->device.driver = NULL;
> 
> It looks a good fix.
> Acked-by: Thomas Monjalon <thomas@monjalon.net>
> 
> I am wondering if there could be a risk for any test application if applied in
> 19.11-rc3.
> I think we should try to get it and revert if a side effect is discovered.
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [dpdk-stable] [PATCH v2] bus/pci: fix driver detach clear
  2019-11-12  8:47 [dpdk-stable] [PATCH 1/2] bus/pci: fix driver detach clear Matan Azrad
  2019-11-12  8:47 ` [dpdk-stable] [PATCH 2/2] app/testpmd: fix invalid port detaching Matan Azrad
  2019-11-19 22:40 ` [dpdk-stable] [PATCH 1/2] bus/pci: fix driver detach clear Thomas Monjalon
@ 2019-11-20  9:47 ` Matan Azrad
  2019-11-20 13:03   ` David Marchand
  2019-11-20 22:52   ` David Marchand
  2 siblings, 2 replies; 23+ messages in thread
From: Matan Azrad @ 2019-11-20  9:47 UTC (permalink / raw)
  To: dev; +Cc: mukawa, stable

When a rte_device is unplugged, the driver should be detached from the
device.

The PCI detach driver operation wrongly didn't clear the driver from the
device structure what remain the device in probe state from the EAL
point of view.

For example, when a device is removed twice using rte_dev_remove, it
cause a crash in EAL.

Clear the driver in driver detach successful operation.

Fixes: dbe6b4b61b0e ("pci: probe or close device")
Cc: mukawa@igel.co.jp
Cc: stable@dpdk.org

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/bus/pci/pci_common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
index 6b46b4f..3f55420 100644
--- a/drivers/bus/pci/pci_common.c
+++ b/drivers/bus/pci/pci_common.c
@@ -247,6 +247,7 @@ static struct rte_devargs *pci_devargs_lookup(struct rte_pci_device *dev)
 
 	/* clear driver structure */
 	dev->driver = NULL;
+	dev->device.driver = NULL;
 
 	if (dr->drv_flags & RTE_PCI_DRV_NEED_MAPPING)
 		/* unmap resources for devices that use igb_uio */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [PATCH v2] bus/pci: fix driver detach clear
  2019-11-20  9:47 ` [dpdk-stable] [PATCH v2] " Matan Azrad
@ 2019-11-20 13:03   ` David Marchand
  2019-11-20 13:44     ` Matan Azrad
  2019-11-20 13:51     ` Thomas Monjalon
  2019-11-20 22:52   ` David Marchand
  1 sibling, 2 replies; 23+ messages in thread
From: David Marchand @ 2019-11-20 13:03 UTC (permalink / raw)
  To: Matan Azrad, Thomas Monjalon; +Cc: dev, mukawa, dpdk stable

On Wed, Nov 20, 2019 at 10:48 AM Matan Azrad <matan@mellanox.com> wrote:
>
> When a rte_device is unplugged, the driver should be detached from the
> device.
>
> The PCI detach driver operation wrongly didn't clear the driver from the
> device structure what remain the device in probe state from the EAL
> point of view.
>
> For example, when a device is removed twice using rte_dev_remove, it
> cause a crash in EAL.

I can see a crash when using port detach in testpmd with a virtio pci device.

testpmd> port attach 0000:07:00.0
Attaching a new port...
EAL: PCI device 0000:07:00.0 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 1af4:1041 net_virtio
Port 1 is attached. Now total ports is 2
Done
testpmd> port close 1
Closing ports...
EAL: Releasing pci mapped resource for 0000:07:00.0
EAL: Calling pci_unmap_resource for 0000:07:00.0 at 0x2200006000
Done
testpmd> port detach 1
Removing a device...

Breakpoint 1, local_dev_remove (dev=0x1de64b0) at
/root/dpdk/lib/librte_eal/common/eal_common_dev.c:315
315        if (dev->bus->unplug == NULL) {
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-292.el7.x86_64 libgcc-4.8.5-39.el7.x86_64
libpcap-1.5.3-11.el7.x86_64 numactl-libs-2.0.12-3.el7.x86_64
(gdb) p *dev
$1 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x1cf8078
"0000:07:00.0", driver = 0x16c68f0 <rte_virtio_pmd+16>, bus =
0x16b2640 <rte_pci_bus>, numa_node = 0, devargs = 0x1cf8060}
(gdb) c
Continuing.
Device of port 1 is detached
Now total ports is 1
Done


On the first detach, the pci bus frees the rte_pci_device which embeds
the rte_device object.

static int
pci_unplug(struct rte_device *dev)
{
        struct rte_pci_device *pdev;
        int ret;

        pdev = RTE_DEV_TO_PCI(dev);
        ret = rte_pci_detach_dev(pdev);
        if (ret == 0) {
                rte_pci_remove_device(pdev);
                rte_devargs_remove(dev->devargs);
                free(pdev);
        }
        return ret;
}



testpmd> port detach 1
Removing a device...

Breakpoint 1, local_dev_remove (dev=0x1de64b0) at
/root/dpdk/lib/librte_eal/common/eal_common_dev.c:315
315        if (dev->bus->unplug == NULL) {
(gdb) p *dev
$2 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0xa <Address 0xa
out of bounds>, driver = 0x0, bus = 0x4637, numa_node = 1, devargs =
0x40000002e040018}
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00000000007c1ddd in local_dev_remove (dev=0x1de64b0) at
/root/dpdk/lib/librte_eal/common/eal_common_dev.c:315
315        if (dev->bus->unplug == NULL) {


On the second detach, testpmd passes the same rte_device pointer it
extracts from rte_eth_devices, but the malloc'd location has been
reused (with watchpoint on the location, I found somewhere around
rte_mp_request_sync/opendir()), and then *crunch* on dev->bus.


From my pov:
- testpmd is wrongly reusing a pointer coming from rte_eth_devices[],
without caring about the port state (this is what your second patch
fixes),
- testpmd is directly kicking pointers in rte_eth_devices[] (setting
->device = NULL for its own logic), which is bad too,
- this patch just hides the reuse of a freed pointer,


-- 
David Marchand


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [PATCH v2] bus/pci: fix driver detach clear
  2019-11-20 13:03   ` David Marchand
@ 2019-11-20 13:44     ` Matan Azrad
  2019-11-20 13:51     ` Thomas Monjalon
  1 sibling, 0 replies; 23+ messages in thread
From: Matan Azrad @ 2019-11-20 13:44 UTC (permalink / raw)
  To: David Marchand, Thomas Monjalon; +Cc: dev, mukawa, dpdk stable

Hi David

From: David Marchand
> On Wed, Nov 20, 2019 at 10:48 AM Matan Azrad <matan@mellanox.com>
> wrote:
> >
> > When a rte_device is unplugged, the driver should be detached from the
> > device.
> >
> > The PCI detach driver operation wrongly didn't clear the driver from
> > the device structure what remain the device in probe state from the
> > EAL point of view.
> >
> > For example, when a device is removed twice using rte_dev_remove, it
> > cause a crash in EAL.
> 
> I can see a crash when using port detach in testpmd with a virtio pci device.
> 
> testpmd> port attach 0000:07:00.0
> Attaching a new port...
> EAL: PCI device 0000:07:00.0 on NUMA socket -1
> EAL:   Invalid NUMA socket, default to 0
> EAL:   probe driver: 1af4:1041 net_virtio
> Port 1 is attached. Now total ports is 2 Done
> testpmd> port close 1
> Closing ports...
> EAL: Releasing pci mapped resource for 0000:07:00.0
> EAL: Calling pci_unmap_resource for 0000:07:00.0 at 0x2200006000 Done
> testpmd> port detach 1
> Removing a device...
> 
> Breakpoint 1, local_dev_remove (dev=0x1de64b0) at
> /root/dpdk/lib/librte_eal/common/eal_common_dev.c:315
> 315        if (dev->bus->unplug == NULL) {
> Missing separate debuginfos, use: debuginfo-install
> glibc-2.17-292.el7.x86_64 libgcc-4.8.5-39.el7.x86_64
> libpcap-1.5.3-11.el7.x86_64 numactl-libs-2.0.12-3.el7.x86_64
> (gdb) p *dev
> $1 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x1cf8078
> "0000:07:00.0", driver = 0x16c68f0 <rte_virtio_pmd+16>, bus =
> 0x16b2640 <rte_pci_bus>, numa_node = 0, devargs = 0x1cf8060}
> (gdb) c
> Continuing.
> Device of port 1 is detached
> Now total ports is 1
> Done
> 
> 
> On the first detach, the pci bus frees the rte_pci_device which embeds the
> rte_device object.
> 
> static int
> pci_unplug(struct rte_device *dev)
> {
>         struct rte_pci_device *pdev;
>         int ret;
> 
>         pdev = RTE_DEV_TO_PCI(dev);
>         ret = rte_pci_detach_dev(pdev);
>         if (ret == 0) {
>                 rte_pci_remove_device(pdev);
>                 rte_devargs_remove(dev->devargs);
>                 free(pdev);
>         }
>         return ret;
> }
> 
> 
> 
> testpmd> port detach 1
> Removing a device...
> 
> Breakpoint 1, local_dev_remove (dev=0x1de64b0) at
> /root/dpdk/lib/librte_eal/common/eal_common_dev.c:315
> 315        if (dev->bus->unplug == NULL) {
> (gdb) p *dev
> $2 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0xa <Address 0xa out
> of bounds>, driver = 0x0, bus = 0x4637, numa_node = 1, devargs =
> 0x40000002e040018}
> (gdb) c
> Continuing.
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x00000000007c1ddd in local_dev_remove (dev=0x1de64b0) at
> /root/dpdk/lib/librte_eal/common/eal_common_dev.c:315
> 315        if (dev->bus->unplug == NULL) {
> 
> 
> On the second detach, testpmd passes the same rte_device pointer it
> extracts from rte_eth_devices, but the malloc'd location has been reused
> (with watchpoint on the location, I found somewhere around
> rte_mp_request_sync/opendir()), and then *crunch* on dev->bus.
> 
> 
> From my pov:
> - testpmd is wrongly reusing a pointer coming from rte_eth_devices[],
> without caring about the port state (this is what your second patch fixes),
> - testpmd is directly kicking pointers in rte_eth_devices[] (setting
> ->device = NULL for its own logic), which is bad too,
> - this patch just hides the reuse of a freed pointer,

Yes, you right.

This patch is not needed since the rte_device is freed in remove.

Thanks.

> 
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [PATCH v2] bus/pci: fix driver detach clear
  2019-11-20 13:03   ` David Marchand
  2019-11-20 13:44     ` Matan Azrad
@ 2019-11-20 13:51     ` Thomas Monjalon
  2019-11-20 17:22       ` David Marchand
  1 sibling, 1 reply; 23+ messages in thread
From: Thomas Monjalon @ 2019-11-20 13:51 UTC (permalink / raw)
  To: David Marchand, dpdk stable; +Cc: Matan Azrad, dev

20/11/2019 14:03, David Marchand:
> On Wed, Nov 20, 2019 at 10:48 AM Matan Azrad <matan@mellanox.com> wrote:
> >
> > When a rte_device is unplugged, the driver should be detached from the
> > device.
> >
> > The PCI detach driver operation wrongly didn't clear the driver from the
> > device structure what remain the device in probe state from the EAL
> > point of view.
> >
> > For example, when a device is removed twice using rte_dev_remove, it
> > cause a crash in EAL.
> 
> I can see a crash when using port detach in testpmd with a virtio pci device.
> 
> testpmd> port attach 0000:07:00.0
> Attaching a new port...
> EAL: PCI device 0000:07:00.0 on NUMA socket -1
> EAL:   Invalid NUMA socket, default to 0
> EAL:   probe driver: 1af4:1041 net_virtio
> Port 1 is attached. Now total ports is 2
> Done
> testpmd> port close 1
> Closing ports...
> EAL: Releasing pci mapped resource for 0000:07:00.0
> EAL: Calling pci_unmap_resource for 0000:07:00.0 at 0x2200006000
> Done
> testpmd> port detach 1
> Removing a device...
> 
> Breakpoint 1, local_dev_remove (dev=0x1de64b0) at
> /root/dpdk/lib/librte_eal/common/eal_common_dev.c:315
> 315        if (dev->bus->unplug == NULL) {
> Missing separate debuginfos, use: debuginfo-install
> glibc-2.17-292.el7.x86_64 libgcc-4.8.5-39.el7.x86_64
> libpcap-1.5.3-11.el7.x86_64 numactl-libs-2.0.12-3.el7.x86_64
> (gdb) p *dev
> $1 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x1cf8078
> "0000:07:00.0", driver = 0x16c68f0 <rte_virtio_pmd+16>, bus =
> 0x16b2640 <rte_pci_bus>, numa_node = 0, devargs = 0x1cf8060}
> (gdb) c
> Continuing.
> Device of port 1 is detached
> Now total ports is 1
> Done
> 
> 
> On the first detach, the pci bus frees the rte_pci_device which embeds
> the rte_device object.
> 
> static int
> pci_unplug(struct rte_device *dev)
> {
>         struct rte_pci_device *pdev;
>         int ret;
> 
>         pdev = RTE_DEV_TO_PCI(dev);
>         ret = rte_pci_detach_dev(pdev);
>         if (ret == 0) {
>                 rte_pci_remove_device(pdev);
>                 rte_devargs_remove(dev->devargs);
>                 free(pdev);
>         }
>         return ret;
> }
> 
> 
> 
> testpmd> port detach 1
> Removing a device...
> 
> Breakpoint 1, local_dev_remove (dev=0x1de64b0) at
> /root/dpdk/lib/librte_eal/common/eal_common_dev.c:315
> 315        if (dev->bus->unplug == NULL) {
> (gdb) p *dev
> $2 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0xa <Address 0xa
> out of bounds>, driver = 0x0, bus = 0x4637, numa_node = 1, devargs =
> 0x40000002e040018}
> (gdb) c
> Continuing.
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x00000000007c1ddd in local_dev_remove (dev=0x1de64b0) at
> /root/dpdk/lib/librte_eal/common/eal_common_dev.c:315
> 315        if (dev->bus->unplug == NULL) {
> 
> 
> On the second detach, testpmd passes the same rte_device pointer it
> extracts from rte_eth_devices, but the malloc'd location has been
> reused (with watchpoint on the location, I found somewhere around
> rte_mp_request_sync/opendir()), and then *crunch* on dev->bus.
> 
> 
> From my pov:
> - testpmd is wrongly reusing a pointer coming from rte_eth_devices[],
> without caring about the port state (this is what your second patch
> fixes),
> - testpmd is directly kicking pointers in rte_eth_devices[] (setting
> ->device = NULL for its own logic), which is bad too,
> - this patch just hides the reuse of a freed pointer,

I agree with most of your analysis.
So we agree that patch 2 is a real fix.
We agree that tespmd should be fixed in next release to not update
.device pointer. But keep it for now as it may be a workaround for
some drivers (need to be deeply analyzed).

But about this patch 1, it is resetting rte_device.driver,
which is used by the function rte_dev_is_probed().
It says rte_device has no rte_driver attached anymore.
This patch is the same idea as
391797f04208 ("drivers/bus: move driver assignment to end of probing")
So I consider this is a real fix.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [PATCH v2] bus/pci: fix driver detach clear
  2019-11-20 13:51     ` Thomas Monjalon
@ 2019-11-20 17:22       ` David Marchand
  0 siblings, 0 replies; 23+ messages in thread
From: David Marchand @ 2019-11-20 17:22 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dpdk stable, Matan Azrad, dev

On Wed, Nov 20, 2019 at 2:54 PM Thomas Monjalon <thomas@monjalon.net> wrote:
> But about this patch 1, it is resetting rte_device.driver,
> which is used by the function rte_dev_is_probed().
> It says rte_device has no rte_driver attached anymore.
> This patch is the same idea as
> 391797f04208 ("drivers/bus: move driver assignment to end of probing")
> So I consider this is a real fix.

But the device should not be used after a rte_dev_remove().
This is more a documentation patch than a fix.

The commitlog must be rewritten to reflect this.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [PATCH v2] bus/pci: fix driver detach clear
  2019-11-20  9:47 ` [dpdk-stable] [PATCH v2] " Matan Azrad
  2019-11-20 13:03   ` David Marchand
@ 2019-11-20 22:52   ` David Marchand
  1 sibling, 0 replies; 23+ messages in thread
From: David Marchand @ 2019-11-20 22:52 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev, dpdk stable, Thomas Monjalon

On Wed, Nov 20, 2019 at 10:48 AM Matan Azrad <matan@mellanox.com> wrote:
>
> When a rte_device is unplugged, the driver should be detached from the
> device.
>
> The PCI detach driver operation wrongly didn't clear the driver from the
> device structure what remain the device in probe state from the EAL
> point of view.
>
> For example, when a device is removed twice using rte_dev_remove, it
> cause a crash in EAL.
>
> Clear the driver in driver detach successful operation.
>
> Signed-off-by: Matan Azrad <matan@mellanox.com>
Acked-by: Thomas Monjalon <thomas@monjalon.net>

Applied with updated commitlog.
Thanks.


--
David Marchand


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2019-11-12 11:20   ` Iremonger, Bernard
@ 2019-11-20 22:52     ` David Marchand
  0 siblings, 0 replies; 23+ messages in thread
From: David Marchand @ 2019-11-20 22:52 UTC (permalink / raw)
  To: Matan Azrad; +Cc: dev, Gaetan Rivet, thomas, stable, Iremonger, Bernard

On Tue, Nov 12, 2019 at 12:21 PM Iremonger, Bernard
<bernard.iremonger@intel.com> wrote:
>
> > -----Original Message-----
> > From: Matan Azrad <matan@mellanox.com>
> > Sent: Tuesday, November 12, 2019 8:48 AM
> > To: dev@dpdk.org
> > Cc: Gaetan Rivet <gaetan.rivet@6wind.com>; Iremonger, Bernard
> > <bernard.iremonger@intel.com>; thomas@monjalon.net; stable@dpdk.org
> > Subject: [PATCH 2/2] app/testpmd: fix invalid port detaching
> >
> > The port was not validated before detaching.
> >
> > Ignore port detach operation when the port is not valid.
> >
> > Fixes: f8e5baa2662d ("app/testpmd: check not detaching device twice")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
>
> Acked-by: Bernard Iremonger <bernard.iremonger@intel.com>
>

Applied, thanks.



--
David Marchand


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2019-11-12  8:47 ` [dpdk-stable] [PATCH 2/2] app/testpmd: fix invalid port detaching Matan Azrad
  2019-11-12 11:20   ` Iremonger, Bernard
@ 2020-01-23 13:19   ` Yigit, Ferruh
  2020-01-23 14:05     ` Matan Azrad
  1 sibling, 1 reply; 23+ messages in thread
From: Yigit, Ferruh @ 2020-01-23 13:19 UTC (permalink / raw)
  To: Matan Azrad, dev, Bernard Iremonger
  Cc: Gaetan Rivet, thomas, stable, David Marchand, Jeff Guo, Qi Zhang

On 11/12/2019 8:47 AM, Matan Azrad wrote:
> The port was not validated before detaching.
> 
> Ignore port detach operation when the port is not valid.
> 
> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device twice")
> Cc: thomas@monjalon.net
> Cc: stable@dpdk.org
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>
> ---
>  app/test-pmd/testpmd.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> index 4444346..370eefe 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -2545,6 +2545,9 @@ struct extmem_param {
>  
>  	printf("Removing a device...\n");
>  
> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
> +		return;
> +
>  	dev = rte_eth_devices[port_id].device;
>  	if (dev == NULL) {
>  		printf("Device already removed\n");
> 

The patch is already in 19.11 [1] but it is breaking the testpmd hotplug support.

Before 'detach_port_device()' called, the port has been stopped and closed [2],
which will make port fail from 'port_id_is_invalid()' check and the device
removal path never fully called.
The implication is, since device not detached, vfio request interrupt keeps
triggered continuously and re-starts the detach path, but because of the half
cleaned device it fails and app gets stuck with a continuous log [3].

I wonder if the actual hotplug has been tested with this patch, the commit log
is not clear about the motivation and implication of the patch, I am not clear
why this check is added but I am sending a patch soon to remove it back.

Regards,
ferruh


[1]
https://git.dpdk.org/dpdk/commit/?id=43d0e304980a1527bcac92dc679057b189e2545a

[2]
rmv_port_callback
  stop_port(port_id);
  close_port(port_id);
  detach_port_device(port_id);

[3]
EAL: can not get port by device 0000:00:05.0!
EAL: can not get port by device 0000:00:05.0!
EAL: can not get port by device 0000:00:05.0!
EAL: can not get port by device 0000:00:05.0!
EAL: can not get port by device 0000:00:05.0!
EAL: can not get port by device 0000:00:05.0!
...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2020-01-23 13:19   ` [dpdk-stable] [dpdk-dev] " Yigit, Ferruh
@ 2020-01-23 14:05     ` Matan Azrad
  2020-01-23 14:48       ` Ferruh Yigit
  0 siblings, 1 reply; 23+ messages in thread
From: Matan Azrad @ 2020-01-23 14:05 UTC (permalink / raw)
  To: Yigit, Ferruh, dev, Bernard Iremonger
  Cc: Gaetan Rivet, Thomas Monjalon, stable, David Marchand, Jeff Guo,
	Qi Zhang

Hi

From: Yigit, Ferruh
> On 11/12/2019 8:47 AM, Matan Azrad wrote:
> > The port was not validated before detaching.
> >
> > Ignore port detach operation when the port is not valid.
> >
> > Fixes: f8e5baa2662d ("app/testpmd: check not detaching device twice")
> > Cc: thomas@monjalon.net
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > ---
> >  app/test-pmd/testpmd.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
> > 4444346..370eefe 100644
> > --- a/app/test-pmd/testpmd.c
> > +++ b/app/test-pmd/testpmd.c
> > @@ -2545,6 +2545,9 @@ struct extmem_param {
> >
> >  	printf("Removing a device...\n");
> >
> > +	if (port_id_is_invalid(port_id, ENABLED_WARN))
> > +		return;
> > +
> >  	dev = rte_eth_devices[port_id].device;
> >  	if (dev == NULL) {
> >  		printf("Device already removed\n");
> >
> 
> The patch is already in 19.11 [1] but it is breaking the testpmd hotplug
> support.
> Before 'detach_port_device()' called, the port has been stopped and closed
> [2], which will make port fail from 'port_id_is_invalid()' check and the device
> removal path never fully called.
> The implication is, since device not detached, vfio request interrupt keeps
> triggered continuously and re-starts the detach path, but because of the half
> cleaned device it fails and app gets stuck with a continuous log [3].
> 
> I wonder if the actual hotplug has been tested with this patch, the commit
> log is not clear about the motivation and implication of the patch, I am not
> clear why this check is added but I am sending a patch soon to remove it
> back.

The motivation of this patch was to prevent double detach on same port, so the user cannot call detach of invalid port.

I agree this patch is not good and we need a fix but I think the bug is conceptual.

Testpmd tries to do detach by port_id which is derived by ethdev port id while detach work with rte_device.

For example:
you can see in the line above after +++: dev = rte_eth_devices[port_id].device,
Testpmd may access invalid  or reallocated ethdev structure to get the device name and may even detach unwanted rte_device.

So, detach is broken with and without this patch.


I think Testpmd should change the concept of rte_device mapping and put attention to next:
1. Don't detach by ethdev port ID.
2. Multiple ethdev port IDs may related to the same rte_device.

The Testpmd user should be sure that all the port IDs of the rte_device are released before the detach call and Testpmd maybe need to validate it.
And like attach, detach should be triggered by PCI address \ rte_device name.


Matan





















 
  


> Regards,
> ferruh
> 
> 
> [1]
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.dp
> dk.org%2Fdpdk%2Fcommit%2F%3Fid%3D43d0e304980a1527bcac92dc679057
> b189e2545a&amp;data=02%7C01%7Cmatan%40mellanox.com%7Cc3f40356d
> d124e20faf708d7a006e68c%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7
> C0%7C637153823809699996&amp;sdata=dBy9m%2BxCA%2Bme1IpX2LqPARa
> 62giznKi8Xbtu220GA%2Bg%3D&amp;reserved=0
> 
> [2]
> rmv_port_callback
>   stop_port(port_id);
>   close_port(port_id);
>   detach_port_device(port_id);
> 
> [3]
> EAL: can not get port by device 0000:00:05.0!
> EAL: can not get port by device 0000:00:05.0!
> EAL: can not get port by device 0000:00:05.0!
> EAL: can not get port by device 0000:00:05.0!
> EAL: can not get port by device 0000:00:05.0!
> EAL: can not get port by device 0000:00:05.0!
> ...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2020-01-23 14:05     ` Matan Azrad
@ 2020-01-23 14:48       ` Ferruh Yigit
  2020-01-23 15:29         ` Matan Azrad
  0 siblings, 1 reply; 23+ messages in thread
From: Ferruh Yigit @ 2020-01-23 14:48 UTC (permalink / raw)
  To: Matan Azrad, Yigit, Ferruh, dev, Bernard Iremonger
  Cc: Gaetan Rivet, Thomas Monjalon, stable, David Marchand, Jeff Guo,
	Qi Zhang

On 1/23/2020 2:05 PM, Matan Azrad wrote:
> Hi
> 
> From: Yigit, Ferruh
>> On 11/12/2019 8:47 AM, Matan Azrad wrote:
>>> The port was not validated before detaching.
>>>
>>> Ignore port detach operation when the port is not valid.
>>>
>>> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device twice")
>>> Cc: thomas@monjalon.net
>>> Cc: stable@dpdk.org
>>>
>>> Signed-off-by: Matan Azrad <matan@mellanox.com>
>>> ---
>>>  app/test-pmd/testpmd.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
>>> 4444346..370eefe 100644
>>> --- a/app/test-pmd/testpmd.c
>>> +++ b/app/test-pmd/testpmd.c
>>> @@ -2545,6 +2545,9 @@ struct extmem_param {
>>>
>>>  	printf("Removing a device...\n");
>>>
>>> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
>>> +		return;
>>> +
>>>  	dev = rte_eth_devices[port_id].device;
>>>  	if (dev == NULL) {
>>>  		printf("Device already removed\n");
>>>
>>
>> The patch is already in 19.11 [1] but it is breaking the testpmd hotplug
>> support.
>> Before 'detach_port_device()' called, the port has been stopped and closed
>> [2], which will make port fail from 'port_id_is_invalid()' check and the device
>> removal path never fully called.
>> The implication is, since device not detached, vfio request interrupt keeps
>> triggered continuously and re-starts the detach path, but because of the half
>> cleaned device it fails and app gets stuck with a continuous log [3].
>>
>> I wonder if the actual hotplug has been tested with this patch, the commit
>> log is not clear about the motivation and implication of the patch, I am not
>> clear why this check is added but I am sending a patch soon to remove it
>> back.
> 
> The motivation of this patch was to prevent double detach on same port, so the user cannot call detach of invalid port.

What is the definition of the 'invalid port', if you mean device already
detached case, in the second call of the function "if (dev == NULL)" check
should prevent it going forward.
But according the 'port_id_is_invalid()' API, a closed port is an invalid port,
I think that is wrong in this context.

> 
> I agree this patch is not good and we need a fix but I think the bug is conceptual.
> 
> Testpmd tries to do detach by port_id which is derived by ethdev port id while detach work with rte_device.
> 
> For example:
> you can see in the line above after +++: dev = rte_eth_devices[port_id].device,
> Testpmd may access invalid  or reallocated ethdev structure to get the device name and may even detach unwanted rte_device.

I thinks whichever function calling 'detach_port_device()' should check the port
validity.
'detach_port_device()' doesn't know if port reallocated or not, it will free the
given port_id, and when freeing done 'rte_eth_devices[port_id].device' will be
NULL, this looks to me a valid check.
The caller of the 'detach_port_device()' should ensure correct port_id passed to
the function.

> 
> So, detach is broken with and without this patch.

I can't see how it is broken without the check, how the problem you mentioned
can be reproduced? Or is it a theoretical issue?
But with this check hotplug support is %100 reproducible broken.

> 
> 
> I think Testpmd should change the concept of rte_device mapping and put attention to next:
> 1. Don't detach by ethdev port ID.
> 2. Multiple ethdev port IDs may related to the same rte_device.
> 
> The Testpmd user should be sure that all the port IDs of the rte_device are released before the detach call and Testpmd maybe need to validate it.
> And like attach, detach should be triggered by PCI address \ rte_device name.
> 

We need to know about port_id too to be able to stop/close it.
And sure no objection to improve the hotplug support but it is broken now, lets
fix it first.

> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  
>   
> 
> 
>> Regards,
>> ferruh
>>
>>
>> [1]
>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.dp
>> dk.org%2Fdpdk%2Fcommit%2F%3Fid%3D43d0e304980a1527bcac92dc679057
>> b189e2545a&amp;data=02%7C01%7Cmatan%40mellanox.com%7Cc3f40356d
>> d124e20faf708d7a006e68c%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7
>> C0%7C637153823809699996&amp;sdata=dBy9m%2BxCA%2Bme1IpX2LqPARa
>> 62giznKi8Xbtu220GA%2Bg%3D&amp;reserved=0
>>
>> [2]
>> rmv_port_callback
>>   stop_port(port_id);
>>   close_port(port_id);
>>   detach_port_device(port_id);
>>
>> [3]
>> EAL: can not get port by device 0000:00:05.0!
>> EAL: can not get port by device 0000:00:05.0!
>> EAL: can not get port by device 0000:00:05.0!
>> EAL: can not get port by device 0000:00:05.0!
>> EAL: can not get port by device 0000:00:05.0!
>> EAL: can not get port by device 0000:00:05.0!
>> ...


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2020-01-23 14:48       ` Ferruh Yigit
@ 2020-01-23 15:29         ` Matan Azrad
  2020-01-23 18:14           ` Ferruh Yigit
  0 siblings, 1 reply; 23+ messages in thread
From: Matan Azrad @ 2020-01-23 15:29 UTC (permalink / raw)
  To: Ferruh Yigit, Yigit, Ferruh, dev, Bernard Iremonger
  Cc: Gaetan Rivet, Thomas Monjalon, stable, David Marchand, Jeff Guo,
	Qi Zhang


Hi

From: Ferruh Yigit
> On 1/23/2020 2:05 PM, Matan Azrad wrote:
> > Hi
> >
> > From: Yigit, Ferruh
> >> On 11/12/2019 8:47 AM, Matan Azrad wrote:
> >>> The port was not validated before detaching.
> >>>
> >>> Ignore port detach operation when the port is not valid.
> >>>
> >>> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device
> >>> twice")
> >>> Cc: thomas@monjalon.net
> >>> Cc: stable@dpdk.org
> >>>
> >>> Signed-off-by: Matan Azrad <matan@mellanox.com>
> >>> ---
> >>>  app/test-pmd/testpmd.c | 3 +++
> >>>  1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
> >>> 4444346..370eefe 100644
> >>> --- a/app/test-pmd/testpmd.c
> >>> +++ b/app/test-pmd/testpmd.c
> >>> @@ -2545,6 +2545,9 @@ struct extmem_param {
> >>>
> >>>  	printf("Removing a device...\n");
> >>>
> >>> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
> >>> +		return;
> >>> +
> >>>  	dev = rte_eth_devices[port_id].device;
> >>>  	if (dev == NULL) {
> >>>  		printf("Device already removed\n");
> >>>
> >>
> >> The patch is already in 19.11 [1] but it is breaking the testpmd
> >> hotplug support.
> >> Before 'detach_port_device()' called, the port has been stopped and
> >> closed [2], which will make port fail from 'port_id_is_invalid()'
> >> check and the device removal path never fully called.
> >> The implication is, since device not detached, vfio request interrupt
> >> keeps triggered continuously and re-starts the detach path, but
> >> because of the half cleaned device it fails and app gets stuck with a
> continuous log [3].
> >>
> >> I wonder if the actual hotplug has been tested with this patch, the
> >> commit log is not clear about the motivation and implication of the
> >> patch, I am not clear why this check is added but I am sending a
> >> patch soon to remove it back.
> >
> > The motivation of this patch was to prevent double detach on same port,
> so the user cannot call detach of invalid port.
> 
> What is the definition of the 'invalid port', if you mean device already
> detached case, in the second call of the function "if (dev == NULL)" check
> should prevent it going forward.

No, ethdev doesn't zero the device pointer when it release a port.
So even if the port is in unused state already - means invalid, the device pointer still may be valid and point to the last port that used the same id.


> But according the 'port_id_is_invalid()' API, a closed port is an invalid port, I
> think that is wrong in this context.

Why?

You are going to look on ethdev portid structure, don't you think we should valid the port before using its structure?


> >
> > I agree this patch is not good and we need a fix but I think the bug is
> conceptual.
> >
> > Testpmd tries to do detach by port_id which is derived by ethdev port id
> while detach work with rte_device.
> >
> > For example:
> > you can see in the line above after +++: dev =
> > rte_eth_devices[port_id].device, Testpmd may access invalid  or
> reallocated ethdev structure to get the device name and may even detach
> unwanted rte_device.
> 
> I thinks whichever function calling 'detach_port_device()' should check the
> port validity.
> 'detach_port_device()' doesn't know if port reallocated or not, it will free the
> given port_id, and when freeing done 'rte_eth_devices[port_id].device' will
> be NULL, this looks to me a valid check.

Please validate me, check ethdev, I don't think so, 'rte_eth_devices[port_id].device still valid after detach.

> The caller of the 'detach_port_device()' should ensure correct port_id
> passed to the function.

What is correct port id, if the port was released , is it correct?

> >
> > So, detach is broken with and without this patch.
> 
> I can't see how it is broken without the check, how the problem you
> mentioned can be reproduced? Or is it a theoretical issue?
> But with this check hotplug support is %100 reproducible broken.
> 
> >
> >
> > I think Testpmd should change the concept of rte_device mapping and put
> attention to next:
> > 1. Don't detach by ethdev port ID.
> > 2. Multiple ethdev port IDs may related to the same rte_device.
> >
> > The Testpmd user should be sure that all the port IDs of the rte_device are
> released before the detach call and Testpmd maybe need to validate it.
> > And like attach, detach should be triggered by PCI address \ rte_device
> name.
> >
> 
> We need to know about port_id too to be able to stop/close it.
> And sure no objection to improve the hotplug support but it is broken now,
> lets fix it first.
> 
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >> Regards,
> >> ferruh
> >>
> >>
> >> [1]
> >> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> >> .dp
> >>
> dk.org%2Fdpdk%2Fcommit%2F%3Fid%3D43d0e304980a1527bcac92dc679057
> >>
> b189e2545a&amp;data=02%7C01%7Cmatan%40mellanox.com%7Cc3f40356d
> >>
> d124e20faf708d7a006e68c%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7
> >>
> C0%7C637153823809699996&amp;sdata=dBy9m%2BxCA%2Bme1IpX2LqPARa
> >> 62giznKi8Xbtu220GA%2Bg%3D&amp;reserved=0
> >>
> >> [2]
> >> rmv_port_callback
> >>   stop_port(port_id);
> >>   close_port(port_id);
> >>   detach_port_device(port_id);
> >>
> >> [3]
> >> EAL: can not get port by device 0000:00:05.0!
> >> EAL: can not get port by device 0000:00:05.0!
> >> EAL: can not get port by device 0000:00:05.0!
> >> EAL: can not get port by device 0000:00:05.0!
> >> EAL: can not get port by device 0000:00:05.0!
> >> EAL: can not get port by device 0000:00:05.0!
> >> ...


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2020-01-23 15:29         ` Matan Azrad
@ 2020-01-23 18:14           ` Ferruh Yigit
  2020-01-23 19:25             ` Matan Azrad
  0 siblings, 1 reply; 23+ messages in thread
From: Ferruh Yigit @ 2020-01-23 18:14 UTC (permalink / raw)
  To: Matan Azrad, Yigit, Ferruh, dev, Bernard Iremonger
  Cc: Gaetan Rivet, Thomas Monjalon, stable, David Marchand, Jeff Guo,
	Qi Zhang

On 1/23/2020 3:29 PM, Matan Azrad wrote:
> 
> Hi
> 
> From: Ferruh Yigit
>> On 1/23/2020 2:05 PM, Matan Azrad wrote:
>>> Hi
>>>
>>> From: Yigit, Ferruh
>>>> On 11/12/2019 8:47 AM, Matan Azrad wrote:
>>>>> The port was not validated before detaching.
>>>>>
>>>>> Ignore port detach operation when the port is not valid.
>>>>>
>>>>> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device
>>>>> twice")
>>>>> Cc: thomas@monjalon.net
>>>>> Cc: stable@dpdk.org
>>>>>
>>>>> Signed-off-by: Matan Azrad <matan@mellanox.com>
>>>>> ---
>>>>>  app/test-pmd/testpmd.c | 3 +++
>>>>>  1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
>>>>> 4444346..370eefe 100644
>>>>> --- a/app/test-pmd/testpmd.c
>>>>> +++ b/app/test-pmd/testpmd.c
>>>>> @@ -2545,6 +2545,9 @@ struct extmem_param {
>>>>>
>>>>>  	printf("Removing a device...\n");
>>>>>
>>>>> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
>>>>> +		return;
>>>>> +
>>>>>  	dev = rte_eth_devices[port_id].device;
>>>>>  	if (dev == NULL) {
>>>>>  		printf("Device already removed\n");
>>>>>
>>>>
>>>> The patch is already in 19.11 [1] but it is breaking the testpmd
>>>> hotplug support.
>>>> Before 'detach_port_device()' called, the port has been stopped and
>>>> closed [2], which will make port fail from 'port_id_is_invalid()'
>>>> check and the device removal path never fully called.
>>>> The implication is, since device not detached, vfio request interrupt
>>>> keeps triggered continuously and re-starts the detach path, but
>>>> because of the half cleaned device it fails and app gets stuck with a
>> continuous log [3].
>>>>
>>>> I wonder if the actual hotplug has been tested with this patch, the
>>>> commit log is not clear about the motivation and implication of the
>>>> patch, I am not clear why this check is added but I am sending a
>>>> patch soon to remove it back.
>>>
>>> The motivation of this patch was to prevent double detach on same port,
>> so the user cannot call detach of invalid port.
>>
>> What is the definition of the 'invalid port', if you mean device already
>> detached case, in the second call of the function "if (dev == NULL)" check
>> should prevent it going forward.
> 
> No, ethdev doesn't zero the device pointer when it release a port.

As far as I can see it does, please see below.

> So even if the port is in unused state already - means invalid, the device pointer still may be valid and point to the last port that used the same id.

If the port is closed, it is unused state, and ethdev layer resources freed but
as you said device related structures are still there, device pointer is still
valid and it is still in probed device list etc.. We need to able to detach the
device even after it is unused state.

"stop -> close -> detach" is a normal order, we shouldn't prevent it, but your
check does prevent it.

I am not very clear about your concern here, "point to the last port that used
the same id", can you please clarify?

> 
> 
>> But according the 'port_id_is_invalid()' API, a closed port is an invalid port, I
>> think that is wrong in this context.
> 
> Why?

Closed port is 'invalid' for using it, because ethdev resources are freed. But
it is not 'invalid' to detach it, why a port being closed should prevent freeing
its device layer resources?

> 
> You are going to look on ethdev portid structure, don't you think we should valid the port before using its structure?

Is your main concern "rte_eth_devices[port_id].device" can be dangling pointer?

1) It is not.
2) The check you added to replace it is not correct check.

> 
> 
>>>
>>> I agree this patch is not good and we need a fix but I think the bug is
>> conceptual.
>>>
>>> Testpmd tries to do detach by port_id which is derived by ethdev port id
>> while detach work with rte_device.
>>>
>>> For example:
>>> you can see in the line above after +++: dev =
>>> rte_eth_devices[port_id].device, Testpmd may access invalid  or
>> reallocated ethdev structure to get the device name and may even detach
>> unwanted rte_device.
>>
>> I thinks whichever function calling 'detach_port_device()' should check the
>> port validity.
>> 'detach_port_device()' doesn't know if port reallocated or not, it will free the
>> given port_id, and when freeing done 'rte_eth_devices[port_id].device' will
>> be NULL, this looks to me a valid check.
> 
> Please validate me, check ethdev, I don't think so, 'rte_eth_devices[port_id].device still valid after detach.

This is a long stack trace, but what happens is:

rte_dev_remove
  bus unpug
    driver remove
      rte_eth_dev_pci_release
        eth_dev->device = NULL;

Please check the driver you are testing remove() ops (rte_pci_driver.remove())
does cleans the ethdev fields.

A little more detailed stack trace for my environment:
#0  rte_eth_dev_pci_release (eth_dev=..) at  rte_ethdev_pci.h:143
#1  rte_eth_dev_pci_generic_remove (pci_dev=.., dev_uninit=..) at
rte_ethdev_pci.h:199
#2  eth_i40e_pci_remove (pci_dev=..) at i40e_ethdev.c:710
#3  rte_pci_detach_dev (dev=..) at pci_common.c:243
#4  pci_unplug (dev=..) at pci_common.c:537
#5  local_dev_remove (dev=..) at eal_common_dev.c:321
#6  rte_dev_remove (dev=..) at eal_common_dev.c:402
#7  detach_port_device (port_id=0) at testpmd.c:2663
#8  cmd_operate_detach_port_parsed (parsed_result=.., cl=.., data=0x0) at
cmdline.c:1501
#9  cmdline_parse (cl=.., buf=.."port detach 0\n") at cmdline_parse.c:295
#10 cmdline_valid_buffer (rdl=.., buf="port detach 0\n", size=15) at  cmdline.c:31
#11 rdline_char_in (rdl=.., c=10 '\n') at  cmdline_rdline.c:421
#12 cmdline_in (cl=.., buf=.."\n", size=1) at cmdline.c:148
#13 cmdline_interact (cl=..) at cmdline.c:227
#14 prompt () at cmdline.c:19644
#15 main (argc=3, argv=..) at testpmd.c:3617

> 
>> The caller of the 'detach_port_device()' should ensure correct port_id
>> passed to the function.
> 
> What is correct port id, if the port was released , is it correct?

You are right, there is no good answer for it, I was thinking application state
information can be used but no ethdev should able to provide this information,
we need 'is_freed' kind of check for it, currently
'rte_eth_devices[port_id].device' is used for that purpose.

> 
>>>
>>> So, detach is broken with and without this patch.
>>
>> I can't see how it is broken without the check, how the problem you
>> mentioned can be reproduced? Or is it a theoretical issue?
>> But with this check hotplug support is %100 reproducible broken.
>>
>>>
>>>
>>> I think Testpmd should change the concept of rte_device mapping and put
>> attention to next:
>>> 1. Don't detach by ethdev port ID.
>>> 2. Multiple ethdev port IDs may related to the same rte_device.
>>>
>>> The Testpmd user should be sure that all the port IDs of the rte_device are
>> released before the detach call and Testpmd maybe need to validate it.
>>> And like attach, detach should be triggered by PCI address \ rte_device
>> name.
>>>
>>
>> We need to know about port_id too to be able to stop/close it.
>> And sure no objection to improve the hotplug support but it is broken now,
>> lets fix it first.
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> Regards,
>>>> ferruh
>>>>
>>>>
>>>> [1]
>>>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
>>>> .dp
>>>>
>> dk.org%2Fdpdk%2Fcommit%2F%3Fid%3D43d0e304980a1527bcac92dc679057
>>>>
>> b189e2545a&amp;data=02%7C01%7Cmatan%40mellanox.com%7Cc3f40356d
>>>>
>> d124e20faf708d7a006e68c%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7
>>>>
>> C0%7C637153823809699996&amp;sdata=dBy9m%2BxCA%2Bme1IpX2LqPARa
>>>> 62giznKi8Xbtu220GA%2Bg%3D&amp;reserved=0
>>>>
>>>> [2]
>>>> rmv_port_callback
>>>>   stop_port(port_id);
>>>>   close_port(port_id);
>>>>   detach_port_device(port_id);
>>>>
>>>> [3]
>>>> EAL: can not get port by device 0000:00:05.0!
>>>> EAL: can not get port by device 0000:00:05.0!
>>>> EAL: can not get port by device 0000:00:05.0!
>>>> EAL: can not get port by device 0000:00:05.0!
>>>> EAL: can not get port by device 0000:00:05.0!
>>>> EAL: can not get port by device 0000:00:05.0!
>>>> ...
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2020-01-23 18:14           ` Ferruh Yigit
@ 2020-01-23 19:25             ` Matan Azrad
  2020-01-24 16:28               ` Ferruh Yigit
  0 siblings, 1 reply; 23+ messages in thread
From: Matan Azrad @ 2020-01-23 19:25 UTC (permalink / raw)
  To: Ferruh Yigit, Yigit, Ferruh, dev, Bernard Iremonger
  Cc: Gaetan Rivet, Thomas Monjalon, stable, David Marchand, Jeff Guo,
	Qi Zhang

Hi

From: Ferruh Yigit
> On 1/23/2020 3:29 PM, Matan Azrad wrote:
> >
> > Hi
> >
> > From: Ferruh Yigit
> >> On 1/23/2020 2:05 PM, Matan Azrad wrote:
> >>> Hi
> >>>
> >>> From: Yigit, Ferruh
> >>>> On 11/12/2019 8:47 AM, Matan Azrad wrote:
> >>>>> The port was not validated before detaching.
> >>>>>
> >>>>> Ignore port detach operation when the port is not valid.
> >>>>>
> >>>>> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device
> >>>>> twice")
> >>>>> Cc: thomas@monjalon.net
> >>>>> Cc: stable@dpdk.org
> >>>>>
> >>>>> Signed-off-by: Matan Azrad <matan@mellanox.com>
> >>>>> ---
> >>>>>  app/test-pmd/testpmd.c | 3 +++
> >>>>>  1 file changed, 3 insertions(+)
> >>>>>
> >>>>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
> >>>>> 4444346..370eefe 100644
> >>>>> --- a/app/test-pmd/testpmd.c
> >>>>> +++ b/app/test-pmd/testpmd.c
> >>>>> @@ -2545,6 +2545,9 @@ struct extmem_param {
> >>>>>
> >>>>>  	printf("Removing a device...\n");
> >>>>>
> >>>>> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
> >>>>> +		return;
> >>>>> +
> >>>>>  	dev = rte_eth_devices[port_id].device;
> >>>>>  	if (dev == NULL) {
> >>>>>  		printf("Device already removed\n");
> >>>>>
> >>>>
> >>>> The patch is already in 19.11 [1] but it is breaking the testpmd
> >>>> hotplug support.
> >>>> Before 'detach_port_device()' called, the port has been stopped and
> >>>> closed [2], which will make port fail from 'port_id_is_invalid()'
> >>>> check and the device removal path never fully called.
> >>>> The implication is, since device not detached, vfio request
> >>>> interrupt keeps triggered continuously and re-starts the detach
> >>>> path, but because of the half cleaned device it fails and app gets
> >>>> stuck with a
> >> continuous log [3].
> >>>>
> >>>> I wonder if the actual hotplug has been tested with this patch, the
> >>>> commit log is not clear about the motivation and implication of the
> >>>> patch, I am not clear why this check is added but I am sending a
> >>>> patch soon to remove it back.
> >>>
> >>> The motivation of this patch was to prevent double detach on same
> >>> port,
> >> so the user cannot call detach of invalid port.
> >>
> >> What is the definition of the 'invalid port', if you mean device
> >> already detached case, in the second call of the function "if (dev ==
> >> NULL)" check should prevent it going forward.
> >
> > No, ethdev doesn't zero the device pointer when it release a port.
> 
> As far as I can see it does, please see below.

The code below is problematic because:

1. It is very bad that the application changing ethdev structure directly.
2. The below code run over valid port only, not on invalid port(UNUSED state).

So, the device pointer will still be valid if the port is invalid.

All of this shows that this function try to detach only a valid port (probably mainly because it is called by Testpmd detach command).

> > So even if the port is in unused state already - means invalid, the device
> pointer still may be valid and point to the last port that used the same id.
> 
> If the port is closed, it is unused state, and ethdev layer resources freed but
> as you said device related structures are still there, device pointer is still valid
> and it is still in probed device list etc.. We need to able to detach the device
> even after it is unused state.

Yes, but detach is for device, not for port.
The device pointer must be taken only when the port is in valid state.
Why?
Because if the port is in UNUSED state it is free to be allocated again by ethdev layer for other device, then, the device pointer may point to other device.

> "stop -> close -> detach" is a normal order, we shouldn't prevent it, but your
> check does prevent it.

Yes, this is good order, but the pointer of the device should be taken before close.
My patch prevent accessing invalid structure.
And yes, Testpmd detach stays broken after my patch and after this patch too.


> 
> I am not very clear about your concern here, "point to the last port that used
> the same id", can you please clarify?

Yes, when ethdev layer allocates a port ID for a new device, it tries to find UNUSED port.
When found, the port will move to ATTACHED after the PMD finishes its probing function.

So, any UNUSED port may be allocated for other device and then, the device pointer points to other device.

> 
> >
> >
> >> But according the 'port_id_is_invalid()' API, a closed port is an
> >> invalid port, I think that is wrong in this context.
> >
> > Why?
> 
> Closed port is 'invalid' for using it, because ethdev resources are freed. But it
> is not 'invalid' to detach it, why a port being closed should prevent freeing its
> device layer resources?

I didn't said that, I said that the device pointer should be taken when the port is valid.


> 
> >
> > You are going to look on ethdev portid structure, don't you think we should
> valid the port before using its structure?
> 
> Is your main concern "rte_eth_devices[port_id].device" can be dangling
> pointer?
> 
> 1) It is not.
> 2) The check you added to replace it is not correct check.
> 
Didn't said that.

It just may point to other device.
It is not correct to take information from invalid structure.

Don't you agree that the structure is not valid when the port is not valid?

> >
> >>>
> >>> I agree this patch is not good and we need a fix but I think the bug
> >>> is
> >> conceptual.
> >>>
> >>> Testpmd tries to do detach by port_id which is derived by ethdev
> >>> port id
> >> while detach work with rte_device.
> >>>
> >>> For example:
> >>> you can see in the line above after +++: dev =
> >>> rte_eth_devices[port_id].device, Testpmd may access invalid  or
> >> reallocated ethdev structure to get the device name and may even
> >> detach unwanted rte_device.
> >>
> >> I thinks whichever function calling 'detach_port_device()' should
> >> check the port validity.
> >> 'detach_port_device()' doesn't know if port reallocated or not, it
> >> will free the given port_id, and when freeing done
> >> 'rte_eth_devices[port_id].device' will be NULL, this looks to me a valid
> check.
> >
> > Please validate me, check ethdev, I don't think so,
> 'rte_eth_devices[port_id].device still valid after detach.
> 
> This is a long stack trace, but what happens is:
> 
> rte_dev_remove
>   bus unpug
>     driver remove
>       rte_eth_dev_pci_release
>         eth_dev->device = NULL;

The last line doesn't happen here because the rte_eth_dev_pci_release moves the port to UNUSED.
And it is bad that application is trying to do it.

> 
> Please check the driver you are testing remove() ops
> (rte_pci_driver.remove()) does cleans the ethdev fields.
> 
> A little more detailed stack trace for my environment:
> #0  rte_eth_dev_pci_release (eth_dev=..) at  rte_ethdev_pci.h:143
> #1  rte_eth_dev_pci_generic_remove (pci_dev=.., dev_uninit=..) at
> rte_ethdev_pci.h:199
> #2  eth_i40e_pci_remove (pci_dev=..) at i40e_ethdev.c:710
> #3  rte_pci_detach_dev (dev=..) at pci_common.c:243
> #4  pci_unplug (dev=..) at pci_common.c:537
> #5  local_dev_remove (dev=..) at eal_common_dev.c:321
> #6  rte_dev_remove (dev=..) at eal_common_dev.c:402
> #7  detach_port_device (port_id=0) at testpmd.c:2663
> #8  cmd_operate_detach_port_parsed (parsed_result=.., cl=.., data=0x0) at
> cmdline.c:1501
> #9  cmdline_parse (cl=.., buf=.."port detach 0\n") at cmdline_parse.c:295
> #10 cmdline_valid_buffer (rdl=.., buf="port detach 0\n", size=15) at
> cmdline.c:31
> #11 rdline_char_in (rdl=.., c=10 '\n') at  cmdline_rdline.c:421
> #12 cmdline_in (cl=.., buf=.."\n", size=1) at cmdline.c:148
> #13 cmdline_interact (cl=..) at cmdline.c:227
> #14 prompt () at cmdline.c:19644
> #15 main (argc=3, argv=..) at testpmd.c:3617
> 
Not all the drivers are doing it.
I think it is good if we will do it by ethdev release function.


> >
> >> The caller of the 'detach_port_device()' should ensure correct
> >> port_id passed to the function.
> >
> > What is correct port id, if the port was released , is it correct?
> 
> You are right, there is no good answer for it, I was thinking application state
> information can be used but no ethdev should able to provide this
> information, we need 'is_freed' kind of check for it, currently
> 'rte_eth_devices[port_id].device' is used for that purpose.

It is wrong to take device from invalid structure. (I explained a lot above).
Better way to save the rte_device in the start(before close) and call detach by rte_device when we sure that all the ports of this rte_device are released(mlx4 can manage 2 ports one rte_device, also any device supports representors).

Let's do correct fix.


> 
> >
> >>>
> >>> So, detach is broken with and without this patch.
> >>
> >> I can't see how it is broken without the check, how the problem you
> >> mentioned can be reproduced? Or is it a theoretical issue?
> >> But with this check hotplug support is %100 reproducible broken.
> >>
> >>>
> >>>
> >>> I think Testpmd should change the concept of rte_device mapping and
> >>> put
> >> attention to next:
> >>> 1. Don't detach by ethdev port ID.
> >>> 2. Multiple ethdev port IDs may related to the same rte_device.
> >>>
> >>> The Testpmd user should be sure that all the port IDs of the
> >>> rte_device are
> >> released before the detach call and Testpmd maybe need to validate it.
> >>> And like attach, detach should be triggered by PCI address \
> >>> rte_device
> >> name.
> >>>
> >>
> >> We need to know about port_id too to be able to stop/close it.
> >> And sure no objection to improve the hotplug support but it is broken
> >> now, lets fix it first.
> >>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> Regards,
> >>>> ferruh
> >>>>
> >>>>
> >>>> [1]
> >>>>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
> >>>> it
> >>>> .dp
> >>>>
> >>
> dk.org%2Fdpdk%2Fcommit%2F%3Fid%3D43d0e304980a1527bcac92dc679057
> >>>>
> >>
> b189e2545a&amp;data=02%7C01%7Cmatan%40mellanox.com%7Cc3f40356d
> >>>>
> >>
> d124e20faf708d7a006e68c%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7
> >>>>
> >>
> C0%7C637153823809699996&amp;sdata=dBy9m%2BxCA%2Bme1IpX2LqPARa
> >>>> 62giznKi8Xbtu220GA%2Bg%3D&amp;reserved=0
> >>>>
> >>>> [2]
> >>>> rmv_port_callback
> >>>>   stop_port(port_id);
> >>>>   close_port(port_id);
> >>>>   detach_port_device(port_id);
> >>>>
> >>>> [3]
> >>>> EAL: can not get port by device 0000:00:05.0!
> >>>> EAL: can not get port by device 0000:00:05.0!
> >>>> EAL: can not get port by device 0000:00:05.0!
> >>>> EAL: can not get port by device 0000:00:05.0!
> >>>> EAL: can not get port by device 0000:00:05.0!
> >>>> EAL: can not get port by device 0000:00:05.0!
> >>>> ...
> >


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2020-01-23 19:25             ` Matan Azrad
@ 2020-01-24 16:28               ` Ferruh Yigit
  2020-01-25 18:56                 ` Matan Azrad
  0 siblings, 1 reply; 23+ messages in thread
From: Ferruh Yigit @ 2020-01-24 16:28 UTC (permalink / raw)
  To: Matan Azrad, Yigit, Ferruh, dev, Bernard Iremonger
  Cc: Gaetan Rivet, Thomas Monjalon, stable, David Marchand, Jeff Guo,
	Qi Zhang

On 1/23/2020 7:25 PM, Matan Azrad wrote:
> Hi
> 
> From: Ferruh Yigit
>> On 1/23/2020 3:29 PM, Matan Azrad wrote:
>>>
>>> Hi
>>>
>>> From: Ferruh Yigit
>>>> On 1/23/2020 2:05 PM, Matan Azrad wrote:
>>>>> Hi
>>>>>
>>>>> From: Yigit, Ferruh
>>>>>> On 11/12/2019 8:47 AM, Matan Azrad wrote:
>>>>>>> The port was not validated before detaching.
>>>>>>>
>>>>>>> Ignore port detach operation when the port is not valid.
>>>>>>>
>>>>>>> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device
>>>>>>> twice")
>>>>>>> Cc: thomas@monjalon.net
>>>>>>> Cc: stable@dpdk.org
>>>>>>>
>>>>>>> Signed-off-by: Matan Azrad <matan@mellanox.com>
>>>>>>> ---
>>>>>>>  app/test-pmd/testpmd.c | 3 +++
>>>>>>>  1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
>>>>>>> 4444346..370eefe 100644
>>>>>>> --- a/app/test-pmd/testpmd.c
>>>>>>> +++ b/app/test-pmd/testpmd.c
>>>>>>> @@ -2545,6 +2545,9 @@ struct extmem_param {
>>>>>>>
>>>>>>>  	printf("Removing a device...\n");
>>>>>>>
>>>>>>> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
>>>>>>> +		return;
>>>>>>> +
>>>>>>>  	dev = rte_eth_devices[port_id].device;
>>>>>>>  	if (dev == NULL) {
>>>>>>>  		printf("Device already removed\n");
>>>>>>>
>>>>>>
>>>>>> The patch is already in 19.11 [1] but it is breaking the testpmd
>>>>>> hotplug support.
>>>>>> Before 'detach_port_device()' called, the port has been stopped and
>>>>>> closed [2], which will make port fail from 'port_id_is_invalid()'
>>>>>> check and the device removal path never fully called.
>>>>>> The implication is, since device not detached, vfio request
>>>>>> interrupt keeps triggered continuously and re-starts the detach
>>>>>> path, but because of the half cleaned device it fails and app gets
>>>>>> stuck with a
>>>> continuous log [3].
>>>>>>
>>>>>> I wonder if the actual hotplug has been tested with this patch, the
>>>>>> commit log is not clear about the motivation and implication of the
>>>>>> patch, I am not clear why this check is added but I am sending a
>>>>>> patch soon to remove it back.
>>>>>
>>>>> The motivation of this patch was to prevent double detach on same
>>>>> port,
>>>> so the user cannot call detach of invalid port.
>>>>
>>>> What is the definition of the 'invalid port', if you mean device
>>>> already detached case, in the second call of the function "if (dev ==
>>>> NULL)" check should prevent it going forward.
>>>
>>> No, ethdev doesn't zero the device pointer when it release a port.
>>
>> As far as I can see it does, please see below.
> 
> The code below is problematic because:
> 
> 1. It is very bad that the application changing ethdev structure directly.

Where the application is changing the ethdev structure?
Application calls the 'rte_dev_remove()' API, which does the job.

> 2. The below code run over valid port only, not on invalid port(UNUSED state).
> 
> So, the device pointer will still be valid if the port is invalid.
> 
> All of this shows that this function try to detach only a valid port (probably mainly because it is called by Testpmd detach command).
> 
>>> So even if the port is in unused state already - means invalid, the device
>> pointer still may be valid and point to the last port that used the same id.
>>
>> If the port is closed, it is unused state, and ethdev layer resources freed but
>> as you said device related structures are still there, device pointer is still valid
>> and it is still in probed device list etc.. We need to able to detach the device
>> even after it is unused state.
> 
> Yes, but detach is for device, not for port.
> The device pointer must be taken only when the port is in valid state.
> Why?
> Because if the port is in UNUSED state it is free to be allocated again by ethdev layer for other device, then, the device pointer may point to other device.
> 
>> "stop -> close -> detach" is a normal order, we shouldn't prevent it, but your
>> check does prevent it.
> 
> Yes, this is good order, but the pointer of the device should be taken before close.
> My patch prevent accessing invalid structure.

The ethdev close() dev_ops, frees ethdev related resources, the rte_device is
still valid in that struct. And yes your patch prevents accessing them and
prevents hotplug remove the device.

> And yes, Testpmd detach stays broken after my patch and after this patch too.
> 
> 
>>
>> I am not very clear about your concern here, "point to the last port that used
>> the same id", can you please clarify?
> 
> Yes, when ethdev layer allocates a port ID for a new device, it tries to find UNUSED port.
> When found, the port will move to ATTACHED after the PMD finishes its probing function.
> 
> So, any UNUSED port may be allocated for other device and then, the device pointer points to other device.
> 
>>
>>>
>>>
>>>> But according the 'port_id_is_invalid()' API, a closed port is an
>>>> invalid port, I think that is wrong in this context.
>>>
>>> Why?
>>
>> Closed port is 'invalid' for using it, because ethdev resources are freed. But it
>> is not 'invalid' to detach it, why a port being closed should prevent freeing its
>> device layer resources?
> 
> I didn't said that, I said that the device pointer should be taken when the port is valid.
> 
> 
>>
>>>
>>> You are going to look on ethdev portid structure, don't you think we should
>> valid the port before using its structure?
>>
>> Is your main concern "rte_eth_devices[port_id].device" can be dangling
>> pointer?
>>
>> 1) It is not.
>> 2) The check you added to replace it is not correct check.
>>
> Didn't said that.
> 
> It just may point to other device.
> It is not correct to take information from invalid structure.
> 
> Don't you agree that the structure is not valid when the port is not valid?
> 
>>>
>>>>>
>>>>> I agree this patch is not good and we need a fix but I think the bug
>>>>> is
>>>> conceptual.
>>>>>
>>>>> Testpmd tries to do detach by port_id which is derived by ethdev
>>>>> port id
>>>> while detach work with rte_device.
>>>>>
>>>>> For example:
>>>>> you can see in the line above after +++: dev =
>>>>> rte_eth_devices[port_id].device, Testpmd may access invalid  or
>>>> reallocated ethdev structure to get the device name and may even
>>>> detach unwanted rte_device.
>>>>
>>>> I thinks whichever function calling 'detach_port_device()' should
>>>> check the port validity.
>>>> 'detach_port_device()' doesn't know if port reallocated or not, it
>>>> will free the given port_id, and when freeing done
>>>> 'rte_eth_devices[port_id].device' will be NULL, this looks to me a valid
>> check.
>>>
>>> Please validate me, check ethdev, I don't think so,
>> 'rte_eth_devices[port_id].device still valid after detach.
>>
>> This is a long stack trace, but what happens is:
>>
>> rte_dev_remove
>>   bus unpug
>>     driver remove
>>       rte_eth_dev_pci_release
>>         eth_dev->device = NULL;
> 
> The last line doesn't happen here because the rte_eth_dev_pci_release moves the port to UNUSED.
> And it is bad that application is trying to do it.
> 
>>
>> Please check the driver you are testing remove() ops
>> (rte_pci_driver.remove()) does cleans the ethdev fields.
>>
>> A little more detailed stack trace for my environment:
>> #0  rte_eth_dev_pci_release (eth_dev=..) at  rte_ethdev_pci.h:143
>> #1  rte_eth_dev_pci_generic_remove (pci_dev=.., dev_uninit=..) at
>> rte_ethdev_pci.h:199
>> #2  eth_i40e_pci_remove (pci_dev=..) at i40e_ethdev.c:710
>> #3  rte_pci_detach_dev (dev=..) at pci_common.c:243
>> #4  pci_unplug (dev=..) at pci_common.c:537
>> #5  local_dev_remove (dev=..) at eal_common_dev.c:321
>> #6  rte_dev_remove (dev=..) at eal_common_dev.c:402
>> #7  detach_port_device (port_id=0) at testpmd.c:2663
>> #8  cmd_operate_detach_port_parsed (parsed_result=.., cl=.., data=0x0) at
>> cmdline.c:1501
>> #9  cmdline_parse (cl=.., buf=.."port detach 0\n") at cmdline_parse.c:295
>> #10 cmdline_valid_buffer (rdl=.., buf="port detach 0\n", size=15) at
>> cmdline.c:31
>> #11 rdline_char_in (rdl=.., c=10 '\n') at  cmdline_rdline.c:421
>> #12 cmdline_in (cl=.., buf=.."\n", size=1) at cmdline.c:148
>> #13 cmdline_interact (cl=..) at cmdline.c:227
>> #14 prompt () at cmdline.c:19644
>> #15 main (argc=3, argv=..) at testpmd.c:3617
>>
> Not all the drivers are doing it.
> I think it is good if we will do it by ethdev release function.
> 
> 
>>>
>>>> The caller of the 'detach_port_device()' should ensure correct
>>>> port_id passed to the function.
>>>
>>> What is correct port id, if the port was released , is it correct?
>>
>> You are right, there is no good answer for it, I was thinking application state
>> information can be used but no ethdev should able to provide this
>> information, we need 'is_freed' kind of check for it, currently
>> 'rte_eth_devices[port_id].device' is used for that purpose.
> 
> It is wrong to take device from invalid structure. (I explained a lot above).
> Better way to save the rte_device in the start(before close) and call detach by rte_device when we sure that all the ports of this rte_device are released(mlx4 can manage 2 ports one rte_device, also any device supports representors).
> 
> Let's do correct fix.

Matan,

It become so hard to follow this discussion.The check you add is preventing
device hotplug, so breaking the feature, but you want to keep the check to fix
something which is still not clear to me.

To simplify things, can you please clarify what error are you getting with this
patch, and can you please give some details how to reproduce it? So I can debug
the issue you are having.


> 
> 
>>
>>>
>>>>>
>>>>> So, detach is broken with and without this patch.
>>>>
>>>> I can't see how it is broken without the check, how the problem you
>>>> mentioned can be reproduced? Or is it a theoretical issue?
>>>> But with this check hotplug support is %100 reproducible broken.
>>>>
>>>>>
>>>>>
>>>>> I think Testpmd should change the concept of rte_device mapping and
>>>>> put
>>>> attention to next:
>>>>> 1. Don't detach by ethdev port ID.
>>>>> 2. Multiple ethdev port IDs may related to the same rte_device.
>>>>>
>>>>> The Testpmd user should be sure that all the port IDs of the
>>>>> rte_device are
>>>> released before the detach call and Testpmd maybe need to validate it.
>>>>> And like attach, detach should be triggered by PCI address \
>>>>> rte_device
>>>> name.
>>>>>
>>>>
>>>> We need to know about port_id too to be able to stop/close it.
>>>> And sure no objection to improve the hotplug support but it is broken
>>>> now, lets fix it first.
>>>>

<....>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2020-01-24 16:28               ` Ferruh Yigit
@ 2020-01-25 18:56                 ` Matan Azrad
  2020-02-03 15:58                   ` Ferruh Yigit
  0 siblings, 1 reply; 23+ messages in thread
From: Matan Azrad @ 2020-01-25 18:56 UTC (permalink / raw)
  To: Ferruh Yigit, Yigit, Ferruh, dev, Bernard Iremonger
  Cc: Gaetan Rivet, Thomas Monjalon, stable, David Marchand, Jeff Guo,
	Qi Zhang

Hi Ferruh

From: Ferruh Yigit
> On 1/23/2020 7:25 PM, Matan Azrad wrote:
> > Hi
> >
> > From: Ferruh Yigit
> >> On 1/23/2020 3:29 PM, Matan Azrad wrote:
> >>>
> >>> Hi
> >>>
> >>> From: Ferruh Yigit
> >>>> On 1/23/2020 2:05 PM, Matan Azrad wrote:
> >>>>> Hi
> >>>>>
> >>>>> From: Yigit, Ferruh
> >>>>>> On 11/12/2019 8:47 AM, Matan Azrad wrote:
> >>>>>>> The port was not validated before detaching.
> >>>>>>>
> >>>>>>> Ignore port detach operation when the port is not valid.
> >>>>>>>
> >>>>>>> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device
> >>>>>>> twice")
> >>>>>>> Cc: thomas@monjalon.net
> >>>>>>> Cc: stable@dpdk.org
> >>>>>>>
> >>>>>>> Signed-off-by: Matan Azrad <matan@mellanox.com>
> >>>>>>> ---
> >>>>>>>  app/test-pmd/testpmd.c | 3 +++
> >>>>>>>  1 file changed, 3 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> >>>>>>> index 4444346..370eefe 100644
> >>>>>>> --- a/app/test-pmd/testpmd.c
> >>>>>>> +++ b/app/test-pmd/testpmd.c
> >>>>>>> @@ -2545,6 +2545,9 @@ struct extmem_param {
> >>>>>>>
> >>>>>>>  	printf("Removing a device...\n");
> >>>>>>>
> >>>>>>> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
> >>>>>>> +		return;
> >>>>>>> +
> >>>>>>>  	dev = rte_eth_devices[port_id].device;
> >>>>>>>  	if (dev == NULL) {
> >>>>>>>  		printf("Device already removed\n");
> >>>>>>>
> >>>>>>
> >>>>>> The patch is already in 19.11 [1] but it is breaking the testpmd
> >>>>>> hotplug support.
> >>>>>> Before 'detach_port_device()' called, the port has been stopped
> >>>>>> and closed [2], which will make port fail from 'port_id_is_invalid()'
> >>>>>> check and the device removal path never fully called.
> >>>>>> The implication is, since device not detached, vfio request
> >>>>>> interrupt keeps triggered continuously and re-starts the detach
> >>>>>> path, but because of the half cleaned device it fails and app
> >>>>>> gets stuck with a
> >>>> continuous log [3].
> >>>>>>
> >>>>>> I wonder if the actual hotplug has been tested with this patch,
> >>>>>> the commit log is not clear about the motivation and implication
> >>>>>> of the patch, I am not clear why this check is added but I am
> >>>>>> sending a patch soon to remove it back.
> >>>>>
> >>>>> The motivation of this patch was to prevent double detach on same
> >>>>> port,
> >>>> so the user cannot call detach of invalid port.
> >>>>
> >>>> What is the definition of the 'invalid port', if you mean device
> >>>> already detached case, in the second call of the function "if (dev
> >>>> == NULL)" check should prevent it going forward.
> >>>
> >>> No, ethdev doesn't zero the device pointer when it release a port.
> >>
> >> As far as I can see it does, please see below.
> >
> > The code below is problematic because:
> >
> > 1. It is very bad that the application changing ethdev structure directly.
> 
> Where the application is changing the ethdev structure?

See it in the function we talk on:
rte_eth_devices[sibling].device = NULL;

The application shouldn't do it - it should be done only by ethdev lib or by the PMDs.

Are you agree here?

> Application calls the 'rte_dev_remove()' API, which does the job.

Agree, This function is freeing(rte_free) the rte_device (actually makes the rte_eth_devices[sibling].device pointer dangled) 
and releases its related resources what makes the device detached.

> > 2. The below code run over valid port only, not on invalid port(UNUSED
> state).
> >
> > So, the device pointer will still be valid if the port is invalid.
> >
> > All of this shows that this function try to detach only a valid port (probably
> mainly because it is called by Testpmd detach command).
> >
> >>> So even if the port is in unused state already - means invalid, the
> >>> device
> >> pointer still may be valid and point to the last port that used the same id.
> >>
> >> If the port is closed, it is unused state, and ethdev layer resources
> >> freed but as you said device related structures are still there,
> >> device pointer is still valid and it is still in probed device list
> >> etc.. We need to able to detach the device even after it is unused state.
> >
> > Yes, but detach is for device, not for port.
> > The device pointer must be taken only when the port is in valid state.
> > Why?
> > Because if the port is in UNUSED state it is free to be allocated again by
> ethdev layer for other device, then, the device pointer may point to other
> device.
> >

Do you agree on the above statement I wrote?

> >> "stop -> close -> detach" is a normal order, we shouldn't prevent it,
> >> but your check does prevent it.
> >
> > Yes, this is good order, but the pointer of the device should be taken
> before close.
> > My patch prevent accessing invalid structure.
> 
> The ethdev close() dev_ops, frees ethdev related resources, the rte_device
> is still valid in that struct.

That’s exactly my concern.
I think you wrong here, the rte_device may be invalid in that struct, especially after close():

When the port ID is closed and released, its ethdev structure moves to UNUSED state.
When an ethdev structure is in UNUSED state it may be attached again to another rte_device - see function rte_eth_dev_allocate.
Are you agree here?

In this case, when a new device is attached after close() and before detach_port_device() we may remove wrong rte_device and cause a lot of problems.

Do you understand that?

One more problematic case is a user mistake by the Testpmd command which may cause segfault in the good case and memory overriding in the worst case (my patch case):

port stop all
port detach 0
port detach 0

detach the same port twice will cause referencing of freed pointer of rte_device.


All of that is because Testpmd takes ethdev structure information from invalid ethdev structure.

My patch prevents it.



>And yes your patch prevents accessing them and
> prevents hotplug remove the device.
> 

Yes, my patch is not good, solved issues and caused a new one.

Agree that we need a new fix, my suggestion here is:

1. In the Testpmd internal management for hutplug (rmv_port_callback):
	Call stop()
	Take rte_device pointer( before port close).
	Call close().
	If no other valid port for the rte_device: 
		call detach() by the saved rte_device pointer.
2. Replace the Testpmd command line for "port detach" with "detach [rte device name]":
	Why? 
	Detach by port is problematic:
	1. If the port is closed - Testpmd cannot get its rte_device from the related ethdev port structure.
	2. If the port is not closed - It is not safe to detach it.
	3. Attach is done by rte_device name, detach should be in same way.
 Are you agree?


I hope you understand now. 

> > And yes, Testpmd detach stays broken after my patch and after this patch
> too.
> >
> >
> >>
> >> I am not very clear about your concern here, "point to the last port
> >> that used the same id", can you please clarify?
> >
> > Yes, when ethdev layer allocates a port ID for a new device, it tries to find
> UNUSED port.
> > When found, the port will move to ATTACHED after the PMD finishes its
> probing function.
> >
> > So, any UNUSED port may be allocated for other device and then, the
> device pointer points to other device.
> >
> >>
> >>>
> >>>
> >>>> But according the 'port_id_is_invalid()' API, a closed port is an
> >>>> invalid port, I think that is wrong in this context.
> >>>
> >>> Why?
> >>
> >> Closed port is 'invalid' for using it, because ethdev resources are
> >> freed. But it is not 'invalid' to detach it, why a port being closed
> >> should prevent freeing its device layer resources?
> >
> > I didn't said that, I said that the device pointer should be taken when the
> port is valid.
> >
> >
> >>
> >>>
> >>> You are going to look on ethdev portid structure, don't you think we
> >>> should
> >> valid the port before using its structure?
> >>
> >> Is your main concern "rte_eth_devices[port_id].device" can be
> >> dangling pointer?
> >>
> >> 1) It is not.
> >> 2) The check you added to replace it is not correct check.
> >>
> > Didn't said that.
> >
> > It just may point to other device.
> > It is not correct to take information from invalid structure.
> >
> > Don't you agree that the structure is not valid when the port is not valid?
> >
> >>>
> >>>>>
> >>>>> I agree this patch is not good and we need a fix but I think the
> >>>>> bug is
> >>>> conceptual.
> >>>>>
> >>>>> Testpmd tries to do detach by port_id which is derived by ethdev
> >>>>> port id
> >>>> while detach work with rte_device.
> >>>>>
> >>>>> For example:
> >>>>> you can see in the line above after +++: dev =
> >>>>> rte_eth_devices[port_id].device, Testpmd may access invalid  or
> >>>> reallocated ethdev structure to get the device name and may even
> >>>> detach unwanted rte_device.
> >>>>
> >>>> I thinks whichever function calling 'detach_port_device()' should
> >>>> check the port validity.
> >>>> 'detach_port_device()' doesn't know if port reallocated or not, it
> >>>> will free the given port_id, and when freeing done
> >>>> 'rte_eth_devices[port_id].device' will be NULL, this looks to me a
> >>>> valid
> >> check.
> >>>
> >>> Please validate me, check ethdev, I don't think so,
> >> 'rte_eth_devices[port_id].device still valid after detach.
> >>
> >> This is a long stack trace, but what happens is:
> >>
> >> rte_dev_remove
> >>   bus unpug
> >>     driver remove
> >>       rte_eth_dev_pci_release
> >>         eth_dev->device = NULL;
> >
> > The last line doesn't happen here because the rte_eth_dev_pci_release
> moves the port to UNUSED.
> > And it is bad that application is trying to do it.
> >
> >>
> >> Please check the driver you are testing remove() ops
> >> (rte_pci_driver.remove()) does cleans the ethdev fields.
> >>
> >> A little more detailed stack trace for my environment:
> >> #0  rte_eth_dev_pci_release (eth_dev=..) at  rte_ethdev_pci.h:143
> >> #1  rte_eth_dev_pci_generic_remove (pci_dev=.., dev_uninit=..) at
> >> rte_ethdev_pci.h:199
> >> #2  eth_i40e_pci_remove (pci_dev=..) at i40e_ethdev.c:710
> >> #3  rte_pci_detach_dev (dev=..) at pci_common.c:243
> >> #4  pci_unplug (dev=..) at pci_common.c:537
> >> #5  local_dev_remove (dev=..) at eal_common_dev.c:321
> >> #6  rte_dev_remove (dev=..) at eal_common_dev.c:402
> >> #7  detach_port_device (port_id=0) at testpmd.c:2663
> >> #8  cmd_operate_detach_port_parsed (parsed_result=.., cl=..,
> >> data=0x0) at
> >> cmdline.c:1501
> >> #9  cmdline_parse (cl=.., buf=.."port detach 0\n") at
> >> cmdline_parse.c:295
> >> #10 cmdline_valid_buffer (rdl=.., buf="port detach 0\n", size=15) at
> >> cmdline.c:31
> >> #11 rdline_char_in (rdl=.., c=10 '\n') at  cmdline_rdline.c:421
> >> #12 cmdline_in (cl=.., buf=.."\n", size=1) at cmdline.c:148
> >> #13 cmdline_interact (cl=..) at cmdline.c:227
> >> #14 prompt () at cmdline.c:19644
> >> #15 main (argc=3, argv=..) at testpmd.c:3617
> >>
> > Not all the drivers are doing it.
> > I think it is good if we will do it by ethdev release function.
> >
> >
> >>>
> >>>> The caller of the 'detach_port_device()' should ensure correct
> >>>> port_id passed to the function.
> >>>
> >>> What is correct port id, if the port was released , is it correct?
> >>
> >> You are right, there is no good answer for it, I was thinking
> >> application state information can be used but no ethdev should able
> >> to provide this information, we need 'is_freed' kind of check for it,
> >> currently 'rte_eth_devices[port_id].device' is used for that purpose.
> >
> > It is wrong to take device from invalid structure. (I explained a lot above).
> > Better way to save the rte_device in the start(before close) and call detach
> by rte_device when we sure that all the ports of this rte_device are
> released(mlx4 can manage 2 ports one rte_device, also any device supports
> representors).
> >
> > Let's do correct fix.
> 
> Matan,
> 
> It become so hard to follow this discussion.The check you add is preventing
> device hotplug, so breaking the feature, but you want to keep the check to
> fix something which is still not clear to me.
> 
> To simplify things, can you please clarify what error are you getting with this
> patch, and can you please give some details how to reproduce it? So I can
> debug the issue you are having.

Added details above, hope everything is clear when you read this line 😊 

> 
> >
> >
> >>
> >>>
> >>>>>
> >>>>> So, detach is broken with and without this patch.
> >>>>
> >>>> I can't see how it is broken without the check, how the problem you
> >>>> mentioned can be reproduced? Or is it a theoretical issue?
> >>>> But with this check hotplug support is %100 reproducible broken.
> >>>>
> >>>>>
> >>>>>
> >>>>> I think Testpmd should change the concept of rte_device mapping
> >>>>> and put
> >>>> attention to next:
> >>>>> 1. Don't detach by ethdev port ID.
> >>>>> 2. Multiple ethdev port IDs may related to the same rte_device.
> >>>>>
> >>>>> The Testpmd user should be sure that all the port IDs of the
> >>>>> rte_device are
> >>>> released before the detach call and Testpmd maybe need to validate it.
> >>>>> And like attach, detach should be triggered by PCI address \
> >>>>> rte_device
> >>>> name.
> >>>>>
> >>>>
> >>>> We need to know about port_id too to be able to stop/close it.
> >>>> And sure no objection to improve the hotplug support but it is
> >>>> broken now, lets fix it first.
> >>>>
> 
> <....>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2020-01-25 18:56                 ` Matan Azrad
@ 2020-02-03 15:58                   ` Ferruh Yigit
  2020-02-03 17:10                     ` Matan Azrad
  0 siblings, 1 reply; 23+ messages in thread
From: Ferruh Yigit @ 2020-02-03 15:58 UTC (permalink / raw)
  To: Matan Azrad, Yigit, Ferruh, dev, Bernard Iremonger
  Cc: Gaetan Rivet, Thomas Monjalon, stable, David Marchand, Jeff Guo,
	Qi Zhang

On 1/25/2020 6:56 PM, Matan Azrad wrote:
> Hi Ferruh
> 
> From: Ferruh Yigit
>> On 1/23/2020 7:25 PM, Matan Azrad wrote:
>>> Hi
>>>
>>> From: Ferruh Yigit
>>>> On 1/23/2020 3:29 PM, Matan Azrad wrote:
>>>>>
>>>>> Hi
>>>>>
>>>>> From: Ferruh Yigit
>>>>>> On 1/23/2020 2:05 PM, Matan Azrad wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> From: Yigit, Ferruh
>>>>>>>> On 11/12/2019 8:47 AM, Matan Azrad wrote:
>>>>>>>>> The port was not validated before detaching.
>>>>>>>>>
>>>>>>>>> Ignore port detach operation when the port is not valid.
>>>>>>>>>
>>>>>>>>> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device
>>>>>>>>> twice")
>>>>>>>>> Cc: thomas@monjalon.net
>>>>>>>>> Cc: stable@dpdk.org
>>>>>>>>>
>>>>>>>>> Signed-off-by: Matan Azrad <matan@mellanox.com>
>>>>>>>>> ---
>>>>>>>>>  app/test-pmd/testpmd.c | 3 +++
>>>>>>>>>  1 file changed, 3 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
>>>>>>>>> index 4444346..370eefe 100644
>>>>>>>>> --- a/app/test-pmd/testpmd.c
>>>>>>>>> +++ b/app/test-pmd/testpmd.c
>>>>>>>>> @@ -2545,6 +2545,9 @@ struct extmem_param {
>>>>>>>>>
>>>>>>>>>  	printf("Removing a device...\n");
>>>>>>>>>
>>>>>>>>> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
>>>>>>>>> +		return;
>>>>>>>>> +
>>>>>>>>>  	dev = rte_eth_devices[port_id].device;
>>>>>>>>>  	if (dev == NULL) {
>>>>>>>>>  		printf("Device already removed\n");
>>>>>>>>>
>>>>>>>>
>>>>>>>> The patch is already in 19.11 [1] but it is breaking the testpmd
>>>>>>>> hotplug support.
>>>>>>>> Before 'detach_port_device()' called, the port has been stopped
>>>>>>>> and closed [2], which will make port fail from 'port_id_is_invalid()'
>>>>>>>> check and the device removal path never fully called.
>>>>>>>> The implication is, since device not detached, vfio request
>>>>>>>> interrupt keeps triggered continuously and re-starts the detach
>>>>>>>> path, but because of the half cleaned device it fails and app
>>>>>>>> gets stuck with a
>>>>>> continuous log [3].
>>>>>>>>
>>>>>>>> I wonder if the actual hotplug has been tested with this patch,
>>>>>>>> the commit log is not clear about the motivation and implication
>>>>>>>> of the patch, I am not clear why this check is added but I am
>>>>>>>> sending a patch soon to remove it back.
>>>>>>>
>>>>>>> The motivation of this patch was to prevent double detach on same
>>>>>>> port,
>>>>>> so the user cannot call detach of invalid port.
>>>>>>
>>>>>> What is the definition of the 'invalid port', if you mean device
>>>>>> already detached case, in the second call of the function "if (dev
>>>>>> == NULL)" check should prevent it going forward.
>>>>>
>>>>> No, ethdev doesn't zero the device pointer when it release a port.
>>>>
>>>> As far as I can see it does, please see below.
>>>
>>> The code below is problematic because:
>>>
>>> 1. It is very bad that the application changing ethdev structure directly.
>>
>> Where the application is changing the ethdev structure?
> 
> See it in the function we talk on:
> rte_eth_devices[sibling].device = NULL;
> 
> The application shouldn't do it - it should be done only by ethdev lib or by the PMDs.
> 
> Are you agree here?

This is really no fun :(

It is not done by application, I already provided the call trace. This is done
by the path of driver .remove().

> 
>> Application calls the 'rte_dev_remove()' API, which does the job.
> 
> Agree, This function is freeing(rte_free) the rte_device (actually makes the rte_eth_devices[sibling].device pointer dangled) 
> and releases its related resources what makes the device detached.

No it doesn't, I provided full call stack, and showed where the value set to NULL.

> 
>>> 2. The below code run over valid port only, not on invalid port(UNUSED
>> state).
>>>
>>> So, the device pointer will still be valid if the port is invalid.
>>>
>>> All of this shows that this function try to detach only a valid port (probably
>> mainly because it is called by Testpmd detach command).
>>>
>>>>> So even if the port is in unused state already - means invalid, the
>>>>> device
>>>> pointer still may be valid and point to the last port that used the same id.
>>>>
>>>> If the port is closed, it is unused state, and ethdev layer resources
>>>> freed but as you said device related structures are still there,
>>>> device pointer is still valid and it is still in probed device list
>>>> etc.. We need to able to detach the device even after it is unused state.
>>>
>>> Yes, but detach is for device, not for port.
>>> The device pointer must be taken only when the port is in valid state.
>>> Why?
>>> Because if the port is in UNUSED state it is free to be allocated again by
>> ethdev layer for other device, then, the device pointer may point to other
>> device.
>>>
> 
> Do you agree on the above statement I wrote?
> 
>>>> "stop -> close -> detach" is a normal order, we shouldn't prevent it,
>>>> but your check does prevent it.
>>>
>>> Yes, this is good order, but the pointer of the device should be taken
>> before close.
>>> My patch prevent accessing invalid structure.
>>
>> The ethdev close() dev_ops, frees ethdev related resources, the rte_device
>> is still valid in that struct.
> 
> That’s exactly my concern.
> I think you wrong here, the rte_device may be invalid in that struct, especially after close():
> 
> When the port ID is closed and released, its ethdev structure moves to UNUSED state.
> When an ethdev structure is in UNUSED state it may be attached again to another rte_device - see function rte_eth_dev_allocate.
> Are you agree here?
> 
> In this case, when a new device is attached after close() and before detach_port_device() we may remove wrong rte_device and cause a lot of problems.

The problem here is re-using the ethdev structure when it is closed but not
freed completely, resulting overwriting some fields of it. This is another issue
and can be fixed in the alloc path.

> 
> Do you understand that?
> 
> One more problematic case is a user mistake by the Testpmd command which may cause segfault in the good case and memory overriding in the worst case (my patch case):
> 
> port stop all
> port detach 0
> port detach 0
> 
> detach the same port twice will cause referencing of freed pointer of rte_device.
> 
> 
> All of that is because Testpmd takes ethdev structure information from invalid ethdev structure.
> 
> My patch prevents it.

For this case I am already getting "Device already removed" message from
'detach_port_device()' function.

Your patch is doing two things:
- Hiding the fact that PMD .remove() is not setting the device pointer to null
- Breaking the hotplug functionality

>  
> 
> 
>> And yes your patch prevents accessing them and
>> prevents hotplug remove the device.
>>
> 
> Yes, my patch is not good, solved issues and caused a new one.
> 
> Agree that we need a new fix, my suggestion here is:
> 
> 1. In the Testpmd internal management for hutplug (rmv_port_callback):
> 	Call stop()
> 	Take rte_device pointer( before port close).
> 	Call close().
> 	If no other valid port for the rte_device: 
> 		call detach() by the saved rte_device pointer.

Not sure about pushing more to the application, like checking if any other port
using a device etc..

As far as I understand your concern is when multiple ethdev are using same
device, why not handle this in driver .remove() path, like detect if device
still needs to be used and if so free only ethdev resources and return error,
this error will prevent device resources to be freed:

pci_unplug()
  ret = rte_pci_detach_dev(pdev);
  if (ret == 0)
    rte_pci_remove_device(pdev);
    rte_devargs_remove(dev->devargs);
    ...

This will cause the application receive an error but this is kind of true
because all resources are not freed because they are shared.

When last ethdev detached, driver can send success causing all device resources
to be freed.

> 2. Replace the Testpmd command line for "port detach" with "detach [rte device name]":
> 	Why? 
> 	Detach by port is problematic:
> 	1. If the port is closed - Testpmd cannot get its rte_device from the related ethdev port structure.
> 	2. If the port is not closed - It is not safe to detach it.
> 	3. Attach is done by rte_device name, detach should be in same way.

Testpmd can first close() later detach().

If it is closed already, agreed that new attached devices shouldn't be able to
this struct until it is freed completely. But this is kind of edge case, because
it required new device to be attached after old one closed but before it is
detached.

>  Are you agree?
> 
> 
> I hope you understand now. 
> 
>>> And yes, Testpmd detach stays broken after my patch and after this patch
>> too.
>>>
>>>
>>>>
>>>> I am not very clear about your concern here, "point to the last port
>>>> that used the same id", can you please clarify?
>>>
>>> Yes, when ethdev layer allocates a port ID for a new device, it tries to find
>> UNUSED port.
>>> When found, the port will move to ATTACHED after the PMD finishes its
>> probing function.
>>>
>>> So, any UNUSED port may be allocated for other device and then, the
>> device pointer points to other device.
>>>
>>>>
>>>>>
>>>>>
>>>>>> But according the 'port_id_is_invalid()' API, a closed port is an
>>>>>> invalid port, I think that is wrong in this context.
>>>>>
>>>>> Why?
>>>>
>>>> Closed port is 'invalid' for using it, because ethdev resources are
>>>> freed. But it is not 'invalid' to detach it, why a port being closed
>>>> should prevent freeing its device layer resources?
>>>
>>> I didn't said that, I said that the device pointer should be taken when the
>> port is valid.
>>>
>>>
>>>>
>>>>>
>>>>> You are going to look on ethdev portid structure, don't you think we
>>>>> should
>>>> valid the port before using its structure?
>>>>
>>>> Is your main concern "rte_eth_devices[port_id].device" can be
>>>> dangling pointer?
>>>>
>>>> 1) It is not.
>>>> 2) The check you added to replace it is not correct check.
>>>>
>>> Didn't said that.
>>>
>>> It just may point to other device.
>>> It is not correct to take information from invalid structure.
>>>
>>> Don't you agree that the structure is not valid when the port is not valid?
>>>
>>>>>
>>>>>>>
>>>>>>> I agree this patch is not good and we need a fix but I think the
>>>>>>> bug is
>>>>>> conceptual.
>>>>>>>
>>>>>>> Testpmd tries to do detach by port_id which is derived by ethdev
>>>>>>> port id
>>>>>> while detach work with rte_device.
>>>>>>>
>>>>>>> For example:
>>>>>>> you can see in the line above after +++: dev =
>>>>>>> rte_eth_devices[port_id].device, Testpmd may access invalid  or
>>>>>> reallocated ethdev structure to get the device name and may even
>>>>>> detach unwanted rte_device.
>>>>>>
>>>>>> I thinks whichever function calling 'detach_port_device()' should
>>>>>> check the port validity.
>>>>>> 'detach_port_device()' doesn't know if port reallocated or not, it
>>>>>> will free the given port_id, and when freeing done
>>>>>> 'rte_eth_devices[port_id].device' will be NULL, this looks to me a
>>>>>> valid
>>>> check.
>>>>>
>>>>> Please validate me, check ethdev, I don't think so,
>>>> 'rte_eth_devices[port_id].device still valid after detach.
>>>>
>>>> This is a long stack trace, but what happens is:
>>>>
>>>> rte_dev_remove
>>>>   bus unpug
>>>>     driver remove
>>>>       rte_eth_dev_pci_release
>>>>         eth_dev->device = NULL;
>>>
>>> The last line doesn't happen here because the rte_eth_dev_pci_release
>> moves the port to UNUSED.
>>> And it is bad that application is trying to do it.
>>>
>>>>
>>>> Please check the driver you are testing remove() ops
>>>> (rte_pci_driver.remove()) does cleans the ethdev fields.
>>>>
>>>> A little more detailed stack trace for my environment:
>>>> #0  rte_eth_dev_pci_release (eth_dev=..) at  rte_ethdev_pci.h:143
>>>> #1  rte_eth_dev_pci_generic_remove (pci_dev=.., dev_uninit=..) at
>>>> rte_ethdev_pci.h:199
>>>> #2  eth_i40e_pci_remove (pci_dev=..) at i40e_ethdev.c:710
>>>> #3  rte_pci_detach_dev (dev=..) at pci_common.c:243
>>>> #4  pci_unplug (dev=..) at pci_common.c:537
>>>> #5  local_dev_remove (dev=..) at eal_common_dev.c:321
>>>> #6  rte_dev_remove (dev=..) at eal_common_dev.c:402
>>>> #7  detach_port_device (port_id=0) at testpmd.c:2663
>>>> #8  cmd_operate_detach_port_parsed (parsed_result=.., cl=..,
>>>> data=0x0) at
>>>> cmdline.c:1501
>>>> #9  cmdline_parse (cl=.., buf=.."port detach 0\n") at
>>>> cmdline_parse.c:295
>>>> #10 cmdline_valid_buffer (rdl=.., buf="port detach 0\n", size=15) at
>>>> cmdline.c:31
>>>> #11 rdline_char_in (rdl=.., c=10 '\n') at  cmdline_rdline.c:421
>>>> #12 cmdline_in (cl=.., buf=.."\n", size=1) at cmdline.c:148
>>>> #13 cmdline_interact (cl=..) at cmdline.c:227
>>>> #14 prompt () at cmdline.c:19644
>>>> #15 main (argc=3, argv=..) at testpmd.c:3617
>>>>
>>> Not all the drivers are doing it.
>>> I think it is good if we will do it by ethdev release function.
>>>
>>>
>>>>>
>>>>>> The caller of the 'detach_port_device()' should ensure correct
>>>>>> port_id passed to the function.
>>>>>
>>>>> What is correct port id, if the port was released , is it correct?
>>>>
>>>> You are right, there is no good answer for it, I was thinking
>>>> application state information can be used but no ethdev should able
>>>> to provide this information, we need 'is_freed' kind of check for it,
>>>> currently 'rte_eth_devices[port_id].device' is used for that purpose.
>>>
>>> It is wrong to take device from invalid structure. (I explained a lot above).
>>> Better way to save the rte_device in the start(before close) and call detach
>> by rte_device when we sure that all the ports of this rte_device are
>> released(mlx4 can manage 2 ports one rte_device, also any device supports
>> representors).
>>>
>>> Let's do correct fix.
>>
>> Matan,
>>
>> It become so hard to follow this discussion.The check you add is preventing
>> device hotplug, so breaking the feature, but you want to keep the check to
>> fix something which is still not clear to me.
>>
>> To simplify things, can you please clarify what error are you getting with this
>> patch, and can you please give some details how to reproduce it? So I can
>> debug the issue you are having.
> 
> Added details above, hope everything is clear when you read this line 😊 

Overall I believe this all fuss is about the PMD you are testing not cleaning
the 'rte_eth_devices[port_id].device' pointer which should be handled in driver
level but you are trying to fix this in testpmd causing it fail.


> 
>>
>>>
>>>
>>>>
>>>>>
>>>>>>>
>>>>>>> So, detach is broken with and without this patch.
>>>>>>
>>>>>> I can't see how it is broken without the check, how the problem you
>>>>>> mentioned can be reproduced? Or is it a theoretical issue?
>>>>>> But with this check hotplug support is %100 reproducible broken.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think Testpmd should change the concept of rte_device mapping
>>>>>>> and put
>>>>>> attention to next:
>>>>>>> 1. Don't detach by ethdev port ID.
>>>>>>> 2. Multiple ethdev port IDs may related to the same rte_device.
>>>>>>>
>>>>>>> The Testpmd user should be sure that all the port IDs of the
>>>>>>> rte_device are
>>>>>> released before the detach call and Testpmd maybe need to validate it.
>>>>>>> And like attach, detach should be triggered by PCI address \
>>>>>>> rte_device
>>>>>> name.
>>>>>>>
>>>>>>
>>>>>> We need to know about port_id too to be able to stop/close it.
>>>>>> And sure no objection to improve the hotplug support but it is
>>>>>> broken now, lets fix it first.
>>>>>>
>>
>> <....>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2020-02-03 15:58                   ` Ferruh Yigit
@ 2020-02-03 17:10                     ` Matan Azrad
  2020-02-12 13:49                       ` Ferruh Yigit
  0 siblings, 1 reply; 23+ messages in thread
From: Matan Azrad @ 2020-02-03 17:10 UTC (permalink / raw)
  To: Ferruh Yigit, Yigit, Ferruh, dev, Bernard Iremonger
  Cc: Gaetan Rivet, Thomas Monjalon, stable, David Marchand, Jeff Guo,
	Qi Zhang


Hi

From: Ferruh Yigit
> On 1/25/2020 6:56 PM, Matan Azrad wrote:
> > Hi Ferruh
> >
> > From: Ferruh Yigit
> >> On 1/23/2020 7:25 PM, Matan Azrad wrote:
> >>> Hi
> >>>
> >>> From: Ferruh Yigit
> >>>> On 1/23/2020 3:29 PM, Matan Azrad wrote:
> >>>>>
> >>>>> Hi
> >>>>>
> >>>>> From: Ferruh Yigit
> >>>>>> On 1/23/2020 2:05 PM, Matan Azrad wrote:
> >>>>>>> Hi
> >>>>>>>
> >>>>>>> From: Yigit, Ferruh
> >>>>>>>> On 11/12/2019 8:47 AM, Matan Azrad wrote:
> >>>>>>>>> The port was not validated before detaching.
> >>>>>>>>>
> >>>>>>>>> Ignore port detach operation when the port is not valid.
> >>>>>>>>>
> >>>>>>>>> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device
> >>>>>>>>> twice")
> >>>>>>>>> Cc: thomas@monjalon.net
> >>>>>>>>> Cc: stable@dpdk.org
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Matan Azrad <matan@mellanox.com>
> >>>>>>>>> ---
> >>>>>>>>>  app/test-pmd/testpmd.c | 3 +++
> >>>>>>>>>  1 file changed, 3 insertions(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> >>>>>>>>> index 4444346..370eefe 100644
> >>>>>>>>> --- a/app/test-pmd/testpmd.c
> >>>>>>>>> +++ b/app/test-pmd/testpmd.c
> >>>>>>>>> @@ -2545,6 +2545,9 @@ struct extmem_param {
> >>>>>>>>>
> >>>>>>>>>  	printf("Removing a device...\n");
> >>>>>>>>>
> >>>>>>>>> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
> >>>>>>>>> +		return;
> >>>>>>>>> +
> >>>>>>>>>  	dev = rte_eth_devices[port_id].device;
> >>>>>>>>>  	if (dev == NULL) {
> >>>>>>>>>  		printf("Device already removed\n");
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> The patch is already in 19.11 [1] but it is breaking the
> >>>>>>>> testpmd hotplug support.
> >>>>>>>> Before 'detach_port_device()' called, the port has been stopped
> >>>>>>>> and closed [2], which will make port fail from 'port_id_is_invalid()'
> >>>>>>>> check and the device removal path never fully called.
> >>>>>>>> The implication is, since device not detached, vfio request
> >>>>>>>> interrupt keeps triggered continuously and re-starts the detach
> >>>>>>>> path, but because of the half cleaned device it fails and app
> >>>>>>>> gets stuck with a
> >>>>>> continuous log [3].
> >>>>>>>>
> >>>>>>>> I wonder if the actual hotplug has been tested with this patch,
> >>>>>>>> the commit log is not clear about the motivation and
> >>>>>>>> implication of the patch, I am not clear why this check is
> >>>>>>>> added but I am sending a patch soon to remove it back.
> >>>>>>>
> >>>>>>> The motivation of this patch was to prevent double detach on
> >>>>>>> same port,
> >>>>>> so the user cannot call detach of invalid port.
> >>>>>>
> >>>>>> What is the definition of the 'invalid port', if you mean device
> >>>>>> already detached case, in the second call of the function "if
> >>>>>> (dev == NULL)" check should prevent it going forward.
> >>>>>
> >>>>> No, ethdev doesn't zero the device pointer when it release a port.
> >>>>
> >>>> As far as I can see it does, please see below.
> >>>
> >>> The code below is problematic because:
> >>>
> >>> 1. It is very bad that the application changing ethdev structure directly.
> >>
> >> Where the application is changing the ethdev structure?
> >
> > See it in the function we talk on:
> > rte_eth_devices[sibling].device = NULL;
> >
> > The application shouldn't do it - it should be done only by ethdev lib or by
> the PMDs.
> >
> > Are you agree here?
> 
> This is really no fun :(
> 
> It is not done by application, I already provided the call trace. This is done by
> the path of driver .remove().

Yes, probably, but also by testpmd application, I copied it from testpmd application.

Don't you see it?

> >
> >> Application calls the 'rte_dev_remove()' API, which does the job.
> >
> > Agree, This function is freeing(rte_free) the rte_device (actually
> > makes the rte_eth_devices[sibling].device pointer dangled) and releases
> its related resources what makes the device detached.
> 
> No it doesn't, I provided full call stack, and showed where the value set to
> NULL.

See again the testpmd function - it  does it too.

> >
> >>> 2. The below code run over valid port only, not on invalid
> >>> port(UNUSED
> >> state).
> >>>
> >>> So, the device pointer will still be valid if the port is invalid.
> >>>
> >>> All of this shows that this function try to detach only a valid port
> >>> (probably
> >> mainly because it is called by Testpmd detach command).
> >>>
> >>>>> So even if the port is in unused state already - means invalid,
> >>>>> the device
> >>>> pointer still may be valid and point to the last port that used the same
> id.
> >>>>
> >>>> If the port is closed, it is unused state, and ethdev layer
> >>>> resources freed but as you said device related structures are still
> >>>> there, device pointer is still valid and it is still in probed
> >>>> device list etc.. We need to able to detach the device even after it is
> unused state.
> >>>
> >>> Yes, but detach is for device, not for port.
> >>> The device pointer must be taken only when the port is in valid state.
> >>> Why?
> >>> Because if the port is in UNUSED state it is free to be allocated
> >>> again by
> >> ethdev layer for other device, then, the device pointer may point to
> >> other device.
> >>>
> >
> > Do you agree on the above statement I wrote?
> >
> >>>> "stop -> close -> detach" is a normal order, we shouldn't prevent
> >>>> it, but your check does prevent it.
> >>>
> >>> Yes, this is good order, but the pointer of the device should be
> >>> taken
> >> before close.
> >>> My patch prevent accessing invalid structure.
> >>
> >> The ethdev close() dev_ops, frees ethdev related resources, the
> >> rte_device is still valid in that struct.
> >
> > That’s exactly my concern.
> > I think you wrong here, the rte_device may be invalid in that struct,
> especially after close():
> >
> > When the port ID is closed and released, its ethdev structure moves to
> UNUSED state.
> > When an ethdev structure is in UNUSED state it may be attached again to
> another rte_device - see function rte_eth_dev_allocate.
> > Are you agree here?
> >
> > In this case, when a new device is attached after close() and before
> detach_port_device() we may remove wrong rte_device and cause a lot of
> problems.
> 
> The problem here is re-using the ethdev structure when it is closed but not
> freed completely, resulting overwriting some fields of it. This is another issue
> and can be fixed in the alloc path.

Sorry, don't agree with you here.
Port which is closed can be allocated again for other device - this is the basic for hot-plug mechanism in dpdk.
Reading the rte_device from port which was closed may remove other rte_device which is not related.

Agree that the PMD should clear the ethdev structure in remove, mlx5 doesn't do it and should be fixed, I don't know about other PMDS.
But this is not the issue I talk about.

Testpmd shouldn't read device pointer from port which was closed - this is race.
  
> >
> > Do you understand that?
> >
> > One more problematic case is a user mistake by the Testpmd command
> which may cause segfault in the good case and memory overriding in the
> worst case (my patch case):
> >
> > port stop all
> > port detach 0
> > port detach 0
> >
> > detach the same port twice will cause referencing of freed pointer of
> rte_device.
> >
> >
> > All of that is because Testpmd takes ethdev structure information from
> invalid ethdev structure.
> >
> > My patch prevents it.
> 
> For this case I am already getting "Device already removed" message from
> 'detach_port_device()' function.
> 
> Your patch is doing two things:
> - Hiding the fact that PMD .remove() is not setting the device pointer to null

The device pointer is zero also by testpmd - the hiding is here.

> - Breaking the hotplug functionality

To be precise - stay it broken.

> 
> >
> >
> >
> >> And yes your patch prevents accessing them and prevents hotplug
> >> remove the device.
> >>
> >
> > Yes, my patch is not good, solved issues and caused a new one.
> >
> > Agree that we need a new fix, my suggestion here is:
> >
> > 1. In the Testpmd internal management for hutplug (rmv_port_callback):
> > 	Call stop()
> > 	Take rte_device pointer( before port close).
> > 	Call close().
> > 	If no other valid port for the rte_device:
> > 		call detach() by the saved rte_device pointer.
> 
> Not sure about pushing more to the application, like checking if any other
> port using a device etc..

And for device pointer before close(), do you agree?

> As far as I understand your concern is when multiple ethdev are using same
> device, why not handle this in driver .remove() path, like detect if device still
> needs to be used and if so free only ethdev resources and return error, this
> error will prevent device resources to be freed:
> 
> pci_unplug()
>   ret = rte_pci_detach_dev(pdev);
>   if (ret == 0)
>     rte_pci_remove_device(pdev);
>     rte_devargs_remove(dev->devargs);
>     ...
> 
> This will cause the application receive an error but this is kind of true because
> all resources are not freed because they are shared.
> 
> When last ethdev detached, driver can send success causing all device
> resources to be freed.

Can be good for multi-port handling, but testpmd should handle this error and report it correctly.

> > 2. Replace the Testpmd command line for "port detach" with "detach [rte
> device name]":
> > 	Why?
> > 	Detach by port is problematic:
> > 	1. If the port is closed - Testpmd cannot get its rte_device from the
> related ethdev port structure.
> > 	2. If the port is not closed - It is not safe to detach it.
> > 	3. Attach is done by rte_device name, detach should be in same way.
> 
> Testpmd can first close() later detach().

Yes, close by port, detach by rte_device name (for example pci name).
That’s what I said.


> If it is closed already, agreed that new attached devices shouldn't be able to
> this struct until it is freed completely. But this is kind of edge case, because it
> required new device to be attached after old one closed but before it is
> detached.
> 
> >  Are you agree?

This is race, no edge.
What is "freed completely"?
IMO it is when the port is in UNUSED state (after close\release).

Hotplug can be triggered internally in parallel.

> >
> > I hope you understand now.
> >
> >>> And yes, Testpmd detach stays broken after my patch and after this
> >>> patch
> >> too.
> >>>
> >>>
> >>>>
<snip>
> >> To simplify things, can you please clarify what error are you getting
> >> with this patch, and can you please give some details how to
> >> reproduce it? So I can debug the issue you are having.
> >
> > Added details above, hope everything is clear when you read this line
> > 😊
> 
> Overall I believe this all fuss is about the PMD you are testing not cleaning the
> 'rte_eth_devices[port_id].device' pointer which should be handled in driver
> level but you are trying to fix this in testpmd causing it fail.

Sorry, but no, It is all about hotplug race.

Even if the PMD clear the device pointer, the testpmd still may release wrong rte_device.
<snip>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-stable] [dpdk-dev] [PATCH 2/2] app/testpmd: fix invalid port detaching
  2020-02-03 17:10                     ` Matan Azrad
@ 2020-02-12 13:49                       ` Ferruh Yigit
  0 siblings, 0 replies; 23+ messages in thread
From: Ferruh Yigit @ 2020-02-12 13:49 UTC (permalink / raw)
  To: Matan Azrad, Yigit, Ferruh, dev, Bernard Iremonger
  Cc: Gaetan Rivet, Thomas Monjalon, stable, David Marchand, Jeff Guo,
	Qi Zhang

On 2/3/2020 5:10 PM, Matan Azrad wrote:
> 
> Hi
> 
> From: Ferruh Yigit
>> On 1/25/2020 6:56 PM, Matan Azrad wrote:
>>> Hi Ferruh
>>>
>>> From: Ferruh Yigit
>>>> On 1/23/2020 7:25 PM, Matan Azrad wrote:
>>>>> Hi
>>>>>
>>>>> From: Ferruh Yigit
>>>>>> On 1/23/2020 3:29 PM, Matan Azrad wrote:
>>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> From: Ferruh Yigit
>>>>>>>> On 1/23/2020 2:05 PM, Matan Azrad wrote:
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> From: Yigit, Ferruh
>>>>>>>>>> On 11/12/2019 8:47 AM, Matan Azrad wrote:
>>>>>>>>>>> The port was not validated before detaching.
>>>>>>>>>>>
>>>>>>>>>>> Ignore port detach operation when the port is not valid.
>>>>>>>>>>>
>>>>>>>>>>> Fixes: f8e5baa2662d ("app/testpmd: check not detaching device
>>>>>>>>>>> twice")
>>>>>>>>>>> Cc: thomas@monjalon.net
>>>>>>>>>>> Cc: stable@dpdk.org
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Matan Azrad <matan@mellanox.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  app/test-pmd/testpmd.c | 3 +++
>>>>>>>>>>>  1 file changed, 3 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
>>>>>>>>>>> index 4444346..370eefe 100644
>>>>>>>>>>> --- a/app/test-pmd/testpmd.c
>>>>>>>>>>> +++ b/app/test-pmd/testpmd.c
>>>>>>>>>>> @@ -2545,6 +2545,9 @@ struct extmem_param {
>>>>>>>>>>>
>>>>>>>>>>>  	printf("Removing a device...\n");
>>>>>>>>>>>
>>>>>>>>>>> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
>>>>>>>>>>> +		return;
>>>>>>>>>>> +
>>>>>>>>>>>  	dev = rte_eth_devices[port_id].device;
>>>>>>>>>>>  	if (dev == NULL) {
>>>>>>>>>>>  		printf("Device already removed\n");
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The patch is already in 19.11 [1] but it is breaking the
>>>>>>>>>> testpmd hotplug support.
>>>>>>>>>> Before 'detach_port_device()' called, the port has been stopped
>>>>>>>>>> and closed [2], which will make port fail from 'port_id_is_invalid()'
>>>>>>>>>> check and the device removal path never fully called.
>>>>>>>>>> The implication is, since device not detached, vfio request
>>>>>>>>>> interrupt keeps triggered continuously and re-starts the detach
>>>>>>>>>> path, but because of the half cleaned device it fails and app
>>>>>>>>>> gets stuck with a
>>>>>>>> continuous log [3].
>>>>>>>>>>
>>>>>>>>>> I wonder if the actual hotplug has been tested with this patch,
>>>>>>>>>> the commit log is not clear about the motivation and
>>>>>>>>>> implication of the patch, I am not clear why this check is
>>>>>>>>>> added but I am sending a patch soon to remove it back.
>>>>>>>>>
>>>>>>>>> The motivation of this patch was to prevent double detach on
>>>>>>>>> same port,
>>>>>>>> so the user cannot call detach of invalid port.
>>>>>>>>
>>>>>>>> What is the definition of the 'invalid port', if you mean device
>>>>>>>> already detached case, in the second call of the function "if
>>>>>>>> (dev == NULL)" check should prevent it going forward.
>>>>>>>
>>>>>>> No, ethdev doesn't zero the device pointer when it release a port.
>>>>>>
>>>>>> As far as I can see it does, please see below.
>>>>>
>>>>> The code below is problematic because:
>>>>>
>>>>> 1. It is very bad that the application changing ethdev structure directly.
>>>>
>>>> Where the application is changing the ethdev structure?
>>>
>>> See it in the function we talk on:
>>> rte_eth_devices[sibling].device = NULL;
>>>
>>> The application shouldn't do it - it should be done only by ethdev lib or by
>> the PMDs.
>>>
>>> Are you agree here?
>>
>> This is really no fun :(
>>
>> It is not done by application, I already provided the call trace. This is done by
>> the path of driver .remove().
> 
> Yes, probably, but also by testpmd application, I copied it from testpmd application.
> 
> Don't you see it?
> 
>>>
>>>> Application calls the 'rte_dev_remove()' API, which does the job.
>>>
>>> Agree, This function is freeing(rte_free) the rte_device (actually
>>> makes the rte_eth_devices[sibling].device pointer dangled) and releases
>> its related resources what makes the device detached.
>>
>> No it doesn't, I provided full call stack, and showed where the value set to
>> NULL.
> 
> See again the testpmd function - it  does it too.
> 
>>>
>>>>> 2. The below code run over valid port only, not on invalid
>>>>> port(UNUSED
>>>> state).
>>>>>
>>>>> So, the device pointer will still be valid if the port is invalid.
>>>>>
>>>>> All of this shows that this function try to detach only a valid port
>>>>> (probably
>>>> mainly because it is called by Testpmd detach command).
>>>>>
>>>>>>> So even if the port is in unused state already - means invalid,
>>>>>>> the device
>>>>>> pointer still may be valid and point to the last port that used the same
>> id.
>>>>>>
>>>>>> If the port is closed, it is unused state, and ethdev layer
>>>>>> resources freed but as you said device related structures are still
>>>>>> there, device pointer is still valid and it is still in probed
>>>>>> device list etc.. We need to able to detach the device even after it is
>> unused state.
>>>>>
>>>>> Yes, but detach is for device, not for port.
>>>>> The device pointer must be taken only when the port is in valid state.
>>>>> Why?
>>>>> Because if the port is in UNUSED state it is free to be allocated
>>>>> again by
>>>> ethdev layer for other device, then, the device pointer may point to
>>>> other device.
>>>>>
>>>
>>> Do you agree on the above statement I wrote?
>>>
>>>>>> "stop -> close -> detach" is a normal order, we shouldn't prevent
>>>>>> it, but your check does prevent it.
>>>>>
>>>>> Yes, this is good order, but the pointer of the device should be
>>>>> taken
>>>> before close.
>>>>> My patch prevent accessing invalid structure.
>>>>
>>>> The ethdev close() dev_ops, frees ethdev related resources, the
>>>> rte_device is still valid in that struct.
>>>
>>> That’s exactly my concern.
>>> I think you wrong here, the rte_device may be invalid in that struct,
>> especially after close():
>>>
>>> When the port ID is closed and released, its ethdev structure moves to
>> UNUSED state.
>>> When an ethdev structure is in UNUSED state it may be attached again to
>> another rte_device - see function rte_eth_dev_allocate.
>>> Are you agree here?
>>>
>>> In this case, when a new device is attached after close() and before
>> detach_port_device() we may remove wrong rte_device and cause a lot of
>> problems.
>>
>> The problem here is re-using the ethdev structure when it is closed but not
>> freed completely, resulting overwriting some fields of it. This is another issue
>> and can be fixed in the alloc path.
> 
> Sorry, don't agree with you here.
> Port which is closed can be allocated again for other device - this is the basic for hot-plug mechanism in dpdk.
> Reading the rte_device from port which was closed may remove other rte_device which is not related.
> 
> Agree that the PMD should clear the ethdev structure in remove, mlx5 doesn't do it and should be fixed, I don't know about other PMDS.
> But this is not the issue I talk about.
> 
> Testpmd shouldn't read device pointer from port which was closed - this is race.
>   
>>>
>>> Do you understand that?
>>>
>>> One more problematic case is a user mistake by the Testpmd command
>> which may cause segfault in the good case and memory overriding in the
>> worst case (my patch case):
>>>
>>> port stop all
>>> port detach 0
>>> port detach 0
>>>
>>> detach the same port twice will cause referencing of freed pointer of
>> rte_device.
>>>
>>>
>>> All of that is because Testpmd takes ethdev structure information from
>> invalid ethdev structure.
>>>
>>> My patch prevents it.
>>
>> For this case I am already getting "Device already removed" message from
>> 'detach_port_device()' function.
>>
>> Your patch is doing two things:
>> - Hiding the fact that PMD .remove() is not setting the device pointer to null
> 
> The device pointer is zero also by testpmd - the hiding is here.
> 
>> - Breaking the hotplug functionality
> 
> To be precise - stay it broken.
> 
>>
>>>
>>>
>>>
>>>> And yes your patch prevents accessing them and prevents hotplug
>>>> remove the device.
>>>>
>>>
>>> Yes, my patch is not good, solved issues and caused a new one.
>>>
>>> Agree that we need a new fix, my suggestion here is:
>>>
>>> 1. In the Testpmd internal management for hutplug (rmv_port_callback):
>>> 	Call stop()
>>> 	Take rte_device pointer( before port close).
>>> 	Call close().
>>> 	If no other valid port for the rte_device:
>>> 		call detach() by the saved rte_device pointer.
>>
>> Not sure about pushing more to the application, like checking if any other
>> port using a device etc..
> 
> And for device pointer before close(), do you agree?
> 
>> As far as I understand your concern is when multiple ethdev are using same
>> device, why not handle this in driver .remove() path, like detect if device still
>> needs to be used and if so free only ethdev resources and return error, this
>> error will prevent device resources to be freed:
>>
>> pci_unplug()
>>   ret = rte_pci_detach_dev(pdev);
>>   if (ret == 0)
>>     rte_pci_remove_device(pdev);
>>     rte_devargs_remove(dev->devargs);
>>     ...
>>
>> This will cause the application receive an error but this is kind of true because
>> all resources are not freed because they are shared.
>>
>> When last ethdev detached, driver can send success causing all device
>> resources to be freed.
> 
> Can be good for multi-port handling, but testpmd should handle this error and report it correctly.
> 
>>> 2. Replace the Testpmd command line for "port detach" with "detach [rte
>> device name]":
>>> 	Why?
>>> 	Detach by port is problematic:
>>> 	1. If the port is closed - Testpmd cannot get its rte_device from the
>> related ethdev port structure.
>>> 	2. If the port is not closed - It is not safe to detach it.
>>> 	3. Attach is done by rte_device name, detach should be in same way.
>>
>> Testpmd can first close() later detach().
> 
> Yes, close by port, detach by rte_device name (for example pci name).
> That’s what I said.
> 
> 
>> If it is closed already, agreed that new attached devices shouldn't be able to
>> this struct until it is freed completely. But this is kind of edge case, because it
>> required new device to be attached after old one closed but before it is
>> detached.
>>
>>>  Are you agree?
> 
> This is race, no edge.
> What is "freed completely"?
> IMO it is when the port is in UNUSED state (after close\release).
> 
> Hotplug can be triggered internally in parallel.
> 
>>>
>>> I hope you understand now.
>>>
>>>>> And yes, Testpmd detach stays broken after my patch and after this
>>>>> patch
>>>> too.
>>>>>
>>>>>
>>>>>>
> <snip>
>>>> To simplify things, can you please clarify what error are you getting
>>>> with this patch, and can you please give some details how to
>>>> reproduce it? So I can debug the issue you are having.
>>>
>>> Added details above, hope everything is clear when you read this line
>>> 😊
>>
>> Overall I believe this all fuss is about the PMD you are testing not cleaning the
>> 'rte_eth_devices[port_id].device' pointer which should be handled in driver
>> level but you are trying to fix this in testpmd causing it fail.
> 
> Sorry, but no, It is all about hotplug race.
> 
> Even if the PMD clear the device pointer, the testpmd still may release wrong rte_device.

Yes it may, although that is less likely to occur, it requires a new device hot
added between close() and detach of the other device.

Would you be agree to say there are two problems:

1) When testpmd close a port, a new attached port can re-use it over writing
some fields, relying the data structures of the closed port is not safe.

2) PMD not cleaning ethdev->device pointer in the .remove() may cause issues in
double detach of a port.


For (1) I suggest fixing it in the attach path, don't re-use an eth_dev port id
unless it is completely freed, may need to add new state for it. Does it make sense?

For (2) PMDs want to get hotplug support needs to fix it.




^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-02-12 13:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-12  8:47 [dpdk-stable] [PATCH 1/2] bus/pci: fix driver detach clear Matan Azrad
2019-11-12  8:47 ` [dpdk-stable] [PATCH 2/2] app/testpmd: fix invalid port detaching Matan Azrad
2019-11-12 11:20   ` Iremonger, Bernard
2019-11-20 22:52     ` David Marchand
2020-01-23 13:19   ` [dpdk-stable] [dpdk-dev] " Yigit, Ferruh
2020-01-23 14:05     ` Matan Azrad
2020-01-23 14:48       ` Ferruh Yigit
2020-01-23 15:29         ` Matan Azrad
2020-01-23 18:14           ` Ferruh Yigit
2020-01-23 19:25             ` Matan Azrad
2020-01-24 16:28               ` Ferruh Yigit
2020-01-25 18:56                 ` Matan Azrad
2020-02-03 15:58                   ` Ferruh Yigit
2020-02-03 17:10                     ` Matan Azrad
2020-02-12 13:49                       ` Ferruh Yigit
2019-11-19 22:40 ` [dpdk-stable] [PATCH 1/2] bus/pci: fix driver detach clear Thomas Monjalon
2019-11-20  9:02   ` Matan Azrad
2019-11-20  9:47 ` [dpdk-stable] [PATCH v2] " Matan Azrad
2019-11-20 13:03   ` David Marchand
2019-11-20 13:44     ` Matan Azrad
2019-11-20 13:51     ` Thomas Monjalon
2019-11-20 17:22       ` David Marchand
2019-11-20 22:52   ` David Marchand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).