DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] net/mlx4: port start failure after device removal
@ 2018-01-19 10:09 Moti Haimovsky
  2018-01-19 10:16 ` [dpdk-dev] [PATCH v2] net/mlx4: fix port start fail " Moti Haimovsky
  0 siblings, 1 reply; 17+ messages in thread
From: Moti Haimovsky @ 2018-01-19 10:09 UTC (permalink / raw)
  To: adrien.mazarguil; +Cc: dev, Moti Haimovsky

In failsafe device start can be called for ports/devices that
had been plugged out.
This patch fixes mlx4 port start failure when called by the failsafe
PMD for removed devices.

Fixes: a6e8b01c3c26 ("net/mlx4: compact interrupt functions")

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
 drivers/net/mlx4/mlx4.c      | 10 ++++++++--
 drivers/net/mlx4/mlx4.h      |  2 ++
 drivers/net/mlx4/mlx4_intr.c | 43 ++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index 61c5bf4..c67b221 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -108,7 +108,13 @@ struct mlx4_conf {
 		      " flow error type %d, cause %p, message: %s",
 		      -ret, strerror(-ret), error.type, error.cause,
 		      error.message ? error.message : "(unspecified)");
+		goto exit;
 	}
+	ret = mlx4_intr_install(priv);
+	if (ret)
+		ERROR("%p: interrupt handler installation failed",
+		     (void *)dev);
+exit:
 	return ret;
 }
 
@@ -141,7 +147,7 @@ struct mlx4_conf {
 		      (void *)dev, strerror(-ret));
 		goto err;
 	}
-	ret = mlx4_intr_install(priv);
+	ret = mlx4_intr_enable(priv);
 	if (ret) {
 		ERROR("%p: interrupt handler installation failed",
 		     (void *)dev);
@@ -187,7 +193,7 @@ struct mlx4_conf {
 	dev->rx_pkt_burst = mlx4_rx_burst_removed;
 	rte_wmb();
 	mlx4_flow_sync(priv, NULL);
-	mlx4_intr_uninstall(priv);
+	mlx4_intr_disable(priv);
 	mlx4_rss_deinit(priv);
 }
 
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 99dc335..ab4f396 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -176,6 +176,8 @@ int mlx4_flow_ctrl_set(struct rte_eth_dev *dev,
 
 int mlx4_intr_uninstall(struct priv *priv);
 int mlx4_intr_install(struct priv *priv);
+int mlx4_intr_enable(struct priv *priv);
+void mlx4_intr_disable(struct priv *priv);
 int mlx4_rx_intr_disable(struct rte_eth_dev *dev, uint16_t idx);
 int mlx4_rx_intr_enable(struct rte_eth_dev *dev, uint16_t idx);
 
diff --git a/drivers/net/mlx4/mlx4_intr.c b/drivers/net/mlx4/mlx4_intr.c
index 9becee4..eeab625 100644
--- a/drivers/net/mlx4/mlx4_intr.c
+++ b/drivers/net/mlx4/mlx4_intr.c
@@ -73,6 +73,8 @@
 {
 	struct rte_intr_handle *intr_handle = &priv->intr_handle;
 
+	if (intr_handle == NULL || intr_handle->intr_vec == NULL)
+		return;
 	rte_intr_free_epoll_fd(intr_handle);
 	free(intr_handle->intr_vec);
 	intr_handle->nb_efd = 0;
@@ -291,7 +293,7 @@
 	}
 	rte_eal_alarm_cancel((void (*)(void *))mlx4_link_status_alarm, priv);
 	priv->intr_alarm = 0;
-	mlx4_rx_intr_vec_disable(priv);
+	mlx4_intr_disable(priv);
 	rte_errno = err;
 	return 0;
 }
@@ -313,8 +315,6 @@
 	int rc;
 
 	mlx4_intr_uninstall(priv);
-	if (intr_conf->rxq && mlx4_rx_intr_vec_enable(priv) < 0)
-		goto error;
 	if (intr_conf->lsc | intr_conf->rmv) {
 		priv->intr_handle.fd = priv->ctx->async_fd;
 		rc = rte_intr_callback_register(&priv->intr_handle,
@@ -395,3 +395,40 @@
 	}
 	return -ret;
 }
+
+/**
+ * Enable datapath interrupts.
+ *
+ * @param priv
+ *   Pointer to private structure.
+ *
+ * @return
+ *   0 on success, negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx4_intr_enable(struct priv *priv)
+{
+	const struct rte_intr_conf *const intr_conf =
+		&priv->dev->data->dev_conf.intr_conf;
+
+	if (intr_conf->rxq && mlx4_rx_intr_vec_enable(priv) < 0)
+		goto error;
+	return 0;
+error:
+	return -rte_errno;
+}
+
+/**
+ * Disable datapath interrupts, keeping other interrupts intact.
+ *
+ * @param priv
+ *   Pointer to private structure.
+ */
+void
+mlx4_intr_disable(struct priv *priv)
+{
+	int err = rte_errno; /* Make sure rte_errno remains unchanged. */
+
+	mlx4_rx_intr_vec_disable(priv);
+	rte_errno = err;
+}
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [dpdk-dev] [PATCH v2] net/mlx4: fix port start fail after device removal
  2018-01-19 10:09 [dpdk-dev] [PATCH] net/mlx4: port start failure after device removal Moti Haimovsky
@ 2018-01-19 10:16 ` Moti Haimovsky
  2018-01-28 13:54   ` [dpdk-dev] [dpdk-stable] " Shahaf Shuler
  2018-01-29  8:34   ` [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop Moti Haimovsky
  0 siblings, 2 replies; 17+ messages in thread
From: Moti Haimovsky @ 2018-01-19 10:16 UTC (permalink / raw)
  To: adrien.mazarguil; +Cc: dev, Moti Haimovsky, stable

In failsafe device start can be called for ports/devices that
had been plugged out.
This patch fixes mlx4 port start failure when called by the failsafe
PMD for removed devices.

Fixes: a6e8b01c3c26 ("net/mlx4: compact interrupt functions")
Cc: stable@dpdk.org

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
V2:
Fixed commit message.
---
 drivers/net/mlx4/mlx4.c      | 10 ++++++++--
 drivers/net/mlx4/mlx4.h      |  2 ++
 drivers/net/mlx4/mlx4_intr.c | 43 ++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index 61c5bf4..c67b221 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -108,7 +108,13 @@ struct mlx4_conf {
 		      " flow error type %d, cause %p, message: %s",
 		      -ret, strerror(-ret), error.type, error.cause,
 		      error.message ? error.message : "(unspecified)");
+		goto exit;
 	}
+	ret = mlx4_intr_install(priv);
+	if (ret)
+		ERROR("%p: interrupt handler installation failed",
+		     (void *)dev);
+exit:
 	return ret;
 }
 
@@ -141,7 +147,7 @@ struct mlx4_conf {
 		      (void *)dev, strerror(-ret));
 		goto err;
 	}
-	ret = mlx4_intr_install(priv);
+	ret = mlx4_intr_enable(priv);
 	if (ret) {
 		ERROR("%p: interrupt handler installation failed",
 		     (void *)dev);
@@ -187,7 +193,7 @@ struct mlx4_conf {
 	dev->rx_pkt_burst = mlx4_rx_burst_removed;
 	rte_wmb();
 	mlx4_flow_sync(priv, NULL);
-	mlx4_intr_uninstall(priv);
+	mlx4_intr_disable(priv);
 	mlx4_rss_deinit(priv);
 }
 
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 99dc335..ab4f396 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -176,6 +176,8 @@ int mlx4_flow_ctrl_set(struct rte_eth_dev *dev,
 
 int mlx4_intr_uninstall(struct priv *priv);
 int mlx4_intr_install(struct priv *priv);
+int mlx4_intr_enable(struct priv *priv);
+void mlx4_intr_disable(struct priv *priv);
 int mlx4_rx_intr_disable(struct rte_eth_dev *dev, uint16_t idx);
 int mlx4_rx_intr_enable(struct rte_eth_dev *dev, uint16_t idx);
 
diff --git a/drivers/net/mlx4/mlx4_intr.c b/drivers/net/mlx4/mlx4_intr.c
index 9becee4..eeab625 100644
--- a/drivers/net/mlx4/mlx4_intr.c
+++ b/drivers/net/mlx4/mlx4_intr.c
@@ -73,6 +73,8 @@
 {
 	struct rte_intr_handle *intr_handle = &priv->intr_handle;
 
+	if (intr_handle == NULL || intr_handle->intr_vec == NULL)
+		return;
 	rte_intr_free_epoll_fd(intr_handle);
 	free(intr_handle->intr_vec);
 	intr_handle->nb_efd = 0;
@@ -291,7 +293,7 @@
 	}
 	rte_eal_alarm_cancel((void (*)(void *))mlx4_link_status_alarm, priv);
 	priv->intr_alarm = 0;
-	mlx4_rx_intr_vec_disable(priv);
+	mlx4_intr_disable(priv);
 	rte_errno = err;
 	return 0;
 }
@@ -313,8 +315,6 @@
 	int rc;
 
 	mlx4_intr_uninstall(priv);
-	if (intr_conf->rxq && mlx4_rx_intr_vec_enable(priv) < 0)
-		goto error;
 	if (intr_conf->lsc | intr_conf->rmv) {
 		priv->intr_handle.fd = priv->ctx->async_fd;
 		rc = rte_intr_callback_register(&priv->intr_handle,
@@ -395,3 +395,40 @@
 	}
 	return -ret;
 }
+
+/**
+ * Enable datapath interrupts.
+ *
+ * @param priv
+ *   Pointer to private structure.
+ *
+ * @return
+ *   0 on success, negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx4_intr_enable(struct priv *priv)
+{
+	const struct rte_intr_conf *const intr_conf =
+		&priv->dev->data->dev_conf.intr_conf;
+
+	if (intr_conf->rxq && mlx4_rx_intr_vec_enable(priv) < 0)
+		goto error;
+	return 0;
+error:
+	return -rte_errno;
+}
+
+/**
+ * Disable datapath interrupts, keeping other interrupts intact.
+ *
+ * @param priv
+ *   Pointer to private structure.
+ */
+void
+mlx4_intr_disable(struct priv *priv)
+{
+	int err = rte_errno; /* Make sure rte_errno remains unchanged. */
+
+	mlx4_rx_intr_vec_disable(priv);
+	rte_errno = err;
+}
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [dpdk-stable] [PATCH v2] net/mlx4: fix port start fail after device removal
  2018-01-19 10:16 ` [dpdk-dev] [PATCH v2] net/mlx4: fix port start fail " Moti Haimovsky
@ 2018-01-28 13:54   ` Shahaf Shuler
  2018-01-29  8:34   ` [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop Moti Haimovsky
  1 sibling, 0 replies; 17+ messages in thread
From: Shahaf Shuler @ 2018-01-28 13:54 UTC (permalink / raw)
  To: Mordechay Haimovsky, Adrien Mazarguil; +Cc: dev, Mordechay Haimovsky, stable

Hi Moti,


Friday, January 19, 2018 12:17 PM, Moti Haimovsky
> In failsafe device start can be called for ports/devices that had been plugged
> out.
> This patch fixes mlx4 port start failure when called by the failsafe PMD for
> removed devices.

I think the commit log not fully describes the issue you try to fix.
If I understand correctly, the issue is that the interrupt handler is canceled when the port is stopped, causing link removal events not to arrive to the failsafe PMD. 
To fix that, you precede the installation of the interrupt handler to device configuration, and toggle only the rxq interrupt on start/stop.

Is that correct? 
More below. 

> 
> Fixes: a6e8b01c3c26 ("net/mlx4: compact interrupt functions")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Moti Haimovsky <motih@mellanox.com>
> ---
> V2:
> Fixed commit message.
> ---
>  drivers/net/mlx4/mlx4.c      | 10 ++++++++--
>  drivers/net/mlx4/mlx4.h      |  2 ++
>  drivers/net/mlx4/mlx4_intr.c | 43
> ++++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 50 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c index
> 61c5bf4..c67b221 100644
> --- a/drivers/net/mlx4/mlx4.c
> +++ b/drivers/net/mlx4/mlx4.c
> @@ -108,7 +108,13 @@ struct mlx4_conf {
>  		      " flow error type %d, cause %p, message: %s",
>  		      -ret, strerror(-ret), error.type, error.cause,
>  		      error.message ? error.message : "(unspecified)");
> +		goto exit;

This fix is related to the patch? 

>  	}
> +	ret = mlx4_intr_install(priv);
> +	if (ret)
> +		ERROR("%p: interrupt handler installation failed",
> +		     (void *)dev);

Indentation is wrong.

> +exit:
>  	return ret;
>  }
> 
> @@ -141,7 +147,7 @@ struct mlx4_conf {
>  		      (void *)dev, strerror(-ret));
>  		goto err;
>  	}
> -	ret = mlx4_intr_install(priv);
> +	ret = mlx4_intr_enable(priv);

Maybe mlx4_rxq_intr_enable? 

>  	if (ret) {
>  		ERROR("%p: interrupt handler installation failed",
>  		     (void *)dev);
> @@ -187,7 +193,7 @@ struct mlx4_conf {
>  	dev->rx_pkt_burst = mlx4_rx_burst_removed;
>  	rte_wmb();
>  	mlx4_flow_sync(priv, NULL);
> -	mlx4_intr_uninstall(priv);
> +	mlx4_intr_disable(priv);

Maybe mlx4_rxq_intr_disable?

>  	mlx4_rss_deinit(priv);
>  }
> 
> diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index
> 99dc335..ab4f396 100644
> --- a/drivers/net/mlx4/mlx4.h
> +++ b/drivers/net/mlx4/mlx4.h
> @@ -176,6 +176,8 @@ int mlx4_flow_ctrl_set(struct rte_eth_dev *dev,
> 
>  int mlx4_intr_uninstall(struct priv *priv);  int mlx4_intr_install(struct priv
> *priv);
> +int mlx4_intr_enable(struct priv *priv); void mlx4_intr_disable(struct
> +priv *priv);
>  int mlx4_rx_intr_disable(struct rte_eth_dev *dev, uint16_t idx);  int
> mlx4_rx_intr_enable(struct rte_eth_dev *dev, uint16_t idx);
> 
> diff --git a/drivers/net/mlx4/mlx4_intr.c b/drivers/net/mlx4/mlx4_intr.c
> index 9becee4..eeab625 100644
> --- a/drivers/net/mlx4/mlx4_intr.c
> +++ b/drivers/net/mlx4/mlx4_intr.c
> @@ -73,6 +73,8 @@
>  {
>  	struct rte_intr_handle *intr_handle = &priv->intr_handle;
> 
> +	if (intr_handle == NULL || intr_handle->intr_vec == NULL)
> +		return;

Is the above code related to this patch? 

>  	rte_intr_free_epoll_fd(intr_handle);
>  	free(intr_handle->intr_vec);
>  	intr_handle->nb_efd = 0;
> @@ -291,7 +293,7 @@
>  	}
>  	rte_eal_alarm_cancel((void (*)(void *))mlx4_link_status_alarm,
> priv);
>  	priv->intr_alarm = 0;
> -	mlx4_rx_intr_vec_disable(priv);
> +	mlx4_intr_disable(priv);
>  	rte_errno = err;
>  	return 0;
>  }
> @@ -313,8 +315,6 @@
>  	int rc;
> 
>  	mlx4_intr_uninstall(priv);
> -	if (intr_conf->rxq && mlx4_rx_intr_vec_enable(priv) < 0)
> -		goto error;
>  	if (intr_conf->lsc | intr_conf->rmv) {
>  		priv->intr_handle.fd = priv->ctx->async_fd;
>  		rc = rte_intr_callback_register(&priv->intr_handle,
> @@ -395,3 +395,40 @@
>  	}
>  	return -ret;
>  }
> +
> +/**
> + * Enable datapath interrupts.
> + *
> + * @param priv
> + *   Pointer to private structure.
> + *
> + * @return
> + *   0 on success, negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx4_intr_enable(struct priv *priv)
> +{
> +	const struct rte_intr_conf *const intr_conf =
> +		&priv->dev->data->dev_conf.intr_conf;
> +
> +	if (intr_conf->rxq && mlx4_rx_intr_vec_enable(priv) < 0)
> +		goto error;
> +	return 0;
> +error:
> +	return -rte_errno;
> +}
> +
> +/**
> + * Disable datapath interrupts, keeping other interrupts intact.
> + *
> + * @param priv
> + *   Pointer to private structure.
> + */
> +void
> +mlx4_intr_disable(struct priv *priv)
> +{
> +	int err = rte_errno; /* Make sure rte_errno remains unchanged. */
> +
> +	mlx4_rx_intr_vec_disable(priv);
> +	rte_errno = err;
> +}
> --
> 1.8.3.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-19 10:16 ` [dpdk-dev] [PATCH v2] net/mlx4: fix port start fail " Moti Haimovsky
  2018-01-28 13:54   ` [dpdk-dev] [dpdk-stable] " Shahaf Shuler
@ 2018-01-29  8:34   ` Moti Haimovsky
  2018-01-29 10:54     ` Shahaf Shuler
  1 sibling, 1 reply; 17+ messages in thread
From: Moti Haimovsky @ 2018-01-29  8:34 UTC (permalink / raw)
  To: shahafs, adrien.mazarguil; +Cc: dev, Moti Haimovsky, stable

In failsafe device start can be called for ports/devices that
had been plugged out.
The mlx4 PMD detects device removal by listening to the device RMV
events, when the mlx4 port is being stopped, the PMD no longer
listens to these events causing the PMD to stop detecting device
removals.
This patch fixes this issue by moving installation of the interrupt
handler to device configuration, and toggle only the Rx-queue
interrupts on start/stop.

Fixes: a6e8b01c3c26 ("net/mlx4: compact interrupt functions")
Cc: stable@dpdk.org

Signed-off-by: Moti Haimovsky <motih@mellanox.com>
---
V3:
Modifications according to review inputs from Shahaf Shuler
See: 1516357009-15463-1-git-send-email-motih@mellanox.com

V2:
Fixed commit message.
---

 drivers/net/mlx4/mlx4.c      | 10 ++++++++--
 drivers/net/mlx4/mlx4.h      |  2 ++
 drivers/net/mlx4/mlx4_intr.c | 41 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/drivers/net/mlx4/mlx4.c b/drivers/net/mlx4/mlx4.c
index 952dae0..680eca7 100644
--- a/drivers/net/mlx4/mlx4.c
+++ b/drivers/net/mlx4/mlx4.c
@@ -108,7 +108,13 @@ struct mlx4_conf {
 		      " flow error type %d, cause %p, message: %s",
 		      -ret, strerror(-ret), error.type, error.cause,
 		      error.message ? error.message : "(unspecified)");
+		goto exit;
 	}
+	ret = mlx4_intr_install(priv);
+	if (ret)
+		ERROR("%p: interrupt handler installation failed",
+		      (void *)dev);
+exit:
 	return ret;
 }
 
@@ -141,7 +147,7 @@ struct mlx4_conf {
 		      (void *)dev, strerror(-ret));
 		goto err;
 	}
-	ret = mlx4_intr_install(priv);
+	ret = mlx4_rxq_intr_enable(priv);
 	if (ret) {
 		ERROR("%p: interrupt handler installation failed",
 		     (void *)dev);
@@ -187,7 +193,7 @@ struct mlx4_conf {
 	dev->rx_pkt_burst = mlx4_rx_burst_removed;
 	rte_wmb();
 	mlx4_flow_sync(priv, NULL);
-	mlx4_intr_uninstall(priv);
+	mlx4_rxq_intr_disable(priv);
 	mlx4_rss_deinit(priv);
 }
 
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 30a544f..d65879f 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -177,6 +177,8 @@ int mlx4_flow_ctrl_set(struct rte_eth_dev *dev,
 
 int mlx4_intr_uninstall(struct priv *priv);
 int mlx4_intr_install(struct priv *priv);
+int mlx4_rxq_intr_enable(struct priv *priv);
+void mlx4_rxq_intr_disable(struct priv *priv);
 int mlx4_rx_intr_disable(struct rte_eth_dev *dev, uint16_t idx);
 int mlx4_rx_intr_enable(struct rte_eth_dev *dev, uint16_t idx);
 
diff --git a/drivers/net/mlx4/mlx4_intr.c b/drivers/net/mlx4/mlx4_intr.c
index 2ff7218..e522949 100644
--- a/drivers/net/mlx4/mlx4_intr.c
+++ b/drivers/net/mlx4/mlx4_intr.c
@@ -291,7 +291,7 @@
 	}
 	rte_eal_alarm_cancel((void (*)(void *))mlx4_link_status_alarm, priv);
 	priv->intr_alarm = 0;
-	mlx4_rx_intr_vec_disable(priv);
+	mlx4_rxq_intr_disable(priv);
 	rte_errno = err;
 	return 0;
 }
@@ -313,8 +313,6 @@
 	int rc;
 
 	mlx4_intr_uninstall(priv);
-	if (intr_conf->rxq && mlx4_rx_intr_vec_enable(priv) < 0)
-		goto error;
 	if (intr_conf->lsc | intr_conf->rmv) {
 		priv->intr_handle.fd = priv->ctx->async_fd;
 		rc = rte_intr_callback_register(&priv->intr_handle,
@@ -395,3 +393,40 @@
 	}
 	return -ret;
 }
+
+/**
+ * Enable datapath interrupts.
+ *
+ * @param priv
+ *   Pointer to private structure.
+ *
+ * @return
+ *   0 on success, negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx4_rxq_intr_enable(struct priv *priv)
+{
+	const struct rte_intr_conf *const intr_conf =
+		&priv->dev->data->dev_conf.intr_conf;
+
+	if (intr_conf->rxq && mlx4_rx_intr_vec_enable(priv) < 0)
+		goto error;
+	return 0;
+error:
+	return -rte_errno;
+}
+
+/**
+ * Disable datapath interrupts, keeping other interrupts intact.
+ *
+ * @param priv
+ *   Pointer to private structure.
+ */
+void
+mlx4_rxq_intr_disable(struct priv *priv)
+{
+	int err = rte_errno; /* Make sure rte_errno remains unchanged. */
+
+	mlx4_rx_intr_vec_disable(priv);
+	rte_errno = err;
+}
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-29  8:34   ` [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop Moti Haimovsky
@ 2018-01-29 10:54     ` Shahaf Shuler
  2018-01-30  9:21       ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
  2018-01-30  9:39       ` [dpdk-dev] " Adrien Mazarguil
  0 siblings, 2 replies; 17+ messages in thread
From: Shahaf Shuler @ 2018-01-29 10:54 UTC (permalink / raw)
  To: Mordechay Haimovsky, Adrien Mazarguil; +Cc: dev, stable

Mordechay Haimovsky, Monday, January 29, 2018 10:35 AM:
> In failsafe device start can be called for ports/devices that had been plugged
> out.
> The mlx4 PMD detects device removal by listening to the device RMV events,
> when the mlx4 port is being stopped, the PMD no longer listens to these
> events causing the PMD to stop detecting device removals.
> This patch fixes this issue by moving installation of the interrupt handler to
> device configuration, and toggle only the Rx-queue interrupts on start/stop.
> 
> Fixes: a6e8b01c3c26 ("net/mlx4: compact interrupt functions")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Moti Haimovsky <motih@mellanox.com>
> ---
> V3:
> Modifications according to review inputs from Shahaf Shuler
> See: 1516357009-15463-1-git-send-email-motih@mellanox.com
> 
> V2:
> Fixed commit message.
> ---
> 
>  drivers/net/mlx4/mlx4.c      | 10 ++++++++--
>  drivers/net/mlx4/mlx4.h      |  2 ++
>  drivers/net/mlx4/mlx4_intr.c | 41
> ++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 48 insertions(+), 5 deletions(-)
> 

Acked-by: Shahaf Shuler <shahafs@mellanox.com> 

Adrien - let me know if you see issues with this patch, I want to include it on RC2.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [dpdk-stable] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-29 10:54     ` Shahaf Shuler
@ 2018-01-30  9:21       ` Thomas Monjalon
  2018-01-30  9:39       ` [dpdk-dev] " Adrien Mazarguil
  1 sibling, 0 replies; 17+ messages in thread
From: Thomas Monjalon @ 2018-01-30  9:21 UTC (permalink / raw)
  To: Mordechay Haimovsky; +Cc: stable, Shahaf Shuler, Adrien Mazarguil, dev

29/01/2018 11:54, Shahaf Shuler:
> Mordechay Haimovsky, Monday, January 29, 2018 10:35 AM:
> > In failsafe device start can be called for ports/devices that had been plugged
> > out.
> > The mlx4 PMD detects device removal by listening to the device RMV events,
> > when the mlx4 port is being stopped, the PMD no longer listens to these
> > events causing the PMD to stop detecting device removals.
> > This patch fixes this issue by moving installation of the interrupt handler to
> > device configuration, and toggle only the Rx-queue interrupts on start/stop.
> > 
> > Fixes: a6e8b01c3c26 ("net/mlx4: compact interrupt functions")
> > Cc: stable@dpdk.org
> > 
> > Signed-off-by: Moti Haimovsky <motih@mellanox.com>
> 
> Acked-by: Shahaf Shuler <shahafs@mellanox.com> 
> 
> Adrien - let me know if you see issues with this patch, I want to include it on RC2.

Applied, thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-29 10:54     ` Shahaf Shuler
  2018-01-30  9:21       ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
@ 2018-01-30  9:39       ` Adrien Mazarguil
  2018-01-30 20:37         ` Shahaf Shuler
  1 sibling, 1 reply; 17+ messages in thread
From: Adrien Mazarguil @ 2018-01-30  9:39 UTC (permalink / raw)
  To: Shahaf Shuler; +Cc: Mordechay Haimovsky, dev, stable

Hi,

On Mon, Jan 29, 2018 at 10:54:16AM +0000, Shahaf Shuler wrote:
> Mordechay Haimovsky, Monday, January 29, 2018 10:35 AM:
> > In failsafe device start can be called for ports/devices that had been plugged
> > out.
> > The mlx4 PMD detects device removal by listening to the device RMV events,
> > when the mlx4 port is being stopped, the PMD no longer listens to these
> > events causing the PMD to stop detecting device removals.
> > This patch fixes this issue by moving installation of the interrupt handler to
> > device configuration, and toggle only the Rx-queue interrupts on start/stop.
> > 
> > Fixes: a6e8b01c3c26 ("net/mlx4: compact interrupt functions")
> > Cc: stable@dpdk.org
> > 
> > Signed-off-by: Moti Haimovsky <motih@mellanox.com>
> > ---
> > V3:
> > Modifications according to review inputs from Shahaf Shuler
> > See: 1516357009-15463-1-git-send-email-motih@mellanox.com
> > 
> > V2:
> > Fixed commit message.
> > ---
> > 
> >  drivers/net/mlx4/mlx4.c      | 10 ++++++++--
> >  drivers/net/mlx4/mlx4.h      |  2 ++
> >  drivers/net/mlx4/mlx4_intr.c | 41
> > ++++++++++++++++++++++++++++++++++++++---
> >  3 files changed, 48 insertions(+), 5 deletions(-)
> > 
> 
> Acked-by: Shahaf Shuler <shahafs@mellanox.com> 
> 
> Adrien - let me know if you see issues with this patch, I want to include it on RC2.

Unfortunately I didn't get a chance to review this patch before it was
applied. I'm not sure a stopped port is supposed to report events
(interrupts). Will applications expect them to occur at this point?

In my opinion it's not a fix, as in, it doesn't address an issue introduced
by the mentioned patch whose behavior was correct.

It's probably too late to change it now and it does address an issue seen
with a use case involving this PMD, however I think the fail-safe PMD could
as well poll using the recently-added rte_eth_dev_is_removed() when it's
aware the underlying port is stopped instead of expecting interrupts.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-30  9:39       ` [dpdk-dev] " Adrien Mazarguil
@ 2018-01-30 20:37         ` Shahaf Shuler
  2018-01-31  9:15           ` Adrien Mazarguil
  0 siblings, 1 reply; 17+ messages in thread
From: Shahaf Shuler @ 2018-01-30 20:37 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: Mordechay Haimovsky, dev, stable

Tuesday, January 30, 2018 11:40 AM, Adrien Mazarguil:
> Unfortunately I didn't get a chance to review this patch before it was applied.
> I'm not sure a stopped port is supposed to report events (interrupts). Will
> applications expect them to occur at this point?

Why not?

Stopped port is still counted as attached. The fact the application stopped the packet receive on it doesn't mean it should not receive a sync events (such as the remove event).
async events, by definition, are not related to traffic being flows through the port. 

> 
> In my opinion it's not a fix, as in, it doesn't address an issue introduced by the
> mentioned patch whose behavior was correct.
> 
> It's probably too late to change it now and it does address an issue seen with
> a use case involving this PMD, however I think the fail-safe PMD could as well
> poll using the recently-added rte_eth_dev_is_removed() when it's aware
> the underlying port is stopped instead of expecting interrupts.
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-30 20:37         ` Shahaf Shuler
@ 2018-01-31  9:15           ` Adrien Mazarguil
  2018-01-31  9:54             ` Gaëtan Rivet
  2018-01-31 10:08             ` Matan Azrad
  0 siblings, 2 replies; 17+ messages in thread
From: Adrien Mazarguil @ 2018-01-31  9:15 UTC (permalink / raw)
  To: Shahaf Shuler; +Cc: Mordechay Haimovsky, dev, stable

On Tue, Jan 30, 2018 at 08:37:06PM +0000, Shahaf Shuler wrote:
> Tuesday, January 30, 2018 11:40 AM, Adrien Mazarguil:
> > Unfortunately I didn't get a chance to review this patch before it was applied.
> > I'm not sure a stopped port is supposed to report events (interrupts). Will
> > applications expect them to occur at this point?
> 
> Why not?
> 
> Stopped port is still counted as attached. The fact the application stopped the packet receive on it doesn't mean it should not receive a sync events (such as the remove event).
> async events, by definition, are not related to traffic being flows through the port. 

My comment is based on my understanding of rte_eth_dev_stop(), which is a
device (or port) is completely stopped, in a suspended state and no
interrupts shall occur, as a means for applications to temporarily not be
bothered by them until restarted.

Think about it that way: applications do not want to get interrupts
immediately after the device is initialized, because they might not be ready
to process them at this point. An explicit call to rte_eth_dev_start() tells
the PMD when it's OK to do so. The converse is rte_eth_dev_stop().

Stopping traffic can already be achieved by not polling from the application
side, calling rte_eth_dev_[rt]x_queue_stop() and/or toggling RX/TX
interrupts through rte_eth_dev_[rt]x_intr_enable(). rte_eth_dev_stop()
provides lower-level device control.

Perhaps documentation is not clear, however that's how LSC seems implemented
in all PMDs; it gets disabled after rte_eth_dev_stop() and one should
explicitly use rte_eth_link_get() to retrieve link status afterward. I think
RMV should behave similarly with rte_eth_dev_is_removed(). Adapting
fail-safe should be easier than modifying all the remaining PMDs.

> > In my opinion it's not a fix, as in, it doesn't address an issue introduced by the
> > mentioned patch whose behavior was correct.
> > 
> > It's probably too late to change it now and it does address an issue seen with
> > a use case involving this PMD, however I think the fail-safe PMD could as well
> > poll using the recently-added rte_eth_dev_is_removed() when it's aware
> > the underlying port is stopped instead of expecting interrupts.
> > 
> > --
> > Adrien Mazarguil
> > 6WIND

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-31  9:15           ` Adrien Mazarguil
@ 2018-01-31  9:54             ` Gaëtan Rivet
  2018-01-31 10:08             ` Matan Azrad
  1 sibling, 0 replies; 17+ messages in thread
From: Gaëtan Rivet @ 2018-01-31  9:54 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: Shahaf Shuler, Mordechay Haimovsky, dev, stable

On Wed, Jan 31, 2018 at 10:15:13AM +0100, Adrien Mazarguil wrote:
> On Tue, Jan 30, 2018 at 08:37:06PM +0000, Shahaf Shuler wrote:
> > Tuesday, January 30, 2018 11:40 AM, Adrien Mazarguil:
> > > Unfortunately I didn't get a chance to review this patch before it was applied.
> > > I'm not sure a stopped port is supposed to report events (interrupts). Will
> > > applications expect them to occur at this point?
> > 
> > Why not?
> > 
> > Stopped port is still counted as attached. The fact the application stopped the packet receive on it doesn't mean it should not receive a sync events (such as the remove event).
> > async events, by definition, are not related to traffic being flows through the port. 
> 
> My comment is based on my understanding of rte_eth_dev_stop(), which is a
> device (or port) is completely stopped, in a suspended state and no
> interrupts shall occur, as a means for applications to temporarily not be
> bothered by them until restarted.
> 
> Think about it that way: applications do not want to get interrupts
> immediately after the device is initialized, because they might not be ready
> to process them at this point. An explicit call to rte_eth_dev_start() tells
> the PMD when it's OK to do so. The converse is rte_eth_dev_stop().
> 
> Stopping traffic can already be achieved by not polling from the application
> side, calling rte_eth_dev_[rt]x_queue_stop() and/or toggling RX/TX
> interrupts through rte_eth_dev_[rt]x_intr_enable(). rte_eth_dev_stop()
> provides lower-level device control.
> 
> Perhaps documentation is not clear, however that's how LSC seems implemented
> in all PMDs; it gets disabled after rte_eth_dev_stop() and one should
> explicitly use rte_eth_link_get() to retrieve link status afterward. I think
> RMV should behave similarly with rte_eth_dev_is_removed(). Adapting
> fail-safe should be easier than modifying all the remaining PMDs.
> 

I think fail-safe should follow the API, but for that the API should be
defined explicitly, instead of relying on common interpretation.

It makes sense for relatively harmless interrupts (LSC, RX) to be masked when stopped,
but I'm not sure this holds true for all possible interrupts (RMV), or
future ones.

If applications needs to ignore all interrupts at some point, it could
be interesting to extend the interrupt API to allow for enabling /
disabling them instead of relying on PMDs to register / unregister their
callbacks each time.

On that note, it could also be interesting for applications that won't
use interrupts, not to launch the interrupt thread during
rte_eal_init(). This is nonsense to force an additional thread to
operate without clear affinity. But I digress.

Interrupt API needs an overhaul, and its implementation as well.

> > > In my opinion it's not a fix, as in, it doesn't address an issue introduced by the
> > > mentioned patch whose behavior was correct.
> > > 
> > > It's probably too late to change it now and it does address an issue seen with
> > > a use case involving this PMD, however I think the fail-safe PMD could as well
> > > poll using the recently-added rte_eth_dev_is_removed() when it's aware
> > > the underlying port is stopped instead of expecting interrupts.
> > > 
> > > --
> > > Adrien Mazarguil
> > > 6WIND

-- 
Gaëtan Rivet
6WIND

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-31  9:15           ` Adrien Mazarguil
  2018-01-31  9:54             ` Gaëtan Rivet
@ 2018-01-31 10:08             ` Matan Azrad
  2018-01-31 10:43               ` Adrien Mazarguil
  1 sibling, 1 reply; 17+ messages in thread
From: Matan Azrad @ 2018-01-31 10:08 UTC (permalink / raw)
  To: Adrien Mazarguil, Shahaf Shuler; +Cc: Mordechay Haimovsky, dev, stable

Hi all

From: Adrien Mazarguil
> On Tue, Jan 30, 2018 at 08:37:06PM +0000, Shahaf Shuler wrote:
> > Tuesday, January 30, 2018 11:40 AM, Adrien Mazarguil:
> > > Unfortunately I didn't get a chance to review this patch before it was
> applied.
> > > I'm not sure a stopped port is supposed to report events
> > > (interrupts). Will applications expect them to occur at this point?
> >
> > Why not?
> >
> > Stopped port is still counted as attached. The fact the application stopped
> the packet receive on it doesn't mean it should not receive a sync events
> (such as the remove event).
> > async events, by definition, are not related to traffic being flows through
> the port.
> 
> My comment is based on my understanding of rte_eth_dev_stop(), which is
> a device (or port) is completely stopped, in a suspended state and no
> interrupts shall occur, as a means for applications to temporarily not be
> bothered by them until restarted.

Stopping traffic is not saying that the application is not interesting in the device,
I think that you mean to dev_close().

Any event may still be usable for application between dev_stop() to dev_start(), 
especially RMV or LCS can still be interested.

> Think about it that way: applications do not want to get interrupts
> immediately after the device is initialized, because they might not be ready
> to process them at this point. An explicit call to rte_eth_dev_start() tells the
> PMD when it's OK to do so. The converse is rte_eth_dev_stop().

So, they can delay the event registration to the time they interesting in the events.
And use event unregister when they are not interesting in it anymore.

> Stopping traffic can already be achieved by not polling from the application
> side, calling rte_eth_dev_[rt]x_queue_stop() and/or toggling RX/TX
> interrupts through rte_eth_dev_[rt]x_intr_enable(). rte_eth_dev_stop()
> provides lower-level device control.
 
I think it makes sense only for Rx interrupt which is traffic oriented(like stop and start). 

> Perhaps documentation is not clear, however that's how LSC seems
> implemented in all PMDs; it gets disabled after rte_eth_dev_stop() and one
> should explicitly use rte_eth_link_get() to retrieve link status afterward. I
> think RMV should behave similarly with rte_eth_dev_is_removed().
> Adapting fail-safe should be easier than modifying all the remaining PMDs.

Or maybe PMDs which do it make mistakes.

Matan.

> > > In my opinion it's not a fix, as in, it doesn't address an issue
> > > introduced by the mentioned patch whose behavior was correct.
> > >
> > > It's probably too late to change it now and it does address an issue
> > > seen with a use case involving this PMD, however I think the
> > > fail-safe PMD could as well poll using the recently-added
> > > rte_eth_dev_is_removed() when it's aware the underlying port is
> stopped instead of expecting interrupts.
> > >
> > > --
> > > Adrien Mazarguil
> > > 6WIND
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-31 10:08             ` Matan Azrad
@ 2018-01-31 10:43               ` Adrien Mazarguil
  2018-01-31 13:44                 ` Matan Azrad
  0 siblings, 1 reply; 17+ messages in thread
From: Adrien Mazarguil @ 2018-01-31 10:43 UTC (permalink / raw)
  To: Matan Azrad; +Cc: Shahaf Shuler, Mordechay Haimovsky, dev, stable

On Wed, Jan 31, 2018 at 10:08:06AM +0000, Matan Azrad wrote:
> Hi all
> 
> From: Adrien Mazarguil
> > On Tue, Jan 30, 2018 at 08:37:06PM +0000, Shahaf Shuler wrote:
> > > Tuesday, January 30, 2018 11:40 AM, Adrien Mazarguil:
> > > > Unfortunately I didn't get a chance to review this patch before it was
> > applied.
> > > > I'm not sure a stopped port is supposed to report events
> > > > (interrupts). Will applications expect them to occur at this point?
> > >
> > > Why not?
> > >
> > > Stopped port is still counted as attached. The fact the application stopped
> > the packet receive on it doesn't mean it should not receive a sync events
> > (such as the remove event).
> > > async events, by definition, are not related to traffic being flows through
> > the port.
> > 
> > My comment is based on my understanding of rte_eth_dev_stop(), which is
> > a device (or port) is completely stopped, in a suspended state and no
> > interrupts shall occur, as a means for applications to temporarily not be
> > bothered by them until restarted.
> 
> Stopping traffic is not saying that the application is not interesting in the device,
> I think that you mean to dev_close().

No, dev_close() releases resources and destroys configuration. Good luck
using dev_start() or any other devop after dev_close().

> Any event may still be usable for application between dev_stop() to dev_start(), 
> especially RMV or LCS can still be interested.

Possibly, but then, how come no PMD implements it that way? Neither did mlx4
before this patch got applied. At the very least it cannot be considered a
"fix".

> > Think about it that way: applications do not want to get interrupts
> > immediately after the device is initialized, because they might not be ready
> > to process them at this point. An explicit call to rte_eth_dev_start() tells the
> > PMD when it's OK to do so. The converse is rte_eth_dev_stop().
> 
> So, they can delay the event registration to the time they interesting in the events.
> And use event unregister when they are not interesting in it anymore.

Of course you can ask application maintainers to adapt to the new behavior,
or you know, leave things as they used to be.

Setting up RMV/LSC callbacks is not the only configuration an application
usually performs before calling dev_start(). Think about setting up flow
rules, MAC addresses, VLANs, and so on, this on multiple ports before
starting them up all at once. Previously it could be done in an unspecified
order, now they have to take special care for RMV/LSC.

Many devops are only safe when called while a device is stopped. It's even
documented in rte_ethdev.h.

> > Stopping traffic can already be achieved by not polling from the application
> > side, calling rte_eth_dev_[rt]x_queue_stop() and/or toggling RX/TX
> > interrupts through rte_eth_dev_[rt]x_intr_enable(). rte_eth_dev_stop()
> > provides lower-level device control.
>  
> I think it makes sense only for Rx interrupt which is traffic oriented(like stop and start). 

OK, looks like we disagree :)

> > Perhaps documentation is not clear, however that's how LSC seems
> > implemented in all PMDs; it gets disabled after rte_eth_dev_stop() and one
> > should explicitly use rte_eth_link_get() to retrieve link status afterward. I
> > think RMV should behave similarly with rte_eth_dev_is_removed().
> > Adapting fail-safe should be easier than modifying all the remaining PMDs.
> 
> Or maybe PMDs which do it make mistakes.

I'm not convinced mlx4 is the only PMD doing the right thing, we should ask
other maintainers as well as application writers' opinion on the matter. If
it's not a problem for RMV/LSC to occur while a device is stopped, then I'll
agree with the approach. It still won't make it a fix regardless.

> > > > In my opinion it's not a fix, as in, it doesn't address an issue
> > > > introduced by the mentioned patch whose behavior was correct.
> > > >
> > > > It's probably too late to change it now and it does address an issue
> > > > seen with a use case involving this PMD, however I think the
> > > > fail-safe PMD could as well poll using the recently-added
> > > > rte_eth_dev_is_removed() when it's aware the underlying port is
> > stopped instead of expecting interrupts.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-31 10:43               ` Adrien Mazarguil
@ 2018-01-31 13:44                 ` Matan Azrad
  2018-01-31 14:31                   ` Adrien Mazarguil
  0 siblings, 1 reply; 17+ messages in thread
From: Matan Azrad @ 2018-01-31 13:44 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: Shahaf Shuler, Mordechay Haimovsky, dev, stable

Hi Adrien

 From: Adrien Mazarguil, Sent: Wednesday, January 31, 2018 12:44 PM
> On Wed, Jan 31, 2018 at 10:08:06AM +0000, Matan Azrad wrote:
> > Hi all
> >
> > From: Adrien Mazarguil
> > > On Tue, Jan 30, 2018 at 08:37:06PM +0000, Shahaf Shuler wrote:
> > > > Tuesday, January 30, 2018 11:40 AM, Adrien Mazarguil:
> > > > > Unfortunately I didn't get a chance to review this patch before
> > > > > it was
> > > applied.
> > > > > I'm not sure a stopped port is supposed to report events
> > > > > (interrupts). Will applications expect them to occur at this point?
> > > >
> > > > Why not?
> > > >
> > > > Stopped port is still counted as attached. The fact the
> > > > application stopped
> > > the packet receive on it doesn't mean it should not receive a sync
> > > events (such as the remove event).
> > > > async events, by definition, are not related to traffic being
> > > > flows through
> > > the port.
> > >
> > > My comment is based on my understanding of rte_eth_dev_stop(),
> which
> > > is a device (or port) is completely stopped, in a suspended state
> > > and no interrupts shall occur, as a means for applications to
> > > temporarily not be bothered by them until restarted.
> >
> > Stopping traffic is not saying that the application is not interesting
> > in the device, I think that you mean to dev_close().
> 
> No, dev_close() releases resources and destroys configuration. Good luck
> using dev_start() or any other devop after dev_close().

I'm just saying here that when the user call dev_close() it means he is not interesting in the device any more,
This is not the case in dev_stop().
 

> > Any event may still be usable for application between dev_stop() to
> > dev_start(), especially RMV or LCS can still be interested.
> 
> Possibly, but then, how come no PMD implements it that way?

Again, maybe others PMDs make mistakes.

> Neither did mlx4 before this patch got applied. At the very least it cannot be considered a
> "fix".

It fixes something.

> > > Think about it that way: applications do not want to get interrupts
> > > immediately after the device is initialized, because they might not
> > > be ready to process them at this point. An explicit call to
> > > rte_eth_dev_start() tells the PMD when it's OK to do so. The converse is
> rte_eth_dev_stop().
> >
> > So, they can delay the event registration to the time they interesting in the
> events.
> > And use event unregister when they are not interesting in it anymore.
> 
> Of course you can ask application maintainers to adapt to the new behavior,
> or you know, leave things as they used to be.
> 

I don't know what any application does but for me it is a mistake to stop all event processes in dev_stop(),
Maybe for other application maintainers too.

> Setting up RMV/LSC callbacks is not the only configuration an application
> usually performs before calling dev_start(). Think about setting up flow rules,
> MAC addresses, VLANs, and so on, this on multiple ports before starting
> them up all at once. Previously it could be done in an unspecified order, now
> they have to take special care for RMV/LSC.

Or maybe there callbacks code are already safe for it.
Or they manages the unregister\register calls in the right places.
 
> Many devops are only safe when called while a device is stopped. It's even
> documented in rte_ethdev.h.
> 

And?

> > > Stopping traffic can already be achieved by not polling from the
> > > application side, calling rte_eth_dev_[rt]x_queue_stop() and/or
> > > toggling RX/TX interrupts through rte_eth_dev_[rt]x_intr_enable().
> > > rte_eth_dev_stop() provides lower-level device control.
> >
> > I think it makes sense only for Rx interrupt which is traffic oriented(like stop
> and start).
> 
> OK, looks like we disagree :)
> 
> > > Perhaps documentation is not clear, however that's how LSC seems
> > > implemented in all PMDs; it gets disabled after rte_eth_dev_stop()
> > > and one should explicitly use rte_eth_link_get() to retrieve link
> > > status afterward. I think RMV should behave similarly with
> rte_eth_dev_is_removed().
> > > Adapting fail-safe should be easier than modifying all the remaining
> PMDs.
> >
> > Or maybe PMDs which do it make mistakes.
> 
> I'm not convinced mlx4 is the only PMD doing the right thing, we should ask
> other maintainers as well as application writers' opinion on the matter. If it's
> not a problem for RMV/LSC to occur while a device is stopped, then I'll agree
> with the approach. It still won't make it a fix regardless.

Let's think about RMV event, How can the application\other dpdk entities to know if the device was removed when it was in stopped state?
Checking it synchronically (by rte_eth_dev_is_removed()) can miss the removal just a moment after the device came back again, errors will occur and no one will know why.

So, at least for RMV event, we need the notification also in stopped state.

> > > > > In my opinion it's not a fix, as in, it doesn't address an issue
> > > > > introduced by the mentioned patch whose behavior was correct.
> > > > >
> > > > > It's probably too late to change it now and it does address an
> > > > > issue seen with a use case involving this PMD, however I think
> > > > > the fail-safe PMD could as well poll using the recently-added
> > > > > rte_eth_dev_is_removed() when it's aware the underlying port is
> > > stopped instead of expecting interrupts.
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-31 13:44                 ` Matan Azrad
@ 2018-01-31 14:31                   ` Adrien Mazarguil
  2018-01-31 17:07                     ` Matan Azrad
  0 siblings, 1 reply; 17+ messages in thread
From: Adrien Mazarguil @ 2018-01-31 14:31 UTC (permalink / raw)
  To: Matan Azrad; +Cc: Shahaf Shuler, Mordechay Haimovsky, dev, stable

Hi Matan,

On Wed, Jan 31, 2018 at 01:44:41PM +0000, Matan Azrad wrote:
> Hi Adrien
> 
>  From: Adrien Mazarguil, Sent: Wednesday, January 31, 2018 12:44 PM
> > On Wed, Jan 31, 2018 at 10:08:06AM +0000, Matan Azrad wrote:
> > > Hi all
> > >
> > > From: Adrien Mazarguil
> > > > On Tue, Jan 30, 2018 at 08:37:06PM +0000, Shahaf Shuler wrote:
> > > > > Tuesday, January 30, 2018 11:40 AM, Adrien Mazarguil:
> > > > > > Unfortunately I didn't get a chance to review this patch before
> > > > > > it was
> > > > applied.
> > > > > > I'm not sure a stopped port is supposed to report events
> > > > > > (interrupts). Will applications expect them to occur at this point?
> > > > >
> > > > > Why not?
> > > > >
> > > > > Stopped port is still counted as attached. The fact the
> > > > > application stopped
> > > > the packet receive on it doesn't mean it should not receive a sync
> > > > events (such as the remove event).
> > > > > async events, by definition, are not related to traffic being
> > > > > flows through
> > > > the port.
> > > >
> > > > My comment is based on my understanding of rte_eth_dev_stop(),
> > which
> > > > is a device (or port) is completely stopped, in a suspended state
> > > > and no interrupts shall occur, as a means for applications to
> > > > temporarily not be bothered by them until restarted.
> > >
> > > Stopping traffic is not saying that the application is not interesting
> > > in the device, I think that you mean to dev_close().
> > 
> > No, dev_close() releases resources and destroys configuration. Good luck
> > using dev_start() or any other devop after dev_close().
> 
> I'm just saying here that when the user call dev_close() it means he is not interesting in the device any more,
> This is not the case in dev_stop().
> 
> > > Any event may still be usable for application between dev_stop() to
> > > dev_start(), especially RMV or LCS can still be interested.
> > 
> > Possibly, but then, how come no PMD implements it that way?
> 
> Again, maybe others PMDs make mistakes.

It's basically the same as stating that for all these years, neither PMD nor
application maintainers understood the true meaning of this API.

Maybe you're right, maybe not. What's for sure is this patch unilaterally
decided to modify an accepted behavior. Perhaps it's not a big deal but we
must be cautious.

> > Neither did mlx4 before this patch got applied. At the very least it cannot be considered a
> > "fix".
> 
> It fixes something.

Technically every time a feature is added, its absence gets "fixed" :)

> > > > Think about it that way: applications do not want to get interrupts
> > > > immediately after the device is initialized, because they might not
> > > > be ready to process them at this point. An explicit call to
> > > > rte_eth_dev_start() tells the PMD when it's OK to do so. The converse is
> > rte_eth_dev_stop().
> > >
> > > So, they can delay the event registration to the time they interesting in the
> > events.
> > > And use event unregister when they are not interesting in it anymore.
> > 
> > Of course you can ask application maintainers to adapt to the new behavior,
> > or you know, leave things as they used to be.
> > 
> 
> I don't know what any application does but for me it is a mistake to stop all event processes in dev_stop(),
> Maybe for other application maintainers too.

Just like you, I don't know either what all the applications ever written
for DPDK expect out of dev_stop(). What's for sure is that currently,
LSC/RMV don't occcur afterward, the same way these events do not occur
before dev_start(). Any application possibly relying on this fact will
break. In such a situation, a conservative approach is better.

> > Setting up RMV/LSC callbacks is not the only configuration an application
> > usually performs before calling dev_start(). Think about setting up flow rules,
> > MAC addresses, VLANs, and so on, this on multiple ports before starting
> > them up all at once. Previously it could be done in an unspecified order, now
> > they have to take special care for RMV/LSC.
> 
> Or maybe there callbacks code are already safe for it.
> Or they manages the unregister\register calls in the right places.

That's my point, these "maybes" don't argue in favor of changing things.

> > Many devops are only safe when called while a device is stopped. It's even
> > documented in rte_ethdev.h.
> > 
> 
> And?

...And applications therefore often do all their configuration in an
unspecified order while a port is stopped as a measure of safety. No extra
care is taken for RMV/LSC. This uncertainty can be addressed by not
modifying the current behavior.

> > > > Stopping traffic can already be achieved by not polling from the
> > > > application side, calling rte_eth_dev_[rt]x_queue_stop() and/or
> > > > toggling RX/TX interrupts through rte_eth_dev_[rt]x_intr_enable().
> > > > rte_eth_dev_stop() provides lower-level device control.
> > >
> > > I think it makes sense only for Rx interrupt which is traffic oriented(like stop
> > and start).
> > 
> > OK, looks like we disagree :)
> > 
> > > > Perhaps documentation is not clear, however that's how LSC seems
> > > > implemented in all PMDs; it gets disabled after rte_eth_dev_stop()
> > > > and one should explicitly use rte_eth_link_get() to retrieve link
> > > > status afterward. I think RMV should behave similarly with
> > rte_eth_dev_is_removed().
> > > > Adapting fail-safe should be easier than modifying all the remaining
> > PMDs.
> > >
> > > Or maybe PMDs which do it make mistakes.
> > 
> > I'm not convinced mlx4 is the only PMD doing the right thing, we should ask
> > other maintainers as well as application writers' opinion on the matter. If it's
> > not a problem for RMV/LSC to occur while a device is stopped, then I'll agree
> > with the approach. It still won't make it a fix regardless.
> 
> Let's think about RMV event, How can the application\other dpdk entities to know if the device was removed when it was in stopped state?
> Checking it synchronically (by rte_eth_dev_is_removed()) can miss the removal just a moment after the device came back again, errors will occur and no one will know why.
> 
> So, at least for RMV event, we need the notification also in stopped state.

You sent the rte_eth_dev_is_removed() series. You're aware that PMDs
implementing this call benefit from an automatic is_removed() check on all
remaining devops whenever some error occur.

In short, an application will get notified simply by getting dev_start()
(or any other callback) return -EIO and not being able to use the device.

PMDs that do not implement is_removed() (or in addition to it) could also
artificially trigger a RMV event after dev_start() is called. As long as the
PMD remains quiet while the device is stopped, it's fine.

> > > > > > In my opinion it's not a fix, as in, it doesn't address an issue
> > > > > > introduced by the mentioned patch whose behavior was correct.
> > > > > >
> > > > > > It's probably too late to change it now and it does address an
> > > > > > issue seen with a use case involving this PMD, however I think
> > > > > > the fail-safe PMD could as well poll using the recently-added
> > > > > > rte_eth_dev_is_removed() when it's aware the underlying port is
> > > > stopped instead of expecting interrupts.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-31 14:31                   ` Adrien Mazarguil
@ 2018-01-31 17:07                     ` Matan Azrad
  2018-02-02 19:53                       ` Adrien Mazarguil
  0 siblings, 1 reply; 17+ messages in thread
From: Matan Azrad @ 2018-01-31 17:07 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: Shahaf Shuler, Mordechay Haimovsky, dev, stable

Hi Adrien

From: Adrien Mazarguil , Sent: Wednesday, January 31, 2018 4:32 PM
> Hi Matan,
> 
> On Wed, Jan 31, 2018 at 01:44:41PM +0000, Matan Azrad wrote:
> > Hi Adrien
> >
> >  From: Adrien Mazarguil, Sent: Wednesday, January 31, 2018 12:44 PM
> > > On Wed, Jan 31, 2018 at 10:08:06AM +0000, Matan Azrad wrote:
> > > > Hi all
> > > >
> > > > From: Adrien Mazarguil
> > > > > On Tue, Jan 30, 2018 at 08:37:06PM +0000, Shahaf Shuler wrote:
> > > > > > Tuesday, January 30, 2018 11:40 AM, Adrien Mazarguil:
> > > > > > > Unfortunately I didn't get a chance to review this patch
> > > > > > > before it was
> > > > > applied.
> > > > > > > I'm not sure a stopped port is supposed to report events
> > > > > > > (interrupts). Will applications expect them to occur at this point?
> > > > > >
> > > > > > Why not?
> > > > > >
> > > > > > Stopped port is still counted as attached. The fact the
> > > > > > application stopped
> > > > > the packet receive on it doesn't mean it should not receive a
> > > > > sync events (such as the remove event).
> > > > > > async events, by definition, are not related to traffic being
> > > > > > flows through
> > > > > the port.
> > > > >
> > > > > My comment is based on my understanding of rte_eth_dev_stop(),
> > > which
> > > > > is a device (or port) is completely stopped, in a suspended
> > > > > state and no interrupts shall occur, as a means for applications
> > > > > to temporarily not be bothered by them until restarted.
> > > >
> > > > Stopping traffic is not saying that the application is not
> > > > interesting in the device, I think that you mean to dev_close().
> > >
> > > No, dev_close() releases resources and destroys configuration. Good
> > > luck using dev_start() or any other devop after dev_close().
> >
> > I'm just saying here that when the user call dev_close() it means he
> > is not interesting in the device any more, This is not the case in dev_stop().
> >
> > > > Any event may still be usable for application between dev_stop()
> > > > to dev_start(), especially RMV or LCS can still be interested.
> > >
> > > Possibly, but then, how come no PMD implements it that way?
> >
> > Again, maybe others PMDs make mistakes.
> 
> It's basically the same as stating that for all these years, neither PMD nor
> application maintainers understood the true meaning of this API.
> 
> Maybe you're right, maybe not. What's for sure is this patch unilaterally
> decided to modify an accepted behavior. Perhaps it's not a big deal but we
> must be cautious.

Agree.

I'm just saying my opinion here.

> > > Neither did mlx4 before this patch got applied. At the very least it
> > > cannot be considered a "fix".
> >
> > It fixes something.
> 
> Technically every time a feature is added, its absence gets "fixed" :)

:)
 
> > > > > Think about it that way: applications do not want to get
> > > > > interrupts immediately after the device is initialized, because
> > > > > they might not be ready to process them at this point. An
> > > > > explicit call to
> > > > > rte_eth_dev_start() tells the PMD when it's OK to do so. The
> > > > > converse is
> > > rte_eth_dev_stop().
> > > >
> > > > So, they can delay the event registration to the time they
> > > > interesting in the
> > > events.
> > > > And use event unregister when they are not interesting in it anymore.
> > >
> > > Of course you can ask application maintainers to adapt to the new
> > > behavior, or you know, leave things as they used to be.
> > >
> >
> > I don't know what any application does but for me it is a mistake to
> > stop all event processes in dev_stop(), Maybe for other application
> maintainers too.
> 
> Just like you, I don't know either what all the applications ever written for
> DPDK expect out of dev_stop(). What's for sure is that currently, LSC/RMV
> don't occcur afterward, the same way these events do not occur before
> dev_start().

Why not? RMV event can occur any time after probe.

> Any application possibly relying on this fact will break. In such a
> situation, a conservative approach is better.

If an application should fail to get event in stopped state it may fail in the previous code too:
The interrupt run from host thread so the next race may occur:
dev_start() : master thread.
Context switch.
RMV interrupt started to run callbacks: host thread.
Context switch.
dev_stop(): master thread.
Start reconfiguration: master thread. 
Context switch.
Callback running.

So, the only thing which can disable callback running after dev_stop() is callback unregistration before it.

> > > Setting up RMV/LSC callbacks is not the only configuration an
> > > application usually performs before calling dev_start(). Think about
> > > setting up flow rules, MAC addresses, VLANs, and so on, this on
> > > multiple ports before starting them up all at once. Previously it
> > > could be done in an unspecified order, now they have to take special care
> for RMV/LSC.
> >
> > Or maybe there callbacks code are already safe for it.
> > Or they manages the unregister\register calls in the right places.
> 
> That's my point, these "maybes" don't argue in favor of changing things.

What I'm saying is that callbacks should be safe or not registered in the right time.

> > > Many devops are only safe when called while a device is stopped.
> > > It's even documented in rte_ethdev.h.
> > >
> >
> > And?
> 
> ...And applications therefore often do all their configuration in an unspecified
> order while a port is stopped as a measure of safety. No extra care is taken
> for RMV/LSC. This uncertainty can be addressed by not modifying the current
> behavior.

Or they expect to get interrupt and the corner case will come later if we will not change it.

> > > > > Stopping traffic can already be achieved by not polling from the
> > > > > application side, calling rte_eth_dev_[rt]x_queue_stop() and/or
> > > > > toggling RX/TX interrupts through rte_eth_dev_[rt]x_intr_enable().
> > > > > rte_eth_dev_stop() provides lower-level device control.
> > > >
> > > > I think it makes sense only for Rx interrupt which is traffic
> > > > oriented(like stop
> > > and start).
> > >
> > > OK, looks like we disagree :)
> > >
> > > > > Perhaps documentation is not clear, however that's how LSC seems
> > > > > implemented in all PMDs; it gets disabled after
> > > > > rte_eth_dev_stop() and one should explicitly use
> > > > > rte_eth_link_get() to retrieve link status afterward. I think
> > > > > RMV should behave similarly with
> > > rte_eth_dev_is_removed().
> > > > > Adapting fail-safe should be easier than modifying all the
> > > > > remaining
> > > PMDs.
> > > >
> > > > Or maybe PMDs which do it make mistakes.
> > >
> > > I'm not convinced mlx4 is the only PMD doing the right thing, we
> > > should ask other maintainers as well as application writers' opinion
> > > on the matter. If it's not a problem for RMV/LSC to occur while a
> > > device is stopped, then I'll agree with the approach. It still won't make it a
> fix regardless.
> >
> > Let's think about RMV event, How can the application\other dpdk entities
> to know if the device was removed when it was in stopped state?
> > Checking it synchronically (by rte_eth_dev_is_removed()) can miss the
> removal just a moment after the device came back again, errors will occur
> and no one will know why.
> >
> > So, at least for RMV event, we need the notification also in stopped state.
> 
> You sent the rte_eth_dev_is_removed() series. You're aware that PMDs
> implementing this call benefit from an automatic is_removed() check on all
> remaining devops whenever some error occur.
> In short, an application will get notified simply by getting dev_start() (or any
> other callback) return -EIO and not being able to use the device.
 
Yes, but between dev_stop to dev_start may not be any ethdev API calling.

> PMDs that do not implement is_removed() (or in addition to it) could also
> artificially trigger a RMV event after dev_start() is called. As long as the PMD
> remains quiet while the device is stopped, it's fine.

How can the PMD do it after dev_start()? Initiate alarm in dev start function to do it later And entering into race again?

I think it is not worth the effort and this patch is doing the right thing(also some other PMDs) .

Thanks.
> > > > > > > In my opinion it's not a fix, as in, it doesn't address an
> > > > > > > issue introduced by the mentioned patch whose behavior was
> correct.
> > > > > > >
> > > > > > > It's probably too late to change it now and it does address
> > > > > > > an issue seen with a use case involving this PMD, however I
> > > > > > > think the fail-safe PMD could as well poll using the
> > > > > > > recently-added
> > > > > > > rte_eth_dev_is_removed() when it's aware the underlying port
> > > > > > > is
> > > > > stopped instead of expecting interrupts.
> 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-01-31 17:07                     ` Matan Azrad
@ 2018-02-02 19:53                       ` Adrien Mazarguil
  2018-02-03 19:42                         ` Matan Azrad
  0 siblings, 1 reply; 17+ messages in thread
From: Adrien Mazarguil @ 2018-02-02 19:53 UTC (permalink / raw)
  To: Matan Azrad; +Cc: Shahaf Shuler, Mordechay Haimovsky, dev, stable

Hi Matan,

On Wed, Jan 31, 2018 at 05:07:56PM +0000, Matan Azrad wrote:
> Hi Adrien
> 
> From: Adrien Mazarguil , Sent: Wednesday, January 31, 2018 4:32 PM
> > Hi Matan,
> > 
> > On Wed, Jan 31, 2018 at 01:44:41PM +0000, Matan Azrad wrote:
> > > Hi Adrien
<snip>
> > > I don't know what any application does but for me it is a mistake to
> > > stop all event processes in dev_stop(), Maybe for other application
> > maintainers too.
> > 
> > Just like you, I don't know either what all the applications ever written for
> > DPDK expect out of dev_stop(). What's for sure is that currently, LSC/RMV
> > don't occcur afterward, the same way these events do not occur before
> > dev_start().
> 
> Why not? RMV event can occur any time after probe.

LSC as well (keep in mind this patch modifies the behavior for both
events). RMV events may also occur before application has a chance to
register a handler for it, in which case this approach fails to solve the
problem it's supposed to solve. Mitigate all you want, the application still
can't rely on that event only.

> > Any application possibly relying on this fact will break. In such a
> > situation, a conservative approach is better.
> 
> If an application should fail to get event in stopped state it may fail in the previous code too:
> The interrupt run from host thread so the next race may occur:
> dev_start() : master thread.
> Context switch.
> RMV interrupt started to run callbacks: host thread.
> Context switch.
> dev_stop(): master thread.
> Start reconfiguration: master thread. 
> Context switch.
> Callback running.
> 
> So, the only thing which can disable callback running after dev_stop() is callback unregistration before it.

After dev_stop() returns, new events cannot be triggered by the PMD which is
what matters. Obviously a callback that already started to run before that
will eventually have to complete. What's your point?

There's a race only if an application performs multiple simultaneous control
operations on the underlying device, but this has always been unsafe (not
only during RMV) because there are no locks, it's documented as such.

> > > > Setting up RMV/LSC callbacks is not the only configuration an
> > > > application usually performs before calling dev_start(). Think about
> > > > setting up flow rules, MAC addresses, VLANs, and so on, this on
> > > > multiple ports before starting them up all at once. Previously it
> > > > could be done in an unspecified order, now they have to take special care
> > for RMV/LSC.
> > >
> > > Or maybe there callbacks code are already safe for it.
> > > Or they manages the unregister\register calls in the right places.
> > 
> > That's my point, these "maybes" don't argue in favor of changing things.
> 
> What I'm saying is that callbacks should be safe or not registered in the right time.

I understand that, though it's not a valid counter argument :)

> > > > Many devops are only safe when called while a device is stopped.
> > > > It's even documented in rte_ethdev.h.
> > > >
> > >
> > > And?
> > 
> > ...And applications therefore often do all their configuration in an unspecified
> > order while a port is stopped as a measure of safety. No extra care is taken
> > for RMV/LSC. This uncertainty can be addressed by not modifying the current
> > behavior.
> 
> Or they expect to get interrupt and the corner case will come later if we will not change it.

Look, we're throwing opposite use cases at each other in order to make a
point, and I don't see an end to this since we're both stubborn. Let's thus
assume applications use a bit of both.

Now we're left with a problem, before this patch neither use cases were
broken. Now it's applied, mine is broken so let's agree something needs to
be done. Either all affected applications need to be updated, or we can
simply revert this and properly fix fail-safe instead.

<snip>
> > > So, at least for RMV event, we need the notification also in stopped state.
> > 
> > You sent the rte_eth_dev_is_removed() series. You're aware that PMDs
> > implementing this call benefit from an automatic is_removed() check on all
> > remaining devops whenever some error occur.
> > In short, an application will get notified simply by getting dev_start() (or any
> > other callback) return -EIO and not being able to use the device.
>  
> Yes, but between dev_stop to dev_start may not be any ethdev API calling.

So what, if an application is not using the device, why does it need to know
it's been removed? If it's that important, why can't it run its own periodic
rte_eth_dev_is_removed() probe?

> > PMDs that do not implement is_removed() (or in addition to it) could also
> > artificially trigger a RMV event after dev_start() is called. As long as the PMD
> > remains quiet while the device is stopped, it's fine.
> 
> How can the PMD do it after dev_start()? Initiate alarm in dev start function to do it later And entering into race again?

What race? All the PMD needs to to is trigger an event by registering one
with immediate effect, it won't make any difference to the application. If
it performs racy control operations from the handler, it's never been a
PMD's problem.

Anyway I just provided this idea as a counter argument, it's not really
needed; relying on rte_eth_dev_is_removed() is the safest approach in my
opinion.

> I think it is not worth the effort and this patch is doing the right thing(also some other PMDs) .

Well, it's probably too late to revert this patch for 18.02 since one would
also have to come up with the proper fix for fail-safe, however that doesn't
mean breaking things and expecting applications to deal with it because it's
never been properly documented is OK either.

I'm post-NACKing this patch, it will have to be reverted and a fix submitted
for the upcoming 18.02 stable branch. In the meantime, it's not a fix for
mlx4 and as such must not be backported.

-- 
Adrien Mazarguil
6WIND

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop
  2018-02-02 19:53                       ` Adrien Mazarguil
@ 2018-02-03 19:42                         ` Matan Azrad
  0 siblings, 0 replies; 17+ messages in thread
From: Matan Azrad @ 2018-02-03 19:42 UTC (permalink / raw)
  To: Adrien Mazarguil; +Cc: Shahaf Shuler, Mordechay Haimovsky, dev, stable

Hi Adrien

> From: Adrien Mazarguil , Sent: Friday, February 2, 2018 9:53 PM
> Hi Matan,
> 
> On Wed, Jan 31, 2018 at 05:07:56PM +0000, Matan Azrad wrote:
> > Hi Adrien
> >
> > From: Adrien Mazarguil , Sent: Wednesday, January 31, 2018 4:32 PM
> > > Hi Matan,
> > >
> > > On Wed, Jan 31, 2018 at 01:44:41PM +0000, Matan Azrad wrote:
> > > > Hi Adrien
> <snip>
> > > > I don't know what any application does but for me it is a mistake
> > > > to stop all event processes in dev_stop(), Maybe for other
> > > > application
> > > maintainers too.
> > >
> > > Just like you, I don't know either what all the applications ever
> > > written for DPDK expect out of dev_stop(). What's for sure is that
> > > currently, LSC/RMV don't occcur afterward, the same way these events
> > > do not occur before dev_start().
> >
> > Why not? RMV event can occur any time after probe.
> 
> LSC as well (keep in mind this patch modifies the behavior for both events).
> RMV events may also occur before application has a chance to register a
> handler for it, in which case this approach fails to solve the problem it's
> supposed to solve. Mitigate all you want, the application still can't rely on that
> event only.

Again, the application should register to event when it want to be notified about it by callback and unregister to it when it doesn't want to.
So, if application still didn't has chance to register to it, its ok not to get notification about it.
Obviously, Application can check both RMV and LCS statuses after the first registration.
 

> > > Any application possibly relying on this fact will break. In such a
> > > situation, a conservative approach is better.
> >
> > If an application should fail to get event in stopped state it may fail in the
> previous code too:
> > The interrupt run from host thread so the next race may occur:
> > dev_start() : master thread.
> > Context switch.
> > RMV interrupt started to run callbacks: host thread.
> > Context switch.
> > dev_stop(): master thread.
> > Start reconfiguration: master thread.
> > Context switch.
> > Callback running.
> >
> > So, the only thing which can disable callback running after dev_stop() is
> callback unregistration before it.
> 
> After dev_stop() returns, new events cannot be triggered by the PMD which
> is what matters.

No, from application point of view it can happen even if the PMD will stop to trigger it from dev_stop() (as the old code did).
This is exactly what the above scenario says.
 
 >Obviously a callback that already started to run before that
> will eventually have to complete. What's your point?

It can even start after dev_stop() in spite of the PMD will stop to trigger the event, read the above scenario again.

So, my point is:
If application is not safe to get notification after dev_stop() it may fail in both old and new versions.
So every PMD which stops to trigger event from dev_stop() because of your reasons makes a mistake and fails for its purpose. 

 
> There's a race only if an application performs multiple simultaneous control
> operations on the underlying device, but this has always been unsafe (not
> only during RMV) because there are no locks, it's documented as such.

So, it is probably safe.

> > > > > Setting up RMV/LSC callbacks is not the only configuration an
> > > > > application usually performs before calling dev_start(). Think
> > > > > about setting up flow rules, MAC addresses, VLANs, and so on,
> > > > > this on multiple ports before starting them up all at once.
> > > > > Previously it could be done in an unspecified order, now they
> > > > > have to take special care
> > > for RMV/LSC.
> > > >
> > > > Or maybe there callbacks code are already safe for it.
> > > > Or they manages the unregister\register calls in the right places.
> > >
> > > That's my point, these "maybes" don't argue in favor of changing things.
> >
> > What I'm saying is that callbacks should be safe or not registered in the right
> time.
> 
> I understand that, though it's not a valid counter argument :)

With the above scenario it is :)

> > > > > Many devops are only safe when called while a device is stopped.
> > > > > It's even documented in rte_ethdev.h.
> > > > >
> > > >
> > > > And?
> > >
> > > ...And applications therefore often do all their configuration in an
> > > unspecified order while a port is stopped as a measure of safety. No
> > > extra care is taken for RMV/LSC. This uncertainty can be addressed
> > > by not modifying the current behavior.
> >
> > Or they expect to get interrupt and the corner case will come later if we will
> not change it.
> 
> Look, we're throwing opposite use cases at each other in order to make a
> point, and I don't see an end to this since we're both stubborn. Let's thus
> assume applications use a bit of both.
> 
> Now we're left with a problem, before this patch neither use cases were
> broken.

No, again, The above scenario shows it can be broken even before this patch.
 
> Now it's applied, mine is broken so let's agree something needs to
> be done. Either all affected applications need to be updated, or we can
> simply revert this and properly fix fail-safe instead.

I think no, applications which will fail because of this patch may fail before this patch.
So, probably applications should be safe for this patch.

> <snip>
> > > > So, at least for RMV event, we need the notification also in stopped
> state.
> > >
> > > You sent the rte_eth_dev_is_removed() series. You're aware that PMDs
> > > implementing this call benefit from an automatic is_removed() check
> > > on all remaining devops whenever some error occur.
> > > In short, an application will get notified simply by getting
> > > dev_start() (or any other callback) return -EIO and not being able to use
> the device.
> >
> > Yes, but between dev_stop to dev_start may not be any ethdev API calling.
> 
> So what, if an application is not using the device, why does it need to know
> it's been removed?

Data path will not check synchronously the removal status - this is must come by asynchronic  notification.

 >If it's that important, why can't it run its own periodic
> rte_eth_dev_is_removed() probe?

Because ETHDEV have event for it, why to do similar mechanism again?

> > > PMDs that do not implement is_removed() (or in addition to it) could
> > > also artificially trigger a RMV event after dev_start() is called.
> > > As long as the PMD remains quiet while the device is stopped, it's fine.
> >
> > How can the PMD do it after dev_start()? Initiate alarm in dev start function
> to do it later And entering into race again?
> 
> What race? All the PMD needs to to is trigger an event by registering one
> with immediate effect, it won't make any difference to the application. If it
> performs racy control operations from the handler, it's never been a PMD's
> problem.

See the race scenario above, it should be similar.

> Anyway I just provided this idea as a counter argument, it's not really
> needed; relying on rte_eth_dev_is_removed() is the safest approach in my
> opinion.

Not always.
We need them both to be really hotplug aware from all sides.

> > I think it is not worth the effort and this patch is doing the right thing(also
> some other PMDs) .
> 
> Well, it's probably too late to revert this patch for 18.02 since one would also
> have to come up with the proper fix for fail-safe, however that doesn't mean
> breaking things and expecting applications to deal with it because it's never
> been properly documented is OK either.

Nothing breaking for my opinion.

> I'm post-NACKing this patch, it will have to be reverted and a fix submitted
> for the upcoming 18.02 stable branch. In the meantime, it's not a fix for
> mlx4 and as such must not be backported.

I can't agree with you about it now.
But you know, You are the maintainer :) do whatever you want.

Anyway, I think we are all agree that all PMDs should do the same thing and documentation somewhere must be added.
My opinion, as you probably know, is to continue to trigger events even after dev_stop().

Thanks,
Matan. 
> --
> Adrien Mazarguil
> 6WIND

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-02-03 19:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-19 10:09 [dpdk-dev] [PATCH] net/mlx4: port start failure after device removal Moti Haimovsky
2018-01-19 10:16 ` [dpdk-dev] [PATCH v2] net/mlx4: fix port start fail " Moti Haimovsky
2018-01-28 13:54   ` [dpdk-dev] [dpdk-stable] " Shahaf Shuler
2018-01-29  8:34   ` [dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop Moti Haimovsky
2018-01-29 10:54     ` Shahaf Shuler
2018-01-30  9:21       ` [dpdk-dev] [dpdk-stable] " Thomas Monjalon
2018-01-30  9:39       ` [dpdk-dev] " Adrien Mazarguil
2018-01-30 20:37         ` Shahaf Shuler
2018-01-31  9:15           ` Adrien Mazarguil
2018-01-31  9:54             ` Gaëtan Rivet
2018-01-31 10:08             ` Matan Azrad
2018-01-31 10:43               ` Adrien Mazarguil
2018-01-31 13:44                 ` Matan Azrad
2018-01-31 14:31                   ` Adrien Mazarguil
2018-01-31 17:07                     ` Matan Azrad
2018-02-02 19:53                       ` Adrien Mazarguil
2018-02-03 19:42                         ` Matan Azrad

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).