patches for DPDK stable branches
 help / color / mirror / Atom feed
* [dpdk-stable] [PATCH 1/3] net/failsafe: fix removal handling lack
       [not found] <1509637324-13525-1-git-send-email-matan@mellanox.com>
@ 2017-11-02 15:42 ` Matan Azrad
  2017-11-06  8:19   ` Gaëtan Rivet
       [not found] ` <1513175370-16583-1-git-send-email-matan@mellanox.com>
  1 sibling, 1 reply; 14+ messages in thread
From: Matan Azrad @ 2017-11-02 15:42 UTC (permalink / raw)
  To: Adrien Mazarguil, Gaetan Rivet; +Cc: dev, stable

There is time between the physical removal of the device until
sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
applications still don't know about the removal and may call sub-device
control operation which should return an error.

In previous code this error is reported to the application contrary to
fail-safe principle that the app should not be aware of device removal.

Define a removal error that each sub-device PMD should return in case
of an error caused by removal event; The special error is -ENODEV.

Add an error check in each relevant control command error flow and
prevent an error report to application when its value is -ENODEV.

Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
Fixes: b737a1e ("net/failsafe: support flow API")
Cc: stable@dpdk.org

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 doc/guides/nics/fail_safe.rst                   |  7 +++++++
 doc/guides/prog_guide/env_abstraction_layer.rst |  3 +++
 drivers/net/failsafe/failsafe_flow.c            | 16 +++++++++------
 drivers/net/failsafe/failsafe_ops.c             | 27 ++++++++++++++++---------
 drivers/net/failsafe/failsafe_private.h         |  8 ++++++++
 5 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
index c4e3d2e..5023fc4 100644
--- a/doc/guides/nics/fail_safe.rst
+++ b/doc/guides/nics/fail_safe.rst
@@ -193,6 +193,13 @@ any time. The fail-safe PMD will register a callback for such event and react
 accordingly. It will try to safely stop, close and uninit the sub-device having
 emitted this event, allowing it to free its eventual resources.
 
+When fail-safe PMD gets -ENODEV error from control command sent to removable
+sub-devices, it assumes that the error reason is device removal. In this case
+fail-safe returns success value to application. The PMD controlling the
+sub-device is still responsible to emit a removal event (RMV) in addition to
+returning -ENODEV from control operations after the device has been physically
+removed. Only the reception of this event unregisters it on the fail-safe side.
+
 Fail-safe glossary
 ------------------
 
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 4775eb3..bd2fd87 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -213,6 +213,9 @@ device having emitted a Device Removal Event. In such case, calling
 callback. Care must be taken not to close the device from the interrupt handler
 context. It is necessary to reschedule such closing operation.
 
+Unsuccessful control operations (for those that return errors) may return
+-ENODEV after the device is physically unplugged.
+
 Blacklisting
 ~~~~~~~~~~~~
 
diff --git a/drivers/net/failsafe/failsafe_flow.c b/drivers/net/failsafe/failsafe_flow.c
index 153ceee..ce9b769 100644
--- a/drivers/net/failsafe/failsafe_flow.c
+++ b/drivers/net/failsafe/failsafe_flow.c
@@ -87,7 +87,7 @@
 		DEBUG("Calling rte_flow_validate on sub_device %d", i);
 		ret = rte_flow_validate(PORT_ID(sdev),
 				attr, patterns, actions, error);
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("Operation rte_flow_validate failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -111,7 +111,8 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		flow->flows[i] = rte_flow_create(PORT_ID(sdev),
 				attr, patterns, actions, error);
-		if (flow->flows[i] == NULL) {
+		if (flow->flows[i] == NULL &&
+			!SUBDEV_REMOVED(sdev, -rte_errno)) {
 			ERROR("Failed to create flow on sub_device %d",
 				i);
 			goto err;
@@ -150,7 +151,7 @@
 			continue;
 		local_ret = rte_flow_destroy(PORT_ID(sdev),
 				flow->flows[i], error);
-		if (local_ret) {
+		if (local_ret && !SUBDEV_REMOVED(sdev, local_ret)) {
 			ERROR("Failed to destroy flow on sub_device %d: %d",
 					i, local_ret);
 			if (ret == 0)
@@ -175,7 +176,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_flow_flush on sub_device %d", i);
 		ret = rte_flow_flush(PORT_ID(sdev), error);
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("Operation rte_flow_flush failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -199,8 +200,11 @@
 
 	sdev = TX_SUBDEV(dev);
 	if (sdev != NULL) {
-		return rte_flow_query(PORT_ID(sdev),
+		int ret = rte_flow_query(PORT_ID(sdev),
 				flow->flows[SUB_ID(sdev)], type, arg, error);
+
+		if (!SUBDEV_REMOVED(sdev, ret))
+			return ret;
 	}
 	WARN("No active sub_device to query about its flow");
 	return -1;
@@ -223,7 +227,7 @@
 			WARN("flow isolation mode of sub_device %d in incoherent state.",
 				i);
 		ret = rte_flow_isolate(PORT_ID(sdev), set, error);
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("Operation rte_flow_isolate failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
diff --git a/drivers/net/failsafe/failsafe_ops.c b/drivers/net/failsafe/failsafe_ops.c
index f460551..cc7ab7f 100644
--- a/drivers/net/failsafe/failsafe_ops.c
+++ b/drivers/net/failsafe/failsafe_ops.c
@@ -314,7 +314,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_set_link_up on sub_device %d", i);
 		ret = rte_eth_dev_set_link_up(PORT_ID(sdev));
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_set_link_up failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -333,7 +333,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_set_link_down on sub_device %d", i);
 		ret = rte_eth_dev_set_link_down(PORT_ID(sdev));
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_set_link_down failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -418,7 +418,7 @@
 				rx_queue_id,
 				nb_rx_desc, socket_id,
 				rx_conf, mb_pool);
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("RX queue setup failed for sub_device %d", i);
 			goto free_rxq;
 		}
@@ -484,7 +484,7 @@
 				tx_queue_id,
 				nb_tx_desc, socket_id,
 				tx_conf);
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("TX queue setup failed for sub_device %d", i);
 			goto free_txq;
 		}
@@ -563,7 +563,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling link_update on sub_device %d", i);
 		ret = (SUBOPS(sdev, link_update))(ETH(sdev), wait_to_complete);
-		if (ret && ret != -1) {
+		if (ret && ret != -1  && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("Link update failed for sub_device %d with error %d",
 			      i, ret);
 			return ret;
@@ -587,6 +587,7 @@
 fs_stats_get(struct rte_eth_dev *dev,
 	     struct rte_eth_stats *stats)
 {
+	struct rte_eth_stats backup;
 	struct sub_device *sdev;
 	uint8_t i;
 	int ret;
@@ -596,14 +597,20 @@
 		struct rte_eth_stats *snapshot = &sdev->stats_snapshot.stats;
 		uint64_t *timestamp = &sdev->stats_snapshot.timestamp;
 
+		rte_memcpy(&backup, snapshot, sizeof(backup));
 		ret = rte_eth_stats_get(PORT_ID(sdev), snapshot);
 		if (ret) {
+			if (SUBDEV_REMOVED(sdev, ret)) {
+				rte_memcpy(snapshot, &backup, sizeof(backup));
+				goto inc;
+			}
 			ERROR("Operation rte_eth_stats_get failed for sub_device %d with error %d",
 				  i, ret);
 			*timestamp = 0;
 			return ret;
 		}
 		*timestamp = rte_rdtsc();
+inc:
 		failsafe_stats_increment(stats, snapshot);
 	}
 	return 0;
@@ -716,7 +723,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_set_mtu on sub_device %d", i);
 		ret = rte_eth_dev_set_mtu(PORT_ID(sdev), mtu);
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_set_mtu failed for sub_device %d with error %d",
 			      i, ret);
 			return ret;
@@ -735,7 +742,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_vlan_filter on sub_device %d", i);
 		ret = rte_eth_dev_vlan_filter(PORT_ID(sdev), vlan_id, on);
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_vlan_filter failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -769,7 +776,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_flow_ctrl_set on sub_device %d", i);
 		ret = rte_eth_dev_flow_ctrl_set(PORT_ID(sdev), fc_conf);
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_flow_ctrl_set failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -806,7 +813,7 @@
 	RTE_ASSERT(index < FAILSAFE_MAX_ETHADDR);
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		ret = rte_eth_dev_mac_addr_add(PORT_ID(sdev), mac_addr, vmdq);
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_mac_addr_add failed for sub_device %"
 			      PRIu8 " with error %d", i, ret);
 			return ret;
@@ -848,7 +855,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_filter_ctrl on sub_device %d", i);
 		ret = rte_eth_dev_filter_ctrl(PORT_ID(sdev), type, op, arg);
-		if (ret) {
+		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_filter_ctrl failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index d81cc3c..ee81b70 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -262,6 +262,14 @@ int failsafe_eth_lsc_event_callback(uint16_t port_id,
 	(ETH(s)->dev_ops->ops)
 
 /**
+ * s: (struct sub_device *)
+ * e: (int) error
+ */
+#define SUBDEV_REMOVED(s, e) \
+	(s->remove || \
+	 (((e) == -ENODEV) && (ETH(s)->data->dev_flags & RTE_ETH_DEV_INTR_RMV)))
+
+/**
  * Atomic guard
  */
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH 1/3] net/failsafe: fix removal handling lack
  2017-11-02 15:42 ` [dpdk-stable] [PATCH 1/3] net/failsafe: fix removal handling lack Matan Azrad
@ 2017-11-06  8:19   ` Gaëtan Rivet
  0 siblings, 0 replies; 14+ messages in thread
From: Gaëtan Rivet @ 2017-11-06  8:19 UTC (permalink / raw)
  To: Matan Azrad; +Cc: Adrien Mazarguil, dev, stable

Hello Matan,

On Thu, Nov 02, 2017 at 03:42:02PM +0000, Matan Azrad wrote:
> There is time between the physical removal of the device until
> sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> applications still don't know about the removal and may call sub-device
> control operation which should return an error.
> 
> In previous code this error is reported to the application contrary to
> fail-safe principle that the app should not be aware of device removal.
> 
> Define a removal error that each sub-device PMD should return in case
> of an error caused by removal event; The special error is -ENODEV.
> 
> Add an error check in each relevant control command error flow and
> prevent an error report to application when its value is -ENODEV.
> 
> Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> Fixes: b737a1e ("net/failsafe: support flow API")
> Cc: stable@dpdk.org
> 

This is not a fix.

This would be useless backported in stable without the related
mlx4 and mlx5 changes. The related mlx4 and mlx5 patches are themselves
not marked as fixes and won't be backported.

> Signed-off-by: Matan Azrad <matan@mellanox.com>
> ---
>  doc/guides/nics/fail_safe.rst                   |  7 +++++++
>  doc/guides/prog_guide/env_abstraction_layer.rst |  3 +++
>  drivers/net/failsafe/failsafe_flow.c            | 16 +++++++++------
>  drivers/net/failsafe/failsafe_ops.c             | 27 ++++++++++++++++---------
>  drivers/net/failsafe/failsafe_private.h         |  8 ++++++++
>  5 files changed, 45 insertions(+), 16 deletions(-)
> 
> diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst
> index c4e3d2e..5023fc4 100644
> --- a/doc/guides/nics/fail_safe.rst
> +++ b/doc/guides/nics/fail_safe.rst
> @@ -193,6 +193,13 @@ any time. The fail-safe PMD will register a callback for such event and react
>  accordingly. It will try to safely stop, close and uninit the sub-device having
>  emitted this event, allowing it to free its eventual resources.
>  
> +When fail-safe PMD gets -ENODEV error from control command sent to removable
> +sub-devices, it assumes that the error reason is device removal. In this case
> +fail-safe returns success value to application. The PMD controlling the
> +sub-device is still responsible to emit a removal event (RMV) in addition to
> +returning -ENODEV from control operations after the device has been physically
> +removed. Only the reception of this event unregisters it on the fail-safe side.
> +
>  Fail-safe glossary
>  ------------------
>  
> diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
> index 4775eb3..bd2fd87 100644
> --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> @@ -213,6 +213,9 @@ device having emitted a Device Removal Event. In such case, calling
>  callback. Care must be taken not to close the device from the interrupt handler
>  context. It is necessary to reschedule such closing operation.
>  
> +Unsuccessful control operations (for those that return errors) may return
> +-ENODEV after the device is physically unplugged.
> +

I think I should be neither ack-ing nor nack-ing this change.
Could you propose it on its own, so that people ignoring fail-safe
related matters could look into it as well?

>  Blacklisting
>  ~~~~~~~~~~~~
>  
> diff --git a/drivers/net/failsafe/failsafe_flow.c b/drivers/net/failsafe/failsafe_flow.c
> index 153ceee..ce9b769 100644
> --- a/drivers/net/failsafe/failsafe_flow.c
> +++ b/drivers/net/failsafe/failsafe_flow.c
> @@ -87,7 +87,7 @@
>  		DEBUG("Calling rte_flow_validate on sub_device %d", i);
>  		ret = rte_flow_validate(PORT_ID(sdev),
>  				attr, patterns, actions, error);
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {

Here and for subsequent checks, there should be an explicit check
against zero instead of using unary !.

>  			ERROR("Operation rte_flow_validate failed for sub_device %d"
>  			      " with error %d", i, ret);
>  			return ret;
> @@ -111,7 +111,8 @@
>  	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
>  		flow->flows[i] = rte_flow_create(PORT_ID(sdev),
>  				attr, patterns, actions, error);
> -		if (flow->flows[i] == NULL) {
> +		if (flow->flows[i] == NULL &&
> +			!SUBDEV_REMOVED(sdev, -rte_errno)) {
>  			ERROR("Failed to create flow on sub_device %d",
>  				i);
>  			goto err;
> @@ -150,7 +151,7 @@
>  			continue;
>  		local_ret = rte_flow_destroy(PORT_ID(sdev),
>  				flow->flows[i], error);
> -		if (local_ret) {
> +		if (local_ret && !SUBDEV_REMOVED(sdev, local_ret)) {
>  			ERROR("Failed to destroy flow on sub_device %d: %d",
>  					i, local_ret);
>  			if (ret == 0)
> @@ -175,7 +176,7 @@
>  	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
>  		DEBUG("Calling rte_flow_flush on sub_device %d", i);
>  		ret = rte_flow_flush(PORT_ID(sdev), error);
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("Operation rte_flow_flush failed for sub_device %d"
>  			      " with error %d", i, ret);
>  			return ret;
> @@ -199,8 +200,11 @@
>  
>  	sdev = TX_SUBDEV(dev);
>  	if (sdev != NULL) {
> -		return rte_flow_query(PORT_ID(sdev),
> +		int ret = rte_flow_query(PORT_ID(sdev),
>  				flow->flows[SUB_ID(sdev)], type, arg, error);
> +
> +		if (!SUBDEV_REMOVED(sdev, ret))
> +			return ret;
>  	}
>  	WARN("No active sub_device to query about its flow");
>  	return -1;
> @@ -223,7 +227,7 @@
>  			WARN("flow isolation mode of sub_device %d in incoherent state.",
>  				i);
>  		ret = rte_flow_isolate(PORT_ID(sdev), set, error);
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("Operation rte_flow_isolate failed for sub_device %d"
>  			      " with error %d", i, ret);
>  			return ret;
> diff --git a/drivers/net/failsafe/failsafe_ops.c b/drivers/net/failsafe/failsafe_ops.c
> index f460551..cc7ab7f 100644
> --- a/drivers/net/failsafe/failsafe_ops.c
> +++ b/drivers/net/failsafe/failsafe_ops.c
> @@ -314,7 +314,7 @@
>  	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
>  		DEBUG("Calling rte_eth_dev_set_link_up on sub_device %d", i);
>  		ret = rte_eth_dev_set_link_up(PORT_ID(sdev));
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("Operation rte_eth_dev_set_link_up failed for sub_device %d"
>  			      " with error %d", i, ret);
>  			return ret;
> @@ -333,7 +333,7 @@
>  	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
>  		DEBUG("Calling rte_eth_dev_set_link_down on sub_device %d", i);
>  		ret = rte_eth_dev_set_link_down(PORT_ID(sdev));
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("Operation rte_eth_dev_set_link_down failed for sub_device %d"
>  			      " with error %d", i, ret);
>  			return ret;
> @@ -418,7 +418,7 @@
>  				rx_queue_id,
>  				nb_rx_desc, socket_id,
>  				rx_conf, mb_pool);
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("RX queue setup failed for sub_device %d", i);
>  			goto free_rxq;
>  		}
> @@ -484,7 +484,7 @@
>  				tx_queue_id,
>  				nb_tx_desc, socket_id,
>  				tx_conf);
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("TX queue setup failed for sub_device %d", i);
>  			goto free_txq;
>  		}
> @@ -563,7 +563,7 @@
>  	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
>  		DEBUG("Calling link_update on sub_device %d", i);
>  		ret = (SUBOPS(sdev, link_update))(ETH(sdev), wait_to_complete);
> -		if (ret && ret != -1) {
> +		if (ret && ret != -1  && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("Link update failed for sub_device %d with error %d",
>  			      i, ret);
>  			return ret;
> @@ -587,6 +587,7 @@
>  fs_stats_get(struct rte_eth_dev *dev,
>  	     struct rte_eth_stats *stats)
>  {
> +	struct rte_eth_stats backup;
>  	struct sub_device *sdev;
>  	uint8_t i;
>  	int ret;
> @@ -596,14 +597,20 @@
>  		struct rte_eth_stats *snapshot = &sdev->stats_snapshot.stats;
>  		uint64_t *timestamp = &sdev->stats_snapshot.timestamp;
>  
> +		rte_memcpy(&backup, snapshot, sizeof(backup));
>  		ret = rte_eth_stats_get(PORT_ID(sdev), snapshot);
>  		if (ret) {
> +			if (SUBDEV_REMOVED(sdev, ret)) {
> +				rte_memcpy(snapshot, &backup, sizeof(backup));
> +				goto inc;
> +			}
>  			ERROR("Operation rte_eth_stats_get failed for sub_device %d with error %d",
>  				  i, ret);
>  			*timestamp = 0;
>  			return ret;
>  		}
>  		*timestamp = rte_rdtsc();
> +inc:
>  		failsafe_stats_increment(stats, snapshot);
>  	}
>  	return 0;
> @@ -716,7 +723,7 @@
>  	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
>  		DEBUG("Calling rte_eth_dev_set_mtu on sub_device %d", i);
>  		ret = rte_eth_dev_set_mtu(PORT_ID(sdev), mtu);
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("Operation rte_eth_dev_set_mtu failed for sub_device %d with error %d",
>  			      i, ret);
>  			return ret;
> @@ -735,7 +742,7 @@
>  	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
>  		DEBUG("Calling rte_eth_dev_vlan_filter on sub_device %d", i);
>  		ret = rte_eth_dev_vlan_filter(PORT_ID(sdev), vlan_id, on);
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("Operation rte_eth_dev_vlan_filter failed for sub_device %d"
>  			      " with error %d", i, ret);
>  			return ret;
> @@ -769,7 +776,7 @@
>  	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
>  		DEBUG("Calling rte_eth_dev_flow_ctrl_set on sub_device %d", i);
>  		ret = rte_eth_dev_flow_ctrl_set(PORT_ID(sdev), fc_conf);
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("Operation rte_eth_dev_flow_ctrl_set failed for sub_device %d"
>  			      " with error %d", i, ret);
>  			return ret;
> @@ -806,7 +813,7 @@
>  	RTE_ASSERT(index < FAILSAFE_MAX_ETHADDR);
>  	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
>  		ret = rte_eth_dev_mac_addr_add(PORT_ID(sdev), mac_addr, vmdq);
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("Operation rte_eth_dev_mac_addr_add failed for sub_device %"
>  			      PRIu8 " with error %d", i, ret);
>  			return ret;
> @@ -848,7 +855,7 @@
>  	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
>  		DEBUG("Calling rte_eth_dev_filter_ctrl on sub_device %d", i);
>  		ret = rte_eth_dev_filter_ctrl(PORT_ID(sdev), type, op, arg);
> -		if (ret) {
> +		if (ret && !SUBDEV_REMOVED(sdev, ret)) {
>  			ERROR("Operation rte_eth_dev_filter_ctrl failed for sub_device %d"
>  			      " with error %d", i, ret);
>  			return ret;
> diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
> index d81cc3c..ee81b70 100644
> --- a/drivers/net/failsafe/failsafe_private.h
> +++ b/drivers/net/failsafe/failsafe_private.h
> @@ -262,6 +262,14 @@ int failsafe_eth_lsc_event_callback(uint16_t port_id,
>  	(ETH(s)->dev_ops->ops)
>  
>  /**
> + * s: (struct sub_device *)
> + * e: (int) error
> + */
> +#define SUBDEV_REMOVED(s, e) \
> +	(s->remove || \
> +	 (((e) == -ENODEV) && (ETH(s)->data->dev_flags & RTE_ETH_DEV_INTR_RMV)))
> +
> +/**
>   * Atomic guard
>   */
>  
> -- 
> 1.8.3.1
> 

-- 
Gaëtan Rivet
6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
       [not found] ` <1513175370-16583-1-git-send-email-matan@mellanox.com>
@ 2017-12-13 14:29   ` Matan Azrad
  2017-12-13 15:16     ` Gaëtan Rivet
  0 siblings, 1 reply; 14+ messages in thread
From: Matan Azrad @ 2017-12-13 14:29 UTC (permalink / raw)
  To: Adrien Mazarguil, Thomas Monjalon, Gaetan Rivet; +Cc: dev, stable

There is time between the physical removal of the device until
sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
applications still don't know about the removal and may call sub-device
control operation which should return an error.

In previous code this error is reported to the application contrary to
fail-safe principle that the app should not be aware of device removal.

Add an removal check in each relevant control command error flow and
prevent an error report to application when the sub-device is removed.

Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
Fixes: b737a1e ("net/failsafe: support flow API")
Cc: stable@dpdk.org

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/failsafe/failsafe_flow.c    | 18 ++++++++++-------
 drivers/net/failsafe/failsafe_ops.c     | 34 ++++++++++++++++++++++-----------
 drivers/net/failsafe/failsafe_private.h | 10 ++++++++++
 3 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/drivers/net/failsafe/failsafe_flow.c b/drivers/net/failsafe/failsafe_flow.c
index 153ceee..1e39d66 100644
--- a/drivers/net/failsafe/failsafe_flow.c
+++ b/drivers/net/failsafe/failsafe_flow.c
@@ -87,7 +87,7 @@
 		DEBUG("Calling rte_flow_validate on sub_device %d", i);
 		ret = rte_flow_validate(PORT_ID(sdev),
 				attr, patterns, actions, error);
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("Operation rte_flow_validate failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -111,7 +111,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		flow->flows[i] = rte_flow_create(PORT_ID(sdev),
 				attr, patterns, actions, error);
-		if (flow->flows[i] == NULL) {
+		if (flow->flows[i] == NULL && fs_is_removed(sdev) == 0) {
 			ERROR("Failed to create flow on sub_device %d",
 				i);
 			goto err;
@@ -150,7 +150,7 @@
 			continue;
 		local_ret = rte_flow_destroy(PORT_ID(sdev),
 				flow->flows[i], error);
-		if (local_ret) {
+		if (local_ret && fs_is_removed(sdev) == 0) {
 			ERROR("Failed to destroy flow on sub_device %d: %d",
 					i, local_ret);
 			if (ret == 0)
@@ -175,7 +175,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_flow_flush on sub_device %d", i);
 		ret = rte_flow_flush(PORT_ID(sdev), error);
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("Operation rte_flow_flush failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -199,8 +199,12 @@
 
 	sdev = TX_SUBDEV(dev);
 	if (sdev != NULL) {
-		return rte_flow_query(PORT_ID(sdev),
-				flow->flows[SUB_ID(sdev)], type, arg, error);
+		int ret = rte_flow_query(PORT_ID(sdev),
+					 flow->flows[SUB_ID(sdev)],
+					 type, arg, error);
+
+		if (ret && fs_is_removed(sdev) == 0)
+			return ret;
 	}
 	WARN("No active sub_device to query about its flow");
 	return -1;
@@ -223,7 +227,7 @@
 			WARN("flow isolation mode of sub_device %d in incoherent state.",
 				i);
 		ret = rte_flow_isolate(PORT_ID(sdev), set, error);
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("Operation rte_flow_isolate failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
diff --git a/drivers/net/failsafe/failsafe_ops.c b/drivers/net/failsafe/failsafe_ops.c
index e16a590..6799b55 100644
--- a/drivers/net/failsafe/failsafe_ops.c
+++ b/drivers/net/failsafe/failsafe_ops.c
@@ -121,6 +121,8 @@
 					dev->data->nb_tx_queues,
 					&dev->data->dev_conf);
 		if (ret) {
+			if (fs_is_removed(sdev) != 0)
+				continue;
 			ERROR("Could not configure sub_device %d", i);
 			return ret;
 		}
@@ -163,8 +165,11 @@
 			continue;
 		DEBUG("Starting sub_device %d", i);
 		ret = rte_eth_dev_start(PORT_ID(sdev));
-		if (ret)
+		if (ret) {
+			if (fs_is_removed(sdev) != 0)
+				continue;
 			return ret;
+		}
 		sdev->state = DEV_STARTED;
 	}
 	if (PRIV(dev)->state < DEV_STARTED)
@@ -196,7 +201,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_set_link_up on sub_device %d", i);
 		ret = rte_eth_dev_set_link_up(PORT_ID(sdev));
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("Operation rte_eth_dev_set_link_up failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -215,7 +220,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_set_link_down on sub_device %d", i);
 		ret = rte_eth_dev_set_link_down(PORT_ID(sdev));
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("Operation rte_eth_dev_set_link_down failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -300,7 +305,7 @@
 				rx_queue_id,
 				nb_rx_desc, socket_id,
 				rx_conf, mb_pool);
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("RX queue setup failed for sub_device %d", i);
 			goto free_rxq;
 		}
@@ -366,7 +371,7 @@
 				tx_queue_id,
 				nb_tx_desc, socket_id,
 				tx_conf);
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("TX queue setup failed for sub_device %d", i);
 			goto free_txq;
 		}
@@ -445,7 +450,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling link_update on sub_device %d", i);
 		ret = (SUBOPS(sdev, link_update))(ETH(sdev), wait_to_complete);
-		if (ret && ret != -1) {
+		if (ret && ret != -1 && fs_is_removed(sdev) == 0) {
 			ERROR("Link update failed for sub_device %d with error %d",
 			      i, ret);
 			return ret;
@@ -469,6 +474,7 @@
 fs_stats_get(struct rte_eth_dev *dev,
 	     struct rte_eth_stats *stats)
 {
+	struct rte_eth_stats backup;
 	struct sub_device *sdev;
 	uint8_t i;
 	int ret;
@@ -478,14 +484,20 @@
 		struct rte_eth_stats *snapshot = &sdev->stats_snapshot.stats;
 		uint64_t *timestamp = &sdev->stats_snapshot.timestamp;
 
+		rte_memcpy(&backup, snapshot, sizeof(backup));
 		ret = rte_eth_stats_get(PORT_ID(sdev), snapshot);
 		if (ret) {
+			if (fs_is_removed(sdev) != 0) {
+				rte_memcpy(snapshot, &backup, sizeof(backup));
+				goto inc;
+			}
 			ERROR("Operation rte_eth_stats_get failed for sub_device %d with error %d",
 				  i, ret);
 			*timestamp = 0;
 			return ret;
 		}
 		*timestamp = rte_rdtsc();
+inc:
 		failsafe_stats_increment(stats, snapshot);
 	}
 	return 0;
@@ -598,7 +610,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_set_mtu on sub_device %d", i);
 		ret = rte_eth_dev_set_mtu(PORT_ID(sdev), mtu);
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("Operation rte_eth_dev_set_mtu failed for sub_device %d with error %d",
 			      i, ret);
 			return ret;
@@ -617,7 +629,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_vlan_filter on sub_device %d", i);
 		ret = rte_eth_dev_vlan_filter(PORT_ID(sdev), vlan_id, on);
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("Operation rte_eth_dev_vlan_filter failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -651,7 +663,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_flow_ctrl_set on sub_device %d", i);
 		ret = rte_eth_dev_flow_ctrl_set(PORT_ID(sdev), fc_conf);
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("Operation rte_eth_dev_flow_ctrl_set failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -688,7 +700,7 @@
 	RTE_ASSERT(index < FAILSAFE_MAX_ETHADDR);
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		ret = rte_eth_dev_mac_addr_add(PORT_ID(sdev), mac_addr, vmdq);
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("Operation rte_eth_dev_mac_addr_add failed for sub_device %"
 			      PRIu8 " with error %d", i, ret);
 			return ret;
@@ -730,7 +742,7 @@
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_filter_ctrl on sub_device %d", i);
 		ret = rte_eth_dev_filter_ctrl(PORT_ID(sdev), type, op, arg);
-		if (ret) {
+		if (ret && fs_is_removed(sdev) == 0) {
 			ERROR("Operation rte_eth_dev_filter_ctrl failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index d81cc3c..0539782 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -375,4 +375,14 @@ int failsafe_eth_lsc_event_callback(uint16_t port_id,
 	rte_wmb();
 }
 
+/*
+ * Check if sub device was removed.
+ */
+static inline int
+fs_is_removed(struct sub_device *sdev)
+{
+	if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev)) != 0)
+		return 1;
+	return 0;
+}
 #endif /* _RTE_ETH_FAILSAFE_PRIVATE_H_ */
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
  2017-12-13 14:29   ` [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling Matan Azrad
@ 2017-12-13 15:16     ` Gaëtan Rivet
  2017-12-13 15:48       ` Matan Azrad
  0 siblings, 1 reply; 14+ messages in thread
From: Gaëtan Rivet @ 2017-12-13 15:16 UTC (permalink / raw)
  To: Matan Azrad; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable

Hi Matan,

On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote:
> There is time between the physical removal of the device until
> sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> applications still don't know about the removal and may call sub-device
> control operation which should return an error.
> 
> In previous code this error is reported to the application contrary to
> fail-safe principle that the app should not be aware of device removal.
> 
> Add an removal check in each relevant control command error flow and
> prevent an error report to application when the sub-device is removed.
> 
> Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> Fixes: b737a1e ("net/failsafe: support flow API")
> Cc: stable@dpdk.org
> 

This patch is not a fix.
It relies on an eth_dev API evolution. Without this evolution,
this patch is meaningless and would break compilation if backported in
stable branch.

Please remove those tags.

> Signed-off-by: Matan Azrad <matan@mellanox.com>
> ---
>  drivers/net/failsafe/failsafe_flow.c    | 18 ++++++++++-------
>  drivers/net/failsafe/failsafe_ops.c     | 34 ++++++++++++++++++++++-----------
>  drivers/net/failsafe/failsafe_private.h | 10 ++++++++++
>  3 files changed, 44 insertions(+), 18 deletions(-)

< ... >

> +/*
> + * Check if sub device was removed.
> + */
> +static inline int
> +fs_is_removed(struct sub_device *sdev)
> +{
> +	if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev)) != 0)
> +		return 1;
> +	return 0;
> +}

Have you considered adding this check within the subdev iterator itself?
I think it would prevent you from having to add it to each return value
checks.

It is still MT-unsafe anyway.

-- 
Gaëtan Rivet
6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
  2017-12-13 15:16     ` Gaëtan Rivet
@ 2017-12-13 15:48       ` Matan Azrad
  2017-12-13 16:09         ` Gaëtan Rivet
  0 siblings, 1 reply; 14+ messages in thread
From: Matan Azrad @ 2017-12-13 15:48 UTC (permalink / raw)
  To: Gaëtan Rivet; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable

Hi Gaetan
Thanks for the review.
Some comments..

> -----Original Message-----
> From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> Sent: Wednesday, December 13, 2017 5:17 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> 
> Hi Matan,
> 
> On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote:
> > There is time between the physical removal of the device until
> > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> > applications still don't know about the removal and may call
> > sub-device control operation which should return an error.
> >
> > In previous code this error is reported to the application contrary to
> > fail-safe principle that the app should not be aware of device removal.
> >
> > Add an removal check in each relevant control command error flow and
> > prevent an error report to application when the sub-device is removed.
> >
> > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > Fixes: b737a1e ("net/failsafe: support flow API")
> > Cc: stable@dpdk.org
> >
> 
> This patch is not a fix.
> It relies on an eth_dev API evolution. Without this evolution, this patch is
> meaningless and would break compilation if backported in stable branch.
> 

It is a fix because the bug is finally solved by this patch.
I agree that it cannot be backported itself, but maybe all the series should be backported.
Other idea:
Add new patch which documents the bug and backport it.
Remove it in this patch and remove cc stable from it.
What do you think?

> Please remove those tags.
> 
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > ---
> >  drivers/net/failsafe/failsafe_flow.c    | 18 ++++++++++-------
> >  drivers/net/failsafe/failsafe_ops.c     | 34 ++++++++++++++++++++++-----
> ------
> >  drivers/net/failsafe/failsafe_private.h | 10 ++++++++++
> >  3 files changed, 44 insertions(+), 18 deletions(-)
> 
> < ... >
> 
> > +/*
> > + * Check if sub device was removed.
> > + */
> > +static inline int
> > +fs_is_removed(struct sub_device *sdev) {
> > +	if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev))
> != 0)
> > +		return 1;
> > +	return 0;
> > +}
> 
> Have you considered adding this check within the subdev iterator itself?
> I think it would prevent you from having to add it to each return value
> checks.
> 
> It is still MT-unsafe anyway.
>

This fix doesn't come to solve the MT issue, It comes to solve the error report to application because of removal.
Adding the check in subdev iterator doesn't make sense for this issue.

Matan. 
> --
> Gaëtan Rivet
> 6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
  2017-12-13 15:48       ` Matan Azrad
@ 2017-12-13 16:09         ` Gaëtan Rivet
  2017-12-13 17:09           ` Thomas Monjalon
  2017-12-13 21:55           ` Gaëtan Rivet
  0 siblings, 2 replies; 14+ messages in thread
From: Gaëtan Rivet @ 2017-12-13 16:09 UTC (permalink / raw)
  To: Matan Azrad; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable

On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote:
> Hi Gaetan
> Thanks for the review.
> Some comments..
> 
> > -----Original Message-----
> > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > Sent: Wednesday, December 13, 2017 5:17 PM
> > To: Matan Azrad <matan@mellanox.com>
> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> > 
> > Hi Matan,
> > 
> > On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote:
> > > There is time between the physical removal of the device until
> > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> > > applications still don't know about the removal and may call
> > > sub-device control operation which should return an error.
> > >
> > > In previous code this error is reported to the application contrary to
> > > fail-safe principle that the app should not be aware of device removal.
> > >
> > > Add an removal check in each relevant control command error flow and
> > > prevent an error report to application when the sub-device is removed.
> > >
> > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > Fixes: b737a1e ("net/failsafe: support flow API")
> > > Cc: stable@dpdk.org
> > >
> > 
> > This patch is not a fix.
> > It relies on an eth_dev API evolution. Without this evolution, this patch is
> > meaningless and would break compilation if backported in stable branch.
> > 
> 
> It is a fix because the bug is finally solved by this patch.
> I agree that it cannot be backported itself, but maybe all the series should be backported.
> Other idea:
> Add new patch which documents the bug and backport it.
> Remove it in this patch and remove cc stable from it.
> What do you think?
> 

I think you could write a crude version that would not rely on the
ethdev evolution (checking sdev->remove only), which would be incomplete
but still better than nothing.
And why not in this patch document the issue.
Without any dependency outside failsafe, this could be backported.

Then complete the fix with the API evolution if the new devops is
accepted.

> > Please remove those tags.
> > 
> > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > ---
> > >  drivers/net/failsafe/failsafe_flow.c    | 18 ++++++++++-------
> > >  drivers/net/failsafe/failsafe_ops.c     | 34 ++++++++++++++++++++++-----
> > ------
> > >  drivers/net/failsafe/failsafe_private.h | 10 ++++++++++
> > >  3 files changed, 44 insertions(+), 18 deletions(-)
> > 
> > < ... >
> > 
> > > +/*
> > > + * Check if sub device was removed.
> > > + */
> > > +static inline int
> > > +fs_is_removed(struct sub_device *sdev) {
> > > +	if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev))
> > != 0)
> > > +		return 1;
> > > +	return 0;
> > > +}
> > 
> > Have you considered adding this check within the subdev iterator itself?
> > I think it would prevent you from having to add it to each return value
> > checks.
> > 
> > It is still MT-unsafe anyway.
> >
> 
> This fix doesn't come to solve the MT issue, It comes to solve the error report to application because of removal.
> Adding the check in subdev iterator doesn't make sense for this issue.
> 
> Matan. 

If you add this check in the iterator itself, you would skip removed
devices before attempting operating upon them, right?

Then it should probably help with your issue, unless you tested it and
verified that it didnt?

Something like this:

---8<---

diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index d81cc3ca6..62ddc0689 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev,
        subs = PRIV(dev)->subs;
        tail = PRIV(dev)->subs_tail;
        while (sid < tail) {
+               if (min_state > DEV_PROBED &&
+                   fs_is_removed(&sub[sid]))
+                       goto next;
                if (subs[sid].state >= min_state)
                        break;
+next:
                sid++;
        }
        *sid_out = sid;

--->8---

Only issue being that it is completely racy, but as this MT-unsafe property is
inescapable we might as well ignore it and go for KISS.

If that's enough, I would prefer instead of having this additional check
added to all rte_eth operations.

-- 
Gaëtan Rivet
6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
  2017-12-13 16:09         ` Gaëtan Rivet
@ 2017-12-13 17:09           ` Thomas Monjalon
  2017-12-14 10:40             ` Matan Azrad
  2017-12-13 21:55           ` Gaëtan Rivet
  1 sibling, 1 reply; 14+ messages in thread
From: Thomas Monjalon @ 2017-12-13 17:09 UTC (permalink / raw)
  To: Gaëtan Rivet, Matan Azrad; +Cc: Adrien Mazarguil, dev, stable

13/12/2017 17:09, Gaëtan Rivet:
> On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote:
> > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > > Fixes: b737a1e ("net/failsafe: support flow API")
> > > > Cc: stable@dpdk.org
> > > >
> > > 
> > > This patch is not a fix.
> > > It relies on an eth_dev API evolution. Without this evolution, this patch is
> > > meaningless and would break compilation if backported in stable branch.
> > > 
> > 
> > It is a fix because the bug is finally solved by this patch.
> > I agree that it cannot be backported itself, but maybe all the series should be backported.
> > Other idea:
> > Add new patch which documents the bug and backport it.
> > Remove it in this patch and remove cc stable from it.
> > What do you think?
> > 
> 
> I think you could write a crude version that would not rely on the
> ethdev evolution (checking sdev->remove only), which would be incomplete
> but still better than nothing.
> And why not in this patch document the issue.
> Without any dependency outside failsafe, this could be backported.
> 
> Then complete the fix with the API evolution if the new devops is
> accepted.

I think it is not worth the effort.
It is a limitation in earlier releases and will be properly fixed
with the new op.
Please just remove the Cc:stable@dpdk.org.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
  2017-12-13 16:09         ` Gaëtan Rivet
  2017-12-13 17:09           ` Thomas Monjalon
@ 2017-12-13 21:55           ` Gaëtan Rivet
  2017-12-14 10:40             ` Matan Azrad
  1 sibling, 1 reply; 14+ messages in thread
From: Gaëtan Rivet @ 2017-12-13 21:55 UTC (permalink / raw)
  To: Matan Azrad; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable

Hi again Matan,

On Wed, Dec 13, 2017 at 05:09:16PM +0100, Gaëtan Rivet wrote:
> On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote:
> > Hi Gaetan
> > Thanks for the review.
> > Some comments..
> > 
> > > -----Original Message-----
> > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > Sent: Wednesday, December 13, 2017 5:17 PM
> > > To: Matan Azrad <matan@mellanox.com>
> > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> > > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> > > 
> > > Hi Matan,
> > > 
> > > On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote:
> > > > There is time between the physical removal of the device until
> > > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> > > > applications still don't know about the removal and may call
> > > > sub-device control operation which should return an error.
> > > >
> > > > In previous code this error is reported to the application contrary to
> > > > fail-safe principle that the app should not be aware of device removal.
> > > >
> > > > Add an removal check in each relevant control command error flow and
> > > > prevent an error report to application when the sub-device is removed.
> > > >
> > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > > Fixes: b737a1e ("net/failsafe: support flow API")
> > > > Cc: stable@dpdk.org
> > > >
> > > 
> > > This patch is not a fix.
> > > It relies on an eth_dev API evolution. Without this evolution, this patch is
> > > meaningless and would break compilation if backported in stable branch.
> > > 
> > 
> > It is a fix because the bug is finally solved by this patch.
> > I agree that it cannot be backported itself, but maybe all the series should be backported.
> > Other idea:
> > Add new patch which documents the bug and backport it.
> > Remove it in this patch and remove cc stable from it.
> > What do you think?
> > 
> 
> I think you could write a crude version that would not rely on the
> ethdev evolution (checking sdev->remove only), which would be incomplete
> but still better than nothing.
> And why not in this patch document the issue.
> Without any dependency outside failsafe, this could be backported.
> 
> Then complete the fix with the API evolution if the new devops is
> accepted.
> 
> > > Please remove those tags.
> > > 
> > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > ---
> > > >  drivers/net/failsafe/failsafe_flow.c    | 18 ++++++++++-------
> > > >  drivers/net/failsafe/failsafe_ops.c     | 34 ++++++++++++++++++++++-----
> > > ------
> > > >  drivers/net/failsafe/failsafe_private.h | 10 ++++++++++
> > > >  3 files changed, 44 insertions(+), 18 deletions(-)
> > > 
> > > < ... >
> > > 
> > > > +/*
> > > > + * Check if sub device was removed.
> > > > + */
> > > > +static inline int
> > > > +fs_is_removed(struct sub_device *sdev) {
> > > > +	if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev))
> > > != 0)
> > > > +		return 1;
> > > > +	return 0;
> > > > +}
> > > 
> > > Have you considered adding this check within the subdev iterator itself?
> > > I think it would prevent you from having to add it to each return value
> > > checks.
> > > 
> > > It is still MT-unsafe anyway.
> > >
> > 
> > This fix doesn't come to solve the MT issue, It comes to solve the error report to application because of removal.
> > Adding the check in subdev iterator doesn't make sense for this issue.
> > 
> > Matan. 
> 
> If you add this check in the iterator itself, you would skip removed
> devices before attempting operating upon them, right?
> 
> Then it should probably help with your issue, unless you tested it and
> verified that it didnt?
> 
> Something like this:
> 
> ---8<---
> 
> diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
> index d81cc3ca6..62ddc0689 100644
> --- a/drivers/net/failsafe/failsafe_private.h
> +++ b/drivers/net/failsafe/failsafe_private.h
> @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev,
>         subs = PRIV(dev)->subs;
>         tail = PRIV(dev)->subs_tail;
>         while (sid < tail) {
> +               if (min_state > DEV_PROBED &&
> +                   fs_is_removed(&sub[sid]))
> +                       goto next;
>                 if (subs[sid].state >= min_state)
>                         break;
> +next:
>                 sid++;
>         }
>         *sid_out = sid;
> 
> --->8---
> 
> Only issue being that it is completely racy, but as this MT-unsafe property is
> inescapable we might as well ignore it and go for KISS.
> 
> If that's enough, I would prefer instead of having this additional check
> added to all rte_eth operations.
> 

Ok, actually you were right here to do it this way. The "is_removed"
check needs to happen after the operation attempt to effectively
mitigate the possible race. Checking before attempting the call will be
much less effective.

That being said, would it be cleaner to have eth_dev ops return -ENODEV
directly, and check against it within fail-safe?

-- 
Gaëtan Rivet
6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
  2017-12-13 21:55           ` Gaëtan Rivet
@ 2017-12-14 10:40             ` Matan Azrad
  2017-12-14 10:48               ` Gaëtan Rivet
  0 siblings, 1 reply; 14+ messages in thread
From: Matan Azrad @ 2017-12-14 10:40 UTC (permalink / raw)
  To: Gaëtan Rivet; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable

Hi Gaetan

> -----Original Message-----
> From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> Sent: Wednesday, December 13, 2017 11:56 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> 
> Hi again Matan,
> 
> On Wed, Dec 13, 2017 at 05:09:16PM +0100, Gaëtan Rivet wrote:
> > On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote:
> > > Hi Gaetan
> > > Thanks for the review.
> > > Some comments..
> > >
> > > > -----Original Message-----
> > > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > > Sent: Wednesday, December 13, 2017 5:17 PM
> > > > To: Matan Azrad <matan@mellanox.com>
> > > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas
> Monjalon
> > > > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> > > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device
> > > > handling
> > > >
> > > > Hi Matan,
> > > >
> > > > On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote:
> > > > > There is time between the physical removal of the device until
> > > > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> > > > > applications still don't know about the removal and may call
> > > > > sub-device control operation which should return an error.
> > > > >
> > > > > In previous code this error is reported to the application
> > > > > contrary to fail-safe principle that the app should not be aware of
> device removal.
> > > > >
> > > > > Add an removal check in each relevant control command error flow
> > > > > and prevent an error report to application when the sub-device is
> removed.
> > > > >
> > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > > > Fixes: b737a1e ("net/failsafe: support flow API")
> > > > > Cc: stable@dpdk.org
> > > > >
> > > >
> > > > This patch is not a fix.
> > > > It relies on an eth_dev API evolution. Without this evolution,
> > > > this patch is meaningless and would break compilation if backported in
> stable branch.
> > > >
> > >
> > > It is a fix because the bug is finally solved by this patch.
> > > I agree that it cannot be backported itself, but maybe all the series should
> be backported.
> > > Other idea:
> > > Add new patch which documents the bug and backport it.
> > > Remove it in this patch and remove cc stable from it.
> > > What do you think?
> > >
> >
> > I think you could write a crude version that would not rely on the
> > ethdev evolution (checking sdev->remove only), which would be
> > incomplete but still better than nothing.
> > And why not in this patch document the issue.
> > Without any dependency outside failsafe, this could be backported.
> >
> > Then complete the fix with the API evolution if the new devops is
> > accepted.
> >
> > > > Please remove those tags.
> > > >
> > > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > > ---
> > > > >  drivers/net/failsafe/failsafe_flow.c    | 18 ++++++++++-------
> > > > >  drivers/net/failsafe/failsafe_ops.c     | 34
> ++++++++++++++++++++++-----
> > > > ------
> > > > >  drivers/net/failsafe/failsafe_private.h | 10 ++++++++++
> > > > >  3 files changed, 44 insertions(+), 18 deletions(-)
> > > >
> > > > < ... >
> > > >
> > > > > +/*
> > > > > + * Check if sub device was removed.
> > > > > + */
> > > > > +static inline int
> > > > > +fs_is_removed(struct sub_device *sdev) {
> > > > > +	if (sdev->remove == 1 ||
> rte_eth_dev_is_removed(PORT_ID(sdev))
> > > > != 0)
> > > > > +		return 1;
> > > > > +	return 0;
> > > > > +}
> > > >
> > > > Have you considered adding this check within the subdev iterator itself?
> > > > I think it would prevent you from having to add it to each return
> > > > value checks.
> > > >
> > > > It is still MT-unsafe anyway.
> > > >
> > >
> > > This fix doesn't come to solve the MT issue, It comes to solve the error
> report to application because of removal.
> > > Adding the check in subdev iterator doesn't make sense for this issue.
> > >
> > > Matan.
> >
> > If you add this check in the iterator itself, you would skip removed
> > devices before attempting operating upon them, right?
> >
> > Then it should probably help with your issue, unless you tested it and
> > verified that it didnt?
> >
> > Something like this:
> >
> > ---8<---
> >
> > diff --git a/drivers/net/failsafe/failsafe_private.h
> > b/drivers/net/failsafe/failsafe_private.h
> > index d81cc3ca6..62ddc0689 100644
> > --- a/drivers/net/failsafe/failsafe_private.h
> > +++ b/drivers/net/failsafe/failsafe_private.h
> > @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev,
> >         subs = PRIV(dev)->subs;
> >         tail = PRIV(dev)->subs_tail;
> >         while (sid < tail) {
> > +               if (min_state > DEV_PROBED &&
> > +                   fs_is_removed(&sub[sid]))
> > +                       goto next;
> >                 if (subs[sid].state >= min_state)
> >                         break;
> > +next:
> >                 sid++;
> >         }
> >         *sid_out = sid;
> >
> > --->8---
> >
> > Only issue being that it is completely racy, but as this MT-unsafe
> > property is inescapable we might as well ignore it and go for KISS.
> >
> > If that's enough, I would prefer instead of having this additional
> > check added to all rte_eth operations.
> >
> 
> Ok, actually you were right here to do it this way. The "is_removed"
> check needs to happen after the operation attempt to effectively mitigate
> the possible race. Checking before attempting the call will be much less
> effective.
> 
> That being said, would it be cleaner to have eth_dev ops return -ENODEV
> directly, and check against it within fail-safe?
> 

I think that according to "is_removed" semantic we must return a Boolean value (Each value different from '0' means that the device is removed) like other functions in c library (for example isspace()).

Thanks.
 

> --
> Gaëtan Rivet
> 6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
  2017-12-13 17:09           ` Thomas Monjalon
@ 2017-12-14 10:40             ` Matan Azrad
  0 siblings, 0 replies; 14+ messages in thread
From: Matan Azrad @ 2017-12-14 10:40 UTC (permalink / raw)
  To: Thomas Monjalon, Gaëtan Rivet; +Cc: Adrien Mazarguil, dev, stable



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, December 13, 2017 7:09 PM
> To: Gaëtan Rivet <gaetan.rivet@6wind.com>; Matan Azrad
> <matan@mellanox.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org;
> stable@dpdk.org
> Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> 
> 13/12/2017 17:09, Gaëtan Rivet:
> > On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote:
> > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > > > Fixes: b737a1e ("net/failsafe: support flow API")
> > > > > Cc: stable@dpdk.org
> > > > >
> > > >
> > > > This patch is not a fix.
> > > > It relies on an eth_dev API evolution. Without this evolution,
> > > > this patch is meaningless and would break compilation if backported in
> stable branch.
> > > >
> > >
> > > It is a fix because the bug is finally solved by this patch.
> > > I agree that it cannot be backported itself, but maybe all the series should
> be backported.
> > > Other idea:
> > > Add new patch which documents the bug and backport it.
> > > Remove it in this patch and remove cc stable from it.
> > > What do you think?
> > >
> >
> > I think you could write a crude version that would not rely on the
> > ethdev evolution (checking sdev->remove only), which would be
> > incomplete but still better than nothing.
> > And why not in this patch document the issue.
> > Without any dependency outside failsafe, this could be backported.
> >
> > Then complete the fix with the API evolution if the new devops is
> > accepted.
> 
> I think it is not worth the effort.
> It is a limitation in earlier releases and will be properly fixed with the new op.
> Please just remove the Cc:stable@dpdk.org.

OK.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
  2017-12-14 10:40             ` Matan Azrad
@ 2017-12-14 10:48               ` Gaëtan Rivet
  2017-12-14 13:07                 ` Matan Azrad
  0 siblings, 1 reply; 14+ messages in thread
From: Gaëtan Rivet @ 2017-12-14 10:48 UTC (permalink / raw)
  To: Matan Azrad; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable

On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote:
> Hi Gaetan
> 

<snip>

> > >
> > > If you add this check in the iterator itself, you would skip removed
> > > devices before attempting operating upon them, right?
> > >
> > > Then it should probably help with your issue, unless you tested it and
> > > verified that it didnt?
> > >
> > > Something like this:
> > >
> > > ---8<---
> > >
> > > diff --git a/drivers/net/failsafe/failsafe_private.h
> > > b/drivers/net/failsafe/failsafe_private.h
> > > index d81cc3ca6..62ddc0689 100644
> > > --- a/drivers/net/failsafe/failsafe_private.h
> > > +++ b/drivers/net/failsafe/failsafe_private.h
> > > @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev,
> > >         subs = PRIV(dev)->subs;
> > >         tail = PRIV(dev)->subs_tail;
> > >         while (sid < tail) {
> > > +               if (min_state > DEV_PROBED &&
> > > +                   fs_is_removed(&sub[sid]))
> > > +                       goto next;
> > >                 if (subs[sid].state >= min_state)
> > >                         break;
> > > +next:
> > >                 sid++;
> > >         }
> > >         *sid_out = sid;
> > >
> > > --->8---
> > >
> > > Only issue being that it is completely racy, but as this MT-unsafe
> > > property is inescapable we might as well ignore it and go for KISS.
> > >
> > > If that's enough, I would prefer instead of having this additional
> > > check added to all rte_eth operations.
> > >
> > 
> > Ok, actually you were right here to do it this way. The "is_removed"
> > check needs to happen after the operation attempt to effectively mitigate
> > the possible race. Checking before attempting the call will be much less
> > effective.
> > 
> > That being said, would it be cleaner to have eth_dev ops return -ENODEV
> > directly, and check against it within fail-safe?
> > 
> 
> I think that according to "is_removed" semantic we must return a Boolean value (Each value different from '0' means that the device is removed) like other functions in c library (for example isspace()).
> 

Sure, I wasn't discussing the interface proposed by rte_eth_dev_is_removed().

What I meant was to ask whether checking rte_eth_dev_is_removed() would
be more interesting in the ethdev layer, making the eth_dev_ops return
-ENODEV regardless of the previous error if this check is supported by
the driver and signal that the port is removed.

I think this information could be interesting to other systems, not just
fail-safe.

-- 
Gaëtan Rivet
6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
  2017-12-14 10:48               ` Gaëtan Rivet
@ 2017-12-14 13:07                 ` Matan Azrad
  2017-12-14 13:27                   ` Gaëtan Rivet
  0 siblings, 1 reply; 14+ messages in thread
From: Matan Azrad @ 2017-12-14 13:07 UTC (permalink / raw)
  To: Gaëtan Rivet; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable

Hi Gaetan

> -----Original Message-----
> From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> Sent: Thursday, December 14, 2017 12:49 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> 
> On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote:
> > Hi Gaetan
> >
> 
> <snip>
> 
> > > >
> > > > If you add this check in the iterator itself, you would skip
> > > > removed devices before attempting operating upon them, right?
> > > >
> > > > Then it should probably help with your issue, unless you tested it
> > > > and verified that it didnt?
> > > >
> > > > Something like this:
> > > >
> > > > ---8<---
> > > >
> > > > diff --git a/drivers/net/failsafe/failsafe_private.h
> > > > b/drivers/net/failsafe/failsafe_private.h
> > > > index d81cc3ca6..62ddc0689 100644
> > > > --- a/drivers/net/failsafe/failsafe_private.h
> > > > +++ b/drivers/net/failsafe/failsafe_private.h
> > > > @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev,
> > > >         subs = PRIV(dev)->subs;
> > > >         tail = PRIV(dev)->subs_tail;
> > > >         while (sid < tail) {
> > > > +               if (min_state > DEV_PROBED &&
> > > > +                   fs_is_removed(&sub[sid]))
> > > > +                       goto next;
> > > >                 if (subs[sid].state >= min_state)
> > > >                         break;
> > > > +next:
> > > >                 sid++;
> > > >         }
> > > >         *sid_out = sid;
> > > >
> > > > --->8---
> > > >
> > > > Only issue being that it is completely racy, but as this MT-unsafe
> > > > property is inescapable we might as well ignore it and go for KISS.
> > > >
> > > > If that's enough, I would prefer instead of having this additional
> > > > check added to all rte_eth operations.
> > > >
> > >
> > > Ok, actually you were right here to do it this way. The "is_removed"
> > > check needs to happen after the operation attempt to effectively
> > > mitigate the possible race. Checking before attempting the call will
> > > be much less effective.
> > >
> > > That being said, would it be cleaner to have eth_dev ops return
> > > -ENODEV directly, and check against it within fail-safe?
> > >
> >
> > I think that according to "is_removed" semantic we must return a Boolean
> value (Each value different from '0' means that the device is removed) like
> other functions in c library (for example isspace()).
> >
> 
> Sure, I wasn't discussing the interface proposed by
> rte_eth_dev_is_removed().
> 
> What I meant was to ask whether checking rte_eth_dev_is_removed()
> would be more interesting in the ethdev layer, making the eth_dev_ops
> return -ENODEV regardless of the previous error if this check is supported by
> the driver and signal that the port is removed.
> 
> I think this information could be interesting to other systems, not just fail-
> safe.
> 

Ok. Got you now.
Interesting approach - plan:
	1. update fs_link_update to use rte_eth* functions.
	2. maybe -EIO is preferred because -ENODEV is used for no port error?
	3. update all relevant rte_eth* to use "is_removed" in error flows(1 patch for flow APIs and 1 for the others).
	4. Change fs checks in error flows to check rte_eth* return values.
	5. Remove CC stable from commit massage.

What do you think?

> --
> Gaëtan Rivet
> 6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
  2017-12-14 13:07                 ` Matan Azrad
@ 2017-12-14 13:27                   ` Gaëtan Rivet
  2017-12-14 14:43                     ` Matan Azrad
  0 siblings, 1 reply; 14+ messages in thread
From: Gaëtan Rivet @ 2017-12-14 13:27 UTC (permalink / raw)
  To: Matan Azrad; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable

On Thu, Dec 14, 2017 at 01:07:31PM +0000, Matan Azrad wrote:
> Hi Gaetan
> 
> > -----Original Message-----
> > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > Sent: Thursday, December 14, 2017 12:49 PM
> > To: Matan Azrad <matan@mellanox.com>
> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> > 
> > On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote:
> > > Hi Gaetan
> > >
> > 

<snip>

> > > > Ok, actually you were right here to do it this way. The "is_removed"
> > > > check needs to happen after the operation attempt to effectively
> > > > mitigate the possible race. Checking before attempting the call will
> > > > be much less effective.
> > > >
> > > > That being said, would it be cleaner to have eth_dev ops return
> > > > -ENODEV directly, and check against it within fail-safe?
> > > >
> > >
> > > I think that according to "is_removed" semantic we must return a Boolean
> > value (Each value different from '0' means that the device is removed) like
> > other functions in c library (for example isspace()).
> > >
> > 
> > Sure, I wasn't discussing the interface proposed by
> > rte_eth_dev_is_removed().
> > 
> > What I meant was to ask whether checking rte_eth_dev_is_removed()
> > would be more interesting in the ethdev layer, making the eth_dev_ops
> > return -ENODEV regardless of the previous error if this check is supported by
> > the driver and signal that the port is removed.
> > 
> > I think this information could be interesting to other systems, not just fail-
> > safe.
> > 
> 
> Ok. Got you now.
> Interesting approach - plan:
> 	1. update fs_link_update to use rte_eth* functions.

I'm surprised it doesn't already.
Either the rte_eth* function was introduced after the failsafe, or be
wary of potential issues. I don't see a problem right now though.

> 	2. maybe -EIO is preferred because -ENODEV is used for no port error?

Good point, didn't think about it.
Prepare yourself maybe to some arguments about the most relevant error
code. -EIO seems fine to me, but maybe use a wrapper for all this.

Something like:

---8<---

static int
eth_error(pid, int original_ret)
{
    int ret;

    if (original_ret == 0)
        return original_ret;
    ret = rte_eth_is_removed(pid);
    if (ret == 0 || ret == -ENOTSUP)
        return original_ret;
    return -EIO;
}

int
rte_eth_ops_xyz(pid)
{
        int ret;
        ret = eth_dev(pid).ops_xyz();
        return eth_error(pid, ret);
}

--->8---

This way you would be able to change it easily and the logic would be
insulated.

> 	3. update all relevant rte_eth* to use "is_removed" in error flows(1 patch for flow APIs and 1 for the others).
> 	4. Change fs checks in error flows to check rte_eth* return values.
> 	5. Remove CC stable from commit massage.
> 
> What do you think?
> 

Agreed otherwise.

Thanks,

> > --
> > Gaëtan Rivet
> > 6WIND

-- 
Gaëtan Rivet
6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling
  2017-12-14 13:27                   ` Gaëtan Rivet
@ 2017-12-14 14:43                     ` Matan Azrad
  0 siblings, 0 replies; 14+ messages in thread
From: Matan Azrad @ 2017-12-14 14:43 UTC (permalink / raw)
  To: Gaëtan Rivet; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable

Hi

> -----Original Message-----
> From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> Sent: Thursday, December 14, 2017 3:27 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling
> 
> On Thu, Dec 14, 2017 at 01:07:31PM +0000, Matan Azrad wrote:
> > Hi Gaetan
> >
> > > -----Original Message-----
> > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > Sent: Thursday, December 14, 2017 12:49 PM
> > > To: Matan Azrad <matan@mellanox.com>
> > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> > > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org
> > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device
> > > handling
> > >
> > > On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote:
> > > > Hi Gaetan
> > > >
> > >
> 
> <snip>
> 
> > > > > Ok, actually you were right here to do it this way. The "is_removed"
> > > > > check needs to happen after the operation attempt to effectively
> > > > > mitigate the possible race. Checking before attempting the call
> > > > > will be much less effective.
> > > > >
> > > > > That being said, would it be cleaner to have eth_dev ops return
> > > > > -ENODEV directly, and check against it within fail-safe?
> > > > >
> > > >
> > > > I think that according to "is_removed" semantic we must return a
> > > > Boolean
> > > value (Each value different from '0' means that the device is
> > > removed) like other functions in c library (for example isspace()).
> > > >
> > >
> > > Sure, I wasn't discussing the interface proposed by
> > > rte_eth_dev_is_removed().
> > >
> > > What I meant was to ask whether checking rte_eth_dev_is_removed()
> > > would be more interesting in the ethdev layer, making the
> > > eth_dev_ops return -ENODEV regardless of the previous error if this
> > > check is supported by the driver and signal that the port is removed.
> > >
> > > I think this information could be interesting to other systems, not
> > > just fail- safe.
> > >
> >
> > Ok. Got you now.
> > Interesting approach - plan:
> > 	1. update fs_link_update to use rte_eth* functions.
> 
> I'm surprised it doesn't already.
> Either the rte_eth* function was introduced after the failsafe, or be wary of
> potential issues. I don't see a problem right now though.
> 
> > 	2. maybe -EIO is preferred because -ENODEV is used for no port
> error?
> 
> Good point, didn't think about it.
> Prepare yourself maybe to some arguments about the most relevant error
> code. -EIO seems fine to me, but maybe use a wrapper for all this.
> 
> Something like:
> 
> ---8<---
> 
> static int
> eth_error(pid, int original_ret)
> {
>     int ret;
> 
>     if (original_ret == 0)
>         return original_ret;
>     ret = rte_eth_is_removed(pid);
>     if (ret == 0 || ret == -ENOTSUP)
>         return original_ret;
>     return -EIO;
> }
> 
> int
> rte_eth_ops_xyz(pid)
> {
>         int ret;
>         ret = eth_dev(pid).ops_xyz();
>         return eth_error(pid, ret);
> }
> 
> --->8---
> 
> This way you would be able to change it easily and the logic would be
> insulated.
> 

Nice.

> > 	3. update all relevant rte_eth* to use "is_removed" in error flows(1
> patch for flow APIs and 1 for the others).
> > 	4. Change fs checks in error flows to check rte_eth* return values.
> > 	5. Remove CC stable from commit massage.
> >
> > What do you think?
> >
> 
> Agreed otherwise.
> 

Will create V3, thanks!

> Thanks,
> 
> > > --
> > > Gaëtan Rivet
> > > 6WIND
> 
> --
> Gaëtan Rivet
> 6WIND

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-12-14 14:43 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1509637324-13525-1-git-send-email-matan@mellanox.com>
2017-11-02 15:42 ` [dpdk-stable] [PATCH 1/3] net/failsafe: fix removal handling lack Matan Azrad
2017-11-06  8:19   ` Gaëtan Rivet
     [not found] ` <1513175370-16583-1-git-send-email-matan@mellanox.com>
2017-12-13 14:29   ` [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling Matan Azrad
2017-12-13 15:16     ` Gaëtan Rivet
2017-12-13 15:48       ` Matan Azrad
2017-12-13 16:09         ` Gaëtan Rivet
2017-12-13 17:09           ` Thomas Monjalon
2017-12-14 10:40             ` Matan Azrad
2017-12-13 21:55           ` Gaëtan Rivet
2017-12-14 10:40             ` Matan Azrad
2017-12-14 10:48               ` Gaëtan Rivet
2017-12-14 13:07                 ` Matan Azrad
2017-12-14 13:27                   ` Gaëtan Rivet
2017-12-14 14:43                     ` Matan Azrad

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).