* [dpdk-stable] [PATCH 1/3] net/failsafe: fix removal handling lack [not found] <1509637324-13525-1-git-send-email-matan@mellanox.com> @ 2017-11-02 15:42 ` Matan Azrad 2017-11-06 8:19 ` Gaëtan Rivet [not found] ` <1513175370-16583-1-git-send-email-matan@mellanox.com> 1 sibling, 1 reply; 14+ messages in thread From: Matan Azrad @ 2017-11-02 15:42 UTC (permalink / raw) To: Adrien Mazarguil, Gaetan Rivet; +Cc: dev, stable There is time between the physical removal of the device until sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and applications still don't know about the removal and may call sub-device control operation which should return an error. In previous code this error is reported to the application contrary to fail-safe principle that the app should not be aware of device removal. Define a removal error that each sub-device PMD should return in case of an error caused by removal event; The special error is -ENODEV. Add an error check in each relevant control command error flow and prevent an error report to application when its value is -ENODEV. Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") Fixes: b737a1e ("net/failsafe: support flow API") Cc: stable@dpdk.org Signed-off-by: Matan Azrad <matan@mellanox.com> --- doc/guides/nics/fail_safe.rst | 7 +++++++ doc/guides/prog_guide/env_abstraction_layer.rst | 3 +++ drivers/net/failsafe/failsafe_flow.c | 16 +++++++++------ drivers/net/failsafe/failsafe_ops.c | 27 ++++++++++++++++--------- drivers/net/failsafe/failsafe_private.h | 8 ++++++++ 5 files changed, 45 insertions(+), 16 deletions(-) diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst index c4e3d2e..5023fc4 100644 --- a/doc/guides/nics/fail_safe.rst +++ b/doc/guides/nics/fail_safe.rst @@ -193,6 +193,13 @@ any time. The fail-safe PMD will register a callback for such event and react accordingly. It will try to safely stop, close and uninit the sub-device having emitted this event, allowing it to free its eventual resources. +When fail-safe PMD gets -ENODEV error from control command sent to removable +sub-devices, it assumes that the error reason is device removal. In this case +fail-safe returns success value to application. The PMD controlling the +sub-device is still responsible to emit a removal event (RMV) in addition to +returning -ENODEV from control operations after the device has been physically +removed. Only the reception of this event unregisters it on the fail-safe side. + Fail-safe glossary ------------------ diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst index 4775eb3..bd2fd87 100644 --- a/doc/guides/prog_guide/env_abstraction_layer.rst +++ b/doc/guides/prog_guide/env_abstraction_layer.rst @@ -213,6 +213,9 @@ device having emitted a Device Removal Event. In such case, calling callback. Care must be taken not to close the device from the interrupt handler context. It is necessary to reschedule such closing operation. +Unsuccessful control operations (for those that return errors) may return +-ENODEV after the device is physically unplugged. + Blacklisting ~~~~~~~~~~~~ diff --git a/drivers/net/failsafe/failsafe_flow.c b/drivers/net/failsafe/failsafe_flow.c index 153ceee..ce9b769 100644 --- a/drivers/net/failsafe/failsafe_flow.c +++ b/drivers/net/failsafe/failsafe_flow.c @@ -87,7 +87,7 @@ DEBUG("Calling rte_flow_validate on sub_device %d", i); ret = rte_flow_validate(PORT_ID(sdev), attr, patterns, actions, error); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("Operation rte_flow_validate failed for sub_device %d" " with error %d", i, ret); return ret; @@ -111,7 +111,8 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { flow->flows[i] = rte_flow_create(PORT_ID(sdev), attr, patterns, actions, error); - if (flow->flows[i] == NULL) { + if (flow->flows[i] == NULL && + !SUBDEV_REMOVED(sdev, -rte_errno)) { ERROR("Failed to create flow on sub_device %d", i); goto err; @@ -150,7 +151,7 @@ continue; local_ret = rte_flow_destroy(PORT_ID(sdev), flow->flows[i], error); - if (local_ret) { + if (local_ret && !SUBDEV_REMOVED(sdev, local_ret)) { ERROR("Failed to destroy flow on sub_device %d: %d", i, local_ret); if (ret == 0) @@ -175,7 +176,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_flow_flush on sub_device %d", i); ret = rte_flow_flush(PORT_ID(sdev), error); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("Operation rte_flow_flush failed for sub_device %d" " with error %d", i, ret); return ret; @@ -199,8 +200,11 @@ sdev = TX_SUBDEV(dev); if (sdev != NULL) { - return rte_flow_query(PORT_ID(sdev), + int ret = rte_flow_query(PORT_ID(sdev), flow->flows[SUB_ID(sdev)], type, arg, error); + + if (!SUBDEV_REMOVED(sdev, ret)) + return ret; } WARN("No active sub_device to query about its flow"); return -1; @@ -223,7 +227,7 @@ WARN("flow isolation mode of sub_device %d in incoherent state.", i); ret = rte_flow_isolate(PORT_ID(sdev), set, error); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("Operation rte_flow_isolate failed for sub_device %d" " with error %d", i, ret); return ret; diff --git a/drivers/net/failsafe/failsafe_ops.c b/drivers/net/failsafe/failsafe_ops.c index f460551..cc7ab7f 100644 --- a/drivers/net/failsafe/failsafe_ops.c +++ b/drivers/net/failsafe/failsafe_ops.c @@ -314,7 +314,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_set_link_up on sub_device %d", i); ret = rte_eth_dev_set_link_up(PORT_ID(sdev)); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("Operation rte_eth_dev_set_link_up failed for sub_device %d" " with error %d", i, ret); return ret; @@ -333,7 +333,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_set_link_down on sub_device %d", i); ret = rte_eth_dev_set_link_down(PORT_ID(sdev)); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("Operation rte_eth_dev_set_link_down failed for sub_device %d" " with error %d", i, ret); return ret; @@ -418,7 +418,7 @@ rx_queue_id, nb_rx_desc, socket_id, rx_conf, mb_pool); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("RX queue setup failed for sub_device %d", i); goto free_rxq; } @@ -484,7 +484,7 @@ tx_queue_id, nb_tx_desc, socket_id, tx_conf); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("TX queue setup failed for sub_device %d", i); goto free_txq; } @@ -563,7 +563,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling link_update on sub_device %d", i); ret = (SUBOPS(sdev, link_update))(ETH(sdev), wait_to_complete); - if (ret && ret != -1) { + if (ret && ret != -1 && !SUBDEV_REMOVED(sdev, ret)) { ERROR("Link update failed for sub_device %d with error %d", i, ret); return ret; @@ -587,6 +587,7 @@ fs_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats) { + struct rte_eth_stats backup; struct sub_device *sdev; uint8_t i; int ret; @@ -596,14 +597,20 @@ struct rte_eth_stats *snapshot = &sdev->stats_snapshot.stats; uint64_t *timestamp = &sdev->stats_snapshot.timestamp; + rte_memcpy(&backup, snapshot, sizeof(backup)); ret = rte_eth_stats_get(PORT_ID(sdev), snapshot); if (ret) { + if (SUBDEV_REMOVED(sdev, ret)) { + rte_memcpy(snapshot, &backup, sizeof(backup)); + goto inc; + } ERROR("Operation rte_eth_stats_get failed for sub_device %d with error %d", i, ret); *timestamp = 0; return ret; } *timestamp = rte_rdtsc(); +inc: failsafe_stats_increment(stats, snapshot); } return 0; @@ -716,7 +723,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_set_mtu on sub_device %d", i); ret = rte_eth_dev_set_mtu(PORT_ID(sdev), mtu); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("Operation rte_eth_dev_set_mtu failed for sub_device %d with error %d", i, ret); return ret; @@ -735,7 +742,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_vlan_filter on sub_device %d", i); ret = rte_eth_dev_vlan_filter(PORT_ID(sdev), vlan_id, on); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("Operation rte_eth_dev_vlan_filter failed for sub_device %d" " with error %d", i, ret); return ret; @@ -769,7 +776,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_flow_ctrl_set on sub_device %d", i); ret = rte_eth_dev_flow_ctrl_set(PORT_ID(sdev), fc_conf); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("Operation rte_eth_dev_flow_ctrl_set failed for sub_device %d" " with error %d", i, ret); return ret; @@ -806,7 +813,7 @@ RTE_ASSERT(index < FAILSAFE_MAX_ETHADDR); FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { ret = rte_eth_dev_mac_addr_add(PORT_ID(sdev), mac_addr, vmdq); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("Operation rte_eth_dev_mac_addr_add failed for sub_device %" PRIu8 " with error %d", i, ret); return ret; @@ -848,7 +855,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_filter_ctrl on sub_device %d", i); ret = rte_eth_dev_filter_ctrl(PORT_ID(sdev), type, op, arg); - if (ret) { + if (ret && !SUBDEV_REMOVED(sdev, ret)) { ERROR("Operation rte_eth_dev_filter_ctrl failed for sub_device %d" " with error %d", i, ret); return ret; diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h index d81cc3c..ee81b70 100644 --- a/drivers/net/failsafe/failsafe_private.h +++ b/drivers/net/failsafe/failsafe_private.h @@ -262,6 +262,14 @@ int failsafe_eth_lsc_event_callback(uint16_t port_id, (ETH(s)->dev_ops->ops) /** + * s: (struct sub_device *) + * e: (int) error + */ +#define SUBDEV_REMOVED(s, e) \ + (s->remove || \ + (((e) == -ENODEV) && (ETH(s)->data->dev_flags & RTE_ETH_DEV_INTR_RMV))) + +/** * Atomic guard */ -- 1.8.3.1 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH 1/3] net/failsafe: fix removal handling lack 2017-11-02 15:42 ` [dpdk-stable] [PATCH 1/3] net/failsafe: fix removal handling lack Matan Azrad @ 2017-11-06 8:19 ` Gaëtan Rivet 0 siblings, 0 replies; 14+ messages in thread From: Gaëtan Rivet @ 2017-11-06 8:19 UTC (permalink / raw) To: Matan Azrad; +Cc: Adrien Mazarguil, dev, stable Hello Matan, On Thu, Nov 02, 2017 at 03:42:02PM +0000, Matan Azrad wrote: > There is time between the physical removal of the device until > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and > applications still don't know about the removal and may call sub-device > control operation which should return an error. > > In previous code this error is reported to the application contrary to > fail-safe principle that the app should not be aware of device removal. > > Define a removal error that each sub-device PMD should return in case > of an error caused by removal event; The special error is -ENODEV. > > Add an error check in each relevant control command error flow and > prevent an error report to application when its value is -ENODEV. > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") > Fixes: b737a1e ("net/failsafe: support flow API") > Cc: stable@dpdk.org > This is not a fix. This would be useless backported in stable without the related mlx4 and mlx5 changes. The related mlx4 and mlx5 patches are themselves not marked as fixes and won't be backported. > Signed-off-by: Matan Azrad <matan@mellanox.com> > --- > doc/guides/nics/fail_safe.rst | 7 +++++++ > doc/guides/prog_guide/env_abstraction_layer.rst | 3 +++ > drivers/net/failsafe/failsafe_flow.c | 16 +++++++++------ > drivers/net/failsafe/failsafe_ops.c | 27 ++++++++++++++++--------- > drivers/net/failsafe/failsafe_private.h | 8 ++++++++ > 5 files changed, 45 insertions(+), 16 deletions(-) > > diff --git a/doc/guides/nics/fail_safe.rst b/doc/guides/nics/fail_safe.rst > index c4e3d2e..5023fc4 100644 > --- a/doc/guides/nics/fail_safe.rst > +++ b/doc/guides/nics/fail_safe.rst > @@ -193,6 +193,13 @@ any time. The fail-safe PMD will register a callback for such event and react > accordingly. It will try to safely stop, close and uninit the sub-device having > emitted this event, allowing it to free its eventual resources. > > +When fail-safe PMD gets -ENODEV error from control command sent to removable > +sub-devices, it assumes that the error reason is device removal. In this case > +fail-safe returns success value to application. The PMD controlling the > +sub-device is still responsible to emit a removal event (RMV) in addition to > +returning -ENODEV from control operations after the device has been physically > +removed. Only the reception of this event unregisters it on the fail-safe side. > + > Fail-safe glossary > ------------------ > > diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst > index 4775eb3..bd2fd87 100644 > --- a/doc/guides/prog_guide/env_abstraction_layer.rst > +++ b/doc/guides/prog_guide/env_abstraction_layer.rst > @@ -213,6 +213,9 @@ device having emitted a Device Removal Event. In such case, calling > callback. Care must be taken not to close the device from the interrupt handler > context. It is necessary to reschedule such closing operation. > > +Unsuccessful control operations (for those that return errors) may return > +-ENODEV after the device is physically unplugged. > + I think I should be neither ack-ing nor nack-ing this change. Could you propose it on its own, so that people ignoring fail-safe related matters could look into it as well? > Blacklisting > ~~~~~~~~~~~~ > > diff --git a/drivers/net/failsafe/failsafe_flow.c b/drivers/net/failsafe/failsafe_flow.c > index 153ceee..ce9b769 100644 > --- a/drivers/net/failsafe/failsafe_flow.c > +++ b/drivers/net/failsafe/failsafe_flow.c > @@ -87,7 +87,7 @@ > DEBUG("Calling rte_flow_validate on sub_device %d", i); > ret = rte_flow_validate(PORT_ID(sdev), > attr, patterns, actions, error); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { Here and for subsequent checks, there should be an explicit check against zero instead of using unary !. > ERROR("Operation rte_flow_validate failed for sub_device %d" > " with error %d", i, ret); > return ret; > @@ -111,7 +111,8 @@ > FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > flow->flows[i] = rte_flow_create(PORT_ID(sdev), > attr, patterns, actions, error); > - if (flow->flows[i] == NULL) { > + if (flow->flows[i] == NULL && > + !SUBDEV_REMOVED(sdev, -rte_errno)) { > ERROR("Failed to create flow on sub_device %d", > i); > goto err; > @@ -150,7 +151,7 @@ > continue; > local_ret = rte_flow_destroy(PORT_ID(sdev), > flow->flows[i], error); > - if (local_ret) { > + if (local_ret && !SUBDEV_REMOVED(sdev, local_ret)) { > ERROR("Failed to destroy flow on sub_device %d: %d", > i, local_ret); > if (ret == 0) > @@ -175,7 +176,7 @@ > FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > DEBUG("Calling rte_flow_flush on sub_device %d", i); > ret = rte_flow_flush(PORT_ID(sdev), error); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("Operation rte_flow_flush failed for sub_device %d" > " with error %d", i, ret); > return ret; > @@ -199,8 +200,11 @@ > > sdev = TX_SUBDEV(dev); > if (sdev != NULL) { > - return rte_flow_query(PORT_ID(sdev), > + int ret = rte_flow_query(PORT_ID(sdev), > flow->flows[SUB_ID(sdev)], type, arg, error); > + > + if (!SUBDEV_REMOVED(sdev, ret)) > + return ret; > } > WARN("No active sub_device to query about its flow"); > return -1; > @@ -223,7 +227,7 @@ > WARN("flow isolation mode of sub_device %d in incoherent state.", > i); > ret = rte_flow_isolate(PORT_ID(sdev), set, error); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("Operation rte_flow_isolate failed for sub_device %d" > " with error %d", i, ret); > return ret; > diff --git a/drivers/net/failsafe/failsafe_ops.c b/drivers/net/failsafe/failsafe_ops.c > index f460551..cc7ab7f 100644 > --- a/drivers/net/failsafe/failsafe_ops.c > +++ b/drivers/net/failsafe/failsafe_ops.c > @@ -314,7 +314,7 @@ > FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > DEBUG("Calling rte_eth_dev_set_link_up on sub_device %d", i); > ret = rte_eth_dev_set_link_up(PORT_ID(sdev)); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("Operation rte_eth_dev_set_link_up failed for sub_device %d" > " with error %d", i, ret); > return ret; > @@ -333,7 +333,7 @@ > FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > DEBUG("Calling rte_eth_dev_set_link_down on sub_device %d", i); > ret = rte_eth_dev_set_link_down(PORT_ID(sdev)); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("Operation rte_eth_dev_set_link_down failed for sub_device %d" > " with error %d", i, ret); > return ret; > @@ -418,7 +418,7 @@ > rx_queue_id, > nb_rx_desc, socket_id, > rx_conf, mb_pool); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("RX queue setup failed for sub_device %d", i); > goto free_rxq; > } > @@ -484,7 +484,7 @@ > tx_queue_id, > nb_tx_desc, socket_id, > tx_conf); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("TX queue setup failed for sub_device %d", i); > goto free_txq; > } > @@ -563,7 +563,7 @@ > FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > DEBUG("Calling link_update on sub_device %d", i); > ret = (SUBOPS(sdev, link_update))(ETH(sdev), wait_to_complete); > - if (ret && ret != -1) { > + if (ret && ret != -1 && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("Link update failed for sub_device %d with error %d", > i, ret); > return ret; > @@ -587,6 +587,7 @@ > fs_stats_get(struct rte_eth_dev *dev, > struct rte_eth_stats *stats) > { > + struct rte_eth_stats backup; > struct sub_device *sdev; > uint8_t i; > int ret; > @@ -596,14 +597,20 @@ > struct rte_eth_stats *snapshot = &sdev->stats_snapshot.stats; > uint64_t *timestamp = &sdev->stats_snapshot.timestamp; > > + rte_memcpy(&backup, snapshot, sizeof(backup)); > ret = rte_eth_stats_get(PORT_ID(sdev), snapshot); > if (ret) { > + if (SUBDEV_REMOVED(sdev, ret)) { > + rte_memcpy(snapshot, &backup, sizeof(backup)); > + goto inc; > + } > ERROR("Operation rte_eth_stats_get failed for sub_device %d with error %d", > i, ret); > *timestamp = 0; > return ret; > } > *timestamp = rte_rdtsc(); > +inc: > failsafe_stats_increment(stats, snapshot); > } > return 0; > @@ -716,7 +723,7 @@ > FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > DEBUG("Calling rte_eth_dev_set_mtu on sub_device %d", i); > ret = rte_eth_dev_set_mtu(PORT_ID(sdev), mtu); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("Operation rte_eth_dev_set_mtu failed for sub_device %d with error %d", > i, ret); > return ret; > @@ -735,7 +742,7 @@ > FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > DEBUG("Calling rte_eth_dev_vlan_filter on sub_device %d", i); > ret = rte_eth_dev_vlan_filter(PORT_ID(sdev), vlan_id, on); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("Operation rte_eth_dev_vlan_filter failed for sub_device %d" > " with error %d", i, ret); > return ret; > @@ -769,7 +776,7 @@ > FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > DEBUG("Calling rte_eth_dev_flow_ctrl_set on sub_device %d", i); > ret = rte_eth_dev_flow_ctrl_set(PORT_ID(sdev), fc_conf); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("Operation rte_eth_dev_flow_ctrl_set failed for sub_device %d" > " with error %d", i, ret); > return ret; > @@ -806,7 +813,7 @@ > RTE_ASSERT(index < FAILSAFE_MAX_ETHADDR); > FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > ret = rte_eth_dev_mac_addr_add(PORT_ID(sdev), mac_addr, vmdq); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("Operation rte_eth_dev_mac_addr_add failed for sub_device %" > PRIu8 " with error %d", i, ret); > return ret; > @@ -848,7 +855,7 @@ > FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { > DEBUG("Calling rte_eth_dev_filter_ctrl on sub_device %d", i); > ret = rte_eth_dev_filter_ctrl(PORT_ID(sdev), type, op, arg); > - if (ret) { > + if (ret && !SUBDEV_REMOVED(sdev, ret)) { > ERROR("Operation rte_eth_dev_filter_ctrl failed for sub_device %d" > " with error %d", i, ret); > return ret; > diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h > index d81cc3c..ee81b70 100644 > --- a/drivers/net/failsafe/failsafe_private.h > +++ b/drivers/net/failsafe/failsafe_private.h > @@ -262,6 +262,14 @@ int failsafe_eth_lsc_event_callback(uint16_t port_id, > (ETH(s)->dev_ops->ops) > > /** > + * s: (struct sub_device *) > + * e: (int) error > + */ > +#define SUBDEV_REMOVED(s, e) \ > + (s->remove || \ > + (((e) == -ENODEV) && (ETH(s)->data->dev_flags & RTE_ETH_DEV_INTR_RMV))) > + > +/** > * Atomic guard > */ > > -- > 1.8.3.1 > -- Gaëtan Rivet 6WIND ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <1513175370-16583-1-git-send-email-matan@mellanox.com>]
* [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling [not found] ` <1513175370-16583-1-git-send-email-matan@mellanox.com> @ 2017-12-13 14:29 ` Matan Azrad 2017-12-13 15:16 ` Gaëtan Rivet 0 siblings, 1 reply; 14+ messages in thread From: Matan Azrad @ 2017-12-13 14:29 UTC (permalink / raw) To: Adrien Mazarguil, Thomas Monjalon, Gaetan Rivet; +Cc: dev, stable There is time between the physical removal of the device until sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and applications still don't know about the removal and may call sub-device control operation which should return an error. In previous code this error is reported to the application contrary to fail-safe principle that the app should not be aware of device removal. Add an removal check in each relevant control command error flow and prevent an error report to application when the sub-device is removed. Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") Fixes: b737a1e ("net/failsafe: support flow API") Cc: stable@dpdk.org Signed-off-by: Matan Azrad <matan@mellanox.com> --- drivers/net/failsafe/failsafe_flow.c | 18 ++++++++++------- drivers/net/failsafe/failsafe_ops.c | 34 ++++++++++++++++++++++----------- drivers/net/failsafe/failsafe_private.h | 10 ++++++++++ 3 files changed, 44 insertions(+), 18 deletions(-) diff --git a/drivers/net/failsafe/failsafe_flow.c b/drivers/net/failsafe/failsafe_flow.c index 153ceee..1e39d66 100644 --- a/drivers/net/failsafe/failsafe_flow.c +++ b/drivers/net/failsafe/failsafe_flow.c @@ -87,7 +87,7 @@ DEBUG("Calling rte_flow_validate on sub_device %d", i); ret = rte_flow_validate(PORT_ID(sdev), attr, patterns, actions, error); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("Operation rte_flow_validate failed for sub_device %d" " with error %d", i, ret); return ret; @@ -111,7 +111,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { flow->flows[i] = rte_flow_create(PORT_ID(sdev), attr, patterns, actions, error); - if (flow->flows[i] == NULL) { + if (flow->flows[i] == NULL && fs_is_removed(sdev) == 0) { ERROR("Failed to create flow on sub_device %d", i); goto err; @@ -150,7 +150,7 @@ continue; local_ret = rte_flow_destroy(PORT_ID(sdev), flow->flows[i], error); - if (local_ret) { + if (local_ret && fs_is_removed(sdev) == 0) { ERROR("Failed to destroy flow on sub_device %d: %d", i, local_ret); if (ret == 0) @@ -175,7 +175,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_flow_flush on sub_device %d", i); ret = rte_flow_flush(PORT_ID(sdev), error); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("Operation rte_flow_flush failed for sub_device %d" " with error %d", i, ret); return ret; @@ -199,8 +199,12 @@ sdev = TX_SUBDEV(dev); if (sdev != NULL) { - return rte_flow_query(PORT_ID(sdev), - flow->flows[SUB_ID(sdev)], type, arg, error); + int ret = rte_flow_query(PORT_ID(sdev), + flow->flows[SUB_ID(sdev)], + type, arg, error); + + if (ret && fs_is_removed(sdev) == 0) + return ret; } WARN("No active sub_device to query about its flow"); return -1; @@ -223,7 +227,7 @@ WARN("flow isolation mode of sub_device %d in incoherent state.", i); ret = rte_flow_isolate(PORT_ID(sdev), set, error); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("Operation rte_flow_isolate failed for sub_device %d" " with error %d", i, ret); return ret; diff --git a/drivers/net/failsafe/failsafe_ops.c b/drivers/net/failsafe/failsafe_ops.c index e16a590..6799b55 100644 --- a/drivers/net/failsafe/failsafe_ops.c +++ b/drivers/net/failsafe/failsafe_ops.c @@ -121,6 +121,8 @@ dev->data->nb_tx_queues, &dev->data->dev_conf); if (ret) { + if (fs_is_removed(sdev) != 0) + continue; ERROR("Could not configure sub_device %d", i); return ret; } @@ -163,8 +165,11 @@ continue; DEBUG("Starting sub_device %d", i); ret = rte_eth_dev_start(PORT_ID(sdev)); - if (ret) + if (ret) { + if (fs_is_removed(sdev) != 0) + continue; return ret; + } sdev->state = DEV_STARTED; } if (PRIV(dev)->state < DEV_STARTED) @@ -196,7 +201,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_set_link_up on sub_device %d", i); ret = rte_eth_dev_set_link_up(PORT_ID(sdev)); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("Operation rte_eth_dev_set_link_up failed for sub_device %d" " with error %d", i, ret); return ret; @@ -215,7 +220,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_set_link_down on sub_device %d", i); ret = rte_eth_dev_set_link_down(PORT_ID(sdev)); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("Operation rte_eth_dev_set_link_down failed for sub_device %d" " with error %d", i, ret); return ret; @@ -300,7 +305,7 @@ rx_queue_id, nb_rx_desc, socket_id, rx_conf, mb_pool); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("RX queue setup failed for sub_device %d", i); goto free_rxq; } @@ -366,7 +371,7 @@ tx_queue_id, nb_tx_desc, socket_id, tx_conf); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("TX queue setup failed for sub_device %d", i); goto free_txq; } @@ -445,7 +450,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling link_update on sub_device %d", i); ret = (SUBOPS(sdev, link_update))(ETH(sdev), wait_to_complete); - if (ret && ret != -1) { + if (ret && ret != -1 && fs_is_removed(sdev) == 0) { ERROR("Link update failed for sub_device %d with error %d", i, ret); return ret; @@ -469,6 +474,7 @@ fs_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats) { + struct rte_eth_stats backup; struct sub_device *sdev; uint8_t i; int ret; @@ -478,14 +484,20 @@ struct rte_eth_stats *snapshot = &sdev->stats_snapshot.stats; uint64_t *timestamp = &sdev->stats_snapshot.timestamp; + rte_memcpy(&backup, snapshot, sizeof(backup)); ret = rte_eth_stats_get(PORT_ID(sdev), snapshot); if (ret) { + if (fs_is_removed(sdev) != 0) { + rte_memcpy(snapshot, &backup, sizeof(backup)); + goto inc; + } ERROR("Operation rte_eth_stats_get failed for sub_device %d with error %d", i, ret); *timestamp = 0; return ret; } *timestamp = rte_rdtsc(); +inc: failsafe_stats_increment(stats, snapshot); } return 0; @@ -598,7 +610,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_set_mtu on sub_device %d", i); ret = rte_eth_dev_set_mtu(PORT_ID(sdev), mtu); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("Operation rte_eth_dev_set_mtu failed for sub_device %d with error %d", i, ret); return ret; @@ -617,7 +629,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_vlan_filter on sub_device %d", i); ret = rte_eth_dev_vlan_filter(PORT_ID(sdev), vlan_id, on); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("Operation rte_eth_dev_vlan_filter failed for sub_device %d" " with error %d", i, ret); return ret; @@ -651,7 +663,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_flow_ctrl_set on sub_device %d", i); ret = rte_eth_dev_flow_ctrl_set(PORT_ID(sdev), fc_conf); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("Operation rte_eth_dev_flow_ctrl_set failed for sub_device %d" " with error %d", i, ret); return ret; @@ -688,7 +700,7 @@ RTE_ASSERT(index < FAILSAFE_MAX_ETHADDR); FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { ret = rte_eth_dev_mac_addr_add(PORT_ID(sdev), mac_addr, vmdq); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("Operation rte_eth_dev_mac_addr_add failed for sub_device %" PRIu8 " with error %d", i, ret); return ret; @@ -730,7 +742,7 @@ FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) { DEBUG("Calling rte_eth_dev_filter_ctrl on sub_device %d", i); ret = rte_eth_dev_filter_ctrl(PORT_ID(sdev), type, op, arg); - if (ret) { + if (ret && fs_is_removed(sdev) == 0) { ERROR("Operation rte_eth_dev_filter_ctrl failed for sub_device %d" " with error %d", i, ret); return ret; diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h index d81cc3c..0539782 100644 --- a/drivers/net/failsafe/failsafe_private.h +++ b/drivers/net/failsafe/failsafe_private.h @@ -375,4 +375,14 @@ int failsafe_eth_lsc_event_callback(uint16_t port_id, rte_wmb(); } +/* + * Check if sub device was removed. + */ +static inline int +fs_is_removed(struct sub_device *sdev) +{ + if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev)) != 0) + return 1; + return 0; +} #endif /* _RTE_ETH_FAILSAFE_PRIVATE_H_ */ -- 1.8.3.1 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling 2017-12-13 14:29 ` [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling Matan Azrad @ 2017-12-13 15:16 ` Gaëtan Rivet 2017-12-13 15:48 ` Matan Azrad 0 siblings, 1 reply; 14+ messages in thread From: Gaëtan Rivet @ 2017-12-13 15:16 UTC (permalink / raw) To: Matan Azrad; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable Hi Matan, On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote: > There is time between the physical removal of the device until > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and > applications still don't know about the removal and may call sub-device > control operation which should return an error. > > In previous code this error is reported to the application contrary to > fail-safe principle that the app should not be aware of device removal. > > Add an removal check in each relevant control command error flow and > prevent an error report to application when the sub-device is removed. > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") > Fixes: b737a1e ("net/failsafe: support flow API") > Cc: stable@dpdk.org > This patch is not a fix. It relies on an eth_dev API evolution. Without this evolution, this patch is meaningless and would break compilation if backported in stable branch. Please remove those tags. > Signed-off-by: Matan Azrad <matan@mellanox.com> > --- > drivers/net/failsafe/failsafe_flow.c | 18 ++++++++++------- > drivers/net/failsafe/failsafe_ops.c | 34 ++++++++++++++++++++++----------- > drivers/net/failsafe/failsafe_private.h | 10 ++++++++++ > 3 files changed, 44 insertions(+), 18 deletions(-) < ... > > +/* > + * Check if sub device was removed. > + */ > +static inline int > +fs_is_removed(struct sub_device *sdev) > +{ > + if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev)) != 0) > + return 1; > + return 0; > +} Have you considered adding this check within the subdev iterator itself? I think it would prevent you from having to add it to each return value checks. It is still MT-unsafe anyway. -- Gaëtan Rivet 6WIND ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling 2017-12-13 15:16 ` Gaëtan Rivet @ 2017-12-13 15:48 ` Matan Azrad 2017-12-13 16:09 ` Gaëtan Rivet 0 siblings, 1 reply; 14+ messages in thread From: Matan Azrad @ 2017-12-13 15:48 UTC (permalink / raw) To: Gaëtan Rivet; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable Hi Gaetan Thanks for the review. Some comments.. > -----Original Message----- > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com] > Sent: Wednesday, December 13, 2017 5:17 PM > To: Matan Azrad <matan@mellanox.com> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling > > Hi Matan, > > On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote: > > There is time between the physical removal of the device until > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and > > applications still don't know about the removal and may call > > sub-device control operation which should return an error. > > > > In previous code this error is reported to the application contrary to > > fail-safe principle that the app should not be aware of device removal. > > > > Add an removal check in each relevant control command error flow and > > prevent an error report to application when the sub-device is removed. > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") > > Fixes: b737a1e ("net/failsafe: support flow API") > > Cc: stable@dpdk.org > > > > This patch is not a fix. > It relies on an eth_dev API evolution. Without this evolution, this patch is > meaningless and would break compilation if backported in stable branch. > It is a fix because the bug is finally solved by this patch. I agree that it cannot be backported itself, but maybe all the series should be backported. Other idea: Add new patch which documents the bug and backport it. Remove it in this patch and remove cc stable from it. What do you think? > Please remove those tags. > > > Signed-off-by: Matan Azrad <matan@mellanox.com> > > --- > > drivers/net/failsafe/failsafe_flow.c | 18 ++++++++++------- > > drivers/net/failsafe/failsafe_ops.c | 34 ++++++++++++++++++++++----- > ------ > > drivers/net/failsafe/failsafe_private.h | 10 ++++++++++ > > 3 files changed, 44 insertions(+), 18 deletions(-) > > < ... > > > > +/* > > + * Check if sub device was removed. > > + */ > > +static inline int > > +fs_is_removed(struct sub_device *sdev) { > > + if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev)) > != 0) > > + return 1; > > + return 0; > > +} > > Have you considered adding this check within the subdev iterator itself? > I think it would prevent you from having to add it to each return value > checks. > > It is still MT-unsafe anyway. > This fix doesn't come to solve the MT issue, It comes to solve the error report to application because of removal. Adding the check in subdev iterator doesn't make sense for this issue. Matan. > -- > Gaëtan Rivet > 6WIND ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling 2017-12-13 15:48 ` Matan Azrad @ 2017-12-13 16:09 ` Gaëtan Rivet 2017-12-13 17:09 ` Thomas Monjalon 2017-12-13 21:55 ` Gaëtan Rivet 0 siblings, 2 replies; 14+ messages in thread From: Gaëtan Rivet @ 2017-12-13 16:09 UTC (permalink / raw) To: Matan Azrad; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote: > Hi Gaetan > Thanks for the review. > Some comments.. > > > -----Original Message----- > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com] > > Sent: Wednesday, December 13, 2017 5:17 PM > > To: Matan Azrad <matan@mellanox.com> > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon > > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling > > > > Hi Matan, > > > > On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote: > > > There is time between the physical removal of the device until > > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and > > > applications still don't know about the removal and may call > > > sub-device control operation which should return an error. > > > > > > In previous code this error is reported to the application contrary to > > > fail-safe principle that the app should not be aware of device removal. > > > > > > Add an removal check in each relevant control command error flow and > > > prevent an error report to application when the sub-device is removed. > > > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") > > > Fixes: b737a1e ("net/failsafe: support flow API") > > > Cc: stable@dpdk.org > > > > > > > This patch is not a fix. > > It relies on an eth_dev API evolution. Without this evolution, this patch is > > meaningless and would break compilation if backported in stable branch. > > > > It is a fix because the bug is finally solved by this patch. > I agree that it cannot be backported itself, but maybe all the series should be backported. > Other idea: > Add new patch which documents the bug and backport it. > Remove it in this patch and remove cc stable from it. > What do you think? > I think you could write a crude version that would not rely on the ethdev evolution (checking sdev->remove only), which would be incomplete but still better than nothing. And why not in this patch document the issue. Without any dependency outside failsafe, this could be backported. Then complete the fix with the API evolution if the new devops is accepted. > > Please remove those tags. > > > > > Signed-off-by: Matan Azrad <matan@mellanox.com> > > > --- > > > drivers/net/failsafe/failsafe_flow.c | 18 ++++++++++------- > > > drivers/net/failsafe/failsafe_ops.c | 34 ++++++++++++++++++++++----- > > ------ > > > drivers/net/failsafe/failsafe_private.h | 10 ++++++++++ > > > 3 files changed, 44 insertions(+), 18 deletions(-) > > > > < ... > > > > > > +/* > > > + * Check if sub device was removed. > > > + */ > > > +static inline int > > > +fs_is_removed(struct sub_device *sdev) { > > > + if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev)) > > != 0) > > > + return 1; > > > + return 0; > > > +} > > > > Have you considered adding this check within the subdev iterator itself? > > I think it would prevent you from having to add it to each return value > > checks. > > > > It is still MT-unsafe anyway. > > > > This fix doesn't come to solve the MT issue, It comes to solve the error report to application because of removal. > Adding the check in subdev iterator doesn't make sense for this issue. > > Matan. If you add this check in the iterator itself, you would skip removed devices before attempting operating upon them, right? Then it should probably help with your issue, unless you tested it and verified that it didnt? Something like this: ---8<--- diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h index d81cc3ca6..62ddc0689 100644 --- a/drivers/net/failsafe/failsafe_private.h +++ b/drivers/net/failsafe/failsafe_private.h @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev, subs = PRIV(dev)->subs; tail = PRIV(dev)->subs_tail; while (sid < tail) { + if (min_state > DEV_PROBED && + fs_is_removed(&sub[sid])) + goto next; if (subs[sid].state >= min_state) break; +next: sid++; } *sid_out = sid; --->8--- Only issue being that it is completely racy, but as this MT-unsafe property is inescapable we might as well ignore it and go for KISS. If that's enough, I would prefer instead of having this additional check added to all rte_eth operations. -- Gaëtan Rivet 6WIND ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling 2017-12-13 16:09 ` Gaëtan Rivet @ 2017-12-13 17:09 ` Thomas Monjalon 2017-12-14 10:40 ` Matan Azrad 2017-12-13 21:55 ` Gaëtan Rivet 1 sibling, 1 reply; 14+ messages in thread From: Thomas Monjalon @ 2017-12-13 17:09 UTC (permalink / raw) To: Gaëtan Rivet, Matan Azrad; +Cc: Adrien Mazarguil, dev, stable 13/12/2017 17:09, Gaëtan Rivet: > On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote: > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com] > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") > > > > Fixes: b737a1e ("net/failsafe: support flow API") > > > > Cc: stable@dpdk.org > > > > > > > > > > This patch is not a fix. > > > It relies on an eth_dev API evolution. Without this evolution, this patch is > > > meaningless and would break compilation if backported in stable branch. > > > > > > > It is a fix because the bug is finally solved by this patch. > > I agree that it cannot be backported itself, but maybe all the series should be backported. > > Other idea: > > Add new patch which documents the bug and backport it. > > Remove it in this patch and remove cc stable from it. > > What do you think? > > > > I think you could write a crude version that would not rely on the > ethdev evolution (checking sdev->remove only), which would be incomplete > but still better than nothing. > And why not in this patch document the issue. > Without any dependency outside failsafe, this could be backported. > > Then complete the fix with the API evolution if the new devops is > accepted. I think it is not worth the effort. It is a limitation in earlier releases and will be properly fixed with the new op. Please just remove the Cc:stable@dpdk.org. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling 2017-12-13 17:09 ` Thomas Monjalon @ 2017-12-14 10:40 ` Matan Azrad 0 siblings, 0 replies; 14+ messages in thread From: Matan Azrad @ 2017-12-14 10:40 UTC (permalink / raw) To: Thomas Monjalon, Gaëtan Rivet; +Cc: Adrien Mazarguil, dev, stable > -----Original Message----- > From: Thomas Monjalon [mailto:thomas@monjalon.net] > Sent: Wednesday, December 13, 2017 7:09 PM > To: Gaëtan Rivet <gaetan.rivet@6wind.com>; Matan Azrad > <matan@mellanox.com> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org; > stable@dpdk.org > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling > > 13/12/2017 17:09, Gaëtan Rivet: > > On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote: > > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com] > > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") > > > > > Fixes: b737a1e ("net/failsafe: support flow API") > > > > > Cc: stable@dpdk.org > > > > > > > > > > > > > This patch is not a fix. > > > > It relies on an eth_dev API evolution. Without this evolution, > > > > this patch is meaningless and would break compilation if backported in > stable branch. > > > > > > > > > > It is a fix because the bug is finally solved by this patch. > > > I agree that it cannot be backported itself, but maybe all the series should > be backported. > > > Other idea: > > > Add new patch which documents the bug and backport it. > > > Remove it in this patch and remove cc stable from it. > > > What do you think? > > > > > > > I think you could write a crude version that would not rely on the > > ethdev evolution (checking sdev->remove only), which would be > > incomplete but still better than nothing. > > And why not in this patch document the issue. > > Without any dependency outside failsafe, this could be backported. > > > > Then complete the fix with the API evolution if the new devops is > > accepted. > > I think it is not worth the effort. > It is a limitation in earlier releases and will be properly fixed with the new op. > Please just remove the Cc:stable@dpdk.org. OK. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling 2017-12-13 16:09 ` Gaëtan Rivet 2017-12-13 17:09 ` Thomas Monjalon @ 2017-12-13 21:55 ` Gaëtan Rivet 2017-12-14 10:40 ` Matan Azrad 1 sibling, 1 reply; 14+ messages in thread From: Gaëtan Rivet @ 2017-12-13 21:55 UTC (permalink / raw) To: Matan Azrad; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable Hi again Matan, On Wed, Dec 13, 2017 at 05:09:16PM +0100, Gaëtan Rivet wrote: > On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote: > > Hi Gaetan > > Thanks for the review. > > Some comments.. > > > > > -----Original Message----- > > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com] > > > Sent: Wednesday, December 13, 2017 5:17 PM > > > To: Matan Azrad <matan@mellanox.com> > > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon > > > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org > > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling > > > > > > Hi Matan, > > > > > > On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote: > > > > There is time between the physical removal of the device until > > > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and > > > > applications still don't know about the removal and may call > > > > sub-device control operation which should return an error. > > > > > > > > In previous code this error is reported to the application contrary to > > > > fail-safe principle that the app should not be aware of device removal. > > > > > > > > Add an removal check in each relevant control command error flow and > > > > prevent an error report to application when the sub-device is removed. > > > > > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") > > > > Fixes: b737a1e ("net/failsafe: support flow API") > > > > Cc: stable@dpdk.org > > > > > > > > > > This patch is not a fix. > > > It relies on an eth_dev API evolution. Without this evolution, this patch is > > > meaningless and would break compilation if backported in stable branch. > > > > > > > It is a fix because the bug is finally solved by this patch. > > I agree that it cannot be backported itself, but maybe all the series should be backported. > > Other idea: > > Add new patch which documents the bug and backport it. > > Remove it in this patch and remove cc stable from it. > > What do you think? > > > > I think you could write a crude version that would not rely on the > ethdev evolution (checking sdev->remove only), which would be incomplete > but still better than nothing. > And why not in this patch document the issue. > Without any dependency outside failsafe, this could be backported. > > Then complete the fix with the API evolution if the new devops is > accepted. > > > > Please remove those tags. > > > > > > > Signed-off-by: Matan Azrad <matan@mellanox.com> > > > > --- > > > > drivers/net/failsafe/failsafe_flow.c | 18 ++++++++++------- > > > > drivers/net/failsafe/failsafe_ops.c | 34 ++++++++++++++++++++++----- > > > ------ > > > > drivers/net/failsafe/failsafe_private.h | 10 ++++++++++ > > > > 3 files changed, 44 insertions(+), 18 deletions(-) > > > > > > < ... > > > > > > > > +/* > > > > + * Check if sub device was removed. > > > > + */ > > > > +static inline int > > > > +fs_is_removed(struct sub_device *sdev) { > > > > + if (sdev->remove == 1 || rte_eth_dev_is_removed(PORT_ID(sdev)) > > > != 0) > > > > + return 1; > > > > + return 0; > > > > +} > > > > > > Have you considered adding this check within the subdev iterator itself? > > > I think it would prevent you from having to add it to each return value > > > checks. > > > > > > It is still MT-unsafe anyway. > > > > > > > This fix doesn't come to solve the MT issue, It comes to solve the error report to application because of removal. > > Adding the check in subdev iterator doesn't make sense for this issue. > > > > Matan. > > If you add this check in the iterator itself, you would skip removed > devices before attempting operating upon them, right? > > Then it should probably help with your issue, unless you tested it and > verified that it didnt? > > Something like this: > > ---8<--- > > diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h > index d81cc3ca6..62ddc0689 100644 > --- a/drivers/net/failsafe/failsafe_private.h > +++ b/drivers/net/failsafe/failsafe_private.h > @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev, > subs = PRIV(dev)->subs; > tail = PRIV(dev)->subs_tail; > while (sid < tail) { > + if (min_state > DEV_PROBED && > + fs_is_removed(&sub[sid])) > + goto next; > if (subs[sid].state >= min_state) > break; > +next: > sid++; > } > *sid_out = sid; > > --->8--- > > Only issue being that it is completely racy, but as this MT-unsafe property is > inescapable we might as well ignore it and go for KISS. > > If that's enough, I would prefer instead of having this additional check > added to all rte_eth operations. > Ok, actually you were right here to do it this way. The "is_removed" check needs to happen after the operation attempt to effectively mitigate the possible race. Checking before attempting the call will be much less effective. That being said, would it be cleaner to have eth_dev ops return -ENODEV directly, and check against it within fail-safe? -- Gaëtan Rivet 6WIND ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling 2017-12-13 21:55 ` Gaëtan Rivet @ 2017-12-14 10:40 ` Matan Azrad 2017-12-14 10:48 ` Gaëtan Rivet 0 siblings, 1 reply; 14+ messages in thread From: Matan Azrad @ 2017-12-14 10:40 UTC (permalink / raw) To: Gaëtan Rivet; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable Hi Gaetan > -----Original Message----- > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com] > Sent: Wednesday, December 13, 2017 11:56 PM > To: Matan Azrad <matan@mellanox.com> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling > > Hi again Matan, > > On Wed, Dec 13, 2017 at 05:09:16PM +0100, Gaëtan Rivet wrote: > > On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote: > > > Hi Gaetan > > > Thanks for the review. > > > Some comments.. > > > > > > > -----Original Message----- > > > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com] > > > > Sent: Wednesday, December 13, 2017 5:17 PM > > > > To: Matan Azrad <matan@mellanox.com> > > > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas > Monjalon > > > > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org > > > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device > > > > handling > > > > > > > > Hi Matan, > > > > > > > > On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote: > > > > > There is time between the physical removal of the device until > > > > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and > > > > > applications still don't know about the removal and may call > > > > > sub-device control operation which should return an error. > > > > > > > > > > In previous code this error is reported to the application > > > > > contrary to fail-safe principle that the app should not be aware of > device removal. > > > > > > > > > > Add an removal check in each relevant control command error flow > > > > > and prevent an error report to application when the sub-device is > removed. > > > > > > > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") > > > > > Fixes: b737a1e ("net/failsafe: support flow API") > > > > > Cc: stable@dpdk.org > > > > > > > > > > > > > This patch is not a fix. > > > > It relies on an eth_dev API evolution. Without this evolution, > > > > this patch is meaningless and would break compilation if backported in > stable branch. > > > > > > > > > > It is a fix because the bug is finally solved by this patch. > > > I agree that it cannot be backported itself, but maybe all the series should > be backported. > > > Other idea: > > > Add new patch which documents the bug and backport it. > > > Remove it in this patch and remove cc stable from it. > > > What do you think? > > > > > > > I think you could write a crude version that would not rely on the > > ethdev evolution (checking sdev->remove only), which would be > > incomplete but still better than nothing. > > And why not in this patch document the issue. > > Without any dependency outside failsafe, this could be backported. > > > > Then complete the fix with the API evolution if the new devops is > > accepted. > > > > > > Please remove those tags. > > > > > > > > > Signed-off-by: Matan Azrad <matan@mellanox.com> > > > > > --- > > > > > drivers/net/failsafe/failsafe_flow.c | 18 ++++++++++------- > > > > > drivers/net/failsafe/failsafe_ops.c | 34 > ++++++++++++++++++++++----- > > > > ------ > > > > > drivers/net/failsafe/failsafe_private.h | 10 ++++++++++ > > > > > 3 files changed, 44 insertions(+), 18 deletions(-) > > > > > > > > < ... > > > > > > > > > > +/* > > > > > + * Check if sub device was removed. > > > > > + */ > > > > > +static inline int > > > > > +fs_is_removed(struct sub_device *sdev) { > > > > > + if (sdev->remove == 1 || > rte_eth_dev_is_removed(PORT_ID(sdev)) > > > > != 0) > > > > > + return 1; > > > > > + return 0; > > > > > +} > > > > > > > > Have you considered adding this check within the subdev iterator itself? > > > > I think it would prevent you from having to add it to each return > > > > value checks. > > > > > > > > It is still MT-unsafe anyway. > > > > > > > > > > This fix doesn't come to solve the MT issue, It comes to solve the error > report to application because of removal. > > > Adding the check in subdev iterator doesn't make sense for this issue. > > > > > > Matan. > > > > If you add this check in the iterator itself, you would skip removed > > devices before attempting operating upon them, right? > > > > Then it should probably help with your issue, unless you tested it and > > verified that it didnt? > > > > Something like this: > > > > ---8<--- > > > > diff --git a/drivers/net/failsafe/failsafe_private.h > > b/drivers/net/failsafe/failsafe_private.h > > index d81cc3ca6..62ddc0689 100644 > > --- a/drivers/net/failsafe/failsafe_private.h > > +++ b/drivers/net/failsafe/failsafe_private.h > > @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev, > > subs = PRIV(dev)->subs; > > tail = PRIV(dev)->subs_tail; > > while (sid < tail) { > > + if (min_state > DEV_PROBED && > > + fs_is_removed(&sub[sid])) > > + goto next; > > if (subs[sid].state >= min_state) > > break; > > +next: > > sid++; > > } > > *sid_out = sid; > > > > --->8--- > > > > Only issue being that it is completely racy, but as this MT-unsafe > > property is inescapable we might as well ignore it and go for KISS. > > > > If that's enough, I would prefer instead of having this additional > > check added to all rte_eth operations. > > > > Ok, actually you were right here to do it this way. The "is_removed" > check needs to happen after the operation attempt to effectively mitigate > the possible race. Checking before attempting the call will be much less > effective. > > That being said, would it be cleaner to have eth_dev ops return -ENODEV > directly, and check against it within fail-safe? > I think that according to "is_removed" semantic we must return a Boolean value (Each value different from '0' means that the device is removed) like other functions in c library (for example isspace()). Thanks. > -- > Gaëtan Rivet > 6WIND ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling 2017-12-14 10:40 ` Matan Azrad @ 2017-12-14 10:48 ` Gaëtan Rivet 2017-12-14 13:07 ` Matan Azrad 0 siblings, 1 reply; 14+ messages in thread From: Gaëtan Rivet @ 2017-12-14 10:48 UTC (permalink / raw) To: Matan Azrad; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote: > Hi Gaetan > <snip> > > > > > > If you add this check in the iterator itself, you would skip removed > > > devices before attempting operating upon them, right? > > > > > > Then it should probably help with your issue, unless you tested it and > > > verified that it didnt? > > > > > > Something like this: > > > > > > ---8<--- > > > > > > diff --git a/drivers/net/failsafe/failsafe_private.h > > > b/drivers/net/failsafe/failsafe_private.h > > > index d81cc3ca6..62ddc0689 100644 > > > --- a/drivers/net/failsafe/failsafe_private.h > > > +++ b/drivers/net/failsafe/failsafe_private.h > > > @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev, > > > subs = PRIV(dev)->subs; > > > tail = PRIV(dev)->subs_tail; > > > while (sid < tail) { > > > + if (min_state > DEV_PROBED && > > > + fs_is_removed(&sub[sid])) > > > + goto next; > > > if (subs[sid].state >= min_state) > > > break; > > > +next: > > > sid++; > > > } > > > *sid_out = sid; > > > > > > --->8--- > > > > > > Only issue being that it is completely racy, but as this MT-unsafe > > > property is inescapable we might as well ignore it and go for KISS. > > > > > > If that's enough, I would prefer instead of having this additional > > > check added to all rte_eth operations. > > > > > > > Ok, actually you were right here to do it this way. The "is_removed" > > check needs to happen after the operation attempt to effectively mitigate > > the possible race. Checking before attempting the call will be much less > > effective. > > > > That being said, would it be cleaner to have eth_dev ops return -ENODEV > > directly, and check against it within fail-safe? > > > > I think that according to "is_removed" semantic we must return a Boolean value (Each value different from '0' means that the device is removed) like other functions in c library (for example isspace()). > Sure, I wasn't discussing the interface proposed by rte_eth_dev_is_removed(). What I meant was to ask whether checking rte_eth_dev_is_removed() would be more interesting in the ethdev layer, making the eth_dev_ops return -ENODEV regardless of the previous error if this check is supported by the driver and signal that the port is removed. I think this information could be interesting to other systems, not just fail-safe. -- Gaëtan Rivet 6WIND ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling 2017-12-14 10:48 ` Gaëtan Rivet @ 2017-12-14 13:07 ` Matan Azrad 2017-12-14 13:27 ` Gaëtan Rivet 0 siblings, 1 reply; 14+ messages in thread From: Matan Azrad @ 2017-12-14 13:07 UTC (permalink / raw) To: Gaëtan Rivet; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable Hi Gaetan > -----Original Message----- > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com] > Sent: Thursday, December 14, 2017 12:49 PM > To: Matan Azrad <matan@mellanox.com> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling > > On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote: > > Hi Gaetan > > > > <snip> > > > > > > > > > If you add this check in the iterator itself, you would skip > > > > removed devices before attempting operating upon them, right? > > > > > > > > Then it should probably help with your issue, unless you tested it > > > > and verified that it didnt? > > > > > > > > Something like this: > > > > > > > > ---8<--- > > > > > > > > diff --git a/drivers/net/failsafe/failsafe_private.h > > > > b/drivers/net/failsafe/failsafe_private.h > > > > index d81cc3ca6..62ddc0689 100644 > > > > --- a/drivers/net/failsafe/failsafe_private.h > > > > +++ b/drivers/net/failsafe/failsafe_private.h > > > > @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev, > > > > subs = PRIV(dev)->subs; > > > > tail = PRIV(dev)->subs_tail; > > > > while (sid < tail) { > > > > + if (min_state > DEV_PROBED && > > > > + fs_is_removed(&sub[sid])) > > > > + goto next; > > > > if (subs[sid].state >= min_state) > > > > break; > > > > +next: > > > > sid++; > > > > } > > > > *sid_out = sid; > > > > > > > > --->8--- > > > > > > > > Only issue being that it is completely racy, but as this MT-unsafe > > > > property is inescapable we might as well ignore it and go for KISS. > > > > > > > > If that's enough, I would prefer instead of having this additional > > > > check added to all rte_eth operations. > > > > > > > > > > Ok, actually you were right here to do it this way. The "is_removed" > > > check needs to happen after the operation attempt to effectively > > > mitigate the possible race. Checking before attempting the call will > > > be much less effective. > > > > > > That being said, would it be cleaner to have eth_dev ops return > > > -ENODEV directly, and check against it within fail-safe? > > > > > > > I think that according to "is_removed" semantic we must return a Boolean > value (Each value different from '0' means that the device is removed) like > other functions in c library (for example isspace()). > > > > Sure, I wasn't discussing the interface proposed by > rte_eth_dev_is_removed(). > > What I meant was to ask whether checking rte_eth_dev_is_removed() > would be more interesting in the ethdev layer, making the eth_dev_ops > return -ENODEV regardless of the previous error if this check is supported by > the driver and signal that the port is removed. > > I think this information could be interesting to other systems, not just fail- > safe. > Ok. Got you now. Interesting approach - plan: 1. update fs_link_update to use rte_eth* functions. 2. maybe -EIO is preferred because -ENODEV is used for no port error? 3. update all relevant rte_eth* to use "is_removed" in error flows(1 patch for flow APIs and 1 for the others). 4. Change fs checks in error flows to check rte_eth* return values. 5. Remove CC stable from commit massage. What do you think? > -- > Gaëtan Rivet > 6WIND ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling 2017-12-14 13:07 ` Matan Azrad @ 2017-12-14 13:27 ` Gaëtan Rivet 2017-12-14 14:43 ` Matan Azrad 0 siblings, 1 reply; 14+ messages in thread From: Gaëtan Rivet @ 2017-12-14 13:27 UTC (permalink / raw) To: Matan Azrad; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable On Thu, Dec 14, 2017 at 01:07:31PM +0000, Matan Azrad wrote: > Hi Gaetan > > > -----Original Message----- > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com] > > Sent: Thursday, December 14, 2017 12:49 PM > > To: Matan Azrad <matan@mellanox.com> > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon > > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling > > > > On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote: > > > Hi Gaetan > > > > > <snip> > > > > Ok, actually you were right here to do it this way. The "is_removed" > > > > check needs to happen after the operation attempt to effectively > > > > mitigate the possible race. Checking before attempting the call will > > > > be much less effective. > > > > > > > > That being said, would it be cleaner to have eth_dev ops return > > > > -ENODEV directly, and check against it within fail-safe? > > > > > > > > > > I think that according to "is_removed" semantic we must return a Boolean > > value (Each value different from '0' means that the device is removed) like > > other functions in c library (for example isspace()). > > > > > > > Sure, I wasn't discussing the interface proposed by > > rte_eth_dev_is_removed(). > > > > What I meant was to ask whether checking rte_eth_dev_is_removed() > > would be more interesting in the ethdev layer, making the eth_dev_ops > > return -ENODEV regardless of the previous error if this check is supported by > > the driver and signal that the port is removed. > > > > I think this information could be interesting to other systems, not just fail- > > safe. > > > > Ok. Got you now. > Interesting approach - plan: > 1. update fs_link_update to use rte_eth* functions. I'm surprised it doesn't already. Either the rte_eth* function was introduced after the failsafe, or be wary of potential issues. I don't see a problem right now though. > 2. maybe -EIO is preferred because -ENODEV is used for no port error? Good point, didn't think about it. Prepare yourself maybe to some arguments about the most relevant error code. -EIO seems fine to me, but maybe use a wrapper for all this. Something like: ---8<--- static int eth_error(pid, int original_ret) { int ret; if (original_ret == 0) return original_ret; ret = rte_eth_is_removed(pid); if (ret == 0 || ret == -ENOTSUP) return original_ret; return -EIO; } int rte_eth_ops_xyz(pid) { int ret; ret = eth_dev(pid).ops_xyz(); return eth_error(pid, ret); } --->8--- This way you would be able to change it easily and the logic would be insulated. > 3. update all relevant rte_eth* to use "is_removed" in error flows(1 patch for flow APIs and 1 for the others). > 4. Change fs checks in error flows to check rte_eth* return values. > 5. Remove CC stable from commit massage. > > What do you think? > Agreed otherwise. Thanks, > > -- > > Gaëtan Rivet > > 6WIND -- Gaëtan Rivet 6WIND ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling 2017-12-14 13:27 ` Gaëtan Rivet @ 2017-12-14 14:43 ` Matan Azrad 0 siblings, 0 replies; 14+ messages in thread From: Matan Azrad @ 2017-12-14 14:43 UTC (permalink / raw) To: Gaëtan Rivet; +Cc: Adrien Mazarguil, Thomas Monjalon, dev, stable Hi > -----Original Message----- > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com] > Sent: Thursday, December 14, 2017 3:27 PM > To: Matan Azrad <matan@mellanox.com> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling > > On Thu, Dec 14, 2017 at 01:07:31PM +0000, Matan Azrad wrote: > > Hi Gaetan > > > > > -----Original Message----- > > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com] > > > Sent: Thursday, December 14, 2017 12:49 PM > > > To: Matan Azrad <matan@mellanox.com> > > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon > > > <thomas@monjalon.net>; dev@dpdk.org; stable@dpdk.org > > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device > > > handling > > > > > > On Thu, Dec 14, 2017 at 10:40:22AM +0000, Matan Azrad wrote: > > > > Hi Gaetan > > > > > > > > > <snip> > > > > > > Ok, actually you were right here to do it this way. The "is_removed" > > > > > check needs to happen after the operation attempt to effectively > > > > > mitigate the possible race. Checking before attempting the call > > > > > will be much less effective. > > > > > > > > > > That being said, would it be cleaner to have eth_dev ops return > > > > > -ENODEV directly, and check against it within fail-safe? > > > > > > > > > > > > > I think that according to "is_removed" semantic we must return a > > > > Boolean > > > value (Each value different from '0' means that the device is > > > removed) like other functions in c library (for example isspace()). > > > > > > > > > > Sure, I wasn't discussing the interface proposed by > > > rte_eth_dev_is_removed(). > > > > > > What I meant was to ask whether checking rte_eth_dev_is_removed() > > > would be more interesting in the ethdev layer, making the > > > eth_dev_ops return -ENODEV regardless of the previous error if this > > > check is supported by the driver and signal that the port is removed. > > > > > > I think this information could be interesting to other systems, not > > > just fail- safe. > > > > > > > Ok. Got you now. > > Interesting approach - plan: > > 1. update fs_link_update to use rte_eth* functions. > > I'm surprised it doesn't already. > Either the rte_eth* function was introduced after the failsafe, or be wary of > potential issues. I don't see a problem right now though. > > > 2. maybe -EIO is preferred because -ENODEV is used for no port > error? > > Good point, didn't think about it. > Prepare yourself maybe to some arguments about the most relevant error > code. -EIO seems fine to me, but maybe use a wrapper for all this. > > Something like: > > ---8<--- > > static int > eth_error(pid, int original_ret) > { > int ret; > > if (original_ret == 0) > return original_ret; > ret = rte_eth_is_removed(pid); > if (ret == 0 || ret == -ENOTSUP) > return original_ret; > return -EIO; > } > > int > rte_eth_ops_xyz(pid) > { > int ret; > ret = eth_dev(pid).ops_xyz(); > return eth_error(pid, ret); > } > > --->8--- > > This way you would be able to change it easily and the logic would be > insulated. > Nice. > > 3. update all relevant rte_eth* to use "is_removed" in error flows(1 > patch for flow APIs and 1 for the others). > > 4. Change fs checks in error flows to check rte_eth* return values. > > 5. Remove CC stable from commit massage. > > > > What do you think? > > > > Agreed otherwise. > Will create V3, thanks! > Thanks, > > > > -- > > > Gaëtan Rivet > > > 6WIND > > -- > Gaëtan Rivet > 6WIND ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2017-12-14 14:43 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <1509637324-13525-1-git-send-email-matan@mellanox.com> 2017-11-02 15:42 ` [dpdk-stable] [PATCH 1/3] net/failsafe: fix removal handling lack Matan Azrad 2017-11-06 8:19 ` Gaëtan Rivet [not found] ` <1513175370-16583-1-git-send-email-matan@mellanox.com> 2017-12-13 14:29 ` [dpdk-stable] [PATCH v2 4/4] net/failsafe: fix removed device handling Matan Azrad 2017-12-13 15:16 ` Gaëtan Rivet 2017-12-13 15:48 ` Matan Azrad 2017-12-13 16:09 ` Gaëtan Rivet 2017-12-13 17:09 ` Thomas Monjalon 2017-12-14 10:40 ` Matan Azrad 2017-12-13 21:55 ` Gaëtan Rivet 2017-12-14 10:40 ` Matan Azrad 2017-12-14 10:48 ` Gaëtan Rivet 2017-12-14 13:07 ` Matan Azrad 2017-12-14 13:27 ` Gaëtan Rivet 2017-12-14 14:43 ` Matan Azrad
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).