DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC] mlx5: fix error unwind in device start
@ 2018-08-01 21:59 Stephen Hemminger
  2018-08-13  7:52 ` Shahaf Shuler
  0 siblings, 1 reply; 4+ messages in thread
From: Stephen Hemminger @ 2018-08-01 21:59 UTC (permalink / raw)
  To: shahafs, yskoh; +Cc: dev, Stephen Hemminger, Stephen Hemminger

The error handling in start of the mlx5 driver is buggy.
For example, if setting up the flows fails the device driver
will then get stuck in mlx5_flow_rxq_flags_clear waiting
for something that will never happen.

The problem is that the code jumps to a common error label
and does unwind for portions of the driver which have not
been setup.

This suggested patch breaks it into different labels with
each failure path only unwinding what was done.

Also, the ethdev driver should not be manipulating the
dev_started flag directly. That is handled by the common
ethdev layer.

The patch works for the success case, but furthur testing
is needed to actually exercise all the error paths.
This is left as exercise for the maintainers.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/mlx5/mlx5_trigger.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index e2a9bb703261..79a7b233986a 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -171,42 +171,42 @@ mlx5_dev_start(struct rte_eth_dev *dev)
 	if (ret) {
 		DRV_LOG(ERR, "port %u Rx queue allocation failed: %s",
 			dev->data->port_id, strerror(rte_errno));
-		mlx5_txq_stop(dev);
-		return -rte_errno;
+		goto error_txq_stop;
 	}
-	dev->data->dev_started = 1;
+
 	ret = mlx5_rx_intr_vec_enable(dev);
 	if (ret) {
 		DRV_LOG(ERR, "port %u Rx interrupt vector creation failed",
 			dev->data->port_id);
-		goto error;
+		goto error_rxq_stop;
 	}
 	mlx5_xstats_init(dev);
 	ret = mlx5_traffic_enable(dev);
 	if (ret) {
 		DRV_LOG(DEBUG, "port %u failed to set defaults flows",
 			dev->data->port_id);
-		goto error;
+		goto error_intr_vec_disable;
 	}
 	ret = mlx5_flow_start(dev, &priv->flows);
 	if (ret) {
 		DRV_LOG(DEBUG, "port %u failed to set flows",
 			dev->data->port_id);
-		goto error;
+		goto error_traffic_disable;
 	}
+
 	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
 	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
 	mlx5_dev_interrupt_handler_install(dev);
 	return 0;
-error:
-	ret = rte_errno; /* Save rte_errno before cleanup. */
-	/* Rollback. */
-	dev->data->dev_started = 0;
-	mlx5_flow_stop(dev, &priv->flows);
+
+error_traffic_disable:
 	mlx5_traffic_disable(dev);
-	mlx5_txq_stop(dev);
+error_intr_vec_disable:
+	mlx5_rx_intr_vec_disable(dev);
+error_rxq_stop:
 	mlx5_rxq_stop(dev);
-	rte_errno = ret; /* Restore rte_errno. */
+error_txq_stop:
+	mlx5_txq_stop(dev);
 	return -rte_errno;
 }
 
-- 
2.18.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] [RFC] mlx5: fix error unwind in device start
  2018-08-01 21:59 [dpdk-dev] [RFC] mlx5: fix error unwind in device start Stephen Hemminger
@ 2018-08-13  7:52 ` Shahaf Shuler
  2018-08-13 15:20   ` Stephen Hemminger
  0 siblings, 1 reply; 4+ messages in thread
From: Shahaf Shuler @ 2018-08-13  7:52 UTC (permalink / raw)
  To: Stephen Hemminger, Yongseok Koh; +Cc: dev, Stephen Hemminger

Hi Stephan,

Thursday, August 2, 2018 1:00 AM, Stephen Hemminger:
> Subject: [RFC] mlx5: fix error unwind in device start
> 
> The error handling in start of the mlx5 driver is buggy.
> For example, if setting up the flows fails the device driver will then get stuck
> in mlx5_flow_rxq_flags_clear waiting for something that will never happen.

Looking at the code I cannot understand why the mlx5_flow_rxq_flags_clear get stuck nor to what it waits.
The function has few finite loops which are not depended in anything which happened before it at the device start.

Moreover I tried to force either the mlx5_traffic_enable or the mlx5_flow_start to stop, however the results was the port failed to start but no stuck.

Can you provide more details about the issue you saw there?  

> 
> The problem is that the code jumps to a common error label and does
> unwind for portions of the driver which have not been setup.
> 
> This suggested patch breaks it into different labels with each failure path only
> unwinding what was done.
> 
> Also, the ethdev driver should not be manipulating the dev_started flag
> directly. That is handled by the common ethdev layer.
> 

I agree that maybe this code part can be better written, but my question before is whether we have an actual bug that we will solve w/ this change? 

> The patch works for the success case, but furthur testing is needed to
> actually exercise all the error paths.
> This is left as exercise for the maintainers.
> 
> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> ---
>  drivers/net/mlx5/mlx5_trigger.c | 26 +++++++++++++-------------
>  1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_trigger.c
> b/drivers/net/mlx5/mlx5_trigger.c index e2a9bb703261..79a7b233986a
> 100644
> --- a/drivers/net/mlx5/mlx5_trigger.c
> +++ b/drivers/net/mlx5/mlx5_trigger.c
> @@ -171,42 +171,42 @@ mlx5_dev_start(struct rte_eth_dev *dev)
>  	if (ret) {
>  		DRV_LOG(ERR, "port %u Rx queue allocation failed: %s",
>  			dev->data->port_id, strerror(rte_errno));
> -		mlx5_txq_stop(dev);
> -		return -rte_errno;
> +		goto error_txq_stop;
>  	}
> -	dev->data->dev_started = 1;
> +
>  	ret = mlx5_rx_intr_vec_enable(dev);
>  	if (ret) {
>  		DRV_LOG(ERR, "port %u Rx interrupt vector creation failed",
>  			dev->data->port_id);
> -		goto error;
> +		goto error_rxq_stop;
>  	}
>  	mlx5_xstats_init(dev);
>  	ret = mlx5_traffic_enable(dev);
>  	if (ret) {
>  		DRV_LOG(DEBUG, "port %u failed to set defaults flows",
>  			dev->data->port_id);
> -		goto error;
> +		goto error_intr_vec_disable;
>  	}
>  	ret = mlx5_flow_start(dev, &priv->flows);
>  	if (ret) {
>  		DRV_LOG(DEBUG, "port %u failed to set flows",
>  			dev->data->port_id);
> -		goto error;
> +		goto error_traffic_disable;
>  	}
> +
>  	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
>  	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
>  	mlx5_dev_interrupt_handler_install(dev);
>  	return 0;
> -error:
> -	ret = rte_errno; /* Save rte_errno before cleanup. */
> -	/* Rollback. */
> -	dev->data->dev_started = 0;
> -	mlx5_flow_stop(dev, &priv->flows);
> +
> +error_traffic_disable:
>  	mlx5_traffic_disable(dev);
> -	mlx5_txq_stop(dev);
> +error_intr_vec_disable:
> +	mlx5_rx_intr_vec_disable(dev);
> +error_rxq_stop:
>  	mlx5_rxq_stop(dev);
> -	rte_errno = ret; /* Restore rte_errno. */
> +error_txq_stop:
> +	mlx5_txq_stop(dev);
>  	return -rte_errno;
>  }
> 
> --
> 2.18.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] [RFC] mlx5: fix error unwind in device start
  2018-08-13  7:52 ` Shahaf Shuler
@ 2018-08-13 15:20   ` Stephen Hemminger
  2018-08-14  7:35     ` Shahaf Shuler
  0 siblings, 1 reply; 4+ messages in thread
From: Stephen Hemminger @ 2018-08-13 15:20 UTC (permalink / raw)
  To: Shahaf Shuler; +Cc: Yongseok Koh, dev, Stephen Hemminger

On Mon, 13 Aug 2018 07:52:47 +0000
Shahaf Shuler <shahafs@mellanox.com> wrote:

> Hi Stephan,
> 
> Thursday, August 2, 2018 1:00 AM, Stephen Hemminger:
> > Subject: [RFC] mlx5: fix error unwind in device start
> > 
> > The error handling in start of the mlx5 driver is buggy.
> > For example, if setting up the flows fails the device driver will then get stuck
> > in mlx5_flow_rxq_flags_clear waiting for something that will never happen.  
> 
> Looking at the code I cannot understand why the mlx5_flow_rxq_flags_clear get stuck nor to what it waits.
> The function has few finite loops which are not depended in anything which happened before it at the device start.
> 
> Moreover I tried to force either the mlx5_traffic_enable or the mlx5_flow_start to stop, however the results was the port failed to start but no stuck.
> 
> Can you provide more details about the issue you saw there?  
> 
> > 
> > The problem is that the code jumps to a common error label and does
> > unwind for portions of the driver which have not been setup.
> > 
> > This suggested patch breaks it into different labels with each failure path only
> > unwinding what was done.
> > 
> > Also, the ethdev driver should not be manipulating the dev_started flag
> > directly. That is handled by the common ethdev layer.
> >   
> 
> I agree that maybe this code part can be better written, but my question before is whether we have an actual bug that we will solve w/ this change? 
> 
> > The patch works for the success case, but furthur testing is needed to
> > actually exercise all the error paths.
> > This is left as exercise for the maintainers.
> > 
> > Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> > ---
> >  drivers/net/mlx5/mlx5_trigger.c | 26 +++++++++++++-------------
> >  1 file changed, 13 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/net/mlx5/mlx5_trigger.c
> > b/drivers/net/mlx5/mlx5_trigger.c index e2a9bb703261..79a7b233986a
> > 100644
> > --- a/drivers/net/mlx5/mlx5_trigger.c
> > +++ b/drivers/net/mlx5/mlx5_trigger.c
> > @@ -171,42 +171,42 @@ mlx5_dev_start(struct rte_eth_dev *dev)
> >  	if (ret) {
> >  		DRV_LOG(ERR, "port %u Rx queue allocation failed: %s",
> >  			dev->data->port_id, strerror(rte_errno));
> > -		mlx5_txq_stop(dev);
> > -		return -rte_errno;
> > +		goto error_txq_stop;
> >  	}
> > -	dev->data->dev_started = 1;
> > +
> >  	ret = mlx5_rx_intr_vec_enable(dev);
> >  	if (ret) {
> >  		DRV_LOG(ERR, "port %u Rx interrupt vector creation failed",
> >  			dev->data->port_id);
> > -		goto error;
> > +		goto error_rxq_stop;
> >  	}
> >  	mlx5_xstats_init(dev);
> >  	ret = mlx5_traffic_enable(dev);
> >  	if (ret) {
> >  		DRV_LOG(DEBUG, "port %u failed to set defaults flows",
> >  			dev->data->port_id);
> > -		goto error;
> > +		goto error_intr_vec_disable;
> >  	}
> >  	ret = mlx5_flow_start(dev, &priv->flows);
> >  	if (ret) {
> >  		DRV_LOG(DEBUG, "port %u failed to set flows",
> >  			dev->data->port_id);
> > -		goto error;
> > +		goto error_traffic_disable;
> >  	}
> > +
> >  	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
> >  	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
> >  	mlx5_dev_interrupt_handler_install(dev);
> >  	return 0;
> > -error:
> > -	ret = rte_errno; /* Save rte_errno before cleanup. */
> > -	/* Rollback. */
> > -	dev->data->dev_started = 0;
> > -	mlx5_flow_stop(dev, &priv->flows);
> > +
> > +error_traffic_disable:
> >  	mlx5_traffic_disable(dev);
> > -	mlx5_txq_stop(dev);
> > +error_intr_vec_disable:
> > +	mlx5_rx_intr_vec_disable(dev);
> > +error_rxq_stop:
> >  	mlx5_rxq_stop(dev);
> > -	rte_errno = ret; /* Restore rte_errno. */
> > +error_txq_stop:
> > +	mlx5_txq_stop(dev);
> >  	return -rte_errno;
> >  }
> > 
> > --
> > 2.18.0  
> 

The issue was caused in an early version of netvsc VF support where it forgot
to call dev_configure on the mlx5 device. In that case mlx5 would get confused and stuck.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dpdk-dev] [RFC] mlx5: fix error unwind in device start
  2018-08-13 15:20   ` Stephen Hemminger
@ 2018-08-14  7:35     ` Shahaf Shuler
  0 siblings, 0 replies; 4+ messages in thread
From: Shahaf Shuler @ 2018-08-14  7:35 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Yongseok Koh, dev, Stephen Hemminger

Monday, August 13, 2018 6:21 PM, Stephen Hemminger:
> Subject: Re: [RFC] mlx5: fix error unwind in device start
> 
> The issue was caused in an early version of netvsc VF support where it forgot
> to call dev_configure on the mlx5 device. In that case mlx5 would get
> confused and stuck.

I see, well missing the configuration stage is quite critical regardless. I am surprised ethdev layer didn't block the device start,
and maybe this is the right fix for this specific case till a rework will be done for the mlx5 device start logic. 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-08-14  7:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-01 21:59 [dpdk-dev] [RFC] mlx5: fix error unwind in device start Stephen Hemminger
2018-08-13  7:52 ` Shahaf Shuler
2018-08-13 15:20   ` Stephen Hemminger
2018-08-14  7:35     ` Shahaf Shuler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).