DPDK patches and discussions
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: Shahaf Shuler <shahafs@mellanox.com>
Cc: Yongseok Koh <yskoh@mellanox.com>, "dev@dpdk.org" <dev@dpdk.org>,
	Stephen Hemminger <sthemmin@microsoft.com>
Subject: Re: [dpdk-dev] [RFC] mlx5: fix error unwind in device start
Date: Mon, 13 Aug 2018 08:20:49 -0700	[thread overview]
Message-ID: <20180813082049.1758d647@xeon-e3> (raw)
In-Reply-To: <DB7PR05MB44267D3ABFE7544237AB621EC3390@DB7PR05MB4426.eurprd05.prod.outlook.com>

On Mon, 13 Aug 2018 07:52:47 +0000
Shahaf Shuler <shahafs@mellanox.com> wrote:

> Hi Stephan,
> 
> Thursday, August 2, 2018 1:00 AM, Stephen Hemminger:
> > Subject: [RFC] mlx5: fix error unwind in device start
> > 
> > The error handling in start of the mlx5 driver is buggy.
> > For example, if setting up the flows fails the device driver will then get stuck
> > in mlx5_flow_rxq_flags_clear waiting for something that will never happen.  
> 
> Looking at the code I cannot understand why the mlx5_flow_rxq_flags_clear get stuck nor to what it waits.
> The function has few finite loops which are not depended in anything which happened before it at the device start.
> 
> Moreover I tried to force either the mlx5_traffic_enable or the mlx5_flow_start to stop, however the results was the port failed to start but no stuck.
> 
> Can you provide more details about the issue you saw there?  
> 
> > 
> > The problem is that the code jumps to a common error label and does
> > unwind for portions of the driver which have not been setup.
> > 
> > This suggested patch breaks it into different labels with each failure path only
> > unwinding what was done.
> > 
> > Also, the ethdev driver should not be manipulating the dev_started flag
> > directly. That is handled by the common ethdev layer.
> >   
> 
> I agree that maybe this code part can be better written, but my question before is whether we have an actual bug that we will solve w/ this change? 
> 
> > The patch works for the success case, but furthur testing is needed to
> > actually exercise all the error paths.
> > This is left as exercise for the maintainers.
> > 
> > Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> > ---
> >  drivers/net/mlx5/mlx5_trigger.c | 26 +++++++++++++-------------
> >  1 file changed, 13 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/net/mlx5/mlx5_trigger.c
> > b/drivers/net/mlx5/mlx5_trigger.c index e2a9bb703261..79a7b233986a
> > 100644
> > --- a/drivers/net/mlx5/mlx5_trigger.c
> > +++ b/drivers/net/mlx5/mlx5_trigger.c
> > @@ -171,42 +171,42 @@ mlx5_dev_start(struct rte_eth_dev *dev)
> >  	if (ret) {
> >  		DRV_LOG(ERR, "port %u Rx queue allocation failed: %s",
> >  			dev->data->port_id, strerror(rte_errno));
> > -		mlx5_txq_stop(dev);
> > -		return -rte_errno;
> > +		goto error_txq_stop;
> >  	}
> > -	dev->data->dev_started = 1;
> > +
> >  	ret = mlx5_rx_intr_vec_enable(dev);
> >  	if (ret) {
> >  		DRV_LOG(ERR, "port %u Rx interrupt vector creation failed",
> >  			dev->data->port_id);
> > -		goto error;
> > +		goto error_rxq_stop;
> >  	}
> >  	mlx5_xstats_init(dev);
> >  	ret = mlx5_traffic_enable(dev);
> >  	if (ret) {
> >  		DRV_LOG(DEBUG, "port %u failed to set defaults flows",
> >  			dev->data->port_id);
> > -		goto error;
> > +		goto error_intr_vec_disable;
> >  	}
> >  	ret = mlx5_flow_start(dev, &priv->flows);
> >  	if (ret) {
> >  		DRV_LOG(DEBUG, "port %u failed to set flows",
> >  			dev->data->port_id);
> > -		goto error;
> > +		goto error_traffic_disable;
> >  	}
> > +
> >  	dev->tx_pkt_burst = mlx5_select_tx_function(dev);
> >  	dev->rx_pkt_burst = mlx5_select_rx_function(dev);
> >  	mlx5_dev_interrupt_handler_install(dev);
> >  	return 0;
> > -error:
> > -	ret = rte_errno; /* Save rte_errno before cleanup. */
> > -	/* Rollback. */
> > -	dev->data->dev_started = 0;
> > -	mlx5_flow_stop(dev, &priv->flows);
> > +
> > +error_traffic_disable:
> >  	mlx5_traffic_disable(dev);
> > -	mlx5_txq_stop(dev);
> > +error_intr_vec_disable:
> > +	mlx5_rx_intr_vec_disable(dev);
> > +error_rxq_stop:
> >  	mlx5_rxq_stop(dev);
> > -	rte_errno = ret; /* Restore rte_errno. */
> > +error_txq_stop:
> > +	mlx5_txq_stop(dev);
> >  	return -rte_errno;
> >  }
> > 
> > --
> > 2.18.0  
> 

The issue was caused in an early version of netvsc VF support where it forgot
to call dev_configure on the mlx5 device. In that case mlx5 would get confused and stuck.

  reply	other threads:[~2018-08-13 15:20 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-01 21:59 Stephen Hemminger
2018-08-13  7:52 ` Shahaf Shuler
2018-08-13 15:20   ` Stephen Hemminger [this message]
2018-08-14  7:35     ` Shahaf Shuler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180813082049.1758d647@xeon-e3 \
    --to=stephen@networkplumber.org \
    --cc=dev@dpdk.org \
    --cc=shahafs@mellanox.com \
    --cc=sthemmin@microsoft.com \
    --cc=yskoh@mellanox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).