From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg1-f194.google.com (mail-pg1-f194.google.com [209.85.215.194]) by dpdk.org (Postfix) with ESMTP id 012462661 for ; Mon, 13 Aug 2018 17:20:56 +0200 (CEST) Received: by mail-pg1-f194.google.com with SMTP id y4-v6so7648354pgp.9 for ; Mon, 13 Aug 2018 08:20:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=vMy30oMdWzkidwMlCDmvnzDXMWQjiOu3+e5u33Icezc=; b=1RfyzAZsIw+/x4DZ8Qu1d0Pw3PEK64nbHuPUyYfuXhrPlOnoF8JAKprG8QmhEFzIJj w/CnOygSdrHZW3lRbJr0U89SxDdfbflijpzFOztydNGhb97gE8HiNNXxrmmpFDcNCjFS gF1VwiB/qLBbLlECdjLygyKdSEaXp0f88D2d/6imyUfwQKdFqDPCkkaZFXxidtSd/WGo nEL8U/lwnw2bD/Ry59YZXPvwozqG6JJTcHynp4hqa6GB0eVm45QTC1VC2FUdNXxK89rq l2IXc3511WylpZt+SSWeVeAtnrVBbTpTdJSumsCPkjkPD82kDLErRgALSLNzIpvE/5EJ /T0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=vMy30oMdWzkidwMlCDmvnzDXMWQjiOu3+e5u33Icezc=; b=qjBudrHzwyDcDHF31l1BlRhjj66BUcsbwpN1DcwUMZPkdZCsfygHG9TSTnohZ7QGg0 0m14Sy9hoTdn3kwwX7knf70n4iTeGZEb4bzmfQnT1Tw4GhrrTPSs4ktMMCaTidxTFS6Z 5doN17xxydcSJiuk87/lqb2Z7LNh/lBHJEstqI74RZktybtspV7MScZMi/Zl5gstQRxI gruhUEROjB3GzCbSvlvZO2VZ3huzY1hsPpZz8uc0+PVXO5QlW868PAkGazQXeNMjH6CH 6k9y65vz1ZEHAhQ1bq5sPf6iWiAALO4Qh8Kh4z9ZaIcr4/WnWYuAhcWzOKptCJVEFD8V H5Dg== X-Gm-Message-State: AOUpUlGspKIXZ9bsLwjR8jKyXpFpRgA4ym1z64FM1tfp+vUPdublp8aj GsDkMptYGXOWdeSXdWkRdFk9RQ== X-Google-Smtp-Source: AA+uWPzeeIqrJVmlJwnzk2WKuV1UoR1shqGHpvYdp0mSgB6VUHWHY/30czmDMFShW5kXK31tZ8NuhQ== X-Received: by 2002:a62:571b:: with SMTP id l27-v6mr19488158pfb.29.1534173656012; Mon, 13 Aug 2018 08:20:56 -0700 (PDT) Received: from xeon-e3 (204-195-22-127.wavecable.com. [204.195.22.127]) by smtp.gmail.com with ESMTPSA id 2-v6sm38072481pfs.58.2018.08.13.08.20.55 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 13 Aug 2018 08:20:55 -0700 (PDT) Date: Mon, 13 Aug 2018 08:20:49 -0700 From: Stephen Hemminger To: Shahaf Shuler Cc: Yongseok Koh , "dev@dpdk.org" , Stephen Hemminger Message-ID: <20180813082049.1758d647@xeon-e3> In-Reply-To: References: <20180801215952.25326-1-stephen@networkplumber.org> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [RFC] mlx5: fix error unwind in device start X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Aug 2018 15:20:57 -0000 On Mon, 13 Aug 2018 07:52:47 +0000 Shahaf Shuler wrote: > Hi Stephan, > > Thursday, August 2, 2018 1:00 AM, Stephen Hemminger: > > Subject: [RFC] mlx5: fix error unwind in device start > > > > The error handling in start of the mlx5 driver is buggy. > > For example, if setting up the flows fails the device driver will then get stuck > > in mlx5_flow_rxq_flags_clear waiting for something that will never happen. > > Looking at the code I cannot understand why the mlx5_flow_rxq_flags_clear get stuck nor to what it waits. > The function has few finite loops which are not depended in anything which happened before it at the device start. > > Moreover I tried to force either the mlx5_traffic_enable or the mlx5_flow_start to stop, however the results was the port failed to start but no stuck. > > Can you provide more details about the issue you saw there? > > > > > The problem is that the code jumps to a common error label and does > > unwind for portions of the driver which have not been setup. > > > > This suggested patch breaks it into different labels with each failure path only > > unwinding what was done. > > > > Also, the ethdev driver should not be manipulating the dev_started flag > > directly. That is handled by the common ethdev layer. > > > > I agree that maybe this code part can be better written, but my question before is whether we have an actual bug that we will solve w/ this change? > > > The patch works for the success case, but furthur testing is needed to > > actually exercise all the error paths. > > This is left as exercise for the maintainers. > > > > Signed-off-by: Stephen Hemminger > > --- > > drivers/net/mlx5/mlx5_trigger.c | 26 +++++++++++++------------- > > 1 file changed, 13 insertions(+), 13 deletions(-) > > > > diff --git a/drivers/net/mlx5/mlx5_trigger.c > > b/drivers/net/mlx5/mlx5_trigger.c index e2a9bb703261..79a7b233986a > > 100644 > > --- a/drivers/net/mlx5/mlx5_trigger.c > > +++ b/drivers/net/mlx5/mlx5_trigger.c > > @@ -171,42 +171,42 @@ mlx5_dev_start(struct rte_eth_dev *dev) > > if (ret) { > > DRV_LOG(ERR, "port %u Rx queue allocation failed: %s", > > dev->data->port_id, strerror(rte_errno)); > > - mlx5_txq_stop(dev); > > - return -rte_errno; > > + goto error_txq_stop; > > } > > - dev->data->dev_started = 1; > > + > > ret = mlx5_rx_intr_vec_enable(dev); > > if (ret) { > > DRV_LOG(ERR, "port %u Rx interrupt vector creation failed", > > dev->data->port_id); > > - goto error; > > + goto error_rxq_stop; > > } > > mlx5_xstats_init(dev); > > ret = mlx5_traffic_enable(dev); > > if (ret) { > > DRV_LOG(DEBUG, "port %u failed to set defaults flows", > > dev->data->port_id); > > - goto error; > > + goto error_intr_vec_disable; > > } > > ret = mlx5_flow_start(dev, &priv->flows); > > if (ret) { > > DRV_LOG(DEBUG, "port %u failed to set flows", > > dev->data->port_id); > > - goto error; > > + goto error_traffic_disable; > > } > > + > > dev->tx_pkt_burst = mlx5_select_tx_function(dev); > > dev->rx_pkt_burst = mlx5_select_rx_function(dev); > > mlx5_dev_interrupt_handler_install(dev); > > return 0; > > -error: > > - ret = rte_errno; /* Save rte_errno before cleanup. */ > > - /* Rollback. */ > > - dev->data->dev_started = 0; > > - mlx5_flow_stop(dev, &priv->flows); > > + > > +error_traffic_disable: > > mlx5_traffic_disable(dev); > > - mlx5_txq_stop(dev); > > +error_intr_vec_disable: > > + mlx5_rx_intr_vec_disable(dev); > > +error_rxq_stop: > > mlx5_rxq_stop(dev); > > - rte_errno = ret; /* Restore rte_errno. */ > > +error_txq_stop: > > + mlx5_txq_stop(dev); > > return -rte_errno; > > } > > > > -- > > 2.18.0 > The issue was caused in an early version of netvsc VF support where it forgot to call dev_configure on the mlx5 device. In that case mlx5 would get confused and stuck.