From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <dev-bounces@dpdk.org> Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 79196A052B; Tue, 28 Jul 2020 10:56:40 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id BA7A21C0B2; Tue, 28 Jul 2020 10:56:38 +0200 (CEST) Received: from mail-wm1-f68.google.com (mail-wm1-f68.google.com [209.85.128.68]) by dpdk.org (Postfix) with ESMTP id 2EE191BE8A for <dev@dpdk.org>; Tue, 28 Jul 2020 10:56:37 +0200 (CEST) Received: by mail-wm1-f68.google.com with SMTP id g10so15080529wmc.1 for <dev@dpdk.org>; Tue, 28 Jul 2020 01:56:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nfware-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=tcTNKCwH7QXyh90yYkc1LrZ5fropss9HJXVjZr2/c5o=; b=TDmmQpOHuQ7XRT6Wq8olXeOaAsKWlFacc5U7ieDUn+rdlc8tslzRX87s3lsYxwd4BI 28aJJ9xg8euJzyJaicQBHwcYyEKWut7ajfg9YvNFugRI48V5EyWiSRf6lN/VROJ+n62w gWYKEVJeD66OSmSInH50r5f+oze/XoJmku04GE6TGCimzr3Im/ek/PUMvY7HioltuTpu nbqGoQ+1V1kqtahT3fs9a9xztM6LsOrpmMCGEUSr1yo/V4vlKQ3SViT6hJJr0pKXhzdw /oNJyk9RWxCrcpkv8cTiGpOu0WBZnH13lyO3rZhWFBurl5gOteR4zqMGiPyQGcpiF8U8 NcFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=tcTNKCwH7QXyh90yYkc1LrZ5fropss9HJXVjZr2/c5o=; b=PuiV+ySHb/jcoa9AgNlUfEQqnZA5u5RB7YMH5+iqdTmeCC6nRRcD9ME3+Y93Sj7QkF 00EFSi6EpQuKdz4ql0Dbp+K+68coBRrIAvojLRKNN9NtSzHNck1YpQx9C/rv4aprThIi AB9HY++ZXh4He61JSg2SToFrXaUZ7OvStFJ7Ode1HfhBQwiJAciRZf4/qN7AFTyP8vNj 1vpejIYi3gtCnuWfHQXOWENgeBAOss0RC6uOLwhho52EszVtDaDOjABUA7dJcWkq0avK 4TB9UlOYr9mkueJMrdspiZslkAEbas7QKJgZQZ85glpcOWjxLNg2142Xj7Pe08o1aymc Ekmg== X-Gm-Message-State: AOAM531j+9mELiNIHhaKCI6IbFTLwiQ7WMa5zpV4rguPT4Tpuluj9UWA Z4CS3QD5nM3ajS7uXj9GjM8vbgpjSoqeRJrcV6dcQw== X-Google-Smtp-Source: ABdhPJzVaCeL4YZPf1YQSU/0hSh4d7Pp/be98vBMKEmQ962a1wl6e380mTpGcUTe4VafKHwahbfJzFbNnyFDQRl06U8= X-Received: by 2002:a05:600c:514:: with SMTP id i20mr3086241wmc.102.1595926597519; Tue, 28 Jul 2020 01:56:37 -0700 (PDT) MIME-Version: 1.0 References: <20191222175551.17684-1-stephen@networkplumber.org> <ffa192d1-99d7-a636-c1bf-7f64dfde91b4@intel.com> <3101970.h16uAIiOU7@xps> <20200505171454.00274f10@hermes.lan> <da21a634-2a1a-7371-4233-97ee80da8248@intel.com> <20200727105255.74981391@hermes.lan> In-Reply-To: <20200727105255.74981391@hermes.lan> From: Igor Ryzhov <iryzhov@nfware.com> Date: Tue, 28 Jul 2020 11:56:26 +0300 Message-ID: <CAF+s_FxF_s3ETF2Ewt-_nPDEXyg0zBroNyL-Rz0W8mX883bsZQ@mail.gmail.com> To: Stephen Hemminger <stephen@networkplumber.org> Cc: Ferruh Yigit <ferruh.yigit@intel.com>, Thomas Monjalon <thomas@monjalon.net>, dev <dev@dpdk.org>, dpdk stable <stable@dpdk.org> Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-dev] [PATCH] kni: fix kernel deadlock when using mlx devices X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> Hi, Didn't see this patch previously, but we came up with the same idea internally and also faced a hang during the application shutdown. We didn't dig deep, but it occurred in kni_release function. Igor On Mon, Jul 27, 2020 at 8:53 PM Stephen Hemminger < stephen@networkplumber.org> wrote: > On Mon, 27 Jul 2020 18:33:08 +0100 > Ferruh Yigit <ferruh.yigit@intel.com> wrote: > > > On 5/6/2020 1:14 AM, Stephen Hemminger wrote: > > > On Wed, 18 Mar 2020 16:17:57 +0100 > > > Thomas Monjalon <thomas@monjalon.net> wrote: > > > > > >> 17/01/2020 17:43, Ferruh Yigit: > > >>> On 12/22/2019 5:55 PM, Stephen Hemminger wrote: > > >>>> This fixes a deadlock when using KNI with bifurcated drivers. > > >>>> Bringing kni device up always times out when using Mellanox > > >>>> devices. > > >>>> > > >>>> The kernel KNI driver sends message to userspace to complete > > >>>> the request. For the case of bifurcated driver, this may involve > > >>>> an additional request to kernel to change state. This request > > >>>> would deadlock because KNI was holding the RTNL mutex. > > >>>> > > >>>> This was a bad design which goes back to the original code. > > >>>> A workaround is for KNI driver to drop RTNL while waiting. > > >>>> To prevent the device from disappearing while the operation > > >>>> is in progress, it needs to hold reference to network device > > >>>> while waiting. > > >>>> > > >>>> As an added benefit, an useless error check can also be removed. > > >>>> > > >>>> Fixes: 3fc5ca2f6352 ("kni: initial import") > > >>>> Cc: stable@dpdk.org > > >>>> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> > > >>>> --- > > >>> > > >>> This patch cause a hang on my server, not sure what exactly was the > problem but > > >>> kernel log was continuously printing "Cannot send to req_q". Will > dig more. > > >> > > >> Ferruh, did you have a chance to check what is hanging? > > >> Stephen, is there any news on your side? > > >> > > >> > > > > > > It did not hang when I tested it. The bug report is still open > > > > > > > Sorry for the delay, since I am working remotely I was worried about > loosing the > > connection to my server, finally I did create a virtual environment to > test again. > > > > I confirm the hang observed %100 when two different process updates the > kni > > interface, like two different process sets the mtu. Without this patch > this > > works fine. > > > > I understand the motivation of the patch, but with change there is a > possibility > > to hang the server, which we can't allow, need to find another way. Can > updating > > mlx interface wait KNI interface operation to complete? > > Still KNI driver is broken. Calling userspace with RTNL held is > fundamentally > broken design. If KNI were to be incorporated in upstream kernel, then the > netdev > developer would see this. > > What ever solution you think is best. > I will continue to recommend against anyone using KNI. >