From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 95308A054F; Tue, 16 Mar 2021 19:35:46 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 16A7740A4B; Tue, 16 Mar 2021 19:35:46 +0100 (CET) Received: from mail-io1-f42.google.com (mail-io1-f42.google.com [209.85.166.42]) by mails.dpdk.org (Postfix) with ESMTP id D93B6406A2 for ; Tue, 16 Mar 2021 19:35:43 +0100 (CET) Received: by mail-io1-f42.google.com with SMTP id y20so19951744iot.4 for ; Tue, 16 Mar 2021 11:35:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=OdiyP/FWwTttuFRo2p9OYEUPyvrLEwCY2qL+OHup39Q=; b=tpWjzjQP6mIYPF3quBpcMlFLIkKyvOLnG3Nat25IiW24Q2psGvswKFYRX8bubrnj41 aMp0kOJLSVW4UBXbC32Mz2q2oF/Jd9gqf3vjULBfZAGwDRKlwcLpUkv6TZjbf5Fdtt3s yhfQWdJfOYAkudSw+eBH+OzcAG0RNQzv39x2QpoC22P+Aj5U1LmndK1vbFCSn9pwzfaf GB5TnVOKVGQ3lL0LNkiYmJQq3mQQpn40y1b+6Y9SPlzTVM3hvH0JsxxJ+MFArDW0zgG6 /OFRfpn1Ryx0EIURvxrzry27BZAd5C6NdJGe5NoRCAqtGrgx6sE1tbNrpjXRNI2HSOly JsNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=OdiyP/FWwTttuFRo2p9OYEUPyvrLEwCY2qL+OHup39Q=; b=rndU6GrqrlHxt3us6gvdW5Ad4OED+QZfK35JTTaPlr3rofLsUWSODpbTTkAl9MaQNK ryh1j5Ybe21zhIqmJdaRjeLgEEcB0fMWqaFSIa0NVn9OB87eUNN4EXUitLj/RK/bdkwa z7NJouRqTITd7OUadtTpAP4My3KYeFJCtUy3wRIYJSgbI9AO5muIyPL+8EJX/71SBSkQ v1hbi3JLpreGX1l0AI5Z3em4evPSDzumMSEsVoz6fmivJkPA9k0dM0Ftbr2/ym+KGWu1 xCqQHFGj3Nb4CpI6heG/2fDdGCyfafb9qfE8OyjOv/KjhvMzMls5MGRgQAKi0RhCaxZY yc1g== X-Gm-Message-State: AOAM532diYaulCILWTtATgW3bGaLgCfxlYBScoFzUOvUXRmwPXYYu7ye RINaLQCN6/QvMgX5dgUHnf34BNRNdFQDPW3DcjM= X-Google-Smtp-Source: ABdhPJwEnmRJrvoKKZZRfAk0mSZei1ikh2wbfbnUBpORR9N3pXda/hr9i2tIu/jOhfU+jzs0gOBT4lvYd5XCTnByYwM= X-Received: by 2002:a05:6602:2348:: with SMTP id r8mr4477167iot.77.1615919743297; Tue, 16 Mar 2021 11:35:43 -0700 (PDT) MIME-Version: 1.0 References: <20201126144613.4986-1-eladv6@gmail.com> <20210225143239.14220-1-eladv6@gmail.com> <20210225143239.14220-2-eladv6@gmail.com> In-Reply-To: From: Elad Nachman Date: Tue, 16 Mar 2021 20:35:32 +0200 Message-ID: To: Ferruh Yigit Cc: Igor Ryzhov , Stephen Hemminger , dev@dpdk.org, Dan Gora Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [dpdk-dev] [PATCH 2/2] kni: fix rtnl deadlocks and race conditions v4 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi, Owing to my current development schedule and obligations, I see no opportunity to make this set of changes in the near future. Sorry, Elad. =D7=91=D7=AA=D7=90=D7=A8=D7=99=D7=9A =D7=99=D7=95=D7=9D =D7=91=D7=B3, 15 = =D7=91=D7=9E=D7=A8=D7=A5 2021, 19:17, =D7=9E=D7=90=D7=AA Ferruh Yigit =E2= =80=8F< ferruh.yigit@intel.com>: > On 2/25/2021 2:32 PM, Elad Nachman wrote: > > This part of the series includes my fixes for the issues reported > > by Ferruh and Igor (and Igor comments for v3 of the patch) > > on top of part 1 of the patch series: > > > > A. KNI sync lock is being locked while rtnl is held. > > If two threads are calling kni_net_process_request() , > > then the first one will take the sync lock, release rtnl lock then slee= p. > > The second thread will try to lock sync lock while holding rtnl. > > The first thread will wake, and try to lock rtnl, resulting in a > deadlock. > > The remedy is to release rtnl before locking the KNI sync lock. > > Since in between nothing is accessing Linux network-wise, > > no rtnl locking is needed. > > > > B. There is a race condition in __dev_close_many() processing the > > close_list while the application terminates. > > It looks like if two vEth devices are terminating, > > and one releases the rtnl lock, the other takes it, > > updating the close_list in an unstable state, > > causing the close_list to become a circular linked list, > > hence list_for_each_entry() will endlessly loop inside > > __dev_close_many() . > > Since the description for the original patch indicate the > > original motivation was bringing the device up, > > I have changed kni_net_process_request() to hold the rtnl mutex > > in case of bringing the device down since this is the path called > > from __dev_close_many() , causing the corruption of the close_list. > > In order to prevent deadlock in Mellanox device in this case, the > > code has been modified not to wait for user-space while holding > > the rtnl lock. > > Instead, after the request has been sent, all locks are relinquished > > and the function exits immediately with return code of zero (success). > > > > To summarize: > > request !=3D interface down : unlock rtnl, send request to user-space, > > wait for response, send the response error code to caller in user-space= . > > > > request =3D=3D interface down: send request to user-space, return immed= iately > > with error code of 0 (success) to user-space. > > > > Signed-off-by: Elad Nachman > > > > > > --- > > v4: > > * for if down case, send asynchronously with rtnl locked and without > > wait, returning immediately to avoid both kernel race conditions > > and deadlock in user-space > > v3: > > * Include original patch and new patch as a series of patch, added a > > comment to the new patch > > v2: > > * rebuild the patch as increment from patch 64106 > > * fix comment and blank lines > > --- > > kernel/linux/kni/kni_net.c | 41 +++++++++++++++++++++++++++-----= - > > lib/librte_kni/rte_kni.c | 7 ++++-- > > lib/librte_kni/rte_kni_common.h | 1 + > > 3 files changed, 40 insertions(+), 9 deletions(-) > > > > diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c > > index f0b6e9a8d..ba991802b 100644 > > --- a/kernel/linux/kni/kni_net.c > > +++ b/kernel/linux/kni/kni_net.c > > @@ -110,12 +110,34 @@ kni_net_process_request(struct net_device *dev, > struct rte_kni_request *req) > > void *resp_va; > > uint32_t num; > > int ret_val; > > + int req_is_dev_stop =3D 0; > > + > > + /* For configuring the interface to down, > > + * rtnl must be held all the way to prevent race condition > > + * inside __dev_close_many() between two netdev instances of KNI > > + */ > > + if (req->req_id =3D=3D RTE_KNI_REQ_CFG_NETWORK_IF && > > + req->if_up =3D=3D 0) > > + req_is_dev_stop =3D 1; > > Having this request type checks in the 'kni_net_process_request()' > function > looks like hack. > Since adding a new field into the "struct rte_kni_request", that can be a > more > generic 'asnyc' field, and the requested function, like > 'kni_net_release()' can > set it to support async requests. > > And can you please separate the function to add a more generic async > request > support on its patch, which should do: > - add new 'asnyc' field to "struct rte_kni_request" > - in 'kni_net_process_request()', if 'req->async' set, do not wait for > response > - in library, 'lib/librte_kni/rte_kni.c', in 'rte_kni_handle_request()' > function, if the request is async don't put the response > (These are already done in this patch) > > Overall it can be three patch set: > 1) Function parameter change > 2) Add generic async request support (with documentation update) > 3) rtnl unlock and make 'kni_net_release()' request async (actual fix) > (We can discuss more if to make 'kni_net_release()' async with a kernel > parameter or not) > > What do you think, does it make sense? > >