From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 656AAA0547; Fri, 9 Apr 2021 16:56:32 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 80FFD141092; Fri, 9 Apr 2021 16:56:28 +0200 (CEST) Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by mails.dpdk.org (Postfix) with ESMTP id A12B94014D; Fri, 9 Apr 2021 16:56:25 +0200 (CEST) IronPort-SDR: NeL50FeRQF8h2tTYjbnfv9/UBPB4sWlGFfGFXmJZtmVdrIBpf14UB1h0s8c6CirJyE3Y0BSuGM foKuKYy8awzQ== X-IronPort-AV: E=McAfee;i="6000,8403,9949"; a="181309181" X-IronPort-AV: E=Sophos;i="5.82,209,1613462400"; d="scan'208";a="181309181" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Apr 2021 07:56:24 -0700 IronPort-SDR: 74duWh3BxU/dg751ERqSk/gpn1P8yQL5D1sUjTxK50Dqu8qvwfhOPDYe+pAVCFVuSRezpL/mup LVUm3Xr3L9Gg== X-IronPort-AV: E=Sophos;i="5.82,209,1613462400"; d="scan'208";a="422759948" Received: from fyigit-mobl1.ger.corp.intel.com (HELO [10.213.203.45]) ([10.213.203.45]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Apr 2021 07:56:22 -0700 From: Ferruh Yigit To: Elad Nachman , Igor Ryzhov Cc: stable@dpdk.org, Stephen Hemminger , Dan Gora , dev@dpdk.org References: <20201126144613.4986-1-eladv6@gmail.com> <20210329143655.521750-1-ferruh.yigit@intel.com> <20210329143655.521750-3-ferruh.yigit@intel.com> X-User: ferruhy Message-ID: Date: Fri, 9 Apr 2021 15:56:19 +0100 MIME-Version: 1.0 In-Reply-To: <20210329143655.521750-3-ferruh.yigit@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [dpdk-stable] [PATCH v5 3/3] kni: fix kernel deadlock when using mlx devices X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 3/29/2021 3:36 PM, Ferruh Yigit wrote: > KNI runs userspace callback with rtnl lock held, this is not working > fine with some devices that needs to interact with kernel interface in > the callback, like Mellanox devices. > > The solution is releasing the rtnl lock before calling the userspace > callback. But it requires two consideration: > > 1. The rtnl lock needs to released before 'kni->sync_lock', otherwise it > causes deadlock with multiple KNI devices, please check below the A. > for the details of the deadlock condition. > > 2. When rtnl lock is released for interface down event, it cause a > regression and deadlock, so can't release the rtnl lock for interface > down event, please check below B. for the details. > > As a solution, interface down event is handled asynchronously and for > all other events rtnl lock is released before processing the callback. > > A. KNI sync lock is being locked while rtnl is held. > If two threads are calling kni_net_process_request() , > then the first one will take the sync lock, release rtnl lock then sleep. > The second thread will try to lock sync lock while holding rtnl. > The first thread will wake, and try to lock rtnl, resulting in a > deadlock. The remedy is to release rtnl before locking the KNI sync > lock. > Since in between nothing is accessing Linux network-wise, no rtnl > locking is needed. > > B. There is a race condition in __dev_close_many() processing the > close_list while the application terminates. > It looks like if two KNI interfaces are terminating, > and one releases the rtnl lock, the other takes it, > updating the close_list in an unstable state, > causing the close_list to become a circular linked list, > hence list_for_each_entry() will endlessly loop inside > __dev_close_many() . > > To summarize: > request != interface down : unlock rtnl, send request to user-space, > wait for response, send the response error code to caller in user-space. > > request == interface down: send request to user-space, return immediately > with error code of 0 (success) to user-space. > > Fixes: 3fc5ca2f6352 ("kni: initial import") > Cc: stable@dpdk.org > > Signed-off-by: Elad Nachman > --- > Cc: Stephen Hemminger > Cc: Igor Ryzhov > Cc: Dan Gora > Hi Elad, Igor, Can you please review/test this set when you have time? Thanks, ferruh