From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id CCEAAA034F; Tue, 23 Feb 2021 14:01:52 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 42F534068C; Tue, 23 Feb 2021 14:01:52 +0100 (CET) Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by mails.dpdk.org (Postfix) with ESMTP id D759440041 for ; Tue, 23 Feb 2021 13:05:30 +0100 (CET) Received: by mail-ed1-f52.google.com with SMTP id s11so25586478edd.5 for ; Tue, 23 Feb 2021 04:05:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=8YVSDjlFCawy7xRpdT4/4OHUOa8y6tROBUBklu5SSPc=; b=Ine04XmvZzroszan6Al7cn0u05BlfBJSBRT+Q/z/izUZLqvxP0ZZFiLY6vZCTeXleF JRuktLOUc3bkF1YqFlN4KD5DFvX1ZSWTBb3wy+dLs52HEs3is3V+VDq769O0ves+7bMT evOFpOdvcFKmWfC6svrLJerdXJs8tsG2DPOntPWASOD6MgHRW+uiVvA43ngeox3TD0tc NEfZZvT92pPX5gd2FR6V0ngWRNE94PFbAnw+2WRbc7rpXRy0e1GNz8xlFJDaMz1Q4cYZ P8CqHZzrPc8ClhtYhlOYVqyPUXtNp5E7ARF/TdshHLd/UFZrYL3c565aF40HqPKgCsOs ic4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=8YVSDjlFCawy7xRpdT4/4OHUOa8y6tROBUBklu5SSPc=; b=Vxs3fkjbFAkJmM7UxtKip4NLoNpTl8e4a6wML17xlhsqJCBCJL7sMQMWaAcYDftgdG hF0ch5gWb92I51tiVCwrKmt+mANNpqT0+JOWdjxL7YNlFLTZoFiCMmlCjNGox5KachtR Gtecu+zTYbp+7XIDe1f+ouLpOjYg0W3bc/QCer6WO9P4msHsFzy8K/YAtHF/MCclD++5 dmrPtu4Da2Tr+8MLBzmfrppbbrHFblO4i19/Xz8vOihR3tX9sNyNqpulllqHFo/mP2gV zHA7wSVYeNtD+el482iRnYPS/M99pn+1AAkXOmIGdpLfY4JqG/7GEPt0y3NmfwrrWWr1 QM2w== X-Gm-Message-State: AOAM531UMMwq83DuhCThKrwheaYe50uo3gYGpqKpAoKBs9aLmU4DkA1X fbQ3aFRqRRUeGl6jF/xXO3U= X-Google-Smtp-Source: ABdhPJwn8A8Pz5BgAHcNITKlpel3kxpYpLDl9XTbiGDbeFZuB4dk6qktNYlI+PW/i6IN1HzOygzD1g== X-Received: by 2002:aa7:c74e:: with SMTP id c14mr20354749eds.113.1614081930600; Tue, 23 Feb 2021 04:05:30 -0800 (PST) Received: from localhost (46-117-188-146.bb.netvision.net.il. [46.117.188.146]) by smtp.gmail.com with ESMTPSA id qn24sm12469173ejb.104.2021.02.23.04.05.29 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 23 Feb 2021 04:05:30 -0800 (PST) From: Elad Nachman To: ferruh.yigit@intel.com Cc: iryzhov@nfware.com, stephen@networkplumber.org, dev@dpdk.org, eladv6@gmail.com Date: Tue, 23 Feb 2021 14:05:12 +0200 Message-Id: <20210223120512.29216-1-eladv6@gmail.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20201126144613.4986-1-eladv6@gmail.com> References: <20201126144613.4986-1-eladv6@gmail.com> X-Mailman-Approved-At: Tue, 23 Feb 2021 14:01:50 +0100 Subject: [dpdk-dev] [PATCH V2] kni: fix rtnl deadlocks and race conditions v2 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" This version 2 of the patch leverages on Stephen Hemminger's 64106 patch from Dec 2019, and fixes the issues reported by Ferruh and Igor: A. KNI sync lock is being locked while rtnl is held. If two threads are calling kni_net_process_request() , then the first one will take the sync lock, release rtnl lock then sleep. The second thread will try to lock sync lock while holding rtnl. The first thread will wake, and try to lock rtnl, resulting in a deadlock. The remedy is to release rtnl before locking the KNI sync lock. Since in between nothing is accessing Linux network-wise, no rtnl locking is needed. B. There is a race condition in __dev_close_many() processing the close_list while the application terminates. It looks like if two vEth devices are terminating, and one releases the rtnl lock, the other takes it, updating the close_list in an unstable state, causing the close_list to become a circular linked list, hence list_for_each_entry() will endlessly loop inside __dev_close_many() . Since the description for the original patch indicate the original motivation was bringing the device up, I have changed kni_net_process_request() to hold the rtnl mutex in case of bringing the device down since this is the path called from __dev_close_many() , causing the corruption of the close_list. Depends-on: patch-64106 ("kni: fix kernel deadlock when using mlx devices") Signed-off-by: Elad Nachman --- V2: * rebuild the patch as increment from patch 64106 * fix comment and blank lines --- kernel/linux/kni/kni_net.c | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c index f0b6e9a8d..b41360220 100644 --- a/kernel/linux/kni/kni_net.c +++ b/kernel/linux/kni/kni_net.c @@ -110,9 +110,22 @@ kni_net_process_request(struct net_device *dev, struct rte_kni_request *req) void *resp_va; uint32_t num; int ret_val; + int req_is_dev_stop = 0; + + if (req->req_id == RTE_KNI_REQ_CFG_NETWORK_IF && + req->if_up == 0) + req_is_dev_stop = 1; ASSERT_RTNL(); + /* Since we need to wait and RTNL mutex is held + * drop the mutex and hold reference to keep device + */ + if (!req_is_dev_stop) { + dev_hold(dev); + rtnl_unlock(); + } + mutex_lock(&kni->sync_lock); /* Construct data */ @@ -124,16 +137,8 @@ kni_net_process_request(struct net_device *dev, struct rte_kni_request *req) goto fail; } - /* Since we need to wait and RTNL mutex is held - * drop the mutex and hold refernce to keep device - */ - dev_hold(dev); - rtnl_unlock(); - ret_val = wait_event_interruptible_timeout(kni->wq, kni_fifo_count(kni->resp_q), 3 * HZ); - rtnl_lock(); - dev_put(dev); if (signal_pending(current) || ret_val <= 0) { ret = -ETIME; @@ -152,6 +157,10 @@ kni_net_process_request(struct net_device *dev, struct rte_kni_request *req) fail: mutex_unlock(&kni->sync_lock); + if (!req_is_dev_stop) { + rtnl_lock(); + dev_put(dev); + } return ret; } -- 2.17.1