From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yk0-f176.google.com (mail-yk0-f176.google.com [209.85.160.176]) by dpdk.org (Postfix) with ESMTP id 6124BB64B for ; Tue, 17 Feb 2015 16:57:39 +0100 (CET) Received: by mail-yk0-f176.google.com with SMTP id 142so16854739ykq.7 for ; Tue, 17 Feb 2015 07:57:38 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=zXxZ8qDeQctDN8YeF1RPvqP2BWsb+3rR0zo4oNsOyG0=; b=B0xxMwO+45zpQ17hvhZBHzjJj9LJP5TtySJqrWzSTQygjQyf3rrt+EeYIr469YNhV9 XP8XjpFoBk3k21UfAIs4pgrMFhzEQy0I3HvJVDdG7vxxwwl1KWFqf1kk/uWJiWCq9Upr 8p2vC44FLuQOflq50cdCLMh9yFmtTQYCeCuqywoXYCV6LFLAMnWru+N6pY1oNNJtT8Kn syvTCghvgXen/YRocqj67lrRbxLaIsXvPDjfmZfW7UHHPI5SSatw+kDRxZ/k1mnbeVx0 dH88WfAAz8Rfv5vRjox8QtcoBZxQcOgqSFqpkSG9+7s+NxWuxJ+/PYUp7WVg/TaJKJAW W5zQ== X-Gm-Message-State: ALoCoQkw7BNwugRulX7oXxD2XDZuMb/gnOTqQ6v7rWZJm+1QWt9B/8i9Yfb3Q42e+3qdeX4MSufO MIME-Version: 1.0 X-Received: by 10.236.63.6 with SMTP id z6mr410492yhc.65.1424188658847; Tue, 17 Feb 2015 07:57:38 -0800 (PST) Received: by 10.170.205.212 with HTTP; Tue, 17 Feb 2015 07:57:38 -0800 (PST) In-Reply-To: <20150217010024.GB30617@mhcomputing.net> References: <20150217010024.GB30617@mhcomputing.net> Date: Tue, 17 Feb 2015 09:57:38 -0600 Message-ID: From: Jay Rolette To: Matthew Hall Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Cc: Dev Subject: Re: [dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Feb 2015 15:57:39 -0000 On Mon, Feb 16, 2015 at 7:00 PM, Matthew Hall wrote: > On Mon, Feb 16, 2015 at 10:33:52AM -0600, Jay Rolette wrote: > > In kni_net_rx_normal(), it was calling netif_receive_skb() instead of > > netif_rx(). The source for netif_receive_skb() point out that it should > > only be called from soft-irq context, which isn't the case for KNI. > > For the uninitiated among us, what was the practical effect of the coding > error? Waiting forever for a lock which will never be available in IRQ > context, or causing unintended re-entrancy, or what? > Sadly, I'm not really one of the enlightened ones when it comes to the Linux kernel. VxWorks? sure. Linux kernel? learning as required. I didn't chase it down to the specific mechanism in this case. Unusual for me, but this time I took the expedient route of finding a likely explanation plus Yao's fix on that same code with his explanation of a deadlock and went with it. It'll be a few more days before we've had enough run time on it to absolutely confirm (not an easy bug to repro). If I get hand-wavy about it, my assumption is that the requirement for netif_receive_skb() be called in soft-irq context means it doesn't expect to be pre-empted or rentrant. When you call netif_rx() instead, it puts the skb on the backlog and it gets processed from there. Part of that code disables interrupts during part of the processing. Not sure what else is coming in and actually deadlocking things. Honestly, I don't understand enough details of how everything works in the Linux network stack yet. I've done tons of work on the network path of stack-less systems, a bit of work in device drivers, but have only touched the surface of the internals of Linux network stack. The meat of my product avoids that like the plague because it is too slow. Sorry, lots of words but not much light being shed this time... Jay