From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rolette@infiniteio.com>
Received: from mail-yk0-f176.google.com (mail-yk0-f176.google.com
 [209.85.160.176]) by dpdk.org (Postfix) with ESMTP id 6124BB64B
 for <dev@dpdk.org>; Tue, 17 Feb 2015 16:57:39 +0100 (CET)
Received: by mail-yk0-f176.google.com with SMTP id 142so16854739ykq.7
 for <dev@dpdk.org>; Tue, 17 Feb 2015 07:57:38 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:in-reply-to:references:date
 :message-id:subject:from:to:cc:content-type;
 bh=zXxZ8qDeQctDN8YeF1RPvqP2BWsb+3rR0zo4oNsOyG0=;
 b=B0xxMwO+45zpQ17hvhZBHzjJj9LJP5TtySJqrWzSTQygjQyf3rrt+EeYIr469YNhV9
 XP8XjpFoBk3k21UfAIs4pgrMFhzEQy0I3HvJVDdG7vxxwwl1KWFqf1kk/uWJiWCq9Upr
 8p2vC44FLuQOflq50cdCLMh9yFmtTQYCeCuqywoXYCV6LFLAMnWru+N6pY1oNNJtT8Kn
 syvTCghvgXen/YRocqj67lrRbxLaIsXvPDjfmZfW7UHHPI5SSatw+kDRxZ/k1mnbeVx0
 dH88WfAAz8Rfv5vRjox8QtcoBZxQcOgqSFqpkSG9+7s+NxWuxJ+/PYUp7WVg/TaJKJAW
 W5zQ==
X-Gm-Message-State: ALoCoQkw7BNwugRulX7oXxD2XDZuMb/gnOTqQ6v7rWZJm+1QWt9B/8i9Yfb3Q42e+3qdeX4MSufO
MIME-Version: 1.0
X-Received: by 10.236.63.6 with SMTP id z6mr410492yhc.65.1424188658847; Tue,
 17 Feb 2015 07:57:38 -0800 (PST)
Received: by 10.170.205.212 with HTTP; Tue, 17 Feb 2015 07:57:38 -0800 (PST)
In-Reply-To: <20150217010024.GB30617@mhcomputing.net>
References: <CADNuJVpqB+kydPe1QXcLV21GRF-TMR9dkXdYysk7XTdBZryhcQ@mail.gmail.com>
 <CADNuJVpKGyFOKNA1JCVbg72SwPbM0+9HSWAHwAiG=G2AXFKJ-w@mail.gmail.com>
 <20150217010024.GB30617@mhcomputing.net>
Date: Tue, 17 Feb 2015 09:57:38 -0600
Message-ID: <CADNuJVr0X4fDn6=gqFvKHKvm2=Jyo77754wm4+WO9ihjCLc_Ag@mail.gmail.com>
From: Jay Rolette <rolette@infiniteio.com>
To: Matthew Hall <mhall@mhcomputing.net>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Cc: Dev <dev@dpdk.org>
Subject: Re: [dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s!
	[kni_single:1782]
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 17 Feb 2015 15:57:39 -0000

On Mon, Feb 16, 2015 at 7:00 PM, Matthew Hall <mhall@mhcomputing.net> wrote:

> On Mon, Feb 16, 2015 at 10:33:52AM -0600, Jay Rolette wrote:
> > In kni_net_rx_normal(), it was calling netif_receive_skb() instead of
> > netif_rx(). The source for netif_receive_skb() point out that it should
> > only be called from soft-irq context, which isn't the case for KNI.
>
> For the uninitiated among us, what was the practical effect of the coding
> error? Waiting forever for a lock which will never be available in IRQ
> context, or causing unintended re-entrancy, or what?
>

Sadly, I'm not really one of the enlightened ones when it comes to the
Linux kernel. VxWorks? sure. Linux kernel? learning as required.

I didn't chase it down to the specific mechanism in this case. Unusual for
me, but this time I took the expedient route of finding a likely
explanation plus Yao's fix on that same code with his explanation of a
deadlock and went with it. It'll be a few more days before we've had enough
run time on it to absolutely confirm (not an easy bug to repro).

If I get hand-wavy about it, my assumption is that the requirement for
netif_receive_skb() be called in soft-irq context means it doesn't expect
to be pre-empted or rentrant.  When you call netif_rx() instead, it puts
the skb on the backlog and it gets processed from there. Part of that code
disables interrupts during part of the processing. Not sure what else is
coming in and actually deadlocking things.

Honestly, I don't understand enough details of how everything works in the
Linux network stack yet. I've done tons of work on the network path of
stack-less systems, a bit of work in device drivers, but have only touched
the surface of the internals of Linux network stack. The meat of my product
avoids that like the plague because it is too slow.

Sorry, lots of words but not much light being shed this time...
Jay