From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f172.google.com (mail-ig0-f172.google.com [209.85.213.172]) by dpdk.org (Postfix) with ESMTP id 8EC422986 for ; Thu, 7 Apr 2016 15:42:54 +0200 (CEST) Received: by mail-ig0-f172.google.com with SMTP id f1so153649884igr.1 for ; Thu, 07 Apr 2016 06:42:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=infinite-io.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=fl5iOqDZM04tq45JCRTL4UfmBEv/RdmHKedWfvhT+0U=; b=V5Ppnt30NHGEW2HrsaIh2bOReMAJFqG5Yg6Fhct3PSJgrbM9N32RzovXq7buFCKSev Vo8YANGt0dxs85lLG4SwseetMd/tZRpBVBsRUxsvTq/O5Najc1bNQYGBeN13zLGjtVQ9 tLPpEuum8JHHS0gJ5xPAWQt1pzcM8NUqqVy/oeosnh/xjk456L9y2CMCL1b5M/q/Jpl8 A3lYGWrs+Yt96GEykYc7TWdIL+fEZFP2+VCkBFO9Dm3uXhF4t6Y5lrxLFvVuRWVvh9Yv zaAVPf0/ggmG3f9chw+pNeH92JdK9lrCDxNbEvWA126oNQYt0GJEg0Mb8G2CdrxkfSvT kTDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=fl5iOqDZM04tq45JCRTL4UfmBEv/RdmHKedWfvhT+0U=; b=MBI0bzKRmDifH5KpZn7oFbCL73PizZ33usP1ZnPaPho/CfOUyLv5pjc5gVJWzavNXb Vqkdq2jab9FcrtTxm3DPhFebKgO9HxwGa/6T9LzEoQAT2DLvw4XRdFItU1Pa/mVtiKFE nO4V1AXIbwikMsGyz17c6EAn0TEg7ngz3b1leZuYSJ5HmR58BFyBa5ITw28OyKBQiW6A YbbGt23vhrUYGOBQ5jMbiHXmSjWa6oi05NqCcdgR6SvlaoK9+DQz+weyVCaTjMezglHb WW5CG3lbm9WN2Pavt1+7hBHiAacyQi0UVcbJvunPKzvSdc8Hf6Oo76RpLlo1BZhhntAZ kGjQ== X-Gm-Message-State: AD7BkJLQQXIQBiE7313zKvbO+hQ3hXrSADgtD+D04sliRyregGjrjiqzxRgC7yuqFGWLsRox00xnjt9E2lgmZg== MIME-Version: 1.0 X-Received: by 10.50.160.37 with SMTP id xh5mr29506467igb.14.1460036573675; Thu, 07 Apr 2016 06:42:53 -0700 (PDT) Received: by 10.107.48.136 with HTTP; Thu, 7 Apr 2016 06:42:53 -0700 (PDT) In-Reply-To: References: Date: Thu, 7 Apr 2016 08:42:53 -0500 Message-ID: From: Matt Laswell To: "Sanford, Robert" Cc: Jay Rolette , DPDK Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-dev] Kernel panic in KNI X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Apr 2016 13:42:55 -0000 Hey Robert, Thanks for the insight. I work with Jay on the code he's asking about; we only have one mbuf pool that we use for all packets. Mostly, this is for the reasons that you describe, as well as for the sake of simplicity. As it happens, the stack trace we're seeing makes it look as though either the mbuf's data pointer is screwed up, or the VA translation done on it is. I suspect that we're getting to a failure mode similar to the one you experienced, though perhaps for different reasons. Thanks, Matt On Wed, Apr 6, 2016 at 5:30 PM, Sanford, Robert wrote: > Hi Jay, > > I won't try to interpret your kernel stack trace. But, I'll tell you about > a KNI-related problem that we once experienced, and the symptom was a > kernel hang. > > The problem was that we were passing mbufs allocated out of one mempool, > to a KNI context that we had set up with a different mempool (on a > different CPU socket). The KNI kernel driver, converts the user-space mbuf > virtual address (VA) to a kernel VA by adding the difference between the > user and kernel VAs of the mempool used to create the KNI context. So, if > an mbuf comes from a different mempool, the calculated address will > probably be VERY BAD. > > Could this be your problem? > > -- > Robert > > > On 4/6/16 4:16 PM, "Jay Rolette" wrote: > > >I had a system lockup hard a couple of days ago and all we were able to > >get > >was a photo of the LCD monitor with most of the kernel panic on it. No way > >to scroll back the buffer and nothing in the logs after we rebooted. Not > >surprising with a kernel panic due to an exception during interrupt > >processing. We have a serial console attached in case we are able to get > >it > >to happen again, but it's not easy to reproduce (hours of runtime for this > >instance). > > > >Ran the photo through OCR software to get a text version of the dump, so > >possible I missed some fixups in this: > > > >[39178.433262] RDX: 00000000000000ba RSI: ffff881fd2f350ee RDI: > >a12520669126180a > >[39178.464020] RBP: ffff880433966970 R08: a12520669126180a R09: > >ffff881fd2f35000 > >[39178.495091] R10: 000000000000ffff R11: ffff881fd2f88000 R12: > >ffff883fdla75ee8 > >[39178.526594] R13: 00000000000000ba R14: 00007fdad5a66780 R15: > >ffff883715ab6780 > >[39178.559011] FS: 00007ffff7fea740(0000) GS:ffff88lfffc00000(0000) > >knlGS:0000000000000000 > >[39178.592005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >[39178.623931] CR2: 00007ffff7ea2000 CR3: 0000001fd156f000 CR4: > >00000000001407f0 > >[39178.656187] Stack: > >[39178.689025] ffffffffc067c7ef 00000000000000ba 00000000000000ba > >ffff881fd2f88000 > >[39178.722682] 0000000000004000 ffff8B3fd0bbd09c ffff883fdla75ee8 > >ffff8804339bb9c8 > >[39178.756525] ffffffff81658456 ffff881fcd2ec40c ffffffffc0680700 > >ffff880436bad800 > >[39178.790577] Call Trace: > >[39178.824420] [] ? kni_net_tx+0xef/0x1a0 [rte_kni] > >[39178.859190] [] dev_hard_start_xmit+0x316/0x5c0 > >[39178.893426] [] sch_direct_xmit+0xee/0xic0 > >[39178.927435] [l __dev_queue_xmit+0x200/0x4d0 > >[39178.961684] [l dev_queue_xmit+0x10/0x20 > >[39178.996194] [] neigh_connected_output+0x67/0x100 > >[39179.031098] [] ip_finish_output+0xid8/0x850 > >[39179.066709] [l ip_output+0x58/0x90 > >[39179.101551] [] ip_local_out_sk+0x30/0x40 > >[39179.136823] [] ip_queue_xmit+0xl3f/0x3d0 > >[39179.171742] [] tcp_transmit_skb+0x47c/0x900 > >[39179.206854] [l tcp_write_xmit+0x110/0xcb0 > >[39179.242335] [] __tcp_push_pending_frames+0x2e/0xc0 > >[39179.277632] [] tcp_push+0xec/0x120 > >[39179.311768] [] tcp_sendmsg+0xb9/0xce0 > >[39179.346934] [] ? tcp_recvmsg+0x6e2/0xba0 > >[39179.385586] [] inet_sendmsg+0x64/0x60 > >[39179.424228] [] ? apparmor_socket_sendmsg+0x21/0x30 > >[39179.4586581 [] sock_sendmsg+0x86/0xc0 > >[39179.493220] [] ? __inet_stream_connect+0xa5/0x320 > >[39179.528033] [] ? __fdget+0x13/0x20 > >[39179.561214] [] SYSC_sendto+0x121/0x1c0 > >[39179.594665] [] ? aa_sk_perm.isra.4+0x6d/0x150 > >[39179.6268931 [] ? read_tsc+0x9/0x20 > >[39179.6586541 [] ? ktime_get_ts+0x48/0xe0 > >[39179.689944] [] SyS_sendto+0xe/0x10 > >[39179.719575] [] system_call_fastpath+0xia/0xif > >[39179.748760] Code: 43 58 48 Zb 43 50 88 43 4e 5b 5d c3 66 Of if 84 00 00 > >00 00 00 e8 fb fb ff ff eb e2 90 90 90 90 90 90 90 > > 90 48 89 f8 48 89 d1 a4 c3 03 83 eZ 07 f3 48 .15 89 di f3 a4 c3 20 > >4c > >8b % 4c 86 > >[39179.808690] RIP [] memcpy+0x6/0x110 > >[39179.837238] RSP > >[39179.933755] ---[ end trace 2971562f425e2cf8 ]--- > >[39179.964856] Kernel panic - not syncing: Fatal exception in interrupt > >[39179.992896] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation > >range: 0xffffffff80000000-0xffffffffbfffffff) > >[39180.024617] ---[ end Kernel panic - not syncing: Fatal exception in > >interrupt > > > >It blew up when kni_net_tx() called memcpy() to copy data from the skb to > >an mbuf. > > > >Disclosure: I'm not a Linux device driver guy. I dip into the kernel as > >needed. Plenty of experience doing RTOS and bare metal development, but > >not > >a Linux kernel expert. > > > >What context does kni_net_tx() run in? On the receive path, my > >understanding is that KNI always runs in process context on a kthread. > >I've > >been assuming that the transmit path was also in process context (albeit > >on > >the app's process), so the "Fatal exception in interrupt" is throwing me. > > > >Does kni_net_tx() ever run in interrupt (or soft-interrupt) context? > > > >Thanks, > >Jay > >