DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
@ 2015-02-11  1:33 Jay Rolette
  2015-02-11 16:25 ` Alejandro Lucero
  2015-02-16 16:33 ` Jay Rolette
  0 siblings, 2 replies; 6+ messages in thread
From: Jay Rolette @ 2015-02-11  1:33 UTC (permalink / raw)
  To: Dev

Environment:
  * DPDK 1.6.0r2
  * Ubuntu 14.04 LTS
  * kernel: 3.13.0-38-generic

When we start exercising KNI a fair bit (transferring files across it, both
sending and receiving), I'm starting to see a fair bit of these kernel
lockups:

kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]

Frequently I can't do much other than get a screenshot of the error message
coming across the console session once we get into this state, so debugging
what is happening is "interesting"...

I've seen this on multiple hardware platforms (so not box specific) as well
as virtual machines.

Are there any known issues with KNI that would cause kernel lockups in DPDK
1.6? Really hoping someone that knows KNI well can point me in the right
direction.

KNI in the 1.8 tree is significantly different, so it didn't look
straight-forward to back-port it, although I do see a few changes that
might be relevant.

Any suggestions, pointers or other general help for tracking this down?

Thanks!
Jay

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
  2015-02-11  1:33 [dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] Jay Rolette
@ 2015-02-11 16:25 ` Alejandro Lucero
  2015-02-16 13:32   ` Jay Rolette
  2015-02-16 16:33 ` Jay Rolette
  1 sibling, 1 reply; 6+ messages in thread
From: Alejandro Lucero @ 2015-02-11 16:25 UTC (permalink / raw)
  To: dev

Hi Jay,

I saw these errors when I worked in the HPC sector. They come usually with
a kernel dump for each core in the machine so you can know, after some
peering at the kernel code, how the soft lockup triggers. When I did that
it was always an issue with the memory.

So those times that you can still work on the machine after the problem,
look at the kernel messages. I will be glad to look at it.



On Wed, Feb 11, 2015 at 1:33 AM, Jay Rolette <rolette@infiniteio.com> wrote:

> Environment:
>   * DPDK 1.6.0r2
>   * Ubuntu 14.04 LTS
>   * kernel: 3.13.0-38-generic
>
> When we start exercising KNI a fair bit (transferring files across it, both
> sending and receiving), I'm starting to see a fair bit of these kernel
> lockups:
>
> kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
>
> Frequently I can't do much other than get a screenshot of the error message
> coming across the console session once we get into this state, so debugging
> what is happening is "interesting"...
>
> I've seen this on multiple hardware platforms (so not box specific) as well
> as virtual machines.
>
> Are there any known issues with KNI that would cause kernel lockups in DPDK
> 1.6? Really hoping someone that knows KNI well can point me in the right
> direction.
>
> KNI in the 1.8 tree is significantly different, so it didn't look
> straight-forward to back-port it, although I do see a few changes that
> might be relevant.
>
> Any suggestions, pointers or other general help for tracking this down?
>
> Thanks!
> Jay
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
  2015-02-11 16:25 ` Alejandro Lucero
@ 2015-02-16 13:32   ` Jay Rolette
  0 siblings, 0 replies; 6+ messages in thread
From: Jay Rolette @ 2015-02-16 13:32 UTC (permalink / raw)
  To: Alejandro Lucero; +Cc: dev

Thanks Alejandro.

I'll look into the kernel dump if there is one. The system is extremely
brittle once this happens. Usually I can't do much other than power-cycle
the box. Anything requiring sudo just locks the terminal up, so little to
look at besides the messages on the console.

Matthew Hall also suggested a few things for me to look into, so I'm
following up on that as well.

Jay

On Wed, Feb 11, 2015 at 10:25 AM, Alejandro Lucero <
alejandro.lucero@netronome.com> wrote:

> Hi Jay,
>
> I saw these errors when I worked in the HPC sector. They come usually with
> a kernel dump for each core in the machine so you can know, after some
> peering at the kernel code, how the soft lockup triggers. When I did that
> it was always an issue with the memory.
>
> So those times that you can still work on the machine after the problem,
> look at the kernel messages. I will be glad to look at it.
>
>
>
> On Wed, Feb 11, 2015 at 1:33 AM, Jay Rolette <rolette@infiniteio.com>
> wrote:
>
> > Environment:
> >   * DPDK 1.6.0r2
> >   * Ubuntu 14.04 LTS
> >   * kernel: 3.13.0-38-generic
> >
> > When we start exercising KNI a fair bit (transferring files across it,
> both
> > sending and receiving), I'm starting to see a fair bit of these kernel
> > lockups:
> >
> > kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
> >
> > Frequently I can't do much other than get a screenshot of the error
> message
> > coming across the console session once we get into this state, so
> debugging
> > what is happening is "interesting"...
> >
> > I've seen this on multiple hardware platforms (so not box specific) as
> well
> > as virtual machines.
> >
> > Are there any known issues with KNI that would cause kernel lockups in
> DPDK
> > 1.6? Really hoping someone that knows KNI well can point me in the right
> > direction.
> >
> > KNI in the 1.8 tree is significantly different, so it didn't look
> > straight-forward to back-port it, although I do see a few changes that
> > might be relevant.
> >
> > Any suggestions, pointers or other general help for tracking this down?
> >
> > Thanks!
> > Jay
> >
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
  2015-02-11  1:33 [dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] Jay Rolette
  2015-02-11 16:25 ` Alejandro Lucero
@ 2015-02-16 16:33 ` Jay Rolette
  2015-02-17  1:00   ` Matthew Hall
  1 sibling, 1 reply; 6+ messages in thread
From: Jay Rolette @ 2015-02-16 16:33 UTC (permalink / raw)
  To: Dev

On Tue, Feb 10, 2015 at 7:33 PM, Jay Rolette <rolette@infiniteio.com> wrote:

> Environment:
>   * DPDK 1.6.0r2
>   * Ubuntu 14.04 LTS
>   * kernel: 3.13.0-38-generic
>
> When we start exercising KNI a fair bit (transferring files across it,
> both sending and receiving), I'm starting to see a fair bit of these kernel
> lockups:
>
> kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
>
> Frequently I can't do much other than get a screenshot of the error
> message coming across the console session once we get into this state, so
> debugging what is happening is "interesting"...
>
> I've seen this on multiple hardware platforms (so not box specific) as
> well as virtual machines.
>
> Are there any known issues with KNI that would cause kernel lockups in
> DPDK 1.6? Really hoping someone that knows KNI well can point me in the
> right direction.
>
> KNI in the 1.8 tree is significantly different, so it didn't look
> straight-forward to back-port it, although I do see a few changes that
> might be relevant.
>

Found the problem. No patch to submit since it's already fixed in later
versions of DPDK, but thought I'd follow up with the details since I'm sure
we aren't the only ones trying to use bleeding-edge versions of DPDK...

In kni_net_rx_normal(), it was calling netif_receive_skb() instead of
netif_rx(). The source for netif_receive_skb() point out that it should
only be called from soft-irq context, which isn't the case for KNI.

As typical, simple fix once you track it down.

Yao-Po Wang's fix:  commit 41a6ebded53982107c1adfc0652d6cc1375a7db9.

Cheers,
Jay

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
  2015-02-16 16:33 ` Jay Rolette
@ 2015-02-17  1:00   ` Matthew Hall
  2015-02-17 15:57     ` Jay Rolette
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Hall @ 2015-02-17  1:00 UTC (permalink / raw)
  To: Jay Rolette; +Cc: Dev

On Mon, Feb 16, 2015 at 10:33:52AM -0600, Jay Rolette wrote:
> In kni_net_rx_normal(), it was calling netif_receive_skb() instead of
> netif_rx(). The source for netif_receive_skb() point out that it should
> only be called from soft-irq context, which isn't the case for KNI.

For the uninitiated among us, what was the practical effect of the coding 
error? Waiting forever for a lock which will never be available in IRQ 
context, or causing unintended re-entrancy, or what?

Thanks,
Matthew.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782]
  2015-02-17  1:00   ` Matthew Hall
@ 2015-02-17 15:57     ` Jay Rolette
  0 siblings, 0 replies; 6+ messages in thread
From: Jay Rolette @ 2015-02-17 15:57 UTC (permalink / raw)
  To: Matthew Hall; +Cc: Dev

On Mon, Feb 16, 2015 at 7:00 PM, Matthew Hall <mhall@mhcomputing.net> wrote:

> On Mon, Feb 16, 2015 at 10:33:52AM -0600, Jay Rolette wrote:
> > In kni_net_rx_normal(), it was calling netif_receive_skb() instead of
> > netif_rx(). The source for netif_receive_skb() point out that it should
> > only be called from soft-irq context, which isn't the case for KNI.
>
> For the uninitiated among us, what was the practical effect of the coding
> error? Waiting forever for a lock which will never be available in IRQ
> context, or causing unintended re-entrancy, or what?
>

Sadly, I'm not really one of the enlightened ones when it comes to the
Linux kernel. VxWorks? sure. Linux kernel? learning as required.

I didn't chase it down to the specific mechanism in this case. Unusual for
me, but this time I took the expedient route of finding a likely
explanation plus Yao's fix on that same code with his explanation of a
deadlock and went with it. It'll be a few more days before we've had enough
run time on it to absolutely confirm (not an easy bug to repro).

If I get hand-wavy about it, my assumption is that the requirement for
netif_receive_skb() be called in soft-irq context means it doesn't expect
to be pre-empted or rentrant.  When you call netif_rx() instead, it puts
the skb on the backlog and it gets processed from there. Part of that code
disables interrupts during part of the processing. Not sure what else is
coming in and actually deadlocking things.

Honestly, I don't understand enough details of how everything works in the
Linux network stack yet. I've done tons of work on the network path of
stack-less systems, a bit of work in device drivers, but have only touched
the surface of the internals of Linux network stack. The meat of my product
avoids that like the plague because it is too slow.

Sorry, lots of words but not much light being shed this time...
Jay

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-02-17 15:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-11  1:33 [dpdk-dev] kernel: BUG: soft lockup - CPU#1 stuck for 22s! [kni_single:1782] Jay Rolette
2015-02-11 16:25 ` Alejandro Lucero
2015-02-16 13:32   ` Jay Rolette
2015-02-16 16:33 ` Jay Rolette
2015-02-17  1:00   ` Matthew Hall
2015-02-17 15:57     ` Jay Rolette

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).