DPDK usage discussions
 help / color / mirror / Atom feed
* [dpdk-users] running multiple independent dpdk applications randomly locks up machines
@ 2016-08-19 20:32 Zhongming Qu
  2016-08-19 21:03 ` Stephen Hemminger
  0 siblings, 1 reply; 5+ messages in thread
From: Zhongming Qu @ 2016-08-19 20:32 UTC (permalink / raw)
  To: users

Hi,


As stated in the subject, running multiple dpdk applications (only one
process per application) randomly locks up machines. Thanks in advance for
any help.

It is difficult to provide the exact set of information useful for
debugging. Just listing the as much info as possible in the hope of ringing
a bell somewhere.

System Configuration:
- Motherboard: Supermicro X10SRi-F (BIOS upgraded to the latest version as
of July 2016)
- Intel Xeon E5-2667 v3 (Haswell), no NUMA
- 64GB DRAM
- Ubuntu 14.04 kernel 3.13.0-49-generic
- DPDK 16.04
- 1024 x 2M hugepages are reserved
- 82599ES NIC (2 x 10G) at pci_addr 02:00.0 and 02:00.1. Both ports use the
ixgbe_uio kernel driver and the ixgbe PMD.


Use Scenario of DPDK Application:
- Two single-process dpdk applications, A and B, need to run simultaneously.
- It is made sure that A and B do not have any race conditions or memory
issues, that is, apart from dpdk.
- Each application uses 512 x 2M hugepages (half of the total reserved
amount).
- Each application binds to one port via `--pci-whitelist <pci_addr>`.
- Use `-m 1024` and `--file-prefix <some_unique_id_per_pci_addr>`, as
instructed by 19.2.3 in the Programmer's Guide (
http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html).


Description of Problem:
- Starting and killing down A and B repeatedly every 30 seconds has a
chance of locking up the machine.
- No kernel var/log/syslog, no dmesg, nothing persistent, is available for
debugging after a reboot of the frozen machine.
- Looks like a kernel panic as it dumps some panic info to the serial
console (not useful...) and the CapsLock and NumLock keys on a physically
connected keyboard do not respond.
- No particular sequence of operations of starting and killing A and B, so
far, has been found to reliably lead to a lockup. The best effort of
reproducing the lockup is a keep-trying-until-lockup approach.


A Few Things Tried:
- Via dumping logging to stderr and files, it is found that the lock up can
happen during rte_eal_hugepage_init(), or after it, after the program is
killed.
- It is made sure that rte_config.mem_config->memseg is properly
initialized. That is, the total amount of memory reserved in the memseg is
512 x 2M hugepages.
- Zeroing all huepages when the hugefile is created and mapped, or
immediately after memsegs are initialized (as the second call of
map_all_hugepages() in rte_eal_hugepage_init()) does not fix the problem.
- By default, hugefiles in /mnt/huge are not cleaned up when the
applications are killed. Though, cleaning them up did not solve the problem
either.



Thanks very much for any input!


Zhongming

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-users] running multiple independent dpdk applications randomly locks up machines
  2016-08-19 20:32 [dpdk-users] running multiple independent dpdk applications randomly locks up machines Zhongming Qu
@ 2016-08-19 21:03 ` Stephen Hemminger
  2016-08-20  1:19   ` Zhongming Qu
  0 siblings, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2016-08-19 21:03 UTC (permalink / raw)
  To: Zhongming Qu; +Cc: users

On Fri, 19 Aug 2016 13:32:06 -0700
Zhongming Qu <zhongming@luminatewireless.com> wrote:

> Hi,
> 
> 
> As stated in the subject, running multiple dpdk applications (only one
> process per application) randomly locks up machines. Thanks in advance for
> any help.
> 
> It is difficult to provide the exact set of information useful for
> debugging. Just listing the as much info as possible in the hope of ringing
> a bell somewhere.
> 
> System Configuration:
> - Motherboard: Supermicro X10SRi-F (BIOS upgraded to the latest version as
> of July 2016)
> - Intel Xeon E5-2667 v3 (Haswell), no NUMA
> - 64GB DRAM
> - Ubuntu 14.04 kernel 3.13.0-49-generic
> - DPDK 16.04
> - 1024 x 2M hugepages are reserved
> - 82599ES NIC (2 x 10G) at pci_addr 02:00.0 and 02:00.1. Both ports use the
> ixgbe_uio kernel driver and the ixgbe PMD.
> 
> 
> Use Scenario of DPDK Application:
> - Two single-process dpdk applications, A and B, need to run simultaneously.
> - It is made sure that A and B do not have any race conditions or memory
> issues, that is, apart from dpdk.
> - Each application uses 512 x 2M hugepages (half of the total reserved
> amount).
> - Each application binds to one port via `--pci-whitelist <pci_addr>`.
> - Use `-m 1024` and `--file-prefix <some_unique_id_per_pci_addr>`, as
> instructed by 19.2.3 in the Programmer's Guide (
> http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html).
> 
> 
> Description of Problem:
> - Starting and killing down A and B repeatedly every 30 seconds has a
> chance of locking up the machine.
> - No kernel var/log/syslog, no dmesg, nothing persistent, is available for
> debugging after a reboot of the frozen machine.
> - Looks like a kernel panic as it dumps some panic info to the serial
> console (not useful...) and the CapsLock and NumLock keys on a physically
> connected keyboard do not respond.
> - No particular sequence of operations of starting and killing A and B, so
> far, has been found to reliably lead to a lockup. The best effort of
> reproducing the lockup is a keep-trying-until-lockup approach.
> 
> 
> A Few Things Tried:
> - Via dumping logging to stderr and files, it is found that the lock up can
> happen during rte_eal_hugepage_init(), or after it, after the program is
> killed.
> - It is made sure that rte_config.mem_config->memseg is properly
> initialized. That is, the total amount of memory reserved in the memseg is
> 512 x 2M hugepages.
> - Zeroing all huepages when the hugefile is created and mapped, or
> immediately after memsegs are initialized (as the second call of
> map_all_hugepages() in rte_eal_hugepage_init()) does not fix the problem.
> - By default, hugefiles in /mnt/huge are not cleaned up when the
> applications are killed. Though, cleaning them up did not solve the problem
> either.
> 
> 
> 
> Thanks very much for any input!
> 
> 
> Zhongming

Obviously, two applications can't share the same queue.
Also, you need to give application a different core mask; at least if you are using
poll mode like the DPDK examples.

You might be better off having one primary DPDK process and two secondary processes.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-users] running multiple independent dpdk applications randomly locks up machines
  2016-08-19 21:03 ` Stephen Hemminger
@ 2016-08-20  1:19   ` Zhongming Qu
  2016-08-20  1:30     ` Stephen Hemminger
  0 siblings, 1 reply; 5+ messages in thread
From: Zhongming Qu @ 2016-08-20  1:19 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: users

Thanks!

I did use a hard coded queue_id of 0 when initializing the rx/tx queues,
i.e., rte_eth_rx/tx_queue_setup(). So that is a problem to solve. Will fix
that and try again.

When A and B run at the same time, this lockup problem can be explained by
the conflicting queue usage. But the lockup happens even in the use case
where only one dpdk process is running. That is, A and B take turns to run
but do not run at the same time.

Thanks for pointing out an alternative approach. That sounds really
promising. A concern came up when that idea was talked over: What would
happen if the primary process dies? Would all the secondary processes
eventually go awry at some point? Would `--proc-type auto` solve this
problem?




On Fri, Aug 19, 2016 at 2:03 PM, Stephen Hemminger <
stephen@networkplumber.org> wrote:

> On Fri, 19 Aug 2016 13:32:06 -0700
> Zhongming Qu <zhongming@luminatewireless.com> wrote:
>
> > Hi,
> >
> >
> > As stated in the subject, running multiple dpdk applications (only one
> > process per application) randomly locks up machines. Thanks in advance
> for
> > any help.
> >
> > It is difficult to provide the exact set of information useful for
> > debugging. Just listing the as much info as possible in the hope of
> ringing
> > a bell somewhere.
> >
> > System Configuration:
> > - Motherboard: Supermicro X10SRi-F (BIOS upgraded to the latest version
> as
> > of July 2016)
> > - Intel Xeon E5-2667 v3 (Haswell), no NUMA
> > - 64GB DRAM
> > - Ubuntu 14.04 kernel 3.13.0-49-generic
> > - DPDK 16.04
> > - 1024 x 2M hugepages are reserved
> > - 82599ES NIC (2 x 10G) at pci_addr 02:00.0 and 02:00.1. Both ports use
> the
> > ixgbe_uio kernel driver and the ixgbe PMD.
> >
> >
> > Use Scenario of DPDK Application:
> > - Two single-process dpdk applications, A and B, need to run
> simultaneously.
> > - It is made sure that A and B do not have any race conditions or memory
> > issues, that is, apart from dpdk.
> > - Each application uses 512 x 2M hugepages (half of the total reserved
> > amount).
> > - Each application binds to one port via `--pci-whitelist <pci_addr>`.
> > - Use `-m 1024` and `--file-prefix <some_unique_id_per_pci_addr>`, as
> > instructed by 19.2.3 in the Programmer's Guide (
> > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html).
> >
> >
> > Description of Problem:
> > - Starting and killing down A and B repeatedly every 30 seconds has a
> > chance of locking up the machine.
> > - No kernel var/log/syslog, no dmesg, nothing persistent, is available
> for
> > debugging after a reboot of the frozen machine.
> > - Looks like a kernel panic as it dumps some panic info to the serial
> > console (not useful...) and the CapsLock and NumLock keys on a physically
> > connected keyboard do not respond.
> > - No particular sequence of operations of starting and killing A and B,
> so
> > far, has been found to reliably lead to a lockup. The best effort of
> > reproducing the lockup is a keep-trying-until-lockup approach.
> >
> >
> > A Few Things Tried:
> > - Via dumping logging to stderr and files, it is found that the lock up
> can
> > happen during rte_eal_hugepage_init(), or after it, after the program is
> > killed.
> > - It is made sure that rte_config.mem_config->memseg is properly
> > initialized. That is, the total amount of memory reserved in the memseg
> is
> > 512 x 2M hugepages.
> > - Zeroing all huepages when the hugefile is created and mapped, or
> > immediately after memsegs are initialized (as the second call of
> > map_all_hugepages() in rte_eal_hugepage_init()) does not fix the problem.
> > - By default, hugefiles in /mnt/huge are not cleaned up when the
> > applications are killed. Though, cleaning them up did not solve the
> problem
> > either.
> >
> >
> >
> > Thanks very much for any input!
> >
> >
> > Zhongming
>
> Obviously, two applications can't share the same queue.
> Also, you need to give application a different core mask; at least if you
> are using
> poll mode like the DPDK examples.
>
> You might be better off having one primary DPDK process and two secondary
> processes.
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-users] running multiple independent dpdk applications randomly locks up machines
  2016-08-20  1:19   ` Zhongming Qu
@ 2016-08-20  1:30     ` Stephen Hemminger
  2016-08-26 17:55       ` Zhongming Qu
  0 siblings, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2016-08-20  1:30 UTC (permalink / raw)
  To: Zhongming Qu; +Cc: users

On Fri, 19 Aug 2016 18:19:21 -0700
Zhongming Qu <zhongming@luminatewireless.com> wrote:

> Thanks!
> 
> I did use a hard coded queue_id of 0 when initializing the rx/tx queues,
> i.e., rte_eth_rx/tx_queue_setup(). So that is a problem to solve. Will fix
> that and try again.
> 
> When A and B run at the same time, this lockup problem can be explained by
> the conflicting queue usage. But the lockup happens even in the use case
> where only one dpdk process is running. That is, A and B take turns to run
> but do not run at the same time.
> 
> Thanks for pointing out an alternative approach. That sounds really
> promising. A concern came up when that idea was talked over: What would
> happen if the primary process dies? Would all the secondary processes
> eventually go awry at some point? Would `--proc-type auto` solve this
> problem?
> 

I haven't actually used primary/secondary model, but the recommendation
is that the primary process does nothing (or is a watchdog) so it would
be pretty much impossible to crash unless killed by malicious entity.

All the packet logic would be in the secondary.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-users] running multiple independent dpdk applications randomly locks up machines
  2016-08-20  1:30     ` Stephen Hemminger
@ 2016-08-26 17:55       ` Zhongming Qu
  0 siblings, 0 replies; 5+ messages in thread
From: Zhongming Qu @ 2016-08-26 17:55 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: users

Hi,


Just an update.

Thanks for all the inputs. I feel obliged to update the latest findings
here so that this thread may become useful for other people.

As it turned out, the rx/tx queue problem is not really the problem. Here
is why:
Our use model is to run two different *primary* dpdk processes each of
which binds to a different port. Both ports are on the same 82599ES nic.
They are separate ports that have independent rx/tx queues (in the sense of
BARs and the BAR0-based registers).

What the problem was, though, was that our application never calls the
rte_eth_dev_stop() function to properly shutdown the device. Simply making
sure that rte_eth_dev_stop() is called solved our problem.

>From the standpoint of a user of the dpdk library, the problem is solved.
BUT it is not understood, yet, how exactly failing to call
rte_eth_dev_stop() could have caused machine lockups. Could someone shed
light upon this question by
  a) simply confirming that I am not the only person seeing this problem,
  b) explain how, at a very low level, race conditions or memory
corruptions or anything could happen that causes a kernel panic, or
  c) provide pointers to potentially relevant information?



Thanks a lot!
Zhongming

On Fri, Aug 19, 2016 at 6:30 PM, Stephen Hemminger <
stephen@networkplumber.org> wrote:

> On Fri, 19 Aug 2016 18:19:21 -0700
> Zhongming Qu <zhongming@luminatewireless.com> wrote:
>
> > Thanks!
> >
> > I did use a hard coded queue_id of 0 when initializing the rx/tx queues,
> > i.e., rte_eth_rx/tx_queue_setup(). So that is a problem to solve. Will
> fix
> > that and try again.
> >
> > When A and B run at the same time, this lockup problem can be explained
> by
> > the conflicting queue usage. But the lockup happens even in the use case
> > where only one dpdk process is running. That is, A and B take turns to
> run
> > but do not run at the same time.
> >
> > Thanks for pointing out an alternative approach. That sounds really
> > promising. A concern came up when that idea was talked over: What would
> > happen if the primary process dies? Would all the secondary processes
> > eventually go awry at some point? Would `--proc-type auto` solve this
> > problem?
> >
>
> I haven't actually used primary/secondary model, but the recommendation
> is that the primary process does nothing (or is a watchdog) so it would
> be pretty much impossible to crash unless killed by malicious entity.
>
> All the packet logic would be in the secondary.
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-08-26 17:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-19 20:32 [dpdk-users] running multiple independent dpdk applications randomly locks up machines Zhongming Qu
2016-08-19 21:03 ` Stephen Hemminger
2016-08-20  1:19   ` Zhongming Qu
2016-08-20  1:30     ` Stephen Hemminger
2016-08-26 17:55       ` Zhongming Qu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).