From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f170.google.com (mail-yw0-f170.google.com [209.85.161.170]) by dpdk.org (Postfix) with ESMTP id 0F7EE6CAF for ; Sat, 20 Aug 2016 03:19:22 +0200 (CEST) Received: by mail-yw0-f170.google.com with SMTP id r9so22862366ywg.0 for ; Fri, 19 Aug 2016 18:19:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=luminatewireless-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=bvrUUYEYiWSlWanmmgB6HMX8qxzpDL+9hcEDlYomm3s=; b=OBz4CJtPBuc4vV0nLGT2TMWULZQ1GIVO/j+t5z/05EmcLrOJ05o+bc1LEQC4JvsBG7 tZ6YLilvFidR/beyuxt+9Y5jEidelFSC3cHV7blP1UHPiOfxGdpsbZTjW9kCf/v8zyuQ c4ylcCIgR7/Jb5D1JngCqDNQc3JYlW7cZOt6/qnqyM4o1ouIjS2js/D1bJEV9cU6c/xm bZ8JsVs4rt8y7mFSheETvu13d3wTFeE355/FwjcomaOl8tKPdi/IpRs8nnmli9d3nml3 hJMZ06sh+kXDliZWCAvDqh/7/k4jwx93MZXtmEkeZi5GdYImCyUCzW4Kxc/NqQRsy7Rl Ayyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=bvrUUYEYiWSlWanmmgB6HMX8qxzpDL+9hcEDlYomm3s=; b=BVdJzyqQ0HfioHMDY0Y1QHo9RgTmsOijyWZnRieYiSs9E1QhJlMwaFEG00EvYNmNEY CDIKDvt+Q/69lMNbhxGHOLkc62o2dRZ0CyTsmTgL2P/J8hbQ+J99WGtdHiNW+8gYD/Z4 XuFzd6uxgpCua9Kmt4CEbglQdChraC4Au25I24oL6cG7hyzpamz3gV6MGYB1K7lPnUXZ LiN1pUqrGr7jfYcWNa4kv88xRvdzrlPF0GjQ7dEDlWlGS66aqarbFYZiuPjyCeATm3/k me+PICWHZk9kSqFvETGC+ryBdNOZuy2eG6BquRm4hV13R/OqcDALijapg+M5W8qF/MNa OGnA== X-Gm-Message-State: AEkoouv2sqsMhEz2dmNF7cR4OxEM/8e6Ju48xstsWzG8CSNIpv7YL3GJcKYJ5h+UQ41qt7bvWfZMGcIJK/vZ6dvU X-Received: by 10.129.49.205 with SMTP id x196mr8474058ywx.223.1471655961460; Fri, 19 Aug 2016 18:19:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.119.139 with HTTP; Fri, 19 Aug 2016 18:19:21 -0700 (PDT) In-Reply-To: <20160819140350.70fbc49e@xeon-e3> References: <20160819140350.70fbc49e@xeon-e3> From: Zhongming Qu Date: Fri, 19 Aug 2016 18:19:21 -0700 Message-ID: To: Stephen Hemminger Cc: users@dpdk.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: Re: [dpdk-users] running multiple independent dpdk applications randomly locks up machines X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Aug 2016 01:19:22 -0000 Thanks! I did use a hard coded queue_id of 0 when initializing the rx/tx queues, i.e., rte_eth_rx/tx_queue_setup(). So that is a problem to solve. Will fix that and try again. When A and B run at the same time, this lockup problem can be explained by the conflicting queue usage. But the lockup happens even in the use case where only one dpdk process is running. That is, A and B take turns to run but do not run at the same time. Thanks for pointing out an alternative approach. That sounds really promising. A concern came up when that idea was talked over: What would happen if the primary process dies? Would all the secondary processes eventually go awry at some point? Would `--proc-type auto` solve this problem? On Fri, Aug 19, 2016 at 2:03 PM, Stephen Hemminger < stephen@networkplumber.org> wrote: > On Fri, 19 Aug 2016 13:32:06 -0700 > Zhongming Qu wrote: > > > Hi, > > > > > > As stated in the subject, running multiple dpdk applications (only one > > process per application) randomly locks up machines. Thanks in advance > for > > any help. > > > > It is difficult to provide the exact set of information useful for > > debugging. Just listing the as much info as possible in the hope of > ringing > > a bell somewhere. > > > > System Configuration: > > - Motherboard: Supermicro X10SRi-F (BIOS upgraded to the latest version > as > > of July 2016) > > - Intel Xeon E5-2667 v3 (Haswell), no NUMA > > - 64GB DRAM > > - Ubuntu 14.04 kernel 3.13.0-49-generic > > - DPDK 16.04 > > - 1024 x 2M hugepages are reserved > > - 82599ES NIC (2 x 10G) at pci_addr 02:00.0 and 02:00.1. Both ports use > the > > ixgbe_uio kernel driver and the ixgbe PMD. > > > > > > Use Scenario of DPDK Application: > > - Two single-process dpdk applications, A and B, need to run > simultaneously. > > - It is made sure that A and B do not have any race conditions or memory > > issues, that is, apart from dpdk. > > - Each application uses 512 x 2M hugepages (half of the total reserved > > amount). > > - Each application binds to one port via `--pci-whitelist `. > > - Use `-m 1024` and `--file-prefix `, as > > instructed by 19.2.3 in the Programmer's Guide ( > > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html). > > > > > > Description of Problem: > > - Starting and killing down A and B repeatedly every 30 seconds has a > > chance of locking up the machine. > > - No kernel var/log/syslog, no dmesg, nothing persistent, is available > for > > debugging after a reboot of the frozen machine. > > - Looks like a kernel panic as it dumps some panic info to the serial > > console (not useful...) and the CapsLock and NumLock keys on a physically > > connected keyboard do not respond. > > - No particular sequence of operations of starting and killing A and B, > so > > far, has been found to reliably lead to a lockup. The best effort of > > reproducing the lockup is a keep-trying-until-lockup approach. > > > > > > A Few Things Tried: > > - Via dumping logging to stderr and files, it is found that the lock up > can > > happen during rte_eal_hugepage_init(), or after it, after the program is > > killed. > > - It is made sure that rte_config.mem_config->memseg is properly > > initialized. That is, the total amount of memory reserved in the memseg > is > > 512 x 2M hugepages. > > - Zeroing all huepages when the hugefile is created and mapped, or > > immediately after memsegs are initialized (as the second call of > > map_all_hugepages() in rte_eal_hugepage_init()) does not fix the problem. > > - By default, hugefiles in /mnt/huge are not cleaned up when the > > applications are killed. Though, cleaning them up did not solve the > problem > > either. > > > > > > > > Thanks very much for any input! > > > > > > Zhongming > > Obviously, two applications can't share the same queue. > Also, you need to give application a different core mask; at least if you > are using > poll mode like the DPDK examples. > > You might be better off having one primary DPDK process and two secondary > processes. >