From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yb0-f180.google.com (mail-yb0-f180.google.com [209.85.213.180]) by dpdk.org (Postfix) with ESMTP id DA3FD5A92 for ; Fri, 19 Aug 2016 22:32:07 +0200 (CEST) Received: by mail-yb0-f180.google.com with SMTP id z10so19424225ybh.2 for ; Fri, 19 Aug 2016 13:32:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=luminatewireless-com.20150623.gappssmtp.com; s=20150623; h=mime-version:from:date:message-id:subject:to; bh=KruPl1XadeSpZmMnfj6/mhYSeKeVwjfa7X8L/aViEKM=; b=HV9mffr//RkLV86gI99FuT9MPI1avW3u/LpNOUW3u8RFmovQlIuFUyo2UZIEG2/S6J F96p3LchSiLLgv5sdLd2SBzM049sXJVpYq3jkFiPMWdpuVDisZEfI1sm3qopw5Ide8MJ p/hQ3FIsffqYz4tKUxMJP2PzxvdmsRcEQM+xrGxrgjQ0Qhf3x+zfLvohltgVcXZVEpO8 RG21HLFXmX7ZadrTwfEIY4RmNAr7teEd3f/nE3u8U9lNN9qS9KiRn7AL+ma20lcj6vkQ FsiUCN+U+rWEqfWFrShnFQ7scJc2NQScc3lEGyHOSpWKvGmmlHkKT1GbpMDpbcNw0VGq gfLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=KruPl1XadeSpZmMnfj6/mhYSeKeVwjfa7X8L/aViEKM=; b=NEZnQ6D2TCdSti98jpMTBqda6cUIDnCdsU3RP0eO6BIwQh3FHawLKWry+/7RbHoS+X FN78QOS3OLjK0+wMoBGvXDqNLiMCxYDZMdvRob+T8A3jPpAIkObEBOOpCeTdMvdHBkWF fDtwlN0GYfFZfzJKXPv5EdeveAPbu2TFdxZjqdVlsZ0BBmjyKbLV/SWnYIC4H7D4+ZSc P6QMOlK2cAnN44l6/jexBlMUnVU0GMZEE+IdXEajyRWsL3eeTP86eia8PHWmAi9+yn4P MOhDINQy8XdstmepQq4y6ZeTJfA578c6pILco9YiJrq0aD0YGGLYNDap77Q6YsbaEjDk V2DQ== X-Gm-Message-State: AEkoousV71U5fIbTcnXL5I/N4rIFqWIv1llZsKnqJ71QepW0Kig5WIy4H2hB0MgNacTq465lEjm1u81BSOR3xuIY X-Received: by 10.37.105.19 with SMTP id e19mr7451412ybc.64.1471638727111; Fri, 19 Aug 2016 13:32:07 -0700 (PDT) MIME-Version: 1.0 Received: by 10.37.119.139 with HTTP; Fri, 19 Aug 2016 13:32:06 -0700 (PDT) From: Zhongming Qu Date: Fri, 19 Aug 2016 13:32:06 -0700 Message-ID: To: users@dpdk.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: [dpdk-users] running multiple independent dpdk applications randomly locks up machines X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Aug 2016 20:32:08 -0000 Hi, As stated in the subject, running multiple dpdk applications (only one process per application) randomly locks up machines. Thanks in advance for any help. It is difficult to provide the exact set of information useful for debugging. Just listing the as much info as possible in the hope of ringing a bell somewhere. System Configuration: - Motherboard: Supermicro X10SRi-F (BIOS upgraded to the latest version as of July 2016) - Intel Xeon E5-2667 v3 (Haswell), no NUMA - 64GB DRAM - Ubuntu 14.04 kernel 3.13.0-49-generic - DPDK 16.04 - 1024 x 2M hugepages are reserved - 82599ES NIC (2 x 10G) at pci_addr 02:00.0 and 02:00.1. Both ports use the ixgbe_uio kernel driver and the ixgbe PMD. Use Scenario of DPDK Application: - Two single-process dpdk applications, A and B, need to run simultaneously. - It is made sure that A and B do not have any race conditions or memory issues, that is, apart from dpdk. - Each application uses 512 x 2M hugepages (half of the total reserved amount). - Each application binds to one port via `--pci-whitelist `. - Use `-m 1024` and `--file-prefix `, as instructed by 19.2.3 in the Programmer's Guide ( http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html). Description of Problem: - Starting and killing down A and B repeatedly every 30 seconds has a chance of locking up the machine. - No kernel var/log/syslog, no dmesg, nothing persistent, is available for debugging after a reboot of the frozen machine. - Looks like a kernel panic as it dumps some panic info to the serial console (not useful...) and the CapsLock and NumLock keys on a physically connected keyboard do not respond. - No particular sequence of operations of starting and killing A and B, so far, has been found to reliably lead to a lockup. The best effort of reproducing the lockup is a keep-trying-until-lockup approach. A Few Things Tried: - Via dumping logging to stderr and files, it is found that the lock up can happen during rte_eal_hugepage_init(), or after it, after the program is killed. - It is made sure that rte_config.mem_config->memseg is properly initialized. That is, the total amount of memory reserved in the memseg is 512 x 2M hugepages. - Zeroing all huepages when the hugefile is created and mapped, or immediately after memsegs are initialized (as the second call of map_all_hugepages() in rte_eal_hugepage_init()) does not fix the problem. - By default, hugefiles in /mnt/huge are not cleaned up when the applications are killed. Though, cleaning them up did not solve the problem either. Thanks very much for any input! Zhongming