From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f176.google.com (mail-ig0-f176.google.com [209.85.213.176]) by dpdk.org (Postfix) with ESMTP id D0981AFD1 for ; Mon, 23 Jun 2014 23:43:03 +0200 (CEST) Received: by mail-ig0-f176.google.com with SMTP id c1so3572021igq.15 for ; Mon, 23 Jun 2014 14:43:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=HzT4U9pMqgphN4zMyFOrkp6/FqjTHf2aldsbyIVRt40=; b=oM5D5M1YKWbYEAqhDNrVZEkc36B0eEQAsWhBbpfNyIZMLyQOYHAouQ/C1smswrVlQH JJahPogqxHgQr+kDIZSaTApJ4LMvGc7f1bZbafGg1SyB4PRiYwVoY3fMYg5CJC7RTtcj kRDBNOPbkZqCJYdEi1cEsDMFLpdb45YOLe+91+uL45k99i/TTd+RvpTKqcjCxL5cLboA 8qCNxynWvk82LLE+9SdLtOU9hjK2u22HFdqlwHPRgd4MJ7MJtwUzVdGSk8AAkOsoIXzc WTvv1GafYdx/72PrrokJIgU+N3p51GsAwfYuP8ikoRVDIeQn5fNQk5Ax5WgrMRm3pGsY EahA== MIME-Version: 1.0 X-Received: by 10.43.117.133 with SMTP id fm5mr25171288icc.3.1403559801590; Mon, 23 Jun 2014 14:43:21 -0700 (PDT) Received: by 10.64.28.18 with HTTP; Mon, 23 Jun 2014 14:43:21 -0700 (PDT) In-Reply-To: <53A43E5E.3030809@windriver.com> References: <53A43E5E.3030809@windriver.com> Date: Mon, 23 Jun 2014 17:43:21 -0400 Message-ID: From: Stefan Baranoff To: Paul Barrette Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Cc: dev@dpdk.org Subject: Re: [dpdk-dev] Random mbuf corruption X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Jun 2014 21:43:04 -0000 Paul, Thanks for the advice; we ran memtest as well as the Dell complete system diagnostic and neither found an issue. The plot thickens, though! Our admins messed up our kickstart labels and what I *thought* was CentOS 6.4 was actually RHEL 6.4 and the problem seems to be following the CentOS 6.4 installations -- the current configuration of success/failure is: 1 server - Westmere - RHEL 6.4 -- works 1 server - Sandy Bridge - RHEL 6.4 -- works 2 servers - Sandy Bridge - CentOS 6.4 -- fails Given that the hardware seems otherwise stable/checks out I'm trying to figure out how to determine if this is: a) our software has a bug b) a kernel/hugetlbfs bug c) a DPDK 1.6.0r2 bug I have seen similar issues where calling rte_eal_init too late in a process also causes similar issues (things like calling 'free' on memory that was allocated with 'malloc' before 'rte_eal_init' is called fails/results in segfault in libc) which seems odd to me but in this case we are calling rte_eal_init as the first thing we do in main(). Thanks, Stefan On Fri, Jun 20, 2014 at 9:59 AM, Paul Barrette wrote: > > On 06/20/2014 07:20 AM, Stefan Baranoff wrote: > >> All, >> >> We are seeing 'random' memory corruption in mbufs coming from the ixgbe >> UIO >> driver and I am looking for some pointers on debugging it. Our software >> was >> running flawlessly for weeks at a time on our old Westmere systems (CentOS >> 6.4) but since moving to a new Sandy Bridge v2 server (also CentOS 6.4) it >> runs for 1-2 minutes and then at least one mbuf is overwritten with >> arbitrary data (pointers/lengths/RSS value/num segs/etc. are all >> ridiculous). Both servers are using the 82599EB chipset (x520) and the >> DPDK >> version (1.6.0r2) is identical. We recently also tested on a third server >> running RHEL 6.4 with the same hardware as the failing Sandy Bridge based >> system and it is fine (days of runtime no failures). >> >> Running all of this in GDB with 'record' enabled and setting a watchpoint >> on the address which contains the corrupted data and executing a >> 'reverse-continue' never hits the watchpoint [GDB newbie here -- assuming >> 'watch *(uint64_t*)0x7FB.....' should work]. My first thought was memory >> corruption but the BIOS memcheck on the ECC RAM shows no issues. >> >> Also looking at mbuf->pkt.data, as an example, the corrupt value was the >> same 6/12 trials but I could not find that value elsewhere in the >> processes >> memory. This doesn't seem "random" and points to a software bug but I >> cannot for the life of me get GDB to tell me where the program is when >> that >> memory is written to. Incidentally trying this with the PCAP driver and >> --no-huge to run valgrind shows no memory access errors/uninitialized >> values/etc. >> >> Thoughts? Pointers? Ways to rule in/out hardware other than going 1 by 1 >> removing each of the 24 DIMMs? >> >> Thanks so much in advance! >> Stefan >> > Run memtest to rule out bad ram. > > Pb >