From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.tuxdriver.com (charlotte.tuxdriver.com [70.61.120.58]) by dpdk.org (Postfix) with ESMTP id 27F925907 for ; Tue, 24 Jun 2014 12:48:57 +0200 (CEST) Received: from hmsreliant.think-freely.org ([2001:470:8:a08:7aac:c0ff:fec2:933b] helo=localhost) by smtp.tuxdriver.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.63) (envelope-from ) id 1WzOHc-0007ri-7F; Tue, 24 Jun 2014 06:49:10 -0400 Date: Tue, 24 Jun 2014 06:48:59 -0400 From: Neil Horman To: Stefan Baranoff Message-ID: <20140624104859.GA19229@hmsreliant.think-freely.org> References: <53A43E5E.3030809@windriver.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Score: -2.9 (--) X-Spam-Status: No Cc: dev@dpdk.org Subject: Re: [dpdk-dev] Random mbuf corruption X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Jun 2014 10:48:57 -0000 On Mon, Jun 23, 2014 at 05:43:21PM -0400, Stefan Baranoff wrote: > Paul, > > Thanks for the advice; we ran memtest as well as the Dell complete system > diagnostic and neither found an issue. The plot thickens, though! > > Our admins messed up our kickstart labels and what I *thought* was CentOS > 6.4 was actually RHEL 6.4 and the problem seems to be following the CentOS > 6.4 installations -- the current configuration of success/failure is: > 1 server - Westmere - RHEL 6.4 -- works > 1 server - Sandy Bridge - RHEL 6.4 -- works > 2 servers - Sandy Bridge - CentOS 6.4 -- fails > there were several memory corruptors fixed in between RHEL 6.3 and RHEL 6.5. Its possible that CentOS didn't get one of those patches, if it went out in a zstream or something. Is there an older version of RHEL that recreates the problem for you? If so, I can provide a list of bugs/fixes that may be related, which you could cross check against centos for inclusion. > Given that the hardware seems otherwise stable/checks out I'm trying to > figure out how to determine if this is: > a) our software has a bug > b) a kernel/hugetlbfs bug > c) a DPDK 1.6.0r2 bug > > I have seen similar issues where calling rte_eal_init too late in a process > also causes similar issues (things like calling 'free' on memory that was > allocated with 'malloc' before 'rte_eal_init' is called fails/results in > segfault in libc) which seems odd to me but in this case we are calling > rte_eal_init as the first thing we do in main(). > Sounds like it might be time to add in some poisoning options to the mbuf allocator. Neil > > Thanks, > Stefan > > > On Fri, Jun 20, 2014 at 9:59 AM, Paul Barrette > wrote: > > > > > On 06/20/2014 07:20 AM, Stefan Baranoff wrote: > > > >> All, > >> > >> We are seeing 'random' memory corruption in mbufs coming from the ixgbe > >> UIO > >> driver and I am looking for some pointers on debugging it. Our software > >> was > >> running flawlessly for weeks at a time on our old Westmere systems (CentOS > >> 6.4) but since moving to a new Sandy Bridge v2 server (also CentOS 6.4) it > >> runs for 1-2 minutes and then at least one mbuf is overwritten with > >> arbitrary data (pointers/lengths/RSS value/num segs/etc. are all > >> ridiculous). Both servers are using the 82599EB chipset (x520) and the > >> DPDK > >> version (1.6.0r2) is identical. We recently also tested on a third server > >> running RHEL 6.4 with the same hardware as the failing Sandy Bridge based > >> system and it is fine (days of runtime no failures). > >> > >> Running all of this in GDB with 'record' enabled and setting a watchpoint > >> on the address which contains the corrupted data and executing a > >> 'reverse-continue' never hits the watchpoint [GDB newbie here -- assuming > >> 'watch *(uint64_t*)0x7FB.....' should work]. My first thought was memory > >> corruption but the BIOS memcheck on the ECC RAM shows no issues. > >> > >> Also looking at mbuf->pkt.data, as an example, the corrupt value was the > >> same 6/12 trials but I could not find that value elsewhere in the > >> processes > >> memory. This doesn't seem "random" and points to a software bug but I > >> cannot for the life of me get GDB to tell me where the program is when > >> that > >> memory is written to. Incidentally trying this with the PCAP driver and > >> --no-huge to run valgrind shows no memory access errors/uninitialized > >> values/etc. > >> > >> Thoughts? Pointers? Ways to rule in/out hardware other than going 1 by 1 > >> removing each of the 24 DIMMs? > >> > >> Thanks so much in advance! > >> Stefan > >> > > Run memtest to rule out bad ram. > > > > Pb > > >