From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.windriver.com (mail.windriver.com [147.11.1.11]) by dpdk.org (Postfix) with ESMTP id 903835939 for ; Fri, 20 Jun 2014 15:59:43 +0200 (CEST) Received: from ALA-HCA.corp.ad.wrs.com (ala-hca.corp.ad.wrs.com [147.11.189.40]) by mail.windriver.com (8.14.5/8.14.5) with ESMTP id s5KDxx9B026574 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL); Fri, 20 Jun 2014 06:59:59 -0700 (PDT) Received: from [128.224.146.11] (128.224.146.11) by ALA-HCA.corp.ad.wrs.com (147.11.189.50) with Microsoft SMTP Server id 14.3.169.1; Fri, 20 Jun 2014 06:59:59 -0700 Message-ID: <53A43E5E.3030809@windriver.com> Date: Fri, 20 Jun 2014 09:59:58 -0400 From: Paul Barrette User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Stefan Baranoff , References: In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] Random mbuf corruption X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jun 2014 13:59:44 -0000 On 06/20/2014 07:20 AM, Stefan Baranoff wrote: > All, > > We are seeing 'random' memory corruption in mbufs coming from the ixgbe UIO > driver and I am looking for some pointers on debugging it. Our software was > running flawlessly for weeks at a time on our old Westmere systems (CentOS > 6.4) but since moving to a new Sandy Bridge v2 server (also CentOS 6.4) it > runs for 1-2 minutes and then at least one mbuf is overwritten with > arbitrary data (pointers/lengths/RSS value/num segs/etc. are all > ridiculous). Both servers are using the 82599EB chipset (x520) and the DPDK > version (1.6.0r2) is identical. We recently also tested on a third server > running RHEL 6.4 with the same hardware as the failing Sandy Bridge based > system and it is fine (days of runtime no failures). > > Running all of this in GDB with 'record' enabled and setting a watchpoint > on the address which contains the corrupted data and executing a > 'reverse-continue' never hits the watchpoint [GDB newbie here -- assuming > 'watch *(uint64_t*)0x7FB.....' should work]. My first thought was memory > corruption but the BIOS memcheck on the ECC RAM shows no issues. > > Also looking at mbuf->pkt.data, as an example, the corrupt value was the > same 6/12 trials but I could not find that value elsewhere in the processes > memory. This doesn't seem "random" and points to a software bug but I > cannot for the life of me get GDB to tell me where the program is when that > memory is written to. Incidentally trying this with the PCAP driver and > --no-huge to run valgrind shows no memory access errors/uninitialized > values/etc. > > Thoughts? Pointers? Ways to rule in/out hardware other than going 1 by 1 > removing each of the 24 DIMMs? > > Thanks so much in advance! > Stefan Run memtest to rule out bad ram. Pb