From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ie0-f179.google.com (mail-ie0-f179.google.com [209.85.223.179]) by dpdk.org (Postfix) with ESMTP id DA35858D3 for ; Fri, 20 Jun 2014 13:20:35 +0200 (CEST) Received: by mail-ie0-f179.google.com with SMTP id tr6so3002807ieb.24 for ; Fri, 20 Jun 2014 04:20:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=/E4rujGhb0W3xoVfnqevJtW+6l+KgH3Pj3GKfDF+IfE=; b=ZTikZHgfNgePgCdDCsKa8VguUVJMwttkDT7D4x1Era4xBXqiYKGlnEgkZIOqIIjEbf 9hsb063nBylPuq/i5Vapr7Xhs+ocgSEohH9kvekHWeUZGDhikoUKshiqNAz2KzCcfI9L ITJAYf5gSa0QPg2+yynR+rHHRa+PGuJT4pJ8HlYqrSnsWHsaGl6+Q+ChTFPR2BEo9CK9 PQUuZTuifwEO2otP5zvu1g+dk311HNsysIR6LTExGz+QrJBaicqjsBL3FsO4YKKVUWdN TdbM5HCS47zgTrTwkvsj7/wgFmh85hsOJu+dFmLiMecdljVJka62fT9DqJOPnYVHrCUw cvHg== MIME-Version: 1.0 X-Received: by 10.50.79.137 with SMTP id j9mr3393621igx.29.1403263252674; Fri, 20 Jun 2014 04:20:52 -0700 (PDT) Received: by 10.64.28.18 with HTTP; Fri, 20 Jun 2014 04:20:52 -0700 (PDT) Received: by 10.64.28.18 with HTTP; Fri, 20 Jun 2014 04:20:52 -0700 (PDT) In-Reply-To: References: Date: Fri, 20 Jun 2014 07:20:52 -0400 Message-ID: From: Stefan Baranoff To: dev@dpdk.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Subject: [dpdk-dev] Random mbuf corruption X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jun 2014 11:20:36 -0000 All, We are seeing 'random' memory corruption in mbufs coming from the ixgbe UIO driver and I am looking for some pointers on debugging it. Our software was running flawlessly for weeks at a time on our old Westmere systems (CentOS 6.4) but since moving to a new Sandy Bridge v2 server (also CentOS 6.4) it runs for 1-2 minutes and then at least one mbuf is overwritten with arbitrary data (pointers/lengths/RSS value/num segs/etc. are all ridiculous). Both servers are using the 82599EB chipset (x520) and the DPDK version (1.6.0r2) is identical. We recently also tested on a third server running RHEL 6.4 with the same hardware as the failing Sandy Bridge based system and it is fine (days of runtime no failures). Running all of this in GDB with 'record' enabled and setting a watchpoint on the address which contains the corrupted data and executing a 'reverse-continue' never hits the watchpoint [GDB newbie here -- assuming 'watch *(uint64_t*)0x7FB.....' should work]. My first thought was memory corruption but the BIOS memcheck on the ECC RAM shows no issues. Also looking at mbuf->pkt.data, as an example, the corrupt value was the same 6/12 trials but I could not find that value elsewhere in the processes memory. This doesn't seem "random" and points to a software bug but I cannot for the life of me get GDB to tell me where the program is when that memory is written to. Incidentally trying this with the PCAP driver and --no-huge to run valgrind shows no memory access errors/uninitialized values/etc. Thoughts? Pointers? Ways to rule in/out hardware other than going 1 by 1 removing each of the 24 DIMMs? Thanks so much in advance! Stefan