From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <nhorman@tuxdriver.com>
Received: from smtp.tuxdriver.com (charlotte.tuxdriver.com [70.61.120.58])
 by dpdk.org (Postfix) with ESMTP id 27F925907
 for <dev@dpdk.org>; Tue, 24 Jun 2014 12:48:57 +0200 (CEST)
Received: from hmsreliant.think-freely.org
 ([2001:470:8:a08:7aac:c0ff:fec2:933b] helo=localhost)
 by smtp.tuxdriver.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.63)
 (envelope-from <nhorman@tuxdriver.com>)
 id 1WzOHc-0007ri-7F; Tue, 24 Jun 2014 06:49:10 -0400
Date: Tue, 24 Jun 2014 06:48:59 -0400
From: Neil Horman <nhorman@tuxdriver.com>
To: Stefan Baranoff <sbaranoff@gmail.com>
Message-ID: <20140624104859.GA19229@hmsreliant.think-freely.org>
References: <CAHzKxpaxCbt9d+njdBBpwSy069zLfsOvQ5Dx0CzXLNVMKQ9AaQ@mail.gmail.com>
 <CAHzKxpaNvZkH9h0kqYJd8VoYEXqBUfhSX9V_zUro2oX_-ioAAw@mail.gmail.com>
 <CAHzKxpZUOVKbCYTb66D8cQbm0ceSt7rfYo6VU3f2qhi2ZBvytQ@mail.gmail.com>
 <53A43E5E.3030809@windriver.com>
 <CAHzKxpYaUhR5ti2EDZfj7jeu8pWxhnmWM+e2D20k01NHa_u85w@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAHzKxpYaUhR5ti2EDZfj7jeu8pWxhnmWM+e2D20k01NHa_u85w@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Score: -2.9 (--)
X-Spam-Status: No
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] Random mbuf corruption
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Jun 2014 10:48:57 -0000

On Mon, Jun 23, 2014 at 05:43:21PM -0400, Stefan Baranoff wrote:
> Paul,
> 
> Thanks for the advice; we ran memtest as well as the Dell complete system
> diagnostic and neither found an issue. The plot thickens, though!
> 
> Our admins messed up our kickstart labels and what I *thought* was CentOS
> 6.4 was actually RHEL 6.4 and the problem seems to be following the CentOS
> 6.4 installations -- the current configuration of success/failure is:
>   1 server - Westmere - RHEL 6.4 -- works
>   1 server - Sandy Bridge - RHEL 6.4 -- works
>   2 servers - Sandy Bridge - CentOS 6.4 -- fails
> 
there were several memory corruptors fixed in between RHEL 6.3 and RHEL 6.5.
Its possible that CentOS didn't get one of those patches, if it went out in a
zstream or something.  Is there an older version of RHEL that recreates the
problem for you?  If so, I can provide a list of bugs/fixes that may be related,
which you could cross check against centos for inclusion.

> Given that the hardware seems otherwise stable/checks out I'm trying to
> figure out how to determine if this is:
>   a) our software has a bug
>   b) a kernel/hugetlbfs bug
>   c) a  DPDK 1.6.0r2 bug
> 
> I have seen similar issues where calling rte_eal_init too late in a process
> also causes similar issues (things like calling 'free' on memory that was
> allocated with 'malloc' before 'rte_eal_init' is called fails/results in
> segfault in libc) which seems odd to me but in this case we are calling
> rte_eal_init as the first thing we do in main().
> 
Sounds like it might be time to add in some poisoning options to the mbuf
allocator.

Neil

> 
> Thanks,
> Stefan
> 
> 
> On Fri, Jun 20, 2014 at 9:59 AM, Paul Barrette <paul.barrette@windriver.com>
> wrote:
> 
> >
> > On 06/20/2014 07:20 AM, Stefan Baranoff wrote:
> >
> >> All,
> >>
> >> We are seeing 'random' memory corruption in mbufs coming from the ixgbe
> >> UIO
> >> driver and I am looking for some pointers on debugging it. Our software
> >> was
> >> running flawlessly for weeks at a time on our old Westmere systems (CentOS
> >> 6.4) but since moving to a new Sandy Bridge v2 server (also CentOS 6.4) it
> >> runs for 1-2 minutes and then at least one mbuf is overwritten with
> >> arbitrary data (pointers/lengths/RSS value/num segs/etc. are all
> >> ridiculous). Both servers are using the 82599EB chipset (x520) and the
> >> DPDK
> >> version (1.6.0r2) is identical. We recently also tested on a third server
> >> running RHEL 6.4 with the same hardware as the failing Sandy Bridge based
> >> system and it is fine (days of runtime no failures).
> >>
> >> Running all of this in GDB with 'record' enabled and setting a watchpoint
> >> on the address which contains the corrupted data and executing a
> >> 'reverse-continue' never hits the watchpoint [GDB newbie here -- assuming
> >> 'watch *(uint64_t*)0x7FB.....' should work]. My first thought was memory
> >> corruption but the BIOS memcheck on the ECC RAM shows no issues.
> >>
> >> Also looking at mbuf->pkt.data, as an example, the corrupt value was the
> >> same 6/12 trials but I could not find that value elsewhere in the
> >> processes
> >> memory. This doesn't seem "random" and points to a software bug but I
> >> cannot for the life of me get GDB to tell me where the program is when
> >> that
> >> memory is written to. Incidentally trying this with the PCAP driver and
> >> --no-huge to run valgrind shows no memory access errors/uninitialized
> >> values/etc.
> >>
> >> Thoughts? Pointers? Ways to rule in/out hardware other than going 1 by 1
> >> removing each of the 24 DIMMs?
> >>
> >> Thanks so much in advance!
> >> Stefan
> >>
> > Run memtest to rule out bad ram.
> >
> > Pb
> >
>