From: Stefan Baranoff <sbaranoff@gmail.com>
To: Paul Barrette <paul.barrette@windriver.com>
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] Random mbuf corruption
Date: Mon, 23 Jun 2014 17:43:21 -0400 [thread overview]
Message-ID: <CAHzKxpYaUhR5ti2EDZfj7jeu8pWxhnmWM+e2D20k01NHa_u85w@mail.gmail.com> (raw)
In-Reply-To: <53A43E5E.3030809@windriver.com>
Paul,
Thanks for the advice; we ran memtest as well as the Dell complete system
diagnostic and neither found an issue. The plot thickens, though!
Our admins messed up our kickstart labels and what I *thought* was CentOS
6.4 was actually RHEL 6.4 and the problem seems to be following the CentOS
6.4 installations -- the current configuration of success/failure is:
1 server - Westmere - RHEL 6.4 -- works
1 server - Sandy Bridge - RHEL 6.4 -- works
2 servers - Sandy Bridge - CentOS 6.4 -- fails
Given that the hardware seems otherwise stable/checks out I'm trying to
figure out how to determine if this is:
a) our software has a bug
b) a kernel/hugetlbfs bug
c) a DPDK 1.6.0r2 bug
I have seen similar issues where calling rte_eal_init too late in a process
also causes similar issues (things like calling 'free' on memory that was
allocated with 'malloc' before 'rte_eal_init' is called fails/results in
segfault in libc) which seems odd to me but in this case we are calling
rte_eal_init as the first thing we do in main().
Thanks,
Stefan
On Fri, Jun 20, 2014 at 9:59 AM, Paul Barrette <paul.barrette@windriver.com>
wrote:
>
> On 06/20/2014 07:20 AM, Stefan Baranoff wrote:
>
>> All,
>>
>> We are seeing 'random' memory corruption in mbufs coming from the ixgbe
>> UIO
>> driver and I am looking for some pointers on debugging it. Our software
>> was
>> running flawlessly for weeks at a time on our old Westmere systems (CentOS
>> 6.4) but since moving to a new Sandy Bridge v2 server (also CentOS 6.4) it
>> runs for 1-2 minutes and then at least one mbuf is overwritten with
>> arbitrary data (pointers/lengths/RSS value/num segs/etc. are all
>> ridiculous). Both servers are using the 82599EB chipset (x520) and the
>> DPDK
>> version (1.6.0r2) is identical. We recently also tested on a third server
>> running RHEL 6.4 with the same hardware as the failing Sandy Bridge based
>> system and it is fine (days of runtime no failures).
>>
>> Running all of this in GDB with 'record' enabled and setting a watchpoint
>> on the address which contains the corrupted data and executing a
>> 'reverse-continue' never hits the watchpoint [GDB newbie here -- assuming
>> 'watch *(uint64_t*)0x7FB.....' should work]. My first thought was memory
>> corruption but the BIOS memcheck on the ECC RAM shows no issues.
>>
>> Also looking at mbuf->pkt.data, as an example, the corrupt value was the
>> same 6/12 trials but I could not find that value elsewhere in the
>> processes
>> memory. This doesn't seem "random" and points to a software bug but I
>> cannot for the life of me get GDB to tell me where the program is when
>> that
>> memory is written to. Incidentally trying this with the PCAP driver and
>> --no-huge to run valgrind shows no memory access errors/uninitialized
>> values/etc.
>>
>> Thoughts? Pointers? Ways to rule in/out hardware other than going 1 by 1
>> removing each of the 24 DIMMs?
>>
>> Thanks so much in advance!
>> Stefan
>>
> Run memtest to rule out bad ram.
>
> Pb
>
next prev parent reply other threads:[~2014-06-23 21:43 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CAHzKxpaxCbt9d+njdBBpwSy069zLfsOvQ5Dx0CzXLNVMKQ9AaQ@mail.gmail.com>
[not found] ` <CAHzKxpaNvZkH9h0kqYJd8VoYEXqBUfhSX9V_zUro2oX_-ioAAw@mail.gmail.com>
2014-06-20 11:20 ` Stefan Baranoff
2014-06-20 13:59 ` Paul Barrette
2014-06-23 21:43 ` Stefan Baranoff [this message]
2014-06-24 8:05 ` Gray, Mark D
2014-06-24 10:48 ` Neil Horman
2014-06-24 11:01 ` Olivier MATZ
2014-06-25 1:31 ` Stefan Baranoff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAHzKxpYaUhR5ti2EDZfj7jeu8pWxhnmWM+e2D20k01NHa_u85w@mail.gmail.com \
--to=sbaranoff@gmail.com \
--cc=dev@dpdk.org \
--cc=paul.barrette@windriver.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).