* [dpdk-dev] Random mbuf corruption [not found] ` <CAHzKxpaNvZkH9h0kqYJd8VoYEXqBUfhSX9V_zUro2oX_-ioAAw@mail.gmail.com> @ 2014-06-20 11:20 ` Stefan Baranoff 2014-06-20 13:59 ` Paul Barrette 0 siblings, 1 reply; 7+ messages in thread From: Stefan Baranoff @ 2014-06-20 11:20 UTC (permalink / raw) To: dev All, We are seeing 'random' memory corruption in mbufs coming from the ixgbe UIO driver and I am looking for some pointers on debugging it. Our software was running flawlessly for weeks at a time on our old Westmere systems (CentOS 6.4) but since moving to a new Sandy Bridge v2 server (also CentOS 6.4) it runs for 1-2 minutes and then at least one mbuf is overwritten with arbitrary data (pointers/lengths/RSS value/num segs/etc. are all ridiculous). Both servers are using the 82599EB chipset (x520) and the DPDK version (1.6.0r2) is identical. We recently also tested on a third server running RHEL 6.4 with the same hardware as the failing Sandy Bridge based system and it is fine (days of runtime no failures). Running all of this in GDB with 'record' enabled and setting a watchpoint on the address which contains the corrupted data and executing a 'reverse-continue' never hits the watchpoint [GDB newbie here -- assuming 'watch *(uint64_t*)0x7FB.....' should work]. My first thought was memory corruption but the BIOS memcheck on the ECC RAM shows no issues. Also looking at mbuf->pkt.data, as an example, the corrupt value was the same 6/12 trials but I could not find that value elsewhere in the processes memory. This doesn't seem "random" and points to a software bug but I cannot for the life of me get GDB to tell me where the program is when that memory is written to. Incidentally trying this with the PCAP driver and --no-huge to run valgrind shows no memory access errors/uninitialized values/etc. Thoughts? Pointers? Ways to rule in/out hardware other than going 1 by 1 removing each of the 24 DIMMs? Thanks so much in advance! Stefan ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Random mbuf corruption 2014-06-20 11:20 ` [dpdk-dev] Random mbuf corruption Stefan Baranoff @ 2014-06-20 13:59 ` Paul Barrette 2014-06-23 21:43 ` Stefan Baranoff 0 siblings, 1 reply; 7+ messages in thread From: Paul Barrette @ 2014-06-20 13:59 UTC (permalink / raw) To: Stefan Baranoff, dev On 06/20/2014 07:20 AM, Stefan Baranoff wrote: > All, > > We are seeing 'random' memory corruption in mbufs coming from the ixgbe UIO > driver and I am looking for some pointers on debugging it. Our software was > running flawlessly for weeks at a time on our old Westmere systems (CentOS > 6.4) but since moving to a new Sandy Bridge v2 server (also CentOS 6.4) it > runs for 1-2 minutes and then at least one mbuf is overwritten with > arbitrary data (pointers/lengths/RSS value/num segs/etc. are all > ridiculous). Both servers are using the 82599EB chipset (x520) and the DPDK > version (1.6.0r2) is identical. We recently also tested on a third server > running RHEL 6.4 with the same hardware as the failing Sandy Bridge based > system and it is fine (days of runtime no failures). > > Running all of this in GDB with 'record' enabled and setting a watchpoint > on the address which contains the corrupted data and executing a > 'reverse-continue' never hits the watchpoint [GDB newbie here -- assuming > 'watch *(uint64_t*)0x7FB.....' should work]. My first thought was memory > corruption but the BIOS memcheck on the ECC RAM shows no issues. > > Also looking at mbuf->pkt.data, as an example, the corrupt value was the > same 6/12 trials but I could not find that value elsewhere in the processes > memory. This doesn't seem "random" and points to a software bug but I > cannot for the life of me get GDB to tell me where the program is when that > memory is written to. Incidentally trying this with the PCAP driver and > --no-huge to run valgrind shows no memory access errors/uninitialized > values/etc. > > Thoughts? Pointers? Ways to rule in/out hardware other than going 1 by 1 > removing each of the 24 DIMMs? > > Thanks so much in advance! > Stefan Run memtest to rule out bad ram. Pb ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Random mbuf corruption 2014-06-20 13:59 ` Paul Barrette @ 2014-06-23 21:43 ` Stefan Baranoff 2014-06-24 8:05 ` Gray, Mark D 2014-06-24 10:48 ` Neil Horman 0 siblings, 2 replies; 7+ messages in thread From: Stefan Baranoff @ 2014-06-23 21:43 UTC (permalink / raw) To: Paul Barrette; +Cc: dev Paul, Thanks for the advice; we ran memtest as well as the Dell complete system diagnostic and neither found an issue. The plot thickens, though! Our admins messed up our kickstart labels and what I *thought* was CentOS 6.4 was actually RHEL 6.4 and the problem seems to be following the CentOS 6.4 installations -- the current configuration of success/failure is: 1 server - Westmere - RHEL 6.4 -- works 1 server - Sandy Bridge - RHEL 6.4 -- works 2 servers - Sandy Bridge - CentOS 6.4 -- fails Given that the hardware seems otherwise stable/checks out I'm trying to figure out how to determine if this is: a) our software has a bug b) a kernel/hugetlbfs bug c) a DPDK 1.6.0r2 bug I have seen similar issues where calling rte_eal_init too late in a process also causes similar issues (things like calling 'free' on memory that was allocated with 'malloc' before 'rte_eal_init' is called fails/results in segfault in libc) which seems odd to me but in this case we are calling rte_eal_init as the first thing we do in main(). Thanks, Stefan On Fri, Jun 20, 2014 at 9:59 AM, Paul Barrette <paul.barrette@windriver.com> wrote: > > On 06/20/2014 07:20 AM, Stefan Baranoff wrote: > >> All, >> >> We are seeing 'random' memory corruption in mbufs coming from the ixgbe >> UIO >> driver and I am looking for some pointers on debugging it. Our software >> was >> running flawlessly for weeks at a time on our old Westmere systems (CentOS >> 6.4) but since moving to a new Sandy Bridge v2 server (also CentOS 6.4) it >> runs for 1-2 minutes and then at least one mbuf is overwritten with >> arbitrary data (pointers/lengths/RSS value/num segs/etc. are all >> ridiculous). Both servers are using the 82599EB chipset (x520) and the >> DPDK >> version (1.6.0r2) is identical. We recently also tested on a third server >> running RHEL 6.4 with the same hardware as the failing Sandy Bridge based >> system and it is fine (days of runtime no failures). >> >> Running all of this in GDB with 'record' enabled and setting a watchpoint >> on the address which contains the corrupted data and executing a >> 'reverse-continue' never hits the watchpoint [GDB newbie here -- assuming >> 'watch *(uint64_t*)0x7FB.....' should work]. My first thought was memory >> corruption but the BIOS memcheck on the ECC RAM shows no issues. >> >> Also looking at mbuf->pkt.data, as an example, the corrupt value was the >> same 6/12 trials but I could not find that value elsewhere in the >> processes >> memory. This doesn't seem "random" and points to a software bug but I >> cannot for the life of me get GDB to tell me where the program is when >> that >> memory is written to. Incidentally trying this with the PCAP driver and >> --no-huge to run valgrind shows no memory access errors/uninitialized >> values/etc. >> >> Thoughts? Pointers? Ways to rule in/out hardware other than going 1 by 1 >> removing each of the 24 DIMMs? >> >> Thanks so much in advance! >> Stefan >> > Run memtest to rule out bad ram. > > Pb > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Random mbuf corruption 2014-06-23 21:43 ` Stefan Baranoff @ 2014-06-24 8:05 ` Gray, Mark D 2014-06-24 10:48 ` Neil Horman 1 sibling, 0 replies; 7+ messages in thread From: Gray, Mark D @ 2014-06-24 8:05 UTC (permalink / raw) To: Stefan Baranoff, Barrette, Paul (Wind River); +Cc: dev > > Paul, > > Thanks for the advice; we ran memtest as well as the Dell complete system > diagnostic and neither found an issue. The plot thickens, though! > > Our admins messed up our kickstart labels and what I *thought* was CentOS > 6.4 was actually RHEL 6.4 and the problem seems to be following the CentOS > 6.4 installations -- the current configuration of success/failure is: > 1 server - Westmere - RHEL 6.4 -- works > 1 server - Sandy Bridge - RHEL 6.4 -- works > 2 servers - Sandy Bridge - CentOS 6.4 -- fails > > Given that the hardware seems otherwise stable/checks out I'm trying to > figure out how to determine if this is: > a) our software has a bug > b) a kernel/hugetlbfs bug > c) a DPDK 1.6.0r2 bug > > I have seen similar issues where calling rte_eal_init too late in a process also > causes similar issues (things like calling 'free' on memory that was allocated > with 'malloc' before 'rte_eal_init' is called fails/results in segfault in libc) > which seems odd to me but in this case we are calling rte_eal_init as the first > thing we do in main(). I have seen the following issues causing mbuf corruption of this type 1. Calling an rte_pktmbuf_free() on an mbuf and then still using a reference to that mbuf. 2. Using rte_pktmbuf_free() and rte_pktmbuf_alloc() in a pthread (i.e. not a "dpdk" thread). This corrupted the per-lcore mbuf cache. Not pleasant to debug, especially if you are sharing the mempool between primary and secondary processes. I have no tips for debug other than careful code review everywhere an mbuf is freed or allocated. Mark ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Random mbuf corruption 2014-06-23 21:43 ` Stefan Baranoff 2014-06-24 8:05 ` Gray, Mark D @ 2014-06-24 10:48 ` Neil Horman 2014-06-24 11:01 ` Olivier MATZ 1 sibling, 1 reply; 7+ messages in thread From: Neil Horman @ 2014-06-24 10:48 UTC (permalink / raw) To: Stefan Baranoff; +Cc: dev On Mon, Jun 23, 2014 at 05:43:21PM -0400, Stefan Baranoff wrote: > Paul, > > Thanks for the advice; we ran memtest as well as the Dell complete system > diagnostic and neither found an issue. The plot thickens, though! > > Our admins messed up our kickstart labels and what I *thought* was CentOS > 6.4 was actually RHEL 6.4 and the problem seems to be following the CentOS > 6.4 installations -- the current configuration of success/failure is: > 1 server - Westmere - RHEL 6.4 -- works > 1 server - Sandy Bridge - RHEL 6.4 -- works > 2 servers - Sandy Bridge - CentOS 6.4 -- fails > there were several memory corruptors fixed in between RHEL 6.3 and RHEL 6.5. Its possible that CentOS didn't get one of those patches, if it went out in a zstream or something. Is there an older version of RHEL that recreates the problem for you? If so, I can provide a list of bugs/fixes that may be related, which you could cross check against centos for inclusion. > Given that the hardware seems otherwise stable/checks out I'm trying to > figure out how to determine if this is: > a) our software has a bug > b) a kernel/hugetlbfs bug > c) a DPDK 1.6.0r2 bug > > I have seen similar issues where calling rte_eal_init too late in a process > also causes similar issues (things like calling 'free' on memory that was > allocated with 'malloc' before 'rte_eal_init' is called fails/results in > segfault in libc) which seems odd to me but in this case we are calling > rte_eal_init as the first thing we do in main(). > Sounds like it might be time to add in some poisoning options to the mbuf allocator. Neil > > Thanks, > Stefan > > > On Fri, Jun 20, 2014 at 9:59 AM, Paul Barrette <paul.barrette@windriver.com> > wrote: > > > > > On 06/20/2014 07:20 AM, Stefan Baranoff wrote: > > > >> All, > >> > >> We are seeing 'random' memory corruption in mbufs coming from the ixgbe > >> UIO > >> driver and I am looking for some pointers on debugging it. Our software > >> was > >> running flawlessly for weeks at a time on our old Westmere systems (CentOS > >> 6.4) but since moving to a new Sandy Bridge v2 server (also CentOS 6.4) it > >> runs for 1-2 minutes and then at least one mbuf is overwritten with > >> arbitrary data (pointers/lengths/RSS value/num segs/etc. are all > >> ridiculous). Both servers are using the 82599EB chipset (x520) and the > >> DPDK > >> version (1.6.0r2) is identical. We recently also tested on a third server > >> running RHEL 6.4 with the same hardware as the failing Sandy Bridge based > >> system and it is fine (days of runtime no failures). > >> > >> Running all of this in GDB with 'record' enabled and setting a watchpoint > >> on the address which contains the corrupted data and executing a > >> 'reverse-continue' never hits the watchpoint [GDB newbie here -- assuming > >> 'watch *(uint64_t*)0x7FB.....' should work]. My first thought was memory > >> corruption but the BIOS memcheck on the ECC RAM shows no issues. > >> > >> Also looking at mbuf->pkt.data, as an example, the corrupt value was the > >> same 6/12 trials but I could not find that value elsewhere in the > >> processes > >> memory. This doesn't seem "random" and points to a software bug but I > >> cannot for the life of me get GDB to tell me where the program is when > >> that > >> memory is written to. Incidentally trying this with the PCAP driver and > >> --no-huge to run valgrind shows no memory access errors/uninitialized > >> values/etc. > >> > >> Thoughts? Pointers? Ways to rule in/out hardware other than going 1 by 1 > >> removing each of the 24 DIMMs? > >> > >> Thanks so much in advance! > >> Stefan > >> > > Run memtest to rule out bad ram. > > > > Pb > > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Random mbuf corruption 2014-06-24 10:48 ` Neil Horman @ 2014-06-24 11:01 ` Olivier MATZ 2014-06-25 1:31 ` Stefan Baranoff 0 siblings, 1 reply; 7+ messages in thread From: Olivier MATZ @ 2014-06-24 11:01 UTC (permalink / raw) To: Neil Horman, Stefan Baranoff; +Cc: dev Hi, On 06/24/2014 12:48 PM, Neil Horman wrote: > Sounds like it might be time to add in some poisoning options to the mbuf > allocator. Such option already exists: CONFIG_RTE_LIBRTE_MBUF_DEBUG and RTE_LIBRTE_MEMPOOL_DEBUG will enable some runtime checks, at least double free and buffer overflows. Olivier ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [dpdk-dev] Random mbuf corruption 2014-06-24 11:01 ` Olivier MATZ @ 2014-06-25 1:31 ` Stefan Baranoff 0 siblings, 0 replies; 7+ messages in thread From: Stefan Baranoff @ 2014-06-25 1:31 UTC (permalink / raw) To: dev Thanks, all! We have alloc/free counters in place but I'll enable those debug options. I'll try older releases of RHEL and get back to you. I've also eradicated pthreads in this code but also have the mempool cache size set to 0 just in case. I should mention we hold onto mbufs for a while (up to 30s) and have many of them available (~8GB worth at 2K per). The entire app is fairly short (2k-3k lines) so it's not too hard to review. We'll take another look for buffer overruns. Thanks, Stefan Sent from my smart phone; people don't make typos, Swype does! On Jun 24, 2014 7:02 AM, "Olivier MATZ" <olivier.matz@6wind.com> wrote: > Hi, > > On 06/24/2014 12:48 PM, Neil Horman wrote: > >> Sounds like it might be time to add in some poisoning options to the mbuf >> allocator. >> > > Such option already exists: CONFIG_RTE_LIBRTE_MBUF_DEBUG and > RTE_LIBRTE_MEMPOOL_DEBUG will enable some runtime checks, at least > double free and buffer overflows. > > Olivier > > ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-06-25 1:31 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CAHzKxpaxCbt9d+njdBBpwSy069zLfsOvQ5Dx0CzXLNVMKQ9AaQ@mail.gmail.com> [not found] ` <CAHzKxpaNvZkH9h0kqYJd8VoYEXqBUfhSX9V_zUro2oX_-ioAAw@mail.gmail.com> 2014-06-20 11:20 ` [dpdk-dev] Random mbuf corruption Stefan Baranoff 2014-06-20 13:59 ` Paul Barrette 2014-06-23 21:43 ` Stefan Baranoff 2014-06-24 8:05 ` Gray, Mark D 2014-06-24 10:48 ` Neil Horman 2014-06-24 11:01 ` Olivier MATZ 2014-06-25 1:31 ` Stefan Baranoff
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).