[dpdk-dev] URGENT please help. Issue on ixgbe_tx_free

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0
@ 2015-11-10  4:35 Ariel Rodriguez
  2015-11-10 10:54 ` Bruce Richardson
  0 siblings, 1 reply; 5+ messages in thread
From: Ariel Rodriguez @ 2015-11-10  4:35 UTC (permalink / raw)
  To: dev

Dear dpdk experts.

Im having a recurrent segmentation fault under the
function ixgbe_tx_free_bufs (ixgbe_rxtx.c:150) (i enable -g3 -O0).

Surfing the core dump i find out this:

txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);

txq->tx_next_dd = 31
txq->txq->tx_rs_thresh=32

Obviosly txep points out to the first element but

*(txep).mbuf == INVALID MBUF ADDRESS

The same applies to

*(txep+1).mbuf ; *(txep +2).mbuf;*(txep+3).mbuf

from *(txep+4) .mbuf to *(txep+31).mbuf seems to be valid because im able
to derefence the mbuf's


Note:

I disable CONFIG_RTE_IXGBE_INC_VECTOR because i gets similiar behavior , I
thought the problem would disappear disabling that feature.


the program always  runs well up to 4 or 5 hours and then crash ... always
in the same line.

this is the backtrace of the program:

#0  0x0000000000677a64 in rte_atomic16_read (v=0x47dc14c18b14) at
/opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/generic/rte_atomic.h:151
#1  0x0000000000677c1d in rte_mbuf_refcnt_read (m=0x47dc14c18b00) at
/opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:411
#2  0x000000000067a13c in __rte_pktmbuf_prefree_seg (m=0x47dc14c18b00) at
/opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:778
#3  rte_pktmbuf_free_seg (m=0x47dc14c18b00) at
/opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:810
#4  ixgbe_tx_free_bufs (txq=0x7ffb40ae52c0) at
/opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:150
#5  tx_xmit_pkts (tx_queue=0x7ffb40ae52c0, tx_pkts=0x64534770 <app+290608>,
nb_pkts=32) at /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:256
#6  0x000000000067c6f3 in ixgbe_xmit_pkts_simple (tx_queue=0x7ffb40ae52c0,
tx_pkts=0x64534570 <app+290096>, nb_pkts=80) at
/opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:343
#7  0x00000000004ec93d in rte_eth_tx_burst (port_id=1 '\001', queue_id=0,
tx_pkts=0x64534570 <app+290096>, nb_pkts=144) at
/opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2572


frame 5 tx queue:

print *txq

{tx_ring = 0x7ffd9d3f1880, tx_ring_phys_addr = 79947569280, sw_ring =
0x7ffb40ae1280, tdt_reg_addr = 0x7fff0002a018, nb_tx_desc = 1024, tx_tail =
1008, tx_free_thresh = 32, tx_rs_thresh = 32, nb_tx_used = 0,
  last_desc_cleaned = 1023, nb_tx_free = 15, tx_next_dd = 31, tx_next_rs =
1023, queue_id = 0, reg_idx = 0, port_id = 1 '\001', pthresh = 32 ' ',
hthresh = 0 '\000', wthresh = 0 '\000', txq_flags = 3841, ctx_curr = 0,
ctx_cache = {{
      flags = 0, tx_offload = {data = 0, {l2_len = 0, l3_len = 0, l4_len =
0, tso_segsz = 0, vlan_tci = 0}}, tx_offload_mask = {data = 0, {l2_len = 0,
l3_len = 0, l4_len = 0, tso_segsz = 0, vlan_tci = 0}}}, {flags = 0,
tx_offload = {
        data = 0, {l2_len = 0, l3_len = 0, l4_len = 0, tso_segsz = 0,
vlan_tci = 0}}, tx_offload_mask = {data = 0, {l2_len = 0, l3_len = 0,
l4_len = 0, tso_segsz = 0, vlan_tci = 0}}}}, ops = 0x7616d0 <def_txq_ops>,
  tx_deferred_start = 0 '\000'}



Please help me !!!

Regards.

Ariel Rodriguez.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0
  2015-11-10  4:35 [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0 Ariel Rodriguez
@ 2015-11-10 10:54 ` Bruce Richardson
  2015-11-10 11:50   ` Ariel Rodriguez
  0 siblings, 1 reply; 5+ messages in thread
From: Bruce Richardson @ 2015-11-10 10:54 UTC (permalink / raw)
  To: Ariel Rodriguez; +Cc: dev

On Tue, Nov 10, 2015 at 01:35:21AM -0300, Ariel Rodriguez wrote:
> Dear dpdk experts.
> 
> Im having a recurrent segmentation fault under the
> function ixgbe_tx_free_bufs (ixgbe_rxtx.c:150) (i enable -g3 -O0).
> 
> Surfing the core dump i find out this:
> 
> txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);
> 
> txq->tx_next_dd = 31
> txq->txq->tx_rs_thresh=32
> 
> Obviosly txep points out to the first element but
> 
> *(txep).mbuf == INVALID MBUF ADDRESS
> 
> The same applies to
> 
> *(txep+1).mbuf ; *(txep +2).mbuf;*(txep+3).mbuf
> 
> from *(txep+4) .mbuf to *(txep+31).mbuf seems to be valid because im able
> to derefence the mbuf's
> 
> 
> Note:
> 
> I disable CONFIG_RTE_IXGBE_INC_VECTOR because i gets similiar behavior , I
> thought the problem would disappear disabling that feature.
> 
> 
> the program always  runs well up to 4 or 5 hours and then crash ... always
> in the same line.
> 
> this is the backtrace of the program:
> 
> #0  0x0000000000677a64 in rte_atomic16_read (v=0x47dc14c18b14) at
> /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/generic/rte_atomic.h:151
> #1  0x0000000000677c1d in rte_mbuf_refcnt_read (m=0x47dc14c18b00) at
> /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:411
> #2  0x000000000067a13c in __rte_pktmbuf_prefree_seg (m=0x47dc14c18b00) at
> /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:778
> #3  rte_pktmbuf_free_seg (m=0x47dc14c18b00) at
> /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:810
> #4  ixgbe_tx_free_bufs (txq=0x7ffb40ae52c0) at
> /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:150
> #5  tx_xmit_pkts (tx_queue=0x7ffb40ae52c0, tx_pkts=0x64534770 <app+290608>,
> nb_pkts=32) at /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:256
> #6  0x000000000067c6f3 in ixgbe_xmit_pkts_simple (tx_queue=0x7ffb40ae52c0,
> tx_pkts=0x64534570 <app+290096>, nb_pkts=80) at
> /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:343
> #7  0x00000000004ec93d in rte_eth_tx_burst (port_id=1 '\001', queue_id=0,
> tx_pkts=0x64534570 <app+290096>, nb_pkts=144) at
> /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2572
>
Hi,

I'd like a bit more information to help debug your problem:
* what application are you running when you see this crash? If it's an app of your
own making, can you reproduce the crash using one of the standard DPDK apps, or
example apps, e.g. testpmd, l2fwd, etc.

* Can you also try to verify if the crash occurs with the latest DPDK code available
in git from dpdk.org?

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0
  2015-11-10 10:54 ` Bruce Richardson
@ 2015-11-10 11:50   ` Ariel Rodriguez
  2015-11-15 22:58     ` Ariel Rodriguez
  0 siblings, 1 reply; 5+ messages in thread
From: Ariel Rodriguez @ 2015-11-10 11:50 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Thank you very much for your rapid response.

1) IO part is the same as load balancer. The worker part is different. The
tx part use qos scheduler framework also. I will try to run the example and
see what happends.

2) yes i can. I will do that too.

The nic is 82599ES 10-Gigabit SFI/SFP+ with tapped traffic (is a hardware
bypass device silicom vendor).

I develop a similar app without the tx part. It just received a copy of the
traffic (around 6gbps and 400000 concurrent flows) and then free the mbufs.
It works like a charm.

Is strange this issue ... If i disabled the qos scheduler code and the tx
code dropping all packets instead of rte_eth_tx_burst ( is like disabling
tx core) the issue is happening in rte_eth_rx_burst returning corrupted
mbuf (rx core)

Could the nic behave anormally?

I will try the 2 things you comment before.

Regards .

Ariel Horacio Rodriguez
On Tue, Nov 10, 2015 at 01:35:21AM -0300, Ariel Rodriguez wrote:
> Dear dpdk experts.
>
> Im having a recurrent segmentation fault under the
> function ixgbe_tx_free_bufs (ixgbe_rxtx.c:150) (i enable -g3 -O0).
>
> Surfing the core dump i find out this:
>
> txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);
>
> txq->tx_next_dd = 31
> txq->txq->tx_rs_thresh=32
>
> Obviosly txep points out to the first element but
>
> *(txep).mbuf == INVALID MBUF ADDRESS
>
> The same applies to
>
> *(txep+1).mbuf ; *(txep +2).mbuf;*(txep+3).mbuf
>
> from *(txep+4) .mbuf to *(txep+31).mbuf seems to be valid because im able
> to derefence the mbuf's
>
>
> Note:
>
> I disable CONFIG_RTE_IXGBE_INC_VECTOR because i gets similiar behavior , I
> thought the problem would disappear disabling that feature.
>
>
> the program always  runs well up to 4 or 5 hours and then crash ... always
> in the same line.
>
> this is the backtrace of the program:
>
> #0  0x0000000000677a64 in rte_atomic16_read (v=0x47dc14c18b14) at
>
/opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/generic/rte_atomic.h:151
> #1  0x0000000000677c1d in rte_mbuf_refcnt_read (m=0x47dc14c18b00) at
> /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:411
> #2  0x000000000067a13c in __rte_pktmbuf_prefree_seg (m=0x47dc14c18b00) at
> /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:778
> #3  rte_pktmbuf_free_seg (m=0x47dc14c18b00) at
> /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:810
> #4  ixgbe_tx_free_bufs (txq=0x7ffb40ae52c0) at
> /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:150
> #5  tx_xmit_pkts (tx_queue=0x7ffb40ae52c0, tx_pkts=0x64534770
<app+290608>,
> nb_pkts=32) at /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:256
> #6  0x000000000067c6f3 in ixgbe_xmit_pkts_simple (tx_queue=0x7ffb40ae52c0,
> tx_pkts=0x64534570 <app+290096>, nb_pkts=80) at
> /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:343
> #7  0x00000000004ec93d in rte_eth_tx_burst (port_id=1 '\001', queue_id=0,
> tx_pkts=0x64534570 <app+290096>, nb_pkts=144) at
> /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2572
>
Hi,

I'd like a bit more information to help debug your problem:
* what application are you running when you see this crash? If it's an app
of your
own making, can you reproduce the crash using one of the standard DPDK
apps, or
example apps, e.g. testpmd, l2fwd, etc.

* Can you also try to verify if the crash occurs with the latest DPDK code
available
in git from dpdk.org?

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0
  2015-11-10 11:50   ` Ariel Rodriguez
@ 2015-11-15 22:58     ` Ariel Rodriguez
  2015-11-17 10:33       ` Bruce Richardson
  0 siblings, 1 reply; 5+ messages in thread
From: Ariel Rodriguez @ 2015-11-15 22:58 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Hi Bruce, im going to list the results after the test´s.

I will start with the second hint you proposed:

2) I upgrade our custom dpdk application with the latest dpdk code (2.1.0)
and the issue still there.

1) I test the load balancer app with the latest dpdk code (2.1.0) with the nic
82599ES 10-Gigabit SFI/SFP+ with tapped traffic and the results are:

   a) Work fine after 6 hours of running. (For timing issues i cant wait
longer but the issue always happend before 5 hours of running so i supposed
we are fine in this test).

   b) I made a change to load balancer code to behave as our dpdk
application in the workers code. This change is just for giving  the
workers code enough load (load in terms of core frecuency) that made the rx
core drop several packet because ring between workers and rx core is full.
(Our application drop several packets because the workers code are not fast
enough).

       In the last test, the segmentation fault arise , just in the same
line that i previously report.

Debugging and reading the code in the ixgbe_rxtx.c i  see some weird things.

  - The core dump of the issue always is around line 260 in the
ixgbe_rxtx.c code.
  - Looking at the function "ixgbe_tx_free_bufs" at line 132 , i understand
there is a test for looking at the rs bit write back mechanism.
The IXGBE_ADVTXD_STAT_DD is set and then the code type cast to
ixgbe_tx_entry from the sw_ring in the tx queue (variable name txep).

  - The txep->mbuf entry is totally corrupted beacause has a invalid memory
address, obviously i compared that memory address with the mbuf mempool and
is not even close to be valid. But the address of ixgbe_tx_entry is valid
and in the range of the zmalloc sotware ring structure constructed at
initialization.

 - The txep pointer is the first one in the sw_ring. That
because txq->tx_next_dd is 31 and txq->tx_rs_thresh is 32.
txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);

 - txq->tx_rs_thresh is 32. I use  the default values just setting null in
the corresponding *_queue_setup functions.

 - The weirdess thing is that the next entry on the software ring (next
ixgbe_tx_entry) is valid  and has a valid mbuf memory address.

I dont know how to continue , because im tryng to find out where i could
corrupt the mbuf associated with the ixgbe_tx_entry. I debug and test all
part of the worker core code , finding out a bad mbuf or a mbuf corruption
before enqueue on the tx ring. The tx core and the rx core is just the same
as the one in the load balancer core (This apply in our application). Not
issue there. If there is a corruption of the mbuf in the workers code the
segmentation fault has to be before tx queue ring enqueue. (I test several
field of the mbuf before enqueuing it, ->port field , ->data_len ... etc)

In the second test of the load balancer core i could not see a relationship
between the packets drop in the rx core and the mbuf corruption in the
ixgbe_tx_entry.


Waiting for some advices...

Regards

Ariel Horacio Rodriguez.













On Tue, Nov 10, 2015 at 8:50 AM, Ariel Rodriguez <arodriguez@callistech.com>
wrote:

> Thank you very much for your rapid response.
>
> 1) IO part is the same as load balancer. The worker part is different. The
> tx part use qos scheduler framework also. I will try to run the example and
> see what happends.
>
> 2) yes i can. I will do that too.
>
> The nic is 82599ES 10-Gigabit SFI/SFP+ with tapped traffic (is a hardware
> bypass device silicom vendor).
>
> I develop a similar app without the tx part. It just received a copy of
> the traffic (around 6gbps and 400000 concurrent flows) and then free the
> mbufs. It works like a charm.
>
> Is strange this issue ... If i disabled the qos scheduler code and the tx
> code dropping all packets instead of rte_eth_tx_burst ( is like disabling
> tx core) the issue is happening in rte_eth_rx_burst returning corrupted
> mbuf (rx core)
>
> Could the nic behave anormally?
>
> I will try the 2 things you comment before.
>
> Regards .
>
> Ariel Horacio Rodriguez
> On Tue, Nov 10, 2015 at 01:35:21AM -0300, Ariel Rodriguez wrote:
> > Dear dpdk experts.
> >
> > Im having a recurrent segmentation fault under the
> > function ixgbe_tx_free_bufs (ixgbe_rxtx.c:150) (i enable -g3 -O0).
> >
> > Surfing the core dump i find out this:
> >
> > txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);
> >
> > txq->tx_next_dd = 31
> > txq->txq->tx_rs_thresh=32
> >
> > Obviosly txep points out to the first element but
> >
> > *(txep).mbuf == INVALID MBUF ADDRESS
> >
> > The same applies to
> >
> > *(txep+1).mbuf ; *(txep +2).mbuf;*(txep+3).mbuf
> >
> > from *(txep+4) .mbuf to *(txep+31).mbuf seems to be valid because im able
> > to derefence the mbuf's
> >
> >
> > Note:
> >
> > I disable CONFIG_RTE_IXGBE_INC_VECTOR because i gets similiar behavior ,
> I
> > thought the problem would disappear disabling that feature.
> >
> >
> > the program always  runs well up to 4 or 5 hours and then crash ...
> always
> > in the same line.
> >
> > this is the backtrace of the program:
> >
> > #0  0x0000000000677a64 in rte_atomic16_read (v=0x47dc14c18b14) at
> >
> /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/generic/rte_atomic.h:151
> > #1  0x0000000000677c1d in rte_mbuf_refcnt_read (m=0x47dc14c18b00) at
> > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:411
> > #2  0x000000000067a13c in __rte_pktmbuf_prefree_seg (m=0x47dc14c18b00) at
> > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:778
> > #3  rte_pktmbuf_free_seg (m=0x47dc14c18b00) at
> > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:810
> > #4  ixgbe_tx_free_bufs (txq=0x7ffb40ae52c0) at
> > /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:150
> > #5  tx_xmit_pkts (tx_queue=0x7ffb40ae52c0, tx_pkts=0x64534770
> <app+290608>,
> > nb_pkts=32) at /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:256
> > #6  0x000000000067c6f3 in ixgbe_xmit_pkts_simple
> (tx_queue=0x7ffb40ae52c0,
> > tx_pkts=0x64534570 <app+290096>, nb_pkts=80) at
> > /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:343
> > #7  0x00000000004ec93d in rte_eth_tx_burst (port_id=1 '\001', queue_id=0,
> > tx_pkts=0x64534570 <app+290096>, nb_pkts=144) at
> > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2572
> >
> Hi,
>
> I'd like a bit more information to help debug your problem:
> * what application are you running when you see this crash? If it's an app
> of your
> own making, can you reproduce the crash using one of the standard DPDK
> apps, or
> example apps, e.g. testpmd, l2fwd, etc.
>
> * Can you also try to verify if the crash occurs with the latest DPDK code
> available
> in git from dpdk.org?
>
> Regards,
> /Bruce
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0
  2015-11-15 22:58     ` Ariel Rodriguez
@ 2015-11-17 10:33       ` Bruce Richardson
  0 siblings, 0 replies; 5+ messages in thread
From: Bruce Richardson @ 2015-11-17 10:33 UTC (permalink / raw)
  To: Ariel Rodriguez; +Cc: dev

On Sun, Nov 15, 2015 at 07:58:27PM -0300, Ariel Rodriguez wrote:
> Hi Bruce, im going to list the results after the test´s.
> 
> I will start with the second hint you proposed:
> 
> 2) I upgrade our custom dpdk application with the latest dpdk code (2.1.0)
> and the issue still there.
> 
> 1) I test the load balancer app with the latest dpdk code (2.1.0) with the nic
> 82599ES 10-Gigabit SFI/SFP+ with tapped traffic and the results are:
> 
>    a) Work fine after 6 hours of running. (For timing issues i cant wait
> longer but the issue always happend before 5 hours of running so i supposed
> we are fine in this test).
> 
>    b) I made a change to load balancer code to behave as our dpdk
> application in the workers code. This change is just for giving  the
> workers code enough load (load in terms of core frecuency) that made the rx
> core drop several packet because ring between workers and rx core is full.
> (Our application drop several packets because the workers code are not fast
> enough).
> 
>        In the last test, the segmentation fault arise , just in the same
> line that i previously report.
> 
What is the workload you are putting into the worker core? Can you provide a
diff for the load balancer app that reproduces this issue, since from your
description the problem may be in the extra code added in.

/Bruce

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-11-17 10:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-10  4:35 [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0 Ariel Rodriguez
2015-11-10 10:54 ` Bruce Richardson
2015-11-10 11:50   ` Ariel Rodriguez
2015-11-15 22:58     ` Ariel Rodriguez
2015-11-17 10:33       ` Bruce Richardson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).