From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yk0-f181.google.com (mail-yk0-f181.google.com [209.85.160.181]) by dpdk.org (Postfix) with ESMTP id 52EA995AE for ; Sun, 15 Nov 2015 23:58:28 +0100 (CET) Received: by ykdv3 with SMTP id v3so212841581ykd.0 for ; Sun, 15 Nov 2015 14:58:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=callistech_com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=p7ADBjsdzxGVI7aqzmzE+6qwzd/S2rh1vfUDpkAA5xw=; b=kYWkZS8oSjeRj1LZ4Ot5oycnZdJjr3+b2EPAOTnhoLqNHtHIcIaYvjDcl+rqHugCA+ NjU/ttVl+b30ruX8oxI79mUEDy0uhN2fiZVdILcWQSFEvGCLKnfowGaxr1tUHTc28zPn 5VpiwcNMwqLB/Ytjl5zS9B6bW7crM5oIB27qK1sb2yQLIuOL2cDc2ZLV/SjSlLoj+XbW iYzNGl5O0Si/ZIF8H8EqUfIX3/AomCgoQFMvqlbS2MQENMqUsIvFexEMwpnfU3RALyr+ I7qLnDagqyQTF9CD5GlYn9Oxmy1I9gkinRigCLwfRjRgZsffRly6h/e94g4EFBq5GxWR EhAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=p7ADBjsdzxGVI7aqzmzE+6qwzd/S2rh1vfUDpkAA5xw=; b=YEGoMKQB5wwlq2F9Uft+OFTtRrpRWY+dQSRgTutZuCXNZ2kErD5nLzmATFYxK0nsbj J6jYpOZWihrRhL1BKFUL17B3LyT39FSUlZLJWjQp7q3e05XHX9lCGcr8ERLmX3IbhCzB m0rl4KwrtmSI4e9BRKT8hM39DLInQDg3PKTe21Qzrx1lHPpJiuCdrCrtLjoTA4TWU272 AQ/kSVD0hmUxnuhjDAbzAN0XSFPE17jvBSNQsyV2P9J0bEVH+QvDMSsLTtZc1iPPIZnd YrrqMKKRLOi9mJaAAjSxapmHhF2/ca1QR/o45RQbnICYBNpdDWH4thLWU6JPR9Su9C5x wSXQ== X-Gm-Message-State: ALoCoQlL3t/o6wy7W6V3lySPjZeeGhwjS6mDWqFkp4H0cjYkyveyksO7g1+LzPDQWVva32ryBZuO MIME-Version: 1.0 X-Received: by 10.13.204.134 with SMTP id o128mr31079408ywd.132.1447628307793; Sun, 15 Nov 2015 14:58:27 -0800 (PST) Received: by 10.13.204.203 with HTTP; Sun, 15 Nov 2015 14:58:27 -0800 (PST) In-Reply-To: References: <20151110105417.GD29836@bricha3-MOBL3> Date: Sun, 15 Nov 2015 19:58:27 -0300 Message-ID: From: Ariel Rodriguez To: Bruce Richardson Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.15 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Nov 2015 22:58:28 -0000 Hi Bruce, im going to list the results after the test=C2=B4s. I will start with the second hint you proposed: 2) I upgrade our custom dpdk application with the latest dpdk code (2.1.0) and the issue still there. 1) I test the load balancer app with the latest dpdk code (2.1.0) with the = nic 82599ES 10-Gigabit SFI/SFP+ with tapped traffic and the results are: a) Work fine after 6 hours of running. (For timing issues i cant wait longer but the issue always happend before 5 hours of running so i supposed we are fine in this test). b) I made a change to load balancer code to behave as our dpdk application in the workers code. This change is just for giving the workers code enough load (load in terms of core frecuency) that made the rx core drop several packet because ring between workers and rx core is full. (Our application drop several packets because the workers code are not fast enough). In the last test, the segmentation fault arise , just in the same line that i previously report. Debugging and reading the code in the ixgbe_rxtx.c i see some weird things= . - The core dump of the issue always is around line 260 in the ixgbe_rxtx.c code. - Looking at the function "ixgbe_tx_free_bufs" at line 132 , i understand there is a test for looking at the rs bit write back mechanism. The IXGBE_ADVTXD_STAT_DD is set and then the code type cast to ixgbe_tx_entry from the sw_ring in the tx queue (variable name txep). - The txep->mbuf entry is totally corrupted beacause has a invalid memory address, obviously i compared that memory address with the mbuf mempool and is not even close to be valid. But the address of ixgbe_tx_entry is valid and in the range of the zmalloc sotware ring structure constructed at initialization. - The txep pointer is the first one in the sw_ring. That because txq->tx_next_dd is 31 and txq->tx_rs_thresh is 32. txep =3D &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]); - txq->tx_rs_thresh is 32. I use the default values just setting null in the corresponding *_queue_setup functions. - The weirdess thing is that the next entry on the software ring (next ixgbe_tx_entry) is valid and has a valid mbuf memory address. I dont know how to continue , because im tryng to find out where i could corrupt the mbuf associated with the ixgbe_tx_entry. I debug and test all part of the worker core code , finding out a bad mbuf or a mbuf corruption before enqueue on the tx ring. The tx core and the rx core is just the same as the one in the load balancer core (This apply in our application). Not issue there. If there is a corruption of the mbuf in the workers code the segmentation fault has to be before tx queue ring enqueue. (I test several field of the mbuf before enqueuing it, ->port field , ->data_len ... etc) In the second test of the load balancer core i could not see a relationship between the packets drop in the rx core and the mbuf corruption in the ixgbe_tx_entry. Waiting for some advices... Regards Ariel Horacio Rodriguez. On Tue, Nov 10, 2015 at 8:50 AM, Ariel Rodriguez wrote: > Thank you very much for your rapid response. > > 1) IO part is the same as load balancer. The worker part is different. Th= e > tx part use qos scheduler framework also. I will try to run the example a= nd > see what happends. > > 2) yes i can. I will do that too. > > The nic is 82599ES 10-Gigabit SFI/SFP+ with tapped traffic (is a hardware > bypass device silicom vendor). > > I develop a similar app without the tx part. It just received a copy of > the traffic (around 6gbps and 400000 concurrent flows) and then free the > mbufs. It works like a charm. > > Is strange this issue ... If i disabled the qos scheduler code and the tx > code dropping all packets instead of rte_eth_tx_burst ( is like disabling > tx core) the issue is happening in rte_eth_rx_burst returning corrupted > mbuf (rx core) > > Could the nic behave anormally? > > I will try the 2 things you comment before. > > Regards . > > Ariel Horacio Rodriguez > On Tue, Nov 10, 2015 at 01:35:21AM -0300, Ariel Rodriguez wrote: > > Dear dpdk experts. > > > > Im having a recurrent segmentation fault under the > > function ixgbe_tx_free_bufs (ixgbe_rxtx.c:150) (i enable -g3 -O0). > > > > Surfing the core dump i find out this: > > > > txep =3D &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]); > > > > txq->tx_next_dd =3D 31 > > txq->txq->tx_rs_thresh=3D32 > > > > Obviosly txep points out to the first element but > > > > *(txep).mbuf =3D=3D INVALID MBUF ADDRESS > > > > The same applies to > > > > *(txep+1).mbuf ; *(txep +2).mbuf;*(txep+3).mbuf > > > > from *(txep+4) .mbuf to *(txep+31).mbuf seems to be valid because im ab= le > > to derefence the mbuf's > > > > > > Note: > > > > I disable CONFIG_RTE_IXGBE_INC_VECTOR because i gets similiar behavior = , > I > > thought the problem would disappear disabling that feature. > > > > > > the program always runs well up to 4 or 5 hours and then crash ... > always > > in the same line. > > > > this is the backtrace of the program: > > > > #0 0x0000000000677a64 in rte_atomic16_read (v=3D0x47dc14c18b14) at > > > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/generic/rte_atomic.h:1= 51 > > #1 0x0000000000677c1d in rte_mbuf_refcnt_read (m=3D0x47dc14c18b00) at > > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:411 > > #2 0x000000000067a13c in __rte_pktmbuf_prefree_seg (m=3D0x47dc14c18b00= ) at > > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:778 > > #3 rte_pktmbuf_free_seg (m=3D0x47dc14c18b00) at > > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:810 > > #4 ixgbe_tx_free_bufs (txq=3D0x7ffb40ae52c0) at > > /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:150 > > #5 tx_xmit_pkts (tx_queue=3D0x7ffb40ae52c0, tx_pkts=3D0x64534770 > , > > nb_pkts=3D32) at /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:256 > > #6 0x000000000067c6f3 in ixgbe_xmit_pkts_simple > (tx_queue=3D0x7ffb40ae52c0, > > tx_pkts=3D0x64534570 , nb_pkts=3D80) at > > /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:343 > > #7 0x00000000004ec93d in rte_eth_tx_burst (port_id=3D1 '\001', queue_i= d=3D0, > > tx_pkts=3D0x64534570 , nb_pkts=3D144) at > > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2572 > > > Hi, > > I'd like a bit more information to help debug your problem: > * what application are you running when you see this crash? If it's an ap= p > of your > own making, can you reproduce the crash using one of the standard DPDK > apps, or > example apps, e.g. testpmd, l2fwd, etc. > > * Can you also try to verify if the crash occurs with the latest DPDK cod= e > available > in git from dpdk.org? > > Regards, > /Bruce >