From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f44.google.com (mail-it0-f44.google.com [209.85.214.44]) by dpdk.org (Postfix) with ESMTP id 81CEB5A6E for ; Mon, 10 Jul 2017 14:38:00 +0200 (CEST) Received: by mail-it0-f44.google.com with SMTP id m84so36899314ita.0 for ; Mon, 10 Jul 2017 05:38:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=SHeoFUNL/UhhnagONYxQonaxOEHG4Ncav0+SPttUdiI=; b=W1e7P4/a5XUmemXf9FOdg6WO4WbQjN+sh3V79EDTeXpkGRLjzdeeGWiN2Nq+0oHPgD iUGR/Trna/z7DFrUH5iwIWJq3Bv+ykKB/4y8XnBlQtZJ+Il03n0KbLUNIEGj0B6XveD7 6Rrr62QMXiHjGQltvGEDUSGtB8zDqnS7yD8cN8xSfrYotLmeKDOt+lYlJzyHRpFpBUKo in1+CULY5RZF2NV39nP4xLoBcqtKtIiaBNVK/X4WzyC4jGjPXWcpp3DH2iwNtzOpAvbF m/u8qEaWTd15sZBYJLEW3GxqxONlsX+uuG8BKCl40HK4YHV+wi0ujIEdUSZ8QybEZGYM UbRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=SHeoFUNL/UhhnagONYxQonaxOEHG4Ncav0+SPttUdiI=; b=VIeJkWxQ3mU+O/IKQYA3VBfPge4k8LHRUcox5VYSLLTMIChJtxVxyvi9s8TkHOMdZs UypehqYVfhXPkliy/mqQYibzTAi14MMMCaTVu5DD3bEoQW9SZFe8JMt0ZBy1Z1IElszc Si8uoyspEAMD5R+gg8Quiey7MBDPRUwfq9k0Slns3fDeR7s+CctqJb+A4pJkympJzi3q QQkNrfwLC1QKwvO82RluO/vAINyWVGawm5YM2kkKqwQgj70hxqcGSUsS6q8xipYGm5jt 0tZeMYJ1kLrw2c3CLuNcrRwQA8GxSVFeczhUR3pS9dT2Aj7MU+A6Bk9JMKBdJaXwebBP vr6w== X-Gm-Message-State: AIVw111Dc68EyH+ITGEVPFte6EyqFEZJZZTUcNmuuim7dM68sFRyq+jk 9LZzv68UNDAEQ1VLwQLX17mv7LObuy0f X-Received: by 10.107.12.21 with SMTP id w21mr3010206ioi.61.1499690279401; Mon, 10 Jul 2017 05:37:59 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.133.86 with HTTP; Mon, 10 Jul 2017 05:37:58 -0700 (PDT) From: Alex Kiselev Date: Mon, 10 Jul 2017 15:37:58 +0300 Message-ID: To: users Content-Type: text/plain; charset="UTF-8" Subject: [dpdk-users] bonding driver LACP mode issues X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Jul 2017 12:38:00 -0000 Hello. I've managed to gather more information about my problem and it looks like I have pinpointed the its source when my lacp bond port stop forwarding packets. At first, I was thinking that the cause of the problem was LACP protocol. But turning on RTE_LIBRTE_BOND_DEBUG_8023AD showed that the both switch ports (21,22) and my app's bond ports (0,1) are perfectly synchronized: on the switch: xx # sho lacp lag 21 Lag Actor Actor Partner Partner Partner Agg Actor Sys-Pri Key MAC Sys-Pri Key Count MAC -------------------------------------------------------------------------------- 21 0 0x03fd 00:e0:ed:7b:ce:08 65535 0x0021 2 00:04:96:83:6d:2f Port list: Member Port Rx Sel Mux Actor Partner Port Priority State Logic State Flags Port -------------------------------------------------------------------------------- 21 0 Current Selected Collect-Dist A-GSCD-- 1 22 0 Current Selected Collect-Dist A-GSCD-- 2 ================================================================================ Actor Flags: A-Activity, T-Timeout, G-Aggregation, S-Synchronization C-Collecting, D-Distributing, F-Defaulted, E-Expired Jul 10 16:38:31 xxx the_router.lag[22009]: PMD: 250434656 [Port 0: tx_machine] sending LACP frame Jul 10 16:38:31 xxx the_router.lag[22009]: PMD: LACP: { subtype= 01 ver_num=01 actor={ tlv=01, len=14 pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0100 state={ ACT AGG SYNC COL DIST } } partner={ tlv=02, len=14 pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FD03 state={ ACT AGG SYNC COL DIST } } collector={info=03, length=10, max_delay=0000 , type_term=00, terminator_length = 00} Jul 10 16:38:33 bizin the_router.lag[22009]: PMD: 250436556 [Port 0: rx_machine] LACP -> CURRENT Jul 10 16:38:33 bizin the_router.lag[22009]: PMD: LACP: { subtype= 01 ver_num=01 actor={ tlv=01, len=14 pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FD03 state={ ACT AGG SYNC COL DIST } } partner={ tlv=02, len=14 pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0100 state={ ACT AGG SYNC COL DIST } } collector={info=03, length=10, max_delay=0000 , type_term=00, terminator_length = 00} Jul 10 16:40:24 bizin the_router.lag[22009]: PMD: 250547261 [Port 1: tx_machine] sending LACP frame Jul 10 16:40:24 bizin the_router.lag[22009]: PMD: LACP: { subtype= 01 ver_num=01 actor={ tlv=01, len=14 pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0200 state={ ACT AGG SYNC COL DIST } } partner={ tlv=02, len=14 pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FE03 state={ ACT AGG SYNC COL DIST } } collector={info=03, length=10, max_delay=0000 , type_term=00, terminator_length = 00} Jul 10 16:40:28 bizin the_router.lag[22009]: PMD: 250551162 [Port 1: rx_machine] LACP -> CURRENT Jul 10 16:40:28 bizin the_router.lag[22009]: PMD: LACP: { subtype= 01 ver_num=01 actor={ tlv=01, len=14 pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FE03 state={ ACT AGG SYNC COL DIST } } partner={ tlv=02, len=14 pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0200 state={ ACT AGG SYNC COL DIST } } collector={info=03, length=10, max_delay=0000 , type_term=00, terminator_length = 00} Then I started looking at tx sending errors and noticed that in some cases (I send icmp echo request packets and expect my app to send replies back) all reply packets are dropped because of rte_eth_tx_burst indicates that all packets are not sent, and in the rest of cases, I receive all icmp replies with zero packet loss. rte_eth_stats_get also repors that no packets are transmited on slave ports 0 and 1 when I am not receiving echo replies. So, looks like one bonding slave port fails to send packets and the other slave port has no problem with sending. At the same time both bonding ports have no problem with sending lacpdu packets. I am not sure if both slave ports receive packets normally as the switch sends all test icmp streams from the same port. Also rte_eth_bond_slaves_get and rte_eth_bond_active_slaves_get reports that the bonding ports has 2 slaves and that's correct, the bond port is created with 2 slaves. xxx ~ # rcli sh port bond stat 3 bond port 3: slaves: 0, 1 active slaves: 0, 1 Looking at the source code of bonding driver so far brings me nothing. So, the question is why after some time of normal operations (last time app has been working for 4 days) bonding driver stop sending packets? Is there any other things that I can do to troubleshoot this situation? I would appreciate any help. Thank you in advance. -- Alex Kiselev