DPDK usage discussions
 help / color / mirror / Atom feed
* [dpdk-users] bonding driver LACP mode issues
@ 2017-07-13 23:28 Alex Kiselev
  0 siblings, 0 replies; 3+ messages in thread
From: Alex Kiselev @ 2017-07-13 23:28 UTC (permalink / raw)
  To: users

Update #2

When txqueue 0 stops sending packets there are following messages in the syslog:

Jul 14 03:48:56 bizin the_router.lag[22550]: PMD: i40e_xmit_cleanup():
TX descriptor  351 is not done (port=1 queue=0)
Jul 14 03:48:56 bizin the_router.lag[22550]: PMD: i40e_xmit_cleanup():
TX descriptor  351 is not done (port=1 queue=0)
Jul 14 03:48:56 bizin the_router.lag[22550]: PMD: i40e_xmit_cleanup():
TX descriptor  351 is not

Is it some kind of hardware or firmware problem?

--
Alex Kiselev

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [dpdk-users] bonding driver LACP mode issues
@ 2017-07-13 22:33 Alex Kiselev
  0 siblings, 0 replies; 3+ messages in thread
From: Alex Kiselev @ 2017-07-13 22:33 UTC (permalink / raw)
  To: users

New info:

I've added RTE_LOG call right after the rte_eth_tx_burst in the
bond_ethdev_tx_burst_8023ad
in the file rte_eth_bond_pmd.c

bond_ethdev_tx_burst_8023ad(...)
{
...
                num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
                                slave_bufs[i], slave_nb_pkts[i]);

                RTE_LOG(DEBUG, PMD,
                    "rte_eth_tx_burst: port_id %hhu, queue_id %hu, num
pkts %hu, "
                    "was sent %hu, slow pkts %hhu\n", slaves[i],
bd_tx_q->queue_id, slave_nb_pkts[i],
                    num_tx_slave, slave_slow_nb_pkts[i]);
...
}

When my app stopped forward packets I set log_level to debug and got
some new debug data. It turned out that the problem is tx queue_id #0.
It's the only
tx queue that stops sending packets on both slave ports 0 and 1.

tail -f /run/the_router | grep 'was sent 0' | 'grep port_id 0'

Jul 14 02:51:03 bizin the_router.lag[22550]: PMD: rte_eth_tx_burst:
port_id 0, queue_id 0, num pkts 1, was sent 0, slow pkts 0
Jul 14 02:51:03 bizin the_router.lag[22550]: PMD: rte_eth_tx_burst:
port_id 0, queue_id 0, num pkts 1, was sent 0, slow pkts 0
Jul 14 02:51:03 bizin the_router.lag[22550]: PMD: rte_eth_tx_burst:
port_id 0, queue_id 0, num pkts 1, was sent 0, slow pkts 0

tail -f /run/the_router | grep 'was sent 0' | 'grep port_id 1'
Jul 14 02:44:45 bizin the_router.lag[22550]: PMD: rte_eth_tx_burst:
port_id 1, queue_id 0, num pkts 1, was sent 0, slow pkts 0
Jul 14 02:44:46 bizin the_router.lag[22550]: PMD: rte_eth_tx_burst:
port_id 1, queue_id 0, num pkts 31, was sent 0, slow pkts 0
Jul 14 02:44:46 bizin the_router.lag[22550]: PMD: rte_eth_tx_burst:
port_id 1, queue_id 0, num pkts 1, was sent 0, slow pkts 0


The others tx queues are ok. Each packet that goes into tx queue other
than #0 is sent to the wire.

tail -f /run/the_router | grep -v 'was sent 0' | grep 'was sent'
Jul 14 03:04:06 bizin the_router.lag[22550]: PMD: rte_eth_tx_burst:
port_id 1, queue_id 3, num pkts 4, was sent 4, slow pkts 0
Jul 14 03:04:06 bizin the_router.lag[22550]: PMD: rte_eth_tx_burst:
port_id 1, queue_id 2, num pkts 1, was sent 1, slow pkts 0
Jul 14 03:04:06 bizin the_router.lag[22550]: PMD: rte_eth_tx_burst:
port_id 1, queue_id 3, num pkts 1, was sent 1, slow pkts 0
Jul 14 03:04:06 bizin the_router.lag[22550]: PMD: rte_eth_tx_burst:
port_id 1, queue_id 3, num pkts 1, was sent 1, slow pkts 0

I still have no clue what could cause such behavior and I am running
out of ideas how to further debug it.

Please, anybody, help! I would love to hear any ideas.

--
Alex Kiselev

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [dpdk-users] bonding driver LACP mode issues
@ 2017-07-10 12:37 Alex Kiselev
  0 siblings, 0 replies; 3+ messages in thread
From: Alex Kiselev @ 2017-07-10 12:37 UTC (permalink / raw)
  To: users

Hello.

I've managed to gather more information about my problem and it looks
like I have pinpointed the its source when my lacp bond port stop
forwarding packets.

At first, I was thinking that the cause of the problem was LACP
protocol. But turning on RTE_LIBRTE_BOND_DEBUG_8023AD showed that the
both switch ports (21,22) and my app's bond ports (0,1) are perfectly
synchronized:


on the switch:
xx # sho lacp lag 21

Lag   Actor    Actor  Partner           Partner  Partner Agg   Actor
      Sys-Pri  Key    MAC               Sys-Pri  Key     Count MAC
--------------------------------------------------------------------------------
21          0  0x03fd 00:e0:ed:7b:ce:08   65535  0x0021      2 00:04:96:83:6d:2f

Port list:

Member     Port      Rx           Sel          Mux            Actor     Partner
Port       Priority  State        Logic        State          Flags     Port
--------------------------------------------------------------------------------
21         0         Current      Selected     Collect-Dist   A-GSCD--  1
22         0         Current      Selected     Collect-Dist   A-GSCD--  2
================================================================================
Actor Flags: A-Activity, T-Timeout, G-Aggregation, S-Synchronization
             C-Collecting, D-Distributing, F-Defaulted, E-Expired



Jul 10 16:38:31 xxx the_router.lag[22009]: PMD: 250434656 [Port 0:
tx_machine] sending LACP frame
Jul 10 16:38:31 xxx the_router.lag[22009]: PMD: LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0100
       state={ ACT AGG SYNC COL DIST }
  }
  partner={ tlv=02, len=14
    pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FD03
       state={ ACT AGG SYNC COL DIST }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}
Jul 10 16:38:33 bizin the_router.lag[22009]: PMD: 250436556 [Port 0:
rx_machine] LACP -> CURRENT
Jul 10 16:38:33 bizin the_router.lag[22009]: PMD: LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FD03
       state={ ACT AGG SYNC COL DIST }
  }
  partner={ tlv=02, len=14
    pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0100
       state={ ACT AGG SYNC COL DIST }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}



Jul 10 16:40:24 bizin the_router.lag[22009]: PMD: 250547261 [Port 1:
tx_machine] sending LACP frame
Jul 10 16:40:24 bizin the_router.lag[22009]: PMD: LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0200
       state={ ACT AGG SYNC COL DIST }
  }
  partner={ tlv=02, len=14
    pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FE03
       state={ ACT AGG SYNC COL DIST }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}
Jul 10 16:40:28 bizin the_router.lag[22009]: PMD: 250551162 [Port 1:
rx_machine] LACP -> CURRENT
Jul 10 16:40:28 bizin the_router.lag[22009]: PMD: LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FE03
       state={ ACT AGG SYNC COL DIST }
  }
  partner={ tlv=02, len=14
    pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0200
       state={ ACT AGG SYNC COL DIST }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}

Then I started looking at tx sending errors and noticed that in some
cases (I send icmp echo request packets and expect my app to send
replies back) all reply packets are dropped because of
rte_eth_tx_burst indicates that all packets are not sent, and in the
rest of cases, I receive all icmp replies with zero packet loss.

rte_eth_stats_get also repors that no packets are transmited on slave
ports 0 and 1 when
I am not receiving echo replies.

So, looks like one bonding slave port fails to send packets and the
other slave port has no problem with sending.

At the same time both bonding ports have no problem with sending lacpdu packets.

I am not sure if both slave ports receive packets normally as the
switch sends all test icmp streams from the same port.

Also rte_eth_bond_slaves_get and rte_eth_bond_active_slaves_get
reports that the bonding ports has 2 slaves and that's correct, the
bond port is created with 2 slaves.

xxx ~ # rcli sh port bond stat 3
bond port 3:
slaves: 0, 1
active slaves: 0, 1

Looking at the source code of bonding driver so far brings me nothing.

So, the question is why after some time of normal operations (last
time app has been working for 4 days) bonding driver stop sending
packets?
Is there any other things that I can do to troubleshoot this situation?

I would appreciate any help.
Thank you in advance.

-- 
Alex Kiselev

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-07-13 23:28 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-13 23:28 [dpdk-users] bonding driver LACP mode issues Alex Kiselev
  -- strict thread matches above, loose matches on Subject: below --
2017-07-13 22:33 Alex Kiselev
2017-07-10 12:37 Alex Kiselev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).