From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <klarose@sandvine.com>
Received: from mail1.sandvine.com (Mail1.sandvine.com [64.7.137.134])
 by dpdk.org (Postfix) with ESMTP id C4D1B2A5D
 for <dev@dpdk.org>; Thu, 23 Nov 2017 17:04:14 +0100 (CET)
Received: from WTL-EXCHP-1.sandvine.com ([fe80::ac6b:cc1e:f2ff:93aa]) by
 wtl-exchp-2.sandvine.com ([::1]) with mapi id 14.03.0319.002; Thu, 23 Nov
 2017 11:04:13 -0500
From: Kyle Larose <klarose@sandvine.com>
To: "dev@dpdk.org" <dev@dpdk.org>
CC: Declan Doherty <declan.doherty@intel.com>
Thread-Topic: rte_eth_bond: Problem with link failure and 8023AD
Thread-Index: AdNkc02nPlVLapNtRaOLYwxkHidOWw==
Date: Thu, 23 Nov 2017 16:04:13 +0000
Message-ID: <D76BBBCF97F57144BB5FCF08007244A7EDB5A10F@wtl-exchp-1.sandvine.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [192.168.200.51]
x-c2processedorg: b2f06e69-072f-40ee-90c5-80a34e700794
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.15
Subject: [dpdk-dev] rte_eth_bond: Problem with link failure and 8023AD
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 23 Nov 2017 16:04:15 -0000

Hello,

I've been testing my LAG implemented with the DPDK eth_bond pmd. As part of=
 my fault tolerance testing, I want to ensure that if a link is flapping up=
 and down continuously, impact to service is minimal. My findings are that =
in this case, the lag is rendered inoperable if a certain link is flapping.=
 Details below.

Setup:

-    4x10G X710 links in a 8023ad lag connected to a switch.

-    Under normal operations, lag is steady, traffic balanced, etc
Problem:
If I take down a link on the switch corresponding to the "aggregator" link =
in the dpdk lag, then bring it back up, every link in the lag goes from dis=
tributing to not distributing to back to distributing. This causes unnecess=
ary loss of service.
A single link failure, regardless of whether or not it's the aggregator lin=
k, should not change the state of the other links. Consider what would happ=
en if there were a hardware fault on that link, or its signal were bad: it'=
s possible for it to be stuck flapping up and down. This would lead to comp=
lete loss of service on the lag, despite there being three stable links rem=
aining.
Analysis:
- The switch is showing that the system id is changing when the link flaps.=
 It's going from 00:00:00:00:00:00 to the aggregator's mac. This is not goo=
d. Why is it happening? It's because by default we seem to be using the "AG=
G_BANDWIDTH" selection algorithm, which is broken: It's taking a slave inde=
x, and using that the index into the 8023ad ports array, which is based on =
the dpdk port number. It should translate it from the slave index into a dp=
dk_port number using the slaves[] array.
- Aside from the above, if you look, the default is supposed to be AGG_STAB=
LE, according to bond_mode_8023ad_conf_get_default. However, bond_mode_8023=
ad_conf_assign does not actually copy out the selection algorithm, so it ju=
st uses 0, which happens to be AGG_BANDWIDTH.
- I fixed the above, but still faced two more issues:
  1) The system ID changes when the aggregator changes, which can lead to t=
he problem.
  2) When the link fails, it is "deactivated" in the lag via bond_mode_8023=
ad_deactivate_slave. There is a block in there dedicated to the case where =
the aggregator is disabled. In that case, it explicitly unselects each slav=
e sharing that aggregator. This causes
     them to fall back to the DETACHED state in the mux machine -- i.e. the=
y are no longer aggregating at all, until the state machine runs through th=
e LACP exchange with the partner again.

Possible fix:
1) Change bond_mode_8023ad_conf_assign to actually copy out the selection a=
lgorithm.
2) Ensure that all members of a LAG have the same system id (i.e. choose th=
e LAG's mac address)
3) Do not detach the other members when the aggregator's link state goes do=
wn.

Note:

1)  We should fix  AGG_BANDWIDTH and AGG_COUNT separately.

2)  I can't see any reason why the system id should be equal to the mac of =
the aggregator. It's intended to represent the system to which the lag belo=
ngs, not the aggregator itself. The aggregator is represented by the operat=
ional key. So, it should be fine to use the LAG's mac address, which is fix=
ed at init, as the system id for all possible aggregators.
3) I think not detaching is the correct approach. There is nothing in my re=
ading of 802.1Q or 802.1AX' LACP specification that implies we should do th=
is. There is a blurb about changes in parameters which lead to the change i=
n aggregator forcing the unselected
    transition, but I don't think that needs to apply here. I'm fairly cert=
ain they're talking about changing the operational key/etc.


How does everyone feel about this? Am I crazy in requiring this functionali=
ty? What about the proposed fix. Does it sound reasonable, or am I going to=
 break the state machine somehow?

Thanks,

Kyle