Hi there,
I'd like to ask for
advice for a weird issue that I'm facing trying to run XDP on top of a
bonding device (802.3ad) (and also on the physical interfaces behind the
bond).
I've a DPDK application which runs on top of XDP sockets, using the DPDK
AF_XDP driver.
It was a pure DPDK application but lately it was migrated to run on top
of XDP sockets because we need to split the traffic entering the
machine between the DPDK application and other "standard-Linux"
applications running on the same machine.
The application works fine
when running on top of a single interface but it has problems when it
runs on top of a bonding interface. It needs to be able to run with
multiple XDP sockets where each socket (or group of XDP sockets) is/are
handled in a separate thread. However, the bonding device is reported
with a single queue and thus the application can't open more than one
XDP socket for it. So I've tried binding the XDP sockets to the queues
of the physical interfaces. For example:
- 3 interfaces each one is set to have 8 queues
-
I've created 3 virtual af_xdp devices each one with 8 queues i.e. in
summary 24 XDP sockets each bound to a separate queue (this
functionality is provided by the DPDK itself).
- I've run
the application on 2 threads where the first thread handled the first 12
queues (XDP sockets) and the second thread handled the next 12 queues
(XDP socket) i.e. the first thread worked with all 8 queues from af_xdp
device 0 and the first 4 queues from af_xdp device 1. The second thread
worked with the next 4 queues from af_xdp device 1 and all 8 queues from
af_xdp device 2. I've also tried another distribution scheme (see
below). The given threads just call the receve/transmit functions
provided by the DPDK for the assigned queues.
- The problem is that with this scheme the network device on the other side reports: "The
member of the LACP mode Eth-Trunk interface received an abnormal
LACPDU, which may be caused by optical fiber misconnection". And this
error is always reported for the last device/interface in the bonding
and the bonding/LACP doesn't work.
- Another
thing is that if I run the DPDK application on a single thread, and the
sending/receiving on all queues is handled on a single thread, then the
bonding seems to work correctly and the above error is not reported.
- I've checked the code multiple times and I'm sure that each thread is accessing its own group of queues/sockets.
- I've
tried 2 different schemes of accessing but each one led to the same
issue. For example (device_idx - queue_idx), I've tried these two
orders of accessing:
Thread 1 Thread2
(0 - 0) (1 - 4)
(0 - 1) (1 - 5)
... (1 - 6)
... (1 - 7)
(0 - 7) (2 - 0)
(1 - 0) (2 - 1)
(1 - 1) ...
(1 - 2) ...
(1 - 3) (2 - 7)
Thread 1 Thread2
(0 - 0) (0 - 4)
(1 - 0) (1 - 4)
(2 - 0) (2 - 4)
(0 - 1) (0 - 5)
(1 - 1) (1 - 5)
(2 - 1) (2 - 5)
... ...
(0 - 3) (0 - 7)
(1 - 3) (1 - 7)
(2 - 3) (2 - 7)