Hello,
I have several issues to report concerning the qede pmd
as well as potential solutions for them. Most of them have to do with
configuring the MTU.
========== Abort on mtu change ===========
First,
the qede_assign_rxtx_handlers() seems to be working wrong since an API
change in the rte_eth lib. The commit linked bellow changes the way
packet receive and transmit functions are handled:
https://git.dpdk.org/dpdk/commit/?id=c87d435a4d79739c0cec2ed280b94b41cb908af7
Originally,
the Rx/Tx handlers were located in rte_eth_dev struct. This is
currently no longer the case and the polling of incoming packets is done
with functions registered in rte_eth_fp_ops instead. The rte lib change
is supposed to be transparent for the individual pmds, but the polling
functions in rte_eth_dev are only synchronized with the ones in
rte_eth_fp_ops at the device start. This leads to an issue when trying
to configure the MTU while there is ongoing traffic:
-> Trying to change the MTU triggers a port restart.
->
qede_assign_rxtx_handlers() assign dummy polling functions in
dev->rx_pkt_burst and dev->tx_pkt_burst while the port is down.
-> However, rte_eth_rx_burst() polls in &rte_eth_fp_ops[port_id].rx_pkt_burst which still points to qede_recv_pkts_regular()
-> The application keep polling packets in the receive function and triggers an assert(rx_mb != NULL) which caused an abort.
Since
rte_eth_fp_ops is reset in rte_eth_dev_stop(), it may be better to call
this function instead of qede_dev_stop(). However the dummy functions
defined in lib/ethdev/ethdev_private.c log an error and dump stack when
called so they might not be intended to be used this way.
The
way I fixed this issue in our applications is by forcing a complete stop of the port before
configuring the MTU. I have no DPDK patch to suggest for this
-------------- Reproduction ------------------
1) Start testpmd:
dpdk-testpmd --log-level=pmd.net.qede.driver:7 -a 0000:17:00.0 -a
0000:17:00.1 -- -i --rxq=2 --txq=2 --coremask=0x0c
--total-num-mbufs=250000
2) Start packet forwarding:
start
io packet forwarding - ports=2 - cores=2 - streams=4 - NUMA support enabled, MP allocation mode: native
Logical Core 2 (socket 0) forwards packets on 2 streams:
RX P=0/Q=0 (socket 0) -> TX P=1/Q=0 (socket 0) peer=02:00:00:00:00:01
RX P=1/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
Logical Core 3 (socket 1) forwards packets on 2 streams:
RX P=0/Q=1 (socket 0) -> TX P=1/Q=1 (socket 0) peer=02:00:00:00:00:01
RX P=1/Q=1 (socket 0) -> TX P=0/Q=1 (socket 0) peer=02:00:00:00:00:00
io packet forwarding packets/burst=32
nb forwarding cores=2 - nb forwarding ports=2
port 0: RX queue number: 2 Tx queue number: 2
Rx offloads=0x80000 Tx offloads=0x0
RX queue: 0
RX desc=0 - RX free threshold=0
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
RX Offloads=0x80000
TX queue: 0
TX desc=0 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX offloads=0x8000 - TX RS bit threshold=0
port 1: RX queue number: 2 Tx queue number: 2
Rx offloads=0x80000 Tx offloads=0x0
RX queue: 0
RX desc=0 - RX free threshold=0
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
RX Offloads=0x80000
TX queue: 0
TX desc=0 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX offloads=0x8000 - TX RS bit threshold=0
3) Send a continuous stream of packets and change the mtu while they are being forwarded:
p = Ether()/IP(src='10.100.0.1', dst='10.200.0.1')/UDP(sport=11111, dport=22222)/Raw(load='A'*100)
sendp(p, iface='ntfp1', count=5000000, inter=0.001)
port config mtu 0 1500
[qede_dev_set_link_state:1842(17:00.0:dpdk-port-0)]setting link state 0
[qede_link_update:1483(17:00.0:dpdk-port-0)]Link - Speed 0 Mode 1 AN 1 Status 0
[ecore_int_attentions:1239(17:00.0:dpdk-port-0-0)]MFW indication via attention
[qede_link_update:1483(17:00.0:dpdk-port-0)]Link - Speed 0 Mode 1 AN 1 Status 0
[qede_activate_vport:536(17:00.0:dpdk-port-0)]vport is deactivated
[qede_tx_queue_reset:509(17:00.0:dpdk-port-0)]Reset TX queue 0
[qede_tx_queue_stop:991(17:00.0:dpdk-port-0)]TX queue 0 stopped
[qede_tx_queue_reset:509(17:00.0:dpdk-port-0)]Reset TX queue 1
[qede_tx_queue_stop:991(17:00.0:dpdk-port-0)]TX queue 1 stopped
[qede_rx_queue_reset:293(17:00.0:dpdk-port-0)]Reset RX queue 0
[qede_rx_queue_stop:371(17:00.0:dpdk-port-0)]RX queue 0 stopped
[qede_rx_queue_reset:293(17:00.0:dpdk-port-0)]Reset RX queue 1
[qede_rx_queue_stop:371(17:00.0:dpdk-port-0)]RX queue 1 stopped
dpdk-testpmd:
../../../source/dpdk-24.11/drivers/net/qede/qede_rxtx.c:1600:
qede_recv_pkts_regular: Assertion `rx_mb != NULL' failed.
Aborted (core dumped) <=======
As you can see, the application is aborted.
========= Bad Rx buffer size for mbufs ===========
Another
issue I had when trying to set the MTU was that I got bad sanity checks
on mbufs when sending large packets, causing aborts. These checks are
only done when compiling in debug mode with RTE_LIBRTE_MBUF_DEBUG set,
so they may be easy to miss. Even without using this debugging config,
there are problems when transmitting those packets. This problem occurs
when calculating the size of the rx buffer, which was reworked in this
patch:
https://git.dpdk.org/dpdk/commit/drivers/net/qede?id=318d7da3122bac04772418c5eda9f50fcd175d18->
First, the max Rx buffer size is configured at two different places in
the code, and the calculation is not consistant between them. In
qede_rx_queue_setup(), RTE_ETHER_CRC_LEN is added to max_rx_pktlen but
not in qede_set_mtu(). The commit above mentions that HW does not
include CRC in received frame when passed to host, meaning the CRC
should not be added. Also, QEDE_ETH_OVERHEAD is added twice, once when
frame_size is initialized with QEDE_MAX_ETHER_HDR_LEN and another time
in qede_calc_rx_buf_size.
->
Furthermore, the commit mentions applying a flooring on rx_buf_size to
align its value. This will cause an issue when receiving large packets
nearing MTU size limit, as the rx_buffer will not be large enough for
some MTU values. What I observed when debugging the pmd is that the nic
would do Rx scatter to compensate for the insufficient buffer size,
although the pmd was not configured to handle this. I saw mbufs with
m->nb_segs = 2 and m->next = NULL being received, which was
unsurprising, given the wrong receive function was used as Rx scatter
was not enabled: qede_recv_pkts_regular() instead of qede_recv_pkts(). I
would suggest restoring the use of QEDE_CEIL_TO_CACHE_LINE_SIZE and
force-enabling Rx scatter in case this value would exceed mbuf size.
This would ensure the buffer is large enough to receive MTU-sized
packets without fragmentation.
I will submit patches to improve the calculation of the rx buffer size.
-------------- Reproduction ------------------
Sadly,
I did not manage to reproduce this issue with
testpmd because Rx scatter gets forcefully enabled by qede when started.
This happens because mtu is set at maximum possible value by
default.
With a sufficient log level, this is visible at start:
Configuring Port 0 (socket 0)
[qede_check_fdir_support:149(17:00.0:dpdk-port-0)]flowdir is disabled
[qed_sb_init:500(17:00.0:dpdk-port-0)]hwfn [0] <--[init]-- SB 0000 [0x0000 upper]
[qede_alloc_fp_resc:656(17:00.0:dpdk-port-0)]sb_info idx 0x1 initialized
[qed_sb_init:500(17:00.0:dpdk-port-0)]hwfn [0] <--[init]-- SB 0001 [0x0001 upper]
[qede_alloc_fp_resc:656(17:00.0:dpdk-port-0)]sb_info idx 0x2 initialized
[qede_start_vport:498(17:00.0:dpdk-port-0)]VPORT started with MTU = 2154
[qede_vlan_stripping:906(17:00.0:dpdk-port-0)]VLAN stripping disabled
[qede_vlan_filter_set:970(17:00.0:dpdk-port-0)]No VLAN filters configured yet
[qede_vlan_offload_set:1035(17:00.0:dpdk-port-0)]VLAN offload mask 3
[qede_dev_configure:1329(17:00.0:dpdk-port-0)]Device configured with RSS=2 TSS=2
[qede_alloc_tx_queue_mem:446(17:00.0:dpdk-port-0)]txq 0 num_desc 512 tx_free_thresh 480 socket 0
[qede_alloc_tx_queue_mem:446(17:00.0:dpdk-port-0)]txq 1 num_desc 512 tx_free_thresh 480 socket 0
[qede_rx_queue_setup:247(17:00.0:dpdk-port-0)]Forcing scatter-gather mode <=========================
[qede_alloc_rx_queue_mem:155(17:00.0:dpdk-port-0)]mtu 2154 mbufsz 2048 bd_max_bytes 2048 scatter_mode 1
[qede_rx_queue_setup:283(17:00.0:dpdk-port-0)]rxq 0 num_desc 512 rx_buf_size=2048 socket 0
[qede_alloc_rx_queue_mem:155(17:00.0:dpdk-port-0)]mtu 2154 mbufsz 2048 bd_max_bytes 2048 scatter_mode 1
[qede_rx_queue_setup:283(17:00.0:dpdk-port-0)]rxq 1 num_desc 512 rx_buf_size=2048 socket 0
[qede_rx_queue_start:778(17:00.0:dpdk-port-0)]rxq 0 igu_sb_id 0x1
[qede_rx_queue_start:805(17:00.0:dpdk-port-0)]RX queue 0 started
[qede_rx_queue_start:778(17:00.0:dpdk-port-0)]rxq 1 igu_sb_id 0x2
[qede_rx_queue_start:805(17:00.0:dpdk-port-0)]RX queue 1 started
[qede_tx_queue_start:837(17:00.0:dpdk-port-0)]txq 0 igu_sb_id 0x1
[qede_tx_queue_start:870(17:00.0:dpdk-port-0)]TX queue 0 started
[qede_tx_queue_start:837(17:00.0:dpdk-port-0)]txq 1 igu_sb_id 0x2
[qede_tx_queue_start:870(17:00.0:dpdk-port-0)]TX queue 1 started
[qede_config_rss:1059(17:00.0:dpdk-port-0)]Applying driver default key
[qede_rss_hash_update:2121(17:00.0:dpdk-port-0)]RSS hf = 0x104 len = 40 key = 0x7ffc5980db90
[qede_rss_hash_update:2126(17:00.0:dpdk-port-0)]Enabling rss
[qede_rss_hash_update:2140(17:00.0:dpdk-port-0)]Applying user supplied hash key
[qede_rss_hash_update:2187(17:00.0:dpdk-port-0)]Storing RSS key
[qede_activate_vport:536(17:00.0:dpdk-port-0)]vport is activated
[qede_dev_set_link_state:1842(17:00.0:dpdk-port-0)]setting link state 1
[qede_link_update:1483(17:00.0:dpdk-port-0)]Link - Speed 0 Mode 1 AN 1 Status 0
[qede_link_update:1483(17:00.0:dpdk-port-0)]Link - Speed 0 Mode 1 AN 1 Status 0
[qede_assign_rxtx_handlers:338(17:00.0:dpdk-port-0)]Assigning qede_recv_pkts
[qede_assign_rxtx_handlers:354(17:00.0:dpdk-port-0)]Assigning qede_xmit_pkts_regular
[qede_dev_start:1155(17:00.0:dpdk-port-0)]Device started
[qede_ucast_filter:682(17:00.0:dpdk-port-0)]Unicast MAC is not found
Port 0: F4:E9:D4:7A:A1:E6
========= Bad sanity check on port stop ===========
There
is another sanity check which break the program on port stop when
releasing mbufs: If large traffic has been sent, it's possible to see
packets with m->pkt_len = 0 but non-zero m->data_len. It is
probably because mbufs are not reset when allocated in
qede_alloc_rx_buffer() and their value is only set at the end of
qede_recv_pkts(). Calling rte_pktmbuf_reset() before freeing the mbufs
is enough to avoid this issue. I will submit a patch doing this.
-------------- Reproduction ------------------
1) Start testpmd:
dpdk-testpmd --log-level=pmd.net.qede.driver:7 -a 0000:17:00.0 -a
0000:17:00.1 -- -i --rxq=2 --txq=2 --coremask=0x0c
--total-num-mbufs=250000 --max-pkt-len 2172
show port info 0
********************* Infos for port 0 *********************
MAC address: F4:E9:D4:7A:A1:E6
Device name: 0000:17:00.0
Driver name: net_qede
Firmware-version: 8.40.33.0 MFW: 8.24.21.0
Devargs:
Connect to socket: 0
memory allocation on the socket: 0
Link status: up
Link speed: 10 Gbps
Link duplex: full-duplex
Autoneg status: On
MTU: 2154
Promiscuous mode: enabled
Allmulticast mode: disabled
Maximum number of MAC addresses: 256
Maximum number of MAC addresses of hash filtering: 0
VLAN offload:
strip off, filter off, extend off, qinq strip off
Hash key size in bytes: 40
Redirection table size: 128
Supported RSS offload flow types:
ipv4 ipv4-tcp ipv4-udp ipv6 ipv6-tcp ipv6-udp vxlan
geneve
Minimum size of RX buffer: 1024
Maximum configurable length of RX packet: 9672
Maximum configurable size of LRO aggregated packet: 0
Current number of RX queues: 2
Max possible RX queues: 32
Max possible number of RXDs per queue: 32768
Min possible number of RXDs per queue: 128
RXDs number alignment: 128
Current number of TX queues: 2
Max possible TX queues: 32
Max possible number of TXDs per queue: 32768
Min possible number of TXDs per queue: 256
TXDs number alignment: 256
Max segment number per packet: 255
Max segment number per MTU/TSO: 18
Device capabilities: 0x0( )
Device error handling mode: none
Device private info:
none
2) Start packet forwarding:
start
io packet forwarding - ports=2 - cores=2 - streams=4 - NUMA support enabled, MP allocation mode: native
Logical Core 2 (socket 0) forwards packets on 2 streams:
RX P=0/Q=0 (socket 0) -> TX P=1/Q=0 (socket 0) peer=02:00:00:00:00:01
RX P=1/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
Logical Core 3 (socket 1) forwards packets on 2 streams:
RX P=0/Q=1 (socket 0) -> TX P=1/Q=1 (socket 0) peer=02:00:00:00:00:01
RX P=1/Q=1 (socket 0) -> TX P=0/Q=1 (socket 0) peer=02:00:00:00:00:00
io packet forwarding packets/burst=32
nb forwarding cores=2 - nb forwarding ports=2
port 0: RX queue number: 2 Tx queue number: 2
Rx offloads=0x80000 Tx offloads=0x0
RX queue: 0
RX desc=0 - RX free threshold=0
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
RX Offloads=0x80000
TX queue: 0
TX desc=0 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX offloads=0x8000 - TX RS bit threshold=0
port 1: RX queue number: 2 Tx queue number: 2
Rx offloads=0x80000 Tx offloads=0x0
RX queue: 0
RX desc=0 - RX free threshold=0
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
RX Offloads=0x80000
TX queue: 0
TX desc=0 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX offloads=0x8000 - TX RS bit threshold=0
3) Send a continuous stream of packet of large size, nearing MTU limit (size 2020):
p = Ether()/IP(src='10.100.0.1', dst='10.200.0.1')/UDP(sport=11111, dport=22222)/Raw(load='A'*2020)
sendp(p, iface='ntfp1', count=5000, inter=0.001)
*Stop sending packets*
4) Stop packet forwarding:
stop
Telling cores to stop...
Waiting for lcores to finish...
------- Forward Stats for RX Port= 0/Queue= 1 -> TX Port= 1/Queue= 1 -------
RX-packets: 1251 TX-packets: 1251 TX-dropped: 0
---------------------- Forward statistics for port 0 ----------------------
RX-packets: 1251 RX-dropped: 0 RX-total: 1251
TX-packets: 0 TX-dropped: 0 TX-total: 0
----------------------------------------------------------------------------
---------------------- Forward statistics for port 1 ----------------------
RX-packets: 0 RX-dropped: 0 RX-total: 0
TX-packets: 1251 TX-dropped: 0 TX-total: 1251
----------------------------------------------------------------------------
+++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
RX-packets: 1251 RX-dropped: 0 RX-total: 1251
TX-packets: 1251 TX-dropped: 0 TX-total: 1251
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Done.
5) Stop port:
port stop 0
Stopping ports...
[qede_dev_set_link_state:1842(17:00.0:dpdk-port-0)]setting link state 0
[qede_link_update:1483(17:00.0:dpdk-port-0)]Link - Speed 0 Mode 1 AN 1 Status 0
[ecore_int_attentions:1239(17:00.0:dpdk-port-0-0)]MFW indication via attention
[qede_link_update:1483(17:00.0:dpdk-port-0)]Link - Speed 0 Mode 1 AN 1 Status 0
[qede_activate_vport:536(17:00.0:dpdk-port-0)]vport is deactivated
[qede_tx_queue_reset:509(17:00.0:dpdk-port-0)]Reset TX queue 0
[qede_tx_queue_stop:991(17:00.0:dpdk-port-0)]TX queue 0 stopped
[qede_tx_queue_reset:509(17:00.0:dpdk-port-0)]Reset TX queue 1
[qede_tx_queue_stop:991(17:00.0:dpdk-port-0)]TX queue 1 stopped
[qede_rx_queue_reset:293(17:00.0:dpdk-port-0)]Reset RX queue 0
[qede_rx_queue_stop:371(17:00.0:dpdk-port-0)]RX queue 0 stopped
EAL: PANIC in rte_mbuf_sanity_check():
bad pkt_len
0: dpdk-testpmd (rte_dump_stack+0x42) [643a372c5ac2]
1: dpdk-testpmd (__rte_panic+0xcc) [643a36c81963]
2: dpdk-testpmd (643a36ab0000+0x1d0b99) [643a36c80b99]
3: dpdk-testpmd (643a36ab0000+0x11b66be) [643a37c666be]
4: dpdk-testpmd (qede_stop_queues+0x162) [643a37c67ea2]
5: dpdk-testpmd (643a36ab0000+0x3c33fa) [643a36e733fa]
6: dpdk-testpmd (rte_eth_dev_stop+0x86) [643a3725a4c6]
7: dpdk-testpmd (643a36ab0000+0x4a2b28) [643a36f52b28]
8: dpdk-testpmd (643a36ab0000+0x799696) [643a37249696]
9: dpdk-testpmd (643a36ab0000+0x798604) [643a37248604]
10: dpdk-testpmd (rdline_char_in+0x38b) [643a3724bcdb]
11: dpdk-testpmd (cmdline_in+0x71) [643a372486d1]
12: dpdk-testpmd (cmdline_interact+0x40) [643a37248800]
13: dpdk-testpmd (prompt+0x2f) [643a36f01cef]
14: dpdk-testpmd (main+0x7e8) [643a36eddeb8]
15: /lib/x86_64-linux-gnu/libc.so.6 (718852800000+0x29d90) [718852829d90]
16: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0x80) [718852829e40]
17: dpdk-testpmd (_start+0x25) [643a36ef7145]
Aborted (core dumped) <=======
========= Wrong offload flags set for tunnel packets ===========
Finally, there is still another problem with offload flags for RX outer/inner IP and L4 checksum.
Flags
raised outside the tunnel part of qede_recv_pkts() and
qede_recv_pkts_regular() should be OUTER flags. Instead, they are the
same as the inner part.
This can lead to a problem where both GOOD and BAD L4 flags are set in inner offloads. I will submit a patch for this.