Hi

During the pressure test on the CX6 using DPDK, the process exits abnormally. It is located that the problem is caused by a bug of the DPDK mlx5 driver. Please check whether the latest firmware and driver fix this coredump.

 

By default, the DPDK enables the rxtx_vect and compress CQE functions, and the receive ringbuffer is 1024. During the service process pressure, the service process receives SIGFAULT and exits.

Call stack information:

    #2  0x0000000000e72437 in signal_captured_function (signo=11, si=0x7f6310f46eb0, ucontext=0x7f6310f46d80) at ../v1/handle_signal.c:499

    #3  <signal handler called>

    #4  _mm_storeu_si128 (__B=..., __P=<optimized out>) at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/include/emmintrin.h:720

    #5  rxq_cq_decompress_v (elts=0x20217ff394e8, cq=0x20217f8538c0, rxq=0x20217ff36e00) at ../drivers/net/mlx5/mlx5_rxtx_vec_sse.h:159

    #6  rxq_burst_v (no_cq=<synthetic pointer>, err=<synthetic pointer>, pkts_n=9, pkts=0x2004e278c9d8, rxq=0x20217ff36e00) at ../drivers/net/mlx5/mlx5_rxtx_vec.c:349

    #7  mlx5_rx_burst_vec (dpdk_rxq=0x20217ff36e00, pkts=0x2004e278c9d8, pkts_n=128) at ../drivers/net/mlx5/mlx5_rxtx_vec.c:393

    #8  0x0000000001086448 in rte_eth_rx_burst (nb_pkts=128, rx_pkts=0x2004e278c9d8, queue_id=7, port_id=<optimized out>) at ../include/dpdk/rte_ethdev.h:5339

 

Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]

Version:

    [root@localhost ~]# ofed_info -s

    MLNX_OFED_LINUX-23.04-0.5.3.3:

 

    [root@localhost ~]# ethtool -i eth6|grep fir

firmware-version: 22.37.1014 (MT_0000000359)

dpdk version: DPDK 21.11

 

 

../drivers/net/mlx5/mlx5_rxtx_vec_sse.h:159

157:          /* B.1 store rearm data to mbuf. */

158:          _mm_storeu_si128((__m128i *)&elts[pos + 2]->rearm_data, rearm);

159:          _mm_storeu_si128((__m128i *)&elts[pos + 3]->rearm_data, rearm);

 

Root cause: When processing compressed CQEs, 9 mini CQEs need to be processed and (*rxq->elts)[1021] -> (*rxq->elts)[1028] is accessed. Only [0, 1027] are reserved during the initialization of the receive queue. A null pointer is accessed due to out-of-bounds access. As a result, a core dump occurs in the process.

(gdb) p elts[0]

$149 = (struct rte_mbuf *) 0x2006945a8000  //first round

(gdb) p elts[1]

$150 = (struct rte_mbuf *) 0x2006945aa1c0

(gdb) p elts[2]

$151 = (struct rte_mbuf *) 0x2006945ac380

(gdb) p elts[3]

$152 = (struct rte_mbuf *) 0x20217ff36f80

(gdb) p elts[4]

$153 = (struct rte_mbuf *) 0x20217ff36f80  //Second round

(gdb) p elts[5]

$154 = (struct rte_mbuf *) 0x20217ff36f80

(gdb) p elts[6]

$155 = (struct rte_mbuf *) 0x20217ff36f80 

(gdb) p elts[7]

$156 = (struct rte_mbuf *) 0x0     //coredump

(gdb) p elts - (*rxq->elts)

$157 = 1021