I get 125 Mpps from single port using 12 lcores:
numactl -N 1 -m 1 /opt/dpdk-21.11/build/app/dpdk-testpmd -l 64-127 -n 4  -a 0000:c1:00.0 -- --stats-period 1 --nb-cores=12 --rxq=12 --txq=12 --rxd=512

With 63 cores i get 35 Mpps:
numactl -N 1 -m 1 /opt/dpdk-21.11/build/app/dpdk-testpmd -l 64-127 -n 4  -a 0000:c1:00.0 -- --stats-period 1 --nb-cores=63 --rxq=63 --txq=63 --rxd=512

I'm using this guide as a reference - https://fast.dpdk.org/doc/perf/DPDK_20_11_Mellanox_NIC_performance_report.pdf
This reference suggests examples of how to get the best performance but all of them use maximum 12 lcores. 
125 Mpps with 12 lcores is nearly the maximum I can get from single 100GB port (148Mpps theoretical maximum for 64byte packet). I just want to understand - why I get good performance with 12 lcores and bad performance with 63 cores?

пт, 18 февр. 2022 г. в 16:30, Asaf Penso <asafp@nvidia.com>:
Hello Dmitry,

Could you please paste the testpmd commands per each experiment?

Also, have you looked into dpdk.org performance report to see how to tune for best results?

Regards,
Asaf Penso

From: Дмитрий Степанов <stepanov.dmit@gmail.com>
Sent: Friday, February 18, 2022 9:32:59 AM
To: users@dpdk.org <users@dpdk.org>
Subject: Mellanox performance degradation with more than 12 lcores
 
Hi folks!

I'm using Mellanox ConnectX-6 Dx EN adapter card (100GbE; Dual-port QSFP56; PCIe 4.0/3.0 x16) with DPDK 21.11 on a server with AMD EPYC 7702 64-Core Processor (NUMA system with 2 sockets). Hyperthreading is turned off.
I'm testing the maximum receive throughput I can get from a single port using testpmd utility (shipped with dpdk). My generator produces random UDP packets with zero payload length.

I get the maximum performance using 8-12 lcores (overall 120-125Mpps on receive path of single port):

numactl -N 1 -m 1 /opt/dpdk-21.11/build/app/dpdk-testpmd -l 64-127 -n 4  -a 0000:c1:00.0 -- --stats-period 1 --nb-cores=12 --rxq=12 --txq=12 --rxd=512

With more than 12 lcores overall receive performance reduces. With 16-32 lcores I get 100-110 Mpps, and I get a significant performance fall with 33 lcores - 84Mpps. With 63 cores I get even 35Mpps  overall receive performance.

Are there any limitations on the total number of receive queues (total lcores) that can handle a single port on a given NIC?

Thanks,
Dmitriy Stepanov