Thanks for the clarification! 
I was able to get 148Mpps with 12 lcores after some BIOS tunings. 
Looks like due to these HW limitations I have to use ring buffer as you suggested to support more than 32 lcores! 

пт, 18 февр. 2022 г. в 16:40, Dmitry Kozlyuk <dkozlyuk@nvidia.com>:
Hi,

> With more than 12 lcores overall receive performance reduces.
> With 16-32 lcores I get 100-110 Mpps,

It is more about the number of queues than the number of cores:
12 queues are the threshold when Multi-Packet Receive Queue (MPRQ)
is automatically enabled in mlx5 PMD.
Try increasing --rxd and check out mprq_en device argument.
Please see mlx5 PMD user guide for details about MPRQ.
You should be able to get full 148 Mpps with your HW.

> and I get a significant performance fall with 33 lcores - 84Mpps.
> With 63 cores I get even 35Mpps overall receive performance.
>
> Are there any limitations on the total number of receive queues (total
> lcores) that can handle a single port on a given NIC?

This is a hardware limitation.
The limit on the number of queues you can create is very high (16M),
but performance can perfectly scale only up to 32 queues
at high packet rates (as opposed to bit rates).
Using more queues can even degrade it, just as you observe.
One way to overcome this (not specific to mlx5)
is to use a ring buffer for incoming packets,
from which any number of processing cores can take packets.