DPDK usage discussions
 help / color / mirror / Atom feed
* Mellanox performance degradation with more than 12 lcores
@ 2022-02-18  7:32 Дмитрий Степанов
  2022-02-18 13:30 ` Asaf Penso
  2022-02-18 13:39 ` Dmitry Kozlyuk
  0 siblings, 2 replies; 5+ messages in thread
From: Дмитрий Степанов @ 2022-02-18  7:32 UTC (permalink / raw)
  To: users

[-- Attachment #1: Type: text/plain, Size: 1048 bytes --]

Hi folks!

I'm using Mellanox ConnectX-6 Dx EN adapter card (100GbE; Dual-port QSFP56;
PCIe 4.0/3.0 x16) with DPDK 21.11 on a server with AMD EPYC 7702 64-Core
Processor (NUMA system with 2 sockets). Hyperthreading is turned off.
I'm testing the maximum receive throughput I can get from a single port
using testpmd utility (shipped with dpdk). My generator produces random UDP
packets with zero payload length.

I get the maximum performance using 8-12 lcores (overall 120-125Mpps on
receive path of single port):

numactl -N 1 -m 1 /opt/dpdk-21.11/build/app/dpdk-testpmd -l 64-127 -n 4  -a
0000:c1:00.0 -- --stats-period 1 --nb-cores=12 --rxq=12 --txq=12 --rxd=512

With more than 12 lcores overall receive performance reduces. With 16-32
lcores I get 100-110 Mpps, and I get a significant performance fall with 33
lcores - 84Mpps. With 63 cores I get even 35Mpps  overall receive
performance.

Are there any limitations on the total number of receive queues (total
lcores) that can handle a single port on a given NIC?

Thanks,
Dmitriy Stepanov

[-- Attachment #2: Type: text/html, Size: 1167 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Mellanox performance degradation with more than 12 lcores
  2022-02-18  7:32 Mellanox performance degradation with more than 12 lcores Дмитрий Степанов
@ 2022-02-18 13:30 ` Asaf Penso
  2022-02-18 13:49   ` Дмитрий Степанов
  2022-02-18 13:39 ` Dmitry Kozlyuk
  1 sibling, 1 reply; 5+ messages in thread
From: Asaf Penso @ 2022-02-18 13:30 UTC (permalink / raw)
  To: Дмитрий
	Степанов,
	users

[-- Attachment #1: Type: text/plain, Size: 1500 bytes --]

Hello Dmitry,

Could you please paste the testpmd commands per each experiment?

Also, have you looked into dpdk.org performance report to see how to tune for best results?

Regards,
Asaf Penso
________________________________
From: Дмитрий Степанов <stepanov.dmit@gmail.com>
Sent: Friday, February 18, 2022 9:32:59 AM
To: users@dpdk.org <users@dpdk.org>
Subject: Mellanox performance degradation with more than 12 lcores

Hi folks!

I'm using Mellanox ConnectX-6 Dx EN adapter card (100GbE; Dual-port QSFP56; PCIe 4.0/3.0 x16) with DPDK 21.11 on a server with AMD EPYC 7702 64-Core Processor (NUMA system with 2 sockets). Hyperthreading is turned off.
I'm testing the maximum receive throughput I can get from a single port using testpmd utility (shipped with dpdk). My generator produces random UDP packets with zero payload length.

I get the maximum performance using 8-12 lcores (overall 120-125Mpps on receive path of single port):

numactl -N 1 -m 1 /opt/dpdk-21.11/build/app/dpdk-testpmd -l 64-127 -n 4  -a 0000:c1:00.0 -- --stats-period 1 --nb-cores=12 --rxq=12 --txq=12 --rxd=512

With more than 12 lcores overall receive performance reduces. With 16-32 lcores I get 100-110 Mpps, and I get a significant performance fall with 33 lcores - 84Mpps. With 63 cores I get even 35Mpps  overall receive performance.

Are there any limitations on the total number of receive queues (total lcores) that can handle a single port on a given NIC?

Thanks,
Dmitriy Stepanov

[-- Attachment #2: Type: text/html, Size: 2225 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Mellanox performance degradation with more than 12 lcores
  2022-02-18  7:32 Mellanox performance degradation with more than 12 lcores Дмитрий Степанов
  2022-02-18 13:30 ` Asaf Penso
@ 2022-02-18 13:39 ` Dmitry Kozlyuk
  2022-02-18 16:14   ` Дмитрий Степанов
  1 sibling, 1 reply; 5+ messages in thread
From: Dmitry Kozlyuk @ 2022-02-18 13:39 UTC (permalink / raw)
  To: Дмитрий
	Степанов
  Cc: users

Hi,

> With more than 12 lcores overall receive performance reduces.
> With 16-32 lcores I get 100-110 Mpps,

It is more about the number of queues than the number of cores:
12 queues are the threshold when Multi-Packet Receive Queue (MPRQ)
is automatically enabled in mlx5 PMD.
Try increasing --rxd and check out mprq_en device argument.
Please see mlx5 PMD user guide for details about MPRQ.
You should be able to get full 148 Mpps with your HW.

> and I get a significant performance fall with 33 lcores - 84Mpps.
> With 63 cores I get even 35Mpps overall receive performance.
> 
> Are there any limitations on the total number of receive queues (total
> lcores) that can handle a single port on a given NIC?

This is a hardware limitation.
The limit on the number of queues you can create is very high (16M),
but performance can perfectly scale only up to 32 queues
at high packet rates (as opposed to bit rates).
Using more queues can even degrade it, just as you observe.
One way to overcome this (not specific to mlx5)
is to use a ring buffer for incoming packets,
from which any number of processing cores can take packets.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Mellanox performance degradation with more than 12 lcores
  2022-02-18 13:30 ` Asaf Penso
@ 2022-02-18 13:49   ` Дмитрий Степанов
  0 siblings, 0 replies; 5+ messages in thread
From: Дмитрий Степанов @ 2022-02-18 13:49 UTC (permalink / raw)
  To: Asaf Penso; +Cc: users

[-- Attachment #1: Type: text/plain, Size: 2539 bytes --]

I get 125 Mpps from single port using 12 lcores:
numactl -N 1 -m 1 /opt/dpdk-21.11/build/app/dpdk-testpmd -l 64-127 -n 4  -a
0000:c1:00.0 -- --stats-period 1 --nb-cores=12 --rxq=12 --txq=12 --rxd=512

With 63 cores i get 35 Mpps:
numactl -N 1 -m 1 /opt/dpdk-21.11/build/app/dpdk-testpmd -l 64-127 -n 4  -a
0000:c1:00.0 -- --stats-period 1 --nb-cores=63 --rxq=63 --txq=63 --rxd=512

I'm using this guide as a reference -
https://fast.dpdk.org/doc/perf/DPDK_20_11_Mellanox_NIC_performance_report.pdf
This reference suggests examples of how to get the best performance but all
of them use maximum 12 lcores.
125 Mpps with 12 lcores is nearly the maximum I can get from single 100GB
port (148Mpps theoretical maximum for 64byte packet). I just want to
understand - why I get good performance with 12 lcores and bad performance
with 63 cores?

пт, 18 февр. 2022 г. в 16:30, Asaf Penso <asafp@nvidia.com>:

> Hello Dmitry,
>
> Could you please paste the testpmd commands per each experiment?
>
> Also, have you looked into dpdk.org performance report to see how to tune
> for best results?
>
> Regards,
> Asaf Penso
> ------------------------------
> *From:* Дмитрий Степанов <stepanov.dmit@gmail.com>
> *Sent:* Friday, February 18, 2022 9:32:59 AM
> *To:* users@dpdk.org <users@dpdk.org>
> *Subject:* Mellanox performance degradation with more than 12 lcores
>
> Hi folks!
>
> I'm using Mellanox ConnectX-6 Dx EN adapter card (100GbE; Dual-port
> QSFP56; PCIe 4.0/3.0 x16) with DPDK 21.11 on a server with AMD EPYC 7702
> 64-Core Processor (NUMA system with 2 sockets). Hyperthreading is turned
> off.
> I'm testing the maximum receive throughput I can get from a single port
> using testpmd utility (shipped with dpdk). My generator produces random UDP
> packets with zero payload length.
>
> I get the maximum performance using 8-12 lcores (overall 120-125Mpps on
> receive path of single port):
>
> numactl -N 1 -m 1 /opt/dpdk-21.11/build/app/dpdk-testpmd -l 64-127 -n 4
>  -a 0000:c1:00.0 -- --stats-period 1 --nb-cores=12 --rxq=12 --txq=12
> --rxd=512
>
> With more than 12 lcores overall receive performance reduces. With 16-32
> lcores I get 100-110 Mpps, and I get a significant performance fall with 33
> lcores - 84Mpps. With 63 cores I get even 35Mpps  overall receive
> performance.
>
> Are there any limitations on the total number of receive queues (total
> lcores) that can handle a single port on a given NIC?
>
> Thanks,
> Dmitriy Stepanov
>

[-- Attachment #2: Type: text/html, Size: 3774 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Mellanox performance degradation with more than 12 lcores
  2022-02-18 13:39 ` Dmitry Kozlyuk
@ 2022-02-18 16:14   ` Дмитрий Степанов
  0 siblings, 0 replies; 5+ messages in thread
From: Дмитрий Степанов @ 2022-02-18 16:14 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: users

[-- Attachment #1: Type: text/plain, Size: 1503 bytes --]

Thanks for the clarification!
I was able to get 148Mpps with 12 lcores after some BIOS tunings.
Looks like due to these HW limitations I have to use ring buffer as you
suggested to support more than 32 lcores!

пт, 18 февр. 2022 г. в 16:40, Dmitry Kozlyuk <dkozlyuk@nvidia.com>:

> Hi,
>
> > With more than 12 lcores overall receive performance reduces.
> > With 16-32 lcores I get 100-110 Mpps,
>
> It is more about the number of queues than the number of cores:
> 12 queues are the threshold when Multi-Packet Receive Queue (MPRQ)
> is automatically enabled in mlx5 PMD.
> Try increasing --rxd and check out mprq_en device argument.
> Please see mlx5 PMD user guide for details about MPRQ.
> You should be able to get full 148 Mpps with your HW.
>
> > and I get a significant performance fall with 33 lcores - 84Mpps.
> > With 63 cores I get even 35Mpps overall receive performance.
> >
> > Are there any limitations on the total number of receive queues (total
> > lcores) that can handle a single port on a given NIC?
>
> This is a hardware limitation.
> The limit on the number of queues you can create is very high (16M),
> but performance can perfectly scale only up to 32 queues
> at high packet rates (as opposed to bit rates).
> Using more queues can even degrade it, just as you observe.
> One way to overcome this (not specific to mlx5)
> is to use a ring buffer for incoming packets,
> from which any number of processing cores can take packets.
>

[-- Attachment #2: Type: text/html, Size: 1878 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-02-21 17:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-18  7:32 Mellanox performance degradation with more than 12 lcores Дмитрий Степанов
2022-02-18 13:30 ` Asaf Penso
2022-02-18 13:49   ` Дмитрий Степанов
2022-02-18 13:39 ` Dmitry Kozlyuk
2022-02-18 16:14   ` Дмитрий Степанов

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).