DPDK usage discussions
 help / color / mirror / Atom feed
* Synchronized CPU stalls with 8 queues on Mellanox MLX5 NIC
@ 2024-01-09 14:47 Lukáš Šišmiš
  2024-02-20  7:41 ` Maayan Kashani
  0 siblings, 1 reply; 3+ messages in thread
From: Lukáš Šišmiš @ 2024-01-09 14:47 UTC (permalink / raw)
  To: users

[-- Attachment #1: Type: text/plain, Size: 3162 bytes --]

Hi everyone,


past few weeks I am trying to debug why independent application workers 
have the same access patterns to a Mellanox NIC.
The application I am debugging is Suricata and the debug tool that I am 
using is primarily Intel Vtune.

I am using 8 cores for packet processing, each core has an independent 
processing queue. All application cores are on the same NUMA node. 
Importantly, this only happens on Mellanox/NVIDIA NIC (currently MT2892 
Family - mlx5) and NOT on X710. Suricata is compiled with DPDK (2 
versions tested, replicated on both - master 1dcf69b211 
(https://github.com/OISF/suricata/) and version with interrupt support 
(commit c822f66b - 
https://github.com/lukashino/suricata/commits/feat-power-saving-v4/)).
For packet generation I use the Trex packet generator on an independent 
server in ASTF mode with the command "start -f astf/http_simple.py -m 
6000". The traffic exchanged between the two trex interfaces is mirrored 
on a switch to Suricata interface. That yields roughly 4.6 Gbps of 
traffic. The traffic is a simple http  GET request yet the flows are 
alternating each iteration with an increment in an IP address. RSS then 
distributes the traffic evenly across all cores. The problem occurs both 
on 500 Mbps and on 20 Gbps transmit speed.

This is a flame graph from one of the runs. I wonder why CPUs have 
almost synchronous no CPU/some CPU activity in the graph below. The 
worker cores are denoted with "W#0..." and are in 2 groups that are 
alternating. CPU stalls can be especially seen in regions of high CPU 
activity but it is present also with low CPU activity. Having high/low 
CPU activity is not relevant here as  I am only interested in the 
pattern of CPU stalls. It suggest for some shared resource. But even 
with a shared resource it would not be paused synchronously but randomly 
blocked.
I am debugging the application with interrupts enabled however the same 
pattern occurs when poll mode is enabled. When polling mode is active I 
filtered out mlx5 module activity from the Vtune result and was still 
able to see CPU pauses ranging from 0.5 to 1 second across all cores.

DPDK 8 cores, MLX5 NIC



I tried to profile Suricata in different scenarios and this pattern of 
complete CPU stalls doesn't happen elsewhere.

e.g.

AF_PACKET 8 cores, MLX5 NIC, the CPU activity is similar across cores 
but the cores never pause:


DPDK 4 cores, MLX5 NIC,



DPDK 9 cores, MLX5 NIC


DPDK 8 cores, X710 NIC, no CPU stalls on worker cores


Testpmd, MLX5, 8 cores, I tried to filter out majority of RX NIC 
functions and it still seems that CPUs are being continuously active. 
(It was running in rxonly fwd mode, with 8 queues and 8 cores) Though I 
am a bit skeptical about the CPU activity as testpmd only 
receives/discards the traffic.


It seems like the issue is connected with MLX5 NIC and DPDK as it works 
well with AF_PACKET, lower/higher number of threads.

Does anybody have an idea why CPU stalls occurs in combination with 8 
cores or possibly what else I could do to mitigate/better evaluate the 
problem?

Thanks in advance.


Hopefully you will also receive the images.

Lukas

[-- Attachment #2.1: Type: text/html, Size: 4571 bytes --]

[-- Attachment #2.2: Kgu38Rg0ITYmQcez.png --]
[-- Type: image/png, Size: 265351 bytes --]

[-- Attachment #2.3: 3R04SpUEU6jZCWyq.png --]
[-- Type: image/png, Size: 88901 bytes --]

[-- Attachment #2.4: t1QqWEMKhiwKPkdI.png --]
[-- Type: image/png, Size: 62672 bytes --]

[-- Attachment #2.5: 9xI8VW6mO6mBSC1H.png --]
[-- Type: image/png, Size: 132732 bytes --]

[-- Attachment #2.6: NFlT3M0baqCeLjow.png --]
[-- Type: image/png, Size: 91397 bytes --]

[-- Attachment #2.7: q9c8ofGh06WvPfKj.png --]
[-- Type: image/png, Size: 337500 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: Synchronized CPU stalls with 8 queues on Mellanox MLX5 NIC
  2024-01-09 14:47 Synchronized CPU stalls with 8 queues on Mellanox MLX5 NIC Lukáš Šišmiš
@ 2024-02-20  7:41 ` Maayan Kashani
  0 siblings, 0 replies; 3+ messages in thread
From: Maayan Kashani @ 2024-02-20  7:41 UTC (permalink / raw)
  To: Lukáš Šišmiš, users


[-- Attachment #1.1: Type: text/plain, Size: 3747 bytes --]

Thanks Lukas,
We will look into it and reply soon.

Regards,
Maayan Kashani

From: Lukáš Šišmiš <sismis@cesnet.cz>
Sent: Tuesday, 9 January 2024 16:48
To: users@dpdk.org
Subject: Synchronized CPU stalls with 8 queues on Mellanox MLX5 NIC

External email: Use caution opening links or attachments


Hi everyone,



past few weeks I am trying to debug why independent application workers have the same access patterns to a Mellanox NIC.
The application I am debugging is Suricata and the debug tool that I am using is primarily Intel Vtune.

I am using 8 cores for packet processing, each core has an independent processing queue. All application cores are on the same NUMA node. Importantly, this only happens on Mellanox/NVIDIA NIC (currently MT2892 Family - mlx5) and NOT on X710. Suricata is compiled with DPDK (2 versions tested, replicated on both - master 1dcf69b211 (https://github.com/OISF/suricata/) and version with interrupt support (commit c822f66b - https://github.com/lukashino/suricata/commits/feat-power-saving-v4/)).
For packet generation I use the Trex packet generator on an independent server in ASTF mode with the command "start -f astf/http_simple.py -m 6000". The traffic exchanged between the two trex interfaces is mirrored on a switch to Suricata interface. That yields roughly 4.6 Gbps of traffic. The traffic is a simple http  GET request yet the flows are alternating each iteration with an increment in an IP address. RSS then distributes the traffic evenly across all cores. The problem occurs both on 500 Mbps and on 20 Gbps transmit speed.

This is a flame graph from one of the runs. I wonder why CPUs have almost synchronous no CPU/some CPU activity in the graph below. The worker cores are denoted with "W#0..." and are in 2 groups that are alternating. CPU stalls can be especially seen in regions of high CPU activity but it is present also with low CPU activity. Having high/low CPU activity is not relevant here as  I am only interested in the pattern of CPU stalls. It suggest for some shared resource. But even with a shared resource it would not be paused synchronously but randomly blocked.
I am debugging the application with interrupts enabled however the same pattern occurs when poll mode is enabled. When polling mode is active I filtered out mlx5 module activity from the Vtune result and was still able to see CPU pauses ranging from 0.5 to 1 second across all cores.

DPDK 8 cores, MLX5 NIC

[cid:image001.png@01DA63E0.EDFDAC60]





I tried to profile Suricata in different scenarios and this pattern of complete CPU stalls doesn't happen elsewhere.

e.g.

AF_PACKET 8 cores, MLX5 NIC, the CPU activity is similar across cores but the cores never pause:

[cid:image002.png@01DA63E0.EDFDAC60]



DPDK 4 cores, MLX5 NIC,

[cid:image003.png@01DA63E0.EDFDAC60]





DPDK 9 cores, MLX5 NIC

[cid:image004.png@01DA63E0.EDFDAC60]



DPDK 8 cores, X710 NIC, no CPU stalls on worker cores

[cid:image005.png@01DA63E0.EDFDAC60]



Testpmd, MLX5, 8 cores, I tried to filter out majority of RX NIC functions and it still seems that CPUs are being continuously active. (It was running in rxonly fwd mode, with 8 queues and 8 cores) Though I am a bit skeptical about the CPU activity as testpmd only receives/discards the traffic.

[cid:image006.png@01DA63E0.EDFDAC60]



It seems like the issue is connected with MLX5 NIC and DPDK as it works well with AF_PACKET, lower/higher number of threads.

Does anybody have an idea why CPU stalls occurs in combination with 8 cores or possibly what else I could do to mitigate/better evaluate the problem?

Thanks in advance.



Hopefully you will also receive the images.

Lukas

[-- Attachment #1.2: Type: text/html, Size: 8321 bytes --]

[-- Attachment #2: image001.png --]
[-- Type: image/png, Size: 265351 bytes --]

[-- Attachment #3: image002.png --]
[-- Type: image/png, Size: 88901 bytes --]

[-- Attachment #4: image003.png --]
[-- Type: image/png, Size: 62672 bytes --]

[-- Attachment #5: image004.png --]
[-- Type: image/png, Size: 132732 bytes --]

[-- Attachment #6: image005.png --]
[-- Type: image/png, Size: 91397 bytes --]

[-- Attachment #7: image006.png --]
[-- Type: image/png, Size: 337500 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Synchronized CPU stalls with 8 queues on Mellanox MLX5 NIC
@ 2024-01-10 13:29 Lukáš Šišmiš
  0 siblings, 0 replies; 3+ messages in thread
From: Lukáš Šišmiš @ 2024-01-10 13:29 UTC (permalink / raw)
  To: users

Hi everyone,


past few weeks I am trying to debug why independent application workers 
have the same access patterns to a Mellanox NIC.
The application I am debugging is Suricata and the debug tool that I am 
using is primarily Intel Vtune.

I am using 8 cores for packet processing, each core has an independent 
processing queue. All application cores are on the same NUMA node. 
Importantly, this only happens on Mellanox/NVIDIA NIC (currently MT2892 
Family - mlx5) and NOT on X710. Suricata is compiled with DPDK (2 
versions tested, replicated on both - master 1dcf69b211 
(https://github.com/OISF/suricata/) and version with interrupt support 
(commit c822f66b - 
https://github.com/lukashino/suricata/commits/feat-power-saving-v4/)). 
I've used various number of descriptors but the problem remained the same.
For packet generation I use the Trex packet generator on an independent 
server in ASTF mode with the command "start -f astf/http_simple.py -m 
6000". The traffic exchanged between the two trex interfaces is mirrored 
on a switch to Suricata interface. That yields roughly 4.6 Gbps of 
traffic. The traffic is a simple http  GET request yet the flows are 
alternating each iteration with an increment in an IP address. RSS then 
distributes the traffic evenly across all cores. The problem occurs both 
on 500 Mbps and on 20 Gbps transmit speed.

This is a flame graph from one of the runs. I wonder why CPUs have 
almost synchronous no CPU/some CPU activity in the graph below. The 
worker cores are denoted with "W#0..." and are in 2 groups that are 
alternating. CPU stalls can be especially seen in regions of high CPU 
activity but it is present also with low CPU activity. Having high/low 
CPU activity is not relevant here as  I am only interested in the 
pattern of CPU stalls. It suggest for some shared resource. But even 
with a shared resource it would not be paused synchronously but randomly 
blocked.
I am debugging the application with interrupts enabled however the same 
pattern occurs when poll mode is enabled. When polling mode is active I 
filtered out mlx5 module activity from the Vtune result and was still 
able to see CPU pauses ranging from 0.5 to 1 second across all cores.

DPDK 8 cores, MLX5 NIC

https://imgur.com/a/TrZ9vIy


I tried to profile Suricata in different scenarios and this pattern of 
complete CPU stalls doesn't happen elsewhere.

e.g.

AF_PACKET 8 cores, MLX5 NIC, the CPU activity is similar across cores 
but the cores never pause:

https://imgur.com/a/HIhDVyQ


DPDK 4 cores, MLX5 NIC,

https://imgur.com/a/G0JVOXa


DPDK 9 cores, MLX5 NIC

https://imgur.com/a/IdHCruj


DPDK 8 cores, X710 NIC, no CPU stalls on worker cores

https://imgur.com/a/94KLCjE


Testpmd, MLX5, 8 cores, I tried to filter out majority of RX NIC 
functions and it still seems that CPUs are being continuously active. 
(It was running in rxonly fwd mode, with 8 queues and 8 cores) Though I 
am a bit skeptical about the CPU activity as testpmd only 
receives/discards the traffic.

https://imgur.com/a/UwHZzAr


It seems like the issue is connected with MLX5 NIC and DPDK as it works 
well with AF_PACKET, lower/higher number of threads.

Does anybody have an idea why CPU stalls occurs in combination with 8 
cores or possibly what else I could do to mitigate/better evaluate the 
problem?

Thanks in advance.

Lukas


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-02-29  8:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-09 14:47 Synchronized CPU stalls with 8 queues on Mellanox MLX5 NIC Lukáš Šišmiš
2024-02-20  7:41 ` Maayan Kashani
2024-01-10 13:29 Lukáš Šišmiš

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).