RE: [net/mlx5] Performance drop with HWS compared to SWS

DPDK usage discussions
 help / color / mirror / Atom feed

From: Dariusz Sosnowski <dsosnowski@nvidia.com>
To: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
Cc: "users@dpdk.org" <users@dpdk.org>
Subject: RE: [net/mlx5] Performance drop with HWS compared to SWS
Date: Wed, 19 Jun 2024 19:15:30 +0000	[thread overview]
Message-ID: <PH0PR12MB880055029212CFDAEB0728C2A4CF2@PH0PR12MB8800.namprd12.prod.outlook.com> (raw)
In-Reply-To: <20240613231448.63f1dbbd@sovereign>

Hi,

Thank you for running all the tests and for all the data. Really appreciated.

> -----Original Message-----
> From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
> Sent: Thursday, June 13, 2024 22:15
> To: Dariusz Sosnowski <dsosnowski@nvidia.com>
> Cc: users@dpdk.org
> Subject: Re: [net/mlx5] Performance drop with HWS compared to SWS
> 
> Hi Dariusz,
> 
> Thank you for looking into the issue, please find full details below.
> 
> Summary:
> 
> Case       SWS (Mpps)  HWS (Mpps)
> --------   ----------  ----------
> baseline       148          -
> jump_rss        37        148
> jump_miss      148        107
> jump_drop      148        107
> 
> From "baseline" vs "jump_rss", the problem is not in jump.
> From "jump_miss" vs "jump_drop", the problem is not only in miss.
> This is a lab so I can try anything else you need for diagnostic.
> 
> Disabling flow control only fixes the number of packets received by PHY, but not
> the number of packets processed by steering.
> 
> > - Could you share mlnx_perf stats for SWS case as well?
> 
>       rx_vport_unicast_packets: 151,716,299
>         rx_vport_unicast_bytes: 9,709,843,136 Bps    = 77,678.74 Mbps
>                 rx_packets_phy: 151,716,517
>                   rx_bytes_phy: 9,709,856,896 Bps    = 77,678.85 Mbps
>                rx_64_bytes_phy: 151,716,867 Bps      = 1,213.73 Mbps
>                 rx_prio0_bytes: 9,710,051,648 Bps    = 77,680.41 Mbps
>               rx_prio0_packets: 151,719,564
> 
> > - If group 1 had a flow rule with empty match and RSS action, is the
> performance difference the same?
> >   (This would help to understand if the problem is with miss behavior or with
> jump between group 0 and group 1).
> 
> Case "baseline"
> ===============
> No flow rules, just to make sure the host can poll the NIC fast enough.
> Result: 148 Mpps
> 
> /root/build/app/dpdk-testpmd -l 0-31,64-95 -a
> 21:00.0,dv_flow_en=1,mprq_en=1,rx_vec_en=1 --in-memory -- \
>         -i --rxq=32 --txq=32 --forward-mode=rxonly --nb-cores=32
> 
> mlnx_perf -i enp33s0f0np0 -t 1
> 
>       rx_vport_unicast_packets: 151,622,123
>         rx_vport_unicast_bytes: 9,703,815,872 Bps    = 77,630.52 Mbps
>                 rx_packets_phy: 151,621,983
>                   rx_bytes_phy: 9,703,807,872 Bps    = 77,630.46 Mbps
>                rx_64_bytes_phy: 151,621,026 Bps      = 1,212.96 Mbps
>                 rx_prio0_bytes: 9,703,716,480 Bps    = 77,629.73 Mbps
>               rx_prio0_packets: 151,620,576
> 
> Attached: "neohost-cx6dx-baseline-sws.txt".
> 
> Case "jump_rss", SWS
> ====================
> Jump to group 1, then RSS.
> Result: 37 Mpps (?!)
> This "37 Mpps" seems to be caused by PCIe bottleneck, which MPRQ is supposed
> to overcome.
> Is MPRQ limited only to default RSS in SWS mode?
> 
> /root/build/app/dpdk-testpmd -l 0-31,64-95 -a
> 21:00.0,dv_flow_en=1,mprq_en=1,rx_vec_en=1 --in-memory -- \
>         -i --rxq=32 --txq=32 --forward-mode=rxonly --nb-cores=32
> 
> flow create 0 ingress group 0 pattern end actions jump group 1 / end flow create
> 0 ingress group 1 pattern end actions rss queues 0 1 2 3 4 5 6 7 8 9 10 11 12 13
> 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 end / end # start
> 
> mlnx_perf -i enp33s0f0np0 -t 1:
> 
>       rx_vport_unicast_packets: 38,155,359
>         rx_vport_unicast_bytes: 2,441,942,976 Bps    = 19,535.54 Mbps
>                 tx_packets_phy: 7,586
>                 rx_packets_phy: 151,531,694
>                   tx_bytes_phy: 485,568 Bps          = 3.88 Mbps
>                   rx_bytes_phy: 9,698,029,248 Bps    = 77,584.23 Mbps
>             tx_mac_control_phy: 7,587
>              tx_pause_ctrl_phy: 7,587
>                rx_discards_phy: 113,376,265
>                rx_64_bytes_phy: 151,531,748 Bps      = 1,212.25 Mbps
>     rx_buffer_passed_thres_phy: 203
>                 rx_prio0_bytes: 9,698,066,560 Bps    = 77,584.53 Mbps
>               rx_prio0_packets: 38,155,328
>              rx_prio0_discards: 113,376,963
>                tx_global_pause: 7,587
>       tx_global_pause_duration: 1,018,266
> 
> Attached: "neohost-cx6dx-jump_rss-sws.txt".

How are you generating the traffic? Are both IP addresses and TCP ports changing?

"jump_rss" case degradation seems to be caused by RSS configuration.
It appears that packets are not distributed across all queues.
With these flow commands in SWS all packets should go to queue 0 only.
Could you please check if that's the case on your side?

It can be alleviated this by specifying RSS hash types on RSS action:

flow create 0 ingress group 0 pattern end actions jump group 1 / end
flow create 0 ingress group 1 pattern end actions rss queues <queues> end types ip tcp end / end

Could you please try that on your side?

With HWS flow engine, if RSS action does not have hash types specified,
implementation defaults to hashing on IP addresses.
If IP addresses are variable in your test traffic, that would explain the difference.

> Case "jump_rss", HWS
> ====================
> Result: 148 Mpps
> 
> /root/build/app/dpdk-testpmd -l 0-31,64-95 -a
> 21:00.0,dv_flow_en=2,mprq_en=1,rx_vec_en=1 --in-memory -- \
>         -i --rxq=32 --txq=32 --forward-mode=rxonly --nb-cores=32
> 
> port stop 0
> flow configure 0 queues_number 1 queues_size 128 counters_number 16 port
> start 0 # flow pattern_template 0 create pattern_template_id 1 ingress template
> end flow actions_template 0 create ingress actions_template_id 1 template jump
> group 1 / end mask jump group 0xFFFFFFFF / end flow template_table 0 create
> ingress group 0 table_id 1 pattern_template 1 actions_template 1 rules_number
> 1 flow queue 0 create 0 template_table 1 pattern_template 0 actions_template 0
> postpone false pattern end actions jump group 1 / end flow pull 0 queue 0 # flow
> actions_template 0 create ingress actions_template_id 2 template rss / end mask
> rss / end flow template_table 0 create ingress group 1 table_id 2
> pattern_template 1 actions_template 2 rules_number 1 flow queue 0 create 0
> template_table 2 pattern_template 0 actions_template 0 postpone false pattern
> end actions rss queues 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
> 22 23 24 25 26 27 28 29 30 31 end / end flow pull 0 queue 0 # start
> 
> mlnx_perf -i enp33s0f0np0 -t 1:
> 
>       rx_vport_unicast_packets: 151,514,131
>         rx_vport_unicast_bytes: 9,696,904,384 Bps    = 77,575.23 Mbps
>                 rx_packets_phy: 151,514,275
>                   rx_bytes_phy: 9,696,913,600 Bps    = 77,575.30 Mbps
>                rx_64_bytes_phy: 151,514,122 Bps      = 1,212.11 Mbps
>                 rx_prio0_bytes: 9,696,814,528 Bps    = 77,574.51 Mbps
>               rx_prio0_packets: 151,512,717
> 
> Attached: "neohost-cx6dx-jump_rss-hws.txt".
> 
> > - Would you be able to do the test with miss in empty group 1, with Ethernet
> Flow Control disabled?
> 
> $ ethtool -A enp33s0f0np0 rx off tx off
> 
> $ ethtool -a enp33s0f0np0
> Pause parameters for enp33s0f0np0:
> Autonegotiate:  off
> RX:             off
> TX:             off
> 
> testpmd> show port 0 flow_ctrl
> 
> ********************* Flow control infos for port 0  ********************* FC
> mode:
>    Rx pause: off
>    Tx pause: off
> Autoneg: off
> Pause time: 0x0
> High waterline: 0x0
> Low waterline: 0x0
> Send XON: off
> Forward MAC control frames: off
> 
> 
> Case "jump_miss", SWS
> =====================
> Result: 148 Mpps
> 
> /root/build/app/dpdk-testpmd -l 0-31,64-95 -a
> 21:00.0,dv_flow_en=1,mprq_en=1,rx_vec_en=1 --in-memory -- \
>         -i --rxq=32 --txq=32 --forward-mode=rxonly --nb-cores=32
> 
> flow create 0 ingress group 0 pattern end actions jump group 1 / end start
> 
> mlnx_perf -i enp33s0f0np0
> 
>       rx_vport_unicast_packets: 151,526,489
>         rx_vport_unicast_bytes: 9,697,695,296 Bps    = 77,581.56 Mbps
>                 rx_packets_phy: 151,526,193
>                   rx_bytes_phy: 9,697,676,672 Bps    = 77,581.41 Mbps
>                rx_64_bytes_phy: 151,525,423 Bps      = 1,212.20 Mbps
>                 rx_prio0_bytes: 9,697,488,256 Bps    = 77,579.90 Mbps
>               rx_prio0_packets: 151,523,240
> 
> Attached: "neohost-cx6dx-jump_miss-sws.txt".
> 
> 
> Case "jump_miss", HWS
> =====================
> Result: 107 Mpps
> Neohost shows RX Packet Rate = 148 Mpps, but RX Steering Packets = 107 Mpps.
> 
> /root/build/app/dpdk-testpmd -l 0-31,64-95 -a
> 21:00.0,dv_flow_en=2,mprq_en=1,rx_vec_en=1 --in-memory -- \
>         -i --rxq=32 --txq=32 --forward-mode=rxonly --nb-cores=32
> 
> port stop 0
> flow configure 0 queues_number 1 queues_size 128 counters_number 16 port
> start 0 flow pattern_template 0 create pattern_template_id 1 ingress template
> end flow actions_template 0 create ingress actions_template_id 1 template jump
> group 1 / end mask jump group 0xFFFFFFFF / end flow template_table 0 create
> ingress group 0 table_id 1 pattern_template 1 actions_template 1 rules_number
> 1 flow queue 0 create 0 template_table 1 pattern_template 0 actions_template 0
> postpone false pattern end actions jump group 1 / end flow pull 0 queue 0
> 
> mlnx_perf -i enp33s0f0np0
> 
>        rx_steer_missed_packets: 109,463,466
>       rx_vport_unicast_packets: 109,463,450
>         rx_vport_unicast_bytes: 7,005,660,800 Bps    = 56,045.28 Mbps
>                 rx_packets_phy: 151,518,062
>                   rx_bytes_phy: 9,697,155,840 Bps    = 77,577.24 Mbps
>                rx_64_bytes_phy: 151,516,201 Bps      = 1,212.12 Mbps
>                 rx_prio0_bytes: 9,697,137,280 Bps    = 77,577.9 Mbps
>               rx_prio0_packets: 151,517,782
>           rx_prio0_buf_discard: 42,055,156
> 
> Attached: "neohost-cx6dx-jump_miss-hws.txt".

As you can see HWS provides "rx_steer_missed_packets" counter, which is not available with SWS.
It counts the number of packets which did not hit any rule and in the end, they had to be dropped.
To enable that, additional HW flows are required which would handle packets which did not hit any rule.
It has a side effect - these HW flows cause enough backpressure
that on very high packet rate, it causes Rx buffer overflow on CX6 Dx.

After some internal discussions, I learned that it is kind of expected,
because this high number of missed packets is already an indication of the problem -
- NIC resources are wasted on packets for which there is no specified destination.

> Case "jump_drop", SWS
> =====================
> Result: 148 Mpps
> Match all in group 0, jump to group 1; match all in group 1, drop.
> 
> /root/build/app/dpdk-testpmd -l 0-31,64-95 -a
> 21:00.0,dv_flow_en=1,mprq_en=1,rx_vec_en=1 --in-memory -- \
>         -i --rxq=32 --txq=32 --forward-mode=rxonly --nb-cores=32
> 
> flow create 0 ingress group 0 pattern end actions jump group 1 / end flow create
> 0 ingress group 1 pattern end actions drop / end
> 
> mlnx_perf -i enp33s0f0np0
> 
>       rx_vport_unicast_packets: 151,705,269
>         rx_vport_unicast_bytes: 9,709,137,216 Bps    = 77,673.9 Mbps
>                 rx_packets_phy: 151,701,498
>                   rx_bytes_phy: 9,708,896,128 Bps    = 77,671.16 Mbps
>                rx_64_bytes_phy: 151,693,532 Bps      = 1,213.54 Mbps
>                 rx_prio0_bytes: 9,707,005,888 Bps    = 77,656.4 Mbps
>               rx_prio0_packets: 151,671,959
> 
> Attached: "neohost-cx6dx-jump_drop-sws.txt".
> 
> 
> Case "jump_drop", HWS
> =====================
> Result: 107 Mpps
> Match all in group 0, jump to group 1; match all in group 1, drop.
> I've also run this test with a counter attached to the dropping table, and it
> showed that indeed only 107 Mpps hit the rule.
> 
> /root/build/app/dpdk-testpmd -l 0-31,64-95 -a
> 21:00.0,dv_flow_en=2,mprq_en=1,rx_vec_en=1 --in-memory -- \
>         -i --rxq=32 --txq=32 --forward-mode=rxonly --nb-cores=32
> 
> port stop 0
> flow configure 0 queues_number 1 queues_size 128 counters_number 16 port
> start 0 flow pattern_template 0 create pattern_template_id 1 ingress template
> end flow actions_template 0 create ingress actions_template_id 1 template jump
> group 1 / end mask jump group 0xFFFFFFFF / end flow template_table 0 create
> ingress group 0 table_id 1 pattern_template 1 actions_template 1 rules_number
> 1 flow queue 0 create 0 template_table 1 pattern_template 0 actions_template 0
> postpone false pattern end actions jump group 1 / end flow pull 0 queue 0 # flow
> actions_template 0 create ingress actions_template_id 2 template drop / end
> mask drop / end flow template_table 0 create ingress group 1 table_id 2
> pattern_template 1 actions_template 2 rules_number 1 flow queue 0 create 0
> template_table 2 pattern_template 0 actions_template 0 postpone false pattern
> end actions drop / end flow pull 0 queue 0
> 
> mlnx_perf -i enp33s0f0np0
> 
>       rx_vport_unicast_packets: 109,500,637
>         rx_vport_unicast_bytes: 7,008,040,768 Bps    = 56,064.32 Mbps
>                 rx_packets_phy: 151,568,915
>                   rx_bytes_phy: 9,700,410,560 Bps    = 77,603.28 Mbps
>                rx_64_bytes_phy: 151,569,146 Bps      = 1,212.55 Mbps
>                 rx_prio0_bytes: 9,699,889,216 Bps    = 77,599.11 Mbps
>               rx_prio0_packets: 151,560,756
>           rx_prio0_buf_discard: 42,065,705
> 
> Attached: "neohost-cx6dx-jump_drop-hws.txt".

We're still looking into "jump_drop" case.

By the way - May I ask what is your target use case with HWS?

Best regards,
Dariusz Sosnowski

next prev parent reply	other threads:[~2024-06-19 19:15 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-13  9:01 Dmitry Kozlyuk
2024-06-13 15:06 ` Dariusz Sosnowski
2024-06-13 20:14   ` Dmitry Kozlyuk
2024-06-19 19:15     ` Dariusz Sosnowski [this message]
2024-06-20 13:05       ` Dmitry Kozlyuk
2024-09-27 11:33       ` Dmitry Kozlyuk
2024-10-09 17:16         ` Dariusz Sosnowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=PH0PR12MB880055029212CFDAEB0728C2A4CF2@PH0PR12MB8800.namprd12.prod.outlook.com \
    --to=dsosnowski@nvidia.com \
    --cc=dmitry.kozliuk@gmail.com \
    --cc=users@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).