Hello I did experiments where I sent packets to the hairpin queues and the CPU queue at the same time. During testing, I found that when the CPU queue is overloaded too much, the hairpin queues also begin to drop packets. Example 1: Sending 10 Gbps to hairpin queues Resulting throughput is 10 Gbps Expected result Example 2: Sending 20 Gbps to CPU queue Resulting throughput is 11 Gbps (9 Gbps drop) Expected result Example 3: Sending 10 Gbps to hairpin queues and 20 Gbps to CPU queue Resulting throughput is 21Gbps, 10 Gbps (zero packet drop) from hairpin + 11 Gbps from CPU Expected result Example 4: Sending 10 Gbps to hairpin queues and 50 Gbps to CPU queue Resulting throughput is 16 Gbps, 5Gbps (50%+ packet drop) from hairpin + 11Gbps from CPU, Unexpected result... Experiments setup: sudo mlxconfig -y -d 0000:c4:00.0 set MEMIC_SIZE_LIMIT=0 HAIRPIN_DATA_BUFFER_LOCK=1 sudo mlxfwreset -y -d 0000:c4:00.0 reset sudo dpdk-testpmd -l 0-1 -n 4 -a 0000:c4:00.0,hp_buf_log_sz=13 -- --rxq=1 --txq=1 --hairpinq=12 --hairpin-mode=0x1110 -i     flow create 0 ingress pattern eth src is 00:10:94:00:00:02 / end actions queue index 0 / end     flow create 0 ingress pattern eth src is 00:10:94:00:00:03 / end actions rss queues 1 2 3 4 5 6 7 8  end / end So I can't achieve my goal: traffic from the hairpin queues is not dropped if the CPU queue is overloaded. Any idea how to achieve this in example 4? What is the problem, full packet buffers/memory in the device that are shared between the hairpin and CPU queues? Any guidance or suggestions on how to achieve this would be greatly appreciated. Mário On 27/06/2024 13:42, Mário Kuka wrote: > Hi Dmitry, > > Thank you for your helpful reply. >> Try enabling "Explicit Tx rule" mode if possible. >> I was able to achieve 137 Mpps @ 64B with the following command: >> >> dpdk-testpmd -a 21:00.0 -a c1:00.0 --in-memory -- \ >> -i --rxq=1 --txq=1 --hairpinq=8 --hairpin-mode=0x10 > > Based o this I was able to achieve 142 Mpps(96.08 Gbps) @ 64B with the following command: > > sudo dpdk-testpmd -l 0-1 -n 4 -a 0000:c4:00.0,hp_buf_log_sz=13 \ > --in-memory -- --rxq=1 --txq=1 --hairpinq=12 --hairpin-mode=0x10 -i > > flow create 0 ingress pattern eth src is 00:10:94:00:00:02 / end actions rss queues 1 2 3 4 5 6 7 8 9 10 11 12 end / end > > Almost full speed :). > Any other value of "hp_buf_log_sz" or more queues does not get better results, but instead makes them worse. > >> RxQ pinned in device memory requires firmware configuration [1]: >> >> mlxconfig -y -d $pci_addr set MEMIC_SIZE_LIMIT=0 HAIRPIN_DATA_BUFFER_LOCK=1 >> mlxfwreset -y -d $pci_addr reset >> >> [1]:https://doc.dpdk.org/guides/platform/mlx5.html?highlight=hairpin_data_buffer_lock >> >> However, pinned RxQ didn't improve anything for me. > > I tried it, but it didn't improve anything for me either. > > Mário > > On 25/06/2024 02:22,  Kozlyuk wrote: >> Hi Mário, >> >> 2024-06-19 08:45 (UTC+0200), Mário Kuka: >>> Hello, >>> >>> I want to use hairpin queues to forward high priority traffic (such as >>> LACP). >>> My goal is to ensure that this traffic is not dropped in case the >>> software pipeline is overwhelmed. >>> But during testing with dpdk-testpmd I can't achieve full throughput for >>> hairpin queues. >> For maintainers: I'd like to express interest in this use case too. >> >>> The best result I have been able to achieve for 64B packets is 83 Gbps >>> in this configuration: >>> $ sudo dpdk-testpmd -l 0-1 -n 4 -a 0000:17:00.0,hp_buf_log_sz=19 -- >>> --rxq=1 --txq=1 --rxd=4096 --txd=4096 --hairpinq=2 >>> testpmd> flow create 0 ingress pattern eth src is 00:10:94:00:00:03 / >>> end actions rss queues 1 2 end / end >> Try enabling "Explicit Tx rule" mode if possible. >> I was able to achieve 137 Mpps @ 64B with the following command: >> >> dpdk-testpmd -a 21:00.0 -a c1:00.0 --in-memory -- \ >> -i --rxq=1 --txq=1 --hairpinq=8 --hairpin-mode=0x10 >> >> You might get even better speed, because my flow rules were more complicated >> (RTE Flow based "router on-a-stick"): >> >> flow create 0 ingress group 1 pattern eth / vlan vid is 721 / end actions of_set_vlan_vid vlan_vid 722 / rss queues 1 2 3 4 5 6 7 8 end / end >> flow create 1 ingress group 1 pattern eth / vlan vid is 721 / end actions of_set_vlan_vid vlan_vid 722 / rss queues 1 2 3 4 5 6 7 8 end / end >> flow create 0 ingress group 1 pattern eth / vlan vid is 722 / end actions of_set_vlan_vid vlan_vid 721 / rss queues 1 2 3 4 5 6 7 8 end / end >> flow create 1 ingress group 1 pattern eth / vlan vid is 722 / end actions of_set_vlan_vid vlan_vid 721 / rss queues 1 2 3 4 5 6 7 8 end / end >> flow create 0 ingress group 0 pattern end actions jump group 1 / end >> flow create 1 ingress group 0 pattern end actions jump group 1 / end >> >>> For packets in the range 68-80B I measured even lower throughput. >>> Full throughput I measured only from packets larger than 112B >>> >>> For only one queue, I didn't get more than 55Gbps: >>> $ sudo dpdk-testpmd -l 0-1 -n 4 -a 0000:17:00.0,hp_buf_log_sz=19 -- >>> --rxq=1 --txq=1 --rxd=4096 --txd=4096 --hairpinq=1 -i >>> testpmd> flow create 0 ingress pattern eth src is 00:10:94:00:00:03 / >>> end actions queue index 1 / end >>> >>> I tried to use locked device memory for TX and RX queues, but it seems >>> that this is not supported: >>> "--hairpin-mode=0x011000" (bit 16 - hairpin TX queues will use locked >>> device memory, bit 12 - hairpin RX queues will use locked device memory) >> RxQ pinned in device memory requires firmware configuration [1]: >> >> mlxconfig -y -d $pci_addr set MEMIC_SIZE_LIMIT=0 HAIRPIN_DATA_BUFFER_LOCK=1 >> mlxfwreset -y -d $pci_addr reset >> >> [1]:https://doc.dpdk.org/guides/platform/mlx5.html?highlight=hairpin_data_buffer_lock >> >> However, pinned RxQ didn't improve anything for me. >> >> TxQ pinned in device memory is not supported by net/mlx5. >> TxQ pinned to DPDK memory made performance awful (predictably). >> >>> I was expecting that achieving full throughput with hairpin queues would >>> not be a problem. >>> Is my expectation too optimistic? >>> >>> What other parameters besides 'hp_buf_log_sz' can I use to achieve full >>> throughput? >> In my experiments, default "hp_buf_log_sz" of 16 is optimal. >> The most influential parameter appears to be the number of hairpin queues. >> >>> I tried combining the following parameters: mprq_en=, rxqs_min_mprq=, >>> mprq_log_stride_num=, txq_inline_mpw=, rxq_pkt_pad_en=, >>> but with no positive impact on throughput. >