* DPDK Ring Q
@ 2025-05-29 5:20 Lombardo, Ed
2025-05-29 9:41 ` Bruce Richardson
0 siblings, 1 reply; 2+ messages in thread
From: Lombardo, Ed @ 2025-05-29 5:20 UTC (permalink / raw)
To: dev
[-- Attachment #1: Type: text/plain, Size: 1332 bytes --]
Hi,
I have an issue with DPDK 24.11.1 and 2 port 100G Intel NIC (E810-C) on 22 core CPU dual socket server.
There is a dedicated CPU core to get the packets from DPDK using rte_eth_rx_burst() and enqueue the mbufs into a worker ring Q. This thread does nothing else. The NIC is dropping packets at 8.5 Gbps per port.
Studying the perf report, I was interested in the common_ring_mc_dequeue(). Perf tool shows common_ring_mc_dequeue() 92.86% Self and 92.86% Children.
I see further with perf tool rte_ring_enqueue_bulk() and rte_ring_enqueue_bulk_elem(). These are at 0.00% Self and 0.05% Children.
Perf tool shows rte_ring_sp_enqueue_bulk_elem (inlined) which is what I wanted to see (Single producer) representing the enqueue of the mbufs pointers to the worker ring Q.
Is it possible to change the common_ring_mc_dequeue() to common_ring_sc_dequeue()? Can it be set to one consumer on single Queue 0.
I believe this is limiting DPDK from reaching 90 Gbps or higher in my setup, which is my goal.
I made sure the E810-C firmware was up to date, NIC FW Version: 4.80 0x80020543 1.3805.0
Perf report shows:
- 99.65% input_thread
- 99.35% rte_eth_rx_burst (inlined)
- ice_recv_scattered_pkts
92.83% common_ring_mc_dequeue
Any thoughts or suggestions?
Thanks,
Ed
[-- Attachment #2: Type: text/html, Size: 6844 bytes --]
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: DPDK Ring Q
2025-05-29 5:20 DPDK Ring Q Lombardo, Ed
@ 2025-05-29 9:41 ` Bruce Richardson
0 siblings, 0 replies; 2+ messages in thread
From: Bruce Richardson @ 2025-05-29 9:41 UTC (permalink / raw)
To: Lombardo, Ed; +Cc: dev
On Thu, May 29, 2025 at 05:20:25AM +0000, Lombardo, Ed wrote:
> Hi,
>
> I have an issue with DPDK 24.11.1 and 2 port 100G Intel NIC
> (E810-C) on 22 core CPU dual socket server.
>
>
> There is a dedicated CPU core to get the packets from DPDK using
> rte_eth_rx_burst() and enqueue the mbufs into a worker ring Q.
> This thread does nothing else. The NIC is dropping packets at 8.5
> Gbps per port.
>
>
> Studying the perf report, I was interested in the
> common_ring_mc_dequeue(). Perf tool shows common_ring_mc_dequeue()
> 92.86% Self and 92.86% Children.
>
>
> I see further with perf tool rte_ring_enqueue_bulk() and
> rte_ring_enqueue_bulk_elem(). These are at 0.00% Self and 0.05%
> Children.
>
> Perf tool shows rte_ring_sp_enqueue_bulk_elem (inlined) which is
> what I wanted to see (Single producer) representing the enqueue of
> the mbufs pointers to the worker ring Q.
>
>
> Is it possible to change the common_ring_mc_dequeue() to
> common_ring_sc_dequeue()? Can it be set to one consumer on single
> Queue 0.
>
>
> I believe this is limiting DPDK from reaching 90 Gbps or higher in
> my setup, which is my goal.
>
>
> I made sure the E810-C firmware was up to date, NIC FW Version:
> 4.80 0x80020543 1.3805.0
>
>
> Perf report shows:
>
> - 99.65% input_thread
>
>
> - 99.35% rte_eth_rx_burst (inlined)
>
>
> - ice_recv_scattered_pkts
>
>
> 92.83% common_ring_mc_dequeue
>
>
> Any thoughts or suggestions?
>
Since this is presumably from the thread that is doing the enqueuing to
the ring, that means that the common_ring_mc_dequeue is from the memory
pool implementation rather than your application ring. A certain amount
of ring dequeue time would be expected due to the buffers getting
allocated on one core and (presumably) freed or transmitted on a
different one. However, the amount of time spent in that dequeue seems
excessive.
Some suggestions:
* I'd suggest checking what the mempool cache size in use in your
application is, and increasing it. The underlying ring implementation
is only used once there are no buffers in the mempool cache. Therefore
having a larger cache should lead to fewer (but large) dequeues from
the mempool. A cache size of 512 would be what I would suggest trying.
* Since you are moving buffers from the allocation core, to avoid
cycling through the whole mempool memory space I'd suggest switching
the underlying mempool implementation from the default ring one to the
stack mempool. Although that mempool implementation uses locks, it
gives better buffer recycling across cores.
* It is possible to switch the ring implementation to use an
"sc_dequeue" function, but I would view that as risky, since it would
mean that no core other than the Rx core can ever allocate a buffer
from the pool (unless you start adding locks, at which point you might
as well keep the "mc_dequeue" implementation). Therefore, I'd take the
two points above as better alternatives.
* Final thing to check - ensure you are not running out of buffers in
the pool. If you run out of buffers, then each dequeue will have the
refill function again checking the mempool ring to get more buffers,
rather than having a store locally in the per-core cache. Try
over-provisionning your memory pool a bit and see if it helps.
Regards,
/Bruce
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-05-29 9:42 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-29 5:20 DPDK Ring Q Lombardo, Ed
2025-05-29 9:41 ` Bruce Richardson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).