DPDK Ring Q

DPDK patches and discussions
 help / color / mirror / Atom feed

* DPDK Ring Q
@ 2025-05-29  5:20 Lombardo, Ed
  2025-05-29  9:41 ` Bruce Richardson
  0 siblings, 1 reply; 2+ messages in thread
From: Lombardo, Ed @ 2025-05-29  5:20 UTC (permalink / raw)
  To: dev

[-- Attachment #1: Type: text/plain, Size: 1332 bytes --]

Hi,
I have an issue with DPDK 24.11.1 and 2 port 100G Intel NIC (E810-C) on 22 core CPU dual socket server.

There is a dedicated CPU core to get the packets from DPDK using rte_eth_rx_burst() and enqueue the mbufs into a worker ring Q.  This thread does nothing else.  The NIC is dropping packets at 8.5 Gbps per port.

Studying the perf report, I was interested in the common_ring_mc_dequeue().  Perf tool shows common_ring_mc_dequeue() 92.86% Self and 92.86% Children.

I see further with perf tool rte_ring_enqueue_bulk() and rte_ring_enqueue_bulk_elem().  These are at 0.00% Self and 0.05% Children.
Perf tool shows rte_ring_sp_enqueue_bulk_elem (inlined) which is what I wanted to see (Single producer) representing the enqueue of the mbufs pointers to the worker ring Q.

Is it possible to change the common_ring_mc_dequeue() to common_ring_sc_dequeue()?  Can it be set to one consumer on single Queue 0.

I believe this is limiting DPDK from reaching 90 Gbps or higher in my setup, which is my goal.

I made sure the E810-C firmware was up to date, NIC FW Version: 4.80 0x80020543 1.3805.0

Perf report shows:
   - 99.65% input_thread
      - 99.35% rte_eth_rx_burst (inlined)
         - ice_recv_scattered_pkts
              92.83% common_ring_mc_dequeue

Any thoughts or suggestions?

Thanks,
Ed

[-- Attachment #2: Type: text/html, Size: 6844 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: DPDK Ring Q
  2025-05-29  5:20 DPDK Ring Q Lombardo, Ed
@ 2025-05-29  9:41 ` Bruce Richardson
  0 siblings, 0 replies; 2+ messages in thread
From: Bruce Richardson @ 2025-05-29  9:41 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: dev

On Thu, May 29, 2025 at 05:20:25AM +0000, Lombardo, Ed wrote:
>    Hi,
> 
>    I have an issue with DPDK 24.11.1 and 2 port 100G Intel NIC
>    (E810-C) on 22 core CPU dual socket server.
> 
> 
>    There is a dedicated CPU core to get the packets from DPDK using
>    rte_eth_rx_burst() and enqueue the mbufs into a worker ring Q.
>    This thread does nothing else.  The NIC is dropping packets at 8.5
>    Gbps per port.
> 
> 
>    Studying the perf report, I was interested in the
>    common_ring_mc_dequeue().  Perf tool shows common_ring_mc_dequeue()
>    92.86% Self and 92.86% Children.
> 
> 
>    I see further with perf tool rte_ring_enqueue_bulk() and
>    rte_ring_enqueue_bulk_elem().  These are at 0.00% Self and 0.05%
>    Children.
> 
>    Perf tool shows rte_ring_sp_enqueue_bulk_elem (inlined) which is
>    what I wanted to see (Single producer) representing the enqueue of
>    the mbufs pointers to the worker ring Q.
> 
> 
>    Is it possible to change the common_ring_mc_dequeue() to
>    common_ring_sc_dequeue()?  Can it be set to one consumer on single
>    Queue 0.
> 
> 
>    I believe this is limiting DPDK from reaching 90 Gbps or higher in
>    my setup, which is my goal.
> 
> 
>    I made sure the E810-C firmware was up to date, NIC FW Version:
>    4.80 0x80020543 1.3805.0
> 
> 
>    Perf report shows:
> 
>       - 99.65% input_thread
> 
> 
>          - 99.35% rte_eth_rx_burst (inlined)
> 
> 
>             - ice_recv_scattered_pkts
> 
> 
>                  92.83% common_ring_mc_dequeue
> 
> 
>    Any thoughts or suggestions?
> 
Since this is presumably from the thread that is doing the enqueuing to
the ring, that means that the common_ring_mc_dequeue is from the memory
pool implementation rather than your application ring. A certain amount
of ring dequeue time would be expected due to the buffers getting
allocated on one core and (presumably) freed or transmitted on a
different one. However, the amount of time spent in that dequeue seems
excessive. 

Some suggestions:
* I'd suggest checking what the mempool cache size in use in your
  application is, and increasing it. The underlying ring implementation
  is only used once there are no buffers in the mempool cache. Therefore
  having a larger cache should lead to fewer (but large) dequeues from
  the mempool. A cache size of 512 would be what I would suggest trying.

* Since you are moving buffers from the allocation core, to avoid
  cycling through the whole mempool memory space I'd suggest switching
  the underlying mempool implementation from the default ring one to the
  stack mempool. Although that mempool implementation uses locks, it
  gives better buffer recycling across cores.

* It is possible to switch the ring implementation to use an
  "sc_dequeue" function, but I would view that as risky, since it would
  mean that no core other than the Rx core can ever allocate a buffer
  from the pool (unless you start adding locks, at which point you might
  as well keep the "mc_dequeue" implementation). Therefore, I'd take the
  two points above as better alternatives.

* Final thing to check - ensure you are not running out of buffers in
  the pool. If you run out of buffers, then each dequeue will have the
  refill function again checking the mempool ring to get more buffers,
  rather than having a store locally in the per-core cache. Try
  over-provisionning your memory pool a bit and see if it helps.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-05-29  9:42 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-29  5:20 DPDK Ring Q Lombardo, Ed
2025-05-29  9:41 ` Bruce Richardson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).