dpdk Tx falling short

DPDK usage discussions
 help / color / mirror / Atom feed

* dpdk Tx falling short
@ 2025-07-03 20:14 Lombardo, Ed
  2025-07-03 21:49 ` Stephen Hemminger
  2025-07-04 11:44 ` Ivan Malov
  0 siblings, 2 replies; 5+ messages in thread
From: Lombardo, Ed @ 2025-07-03 20:14 UTC (permalink / raw)
  To: users

[-- Attachment #1: Type: text/plain, Size: 3064 bytes --]

Hi,
I have run out of ideas and thought I would reach out to the dpdk community.

I have a Sapphire Rapids dual CPU server and one E180 (also tried X710), both are 4x10G NICs.  When our application pipeline final stage enqueues mbufs into the tx ring I expect the rte_ring_dequeue_burst() to pull the mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rate.  What I see is when there is one interface receiving 64-byte UDP in IPv4 the receive and transmit is at line rate (i.e. packets in one port and out another port of the NIC @14.9 MPPS).
When I turn on another receive port then both transmit ports of the NIC shows Tx performance drops to 5 MPPS.  The Tx ring is filling faster than Tx thread can dequeue and transmit mbufs.

Packets arrive on ports 1 and 3 in my test setup.  NIC is on NUMA Node 1.  Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.  The mbuf size is 9KB.

Rx Port 1 -> Tx Port 2
Rx Port 3 -> Tx port 4

I monitor the mbufs available and they are:
*** DPDK Mempool Configuration ***
Number Sockets      :                    1
Memory/Socket GB    :                 6
Hugepage Size MB    :                 1024
Overhead/socket MB  :              512
Usable mem/socket MB:          5629
mbuf size Bytes     :                     9216
nb mbufs per socket :               640455
total nb mbufs      :                      640455
hugepages/socket GB :               6
mempool cache size  :            512

*** DPDK EAL args ***
EAL lcore arg       : -l 36   <<< NUMA Node 1
EAL socket-mem arg  : --socket-mem=0,6144

The number of rings in this configuration is 16 and all are the same size (16384 * 8), and there is one mempool.

The Tx rings are created as SP and SC when created.

There is one Tx thread per NIC port, where its only task is to dequeue mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.  The dequeue burst size is 512 and tx burst is equal to or less than 512.  The rte_eth_tx_burst() never returns less than the bust size given.

Each Tx thread is on a dedicated CPU core and its sibling is unused.
We use cpushielding to keep noncritical threads from using these CPUs for Tx threads.  HTOP shows the Tx threads are the only threads using the carved-out CPUs.

In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of mbufs up to 512.
I added debug counters to keep track of how many mbufs are dequeued from the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a counter for less than 512.  The dequeue of the tx ring is always 512, never less.

Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequeue the mbufs and bulk free the mbufs from the tx ring I do not see the tx ring fill-up, i.e., it is able to free the mbufs faster than they arrive on the tx ring.

So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, which involves the inner bows of DPDK and Intel NIC architecture.

Any help to resolve my issue is greatly appreciated.

Thanks,
Ed

[-- Attachment #2: Type: text/html, Size: 7547 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: dpdk Tx falling short
  2025-07-03 20:14 dpdk Tx falling short Lombardo, Ed
@ 2025-07-03 21:49 ` Stephen Hemminger
  2025-07-04  5:58   ` Rajesh Kumar
  2025-07-04 11:44 ` Ivan Malov
  1 sibling, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2025-07-03 21:49 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: users

On Thu, 3 Jul 2025 20:14:59 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi,
> I have run out of ideas and thought I would reach out to the dpdk community.
> 
> I have a Sapphire Rapids dual CPU server and one E180 (also tried X710), both are 4x10G NICs.  When our application pipeline final stage enqueues mbufs into the tx ring I expect the rte_ring_dequeue_burst() to pull the mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rate.  What I see is when there is one interface receiving 64-byte UDP in IPv4 the receive and transmit is at line rate (i.e. packets in one port and out another port of the NIC @14.9 MPPS).
> When I turn on another receive port then both transmit ports of the NIC shows Tx performance drops to 5 MPPS.  The Tx ring is filling faster than Tx thread can dequeue and transmit mbufs.
> 
> Packets arrive on ports 1 and 3 in my test setup.  NIC is on NUMA Node 1.  Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.  The mbuf size is 9KB.
> 
> Rx Port 1 -> Tx Port 2
> Rx Port 3 -> Tx port 4
> 
> I monitor the mbufs available and they are:
> *** DPDK Mempool Configuration ***
> Number Sockets      :                    1
> Memory/Socket GB    :                 6
> Hugepage Size MB    :                 1024
> Overhead/socket MB  :              512
> Usable mem/socket MB:          5629
> mbuf size Bytes     :                     9216
> nb mbufs per socket :               640455
> total nb mbufs      :                      640455
> hugepages/socket GB :               6
> mempool cache size  :            512
> 
> *** DPDK EAL args ***
> EAL lcore arg       : -l 36   <<< NUMA Node 1
> EAL socket-mem arg  : --socket-mem=0,6144
> 
> The number of rings in this configuration is 16 and all are the same size (16384 * 8), and there is one mempool.
> 
> The Tx rings are created as SP and SC when created.
> 
> There is one Tx thread per NIC port, where its only task is to dequeue mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.  The dequeue burst size is 512 and tx burst is equal to or less than 512.  The rte_eth_tx_burst() never returns less than the bust size given.
> 
> Each Tx thread is on a dedicated CPU core and its sibling is unused.
> We use cpushielding to keep noncritical threads from using these CPUs for Tx threads.  HTOP shows the Tx threads are the only threads using the carved-out CPUs.
> 
> In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of mbufs up to 512.
> I added debug counters to keep track of how many mbufs are dequeued from the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a counter for less than 512.  The dequeue of the tx ring is always 512, never less.
> 
> 
> Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequeue the mbufs and bulk free the mbufs from the tx ring I do not see the tx ring fill-up, i.e., it is able to free the mbufs faster than they arrive on the tx ring.
> 
> So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, which involves the inner bows of DPDK and Intel NIC architecture.
> 
> 
> 
> Any help to resolve my issue is greatly appreciated.
> 
> Thanks,
> Ed
> 
> 
> 


Do profiling, and look at the number of cache misses.
I suspect using an additional ring is causing lots of cache misses.
Remember going to memory is really slow on modern processors.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: dpdk Tx falling short
  2025-07-03 21:49 ` Stephen Hemminger
@ 2025-07-04  5:58   ` Rajesh Kumar
  0 siblings, 0 replies; 5+ messages in thread
From: Rajesh Kumar @ 2025-07-04  5:58 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Lombardo, Ed, users

[-- Attachment #1: Type: text/plain, Size: 4435 bytes --]

Hi Ed,

Did you ran dpdk-testpmd with multiple queue and did you hit line
rate. Sapphire rapid is powerful processor, we were able to hit 200Gbps
with 14 cores with mellanox CX6 NIC.

how many core are you using? what is the descriptor size & number of queue
? try playing with that with that..

dpdk-testpmd -l 0-36 -a <pci of nic> -- -i -a --nb-cores=35  --txq=14
--rxq=14 --rxd=4096

Also try reducing mbuf size to 2K (from the current 9k) and enable jumbo
frame support

try to run "perf top" and see which is taking more time. Also try to
cache-align your data-structure.

struct sample_struct {
      uint32_t  a;
      uint64_t  b;
...
} __rte_cache_aligned;

Thanks,
*-Rajesh*


On Fri, Jul 4, 2025 at 3:27 AM Stephen Hemminger <stephen@networkplumber.org>
wrote:

> On Thu, 3 Jul 2025 20:14:59 +0000
> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>
> > Hi,
> > I have run out of ideas and thought I would reach out to the dpdk
> community.
> >
> > I have a Sapphire Rapids dual CPU server and one E180 (also tried X710),
> both are 4x10G NICs.  When our application pipeline final stage enqueues
> mbufs into the tx ring I expect the rte_ring_dequeue_burst() to pull the
> mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rate.
> What I see is when there is one interface receiving 64-byte UDP in IPv4 the
> receive and transmit is at line rate (i.e. packets in one port and out
> another port of the NIC @14.9 MPPS).
> > When I turn on another receive port then both transmit ports of the NIC
> shows Tx performance drops to 5 MPPS.  The Tx ring is filling faster than
> Tx thread can dequeue and transmit mbufs.
> >
> > Packets arrive on ports 1 and 3 in my test setup.  NIC is on NUMA Node
> 1.  Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.  The mbuf size
> is 9KB.
> >
> > Rx Port 1 -> Tx Port 2
> > Rx Port 3 -> Tx port 4
> >
> > I monitor the mbufs available and they are:
> > *** DPDK Mempool Configuration ***
> > Number Sockets      :                    1
> > Memory/Socket GB    :                 6
> > Hugepage Size MB    :                 1024
> > Overhead/socket MB  :              512
> > Usable mem/socket MB:          5629
> > mbuf size Bytes     :                     9216
> > nb mbufs per socket :               640455
> > total nb mbufs      :                      640455
> > hugepages/socket GB :               6
> > mempool cache size  :            512
> >
> > *** DPDK EAL args ***
> > EAL lcore arg       : -l 36   <<< NUMA Node 1
> > EAL socket-mem arg  : --socket-mem=0,6144
> >
> > The number of rings in this configuration is 16 and all are the same
> size (16384 * 8), and there is one mempool.
> >
> > The Tx rings are created as SP and SC when created.
> >
> > There is one Tx thread per NIC port, where its only task is to dequeue
> mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.
> The dequeue burst size is 512 and tx burst is equal to or less than 512.
> The rte_eth_tx_burst() never returns less than the bust size given.
> >
> > Each Tx thread is on a dedicated CPU core and its sibling is unused.
> > We use cpushielding to keep noncritical threads from using these CPUs
> for Tx threads.  HTOP shows the Tx threads are the only threads using the
> carved-out CPUs.
> >
> > In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of
> mbufs up to 512.
> > I added debug counters to keep track of how many mbufs are dequeued from
> the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a
> counter for less than 512.  The dequeue of the tx ring is always 512, never
> less.
> >
> >
> > Note: if I skip the rte_eth_tx_burst() in the Tx threads and just
> dequeue the mbufs and bulk free the mbufs from the tx ring I do not see the
> tx ring fill-up, i.e., it is able to free the mbufs faster than they arrive
> on the tx ring.
> >
> > So, I suspect that the rte_eth_tx_burst() is the bottleneck to
> investigate, which involves the inner bows of DPDK and Intel NIC
> architecture.
> >
> >
> >
> > Any help to resolve my issue is greatly appreciated.
> >
> > Thanks,
> > Ed
> >
> >
> >
>
>
> Do profiling, and look at the number of cache misses.
> I suspect using an additional ring is causing lots of cache misses.
> Remember going to memory is really slow on modern processors.
>

[-- Attachment #2: Type: text/html, Size: 5555 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: dpdk Tx falling short
  2025-07-03 20:14 dpdk Tx falling short Lombardo, Ed
  2025-07-03 21:49 ` Stephen Hemminger
@ 2025-07-04 11:44 ` Ivan Malov
  2025-07-04 14:49   ` Stephen Hemminger
  1 sibling, 1 reply; 5+ messages in thread
From: Ivan Malov @ 2025-07-04 11:44 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: users

[-- Attachment #1: Type: text/plain, Size: 3502 bytes --]

Hi Ed,

You say there is only one mempool. Why?
Have you tried using dedicated mempools, one per each port pair (0,2), (3,4)?

Thank you.

On Thu, 3 Jul 2025, Lombardo, Ed wrote:

> 
> Hi,
> 
> I have run out of ideas and thought I would reach out to the dpdk community.
> 
>  
> 
> I have a Sapphire Rapids dual CPU server and one E180 (also tried X710), both are 4x10G NICs.  When our application pipeline final stage enqueues mbufs into the tx ring I expect the
> rte_ring_dequeue_burst() to pull the mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rate.  What I see is when there is one interface receiving 64-byte UDP in IPv4
> the receive and transmit is at line rate (i.e. packets in one port and out another port of the NIC @14.9 MPPS).
> 
> When I turn on another receive port then both transmit ports of the NIC shows Tx performance drops to 5 MPPS.  The Tx ring is filling faster than Tx thread can dequeue and transmit
> mbufs. 
> 
>  
> 
> Packets arrive on ports 1 and 3 in my test setup.  NIC is on NUMA Node 1.  Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.  The mbuf size is 9KB.
> 
>  
> 
> Rx Port 1 -> Tx Port 2
> 
> Rx Port 3 -> Tx port 4
> 
>  
> 
> I monitor the mbufs available and they are:
> 
> *** DPDK Mempool Configuration ***
> 
> Number Sockets      :                    1
> 
> Memory/Socket GB    :                 6
> 
> Hugepage Size MB    :                 1024
> 
> Overhead/socket MB  :              512
> 
> Usable mem/socket MB:          5629
> 
> mbuf size Bytes     :                     9216
> 
> nb mbufs per socket :               640455
> 
> total nb mbufs      :                      640455
> 
> hugepages/socket GB :               6
> 
> mempool cache size  :            512
> 
>  
> 
> *** DPDK EAL args ***
> 
> EAL lcore arg       : -l 36   <<< NUMA Node 1
> 
> EAL socket-mem arg  : --socket-mem=0,6144
> 
>  
> 
> The number of rings in this configuration is 16 and all are the same size (16384 * 8), and there is one mempool.
> 
>  
> 
> The Tx rings are created as SP and SC when created.
> 
>  
> 
> There is one Tx thread per NIC port, where its only task is to dequeue mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.  The dequeue burst size is 512 and tx
> burst is equal to or less than 512.  The rte_eth_tx_burst() never returns less than the bust size given.
> 
>  
> 
> Each Tx thread is on a dedicated CPU core and its sibling is unused.
> 
> We use cpushielding to keep noncritical threads from using these CPUs for Tx threads.  HTOP shows the Tx threads are the only threads using the carved-out CPUs.
> 
>  
> 
> In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of mbufs up to 512. 
> 
> I added debug counters to keep track of how many mbufs are dequeued from the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a counter for less than 512.  The dequeue of
> the tx ring is always 512, never less. 
> 
>  
> 
>  
> 
> Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequeue the mbufs and bulk free the mbufs from the tx ring I do not see the tx ring fill-up, i.e., it is able to free
> the mbufs faster than they arrive on the tx ring.
> 
>  
> 
> So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, which involves the inner bows of DPDK and Intel NIC architecture.
> 
>  
> 
>  
> 
>  
> 
> Any help to resolve my issue is greatly appreciated.
> 
>  
> 
> Thanks,
> 
> Ed
> 
>  
> 
>  
> 
>  
> 
> 
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: dpdk Tx falling short
  2025-07-04 11:44 ` Ivan Malov
@ 2025-07-04 14:49   ` Stephen Hemminger
  0 siblings, 0 replies; 5+ messages in thread
From: Stephen Hemminger @ 2025-07-04 14:49 UTC (permalink / raw)
  To: Ivan Malov; +Cc: Lombardo, Ed, users

On Fri, 4 Jul 2025 15:44:50 +0400 (+04)
Ivan Malov <ivan.malov@arknetworks.am> wrote:

> Hi Ed,
> 
> You say there is only one mempool. Why?
> Have you tried using dedicated mempools, one per each port pair (0,2), (3,4)?
> 
> Thank you.
> 
> On Thu, 3 Jul 2025, Lombardo, Ed wrote:

More mempools can make cache behavior worse not better.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-07-04 14:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-03 20:14 dpdk Tx falling short Lombardo, Ed
2025-07-03 21:49 ` Stephen Hemminger
2025-07-04  5:58   ` Rajesh Kumar
2025-07-04 11:44 ` Ivan Malov
2025-07-04 14:49   ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).