dpdk Tx falling short

DPDK usage discussions
 help / color / mirror / Atom feed

* dpdk Tx falling short
@ 2025-07-03 20:14 Lombardo, Ed
  2025-07-03 21:49 ` Stephen Hemminger
  2025-07-04 11:44 ` Ivan Malov
  0 siblings, 2 replies; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-03 20:14 UTC (permalink / raw)
  To: users

[-- Attachment #1: Type: text/plain, Size: 3064 bytes --]

Hi,
I have run out of ideas and thought I would reach out to the dpdk community.

I have a Sapphire Rapids dual CPU server and one E180 (also tried X710), both are 4x10G NICs.  When our application pipeline final stage enqueues mbufs into the tx ring I expect the rte_ring_dequeue_burst() to pull the mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rate.  What I see is when there is one interface receiving 64-byte UDP in IPv4 the receive and transmit is at line rate (i.e. packets in one port and out another port of the NIC @14.9 MPPS).
When I turn on another receive port then both transmit ports of the NIC shows Tx performance drops to 5 MPPS.  The Tx ring is filling faster than Tx thread can dequeue and transmit mbufs.

Packets arrive on ports 1 and 3 in my test setup.  NIC is on NUMA Node 1.  Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.  The mbuf size is 9KB.

Rx Port 1 -> Tx Port 2
Rx Port 3 -> Tx port 4

I monitor the mbufs available and they are:
*** DPDK Mempool Configuration ***
Number Sockets      :                    1
Memory/Socket GB    :                 6
Hugepage Size MB    :                 1024
Overhead/socket MB  :              512
Usable mem/socket MB:          5629
mbuf size Bytes     :                     9216
nb mbufs per socket :               640455
total nb mbufs      :                      640455
hugepages/socket GB :               6
mempool cache size  :            512

*** DPDK EAL args ***
EAL lcore arg       : -l 36   <<< NUMA Node 1
EAL socket-mem arg  : --socket-mem=0,6144

The number of rings in this configuration is 16 and all are the same size (16384 * 8), and there is one mempool.

The Tx rings are created as SP and SC when created.

There is one Tx thread per NIC port, where its only task is to dequeue mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.  The dequeue burst size is 512 and tx burst is equal to or less than 512.  The rte_eth_tx_burst() never returns less than the bust size given.

Each Tx thread is on a dedicated CPU core and its sibling is unused.
We use cpushielding to keep noncritical threads from using these CPUs for Tx threads.  HTOP shows the Tx threads are the only threads using the carved-out CPUs.

In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of mbufs up to 512.
I added debug counters to keep track of how many mbufs are dequeued from the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a counter for less than 512.  The dequeue of the tx ring is always 512, never less.

Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequeue the mbufs and bulk free the mbufs from the tx ring I do not see the tx ring fill-up, i.e., it is able to free the mbufs faster than they arrive on the tx ring.

So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, which involves the inner bows of DPDK and Intel NIC architecture.

Any help to resolve my issue is greatly appreciated.

Thanks,
Ed

[-- Attachment #2: Type: text/html, Size: 7547 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dpdk Tx falling short
  2025-07-03 20:14 dpdk Tx falling short Lombardo, Ed
@ 2025-07-03 21:49 ` Stephen Hemminger
  2025-07-04  5:58   ` Rajesh Kumar
  2025-07-04 11:44 ` Ivan Malov
  1 sibling, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2025-07-03 21:49 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: users

On Thu, 3 Jul 2025 20:14:59 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi,
> I have run out of ideas and thought I would reach out to the dpdk community.
> 
> I have a Sapphire Rapids dual CPU server and one E180 (also tried X710), both are 4x10G NICs.  When our application pipeline final stage enqueues mbufs into the tx ring I expect the rte_ring_dequeue_burst() to pull the mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rate.  What I see is when there is one interface receiving 64-byte UDP in IPv4 the receive and transmit is at line rate (i.e. packets in one port and out another port of the NIC @14.9 MPPS).
> When I turn on another receive port then both transmit ports of the NIC shows Tx performance drops to 5 MPPS.  The Tx ring is filling faster than Tx thread can dequeue and transmit mbufs.
> 
> Packets arrive on ports 1 and 3 in my test setup.  NIC is on NUMA Node 1.  Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.  The mbuf size is 9KB.
> 
> Rx Port 1 -> Tx Port 2
> Rx Port 3 -> Tx port 4
> 
> I monitor the mbufs available and they are:
> *** DPDK Mempool Configuration ***
> Number Sockets      :                    1
> Memory/Socket GB    :                 6
> Hugepage Size MB    :                 1024
> Overhead/socket MB  :              512
> Usable mem/socket MB:          5629
> mbuf size Bytes     :                     9216
> nb mbufs per socket :               640455
> total nb mbufs      :                      640455
> hugepages/socket GB :               6
> mempool cache size  :            512
> 
> *** DPDK EAL args ***
> EAL lcore arg       : -l 36   <<< NUMA Node 1
> EAL socket-mem arg  : --socket-mem=0,6144
> 
> The number of rings in this configuration is 16 and all are the same size (16384 * 8), and there is one mempool.
> 
> The Tx rings are created as SP and SC when created.
> 
> There is one Tx thread per NIC port, where its only task is to dequeue mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.  The dequeue burst size is 512 and tx burst is equal to or less than 512.  The rte_eth_tx_burst() never returns less than the bust size given.
> 
> Each Tx thread is on a dedicated CPU core and its sibling is unused.
> We use cpushielding to keep noncritical threads from using these CPUs for Tx threads.  HTOP shows the Tx threads are the only threads using the carved-out CPUs.
> 
> In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of mbufs up to 512.
> I added debug counters to keep track of how many mbufs are dequeued from the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a counter for less than 512.  The dequeue of the tx ring is always 512, never less.
> 
> 
> Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequeue the mbufs and bulk free the mbufs from the tx ring I do not see the tx ring fill-up, i.e., it is able to free the mbufs faster than they arrive on the tx ring.
> 
> So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, which involves the inner bows of DPDK and Intel NIC architecture.
> 
> 
> 
> Any help to resolve my issue is greatly appreciated.
> 
> Thanks,
> Ed
> 
> 
> 


Do profiling, and look at the number of cache misses.
I suspect using an additional ring is causing lots of cache misses.
Remember going to memory is really slow on modern processors.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dpdk Tx falling short
  2025-07-03 21:49 ` Stephen Hemminger
@ 2025-07-04  5:58   ` Rajesh Kumar
  0 siblings, 0 replies; 28+ messages in thread
From: Rajesh Kumar @ 2025-07-04  5:58 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Lombardo, Ed, users

[-- Attachment #1: Type: text/plain, Size: 4435 bytes --]

Hi Ed,

Did you ran dpdk-testpmd with multiple queue and did you hit line
rate. Sapphire rapid is powerful processor, we were able to hit 200Gbps
with 14 cores with mellanox CX6 NIC.

how many core are you using? what is the descriptor size & number of queue
? try playing with that with that..

dpdk-testpmd -l 0-36 -a <pci of nic> -- -i -a --nb-cores=35  --txq=14
--rxq=14 --rxd=4096

Also try reducing mbuf size to 2K (from the current 9k) and enable jumbo
frame support

try to run "perf top" and see which is taking more time. Also try to
cache-align your data-structure.

struct sample_struct {
      uint32_t  a;
      uint64_t  b;
...
} __rte_cache_aligned;

Thanks,
*-Rajesh*


On Fri, Jul 4, 2025 at 3:27 AM Stephen Hemminger <stephen@networkplumber.org>
wrote:

> On Thu, 3 Jul 2025 20:14:59 +0000
> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>
> > Hi,
> > I have run out of ideas and thought I would reach out to the dpdk
> community.
> >
> > I have a Sapphire Rapids dual CPU server and one E180 (also tried X710),
> both are 4x10G NICs.  When our application pipeline final stage enqueues
> mbufs into the tx ring I expect the rte_ring_dequeue_burst() to pull the
> mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rate.
> What I see is when there is one interface receiving 64-byte UDP in IPv4 the
> receive and transmit is at line rate (i.e. packets in one port and out
> another port of the NIC @14.9 MPPS).
> > When I turn on another receive port then both transmit ports of the NIC
> shows Tx performance drops to 5 MPPS.  The Tx ring is filling faster than
> Tx thread can dequeue and transmit mbufs.
> >
> > Packets arrive on ports 1 and 3 in my test setup.  NIC is on NUMA Node
> 1.  Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.  The mbuf size
> is 9KB.
> >
> > Rx Port 1 -> Tx Port 2
> > Rx Port 3 -> Tx port 4
> >
> > I monitor the mbufs available and they are:
> > *** DPDK Mempool Configuration ***
> > Number Sockets      :                    1
> > Memory/Socket GB    :                 6
> > Hugepage Size MB    :                 1024
> > Overhead/socket MB  :              512
> > Usable mem/socket MB:          5629
> > mbuf size Bytes     :                     9216
> > nb mbufs per socket :               640455
> > total nb mbufs      :                      640455
> > hugepages/socket GB :               6
> > mempool cache size  :            512
> >
> > *** DPDK EAL args ***
> > EAL lcore arg       : -l 36   <<< NUMA Node 1
> > EAL socket-mem arg  : --socket-mem=0,6144
> >
> > The number of rings in this configuration is 16 and all are the same
> size (16384 * 8), and there is one mempool.
> >
> > The Tx rings are created as SP and SC when created.
> >
> > There is one Tx thread per NIC port, where its only task is to dequeue
> mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.
> The dequeue burst size is 512 and tx burst is equal to or less than 512.
> The rte_eth_tx_burst() never returns less than the bust size given.
> >
> > Each Tx thread is on a dedicated CPU core and its sibling is unused.
> > We use cpushielding to keep noncritical threads from using these CPUs
> for Tx threads.  HTOP shows the Tx threads are the only threads using the
> carved-out CPUs.
> >
> > In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of
> mbufs up to 512.
> > I added debug counters to keep track of how many mbufs are dequeued from
> the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a
> counter for less than 512.  The dequeue of the tx ring is always 512, never
> less.
> >
> >
> > Note: if I skip the rte_eth_tx_burst() in the Tx threads and just
> dequeue the mbufs and bulk free the mbufs from the tx ring I do not see the
> tx ring fill-up, i.e., it is able to free the mbufs faster than they arrive
> on the tx ring.
> >
> > So, I suspect that the rte_eth_tx_burst() is the bottleneck to
> investigate, which involves the inner bows of DPDK and Intel NIC
> architecture.
> >
> >
> >
> > Any help to resolve my issue is greatly appreciated.
> >
> > Thanks,
> > Ed
> >
> >
> >
>
>
> Do profiling, and look at the number of cache misses.
> I suspect using an additional ring is causing lots of cache misses.
> Remember going to memory is really slow on modern processors.
>

[-- Attachment #2: Type: text/html, Size: 5555 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dpdk Tx falling short
  2025-07-03 20:14 dpdk Tx falling short Lombardo, Ed
  2025-07-03 21:49 ` Stephen Hemminger
@ 2025-07-04 11:44 ` Ivan Malov
  2025-07-04 14:49   ` Stephen Hemminger
  2025-07-05 17:33   ` Lombardo, Ed
  1 sibling, 2 replies; 28+ messages in thread
From: Ivan Malov @ 2025-07-04 11:44 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: users

[-- Attachment #1: Type: text/plain, Size: 3502 bytes --]

Hi Ed,

You say there is only one mempool. Why?
Have you tried using dedicated mempools, one per each port pair (0,2), (3,4)?

Thank you.

On Thu, 3 Jul 2025, Lombardo, Ed wrote:

> 
> Hi,
> 
> I have run out of ideas and thought I would reach out to the dpdk community.
> 
>  
> 
> I have a Sapphire Rapids dual CPU server and one E180 (also tried X710), both are 4x10G NICs.  When our application pipeline final stage enqueues mbufs into the tx ring I expect the
> rte_ring_dequeue_burst() to pull the mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rate.  What I see is when there is one interface receiving 64-byte UDP in IPv4
> the receive and transmit is at line rate (i.e. packets in one port and out another port of the NIC @14.9 MPPS).
> 
> When I turn on another receive port then both transmit ports of the NIC shows Tx performance drops to 5 MPPS.  The Tx ring is filling faster than Tx thread can dequeue and transmit
> mbufs. 
> 
>  
> 
> Packets arrive on ports 1 and 3 in my test setup.  NIC is on NUMA Node 1.  Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.  The mbuf size is 9KB.
> 
>  
> 
> Rx Port 1 -> Tx Port 2
> 
> Rx Port 3 -> Tx port 4
> 
>  
> 
> I monitor the mbufs available and they are:
> 
> *** DPDK Mempool Configuration ***
> 
> Number Sockets      :                    1
> 
> Memory/Socket GB    :                 6
> 
> Hugepage Size MB    :                 1024
> 
> Overhead/socket MB  :              512
> 
> Usable mem/socket MB:          5629
> 
> mbuf size Bytes     :                     9216
> 
> nb mbufs per socket :               640455
> 
> total nb mbufs      :                      640455
> 
> hugepages/socket GB :               6
> 
> mempool cache size  :            512
> 
>  
> 
> *** DPDK EAL args ***
> 
> EAL lcore arg       : -l 36   <<< NUMA Node 1
> 
> EAL socket-mem arg  : --socket-mem=0,6144
> 
>  
> 
> The number of rings in this configuration is 16 and all are the same size (16384 * 8), and there is one mempool.
> 
>  
> 
> The Tx rings are created as SP and SC when created.
> 
>  
> 
> There is one Tx thread per NIC port, where its only task is to dequeue mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.  The dequeue burst size is 512 and tx
> burst is equal to or less than 512.  The rte_eth_tx_burst() never returns less than the bust size given.
> 
>  
> 
> Each Tx thread is on a dedicated CPU core and its sibling is unused.
> 
> We use cpushielding to keep noncritical threads from using these CPUs for Tx threads.  HTOP shows the Tx threads are the only threads using the carved-out CPUs.
> 
>  
> 
> In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of mbufs up to 512. 
> 
> I added debug counters to keep track of how many mbufs are dequeued from the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a counter for less than 512.  The dequeue of
> the tx ring is always 512, never less. 
> 
>  
> 
>  
> 
> Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequeue the mbufs and bulk free the mbufs from the tx ring I do not see the tx ring fill-up, i.e., it is able to free
> the mbufs faster than they arrive on the tx ring.
> 
>  
> 
> So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, which involves the inner bows of DPDK and Intel NIC architecture.
> 
>  
> 
>  
> 
>  
> 
> Any help to resolve my issue is greatly appreciated.
> 
>  
> 
> Thanks,
> 
> Ed
> 
>  
> 
>  
> 
>  
> 
> 
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dpdk Tx falling short
  2025-07-04 11:44 ` Ivan Malov
@ 2025-07-04 14:49   ` Stephen Hemminger
  2025-07-05 17:36     ` Lombardo, Ed
  2025-07-05 17:33   ` Lombardo, Ed
  1 sibling, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2025-07-04 14:49 UTC (permalink / raw)
  To: Ivan Malov; +Cc: Lombardo, Ed, users

On Fri, 4 Jul 2025 15:44:50 +0400 (+04)
Ivan Malov <ivan.malov@arknetworks.am> wrote:

> Hi Ed,
> 
> You say there is only one mempool. Why?
> Have you tried using dedicated mempools, one per each port pair (0,2), (3,4)?
> 
> Thank you.
> 
> On Thu, 3 Jul 2025, Lombardo, Ed wrote:

More mempools can make cache behavior worse not better.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-04 11:44 ` Ivan Malov
  2025-07-04 14:49   ` Stephen Hemminger
@ 2025-07-05 17:33   ` Lombardo, Ed
  1 sibling, 0 replies; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-05 17:33 UTC (permalink / raw)
  To: Ivan Malov; +Cc: users


Hi Ivan,
If I create dedicated mempools per port pair, how will this benefit the situation?

Thanks,
Ed
-----Original Message-----
From: Ivan Malov <ivan.malov@arknetworks.am> 
Sent: Friday, July 4, 2025 7:45 AM
To: Lombardo, Ed <Ed.Lombardo@netscout.com>
Cc: users <users@dpdk.org>
Subject: Re: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi Ed,

You say there is only one mempool. Why?
Have you tried using dedicated mempools, one per each port pair (0,2), (3,4)?

Thank you.

On Thu, 3 Jul 2025, Lombardo, Ed wrote:

> 
> Hi,
> 
> I have run out of ideas and thought I would reach out to the dpdk community.
> 
>  
> 
> I have a Sapphire Rapids dual CPU server and one E180 (also tried 
> X710), both are 4x10G NICs.  When our application pipeline final stage 
> enqueues mbufs into the tx ring I expect the
> rte_ring_dequeue_burst() to pull the mbufs from the tx ring and 
> rte_eth_tx_burst() transmit them at line rate.  What I see is when there is one interface receiving 64-byte UDP in IPv4 the receive and transmit is at line rate (i.e. packets in one port and out another port of the NIC @14.9 MPPS).
> 
> When I turn on another receive port then both transmit ports of the 
> NIC shows Tx performance drops to 5 MPPS.  The Tx ring is filling faster than Tx thread can dequeue and transmit mbufs.
> 
>  
> 
> Packets arrive on ports 1 and 3 in my test setup.  NIC is on NUMA Node 1.  Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.  The mbuf size is 9KB.
> 
>  
> 
> Rx Port 1 -> Tx Port 2
> 
> Rx Port 3 -> Tx port 4
> 
>  
> 
> I monitor the mbufs available and they are:
> 
> *** DPDK Mempool Configuration ***
> 
> Number Sockets      :                    1
> 
> Memory/Socket GB    :                 6
> 
> Hugepage Size MB    :                 1024
> 
> Overhead/socket MB  :              512
> 
> Usable mem/socket MB:          5629
> 
> mbuf size Bytes     :                     9216
> 
> nb mbufs per socket :               640455
> 
> total nb mbufs      :                      640455
> 
> hugepages/socket GB :               6
> 
> mempool cache size  :            512
> 
>  
> 
> *** DPDK EAL args ***
> 
> EAL lcore arg       : -l 36   <<< NUMA Node 1
> 
> EAL socket-mem arg  : --socket-mem=0,6144
> 
>  
> 
> The number of rings in this configuration is 16 and all are the same size (16384 * 8), and there is one mempool.
> 
>  
> 
> The Tx rings are created as SP and SC when created.
> 
>  
> 
> There is one Tx thread per NIC port, where its only task is to dequeue 
> mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.  The dequeue burst size is 512 and tx burst is equal to or less than 512.  The rte_eth_tx_burst() never returns less than the bust size given.
> 
>  
> 
> Each Tx thread is on a dedicated CPU core and its sibling is unused.
> 
> We use cpushielding to keep noncritical threads from using these CPUs for Tx threads.  HTOP shows the Tx threads are the only threads using the carved-out CPUs.
> 
>  
> 
> In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst 
> of mbufs up to 512.
> 
> I added debug counters to keep track of how many mbufs are dequeued 
> from the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a counter for less than 512.  The dequeue of the tx ring is always 512, never less.
> 
>  
> 
>  
> 
> Note: if I skip the rte_eth_tx_burst() in the Tx threads and just 
> dequeue the mbufs and bulk free the mbufs from the tx ring I do not see the tx ring fill-up, i.e., it is able to free the mbufs faster than they arrive on the tx ring.
> 
>  
> 
> So, I suspect that the rte_eth_tx_burst() is the bottleneck to investigate, which involves the inner bows of DPDK and Intel NIC architecture.
> 
>  
> 
>  
> 
>  
> 
> Any help to resolve my issue is greatly appreciated.
> 
>  
> 
> Thanks,
> 
> Ed
> 
>  
> 
>  
> 
>  
> 
> 
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-04 14:49   ` Stephen Hemminger
@ 2025-07-05 17:36     ` Lombardo, Ed
  2025-07-05 19:02       ` Stephen Hemminger
  2025-07-05 19:08       ` Stephen Hemminger
  0 siblings, 2 replies; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-05 17:36 UTC (permalink / raw)
  To: Stephen Hemminger, Ivan Malov; +Cc: users

Hi Stephen,
I saw your response to more mempools and cache behavior.

I have a goal to support 2x100G next, and if I can't get 10G with DPDK then something is seriously wrong.

Should I build the dpdk static libraries with LTO?

Thanks,
Ed

-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: Friday, July 4, 2025 10:50 AM
To: Ivan Malov <ivan.malov@arknetworks.am>
Cc: Lombardo, Ed <Ed.Lombardo@netscout.com>; users <users@dpdk.org>
Subject: Re: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Fri, 4 Jul 2025 15:44:50 +0400 (+04)
Ivan Malov <ivan.malov@arknetworks.am> wrote:

> Hi Ed,
> 
> You say there is only one mempool. Why?
> Have you tried using dedicated mempools, one per each port pair (0,2), (3,4)?
> 
> Thank you.
> 
> On Thu, 3 Jul 2025, Lombardo, Ed wrote:

More mempools can make cache behavior worse not better.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dpdk Tx falling short
  2025-07-05 17:36     ` Lombardo, Ed
@ 2025-07-05 19:02       ` Stephen Hemminger
  2025-07-05 19:08       ` Stephen Hemminger
  1 sibling, 0 replies; 28+ messages in thread
From: Stephen Hemminger @ 2025-07-05 19:02 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: Ivan Malov, users

On Sat, 5 Jul 2025 17:36:08 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi Stephen,
> I saw your response to more mempools and cache behavior.
> 
> I have a goal to support 2x100G next, and if I can't get 10G with DPDK then something is seriously wrong.
> 
> Should I build the dpdk static libraries with LTO?
> 
> Thanks,
> Ed
> 
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org> 
> Sent: Friday, July 4, 2025 10:50 AM
> To: Ivan Malov <ivan.malov@arknetworks.am>
> Cc: Lombardo, Ed <Ed.Lombardo@netscout.com>; users <users@dpdk.org>
> Subject: Re: dpdk Tx falling short
> 
> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> On Fri, 4 Jul 2025 15:44:50 +0400 (+04)
> Ivan Malov <ivan.malov@arknetworks.am> wrote:
> 
> > Hi Ed,
> > 
> > You say there is only one mempool. Why?
> > Have you tried using dedicated mempools, one per each port pair (0,2), (3,4)?
> > 
> > Thank you.
> > 
> > On Thu, 3 Jul 2025, Lombardo, Ed wrote:  
> 
> More mempools can make cache behavior worse not better.

Your best bet is to use tools like perf to profile cache misses.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dpdk Tx falling short
  2025-07-05 17:36     ` Lombardo, Ed
  2025-07-05 19:02       ` Stephen Hemminger
@ 2025-07-05 19:08       ` Stephen Hemminger
  2025-07-06  0:03         ` Lombardo, Ed
  1 sibling, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2025-07-05 19:08 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: Ivan Malov, users

On Sat, 5 Jul 2025 17:36:08 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi Stephen,
> I saw your response to more mempools and cache behavior.
> 
> I have a goal to support 2x100G next, and if I can't get 10G with DPDK then something is seriously wrong.
> 
> Should I build the dpdk static libraries with LTO?
> 
> Thanks,
> Ed

Are you doing anything in the fast path that is an obvious cache miss.
at 10Gbit/sec and size of 84 bytes = 67.2ns
CPU's haven't got that much faster 3G cpu that is 201 cycles.

Single cache miss is 32ns, so two cache misses means per-packet budget is gone.

Obvious cache misses.
 - passing packets to worker with ring
 - using spinlocks (cost 16ns)
 - fetching TSC
 - syscalls?

Also, never ever use floating point.

Kernel related and older but worth looking at:
https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf



^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-05 19:08       ` Stephen Hemminger
@ 2025-07-06  0:03         ` Lombardo, Ed
  2025-07-06 16:02           ` Stephen Hemminger
  0 siblings, 1 reply; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-06  0:03 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Ivan Malov, users

Hi Stephen,
Here are comments to the list of obvious causes of cache misses you mentiond.

Obvious cache misses.
 - passing packets to worker with ring - we use lots of rings to pass mbuf pointers.  If I skip the rte_eth_tx_burst() and just free mbuf bulk, the tx ring does not fill up.
 - using spinlocks (cost 16ns)  - The driver does not use spinlocks, other than what dpdk uses.
 - fetching TSC  - We don't do this, we let Rx offload timestamp packets.
 - syscalls?  - No syscalls are done in our driver fast path.

You mention "passing packets to worker with ring", do you mean using rings to pass mbuf pointers causes cache misses and should be avoided?

Thanks,
Ed

Also, never ever use floating point.
-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: Saturday, July 5, 2025 3:09 PM
To: Lombardo, Ed <Ed.Lombardo@netscout.com>
Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
Subject: Re: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Sat, 5 Jul 2025 17:36:08 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi Stephen,
> I saw your response to more mempools and cache behavior.
> 
> I have a goal to support 2x100G next, and if I can't get 10G with DPDK then something is seriously wrong.
> 
> Should I build the dpdk static libraries with LTO?
> 
> Thanks,
> Ed

Are you doing anything in the fast path that is an obvious cache miss.
at 10Gbit/sec and size of 84 bytes = 67.2ns CPU's haven't got that much faster 3G cpu that is 201 cycles.

Single cache miss is 32ns, so two cache misses means per-packet budget is gone.

Obvious cache misses.
 - passing packets to worker with ring
 - using spinlocks (cost 16ns)
 - fetching TSC
 - syscalls?

Also, never ever use floating point.

Kernel related and older but worth looking at:
https://urldefense.com/v3/__https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf__;!!Nzg7nt7_!GXQ2fZd0SInSDkGmvq3j3Kk78iP6qOBV37umb2lNgU1Lo7VoBJ40ZTYOK0LfS3o3Lq64NXnFrTyRMJXxISJyNSqhy_c$ 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dpdk Tx falling short
  2025-07-06  0:03         ` Lombardo, Ed
@ 2025-07-06 16:02           ` Stephen Hemminger
  2025-07-06 17:44             ` Lombardo, Ed
  0 siblings, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2025-07-06 16:02 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: Ivan Malov, users

On Sun, 6 Jul 2025 00:03:16 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi Stephen,
> Here are comments to the list of obvious causes of cache misses you mentiond.
> 
> Obvious cache misses.
>  - passing packets to worker with ring - we use lots of rings to pass mbuf pointers.  If I skip the rte_eth_tx_burst() and just free mbuf bulk, the tx ring does not fill up.
>  - using spinlocks (cost 16ns)  - The driver does not use spinlocks, other than what dpdk uses.
>  - fetching TSC  - We don't do this, we let Rx offload timestamp packets.
>  - syscalls?  - No syscalls are done in our driver fast path.
> 
> You mention "passing packets to worker with ring", do you mean using rings to pass mbuf pointers causes cache misses and should be avoided?

Rings do cause data to be modified by one core and examined by another so they are a cache miss.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-06 16:02           ` Stephen Hemminger
@ 2025-07-06 17:44             ` Lombardo, Ed
  2025-07-07  3:30               ` Stephen Hemminger
  2025-07-07 16:27               ` Lombardo, Ed
  0 siblings, 2 replies; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-06 17:44 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Ivan Malov, users

Hi Stephen,
If using dpdk rings comes with this penalty then what should I use, is there an alterative to rings.  We do not want to use shared memory and do buffer copies?

Thanks,
Ed

-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: Sunday, July 6, 2025 12:03 PM
To: Lombardo, Ed <Ed.Lombardo@netscout.com>
Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
Subject: Re: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Sun, 6 Jul 2025 00:03:16 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi Stephen,
> Here are comments to the list of obvious causes of cache misses you mentiond.
> 
> Obvious cache misses.
>  - passing packets to worker with ring - we use lots of rings to pass mbuf pointers.  If I skip the rte_eth_tx_burst() and just free mbuf bulk, the tx ring does not fill up.
>  - using spinlocks (cost 16ns)  - The driver does not use spinlocks, other than what dpdk uses.
>  - fetching TSC  - We don't do this, we let Rx offload timestamp packets.
>  - syscalls?  - No syscalls are done in our driver fast path.
> 
> You mention "passing packets to worker with ring", do you mean using rings to pass mbuf pointers causes cache misses and should be avoided?

Rings do cause data to be modified by one core and examined by another so they are a cache miss.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dpdk Tx falling short
  2025-07-06 17:44             ` Lombardo, Ed
@ 2025-07-07  3:30               ` Stephen Hemminger
  2025-07-07 16:27               ` Lombardo, Ed
  1 sibling, 0 replies; 28+ messages in thread
From: Stephen Hemminger @ 2025-07-07  3:30 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: Ivan Malov, users

On Sun, 6 Jul 2025 17:44:49 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi Stephen,
> If using dpdk rings comes with this penalty then what should I use, is there an alterative to rings.  We do not want to use shared memory and do buffer copies?

The most efficient way is to use single core run-to-completion and packet spreading with RSS and flows.
See examples and Fd.io VPP for more complex version.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-06 17:44             ` Lombardo, Ed
  2025-07-07  3:30               ` Stephen Hemminger
@ 2025-07-07 16:27               ` Lombardo, Ed
  2025-07-07 21:00                 ` Lombardo, Ed
  2025-07-07 21:49                 ` Ivan Malov
  1 sibling, 2 replies; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-07 16:27 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Ivan Malov, users

Hi Stephen,
I ran a perf diff on two perf records and reveals the real problem with the tx thread in transmitting packets.

The comparison is traffic received on ifn3 and transmit ifn4 to traffic received on ifn3, ifn5 and transmit on ifn4, ifn6.
When transmit packets on one port the performance is better, however when transmit on two ports the performance across the two drops dramatically.

There is increase of 55.29% of the CPU spent in common_ring_mp_enqueue and 54.18% less time in i40e_xmit_pkts (was E810 tried x710).
The common_ring_mp_enqueue is multi-producer,  is the enqueue of mbuf pointers passed in to rte_eth_tx_burst() have to be multi-producer?

Is there a way to change dpdk to use single-producer?

# Event 'cycles'
#
# Baseline  Delta Abs  Shared Object      Symbol
# ........  .........  .................  ......................................
#
    36.37%    +55.29%  test                        [.] common_ring_mp_enqueue
    62.36%    -54.18%   test                        [.] i40e_xmit_pkts
     1.10%     -0.94%     test                         [.] dpdk_tx_thread
     0.01%     -0.01%     [kernel.kallsyms]  [k] native_sched_clock
                     +0.00%    [kernel.kallsyms]  [k] fill_pmd
                     +0.00%    [kernel.kallsyms]  [k] perf_sample_event_took
     0.00%     +0.00%    [kernel.kallsyms]  [k] __flush_smp_call_function_queue
     0.02%                      [kernel.kallsyms]  [k] __intel_pmu_enable_all.constprop.0
     0.02%                      [kernel.kallsyms]  [k] native_irq_return_iret
     0.02%                      [kernel.kallsyms]  [k] native_tss_update_io_bitmap
     0.01%                      [kernel.kallsyms]  [k] ktime_get
     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context
     0.01%                      [kernel.kallsyms]  [k] __update_blocked_fair
     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_events

Thanks,
Ed

-----Original Message-----
From: Lombardo, Ed 
Sent: Sunday, July 6, 2025 1:45 PM
To: Stephen Hemminger <stephen@networkplumber.org>
Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
Subject: RE: dpdk Tx falling short

Hi Stephen,
If using dpdk rings comes with this penalty then what should I use, is there an alterative to rings.  We do not want to use shared memory and do buffer copies?

Thanks,
Ed

-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: Sunday, July 6, 2025 12:03 PM
To: Lombardo, Ed <Ed.Lombardo@netscout.com>
Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
Subject: Re: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Sun, 6 Jul 2025 00:03:16 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi Stephen,
> Here are comments to the list of obvious causes of cache misses you mentiond.
> 
> Obvious cache misses.
>  - passing packets to worker with ring - we use lots of rings to pass mbuf pointers.  If I skip the rte_eth_tx_burst() and just free mbuf bulk, the tx ring does not fill up.
>  - using spinlocks (cost 16ns)  - The driver does not use spinlocks, other than what dpdk uses.
>  - fetching TSC  - We don't do this, we let Rx offload timestamp packets.
>  - syscalls?  - No syscalls are done in our driver fast path.
> 
> You mention "passing packets to worker with ring", do you mean using rings to pass mbuf pointers causes cache misses and should be avoided?

Rings do cause data to be modified by one core and examined by another so they are a cache miss.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-07 16:27               ` Lombardo, Ed
@ 2025-07-07 21:00                 ` Lombardo, Ed
  2025-07-07 21:49                 ` Ivan Malov
  1 sibling, 0 replies; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-07 21:00 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Ivan Malov, users

Hi Stephen,
I created debug dpdk libs and built our application.
I ensured all 16 rings in the driver are SP / SC created and at runtime the ring flags have value of 3 confirming they are.

The perf output continues to show the symbol common_ring_mp_enqueue() and rte_atomic32_cmpset().  My understanding is in SP / SC rings they don't utilize:
1.  atomic operation
2.  spin-wait
3.  rte_wait_until_equal_32()
Thus, should be lower overhead, no cache-line bouncing, no memory stalls and no spinning from 3.

Perf report snippit:
+   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue 
+   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset 
+    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts 
+    6.50%  DPDK_TX_1  test             [.] rte_pause 
      0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0 
      0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread                                              

Any suggestions is very appreciated.

Thanks,
Ed

-----Original Message-----
From: Lombardo, Ed 
Sent: Monday, July 7, 2025 12:27 PM
To: Stephen Hemminger <stephen@networkplumber.org>
Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
Subject: RE: dpdk Tx falling short

Hi Stephen,
I ran a perf diff on two perf records and reveals the real problem with the tx thread in transmitting packets.

The comparison is traffic received on ifn3 and transmit ifn4 to traffic received on ifn3, ifn5 and transmit on ifn4, ifn6.
When transmit packets on one port the performance is better, however when transmit on two ports the performance across the two drops dramatically.

There is increase of 55.29% of the CPU spent in common_ring_mp_enqueue and 54.18% less time in i40e_xmit_pkts (was E810 tried x710).
The common_ring_mp_enqueue is multi-producer,  is the enqueue of mbuf pointers passed in to rte_eth_tx_burst() have to be multi-producer?

Is there a way to change dpdk to use single-producer?

# Event 'cycles'
#
# Baseline  Delta Abs  Shared Object      Symbol
# ........  .........  .................  ......................................
#
    36.37%    +55.29%  test                        [.] common_ring_mp_enqueue
    62.36%    -54.18%   test                        [.] i40e_xmit_pkts
     1.10%     -0.94%     test                         [.] dpdk_tx_thread
     0.01%     -0.01%     [kernel.kallsyms]  [k] native_sched_clock
                     +0.00%    [kernel.kallsyms]  [k] fill_pmd
                     +0.00%    [kernel.kallsyms]  [k] perf_sample_event_took
     0.00%     +0.00%    [kernel.kallsyms]  [k] __flush_smp_call_function_queue
     0.02%                      [kernel.kallsyms]  [k] __intel_pmu_enable_all.constprop.0
     0.02%                      [kernel.kallsyms]  [k] native_irq_return_iret
     0.02%                      [kernel.kallsyms]  [k] native_tss_update_io_bitmap
     0.01%                      [kernel.kallsyms]  [k] ktime_get
     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context
     0.01%                      [kernel.kallsyms]  [k] __update_blocked_fair
     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_events

Thanks,
Ed

-----Original Message-----
From: Lombardo, Ed 
Sent: Sunday, July 6, 2025 1:45 PM
To: Stephen Hemminger <stephen@networkplumber.org>
Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
Subject: RE: dpdk Tx falling short

Hi Stephen,
If using dpdk rings comes with this penalty then what should I use, is there an alterative to rings.  We do not want to use shared memory and do buffer copies?

Thanks,
Ed

-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: Sunday, July 6, 2025 12:03 PM
To: Lombardo, Ed <Ed.Lombardo@netscout.com>
Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
Subject: Re: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Sun, 6 Jul 2025 00:03:16 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi Stephen,
> Here are comments to the list of obvious causes of cache misses you mentiond.
> 
> Obvious cache misses.
>  - passing packets to worker with ring - we use lots of rings to pass mbuf pointers.  If I skip the rte_eth_tx_burst() and just free mbuf bulk, the tx ring does not fill up.
>  - using spinlocks (cost 16ns)  - The driver does not use spinlocks, other than what dpdk uses.
>  - fetching TSC  - We don't do this, we let Rx offload timestamp packets.
>  - syscalls?  - No syscalls are done in our driver fast path.
> 
> You mention "passing packets to worker with ring", do you mean using rings to pass mbuf pointers causes cache misses and should be avoided?

Rings do cause data to be modified by one core and examined by another so they are a cache miss.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-07 16:27               ` Lombardo, Ed
  2025-07-07 21:00                 ` Lombardo, Ed
@ 2025-07-07 21:49                 ` Ivan Malov
  2025-07-07 23:04                   ` Stephen Hemminger
  1 sibling, 1 reply; 28+ messages in thread
From: Ivan Malov @ 2025-07-07 21:49 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: Stephen Hemminger, users

Hi Ed,

On Mon, 7 Jul 2025, Lombardo, Ed wrote:

> Hi Stephen,
> I ran a perf diff on two perf records and reveals the real problem with the tx thread in transmitting packets.
>
> The comparison is traffic received on ifn3 and transmit ifn4 to traffic received on ifn3, ifn5 and transmit on ifn4, ifn6.
> When transmit packets on one port the performance is better, however when transmit on two ports the performance across the two drops dramatically.
>
> There is increase of 55.29% of the CPU spent in common_ring_mp_enqueue and 54.18% less time in i40e_xmit_pkts (was E810 tried x710).
> The common_ring_mp_enqueue is multi-producer,  is the enqueue of mbuf pointers passed in to rte_eth_tx_burst() have to be multi-producer?

I may be wrong, but rte_eth_tx_burst(), as part of what is known as "reap"
process, should check for "done" Tx descriptors resulting from previous
invocations and free (enqueue) the associated mbufs into respective mempools.
In your case, you say you only have a single mempool shared between the port
pairs, which, as I understand, are served by concurrent threads, so it might be
logical to use a multi-producer mempool in this case. Or am I missing something?

The pktmbuf API for mempool allocation is a wrapper around generic API and it
might request multi-producer multi-consumer by default (see [1], 'flags').
According to your original mempool monitor printout, the per-lcore cache size is
512. On the premise that separate lcores serve the two port pairs, and taking
into account the burst size, it should be OK, yet you may want to play with the
per-lcore cache size argument when creating the pool. Does it change anything?

Regarding separate mempools, -- I saw Stephen's response about those making CPU
cache behaviour worse and not better. Makes sense and I won't argue. And yet,
why not just try an make sure this indeed holds in this particular case? Also,
since you're seeking single-producer behaviour, having separate per-port-pair
mempools might allow to create such (again, see 'flags' at [1]), provided that
API [1] is used for mempool creation. Please correct me in case I'm mistaken.

Also, PMDs can support "fast free" Tx offload. Please see [2] to check whether
the application asks for this offload flag or not. It may be worth enabling.

[1] https://doc.dpdk.org/api-25.03/rte__mempool_8h.html#a0b64d611bc140a4d2a0c94911580efd5
[2] https://doc.dpdk.org/api-25.03/rte__ethdev_8h.html#a43f198c6b59d965130d56fd8f40ceac1

Thank you.

>
> Is there a way to change dpdk to use single-producer?
>
> # Event 'cycles'
> #
> # Baseline  Delta Abs  Shared Object      Symbol
> # ........  .........  .................  ......................................
> #
>    36.37%    +55.29%  test                        [.] common_ring_mp_enqueue
>    62.36%    -54.18%   test                        [.] i40e_xmit_pkts
>     1.10%     -0.94%     test                         [.] dpdk_tx_thread
>     0.01%     -0.01%     [kernel.kallsyms]  [k] native_sched_clock
>                     +0.00%    [kernel.kallsyms]  [k] fill_pmd
>                     +0.00%    [kernel.kallsyms]  [k] perf_sample_event_took
>     0.00%     +0.00%    [kernel.kallsyms]  [k] __flush_smp_call_function_queue
>     0.02%                      [kernel.kallsyms]  [k] __intel_pmu_enable_all.constprop.0
>     0.02%                      [kernel.kallsyms]  [k] native_irq_return_iret
>     0.02%                      [kernel.kallsyms]  [k] native_tss_update_io_bitmap
>     0.01%                      [kernel.kallsyms]  [k] ktime_get
>     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context
>     0.01%                      [kernel.kallsyms]  [k] __update_blocked_fair
>     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_events
>
> Thanks,
> Ed
>
> -----Original Message-----
> From: Lombardo, Ed
> Sent: Sunday, July 6, 2025 1:45 PM
> To: Stephen Hemminger <stephen@networkplumber.org>
> Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
> Subject: RE: dpdk Tx falling short
>
> Hi Stephen,
> If using dpdk rings comes with this penalty then what should I use, is there an alterative to rings.  We do not want to use shared memory and do buffer copies?
>
> Thanks,
> Ed
>
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Sunday, July 6, 2025 12:03 PM
> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
> Subject: Re: dpdk Tx falling short
>
> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Sun, 6 Jul 2025 00:03:16 +0000
> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>
>> Hi Stephen,
>> Here are comments to the list of obvious causes of cache misses you mentiond.
>>
>> Obvious cache misses.
>>  - passing packets to worker with ring - we use lots of rings to pass mbuf pointers.  If I skip the rte_eth_tx_burst() and just free mbuf bulk, the tx ring does not fill up.
>>  - using spinlocks (cost 16ns)  - The driver does not use spinlocks, other than what dpdk uses.
>>  - fetching TSC  - We don't do this, we let Rx offload timestamp packets.
>>  - syscalls?  - No syscalls are done in our driver fast path.
>>
>> You mention "passing packets to worker with ring", do you mean using rings to pass mbuf pointers causes cache misses and should be avoided?
>
> Rings do cause data to be modified by one core and examined by another so they are a cache miss.
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dpdk Tx falling short
  2025-07-07 21:49                 ` Ivan Malov
@ 2025-07-07 23:04                   ` Stephen Hemminger
  2025-07-08  4:10                     ` Lombardo, Ed
  0 siblings, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2025-07-07 23:04 UTC (permalink / raw)
  To: Ivan Malov; +Cc: Lombardo, Ed, users

On Tue, 8 Jul 2025 01:49:44 +0400 (+04)
Ivan Malov <ivan.malov@arknetworks.am> wrote:

> Hi Ed,
> 
> On Mon, 7 Jul 2025, Lombardo, Ed wrote:
> 
> > Hi Stephen,
> > I ran a perf diff on two perf records and reveals the real problem with the tx thread in transmitting packets.
> >
> > The comparison is traffic received on ifn3 and transmit ifn4 to traffic received on ifn3, ifn5 and transmit on ifn4, ifn6.
> > When transmit packets on one port the performance is better, however when transmit on two ports the performance across the two drops dramatically.
> >
> > There is increase of 55.29% of the CPU spent in common_ring_mp_enqueue and 54.18% less time in i40e_xmit_pkts (was E810 tried x710).
> > The common_ring_mp_enqueue is multi-producer,  is the enqueue of mbuf pointers passed in to rte_eth_tx_burst() have to be multi-producer?  
> 
> I may be wrong, but rte_eth_tx_burst(), as part of what is known as "reap"
> process, should check for "done" Tx descriptors resulting from previous
> invocations and free (enqueue) the associated mbufs into respective mempools.
> In your case, you say you only have a single mempool shared between the port
> pairs, which, as I understand, are served by concurrent threads, so it might be
> logical to use a multi-producer mempool in this case. Or am I missing something?
> 
> The pktmbuf API for mempool allocation is a wrapper around generic API and it
> might request multi-producer multi-consumer by default (see [1], 'flags').
> According to your original mempool monitor printout, the per-lcore cache size is
> 512. On the premise that separate lcores serve the two port pairs, and taking
> into account the burst size, it should be OK, yet you may want to play with the
> per-lcore cache size argument when creating the pool. Does it change anything?
> 
> Regarding separate mempools, -- I saw Stephen's response about those making CPU
> cache behaviour worse and not better. Makes sense and I won't argue. And yet,
> why not just try an make sure this indeed holds in this particular case? Also,
> since you're seeking single-producer behaviour, having separate per-port-pair
> mempools might allow to create such (again, see 'flags' at [1]), provided that
> API [1] is used for mempool creation. Please correct me in case I'm mistaken.
> 
> Also, PMDs can support "fast free" Tx offload. Please see [2] to check whether
> the application asks for this offload flag or not. It may be worth enabling.
> 
> [1] https://doc.dpdk.org/api-25.03/rte__mempool_8h.html#a0b64d611bc140a4d2a0c94911580efd5
> [2] https://doc.dpdk.org/api-25.03/rte__ethdev_8h.html#a43f198c6b59d965130d56fd8f40ceac1
> 
> Thank you.
> 
> >
> > Is there a way to change dpdk to use single-producer?
> >
> > # Event 'cycles'
> > #
> > # Baseline  Delta Abs  Shared Object      Symbol
> > # ........  .........  .................  ......................................
> > #
> >    36.37%    +55.29%  test                        [.] common_ring_mp_enqueue
> >    62.36%    -54.18%   test                        [.] i40e_xmit_pkts
> >     1.10%     -0.94%     test                         [.] dpdk_tx_thread
> >     0.01%     -0.01%     [kernel.kallsyms]  [k] native_sched_clock
> >                     +0.00%    [kernel.kallsyms]  [k] fill_pmd
> >                     +0.00%    [kernel.kallsyms]  [k] perf_sample_event_took
> >     0.00%     +0.00%    [kernel.kallsyms]  [k] __flush_smp_call_function_queue
> >     0.02%                      [kernel.kallsyms]  [k] __intel_pmu_enable_all.constprop.0
> >     0.02%                      [kernel.kallsyms]  [k] native_irq_return_iret
> >     0.02%                      [kernel.kallsyms]  [k] native_tss_update_io_bitmap
> >     0.01%                      [kernel.kallsyms]  [k] ktime_get
> >     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context
> >     0.01%                      [kernel.kallsyms]  [k] __update_blocked_fair
> >     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_events
> >
> > Thanks,
> > Ed
> >
> > -----Original Message-----
> > From: Lombardo, Ed
> > Sent: Sunday, July 6, 2025 1:45 PM
> > To: Stephen Hemminger <stephen@networkplumber.org>
> > Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
> > Subject: RE: dpdk Tx falling short
> >
> > Hi Stephen,
> > If using dpdk rings comes with this penalty then what should I use, is there an alterative to rings.  We do not want to use shared memory and do buffer copies?
> >
> > Thanks,
> > Ed
> >
> > -----Original Message-----
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Sent: Sunday, July 6, 2025 12:03 PM
> > To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> > Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
> > Subject: Re: dpdk Tx falling short
> >
> > External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> >
> > On Sun, 6 Jul 2025 00:03:16 +0000
> > "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
> >  
> >> Hi Stephen,
> >> Here are comments to the list of obvious causes of cache misses you mentiond.
> >>
> >> Obvious cache misses.
> >>  - passing packets to worker with ring - we use lots of rings to pass mbuf pointers.  If I skip the rte_eth_tx_burst() and just free mbuf bulk, the tx ring does not fill up.
> >>  - using spinlocks (cost 16ns)  - The driver does not use spinlocks, other than what dpdk uses.
> >>  - fetching TSC  - We don't do this, we let Rx offload timestamp packets.
> >>  - syscalls?  - No syscalls are done in our driver fast path.
> >>
> >> You mention "passing packets to worker with ring", do you mean using rings to pass mbuf pointers causes cache misses and should be avoided?  
> >
> > Rings do cause data to be modified by one core and examined by another so they are a cache miss.
> >
> >  

How many packets is your application seeing per-burst? Ideally it should be getting chunks
not single packet a  time. And then the driver can use defer free to put back bursts.
If you have multi-stage pipeline it helps if you pass a burst to each stage rather than
looping over the burst in the outer loop. Imagine getting a burst of 16 packets. If you
pass an array down the pipeline, then there is one call per burst. If you process packets
one at a time, it can mean 16 calls, and if the pipeline exceeds the instruction cache
it can mean 16 cache misses.

The point is bursting is a big win in data and instruction cache.
If you really want to tune investigate prefetching like VPP does.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-07 23:04                   ` Stephen Hemminger
@ 2025-07-08  4:10                     ` Lombardo, Ed
  2025-07-08 13:47                       ` Stephen Hemminger
  0 siblings, 1 reply; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-08  4:10 UTC (permalink / raw)
  To: Stephen Hemminger, Ivan Malov; +Cc: users

Hi Stephen,
I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.

Thanks,
Ed

-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: Monday, July 7, 2025 7:04 PM
To: Ivan Malov <ivan.malov@arknetworks.am>
Cc: Lombardo, Ed <Ed.Lombardo@netscout.com>; users <users@dpdk.org>
Subject: Re: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Tue, 8 Jul 2025 01:49:44 +0400 (+04)
Ivan Malov <ivan.malov@arknetworks.am> wrote:

> Hi Ed,
> 
> On Mon, 7 Jul 2025, Lombardo, Ed wrote:
> 
> > Hi Stephen,
> > I ran a perf diff on two perf records and reveals the real problem with the tx thread in transmitting packets.
> >
> > The comparison is traffic received on ifn3 and transmit ifn4 to traffic received on ifn3, ifn5 and transmit on ifn4, ifn6.
> > When transmit packets on one port the performance is better, however when transmit on two ports the performance across the two drops dramatically.
> >
> > There is increase of 55.29% of the CPU spent in common_ring_mp_enqueue and 54.18% less time in i40e_xmit_pkts (was E810 tried x710).
> > The common_ring_mp_enqueue is multi-producer,  is the enqueue of mbuf pointers passed in to rte_eth_tx_burst() have to be multi-producer?  
> 
> I may be wrong, but rte_eth_tx_burst(), as part of what is known as "reap"
> process, should check for "done" Tx descriptors resulting from 
> previous invocations and free (enqueue) the associated mbufs into respective mempools.
> In your case, you say you only have a single mempool shared between 
> the port pairs, which, as I understand, are served by concurrent 
> threads, so it might be logical to use a multi-producer mempool in this case. Or am I missing something?
> 
> The pktmbuf API for mempool allocation is a wrapper around generic API 
> and it might request multi-producer multi-consumer by default (see [1], 'flags').
> According to your original mempool monitor printout, the per-lcore 
> cache size is 512. On the premise that separate lcores serve the two 
> port pairs, and taking into account the burst size, it should be OK, 
> yet you may want to play with the per-lcore cache size argument when creating the pool. Does it change anything?
> 
> Regarding separate mempools, -- I saw Stephen's response about those 
> making CPU cache behaviour worse and not better. Makes sense and I 
> won't argue. And yet, why not just try an make sure this indeed holds 
> in this particular case? Also, since you're seeking single-producer 
> behaviour, having separate per-port-pair mempools might allow to 
> create such (again, see 'flags' at [1]), provided that API [1] is used for mempool creation. Please correct me in case I'm mistaken.
> 
> Also, PMDs can support "fast free" Tx offload. Please see [2] to check 
> whether the application asks for this offload flag or not. It may be worth enabling.
> 
> [1] 
> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mempoo
> l_8h.html*a0b64d611bc140a4d2a0c94911580efd5__;Iw!!Nzg7nt7_!EvEznHI_mP3
> GsiSVrhbQfDE2va8UxZ5-8okSD-Cq_gTm9nP0Q34d6XPWYoQhUGqoJifjjk4Na1a8j5EZH
> SqWzqXGztg$ [2] 
> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__ethdev
> _8h.html*a43f198c6b59d965130d56fd8f40ceac1__;Iw!!Nzg7nt7_!EvEznHI_mP3G
> siSVrhbQfDE2va8UxZ5-8okSD-Cq_gTm9nP0Q34d6XPWYoQhUGqoJifjjk4Na1a8j5EZHS
> qWypWjs8A$
> 
> Thank you.
> 
> >
> > Is there a way to change dpdk to use single-producer?
> >
> > # Event 'cycles'
> > #
> > # Baseline  Delta Abs  Shared Object      Symbol
> > # ........  .........  .................  ......................................
> > #
> >    36.37%    +55.29%  test                        [.] common_ring_mp_enqueue
> >    62.36%    -54.18%   test                        [.] i40e_xmit_pkts
> >     1.10%     -0.94%     test                         [.] dpdk_tx_thread
> >     0.01%     -0.01%     [kernel.kallsyms]  [k] native_sched_clock
> >                     +0.00%    [kernel.kallsyms]  [k] fill_pmd
> >                     +0.00%    [kernel.kallsyms]  [k] perf_sample_event_took
> >     0.00%     +0.00%    [kernel.kallsyms]  [k] __flush_smp_call_function_queue
> >     0.02%                      [kernel.kallsyms]  [k] __intel_pmu_enable_all.constprop.0
> >     0.02%                      [kernel.kallsyms]  [k] native_irq_return_iret
> >     0.02%                      [kernel.kallsyms]  [k] native_tss_update_io_bitmap
> >     0.01%                      [kernel.kallsyms]  [k] ktime_get
> >     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context
> >     0.01%                      [kernel.kallsyms]  [k] __update_blocked_fair
> >     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_events
> >
> > Thanks,
> > Ed
> >
> > -----Original Message-----
> > From: Lombardo, Ed
> > Sent: Sunday, July 6, 2025 1:45 PM
> > To: Stephen Hemminger <stephen@networkplumber.org>
> > Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
> > Subject: RE: dpdk Tx falling short
> >
> > Hi Stephen,
> > If using dpdk rings comes with this penalty then what should I use, is there an alterative to rings.  We do not want to use shared memory and do buffer copies?
> >
> > Thanks,
> > Ed
> >
> > -----Original Message-----
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Sent: Sunday, July 6, 2025 12:03 PM
> > To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> > Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
> > Subject: Re: dpdk Tx falling short
> >
> > External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> >
> > On Sun, 6 Jul 2025 00:03:16 +0000
> > "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
> >  
> >> Hi Stephen,
> >> Here are comments to the list of obvious causes of cache misses you mentiond.
> >>
> >> Obvious cache misses.
> >>  - passing packets to worker with ring - we use lots of rings to pass mbuf pointers.  If I skip the rte_eth_tx_burst() and just free mbuf bulk, the tx ring does not fill up.
> >>  - using spinlocks (cost 16ns)  - The driver does not use spinlocks, other than what dpdk uses.
> >>  - fetching TSC  - We don't do this, we let Rx offload timestamp packets.
> >>  - syscalls?  - No syscalls are done in our driver fast path.
> >>
> >> You mention "passing packets to worker with ring", do you mean using rings to pass mbuf pointers causes cache misses and should be avoided?  
> >
> > Rings do cause data to be modified by one core and examined by another so they are a cache miss.
> >
> >  

How many packets is your application seeing per-burst? Ideally it should be getting chunks not single packet a  time. And then the driver can use defer free to put back bursts.
If you have multi-stage pipeline it helps if you pass a burst to each stage rather than looping over the burst in the outer loop. Imagine getting a burst of 16 packets. If you pass an array down the pipeline, then there is one call per burst. If you process packets one at a time, it can mean 16 calls, and if the pipeline exceeds the instruction cache it can mean 16 cache misses.

The point is bursting is a big win in data and instruction cache.
If you really want to tune investigate prefetching like VPP does.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dpdk Tx falling short
  2025-07-08  4:10                     ` Lombardo, Ed
@ 2025-07-08 13:47                       ` Stephen Hemminger
  2025-07-08 14:03                         ` Lombardo, Ed
  0 siblings, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2025-07-08 13:47 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: Ivan Malov, users

On Tue, 8 Jul 2025 04:10:05 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi Stephen,
> I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
> For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.

You might try buffering like rte_eth_tx_buffer does.
Need to add an additional mechanism to ensure that buffer gets flushed
when you detect idle period.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-08 13:47                       ` Stephen Hemminger
@ 2025-07-08 14:03                         ` Lombardo, Ed
  2025-07-08 14:18                           ` Ivan Malov
  0 siblings, 1 reply; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-08 14:03 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Ivan Malov, users

Hi Stephen,
When I replace rte_eth_tx_burst() with mbuf free bulk I do not see the tx ring fill up.  I think this is valuable information.  Also, perf analysis of the tx thread shows common_ring_mp_enqueue and rte_atomic32_cmpset, where I did not expect to see if I created all the Tx  rings as SP and SC (and the workers and ack rings as well, essentially all the 16 rings).

Perf report snippet:
+   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue 
+   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset 
+    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts 
+    6.50%  DPDK_TX_1  test             [.] rte_pause 
      0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0 
      0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread                                              

The traffic load is constant 10 Gbps 84 bytes packets with no idles.  The burst size of 512 is a desired burst of mbufs, however the tx thread will transmit what ever it can get from the Tx ring.

I think if resolving why the perf analysis shows ring is MP when it has been created as SP / SC should resolve this issue.

Thanks,
ed

-----Original Message-----
From: Stephen Hemminger <stephen@networkplumber.org> 
Sent: Tuesday, July 8, 2025 9:47 AM
To: Lombardo, Ed <Ed.Lombardo@netscout.com>
Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
Subject: Re: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Tue, 8 Jul 2025 04:10:05 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi Stephen,
> I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
> For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.

You might try buffering like rte_eth_tx_buffer does.
Need to add an additional mechanism to ensure that buffer gets flushed when you detect idle period.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-08 14:03                         ` Lombardo, Ed
@ 2025-07-08 14:18                           ` Ivan Malov
  2025-07-08 14:29                             ` Lombardo, Ed
  0 siblings, 1 reply; 28+ messages in thread
From: Ivan Malov @ 2025-07-08 14:18 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: Stephen Hemminger, users

Hi Ed,

On Tue, 8 Jul 2025, Lombardo, Ed wrote:

> Hi Stephen,
> When I replace rte_eth_tx_burst() with mbuf free bulk I do not see the tx ring fill up.  I think this is valuable information.  Also, perf analysis of the tx thread shows common_ring_mp_enqueue and rte_atomic32_cmpset, where I did not expect to see if I created all the Tx  rings as SP and SC (and the workers and ack rings as well, essentially all the 16 rings).
>
> Perf report snippet:
> +   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue
> +   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset
> +    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts
> +    6.50%  DPDK_TX_1  test             [.] rte_pause
>      0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0
>      0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread
>
> The traffic load is constant 10 Gbps 84 bytes packets with no idles.  The burst size of 512 is a desired burst of mbufs, however the tx thread will transmit what ever it can get from the Tx ring.
>
> I think if resolving why the perf analysis shows ring is MP when it has been created as SP / SC should resolve this issue.

The 'common_ring_mp_enqueue' is the enqueue method of mempool variant 'ring',
that is, based on RTE Ring internally. When you say that ring has been created
as SP / SC you seemingly refer to the regular RTE ring created by your
application logic, not the internal ring of the mempool. Am I missing something?

Thank you.

>
> Thanks,
> ed
>
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, July 8, 2025 9:47 AM
> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
> Subject: Re: dpdk Tx falling short
>
> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Tue, 8 Jul 2025 04:10:05 +0000
> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>
>> Hi Stephen,
>> I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
>> For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.
>
> You might try buffering like rte_eth_tx_buffer does.
> Need to add an additional mechanism to ensure that buffer gets flushed when you detect idle period.
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-08 14:18                           ` Ivan Malov
@ 2025-07-08 14:29                             ` Lombardo, Ed
  2025-07-08 14:49                               ` Ivan Malov
  0 siblings, 1 reply; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-08 14:29 UTC (permalink / raw)
  To: Ivan Malov; +Cc: Stephen Hemminger, users

Hi Ivan,
Yes, only the user space created rings.  
Can you add more to your thoughts?

Ed

-----Original Message-----
From: Ivan Malov <ivan.malov@arknetworks.am> 
Sent: Tuesday, July 8, 2025 10:19 AM
To: Lombardo, Ed <Ed.Lombardo@netscout.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
Subject: RE: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi Ed,

On Tue, 8 Jul 2025, Lombardo, Ed wrote:

> Hi Stephen,
> When I replace rte_eth_tx_burst() with mbuf free bulk I do not see the tx ring fill up.  I think this is valuable information.  Also, perf analysis of the tx thread shows common_ring_mp_enqueue and rte_atomic32_cmpset, where I did not expect to see if I created all the Tx  rings as SP and SC (and the workers and ack rings as well, essentially all the 16 rings).
>
> Perf report snippet:
> +   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue
> +   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset
> +    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts
> +    6.50%  DPDK_TX_1  test             [.] rte_pause
>      0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0
>      0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread
>
> The traffic load is constant 10 Gbps 84 bytes packets with no idles.  The burst size of 512 is a desired burst of mbufs, however the tx thread will transmit what ever it can get from the Tx ring.
>
> I think if resolving why the perf analysis shows ring is MP when it has been created as SP / SC should resolve this issue.

The 'common_ring_mp_enqueue' is the enqueue method of mempool variant 'ring', that is, based on RTE Ring internally. When you say that ring has been created as SP / SC you seemingly refer to the regular RTE ring created by your application logic, not the internal ring of the mempool. Am I missing something?

Thank you.

>
> Thanks,
> ed
>
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, July 8, 2025 9:47 AM
> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
> Subject: Re: dpdk Tx falling short
>
> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Tue, 8 Jul 2025 04:10:05 +0000
> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>
>> Hi Stephen,
>> I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
>> For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.
>
> You might try buffering like rte_eth_tx_buffer does.
> Need to add an additional mechanism to ensure that buffer gets flushed when you detect idle period.
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-08 14:29                             ` Lombardo, Ed
@ 2025-07-08 14:49                               ` Ivan Malov
  2025-07-08 16:31                                 ` Lombardo, Ed
  0 siblings, 1 reply; 28+ messages in thread
From: Ivan Malov @ 2025-07-08 14:49 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: Stephen Hemminger, users

On Tue, 8 Jul 2025, Lombardo, Ed wrote:

> Hi Ivan,
> Yes, only the user space created rings.
> Can you add more to your thoughts?

I was seeking to address the probable confusion here. If the application creates
a SC / MP ring for its own pipiline logic using API [1] and then invokes another
API [2] to create a common "mbuf mempool" to be used with Rx and Tx queues of
the network ports, then the observed appearance of "common_ring_mp_enqueue" is
likely attributed to the fact that API [2] creates a ring-based mempool
internally, and in MP / MC mode by default. And the latter ring is not the same
as the one created by the application logic. These are two independent rings.

BTW, does your application set RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag
when configuring Tx port/queue offloads on the network ports?

Thank you.

[1] https://doc.dpdk.org/api-25.03/rte__ring_8h.html#a155cb48ef311eddae9b2e34808338b17
[2] https://doc.dpdk.org/api-25.03/rte__mbuf_8h.html#a8f4abb0d54753d2fde515f35c1ba402a
[3] https://doc.dpdk.org/api-25.03/rte__mempool_8h.html#a0b64d611bc140a4d2a0c94911580efd5

>
> Ed
>
> -----Original Message-----
> From: Ivan Malov <ivan.malov@arknetworks.am>
> Sent: Tuesday, July 8, 2025 10:19 AM
> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
> Subject: RE: dpdk Tx falling short
>
> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> Hi Ed,
>
> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>
>> Hi Stephen,
>> When I replace rte_eth_tx_burst() with mbuf free bulk I do not see the tx ring fill up.  I think this is valuable information.  Also, perf analysis of the tx thread shows common_ring_mp_enqueue and rte_atomic32_cmpset, where I did not expect to see if I created all the Tx  rings as SP and SC (and the workers and ack rings as well, essentially all the 16 rings).
>>
>> Perf report snippet:
>> +   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue
>> +   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset
>> +    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts
>> +    6.50%  DPDK_TX_1  test             [.] rte_pause
>>      0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0
>>      0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread
>>
>> The traffic load is constant 10 Gbps 84 bytes packets with no idles.  The burst size of 512 is a desired burst of mbufs, however the tx thread will transmit what ever it can get from the Tx ring.
>>
>> I think if resolving why the perf analysis shows ring is MP when it has been created as SP / SC should resolve this issue.
>
> The 'common_ring_mp_enqueue' is the enqueue method of mempool variant 'ring', that is, based on RTE Ring internally. When you say that ring has been created as SP / SC you seemingly refer to the regular RTE ring created by your application logic, not the internal ring of the mempool. Am I missing something?
>
> Thank you.
>
>>
>> Thanks,
>> ed
>>
>> -----Original Message-----
>> From: Stephen Hemminger <stephen@networkplumber.org>
>> Sent: Tuesday, July 8, 2025 9:47 AM
>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>> Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
>> Subject: Re: dpdk Tx falling short
>>
>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>
>> On Tue, 8 Jul 2025 04:10:05 +0000
>> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>>
>>> Hi Stephen,
>>> I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
>>> For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.
>>
>> You might try buffering like rte_eth_tx_buffer does.
>> Need to add an additional mechanism to ensure that buffer gets flushed when you detect idle period.
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-08 14:49                               ` Ivan Malov
@ 2025-07-08 16:31                                 ` Lombardo, Ed
  2025-07-08 16:53                                   ` Ivan Malov
  0 siblings, 1 reply; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-08 16:31 UTC (permalink / raw)
  To: Ivan Malov; +Cc: Stephen Hemminger, users

Hi Ivan,
Thanks, this clears up my confusion.  Using API[2] to create one mempool for the network Rx and Tx queues must be MP/MC.  The CPU Cycles spent on the common_ring_mp_enqueue increase as more ports are transmitting.  The transmit operation causes the call for Rx and Tx queues results in fight for access to the mbuf mempool because of one mempool?

This is why you suggested creating two mempools, one for each pair of ports.
If I go this route what are the precautions I need to take?

I will try RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag first.

Thanks,
Ed

-----Original Message-----
From: Ivan Malov <ivan.malov@arknetworks.am> 
Sent: Tuesday, July 8, 2025 10:49 AM
To: Lombardo, Ed <Ed.Lombardo@netscout.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
Subject: RE: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Tue, 8 Jul 2025, Lombardo, Ed wrote:

> Hi Ivan,
> Yes, only the user space created rings.
> Can you add more to your thoughts?

I was seeking to address the probable confusion here. If the application creates a SC / MP ring for its own pipiline logic using API [1] and then invokes another API [2] to create a common "mbuf mempool" to be used with Rx and Tx queues of the network ports, then the observed appearance of "common_ring_mp_enqueue" is likely attributed to the fact that API [2] creates a ring-based mempool internally, and in MP / MC mode by default. And the latter ring is not the same as the one created by the application logic. These are two independent rings.

BTW, does your application set RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag when configuring Tx port/queue offloads on the network ports?

Thank you.

[1] https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__ring_8h.html*a155cb48ef311eddae9b2e34808338b17__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij40zap9fvA$
[2] https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mbuf_8h.html*a8f4abb0d54753d2fde515f35c1ba402a__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij407rwGv1P$
[3] https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mempool_8h.html*a0b64d611bc140a4d2a0c94911580efd5__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij402Z4uOww$ 

>
> Ed
>
> -----Original Message-----
> From: Ivan Malov <ivan.malov@arknetworks.am>
> Sent: Tuesday, July 8, 2025 10:19 AM
> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
> Subject: RE: dpdk Tx falling short
>
> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> Hi Ed,
>
> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>
>> Hi Stephen,
>> When I replace rte_eth_tx_burst() with mbuf free bulk I do not see the tx ring fill up.  I think this is valuable information.  Also, perf analysis of the tx thread shows common_ring_mp_enqueue and rte_atomic32_cmpset, where I did not expect to see if I created all the Tx  rings as SP and SC (and the workers and ack rings as well, essentially all the 16 rings).
>>
>> Perf report snippet:
>> +   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue
>> +   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset
>> +    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts
>> +    6.50%  DPDK_TX_1  test             [.] rte_pause
>>      0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0
>>      0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread
>>
>> The traffic load is constant 10 Gbps 84 bytes packets with no idles.  The burst size of 512 is a desired burst of mbufs, however the tx thread will transmit what ever it can get from the Tx ring.
>>
>> I think if resolving why the perf analysis shows ring is MP when it has been created as SP / SC should resolve this issue.
>
> The 'common_ring_mp_enqueue' is the enqueue method of mempool variant 'ring', that is, based on RTE Ring internally. When you say that ring has been created as SP / SC you seemingly refer to the regular RTE ring created by your application logic, not the internal ring of the mempool. Am I missing something?
>
> Thank you.
>
>>
>> Thanks,
>> ed
>>
>> -----Original Message-----
>> From: Stephen Hemminger <stephen@networkplumber.org>
>> Sent: Tuesday, July 8, 2025 9:47 AM
>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>> Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
>> Subject: Re: dpdk Tx falling short
>>
>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>
>> On Tue, 8 Jul 2025 04:10:05 +0000
>> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>>
>>> Hi Stephen,
>>> I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
>>> For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.
>>
>> You might try buffering like rte_eth_tx_buffer does.
>> Need to add an additional mechanism to ensure that buffer gets flushed when you detect idle period.
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-08 16:31                                 ` Lombardo, Ed
@ 2025-07-08 16:53                                   ` Ivan Malov
  2025-07-09  1:09                                     ` Lombardo, Ed
  0 siblings, 1 reply; 28+ messages in thread
From: Ivan Malov @ 2025-07-08 16:53 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: Stephen Hemminger, users

Hi Ed,

On Tue, 8 Jul 2025, Lombardo, Ed wrote:

> Hi Ivan,
> Thanks, this clears up my confusion.  Using API[2] to create one mempool for the network Rx and Tx queues must be MP/MC.  The CPU Cycles spent on the common_ring_mp_enqueue increase as more ports are transmitting.  The transmit operation causes the call for Rx and Tx queues results in fight for access to the mbuf mempool because of one mempool?

Not really. Mempools in DPDK in general (and, in particular, as shown in your
monitor printout) have per-lcore object cache, which, if I'm not mistaken, is
to avoid such contention when accessing the pool. And, since only a single pool
is used in your case, the use of MP/MC seems logical, as well as the use of the
per-lcore object cache. But it's not obvious if this is optimal in your case.

> This is why you suggested creating two mempools, one for each pair of ports.

It could be a low-hanging fruit to do a quick check with two separate mempools,
probably also MP/MC even (allocated via the same API [2]), to know if it affects
performance or not. Again, as Stephen noted, this may even worsen CPU cache
performance, but may be it still pays to do a quick check after all.

> If I go this route what are the precautions I need to take?
>
> I will try RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag first.

This is somehow unrelated to pools and rings, yet it should enable the PMD's
internal Tx handling to accumulate bulks of mbufs to be freed upon transmission
via bulk operations that, akin Tx and Rx bursts, may also improve CPU cache
utilisation and overall performance. The only prerequisite - all mbufs passed to
a given Tx queue have to come from the same mempool. Hopefully this holds for
you, if the logic does not intermix packets from 2 pools into the same Tx queue.

May be Stephen's suggestion to use a Tx buffer API is also worth the shot.

Thank you.

>
> Thanks,
> Ed
>
> -----Original Message-----
> From: Ivan Malov <ivan.malov@arknetworks.am>
> Sent: Tuesday, July 8, 2025 10:49 AM
> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
> Subject: RE: dpdk Tx falling short
>
> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>
>> Hi Ivan,
>> Yes, only the user space created rings.
>> Can you add more to your thoughts?
>
> I was seeking to address the probable confusion here. If the application creates a SC / MP ring for its own pipiline logic using API [1] and then invokes another API [2] to create a common "mbuf mempool" to be used with Rx and Tx queues of the network ports, then the observed appearance of "common_ring_mp_enqueue" is likely attributed to the fact that API [2] creates a ring-based mempool internally, and in MP / MC mode by default. And the latter ring is not the same as the one created by the application logic. These are two independent rings.
>
> BTW, does your application set RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag when configuring Tx port/queue offloads on the network ports?
>
> Thank you.
>
> [1] https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__ring_8h.html*a155cb48ef311eddae9b2e34808338b17__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij40zap9fvA$
> [2] https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mbuf_8h.html*a8f4abb0d54753d2fde515f35c1ba402a__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij407rwGv1P$
> [3] https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mempool_8h.html*a0b64d611bc140a4d2a0c94911580efd5__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij402Z4uOww$
>
>>
>> Ed
>>
>> -----Original Message-----
>> From: Ivan Malov <ivan.malov@arknetworks.am>
>> Sent: Tuesday, July 8, 2025 10:19 AM
>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>> Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
>> Subject: RE: dpdk Tx falling short
>>
>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>
>> Hi Ed,
>>
>> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>>
>>> Hi Stephen,
>>> When I replace rte_eth_tx_burst() with mbuf free bulk I do not see the tx ring fill up.  I think this is valuable information.  Also, perf analysis of the tx thread shows common_ring_mp_enqueue and rte_atomic32_cmpset, where I did not expect to see if I created all the Tx  rings as SP and SC (and the workers and ack rings as well, essentially all the 16 rings).
>>>
>>> Perf report snippet:
>>> +   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue
>>> +   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset
>>> +    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts
>>> +    6.50%  DPDK_TX_1  test             [.] rte_pause
>>>      0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0
>>>      0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread
>>>
>>> The traffic load is constant 10 Gbps 84 bytes packets with no idles.  The burst size of 512 is a desired burst of mbufs, however the tx thread will transmit what ever it can get from the Tx ring.
>>>
>>> I think if resolving why the perf analysis shows ring is MP when it has been created as SP / SC should resolve this issue.
>>
>> The 'common_ring_mp_enqueue' is the enqueue method of mempool variant 'ring', that is, based on RTE Ring internally. When you say that ring has been created as SP / SC you seemingly refer to the regular RTE ring created by your application logic, not the internal ring of the mempool. Am I missing something?
>>
>> Thank you.
>>
>>>
>>> Thanks,
>>> ed
>>>
>>> -----Original Message-----
>>> From: Stephen Hemminger <stephen@networkplumber.org>
>>> Sent: Tuesday, July 8, 2025 9:47 AM
>>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>>> Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
>>> Subject: Re: dpdk Tx falling short
>>>
>>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>
>>> On Tue, 8 Jul 2025 04:10:05 +0000
>>> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>>>
>>>> Hi Stephen,
>>>> I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
>>>> For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.
>>>
>>> You might try buffering like rte_eth_tx_buffer does.
>>> Need to add an additional mechanism to ensure that buffer gets flushed when you detect idle period.
>>>
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-08 16:53                                   ` Ivan Malov
@ 2025-07-09  1:09                                     ` Lombardo, Ed
  2025-07-09 21:58                                       ` Lombardo, Ed
  0 siblings, 1 reply; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-09  1:09 UTC (permalink / raw)
  To: Ivan Malov; +Cc: Stephen Hemminger, users

Hi Ivan,
I added two mempools per port pair as you suggeted.
The results of Tx performance improved and when turn on one port pair and it does not affect the second port pair.  The tx ring no longer fills up but drains near empty.

Improved: Tx - 1.5 Mpps to 8.3 Mpps, 1.5 Mpps to 11.2 Mpps

I need to do the perf analysis again but wanted to provide you results.

I still need to improve the performance on Tx, but this is much needed break through (with your help).

Thanks,
Ed

-----Original Message-----
From: Ivan Malov <ivan.malov@arknetworks.am> 
Sent: Tuesday, July 8, 2025 12:53 PM
To: Lombardo, Ed <Ed.Lombardo@netscout.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
Subject: RE: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi Ed,

On Tue, 8 Jul 2025, Lombardo, Ed wrote:

> Hi Ivan,
> Thanks, this clears up my confusion.  Using API[2] to create one mempool for the network Rx and Tx queues must be MP/MC.  The CPU Cycles spent on the common_ring_mp_enqueue increase as more ports are transmitting.  The transmit operation causes the call for Rx and Tx queues results in fight for access to the mbuf mempool because of one mempool?

Not really. Mempools in DPDK in general (and, in particular, as shown in your monitor printout) have per-lcore object cache, which, if I'm not mistaken, is to avoid such contention when accessing the pool. And, since only a single pool is used in your case, the use of MP/MC seems logical, as well as the use of the per-lcore object cache. But it's not obvious if this is optimal in your case.

> This is why you suggested creating two mempools, one for each pair of ports.

It could be a low-hanging fruit to do a quick check with two separate mempools, probably also MP/MC even (allocated via the same API [2]), to know if it affects performance or not. Again, as Stephen noted, this may even worsen CPU cache performance, but may be it still pays to do a quick check after all.

> If I go this route what are the precautions I need to take?
>
> I will try RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag first.

This is somehow unrelated to pools and rings, yet it should enable the PMD's internal Tx handling to accumulate bulks of mbufs to be freed upon transmission via bulk operations that, akin Tx and Rx bursts, may also improve CPU cache utilisation and overall performance. The only prerequisite - all mbufs passed to a given Tx queue have to come from the same mempool. Hopefully this holds for you, if the logic does not intermix packets from 2 pools into the same Tx queue.

May be Stephen's suggestion to use a Tx buffer API is also worth the shot.

Thank you.

>
> Thanks,
> Ed
>
> -----Original Message-----
> From: Ivan Malov <ivan.malov@arknetworks.am>
> Sent: Tuesday, July 8, 2025 10:49 AM
> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>; users 
> <users@dpdk.org>
> Subject: RE: dpdk Tx falling short
>
> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>
>> Hi Ivan,
>> Yes, only the user space created rings.
>> Can you add more to your thoughts?
>
> I was seeking to address the probable confusion here. If the application creates a SC / MP ring for its own pipiline logic using API [1] and then invokes another API [2] to create a common "mbuf mempool" to be used with Rx and Tx queues of the network ports, then the observed appearance of "common_ring_mp_enqueue" is likely attributed to the fact that API [2] creates a ring-based mempool internally, and in MP / MC mode by default. And the latter ring is not the same as the one created by the application logic. These are two independent rings.
>
> BTW, does your application set RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag when configuring Tx port/queue offloads on the network ports?
>
> Thank you.
>
> [1] 
> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__ring_8
> h.html*a155cb48ef311eddae9b2e34808338b17__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGh
> dahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij4
> 0zap9fvA$ [2] 
> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mbuf_8
> h.html*a8f4abb0d54753d2fde515f35c1ba402a__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGh
> dahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij4
> 07rwGv1P$ [3] 
> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mempoo
> l_8h.html*a0b64d611bc140a4d2a0c94911580efd5__;Iw!!Nzg7nt7_!GXTS2DQR0JZ
> FGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7iz
> ij402Z4uOww$
>
>>
>> Ed
>>
>> -----Original Message-----
>> From: Ivan Malov <ivan.malov@arknetworks.am>
>> Sent: Tuesday, July 8, 2025 10:19 AM
>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>> Cc: Stephen Hemminger <stephen@networkplumber.org>; users 
>> <users@dpdk.org>
>> Subject: RE: dpdk Tx falling short
>>
>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>
>> Hi Ed,
>>
>> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>>
>>> Hi Stephen,
>>> When I replace rte_eth_tx_burst() with mbuf free bulk I do not see the tx ring fill up.  I think this is valuable information.  Also, perf analysis of the tx thread shows common_ring_mp_enqueue and rte_atomic32_cmpset, where I did not expect to see if I created all the Tx  rings as SP and SC (and the workers and ack rings as well, essentially all the 16 rings).
>>>
>>> Perf report snippet:
>>> +   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue
>>> +   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset
>>> +    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts
>>> +    6.50%  DPDK_TX_1  test             [.] rte_pause
>>>      0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0
>>>      0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread
>>>
>>> The traffic load is constant 10 Gbps 84 bytes packets with no idles.  The burst size of 512 is a desired burst of mbufs, however the tx thread will transmit what ever it can get from the Tx ring.
>>>
>>> I think if resolving why the perf analysis shows ring is MP when it has been created as SP / SC should resolve this issue.
>>
>> The 'common_ring_mp_enqueue' is the enqueue method of mempool variant 'ring', that is, based on RTE Ring internally. When you say that ring has been created as SP / SC you seemingly refer to the regular RTE ring created by your application logic, not the internal ring of the mempool. Am I missing something?
>>
>> Thank you.
>>
>>>
>>> Thanks,
>>> ed
>>>
>>> -----Original Message-----
>>> From: Stephen Hemminger <stephen@networkplumber.org>
>>> Sent: Tuesday, July 8, 2025 9:47 AM
>>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>>> Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
>>> Subject: Re: dpdk Tx falling short
>>>
>>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>
>>> On Tue, 8 Jul 2025 04:10:05 +0000
>>> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>>>
>>>> Hi Stephen,
>>>> I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
>>>> For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.
>>>
>>> You might try buffering like rte_eth_tx_buffer does.
>>> Need to add an additional mechanism to ensure that buffer gets flushed when you detect idle period.
>>>
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-09  1:09                                     ` Lombardo, Ed
@ 2025-07-09 21:58                                       ` Lombardo, Ed
  2025-07-10  6:45                                         ` Ivan Malov
  0 siblings, 1 reply; 28+ messages in thread
From: Lombardo, Ed @ 2025-07-09 21:58 UTC (permalink / raw)
  To: Ivan Malov; +Cc: Stephen Hemminger, users

Hi Ivan,
Do you see any benefit to creating two mempools, one per NUMA node versus both on the same NUMA as the NIC?

If I try creating Hugepage memory on both NUMA nodes and associated mempools , do I need to have a DPDK lcore on each NUMA node, or can I get by with one lcore (strictly for housekeeping).  We use POSIX threads in our application, in which DPDK knows nothing about.

Thanks,
Ed

-----Original Message-----
From: Lombardo, Ed 
Sent: Tuesday, July 8, 2025 9:09 PM
To: Ivan Malov <ivan.malov@arknetworks.am>
Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
Subject: RE: dpdk Tx falling short

Hi Ivan,
I added two mempools per port pair as you suggeted.
The results of Tx performance improved and when turn on one port pair and it does not affect the second port pair.  The tx ring no longer fills up but drains near empty.

Improved: Tx - 1.5 Mpps to 8.3 Mpps, 1.5 Mpps to 11.2 Mpps

I need to do the perf analysis again but wanted to provide you results.

I still need to improve the performance on Tx, but this is much needed break through (with your help).

Thanks,
Ed

-----Original Message-----
From: Ivan Malov <ivan.malov@arknetworks.am>
Sent: Tuesday, July 8, 2025 12:53 PM
To: Lombardo, Ed <Ed.Lombardo@netscout.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
Subject: RE: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi Ed,

On Tue, 8 Jul 2025, Lombardo, Ed wrote:

> Hi Ivan,
> Thanks, this clears up my confusion.  Using API[2] to create one mempool for the network Rx and Tx queues must be MP/MC.  The CPU Cycles spent on the common_ring_mp_enqueue increase as more ports are transmitting.  The transmit operation causes the call for Rx and Tx queues results in fight for access to the mbuf mempool because of one mempool?

Not really. Mempools in DPDK in general (and, in particular, as shown in your monitor printout) have per-lcore object cache, which, if I'm not mistaken, is to avoid such contention when accessing the pool. And, since only a single pool is used in your case, the use of MP/MC seems logical, as well as the use of the per-lcore object cache. But it's not obvious if this is optimal in your case.

> This is why you suggested creating two mempools, one for each pair of ports.

It could be a low-hanging fruit to do a quick check with two separate mempools, probably also MP/MC even (allocated via the same API [2]), to know if it affects performance or not. Again, as Stephen noted, this may even worsen CPU cache performance, but may be it still pays to do a quick check after all.

> If I go this route what are the precautions I need to take?
>
> I will try RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag first.

This is somehow unrelated to pools and rings, yet it should enable the PMD's internal Tx handling to accumulate bulks of mbufs to be freed upon transmission via bulk operations that, akin Tx and Rx bursts, may also improve CPU cache utilisation and overall performance. The only prerequisite - all mbufs passed to a given Tx queue have to come from the same mempool. Hopefully this holds for you, if the logic does not intermix packets from 2 pools into the same Tx queue.

May be Stephen's suggestion to use a Tx buffer API is also worth the shot.

Thank you.

>
> Thanks,
> Ed
>
> -----Original Message-----
> From: Ivan Malov <ivan.malov@arknetworks.am>
> Sent: Tuesday, July 8, 2025 10:49 AM
> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>; users 
> <users@dpdk.org>
> Subject: RE: dpdk Tx falling short
>
> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>
>> Hi Ivan,
>> Yes, only the user space created rings.
>> Can you add more to your thoughts?
>
> I was seeking to address the probable confusion here. If the application creates a SC / MP ring for its own pipiline logic using API [1] and then invokes another API [2] to create a common "mbuf mempool" to be used with Rx and Tx queues of the network ports, then the observed appearance of "common_ring_mp_enqueue" is likely attributed to the fact that API [2] creates a ring-based mempool internally, and in MP / MC mode by default. And the latter ring is not the same as the one created by the application logic. These are two independent rings.
>
> BTW, does your application set RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag when configuring Tx port/queue offloads on the network ports?
>
> Thank you.
>
> [1]
> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__ring_8
> h.html*a155cb48ef311eddae9b2e34808338b17__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGh
> dahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij4
> 0zap9fvA$ [2]
> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mbuf_8
> h.html*a8f4abb0d54753d2fde515f35c1ba402a__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGh
> dahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij4
> 07rwGv1P$ [3]
> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mempoo
> l_8h.html*a0b64d611bc140a4d2a0c94911580efd5__;Iw!!Nzg7nt7_!GXTS2DQR0JZ
> FGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7iz
> ij402Z4uOww$
>
>>
>> Ed
>>
>> -----Original Message-----
>> From: Ivan Malov <ivan.malov@arknetworks.am>
>> Sent: Tuesday, July 8, 2025 10:19 AM
>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>> Cc: Stephen Hemminger <stephen@networkplumber.org>; users 
>> <users@dpdk.org>
>> Subject: RE: dpdk Tx falling short
>>
>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>
>> Hi Ed,
>>
>> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>>
>>> Hi Stephen,
>>> When I replace rte_eth_tx_burst() with mbuf free bulk I do not see the tx ring fill up.  I think this is valuable information.  Also, perf analysis of the tx thread shows common_ring_mp_enqueue and rte_atomic32_cmpset, where I did not expect to see if I created all the Tx  rings as SP and SC (and the workers and ack rings as well, essentially all the 16 rings).
>>>
>>> Perf report snippet:
>>> +   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue
>>> +   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset
>>> +    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts
>>> +    6.50%  DPDK_TX_1  test             [.] rte_pause
>>>      0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0
>>>      0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread
>>>
>>> The traffic load is constant 10 Gbps 84 bytes packets with no idles.  The burst size of 512 is a desired burst of mbufs, however the tx thread will transmit what ever it can get from the Tx ring.
>>>
>>> I think if resolving why the perf analysis shows ring is MP when it has been created as SP / SC should resolve this issue.
>>
>> The 'common_ring_mp_enqueue' is the enqueue method of mempool variant 'ring', that is, based on RTE Ring internally. When you say that ring has been created as SP / SC you seemingly refer to the regular RTE ring created by your application logic, not the internal ring of the mempool. Am I missing something?
>>
>> Thank you.
>>
>>>
>>> Thanks,
>>> ed
>>>
>>> -----Original Message-----
>>> From: Stephen Hemminger <stephen@networkplumber.org>
>>> Sent: Tuesday, July 8, 2025 9:47 AM
>>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>>> Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
>>> Subject: Re: dpdk Tx falling short
>>>
>>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>
>>> On Tue, 8 Jul 2025 04:10:05 +0000
>>> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>>>
>>>> Hi Stephen,
>>>> I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
>>>> For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.
>>>
>>> You might try buffering like rte_eth_tx_buffer does.
>>> Need to add an additional mechanism to ensure that buffer gets flushed when you detect idle period.
>>>
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: dpdk Tx falling short
  2025-07-09 21:58                                       ` Lombardo, Ed
@ 2025-07-10  6:45                                         ` Ivan Malov
  0 siblings, 0 replies; 28+ messages in thread
From: Ivan Malov @ 2025-07-10  6:45 UTC (permalink / raw)
  To: Lombardo, Ed; +Cc: Stephen Hemminger, users

Hi Ed,

If the two NICs sit on different NUMA nodes, then yes, apparently, it should be
practical to allocate resources and run worker lcores in accordance with that.

For example, one can use API [1] on an initialised DPDK port to get its NUMA
socket ID. Then this value can be used for mempool creation and queue set up.
API [2] can be used to identify an lcore (among available lcores) to find the
one that sits on a matching NUMA node and that should be used to launch the
Rx/Tx worker with API [3]. If both ports in the port pair sit on the same NUMA
node, and if traffic flows of separate pairs are independent from each other and
not intermixed, then this may theoretically be a practial set up. Worth trying.

Ideally, the business logic should also be run in the context of worker lcores.
The use of just one "housekeeping" lcore may not give best performance.
I apologise in case I've got something wrong.
Thank you.

[1] https://doc.dpdk.org/api-25.03/rte__ethdev_8h.html#ad032e25f712e6ffeb0c19eab1ec1fd2e
[2] https://doc.dpdk.org/api-25.03/rte__lcore_8h.html#a023b4909f52c3cdf0351d71d2b5032bc
[3] https://doc.dpdk.org/api-25.03/rte__launch_8h.html#a2bf98eda211728b3dc69aa7694758c6d

On Wed, 9 Jul 2025, Lombardo, Ed wrote:

> Hi Ivan,
> Do you see any benefit to creating two mempools, one per NUMA node versus both on the same NUMA as the NIC?
>
> If I try creating Hugepage memory on both NUMA nodes and associated mempools , do I need to have a DPDK lcore on each NUMA node, or can I get by with one lcore (strictly for housekeeping).  We use POSIX threads in our application, in which DPDK knows nothing about.
>
> Thanks,
> Ed
>
> -----Original Message-----
> From: Lombardo, Ed
> Sent: Tuesday, July 8, 2025 9:09 PM
> To: Ivan Malov <ivan.malov@arknetworks.am>
> Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
> Subject: RE: dpdk Tx falling short
>
> Hi Ivan,
> I added two mempools per port pair as you suggeted.
> The results of Tx performance improved and when turn on one port pair and it does not affect the second port pair.  The tx ring no longer fills up but drains near empty.
>
> Improved: Tx - 1.5 Mpps to 8.3 Mpps, 1.5 Mpps to 11.2 Mpps
>
> I need to do the perf analysis again but wanted to provide you results.
>
> I still need to improve the performance on Tx, but this is much needed break through (with your help).
>
> Thanks,
> Ed
>
> -----Original Message-----
> From: Ivan Malov <ivan.malov@arknetworks.am>
> Sent: Tuesday, July 8, 2025 12:53 PM
> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
> Subject: RE: dpdk Tx falling short
>
> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> Hi Ed,
>
> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>
>> Hi Ivan,
>> Thanks, this clears up my confusion.  Using API[2] to create one mempool for the network Rx and Tx queues must be MP/MC.  The CPU Cycles spent on the common_ring_mp_enqueue increase as more ports are transmitting.  The transmit operation causes the call for Rx and Tx queues results in fight for access to the mbuf mempool because of one mempool?
>
> Not really. Mempools in DPDK in general (and, in particular, as shown in your monitor printout) have per-lcore object cache, which, if I'm not mistaken, is to avoid such contention when accessing the pool. And, since only a single pool is used in your case, the use of MP/MC seems logical, as well as the use of the per-lcore object cache. But it's not obvious if this is optimal in your case.
>
>> This is why you suggested creating two mempools, one for each pair of ports.
>
> It could be a low-hanging fruit to do a quick check with two separate mempools, probably also MP/MC even (allocated via the same API [2]), to know if it affects performance or not. Again, as Stephen noted, this may even worsen CPU cache performance, but may be it still pays to do a quick check after all.
>
>> If I go this route what are the precautions I need to take?
>>
>> I will try RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag first.
>
> This is somehow unrelated to pools and rings, yet it should enable the PMD's internal Tx handling to accumulate bulks of mbufs to be freed upon transmission via bulk operations that, akin Tx and Rx bursts, may also improve CPU cache utilisation and overall performance. The only prerequisite - all mbufs passed to a given Tx queue have to come from the same mempool. Hopefully this holds for you, if the logic does not intermix packets from 2 pools into the same Tx queue.
>
> May be Stephen's suggestion to use a Tx buffer API is also worth the shot.
>
> Thank you.
>
>>
>> Thanks,
>> Ed
>>
>> -----Original Message-----
>> From: Ivan Malov <ivan.malov@arknetworks.am>
>> Sent: Tuesday, July 8, 2025 10:49 AM
>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>> Cc: Stephen Hemminger <stephen@networkplumber.org>; users
>> <users@dpdk.org>
>> Subject: RE: dpdk Tx falling short
>>
>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>
>> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>>
>>> Hi Ivan,
>>> Yes, only the user space created rings.
>>> Can you add more to your thoughts?
>>
>> I was seeking to address the probable confusion here. If the application creates a SC / MP ring for its own pipiline logic using API [1] and then invokes another API [2] to create a common "mbuf mempool" to be used with Rx and Tx queues of the network ports, then the observed appearance of "common_ring_mp_enqueue" is likely attributed to the fact that API [2] creates a ring-based mempool internally, and in MP / MC mode by default. And the latter ring is not the same as the one created by the application logic. These are two independent rings.
>>
>> BTW, does your application set RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag when configuring Tx port/queue offloads on the network ports?
>>
>> Thank you.
>>
>> [1]
>> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__ring_8
>> h.html*a155cb48ef311eddae9b2e34808338b17__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGh
>> dahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij4
>> 0zap9fvA$ [2]
>> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mbuf_8
>> h.html*a8f4abb0d54753d2fde515f35c1ba402a__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGh
>> dahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij4
>> 07rwGv1P$ [3]
>> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mempoo
>> l_8h.html*a0b64d611bc140a4d2a0c94911580efd5__;Iw!!Nzg7nt7_!GXTS2DQR0JZ
>> FGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7iz
>> ij402Z4uOww$
>>
>>>
>>> Ed
>>>
>>> -----Original Message-----
>>> From: Ivan Malov <ivan.malov@arknetworks.am>
>>> Sent: Tuesday, July 8, 2025 10:19 AM
>>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>>> Cc: Stephen Hemminger <stephen@networkplumber.org>; users
>>> <users@dpdk.org>
>>> Subject: RE: dpdk Tx falling short
>>>
>>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>
>>> Hi Ed,
>>>
>>> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>>>
>>>> Hi Stephen,
>>>> When I replace rte_eth_tx_burst() with mbuf free bulk I do not see the tx ring fill up.  I think this is valuable information.  Also, perf analysis of the tx thread shows common_ring_mp_enqueue and rte_atomic32_cmpset, where I did not expect to see if I created all the Tx  rings as SP and SC (and the workers and ack rings as well, essentially all the 16 rings).
>>>>
>>>> Perf report snippet:
>>>> +   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue
>>>> +   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset
>>>> +    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts
>>>> +    6.50%  DPDK_TX_1  test             [.] rte_pause
>>>>      0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0
>>>>      0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread
>>>>
>>>> The traffic load is constant 10 Gbps 84 bytes packets with no idles.  The burst size of 512 is a desired burst of mbufs, however the tx thread will transmit what ever it can get from the Tx ring.
>>>>
>>>> I think if resolving why the perf analysis shows ring is MP when it has been created as SP / SC should resolve this issue.
>>>
>>> The 'common_ring_mp_enqueue' is the enqueue method of mempool variant 'ring', that is, based on RTE Ring internally. When you say that ring has been created as SP / SC you seemingly refer to the regular RTE ring created by your application logic, not the internal ring of the mempool. Am I missing something?
>>>
>>> Thank you.
>>>
>>>>
>>>> Thanks,
>>>> ed
>>>>
>>>> -----Original Message-----
>>>> From: Stephen Hemminger <stephen@networkplumber.org>
>>>> Sent: Tuesday, July 8, 2025 9:47 AM
>>>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>>>> Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
>>>> Subject: Re: dpdk Tx falling short
>>>>
>>>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>>
>>>> On Tue, 8 Jul 2025 04:10:05 +0000
>>>> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>>>>
>>>>> Hi Stephen,
>>>>> I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
>>>>> For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.
>>>>
>>>> You might try buffering like rte_eth_tx_buffer does.
>>>> Need to add an additional mechanism to ensure that buffer gets flushed when you detect idle period.
>>>>
>>>
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-07-10  6:45 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-03 20:14 dpdk Tx falling short Lombardo, Ed
2025-07-03 21:49 ` Stephen Hemminger
2025-07-04  5:58   ` Rajesh Kumar
2025-07-04 11:44 ` Ivan Malov
2025-07-04 14:49   ` Stephen Hemminger
2025-07-05 17:36     ` Lombardo, Ed
2025-07-05 19:02       ` Stephen Hemminger
2025-07-05 19:08       ` Stephen Hemminger
2025-07-06  0:03         ` Lombardo, Ed
2025-07-06 16:02           ` Stephen Hemminger
2025-07-06 17:44             ` Lombardo, Ed
2025-07-07  3:30               ` Stephen Hemminger
2025-07-07 16:27               ` Lombardo, Ed
2025-07-07 21:00                 ` Lombardo, Ed
2025-07-07 21:49                 ` Ivan Malov
2025-07-07 23:04                   ` Stephen Hemminger
2025-07-08  4:10                     ` Lombardo, Ed
2025-07-08 13:47                       ` Stephen Hemminger
2025-07-08 14:03                         ` Lombardo, Ed
2025-07-08 14:18                           ` Ivan Malov
2025-07-08 14:29                             ` Lombardo, Ed
2025-07-08 14:49                               ` Ivan Malov
2025-07-08 16:31                                 ` Lombardo, Ed
2025-07-08 16:53                                   ` Ivan Malov
2025-07-09  1:09                                     ` Lombardo, Ed
2025-07-09 21:58                                       ` Lombardo, Ed
2025-07-10  6:45                                         ` Ivan Malov
2025-07-05 17:33   ` Lombardo, Ed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).