From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 9240746AF8 for ; Fri, 4 Jul 2025 07:58:18 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 4D374402E2; Fri, 4 Jul 2025 07:58:18 +0200 (CEST) Received: from mail-vk1-f171.google.com (mail-vk1-f171.google.com [209.85.221.171]) by mails.dpdk.org (Postfix) with ESMTP id 4876E4027D for ; Fri, 4 Jul 2025 07:58:17 +0200 (CEST) Received: by mail-vk1-f171.google.com with SMTP id 71dfb90a1353d-5315ebfd92cso403744e0c.0 for ; Thu, 03 Jul 2025 22:58:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1751608696; x=1752213496; darn=dpdk.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=i/s7n+33790QWTVJ8t6ma3xCM10pkEQinXVD+DrznHU=; b=QNhJtONb1qLW0EFroGNxtayvFCnucBhfFQlLavxDEOE+FcDGIzQqnccyJRHql1jD10 5OXS0slRR+gLAaTNhRzEM5YwZy7eRTmE2kvVzE7D0ifqGLGYIRCoF//554rIstFxjPKI TDlG4COpdPCi4u01Zo8O0SK7QCPlwv6KuxCQlBhtMV4G1PM3aWN1NLK1QqIIwt4eYSuC 5B5reEGJATnwTxk8htHmqwFJvMdVWCqgxfl9G8Zz+Q3PTb83y4us3Al1LTvHXYt4sjQJ UIl4CqkoWN+qxJt4b7nCj+b3Z4jJAC5JbrZMTadoreW0ndv9J089r7d2xk10SNpOHHYg EUzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751608696; x=1752213496; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=i/s7n+33790QWTVJ8t6ma3xCM10pkEQinXVD+DrznHU=; b=TuqiQtP6HmnF1jp8LZIYqQWU/WW2mvKLrPyB6C09UDZidU9TFfC55XYT4Bz4+deZVn 6Et4XBRtI/8hvu+lkuBhZlapNRjCLJGNb6QGv2TDc0De1U9f52KVVXvOhKnb2n0oNjgi Nwbvl/m9bGYBDGDmlVt6lCDr643CUVhZ11zkPCDC21IlkkY/dPZcV6Yf78X48Dr11fLP itymW4W+ct5hYcvptjwmRoLbRLYuoguGxQyOhGSUKYgmkkXO8u0hWCZ2qG57MvONAJmC 0FXKScEWHE5tIpKgyQg6DThIY8ilKdTjmfmBh0WhJEgh5dCs3vRPiVAdBpB9CPHET9uv Yuqg== X-Forwarded-Encrypted: i=1; AJvYcCWGteHhH+FC/fU+0lc6IgAyGCRj3Lj7bWbbX5OonmutKv0qyg9q26UPJCLbBgc2ZnCQXNLYgQ==@dpdk.org X-Gm-Message-State: AOJu0Yx1LJcwZr/C7C01MIbg/yJ0LXC8FvzgS96MYHqxFgeZz4OPH3Vf gV3IGTQtBNlB0UTGfdLxYCkfmwUVuGKobY/zGgYWnJ9NhGpH+flwwpdk2IYAWZm2AGuVSUB8K0X qrn7GdKDOUlJZGfQYHpeSQbxW7s1E4Sc= X-Gm-Gg: ASbGncsrKebfx2LimmD+srYRf/awPaD5GRX3p/rsPRALL9GNkMDAsaZa72ycSaKutgp gVbovtjRsN5ttjzup2ha52oEKSYLNT7CzKchQVm2JIlsMhaEUG1oHjzqH/F731fYf93/lLEnScI wkc5S058KMYtKVS2XNQ5pejzymizXvlCTensb5+yAFmBbS X-Google-Smtp-Source: AGHT+IFBvMp6K392yi0ocQyU4WvGeCvrgoeYWT679O5xzeJybZ1xjiyg+xcV/pfb7SW9O2XxpGTu4fC63nVPNhIwurQ= X-Received: by 2002:a05:6122:83c6:b0:527:67d9:100d with SMTP id 71dfb90a1353d-5347a66a4c2mr975824e0c.4.1751608696469; Thu, 03 Jul 2025 22:58:16 -0700 (PDT) MIME-Version: 1.0 References: <20250703144919.2fea3261@hermes.local> In-Reply-To: <20250703144919.2fea3261@hermes.local> From: Rajesh Kumar Date: Fri, 4 Jul 2025 11:28:05 +0530 X-Gm-Features: Ac12FXxOpijaous36V-cnIujvg9sMwoo1vCxGOJb8Q8FqJGDtlLxop1xQRAqv78 Message-ID: Subject: Re: dpdk Tx falling short To: Stephen Hemminger Cc: "Lombardo, Ed" , users Content-Type: multipart/alternative; boundary="000000000000237dfe0639142eee" X-BeenThere: users@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK usage discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: users-bounces@dpdk.org --000000000000237dfe0639142eee Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Ed, Did you ran dpdk-testpmd with multiple queue and did you hit line rate. Sapphire rapid is powerful processor, we were able to hit 200Gbps with 14 cores with mellanox CX6 NIC. how many core are you using? what is the descriptor size & number of queue ? try playing with that with that.. dpdk-testpmd -l 0-36 -a -- -i -a --nb-cores=3D35 --txq=3D14 --rxq=3D14 --rxd=3D4096 Also try reducing mbuf size to 2K (from the current 9k) and enable jumbo frame support try to run "perf top" and see which is taking more time. Also try to cache-align your data-structure. struct sample_struct { uint32_t a; uint64_t b; ... } __rte_cache_aligned; Thanks, *-Rajesh* On Fri, Jul 4, 2025 at 3:27=E2=80=AFAM Stephen Hemminger wrote: > On Thu, 3 Jul 2025 20:14:59 +0000 > "Lombardo, Ed" wrote: > > > Hi, > > I have run out of ideas and thought I would reach out to the dpdk > community. > > > > I have a Sapphire Rapids dual CPU server and one E180 (also tried X710)= , > both are 4x10G NICs. When our application pipeline final stage enqueues > mbufs into the tx ring I expect the rte_ring_dequeue_burst() to pull the > mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rate. > What I see is when there is one interface receiving 64-byte UDP in IPv4 t= he > receive and transmit is at line rate (i.e. packets in one port and out > another port of the NIC @14.9 MPPS). > > When I turn on another receive port then both transmit ports of the NIC > shows Tx performance drops to 5 MPPS. The Tx ring is filling faster than > Tx thread can dequeue and transmit mbufs. > > > > Packets arrive on ports 1 and 3 in my test setup. NIC is on NUMA Node > 1. Hugepage memory (6GB, 1GB page size) is on NUMA Node 1. The mbuf siz= e > is 9KB. > > > > Rx Port 1 -> Tx Port 2 > > Rx Port 3 -> Tx port 4 > > > > I monitor the mbufs available and they are: > > *** DPDK Mempool Configuration *** > > Number Sockets : 1 > > Memory/Socket GB : 6 > > Hugepage Size MB : 1024 > > Overhead/socket MB : 512 > > Usable mem/socket MB: 5629 > > mbuf size Bytes : 9216 > > nb mbufs per socket : 640455 > > total nb mbufs : 640455 > > hugepages/socket GB : 6 > > mempool cache size : 512 > > > > *** DPDK EAL args *** > > EAL lcore arg : -l 36 <<< NUMA Node 1 > > EAL socket-mem arg : --socket-mem=3D0,6144 > > > > The number of rings in this configuration is 16 and all are the same > size (16384 * 8), and there is one mempool. > > > > The Tx rings are created as SP and SC when created. > > > > There is one Tx thread per NIC port, where its only task is to dequeue > mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs. > The dequeue burst size is 512 and tx burst is equal to or less than 512. > The rte_eth_tx_burst() never returns less than the bust size given. > > > > Each Tx thread is on a dedicated CPU core and its sibling is unused. > > We use cpushielding to keep noncritical threads from using these CPUs > for Tx threads. HTOP shows the Tx threads are the only threads using the > carved-out CPUs. > > > > In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst of > mbufs up to 512. > > I added debug counters to keep track of how many mbufs are dequeued fro= m > the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a > counter for less than 512. The dequeue of the tx ring is always 512, nev= er > less. > > > > > > Note: if I skip the rte_eth_tx_burst() in the Tx threads and just > dequeue the mbufs and bulk free the mbufs from the tx ring I do not see t= he > tx ring fill-up, i.e., it is able to free the mbufs faster than they arri= ve > on the tx ring. > > > > So, I suspect that the rte_eth_tx_burst() is the bottleneck to > investigate, which involves the inner bows of DPDK and Intel NIC > architecture. > > > > > > > > Any help to resolve my issue is greatly appreciated. > > > > Thanks, > > Ed > > > > > > > > > Do profiling, and look at the number of cache misses. > I suspect using an additional ring is causing lots of cache misses. > Remember going to memory is really slow on modern processors. > --000000000000237dfe0639142eee Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Ed,

Did you ran dpdk-test= pmd with multiple queue and did you hit line rate.=C2=A0Sapphire rapid is p= owerful processor, we were able to hit 200Gbps with 14 cores with mellanox = CX6 NIC.

how many core are you=C2=A0using? what is= the descriptor size & number of queue ? try playing with that with tha= t..=C2=A0=C2=A0

dpdk-testpmd -l 0-36 -a <pci of= nic> -- -i -a --nb-cores=3D35=C2=A0 --txq=3D14 --rxq=3D14 --rxd=3D4096= =C2=A0

Also try reducing mbuf size to 2K (from the= current 9k) and enable jumbo frame support

try to= run "perf top" and see which is taking more time. Also try to ca= che-align your data-structure.

struct sample_struc= t {
=C2=A0 =C2=A0 =C2=A0 uint32_t=C2=A0 a;
=C2=A0 =C2= =A0 =C2=A0 uint64_t=C2=A0 b;
...
}=C2=A0__rte_cache_ali= gned;

Th= anks,
-Rajesh


On Fri, Jul 4, 2025 at 3:27=E2=80=AFAM Stephen Hemminger <stephen@networkplumber.org> = wrote:
On Thu, 3= Jul 2025 20:14:59 +0000
"Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:

> Hi,
> I have run out of ideas and thought I would reach out to the dpdk comm= unity.
>
> I have a Sapphire Rapids dual CPU server and one E180 (also tried X710= ), both are 4x10G NICs.=C2=A0 When our application pipeline final stage enq= ueues mbufs into the tx ring I expect the rte_ring_dequeue_burst() to pull = the mbufs from the tx ring and rte_eth_tx_burst() transmit them at line rat= e.=C2=A0 What I see is when there is one interface receiving 64-byte UDP in= IPv4 the receive and transmit is at line rate (i.e. packets in one port an= d out another port of the NIC @14.9 MPPS).
> When I turn on another receive port then both transmit ports of the NI= C shows Tx performance drops to 5 MPPS.=C2=A0 The Tx ring is filling faster= than Tx thread can dequeue and transmit mbufs.
>
> Packets arrive on ports 1 and 3 in my test setup.=C2=A0 NIC is on NUMA= Node 1.=C2=A0 Hugepage memory (6GB, 1GB page size) is on NUMA Node 1.=C2= =A0 The mbuf size is 9KB.
>
> Rx Port 1 -> Tx Port 2
> Rx Port 3 -> Tx port 4
>
> I monitor the mbufs available and they are:
> *** DPDK Mempool Configuration ***
> Number Sockets=C2=A0 =C2=A0 =C2=A0 :=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 1
> Memory/Socket GB=C2=A0 =C2=A0 :=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A06
> Hugepage Size MB=C2=A0 =C2=A0 :=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A01024
> Overhead/socket MB=C2=A0 :=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 512
> Usable mem/socket MB:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 5629
> mbuf size Bytes=C2=A0 =C2=A0 =C2=A0:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A09216
> nb mbufs per socket :=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0640455
> total nb mbufs=C2=A0 =C2=A0 =C2=A0 :=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 640455
> hugepages/socket GB :=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A06
> mempool cache size=C2=A0 :=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 51= 2
>
> *** DPDK EAL args ***
> EAL lcore arg=C2=A0 =C2=A0 =C2=A0 =C2=A0: -l 36=C2=A0 =C2=A0<<&l= t; NUMA Node 1
> EAL socket-mem arg=C2=A0 : --socket-mem=3D0,6144
>
> The number of rings in this configuration is 16 and all are the same s= ize (16384 * 8), and there is one mempool.
>
> The Tx rings are created as SP and SC when created.
>
> There is one Tx thread per NIC port, where its only task is to dequeue= mbufs from the tx ring and call rte_eth_tx_burst() to transmit the mbufs.= =C2=A0 The dequeue burst size is 512 and tx burst is equal to or less than = 512.=C2=A0 The rte_eth_tx_burst() never returns less than the bust size giv= en.
>
> Each Tx thread is on a dedicated CPU core and its sibling is unused. > We use cpushielding to keep noncritical threads from using these CPUs = for Tx threads.=C2=A0 HTOP shows the Tx threads are the only threads using = the carved-out CPUs.
>
> In the Tx thread it uses the rte_ring_dequeue_burst() to get a burst o= f mbufs up to 512.
> I added debug counters to keep track of how many mbufs are dequeued fr= om the tx ring with rte_ring_dequeue_burst() that equals to the 512 and a c= ounter for less than 512.=C2=A0 The dequeue of the tx ring is always 512, n= ever less.
>
>
> Note: if I skip the rte_eth_tx_burst() in the Tx threads and just dequ= eue the mbufs and bulk free the mbufs from the tx ring I do not see the tx = ring fill-up, i.e., it is able to free the mbufs faster than they arrive on= the tx ring.
>
> So, I suspect that the rte_eth_tx_burst() is the bottleneck to investi= gate, which involves the inner bows of DPDK and Intel NIC architecture.
>
>
>
> Any help to resolve my issue is greatly appreciated.
>
> Thanks,
> Ed
>
>
>


Do profiling, and look at the number of cache misses.
I suspect using an additional ring is causing lots of cache misses.
Remember going to memory is really slow on modern processors.
--000000000000237dfe0639142eee--