From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <users-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 11EC546AD9
	for <public@inbox.dpdk.org>; Tue,  8 Jul 2025 01:04:16 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id AF08240287;
	Tue,  8 Jul 2025 01:04:15 +0200 (CEST)
Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com
 [209.85.219.52]) by mails.dpdk.org (Postfix) with ESMTP id AA7D04025D
 for <users@dpdk.org>; Tue,  8 Jul 2025 01:04:13 +0200 (CEST)
Received: by mail-qv1-f52.google.com with SMTP id
 6a1803df08f44-70109af5f70so37708736d6.0
 for <users@dpdk.org>; Mon, 07 Jul 2025 16:04:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=networkplumber-org.20230601.gappssmtp.com; s=20230601; t=1751929453;
 x=1752534253; darn=dpdk.org; 
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:subject:cc:to:from:date:from:to:cc:subject:date
 :message-id:reply-to;
 bh=KfpOl0ZizZ8B8uicMBJ4iFyqBA+NBDutGTJVJT43wgo=;
 b=X6Lwwj6zIVYWKgCs7aQInm1lmw724o8vM+BEPQ8qtZgabTCxKxwYUGTzvOsf+8yU9T
 pPoht1ogQFd2eS6qRAeOYg1Y3g8yqjQSjQ7VgSRmrfiSZ+jwH7yBBePistA7NIfjjf3f
 m6oXXD8lXR0sLRDkyhUR8Mih07GRkKhhRgFtcXTacFBhEs5qZnVUZlRoASd9sd3u8tVr
 SfZMsneknZjh+EYN2vzbG1DbkxoQ0XMB5ZGF97ikND/3pHJt/BQy+cvGlxiJDSpQ6vSU
 CZA4ztX7LVFx9IMpLDcg+cJQb3NjjDQzacxCptCZc8a7IaJ9MIB//1n1guyPP0phcpFf
 5/uQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1751929453; x=1752534253;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=KfpOl0ZizZ8B8uicMBJ4iFyqBA+NBDutGTJVJT43wgo=;
 b=VqsIbYg0YxOBUb/u5kaFwus2dLNJwB5kxQtU8T8ndUHmZFHFvHxegA15+BWsnypPmP
 N7YXuKq//w3NZir9RvOo5scmvQEv7nWpViKmuaFZlE5MD/L92FEBDECV4GSsj2V6yIHf
 f2a0bTtCdyBwbCM1xlIopITXEaiqQzdXuSdlWmM3o2E56Ru/lIZdgk0vskhaRWSMyDJf
 H0dHulqmhf0u5cYtk5Bw0FdX9rvitJrq+uynD/f6qIEmvIeLgKQ73bXJk301rqN3xY1b
 TFSGJsEoGv/EieLOkW6ORtnfCzIrAgkHCfF3XKsXeFZCu42v3mphQBcfToo3KIEYaMca
 HyAw==
X-Forwarded-Encrypted: i=1;
 AJvYcCUDqE7penbRtW5+23CCMBrIuOic1o9tnFXKpXrWs9QO/MplMxTmOTz7wOvUkeDhEWA+jRa8nA==@dpdk.org
X-Gm-Message-State: AOJu0YwmjAmwib6G6UzlFD2lE7pp38U9GFIZ8ntK23Gz551t/YzNsU9E
 swLo5/ii2QVR3fb4aUe+tYr9RrMiODdc6Hy1/3gL06Hy91iBomdrvaDeXD3CZep1JE4eA4fsQuc
 yOZze
X-Gm-Gg: ASbGncusEmVEM+40Oma0SHhasKFOIqlMh2Zp+HGnEjb68LeAFyPs/YAphunRH/rAPRx
 mi4NIO0W5JxJa+Fv2WCeR2lqisGqf80DRh9zx8U1Do0pxRPk0g9q1UdG99SDB96P4lojNPdO4W7
 Bv4qLFKy1uWblsaeCP1f7xxngH8yvXtEc3NJ4N5f7RpVtmUhaQgDAQDwx8SVuDvs7XgX+lkoYpX
 LVC5B4FkmHzvXOeU9VG8P3M/Cl5MDD0lzGKz6QZcHgt/VnEv1Q62LY7Nz8IKJocZ2UI7wpnPwqZ
 6LY9eOeddPGDy2ZOHkp0WVx89xovz94EX9hjvECmS7nlcaHdRRHOqZkNtWR0S0ZwgovBe5sQbVT
 FIu7EVqpE+YEi4SOAjs5wXRRLvw3BwNKeGJyXRcE=
X-Google-Smtp-Source: AGHT+IEGkWEOBbnZdDBwbAkr7HsaFG7e4+JPc4qXYBCPQYsR3VslZ4+3rZ6MpuMAuupfM/GfPCWoWw==
X-Received: by 2002:a05:6214:3f85:b0:704:7d7a:7874 with SMTP id
 6a1803df08f44-7047d806d15mr13103756d6.0.1751929452742; 
 Mon, 07 Jul 2025 16:04:12 -0700 (PDT)
Received: from hermes.local (204-195-96-226.wavecable.com. [204.195.96.226])
 by smtp.gmail.com with ESMTPSA id
 6a1803df08f44-702c4cc7751sm66753146d6.10.2025.07.07.16.04.12
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Jul 2025 16:04:12 -0700 (PDT)
Date: Mon, 7 Jul 2025 16:04:09 -0700
From: Stephen Hemminger <stephen@networkplumber.org>
To: Ivan Malov <ivan.malov@arknetworks.am>
Cc: "Lombardo, Ed" <Ed.Lombardo@netscout.com>, users <users@dpdk.org>
Subject: Re: dpdk Tx falling short
Message-ID: <20250707160409.75fbc2f1@hermes.local>
In-Reply-To: <9ae56e38-0d29-4c7c-0bc2-f92912146da2@arknetworks.am>
References: <CH3PR01MB8470E2030EECEB410B356F878F43A@CH3PR01MB8470.prod.exchangelabs.com>
 <e539b223-d2fb-447f-4117-fc89fe7ee318@arknetworks.am>
 <20250704074957.5848175a@hermes.local>
 <CH3PR01MB84705581A7FED5E8F78C5F738F4DA@CH3PR01MB8470.prod.exchangelabs.com>
 <20250705120834.78849e56@hermes.local>
 <CH3PR01MB8470537887794F10831E73778F4CA@CH3PR01MB8470.prod.exchangelabs.com>
 <20250706090232.635bd36e@hermes.local>
 <CH3PR01MB8470EAA7AB3F584F1A8BCB038F4CA@CH3PR01MB8470.prod.exchangelabs.com>
 <CH3PR01MB8470BDA5F4CEE56E096692758F4FA@CH3PR01MB8470.prod.exchangelabs.com>
 <9ae56e38-0d29-4c7c-0bc2-f92912146da2@arknetworks.am>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-BeenThere: users@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK usage discussions <users.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/users>,
 <mailto:users-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/users/>
List-Post: <mailto:users@dpdk.org>
List-Help: <mailto:users-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/users>,
 <mailto:users-request@dpdk.org?subject=subscribe>
Errors-To: users-bounces@dpdk.org

On Tue, 8 Jul 2025 01:49:44 +0400 (+04)
Ivan Malov <ivan.malov@arknetworks.am> wrote:

> Hi Ed,
> 
> On Mon, 7 Jul 2025, Lombardo, Ed wrote:
> 
> > Hi Stephen,
> > I ran a perf diff on two perf records and reveals the real problem with the tx thread in transmitting packets.
> >
> > The comparison is traffic received on ifn3 and transmit ifn4 to traffic received on ifn3, ifn5 and transmit on ifn4, ifn6.
> > When transmit packets on one port the performance is better, however when transmit on two ports the performance across the two drops dramatically.
> >
> > There is increase of 55.29% of the CPU spent in common_ring_mp_enqueue and 54.18% less time in i40e_xmit_pkts (was E810 tried x710).
> > The common_ring_mp_enqueue is multi-producer,  is the enqueue of mbuf pointers passed in to rte_eth_tx_burst() have to be multi-producer?  
> 
> I may be wrong, but rte_eth_tx_burst(), as part of what is known as "reap"
> process, should check for "done" Tx descriptors resulting from previous
> invocations and free (enqueue) the associated mbufs into respective mempools.
> In your case, you say you only have a single mempool shared between the port
> pairs, which, as I understand, are served by concurrent threads, so it might be
> logical to use a multi-producer mempool in this case. Or am I missing something?
> 
> The pktmbuf API for mempool allocation is a wrapper around generic API and it
> might request multi-producer multi-consumer by default (see [1], 'flags').
> According to your original mempool monitor printout, the per-lcore cache size is
> 512. On the premise that separate lcores serve the two port pairs, and taking
> into account the burst size, it should be OK, yet you may want to play with the
> per-lcore cache size argument when creating the pool. Does it change anything?
> 
> Regarding separate mempools, -- I saw Stephen's response about those making CPU
> cache behaviour worse and not better. Makes sense and I won't argue. And yet,
> why not just try an make sure this indeed holds in this particular case? Also,
> since you're seeking single-producer behaviour, having separate per-port-pair
> mempools might allow to create such (again, see 'flags' at [1]), provided that
> API [1] is used for mempool creation. Please correct me in case I'm mistaken.
> 
> Also, PMDs can support "fast free" Tx offload. Please see [2] to check whether
> the application asks for this offload flag or not. It may be worth enabling.
> 
> [1] https://doc.dpdk.org/api-25.03/rte__mempool_8h.html#a0b64d611bc140a4d2a0c94911580efd5
> [2] https://doc.dpdk.org/api-25.03/rte__ethdev_8h.html#a43f198c6b59d965130d56fd8f40ceac1
> 
> Thank you.
> 
> >
> > Is there a way to change dpdk to use single-producer?
> >
> > # Event 'cycles'
> > #
> > # Baseline  Delta Abs  Shared Object      Symbol
> > # ........  .........  .................  ......................................
> > #
> >    36.37%    +55.29%  test                        [.] common_ring_mp_enqueue
> >    62.36%    -54.18%   test                        [.] i40e_xmit_pkts
> >     1.10%     -0.94%     test                         [.] dpdk_tx_thread
> >     0.01%     -0.01%     [kernel.kallsyms]  [k] native_sched_clock
> >                     +0.00%    [kernel.kallsyms]  [k] fill_pmd
> >                     +0.00%    [kernel.kallsyms]  [k] perf_sample_event_took
> >     0.00%     +0.00%    [kernel.kallsyms]  [k] __flush_smp_call_function_queue
> >     0.02%                      [kernel.kallsyms]  [k] __intel_pmu_enable_all.constprop.0
> >     0.02%                      [kernel.kallsyms]  [k] native_irq_return_iret
> >     0.02%                      [kernel.kallsyms]  [k] native_tss_update_io_bitmap
> >     0.01%                      [kernel.kallsyms]  [k] ktime_get
> >     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context
> >     0.01%                      [kernel.kallsyms]  [k] __update_blocked_fair
> >     0.01%                      [kernel.kallsyms]  [k] perf_adjust_freq_unthr_events
> >
> > Thanks,
> > Ed
> >
> > -----Original Message-----
> > From: Lombardo, Ed
> > Sent: Sunday, July 6, 2025 1:45 PM
> > To: Stephen Hemminger <stephen@networkplumber.org>
> > Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
> > Subject: RE: dpdk Tx falling short
> >
> > Hi Stephen,
> > If using dpdk rings comes with this penalty then what should I use, is there an alterative to rings.  We do not want to use shared memory and do buffer copies?
> >
> > Thanks,
> > Ed
> >
> > -----Original Message-----
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Sent: Sunday, July 6, 2025 12:03 PM
> > To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> > Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
> > Subject: Re: dpdk Tx falling short
> >
> > External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> >
> > On Sun, 6 Jul 2025 00:03:16 +0000
> > "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
> >  
> >> Hi Stephen,
> >> Here are comments to the list of obvious causes of cache misses you mentiond.
> >>
> >> Obvious cache misses.
> >>  - passing packets to worker with ring - we use lots of rings to pass mbuf pointers.  If I skip the rte_eth_tx_burst() and just free mbuf bulk, the tx ring does not fill up.
> >>  - using spinlocks (cost 16ns)  - The driver does not use spinlocks, other than what dpdk uses.
> >>  - fetching TSC  - We don't do this, we let Rx offload timestamp packets.
> >>  - syscalls?  - No syscalls are done in our driver fast path.
> >>
> >> You mention "passing packets to worker with ring", do you mean using rings to pass mbuf pointers causes cache misses and should be avoided?  
> >
> > Rings do cause data to be modified by one core and examined by another so they are a cache miss.
> >
> >  

How many packets is your application seeing per-burst? Ideally it should be getting chunks
not single packet a  time. And then the driver can use defer free to put back bursts.
If you have multi-stage pipeline it helps if you pass a burst to each stage rather than
looping over the burst in the outer loop. Imagine getting a burst of 16 packets. If you
pass an array down the pipeline, then there is one call per burst. If you process packets
one at a time, it can mean 16 calls, and if the pipeline exceeds the instruction cache
it can mean 16 cache misses.

The point is bursting is a big win in data and instruction cache.
If you really want to tune investigate prefetching like VPP does.