From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <users-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id B60C746B2B
	for <public@inbox.dpdk.org>; Tue,  8 Jul 2025 18:53:20 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 5317640292;
	Tue,  8 Jul 2025 18:53:20 +0200 (CEST)
Received: from agw.arknetworks.am (agw.arknetworks.am [79.141.165.80])
 by mails.dpdk.org (Postfix) with ESMTP id 571E64025E
 for <users@dpdk.org>; Tue,  8 Jul 2025 18:53:19 +0200 (CEST)
Received: from debian (unknown [78.109.72.5])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by agw.arknetworks.am (Postfix) with ESMTPSA id 919B5E067E;
 Tue,  8 Jul 2025 20:53:18 +0400 (+04)
DKIM-Filter: OpenDKIM Filter v2.11.0 agw.arknetworks.am 919B5E067E
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=arknetworks.am;
 s=default; t=1751993598;
 bh=R7Lo4UB7g2Vtsfmamfnn9k1bTe12lVlZfeN5minxd9s=;
 h=Date:From:To:cc:Subject:In-Reply-To:References:From;
 b=VlqnnYtX+fhGRz8Yur1HmZPeC35zxJE4OiSB2AKUZ65gSDSRY8FjfDui3Sqxi7dNR
 o0toKjYov5hRRqm9Dbswjblazbel/FbNXs6goNsD2E1/PvqVJoVTn8MB+nB7DyRoY1
 QCv72UnnfAgLTtVYa6Ct046OQt/lNu5GvaW3MJO+/y4ZatnGauSho7MWxYzM5NXRRQ
 mjokdxh8TwNCneOPhFH2M1CPoqgbJCa3TshPmDWcikqSeyH8OV3AT4h994JLoylJCV
 NfwknJ2jTNenUidKiakpZ5GfJuuz6YHK/cWZLO2IpZsBkn7X82AGFl2tcWwlp1kfKM
 fWrjOo8/8cO2A==
Date: Tue, 8 Jul 2025 20:53:17 +0400 (+04)
From: Ivan Malov <ivan.malov@arknetworks.am>
To: "Lombardo, Ed" <Ed.Lombardo@netscout.com>
cc: Stephen Hemminger <stephen@networkplumber.org>, users <users@dpdk.org>
Subject: RE: dpdk Tx falling short
In-Reply-To: <CH3PR01MB8470A19065C3F7780CF31E638F4EA@CH3PR01MB8470.prod.exchangelabs.com>
Message-ID: <baa42cf0-46d5-bc55-093c-596c5edcfd45@arknetworks.am>
References: <CH3PR01MB8470E2030EECEB410B356F878F43A@CH3PR01MB8470.prod.exchangelabs.com>
 <20250705120834.78849e56@hermes.local>
 <CH3PR01MB8470537887794F10831E73778F4CA@CH3PR01MB8470.prod.exchangelabs.com>
 <20250706090232.635bd36e@hermes.local>
 <CH3PR01MB8470EAA7AB3F584F1A8BCB038F4CA@CH3PR01MB8470.prod.exchangelabs.com>
 <CH3PR01MB8470BDA5F4CEE56E096692758F4FA@CH3PR01MB8470.prod.exchangelabs.com>
 <9ae56e38-0d29-4c7c-0bc2-f92912146da2@arknetworks.am>
 <20250707160409.75fbc2f1@hermes.local>
 <CH3PR01MB84705D671EDC074FD3B1996D8F4EA@CH3PR01MB8470.prod.exchangelabs.com>
 <20250708064707.583df905@hermes.local>
 <CH3PR01MB8470A9F9766C7C50EFF18FB78F4EA@CH3PR01MB8470.prod.exchangelabs.com>
 <4b43a1ce-2dc6-5d46-12e0-b26d13a60633@arknetworks.am> 
 <CH3PR01MB8470A4E2F5D9FDB9AFCB804E8F4EA@CH3PR01MB8470.prod.exchangelabs.com>
 <1b7533d3-a3de-b5e9-8838-2d6608f2c8e5@arknetworks.am>
 <CH3PR01MB8470A19065C3F7780CF31E638F4EA@CH3PR01MB8470.prod.exchangelabs.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII; format=flowed
X-BeenThere: users@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK usage discussions <users.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/users>,
 <mailto:users-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/users/>
List-Post: <mailto:users@dpdk.org>
List-Help: <mailto:users-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/users>,
 <mailto:users-request@dpdk.org?subject=subscribe>
Errors-To: users-bounces@dpdk.org

Hi Ed,

On Tue, 8 Jul 2025, Lombardo, Ed wrote:

> Hi Ivan,
> Thanks, this clears up my confusion.  Using API[2] to create one mempool for the network Rx and Tx queues must be MP/MC.  The CPU Cycles spent on the common_ring_mp_enqueue increase as more ports are transmitting.  The transmit operation causes the call for Rx and Tx queues results in fight for access to the mbuf mempool because of one mempool?

Not really. Mempools in DPDK in general (and, in particular, as shown in your
monitor printout) have per-lcore object cache, which, if I'm not mistaken, is
to avoid such contention when accessing the pool. And, since only a single pool
is used in your case, the use of MP/MC seems logical, as well as the use of the
per-lcore object cache. But it's not obvious if this is optimal in your case.

> This is why you suggested creating two mempools, one for each pair of ports.

It could be a low-hanging fruit to do a quick check with two separate mempools,
probably also MP/MC even (allocated via the same API [2]), to know if it affects
performance or not. Again, as Stephen noted, this may even worsen CPU cache
performance, but may be it still pays to do a quick check after all.

> If I go this route what are the precautions I need to take?
>
> I will try RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag first.

This is somehow unrelated to pools and rings, yet it should enable the PMD's
internal Tx handling to accumulate bulks of mbufs to be freed upon transmission
via bulk operations that, akin Tx and Rx bursts, may also improve CPU cache
utilisation and overall performance. The only prerequisite - all mbufs passed to
a given Tx queue have to come from the same mempool. Hopefully this holds for
you, if the logic does not intermix packets from 2 pools into the same Tx queue.

May be Stephen's suggestion to use a Tx buffer API is also worth the shot.

Thank you.

>
> Thanks,
> Ed
>
> -----Original Message-----
> From: Ivan Malov <ivan.malov@arknetworks.am>
> Sent: Tuesday, July 8, 2025 10:49 AM
> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
> Subject: RE: dpdk Tx falling short
>
> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>
>> Hi Ivan,
>> Yes, only the user space created rings.
>> Can you add more to your thoughts?
>
> I was seeking to address the probable confusion here. If the application creates a SC / MP ring for its own pipiline logic using API [1] and then invokes another API [2] to create a common "mbuf mempool" to be used with Rx and Tx queues of the network ports, then the observed appearance of "common_ring_mp_enqueue" is likely attributed to the fact that API [2] creates a ring-based mempool internally, and in MP / MC mode by default. And the latter ring is not the same as the one created by the application logic. These are two independent rings.
>
> BTW, does your application set RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag when configuring Tx port/queue offloads on the network ports?
>
> Thank you.
>
> [1] https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__ring_8h.html*a155cb48ef311eddae9b2e34808338b17__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij40zap9fvA$
> [2] https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mbuf_8h.html*a8f4abb0d54753d2fde515f35c1ba402a__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij407rwGv1P$
> [3] https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mempool_8h.html*a0b64d611bc140a4d2a0c94911580efd5__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij402Z4uOww$
>
>>
>> Ed
>>
>> -----Original Message-----
>> From: Ivan Malov <ivan.malov@arknetworks.am>
>> Sent: Tuesday, July 8, 2025 10:19 AM
>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>> Cc: Stephen Hemminger <stephen@networkplumber.org>; users <users@dpdk.org>
>> Subject: RE: dpdk Tx falling short
>>
>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>
>> Hi Ed,
>>
>> On Tue, 8 Jul 2025, Lombardo, Ed wrote:
>>
>>> Hi Stephen,
>>> When I replace rte_eth_tx_burst() with mbuf free bulk I do not see the tx ring fill up.  I think this is valuable information.  Also, perf analysis of the tx thread shows common_ring_mp_enqueue and rte_atomic32_cmpset, where I did not expect to see if I created all the Tx  rings as SP and SC (and the workers and ack rings as well, essentially all the 16 rings).
>>>
>>> Perf report snippet:
>>> +   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue
>>> +   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset
>>> +    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts
>>> +    6.50%  DPDK_TX_1  test             [.] rte_pause
>>>      0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0
>>>      0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread
>>>
>>> The traffic load is constant 10 Gbps 84 bytes packets with no idles.  The burst size of 512 is a desired burst of mbufs, however the tx thread will transmit what ever it can get from the Tx ring.
>>>
>>> I think if resolving why the perf analysis shows ring is MP when it has been created as SP / SC should resolve this issue.
>>
>> The 'common_ring_mp_enqueue' is the enqueue method of mempool variant 'ring', that is, based on RTE Ring internally. When you say that ring has been created as SP / SC you seemingly refer to the regular RTE ring created by your application logic, not the internal ring of the mempool. Am I missing something?
>>
>> Thank you.
>>
>>>
>>> Thanks,
>>> ed
>>>
>>> -----Original Message-----
>>> From: Stephen Hemminger <stephen@networkplumber.org>
>>> Sent: Tuesday, July 8, 2025 9:47 AM
>>> To: Lombardo, Ed <Ed.Lombardo@netscout.com>
>>> Cc: Ivan Malov <ivan.malov@arknetworks.am>; users <users@dpdk.org>
>>> Subject: Re: dpdk Tx falling short
>>>
>>> External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>>>
>>> On Tue, 8 Jul 2025 04:10:05 +0000
>>> "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote:
>>>
>>>> Hi Stephen,
>>>> I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses the burst version, perf showed the repercussions of doing one mbuf dequeue and enqueue.
>>>> For the receive stage rte_eth_rx_burst() is used and Tx stage we use rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 Mbufs.
>>>
>>> You might try buffering like rte_eth_tx_buffer does.
>>> Need to add an additional mechanism to ensure that buffer gets flushed when you detect idle period.
>>>
>>
>