From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 91904A00C5;
	Wed, 13 Jul 2022 20:48:57 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 3672042847;
	Wed, 13 Jul 2022 20:48:57 +0200 (CEST)
Received: from forward501o.mail.yandex.net (forward501o.mail.yandex.net
 [37.140.190.203])
 by mails.dpdk.org (Postfix) with ESMTP id D2ACF42836
 for <dev@dpdk.org>; Wed, 13 Jul 2022 20:48:54 +0200 (CEST)
Received: from vla3-178a3f795968.qloud-c.yandex.net
 (vla3-178a3f795968.qloud-c.yandex.net
 [IPv6:2a02:6b8:c15:2584:0:640:178a:3f79])
 by forward501o.mail.yandex.net (Yandex) with ESMTP id 1414545C5D25;
 Wed, 13 Jul 2022 21:48:54 +0300 (MSK)
Received: by vla3-178a3f795968.qloud-c.yandex.net (smtp/Yandex) with ESMTPSA
 id HMjlbFbnJy-mqiqcpun; Wed, 13 Jul 2022 21:48:53 +0300
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
 (Client certificate not present)
X-Yandex-Fwd: 1
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.ru; s=mail;
 t=1657738133; bh=VG/elNAuN8BZfpfgbUxtZhniiw1zrWq2wVsczhUmfeQ=;
 h=From:In-Reply-To:Cc:Date:References:To:Subject:Message-ID;
 b=uJnfoEbejODIVanmDJYebhflsrhtLOJMCs7ELVSf3em5cJieHUSijjY0COFnSWxWq
 N7e45HtQz3IOplJiri7ZMgxTsvtBR10jZL3Srsy81EeshBuc8PR5NZP9Q6L+s6FwIM
 FCnVft5PqYgLBuVzWdoWtYWomcK9SiyEaZcWI/IA=
Authentication-Results: vla3-178a3f795968.qloud-c.yandex.net;
 dkim=pass header.i=@yandex.ru
Message-ID: <ffb4265f-7c22-cfee-b812-0009618d6ad6@yandex.ru>
Date: Wed, 13 Jul 2022 19:48:50 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.9.1
Subject: Re: [RFC PATCH v1] net/i40e: put mempool cache out of API
Content-Language: en-US
To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
 Feifei Wang <Feifei.Wang2@arm.com>, Yuying Zhang <Yuying.Zhang@intel.com>,
 Beilei Xing <beilei.xing@intel.com>, Ruifeng Wang <Ruifeng.Wang@arm.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, nd <nd@arm.com>
References: <20220613055136.1949784-1-feifei.wang2@arm.com>
 <1e082bfe-9b52-86f0-e7fa-279ef8feaf1a@yandex.ru>
 <DBAPR08MB5814913DBDD868C1E9DADFBA98879@DBAPR08MB5814.eurprd08.prod.outlook.com>
From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
In-Reply-To: <DBAPR08MB5814913DBDD868C1E9DADFBA98879@DBAPR08MB5814.eurprd08.prod.outlook.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

11/07/2022 07:19, Honnappa Nagarahalli пишет:
> <snip>
>>
>>
>>> Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache out
>>> of API to free buffers directly. There are two changes different with
>>> previous version:
>>> 1. change txep from "i40e_entry" to "i40e_vec_entry"
>>> 2. put cache out of "mempool_bulk" API to copy buffers into it
>>> directly
>>>
>>> Performance Test with l3fwd neon path:
>>> 		with this patch
>>> n1sdp:		no perforamnce change
>>> amper-altra:	+4.0%
>>>
>>
>>
>> Thanks for RFC, appreciate your effort.
>> So, as I understand - bypassing mempool put/get itself gives about 7-10%
>> speedup for RX/TX on ARM platforms, correct?
> It is 7% to 14% on N1SDP, 14% to 17% on Ampere machines
> The current RFC needs to be converted to a patch (albeit with zero-copy mempool APIs) as it applies for a more generic use case.
> 
>>
>> About direct-rearm RX approach you propose:
>> After another thought, probably it is possible to re-arrange it in a way that
>> would help avoid related negatives.
> Thanks for the idea.
> 
>> The basic idea as follows:
>>
>> 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
>>      Also make sw_ring de-coupled from RXQ itself, i.e:
>>      when RXQ is stopped or even destroyed, related sw_ring may still
>>      exist (probably ref-counter or RCU would be sufficient here).
>>      All that means we need a common layout/api for rxq_sw_ring
>>      and PMDs that would like to support direct-rearming will have to
>>      use/obey it.
> This would mean, we will have additional for loop to fill the descriptors on the RX side.

Yes, rx will go though avaiable entries in sw_ring and fill actual HW
descriptors.

> May be we could keep the queue when the RXQ is stopped, and free it only when RXQ is destroyed. Don't have to allocate one more if the RXQ is started back again? This means, the mbufs will be in the sw_ring if the RXQ is stopped but not destroyed (same as what you have mentioned below).

My thought here was - yes sw_ring can stay at queue stop
(even probably at destroy) to avoid related TXQ operation abortion.
Though actual HW desc ring will be de-init/destroyed as usual.
That's why I think we need to decouple sw_ring from queue and
it's HW ring.

> 
>>
>> 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
>>      at txq_free_bufs() try to store released mbufs inside attached
>>      sw_ring directly. If there is no attached sw_ring, or not enough
>>      free space in it - continue with mempool_put() as usual.
>>      Note that actual arming of HW RXDs still remains responsibility
> What problems do you see if we arm the HW RXDs (assuming that we do not have to support this across multiple devices)?

Same as visa-versa scenario:
- RXQ might be already stopped, while TXQ still running.
- Also it doesn't sound right from design point of view to allow
   TX code to manipulate RX HW queue directly and visa-versa.

> 
>>      of RX code-path:
>>      rxq_rearm(rxq) {
>>        ...
>>        - check are there are N already filled entries inside rxq_sw_ring.
>>          if not, populate them from mempool (usual mempool_get()).
>>        - arm related RXDs and mark these sw_ring entries as managed by HW.
>>        ...
>>      }
>>
>>
>> So rxq_sw_ring will serve two purposes:
>> - track mbufs that are managed by HW (that what it does now)
>> - private (per RXQ) mbuf cache
>>
>> Now, if TXQ is stopped while RXQ is running - no extra synchronization is
>> required, RXQ would just use
>> mempool_get() to rearm its sw_ring itself.
> There would be some synchronization required which tells the data plane threads not to access the TXQ before the TXQ is stopped. Other than this, no extra synchronization is required.
> 
> For the current patch, we could use a similar approach. i.e. when TXQ is stopped, it does not free the mbufs from sw_ring, we just let the RXQ consume all the mbufs from TX side sw_ring till it is empty.
> 
>>
>> If RXQ is stopped while TXQ is still running - TXQ can still continue to populate
>> related sw_ring till it gets full.
>> Then it will continue with mempool_put() as usual.
>> Of-course it means that user who wants to use this feature should probably
>> account some extra mbufs for such case, or might be rxq_sw_ring can have
>> enable/disable flag to mitigate such situation.
>>
>> As another benefit here - such approach makes possible to use several TXQs
>> (even from different devices) to rearm same RXQ.
> Being able to use several TXQs is a practical use case. But, I am not sure if different devices in the same server is a practical scenario that we need to address.
> 
>>
>> Have to say, that I am still not sure that 10% RX/TX improvement is worth
>> bypassing mempool completely and introducing all this extra complexity in
>> RX/TX path.
> It is more, clarified above.
> 
>> But, if we'll still decide to go ahead with direct-rearming, this re-arrangement, I
>> think, should help to keep things clear and avoid introducing new limitations in
>> existing functionality.
>>
>> WDYT?
> I had another idea. This requires the application to change, which might be ok given that it is new feature/optimization.
> 
> We could expose a new API to the application which allows RX side to take buffers from TX side. The application would call this additional API just before calling the eth_rx_burst API.
> 
> The advantages I see with this approach is:
> 1) The static mapping of the ports is not required. The mapping is dynamic. This allows for the application to make decisions based on current conditions in the data plane.
> 2) Code is simple, because it does not have to check for many conditions. The application can make better decisions as it knows the scenario well. Not all applications have to take the hit of conditionals.
> 2) The existing synchronization used by the applications (to stop RXQ/TXQ) is sufficient. No new synchronization required.


That sounds like a very good idea to me.
I think it will allow to avoid all short-comings that are
present in original approach.
Again, if we like we can even allow user to feed RXQ from some other 
sources (dropped packets).
Just to confirm that we are talking about the same thing:

1. rte_ethdev will have an API for the user to get sw_ring handle
from the RXQ, something like:
struct rxq_sw_ring *
rte_eth_rxq_get_sw_ring(uint16_t port_id, uint16_t rxq_id);

We probbly can even keep 'struct rxq_sw_ring' as internal one,
and use it as a handle.

2. rte_ethdev will have an API that will ask TXQ to free mbufs
and fill given rxq_sw_ring:
int
rte_eth_txq_fill_sw_ring(uint16_t port_id, uint16_t txq_id, struct * 
rxq_sw_ring);


So the app code will look something like:

rxq_sw_ring = rxq_get_sw_ring(rx_port, rxq);
...
while (...) {

	n = eth_tx_burst(tx_port, txq, pkts, N);
	...
	eth_txq_fill_sw_ring(tx_port, txq, rxq_sw_ring);
	n = eth_rx_burst(rx_port, rxq, pkts, N);
	...
}


Did I get your idea right?

> 
> The disadvantage is that it is another function pointer call which will reduce some performance.
> 
>>
>> Konstantin
>>
>>
>>
>>
>>
>>
>>
>