From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id B0812A04FF;
	Tue, 24 May 2022 03:25:55 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 53F7F4067B;
	Tue, 24 May 2022 03:25:55 +0200 (CEST)
Received: from forward500o.mail.yandex.net (forward500o.mail.yandex.net
 [37.140.190.195])
 by mails.dpdk.org (Postfix) with ESMTP id 74DF14003C
 for <dev@dpdk.org>; Tue, 24 May 2022 03:25:53 +0200 (CEST)
Received: from sas1-1999a3cd81d1.qloud-c.yandex.net
 (sas1-1999a3cd81d1.qloud-c.yandex.net
 [IPv6:2a02:6b8:c08:780b:0:640:1999:a3cd])
 by forward500o.mail.yandex.net (Yandex) with ESMTP id D0C49941FB5;
 Tue, 24 May 2022 04:25:52 +0300 (MSK)
Received: from sas1-37da021029ee.qloud-c.yandex.net
 (sas1-37da021029ee.qloud-c.yandex.net [2a02:6b8:c08:1612:0:640:37da:210])
 by sas1-1999a3cd81d1.qloud-c.yandex.net (mxback/Yandex) with ESMTP id
 9B9BXotef0-Pqg4hUt6; Tue, 24 May 2022 04:25:52 +0300
X-Yandex-Fwd: 2
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.ru; s=mail;
 t=1653355552; bh=HDBGzGT2E7iI82mmxo1Fys8sTJzz7cs/pvZhX1zuOrw=;
 h=In-Reply-To:From:Subject:Cc:References:Date:Message-ID:To;
 b=RLzJW8L0ZsQq+AKkmP5PdAWeAvVLjslC28ZLjgBqW+NWNiv1zMPg01cjvmYYwymWN
 sOc3KVI8yHokMBRrjFGMkubW2OrDU1/5OGIG3dvFCSiQoVPBcTsVWYS8gTmSiajb0L
 s8ytkaafhKElT7iBboxdYJpd1qo5GOPIfrLtYVtA=
Authentication-Results: sas1-1999a3cd81d1.qloud-c.yandex.net;
 dkim=pass header.i=@yandex.ru
Received: by sas1-37da021029ee.qloud-c.yandex.net (smtp/Yandex) with ESMTPSA
 id HOChRKgpmw-PpUeGpdO; Tue, 24 May 2022 04:25:51 +0300
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
 (Client certificate not present)
Message-ID: <5320c9dd-8f53-155e-7900-ff02bfa11b4d@yandex.ru>
Date: Tue, 24 May 2022 02:25:49 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.9.1
Subject: Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side
Content-Language: en-US
To: Feifei Wang <feifei.wang2@arm.com>
Cc: nd@arm.com, dev@dpdk.org, ruifeng.wang@arm.com,
 honnappa.nagarahalli@arm.com
References: <20220420081650.2043183-1-feifei.wang2@arm.com>
 <20220516061012.618787-1-feifei.wang2@arm.com>
From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
In-Reply-To: <20220516061012.618787-1-feifei.wang2@arm.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

16/05/2022 07:10, Feifei Wang пишет:
> 
>>> Currently, the transmit side frees the buffers into the lcore cache and
>>> the receive side allocates buffers from the lcore cache. The transmit
>>> side typically frees 32 buffers resulting in 32*8=256B of stores to
>>> lcore cache. The receive side allocates 32 buffers and stores them in
>>> the receive side software ring, resulting in 32*8=256B of stores and
>>> 256B of load from the lcore cache.
>>>
>>> This patch proposes a mechanism to avoid freeing to/allocating from
>>> the lcore cache. i.e. the receive side will free the buffers from
>>> transmit side directly into it's software ring. This will avoid the 256B
>>> of loads and stores introduced by the lcore cache. It also frees up the
>>> cache lines used by the lcore cache.
>>>
>>> However, this solution poses several constraints:
>>>
>>> 1)The receive queue needs to know which transmit queue it should take
>>> the buffers from. The application logic decides which transmit port to
>>> use to send out the packets. In many use cases the NIC might have a
>>> single port ([1], [2], [3]), in which case a given transmit queue is
>>> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
>>> is easy to configure.
>>>
>>> If the NIC has 2 ports (there are several references), then we will have
>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
>>> However, if this is generalized to 'N' ports, the configuration can be
>>> long. More over the PMD would have to scan a list of transmit queues to
>>> pull the buffers from.
> 
>> Just to re-iterate some generic concerns about this proposal:
>>   - We effectively link RX and TX queues - when this feature is enabled,
>>     user can't stop TX queue without stopping linked RX queue first.
>>     Right now user is free to start/stop any queues at his will.
>>     If that feature will allow to link queues from different ports,
>>     then even ports will become dependent and user will have to pay extra
>>     care when managing such ports.
> 
> [Feifei] When direct rearm enabled, there are two path for thread to 
> choose. If
> there are enough Tx freed buffers, Rx can put buffers from Tx.
> Otherwise, Rx will put buffers from mempool as usual. Thus, users do not
> need to pay much attention managing ports.

What I am talking about: right now different port or different queues of
the same port can be treated as independent entities:
in general user is free to start/stop (and even reconfigure in some 
cases) one entity without need to stop other entity.
I.E user can stop and re-configure TX queue while keep receiving packets
from RX queue.
With direct re-arm enabled, I think it wouldn't be possible any more:
before stopping/reconfiguring TX queue user would have make sure that
corresponding RX queue wouldn't be used by datapath.

> 
>> - very limited usage scenario - it will have a positive effect only
>>    when we have a fixed forwarding mapping: all (or nearly all) packets
>>    from the RX queue are forwarded into the same TX queue.
> 
> [Feifei] Although the usage scenario is limited, this usage scenario has 
> a wide
> range of applications, such as NIC with one port.

yes, there are NICs with one port, but no guarantee there wouldn't be 
several such NICs within the system.

> Furtrhermore, I think this is a tradeoff between performance and 
> flexibility.
> Our goal is to achieve best performance, this means we need to give up some
> flexibility decisively. For example of 'FAST_FREE Mode', it deletes most
> of the buffer check (refcnt > 1, external buffer, chain buffer), chooses a
> shorest path, and then achieve significant performance improvement.
>> Wonder did you had a chance to consider mempool-cache ZC API,
>> similar to one we have for the ring?
>> It would allow us on TX free path to avoid copying mbufs to
>> temporary array on the stack.
>> Instead we can put them straight from TX SW ring to the mempool cache.
>> That should save extra store/load for mbuf and might help to achieve 
>> some performance gain without by-passing mempool.
>> It probably wouldn't be as fast as what you proposing,
>> but might be fast enough to consider as alternative.
>> Again, it would be a generic one, so we can avoid all
>> these implications and limitations.
> 
> [Feifei] I think this is a good try. However, the most important thing
> is that if we can bypass the mempool decisively to pursue the
> significant performance gains.

I understand the intention, and I personally think this is wrong
and dangerous attitude.
We have mempool abstraction in place for very good reason.
So we need to try to improve mempool performance (and API if necessary) 
at first place, not to avoid it and break our own rules and recommendations.


> For ZC, there maybe a problem for it in i40e. The reason for that put Tx 
> buffers
> into temporary is that i40e_tx_entry includes buffer pointer and index.
> Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> firstlt extract mbuf pointer. Finally, though we use ZC, we still can't 
> avoid
> using a temporary stack to extract Tx buffer pointers.

When talking about ZC API for mempool cache I meant something like:
void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc, 
uint32_t *nb_elem, uint32_t flags);
void mempool_cache_put_zc_finish(struct rte_mempool_cache *mc, uint32_t 
nb_elem);
i.e. _start_ will return user a pointer inside mp-cache where to put 
free elems and max number of slots that can be safely filled.
_finish_ will update mc->len.
As an example:

/* expect to free N mbufs */
uint32_t n = N;
void **p = mempool_cache_put_zc_start(mc, &n, ...);

/* free up to n elems */
for (i = 0; i != n; i++) {

   /* get next free mbuf from somewhere */
   mb = extract_and_prefree_mbuf(...);

   /* no more free mbufs for now */
   if (mb == NULL)
      break;

   p[i] = mb;
}

/* finalize ZC put, with _i_ freed elems */
mempool_cache_put_zc_finish(mc, i);

That way, I think we can overcome the issue with i40e_tx_entry
you mentioned above. Plus it might be useful in other similar places.

Another alternative is obviously to split i40e_tx_entry into two structs
(one for mbuf, second for its metadata) and have a separate array for 
each of them.
Though with that approach we need to make sure no perf drops will be
introduced, plus probably more code changes will be required.