RE: [RFC PATCH v1] net/i40e: put mempool cache out of API

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
To: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>,
	Feifei Wang <Feifei.Wang2@arm.com>,
	Yuying Zhang <Yuying.Zhang@intel.com>,
	Beilei Xing <beilei.xing@intel.com>,
	Ruifeng Wang <Ruifeng.Wang@arm.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, nd <nd@arm.com>, nd <nd@arm.com>
Subject: RE: [RFC PATCH v1] net/i40e: put mempool cache out of API
Date: Fri, 15 Jul 2022 21:52:10 +0000	[thread overview]
Message-ID: <DBAPR08MB5814D4B4BE1248B55692108A988B9@DBAPR08MB5814.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <ffb4265f-7c22-cfee-b812-0009618d6ad6@yandex.ru>

<snip>
> >>
> >>
> >>> Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
> >>> out of API to free buffers directly. There are two changes different
> >>> with previous version:
> >>> 1. change txep from "i40e_entry" to "i40e_vec_entry"
> >>> 2. put cache out of "mempool_bulk" API to copy buffers into it
> >>> directly
> >>>
> >>> Performance Test with l3fwd neon path:
> >>> 		with this patch
> >>> n1sdp:		no perforamnce change
> >>> amper-altra:	+4.0%
> >>>
> >>
> >>
> >> Thanks for RFC, appreciate your effort.
> >> So, as I understand - bypassing mempool put/get itself gives about
> >> 7-10% speedup for RX/TX on ARM platforms, correct?
> > It is 7% to 14% on N1SDP, 14% to 17% on Ampere machines The current
> > RFC needs to be converted to a patch (albeit with zero-copy mempool APIs) as
> it applies for a more generic use case.
> >
> >>
> >> About direct-rearm RX approach you propose:
> >> After another thought, probably it is possible to re-arrange it in a
> >> way that would help avoid related negatives.
> > Thanks for the idea.
> >
> >> The basic idea as follows:
> >>
> >> 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
> >>      Also make sw_ring de-coupled from RXQ itself, i.e:
> >>      when RXQ is stopped or even destroyed, related sw_ring may still
> >>      exist (probably ref-counter or RCU would be sufficient here).
> >>      All that means we need a common layout/api for rxq_sw_ring
> >>      and PMDs that would like to support direct-rearming will have to
> >>      use/obey it.
> > This would mean, we will have additional for loop to fill the descriptors on the
> RX side.
> 
> Yes, rx will go though avaiable entries in sw_ring and fill actual HW descriptors.
> 
> > May be we could keep the queue when the RXQ is stopped, and free it only
> when RXQ is destroyed. Don't have to allocate one more if the RXQ is started
> back again? This means, the mbufs will be in the sw_ring if the RXQ is stopped
> but not destroyed (same as what you have mentioned below).
> 
> My thought here was - yes sw_ring can stay at queue stop (even probably at
> destroy) to avoid related TXQ operation abortion.
Understood

> Though actual HW desc ring will be de-init/destroyed as usual.
I do not understand what you mean by this.
I looked at the function ' i40e_dev_tx_queue_stop'. I see that it disables the queue by calling ' i40e_switch_tx_queue'. The rest of the functionality, frees the mbufs in the sw_ring and resets the descriptors (and the associated indices) in the host memory. I guess this is what you are referring to?

> That's why I think we need to decouple sw_ring from queue and it's HW ring.
I have not looked at other PMDs. But, based on what I see in i40e PMD, this is a synchronization problem between control plane and data plane. As long as the data plane is not accessing the TX queue while tx_queue_stop is called, we should be fine to access the descriptors as well. The synchronization requirement applies for the tx_queue_stop API already and we can add the direct-rearm requirement under the same synchronization. i.e. when the data plane says it is not accessing the TX port, it has stopped using the direct-rearm feature as well.

BTW, the rte_eth_dev_tx_queue_stop API does not call out the synchronization requirement and does not mention the need for the data plane to stop accessing the queue. I think it should be documented.

> 
> >
> >>
> >> 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
> >>      at txq_free_bufs() try to store released mbufs inside attached
> >>      sw_ring directly. If there is no attached sw_ring, or not enough
> >>      free space in it - continue with mempool_put() as usual.
> >>      Note that actual arming of HW RXDs still remains responsibility
> > What problems do you see if we arm the HW RXDs (assuming that we do not
> have to support this across multiple devices)?
> 
> Same as visa-versa scenario:
> - RXQ might be already stopped, while TXQ still running.
This is again a control plane - data plane synchronization problem. i.e. if RXQ is to be stopped (which requires synchronization), the data plane should stop using the direct-rearm.

> - Also it doesn't sound right from design point of view to allow
>    TX code to manipulate RX HW queue directly and visa-versa.
Ok, there is no hard technical issue that stops us from doing this, correct?

> 
> >
> >>      of RX code-path:
> >>      rxq_rearm(rxq) {
> >>        ...
> >>        - check are there are N already filled entries inside rxq_sw_ring.
> >>          if not, populate them from mempool (usual mempool_get()).
> >>        - arm related RXDs and mark these sw_ring entries as managed by HW.
> >>        ...
> >>      }
> >>
> >>
> >> So rxq_sw_ring will serve two purposes:
> >> - track mbufs that are managed by HW (that what it does now)
> >> - private (per RXQ) mbuf cache
> >>
> >> Now, if TXQ is stopped while RXQ is running - no extra
> >> synchronization is required, RXQ would just use
> >> mempool_get() to rearm its sw_ring itself.
> > There would be some synchronization required which tells the data plane
> threads not to access the TXQ before the TXQ is stopped. Other than this, no
> extra synchronization is required.
> >
> > For the current patch, we could use a similar approach. i.e. when TXQ is
> stopped, it does not free the mbufs from sw_ring, we just let the RXQ consume
> all the mbufs from TX side sw_ring till it is empty.
> >
> >>
> >> If RXQ is stopped while TXQ is still running - TXQ can still continue
> >> to populate related sw_ring till it gets full.
> >> Then it will continue with mempool_put() as usual.
> >> Of-course it means that user who wants to use this feature should
> >> probably account some extra mbufs for such case, or might be
> >> rxq_sw_ring can have enable/disable flag to mitigate such situation.
> >>
> >> As another benefit here - such approach makes possible to use several
> >> TXQs (even from different devices) to rearm same RXQ.
> > Being able to use several TXQs is a practical use case. But, I am not sure if
> different devices in the same server is a practical scenario that we need to
> address.
> >
> >>
> >> Have to say, that I am still not sure that 10% RX/TX improvement is
> >> worth bypassing mempool completely and introducing all this extra
> >> complexity in RX/TX path.
> > It is more, clarified above.
> >
> >> But, if we'll still decide to go ahead with direct-rearming, this
> >> re-arrangement, I think, should help to keep things clear and avoid
> >> introducing new limitations in existing functionality.
> >>
> >> WDYT?
> > I had another idea. This requires the application to change, which might be ok
> given that it is new feature/optimization.
> >
> > We could expose a new API to the application which allows RX side to take
> buffers from TX side. The application would call this additional API just before
> calling the eth_rx_burst API.
> >
> > The advantages I see with this approach is:
> > 1) The static mapping of the ports is not required. The mapping is dynamic.
> This allows for the application to make decisions based on current conditions in
> the data plane.
> > 2) Code is simple, because it does not have to check for many conditions. The
> application can make better decisions as it knows the scenario well. Not all
> applications have to take the hit of conditionals.
> > 2) The existing synchronization used by the applications (to stop RXQ/TXQ) is
> sufficient. No new synchronization required.
> 
> 
> That sounds like a very good idea to me.
> I think it will allow to avoid all short-comings that are present in original
> approach.
> Again, if we like we can even allow user to feed RXQ from some other sources
> (dropped packets).
Good point

> Just to confirm that we are talking about the same thing:
> 
> 1. rte_ethdev will have an API for the user to get sw_ring handle from the RXQ,
> something like:
> struct rxq_sw_ring *
> rte_eth_rxq_get_sw_ring(uint16_t port_id, uint16_t rxq_id);
> 
> We probbly can even keep 'struct rxq_sw_ring' as internal one, and use it as a
> handle.
Do we need to expose the SW ring to the application level? IMO, it would be good for the application to refer to port ID and queue ID.

> 
> 2. rte_ethdev will have an API that will ask TXQ to free mbufs and fill given
> rxq_sw_ring:
> int
> rte_eth_txq_fill_sw_ring(uint16_t port_id, uint16_t txq_id, struct *
> rxq_sw_ring);
It is good to fill the descriptors (but not update the head and tail pointers) which keeps the code light for better performance.

> 
> 
> So the app code will look something like:
> 
> rxq_sw_ring = rxq_get_sw_ring(rx_port, rxq);
> ...
> while (...) {
> 
> 	n = eth_tx_burst(tx_port, txq, pkts, N);
> 	...
> 	eth_txq_fill_sw_ring(tx_port, txq, rxq_sw_ring);
> 	n = eth_rx_burst(rx_port, rxq, pkts, N);
> 	...
> }
> 
> 
> Did I get your idea right?
Overall your understanding is correct, but the idea I had differs slightly (mainly keeping performance in mind and sticking to the experiments we have done).
1) eth_rxq_fill(rx_port, rxq, tx_port, txq);
This will take the buffers from TXQ and fill the RXQ sw_ring as well as the hardware buffers. 

2) The loop would be as follows:
while (...) {
 
	eth_rxq_fill((rx_port, rxq, tx_port, txq);
	n = eth_rx_burst(rx_port, rxq, pkts, N);
 	...
	n = eth_tx_burst(tx_port, txq, pkts, N);
}


> 
> >
> > The disadvantage is that it is another function pointer call which will reduce
> some performance.
> >
> >>
> >> Konstantin
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >

next prev parent reply	other threads:[~2022-07-15 21:52 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-13  5:51 Feifei Wang
2022-06-13  9:58 ` Morten Brørup
2022-06-22 19:07   ` Honnappa Nagarahalli
2022-07-03 12:20 ` Konstantin Ananyev
2022-07-06  8:52   ` 回复: " Feifei Wang
2022-07-06 11:35     ` Feifei Wang
2022-07-11  3:08       ` Feifei Wang
2022-07-11  6:19   ` Honnappa Nagarahalli
2022-07-13 18:48     ` Konstantin Ananyev
2022-07-15 21:52       ` Honnappa Nagarahalli [this message]
2022-07-21 10:02         ` Konstantin Ananyev
2022-07-15 22:06       ` Honnappa Nagarahalli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DBAPR08MB5814D4B4BE1248B55692108A988B9@DBAPR08MB5814.eurprd08.prod.outlook.com \
    --to=honnappa.nagarahalli@arm.com \
    --cc=Feifei.Wang2@arm.com \
    --cc=Ruifeng.Wang@arm.com \
    --cc=Yuying.Zhang@intel.com \
    --cc=beilei.xing@intel.com \
    --cc=dev@dpdk.org \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).