RE: RFC - Tap io_uring PMD - Konstantin Ananyev

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	"Stephen Hemminger" <stephen@networkplumber.org>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: RE: RFC - Tap io_uring PMD
Date: Wed, 6 Nov 2024 10:30:16 +0000	[thread overview]
Message-ID: <da1588e2e66244f298751dca3712368b@huawei.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F885@smartserver.smartshare.dk>



> > > > > > Probably the hardest part of using io_uring is figuring out how
> > to
> > > > > > collect
> > > > > > completions. The simplest way would be to handle all
> > completions rx
> > > > and
> > > > > > tx
> > > > > > in the rx_burst function.
> > > > >
> > > > > Please don't mix RX and TX, unless explicitly requested by the
> > > > application through the recently introduced "mbuf recycle" feature.
> > > >
> > > > The issue is Rx and Tx share a single fd and ioring for completion
> > is
> > > > per fd.
> > > > The implementation for ioring came from the storage side so
> > initially
> > > > it was for fixing
> > > > the broken Linux AIO support.
> > > >
> > > > Some other devices only have single interrupt or ring shared with
> > rx/tx
> > > > so not unique.
> > > > Virtio, netvsc, and some NIC's.
> > > >
> > > > The problem is that if Tx completes descriptors then there needs to
> > be
> > > > locking
> > > > to prevent Rx thread and Tx thread overlapping. And a spin lock is
> > a
> > > > performance buzz kill.
> > >
> > > Brainstorming a bit here...
> > > What if the new TAP io_uring PMD is designed to use two io_urings per
> > port, one for RX and another one for TX on the same TAP interface?
> > > This requires that a TAP interface can be referenced via two file
> > descriptors (one fd for the RX io_uring and another fd for the TX
> > io_uring), e.g. by using dup() to create the additional file
> > descriptor. I don't know if this is possible, and if it works with
> > io_uring.
> >
> > There a couple of problems with multiple fd's.
> >   - multiple fds pointing to same internal tap queue are not going to
> > get completed separately.
> >   - when multi-proc is supported, limit of 253 fd's in Unix domain IPC
> > comes into play
> >   - tap does not support tx only fd for queues. If fd is queue of tap,
> > receive fan out will go to it.
> >
> > If DPDK was more flexible, harvesting of completion could be done via
> > another thread but that is not general enough
> > to work transparently with all applications.  Existing TAP device plays
> > with SIGIO, but signals are slower.
> 
> I have now read up a bit about io_uring, so here are some thoughts and ideas...
> 
> To avoid locking, there should only be one writer of io_uring Submission Queue Events (SQE), and only one reader of io_uring
> Completion Queue Events (CQE) per TAP interface.
> 
> From what I understand, the TAP io_uring PMD only supports one RX queue per port and one TX queue per port (i.e. per TAP
> interface). We can take advantage of this:
> 
> We can use rte_tx() as the Submission Queue writer and rte_rx() as the Completion Queue reader.
> 
> The PMD must have two internal rte_rings for respectively RX refill and TX completion events.
> 
> rte_rx() does the following:
> Read the Completion Queue;
> If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX Refill SQE and enqueue it in the RX Refill rte_ring;
> If TX CQE, enqueue it in the TX Completion rte_ring;
> Repeat until nb_pkts RX CQEs have been received, or no more CQE's are available. (This complies with the rte_rx() API, which says
> that less than nb_pkts is only returned if no more packets are available for receiving.)
> 
> rte_tx() does the following:
> Pass the data from the TX MBUFs to io_uring TX SQEs, using the TX CQEs in the TX Completion rte_ring, and write them to the io_uring
> Submission Queue.
> Dequeue any RX Refill SQEs from the RX Refill rte_ring and write them to the io_uring Submission Queue.
> 
> This means that the application must call both rte_rx() and rte_tx(); but it would be allowed to call rte_tx() with zero MBUFs.
> 
> The internal rte_rings are Single-Producer, Single-Consumer, and large enough to hold all TX+RX descriptors.
> 
> 
> Alternatively, we can let rte_rx() do all the work and use an rte_ring in the opposite direction...
> 
> The PMD must have two internal rte_rings, one for TX MBUFs and one for TX CQEs. (The latter can be a stack, or any other type of
> container.)
> 
> rte_tx() only does the following:
> Enqueue the TX MBUFs to the TX MBUF rte_ring.
> 
> rte_rx() does the following:
> Dequeue any TX MBUFs from the TX MBUF rte_ring, convert them to TX SQEs, using the TX CQEs in the TX Completion rte_ring, and
> write them to the io_uring Submission Queue.
> Read the Completion Queue;
> If TX CQE, enqueue it in the TX Completion rte_ring;
> If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX Refill SQE and write it to the io_uring Submission Queue;
> Repeat until nb_pkts RX CQEs have been received, or no more CQE's are available. (This complies with the rte_rx() API, which says
> that less than nb_pkts is only returned if no more packets are available for receiving.)
> 
> With the second design, the PMD can support multiple TX queues by using a Multi-Producer rte_ring for the TX MBUFs.
> But it postpones all transmits until rte_rx() is called, so I don't really like it.
> 
> Of the two designs, the first feels more natural to me.
> And if some application absolutely needs multiple TX queues, it can implement a Multi-Producer, Single-Consumer rte_ring as an
> intermediate step in front of the PMD's single TX queue.

And why we can't simply have 2 io_uring(s): one for RX ops, second for TX ops?

next prev parent reply	other threads:[~2024-11-06 10:30 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-30 21:56 Stephen Hemminger
2024-10-31 10:27 ` Morten Brørup
2024-11-01  0:34   ` Stephen Hemminger
2024-11-02 22:28     ` Morten Brørup
2024-11-05 18:58       ` Stephen Hemminger
2024-11-05 23:22         ` Morten Brørup
2024-11-05 23:25           ` Stephen Hemminger
2024-11-05 23:54             ` Morten Brørup
2024-11-06  0:52               ` Igor Gutorov
2024-11-07 16:30                 ` Stephen Hemminger
2024-11-06 10:30           ` Konstantin Ananyev [this message]
2024-11-06  0:46 ` Varghese, Vipin
2024-11-06  7:46 ` Maxime Coquelin
2024-11-07 21:51   ` Morten Brørup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=da1588e2e66244f298751dca3712368b@huawei.com \
    --to=konstantin.ananyev@huawei.com \
    --cc=dev@dpdk.org \
    --cc=mb@smartsharesystems.com \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).