RE: RFC - Tap io_uring PMD

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Stephen Hemminger" <stephen@networkplumber.org>
Cc: <dev@dpdk.org>
Subject: RE: RFC - Tap io_uring PMD
Date: Sat, 2 Nov 2024 23:28:49 +0100	[thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F86A@smartserver.smartshare.dk> (raw)
In-Reply-To: <20241031173450.26cdb54c@hermes.local>

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, 1 November 2024 01.35
> 
> On Thu, 31 Oct 2024 11:27:25 +0100
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > Sent: Wednesday, 30 October 2024 22.57
> > >
> > > The current tap device is slow both due to architectural choices
> and
> > > the
> > > overhead of Linux system calls.
> >
> > Yes; but isn't it only being used for (low volume) management
> traffic?
> > Is the TAP PMD performance an issue for anyone? What is their use
> case?
> 
> 
> In embedded systems, if you want to use DPDK for dataplane, you still
> need
> to have a control plane path to the kernel. And most of the hardware
> used
> does not support a bifurcated driver. Either that or have two NIC's.

Yes, our application does that (not using a bifurcated driver); it can be configured for in-band management or a dedicated management port.

> 
> >
> > Or is the key issue that the TAP PMD makes system calls in the fast
> path, so you are looking to implement a new TAP PMD that doesn't make
> any system calls in the fast path?
> 
> Even the control path performance matters. Think of a router with lots
> BGP
> connections, or doing updates.

BGP is an excellent example where performance matters.
(Our applications currently don't support BGP, so I'm used to a relatively low volume of management traffic, and didn't think about BGP.)

> 
> >
> > > I am exploring a how to fix that but
> > > some
> > > of the choices require some tradeoffs. Which leads to some open
> > > questions:
> > >
> > > 1. DPDK tap also support tunnel (TUN) mode where there is no
> Ethernet
> > > header
> > >    only L3. Does anyone actually use this? It is different than
> what
> > > every other
> > >    PMD expects.
> >
> > If used for high volume (data plane) traffic, I would assume standard
> PMD behavior (i.e. incl. Ethernet headers) would suffice.

The traffic to/from the TAP port is likely to be exchanged with a physical port, so the packets will have an Ethernet header at this point.

> >
> > >
> > > 2. The fastest way to use kernel TAP device would be to use
> io_uring.
> > >    But this was added in 5.1 kernel (2019). Rather than having
> > > conditional or
> > >    dual mode in DPDK tap device, perhaps there should just be a new
> PMD
> > > tap_uring?
> >
> > If the features differ significantly, I'm in favor of a new PMD.
> > And it would be an opportunity to get rid of useless cruft, which I
> think you are already asking about here. :-)
> 
> Yes, and the TAP device was written to support a niche use case (all
> the flow stuff).
> Also TAP device has lots of extra code, at some point doing bit-by-bit
> cleanup gets annoying.
> 
> >
> > Furthermore, a "clean sheet" implementation - adding all the
> experience accumulated since the old TAP PMD - could serve as showcase
> for "best practices" for software PMDs.
> >
> > >
> > > 3. Current TAP device provides hooks for several rte_flow types by
> > > playing
> > >    games with kernel qdisc. Does anyone really use this? Propose
> just
> > > not doing
> > >    this in new tap_uring.
> > >
> > > 4. What other features of TAP device beyond basic send/receive make
> > > sense?
> > >    It looks like new device could support better statistics.
> >
> > IMHO, statistics about missed packets are relevant. If the ingress
> (kernel->DPDK) queue is full, and the kernel has to drop packets, this
> drop counter should be exposed to the application through the PMD.
> 
> It may require some kernel side additions to extract that, but not out
> of scope.
> 
> >
> > I don't know if the existing TAP PMD supports it, but associating a
> port/queue with a "network namespace" or VRF in the kernel could also
> be relevant.
> 
> All network devices can be put in network namespace; VRF in Linux is
> separate from netns it has to do with
> which routing table is associated with the net device.
> 
> >
> > >
> > > 5. What about Rx interrupt support?
> >
> > RX interrupt support seems closely related to power management.
> > It could be used to reduce jitter/latency (and burstiness) when
> someone on the network communicates with an in-band management
> interface.
> 
> Not sure if ioring has wakeup mechanism, but probably epoll() is
> possible.

Yes, it seems to be:
https://unixism.net/loti/tutorial/register_eventfd.html

> 
> >
> > >
> > > Probably the hardest part of using io_uring is figuring out how to
> > > collect
> > > completions. The simplest way would be to handle all completions rx
> and
> > > tx
> > > in the rx_burst function.
> >
> > Please don't mix RX and TX, unless explicitly requested by the
> application through the recently introduced "mbuf recycle" feature.
> 
> The issue is Rx and Tx share a single fd and ioring for completion is
> per fd.
> The implementation for ioring came from the storage side so initially
> it was for fixing
> the broken Linux AIO support.
> 
> Some other devices only have single interrupt or ring shared with rx/tx
> so not unique.
> Virtio, netvsc, and some NIC's.
> 
> The problem is that if Tx completes descriptors then there needs to be
> locking
> to prevent Rx thread and Tx thread overlapping. And a spin lock is a
> performance buzz kill.

Brainstorming a bit here...
What if the new TAP io_uring PMD is designed to use two io_urings per port, one for RX and another one for TX on the same TAP interface?
This requires that a TAP interface can be referenced via two file descriptors (one fd for the RX io_uring and another fd for the TX io_uring), e.g. by using dup() to create the additional file descriptor. I don't know if this is possible, and if it works with io_uring.

next prev parent reply	other threads:[~2024-11-02 22:28 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-30 21:56 Stephen Hemminger
2024-10-31 10:27 ` Morten Brørup
2024-11-01  0:34   ` Stephen Hemminger
2024-11-02 22:28     ` Morten Brørup [this message]
2024-11-05 18:58       ` Stephen Hemminger
2024-11-05 23:22         ` Morten Brørup
2024-11-05 23:25           ` Stephen Hemminger
2024-11-05 23:54             ` Morten Brørup
2024-11-06  0:52               ` Igor Gutorov
2024-11-07 16:30                 ` Stephen Hemminger
2024-11-06 10:30           ` Konstantin Ananyev
2024-11-06  0:46 ` Varghese, Vipin
2024-11-06  7:46 ` Maxime Coquelin
2024-11-07 21:51   ` Morten Brørup
2024-11-12  5:21   ` Stephen Hemminger
2024-12-29 10:45     ` Morten Brørup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98CBD80474FA8B44BF855DF32C47DC35E9F86A@smartserver.smartshare.dk \
    --to=mb@smartsharesystems.com \
    --cc=dev@dpdk.org \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).