DPDK patches and discussions
 help / color / mirror / Atom feed
* RFC - Tap io_uring PMD
@ 2024-10-30 21:56 Stephen Hemminger
  2024-10-31 10:27 ` Morten Brørup
  0 siblings, 1 reply; 3+ messages in thread
From: Stephen Hemminger @ 2024-10-30 21:56 UTC (permalink / raw)
  To: dev

The current tap device is slow both due to architectural choices and the
overhead of Linux system calls. I am exploring a how to fix that but some
of the choices require some tradeoffs. Which leads to some open questions:

1. DPDK tap also support tunnel (TUN) mode where there is no Ethernet header
   only L3. Does anyone actually use this? It is different than what every other
   PMD expects.

2. The fastest way to use kernel TAP device would be to use io_uring.
   But this was added in 5.1 kernel (2019). Rather than having conditional or
   dual mode in DPDK tap device, perhaps there should just be a new PMD tap_uring?

3. Current TAP device provides hooks for several rte_flow types by playing
   games with kernel qdisc. Does anyone really use this? Propose just not doing
   this in new tap_uring.

4. What other features of TAP device beyond basic send/receive make sense?
   It looks like new device could support better statistics.

5. What about Rx interrupt support?

Probably the hardest part of using io_uring is figuring out how to collect
completions. The simplest way would be to handle all completions rx and tx
in the rx_burst function.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: RFC - Tap io_uring PMD
  2024-10-30 21:56 RFC - Tap io_uring PMD Stephen Hemminger
@ 2024-10-31 10:27 ` Morten Brørup
  2024-11-01  0:34   ` Stephen Hemminger
  0 siblings, 1 reply; 3+ messages in thread
From: Morten Brørup @ 2024-10-31 10:27 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 30 October 2024 22.57
> 
> The current tap device is slow both due to architectural choices and
> the
> overhead of Linux system calls.

Yes; but isn't it only being used for (low volume) management traffic?
Is the TAP PMD performance an issue for anyone? What is their use case?

Or is the key issue that the TAP PMD makes system calls in the fast path, so you are looking to implement a new TAP PMD that doesn't make any system calls in the fast path?

> I am exploring a how to fix that but
> some
> of the choices require some tradeoffs. Which leads to some open
> questions:
> 
> 1. DPDK tap also support tunnel (TUN) mode where there is no Ethernet
> header
>    only L3. Does anyone actually use this? It is different than what
> every other
>    PMD expects.

If used for high volume (data plane) traffic, I would assume standard PMD behavior (i.e. incl. Ethernet headers) would suffice.

> 
> 2. The fastest way to use kernel TAP device would be to use io_uring.
>    But this was added in 5.1 kernel (2019). Rather than having
> conditional or
>    dual mode in DPDK tap device, perhaps there should just be a new PMD
> tap_uring?

If the features differ significantly, I'm in favor of a new PMD.
And it would be an opportunity to get rid of useless cruft, which I think you are already asking about here. :-)

Furthermore, a "clean sheet" implementation - adding all the experience accumulated since the old TAP PMD - could serve as showcase for "best practices" for software PMDs.

> 
> 3. Current TAP device provides hooks for several rte_flow types by
> playing
>    games with kernel qdisc. Does anyone really use this? Propose just
> not doing
>    this in new tap_uring.
> 
> 4. What other features of TAP device beyond basic send/receive make
> sense?
>    It looks like new device could support better statistics.

IMHO, statistics about missed packets are relevant. If the ingress (kernel->DPDK) queue is full, and the kernel has to drop packets, this drop counter should be exposed to the application through the PMD.

I don't know if the existing TAP PMD supports it, but associating a port/queue with a "network namespace" or VRF in the kernel could also be relevant.

> 
> 5. What about Rx interrupt support?

RX interrupt support seems closely related to power management.
It could be used to reduce jitter/latency (and burstiness) when someone on the network communicates with an in-band management interface.

> 
> Probably the hardest part of using io_uring is figuring out how to
> collect
> completions. The simplest way would be to handle all completions rx and
> tx
> in the rx_burst function.

Please don't mix RX and TX, unless explicitly requested by the application through the recently introduced "mbuf recycle" feature.

<side tracking>
Currently, rte_rx() does two jobs:
* Deliver packets received from the HW to the application.
* Replenish RX descriptors.

Similarly, rte_tx() does two jobs:
* Deliver packets to be transmitted from the application to the HW.
* Release completed TX descriptors.

It would complicate things, but these two associated jobs could be separated into separate functions, rx_pre_rx() for RX replenishment and tx_post_tx() for TX completion.
This would also give latency sensitive applications more control over when to do what.
And it could introduce a TX completion interrupt.
</side tracking>

Why does this PMD need to handle TX completions differently than other PMDs?


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: RFC - Tap io_uring PMD
  2024-10-31 10:27 ` Morten Brørup
@ 2024-11-01  0:34   ` Stephen Hemminger
  0 siblings, 0 replies; 3+ messages in thread
From: Stephen Hemminger @ 2024-11-01  0:34 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

On Thu, 31 Oct 2024 11:27:25 +0100
Morten Brørup <mb@smartsharesystems.com> wrote:

> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Wednesday, 30 October 2024 22.57
> > 
> > The current tap device is slow both due to architectural choices and
> > the
> > overhead of Linux system calls.  
> 
> Yes; but isn't it only being used for (low volume) management traffic?
> Is the TAP PMD performance an issue for anyone? What is their use case?


In embedded systems, if you want to use DPDK for dataplane, you still need
to have a control plane path to the kernel. And most of the hardware used
does not support a bifurcated driver. Either that or have two NIC's.

> 
> Or is the key issue that the TAP PMD makes system calls in the fast path, so you are looking to implement a new TAP PMD that doesn't make any system calls in the fast path?

Even the control path performance matters. Think of a router with lots BGP
connections, or doing updates.

> 
> > I am exploring a how to fix that but
> > some
> > of the choices require some tradeoffs. Which leads to some open
> > questions:
> > 
> > 1. DPDK tap also support tunnel (TUN) mode where there is no Ethernet
> > header
> >    only L3. Does anyone actually use this? It is different than what
> > every other
> >    PMD expects.  
> 
> If used for high volume (data plane) traffic, I would assume standard PMD behavior (i.e. incl. Ethernet headers) would suffice.
> 
> > 
> > 2. The fastest way to use kernel TAP device would be to use io_uring.
> >    But this was added in 5.1 kernel (2019). Rather than having
> > conditional or
> >    dual mode in DPDK tap device, perhaps there should just be a new PMD
> > tap_uring?  
> 
> If the features differ significantly, I'm in favor of a new PMD.
> And it would be an opportunity to get rid of useless cruft, which I think you are already asking about here. :-)

Yes, and the TAP device was written to support a niche use case (all the flow stuff).
Also TAP device has lots of extra code, at some point doing bit-by-bit cleanup gets annoying.

> 
> Furthermore, a "clean sheet" implementation - adding all the experience accumulated since the old TAP PMD - could serve as showcase for "best practices" for software PMDs.
> 
> > 
> > 3. Current TAP device provides hooks for several rte_flow types by
> > playing
> >    games with kernel qdisc. Does anyone really use this? Propose just
> > not doing
> >    this in new tap_uring.
> > 
> > 4. What other features of TAP device beyond basic send/receive make
> > sense?
> >    It looks like new device could support better statistics.  
> 
> IMHO, statistics about missed packets are relevant. If the ingress (kernel->DPDK) queue is full, and the kernel has to drop packets, this drop counter should be exposed to the application through the PMD.

It may require some kernel side additions to extract that, but not out of scope.

> 
> I don't know if the existing TAP PMD supports it, but associating a port/queue with a "network namespace" or VRF in the kernel could also be relevant.

All network devices can be put in network namespace; VRF in Linux is separate from netns it has to do with
which routing table is associated with the net device.

> 
> > 
> > 5. What about Rx interrupt support?  
> 
> RX interrupt support seems closely related to power management.
> It could be used to reduce jitter/latency (and burstiness) when someone on the network communicates with an in-band management interface.

Not sure if ioring has wakeup mechanism, but probably epoll() is possible.

> 
> > 
> > Probably the hardest part of using io_uring is figuring out how to
> > collect
> > completions. The simplest way would be to handle all completions rx and
> > tx
> > in the rx_burst function.  
> 
> Please don't mix RX and TX, unless explicitly requested by the application through the recently introduced "mbuf recycle" feature.

The issue is Rx and Tx share a single fd and ioring for completion is per fd.
The implementation for ioring came from the storage side so initially it was for fixing
the broken Linux AIO support.

Some other devices only have single interrupt or ring shared with rx/tx so not unique.
Virtio, netvsc, and some NIC's.

The problem is that if Tx completes descriptors then there needs to be locking
to prevent Rx thread and Tx thread overlapping. And a spin lock is a performance buzz kill.




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-11-01  0:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-30 21:56 RFC - Tap io_uring PMD Stephen Hemminger
2024-10-31 10:27 ` Morten Brørup
2024-11-01  0:34   ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).