DPDK patches and discussions
 help / color / mirror / Atom feed
* RFC - Tap io_uring PMD
@ 2024-10-30 21:56 Stephen Hemminger
  2024-10-31 10:27 ` Morten Brørup
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Stephen Hemminger @ 2024-10-30 21:56 UTC (permalink / raw)
  To: dev

The current tap device is slow both due to architectural choices and the
overhead of Linux system calls. I am exploring a how to fix that but some
of the choices require some tradeoffs. Which leads to some open questions:

1. DPDK tap also support tunnel (TUN) mode where there is no Ethernet header
   only L3. Does anyone actually use this? It is different than what every other
   PMD expects.

2. The fastest way to use kernel TAP device would be to use io_uring.
   But this was added in 5.1 kernel (2019). Rather than having conditional or
   dual mode in DPDK tap device, perhaps there should just be a new PMD tap_uring?

3. Current TAP device provides hooks for several rte_flow types by playing
   games with kernel qdisc. Does anyone really use this? Propose just not doing
   this in new tap_uring.

4. What other features of TAP device beyond basic send/receive make sense?
   It looks like new device could support better statistics.

5. What about Rx interrupt support?

Probably the hardest part of using io_uring is figuring out how to collect
completions. The simplest way would be to handle all completions rx and tx
in the rx_burst function.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: RFC - Tap io_uring PMD
  2024-10-30 21:56 RFC - Tap io_uring PMD Stephen Hemminger
@ 2024-10-31 10:27 ` Morten Brørup
  2024-11-01  0:34   ` Stephen Hemminger
  2024-11-06  0:46 ` Varghese, Vipin
  2024-11-06  7:46 ` Maxime Coquelin
  2 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2024-10-31 10:27 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 30 October 2024 22.57
> 
> The current tap device is slow both due to architectural choices and
> the
> overhead of Linux system calls.

Yes; but isn't it only being used for (low volume) management traffic?
Is the TAP PMD performance an issue for anyone? What is their use case?

Or is the key issue that the TAP PMD makes system calls in the fast path, so you are looking to implement a new TAP PMD that doesn't make any system calls in the fast path?

> I am exploring a how to fix that but
> some
> of the choices require some tradeoffs. Which leads to some open
> questions:
> 
> 1. DPDK tap also support tunnel (TUN) mode where there is no Ethernet
> header
>    only L3. Does anyone actually use this? It is different than what
> every other
>    PMD expects.

If used for high volume (data plane) traffic, I would assume standard PMD behavior (i.e. incl. Ethernet headers) would suffice.

> 
> 2. The fastest way to use kernel TAP device would be to use io_uring.
>    But this was added in 5.1 kernel (2019). Rather than having
> conditional or
>    dual mode in DPDK tap device, perhaps there should just be a new PMD
> tap_uring?

If the features differ significantly, I'm in favor of a new PMD.
And it would be an opportunity to get rid of useless cruft, which I think you are already asking about here. :-)

Furthermore, a "clean sheet" implementation - adding all the experience accumulated since the old TAP PMD - could serve as showcase for "best practices" for software PMDs.

> 
> 3. Current TAP device provides hooks for several rte_flow types by
> playing
>    games with kernel qdisc. Does anyone really use this? Propose just
> not doing
>    this in new tap_uring.
> 
> 4. What other features of TAP device beyond basic send/receive make
> sense?
>    It looks like new device could support better statistics.

IMHO, statistics about missed packets are relevant. If the ingress (kernel->DPDK) queue is full, and the kernel has to drop packets, this drop counter should be exposed to the application through the PMD.

I don't know if the existing TAP PMD supports it, but associating a port/queue with a "network namespace" or VRF in the kernel could also be relevant.

> 
> 5. What about Rx interrupt support?

RX interrupt support seems closely related to power management.
It could be used to reduce jitter/latency (and burstiness) when someone on the network communicates with an in-band management interface.

> 
> Probably the hardest part of using io_uring is figuring out how to
> collect
> completions. The simplest way would be to handle all completions rx and
> tx
> in the rx_burst function.

Please don't mix RX and TX, unless explicitly requested by the application through the recently introduced "mbuf recycle" feature.

<side tracking>
Currently, rte_rx() does two jobs:
* Deliver packets received from the HW to the application.
* Replenish RX descriptors.

Similarly, rte_tx() does two jobs:
* Deliver packets to be transmitted from the application to the HW.
* Release completed TX descriptors.

It would complicate things, but these two associated jobs could be separated into separate functions, rx_pre_rx() for RX replenishment and tx_post_tx() for TX completion.
This would also give latency sensitive applications more control over when to do what.
And it could introduce a TX completion interrupt.
</side tracking>

Why does this PMD need to handle TX completions differently than other PMDs?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC - Tap io_uring PMD
  2024-10-31 10:27 ` Morten Brørup
@ 2024-11-01  0:34   ` Stephen Hemminger
  2024-11-02 22:28     ` Morten Brørup
  0 siblings, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2024-11-01  0:34 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

On Thu, 31 Oct 2024 11:27:25 +0100
Morten Brørup <mb@smartsharesystems.com> wrote:

> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Wednesday, 30 October 2024 22.57
> > 
> > The current tap device is slow both due to architectural choices and
> > the
> > overhead of Linux system calls.  
> 
> Yes; but isn't it only being used for (low volume) management traffic?
> Is the TAP PMD performance an issue for anyone? What is their use case?


In embedded systems, if you want to use DPDK for dataplane, you still need
to have a control plane path to the kernel. And most of the hardware used
does not support a bifurcated driver. Either that or have two NIC's.

> 
> Or is the key issue that the TAP PMD makes system calls in the fast path, so you are looking to implement a new TAP PMD that doesn't make any system calls in the fast path?

Even the control path performance matters. Think of a router with lots BGP
connections, or doing updates.

> 
> > I am exploring a how to fix that but
> > some
> > of the choices require some tradeoffs. Which leads to some open
> > questions:
> > 
> > 1. DPDK tap also support tunnel (TUN) mode where there is no Ethernet
> > header
> >    only L3. Does anyone actually use this? It is different than what
> > every other
> >    PMD expects.  
> 
> If used for high volume (data plane) traffic, I would assume standard PMD behavior (i.e. incl. Ethernet headers) would suffice.
> 
> > 
> > 2. The fastest way to use kernel TAP device would be to use io_uring.
> >    But this was added in 5.1 kernel (2019). Rather than having
> > conditional or
> >    dual mode in DPDK tap device, perhaps there should just be a new PMD
> > tap_uring?  
> 
> If the features differ significantly, I'm in favor of a new PMD.
> And it would be an opportunity to get rid of useless cruft, which I think you are already asking about here. :-)

Yes, and the TAP device was written to support a niche use case (all the flow stuff).
Also TAP device has lots of extra code, at some point doing bit-by-bit cleanup gets annoying.

> 
> Furthermore, a "clean sheet" implementation - adding all the experience accumulated since the old TAP PMD - could serve as showcase for "best practices" for software PMDs.
> 
> > 
> > 3. Current TAP device provides hooks for several rte_flow types by
> > playing
> >    games with kernel qdisc. Does anyone really use this? Propose just
> > not doing
> >    this in new tap_uring.
> > 
> > 4. What other features of TAP device beyond basic send/receive make
> > sense?
> >    It looks like new device could support better statistics.  
> 
> IMHO, statistics about missed packets are relevant. If the ingress (kernel->DPDK) queue is full, and the kernel has to drop packets, this drop counter should be exposed to the application through the PMD.

It may require some kernel side additions to extract that, but not out of scope.

> 
> I don't know if the existing TAP PMD supports it, but associating a port/queue with a "network namespace" or VRF in the kernel could also be relevant.

All network devices can be put in network namespace; VRF in Linux is separate from netns it has to do with
which routing table is associated with the net device.

> 
> > 
> > 5. What about Rx interrupt support?  
> 
> RX interrupt support seems closely related to power management.
> It could be used to reduce jitter/latency (and burstiness) when someone on the network communicates with an in-band management interface.

Not sure if ioring has wakeup mechanism, but probably epoll() is possible.

> 
> > 
> > Probably the hardest part of using io_uring is figuring out how to
> > collect
> > completions. The simplest way would be to handle all completions rx and
> > tx
> > in the rx_burst function.  
> 
> Please don't mix RX and TX, unless explicitly requested by the application through the recently introduced "mbuf recycle" feature.

The issue is Rx and Tx share a single fd and ioring for completion is per fd.
The implementation for ioring came from the storage side so initially it was for fixing
the broken Linux AIO support.

Some other devices only have single interrupt or ring shared with rx/tx so not unique.
Virtio, netvsc, and some NIC's.

The problem is that if Tx completes descriptors then there needs to be locking
to prevent Rx thread and Tx thread overlapping. And a spin lock is a performance buzz kill.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: RFC - Tap io_uring PMD
  2024-11-01  0:34   ` Stephen Hemminger
@ 2024-11-02 22:28     ` Morten Brørup
  2024-11-05 18:58       ` Stephen Hemminger
  0 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2024-11-02 22:28 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, 1 November 2024 01.35
> 
> On Thu, 31 Oct 2024 11:27:25 +0100
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > Sent: Wednesday, 30 October 2024 22.57
> > >
> > > The current tap device is slow both due to architectural choices
> and
> > > the
> > > overhead of Linux system calls.
> >
> > Yes; but isn't it only being used for (low volume) management
> traffic?
> > Is the TAP PMD performance an issue for anyone? What is their use
> case?
> 
> 
> In embedded systems, if you want to use DPDK for dataplane, you still
> need
> to have a control plane path to the kernel. And most of the hardware
> used
> does not support a bifurcated driver. Either that or have two NIC's.

Yes, our application does that (not using a bifurcated driver); it can be configured for in-band management or a dedicated management port.

> 
> >
> > Or is the key issue that the TAP PMD makes system calls in the fast
> path, so you are looking to implement a new TAP PMD that doesn't make
> any system calls in the fast path?
> 
> Even the control path performance matters. Think of a router with lots
> BGP
> connections, or doing updates.

BGP is an excellent example where performance matters.
(Our applications currently don't support BGP, so I'm used to a relatively low volume of management traffic, and didn't think about BGP.)

> 
> >
> > > I am exploring a how to fix that but
> > > some
> > > of the choices require some tradeoffs. Which leads to some open
> > > questions:
> > >
> > > 1. DPDK tap also support tunnel (TUN) mode where there is no
> Ethernet
> > > header
> > >    only L3. Does anyone actually use this? It is different than
> what
> > > every other
> > >    PMD expects.
> >
> > If used for high volume (data plane) traffic, I would assume standard
> PMD behavior (i.e. incl. Ethernet headers) would suffice.

The traffic to/from the TAP port is likely to be exchanged with a physical port, so the packets will have an Ethernet header at this point.

> >
> > >
> > > 2. The fastest way to use kernel TAP device would be to use
> io_uring.
> > >    But this was added in 5.1 kernel (2019). Rather than having
> > > conditional or
> > >    dual mode in DPDK tap device, perhaps there should just be a new
> PMD
> > > tap_uring?
> >
> > If the features differ significantly, I'm in favor of a new PMD.
> > And it would be an opportunity to get rid of useless cruft, which I
> think you are already asking about here. :-)
> 
> Yes, and the TAP device was written to support a niche use case (all
> the flow stuff).
> Also TAP device has lots of extra code, at some point doing bit-by-bit
> cleanup gets annoying.
> 
> >
> > Furthermore, a "clean sheet" implementation - adding all the
> experience accumulated since the old TAP PMD - could serve as showcase
> for "best practices" for software PMDs.
> >
> > >
> > > 3. Current TAP device provides hooks for several rte_flow types by
> > > playing
> > >    games with kernel qdisc. Does anyone really use this? Propose
> just
> > > not doing
> > >    this in new tap_uring.
> > >
> > > 4. What other features of TAP device beyond basic send/receive make
> > > sense?
> > >    It looks like new device could support better statistics.
> >
> > IMHO, statistics about missed packets are relevant. If the ingress
> (kernel->DPDK) queue is full, and the kernel has to drop packets, this
> drop counter should be exposed to the application through the PMD.
> 
> It may require some kernel side additions to extract that, but not out
> of scope.
> 
> >
> > I don't know if the existing TAP PMD supports it, but associating a
> port/queue with a "network namespace" or VRF in the kernel could also
> be relevant.
> 
> All network devices can be put in network namespace; VRF in Linux is
> separate from netns it has to do with
> which routing table is associated with the net device.
> 
> >
> > >
> > > 5. What about Rx interrupt support?
> >
> > RX interrupt support seems closely related to power management.
> > It could be used to reduce jitter/latency (and burstiness) when
> someone on the network communicates with an in-band management
> interface.
> 
> Not sure if ioring has wakeup mechanism, but probably epoll() is
> possible.

Yes, it seems to be:
https://unixism.net/loti/tutorial/register_eventfd.html

> 
> >
> > >
> > > Probably the hardest part of using io_uring is figuring out how to
> > > collect
> > > completions. The simplest way would be to handle all completions rx
> and
> > > tx
> > > in the rx_burst function.
> >
> > Please don't mix RX and TX, unless explicitly requested by the
> application through the recently introduced "mbuf recycle" feature.
> 
> The issue is Rx and Tx share a single fd and ioring for completion is
> per fd.
> The implementation for ioring came from the storage side so initially
> it was for fixing
> the broken Linux AIO support.
> 
> Some other devices only have single interrupt or ring shared with rx/tx
> so not unique.
> Virtio, netvsc, and some NIC's.
> 
> The problem is that if Tx completes descriptors then there needs to be
> locking
> to prevent Rx thread and Tx thread overlapping. And a spin lock is a
> performance buzz kill.

Brainstorming a bit here...
What if the new TAP io_uring PMD is designed to use two io_urings per port, one for RX and another one for TX on the same TAP interface?
This requires that a TAP interface can be referenced via two file descriptors (one fd for the RX io_uring and another fd for the TX io_uring), e.g. by using dup() to create the additional file descriptor. I don't know if this is possible, and if it works with io_uring.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC - Tap io_uring PMD
  2024-11-02 22:28     ` Morten Brørup
@ 2024-11-05 18:58       ` Stephen Hemminger
  2024-11-05 23:22         ` Morten Brørup
  0 siblings, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2024-11-05 18:58 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

On Sat, 2 Nov 2024 23:28:49 +0100
Morten Brørup <mb@smartsharesystems.com> wrote:

> > > >
> > > > Probably the hardest part of using io_uring is figuring out how to
> > > > collect
> > > > completions. The simplest way would be to handle all completions rx  
> > and  
> > > > tx
> > > > in the rx_burst function.  
> > >
> > > Please don't mix RX and TX, unless explicitly requested by the  
> > application through the recently introduced "mbuf recycle" feature.
> > 
> > The issue is Rx and Tx share a single fd and ioring for completion is
> > per fd.
> > The implementation for ioring came from the storage side so initially
> > it was for fixing
> > the broken Linux AIO support.
> > 
> > Some other devices only have single interrupt or ring shared with rx/tx
> > so not unique.
> > Virtio, netvsc, and some NIC's.
> > 
> > The problem is that if Tx completes descriptors then there needs to be
> > locking
> > to prevent Rx thread and Tx thread overlapping. And a spin lock is a
> > performance buzz kill.  
> 
> Brainstorming a bit here...
> What if the new TAP io_uring PMD is designed to use two io_urings per port, one for RX and another one for TX on the same TAP interface?
> This requires that a TAP interface can be referenced via two file descriptors (one fd for the RX io_uring and another fd for the TX io_uring), e.g. by using dup() to create the additional file descriptor. I don't know if this is possible, and if it works with io_uring.

There a couple of problems with multiple fd's.
  - multiple fds pointing to same internal tap queue are not going to get completed separately.
  - when multi-proc is supported, limit of 253 fd's in Unix domain IPC comes into play
  - tap does not support tx only fd for queues. If fd is queue of tap, receive fan out will go to it.

If DPDK was more flexible, harvesting of completion could be done via another thread but that is not general enough
to work transparently with all applications.  Existing TAP device plays with SIGIO, but signals are slower.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: RFC - Tap io_uring PMD
  2024-11-05 18:58       ` Stephen Hemminger
@ 2024-11-05 23:22         ` Morten Brørup
  2024-11-05 23:25           ` Stephen Hemminger
  2024-11-06 10:30           ` Konstantin Ananyev
  0 siblings, 2 replies; 15+ messages in thread
From: Morten Brørup @ 2024-11-05 23:22 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Tuesday, 5 November 2024 19.59
> 
> On Sat, 2 Nov 2024 23:28:49 +0100
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > > >
> > > > > Probably the hardest part of using io_uring is figuring out how
> to
> > > > > collect
> > > > > completions. The simplest way would be to handle all
> completions rx
> > > and
> > > > > tx
> > > > > in the rx_burst function.
> > > >
> > > > Please don't mix RX and TX, unless explicitly requested by the
> > > application through the recently introduced "mbuf recycle" feature.
> > >
> > > The issue is Rx and Tx share a single fd and ioring for completion
> is
> > > per fd.
> > > The implementation for ioring came from the storage side so
> initially
> > > it was for fixing
> > > the broken Linux AIO support.
> > >
> > > Some other devices only have single interrupt or ring shared with
> rx/tx
> > > so not unique.
> > > Virtio, netvsc, and some NIC's.
> > >
> > > The problem is that if Tx completes descriptors then there needs to
> be
> > > locking
> > > to prevent Rx thread and Tx thread overlapping. And a spin lock is
> a
> > > performance buzz kill.
> >
> > Brainstorming a bit here...
> > What if the new TAP io_uring PMD is designed to use two io_urings per
> port, one for RX and another one for TX on the same TAP interface?
> > This requires that a TAP interface can be referenced via two file
> descriptors (one fd for the RX io_uring and another fd for the TX
> io_uring), e.g. by using dup() to create the additional file
> descriptor. I don't know if this is possible, and if it works with
> io_uring.
> 
> There a couple of problems with multiple fd's.
>   - multiple fds pointing to same internal tap queue are not going to
> get completed separately.
>   - when multi-proc is supported, limit of 253 fd's in Unix domain IPC
> comes into play
>   - tap does not support tx only fd for queues. If fd is queue of tap,
> receive fan out will go to it.
> 
> If DPDK was more flexible, harvesting of completion could be done via
> another thread but that is not general enough
> to work transparently with all applications.  Existing TAP device plays
> with SIGIO, but signals are slower.

I have now read up a bit about io_uring, so here are some thoughts and ideas...

To avoid locking, there should only be one writer of io_uring Submission Queue Events (SQE), and only one reader of io_uring Completion Queue Events (CQE) per TAP interface.

From what I understand, the TAP io_uring PMD only supports one RX queue per port and one TX queue per port (i.e. per TAP interface). We can take advantage of this:

We can use rte_tx() as the Submission Queue writer and rte_rx() as the Completion Queue reader.

The PMD must have two internal rte_rings for respectively RX refill and TX completion events.

rte_rx() does the following:
Read the Completion Queue;
If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX Refill SQE and enqueue it in the RX Refill rte_ring;
If TX CQE, enqueue it in the TX Completion rte_ring;
Repeat until nb_pkts RX CQEs have been received, or no more CQE's are available. (This complies with the rte_rx() API, which says that less than nb_pkts is only returned if no more packets are available for receiving.)

rte_tx() does the following:
Pass the data from the TX MBUFs to io_uring TX SQEs, using the TX CQEs in the TX Completion rte_ring, and write them to the io_uring Submission Queue.
Dequeue any RX Refill SQEs from the RX Refill rte_ring and write them to the io_uring Submission Queue.

This means that the application must call both rte_rx() and rte_tx(); but it would be allowed to call rte_tx() with zero MBUFs.

The internal rte_rings are Single-Producer, Single-Consumer, and large enough to hold all TX+RX descriptors.


Alternatively, we can let rte_rx() do all the work and use an rte_ring in the opposite direction...

The PMD must have two internal rte_rings, one for TX MBUFs and one for TX CQEs. (The latter can be a stack, or any other type of container.)

rte_tx() only does the following:
Enqueue the TX MBUFs to the TX MBUF rte_ring.

rte_rx() does the following:
Dequeue any TX MBUFs from the TX MBUF rte_ring, convert them to TX SQEs, using the TX CQEs in the TX Completion rte_ring, and write them to the io_uring Submission Queue.
Read the Completion Queue;
If TX CQE, enqueue it in the TX Completion rte_ring;
If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX Refill SQE and write it to the io_uring Submission Queue;
Repeat until nb_pkts RX CQEs have been received, or no more CQE's are available. (This complies with the rte_rx() API, which says that less than nb_pkts is only returned if no more packets are available for receiving.)

With the second design, the PMD can support multiple TX queues by using a Multi-Producer rte_ring for the TX MBUFs.
But it postpones all transmits until rte_rx() is called, so I don't really like it.

Of the two designs, the first feels more natural to me.
And if some application absolutely needs multiple TX queues, it can implement a Multi-Producer, Single-Consumer rte_ring as an intermediate step in front of the PMD's single TX queue.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC - Tap io_uring PMD
  2024-11-05 23:22         ` Morten Brørup
@ 2024-11-05 23:25           ` Stephen Hemminger
  2024-11-05 23:54             ` Morten Brørup
  2024-11-06 10:30           ` Konstantin Ananyev
  1 sibling, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2024-11-05 23:25 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

On Wed, 6 Nov 2024 00:22:19 +0100
Morten Brørup <mb@smartsharesystems.com> wrote:

> From what I understand, the TAP io_uring PMD only supports one RX queue per port and one TX queue per port (i.e. per TAP interface). We can take advantage of this:


No kernel tap support multi queue and we need to use that.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: RFC - Tap io_uring PMD
  2024-11-05 23:25           ` Stephen Hemminger
@ 2024-11-05 23:54             ` Morten Brørup
  2024-11-06  0:52               ` Igor Gutorov
  0 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2024-11-05 23:54 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 6 November 2024 00.26
> 
> On Wed, 6 Nov 2024 00:22:19 +0100
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > From what I understand, the TAP io_uring PMD only supports one RX
> queue per port and one TX queue per port (i.e. per TAP interface). We
> can take advantage of this:
> 
> 
> No kernel tap support multi queue and we need to use that.

Maybe I got it wrong then... I thought you said fanout (of kernel->TAP packets) would affect all fds associated with the TAP interface.
How can the application then use multiple queues?

Another thing is:
For which purposes do applications need multi queue support for TAP, considering the interface is for management traffic only?
If single queue suffices, it might make the PMD simpler.
On the other hand, it is more important making it simple for the application than making a simple PMD! :-)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: RFC - Tap io_uring PMD
  2024-10-30 21:56 RFC - Tap io_uring PMD Stephen Hemminger
  2024-10-31 10:27 ` Morten Brørup
@ 2024-11-06  0:46 ` Varghese, Vipin
  2024-11-06  7:46 ` Maxime Coquelin
  2 siblings, 0 replies; 15+ messages in thread
From: Varghese, Vipin @ 2024-11-06  0:46 UTC (permalink / raw)
  To: Stephen Hemminger, dev

[Public]

Snipped
>
>
> The current tap device is slow both due to architectural choices and the overhead of
> Linux system calls. I am exploring a how to fix that but some of the choices require
> some tradeoffs. Which leads to some open questions:
>
> 1. DPDK tap also support tunnel (TUN) mode where there is no Ethernet header
>    only L3. Does anyone actually use this? It is different than what every other
>    PMD expects.
Hi Stephen, TUN interface were added in 2017 to support couple of use cases in teleco (ipsec tunneling) based on actual uses cases from user space stack. But I am not sure if anyone is using the same now.

Follow up question, is not TUN rx-tx sperate function from TAP?

Note: I am open to get this removed or separated if it is not used much.

>
> 2. The fastest way to use kernel TAP device would be to use io_uring.
>    But this was added in 5.1 kernel (2019). Rather than having conditional or
>    dual mode in DPDK tap device, perhaps there should just be a new PMD
> tap_uring?
>
> 3. Current TAP device provides hooks for several rte_flow types by playing
>    games with kernel qdisc. Does anyone really use this? Propose just not doing
>    this in new tap_uring.
>
> 4. What other features of TAP device beyond basic send/receive make sense?
>    It looks like new device could support better statistics.
>
> 5. What about Rx interrupt support?
>
> Probably the hardest part of using io_uring is figuring out how to collect completions.
> The simplest way would be to handle all completions rx and tx in the rx_burst
> function.

For the above questions from 2 to 5, I do like the idea of exploring better alternatives.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC - Tap io_uring PMD
  2024-11-05 23:54             ` Morten Brørup
@ 2024-11-06  0:52               ` Igor Gutorov
  2024-11-07 16:30                 ` Stephen Hemminger
  0 siblings, 1 reply; 15+ messages in thread
From: Igor Gutorov @ 2024-11-06  0:52 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

On Wed, Nov 6, 2024 at 2:54 AM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Wednesday, 6 November 2024 00.26
> >
> > On Wed, 6 Nov 2024 00:22:19 +0100
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >
> > > From what I understand, the TAP io_uring PMD only supports one RX
> > queue per port and one TX queue per port (i.e. per TAP interface). We
> > can take advantage of this:
> >
> >
> > No kernel tap support multi queue and we need to use that.
>
> Maybe I got it wrong then... I thought you said fanout (of kernel->TAP packets) would affect all fds associated with the TAP interface.
> How can the application then use multiple queues?
>
> Another thing is:
> For which purposes do applications need multi queue support for TAP, considering the interface is for management traffic only?

I've previously used net_pcap as well as net_af_packet PMDs for
debugging/testing and even benchmarking purposes. I'd set up a
software interface, then feed test traffic via `tcpreplay`.
Some of the limitations I've encountered are:

- net_af_packet does not report received/missed packet counters
- net_pcap does not have multi queue support (AFAIK)

As an example, non symmetric RSS configurations may cause 2-way TCP
packets to be steered to different queues. If your application is a
worker-per-queue application with no shared state, you might want to
have these packets to be steered to the same queue instead. Without
multi queue, you can't easily test against scenarios like that.

Though, as you've said, if TAP is for management only, perhaps I was
trying to use the wrong tool for the wrong job.
In the end, I ended up getting a real two port NIC (feeding traffic
from one port to the other) because software PMDs are not similar
enough to the actual hardware.

--
Igor

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC - Tap io_uring PMD
  2024-10-30 21:56 RFC - Tap io_uring PMD Stephen Hemminger
  2024-10-31 10:27 ` Morten Brørup
  2024-11-06  0:46 ` Varghese, Vipin
@ 2024-11-06  7:46 ` Maxime Coquelin
  2024-11-07 21:51   ` Morten Brørup
  2024-11-12  5:21   ` Stephen Hemminger
  2 siblings, 2 replies; 15+ messages in thread
From: Maxime Coquelin @ 2024-11-06  7:46 UTC (permalink / raw)
  To: Stephen Hemminger, dev

Hi Stephen,

On 10/30/24 22:56, Stephen Hemminger wrote:
> The current tap device is slow both due to architectural choices and the
> overhead of Linux system calls. I am exploring a how to fix that but some
> of the choices require some tradeoffs. Which leads to some open questions:
> 
> 1. DPDK tap also support tunnel (TUN) mode where there is no Ethernet header
>     only L3. Does anyone actually use this? It is different than what every other
>     PMD expects.
> 
> 2. The fastest way to use kernel TAP device would be to use io_uring.
>     But this was added in 5.1 kernel (2019). Rather than having conditional or
>     dual mode in DPDK tap device, perhaps there should just be a new PMD tap_uring?
> 
> 3. Current TAP device provides hooks for several rte_flow types by playing
>     games with kernel qdisc. Does anyone really use this? Propose just not doing
>     this in new tap_uring.
> 
> 4. What other features of TAP device beyond basic send/receive make sense?
>     It looks like new device could support better statistics.
> 
> 5. What about Rx interrupt support?
> 
> Probably the hardest part of using io_uring is figuring out how to collect
> completions. The simplest way would be to handle all completions rx and tx
> in the rx_burst function.
> 

Why not just use Virtio-user PMD with Vhost-kernel backend [0]?
Are there any missing features that io_uring can address?

Regards,
Maxime

[0]: http://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: RFC - Tap io_uring PMD
  2024-11-05 23:22         ` Morten Brørup
  2024-11-05 23:25           ` Stephen Hemminger
@ 2024-11-06 10:30           ` Konstantin Ananyev
  1 sibling, 0 replies; 15+ messages in thread
From: Konstantin Ananyev @ 2024-11-06 10:30 UTC (permalink / raw)
  To: Morten Brørup, Stephen Hemminger; +Cc: dev



> > > > > > Probably the hardest part of using io_uring is figuring out how
> > to
> > > > > > collect
> > > > > > completions. The simplest way would be to handle all
> > completions rx
> > > > and
> > > > > > tx
> > > > > > in the rx_burst function.
> > > > >
> > > > > Please don't mix RX and TX, unless explicitly requested by the
> > > > application through the recently introduced "mbuf recycle" feature.
> > > >
> > > > The issue is Rx and Tx share a single fd and ioring for completion
> > is
> > > > per fd.
> > > > The implementation for ioring came from the storage side so
> > initially
> > > > it was for fixing
> > > > the broken Linux AIO support.
> > > >
> > > > Some other devices only have single interrupt or ring shared with
> > rx/tx
> > > > so not unique.
> > > > Virtio, netvsc, and some NIC's.
> > > >
> > > > The problem is that if Tx completes descriptors then there needs to
> > be
> > > > locking
> > > > to prevent Rx thread and Tx thread overlapping. And a spin lock is
> > a
> > > > performance buzz kill.
> > >
> > > Brainstorming a bit here...
> > > What if the new TAP io_uring PMD is designed to use two io_urings per
> > port, one for RX and another one for TX on the same TAP interface?
> > > This requires that a TAP interface can be referenced via two file
> > descriptors (one fd for the RX io_uring and another fd for the TX
> > io_uring), e.g. by using dup() to create the additional file
> > descriptor. I don't know if this is possible, and if it works with
> > io_uring.
> >
> > There a couple of problems with multiple fd's.
> >   - multiple fds pointing to same internal tap queue are not going to
> > get completed separately.
> >   - when multi-proc is supported, limit of 253 fd's in Unix domain IPC
> > comes into play
> >   - tap does not support tx only fd for queues. If fd is queue of tap,
> > receive fan out will go to it.
> >
> > If DPDK was more flexible, harvesting of completion could be done via
> > another thread but that is not general enough
> > to work transparently with all applications.  Existing TAP device plays
> > with SIGIO, but signals are slower.
> 
> I have now read up a bit about io_uring, so here are some thoughts and ideas...
> 
> To avoid locking, there should only be one writer of io_uring Submission Queue Events (SQE), and only one reader of io_uring
> Completion Queue Events (CQE) per TAP interface.
> 
> From what I understand, the TAP io_uring PMD only supports one RX queue per port and one TX queue per port (i.e. per TAP
> interface). We can take advantage of this:
> 
> We can use rte_tx() as the Submission Queue writer and rte_rx() as the Completion Queue reader.
> 
> The PMD must have two internal rte_rings for respectively RX refill and TX completion events.
> 
> rte_rx() does the following:
> Read the Completion Queue;
> If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX Refill SQE and enqueue it in the RX Refill rte_ring;
> If TX CQE, enqueue it in the TX Completion rte_ring;
> Repeat until nb_pkts RX CQEs have been received, or no more CQE's are available. (This complies with the rte_rx() API, which says
> that less than nb_pkts is only returned if no more packets are available for receiving.)
> 
> rte_tx() does the following:
> Pass the data from the TX MBUFs to io_uring TX SQEs, using the TX CQEs in the TX Completion rte_ring, and write them to the io_uring
> Submission Queue.
> Dequeue any RX Refill SQEs from the RX Refill rte_ring and write them to the io_uring Submission Queue.
> 
> This means that the application must call both rte_rx() and rte_tx(); but it would be allowed to call rte_tx() with zero MBUFs.
> 
> The internal rte_rings are Single-Producer, Single-Consumer, and large enough to hold all TX+RX descriptors.
> 
> 
> Alternatively, we can let rte_rx() do all the work and use an rte_ring in the opposite direction...
> 
> The PMD must have two internal rte_rings, one for TX MBUFs and one for TX CQEs. (The latter can be a stack, or any other type of
> container.)
> 
> rte_tx() only does the following:
> Enqueue the TX MBUFs to the TX MBUF rte_ring.
> 
> rte_rx() does the following:
> Dequeue any TX MBUFs from the TX MBUF rte_ring, convert them to TX SQEs, using the TX CQEs in the TX Completion rte_ring, and
> write them to the io_uring Submission Queue.
> Read the Completion Queue;
> If TX CQE, enqueue it in the TX Completion rte_ring;
> If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX Refill SQE and write it to the io_uring Submission Queue;
> Repeat until nb_pkts RX CQEs have been received, or no more CQE's are available. (This complies with the rte_rx() API, which says
> that less than nb_pkts is only returned if no more packets are available for receiving.)
> 
> With the second design, the PMD can support multiple TX queues by using a Multi-Producer rte_ring for the TX MBUFs.
> But it postpones all transmits until rte_rx() is called, so I don't really like it.
> 
> Of the two designs, the first feels more natural to me.
> And if some application absolutely needs multiple TX queues, it can implement a Multi-Producer, Single-Consumer rte_ring as an
> intermediate step in front of the PMD's single TX queue.

And why we can't simply have 2 io_uring(s): one for RX ops, second for TX ops?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC - Tap io_uring PMD
  2024-11-06  0:52               ` Igor Gutorov
@ 2024-11-07 16:30                 ` Stephen Hemminger
  0 siblings, 0 replies; 15+ messages in thread
From: Stephen Hemminger @ 2024-11-07 16:30 UTC (permalink / raw)
  To: Igor Gutorov; +Cc: Morten Brørup, dev

On Wed, 6 Nov 2024 03:52:51 +0300
Igor Gutorov <igootorov@gmail.com> wrote:

> On Wed, Nov 6, 2024 at 2:54 AM Morten Brørup <mb@smartsharesystems.com> wrote:
> >  
> > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > Sent: Wednesday, 6 November 2024 00.26
> > >
> > > On Wed, 6 Nov 2024 00:22:19 +0100
> > > Morten Brørup <mb@smartsharesystems.com> wrote:
> > >  
> > > > From what I understand, the TAP io_uring PMD only supports one RX  
> > > queue per port and one TX queue per port (i.e. per TAP interface). We
> > > can take advantage of this:
> > >
> > >
> > > No kernel tap support multi queue and we need to use that.  
> >
> > Maybe I got it wrong then... I thought you said fanout (of kernel->TAP packets) would affect all fds associated with the TAP interface.
> > How can the application then use multiple queues?
> >
> > Another thing is:
> > For which purposes do applications need multi queue support for TAP, considering the interface is for management traffic only?  
> 
> I've previously used net_pcap as well as net_af_packet PMDs for
> debugging/testing and even benchmarking purposes. I'd set up a
> software interface, then feed test traffic via `tcpreplay`.
> Some of the limitations I've encountered are:
> 
> - net_af_packet does not report received/missed packet counters
> - net_pcap does not have multi queue support (AFAIK)
> 
> As an example, non symmetric RSS configurations may cause 2-way TCP
> packets to be steered to different queues. If your application is a
> worker-per-queue application with no shared state, you might want to
> have these packets to be steered to the same queue instead. Without
> multi queue, you can't easily test against scenarios like that.
> 
> Though, as you've said, if TAP is for management only, perhaps I was
> trying to use the wrong tool for the wrong job.
> In the end, I ended up getting a real two port NIC (feeding traffic
> from one port to the other) because software PMDs are not similar
> enough to the actual hardware.

DPDK interactions with kernel devices is complicated, there are overlapping
drivers:
	- TAP  - uses Linux tap device
	- PCAP - uses libpcap which ends up using af_packet
        - AF_PACKET - uses af_packet
        - AF_XDP - uses XDP
        - Virtio user - relies on vhost in kernel

There are lots of differences in performance, control usage, and other
details between these.

The proposal is to do simpler faster version of the TAP.
Will also do some comparisons and detail (maybe presentation or document)
on all the others later.

So far, have a basic driver that does setup and basic management operations;
still working on details of the ring stuff.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: RFC - Tap io_uring PMD
  2024-11-06  7:46 ` Maxime Coquelin
@ 2024-11-07 21:51   ` Morten Brørup
  2024-11-12  5:21   ` Stephen Hemminger
  1 sibling, 0 replies; 15+ messages in thread
From: Morten Brørup @ 2024-11-07 21:51 UTC (permalink / raw)
  To: Maxime Coquelin, Stephen Hemminger, dev

> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Wednesday, 6 November 2024 08.47
> 
> Hi Stephen,
> 
> On 10/30/24 22:56, Stephen Hemminger wrote:
> > The current tap device is slow both due to architectural choices and
> the
> > overhead of Linux system calls. I am exploring a how to fix that but
> some
> > of the choices require some tradeoffs. Which leads to some open
> questions:

[...]

> 
> Why not just use Virtio-user PMD with Vhost-kernel backend [0]?
> Are there any missing features that io_uring can address?
> 
> Regards,
> Maxime
> 
> [0]:
> http://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html

Thanks for the pointer, Maxime.

After taking a look at it, I think this is exactly what we were looking for!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: RFC - Tap io_uring PMD
  2024-11-06  7:46 ` Maxime Coquelin
  2024-11-07 21:51   ` Morten Brørup
@ 2024-11-12  5:21   ` Stephen Hemminger
  1 sibling, 0 replies; 15+ messages in thread
From: Stephen Hemminger @ 2024-11-12  5:21 UTC (permalink / raw)
  To: Maxime Coquelin; +Cc: dev

On Wed, 6 Nov 2024 08:46:55 +0100
Maxime Coquelin <maxime.coquelin@redhat.com> wrote:

> Hi Stephen,
> 
> On 10/30/24 22:56, Stephen Hemminger wrote:
> > The current tap device is slow both due to architectural choices and the
> > overhead of Linux system calls. I am exploring a how to fix that but some
> > of the choices require some tradeoffs. Which leads to some open questions:
> > 
> > 1. DPDK tap also support tunnel (TUN) mode where there is no Ethernet header
> >     only L3. Does anyone actually use this? It is different than what every other
> >     PMD expects.
> > 
> > 2. The fastest way to use kernel TAP device would be to use io_uring.
> >     But this was added in 5.1 kernel (2019). Rather than having conditional or
> >     dual mode in DPDK tap device, perhaps there should just be a new PMD tap_uring?
> > 
> > 3. Current TAP device provides hooks for several rte_flow types by playing
> >     games with kernel qdisc. Does anyone really use this? Propose just not doing
> >     this in new tap_uring.
> > 
> > 4. What other features of TAP device beyond basic send/receive make sense?
> >     It looks like new device could support better statistics.
> > 
> > 5. What about Rx interrupt support?
> > 
> > Probably the hardest part of using io_uring is figuring out how to collect
> > completions. The simplest way would be to handle all completions rx and tx
> > in the rx_burst function.
> >   
> 
> Why not just use Virtio-user PMD with Vhost-kernel backend [0]?
> Are there any missing features that io_uring can address?
> 
> Regards,
> Maxime
> 
> [0]: http://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html
> 

Yes, I looked at that but:
  - virtio user ends up with a busy kernel thread which is not acceptable
    in SOC environment where all resources are locked down. In the SOC I was working
    on DPDK was limited to 4 polling isolated CPU's and 1 sleeping main thread.
    The rest of the CPU resources were hard constrained by cgroups. The virtio user
    thread was a problem.

  - virtio user device is not persistent. If DPDK is being used a dataplane, need to
    be able to quickly restart and not disturb applications and routing in the kernel
    while the tap device is unavailable.  I.e having device present but in no carrier
    state is better than having to teach applications about hot plug or play around
    with multiple addresses on loopback device (which is what Cisco routers do).

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-11-12  5:21 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-30 21:56 RFC - Tap io_uring PMD Stephen Hemminger
2024-10-31 10:27 ` Morten Brørup
2024-11-01  0:34   ` Stephen Hemminger
2024-11-02 22:28     ` Morten Brørup
2024-11-05 18:58       ` Stephen Hemminger
2024-11-05 23:22         ` Morten Brørup
2024-11-05 23:25           ` Stephen Hemminger
2024-11-05 23:54             ` Morten Brørup
2024-11-06  0:52               ` Igor Gutorov
2024-11-07 16:30                 ` Stephen Hemminger
2024-11-06 10:30           ` Konstantin Ananyev
2024-11-06  0:46 ` Varghese, Vipin
2024-11-06  7:46 ` Maxime Coquelin
2024-11-07 21:51   ` Morten Brørup
2024-11-12  5:21   ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).