AF_XDP performance

DPDK patches and discussions
 help / color / mirror / Atom feed

* AF_XDP performance
@ 2023-05-24 12:32 Alireza Sanaee
  2023-05-24 16:36 ` Bruce Richardson
  0 siblings, 1 reply; 3+ messages in thread
From: Alireza Sanaee @ 2023-05-24 12:32 UTC (permalink / raw)
  To: dev

Hi everyone,

I was looking at this deck of slides 
https://www.dpdk.org/wp-content/uploads/sites/35/2020/11/XDP_ZC_PMD-1.pdf

I tried to reproduce the results with the testpmd application. I am 
working with BlueField 2 NIC and I could sustain ~10Mpps with testpmd 
with AF_XDP, and about 20Mpps without AF_XDP on the RX drop experiment. 
I was wondering why AF_XDP so lower compared to PCI-e scenario given the 
fact that both cases are zero-cpy. Is it because of the frame size?

Thanks,
Ali

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: AF_XDP performance
  2023-05-24 12:32 AF_XDP performance Alireza Sanaee
@ 2023-05-24 16:36 ` Bruce Richardson
  2023-05-24 18:29   ` Stephen Hemminger
  0 siblings, 1 reply; 3+ messages in thread
From: Bruce Richardson @ 2023-05-24 16:36 UTC (permalink / raw)
  To: Alireza Sanaee; +Cc: dev

On Wed, May 24, 2023 at 01:32:17PM +0100, Alireza Sanaee wrote:
> Hi everyone,
> 
> I was looking at this deck of slides
> https://www.dpdk.org/wp-content/uploads/sites/35/2020/11/XDP_ZC_PMD-1.pdf
> 
> I tried to reproduce the results with the testpmd application. I am
> working with BlueField 2 NIC and I could sustain ~10Mpps with testpmd
> with AF_XDP, and about 20Mpps without AF_XDP on the RX drop experiment. I
> was wondering why AF_XDP so lower compared to PCI-e scenario given the
> fact that both cases are zero-cpy. Is it because of the frame size?
> 
While I can't claim to explain all the differences, in short I believe the
AF_XDP version is just doing more work. With a native DPDK driver, the
driver takes the packet descriptors directly from the NIC RX ring and uses
the metadata to construct a packet mbuf, which is returned to the
application.

With AF_XDP, however, the NIC descriptor ring is not directly accessible by
the app. Therefore the processing is (AFAIK):
* kernel reads NIC descriptor ring and processes descriptor
* kernel calls BPF program for the received packets to determine what
  action to take, e.g. forward to socket
* kernel writes an AF_XDP descriptor to the AF_XDP socket RX ring
* application reads the AF_XDP ring entry written by the kernel and then
  creates a DPDK mbuf to return to the application.

There are also other considerations around potential cache locality of
descriptors too that could affect things, but I would expect the extra
descriptor processing work outlined above probably explains most of the
difference.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: AF_XDP performance
  2023-05-24 16:36 ` Bruce Richardson
@ 2023-05-24 18:29   ` Stephen Hemminger
  0 siblings, 0 replies; 3+ messages in thread
From: Stephen Hemminger @ 2023-05-24 18:29 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: Alireza Sanaee, dev

On Wed, 24 May 2023 17:36:32 +0100
Bruce Richardson <bruce.richardson@intel.com> wrote:

> On Wed, May 24, 2023 at 01:32:17PM +0100, Alireza Sanaee wrote:
> > Hi everyone,
> > 
> > I was looking at this deck of slides
> > https://www.dpdk.org/wp-content/uploads/sites/35/2020/11/XDP_ZC_PMD-1.pdf
> > 
> > I tried to reproduce the results with the testpmd application. I am
> > working with BlueField 2 NIC and I could sustain ~10Mpps with testpmd
> > with AF_XDP, and about 20Mpps without AF_XDP on the RX drop experiment. I
> > was wondering why AF_XDP so lower compared to PCI-e scenario given the
> > fact that both cases are zero-cpy. Is it because of the frame size?
> >   
> While I can't claim to explain all the differences, in short I believe the
> AF_XDP version is just doing more work. With a native DPDK driver, the
> driver takes the packet descriptors directly from the NIC RX ring and uses
> the metadata to construct a packet mbuf, which is returned to the
> application.
> 
> With AF_XDP, however, the NIC descriptor ring is not directly accessible by
> the app. Therefore the processing is (AFAIK):
> * kernel reads NIC descriptor ring and processes descriptor
> * kernel calls BPF program for the received packets to determine what
>   action to take, e.g. forward to socket
> * kernel writes an AF_XDP descriptor to the AF_XDP socket RX ring
> * application reads the AF_XDP ring entry written by the kernel and then
>   creates a DPDK mbuf to return to the application.
> 
> There are also other considerations around potential cache locality of
> descriptors too that could affect things, but I would expect the extra
> descriptor processing work outlined above probably explains most of the
> difference.
> 
> Regards,
> /Bruce

There is also a context switch from kernel polling thread to the DPDK polling thread
to consider. Plus the overhead of running the BPF program.  The context switches
mean that both the instruction and data cache is likely to have lots of misses.
Remember on modern processes the limiting factor is usually memory performance
from caching, not number of instructions.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-05-24 18:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-24 12:32 AF_XDP performance Alireza Sanaee
2023-05-24 16:36 ` Bruce Richardson
2023-05-24 18:29   ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).