DPDK patches and discussions
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: "Mattias Rönnblom" <hofors@lysator.liu.se>
Cc: "Morten Brørup" <mb@smartsharesystems.com>,
	"Maxime Coquelin" <maxime.coquelin@redhat.com>,
	"Mattias Rönnblom" <mattias.ronnblom@ericsson.com>,
	dev@dpdk.org, "Abdullah Sevincer" <abdullah.sevincer@intel.com>,
	"Pavan Nikhilesh" <pbhagavatula@marvell.com>,
	"David Hunt" <david.hunt@intel.com>,
	"Vladimir Medvedkin" <vladimir.medvedkin@intel.com>,
	"Bruce Richardson" <bruce.richardson@intel.com>
Subject: Re: [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy
Date: Thu, 27 Jun 2024 08:10:19 -0700	[thread overview]
Message-ID: <20240627081019.43b2779a@hermes.local> (raw)
In-Reply-To: <Zn1Hrq+rH7UJ6hBx@isengard>

On Thu, 27 Jun 2024 13:06:22 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> On Wed, Jun 26, 2024 at 10:16:06PM +0200, Morten Brørup wrote:
> > > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > > Sent: Wednesday, 26 June 2024 20.48
> > > 
> > > On Wed, Jun 26, 2024 at 05:24:04PM +0200, Maxime Coquelin wrote:  
> > > >
> > > >
> > > > On 6/26/24 16:58, Stephen Hemminger wrote:  
> > > > > On Wed, 26 Jun 2024 10:37:31 +0200
> > > > > Maxime Coquelin <maxime.coquelin@redhat.com> wrote:
> > > > >  
> > > > > > On 6/25/24 21:27, Mattias Rönnblom wrote:  
> > > > > > > On Tue, Jun 25, 2024 at 05:29:35PM +0200, Maxime Coquelin wrote:  
> > > > > > > > Hi Mattias,
> > > > > > > >
> > > > > > > > On 6/20/24 19:57, Mattias Rönnblom wrote:  
> > > > > > > > > This patch set make DPDK library, driver, and application  
> > > code use the  
> > > > > > > > > compiler/libc memcpy() by default when functions in  
> > > <rte_memcpy.h> are  
> > > > > > > > > invoked.
> > > > > > > > >
> > > > > > > > > The various custom DPDK rte_memcpy() implementations may be  
> > > retained  
> > > > > > > > > by means of a build-time option.
> > > > > > > > >
> > > > > > > > > This patch set only make a difference on x86, PPC and ARM.  
> > > Loongarch  
> > > > > > > > > and RISCV already used compiler/libc memcpy().  
> > > > > > > >
> > > > > > > > It indeed makes a difference on x86!
> > > > > > > >
> > > > > > > > Just tested latest main with and without your series on
> > > > > > > > Intel(R) Xeon(R) Gold 6438N.
> > > > > > > >
> > > > > > > > The test is a simple IO loop between a Vhost PMD and a Virtio-  
> > > user PMD:  
> > > > > > > > # dpdk-testpmd -l 4-6   --file-prefix=virtio1 --no-pci --vdev  
> > > 'net_virtio_user0,mac=00:01:02:03:04:05,path=./vhost-
> > > net,server=1,mrg_rxbuf=1,in_order=1'  
> > > > > > > > --single-file-segments -- -i  
> > > > > > > > testpmd> start  
> > > > > > > >
> > > > > > > > # dpdk-testpmd -l 8-10   --file-prefix=vhost1 --no-pci --vdev
> > > > > > > > 'net_vhost0,iface=vhost-net,client=1'   --single-file-segments  
> > > -- -i  
> > > > > > > > testpmd> start tx_first 32  
> > > > > > > >
> > > > > > > > Latest main: 14.5Mpps
> > > > > > > > Latest main + this series: 10Mpps  
> > > > > > >
> > > > > > > I ran the above benchmark on my Raptor Lake desktop (locked to  
> > > 3,2  
> > > > > > > GHz). GCC 12.3.0.
> > > > > > >
> > > > > > > Core use_cc_memcpy Mpps
> > > > > > > E    false         9.5
> > > > > > > E    true          9.7
> > > > > > > P    false         16.4
> > > > > > > P    true          13.5
> > > > > > >
> > > > > > > On the P-cores, there's a significant performance regression,  
> > > although  
> > > > > > > not as bad as the one you see on your Sapphire Rapids Xeon. On  
> > > the  
> > > > > > > E-cores, there's actually a slight performance gain.
> > > > > > >
> > > > > > > The virtio PMD does not directly invoke rte_memcpy() or anything  
> > > else  
> > > > > > > from <rte_memcpy.h>, but rather use memcpy(), so I'm not sure I
> > > > > > > understand what's going on here. Does the virtio driver delegate  
> > > some  
> > > > > > > performance-critical task to some module that in turns uses
> > > > > > > rte_memcpy()?  
> > > > > >
> > > > > > This is because Vhost is the bottleneck here, not Virtio driver.
> > > > > > Indeed, the virtqueues memory belongs to the Virtio driver and the
> > > > > > descriptors buffers are Virtio's mbufs, so not much memcpy's are  
> > > done  
> > > > > > there.
> > > > > >
> > > > > > Vhost however, is a heavy memcpy user, as all the descriptors  
> > > buffers  
> > > > > > are copied to/from its mbufs.  
> > > > >
> > > > > Would be good to now the size (if small it is inlining that matters,  
> > > or  
> > > > > maybe alignment matters), and have test results for multiple  
> > > compiler versions.  
> > > > > Ideally, feed results back and update Gcc and Clang.  
> > > >
> > > > I was testing with GCC 11 on RHEL-9:
> > > > gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
> > > >
> > > > I was using the default one, 64B packets.
> > > >
> > > > I don't have time to perform these tests, but if you are willing to do
> > > > it I'll be happy to review the results.
> > > >  
> > > > > DPDK doesn't need to be in the optimize C library space.  
> > > >
> > > > Certainly, but we already have an optimized version currently, so not
> > > > much to do now on our side. When C libraries implementations will be  
> > > on  
> > > > par, we should definitely use them by default.
> > > >  
> > > 
> > > I think it's not so much about optimized versus non-optimized at this
> > > point. It's just that cc/libc memcpy sometimes performs better than
> > > RTE memcpy, and sometimes doesn't.
> > > 
> > > For virtio, a single memory copy in
> > > lib/vhost/virtio_net.c:do_data_copy_enqueue()
> > > is responsible for >95% of the performance regression introduced by
> > > the cc memcpy patch for small packets on Intel P-cores.
> > > 
> > > I'm not so sure this performance regression will go away in newer
> > > compilers. PGO would certainly help, but PGO is a hassle.
> > > 
> > > One way to fix this issue would be to introduce a custom,
> > > memcpy()-based packet copying routine. I tried the below patch, with
> > > the following results:
> > > 
> > > Raptor Lake @ 3,2 GHz
> > > GCC 12
> > > 
> > > 64 bytes packets
> > > Core  Mode              Mpps
> > > ----------------------------
> > > E     RTE memcpy        9.5
> > > E     cc memcpy         9.7
> > > E     cc memcpy+pktcpy  9.0
> > > 
> > > P     RTE memcpy        16.4
> > > P     cc memcpy         13.5
> > > P     cc memcpy+pktcpy  16.2
> > > 
> > > 1500 bytes
> > > Core  Mode              Mpps
> > > ----------------------------
> > > P    RTE memcpy         5.8
> > > P    cc memcpy          5.9
> > > P    cc memcpy+pktcpy   5.9
> > > 
> > > As you can see, most of the regression is eliminated, at the cost of
> > > worse E-core performance. I didn't look at the generated code, but one
> > > could suspect heavy use of wide SIMD is to blame, which E-cores don't
> > > necessarily benefit from.
> > > 
> > > The below prototype assumes the source and destination buffers are
> > > 16-byte aligned. Does that always hold?  
> > 
> > Perhaps always for this specific function; I don't know.
> > Not generally *always*, but I guess in many cases packet copies would have 64-byte aligned pointers, but not sizes.
> > Unless explicitly stated by the developer, it is unsafe to make assumptions about alignment.
> >  
> 
> I meant always (for every packet) in DPDK virtio net.
> 
> > A future rte_memcpy() function might take flags with explicit alignment information for optimized copying, as was part of my non-temporal memcpy(). (The development on this is still on hold.)
> >  
> 
> There is already a mechanism to express alignment in GCC-compatible
> compilers. See the below patch.
> 
> > > 
> > > I'm sure one could further improve performance using context-specific
> > > information, such as packets always being >= 64 bytes. One could also
> > > consider having special cases, maybe for 64 bytes and MTU-sized
> > > packets. Such are always a hassle when you try to characterize
> > > performance though.  
> > 
> > Absolutely!
> > 
> > This got me thinking:
> > These tests are run with 64 byte packets only.
> > Perhaps branch prediction pollutes the results, by optimizing branches in the copy routine for all packets being 64 byte.
> > 
> > You really should be testing with IMIX or random packet sizes.
> >  
> 
> Sure, and it should also be a real app, not testpmd, on top of virtio.
> 
> > In my experience, most internet packets are large (> 1024 byte, but not 1514 byte due to QUIC's conservative max packet size), closely followed by 64 byte (excl. any VLAN tags) packets; only the minority of packets are medium size.
> >   
> > > 
> > > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> > > index 370402d849..7b595a6622 100644
> > > --- a/lib/vhost/virtio_net.c
> > > +++ b/lib/vhost/virtio_net.c
> > > @@ -231,6 +231,26 @@ vhost_async_dma_check_completed(struct virtio_net
> > > *dev, int16_t dma_id, uint16_t
> > >  	return nr_copies;
> > >  }
> > > 
> > > +static inline void
> > > +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len)
> > > +{
> > > +	void *dst = __builtin_assume_aligned(in_dst, 16);
> > > +	const void *src = __builtin_assume_aligned(in_src, 16);
> > > +
> > > +	if (len <= 256) {
> > > +		size_t left;
> > > +
> > > +		for (left = len; left >= 32; left -= 32) {
> > > +			memcpy(dst, src, 32);
> > > +			dst = RTE_PTR_ADD(dst, 32);
> > > +			src = RTE_PTR_ADD(src, 32);
> > > +		}
> > > +
> > > +		memcpy(dst, src, left);
> > > +	} else
> > > +		memcpy(dst, src, len);
> > > +}
> > > +

Is alignment an optimization or requirement? I can think of cases in tunneling
where the packet was received into mbuf (aligned) but then the application prepends
a header making the packet unaligned before sending.

  reply	other threads:[~2024-06-27 15:10 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-27 11:11 [RFC] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-05-28  7:43 ` [RFC v2] " Mattias Rönnblom
2024-05-28  8:19   ` Mattias Rönnblom
2024-05-28  8:27     ` Bruce Richardson
2024-05-28  8:59       ` Mattias Rönnblom
2024-05-28  9:07         ` Morten Brørup
2024-05-28 16:17           ` Mattias Rönnblom
2024-05-28 14:59     ` Stephen Hemminger
2024-05-28 15:09       ` Bruce Richardson
2024-05-31  5:19         ` Mattias Rönnblom
2024-05-31 16:50           ` Stephen Hemminger
2024-06-02 11:33             ` Mattias Rönnblom
2024-05-28 16:03       ` Mattias Rönnblom
2024-05-29 21:55         ` Stephen Hemminger
2024-05-28  8:20   ` Bruce Richardson
2024-06-02 12:39   ` [RFC v3 0/5] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 1/5] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-05  6:49       ` [PATCH 0/5] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-05  6:49         ` [PATCH 1/5] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-05  6:49         ` [PATCH 2/5] net/octeon_ep: properly include vector API header file Mattias Rönnblom
2024-06-05  6:49         ` [PATCH 3/5] distributor: " Mattias Rönnblom
2024-06-10 14:27           ` Tyler Retzlaff
2024-06-05  6:49         ` [PATCH 4/5] fib: " Mattias Rönnblom
2024-06-10 14:28           ` Tyler Retzlaff
2024-06-05  6:49         ` [PATCH 5/5] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-20  7:24         ` [PATCH v2 0/6] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-20  7:24           ` [PATCH v2 1/6] net/fm10k: add missing intrinsic include Mattias Rönnblom
2024-06-20  9:02             ` Bruce Richardson
2024-06-20  9:28             ` Bruce Richardson
2024-06-20 11:40               ` Mattias Rönnblom
2024-06-20 11:59                 ` Bruce Richardson
2024-06-20 11:50             ` [PATCH v3 0/6] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 1/6] net/fm10k: add missing vector API header include Mattias Rönnblom
2024-06-20 12:34                 ` Bruce Richardson
2024-06-20 17:57                 ` [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 01/13] net/i40e: add missing vector API header include Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 02/13] net/iavf: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 03/13] net/ice: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 04/13] net/ixgbe: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 05/13] net/ngbe: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 06/13] net/txgbe: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 07/13] net/virtio: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 08/13] net/fm10k: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 09/13] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 10/13] net/octeon_ep: add missing vector API header include Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 11/13] distributor: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 12/13] fib: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 13/13] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-21 15:19                     ` Stephen Hemminger
2024-06-24 10:05                     ` Thomas Monjalon
2024-06-24 17:56                       ` Mattias Rönnblom
2024-06-25 13:06                       ` Mattias Rönnblom
2024-06-25 13:34                         ` Thomas Monjalon
2024-06-20 18:53                   ` [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy Morten Brørup
2024-06-21  6:56                   ` Mattias Rönnblom
2024-06-21  7:04                     ` David Marchand
2024-06-21  7:35                       ` Mattias Rönnblom
2024-06-21  7:41                         ` David Marchand
2024-06-25 15:29                   ` Maxime Coquelin
2024-06-25 15:44                     ` Stephen Hemminger
2024-06-25 19:27                     ` Mattias Rönnblom
2024-06-26  8:37                       ` Maxime Coquelin
2024-06-26 14:58                         ` Stephen Hemminger
2024-06-26 15:24                           ` Maxime Coquelin
2024-06-26 18:47                             ` Mattias Rönnblom
2024-06-26 20:16                               ` Morten Brørup
2024-06-27 11:06                                 ` Mattias Rönnblom
2024-06-27 15:10                                   ` Stephen Hemminger [this message]
2024-06-27 15:23                                     ` Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 2/6] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 3/6] net/octeon_ep: add missing vector API header include Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 4/6] distributor: " Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 5/6] fib: " Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 6/6] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-20  7:24           ` [PATCH v2 2/6] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-20  9:03             ` Bruce Richardson
2024-06-20  7:24           ` [PATCH v2 3/6] net/octeon_ep: properly include vector API header file Mattias Rönnblom
2024-06-20 14:43             ` Stephen Hemminger
2024-06-20  7:24           ` [PATCH v2 4/6] distributor: " Mattias Rönnblom
2024-06-20  9:13             ` Bruce Richardson
2024-06-20  7:24           ` [PATCH v2 5/6] fib: " Mattias Rönnblom
2024-06-20  9:14             ` Bruce Richardson
2024-06-20 14:43               ` Stephen Hemminger
2024-06-20  7:24           ` [PATCH v2 6/6] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 2/5] net/octeon_ep: properly include vector API header file Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 3/5] distributor: " Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 4/5] fib: " Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 5/5] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-02 20:58       ` Morten Brørup
2024-06-03 17:04         ` Mattias Rönnblom
2024-06-03 17:08           ` Stephen Hemminger
2024-05-29 21:56 ` [RFC] " Stephen Hemminger
2024-06-02 11:30   ` Mattias Rönnblom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240627081019.43b2779a@hermes.local \
    --to=stephen@networkplumber.org \
    --cc=abdullah.sevincer@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=david.hunt@intel.com \
    --cc=dev@dpdk.org \
    --cc=hofors@lysator.liu.se \
    --cc=mattias.ronnblom@ericsson.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=mb@smartsharesystems.com \
    --cc=pbhagavatula@marvell.com \
    --cc=vladimir.medvedkin@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).