DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Mattias Rönnblom" <hofors@lysator.liu.se>,
	"Maxime Coquelin" <maxime.coquelin@redhat.com>
Cc: "Stephen Hemminger" <stephen@networkplumber.org>,
	"Mattias Rönnblom" <mattias.ronnblom@ericsson.com>,
	dev@dpdk.org, "Abdullah Sevincer" <abdullah.sevincer@intel.com>,
	"Pavan Nikhilesh" <pbhagavatula@marvell.com>,
	"David Hunt" <david.hunt@intel.com>,
	"Vladimir Medvedkin" <vladimir.medvedkin@intel.com>,
	"Bruce Richardson" <bruce.richardson@intel.com>
Subject: RE: [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy
Date: Wed, 26 Jun 2024 22:16:06 +0200	[thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F55F@smartserver.smartshare.dk> (raw)
In-Reply-To: <ZnxiXLbI7LlykMKG@isengard>

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Wednesday, 26 June 2024 20.48
> 
> On Wed, Jun 26, 2024 at 05:24:04PM +0200, Maxime Coquelin wrote:
> >
> >
> > On 6/26/24 16:58, Stephen Hemminger wrote:
> > > On Wed, 26 Jun 2024 10:37:31 +0200
> > > Maxime Coquelin <maxime.coquelin@redhat.com> wrote:
> > >
> > > > On 6/25/24 21:27, Mattias Rönnblom wrote:
> > > > > On Tue, Jun 25, 2024 at 05:29:35PM +0200, Maxime Coquelin wrote:
> > > > > > Hi Mattias,
> > > > > >
> > > > > > On 6/20/24 19:57, Mattias Rönnblom wrote:
> > > > > > > This patch set make DPDK library, driver, and application
> code use the
> > > > > > > compiler/libc memcpy() by default when functions in
> <rte_memcpy.h> are
> > > > > > > invoked.
> > > > > > >
> > > > > > > The various custom DPDK rte_memcpy() implementations may be
> retained
> > > > > > > by means of a build-time option.
> > > > > > >
> > > > > > > This patch set only make a difference on x86, PPC and ARM.
> Loongarch
> > > > > > > and RISCV already used compiler/libc memcpy().
> > > > > >
> > > > > > It indeed makes a difference on x86!
> > > > > >
> > > > > > Just tested latest main with and without your series on
> > > > > > Intel(R) Xeon(R) Gold 6438N.
> > > > > >
> > > > > > The test is a simple IO loop between a Vhost PMD and a Virtio-
> user PMD:
> > > > > > # dpdk-testpmd -l 4-6   --file-prefix=virtio1 --no-pci --vdev
> 'net_virtio_user0,mac=00:01:02:03:04:05,path=./vhost-
> net,server=1,mrg_rxbuf=1,in_order=1'
> > > > > > --single-file-segments -- -i
> > > > > > testpmd> start
> > > > > >
> > > > > > # dpdk-testpmd -l 8-10   --file-prefix=vhost1 --no-pci --vdev
> > > > > > 'net_vhost0,iface=vhost-net,client=1'   --single-file-segments
> -- -i
> > > > > > testpmd> start tx_first 32
> > > > > >
> > > > > > Latest main: 14.5Mpps
> > > > > > Latest main + this series: 10Mpps
> > > > >
> > > > > I ran the above benchmark on my Raptor Lake desktop (locked to
> 3,2
> > > > > GHz). GCC 12.3.0.
> > > > >
> > > > > Core use_cc_memcpy Mpps
> > > > > E    false         9.5
> > > > > E    true          9.7
> > > > > P    false         16.4
> > > > > P    true          13.5
> > > > >
> > > > > On the P-cores, there's a significant performance regression,
> although
> > > > > not as bad as the one you see on your Sapphire Rapids Xeon. On
> the
> > > > > E-cores, there's actually a slight performance gain.
> > > > >
> > > > > The virtio PMD does not directly invoke rte_memcpy() or anything
> else
> > > > > from <rte_memcpy.h>, but rather use memcpy(), so I'm not sure I
> > > > > understand what's going on here. Does the virtio driver delegate
> some
> > > > > performance-critical task to some module that in turns uses
> > > > > rte_memcpy()?
> > > >
> > > > This is because Vhost is the bottleneck here, not Virtio driver.
> > > > Indeed, the virtqueues memory belongs to the Virtio driver and the
> > > > descriptors buffers are Virtio's mbufs, so not much memcpy's are
> done
> > > > there.
> > > >
> > > > Vhost however, is a heavy memcpy user, as all the descriptors
> buffers
> > > > are copied to/from its mbufs.
> > >
> > > Would be good to now the size (if small it is inlining that matters,
> or
> > > maybe alignment matters), and have test results for multiple
> compiler versions.
> > > Ideally, feed results back and update Gcc and Clang.
> >
> > I was testing with GCC 11 on RHEL-9:
> > gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3)
> >
> > I was using the default one, 64B packets.
> >
> > I don't have time to perform these tests, but if you are willing to do
> > it I'll be happy to review the results.
> >
> > > DPDK doesn't need to be in the optimize C library space.
> >
> > Certainly, but we already have an optimized version currently, so not
> > much to do now on our side. When C libraries implementations will be
> on
> > par, we should definitely use them by default.
> >
> 
> I think it's not so much about optimized versus non-optimized at this
> point. It's just that cc/libc memcpy sometimes performs better than
> RTE memcpy, and sometimes doesn't.
> 
> For virtio, a single memory copy in
> lib/vhost/virtio_net.c:do_data_copy_enqueue()
> is responsible for >95% of the performance regression introduced by
> the cc memcpy patch for small packets on Intel P-cores.
> 
> I'm not so sure this performance regression will go away in newer
> compilers. PGO would certainly help, but PGO is a hassle.
> 
> One way to fix this issue would be to introduce a custom,
> memcpy()-based packet copying routine. I tried the below patch, with
> the following results:
> 
> Raptor Lake @ 3,2 GHz
> GCC 12
> 
> 64 bytes packets
> Core  Mode              Mpps
> ----------------------------
> E     RTE memcpy        9.5
> E     cc memcpy         9.7
> E     cc memcpy+pktcpy  9.0
> 
> P     RTE memcpy        16.4
> P     cc memcpy         13.5
> P     cc memcpy+pktcpy  16.2
> 
> 1500 bytes
> Core  Mode              Mpps
> ----------------------------
> P    RTE memcpy         5.8
> P    cc memcpy          5.9
> P    cc memcpy+pktcpy   5.9
> 
> As you can see, most of the regression is eliminated, at the cost of
> worse E-core performance. I didn't look at the generated code, but one
> could suspect heavy use of wide SIMD is to blame, which E-cores don't
> necessarily benefit from.
> 
> The below prototype assumes the source and destination buffers are
> 16-byte aligned. Does that always hold?

Perhaps always for this specific function; I don't know.
Not generally *always*, but I guess in many cases packet copies would have 64-byte aligned pointers, but not sizes.
Unless explicitly stated by the developer, it is unsafe to make assumptions about alignment.

A future rte_memcpy() function might take flags with explicit alignment information for optimized copying, as was part of my non-temporal memcpy(). (The development on this is still on hold.)

> 
> I'm sure one could further improve performance using context-specific
> information, such as packets always being >= 64 bytes. One could also
> consider having special cases, maybe for 64 bytes and MTU-sized
> packets. Such are always a hassle when you try to characterize
> performance though.

Absolutely!

This got me thinking:
These tests are run with 64 byte packets only.
Perhaps branch prediction pollutes the results, by optimizing branches in the copy routine for all packets being 64 byte.

You really should be testing with IMIX or random packet sizes.

In my experience, most internet packets are large (> 1024 byte, but not 1514 byte due to QUIC's conservative max packet size), closely followed by 64 byte (excl. any VLAN tags) packets; only the minority of packets are medium size.

> 
> diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
> index 370402d849..7b595a6622 100644
> --- a/lib/vhost/virtio_net.c
> +++ b/lib/vhost/virtio_net.c
> @@ -231,6 +231,26 @@ vhost_async_dma_check_completed(struct virtio_net
> *dev, int16_t dma_id, uint16_t
>  	return nr_copies;
>  }
> 
> +static inline void
> +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len)
> +{
> +	void *dst = __builtin_assume_aligned(in_dst, 16);
> +	const void *src = __builtin_assume_aligned(in_src, 16);
> +
> +	if (len <= 256) {
> +		size_t left;
> +
> +		for (left = len; left >= 32; left -= 32) {
> +			memcpy(dst, src, 32);
> +			dst = RTE_PTR_ADD(dst, 32);
> +			src = RTE_PTR_ADD(src, 32);
> +		}
> +
> +		memcpy(dst, src, left);
> +	} else
> +		memcpy(dst, src, len);
> +}
> +
>  static inline void
>  do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue
> *vq)
>  	__rte_shared_locks_required(&vq->iotlb_lock)
> @@ -240,7 +260,7 @@ do_data_copy_enqueue(struct virtio_net *dev, struct
> vhost_virtqueue *vq)
>  	int i;
> 
>  	for (i = 0; i < count; i++) {
> -		rte_memcpy(elem[i].dst, elem[i].src, elem[i].len);
> +		pktcpy(elem[i].dst, elem[i].src, elem[i].len);
>  		vhost_log_cache_write_iova(dev, vq, elem[i].log_addr,
>  					   elem[i].len);
>  		PRINT_PACKET(dev, (uintptr_t)elem[i].dst, elem[i].len, 0);
> @@ -257,7 +277,7 @@ do_data_copy_dequeue(struct vhost_virtqueue *vq)
>  	int i;
> 
>  	for (i = 0; i < count; i++)
> -		rte_memcpy(elem[i].dst, elem[i].src, elem[i].len);
> +		pktcpy(elem[i].dst, elem[i].src, elem[i].len);
> 
>  	vq->batch_copy_nb_elems = 0;
>  }
> 
> 
> 
> > Maxime
> >

  reply	other threads:[~2024-06-26 20:16 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-27 11:11 [RFC] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-05-28  7:43 ` [RFC v2] " Mattias Rönnblom
2024-05-28  8:19   ` Mattias Rönnblom
2024-05-28  8:27     ` Bruce Richardson
2024-05-28  8:59       ` Mattias Rönnblom
2024-05-28  9:07         ` Morten Brørup
2024-05-28 16:17           ` Mattias Rönnblom
2024-05-28 14:59     ` Stephen Hemminger
2024-05-28 15:09       ` Bruce Richardson
2024-05-31  5:19         ` Mattias Rönnblom
2024-05-31 16:50           ` Stephen Hemminger
2024-06-02 11:33             ` Mattias Rönnblom
2024-05-28 16:03       ` Mattias Rönnblom
2024-05-29 21:55         ` Stephen Hemminger
2024-05-28  8:20   ` Bruce Richardson
2024-06-02 12:39   ` [RFC v3 0/5] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 1/5] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-05  6:49       ` [PATCH 0/5] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-05  6:49         ` [PATCH 1/5] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-05  6:49         ` [PATCH 2/5] net/octeon_ep: properly include vector API header file Mattias Rönnblom
2024-06-05  6:49         ` [PATCH 3/5] distributor: " Mattias Rönnblom
2024-06-10 14:27           ` Tyler Retzlaff
2024-06-05  6:49         ` [PATCH 4/5] fib: " Mattias Rönnblom
2024-06-10 14:28           ` Tyler Retzlaff
2024-06-05  6:49         ` [PATCH 5/5] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-20  7:24         ` [PATCH v2 0/6] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-20  7:24           ` [PATCH v2 1/6] net/fm10k: add missing intrinsic include Mattias Rönnblom
2024-06-20  9:02             ` Bruce Richardson
2024-06-20  9:28             ` Bruce Richardson
2024-06-20 11:40               ` Mattias Rönnblom
2024-06-20 11:59                 ` Bruce Richardson
2024-06-20 11:50             ` [PATCH v3 0/6] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 1/6] net/fm10k: add missing vector API header include Mattias Rönnblom
2024-06-20 12:34                 ` Bruce Richardson
2024-06-20 17:57                 ` [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 01/13] net/i40e: add missing vector API header include Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 02/13] net/iavf: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 03/13] net/ice: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 04/13] net/ixgbe: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 05/13] net/ngbe: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 06/13] net/txgbe: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 07/13] net/virtio: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 08/13] net/fm10k: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 09/13] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 10/13] net/octeon_ep: add missing vector API header include Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 11/13] distributor: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 12/13] fib: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 13/13] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-21 15:19                     ` Stephen Hemminger
2024-06-24 10:05                     ` Thomas Monjalon
2024-06-24 17:56                       ` Mattias Rönnblom
2024-06-25 13:06                       ` Mattias Rönnblom
2024-06-25 13:34                         ` Thomas Monjalon
2024-06-20 18:53                   ` [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy Morten Brørup
2024-06-21  6:56                   ` Mattias Rönnblom
2024-06-21  7:04                     ` David Marchand
2024-06-21  7:35                       ` Mattias Rönnblom
2024-06-21  7:41                         ` David Marchand
2024-06-25 15:29                   ` Maxime Coquelin
2024-06-25 15:44                     ` Stephen Hemminger
2024-06-25 19:27                     ` Mattias Rönnblom
2024-06-26  8:37                       ` Maxime Coquelin
2024-06-26 14:58                         ` Stephen Hemminger
2024-06-26 15:24                           ` Maxime Coquelin
2024-06-26 18:47                             ` Mattias Rönnblom
2024-06-26 20:16                               ` Morten Brørup [this message]
2024-06-27 11:06                                 ` Mattias Rönnblom
2024-06-27 15:10                                   ` Stephen Hemminger
2024-06-27 15:23                                     ` Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 2/6] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 3/6] net/octeon_ep: add missing vector API header include Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 4/6] distributor: " Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 5/6] fib: " Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 6/6] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-20  7:24           ` [PATCH v2 2/6] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-20  9:03             ` Bruce Richardson
2024-06-20  7:24           ` [PATCH v2 3/6] net/octeon_ep: properly include vector API header file Mattias Rönnblom
2024-06-20 14:43             ` Stephen Hemminger
2024-06-20  7:24           ` [PATCH v2 4/6] distributor: " Mattias Rönnblom
2024-06-20  9:13             ` Bruce Richardson
2024-06-20  7:24           ` [PATCH v2 5/6] fib: " Mattias Rönnblom
2024-06-20  9:14             ` Bruce Richardson
2024-06-20 14:43               ` Stephen Hemminger
2024-06-20  7:24           ` [PATCH v2 6/6] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 2/5] net/octeon_ep: properly include vector API header file Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 3/5] distributor: " Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 4/5] fib: " Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 5/5] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-02 20:58       ` Morten Brørup
2024-06-03 17:04         ` Mattias Rönnblom
2024-06-03 17:08           ` Stephen Hemminger
2024-05-29 21:56 ` [RFC] " Stephen Hemminger
2024-06-02 11:30   ` Mattias Rönnblom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98CBD80474FA8B44BF855DF32C47DC35E9F55F@smartserver.smartshare.dk \
    --to=mb@smartsharesystems.com \
    --cc=abdullah.sevincer@intel.com \
    --cc=bruce.richardson@intel.com \
    --cc=david.hunt@intel.com \
    --cc=dev@dpdk.org \
    --cc=hofors@lysator.liu.se \
    --cc=mattias.ronnblom@ericsson.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=pbhagavatula@marvell.com \
    --cc=stephen@networkplumber.org \
    --cc=vladimir.medvedkin@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).