From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 1E571454E5; Wed, 26 Jun 2024 20:48:01 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 067414026B; Wed, 26 Jun 2024 20:48:01 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 6EB1F4021E for ; Wed, 26 Jun 2024 20:47:59 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id B37426C08; Wed, 26 Jun 2024 20:47:58 +0200 (CEST) Received: from isengard (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (prime256v1) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 0D89F6C07; Wed, 26 Jun 2024 20:47:57 +0200 (CEST) Date: Wed, 26 Jun 2024 20:47:56 +0200 From: Mattias =?iso-8859-1?Q?R=F6nnblom?= To: Maxime Coquelin Cc: Stephen Hemminger , Mattias =?iso-8859-1?Q?R=F6nnblom?= , dev@dpdk.org, Morten =?iso-8859-1?Q?Br=F8rup?= , Abdullah Sevincer , Pavan Nikhilesh , David Hunt , Vladimir Medvedkin , Bruce Richardson Subject: Re: [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy Message-ID: References: <20240620115027.420304-2-mattias.ronnblom@ericsson.com> <20240620175731.420639-1-mattias.ronnblom@ericsson.com> <3eebd7f7-9ba2-424c-80d1-6efa8945641d@redhat.com> <20240626075841.5e63e7c0@hermes.local> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Wed, Jun 26, 2024 at 05:24:04PM +0200, Maxime Coquelin wrote: > > > On 6/26/24 16:58, Stephen Hemminger wrote: > > On Wed, 26 Jun 2024 10:37:31 +0200 > > Maxime Coquelin wrote: > > > > > On 6/25/24 21:27, Mattias Rönnblom wrote: > > > > On Tue, Jun 25, 2024 at 05:29:35PM +0200, Maxime Coquelin wrote: > > > > > Hi Mattias, > > > > > > > > > > On 6/20/24 19:57, Mattias Rönnblom wrote: > > > > > > This patch set make DPDK library, driver, and application code use the > > > > > > compiler/libc memcpy() by default when functions in are > > > > > > invoked. > > > > > > > > > > > > The various custom DPDK rte_memcpy() implementations may be retained > > > > > > by means of a build-time option. > > > > > > > > > > > > This patch set only make a difference on x86, PPC and ARM. Loongarch > > > > > > and RISCV already used compiler/libc memcpy(). > > > > > > > > > > It indeed makes a difference on x86! > > > > > > > > > > Just tested latest main with and without your series on > > > > > Intel(R) Xeon(R) Gold 6438N. > > > > > > > > > > The test is a simple IO loop between a Vhost PMD and a Virtio-user PMD: > > > > > # dpdk-testpmd -l 4-6 --file-prefix=virtio1 --no-pci --vdev 'net_virtio_user0,mac=00:01:02:03:04:05,path=./vhost-net,server=1,mrg_rxbuf=1,in_order=1' > > > > > --single-file-segments -- -i > > > > > testpmd> start > > > > > > > > > > # dpdk-testpmd -l 8-10 --file-prefix=vhost1 --no-pci --vdev > > > > > 'net_vhost0,iface=vhost-net,client=1' --single-file-segments -- -i > > > > > testpmd> start tx_first 32 > > > > > > > > > > Latest main: 14.5Mpps > > > > > Latest main + this series: 10Mpps > > > > > > > > I ran the above benchmark on my Raptor Lake desktop (locked to 3,2 > > > > GHz). GCC 12.3.0. > > > > > > > > Core use_cc_memcpy Mpps > > > > E false 9.5 > > > > E true 9.7 > > > > P false 16.4 > > > > P true 13.5 > > > > > > > > On the P-cores, there's a significant performance regression, although > > > > not as bad as the one you see on your Sapphire Rapids Xeon. On the > > > > E-cores, there's actually a slight performance gain. > > > > > > > > The virtio PMD does not directly invoke rte_memcpy() or anything else > > > > from , but rather use memcpy(), so I'm not sure I > > > > understand what's going on here. Does the virtio driver delegate some > > > > performance-critical task to some module that in turns uses > > > > rte_memcpy()? > > > > > > This is because Vhost is the bottleneck here, not Virtio driver. > > > Indeed, the virtqueues memory belongs to the Virtio driver and the > > > descriptors buffers are Virtio's mbufs, so not much memcpy's are done > > > there. > > > > > > Vhost however, is a heavy memcpy user, as all the descriptors buffers > > > are copied to/from its mbufs. > > > > Would be good to now the size (if small it is inlining that matters, or > > maybe alignment matters), and have test results for multiple compiler versions. > > Ideally, feed results back and update Gcc and Clang. > > I was testing with GCC 11 on RHEL-9: > gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) > > I was using the default one, 64B packets. > > I don't have time to perform these tests, but if you are willing to do > it I'll be happy to review the results. > > > DPDK doesn't need to be in the optimize C library space. > > Certainly, but we already have an optimized version currently, so not > much to do now on our side. When C libraries implementations will be on > par, we should definitely use them by default. > I think it's not so much about optimized versus non-optimized at this point. It's just that cc/libc memcpy sometimes performs better than RTE memcpy, and sometimes doesn't. For virtio, a single memory copy in lib/vhost/virtio_net.c:do_data_copy_enqueue() is responsible for >95% of the performance regression introduced by the cc memcpy patch for small packets on Intel P-cores. I'm not so sure this performance regression will go away in newer compilers. PGO would certainly help, but PGO is a hassle. One way to fix this issue would be to introduce a custom, memcpy()-based packet copying routine. I tried the below patch, with the following results: Raptor Lake @ 3,2 GHz GCC 12 64 bytes packets Core Mode Mpps ---------------------------- E RTE memcpy 9.5 E cc memcpy 9.7 E cc memcpy+pktcpy 9.0 P RTE memcpy 16.4 P cc memcpy 13.5 P cc memcpy+pktcpy 16.2 1500 bytes Core Mode Mpps ---------------------------- P RTE memcpy 5.8 P cc memcpy 5.9 P cc memcpy+pktcpy 5.9 As you can see, most of the regression is eliminated, at the cost of worse E-core performance. I didn't look at the generated code, but one could suspect heavy use of wide SIMD is to blame, which E-cores don't necessarily benefit from. The below prototype assumes the source and destination buffers are 16-byte aligned. Does that always hold? I'm sure one could further improve performance using context-specific information, such as packets always being >= 64 bytes. One could also consider having special cases, maybe for 64 bytes and MTU-sized packets. Such are always a hassle when you try to characterize performance though. diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c index 370402d849..7b595a6622 100644 --- a/lib/vhost/virtio_net.c +++ b/lib/vhost/virtio_net.c @@ -231,6 +231,26 @@ vhost_async_dma_check_completed(struct virtio_net *dev, int16_t dma_id, uint16_t return nr_copies; } +static inline void +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t len) +{ + void *dst = __builtin_assume_aligned(in_dst, 16); + const void *src = __builtin_assume_aligned(in_src, 16); + + if (len <= 256) { + size_t left; + + for (left = len; left >= 32; left -= 32) { + memcpy(dst, src, 32); + dst = RTE_PTR_ADD(dst, 32); + src = RTE_PTR_ADD(src, 32); + } + + memcpy(dst, src, left); + } else + memcpy(dst, src, len); +} + static inline void do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq) __rte_shared_locks_required(&vq->iotlb_lock) @@ -240,7 +260,7 @@ do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq) int i; for (i = 0; i < count; i++) { - rte_memcpy(elem[i].dst, elem[i].src, elem[i].len); + pktcpy(elem[i].dst, elem[i].src, elem[i].len); vhost_log_cache_write_iova(dev, vq, elem[i].log_addr, elem[i].len); PRINT_PACKET(dev, (uintptr_t)elem[i].dst, elem[i].len, 0); @@ -257,7 +277,7 @@ do_data_copy_dequeue(struct vhost_virtqueue *vq) int i; for (i = 0; i < count; i++) - rte_memcpy(elem[i].dst, elem[i].src, elem[i].len); + pktcpy(elem[i].dst, elem[i].src, elem[i].len); vq->batch_copy_nb_elems = 0; } > Maxime >