From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id AFB3445502; Wed, 26 Jun 2024 22:16:09 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 82D0D40281; Wed, 26 Jun 2024 22:16:09 +0200 (CEST) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 60BC340274 for ; Wed, 26 Jun 2024 22:16:08 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id 6ADF821815; Wed, 26 Jun 2024 22:16:07 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy Date: Wed, 26 Jun 2024 22:16:06 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F55F@smartserver.smartshare.dk> X-MimeOLE: Produced By Microsoft Exchange V6.5 In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy Thread-Index: AdrH+V4PpagEvULySh6XpHlXhOHeZAAB0Gng References: <20240620115027.420304-2-mattias.ronnblom@ericsson.com> <20240620175731.420639-1-mattias.ronnblom@ericsson.com> <3eebd7f7-9ba2-424c-80d1-6efa8945641d@redhat.com> <20240626075841.5e63e7c0@hermes.local> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: =?iso-8859-1?Q?Mattias_R=F6nnblom?= , "Maxime Coquelin" Cc: "Stephen Hemminger" , =?iso-8859-1?Q?Mattias_R=F6nnblom?= , , "Abdullah Sevincer" , "Pavan Nikhilesh" , "David Hunt" , "Vladimir Medvedkin" , "Bruce Richardson" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Mattias R=F6nnblom [mailto:hofors@lysator.liu.se] > Sent: Wednesday, 26 June 2024 20.48 >=20 > On Wed, Jun 26, 2024 at 05:24:04PM +0200, Maxime Coquelin wrote: > > > > > > On 6/26/24 16:58, Stephen Hemminger wrote: > > > On Wed, 26 Jun 2024 10:37:31 +0200 > > > Maxime Coquelin wrote: > > > > > > > On 6/25/24 21:27, Mattias R=F6nnblom wrote: > > > > > On Tue, Jun 25, 2024 at 05:29:35PM +0200, Maxime Coquelin = wrote: > > > > > > Hi Mattias, > > > > > > > > > > > > On 6/20/24 19:57, Mattias R=F6nnblom wrote: > > > > > > > This patch set make DPDK library, driver, and application > code use the > > > > > > > compiler/libc memcpy() by default when functions in > are > > > > > > > invoked. > > > > > > > > > > > > > > The various custom DPDK rte_memcpy() implementations may = be > retained > > > > > > > by means of a build-time option. > > > > > > > > > > > > > > This patch set only make a difference on x86, PPC and ARM. > Loongarch > > > > > > > and RISCV already used compiler/libc memcpy(). > > > > > > > > > > > > It indeed makes a difference on x86! > > > > > > > > > > > > Just tested latest main with and without your series on > > > > > > Intel(R) Xeon(R) Gold 6438N. > > > > > > > > > > > > The test is a simple IO loop between a Vhost PMD and a = Virtio- > user PMD: > > > > > > # dpdk-testpmd -l 4-6 --file-prefix=3Dvirtio1 --no-pci = --vdev > 'net_virtio_user0,mac=3D00:01:02:03:04:05,path=3D./vhost- > net,server=3D1,mrg_rxbuf=3D1,in_order=3D1' > > > > > > --single-file-segments -- -i > > > > > > testpmd> start > > > > > > > > > > > > # dpdk-testpmd -l 8-10 --file-prefix=3Dvhost1 --no-pci = --vdev > > > > > > 'net_vhost0,iface=3Dvhost-net,client=3D1' = --single-file-segments > -- -i > > > > > > testpmd> start tx_first 32 > > > > > > > > > > > > Latest main: 14.5Mpps > > > > > > Latest main + this series: 10Mpps > > > > > > > > > > I ran the above benchmark on my Raptor Lake desktop (locked to > 3,2 > > > > > GHz). GCC 12.3.0. > > > > > > > > > > Core use_cc_memcpy Mpps > > > > > E false 9.5 > > > > > E true 9.7 > > > > > P false 16.4 > > > > > P true 13.5 > > > > > > > > > > On the P-cores, there's a significant performance regression, > although > > > > > not as bad as the one you see on your Sapphire Rapids Xeon. On > the > > > > > E-cores, there's actually a slight performance gain. > > > > > > > > > > The virtio PMD does not directly invoke rte_memcpy() or = anything > else > > > > > from , but rather use memcpy(), so I'm not sure = I > > > > > understand what's going on here. Does the virtio driver = delegate > some > > > > > performance-critical task to some module that in turns uses > > > > > rte_memcpy()? > > > > > > > > This is because Vhost is the bottleneck here, not Virtio driver. > > > > Indeed, the virtqueues memory belongs to the Virtio driver and = the > > > > descriptors buffers are Virtio's mbufs, so not much memcpy's are > done > > > > there. > > > > > > > > Vhost however, is a heavy memcpy user, as all the descriptors > buffers > > > > are copied to/from its mbufs. > > > > > > Would be good to now the size (if small it is inlining that = matters, > or > > > maybe alignment matters), and have test results for multiple > compiler versions. > > > Ideally, feed results back and update Gcc and Clang. > > > > I was testing with GCC 11 on RHEL-9: > > gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) > > > > I was using the default one, 64B packets. > > > > I don't have time to perform these tests, but if you are willing to = do > > it I'll be happy to review the results. > > > > > DPDK doesn't need to be in the optimize C library space. > > > > Certainly, but we already have an optimized version currently, so = not > > much to do now on our side. When C libraries implementations will be > on > > par, we should definitely use them by default. > > >=20 > I think it's not so much about optimized versus non-optimized at this > point. It's just that cc/libc memcpy sometimes performs better than > RTE memcpy, and sometimes doesn't. >=20 > For virtio, a single memory copy in > lib/vhost/virtio_net.c:do_data_copy_enqueue() > is responsible for >95% of the performance regression introduced by > the cc memcpy patch for small packets on Intel P-cores. >=20 > I'm not so sure this performance regression will go away in newer > compilers. PGO would certainly help, but PGO is a hassle. >=20 > One way to fix this issue would be to introduce a custom, > memcpy()-based packet copying routine. I tried the below patch, with > the following results: >=20 > Raptor Lake @ 3,2 GHz > GCC 12 >=20 > 64 bytes packets > Core Mode Mpps > ---------------------------- > E RTE memcpy 9.5 > E cc memcpy 9.7 > E cc memcpy+pktcpy 9.0 >=20 > P RTE memcpy 16.4 > P cc memcpy 13.5 > P cc memcpy+pktcpy 16.2 >=20 > 1500 bytes > Core Mode Mpps > ---------------------------- > P RTE memcpy 5.8 > P cc memcpy 5.9 > P cc memcpy+pktcpy 5.9 >=20 > As you can see, most of the regression is eliminated, at the cost of > worse E-core performance. I didn't look at the generated code, but one > could suspect heavy use of wide SIMD is to blame, which E-cores don't > necessarily benefit from. >=20 > The below prototype assumes the source and destination buffers are > 16-byte aligned. Does that always hold? Perhaps always for this specific function; I don't know. Not generally *always*, but I guess in many cases packet copies would = have 64-byte aligned pointers, but not sizes. Unless explicitly stated by the developer, it is unsafe to make = assumptions about alignment. A future rte_memcpy() function might take flags with explicit alignment = information for optimized copying, as was part of my non-temporal = memcpy(). (The development on this is still on hold.) >=20 > I'm sure one could further improve performance using context-specific > information, such as packets always being >=3D 64 bytes. One could = also > consider having special cases, maybe for 64 bytes and MTU-sized > packets. Such are always a hassle when you try to characterize > performance though. Absolutely! This got me thinking: These tests are run with 64 byte packets only. Perhaps branch prediction pollutes the results, by optimizing branches = in the copy routine for all packets being 64 byte. You really should be testing with IMIX or random packet sizes. In my experience, most internet packets are large (> 1024 byte, but not = 1514 byte due to QUIC's conservative max packet size), closely followed = by 64 byte (excl. any VLAN tags) packets; only the minority of packets = are medium size. >=20 > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c > index 370402d849..7b595a6622 100644 > --- a/lib/vhost/virtio_net.c > +++ b/lib/vhost/virtio_net.c > @@ -231,6 +231,26 @@ vhost_async_dma_check_completed(struct virtio_net > *dev, int16_t dma_id, uint16_t > return nr_copies; > } >=20 > +static inline void > +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t = len) > +{ > + void *dst =3D __builtin_assume_aligned(in_dst, 16); > + const void *src =3D __builtin_assume_aligned(in_src, 16); > + > + if (len <=3D 256) { > + size_t left; > + > + for (left =3D len; left >=3D 32; left -=3D 32) { > + memcpy(dst, src, 32); > + dst =3D RTE_PTR_ADD(dst, 32); > + src =3D RTE_PTR_ADD(src, 32); > + } > + > + memcpy(dst, src, left); > + } else > + memcpy(dst, src, len); > +} > + > static inline void > do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue > *vq) > __rte_shared_locks_required(&vq->iotlb_lock) > @@ -240,7 +260,7 @@ do_data_copy_enqueue(struct virtio_net *dev, = struct > vhost_virtqueue *vq) > int i; >=20 > for (i =3D 0; i < count; i++) { > - rte_memcpy(elem[i].dst, elem[i].src, elem[i].len); > + pktcpy(elem[i].dst, elem[i].src, elem[i].len); > vhost_log_cache_write_iova(dev, vq, elem[i].log_addr, > elem[i].len); > PRINT_PACKET(dev, (uintptr_t)elem[i].dst, elem[i].len, 0); > @@ -257,7 +277,7 @@ do_data_copy_dequeue(struct vhost_virtqueue *vq) > int i; >=20 > for (i =3D 0; i < count; i++) > - rte_memcpy(elem[i].dst, elem[i].src, elem[i].len); > + pktcpy(elem[i].dst, elem[i].src, elem[i].len); >=20 > vq->batch_copy_nb_elems =3D 0; > } >=20 >=20 >=20 > > Maxime > >