From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 3BE494550C; Thu, 27 Jun 2024 17:10:24 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 2C10E42D92; Thu, 27 Jun 2024 17:10:24 +0200 (CEST) Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178]) by mails.dpdk.org (Postfix) with ESMTP id 4FD9142D89 for ; Thu, 27 Jun 2024 17:10:22 +0200 (CEST) Received: by mail-pg1-f178.google.com with SMTP id 41be03b00d2f7-7178ba1c24bso4453288a12.3 for ; Thu, 27 Jun 2024 08:10:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20230601.gappssmtp.com; s=20230601; t=1719501021; x=1720105821; darn=dpdk.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=90J7Cm97rF8zL6VcCThDBTJIA1LTfG4Q58Vz/atIRbo=; b=wG1xrmm7TwBtiy9P/zcL7QRsORhR1qtyfsvPoziJyiQZ/oCceNt9l6OgD2BF/Jl2IK 4bN7SBM+G3NUKRtPAebzqeqTfhP6IXzig/GohrWbfyffyYLDSL3Y/UloxQIW++yIHKn1 hEYi2Alks9czxgp0zKIJpygkfhPbSUTrUHx03j5pLofCMeTTH1VPFssR9glTSkBhzYDG eDhFAAHC31BxB3dRSykrhAGHaIevNyY5UK/WyqB4QZsTR9iKVFd3E94QWddN07ZXte/j zQEUoK+mbIPALEpAy0lZvhMxAZr3AXK+nBrA15v/vg/Tiwyuoqw9zCXBxEZxrdY4z6HH Ct6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719501021; x=1720105821; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=90J7Cm97rF8zL6VcCThDBTJIA1LTfG4Q58Vz/atIRbo=; b=A9Ne3WwPqm6xQFA37F6xhXRLwWvJ5sw17OMNA17vLYO/3sqnCt7/OX4F4x1ZxcZ+2i GdGOXOAU8OnC3MlDV4zLD7msX4OAY+sfJMX46jfnK5QbEk0dDUxuOys+6f2g2yySZCD9 ENeZS2QvBYNw6D12YZEwMRu7cTLmJEAf8k0ziLkYZ6Tt7MKxK5TirxLeh0+S41mS4Vp9 Jh0eAIQ5Ypj2VHod6ZeJkiAD2wTTya6SSXlD2wEDKkthC0EojBl2MWQe9DXnq0crljhA 1Gw2GkzL5OhEnjlAL+DwjeGlEttE/1/vZTewjR9StmyMNmQ/4r2gODWVyTpBVHrhT8Zb cjaA== X-Forwarded-Encrypted: i=1; AJvYcCW8+c/Xfa146JCJCkBrhWQJRDhl6RGFu/O8OWqA2FQQLloduZOeCQASFXyO/6O4pcMAUMPmZgbVuU9aS6E= X-Gm-Message-State: AOJu0YwUrHBEjk+IcbG/urIt9QaJtKbx86UBHFxExzy9NoJbQTqcvElC hWvnsdZRZW9NL+5BTT/cXU1FWWphHj4FSUE7s4anlLV53t/28NjS23GTTP18qtw= X-Google-Smtp-Source: AGHT+IFK2Kk2ryg0teprDL6iS8bSenjBpm7UhUu3CFONuLvfjprLSwRb7/JuoXtHlOYS1bi7YomHOw== X-Received: by 2002:a05:6a21:2705:b0:1af:66aa:7fc7 with SMTP id adf61e73a8af0-1bcf7e31fffmr15844118637.3.1719501021311; Thu, 27 Jun 2024 08:10:21 -0700 (PDT) Received: from hermes.local (204-195-96-226.wavecable.com. [204.195.96.226]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1faac8df8a1sm14442045ad.8.2024.06.27.08.10.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Jun 2024 08:10:21 -0700 (PDT) Date: Thu, 27 Jun 2024 08:10:19 -0700 From: Stephen Hemminger To: Mattias =?UTF-8?B?UsO2bm5ibG9t?= Cc: Morten =?UTF-8?B?QnLDuHJ1cA==?= , Maxime Coquelin , Mattias =?UTF-8?B?UsO2bm5ibG9t?= , dev@dpdk.org, Abdullah Sevincer , Pavan Nikhilesh , David Hunt , Vladimir Medvedkin , Bruce Richardson Subject: Re: [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy Message-ID: <20240627081019.43b2779a@hermes.local> In-Reply-To: References: <20240620115027.420304-2-mattias.ronnblom@ericsson.com> <20240620175731.420639-1-mattias.ronnblom@ericsson.com> <3eebd7f7-9ba2-424c-80d1-6efa8945641d@redhat.com> <20240626075841.5e63e7c0@hermes.local> <98CBD80474FA8B44BF855DF32C47DC35E9F55F@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Thu, 27 Jun 2024 13:06:22 +0200 Mattias R=C3=B6nnblom wrote: > On Wed, Jun 26, 2024 at 10:16:06PM +0200, Morten Br=C3=B8rup wrote: > > > From: Mattias R=C3=B6nnblom [mailto:hofors@lysator.liu.se] > > > Sent: Wednesday, 26 June 2024 20.48 > > >=20 > > > On Wed, Jun 26, 2024 at 05:24:04PM +0200, Maxime Coquelin wrote: =20 > > > > > > > > > > > > On 6/26/24 16:58, Stephen Hemminger wrote: =20 > > > > > On Wed, 26 Jun 2024 10:37:31 +0200 > > > > > Maxime Coquelin wrote: > > > > > =20 > > > > > > On 6/25/24 21:27, Mattias R=C3=B6nnblom wrote: =20 > > > > > > > On Tue, Jun 25, 2024 at 05:29:35PM +0200, Maxime Coquelin wro= te: =20 > > > > > > > > Hi Mattias, > > > > > > > > > > > > > > > > On 6/20/24 19:57, Mattias R=C3=B6nnblom wrote: =20 > > > > > > > > > This patch set make DPDK library, driver, and application= =20 > > > code use the =20 > > > > > > > > > compiler/libc memcpy() by default when functions in =20 > > > are =20 > > > > > > > > > invoked. > > > > > > > > > > > > > > > > > > The various custom DPDK rte_memcpy() implementations may = be =20 > > > retained =20 > > > > > > > > > by means of a build-time option. > > > > > > > > > > > > > > > > > > This patch set only make a difference on x86, PPC and ARM= . =20 > > > Loongarch =20 > > > > > > > > > and RISCV already used compiler/libc memcpy(). =20 > > > > > > > > > > > > > > > > It indeed makes a difference on x86! > > > > > > > > > > > > > > > > Just tested latest main with and without your series on > > > > > > > > Intel(R) Xeon(R) Gold 6438N. > > > > > > > > > > > > > > > > The test is a simple IO loop between a Vhost PMD and a Virt= io- =20 > > > user PMD: =20 > > > > > > > > # dpdk-testpmd -l 4-6 --file-prefix=3Dvirtio1 --no-pci --= vdev =20 > > > 'net_virtio_user0,mac=3D00:01:02:03:04:05,path=3D./vhost- > > > net,server=3D1,mrg_rxbuf=3D1,in_order=3D1' =20 > > > > > > > > --single-file-segments -- -i =20 > > > > > > > > testpmd> start =20 > > > > > > > > > > > > > > > > # dpdk-testpmd -l 8-10 --file-prefix=3Dvhost1 --no-pci --= vdev > > > > > > > > 'net_vhost0,iface=3Dvhost-net,client=3D1' --single-file-s= egments =20 > > > -- -i =20 > > > > > > > > testpmd> start tx_first 32 =20 > > > > > > > > > > > > > > > > Latest main: 14.5Mpps > > > > > > > > Latest main + this series: 10Mpps =20 > > > > > > > > > > > > > > I ran the above benchmark on my Raptor Lake desktop (locked t= o =20 > > > 3,2 =20 > > > > > > > GHz). GCC 12.3.0. > > > > > > > > > > > > > > Core use_cc_memcpy Mpps > > > > > > > E false 9.5 > > > > > > > E true 9.7 > > > > > > > P false 16.4 > > > > > > > P true 13.5 > > > > > > > > > > > > > > On the P-cores, there's a significant performance regression,= =20 > > > although =20 > > > > > > > not as bad as the one you see on your Sapphire Rapids Xeon. O= n =20 > > > the =20 > > > > > > > E-cores, there's actually a slight performance gain. > > > > > > > > > > > > > > The virtio PMD does not directly invoke rte_memcpy() or anyth= ing =20 > > > else =20 > > > > > > > from , but rather use memcpy(), so I'm not sure= I > > > > > > > understand what's going on here. Does the virtio driver deleg= ate =20 > > > some =20 > > > > > > > performance-critical task to some module that in turns uses > > > > > > > rte_memcpy()? =20 > > > > > > > > > > > > This is because Vhost is the bottleneck here, not Virtio driver. > > > > > > Indeed, the virtqueues memory belongs to the Virtio driver and = the > > > > > > descriptors buffers are Virtio's mbufs, so not much memcpy's ar= e =20 > > > done =20 > > > > > > there. > > > > > > > > > > > > Vhost however, is a heavy memcpy user, as all the descriptors = =20 > > > buffers =20 > > > > > > are copied to/from its mbufs. =20 > > > > > > > > > > Would be good to now the size (if small it is inlining that matte= rs, =20 > > > or =20 > > > > > maybe alignment matters), and have test results for multiple =20 > > > compiler versions. =20 > > > > > Ideally, feed results back and update Gcc and Clang. =20 > > > > > > > > I was testing with GCC 11 on RHEL-9: > > > > gcc (GCC) 11.4.1 20231218 (Red Hat 11.4.1-3) > > > > > > > > I was using the default one, 64B packets. > > > > > > > > I don't have time to perform these tests, but if you are willing to= do > > > > it I'll be happy to review the results. > > > > =20 > > > > > DPDK doesn't need to be in the optimize C library space. =20 > > > > > > > > Certainly, but we already have an optimized version currently, so n= ot > > > > much to do now on our side. When C libraries implementations will b= e =20 > > > on =20 > > > > par, we should definitely use them by default. > > > > =20 > > >=20 > > > I think it's not so much about optimized versus non-optimized at this > > > point. It's just that cc/libc memcpy sometimes performs better than > > > RTE memcpy, and sometimes doesn't. > > >=20 > > > For virtio, a single memory copy in > > > lib/vhost/virtio_net.c:do_data_copy_enqueue() > > > is responsible for >95% of the performance regression introduced by > > > the cc memcpy patch for small packets on Intel P-cores. > > >=20 > > > I'm not so sure this performance regression will go away in newer > > > compilers. PGO would certainly help, but PGO is a hassle. > > >=20 > > > One way to fix this issue would be to introduce a custom, > > > memcpy()-based packet copying routine. I tried the below patch, with > > > the following results: > > >=20 > > > Raptor Lake @ 3,2 GHz > > > GCC 12 > > >=20 > > > 64 bytes packets > > > Core Mode Mpps > > > ---------------------------- > > > E RTE memcpy 9.5 > > > E cc memcpy 9.7 > > > E cc memcpy+pktcpy 9.0 > > >=20 > > > P RTE memcpy 16.4 > > > P cc memcpy 13.5 > > > P cc memcpy+pktcpy 16.2 > > >=20 > > > 1500 bytes > > > Core Mode Mpps > > > ---------------------------- > > > P RTE memcpy 5.8 > > > P cc memcpy 5.9 > > > P cc memcpy+pktcpy 5.9 > > >=20 > > > As you can see, most of the regression is eliminated, at the cost of > > > worse E-core performance. I didn't look at the generated code, but one > > > could suspect heavy use of wide SIMD is to blame, which E-cores don't > > > necessarily benefit from. > > >=20 > > > The below prototype assumes the source and destination buffers are > > > 16-byte aligned. Does that always hold? =20 > >=20 > > Perhaps always for this specific function; I don't know. > > Not generally *always*, but I guess in many cases packet copies would h= ave 64-byte aligned pointers, but not sizes. > > Unless explicitly stated by the developer, it is unsafe to make assumpt= ions about alignment. > > =20 >=20 > I meant always (for every packet) in DPDK virtio net. >=20 > > A future rte_memcpy() function might take flags with explicit alignment= information for optimized copying, as was part of my non-temporal memcpy()= . (The development on this is still on hold.) > > =20 >=20 > There is already a mechanism to express alignment in GCC-compatible > compilers. See the below patch. >=20 > > >=20 > > > I'm sure one could further improve performance using context-specific > > > information, such as packets always being >=3D 64 bytes. One could al= so > > > consider having special cases, maybe for 64 bytes and MTU-sized > > > packets. Such are always a hassle when you try to characterize > > > performance though. =20 > >=20 > > Absolutely! > >=20 > > This got me thinking: > > These tests are run with 64 byte packets only. > > Perhaps branch prediction pollutes the results, by optimizing branches = in the copy routine for all packets being 64 byte. > >=20 > > You really should be testing with IMIX or random packet sizes. > > =20 >=20 > Sure, and it should also be a real app, not testpmd, on top of virtio. >=20 > > In my experience, most internet packets are large (> 1024 byte, but not= 1514 byte due to QUIC's conservative max packet size), closely followed by= 64 byte (excl. any VLAN tags) packets; only the minority of packets are me= dium size. > > =20 > > >=20 > > > diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c > > > index 370402d849..7b595a6622 100644 > > > --- a/lib/vhost/virtio_net.c > > > +++ b/lib/vhost/virtio_net.c > > > @@ -231,6 +231,26 @@ vhost_async_dma_check_completed(struct virtio_net > > > *dev, int16_t dma_id, uint16_t > > > return nr_copies; > > > } > > >=20 > > > +static inline void > > > +pktcpy(void *restrict in_dst, const void *restrict in_src, size_t le= n) > > > +{ > > > + void *dst =3D __builtin_assume_aligned(in_dst, 16); > > > + const void *src =3D __builtin_assume_aligned(in_src, 16); > > > + > > > + if (len <=3D 256) { > > > + size_t left; > > > + > > > + for (left =3D len; left >=3D 32; left -=3D 32) { > > > + memcpy(dst, src, 32); > > > + dst =3D RTE_PTR_ADD(dst, 32); > > > + src =3D RTE_PTR_ADD(src, 32); > > > + } > > > + > > > + memcpy(dst, src, left); > > > + } else > > > + memcpy(dst, src, len); > > > +} > > > + Is alignment an optimization or requirement? I can think of cases in tunnel= ing where the packet was received into mbuf (aligned) but then the application = prepends a header making the packet unaligned before sending.