From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 0E1E443DF8; Thu, 4 Apr 2024 13:19:58 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id F1AE540268; Thu, 4 Apr 2024 13:19:57 +0200 (CEST) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 209484025D for ; Thu, 4 Apr 2024 13:19:56 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id F08CC20CEE; Thu, 4 Apr 2024 13:19:55 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH v2] eal/x86: improve rte_memcpy const size 16 performance Date: Thu, 4 Apr 2024 13:19:54 +0200 X-MimeOLE: Produced By Microsoft Exchange V6.5 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F35C@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v2] eal/x86: improve rte_memcpy const size 16 performance Thread-Index: AdqGd+98N5nmLFLeQ3SVaEdwp04jMgABItlA References: <20240302234812.9137-1-mb@smartsharesystems.com> <20240303094621.16404-1-mb@smartsharesystems.com> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" Cc: , , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Thursday, 4 April 2024 12.07 >=20 > On Sun, Mar 03, 2024 at 10:46:21AM +0100, Morten Br=F8rup wrote: > > When the rte_memcpy() size is 16, the same 16 bytes are copied = twice. > > In the case where the size is known to be 16 at build tine, omit the > > duplicate copy. > > > > Reduced the amount of effectively copy-pasted code by using #ifdef > > inside functions instead of outside functions. > > > > Suggested-by: Stephen Hemminger > > Signed-off-by: Morten Br=F8rup >=20 > Changes in general look good to me. Comments inline below. >=20 > /Bruce >=20 > > --- > > v2: > > * For GCC, version 11 is required for proper AVX handling; > > if older GCC version, treat AVX as SSE. > > Clang does not have this issue. > > Note: Original code always treated AVX as SSE, regardless of = compiler. > > * Do not add copyright. (Stephen Hemminger) > > --- > > lib/eal/x86/include/rte_memcpy.h | 231 = ++++++++----------------------- > > 1 file changed, 56 insertions(+), 175 deletions(-) > > > > diff --git a/lib/eal/x86/include/rte_memcpy.h > b/lib/eal/x86/include/rte_memcpy.h > > index 72a92290e0..d1df841f5e 100644 > > --- a/lib/eal/x86/include/rte_memcpy.h > > +++ b/lib/eal/x86/include/rte_memcpy.h > > @@ -91,14 +91,6 @@ rte_mov15_or_less(void *dst, const void *src, = size_t n) > > return ret; > > } > > > > -#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512 > > - > > -#define ALIGNMENT_MASK 0x3F > > - > > -/** > > - * AVX512 implementation below > > - */ > > - > > /** > > * Copy 16 bytes from one location to another, > > * locations should not overlap. > > @@ -119,10 +111,16 @@ rte_mov16(uint8_t *dst, const uint8_t *src) > > static __rte_always_inline void > > rte_mov32(uint8_t *dst, const uint8_t *src) > > { > > +#if (defined __AVX512F__ && defined RTE_MEMCPY_AVX512) || defined = __AVX2__ > || \ > > + (defined __AVX__ && !(defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION > < 110000))) >=20 > I think we can drop the AVX512 checks here, since I'm not aware of any > system where we'd have AVX512 but not AVX2 available, so just checking = for > AVX2 support should be sufficient. RTE_MEMCPY_AVX512 must be manually defined at build time to enable = AVX512: https://elixir.bootlin.com/dpdk/latest/source/lib/eal/include/generic/rte= _memcpy.h#L98 Without it, the AVX2 version will be used, regardless if the CPU has = AVX512. Also, there are some binutils bugs that might disable compilation for = AVX512: https://elixir.bootlin.com/dpdk/latest/source/config/x86/meson.build#L4 https://elixir.bootlin.com/dpdk/latest/source/config/x86/meson.build#L17 >=20 > On the final compiler-based check, I don't strongly object to it, but = I > just wonder as to its real value. AVX2 was first introduced by Intel = over 10 > years ago, and (from what I find in wikipedia), it's been in AMD CPUs = since > ~2015. While we did have CPUs still being produced without AVX2 since = that > time, they generally didn't have AVX1 either, only having SSE = instructions. > Therefore the number of systems which require this additional check is > likely very small at this stage. > That said, I'm ok to either keep or omit it at your choice. I kept it for consistency, and to support older compilers still = officially supported by DPDK. I don't feel qualified to change support for CPU features; I'll leave = that to the CPU vendors. Also, I have no clue what has been produced by Intel and AMD. :-) > If you do keep > it, how about putting the check once at the top of the file and using = a > single short define instead for the multiple places it's used e.g. >=20 > #if (defined __AVX__ && !(defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION < > 110000))) > #define RTE_MEMCPY_AVX2 > #endif Much of the code reorganization in this patch was done with the = intention to improve readability. And I don't think this suggestion improves readability; especially = considering that RTE_MEMCPY_AVX512 is something manually defined. However, I get your point; and if the conditional was very long or very = complex, I might agree to a "shadow" definition to keep it short. >=20 >=20 > > __m256i ymm0; > > > > ymm0 =3D _mm256_loadu_si256((const __m256i *)src); > > _mm256_storeu_si256((__m256i *)dst, ymm0); > > +#else /* SSE implementation */ > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > > +#endif > > } > > > > /** > > @@ -132,10 +130,15 @@ rte_mov32(uint8_t *dst, const uint8_t *src) > > static __rte_always_inline void > > rte_mov64(uint8_t *dst, const uint8_t *src) > > { > > +#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512 > > __m512i zmm0; > > > > zmm0 =3D _mm512_loadu_si512((const void *)src); > > _mm512_storeu_si512((void *)dst, zmm0); > > +#else /* AVX2, AVX & SSE implementation */ > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); > > +#endif > > } > > > > /** > > @@ -156,12 +159,18 @@ rte_mov128(uint8_t *dst, const uint8_t *src) > > static __rte_always_inline void > > rte_mov256(uint8_t *dst, const uint8_t *src) > > { > > - rte_mov64(dst + 0 * 64, src + 0 * 64); > > - rte_mov64(dst + 1 * 64, src + 1 * 64); > > - rte_mov64(dst + 2 * 64, src + 2 * 64); > > - rte_mov64(dst + 3 * 64, src + 3 * 64); > > + rte_mov128(dst + 0 * 128, src + 0 * 128); > > + rte_mov128(dst + 1 * 128, src + 1 * 128); > > } > > > > +#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512 > > + > > +/** > > + * AVX512 implementation below > > + */ > > + > > +#define ALIGNMENT_MASK 0x3F > > + > > /** > > * Copy 128-byte blocks from one location to another, > > * locations should not overlap. > > @@ -231,12 +240,22 @@ rte_memcpy_generic(void *dst, const void *src, = size_t > n) > > /** > > * Fast way when copy size doesn't exceed 512 bytes > > */ > > + if (__builtin_constant_p(n) && n =3D=3D 32) { > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > + return ret; > > + } >=20 > There's an outstanding patchset from Stephen to replace all use of > rte_memcpy with a constant parameter with an actual call to regular = memcpy. > On a wider scale should we not look to do something similar in this = file, > have calls to rte_memcpy with constant parameter always turn into a = call to > regular memcpy? We used to have such a macro in older DPDK e.g. > from DPDK 1.8 >=20 > = http://git.dpdk.org/dpdk/tree/lib/librte_eal/common/include/arch/x86/rte_= memcp > y.h?h=3Dv1.8.0#n171 >=20 > This would elminiate the need to put in constant_p checks all through = the > code. The old macro in DPDK 1.8 was removed with the description "Remove slow = glibc call for constant copies": https://git.dpdk.org/dpdk/commit/lib/librte_eal/common/include/arch/x86/r= te_memcpy.h?id=3D9144d6bcdefd5096a9f3f89a3ce433a54ed84475 Stephen believes that the memcpy() built-ins provided by compilers are = faster than rte_memcpy() for constant size. I'm not convinced. Such a change should be backed up by performance tests, preferably for = all supported compilers - especially the old compilers that come with = some of the supported distros might not be as good as we would hope.