From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 0E1E443DF8;
	Thu,  4 Apr 2024 13:19:58 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id F1AE540268;
	Thu,  4 Apr 2024 13:19:57 +0200 (CEST)
Received: from dkmailrelay1.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id 209484025D
 for <dev@dpdk.org>; Thu,  4 Apr 2024 13:19:56 +0200 (CEST)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesys.local [192.168.4.10])
 by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id F08CC20CEE;
 Thu,  4 Apr 2024 13:19:55 +0200 (CEST)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [PATCH v2] eal/x86: improve rte_memcpy const size 16 performance
Date: Thu, 4 Apr 2024 13:19:54 +0200
X-MimeOLE: Produced By Microsoft Exchange V6.5
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F35C@smartserver.smartshare.dk>
In-Reply-To: <Zg57x_FhOgH5aXcL@bricha3-mobl1.ger.corp.intel.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [PATCH v2] eal/x86: improve rte_memcpy const size 16 performance
Thread-Index: AdqGd+98N5nmLFLeQ3SVaEdwp04jMgABItlA
References: <20240302234812.9137-1-mb@smartsharesystems.com>
 <20240303094621.16404-1-mb@smartsharesystems.com>
 <Zg57x_FhOgH5aXcL@bricha3-mobl1.ger.corp.intel.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>
Cc: <konstantin.v.ananyev@yandex.ru>, <stephen@networkplumber.org>,
 <mattias.ronnblom@ericsson.com>, <dev@dpdk.org>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, 4 April 2024 12.07
>=20
> On Sun, Mar 03, 2024 at 10:46:21AM +0100, Morten Br=F8rup wrote:
> > When the rte_memcpy() size is 16, the same 16 bytes are copied =
twice.
> > In the case where the size is known to be 16 at build tine, omit the
> > duplicate copy.
> >
> > Reduced the amount of effectively copy-pasted code by using #ifdef
> > inside functions instead of outside functions.
> >
> > Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
> > Signed-off-by: Morten Br=F8rup <mb@smartsharesystems.com>
>=20
> Changes in general look good to me. Comments inline below.
>=20
> /Bruce
>=20
> > ---
> > v2:
> > * For GCC, version 11 is required for proper AVX handling;
> >   if older GCC version, treat AVX as SSE.
> >   Clang does not have this issue.
> >   Note: Original code always treated AVX as SSE, regardless of =
compiler.
> > * Do not add copyright. (Stephen Hemminger)
> > ---
> >  lib/eal/x86/include/rte_memcpy.h | 231 =
++++++++-----------------------
> >  1 file changed, 56 insertions(+), 175 deletions(-)
> >
> > diff --git a/lib/eal/x86/include/rte_memcpy.h
> b/lib/eal/x86/include/rte_memcpy.h
> > index 72a92290e0..d1df841f5e 100644
> > --- a/lib/eal/x86/include/rte_memcpy.h
> > +++ b/lib/eal/x86/include/rte_memcpy.h
> > @@ -91,14 +91,6 @@ rte_mov15_or_less(void *dst, const void *src, =
size_t n)
> >  	return ret;
> >  }
> >
> > -#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
> > -
> > -#define ALIGNMENT_MASK 0x3F
> > -
> > -/**
> > - * AVX512 implementation below
> > - */
> > -
> >  /**
> >   * Copy 16 bytes from one location to another,
> >   * locations should not overlap.
> > @@ -119,10 +111,16 @@ rte_mov16(uint8_t *dst, const uint8_t *src)
> >  static __rte_always_inline void
> >  rte_mov32(uint8_t *dst, const uint8_t *src)
> >  {
> > +#if (defined __AVX512F__ && defined RTE_MEMCPY_AVX512) || defined =
__AVX2__
> || \
> > +		(defined __AVX__ && !(defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION
> < 110000)))
>=20
> I think we can drop the AVX512 checks here, since I'm not aware of any
> system where we'd have AVX512 but not AVX2 available, so just checking =
for
> AVX2 support should be sufficient.

RTE_MEMCPY_AVX512 must be manually defined at build time to enable =
AVX512:
https://elixir.bootlin.com/dpdk/latest/source/lib/eal/include/generic/rte=
_memcpy.h#L98

Without it, the AVX2 version will be used, regardless if the CPU has =
AVX512.

Also, there are some binutils bugs that might disable compilation for =
AVX512:
https://elixir.bootlin.com/dpdk/latest/source/config/x86/meson.build#L4
https://elixir.bootlin.com/dpdk/latest/source/config/x86/meson.build#L17

>=20
> On the final compiler-based check, I don't strongly object to it, but =
I
> just wonder as to its real value. AVX2 was first introduced by Intel =
over 10
> years ago, and (from what I find in wikipedia), it's been in AMD CPUs =
since
> ~2015. While we did have CPUs still being produced without AVX2 since =
that
> time, they generally didn't have AVX1 either, only having SSE =
instructions.
> Therefore the number of systems which require this additional check is
> likely very small at this stage.
> That said, I'm ok to either keep or omit it at your choice.

I kept it for consistency, and to support older compilers still =
officially supported by DPDK.

I don't feel qualified to change support for CPU features; I'll leave =
that to the CPU vendors.
Also, I have no clue what has been produced by Intel and AMD. :-)

> If you do keep
> it, how about putting the check once at the top of the file and using =
a
> single short define instead for the multiple places it's used e.g.
>=20
> #if (defined __AVX__ && !(defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION <
> 110000)))
> #define RTE_MEMCPY_AVX2
> #endif

Much of the code reorganization in this patch was done with the =
intention to improve readability.

And I don't think this suggestion improves readability; especially =
considering that RTE_MEMCPY_AVX512 is something manually defined.

However, I get your point; and if the conditional was very long or very =
complex, I might agree to a "shadow" definition to keep it short.

>=20
>=20
> >  	__m256i ymm0;
> >
> >  	ymm0 =3D _mm256_loadu_si256((const __m256i *)src);
> >  	_mm256_storeu_si256((__m256i *)dst, ymm0);
> > +#else /* SSE implementation */
> > +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > +#endif
> >  }
> >
> >  /**
> > @@ -132,10 +130,15 @@ rte_mov32(uint8_t *dst, const uint8_t *src)
> >  static __rte_always_inline void
> >  rte_mov64(uint8_t *dst, const uint8_t *src)
> >  {
> > +#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
> >  	__m512i zmm0;
> >
> >  	zmm0 =3D _mm512_loadu_si512((const void *)src);
> >  	_mm512_storeu_si512((void *)dst, zmm0);
> > +#else /* AVX2, AVX & SSE implementation */
> > +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > +#endif
> >  }
> >
> >  /**
> > @@ -156,12 +159,18 @@ rte_mov128(uint8_t *dst, const uint8_t *src)
> >  static __rte_always_inline void
> >  rte_mov256(uint8_t *dst, const uint8_t *src)
> >  {
> > -	rte_mov64(dst + 0 * 64, src + 0 * 64);
> > -	rte_mov64(dst + 1 * 64, src + 1 * 64);
> > -	rte_mov64(dst + 2 * 64, src + 2 * 64);
> > -	rte_mov64(dst + 3 * 64, src + 3 * 64);
> > +	rte_mov128(dst + 0 * 128, src + 0 * 128);
> > +	rte_mov128(dst + 1 * 128, src + 1 * 128);
> >  }
> >
> > +#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
> > +
> > +/**
> > + * AVX512 implementation below
> > + */
> > +
> > +#define ALIGNMENT_MASK 0x3F
> > +
> >  /**
> >   * Copy 128-byte blocks from one location to another,
> >   * locations should not overlap.
> > @@ -231,12 +240,22 @@ rte_memcpy_generic(void *dst, const void *src, =
size_t
> n)
> >  	/**
> >  	 * Fast way when copy size doesn't exceed 512 bytes
> >  	 */
> > +	if (__builtin_constant_p(n) && n =3D=3D 32) {
> > +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > +		return ret;
> > +	}
>=20
> There's an outstanding patchset from Stephen to replace all use of
> rte_memcpy with a constant parameter with an actual call to regular =
memcpy.
> On a wider scale should we not look to do something similar in this =
file,
> have calls to rte_memcpy with constant parameter always turn into a =
call to
> regular memcpy? We used to have such a macro in older DPDK e.g.
> from DPDK 1.8
>=20
> =
http://git.dpdk.org/dpdk/tree/lib/librte_eal/common/include/arch/x86/rte_=
memcp
> y.h?h=3Dv1.8.0#n171
>=20
> This would elminiate the need to put in constant_p checks all through =
the
> code.

The old macro in DPDK 1.8 was removed with the description "Remove slow =
glibc call for constant copies":
https://git.dpdk.org/dpdk/commit/lib/librte_eal/common/include/arch/x86/r=
te_memcpy.h?id=3D9144d6bcdefd5096a9f3f89a3ce433a54ed84475

Stephen believes that the memcpy() built-ins provided by compilers are =
faster than rte_memcpy() for constant size.
I'm not convinced.
Such a change should be backed up by performance tests, preferably for =
all supported compilers - especially the old compilers that come with =
some of the supported distros might not be as good as we would hope.