From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id CBA5243DFA; Thu, 4 Apr 2024 17:37:57 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9D67A4064A; Thu, 4 Apr 2024 17:37:57 +0200 (CEST) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id F3E52402E8 for ; Thu, 4 Apr 2024 17:37:55 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id C65E12274F; Thu, 4 Apr 2024 17:37:55 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH v2] eal/x86: improve rte_memcpy const size 16 performance Date: Thu, 4 Apr 2024 17:37:53 +0200 X-MimeOLE: Produced By Microsoft Exchange V6.5 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F360@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v2] eal/x86: improve rte_memcpy const size 16 performance Thread-Index: AdqGlBrkHeV/3CLWSWqpy3Amh9aryAAB6BpQ References: <20240302234812.9137-1-mb@smartsharesystems.com> <20240303094621.16404-1-mb@smartsharesystems.com> <98CBD80474FA8B44BF855DF32C47DC35E9F35C@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" Cc: , , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Thursday, 4 April 2024 15.29 >=20 > On Thu, Apr 04, 2024 at 01:19:54PM +0200, Morten Br=F8rup wrote: > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > Sent: Thursday, 4 April 2024 12.07 > > > > > > On Sun, Mar 03, 2024 at 10:46:21AM +0100, Morten Br=F8rup wrote: > > > > When the rte_memcpy() size is 16, the same 16 bytes are copied = twice. > > > > In the case where the size is known to be 16 at build tine, omit = the > > > > duplicate copy. > > > > > > > > Reduced the amount of effectively copy-pasted code by using = #ifdef > > > > inside functions instead of outside functions. > > > > > > > > Suggested-by: Stephen Hemminger > > > > Signed-off-by: Morten Br=F8rup > > > > > > Changes in general look good to me. Comments inline below. > > > > > > /Bruce > > > > > > > --- > > > > v2: > > > > * For GCC, version 11 is required for proper AVX handling; > > > > if older GCC version, treat AVX as SSE. > > > > Clang does not have this issue. > > > > Note: Original code always treated AVX as SSE, regardless of = compiler. > > > > * Do not add copyright. (Stephen Hemminger) > > > > --- > > > > lib/eal/x86/include/rte_memcpy.h | 231 = ++++++++----------------------- > > > > 1 file changed, 56 insertions(+), 175 deletions(-) > > > > > > > > diff --git a/lib/eal/x86/include/rte_memcpy.h > > > b/lib/eal/x86/include/rte_memcpy.h > > > > index 72a92290e0..d1df841f5e 100644 > > > > --- a/lib/eal/x86/include/rte_memcpy.h > > > > +++ b/lib/eal/x86/include/rte_memcpy.h > > > > @@ -91,14 +91,6 @@ rte_mov15_or_less(void *dst, const void *src, = size_t > n) > > > > return ret; > > > > } > > > > > > > > -#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512 > > > > - > > > > -#define ALIGNMENT_MASK 0x3F > > > > - > > > > -/** > > > > - * AVX512 implementation below > > > > - */ > > > > - > > > > /** > > > > * Copy 16 bytes from one location to another, > > > > * locations should not overlap. > > > > @@ -119,10 +111,16 @@ rte_mov16(uint8_t *dst, const uint8_t = *src) > > > > static __rte_always_inline void > > > > rte_mov32(uint8_t *dst, const uint8_t *src) > > > > { > > > > +#if (defined __AVX512F__ && defined RTE_MEMCPY_AVX512) || = defined > __AVX2__ > > > || \ > > > > + (defined __AVX__ && !(defined(RTE_TOOLCHAIN_GCC) && > (GCC_VERSION > > > < 110000))) > > > > > > I think we can drop the AVX512 checks here, since I'm not aware of = any > > > system where we'd have AVX512 but not AVX2 available, so just = checking for > > > AVX2 support should be sufficient. > > > > RTE_MEMCPY_AVX512 must be manually defined at build time to enable = AVX512: > > > = https://elixir.bootlin.com/dpdk/latest/source/lib/eal/include/generic/rte= _memc > py.h#L98 > > > > Without it, the AVX2 version will be used, regardless if the CPU has = AVX512. > > > > Also, there are some binutils bugs that might disable compilation = for > AVX512: > > = https://elixir.bootlin.com/dpdk/latest/source/config/x86/meson.build#L4 > > = https://elixir.bootlin.com/dpdk/latest/source/config/x86/meson.build#L17 > > >=20 > Yes, I realise that, but the guard here is for an AVX2 block only, so = there > is no point in checking for AVX512 - it's AVX512 or AVX2. Aha! Now I get your point: Checking for AVX2 suffices for AVX2 code. I didn't think of that when combining the copy-pasted code into one code = block. Well spotted! Thank you. >=20 > > > > > > On the final compiler-based check, I don't strongly object to it, = but I > > > just wonder as to its real value. AVX2 was first introduced by = Intel over > 10 > > > years ago, and (from what I find in wikipedia), it's been in AMD = CPUs > since > > > ~2015. While we did have CPUs still being produced without AVX2 = since that > > > time, they generally didn't have AVX1 either, only having SSE > instructions. > > > Therefore the number of systems which require this additional = check is > > > likely very small at this stage. > > > That said, I'm ok to either keep or omit it at your choice. > > > > I kept it for consistency, and to support older compilers still = officially > supported by DPDK. > > > > I don't feel qualified to change support for CPU features; I'll = leave that > to the CPU vendors. > > Also, I have no clue what has been produced by Intel and AMD. :-) > > > > > If you do keep > > > it, how about putting the check once at the top of the file and = using a > > > single short define instead for the multiple places it's used e.g. > > > > > > #if (defined __AVX__ && !(defined(RTE_TOOLCHAIN_GCC) && = (GCC_VERSION < > > > 110000))) > > > #define RTE_MEMCPY_AVX2 > > > #endif > > > > Much of the code reorganization in this patch was done with the = intention to > improve readability. > > > > And I don't think this suggestion improves readability; especially > considering that RTE_MEMCPY_AVX512 is something manually defined. > > > > However, I get your point; and if the conditional was very long or = very > complex, I might agree to a "shadow" definition to keep it short. > > >=20 > I just find it long enough that duplication of it seems painful. :-) = I'd > rather we check once at the top if we can use an AVX copy vs SSE, = rather > than duplicate the compiler version checks multiple times. OK. And I suppose the same principle as above applies: AVX2 implies AVX, so checking for AVX suffices. I suppose your suggested name RTE_MEMCPY_AVX2 was a typo, and will = define it as RTE_MEMCPY_AVX. >=20 >=20 > > > > > > > > > > __m256i ymm0; > > > > > > > > ymm0 =3D _mm256_loadu_si256((const __m256i *)src); > > > > _mm256_storeu_si256((__m256i *)dst, ymm0); > > > > +#else /* SSE implementation */ > > > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * > 16); > > > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * > 16); > > > > +#endif > > > > } > > > > > > > > /** > > > > @@ -132,10 +130,15 @@ rte_mov32(uint8_t *dst, const uint8_t = *src) > > > > static __rte_always_inline void > > > > rte_mov64(uint8_t *dst, const uint8_t *src) > > > > { > > > > +#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512 > > > > __m512i zmm0; > > > > > > > > zmm0 =3D _mm512_loadu_si512((const void *)src); > > > > _mm512_storeu_si512((void *)dst, zmm0); > > > > +#else /* AVX2, AVX & SSE implementation */ > > > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * > 32); > > > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * > 32); > > > > +#endif > > > > } > > > > > > > > /** > > > > @@ -156,12 +159,18 @@ rte_mov128(uint8_t *dst, const uint8_t = *src) > > > > static __rte_always_inline void > > > > rte_mov256(uint8_t *dst, const uint8_t *src) > > > > { > > > > - rte_mov64(dst + 0 * 64, src + 0 * 64); > > > > - rte_mov64(dst + 1 * 64, src + 1 * 64); > > > > - rte_mov64(dst + 2 * 64, src + 2 * 64); > > > > - rte_mov64(dst + 3 * 64, src + 3 * 64); > > > > + rte_mov128(dst + 0 * 128, src + 0 * 128); > > > > + rte_mov128(dst + 1 * 128, src + 1 * 128); > > > > } > > > > > > > > +#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512 > > > > + > > > > +/** > > > > + * AVX512 implementation below > > > > + */ > > > > + > > > > +#define ALIGNMENT_MASK 0x3F > > > > + > > > > /** > > > > * Copy 128-byte blocks from one location to another, > > > > * locations should not overlap. > > > > @@ -231,12 +240,22 @@ rte_memcpy_generic(void *dst, const void = *src, > size_t > > > n) > > > > /** > > > > * Fast way when copy size doesn't exceed 512 bytes > > > > */ > > > > + if (__builtin_constant_p(n) && n =3D=3D 32) { > > > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > > > + return ret; > > > > + } > > > > > > There's an outstanding patchset from Stephen to replace all use of > > > rte_memcpy with a constant parameter with an actual call to = regular > memcpy. > > > On a wider scale should we not look to do something similar in = this file, > > > have calls to rte_memcpy with constant parameter always turn into = a call > to > > > regular memcpy? We used to have such a macro in older DPDK e.g. > > > from DPDK 1.8 > > > > > > > = http://git.dpdk.org/dpdk/tree/lib/librte_eal/common/include/arch/x86/rte_= memcp > > > y.h?h=3Dv1.8.0#n171 > > > > > > This would elminiate the need to put in constant_p checks all = through the > > > code. > > > > The old macro in DPDK 1.8 was removed with the description "Remove = slow > glibc call for constant copies": > > > = https://git.dpdk.org/dpdk/commit/lib/librte_eal/common/include/arch/x86/r= te_me > mcpy.h?id=3D9144d6bcdefd5096a9f3f89a3ce433a54ed84475 > > > > Stephen believes that the memcpy() built-ins provided by compilers = are > faster than rte_memcpy() for constant size. > > I'm not convinced. > > Such a change should be backed up by performance tests, preferably = for all > supported compilers - especially the old compilers that come with some = of the > supported distros might not be as good as we would hope. > > >=20 > I would tend to agree with Stephen that whereever possible we should = use > the built-in memcpy calls. Hence my suggestion of re-introducing the = macro. I agree in principle, but strongly prefer data to back up such changes = in the fast path. > I'm not sure why it previously was seen as slower, it may be that the > compiler-expanded memcpy calls are not done beyond a certain size. > However, since we lack data, I'm ok with taking the changes in your = patch > as-is. >=20 > With the above-flagged superfluous AVX512 check on AVX2 code removed: >=20 > Acked-by: Bruce Richardson Thanks. I'll provide a v3 patch.