From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Bruce Richardson" <bruce.richardson@intel.com>
Cc: <konstantin.v.ananyev@yandex.ru>, <stephen@networkplumber.org>,
<mattias.ronnblom@ericsson.com>, <dev@dpdk.org>
Subject: RE: [PATCH v2] eal/x86: improve rte_memcpy const size 16 performance
Date: Thu, 4 Apr 2024 17:37:53 +0200 [thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35E9F360@smartserver.smartshare.dk> (raw)
In-Reply-To: <Zg6rJN5EAhSA0kJZ@bricha3-mobl1.ger.corp.intel.com>
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, 4 April 2024 15.29
>
> On Thu, Apr 04, 2024 at 01:19:54PM +0200, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Thursday, 4 April 2024 12.07
> > >
> > > On Sun, Mar 03, 2024 at 10:46:21AM +0100, Morten Brørup wrote:
> > > > When the rte_memcpy() size is 16, the same 16 bytes are copied twice.
> > > > In the case where the size is known to be 16 at build tine, omit the
> > > > duplicate copy.
> > > >
> > > > Reduced the amount of effectively copy-pasted code by using #ifdef
> > > > inside functions instead of outside functions.
> > > >
> > > > Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
> > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >
> > > Changes in general look good to me. Comments inline below.
> > >
> > > /Bruce
> > >
> > > > ---
> > > > v2:
> > > > * For GCC, version 11 is required for proper AVX handling;
> > > > if older GCC version, treat AVX as SSE.
> > > > Clang does not have this issue.
> > > > Note: Original code always treated AVX as SSE, regardless of compiler.
> > > > * Do not add copyright. (Stephen Hemminger)
> > > > ---
> > > > lib/eal/x86/include/rte_memcpy.h | 231 ++++++++-----------------------
> > > > 1 file changed, 56 insertions(+), 175 deletions(-)
> > > >
> > > > diff --git a/lib/eal/x86/include/rte_memcpy.h
> > > b/lib/eal/x86/include/rte_memcpy.h
> > > > index 72a92290e0..d1df841f5e 100644
> > > > --- a/lib/eal/x86/include/rte_memcpy.h
> > > > +++ b/lib/eal/x86/include/rte_memcpy.h
> > > > @@ -91,14 +91,6 @@ rte_mov15_or_less(void *dst, const void *src, size_t
> n)
> > > > return ret;
> > > > }
> > > >
> > > > -#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
> > > > -
> > > > -#define ALIGNMENT_MASK 0x3F
> > > > -
> > > > -/**
> > > > - * AVX512 implementation below
> > > > - */
> > > > -
> > > > /**
> > > > * Copy 16 bytes from one location to another,
> > > > * locations should not overlap.
> > > > @@ -119,10 +111,16 @@ rte_mov16(uint8_t *dst, const uint8_t *src)
> > > > static __rte_always_inline void
> > > > rte_mov32(uint8_t *dst, const uint8_t *src)
> > > > {
> > > > +#if (defined __AVX512F__ && defined RTE_MEMCPY_AVX512) || defined
> __AVX2__
> > > || \
> > > > + (defined __AVX__ && !(defined(RTE_TOOLCHAIN_GCC) &&
> (GCC_VERSION
> > > < 110000)))
> > >
> > > I think we can drop the AVX512 checks here, since I'm not aware of any
> > > system where we'd have AVX512 but not AVX2 available, so just checking for
> > > AVX2 support should be sufficient.
> >
> > RTE_MEMCPY_AVX512 must be manually defined at build time to enable AVX512:
> >
> https://elixir.bootlin.com/dpdk/latest/source/lib/eal/include/generic/rte_memc
> py.h#L98
> >
> > Without it, the AVX2 version will be used, regardless if the CPU has AVX512.
> >
> > Also, there are some binutils bugs that might disable compilation for
> AVX512:
> > https://elixir.bootlin.com/dpdk/latest/source/config/x86/meson.build#L4
> > https://elixir.bootlin.com/dpdk/latest/source/config/x86/meson.build#L17
> >
>
> Yes, I realise that, but the guard here is for an AVX2 block only, so there
> is no point in checking for AVX512 - it's AVX512 or AVX2.
Aha! Now I get your point:
Checking for AVX2 suffices for AVX2 code.
I didn't think of that when combining the copy-pasted code into one code block.
Well spotted! Thank you.
>
> > >
> > > On the final compiler-based check, I don't strongly object to it, but I
> > > just wonder as to its real value. AVX2 was first introduced by Intel over
> 10
> > > years ago, and (from what I find in wikipedia), it's been in AMD CPUs
> since
> > > ~2015. While we did have CPUs still being produced without AVX2 since that
> > > time, they generally didn't have AVX1 either, only having SSE
> instructions.
> > > Therefore the number of systems which require this additional check is
> > > likely very small at this stage.
> > > That said, I'm ok to either keep or omit it at your choice.
> >
> > I kept it for consistency, and to support older compilers still officially
> supported by DPDK.
> >
> > I don't feel qualified to change support for CPU features; I'll leave that
> to the CPU vendors.
> > Also, I have no clue what has been produced by Intel and AMD. :-)
> >
> > > If you do keep
> > > it, how about putting the check once at the top of the file and using a
> > > single short define instead for the multiple places it's used e.g.
> > >
> > > #if (defined __AVX__ && !(defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION <
> > > 110000)))
> > > #define RTE_MEMCPY_AVX2
> > > #endif
> >
> > Much of the code reorganization in this patch was done with the intention to
> improve readability.
> >
> > And I don't think this suggestion improves readability; especially
> considering that RTE_MEMCPY_AVX512 is something manually defined.
> >
> > However, I get your point; and if the conditional was very long or very
> complex, I might agree to a "shadow" definition to keep it short.
> >
>
> I just find it long enough that duplication of it seems painful. :-) I'd
> rather we check once at the top if we can use an AVX copy vs SSE, rather
> than duplicate the compiler version checks multiple times.
OK. And I suppose the same principle as above applies:
AVX2 implies AVX, so checking for AVX suffices.
I suppose your suggested name RTE_MEMCPY_AVX2 was a typo, and will define it as RTE_MEMCPY_AVX.
>
>
> > >
> > >
> > > > __m256i ymm0;
> > > >
> > > > ymm0 = _mm256_loadu_si256((const __m256i *)src);
> > > > _mm256_storeu_si256((__m256i *)dst, ymm0);
> > > > +#else /* SSE implementation */
> > > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 *
> 16);
> > > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 *
> 16);
> > > > +#endif
> > > > }
> > > >
> > > > /**
> > > > @@ -132,10 +130,15 @@ rte_mov32(uint8_t *dst, const uint8_t *src)
> > > > static __rte_always_inline void
> > > > rte_mov64(uint8_t *dst, const uint8_t *src)
> > > > {
> > > > +#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
> > > > __m512i zmm0;
> > > >
> > > > zmm0 = _mm512_loadu_si512((const void *)src);
> > > > _mm512_storeu_si512((void *)dst, zmm0);
> > > > +#else /* AVX2, AVX & SSE implementation */
> > > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 *
> 32);
> > > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 *
> 32);
> > > > +#endif
> > > > }
> > > >
> > > > /**
> > > > @@ -156,12 +159,18 @@ rte_mov128(uint8_t *dst, const uint8_t *src)
> > > > static __rte_always_inline void
> > > > rte_mov256(uint8_t *dst, const uint8_t *src)
> > > > {
> > > > - rte_mov64(dst + 0 * 64, src + 0 * 64);
> > > > - rte_mov64(dst + 1 * 64, src + 1 * 64);
> > > > - rte_mov64(dst + 2 * 64, src + 2 * 64);
> > > > - rte_mov64(dst + 3 * 64, src + 3 * 64);
> > > > + rte_mov128(dst + 0 * 128, src + 0 * 128);
> > > > + rte_mov128(dst + 1 * 128, src + 1 * 128);
> > > > }
> > > >
> > > > +#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
> > > > +
> > > > +/**
> > > > + * AVX512 implementation below
> > > > + */
> > > > +
> > > > +#define ALIGNMENT_MASK 0x3F
> > > > +
> > > > /**
> > > > * Copy 128-byte blocks from one location to another,
> > > > * locations should not overlap.
> > > > @@ -231,12 +240,22 @@ rte_memcpy_generic(void *dst, const void *src,
> size_t
> > > n)
> > > > /**
> > > > * Fast way when copy size doesn't exceed 512 bytes
> > > > */
> > > > + if (__builtin_constant_p(n) && n == 32) {
> > > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > > + return ret;
> > > > + }
> > >
> > > There's an outstanding patchset from Stephen to replace all use of
> > > rte_memcpy with a constant parameter with an actual call to regular
> memcpy.
> > > On a wider scale should we not look to do something similar in this file,
> > > have calls to rte_memcpy with constant parameter always turn into a call
> to
> > > regular memcpy? We used to have such a macro in older DPDK e.g.
> > > from DPDK 1.8
> > >
> > >
> http://git.dpdk.org/dpdk/tree/lib/librte_eal/common/include/arch/x86/rte_memcp
> > > y.h?h=v1.8.0#n171
> > >
> > > This would elminiate the need to put in constant_p checks all through the
> > > code.
> >
> > The old macro in DPDK 1.8 was removed with the description "Remove slow
> glibc call for constant copies":
> >
> https://git.dpdk.org/dpdk/commit/lib/librte_eal/common/include/arch/x86/rte_me
> mcpy.h?id=9144d6bcdefd5096a9f3f89a3ce433a54ed84475
> >
> > Stephen believes that the memcpy() built-ins provided by compilers are
> faster than rte_memcpy() for constant size.
> > I'm not convinced.
> > Such a change should be backed up by performance tests, preferably for all
> supported compilers - especially the old compilers that come with some of the
> supported distros might not be as good as we would hope.
> >
>
> I would tend to agree with Stephen that whereever possible we should use
> the built-in memcpy calls. Hence my suggestion of re-introducing the macro.
I agree in principle, but strongly prefer data to back up such changes in the fast path.
> I'm not sure why it previously was seen as slower, it may be that the
> compiler-expanded memcpy calls are not done beyond a certain size.
> However, since we lack data, I'm ok with taking the changes in your patch
> as-is.
>
> With the above-flagged superfluous AVX512 check on AVX2 code removed:
>
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
Thanks. I'll provide a v3 patch.
next prev parent reply other threads:[~2024-04-04 15:37 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-02 23:48 [PATCH] " Morten Brørup
2024-03-03 0:38 ` Morten Brørup
2024-03-03 5:40 ` Stephen Hemminger
2024-03-03 5:47 ` Stephen Hemminger
2024-03-03 5:58 ` Stephen Hemminger
2024-03-03 5:58 ` Stephen Hemminger
2024-03-03 10:07 ` Morten Brørup
2024-03-03 5:41 ` Stephen Hemminger
2024-03-03 9:46 ` [PATCH v2] " Morten Brørup
2024-04-04 9:18 ` Morten Brørup
2024-04-04 10:07 ` Bruce Richardson
2024-04-04 11:19 ` Morten Brørup
2024-04-04 13:29 ` Bruce Richardson
2024-04-04 15:37 ` Morten Brørup [this message]
2024-04-04 15:55 ` Stephen Hemminger
2024-04-04 16:10 ` Morten Brørup
2024-04-04 16:55 ` Bruce Richardson
2024-03-03 16:05 ` [PATCH] " Stephen Hemminger
2024-04-05 12:46 ` [PATCH v3] " Morten Brørup
2024-04-05 13:17 ` Bruce Richardson
2024-04-05 13:48 ` [PATCH v4] " Morten Brørup
2024-05-27 13:15 ` Morten Brørup
2024-05-27 13:16 ` [PATCH v5] " Morten Brørup
2024-05-27 14:13 ` Morten Brørup
2024-05-28 6:18 ` Morten Brørup
2024-05-28 6:22 ` [PATCH v6] " Morten Brørup
2024-05-28 7:05 ` [PATCH v7] " Morten Brørup
2024-05-30 15:41 ` [PATCH v8] " Morten Brørup
2024-06-10 9:05 ` Morten Brørup
2024-06-10 13:40 ` Konstantin Ananyev
2024-06-10 13:59 ` Morten Brørup
2024-07-09 9:24 ` David Marchand
2024-07-09 11:42 ` David Marchand
2024-07-09 12:43 ` Morten Brørup
2024-07-09 12:47 ` David Marchand
2024-07-09 12:54 ` Morten Brørup
2024-07-09 15:26 ` Patrick Robb
2024-07-09 13:27 ` [PATCH v9] " Morten Brørup
2024-07-09 15:42 ` David Marchand
2024-07-10 8:03 ` David Marchand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=98CBD80474FA8B44BF855DF32C47DC35E9F360@smartserver.smartshare.dk \
--to=mb@smartsharesystems.com \
--cc=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
--cc=konstantin.v.ananyev@yandex.ru \
--cc=mattias.ronnblom@ericsson.com \
--cc=stephen@networkplumber.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).