patches for DPDK stable branches
 help / color / mirror / Atom feed
From: Luc Pelletier <lucp.at.work@gmail.com>
To: bruce.richardson@intel.com, konstantin.ananyev@intel.com
Cc: dev <dev@dpdk.org>, Xiaoyun Li <xiaoyun.li@intel.com>,
	dpdk stable <stable@dpdk.org>
Subject: Re: [PATCH v3] eal: fix unaligned loads/stores in rte_memcpy_generic
Date: Sun, 16 Jan 2022 09:33:19 -0500	[thread overview]
Message-ID: <CAFeRdtAbLvbfua7csPFZwmWxb_7SqLxm_fdb58bnqrkNJjgC7w@mail.gmail.com> (raw)
In-Reply-To: <20220116141304.474374-1-lucp.at.work@gmail.com>

As a side note, and to follow up on Stephen's indication that this is
'performance critical code', I think it might be worthwhile to
revisit/revalidate the current implementation of rte_memcpy. There's a
good thread here that mentions rte_memcpy, and its performance on at
least one platform/architecture combination is far from being the
best:

https://github.com/microsoft/mimalloc/issues/201

It seems like enhanced rep movsb could be faster on more recent CPUs,
but that's currently not being used in the current implementation of
rte_memcpy.

I understand some of this may not be directly related to this patch,
but whoever looks at this patch might want to provide their thoughts
on whether updating rte_memcpy would be worthwhile? I suspect looking
at all current public implementations of memcpy (libc, microsoft,
compilers builtin implementations, etc.) might help in making
improvements.


Le dim. 16 janv. 2022 à 09:15, Luc Pelletier <lucp.at.work@gmail.com> a écrit :
>
> Calls to rte_memcpy_generic could result in unaligned loads/stores for
> 1 < n < 16. This is undefined behavior according to the C standard,
> and it gets flagged by the clang undefined behavior sanitizer.
>
> rte_memcpy_generic is called with unaligned src and dst addresses.
> When 1 < n < 16, the code would cast both src and dst to a qword,
> dword or word pointer, without verifying the alignment of src/dst. The
> code was changed to use inline assembly for the load/store operations.
> Unaligned load/store operations are permitted in x86/x64 assembly.
>
> Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time")
> Cc: Xiaoyun Li <xiaoyun.li@intel.com>
> Cc: stable@dpdk.org
>
> Signed-off-by: Luc Pelletier <lucp.at.work@gmail.com>
> ---
>
> Please note that I didn't write the entire function in inline assembly.
> The reason why I kept the bitwise ands as C code is so the optimizer can
> remove the branches when n is known at compile-time.
>
>  lib/eal/x86/include/rte_memcpy.h | 134 +++++++++++++++++--------------
>  1 file changed, 72 insertions(+), 62 deletions(-)
>
> diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
> index 1b6c6e585f..b99c1b2ca5 100644
> --- a/lib/eal/x86/include/rte_memcpy.h
> +++ b/lib/eal/x86/include/rte_memcpy.h
> @@ -45,6 +45,75 @@ extern "C" {
>  static __rte_always_inline void *
>  rte_memcpy(void *dst, const void *src, size_t n);
>
> +#if defined(__i386__)
> +       #define RTE_ACCUMULATOR_REGISTER_NAME "eax"
> +#elif defined(__x86_64__)
> +       #define RTE_ACCUMULATOR_REGISTER_NAME "rax"
> +#endif
> +
> +/**
> + * Copy bytes from one location to another,
> + * locations should not overlap.
> + * Use with unaligned src/dst, and n <= 15.
> + */
> +static __rte_always_inline void *
> +rte_mov15_or_less_unaligned(void *dst, const void *src, size_t n)
> +{
> +       void *ret = dst;
> +       if (n & 8) {
> +               asm (
> +#if defined(__i386__)
> +               "movl (%[src]), %%eax\n"
> +               "movl %%eax, (%[dst])\n"
> +               "add $4, %[src]\n"
> +               "add $4, %[dst]\n"
> +               "movl (%[src]), %%eax\n"
> +               "movl %%eax, (%[dst])\n"
> +               "add $4, %[src]\n"
> +               "add $4, %[dst]\n"
> +#elif defined(__x86_64__)
> +               "movq (%[src]), %%rax\n"
> +               "movq %%rax, (%[dst])\n"
> +               "add $8, %[src]\n"
> +               "add $8, %[dst]\n"
> +#else
> +               #error Unsupported architecture
> +#endif
> +               : [dst] "+r" (dst), [src] "+r" (src)
> +               :
> +               : RTE_ACCUMULATOR_REGISTER_NAME, "memory");
> +       }
> +       if (n & 4) {
> +               asm (
> +               "movl (%[src]), %%eax\n"
> +               "movl %%eax, (%[dst])\n"
> +               "add $4, %[src]\n"
> +               "add $4, %[dst]\n"
> +               : [dst] "+r" (dst), [src] "+r" (src)
> +               :
> +               : RTE_ACCUMULATOR_REGISTER_NAME, "memory");
> +       }
> +       if (n & 2) {
> +               asm (
> +               "movw (%[src]), %%ax\n"
> +               "movw %%ax, (%[dst])\n"
> +               "add $2, %[src]\n"
> +               "add $2, %[dst]\n"
> +               : [dst] "+r" (dst), [src] "+r" (src)
> +               :
> +               : RTE_ACCUMULATOR_REGISTER_NAME, "memory");
> +       }
> +       if (n & 1) {
> +               asm (
> +               "movb (%[src]), %%al\n"
> +               "movb %%al, (%[dst])\n"
> +               : [dst] "+r" (dst), [src] "+r" (src)
> +               :
> +               : RTE_ACCUMULATOR_REGISTER_NAME, "memory");
> +       }
> +       return ret;
> +}
> +
>  #if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
>
>  #define ALIGNMENT_MASK 0x3F
> @@ -171,8 +240,6 @@ rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
>  static __rte_always_inline void *
>  rte_memcpy_generic(void *dst, const void *src, size_t n)
>  {
> -       uintptr_t dstu = (uintptr_t)dst;
> -       uintptr_t srcu = (uintptr_t)src;
>         void *ret = dst;
>         size_t dstofss;
>         size_t bits;
> @@ -181,24 +248,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
>          * Copy less than 16 bytes
>          */
>         if (n < 16) {
> -               if (n & 0x01) {
> -                       *(uint8_t *)dstu = *(const uint8_t *)srcu;
> -                       srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> -                       dstu = (uintptr_t)((uint8_t *)dstu + 1);
> -               }
> -               if (n & 0x02) {
> -                       *(uint16_t *)dstu = *(const uint16_t *)srcu;
> -                       srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> -                       dstu = (uintptr_t)((uint16_t *)dstu + 1);
> -               }
> -               if (n & 0x04) {
> -                       *(uint32_t *)dstu = *(const uint32_t *)srcu;
> -                       srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> -                       dstu = (uintptr_t)((uint32_t *)dstu + 1);
> -               }
> -               if (n & 0x08)
> -                       *(uint64_t *)dstu = *(const uint64_t *)srcu;
> -               return ret;
> +               return rte_mov15_or_less_unaligned(dst, src, n);
>         }
>
>         /**
> @@ -379,8 +429,6 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
>  static __rte_always_inline void *
>  rte_memcpy_generic(void *dst, const void *src, size_t n)
>  {
> -       uintptr_t dstu = (uintptr_t)dst;
> -       uintptr_t srcu = (uintptr_t)src;
>         void *ret = dst;
>         size_t dstofss;
>         size_t bits;
> @@ -389,25 +437,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
>          * Copy less than 16 bytes
>          */
>         if (n < 16) {
> -               if (n & 0x01) {
> -                       *(uint8_t *)dstu = *(const uint8_t *)srcu;
> -                       srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> -                       dstu = (uintptr_t)((uint8_t *)dstu + 1);
> -               }
> -               if (n & 0x02) {
> -                       *(uint16_t *)dstu = *(const uint16_t *)srcu;
> -                       srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> -                       dstu = (uintptr_t)((uint16_t *)dstu + 1);
> -               }
> -               if (n & 0x04) {
> -                       *(uint32_t *)dstu = *(const uint32_t *)srcu;
> -                       srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> -                       dstu = (uintptr_t)((uint32_t *)dstu + 1);
> -               }
> -               if (n & 0x08) {
> -                       *(uint64_t *)dstu = *(const uint64_t *)srcu;
> -               }
> -               return ret;
> +               return rte_mov15_or_less_unaligned(dst, src, n);
>         }
>
>         /**
> @@ -672,8 +702,6 @@ static __rte_always_inline void *
>  rte_memcpy_generic(void *dst, const void *src, size_t n)
>  {
>         __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
> -       uintptr_t dstu = (uintptr_t)dst;
> -       uintptr_t srcu = (uintptr_t)src;
>         void *ret = dst;
>         size_t dstofss;
>         size_t srcofs;
> @@ -682,25 +710,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
>          * Copy less than 16 bytes
>          */
>         if (n < 16) {
> -               if (n & 0x01) {
> -                       *(uint8_t *)dstu = *(const uint8_t *)srcu;
> -                       srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> -                       dstu = (uintptr_t)((uint8_t *)dstu + 1);
> -               }
> -               if (n & 0x02) {
> -                       *(uint16_t *)dstu = *(const uint16_t *)srcu;
> -                       srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> -                       dstu = (uintptr_t)((uint16_t *)dstu + 1);
> -               }
> -               if (n & 0x04) {
> -                       *(uint32_t *)dstu = *(const uint32_t *)srcu;
> -                       srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> -                       dstu = (uintptr_t)((uint32_t *)dstu + 1);
> -               }
> -               if (n & 0x08) {
> -                       *(uint64_t *)dstu = *(const uint64_t *)srcu;
> -               }
> -               return ret;
> +               return rte_mov15_or_less_unaligned(dst, src, n);
>         }
>
>         /**
> --
> 2.25.1
>

  reply	other threads:[~2022-01-16 14:33 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-15 19:41 [PATCH] " Luc Pelletier
2022-01-15 21:39 ` [PATCH v2] " Luc Pelletier
2022-01-15 22:13   ` Stephen Hemminger
2022-01-16 14:09     ` Luc Pelletier
2022-01-16 16:32       ` Stephen Hemminger
2022-01-16 14:13 ` [PATCH v3] " Luc Pelletier
2022-01-16 14:33   ` Luc Pelletier [this message]
2022-01-16 16:34     ` Stephen Hemminger
2022-01-16 17:59       ` Morten Brørup
2022-01-16 20:33 ` [PATCH v4] " Luc Pelletier
2022-01-17 15:37 ` [PATCH v5] " Luc Pelletier
2022-02-04 16:42   ` Luc Pelletier
2022-02-04 17:16   ` Ananyev, Konstantin
2022-02-08 16:53     ` Thomas Monjalon
2022-02-09 15:05     ` Luc Pelletier
2022-02-10 14:04       ` Ananyev, Konstantin
2022-02-10 16:56         ` Luc Pelletier
2022-02-11 15:51           ` Ananyev, Konstantin
2022-02-13 22:31             ` Luc Pelletier
2022-02-14 13:41               ` Ananyev, Konstantin
2022-02-25 15:51 ` [PATCH v6] eal: fix rte_memcpy strict aliasing/alignment bugs Luc Pelletier
2022-02-25 16:38 ` [PATCH v7] " Luc Pelletier
2022-03-10 14:55   ` Ananyev, Konstantin
2022-04-07 15:24     ` David Marchand
2022-04-07 15:32       ` David Marchand
2022-04-07 15:40         ` David Marchand
2022-05-13 19:15           ` Luc Pelletier
2022-05-19 16:41             ` David Marchand
2022-04-08 13:47       ` Luc Pelletier
2022-05-19 16:47       ` David Marchand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFeRdtAbLvbfua7csPFZwmWxb_7SqLxm_fdb58bnqrkNJjgC7w@mail.gmail.com \
    --to=lucp.at.work@gmail.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=konstantin.ananyev@intel.com \
    --cc=stable@dpdk.org \
    --cc=xiaoyun.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).