From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 14F89A0350; Sun, 16 Jan 2022 15:33:33 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 967DC4067C; Sun, 16 Jan 2022 15:33:32 +0100 (CET) Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) by mails.dpdk.org (Postfix) with ESMTP id 6616B40040; Sun, 16 Jan 2022 15:33:31 +0100 (CET) Received: by mail-pl1-f181.google.com with SMTP id u11so12480220plh.13; Sun, 16 Jan 2022 06:33:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=SWAqiBJ++ltb0xyDuIIm4dAsin/MPUKVQsRh0ZrKrrs=; b=oEslNMEk1onOEc9VKKpyUwrkaAGfW+moRyxovJV5Hwlc4McSjez46QYeKvdf2o/dP3 tQQQ3JkzLA0BQO57c2fjOIk2X+HSaOYcGtSj61vZEd5ELPVpeWRZLSOagHros4xZaSt6 1rYbW5piKTDuI4ChxffqoNtqsPCT0JoU+PAg/1kNIjGKY7cHyb73iaSt5bCY1RObMSOq r6SBJgBf5cAYx69pb4/dOviraaSB+KFQXvh76RE6517Yi/1GGWeoT8ol7I8tmsvWjQaL nP/Sr0lEhbxQcDXJMpbYUNBp54ujo3qKwMDtAzP9v9QVsIkP5g6dnn8bnKn0sIum+ZIL HZGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=SWAqiBJ++ltb0xyDuIIm4dAsin/MPUKVQsRh0ZrKrrs=; b=5hrjti4JF752RVISiifOtuCPAeZDDhAa1p2ZJR92d7P/fYiKK0buZPsnggGPFSo+Lq 4EH50Nm9Fv4LdafGwROlwVqeGh8s149RfMAxherd8tBBTVcW5LroYzBBR2KhnqjjaG3P wUVy5NZTXHrTU6Lt8ZfJ5fY6aieDk5LwQbHN1SOwWmek4QBt45NTRg7T1HxKVvWVitpm KcdKo6pbARuHoY3ppnVEsBBAmqKGbHy1jy30ArWxugDTAffRdlv4ndRy5M8BlO9/1D0b xSTCjV56rHoIvNHSDs+mt/cldLLgOmfghwxtZEglLV/BhuPOj78mjWQxuki6zbczPZvb +2xw== X-Gm-Message-State: AOAM5308UHcxskQq/32N4i0OE+c2799nN73NItM+8fZWXsCHfjemTIYq 9bs3p3trqjpg0PTDEuDTZOpeixzoBWgTo8cGQOH/KipsMoo= X-Google-Smtp-Source: ABdhPJy46dC5e6oezJw0Ela/UF9Zngrma/9xYODkEeyDIRcezDSYCPTN/TTQZWUXUgi7B2BJ3Rtww+uxmjK2FvMGoD0= X-Received: by 2002:a17:90b:4a48:: with SMTP id lb8mr29766695pjb.54.1642343610400; Sun, 16 Jan 2022 06:33:30 -0800 (PST) MIME-Version: 1.0 References: <20220115194102.444140-1-lucp.at.work@gmail.com> <20220116141304.474374-1-lucp.at.work@gmail.com> In-Reply-To: <20220116141304.474374-1-lucp.at.work@gmail.com> From: Luc Pelletier Date: Sun, 16 Jan 2022 09:33:19 -0500 Message-ID: Subject: Re: [PATCH v3] eal: fix unaligned loads/stores in rte_memcpy_generic To: bruce.richardson@intel.com, konstantin.ananyev@intel.com Cc: dev , Xiaoyun Li , dpdk stable Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org As a side note, and to follow up on Stephen's indication that this is 'performance critical code', I think it might be worthwhile to revisit/revalidate the current implementation of rte_memcpy. There's a good thread here that mentions rte_memcpy, and its performance on at least one platform/architecture combination is far from being the best: https://github.com/microsoft/mimalloc/issues/201 It seems like enhanced rep movsb could be faster on more recent CPUs, but that's currently not being used in the current implementation of rte_memcpy. I understand some of this may not be directly related to this patch, but whoever looks at this patch might want to provide their thoughts on whether updating rte_memcpy would be worthwhile? I suspect looking at all current public implementations of memcpy (libc, microsoft, compilers builtin implementations, etc.) might help in making improvements. Le dim. 16 janv. 2022 =C3=A0 09:15, Luc Pelletier = a =C3=A9crit : > > Calls to rte_memcpy_generic could result in unaligned loads/stores for > 1 < n < 16. This is undefined behavior according to the C standard, > and it gets flagged by the clang undefined behavior sanitizer. > > rte_memcpy_generic is called with unaligned src and dst addresses. > When 1 < n < 16, the code would cast both src and dst to a qword, > dword or word pointer, without verifying the alignment of src/dst. The > code was changed to use inline assembly for the load/store operations. > Unaligned load/store operations are permitted in x86/x64 assembly. > > Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time= ") > Cc: Xiaoyun Li > Cc: stable@dpdk.org > > Signed-off-by: Luc Pelletier > --- > > Please note that I didn't write the entire function in inline assembly. > The reason why I kept the bitwise ands as C code is so the optimizer can > remove the branches when n is known at compile-time. > > lib/eal/x86/include/rte_memcpy.h | 134 +++++++++++++++++-------------- > 1 file changed, 72 insertions(+), 62 deletions(-) > > diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_m= emcpy.h > index 1b6c6e585f..b99c1b2ca5 100644 > --- a/lib/eal/x86/include/rte_memcpy.h > +++ b/lib/eal/x86/include/rte_memcpy.h > @@ -45,6 +45,75 @@ extern "C" { > static __rte_always_inline void * > rte_memcpy(void *dst, const void *src, size_t n); > > +#if defined(__i386__) > + #define RTE_ACCUMULATOR_REGISTER_NAME "eax" > +#elif defined(__x86_64__) > + #define RTE_ACCUMULATOR_REGISTER_NAME "rax" > +#endif > + > +/** > + * Copy bytes from one location to another, > + * locations should not overlap. > + * Use with unaligned src/dst, and n <=3D 15. > + */ > +static __rte_always_inline void * > +rte_mov15_or_less_unaligned(void *dst, const void *src, size_t n) > +{ > + void *ret =3D dst; > + if (n & 8) { > + asm ( > +#if defined(__i386__) > + "movl (%[src]), %%eax\n" > + "movl %%eax, (%[dst])\n" > + "add $4, %[src]\n" > + "add $4, %[dst]\n" > + "movl (%[src]), %%eax\n" > + "movl %%eax, (%[dst])\n" > + "add $4, %[src]\n" > + "add $4, %[dst]\n" > +#elif defined(__x86_64__) > + "movq (%[src]), %%rax\n" > + "movq %%rax, (%[dst])\n" > + "add $8, %[src]\n" > + "add $8, %[dst]\n" > +#else > + #error Unsupported architecture > +#endif > + : [dst] "+r" (dst), [src] "+r" (src) > + : > + : RTE_ACCUMULATOR_REGISTER_NAME, "memory"); > + } > + if (n & 4) { > + asm ( > + "movl (%[src]), %%eax\n" > + "movl %%eax, (%[dst])\n" > + "add $4, %[src]\n" > + "add $4, %[dst]\n" > + : [dst] "+r" (dst), [src] "+r" (src) > + : > + : RTE_ACCUMULATOR_REGISTER_NAME, "memory"); > + } > + if (n & 2) { > + asm ( > + "movw (%[src]), %%ax\n" > + "movw %%ax, (%[dst])\n" > + "add $2, %[src]\n" > + "add $2, %[dst]\n" > + : [dst] "+r" (dst), [src] "+r" (src) > + : > + : RTE_ACCUMULATOR_REGISTER_NAME, "memory"); > + } > + if (n & 1) { > + asm ( > + "movb (%[src]), %%al\n" > + "movb %%al, (%[dst])\n" > + : [dst] "+r" (dst), [src] "+r" (src) > + : > + : RTE_ACCUMULATOR_REGISTER_NAME, "memory"); > + } > + return ret; > +} > + > #if defined __AVX512F__ && defined RTE_MEMCPY_AVX512 > > #define ALIGNMENT_MASK 0x3F > @@ -171,8 +240,6 @@ rte_mov512blocks(uint8_t *dst, const uint8_t *src, si= ze_t n) > static __rte_always_inline void * > rte_memcpy_generic(void *dst, const void *src, size_t n) > { > - uintptr_t dstu =3D (uintptr_t)dst; > - uintptr_t srcu =3D (uintptr_t)src; > void *ret =3D dst; > size_t dstofss; > size_t bits; > @@ -181,24 +248,7 @@ rte_memcpy_generic(void *dst, const void *src, size_= t n) > * Copy less than 16 bytes > */ > if (n < 16) { > - if (n & 0x01) { > - *(uint8_t *)dstu =3D *(const uint8_t *)srcu; > - srcu =3D (uintptr_t)((const uint8_t *)srcu + 1); > - dstu =3D (uintptr_t)((uint8_t *)dstu + 1); > - } > - if (n & 0x02) { > - *(uint16_t *)dstu =3D *(const uint16_t *)srcu; > - srcu =3D (uintptr_t)((const uint16_t *)srcu + 1); > - dstu =3D (uintptr_t)((uint16_t *)dstu + 1); > - } > - if (n & 0x04) { > - *(uint32_t *)dstu =3D *(const uint32_t *)srcu; > - srcu =3D (uintptr_t)((const uint32_t *)srcu + 1); > - dstu =3D (uintptr_t)((uint32_t *)dstu + 1); > - } > - if (n & 0x08) > - *(uint64_t *)dstu =3D *(const uint64_t *)srcu; > - return ret; > + return rte_mov15_or_less_unaligned(dst, src, n); > } > > /** > @@ -379,8 +429,6 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, si= ze_t n) > static __rte_always_inline void * > rte_memcpy_generic(void *dst, const void *src, size_t n) > { > - uintptr_t dstu =3D (uintptr_t)dst; > - uintptr_t srcu =3D (uintptr_t)src; > void *ret =3D dst; > size_t dstofss; > size_t bits; > @@ -389,25 +437,7 @@ rte_memcpy_generic(void *dst, const void *src, size_= t n) > * Copy less than 16 bytes > */ > if (n < 16) { > - if (n & 0x01) { > - *(uint8_t *)dstu =3D *(const uint8_t *)srcu; > - srcu =3D (uintptr_t)((const uint8_t *)srcu + 1); > - dstu =3D (uintptr_t)((uint8_t *)dstu + 1); > - } > - if (n & 0x02) { > - *(uint16_t *)dstu =3D *(const uint16_t *)srcu; > - srcu =3D (uintptr_t)((const uint16_t *)srcu + 1); > - dstu =3D (uintptr_t)((uint16_t *)dstu + 1); > - } > - if (n & 0x04) { > - *(uint32_t *)dstu =3D *(const uint32_t *)srcu; > - srcu =3D (uintptr_t)((const uint32_t *)srcu + 1); > - dstu =3D (uintptr_t)((uint32_t *)dstu + 1); > - } > - if (n & 0x08) { > - *(uint64_t *)dstu =3D *(const uint64_t *)srcu; > - } > - return ret; > + return rte_mov15_or_less_unaligned(dst, src, n); > } > > /** > @@ -672,8 +702,6 @@ static __rte_always_inline void * > rte_memcpy_generic(void *dst, const void *src, size_t n) > { > __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8; > - uintptr_t dstu =3D (uintptr_t)dst; > - uintptr_t srcu =3D (uintptr_t)src; > void *ret =3D dst; > size_t dstofss; > size_t srcofs; > @@ -682,25 +710,7 @@ rte_memcpy_generic(void *dst, const void *src, size_= t n) > * Copy less than 16 bytes > */ > if (n < 16) { > - if (n & 0x01) { > - *(uint8_t *)dstu =3D *(const uint8_t *)srcu; > - srcu =3D (uintptr_t)((const uint8_t *)srcu + 1); > - dstu =3D (uintptr_t)((uint8_t *)dstu + 1); > - } > - if (n & 0x02) { > - *(uint16_t *)dstu =3D *(const uint16_t *)srcu; > - srcu =3D (uintptr_t)((const uint16_t *)srcu + 1); > - dstu =3D (uintptr_t)((uint16_t *)dstu + 1); > - } > - if (n & 0x04) { > - *(uint32_t *)dstu =3D *(const uint32_t *)srcu; > - srcu =3D (uintptr_t)((const uint32_t *)srcu + 1); > - dstu =3D (uintptr_t)((uint32_t *)dstu + 1); > - } > - if (n & 0x08) { > - *(uint64_t *)dstu =3D *(const uint64_t *)srcu; > - } > - return ret; > + return rte_mov15_or_less_unaligned(dst, src, n); > } > > /** > -- > 2.25.1 >