From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id E684BA034C; Fri, 19 Aug 2022 15:58:14 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id B918C40694; Fri, 19 Aug 2022 15:58:13 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 7378040689 for ; Fri, 19 Aug 2022 15:58:12 +0200 (CEST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: [RFC v3] non-temporal memcpy Date: Fri, 19 Aug 2022 15:58:06 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D8728A@smartserver.smartshare.dk> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [RFC v3] non-temporal memcpy Thread-Index: Adiz07VZWMFiI8doRWWZTfLNlQsEhQ== From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: Cc: "Bruce Richardson" , "Konstantin Ananyev" , "Honnappa Nagarahalli" , "Stephen Hemminger" , =?iso-8859-1?Q?Mattias_R=F6nnblom?= X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org This RFC proposes a set of functions optimized for non-temporal memory = copy. At this stage, I am asking for acceptance of the concept and API. Feedback on the x86 implementation is also welcome. Applications sometimes copy data to another memory location, which is = only used much later. In this case, it is inefficient to pollute the data cache with the = copied data. An example use case (originating from a real life application): Copying filtered packets, or the first part of them, into a capture = buffer for offline analysis. The purpose of the functions is to achieve a performance gain by not polluting the cache when copying data. Although the throughput may be improved by further optimization, I do = not consider throughput optimization relevant initially. Implementation notes: Implementations for non-x86 architectures can be provided by anyone at a later time. I am not going to do it. x86 non-temporal load instructions have 16 byte alignment requirements = [1]. ARM non-temporal load instructions are available with 4 byte alignment requirements [2]. Both platforms offer non-temporal store instructions with 4 byte = alignment requirements. In addition to the general function without any alignment requirements, = I have also implemente functions for respectivly 16 and 4 byte aligned = access for performance purposes. NB: Don't comment on spaces for indentation; a patch will follow DPDK = coding style and use TAB. [1] = https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#= text=3D_mm_stream_load [2] = https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-R= eference/A64-Floating-point-Instructions/LDNP--SIMD-and-FP- V2: - Only copy from non-temporal source to non-temporal destination. I.e. remove the two variants with only source and/or destination being non-temporal. - Do not require alignment. Instead, offer additional 4 and 16 byte aligned functions for = performance purposes. - Implemented two of the functions for x86. - Remove memset function. V3: - Only one generic function is exposed in the API: rte_memcpy_ex(dst, src, len, flags), which should be called with the flags constant at build time. - Requests for non-temporal source/destination memory access are now = flags. - Alignment hints are now flags. - The functions for various alignments are not part of the declaration. They are implementation specific. - Implemented the generic and unaligned functions for x86. - Added note about rte_wmb() to non-temporal store. - Variants for normal load and non-temporal store, as well as variants for non-temporal load and normal store are not implemented. They can be added later. - Improved the workaround for _mm_stream_load_si128() not taking a const pointer as parameter. - Extensive use of __builtin_constant_p(flags) to help the compiler = optimize the code. Signed-off-by: Morten Br=F8rup --- /*****************************************************************/ /* Declaration. Goes into: /lib/eal/include/generic/rte_memcpy.h */ /*****************************************************************/ /* * Advanced/Non-Temporal Memory Operations Flags. */ /** Length alignment hint mask. */ #define RTE_MEMOPS_F_LENA_MASK (UINT64_C(0xFE) << 0) /** Hint: Length is 2 byte aligned. */ #define RTE_MEMOPS_F_LEN2A (UINT64_C(2) << 0) /** Hint: Length is 4 byte aligned. */ #define RTE_MEMOPS_F_LEN4A (UINT64_C(4) << 0) =20 /** Hint: Length is 8 byte aligned. */ #define RTE_MEMOPS_F_LEN8A (UINT64_C(8) << 0) =20 /** Hint: Length is 16 byte aligned. */ #define RTE_MEMOPS_F_LEN16A (UINT64_C(16) << 0) =20 /** Hint: Length is 32 byte aligned. */ #define RTE_MEMOPS_F_LEN32A (UINT64_C(32) << 0) =20 /** Hint: Length is 64 byte aligned. */ #define RTE_MEMOPS_F_LEN64A (UINT64_C(64) << 0) =20 /** Hint: Length is 128 byte aligned. */ #define RTE_MEMOPS_F_LEN128A (UINT64_C(128) << 0) =20 /** Prefer non-temporal access to source memory area. * * On ARM architecture: * Remember to call rte_???() before a sequence of copy operations. */ #define RTE_MEMOPS_F_SRC_NT (UINT64_C(1) << 8) =20 /** Source address alignment hint mask. */ #define RTE_MEMOPS_F_SRCA_MASK (UINT64_C(0xFE) << 8) =20 /** Hint: Source address is 2 byte aligned. */ #define RTE_MEMOPS_F_SRC2A (UINT64_C(2) << 8) =20 /** Hint: Source address is 4 byte aligned. */ #define RTE_MEMOPS_F_SRC4A (UINT64_C(4) << 8) =20 /** Hint: Source address is 8 byte aligned. */ #define RTE_MEMOPS_F_SRC8A (UINT64_C(8) << 8) =20 /** Hint: Source address is 16 byte aligned. */ #define RTE_MEMOPS_F_SRC16A (UINT64_C(16) << 8) =20 /** Hint: Source address is 32 byte aligned. */ #define RTE_MEMOPS_F_SRC32A (UINT64_C(32) << 8) =20 /** Hint: Source address is 64 byte aligned. */ #define RTE_MEMOPS_F_SRC64A (UINT64_C(64) << 8) =20 /** Hint: Source address is 128 byte aligned. */ #define RTE_MEMOPS_F_SRC128A (UINT64_C(128) << 8) =20 /** Prefer non-temporal access to destination memory area. * * On x86 architecture: * Remember to call rte_wmb() after a sequence of copy operations. * On ARM architecture: * Remember to call rte_???() after a sequence of copy operations. */ #define RTE_MEMOPS_F_DST_NT (UINT64_C(1) << 16) =20 /** Destination address alignment hint mask. */ #define RTE_MEMOPS_F_DSTA_MASK (UINT64_C(0xFE) << 16) =20 /** Hint: Destination address is 2 byte aligned. */ #define RTE_MEMOPS_F_DST2A (UINT64_C(2) << 16) =20 /** Hint: Destination address is 4 byte aligned. */ #define RTE_MEMOPS_F_DST4A (UINT64_C(4) << 16) =20 /** Hint: Destination address is 8 byte aligned. */ #define RTE_MEMOPS_F_DST8A (UINT64_C(8) << 16) =20 /** Hint: Destination address is 16 byte aligned. */ #define RTE_MEMOPS_F_DST16A (UINT64_C(16) << 16) =20 /** Hint: Destination address is 32 byte aligned. */ #define RTE_MEMOPS_F_DST32A (UINT64_C(32) << 16) =20 /** Hint: Destination address is 64 byte aligned. */ #define RTE_MEMOPS_F_DST64A (UINT64_C(64) << 16) =20 /** Hint: Destination address is 128 byte aligned. */ #define RTE_MEMOPS_F_DST128A (UINT64_C(128) << 16) =20 /** * @warning * @b EXPERIMENTAL: this API may change without prior notice. * * Advanced/non-temporal memory copy. * The memory areas must not overlap. * * @param dst * Pointer to the destination memory area. * @param src * Pointer to the source memory area. * @param len * Number of bytes to copy. * @param flags * Hints for memory access. * Any of the RTE_MEMOPS_F_(SRC|DST)_NT, = RTE_MEMOPS_F_(LEN|SRC|DST)A flags. * Should be constant at build time. */ __rte_experimental static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), = __access__(read_only, 2, 3))) void rte_memcpy_ex(void * __rte_restrict dst, const void * = __rte_restrict src, size_t len, const uint64_t flags); /****************************************************************/ /* Implementation. Goes into: /lib/eal/x86/include/rte_memcpy.h */ /****************************************************************/ /* Assumptions about register sizes. */ _Static_assert(sizeof(int32_t) =3D=3D 4, "Wrong size of int32_t."); _Static_assert(sizeof(__m128i) =3D=3D 16, "Wrong size of __m128i."); /** * @internal * Workaround for _mm_stream_load_si128() missing const in the = parameter. */ __rte_internal static __rte_always_inline __m128i _mm_stream_load_si128_const(const __m128i * const mem_addr) { #if defined(RTE_TOOLCHAIN_GCC) #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers" #endif return _mm_stream_load_si128(mem_addr); #if defined(RTE_TOOLCHAIN_GCC) #pragma GCC diagnostic pop #endif } /** * @internal * 16 byte aligned non-temporal memory copy. * The memory areas must not overlap. * * @param dst * Pointer to the non-temporal destination memory area. * Must be 16 byte aligned. * @param src * Pointer to the non-temporal source memory area. * Must be 16 byte aligned. * @param len * Number of bytes to copy. * Must be divisible by 16. */ __rte_internal static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), = __access__(read_only, 2, 3))) void rte_memcpy_nt16a(void * __rte_restrict dst, const void * = __rte_restrict src, size_t len, const uint64_t flags) { register __m128i xmm0, xmm1, xmm2, xmm3; RTE_ASSERT(rte_is_aligned(dst, 16)); RTE_ASSERT(rte_is_aligned(src, 16)); RTE_ASSERT(rte_is_aligned(len, 16)); /* Copy large portion of data in chunks of 64 byte. */ while (len >=3D 4 * 16) { xmm0 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16)); xmm1 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16)); xmm2 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16)); xmm3 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16)); _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0); _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1); _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2); _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3); src =3D RTE_PTR_ADD(src, 4 * 16); dst =3D RTE_PTR_ADD(dst, 4 * 16); len -=3D 4 * 16; } /* Copy remaining data. * Omitted if length is known to be 64 byte aligned. */ if (!(__builtin_constant_p(flags) && ((flags & RTE_MEMOPS_F_LENA_MASK) >=3D = RTE_MEMOPS_F_LEN64A))) { while (len !=3D 0) { xmm0 =3D _mm_stream_load_si128_const(src); _mm_stream_si128(dst, xmm0); src =3D RTE_PTR_ADD(src, 16); dst =3D RTE_PTR_ADD(dst, 16); len -=3D 16; } } } /** * @internal * 4 byte aligned non-temporal memory copy. * The memory areas must not overlap. * * @param dst * Pointer to the non-temporal destination memory area. * Must be 4 byte aligned. * @param src * Pointer to the non-temporal source memory area. * Must be 4 byte aligned. * @param len * Number of bytes to copy. * Must be divisible by 4. */ __rte_internal static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), = __access__(read_only, 2, 3))) void rte_memcpy_nt4a(void * __rte_restrict dst, const void * = __rte_restrict src, size_t len, const uint64_t flags) { int32_t buffer[16 / 4] __rte_aligned(16); /** How many bytes is source offset from 16 byte alignment (floor = rounding). */ const size_t offset =3D ((uintptr_t)src & (16 - 1)); register __m128i xmm0; RTE_ASSERT(rte_is_aligned(dst, 4)); RTE_ASSERT(rte_is_aligned(src, 4)); RTE_ASSERT(rte_is_aligned(len, 4)); if (unlikely(len =3D=3D 0)) return; /* Copy first, not 16 byte aligned, part of source data. * Omitted if source is known to be 16 byte aligned. */ if (!(__builtin_constant_p(flags) && ((flags & RTE_MEMOPS_F_SRCA_MASK) >=3D RTE_MEMOPS_F_SRC16A)) = && offset !=3D 0) { const size_t first =3D 16 - offset; /** Adjust source pointer to achieve 16 byte alignment (floor = rounding). */ src =3D RTE_PTR_SUB(src, offset); xmm0 =3D _mm_stream_load_si128_const(src); _mm_store_si128((void *)buffer, xmm0); switch (first) { case 3 * 4: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]); if (unlikely(len =3D=3D 1 * 4)) return; _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]); if (unlikely(len =3D=3D 2 * 4)) return; _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]); break; case 2 * 4: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]); if (unlikely(len =3D=3D 1 * 4)) return; _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]); break; case 1 * 4: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]); break; } src =3D RTE_PTR_ADD(src, first); dst =3D RTE_PTR_ADD(dst, first); len -=3D first; } /* Copy middle, 16 byte aligned, part of source data. */ while (len >=3D 16) { xmm0 =3D _mm_stream_load_si128_const(src); _mm_store_si128((void *)buffer, xmm0); _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]); _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]); _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]); _mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]); src =3D RTE_PTR_ADD(src, 16); dst =3D RTE_PTR_ADD(dst, 4 * 4); len -=3D 16; } /* Copy last, not 16 byte aligned, part of source data. * Omitted if source is known to be 16 byte aligned. */ if (!(__builtin_constant_p(flags) && ((flags & RTE_MEMOPS_F_SRCA_MASK) >=3D RTE_MEMOPS_F_SRC16A)) = && len !=3D 0) { xmm0 =3D _mm_stream_load_si128_const(src); _mm_store_si128((void *)buffer, xmm0); switch (len) { case 1 * 4: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]); break; case 2 * 4: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]); _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]); break; case 3 * 4: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]); _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]); _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]); break; } } } #ifndef RTE_MEMCPY_NT_BUFSIZE #include /* #include */ /** Bounce buffer size for non-temporal memcpy. * * The actual buffer will be slightly larger, due to added padding. * The default is chosen to be able to handle a non-segmented packet. */ #define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM #endif /* RTE_MEMCPY_NT_BUFSIZE */ /** * @internal * Non-temporal memory copy to 16 byte aligned destination and length * from unaligned source via bounce buffer. * * @param dst * Pointer to the non-temporal destination memory area. * Must be 16 byte aligned. * @param src * Pointer to the non-temporal source memory area. * No alignment requirements. * @param len * Number of bytes to copy. * Must be be divisible by 16. * Must be <=3D RTE_MEMCPY_NT_BUFSIZE. */ __rte_internal static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), = __access__(read_only, 2, 3))) void rte_memcpy_nt_buf16dla(void * __rte_restrict dst, const void * = __rte_restrict src, size_t len, const uint64_t flags __rte_unused) { /** Aligned bounce buffer with preceding and trailing padding. */ unsigned char buffer[16 + RTE_MEMCPY_NT_BUFSIZE + 16] = __rte_aligned(16); void * buf; register __m128i xmm0, xmm1, xmm2, xmm3; RTE_ASSERT(rte_is_aligned(dst, 16)); RTE_ASSERT(rte_is_aligned(len, 16)); RTE_ASSERT(len <=3D RTE_MEMCPY_NT_BUFSIZE); if (unlikely(len =3D=3D 0)) return; /* Step 1: * Copy data from the source to the bounce buffer's aligned data = area, * using aligned non-temporal load from the source, * and unaligned store in the bounce buffer. * * If the source is unaligned, the extra bytes preceding the data = will be copied * to the padding area preceding the bounce buffer's aligned data = area. * Similarly, if the source data ends at an unaligned address, the = additional bytes * trailing the data will be copied to the padding area trailing the = bounce buffer's * aligned data area. */ { /** How many bytes is source offset from 16 byte alignment = (floor rounding). */ const size_t offset =3D ((uintptr_t)src & (16 - 1)); /** Number of bytes to copy from source, incl. any extra = preceding bytes. */ size_t srclen =3D len + offset; /* Adjust source pointer for extra preceding bytes. */ src =3D RTE_PTR_SUB(src, offset); /* Bounce buffer pointer, adjusted for extra preceding bytes. */ buf =3D RTE_PTR_ADD(buffer, 16 - offset); /* Copy large portion of data from source to bounce buffer. */ while (srclen >=3D 4 * 16) { xmm0 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * = 16)); xmm1 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * = 16)); xmm2 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * = 16)); xmm3 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * = 16)); _mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0); _mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1); _mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2); _mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3); src =3D RTE_PTR_ADD(src, 4 * 16); buf =3D RTE_PTR_ADD(buf, 4 * 16); srclen -=3D 4 * 16; } /* Copy remaining data from source to bounce buffer. */ while ((ssize_t)srclen > 0) { xmm0 =3D _mm_stream_load_si128_const(src); _mm_storeu_si128(buf, xmm0); src =3D RTE_PTR_ADD(src, 16); buf =3D RTE_PTR_ADD(buf, 16); srclen -=3D 16; } } /* Step 2: * Copy from the aligned bounce buffer to the aligned destination. */ /* Reset bounce buffer pointer; point to the aligned data area. */ buf =3D RTE_PTR_ADD(buffer, 16); /* Copy large portion of data from bounce buffer to destination in = chunks of 64 byte. */ while (len >=3D 4 * 16) { xmm0 =3D _mm_load_si128(RTE_PTR_ADD(buf, 0 * 16)); xmm1 =3D _mm_load_si128(RTE_PTR_ADD(buf, 1 * 16)); xmm2 =3D _mm_load_si128(RTE_PTR_ADD(buf, 2 * 16)); xmm3 =3D _mm_load_si128(RTE_PTR_ADD(buf, 3 * 16)); _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0); _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1); _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2); _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3); buf =3D RTE_PTR_ADD(buf, 4 * 16); dst =3D RTE_PTR_ADD(dst, 4 * 16); len -=3D 4 * 16; } /* Copy remaining data from bounce buffer to destination. */ while (len !=3D 0) { xmm0 =3D _mm_load_si128(buf); _mm_stream_si128(dst, xmm0); buf =3D RTE_PTR_ADD(buf, 16); dst =3D RTE_PTR_ADD(dst, 16); len -=3D 16; } } /** * @internal * Non-temporal memory copy via bounce buffer. * * @note * If the destination and/or length is unaligned, the first and/or last = copied * bytes will be stored in the destination memory area using temporal = access. * * @param dst * Pointer to the non-temporal destination memory area. * @param src * Pointer to the non-temporal source memory area. * No alignment requirements. * @param len * Number of bytes to copy. * Must be <=3D RTE_MEMCPY_NT_BUFSIZE. */ __rte_internal static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), = __access__(read_only, 2, 3))) void rte_memcpy_nt_buf(void * __rte_restrict dst, const void * = __rte_restrict src, size_t len, const uint64_t flags __rte_unused) { /** Aligned bounce buffer with preceding and trailing padding. */ unsigned char buffer[16 + RTE_MEMCPY_NT_BUFSIZE + 16] = __rte_aligned(16); void * buf; register __m128i xmm0, xmm1, xmm2, xmm3; RTE_ASSERT(len <=3D RTE_MEMCPY_NT_BUFSIZE); if (unlikely(len =3D=3D 0)) return; /* Step 1: * Copy data from the source to the bounce buffer's aligned data = area, * using aligned non-temporal load from the source, * and unaligned store in the bounce buffer. * * If the source is unaligned, the additional bytes preceding the = data will be copied * to the padding area preceding the bounce buffer's aligned data = area. * Similarly, if the source data ends at an unaligned address, the = additional bytes * trailing the data will be copied to the padding area trailing the = bounce buffer's * aligned data area. */ { /** How many bytes is source offset from 16 byte alignment = (floor rounding). */ const size_t offset =3D ((uintptr_t)src & (16 - 1)); /** Number of bytes to copy from source, incl. any extra = preceding bytes. */ size_t srclen =3D len + offset; /* Adjust source pointer for extra preceding bytes. */ src =3D RTE_PTR_SUB(src, offset); /* Bounce buffer pointer, adjusted for extra preceding bytes. */ buf =3D RTE_PTR_ADD(buffer, 16 - offset); /* Copy large portion of data from source to bounce buffer. */ while (srclen >=3D 4 * 16) { xmm0 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * = 16)); xmm1 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * = 16)); xmm2 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * = 16)); xmm3 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * = 16)); _mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0); _mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1); _mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2); _mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3); src =3D RTE_PTR_ADD(src, 4 * 16); buf =3D RTE_PTR_ADD(buf, 4 * 16); srclen -=3D 4 * 16; } /* Copy remaining data from source to bounce buffer. */ while ((ssize_t)srclen > 0) { xmm0 =3D _mm_stream_load_si128_const(src); _mm_storeu_si128(buf, xmm0); src =3D RTE_PTR_ADD(src, 16); buf =3D RTE_PTR_ADD(buf, 16); srclen -=3D 16; } } /* Step 2: * Copy from the aligned bounce buffer to the destination. */ /* Reset bounce buffer pointer; point to the aligned data area. */ buf =3D RTE_PTR_ADD(buffer, 16); if (unlikely(!rte_is_aligned(dst, 16))) { /* Destination is not 16 byte aligned. */ if (unlikely(!rte_is_aligned(dst, 4))) { /* Destination is not 4 byte aligned. */ /** How many bytes are missing to reach 16 byte alignment. = */ const size_t n =3D RTE_PTR_DIFF(RTE_PTR_ALIGN_CEIL(dst, 16), = dst); if (unlikely(len <=3D n)) goto copy_trailing_bytes; /* Copy from bounce buffer until destination pointer is 16 = byte aligned. */ memcpy(dst, buf, n); buf =3D RTE_PTR_ADD(buf, n); dst =3D RTE_PTR_ADD(dst, n); len -=3D n; } else { /* Destination is 4 byte aligned. */ /* Copy from bounce buffer until destination pointer is 16 = byte aligned. */ while (!rte_is_aligned(dst, 16)) { register int32_t r; if (unlikely(len < 4)) goto copy_trailing_bytes; r =3D *(int32_t *)buf; _mm_stream_si32(dst, r); buf =3D RTE_PTR_ADD(buf, 4); dst =3D RTE_PTR_ADD(dst, 4); len -=3D 4; } } } /* Destination is 16 byte aligned. */ /* Copy large portion of data from bounce buffer to destination in = chunks of 64 byte. */ while (len >=3D 4 * 16) { xmm0 =3D _mm_loadu_si128(RTE_PTR_ADD(buf, 0 * 16)); xmm1 =3D _mm_loadu_si128(RTE_PTR_ADD(buf, 1 * 16)); xmm2 =3D _mm_loadu_si128(RTE_PTR_ADD(buf, 2 * 16)); xmm3 =3D _mm_loadu_si128(RTE_PTR_ADD(buf, 3 * 16)); _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0); _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1); _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2); _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3); buf =3D RTE_PTR_ADD(buf, 4 * 16); dst =3D RTE_PTR_ADD(dst, 4 * 16); len -=3D 4 * 16; } /* Copy remaining data from bounce buffer to destination. */ while (len >=3D 4) { int32_t r; memcpy(&r, buf, 4); _mm_stream_si32(dst, r); buf =3D RTE_PTR_ADD(buf, 4); dst =3D RTE_PTR_ADD(dst, 4); len -=3D 4; } copy_trailing_bytes: if (unlikely(len !=3D 0)) { /* Copy trailing bytes. */ memcpy(dst, buf, len); } } /** * @internal * Non-temporal memory copy to 16 byte aligned destination and length. * The memory areas must not overlap. * * @param dst * Pointer to the non-temporal destination memory area. * Must be 16 byte aligned. * @param src * Pointer to the non-temporal source memory area. * No alignment requirements. * @param len * Number of bytes to copy. * Must be be divisible by 16. */ __rte_internal static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), = __access__(read_only, 2, 3))) void rte_memcpy_nt16dla(void * __rte_restrict dst, const void * = __rte_restrict src, size_t len, const uint64_t flags) { RTE_ASSERT(rte_is_aligned(dst, 16)); RTE_ASSERT(rte_is_aligned(len, 16)); while (len > RTE_MEMCPY_NT_BUFSIZE) { rte_memcpy_nt_buf16dla(dst, src, RTE_MEMCPY_NT_BUFSIZE, flags); dst =3D RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE); src =3D RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE); len -=3D RTE_MEMCPY_NT_BUFSIZE; } rte_memcpy_nt_buf16dla(dst, src, len, flags); } /** * @internal * Non-temporal memory copy. * The memory areas must not overlap. * * @note * If the destination and/or length is unaligned, some copied bytes will = be * stored in the destination memory area using temporal access. * * @param dst * Pointer to the non-temporal destination memory area. * @param src * Pointer to the non-temporal source memory area. * @param len * Number of bytes to copy. */ __rte_internal static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), = __access__(read_only, 2, 3))) void rte_memcpy_nt_fallback(void * __rte_restrict dst, const void * = __rte_restrict src, size_t len, const uint64_t flags) { while (len > RTE_MEMCPY_NT_BUFSIZE) { rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE, flags); dst =3D RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE); src =3D RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE); len -=3D RTE_MEMCPY_NT_BUFSIZE; } rte_memcpy_nt_buf(dst, src, len, flags); } /* Implementation. Refer to function declaration for documentation. */ __rte_experimental static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), = __access__(read_only, 2, 3))) void rte_memcpy_ex(void * __rte_restrict dst, const void * = __rte_restrict src, size_t len, const uint64_t flags) { if (flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) { if (__builtin_constant_p(flags) ? ((flags & RTE_MEMOPS_F_LENA_MASK) >=3D = RTE_MEMOPS_F_LEN16A && (flags & RTE_MEMOPS_F_DSTA_MASK) >=3D = RTE_MEMOPS_F_DST16A) : !(((uintptr_t)dst | len) & (16 - 1))) { if (__builtin_constant_p(flags) ? (flags & RTE_MEMOPS_F_SRCA_MASK) >=3D = RTE_MEMOPS_F_SRC16A : !((uintptr_t)src & (16 - 1))) rte_memcpy_nt16a(dst, src, len, flags); else rte_memcpy_nt16dla(dst, src, len, flags); } else if (__builtin_constant_p(flags) ? ( (flags & RTE_MEMOPS_F_LENA_MASK) >=3D RTE_MEMOPS_F_LEN4A = && (flags & RTE_MEMOPS_F_SRCA_MASK) >=3D RTE_MEMOPS_F_SRC4A = && (flags & RTE_MEMOPS_F_DSTA_MASK) >=3D = RTE_MEMOPS_F_DST4A) : !(((uintptr_t)dst | (uintptr_t)src | len) & (4 - 1))) rte_memcpy_nt4a(dst, src, len, flags); else rte_memcpy_nt_fallback(dst, src, len, flags); } else rte_memcpy(dst, src, len); }