From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id E684BA034C;
	Fri, 19 Aug 2022 15:58:14 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id B918C40694;
	Fri, 19 Aug 2022 15:58:13 +0200 (CEST)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id 7378040689
 for <dev@dpdk.org>; Fri, 19 Aug 2022 15:58:12 +0200 (CEST)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: [RFC v3] non-temporal memcpy
Date: Fri, 19 Aug 2022 15:58:06 +0200
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D8728A@smartserver.smartshare.dk>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [RFC v3] non-temporal memcpy
Thread-Index: Adiz07VZWMFiI8doRWWZTfLNlQsEhQ==
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: <dev@dpdk.org>
Cc: "Bruce Richardson" <bruce.richardson@intel.com>,
 "Konstantin Ananyev" <konstantin.v.ananyev@yandex.ru>,
 "Honnappa Nagarahalli" <Honnappa.Nagarahalli@arm.com>,
 "Stephen Hemminger" <stephen@networkplumber.org>,
 =?iso-8859-1?Q?Mattias_R=F6nnblom?= <hofors@lysator.liu.se>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

This RFC proposes a set of functions optimized for non-temporal memory =
copy.

At this stage, I am asking for acceptance of the concept and API.
Feedback on the x86 implementation is also welcome.

Applications sometimes copy data to another memory location, which is =
only
used much later.
In this case, it is inefficient to pollute the data cache with the =
copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture =
buffer
for offline analysis.

The purpose of the functions is to achieve a performance gain by not
polluting the cache when copying data.
Although the throughput may be improved by further optimization, I do =
not
consider throughput optimization relevant initially.

Implementation notes:

Implementations for non-x86 architectures can be provided by anyone at a
later time. I am not going to do it.

x86 non-temporal load instructions have 16 byte alignment requirements =
[1].
ARM non-temporal load instructions are available with 4 byte alignment
requirements [2].
Both platforms offer non-temporal store instructions with 4 byte =
alignment
requirements.

In addition to the general function without any alignment requirements, =
I
have also implemente functions for respectivly 16 and 4 byte aligned =
access
for performance purposes.

NB: Don't comment on spaces for indentation; a patch will follow DPDK =
coding
style and use TAB.

[1] =
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#=
text=3D_mm_stream_load
[2] =
https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-R=
eference/A64-Floating-point-Instructions/LDNP--SIMD-and-FP-

V2:
- Only copy from non-temporal source to non-temporal destination.
  I.e. remove the two variants with only source and/or destination being
  non-temporal.
- Do not require alignment.
  Instead, offer additional 4 and 16 byte aligned functions for =
performance
  purposes.
- Implemented two of the functions for x86.
- Remove memset function.

V3:
- Only one generic function is exposed in the API:
  rte_memcpy_ex(dst, src, len, flags),
  which should be called with the flags constant at build time.
- Requests for non-temporal source/destination memory access are now =
flags.
- Alignment hints are now flags.
- The functions for various alignments are not part of the declaration.
  They are implementation specific.
- Implemented the generic and unaligned functions for x86.
- Added note about rte_wmb() to non-temporal store.
- Variants for normal load and non-temporal store, as well as
  variants for non-temporal load and normal store are not implemented.
  They can be added later.
- Improved the workaround for _mm_stream_load_si128() not taking a const
  pointer as parameter.
- Extensive use of __builtin_constant_p(flags) to help the compiler =
optimize
  the code.

Signed-off-by: Morten Br=F8rup <mb@smartsharesystems.com>
---

/*****************************************************************/
/* Declaration. Goes into: /lib/eal/include/generic/rte_memcpy.h */
/*****************************************************************/

/*
 * Advanced/Non-Temporal Memory Operations Flags.
 */

/** Length alignment hint mask. */
#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
/** Hint: Length is 2 byte aligned. */
#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
/** Hint: Length is 4 byte aligned. */
#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)     =20
/** Hint: Length is 8 byte aligned. */
#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)     =20
/** Hint: Length is 16 byte aligned. */
#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)    =20
/** Hint: Length is 32 byte aligned. */
#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)    =20
/** Hint: Length is 64 byte aligned. */
#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)    =20
/** Hint: Length is 128 byte aligned. */
#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)   =20

/** Prefer non-temporal access to source memory area.
 *
 * On ARM architecture:
 * Remember to call rte_???() before a sequence of copy operations.
 */
#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)     =20
/** Source address alignment hint mask. */
#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)  =20
/** Hint: Source address is 2 byte aligned. */
#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)     =20
/** Hint: Source address is 4 byte aligned. */
#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)     =20
/** Hint: Source address is 8 byte aligned. */
#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)     =20
/** Hint: Source address is 16 byte aligned. */
#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)    =20
/** Hint: Source address is 32 byte aligned. */
#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)    =20
/** Hint: Source address is 64 byte aligned. */
#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)    =20
/** Hint: Source address is 128 byte aligned. */
#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)   =20

/** Prefer non-temporal access to destination memory area.
 *
 * On x86 architecture:
 * Remember to call rte_wmb() after a sequence of copy operations.
 * On ARM architecture:
 * Remember to call rte_???() after a sequence of copy operations.
 */
#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)    =20
/** Destination address alignment hint mask. */
#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16) =20
/** Hint: Destination address is 2 byte aligned. */
#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)    =20
/** Hint: Destination address is 4 byte aligned. */
#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)    =20
/** Hint: Destination address is 8 byte aligned. */
#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)    =20
/** Hint: Destination address is 16 byte aligned. */
#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)   =20
/** Hint: Destination address is 32 byte aligned. */
#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)   =20
/** Hint: Destination address is 64 byte aligned. */
#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)   =20
/** Hint: Destination address is 128 byte aligned. */
#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)  =20


/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Advanced/non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the destination memory area.
 * @param src
 *   Pointer to the source memory area.
 * @param len
 *   Number of bytes to copy.
 * @param flags
 *   Hints for memory access.
 *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, =
RTE_MEMOPS_F_(LEN|SRC|DST)<n>A flags.
 *   Should be constant at build time.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), =
__access__(read_only, 2, 3)))
void rte_memcpy_ex(void * __rte_restrict dst, const void * =
__rte_restrict src, size_t len,
        const uint64_t flags);


/****************************************************************/
/* Implementation. Goes into: /lib/eal/x86/include/rte_memcpy.h */
/****************************************************************/

/* Assumptions about register sizes. */
_Static_assert(sizeof(int32_t) =3D=3D 4, "Wrong size of int32_t.");
_Static_assert(sizeof(__m128i) =3D=3D 16, "Wrong size of __m128i.");

/**
 * @internal
 * Workaround for _mm_stream_load_si128() missing const in the =
parameter.
 */
__rte_internal
static __rte_always_inline
__m128i _mm_stream_load_si128_const(const __m128i * const mem_addr)
{
#if defined(RTE_TOOLCHAIN_GCC)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
#endif
    return _mm_stream_load_si128(mem_addr);
#if defined(RTE_TOOLCHAIN_GCC)
#pragma GCC diagnostic pop
#endif
}

/**
 * @internal
 * 16 byte aligned non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   Must be 16 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 16.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), =
__access__(read_only, 2, 3)))
void rte_memcpy_nt16a(void * __rte_restrict dst, const void * =
__rte_restrict src, size_t len,
        const uint64_t flags)
{
    register __m128i    xmm0, xmm1, xmm2, xmm3;

    RTE_ASSERT(rte_is_aligned(dst, 16));
    RTE_ASSERT(rte_is_aligned(src, 16));
    RTE_ASSERT(rte_is_aligned(len, 16));

    /* Copy large portion of data in chunks of 64 byte. */
    while (len >=3D 4 * 16) {
        xmm0 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
        xmm1 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
        xmm2 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
        xmm3 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
        _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
        _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
        _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
        _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
        src =3D RTE_PTR_ADD(src, 4 * 16);
        dst =3D RTE_PTR_ADD(dst, 4 * 16);
        len -=3D 4 * 16;
    }

    /* Copy remaining data.
     * Omitted if length is known to be 64 byte aligned.
     */
    if (!(__builtin_constant_p(flags) &&
            ((flags & RTE_MEMOPS_F_LENA_MASK) >=3D =
RTE_MEMOPS_F_LEN64A))) {
        while (len !=3D 0) {
            xmm0 =3D _mm_stream_load_si128_const(src);
            _mm_stream_si128(dst, xmm0);
            src =3D RTE_PTR_ADD(src, 16);
            dst =3D RTE_PTR_ADD(dst, 16);
            len -=3D 16;
        }
    }
}

/**
 * @internal
 * 4 byte aligned non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 4 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   Must be 4 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 4.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), =
__access__(read_only, 2, 3)))
void rte_memcpy_nt4a(void * __rte_restrict dst, const void * =
__rte_restrict src, size_t len,
        const uint64_t flags)
{
    int32_t             buffer[16 / 4] __rte_aligned(16);
    /** How many bytes is source offset from 16 byte alignment (floor =
rounding). */
    const size_t        offset =3D ((uintptr_t)src & (16 - 1));
    register __m128i    xmm0;

    RTE_ASSERT(rte_is_aligned(dst, 4));
    RTE_ASSERT(rte_is_aligned(src, 4));
    RTE_ASSERT(rte_is_aligned(len, 4));

    if (unlikely(len =3D=3D 0)) return;

    /* Copy first, not 16 byte aligned, part of source data.
     * Omitted if source is known to be 16 byte aligned.
     */
    if (!(__builtin_constant_p(flags) &&
            ((flags & RTE_MEMOPS_F_SRCA_MASK) >=3D RTE_MEMOPS_F_SRC16A)) =
&&
            offset !=3D 0) {
        const size_t    first =3D 16 - offset;

        /** Adjust source pointer to achieve 16 byte alignment (floor =
rounding). */
        src =3D RTE_PTR_SUB(src, offset);
        xmm0 =3D _mm_stream_load_si128_const(src);
        _mm_store_si128((void *)buffer, xmm0);
        switch (first) {
            case 3 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
                if (unlikely(len =3D=3D 1 * 4)) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
                if (unlikely(len =3D=3D 2 * 4)) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
                break;
            case 2 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
                if (unlikely(len =3D=3D 1 * 4)) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
                break;
            case 1 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
                break;
        }
        src =3D RTE_PTR_ADD(src, first);
        dst =3D RTE_PTR_ADD(dst, first);
        len -=3D first;
    }

    /* Copy middle, 16 byte aligned, part of source data. */
    while (len >=3D 16) {
        xmm0 =3D _mm_stream_load_si128_const(src);
        _mm_store_si128((void *)buffer, xmm0);
        _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
        src =3D RTE_PTR_ADD(src, 16);
        dst =3D RTE_PTR_ADD(dst, 4 * 4);
        len -=3D 16;
    }

    /* Copy last, not 16 byte aligned, part of source data.
     * Omitted if source is known to be 16 byte aligned.
     */
    if (!(__builtin_constant_p(flags) &&
            ((flags & RTE_MEMOPS_F_SRCA_MASK) >=3D RTE_MEMOPS_F_SRC16A)) =
&&
            len !=3D 0) {
        xmm0 =3D _mm_stream_load_si128_const(src);
        _mm_store_si128((void *)buffer, xmm0);
        switch (len) {
            case 1 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
                break;
            case 2 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
                break;
            case 3 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
                _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
                break;
        }
    }
}

#ifndef RTE_MEMCPY_NT_BUFSIZE

#include <rte_mbuf.h>   /* #include <rte_mbuf_core.h> */

/** Bounce buffer size for non-temporal memcpy.
 *
 * The actual buffer will be slightly larger, due to added padding.
 * The default is chosen to be able to handle a non-segmented packet.
 */
#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM

#endif  /* RTE_MEMCPY_NT_BUFSIZE */

/**
 * @internal
 * Non-temporal memory copy to 16 byte aligned destination and length
 * from unaligned source via bounce buffer.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Must be be divisible by 16.
 *   Must be <=3D RTE_MEMCPY_NT_BUFSIZE.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), =
__access__(read_only, 2, 3)))
void rte_memcpy_nt_buf16dla(void * __rte_restrict dst, const void * =
__rte_restrict src, size_t len,
        const uint64_t flags __rte_unused)
{
    /** Aligned bounce buffer with preceding and trailing padding. */
    unsigned char       buffer[16 + RTE_MEMCPY_NT_BUFSIZE + 16] =
__rte_aligned(16);
    void *              buf;
    register __m128i    xmm0, xmm1, xmm2, xmm3;

    RTE_ASSERT(rte_is_aligned(dst, 16));
    RTE_ASSERT(rte_is_aligned(len, 16));
    RTE_ASSERT(len <=3D RTE_MEMCPY_NT_BUFSIZE);

    if (unlikely(len =3D=3D 0)) return;

    /* Step 1:
     * Copy data from the source to the bounce buffer's aligned data =
area,
     * using aligned non-temporal load from the source,
     * and unaligned store in the bounce buffer.
     *
     * If the source is unaligned, the extra bytes preceding the data =
will be copied
     * to the padding area preceding the bounce buffer's aligned data =
area.
     * Similarly, if the source data ends at an unaligned address, the =
additional bytes
     * trailing the data will be copied to the padding area trailing the =
bounce buffer's
     * aligned data area.
     */
    {
        /** How many bytes is source offset from 16 byte alignment =
(floor rounding). */
        const size_t        offset =3D ((uintptr_t)src & (16 - 1));
        /** Number of bytes to copy from source, incl. any extra =
preceding bytes. */
        size_t              srclen =3D len + offset;

        /* Adjust source pointer for extra preceding bytes. */
        src =3D RTE_PTR_SUB(src, offset);
        /* Bounce buffer pointer, adjusted for extra preceding bytes. */
        buf =3D RTE_PTR_ADD(buffer, 16 - offset);

        /* Copy large portion of data from source to bounce buffer. */
        while (srclen >=3D 4 * 16) {
            xmm0 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * =
16));
            xmm1 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * =
16));
            xmm2 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * =
16));
            xmm3 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * =
16));
            _mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
            src =3D RTE_PTR_ADD(src, 4 * 16);
            buf =3D RTE_PTR_ADD(buf, 4 * 16);
            srclen -=3D 4 * 16;
        }

        /* Copy remaining data from source to bounce buffer. */
        while ((ssize_t)srclen > 0) {
            xmm0 =3D _mm_stream_load_si128_const(src);
            _mm_storeu_si128(buf, xmm0);
            src =3D RTE_PTR_ADD(src, 16);
            buf =3D RTE_PTR_ADD(buf, 16);
            srclen -=3D 16;
        }
    }

    /* Step 2:
     * Copy from the aligned bounce buffer to the aligned destination.
     */

    /* Reset bounce buffer pointer; point to the aligned data area. */
    buf =3D RTE_PTR_ADD(buffer, 16);

    /* Copy large portion of data from bounce buffer to destination in =
chunks of 64 byte. */
    while (len >=3D 4 * 16) {
        xmm0 =3D _mm_load_si128(RTE_PTR_ADD(buf, 0 * 16));
        xmm1 =3D _mm_load_si128(RTE_PTR_ADD(buf, 1 * 16));
        xmm2 =3D _mm_load_si128(RTE_PTR_ADD(buf, 2 * 16));
        xmm3 =3D _mm_load_si128(RTE_PTR_ADD(buf, 3 * 16));
        _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
        _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
        _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
        _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
        buf =3D RTE_PTR_ADD(buf, 4 * 16);
        dst =3D RTE_PTR_ADD(dst, 4 * 16);
        len -=3D 4 * 16;
    }

    /* Copy remaining data from bounce buffer to destination. */
    while (len !=3D 0) {
        xmm0 =3D _mm_load_si128(buf);
        _mm_stream_si128(dst, xmm0);
        buf =3D RTE_PTR_ADD(buf, 16);
        dst =3D RTE_PTR_ADD(dst, 16);
        len -=3D 16;
    }
}

/**
 * @internal
 * Non-temporal memory copy via bounce buffer.
 *
 * @note
 * If the destination and/or length is unaligned, the first and/or last =
copied
 * bytes will be stored in the destination memory area using temporal =
access.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Must be <=3D RTE_MEMCPY_NT_BUFSIZE.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), =
__access__(read_only, 2, 3)))
void rte_memcpy_nt_buf(void * __rte_restrict dst, const void * =
__rte_restrict src, size_t len,
        const uint64_t flags __rte_unused)
{
    /** Aligned bounce buffer with preceding and trailing padding. */
    unsigned char       buffer[16 + RTE_MEMCPY_NT_BUFSIZE + 16] =
__rte_aligned(16);
    void *              buf;
    register __m128i    xmm0, xmm1, xmm2, xmm3;

    RTE_ASSERT(len <=3D RTE_MEMCPY_NT_BUFSIZE);

    if (unlikely(len =3D=3D 0)) return;

    /* Step 1:
     * Copy data from the source to the bounce buffer's aligned data =
area,
     * using aligned non-temporal load from the source,
     * and unaligned store in the bounce buffer.
     *
     * If the source is unaligned, the additional bytes preceding the =
data will be copied
     * to the padding area preceding the bounce buffer's aligned data =
area.
     * Similarly, if the source data ends at an unaligned address, the =
additional bytes
     * trailing the data will be copied to the padding area trailing the =
bounce buffer's
     * aligned data area.
     */
    {
        /** How many bytes is source offset from 16 byte alignment =
(floor rounding). */
        const size_t        offset =3D ((uintptr_t)src & (16 - 1));
        /** Number of bytes to copy from source, incl. any extra =
preceding bytes. */
        size_t              srclen =3D len + offset;

        /* Adjust source pointer for extra preceding bytes. */
        src =3D RTE_PTR_SUB(src, offset);
        /* Bounce buffer pointer, adjusted for extra preceding bytes. */
        buf =3D RTE_PTR_ADD(buffer, 16 - offset);

        /* Copy large portion of data from source to bounce buffer. */
        while (srclen >=3D 4 * 16) {
            xmm0 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * =
16));
            xmm1 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * =
16));
            xmm2 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * =
16));
            xmm3 =3D _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * =
16));
            _mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
            src =3D RTE_PTR_ADD(src, 4 * 16);
            buf =3D RTE_PTR_ADD(buf, 4 * 16);
            srclen -=3D 4 * 16;
        }

        /* Copy remaining data from source to bounce buffer. */
        while ((ssize_t)srclen > 0) {
            xmm0 =3D _mm_stream_load_si128_const(src);
            _mm_storeu_si128(buf, xmm0);
            src =3D RTE_PTR_ADD(src, 16);
            buf =3D RTE_PTR_ADD(buf, 16);
            srclen -=3D 16;
        }
    }

    /* Step 2:
     * Copy from the aligned bounce buffer to the destination.
     */

    /* Reset bounce buffer pointer; point to the aligned data area. */
    buf =3D RTE_PTR_ADD(buffer, 16);

    if (unlikely(!rte_is_aligned(dst, 16))) {
        /* Destination is not 16 byte aligned. */

        if (unlikely(!rte_is_aligned(dst, 4))) {
            /* Destination is not 4 byte aligned. */
            /** How many bytes are missing to reach 16 byte alignment. =
*/
            const size_t n =3D RTE_PTR_DIFF(RTE_PTR_ALIGN_CEIL(dst, 16), =
dst);

            if (unlikely(len <=3D n))
                goto copy_trailing_bytes;

            /* Copy from bounce buffer until destination pointer is 16 =
byte aligned. */
            memcpy(dst, buf, n);
            buf =3D RTE_PTR_ADD(buf, n);
            dst =3D RTE_PTR_ADD(dst, n);
            len -=3D n;
        } else {
            /* Destination is 4 byte aligned. */

            /* Copy from bounce buffer until destination pointer is 16 =
byte aligned. */
            while (!rte_is_aligned(dst, 16)) {
                register int32_t    r;

                if (unlikely(len < 4))
                    goto copy_trailing_bytes;

                r =3D *(int32_t *)buf;
                _mm_stream_si32(dst, r);
                buf =3D RTE_PTR_ADD(buf, 4);
                dst =3D RTE_PTR_ADD(dst, 4);
                len -=3D 4;
            }
        }
    }

    /* Destination is 16 byte aligned. */

    /* Copy large portion of data from bounce buffer to destination in =
chunks of 64 byte. */
    while (len >=3D 4 * 16) {
        xmm0 =3D _mm_loadu_si128(RTE_PTR_ADD(buf, 0 * 16));
        xmm1 =3D _mm_loadu_si128(RTE_PTR_ADD(buf, 1 * 16));
        xmm2 =3D _mm_loadu_si128(RTE_PTR_ADD(buf, 2 * 16));
        xmm3 =3D _mm_loadu_si128(RTE_PTR_ADD(buf, 3 * 16));
        _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
        _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
        _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
        _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
        buf =3D RTE_PTR_ADD(buf, 4 * 16);
        dst =3D RTE_PTR_ADD(dst, 4 * 16);
        len -=3D 4 * 16;
    }

    /* Copy remaining data from bounce buffer to destination. */
    while (len >=3D 4) {
        int32_t r;

        memcpy(&r, buf, 4);
        _mm_stream_si32(dst, r);
        buf =3D RTE_PTR_ADD(buf, 4);
        dst =3D RTE_PTR_ADD(dst, 4);
        len -=3D 4;
    }

copy_trailing_bytes:
    if (unlikely(len !=3D 0)) {
        /* Copy trailing bytes. */
        memcpy(dst, buf, len);
    }
}

/**
 * @internal
 * Non-temporal memory copy to 16 byte aligned destination and length.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Must be be divisible by 16.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), =
__access__(read_only, 2, 3)))
void rte_memcpy_nt16dla(void * __rte_restrict dst, const void * =
__rte_restrict src, size_t len,
        const uint64_t flags)
{
    RTE_ASSERT(rte_is_aligned(dst, 16));
    RTE_ASSERT(rte_is_aligned(len, 16));

    while (len > RTE_MEMCPY_NT_BUFSIZE) {
        rte_memcpy_nt_buf16dla(dst, src, RTE_MEMCPY_NT_BUFSIZE, flags);
        dst =3D RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        src =3D RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        len -=3D RTE_MEMCPY_NT_BUFSIZE;
    }
    rte_memcpy_nt_buf16dla(dst, src, len, flags);
}

/**
 * @internal
 * Non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @note
 * If the destination and/or length is unaligned, some copied bytes will =
be
 * stored in the destination memory area using temporal access.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 * @param src
 *   Pointer to the non-temporal source memory area.
 * @param len
 *   Number of bytes to copy.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), =
__access__(read_only, 2, 3)))
void rte_memcpy_nt_fallback(void * __rte_restrict dst, const void * =
__rte_restrict src, size_t len,
        const uint64_t flags)
{
    while (len > RTE_MEMCPY_NT_BUFSIZE) {
        rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE, flags);
        dst =3D RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        src =3D RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        len -=3D RTE_MEMCPY_NT_BUFSIZE;
    }
    rte_memcpy_nt_buf(dst, src, len, flags);
}

/* Implementation. Refer to function declaration for documentation. */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), =
__access__(read_only, 2, 3)))
void rte_memcpy_ex(void * __rte_restrict dst, const void * =
__rte_restrict src, size_t len,
        const uint64_t flags)
{
    if (flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) {
        if (__builtin_constant_p(flags) ?
                ((flags & RTE_MEMOPS_F_LENA_MASK) >=3D =
RTE_MEMOPS_F_LEN16A &&
                (flags & RTE_MEMOPS_F_DSTA_MASK) >=3D =
RTE_MEMOPS_F_DST16A) :
                !(((uintptr_t)dst | len) & (16 - 1))) {
            if (__builtin_constant_p(flags) ?
                    (flags & RTE_MEMOPS_F_SRCA_MASK) >=3D =
RTE_MEMOPS_F_SRC16A :
                    !((uintptr_t)src & (16 - 1)))
                rte_memcpy_nt16a(dst, src, len, flags);
            else
                rte_memcpy_nt16dla(dst, src, len, flags);
        }
        else if (__builtin_constant_p(flags) ? (
                (flags & RTE_MEMOPS_F_LENA_MASK) >=3D RTE_MEMOPS_F_LEN4A =
&&
                (flags & RTE_MEMOPS_F_SRCA_MASK) >=3D RTE_MEMOPS_F_SRC4A =
&&
                (flags & RTE_MEMOPS_F_DSTA_MASK) >=3D =
RTE_MEMOPS_F_DST4A) :
                !(((uintptr_t)dst | (uintptr_t)src | len) & (4 - 1)))
            rte_memcpy_nt4a(dst, src, len, flags);
        else
            rte_memcpy_nt_fallback(dst, src, len, flags);
    } else
        rte_memcpy(dst, src, len);
}