[RFC v3] non-temporal memcpy

DPDK patches and discussions
 help / color / mirror / Atom feed

* [RFC v3] non-temporal memcpy
@ 2022-08-19 13:58 Morten Brørup
  2022-10-06 20:34 ` [PATCH] eal: " Morten Brørup
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Morten Brørup @ 2022-08-19 13:58 UTC (permalink / raw)
  To: dev
  Cc: Bruce Richardson, Konstantin Ananyev, Honnappa Nagarahalli,
	Stephen Hemminger, Mattias Rönnblom

This RFC proposes a set of functions optimized for non-temporal memory copy.

At this stage, I am asking for acceptance of the concept and API.
Feedback on the x86 implementation is also welcome.

Applications sometimes copy data to another memory location, which is only
used much later.
In this case, it is inefficient to pollute the data cache with the copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture buffer
for offline analysis.

The purpose of the functions is to achieve a performance gain by not
polluting the cache when copying data.
Although the throughput may be improved by further optimization, I do not
consider throughput optimization relevant initially.

Implementation notes:

Implementations for non-x86 architectures can be provided by anyone at a
later time. I am not going to do it.

x86 non-temporal load instructions have 16 byte alignment requirements [1].
ARM non-temporal load instructions are available with 4 byte alignment
requirements [2].
Both platforms offer non-temporal store instructions with 4 byte alignment
requirements.

In addition to the general function without any alignment requirements, I
have also implemente functions for respectivly 16 and 4 byte aligned access
for performance purposes.

NB: Don't comment on spaces for indentation; a patch will follow DPDK coding
style and use TAB.

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_load
[2] https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-and-FP-

V2:
- Only copy from non-temporal source to non-temporal destination.
  I.e. remove the two variants with only source and/or destination being
  non-temporal.
- Do not require alignment.
  Instead, offer additional 4 and 16 byte aligned functions for performance
  purposes.
- Implemented two of the functions for x86.
- Remove memset function.

V3:
- Only one generic function is exposed in the API:
  rte_memcpy_ex(dst, src, len, flags),
  which should be called with the flags constant at build time.
- Requests for non-temporal source/destination memory access are now flags.
- Alignment hints are now flags.
- The functions for various alignments are not part of the declaration.
  They are implementation specific.
- Implemented the generic and unaligned functions for x86.
- Added note about rte_wmb() to non-temporal store.
- Variants for normal load and non-temporal store, as well as
  variants for non-temporal load and normal store are not implemented.
  They can be added later.
- Improved the workaround for _mm_stream_load_si128() not taking a const
  pointer as parameter.
- Extensive use of __builtin_constant_p(flags) to help the compiler optimize
  the code.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---

/*****************************************************************/
/* Declaration. Goes into: /lib/eal/include/generic/rte_memcpy.h */
/*****************************************************************/

/*
 * Advanced/Non-Temporal Memory Operations Flags.
 */

/** Length alignment hint mask. */
#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
/** Hint: Length is 2 byte aligned. */
#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
/** Hint: Length is 4 byte aligned. */
#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)      
/** Hint: Length is 8 byte aligned. */
#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)      
/** Hint: Length is 16 byte aligned. */
#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)     
/** Hint: Length is 32 byte aligned. */
#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)     
/** Hint: Length is 64 byte aligned. */
#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)     
/** Hint: Length is 128 byte aligned. */
#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)    

/** Prefer non-temporal access to source memory area.
 *
 * On ARM architecture:
 * Remember to call rte_???() before a sequence of copy operations.
 */
#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)      
/** Source address alignment hint mask. */
#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)   
/** Hint: Source address is 2 byte aligned. */
#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)      
/** Hint: Source address is 4 byte aligned. */
#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)      
/** Hint: Source address is 8 byte aligned. */
#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)      
/** Hint: Source address is 16 byte aligned. */
#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)     
/** Hint: Source address is 32 byte aligned. */
#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)     
/** Hint: Source address is 64 byte aligned. */
#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)     
/** Hint: Source address is 128 byte aligned. */
#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)    

/** Prefer non-temporal access to destination memory area.
 *
 * On x86 architecture:
 * Remember to call rte_wmb() after a sequence of copy operations.
 * On ARM architecture:
 * Remember to call rte_???() after a sequence of copy operations.
 */
#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)     
/** Destination address alignment hint mask. */
#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16)  
/** Hint: Destination address is 2 byte aligned. */
#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)     
/** Hint: Destination address is 4 byte aligned. */
#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)     
/** Hint: Destination address is 8 byte aligned. */
#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)     
/** Hint: Destination address is 16 byte aligned. */
#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)    
/** Hint: Destination address is 32 byte aligned. */
#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)    
/** Hint: Destination address is 64 byte aligned. */
#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)    
/** Hint: Destination address is 128 byte aligned. */
#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)   


/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Advanced/non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the destination memory area.
 * @param src
 *   Pointer to the source memory area.
 * @param len
 *   Number of bytes to copy.
 * @param flags
 *   Hints for memory access.
 *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, RTE_MEMOPS_F_(LEN|SRC|DST)<n>A flags.
 *   Should be constant at build time.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_ex(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags);


/****************************************************************/
/* Implementation. Goes into: /lib/eal/x86/include/rte_memcpy.h */
/****************************************************************/

/* Assumptions about register sizes. */
_Static_assert(sizeof(int32_t) == 4, "Wrong size of int32_t.");
_Static_assert(sizeof(__m128i) == 16, "Wrong size of __m128i.");

/**
 * @internal
 * Workaround for _mm_stream_load_si128() missing const in the parameter.
 */
__rte_internal
static __rte_always_inline
__m128i _mm_stream_load_si128_const(const __m128i * const mem_addr)
{
#if defined(RTE_TOOLCHAIN_GCC)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
#endif
    return _mm_stream_load_si128(mem_addr);
#if defined(RTE_TOOLCHAIN_GCC)
#pragma GCC diagnostic pop
#endif
}

/**
 * @internal
 * 16 byte aligned non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   Must be 16 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 16.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt16a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags)
{
    register __m128i    xmm0, xmm1, xmm2, xmm3;

    RTE_ASSERT(rte_is_aligned(dst, 16));
    RTE_ASSERT(rte_is_aligned(src, 16));
    RTE_ASSERT(rte_is_aligned(len, 16));

    /* Copy large portion of data in chunks of 64 byte. */
    while (len >= 4 * 16) {
        xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
        xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
        xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
        xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
        _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
        _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
        _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
        _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
        src = RTE_PTR_ADD(src, 4 * 16);
        dst = RTE_PTR_ADD(dst, 4 * 16);
        len -= 4 * 16;
    }

    /* Copy remaining data.
     * Omitted if length is known to be 64 byte aligned.
     */
    if (!(__builtin_constant_p(flags) &&
            ((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A))) {
        while (len != 0) {
            xmm0 = _mm_stream_load_si128_const(src);
            _mm_stream_si128(dst, xmm0);
            src = RTE_PTR_ADD(src, 16);
            dst = RTE_PTR_ADD(dst, 16);
            len -= 16;
        }
    }
}

/**
 * @internal
 * 4 byte aligned non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 4 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   Must be 4 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 4.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt4a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags)
{
    int32_t             buffer[16 / 4] __rte_aligned(16);
    /** How many bytes is source offset from 16 byte alignment (floor rounding). */
    const size_t        offset = ((uintptr_t)src & (16 - 1));
    register __m128i    xmm0;

    RTE_ASSERT(rte_is_aligned(dst, 4));
    RTE_ASSERT(rte_is_aligned(src, 4));
    RTE_ASSERT(rte_is_aligned(len, 4));

    if (unlikely(len == 0)) return;

    /* Copy first, not 16 byte aligned, part of source data.
     * Omitted if source is known to be 16 byte aligned.
     */
    if (!(__builtin_constant_p(flags) &&
            ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) &&
            offset != 0) {
        const size_t    first = 16 - offset;

        /** Adjust source pointer to achieve 16 byte alignment (floor rounding). */
        src = RTE_PTR_SUB(src, offset);
        xmm0 = _mm_stream_load_si128_const(src);
        _mm_store_si128((void *)buffer, xmm0);
        switch (first) {
            case 3 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
                if (unlikely(len == 1 * 4)) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
                if (unlikely(len == 2 * 4)) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
                break;
            case 2 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
                if (unlikely(len == 1 * 4)) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
                break;
            case 1 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
                break;
        }
        src = RTE_PTR_ADD(src, first);
        dst = RTE_PTR_ADD(dst, first);
        len -= first;
    }

    /* Copy middle, 16 byte aligned, part of source data. */
    while (len >= 16) {
        xmm0 = _mm_stream_load_si128_const(src);
        _mm_store_si128((void *)buffer, xmm0);
        _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
        src = RTE_PTR_ADD(src, 16);
        dst = RTE_PTR_ADD(dst, 4 * 4);
        len -= 16;
    }

    /* Copy last, not 16 byte aligned, part of source data.
     * Omitted if source is known to be 16 byte aligned.
     */
    if (!(__builtin_constant_p(flags) &&
            ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) &&
            len != 0) {
        xmm0 = _mm_stream_load_si128_const(src);
        _mm_store_si128((void *)buffer, xmm0);
        switch (len) {
            case 1 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
                break;
            case 2 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
                break;
            case 3 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
                _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
                break;
        }
    }
}

#ifndef RTE_MEMCPY_NT_BUFSIZE

#include <rte_mbuf.h>   /* #include <rte_mbuf_core.h> */

/** Bounce buffer size for non-temporal memcpy.
 *
 * The actual buffer will be slightly larger, due to added padding.
 * The default is chosen to be able to handle a non-segmented packet.
 */
#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM

#endif  /* RTE_MEMCPY_NT_BUFSIZE */

/**
 * @internal
 * Non-temporal memory copy to 16 byte aligned destination and length
 * from unaligned source via bounce buffer.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Must be be divisible by 16.
 *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt_buf16dla(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags __rte_unused)
{
    /** Aligned bounce buffer with preceding and trailing padding. */
    unsigned char       buffer[16 + RTE_MEMCPY_NT_BUFSIZE + 16] __rte_aligned(16);
    void *              buf;
    register __m128i    xmm0, xmm1, xmm2, xmm3;

    RTE_ASSERT(rte_is_aligned(dst, 16));
    RTE_ASSERT(rte_is_aligned(len, 16));
    RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);

    if (unlikely(len == 0)) return;

    /* Step 1:
     * Copy data from the source to the bounce buffer's aligned data area,
     * using aligned non-temporal load from the source,
     * and unaligned store in the bounce buffer.
     *
     * If the source is unaligned, the extra bytes preceding the data will be copied
     * to the padding area preceding the bounce buffer's aligned data area.
     * Similarly, if the source data ends at an unaligned address, the additional bytes
     * trailing the data will be copied to the padding area trailing the bounce buffer's
     * aligned data area.
     */
    {
        /** How many bytes is source offset from 16 byte alignment (floor rounding). */
        const size_t        offset = ((uintptr_t)src & (16 - 1));
        /** Number of bytes to copy from source, incl. any extra preceding bytes. */
        size_t              srclen = len + offset;

        /* Adjust source pointer for extra preceding bytes. */
        src = RTE_PTR_SUB(src, offset);
        /* Bounce buffer pointer, adjusted for extra preceding bytes. */
        buf = RTE_PTR_ADD(buffer, 16 - offset);

        /* Copy large portion of data from source to bounce buffer. */
        while (srclen >= 4 * 16) {
            xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
            xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
            xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
            xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
            _mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
            src = RTE_PTR_ADD(src, 4 * 16);
            buf = RTE_PTR_ADD(buf, 4 * 16);
            srclen -= 4 * 16;
        }

        /* Copy remaining data from source to bounce buffer. */
        while ((ssize_t)srclen > 0) {
            xmm0 = _mm_stream_load_si128_const(src);
            _mm_storeu_si128(buf, xmm0);
            src = RTE_PTR_ADD(src, 16);
            buf = RTE_PTR_ADD(buf, 16);
            srclen -= 16;
        }
    }

    /* Step 2:
     * Copy from the aligned bounce buffer to the aligned destination.
     */

    /* Reset bounce buffer pointer; point to the aligned data area. */
    buf = RTE_PTR_ADD(buffer, 16);

    /* Copy large portion of data from bounce buffer to destination in chunks of 64 byte. */
    while (len >= 4 * 16) {
        xmm0 = _mm_load_si128(RTE_PTR_ADD(buf, 0 * 16));
        xmm1 = _mm_load_si128(RTE_PTR_ADD(buf, 1 * 16));
        xmm2 = _mm_load_si128(RTE_PTR_ADD(buf, 2 * 16));
        xmm3 = _mm_load_si128(RTE_PTR_ADD(buf, 3 * 16));
        _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
        _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
        _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
        _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
        buf = RTE_PTR_ADD(buf, 4 * 16);
        dst = RTE_PTR_ADD(dst, 4 * 16);
        len -= 4 * 16;
    }

    /* Copy remaining data from bounce buffer to destination. */
    while (len != 0) {
        xmm0 = _mm_load_si128(buf);
        _mm_stream_si128(dst, xmm0);
        buf = RTE_PTR_ADD(buf, 16);
        dst = RTE_PTR_ADD(dst, 16);
        len -= 16;
    }
}

/**
 * @internal
 * Non-temporal memory copy via bounce buffer.
 *
 * @note
 * If the destination and/or length is unaligned, the first and/or last copied
 * bytes will be stored in the destination memory area using temporal access.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt_buf(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags __rte_unused)
{
    /** Aligned bounce buffer with preceding and trailing padding. */
    unsigned char       buffer[16 + RTE_MEMCPY_NT_BUFSIZE + 16] __rte_aligned(16);
    void *              buf;
    register __m128i    xmm0, xmm1, xmm2, xmm3;

    RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);

    if (unlikely(len == 0)) return;

    /* Step 1:
     * Copy data from the source to the bounce buffer's aligned data area,
     * using aligned non-temporal load from the source,
     * and unaligned store in the bounce buffer.
     *
     * If the source is unaligned, the additional bytes preceding the data will be copied
     * to the padding area preceding the bounce buffer's aligned data area.
     * Similarly, if the source data ends at an unaligned address, the additional bytes
     * trailing the data will be copied to the padding area trailing the bounce buffer's
     * aligned data area.
     */
    {
        /** How many bytes is source offset from 16 byte alignment (floor rounding). */
        const size_t        offset = ((uintptr_t)src & (16 - 1));
        /** Number of bytes to copy from source, incl. any extra preceding bytes. */
        size_t              srclen = len + offset;

        /* Adjust source pointer for extra preceding bytes. */
        src = RTE_PTR_SUB(src, offset);
        /* Bounce buffer pointer, adjusted for extra preceding bytes. */
        buf = RTE_PTR_ADD(buffer, 16 - offset);

        /* Copy large portion of data from source to bounce buffer. */
        while (srclen >= 4 * 16) {
            xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
            xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
            xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
            xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
            _mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
            src = RTE_PTR_ADD(src, 4 * 16);
            buf = RTE_PTR_ADD(buf, 4 * 16);
            srclen -= 4 * 16;
        }

        /* Copy remaining data from source to bounce buffer. */
        while ((ssize_t)srclen > 0) {
            xmm0 = _mm_stream_load_si128_const(src);
            _mm_storeu_si128(buf, xmm0);
            src = RTE_PTR_ADD(src, 16);
            buf = RTE_PTR_ADD(buf, 16);
            srclen -= 16;
        }
    }

    /* Step 2:
     * Copy from the aligned bounce buffer to the destination.
     */

    /* Reset bounce buffer pointer; point to the aligned data area. */
    buf = RTE_PTR_ADD(buffer, 16);

    if (unlikely(!rte_is_aligned(dst, 16))) {
        /* Destination is not 16 byte aligned. */

        if (unlikely(!rte_is_aligned(dst, 4))) {
            /* Destination is not 4 byte aligned. */
            /** How many bytes are missing to reach 16 byte alignment. */
            const size_t n = RTE_PTR_DIFF(RTE_PTR_ALIGN_CEIL(dst, 16), dst);

            if (unlikely(len <= n))
                goto copy_trailing_bytes;

            /* Copy from bounce buffer until destination pointer is 16 byte aligned. */
            memcpy(dst, buf, n);
            buf = RTE_PTR_ADD(buf, n);
            dst = RTE_PTR_ADD(dst, n);
            len -= n;
        } else {
            /* Destination is 4 byte aligned. */

            /* Copy from bounce buffer until destination pointer is 16 byte aligned. */
            while (!rte_is_aligned(dst, 16)) {
                register int32_t    r;

                if (unlikely(len < 4))
                    goto copy_trailing_bytes;

                r = *(int32_t *)buf;
                _mm_stream_si32(dst, r);
                buf = RTE_PTR_ADD(buf, 4);
                dst = RTE_PTR_ADD(dst, 4);
                len -= 4;
            }
        }
    }

    /* Destination is 16 byte aligned. */

    /* Copy large portion of data from bounce buffer to destination in chunks of 64 byte. */
    while (len >= 4 * 16) {
        xmm0 = _mm_loadu_si128(RTE_PTR_ADD(buf, 0 * 16));
        xmm1 = _mm_loadu_si128(RTE_PTR_ADD(buf, 1 * 16));
        xmm2 = _mm_loadu_si128(RTE_PTR_ADD(buf, 2 * 16));
        xmm3 = _mm_loadu_si128(RTE_PTR_ADD(buf, 3 * 16));
        _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
        _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
        _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
        _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
        buf = RTE_PTR_ADD(buf, 4 * 16);
        dst = RTE_PTR_ADD(dst, 4 * 16);
        len -= 4 * 16;
    }

    /* Copy remaining data from bounce buffer to destination. */
    while (len >= 4) {
        int32_t r;

        memcpy(&r, buf, 4);
        _mm_stream_si32(dst, r);
        buf = RTE_PTR_ADD(buf, 4);
        dst = RTE_PTR_ADD(dst, 4);
        len -= 4;
    }

copy_trailing_bytes:
    if (unlikely(len != 0)) {
        /* Copy trailing bytes. */
        memcpy(dst, buf, len);
    }
}

/**
 * @internal
 * Non-temporal memory copy to 16 byte aligned destination and length.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Must be be divisible by 16.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt16dla(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags)
{
    RTE_ASSERT(rte_is_aligned(dst, 16));
    RTE_ASSERT(rte_is_aligned(len, 16));

    while (len > RTE_MEMCPY_NT_BUFSIZE) {
        rte_memcpy_nt_buf16dla(dst, src, RTE_MEMCPY_NT_BUFSIZE, flags);
        dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        src = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        len -= RTE_MEMCPY_NT_BUFSIZE;
    }
    rte_memcpy_nt_buf16dla(dst, src, len, flags);
}

/**
 * @internal
 * Non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @note
 * If the destination and/or length is unaligned, some copied bytes will be
 * stored in the destination memory area using temporal access.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 * @param src
 *   Pointer to the non-temporal source memory area.
 * @param len
 *   Number of bytes to copy.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt_fallback(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags)
{
    while (len > RTE_MEMCPY_NT_BUFSIZE) {
        rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE, flags);
        dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        src = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        len -= RTE_MEMCPY_NT_BUFSIZE;
    }
    rte_memcpy_nt_buf(dst, src, len, flags);
}

/* Implementation. Refer to function declaration for documentation. */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_ex(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags)
{
    if (flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) {
        if (__builtin_constant_p(flags) ?
                ((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A &&
                (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) :
                !(((uintptr_t)dst | len) & (16 - 1))) {
            if (__builtin_constant_p(flags) ?
                    (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A :
                    !((uintptr_t)src & (16 - 1)))
                rte_memcpy_nt16a(dst, src, len, flags);
            else
                rte_memcpy_nt16dla(dst, src, len, flags);
        }
        else if (__builtin_constant_p(flags) ? (
                (flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A &&
                (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A &&
                (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) :
                !(((uintptr_t)dst | (uintptr_t)src | len) & (4 - 1)))
            rte_memcpy_nt4a(dst, src, len, flags);
        else
            rte_memcpy_nt_fallback(dst, src, len, flags);
    } else
        rte_memcpy(dst, src, len);
}


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] eal: non-temporal memcpy
  2022-08-19 13:58 [RFC v3] non-temporal memcpy Morten Brørup
@ 2022-10-06 20:34 ` Morten Brørup
  2022-10-10  7:35   ` Morten Brørup
  2022-10-07 10:19 ` [PATCH v2] " Morten Brørup
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Morten Brørup @ 2022-10-06 20:34 UTC (permalink / raw)
  To: hofors, bruce.richardson, konstantin.v.ananyev,
	Honnappa.Nagarahalli, stephen
  Cc: mattias.ronnblom, kda, drc, dev, Morten Brørup

This patch provides a function for memory copy using non-temporal store,
load or both, controlled by flags passed to the function.

Applications sometimes copy data to another memory location, which is only
used much later.
In this case, it is inefficient to pollute the data cache with the copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture buffer
for offline analysis.

The purpose of the function is to achieve a performance gain by not
polluting the cache when copying data.
Although the throughput can be improved by further optimization, I do not
have time to do it now.

The functional tests and performance tests for memory copy have been
expanded to include non-temporal copying.

A non-temporal version of the mbuf library's function to create a full
copy of a given packet mbuf is provided.

The packet capture and packet dump libraries have been updated to use
non-temporal memory copy of the packets.

Implementation notes:

Implementations for non-x86 architectures can be provided by anyone at a
later time. I am not going to do it.

x86 non-temporal load instructions must be 16 byte aligned [1], and
non-temporal store instructions must be 4, 8 or 16 byte aligned [2].

ARM non-temporal load and store instructions seem to require 4 byte
alignment [3].

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_load
[2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_si
[3] https://developer.arm.com/documentation/100076/0100/
A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
LDNP--SIMD-and-FP-

This patch is a major rewrite from the RFC v3, so no version log is
provided.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_memcpy.c               |   69 +-
 app/test/test_memcpy_perf.c          |   92 +-
 lib/eal/include/generic/rte_memcpy.h |  115 +++
 lib/eal/x86/include/rte_memcpy.h     | 1153 ++++++++++++++++++++++++++
 lib/mbuf/rte_mbuf.c                  |   77 ++
 lib/mbuf/rte_mbuf.h                  |   32 +
 lib/mbuf/version.map                 |    1 +
 lib/pcapng/rte_pcapng.c              |    3 +-
 lib/pdump/rte_pdump.c                |    6 +-
 9 files changed, 1506 insertions(+), 42 deletions(-)

diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
index 1ab86f4967..bb094297e1 100644
--- a/app/test/test_memcpy.c
+++ b/app/test/test_memcpy.c
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #include <stdint.h>
@@ -36,6 +37,19 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 /* Data is aligned on this many bytes (power of 2) */
 #define ALIGNMENT_UNIT          32
 
+const uint64_t nt_mode_flags[4] = {
+	0,
+	RTE_MEMOPS_F_SRC_NT,
+	RTE_MEMOPS_F_DST_NT,
+	RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT
+};
+const char * const nt_mode_str[4] = { 
+	"none",
+	"src",
+	"dst",
+	"src+dst"
+};
+
 
 /*
  * Create two buffers, and initialise one with random values. These are copied
@@ -44,12 +58,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
  * changed.
  */
 static int
-test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
+test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size, unsigned int nt_mode)
 {
 	unsigned int i;
 	uint8_t dest[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
 	uint8_t src[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
 	void * ret;
+	const uint64_t flags = nt_mode_flags[nt_mode];
 
 	/* Setup buffers */
 	for (i = 0; i < SMALL_BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
@@ -58,18 +73,23 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	}
 
 	/* Do the copy */
-	ret = rte_memcpy(dest + off_dst, src + off_src, size);
-	if (ret != (dest + off_dst)) {
-		printf("rte_memcpy() returned %p, not %p\n",
-		       ret, dest + off_dst);
+	if (nt_mode) {
+		rte_memcpy_ex(dest + off_dst, src + off_src, size, flags);
+	} else {
+		ret = rte_memcpy(dest + off_dst, src + off_src, size);
+		if (ret != (dest + off_dst)) {
+			printf("rte_memcpy() returned %p, not %p\n",
+			       ret, dest + off_dst);
+		}
 	}
 
 	/* Check nothing before offset is affected */
 	for (i = 0; i < off_dst; i++) {
 		if (dest[i] != 0) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
 			       "[modified before start of dst].\n",
-			       (unsigned)size, off_src, off_dst);
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
 			return -1;
 		}
 	}
@@ -77,9 +97,11 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	/* Check everything was copied */
 	for (i = 0; i < size; i++) {
 		if (dest[i + off_dst] != src[i + off_src]) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
-			       "[didn't copy byte %u].\n",
-			       (unsigned)size, off_src, off_dst, i);
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
+			       "[didn't copy byte %u: 0x%02x!=0x%02x].\n",
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode], i,
+			       dest[i + off_dst], src[i + off_src]);
 			return -1;
 		}
 	}
@@ -87,9 +109,10 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	/* Check nothing after copy was affected */
 	for (i = size; i < SMALL_BUFFER_SIZE; i++) {
 		if (dest[i + off_dst] != 0) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
 			       "[copied too many].\n",
-			       (unsigned)size, off_src, off_dst);
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
 			return -1;
 		}
 	}
@@ -102,16 +125,22 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 static int
 func_test(void)
 {
-	unsigned int off_src, off_dst, i;
+	unsigned int off_src, off_dst, i, nt_mode;
 	int ret;
 
-	for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
-		for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
-			for (i = 0; i < RTE_DIM(buf_sizes); i++) {
-				ret = test_single_memcpy(off_src, off_dst,
-				                         buf_sizes[i]);
-				if (ret != 0)
-					return -1;
+	for (nt_mode = 0; nt_mode < 4; nt_mode++) {
+		for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
+			for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
+				for (i = 0; i < RTE_DIM(buf_sizes); i++) {
+					printf("TEST: rte_memcpy%s(offsets=%u,%u size=%zu nt=%s)\n",
+					       nt_mode ? "_ex" : "",
+					       off_src, off_dst, buf_sizes[i],
+					       nt_mode_str[nt_mode]);
+					ret = test_single_memcpy(off_src, off_dst,
+					                         buf_sizes[i], nt_mode);
+					if (ret != 0)
+						return -1;
+				}
 			}
 		}
 	}
diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 3727c160e6..7eb498a2bc 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #include <stdint.h>
@@ -15,6 +16,7 @@
 #include <rte_malloc.h>
 
 #include <rte_memcpy.h>
+#include <rte_atomic.h>
 
 #include "test.h"
 
@@ -27,8 +29,8 @@
 /* List of buffer sizes to test */
 #if TEST_VALUE_RANGE == 0
 static size_t buf_sizes[] = {
-	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
-	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
+	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 40, 48, 60, 63, 64, 65, 80, 92, 124,
+	127, 128, 129, 140, 152, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
 	449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
 	2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
 };
@@ -60,6 +62,10 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define ALIGNMENT_UNIT          16
 #endif
 
+/* Non-temporal memcpy source and destination address alignment */
+#define ALIGNED_FLAGS ((ALIGNMENT_UNIT << RTE_MEMOPS_F_SRCA_SHIFT) | \
+        (ALIGNMENT_UNIT << RTE_MEMOPS_F_DSTA_SHIFT);
+
 /*
  * Pointers used in performance tests. The two large buffers are for uncached
  * access where random addresses within the buffer are used for each
@@ -172,15 +178,20 @@ do_uncached_write(uint8_t *dst, int is_dst_cached,
 do {                                                                        \
     unsigned int iter, t;                                                   \
     size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];          \
-    uint64_t start_time, total_time = 0;                                    \
-    uint64_t total_time2 = 0;                                               \
+    uint64_t start_time;                                                    \
+    uint64_t total_time_rte = 0, total_time_std = 0;                        \
+    uint64_t total_time_ntd = 0, total_time_nts = 0, total_time_nt = 0;     \
+    const uint64_t flags = ((dst_uoffset == 0) ?                            \
+            (ALIGNMENT_UNIT << RTE_MEMOPS_F_DSTA_SHIFT) : 0) |              \
+            ((src_uoffset == 0) ?                                           \
+            (ALIGNMENT_UNIT << RTE_MEMOPS_F_SRCA_SHIFT) : 0);               \
     for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
         fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
                          src_addrs, is_src_cached, src_uoffset);            \
         start_time = rte_rdtsc();                                           \
         for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
             rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size);           \
-        total_time += rte_rdtsc() - start_time;                             \
+        total_time_rte += rte_rdtsc() - start_time;                         \
     }                                                                       \
     for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
         fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
@@ -188,11 +199,49 @@ do {                                                                        \
         start_time = rte_rdtsc();                                           \
         for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
             memcpy(dst+dst_addrs[t], src+src_addrs[t], size);               \
-        total_time2 += rte_rdtsc() - start_time;                            \
+        total_time_std += rte_rdtsc() - start_time;                         \
     }                                                                       \
-    printf("%3.0f -", (double)total_time  / TEST_ITERATIONS);                 \
-    printf("%3.0f",   (double)total_time2 / TEST_ITERATIONS);                 \
-    printf("(%6.2f%%) ", ((double)total_time - total_time2)*100/total_time2); \
+    if (!(is_dst_cached && is_src_cached)) {                                    \
+        for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
+            fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
+                             src_addrs, is_src_cached, src_uoffset);            \
+            start_time = rte_rdtsc();                                           \
+            for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
+                rte_memcpy_ex(dst+dst_addrs[t], src+src_addrs[t], size,         \
+                        flags | RTE_MEMOPS_F_DST_NT);                           \
+            total_time_ntd += rte_rdtsc() - start_time;                         \
+        }                                                                       \
+        for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
+            fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
+                             src_addrs, is_src_cached, src_uoffset);            \
+            start_time = rte_rdtsc();                                           \
+            for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
+                rte_memcpy_ex(dst+dst_addrs[t], src+src_addrs[t], size,         \
+                        flags | RTE_MEMOPS_F_SRC_NT);                           \
+            total_time_nts += rte_rdtsc() - start_time;                         \
+        }                                                                       \
+        for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
+            fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
+                             src_addrs, is_src_cached, src_uoffset);            \
+            start_time = rte_rdtsc();                                           \
+            for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
+                rte_memcpy_ex(dst+dst_addrs[t], src+src_addrs[t], size,         \
+                        flags | RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT);     \
+            total_time_nt += rte_rdtsc() - start_time;                          \
+        }                                                                       \
+    }                                                                           \
+    printf(" %4.0f-", (double)total_time_rte / TEST_ITERATIONS);                                \
+    printf("%4.0f",   (double)total_time_std / TEST_ITERATIONS);                                \
+    printf("(%+4.0f%%)", ((double)total_time_rte - total_time_std)*100/total_time_std);         \
+    if (!(is_dst_cached && is_src_cached)) {                                                    \
+        printf(" %4.0f", (double)total_time_ntd / TEST_ITERATIONS);                             \
+        printf(" %4.0f", (double)total_time_nts / TEST_ITERATIONS);                             \
+        printf(" %4.0f", (double)total_time_nt / TEST_ITERATIONS);                              \
+        if (total_time_nt / total_time_std > 9)                                                 \
+            printf("(*%4.1f)", (double)total_time_nt/total_time_std);                           \
+        else                                                                                    \
+            printf("(%+4.0f%%)", ((double)total_time_nt - total_time_std)*100/total_time_std);  \
+    }                                                                                           \
 } while (0)
 
 /* Run aligned memcpy tests for each cached/uncached permutation */
@@ -224,9 +273,11 @@ do {                                                                     \
 /* Run memcpy tests for constant length */
 #define ALL_PERF_TEST_FOR_CONSTANT                                      \
 do {                                                                    \
-    TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);         \
+    TEST_CONSTANT(4U); TEST_CONSTANT(6U); TEST_CONSTANT(8U);            \
+    TEST_CONSTANT(16U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);        \
     TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);      \
     TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);    \
+    TEST_CONSTANT(2048U);                                               \
 } while (0)
 
 /* Run all memcpy tests for aligned constant cases */
@@ -290,13 +341,14 @@ perf_test(void)
 	/* See function comment */
 	do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
 
-	printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
-		   "======= ================= ================= ================= =================\n"
-		   "   Size   Cache to cache     Cache to mem      Mem to cache        Mem to mem\n"
-		   "(bytes)          (ticks)          (ticks)           (ticks)           (ticks)\n"
-		   "------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n** rte_memcpy(RTE)/memcpy(STD)/rte_memcpy_ex(NTD/NTS/NT) - memcpy perf. tests (C = compile-time constant) **\n"
+		   "======= ================ ====================================== ====================================== ======================================\n"
+		   "   Size  Cache to cache               Cache to mem                           Mem to cache                            Mem to mem\n"
+		   "(bytes)         (ticks)                    (ticks)                                (ticks)                               (ticks)\n"
+		   "         RTE- STD(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)\n"
+		   "------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 
-	printf("\n================================= %2dB aligned =================================",
+	printf("\n================================================================ %2dB aligned ===============================================================",
 		ALIGNMENT_UNIT);
 	/* Do aligned tests where size is a variable */
 	timespec_get(&tv_begin, TIME_UTC);
@@ -304,28 +356,28 @@ perf_test(void)
 	timespec_get(&tv_end, TIME_UTC);
 	time_aligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 	/* Do aligned tests where size is a compile-time constant */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_constant_aligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_aligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n================================== Unaligned ==================================");
+	printf("\n================================================================= Unaligned =================================================================");
 	/* Do unaligned tests where size is a variable */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_variable_unaligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_unaligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 	/* Do unaligned tests where size is a compile-time constant */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_constant_unaligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_unaligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n======= ================= ================= ================= =================\n\n");
+	printf("\n======= ================ ====================================== ====================================== ======================================\n\n");
 
 	printf("Test Execution Time (seconds):\n");
 	printf("Aligned variable copy size   = %8.3f\n", time_aligned);
diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
index e7f0f8eaa9..f20816e346 100644
--- a/lib/eal/include/generic/rte_memcpy.h
+++ b/lib/eal/include/generic/rte_memcpy.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #ifndef _RTE_MEMCPY_H_
@@ -113,4 +114,118 @@ rte_memcpy(void *dst, const void *src, size_t n);
 
 #endif /* __DOXYGEN__ */
 
+/*
+ * Advanced/Non-Temporal Memory Operations Flags.
+ */
+
+/** Length alignment hint mask. */
+#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
+/** Length alignment hint shift. */
+#define RTE_MEMOPS_F_LENA_SHIFT 0
+/** Hint: Length is 2 byte aligned. */
+#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
+/** Hint: Length is 4 byte aligned. */
+#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)
+/** Hint: Length is 8 byte aligned. */
+#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)
+/** Hint: Length is 16 byte aligned. */
+#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)
+/** Hint: Length is 32 byte aligned. */
+#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)
+/** Hint: Length is 64 byte aligned. */
+#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)
+/** Hint: Length is 128 byte aligned. */
+#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)
+
+/** Prefer non-temporal access to source memory area.
+ */
+#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)
+/** Source address alignment hint mask. */
+#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)
+/** Source address alignment hint shift. */
+#define RTE_MEMOPS_F_SRCA_SHIFT 8
+/** Hint: Source address is 2 byte aligned. */
+#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)
+/** Hint: Source address is 4 byte aligned. */
+#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)
+/** Hint: Source address is 8 byte aligned. */
+#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)
+/** Hint: Source address is 16 byte aligned. */
+#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)
+/** Hint: Source address is 32 byte aligned. */
+#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)
+/** Hint: Source address is 64 byte aligned. */
+#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)
+/** Hint: Source address is 128 byte aligned. */
+#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)
+
+/** Prefer non-temporal access to destination memory area.
+ *
+ * On x86 architecture:
+ * Remember to call rte_wmb() after a sequence of copy operations.
+ */
+#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)
+/** Destination address alignment hint mask. */
+#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16)
+/** Destination address alignment hint shift. */
+#define RTE_MEMOPS_F_DSTA_SHIFT 16
+/** Hint: Destination address is 2 byte aligned. */
+#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)
+/** Hint: Destination address is 4 byte aligned. */
+#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)
+/** Hint: Destination address is 8 byte aligned. */
+#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)
+/** Hint: Destination address is 16 byte aligned. */
+#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)
+/** Hint: Destination address is 32 byte aligned. */
+#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)
+/** Hint: Destination address is 64 byte aligned. */
+#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)
+/** Hint: Destination address is 128 byte aligned. */
+#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Advanced/non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the destination memory area.
+ * @param src
+ *   Pointer to the source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, RTE_MEMOPS_F_(LEN|SRC|DST)<n>A flags.
+ *   Must be constant at build time.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
+		const uint64_t flags);
+
+#ifndef RTE_MEMCPY_EX_ARCH_DEFINED
+
+/* Fallback implementation, if no arch-specific implementation is provided. */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
+		const uint64_t flags __rte_unused)
+{
+	memcpy(dst, src, len);
+}
+
+#endif /* RTE_MEMCPY_EX_ARCH_DEFINED */
+
 #endif /* _RTE_MEMCPY_H_ */
diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index d4d7a5cfc8..8286e83d1e 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #ifndef _RTE_MEMCPY_X86_64_H_
@@ -17,6 +18,10 @@
 #include <rte_vect.h>
 #include <rte_common.h>
 #include <rte_config.h>
+#include <rte_debug.h>
+
+#define RTE_MEMCPY_EX_ARCH_DEFINED
+#include "generic/rte_memcpy.h"
 
 #ifdef __cplusplus
 extern "C" {
@@ -868,6 +873,1154 @@ rte_memcpy(void *dst, const void *src, size_t n)
 		return rte_memcpy_generic(dst, src, n);
 }
 
+/*
+ * Advanced/Non-Temporal Memory Operations.
+ */
+
+/**
+ * @internal
+ * Workaround for _mm_stream_load_si128() missing const in the parameter.
+ */
+__rte_internal
+static __rte_always_inline
+__m128i _mm_stream_load_si128_const(const __m128i * const mem_addr)
+{
+#if defined(RTE_TOOLCHAIN_GCC)
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
+#endif
+	return _mm_stream_load_si128(mem_addr);
+#if defined(RTE_TOOLCHAIN_GCC)
+#pragma GCC diagnostic pop
+#endif
+}
+
+/**
+ * @internal
+ * Memory copy from non-temporal source area.
+ *
+ * @note
+ * Performance is optimal when source pointer is 16 byte aligned.
+ *
+ * @param dst
+ *   Pointer to the destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(LEN|SRC)<n>A flags.
+ *   The RTE_MEMOPS_F_SRC_NT flag must be set.
+ *   The RTE_MEMOPS_F_DST_NT flag must be clear.
+ *   The RTE_MEMOPS_F_DST<n>A flags are ignored.
+ *   Must be constant at build time.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nts(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_SRC_NT);
+
+	if (unlikely(len == 0)) return;
+
+	/* If source is not 16 byte aligned, then copy first part of data via bounce buffer,
+	 * to achieve 16 byte alignment of source pointer.
+	 * This invalidates the source, destination and length alignment flags, and
+	 * potentially makes the destination pointer unaligned.
+	 *
+	 * Omitted if source is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) {
+		/* Source is not known to be 16 byte aligned, but might be. */
+		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+		const size_t    offset = (uintptr_t)src & 15;
+
+		if (offset) {
+			/* Source is not 16 byte aligned. */
+			char            buffer[16] __rte_aligned(16);
+			/** How many bytes is source away from 16 byte alignment (ceiling rounding). */
+			const size_t    first = 16 - offset;
+
+			xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
+			_mm_store_si128((void *)buffer, xmm0);
+
+			/* Test for short length.
+			 *
+			 * Omitted if length is known to be >= 16.
+			 */
+			if (!(__builtin_constant_p(len) && len >= 16) &&
+					unlikely(len <= first)) {
+				/* Short length. */
+				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
+				return;
+			}
+
+			/* Copy until source pointer is 16 byte aligned. */
+			rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), first);
+			src = RTE_PTR_ADD(src, first);
+			dst = RTE_PTR_ADD(dst, first);
+			len -= first;
+		}
+	}
+
+	/* Source pointer is now 16 byte aligned. */
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(dst, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, if any, via bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the alignment
+	 * flags are still valid) and length is known to be 16 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(len & 15)) {
+		char    buffer[16] __rte_aligned(16);
+
+		xmm3 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)buffer, xmm3);
+		rte_mov15_or_less(dst, buffer, len & 15);
+	}
+}
+
+/**
+ * @internal
+ * Memory copy to non-temporal destination area.
+ *
+ * @note
+ * If the destination and/or length is unaligned, the first and/or last copied
+ * bytes will be stored in the destination memory area using temporal access.
+ * @note
+ * Performance is optimal when destination pointer is 16 byte aligned.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(LEN|DST)<n>A flags.
+ *   The RTE_MEMOPS_F_SRC_NT flag must be clear.
+ *   The RTE_MEMOPS_F_DST_NT flag must be set.
+ *   The RTE_MEMOPS_F_SRC<n>A flags are ignored.
+ *   Must be constant at build time.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ntd(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_DST_NT);
+
+	if (unlikely(len == 0)) return;
+
+	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ||
+			len >= 16) {
+		/* Length >= 16 and/or destination is known to be 16 byte aligned. */
+		register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+		/* If destination is not 16 byte aligned, then copy first part of data,
+		 * to achieve 16 byte alignment of destination pointer.
+		 * This invalidates the source, destination and length alignment flags, and
+		 * potentially makes the source pointer unaligned.
+		 *
+		 * Omitted if destination is known to be 16 byte aligned.
+		 */
+		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A)) {
+			/* Destination is not known to be 16 byte aligned, but might be. */
+			/** How many bytes is destination offset from 16 byte alignment (floor rounding). */
+			const size_t    offset = (uintptr_t)dst & 15;
+
+			if (offset) {
+				/* Destination is not 16 byte aligned. */
+				/** How many bytes is destination away from 16 byte alignment (ceiling rounding). */
+				const size_t    first = 16 - offset;
+
+				if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
+						(offset & 3) == 0) {
+					/* Destination is (known to be) 4 byte aligned. */
+					int32_t r0, r1, r2;
+
+					/* Copy until destination pointer is 16 byte aligned. */
+					if (first & 8) {
+						memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
+						memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
+						src = RTE_PTR_ADD(src, 8);
+						dst = RTE_PTR_ADD(dst, 8);
+						len -= 8;
+					}
+					if (first & 4) {
+						memcpy(&r2, src, 4);
+						_mm_stream_si32(dst, r2);
+						src = RTE_PTR_ADD(src, 4);
+						dst = RTE_PTR_ADD(dst, 4);
+						len -= 4;
+					}
+				} else {
+					/* Destination is not 4 byte aligned. */
+					/* Copy until destination pointer is 16 byte aligned. */
+					rte_mov15_or_less(dst, src, first);
+					src = RTE_PTR_ADD(src, first);
+					dst = RTE_PTR_ADD(dst, first);
+					len -= first;
+				}
+			}
+		}
+
+		/* Destination pointer is now 16 byte aligned. */
+		RTE_ASSERT(rte_is_aligned(dst, 16));
+
+		/* Copy large portion of data in chunks of 64 byte. */
+		while (len >= 64) {
+			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
+			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
+			xmm2 = _mm_loadu_si128(RTE_PTR_ADD(src, 2 * 16));
+			xmm3 = _mm_loadu_si128(RTE_PTR_ADD(src, 3 * 16));
+			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+			src = RTE_PTR_ADD(src, 64);
+			dst = RTE_PTR_ADD(dst, 64);
+			len -= 64;
+		}
+
+		/* Copy following 32 and 16 byte portions of data.
+		 *
+		 * Omitted if destination is known to be 16 byte aligned (so the alignment
+		 * flags are still valid)
+		 * and length is known to be respectively 64 or 32 byte aligned.
+		 */
+		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+				(len & 32)) {
+			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
+			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
+			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+			src = RTE_PTR_ADD(src, 32);
+			dst = RTE_PTR_ADD(dst, 32);
+		}
+		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+				(len & 16)) {
+			xmm2 = _mm_loadu_si128(src);
+			_mm_stream_si128(dst, xmm2);
+			src = RTE_PTR_ADD(src, 16);
+			dst = RTE_PTR_ADD(dst, 16);
+		}
+	} else {
+		/* Length <= 15, and
+		 * destination is not known to be 16 byte aligned (but might be).
+		 */
+		/* If destination is not 4 byte aligned, then
+		 * use normal copy and return.
+		 *
+		 * Omitted if destination is known to be 4 byte aligned.
+		 */
+		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) &&
+				!rte_is_aligned(dst, 4)) {
+			/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
+			rte_mov15_or_less(dst, src, len);
+			return;
+		}
+		/* Destination is (known to be) 4 byte aligned. Proceed. */
+	}
+
+	/* Destination pointer is now 4 byte (or 16 byte) aligned. */
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+
+	/* Copy following 8 and 4 byte portions of data.
+	 *
+	 * Omitted if destination is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 16 or 8 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(len & 8)) {
+		int32_t r0, r1;
+
+		memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
+		memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
+		src = RTE_PTR_ADD(src, 8);
+		dst = RTE_PTR_ADD(dst, 8);
+	}
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A)) &&
+			(len & 4)) {
+		int32_t r2;
+
+		memcpy(&r2, src, 4);
+		_mm_stream_si32(dst, r2);
+		src = RTE_PTR_ADD(src, 4);
+		dst = RTE_PTR_ADD(dst, 4);
+	}
+
+	/* Copy remaining 2 and 1 byte portions of data.
+	 *
+	 * Omitted if destination is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be repectively 4 and 2 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A)) &&
+			(len & 2)) {
+		int16_t r3;
+
+		memcpy(&r3, src, 2);
+		*(int16_t *)dst = r3;
+		src = RTE_PTR_ADD(src, 2);
+		dst = RTE_PTR_ADD(dst, 2);
+	}
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A)) &&
+			(len & 1))
+		*(char *)dst = *(const char *)src;
+}
+
+/**
+ * @internal
+ * Non-temporal memory copy of 15 or less byte
+ * from 16 byte aligned source via bounce buffer.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Only the 4 least significant bits of this parameter are used.
+ *   The 4 least significant bits of this holds the number of remaining bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_15_or_less_s16a(void * __rte_restrict dst,
+		const void * __rte_restrict src, size_t len, const uint64_t flags)
+{
+	int32_t             buffer[4] __rte_aligned(16);
+	register __m128i    xmm0;
+
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if ((len & 15) == 0) return;
+
+	/* Non-temporal load into bounce buffer. */
+	xmm0 = _mm_stream_load_si128_const(src);
+	_mm_store_si128((void *)buffer, xmm0);
+
+	/* Store from bounce buffer. */
+	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
+			rte_is_aligned(dst, 4)) {
+		/* Destination is (known to be) 4 byte aligned. */
+		src = (const void *)buffer;
+		if (len & 8) {
+			if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) {
+				/* Destination is known to be 8 byte aligned. */
+				_mm_stream_si64(dst, *(const int64_t *)src);
+			} else {
+				_mm_stream_si32(RTE_PTR_ADD(dst, 0), buffer[0]);
+				_mm_stream_si32(RTE_PTR_ADD(dst, 4), buffer[1]);
+			}
+			src = RTE_PTR_ADD(src, 8);
+			dst = RTE_PTR_ADD(dst, 8);
+		}
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A) &&
+				(len & 4)) {
+			_mm_stream_si32(dst, *(const int32_t *)src);
+			src = RTE_PTR_ADD(src, 4);
+			dst = RTE_PTR_ADD(dst, 4);
+		}
+
+		/* Non-temporal store is unavailble for the remaining 3 byte or less. */
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) &&
+				(len & 2)) {
+			*(int16_t *)dst = *(const int16_t *)src;
+			src = RTE_PTR_ADD(src, 2);
+			dst = RTE_PTR_ADD(dst, 2);
+		}
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A) &&
+				(len & 1)) {
+			*(char *)dst = *(const char *)src;
+		}
+	} else {
+		/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
+		rte_mov15_or_less(dst, (const void *)buffer, len & 15);
+	}
+}
+
+/**
+ * @internal
+ * 16 byte aligned addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 16 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d16s16a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 16));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0)) return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_stream_si128(dst, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ?
+				flags : RTE_MEMOPS_F_DST16A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+/**
+ * @internal
+ * 8/16 byte aligned destination/source addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 8 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d8s16a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	int64_t             buffer[8] __rte_cache_aligned /* at least __rte_aligned(16) */;
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 8));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0)) return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
+		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
+		_mm_store_si128((void *)&buffer[3 * 2], xmm3);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 4 * 8), buffer[4]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 5 * 8), buffer[5]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 6 * 8), buffer[6]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 7 * 8), buffer[7]);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[4]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[5]);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) ?
+				flags : RTE_MEMOPS_F_DST8A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+/**
+ * @internal
+ * 4/16 byte aligned destination/source addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 4 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d4s16a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	int32_t             buffer[16] __rte_cache_aligned /* at least __rte_aligned(16) */;
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0)) return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
+		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
+		_mm_store_si128((void *)&buffer[3 * 4], xmm3);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  0 * 4), buffer[ 0]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  1 * 4), buffer[ 1]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  2 * 4), buffer[ 2]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  3 * 4), buffer[ 3]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  4 * 4), buffer[ 4]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  5 * 4), buffer[ 5]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  6 * 4), buffer[ 6]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  7 * 4), buffer[ 7]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  8 * 4), buffer[ 8]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  9 * 4), buffer[ 9]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 10 * 4), buffer[10]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 11 * 4), buffer[11]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 12 * 4), buffer[12]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 13 * 4), buffer[13]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 14 * 4), buffer[14]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 15 * 4), buffer[15]);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 4 * 4), buffer[4]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 5 * 4), buffer[5]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 6 * 4), buffer[6]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 7 * 4), buffer[7]);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[ 8]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[ 9]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[10]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[11]);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ?
+				flags : RTE_MEMOPS_F_DST4A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+/**
+ * @internal
+ * 4 byte aligned addresses (non-temporal) memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the (non-temporal) destination memory area.
+ *   Must be 4 byte aligned if using non-temporal store.
+ * @param src
+ *   Pointer to the (non-temporal) source memory area.
+ *   Must be 4 byte aligned if using non-temporal load.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d4s4a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+	const size_t    offset = (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A ?
+			0 : (uintptr_t)src & 15;
+
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+	RTE_ASSERT(rte_is_aligned(src, 4));
+
+	if (unlikely(len == 0)) return;
+
+	if (offset == 0) {
+		/* Source is 16 byte aligned. */
+		/* Copy everything, using upgraded source alignment flags. */
+		rte_memcpy_nt_d4s16a(dst, src, len,
+				(flags & ~RTE_MEMOPS_F_SRCA_MASK) | RTE_MEMOPS_F_SRC16A);
+	} else {
+		/* Source is not 16 byte aligned, so make it 16 byte aligned. */
+		int32_t             buffer[4] __rte_aligned(16);
+		const size_t        first = 16 - offset;
+		register __m128i    xmm0;
+
+		/* First, copy first part of data in chunks of 4 byte,
+		 * to achieve 16 byte alignment of source.
+		 * This invalidates the source, destination and length alignment flags, and
+		 * potentially makes the destination pointer 16 byte unaligned/aligned.
+		 */
+
+		/** Copy from 16 byte aligned source pointer (floor rounding). */
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
+		_mm_store_si128((void *)buffer, xmm0);
+
+		if (unlikely(len + offset <= 16)) {
+			/* Short length. */
+			if (((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ||
+					(len & 3) == 0) {
+				/* Length is 4 byte aligned. */
+				switch (len) {
+					case 1 * 4:
+						/* Offset can be 1 * 4, 2 * 4 or 3 * 4. */
+						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[offset / 4]);
+						break;
+					case 2 * 4:
+						/* Offset can be 1 * 4 or 2 * 4. */
+						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[offset / 4]);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[offset / 4 + 1]);
+						break;
+					case 3 * 4:
+						/* Offset can only be 1 * 4. */
+						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
+						break;
+				}
+			} else {
+				/* Length is not 4 byte aligned. */
+				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
+			}
+			return;
+		}
+
+		switch (first) {
+			case 1 * 4:
+				_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
+				break;
+			case 2 * 4:
+				_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
+				_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
+				break;
+			case 3 * 4:
+				_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
+				_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
+				_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
+				break;
+		}
+
+		src = RTE_PTR_ADD(src, first);
+		dst = RTE_PTR_ADD(dst, first);
+		len -= first;
+
+		/* Source pointer is now 16 byte aligned. */
+		RTE_ASSERT(rte_is_aligned(src, 16));
+
+		/* Then, copy the rest, using corrected alignment flags. */
+		if (rte_is_aligned(dst, 16))
+			rte_memcpy_nt_d16s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST16A | RTE_MEMOPS_F_SRC16A |
+					((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A));
+		else if (rte_is_aligned(dst, 8))
+			rte_memcpy_nt_d8s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST8A | RTE_MEMOPS_F_SRC16A |
+					((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A));
+		else
+			rte_memcpy_nt_d4s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST4A | RTE_MEMOPS_F_SRC16A |
+					((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A));
+	}
+}
+
+#ifndef RTE_MEMCPY_NT_BUFSIZE
+
+#include <lib/mbuf/rte_mbuf_core.h>
+
+/** Bounce buffer size for non-temporal memcpy.
+ *
+ * Must be 2^N and >= 128.
+ * The actual buffer will be slightly larger, due to added padding.
+ * The default is chosen to be able to handle a non-segmented packet.
+ */
+#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM
+
+#endif  /* RTE_MEMCPY_NT_BUFSIZE */
+
+/**
+ * @internal
+ * Non-temporal memory copy via bounce buffer.
+ *
+ * @note
+ * If the destination and/or length is unaligned, the first and/or last copied
+ * bytes will be stored in the destination memory area using temporal access.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_buf(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	/** Cache line aligned bounce buffer with preceding and trailing padding.
+	 *
+	 * The preceding padding is one cache line, so the data area itself
+	 * is cache line aligned.
+	 * The trailing padding is 16 bytes, leaving room for the trailing bytes
+	 * of a 16 byte store operation.
+	 */
+	char                buffer[RTE_CACHE_LINE_SIZE + RTE_MEMCPY_NT_BUFSIZE +  16]
+			__rte_cache_aligned;
+	/** Pointer to bounce buffer's aligned data area. */
+	char * const        buf0 = &buffer[RTE_CACHE_LINE_SIZE];
+	void *              buf;
+	/** Number of bytes to copy from source, incl. any extra preceding bytes. */
+	size_t              srclen;
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);
+
+	if (unlikely(len == 0)) return;
+
+	/* Step 1:
+	 * Copy data from the source to the bounce buffer's aligned data area,
+	 * using aligned non-temporal load from the source,
+	 * and unaligned store in the bounce buffer.
+	 *
+	 * If the source is unaligned, the additional bytes preceding the data will be copied
+	 * to the padding area preceding the bounce buffer's aligned data area.
+	 * Similarly, if the source data ends at an unaligned address, the additional bytes
+	 * trailing the data will be copied to the padding area trailing the bounce buffer's
+	 * aligned data area.
+	 */
+
+	/* Adjust for extra preceding bytes,
+	 * unless source is known to be 16 byte aligned. */
+	if ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) {
+		buf = buf0;
+		srclen = len;
+	} else {
+		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+		const size_t offset = (uintptr_t)src & 15;
+
+		buf = RTE_PTR_SUB(buf0, offset);
+		src = RTE_PTR_SUB(src, offset);
+		srclen = len + offset;
+	}
+
+	/* Copy large portion of data from source to bounce buffer in chunks of 64 byte. */
+	while (srclen >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		buf = RTE_PTR_ADD(buf, 64);
+		srclen -= 64;
+	}
+
+	/* Copy remaining 32 and 16 byte portions of data from source to bounce buffer,
+	 * incl. any trailing bytes.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the length alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+			(srclen & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
+	}
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+			(srclen & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(buf, xmm2);
+	}
+
+	/* Step 2:
+	 * Copy from the aligned bounce buffer to the non-temporal destination.
+	 */
+	rte_memcpy_ntd(dst, buf0, len,
+			(flags & ~(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_SRCA_MASK)) |
+			(RTE_CACHE_LINE_SIZE << RTE_MEMOPS_F_SRCA_SHIFT));
+}
+
+/**
+ * @internal
+ * Non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @note
+ * If the destination and/or length is unaligned, some copied bytes will be
+ * stored in the destination memory area using temporal access.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_generic(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+
+	while (len > RTE_MEMCPY_NT_BUFSIZE) {
+		rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE,
+				(flags & ~RTE_MEMOPS_F_LENA_MASK) | RTE_MEMOPS_F_LEN128A);
+		dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
+		src = RTE_PTR_ADD(src, RTE_MEMCPY_NT_BUFSIZE);
+		len -= RTE_MEMCPY_NT_BUFSIZE;
+	}
+	rte_memcpy_nt_buf(dst, src, len, flags);
+}
+
+/* Implementation. Refer to function declaration for documentation. */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	if ((flags & (RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) ==
+			(RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) {
+		/* Copy between non-temporal source and destination. */
+		if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d16s16a(dst, src, len, flags);
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d8s16a(dst, src, len, flags);
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d4s16a(dst, src, len, flags);
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A)
+			rte_memcpy_nt_d4s4a(dst, src, len, flags);
+		else if (len <= RTE_MEMCPY_NT_BUFSIZE)
+			rte_memcpy_nt_buf(dst, src, len, flags);
+		else
+			rte_memcpy_nt_generic(dst, src, len, flags);
+	} else if (flags & RTE_MEMOPS_F_SRC_NT) {
+		/* Copy from non-temporal source. */
+		rte_memcpy_nts(dst, src, len, flags);
+	} else if (flags & RTE_MEMOPS_F_DST_NT) {
+		/* Copy to non-temporal destination. */
+		rte_memcpy_ntd(dst, src, len, flags);
+	} else
+		rte_memcpy(dst, src, len);
+}
+
 #undef ALIGNMENT_MASK
 
 #if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index a2307cebe6..aa96fb4cc8 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -660,6 +660,83 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	return mc;
 }
 
+/* Create a deep copy of mbuf, using non-temporal memory access */
+struct rte_mbuf *
+rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
+		 uint32_t off, uint32_t len, const uint64_t flags)
+{
+	const struct rte_mbuf *seg = m;
+	struct rte_mbuf *mc, *m_last, **prev;
+
+	/* garbage in check */
+	__rte_mbuf_sanity_check(m, 1);
+
+	/* check for request to copy at offset past end of mbuf */
+	if (unlikely(off >= m->pkt_len))
+		return NULL;
+
+	mc = rte_pktmbuf_alloc(mp);
+	if (unlikely(mc == NULL))
+		return NULL;
+
+	/* truncate requested length to available data */
+	if (len > m->pkt_len - off)
+		len = m->pkt_len - off;
+
+	__rte_pktmbuf_copy_hdr(mc, m);
+
+	/* copied mbuf is not indirect or external */
+	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
+
+	prev = &mc->next;
+	m_last = mc;
+	while (len > 0) {
+		uint32_t copy_len;
+
+		/* skip leading mbuf segments */
+		while (off >= seg->data_len) {
+			off -= seg->data_len;
+			seg = seg->next;
+		}
+
+		/* current buffer is full, chain a new one */
+		if (rte_pktmbuf_tailroom(m_last) == 0) {
+			m_last = rte_pktmbuf_alloc(mp);
+			if (unlikely(m_last == NULL)) {
+				rte_pktmbuf_free(mc);
+				return NULL;
+			}
+			++mc->nb_segs;
+			*prev = m_last;
+			prev = &m_last->next;
+		}
+
+		/*
+		 * copy the min of data in input segment (seg)
+		 * vs space available in output (m_last)
+		 */
+		copy_len = RTE_MIN(seg->data_len - off, len);
+		if (copy_len > rte_pktmbuf_tailroom(m_last))
+			copy_len = rte_pktmbuf_tailroom(m_last);
+
+		/* append from seg to m_last */
+		rte_memcpy_ex(rte_pktmbuf_mtod_offset(m_last, char *,
+						   m_last->data_len),
+			   rte_pktmbuf_mtod_offset(seg, char *, off),
+			   copy_len, flags);
+
+		/* update offsets and lengths */
+		m_last->data_len += copy_len;
+		mc->pkt_len += copy_len;
+		off += copy_len;
+		len -= copy_len;
+	}
+
+	/* garbage out check */
+	__rte_mbuf_sanity_check(mc, 1);
+	return mc;
+}
+
 /* dump a mbuf on console */
 void
 rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index b6e23d98ce..030df396a3 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1443,6 +1443,38 @@ struct rte_mbuf *
 rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t offset, uint32_t length);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Create a full copy of a given packet mbuf,
+ * using non-temporal memory access as specified by flags.
+ *
+ * Copies all the data from a given packet mbuf to a newly allocated
+ * set of mbufs. The private data are is not copied.
+ *
+ * @param m
+ *   The packet mbuf to be copied.
+ * @param mp
+ *   The mempool from which the "clone" mbufs are allocated.
+ * @param offset
+ *   The number of bytes to skip before copying.
+ *   If the mbuf does not have that many bytes, it is an error
+ *   and NULL is returned.
+ * @param length
+ *   The upper limit on bytes to copy.  Passing UINT32_MAX
+ *   means all data (after offset).
+ * @param flags
+ *   Non-temporal memory access hints for rte_memcpy_ex.
+ * @return
+ *   - The pointer to the new "clone" mbuf on success.
+ *   - NULL if allocation fails.
+ */
+__rte_experimental
+struct rte_mbuf *
+rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
+		    uint32_t offset, uint32_t length, const uint64_t flags);
+
 /**
  * Adds given value to the refcnt of all packet mbuf segments.
  *
diff --git a/lib/mbuf/version.map b/lib/mbuf/version.map
index ed486ed14e..b583364ad4 100644
--- a/lib/mbuf/version.map
+++ b/lib/mbuf/version.map
@@ -47,5 +47,6 @@ EXPERIMENTAL {
 	global:
 
 	rte_pktmbuf_pool_create_extbuf;
+	rte_pktmbuf_copy_ex;
 
 };
diff --git a/lib/pcapng/rte_pcapng.c b/lib/pcapng/rte_pcapng.c
index af2b814251..ae871c4865 100644
--- a/lib/pcapng/rte_pcapng.c
+++ b/lib/pcapng/rte_pcapng.c
@@ -466,7 +466,8 @@ rte_pcapng_copy(uint16_t port_id, uint32_t queue,
 	orig_len = rte_pktmbuf_pkt_len(md);
 
 	/* Take snapshot of the data */
-	mc = rte_pktmbuf_copy(md, mp, 0, length);
+	mc = rte_pktmbuf_copy_ex(md, mp, 0, length,
+				 RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
 	if (unlikely(mc == NULL))
 		return NULL;
 
diff --git a/lib/pdump/rte_pdump.c b/lib/pdump/rte_pdump.c
index 98dcbc037b..6e61c75407 100644
--- a/lib/pdump/rte_pdump.c
+++ b/lib/pdump/rte_pdump.c
@@ -124,7 +124,8 @@ pdump_copy(uint16_t port_id, uint16_t queue,
 					    pkts[i], mp, cbs->snaplen,
 					    ts, direction);
 		else
-			p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
+			p = rte_pktmbuf_copy_ex(pkts[i], mp, 0, cbs->snaplen,
+						RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
 
 		if (unlikely(p == NULL))
 			__atomic_fetch_add(&stats->nombuf, 1, __ATOMIC_RELAXED);
@@ -134,6 +135,9 @@ pdump_copy(uint16_t port_id, uint16_t queue,
 
 	__atomic_fetch_add(&stats->accepted, d_pkts, __ATOMIC_RELAXED);
 
+	/* Flush non-temporal stores regarding the packet copies. */
+	rte_wmb();
+
 	ring_enq = rte_ring_enqueue_burst(ring, (void *)dup_bufs, d_pkts, NULL);
 	if (unlikely(ring_enq < d_pkts)) {
 		unsigned int drops = d_pkts - ring_enq;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] eal: non-temporal memcpy
  2022-08-19 13:58 [RFC v3] non-temporal memcpy Morten Brørup
  2022-10-06 20:34 ` [PATCH] eal: " Morten Brørup
@ 2022-10-07 10:19 ` Morten Brørup
  2022-10-09 15:35 ` [PATCH v3] " Morten Brørup
  2022-10-10  6:46 ` [PATCH v4] " Morten Brørup
  3 siblings, 0 replies; 17+ messages in thread
From: Morten Brørup @ 2022-10-07 10:19 UTC (permalink / raw)
  To: hofors, bruce.richardson, konstantin.v.ananyev,
	Honnappa.Nagarahalli, stephen
  Cc: mattias.ronnblom, kda, drc, dev, Morten Brørup

This patch provides a function for memory copy using non-temporal store,
load or both, controlled by flags passed to the function.

Applications sometimes copy data to another memory location, which is only
used much later.
In this case, it is inefficient to pollute the data cache with the copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture buffer
for offline analysis.

The purpose of the function is to achieve a performance gain by not
polluting the cache when copying data.
Although the throughput can be improved by further optimization, I do not
have time to do it now.

The functional tests and performance tests for memory copy have been
expanded to include non-temporal copying.

A non-temporal version of the mbuf library's function to create a full
copy of a given packet mbuf is provided.

The packet capture and packet dump libraries have been updated to use
non-temporal memory copy of the packets.

Implementation notes:

Implementations for non-x86 architectures can be provided by anyone at a
later time. I am not going to do it.

x86 non-temporal load instructions must be 16 byte aligned [1], and
non-temporal store instructions must be 4, 8 or 16 byte aligned [2].

ARM non-temporal load and store instructions seem to require 4 byte
alignment [3].

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_load
[2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_si
[3] https://developer.arm.com/documentation/100076/0100/
A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
LDNP--SIMD-and-FP-

This patch is a major rewrite from the RFC v3, so no version log comparing
to the RFC is provided.

v2
* The last 16 byte block of data, incl. any trailing bytes, were not
  copied from the source memory area in rte_memcpy_nt_buf().
* Fix many coding style issues.
* Add some missing header files.
* Fix build time warning for non-x86 architectures by using a different
  method to mark the flags parameter unused.
* CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
  so omit it when using CLANG.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_memcpy.c               |   69 +-
 app/test/test_memcpy_perf.c          |  187 ++--
 lib/eal/include/generic/rte_memcpy.h |  119 +++
 lib/eal/x86/include/rte_memcpy.h     | 1203 ++++++++++++++++++++++++++
 lib/mbuf/rte_mbuf.c                  |   77 ++
 lib/mbuf/rte_mbuf.h                  |   32 +
 lib/mbuf/version.map                 |    1 +
 lib/pcapng/rte_pcapng.c              |    3 +-
 lib/pdump/rte_pdump.c                |    6 +-
 9 files changed, 1606 insertions(+), 91 deletions(-)

diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
index 1ab86f4967..e3adb6d9df 100644
--- a/app/test/test_memcpy.c
+++ b/app/test/test_memcpy.c
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #include <stdint.h>
@@ -36,6 +37,19 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 /* Data is aligned on this many bytes (power of 2) */
 #define ALIGNMENT_UNIT          32
 
+const uint64_t nt_mode_flags[4] = {
+	0,
+	RTE_MEMOPS_F_SRC_NT,
+	RTE_MEMOPS_F_DST_NT,
+	RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT
+};
+const char * const nt_mode_str[4] = {
+	"none",
+	"src",
+	"dst",
+	"src+dst"
+};
+
 
 /*
  * Create two buffers, and initialise one with random values. These are copied
@@ -44,12 +58,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
  * changed.
  */
 static int
-test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
+test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size, unsigned int nt_mode)
 {
 	unsigned int i;
 	uint8_t dest[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
 	uint8_t src[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
 	void * ret;
+	const uint64_t flags = nt_mode_flags[nt_mode];
 
 	/* Setup buffers */
 	for (i = 0; i < SMALL_BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
@@ -58,18 +73,23 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	}
 
 	/* Do the copy */
-	ret = rte_memcpy(dest + off_dst, src + off_src, size);
-	if (ret != (dest + off_dst)) {
-		printf("rte_memcpy() returned %p, not %p\n",
-		       ret, dest + off_dst);
+	if (nt_mode) {
+		rte_memcpy_ex(dest + off_dst, src + off_src, size, flags);
+	} else {
+		ret = rte_memcpy(dest + off_dst, src + off_src, size);
+		if (ret != (dest + off_dst)) {
+			printf("rte_memcpy() returned %p, not %p\n",
+			       ret, dest + off_dst);
+		}
 	}
 
 	/* Check nothing before offset is affected */
 	for (i = 0; i < off_dst; i++) {
 		if (dest[i] != 0) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
 			       "[modified before start of dst].\n",
-			       (unsigned)size, off_src, off_dst);
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
 			return -1;
 		}
 	}
@@ -77,9 +97,11 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	/* Check everything was copied */
 	for (i = 0; i < size; i++) {
 		if (dest[i + off_dst] != src[i + off_src]) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
-			       "[didn't copy byte %u].\n",
-			       (unsigned)size, off_src, off_dst, i);
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
+			       "[didn't copy byte %u: 0x%02x!=0x%02x].\n",
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode], i,
+			       dest[i + off_dst], src[i + off_src]);
 			return -1;
 		}
 	}
@@ -87,9 +109,10 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	/* Check nothing after copy was affected */
 	for (i = size; i < SMALL_BUFFER_SIZE; i++) {
 		if (dest[i + off_dst] != 0) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
 			       "[copied too many].\n",
-			       (unsigned)size, off_src, off_dst);
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
 			return -1;
 		}
 	}
@@ -102,16 +125,22 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 static int
 func_test(void)
 {
-	unsigned int off_src, off_dst, i;
+	unsigned int off_src, off_dst, i, nt_mode;
 	int ret;
 
-	for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
-		for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
-			for (i = 0; i < RTE_DIM(buf_sizes); i++) {
-				ret = test_single_memcpy(off_src, off_dst,
-				                         buf_sizes[i]);
-				if (ret != 0)
-					return -1;
+	for (nt_mode = 0; nt_mode < 4; nt_mode++) {
+		for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
+			for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
+				for (i = 0; i < RTE_DIM(buf_sizes); i++) {
+					printf("TEST: rte_memcpy%s(offsets=%u,%u size=%zu nt=%s)\n",
+					       nt_mode ? "_ex" : "",
+					       off_src, off_dst, buf_sizes[i],
+					       nt_mode_str[nt_mode]);
+					ret = test_single_memcpy(off_src, off_dst,
+								 buf_sizes[i], nt_mode);
+					if (ret != 0)
+						return -1;
+				}
 			}
 		}
 	}
diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 3727c160e6..6bb52cba88 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #include <stdint.h>
@@ -15,6 +16,7 @@
 #include <rte_malloc.h>
 
 #include <rte_memcpy.h>
+#include <rte_atomic.h>
 
 #include "test.h"
 
@@ -27,9 +29,9 @@
 /* List of buffer sizes to test */
 #if TEST_VALUE_RANGE == 0
 static size_t buf_sizes[] = {
-	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
-	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
-	449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
+	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 40, 48, 60, 63, 64, 65, 80, 92, 124,
+	127, 128, 129, 140, 152, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447,
+	448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
 	2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
 };
 /* MUST be as large as largest packet size above */
@@ -72,7 +74,7 @@ static uint8_t *small_buf_read, *small_buf_write;
 static int
 init_buffers(void)
 {
-	unsigned i;
+	unsigned int i;
 
 	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (large_buf_read == NULL)
@@ -151,7 +153,7 @@ static void
 do_uncached_write(uint8_t *dst, int is_dst_cached,
 				  const uint8_t *src, int is_src_cached, size_t size)
 {
-	unsigned i, j;
+	unsigned int i, j;
 	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];
 
 	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
@@ -167,66 +169,112 @@ do_uncached_write(uint8_t *dst, int is_dst_cached,
  * Run a single memcpy performance test. This is a macro to ensure that if
  * the "size" parameter is a constant it won't be converted to a variable.
  */
-#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,                   \
-                         src, is_src_cached, src_uoffset, size)             \
-do {                                                                        \
-    unsigned int iter, t;                                                   \
-    size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];          \
-    uint64_t start_time, total_time = 0;                                    \
-    uint64_t total_time2 = 0;                                               \
-    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
-        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
-                         src_addrs, is_src_cached, src_uoffset);            \
-        start_time = rte_rdtsc();                                           \
-        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
-            rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size);           \
-        total_time += rte_rdtsc() - start_time;                             \
-    }                                                                       \
-    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
-        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
-                         src_addrs, is_src_cached, src_uoffset);            \
-        start_time = rte_rdtsc();                                           \
-        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
-            memcpy(dst+dst_addrs[t], src+src_addrs[t], size);               \
-        total_time2 += rte_rdtsc() - start_time;                            \
-    }                                                                       \
-    printf("%3.0f -", (double)total_time  / TEST_ITERATIONS);                 \
-    printf("%3.0f",   (double)total_time2 / TEST_ITERATIONS);                 \
-    printf("(%6.2f%%) ", ((double)total_time - total_time2)*100/total_time2); \
+#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,					  \
+			 src, is_src_cached, src_uoffset, size)					  \
+do {												  \
+	unsigned int iter, t;									  \
+	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];				  \
+	uint64_t start_time;									  \
+	uint64_t total_time_rte = 0, total_time_std = 0;					  \
+	uint64_t total_time_ntd = 0, total_time_nts = 0, total_time_nt = 0;			  \
+	const uint64_t flags = ((dst_uoffset == 0) ?						  \
+				(ALIGNMENT_UNIT << RTE_MEMOPS_F_DSTA_SHIFT) : 0) |		  \
+			       ((src_uoffset == 0) ?						  \
+				(ALIGNMENT_UNIT << RTE_MEMOPS_F_SRCA_SHIFT) : 0);		  \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
+		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
+				 src_addrs, is_src_cached, src_uoffset);			  \
+		start_time = rte_rdtsc();							  \
+		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
+			rte_memcpy(dst + dst_addrs[t], src + src_addrs[t], size);		  \
+		total_time_rte += rte_rdtsc() - start_time;					  \
+	}											  \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
+		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
+				 src_addrs, is_src_cached, src_uoffset);			  \
+		start_time = rte_rdtsc();							  \
+		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
+			memcpy(dst + dst_addrs[t], src + src_addrs[t], size);			  \
+		total_time_std += rte_rdtsc() - start_time;					  \
+	}											  \
+	if (!(is_dst_cached && is_src_cached)) {						  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_DST_NT);			  \
+			total_time_ntd += rte_rdtsc() - start_time;				  \
+		}										  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_SRC_NT);			  \
+			total_time_nts += rte_rdtsc() - start_time;				  \
+		}										  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT); \
+			total_time_nt += rte_rdtsc() - start_time;				  \
+		}										  \
+	}											  \
+	printf(" %4.0f-", (double)total_time_rte / TEST_ITERATIONS);				  \
+	printf("%4.0f",   (double)total_time_std / TEST_ITERATIONS);				  \
+	printf("(%+4.0f%%)", ((double)total_time_rte - total_time_std) * 100 / total_time_std);   \
+	if (!(is_dst_cached && is_src_cached)) {						  \
+		printf(" %4.0f", (double)total_time_ntd / TEST_ITERATIONS);			  \
+		printf(" %4.0f", (double)total_time_nts / TEST_ITERATIONS);			  \
+		printf(" %4.0f", (double)total_time_nt / TEST_ITERATIONS);			  \
+		if (total_time_nt / total_time_std > 9)						  \
+			printf("(*%4.1f)", (double)total_time_nt / total_time_std);		  \
+		else										  \
+			printf("(%+4.0f%%)",							  \
+			       ((double)total_time_nt - total_time_std) * 100 / total_time_std);  \
+	}											  \
 } while (0)
 
 /* Run aligned memcpy tests for each cached/uncached permutation */
-#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
-do {                                                                     \
-    if (__builtin_constant_p(n))                                         \
-        printf("\nC%6u", (unsigned)n);                                   \
-    else                                                                 \
-        printf("\n%7u", (unsigned)n);                                    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);    \
+#define ALL_PERF_TESTS_FOR_SIZE(n)						\
+do {										\
+	if (__builtin_constant_p(n))						\
+		printf("\nC%6u", (unsigned int)n);				\
+	else									\
+		printf("\n%7u", (unsigned int)n);				\
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);	\
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);	\
 } while (0)
 
 /* Run unaligned memcpy tests for each cached/uncached permutation */
-#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
-do {                                                                     \
-    if (__builtin_constant_p(n))                                         \
-        printf("\nC%6u", (unsigned)n);                                   \
-    else                                                                 \
-        printf("\n%7u", (unsigned)n);                                    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);    \
+#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)					\
+do {										\
+	if (__builtin_constant_p(n))						\
+		printf("\nC%6u", (unsigned int)n);				\
+	else									\
+		printf("\n%7u", (unsigned int)n);				\
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);	\
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);	\
 } while (0)
 
 /* Run memcpy tests for constant length */
-#define ALL_PERF_TEST_FOR_CONSTANT                                      \
-do {                                                                    \
-    TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);         \
-    TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);      \
-    TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);    \
+#define ALL_PERF_TEST_FOR_CONSTANT						\
+do {										\
+	TEST_CONSTANT(4U); TEST_CONSTANT(6U); TEST_CONSTANT(8U);		\
+	TEST_CONSTANT(16U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);		\
+	TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);		\
+	TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);	\
+	TEST_CONSTANT(2048U);							\
 } while (0)
 
 /* Run all memcpy tests for aligned constant cases */
@@ -251,7 +299,7 @@ perf_test_constant_unaligned(void)
 static inline void
 perf_test_variable_aligned(void)
 {
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
 		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
 	}
@@ -261,7 +309,7 @@ perf_test_variable_aligned(void)
 static inline void
 perf_test_variable_unaligned(void)
 {
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
 		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
 	}
@@ -282,7 +330,7 @@ perf_test(void)
 
 #if TEST_VALUE_RANGE != 0
 	/* Set up buf_sizes array, if required */
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < TEST_VALUE_RANGE; i++)
 		buf_sizes[i] = i;
 #endif
@@ -290,13 +338,14 @@ perf_test(void)
 	/* See function comment */
 	do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
 
-	printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
-		   "======= ================= ================= ================= =================\n"
-		   "   Size   Cache to cache     Cache to mem      Mem to cache        Mem to mem\n"
-		   "(bytes)          (ticks)          (ticks)           (ticks)           (ticks)\n"
-		   "------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n** rte_memcpy(RTE)/memcpy(STD)/rte_memcpy_ex(NTD/NTS/NT) - memcpy perf. tests (C = compile-time constant) **\n"
+		   "======= ================ ====================================== ====================================== ======================================\n"
+		   "   Size  Cache to cache               Cache to mem                           Mem to cache                            Mem to mem\n"
+		   "(bytes)         (ticks)                    (ticks)                                (ticks)                               (ticks)\n"
+		   "         RTE- STD(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)\n"
+		   "------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 
-	printf("\n================================= %2dB aligned =================================",
+	printf("\n================================================================ %2dB aligned ===============================================================",
 		ALIGNMENT_UNIT);
 	/* Do aligned tests where size is a variable */
 	timespec_get(&tv_begin, TIME_UTC);
@@ -304,28 +353,28 @@ perf_test(void)
 	timespec_get(&tv_end, TIME_UTC);
 	time_aligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 	/* Do aligned tests where size is a compile-time constant */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_constant_aligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_aligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n================================== Unaligned ==================================");
+	printf("\n================================================================= Unaligned =================================================================");
 	/* Do unaligned tests where size is a variable */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_variable_unaligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_unaligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 	/* Do unaligned tests where size is a compile-time constant */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_constant_unaligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_unaligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n======= ================= ================= ================= =================\n\n");
+	printf("\n======= ================ ====================================== ====================================== ======================================\n\n");
 
 	printf("Test Execution Time (seconds):\n");
 	printf("Aligned variable copy size   = %8.3f\n", time_aligned);
diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
index e7f0f8eaa9..5963fda992 100644
--- a/lib/eal/include/generic/rte_memcpy.h
+++ b/lib/eal/include/generic/rte_memcpy.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #ifndef _RTE_MEMCPY_H_
@@ -11,6 +12,9 @@
  * Functions for vectorised implementation of memcpy().
  */
 
+#include <rte_common.h>
+#include <rte_compat.h>
+
 /**
  * Copy 16 bytes from one location to another using optimised
  * instructions. The locations should not overlap.
@@ -113,4 +117,119 @@ rte_memcpy(void *dst, const void *src, size_t n);
 
 #endif /* __DOXYGEN__ */
 
+/*
+ * Advanced/Non-Temporal Memory Operations Flags.
+ */
+
+/** Length alignment hint mask. */
+#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
+/** Length alignment hint shift. */
+#define RTE_MEMOPS_F_LENA_SHIFT 0
+/** Hint: Length is 2 byte aligned. */
+#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
+/** Hint: Length is 4 byte aligned. */
+#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)
+/** Hint: Length is 8 byte aligned. */
+#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)
+/** Hint: Length is 16 byte aligned. */
+#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)
+/** Hint: Length is 32 byte aligned. */
+#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)
+/** Hint: Length is 64 byte aligned. */
+#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)
+/** Hint: Length is 128 byte aligned. */
+#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)
+
+/** Prefer non-temporal access to source memory area.
+ */
+#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)
+/** Source address alignment hint mask. */
+#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)
+/** Source address alignment hint shift. */
+#define RTE_MEMOPS_F_SRCA_SHIFT 8
+/** Hint: Source address is 2 byte aligned. */
+#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)
+/** Hint: Source address is 4 byte aligned. */
+#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)
+/** Hint: Source address is 8 byte aligned. */
+#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)
+/** Hint: Source address is 16 byte aligned. */
+#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)
+/** Hint: Source address is 32 byte aligned. */
+#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)
+/** Hint: Source address is 64 byte aligned. */
+#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)
+/** Hint: Source address is 128 byte aligned. */
+#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)
+
+/** Prefer non-temporal access to destination memory area.
+ *
+ * On x86 architecture:
+ * Remember to call rte_wmb() after a sequence of copy operations.
+ */
+#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)
+/** Destination address alignment hint mask. */
+#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16)
+/** Destination address alignment hint shift. */
+#define RTE_MEMOPS_F_DSTA_SHIFT 16
+/** Hint: Destination address is 2 byte aligned. */
+#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)
+/** Hint: Destination address is 4 byte aligned. */
+#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)
+/** Hint: Destination address is 8 byte aligned. */
+#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)
+/** Hint: Destination address is 16 byte aligned. */
+#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)
+/** Hint: Destination address is 32 byte aligned. */
+#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)
+/** Hint: Destination address is 64 byte aligned. */
+#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)
+/** Hint: Destination address is 128 byte aligned. */
+#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Advanced/non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the destination memory area.
+ * @param src
+ *   Pointer to the source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, RTE_MEMOPS_F_(LEN|SRC|DST)<n>A flags.
+ *   Must be constant at build time.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags);
+
+#ifndef RTE_MEMCPY_EX_ARCH_DEFINED
+
+/* Fallback implementation, if no arch-specific implementation is provided. */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	RTE_SET_USED(flags);
+	memcpy(dst, src, len);
+}
+
+#endif /* RTE_MEMCPY_EX_ARCH_DEFINED */
+
 #endif /* _RTE_MEMCPY_H_ */
diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index d4d7a5cfc8..8ef1260895 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #ifndef _RTE_MEMCPY_X86_64_H_
@@ -17,6 +18,10 @@
 #include <rte_vect.h>
 #include <rte_common.h>
 #include <rte_config.h>
+#include <rte_debug.h>
+
+#define RTE_MEMCPY_EX_ARCH_DEFINED
+#include "generic/rte_memcpy.h"
 
 #ifdef __cplusplus
 extern "C" {
@@ -868,6 +873,1204 @@ rte_memcpy(void *dst, const void *src, size_t n)
 		return rte_memcpy_generic(dst, src, n);
 }
 
+/*
+ * Advanced/Non-Temporal Memory Operations.
+ */
+
+/**
+ * @internal
+ * Workaround for _mm_stream_load_si128() missing const in the parameter.
+ */
+__rte_internal
+static __rte_always_inline
+__m128i _mm_stream_load_si128_const(const __m128i * const mem_addr)
+{
+#if defined(RTE_TOOLCHAIN_GCC)
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
+#endif
+	return _mm_stream_load_si128(mem_addr);
+#if defined(RTE_TOOLCHAIN_GCC)
+#pragma GCC diagnostic pop
+#endif
+}
+
+/**
+ * @internal
+ * Memory copy from non-temporal source area.
+ *
+ * @note
+ * Performance is optimal when source pointer is 16 byte aligned.
+ *
+ * @param dst
+ *   Pointer to the destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(LEN|SRC)<n>A flags.
+ *   The RTE_MEMOPS_F_SRC_NT flag must be set.
+ *   The RTE_MEMOPS_F_DST_NT flag must be clear.
+ *   The RTE_MEMOPS_F_DST<n>A flags are ignored.
+ *   Must be constant at build time.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nts(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_SRC_NT);
+
+	if (unlikely(len == 0))
+		return;
+
+	/* If source is not 16 byte aligned, then copy first part of data via bounce buffer,
+	 * to achieve 16 byte alignment of source pointer.
+	 * This invalidates the source, destination and length alignment flags, and
+	 * potentially makes the destination pointer unaligned.
+	 *
+	 * Omitted if source is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) {
+		/* Source is not known to be 16 byte aligned, but might be. */
+		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+		const size_t    offset = (uintptr_t)src & 15;
+
+		if (offset) {
+			/* Source is not 16 byte aligned. */
+			char            buffer[16] __rte_aligned(16);
+			/** How many bytes is source away from 16 byte alignment
+			 * (ceiling rounding).
+			 */
+			const size_t    first = 16 - offset;
+
+			xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
+			_mm_store_si128((void *)buffer, xmm0);
+
+			/* Test for short length.
+			 *
+			 * Omitted if length is known to be >= 16.
+			 */
+			if (!(__builtin_constant_p(len) && len >= 16) &&
+					unlikely(len <= first)) {
+				/* Short length. */
+				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
+				return;
+			}
+
+			/* Copy until source pointer is 16 byte aligned. */
+			rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), first);
+			src = RTE_PTR_ADD(src, first);
+			dst = RTE_PTR_ADD(dst, first);
+			len -= first;
+		}
+	}
+
+	/* Source pointer is now 16 byte aligned. */
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(dst, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, if any, via bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the alignment
+	 * flags are still valid) and length is known to be 16 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(len & 15)) {
+		char    buffer[16] __rte_aligned(16);
+
+		xmm3 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)buffer, xmm3);
+		rte_mov15_or_less(dst, buffer, len & 15);
+	}
+}
+
+/**
+ * @internal
+ * Memory copy to non-temporal destination area.
+ *
+ * @note
+ * If the destination and/or length is unaligned, the first and/or last copied
+ * bytes will be stored in the destination memory area using temporal access.
+ * @note
+ * Performance is optimal when destination pointer is 16 byte aligned.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(LEN|DST)<n>A flags.
+ *   The RTE_MEMOPS_F_SRC_NT flag must be clear.
+ *   The RTE_MEMOPS_F_DST_NT flag must be set.
+ *   The RTE_MEMOPS_F_SRC<n>A flags are ignored.
+ *   Must be constant at build time.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ntd(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_DST_NT);
+
+	if (unlikely(len == 0))
+		return;
+
+	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ||
+			len >= 16) {
+		/* Length >= 16 and/or destination is known to be 16 byte aligned. */
+		register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+		/* If destination is not 16 byte aligned, then copy first part of data,
+		 * to achieve 16 byte alignment of destination pointer.
+		 * This invalidates the source, destination and length alignment flags, and
+		 * potentially makes the source pointer unaligned.
+		 *
+		 * Omitted if destination is known to be 16 byte aligned.
+		 */
+		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A)) {
+			/* Destination is not known to be 16 byte aligned, but might be. */
+			/** How many bytes is destination offset from 16 byte alignment
+			 * (floor rounding).
+			 */
+			const size_t    offset = (uintptr_t)dst & 15;
+
+			if (offset) {
+				/* Destination is not 16 byte aligned. */
+				/** How many bytes is destination away from 16 byte alignment
+				 * (ceiling rounding).
+				 */
+				const size_t    first = 16 - offset;
+
+				if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
+						(offset & 3) == 0) {
+					/* Destination is (known to be) 4 byte aligned. */
+					int32_t r0, r1, r2;
+
+					/* Copy until destination pointer is 16 byte aligned. */
+					if (first & 8) {
+						memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
+						memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
+						src = RTE_PTR_ADD(src, 8);
+						dst = RTE_PTR_ADD(dst, 8);
+						len -= 8;
+					}
+					if (first & 4) {
+						memcpy(&r2, src, 4);
+						_mm_stream_si32(dst, r2);
+						src = RTE_PTR_ADD(src, 4);
+						dst = RTE_PTR_ADD(dst, 4);
+						len -= 4;
+					}
+				} else {
+					/* Destination is not 4 byte aligned. */
+					/* Copy until destination pointer is 16 byte aligned. */
+					rte_mov15_or_less(dst, src, first);
+					src = RTE_PTR_ADD(src, first);
+					dst = RTE_PTR_ADD(dst, first);
+					len -= first;
+				}
+			}
+		}
+
+		/* Destination pointer is now 16 byte aligned. */
+		RTE_ASSERT(rte_is_aligned(dst, 16));
+
+		/* Copy large portion of data in chunks of 64 byte. */
+		while (len >= 64) {
+			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
+			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
+			xmm2 = _mm_loadu_si128(RTE_PTR_ADD(src, 2 * 16));
+			xmm3 = _mm_loadu_si128(RTE_PTR_ADD(src, 3 * 16));
+			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+			src = RTE_PTR_ADD(src, 64);
+			dst = RTE_PTR_ADD(dst, 64);
+			len -= 64;
+		}
+
+		/* Copy following 32 and 16 byte portions of data.
+		 *
+		 * Omitted if destination is known to be 16 byte aligned (so the alignment
+		 * flags are still valid)
+		 * and length is known to be respectively 64 or 32 byte aligned.
+		 */
+		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+				(len & 32)) {
+			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
+			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
+			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+			src = RTE_PTR_ADD(src, 32);
+			dst = RTE_PTR_ADD(dst, 32);
+		}
+		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+				(len & 16)) {
+			xmm2 = _mm_loadu_si128(src);
+			_mm_stream_si128(dst, xmm2);
+			src = RTE_PTR_ADD(src, 16);
+			dst = RTE_PTR_ADD(dst, 16);
+		}
+	} else {
+		/* Length <= 15, and
+		 * destination is not known to be 16 byte aligned (but might be).
+		 */
+		/* If destination is not 4 byte aligned, then
+		 * use normal copy and return.
+		 *
+		 * Omitted if destination is known to be 4 byte aligned.
+		 */
+		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) &&
+				!rte_is_aligned(dst, 4)) {
+			/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
+			rte_mov15_or_less(dst, src, len);
+			return;
+		}
+		/* Destination is (known to be) 4 byte aligned. Proceed. */
+	}
+
+	/* Destination pointer is now 4 byte (or 16 byte) aligned. */
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+
+	/* Copy following 8 and 4 byte portions of data.
+	 *
+	 * Omitted if destination is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 16 or 8 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(len & 8)) {
+		int32_t r0, r1;
+
+		memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
+		memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
+		src = RTE_PTR_ADD(src, 8);
+		dst = RTE_PTR_ADD(dst, 8);
+	}
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A)) &&
+			(len & 4)) {
+		int32_t r2;
+
+		memcpy(&r2, src, 4);
+		_mm_stream_si32(dst, r2);
+		src = RTE_PTR_ADD(src, 4);
+		dst = RTE_PTR_ADD(dst, 4);
+	}
+
+	/* Copy remaining 2 and 1 byte portions of data.
+	 *
+	 * Omitted if destination is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 4 and 2 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A)) &&
+			(len & 2)) {
+		int16_t r3;
+
+		memcpy(&r3, src, 2);
+		*(int16_t *)dst = r3;
+		src = RTE_PTR_ADD(src, 2);
+		dst = RTE_PTR_ADD(dst, 2);
+	}
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A)) &&
+			(len & 1))
+		*(char *)dst = *(const char *)src;
+}
+
+/**
+ * @internal
+ * Non-temporal memory copy of 15 or less byte
+ * from 16 byte aligned source via bounce buffer.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Only the 4 least significant bits of this parameter are used.
+ *   The 4 least significant bits of this holds the number of remaining bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_15_or_less_s16a(void *__rte_restrict dst,
+		const void *__rte_restrict src, size_t len, const uint64_t flags)
+{
+	int32_t             buffer[4] __rte_aligned(16);
+	register __m128i    xmm0;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if ((len & 15) == 0) return;
+
+	/* Non-temporal load into bounce buffer. */
+	xmm0 = _mm_stream_load_si128_const(src);
+	_mm_store_si128((void *)buffer, xmm0);
+
+	/* Store from bounce buffer. */
+	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
+			rte_is_aligned(dst, 4)) {
+		/* Destination is (known to be) 4 byte aligned. */
+		src = (const void *)buffer;
+		if (len & 8) {
+			if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) {
+				/* Destination is known to be 8 byte aligned. */
+				_mm_stream_si64(dst, *(const int64_t *)src);
+			} else {
+				_mm_stream_si32(RTE_PTR_ADD(dst, 0), buffer[0]);
+				_mm_stream_si32(RTE_PTR_ADD(dst, 4), buffer[1]);
+			}
+			src = RTE_PTR_ADD(src, 8);
+			dst = RTE_PTR_ADD(dst, 8);
+		}
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A) &&
+				(len & 4)) {
+			_mm_stream_si32(dst, *(const int32_t *)src);
+			src = RTE_PTR_ADD(src, 4);
+			dst = RTE_PTR_ADD(dst, 4);
+		}
+
+		/* Non-temporal store is unavailble for the remaining 3 byte or less. */
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) &&
+				(len & 2)) {
+			*(int16_t *)dst = *(const int16_t *)src;
+			src = RTE_PTR_ADD(src, 2);
+			dst = RTE_PTR_ADD(dst, 2);
+		}
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A) &&
+				(len & 1)) {
+			*(char *)dst = *(const char *)src;
+		}
+	} else {
+		/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
+		rte_mov15_or_less(dst, (const void *)buffer, len & 15);
+	}
+}
+
+/**
+ * @internal
+ * 16 byte aligned addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 16 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d16s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 16));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_stream_si128(dst, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ?
+				flags : RTE_MEMOPS_F_DST16A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+/**
+ * @internal
+ * 8/16 byte aligned destination/source addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 8 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d8s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	int64_t             buffer[8] __rte_cache_aligned /* at least __rte_aligned(16) */;
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 8));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
+		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
+		_mm_store_si128((void *)&buffer[3 * 2], xmm3);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 4 * 8), buffer[4]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 5 * 8), buffer[5]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 6 * 8), buffer[6]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 7 * 8), buffer[7]);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[4]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[5]);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) ?
+				flags : RTE_MEMOPS_F_DST8A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+/**
+ * @internal
+ * 4/16 byte aligned destination/source addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 4 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d4s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	int32_t             buffer[16] __rte_cache_aligned /* at least __rte_aligned(16) */;
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
+		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
+		_mm_store_si128((void *)&buffer[3 * 4], xmm3);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  0 * 4), buffer[0]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  1 * 4), buffer[1]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  2 * 4), buffer[2]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  3 * 4), buffer[3]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  4 * 4), buffer[4]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  5 * 4), buffer[5]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  6 * 4), buffer[6]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  7 * 4), buffer[7]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  8 * 4), buffer[8]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  9 * 4), buffer[9]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 10 * 4), buffer[10]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 11 * 4), buffer[11]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 12 * 4), buffer[12]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 13 * 4), buffer[13]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 14 * 4), buffer[14]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 15 * 4), buffer[15]);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 4 * 4), buffer[4]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 5 * 4), buffer[5]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 6 * 4), buffer[6]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 7 * 4), buffer[7]);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[8]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[9]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[10]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[11]);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ?
+				flags : RTE_MEMOPS_F_DST4A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+/**
+ * @internal
+ * 4 byte aligned addresses (non-temporal) memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the (non-temporal) destination memory area.
+ *   Must be 4 byte aligned if using non-temporal store.
+ * @param src
+ *   Pointer to the (non-temporal) source memory area.
+ *   Must be 4 byte aligned if using non-temporal load.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d4s4a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+	const size_t    offset = (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A ?
+			0 : (uintptr_t)src & 15;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+	RTE_ASSERT(rte_is_aligned(src, 4));
+
+	if (unlikely(len == 0))
+		return;
+
+	if (offset == 0) {
+		/* Source is 16 byte aligned. */
+		/* Copy everything, using upgraded source alignment flags. */
+		rte_memcpy_nt_d4s16a(dst, src, len,
+				(flags & ~RTE_MEMOPS_F_SRCA_MASK) | RTE_MEMOPS_F_SRC16A);
+	} else {
+		/* Source is not 16 byte aligned, so make it 16 byte aligned. */
+		int32_t             buffer[4] __rte_aligned(16);
+		const size_t        first = 16 - offset;
+		register __m128i    xmm0;
+
+		/* First, copy first part of data in chunks of 4 byte,
+		 * to achieve 16 byte alignment of source.
+		 * This invalidates the source, destination and length alignment flags, and
+		 * potentially makes the destination pointer 16 byte unaligned/aligned.
+		 */
+
+		/** Copy from 16 byte aligned source pointer (floor rounding). */
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
+		_mm_store_si128((void *)buffer, xmm0);
+
+		if (unlikely(len + offset <= 16)) {
+			/* Short length. */
+			if (((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ||
+					(len & 3) == 0) {
+				/* Length is 4 byte aligned. */
+				switch (len) {
+				case 1 * 4:
+					/* Offset can be 1 * 4, 2 * 4 or 3 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
+							buffer[offset / 4]);
+					break;
+				case 2 * 4:
+					/* Offset can be 1 * 4 or 2 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
+							buffer[offset / 4]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4),
+							buffer[offset / 4 + 1]);
+					break;
+				case 3 * 4:
+					/* Offset can only be 1 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
+					break;
+				}
+			} else {
+				/* Length is not 4 byte aligned. */
+				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
+			}
+			return;
+		}
+
+		switch (first) {
+		case 1 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
+			break;
+		case 2 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
+			break;
+		case 3 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
+			break;
+		}
+
+		src = RTE_PTR_ADD(src, first);
+		dst = RTE_PTR_ADD(dst, first);
+		len -= first;
+
+		/* Source pointer is now 16 byte aligned. */
+		RTE_ASSERT(rte_is_aligned(src, 16));
+
+		/* Then, copy the rest, using corrected alignment flags. */
+		if (rte_is_aligned(dst, 16))
+			rte_memcpy_nt_d16s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST16A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+		else if (rte_is_aligned(dst, 8))
+			rte_memcpy_nt_d8s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST8A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+		else
+			rte_memcpy_nt_d4s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST4A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+	}
+}
+
+#ifndef RTE_MEMCPY_NT_BUFSIZE
+
+#include <lib/mbuf/rte_mbuf_core.h>
+
+/** Bounce buffer size for non-temporal memcpy.
+ *
+ * Must be 2^N and >= 128.
+ * The actual buffer will be slightly larger, due to added padding.
+ * The default is chosen to be able to handle a non-segmented packet.
+ */
+#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM
+
+#endif  /* RTE_MEMCPY_NT_BUFSIZE */
+
+/**
+ * @internal
+ * Non-temporal memory copy via bounce buffer.
+ *
+ * @note
+ * If the destination and/or length is unaligned, the first and/or last copied
+ * bytes will be stored in the destination memory area using temporal access.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_buf(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	/** Cache line aligned bounce buffer with preceding and trailing padding.
+	 *
+	 * The preceding padding is one cache line, so the data area itself
+	 * is cache line aligned.
+	 * The trailing padding is 16 bytes, leaving room for the trailing bytes
+	 * of a 16 byte store operation.
+	 */
+	char			buffer[RTE_CACHE_LINE_SIZE + RTE_MEMCPY_NT_BUFSIZE +  16]
+				__rte_cache_aligned;
+	/** Pointer to bounce buffer's aligned data area. */
+	char		* const buf0 = &buffer[RTE_CACHE_LINE_SIZE];
+	void		       *buf;
+	/** Number of bytes to copy from source, incl. any extra preceding bytes. */
+	size_t			srclen;
+	register __m128i	xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Step 1:
+	 * Copy data from the source to the bounce buffer's aligned data area,
+	 * using aligned non-temporal load from the source,
+	 * and unaligned store in the bounce buffer.
+	 *
+	 * If the source is unaligned, the additional bytes preceding the data will be copied
+	 * to the padding area preceding the bounce buffer's aligned data area.
+	 * Similarly, if the source data ends at an unaligned address, the additional bytes
+	 * trailing the data will be copied to the padding area trailing the bounce buffer's
+	 * aligned data area.
+	 */
+
+	/* Adjust for extra preceding bytes, unless source is known to be 16 byte aligned. */
+	if ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) {
+		buf = buf0;
+		srclen = len;
+	} else {
+		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+		const size_t offset = (uintptr_t)src & 15;
+
+		buf = RTE_PTR_SUB(buf0, offset);
+		src = RTE_PTR_SUB(src, offset);
+		srclen = len + offset;
+	}
+
+	/* Copy large portion of data from source to bounce buffer in chunks of 64 byte. */
+	while (srclen >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		buf = RTE_PTR_ADD(buf, 64);
+		srclen -= 64;
+	}
+
+	/* Copy remaining 32 and 16 byte portions of data from source to bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the length alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+			(srclen & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		buf = RTE_PTR_ADD(buf, 32);
+	}
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+			(srclen & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(buf, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		buf = RTE_PTR_ADD(buf, 16);
+	}
+	/* Copy any trailing bytes of data from source to bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the length alignment
+	 * flags are still valid)
+	 * and length is known to be 16 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(srclen & 15)) {
+		xmm3 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(buf, xmm3);
+	}
+
+	/* Step 2:
+	 * Copy from the aligned bounce buffer to the non-temporal destination.
+	 */
+	rte_memcpy_ntd(dst, buf0, len,
+			(flags & ~(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_SRCA_MASK)) |
+			(RTE_CACHE_LINE_SIZE << RTE_MEMOPS_F_SRCA_SHIFT));
+}
+
+/**
+ * @internal
+ * Non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @note
+ * If the destination and/or length is unaligned, some copied bytes will be
+ * stored in the destination memory area using temporal access.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_internal
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_generic(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+
+	while (len > RTE_MEMCPY_NT_BUFSIZE) {
+		rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE,
+				(flags & ~RTE_MEMOPS_F_LENA_MASK) | RTE_MEMOPS_F_LEN128A);
+		dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
+		src = RTE_PTR_ADD(src, RTE_MEMCPY_NT_BUFSIZE);
+		len -= RTE_MEMCPY_NT_BUFSIZE;
+	}
+	rte_memcpy_nt_buf(dst, src, len, flags);
+}
+
+/* Implementation. Refer to function declaration for documentation. */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	if ((flags & (RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) ==
+			(RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) {
+		/* Copy between non-temporal source and destination. */
+		if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d16s16a(dst, src, len, flags);
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d8s16a(dst, src, len, flags);
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d4s16a(dst, src, len, flags);
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A)
+			rte_memcpy_nt_d4s4a(dst, src, len, flags);
+		else if (len <= RTE_MEMCPY_NT_BUFSIZE)
+			rte_memcpy_nt_buf(dst, src, len, flags);
+		else
+			rte_memcpy_nt_generic(dst, src, len, flags);
+	} else if (flags & RTE_MEMOPS_F_SRC_NT) {
+		/* Copy from non-temporal source. */
+		rte_memcpy_nts(dst, src, len, flags);
+	} else if (flags & RTE_MEMOPS_F_DST_NT) {
+		/* Copy to non-temporal destination. */
+		rte_memcpy_ntd(dst, src, len, flags);
+	} else
+		rte_memcpy(dst, src, len);
+}
+
 #undef ALIGNMENT_MASK
 
 #if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index a2307cebe6..aa96fb4cc8 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -660,6 +660,83 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	return mc;
 }
 
+/* Create a deep copy of mbuf, using non-temporal memory access */
+struct rte_mbuf *
+rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
+		 uint32_t off, uint32_t len, const uint64_t flags)
+{
+	const struct rte_mbuf *seg = m;
+	struct rte_mbuf *mc, *m_last, **prev;
+
+	/* garbage in check */
+	__rte_mbuf_sanity_check(m, 1);
+
+	/* check for request to copy at offset past end of mbuf */
+	if (unlikely(off >= m->pkt_len))
+		return NULL;
+
+	mc = rte_pktmbuf_alloc(mp);
+	if (unlikely(mc == NULL))
+		return NULL;
+
+	/* truncate requested length to available data */
+	if (len > m->pkt_len - off)
+		len = m->pkt_len - off;
+
+	__rte_pktmbuf_copy_hdr(mc, m);
+
+	/* copied mbuf is not indirect or external */
+	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
+
+	prev = &mc->next;
+	m_last = mc;
+	while (len > 0) {
+		uint32_t copy_len;
+
+		/* skip leading mbuf segments */
+		while (off >= seg->data_len) {
+			off -= seg->data_len;
+			seg = seg->next;
+		}
+
+		/* current buffer is full, chain a new one */
+		if (rte_pktmbuf_tailroom(m_last) == 0) {
+			m_last = rte_pktmbuf_alloc(mp);
+			if (unlikely(m_last == NULL)) {
+				rte_pktmbuf_free(mc);
+				return NULL;
+			}
+			++mc->nb_segs;
+			*prev = m_last;
+			prev = &m_last->next;
+		}
+
+		/*
+		 * copy the min of data in input segment (seg)
+		 * vs space available in output (m_last)
+		 */
+		copy_len = RTE_MIN(seg->data_len - off, len);
+		if (copy_len > rte_pktmbuf_tailroom(m_last))
+			copy_len = rte_pktmbuf_tailroom(m_last);
+
+		/* append from seg to m_last */
+		rte_memcpy_ex(rte_pktmbuf_mtod_offset(m_last, char *,
+						   m_last->data_len),
+			   rte_pktmbuf_mtod_offset(seg, char *, off),
+			   copy_len, flags);
+
+		/* update offsets and lengths */
+		m_last->data_len += copy_len;
+		mc->pkt_len += copy_len;
+		off += copy_len;
+		len -= copy_len;
+	}
+
+	/* garbage out check */
+	__rte_mbuf_sanity_check(mc, 1);
+	return mc;
+}
+
 /* dump a mbuf on console */
 void
 rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index b6e23d98ce..030df396a3 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1443,6 +1443,38 @@ struct rte_mbuf *
 rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t offset, uint32_t length);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Create a full copy of a given packet mbuf,
+ * using non-temporal memory access as specified by flags.
+ *
+ * Copies all the data from a given packet mbuf to a newly allocated
+ * set of mbufs. The private data are is not copied.
+ *
+ * @param m
+ *   The packet mbuf to be copied.
+ * @param mp
+ *   The mempool from which the "clone" mbufs are allocated.
+ * @param offset
+ *   The number of bytes to skip before copying.
+ *   If the mbuf does not have that many bytes, it is an error
+ *   and NULL is returned.
+ * @param length
+ *   The upper limit on bytes to copy.  Passing UINT32_MAX
+ *   means all data (after offset).
+ * @param flags
+ *   Non-temporal memory access hints for rte_memcpy_ex.
+ * @return
+ *   - The pointer to the new "clone" mbuf on success.
+ *   - NULL if allocation fails.
+ */
+__rte_experimental
+struct rte_mbuf *
+rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
+		    uint32_t offset, uint32_t length, const uint64_t flags);
+
 /**
  * Adds given value to the refcnt of all packet mbuf segments.
  *
diff --git a/lib/mbuf/version.map b/lib/mbuf/version.map
index ed486ed14e..b583364ad4 100644
--- a/lib/mbuf/version.map
+++ b/lib/mbuf/version.map
@@ -47,5 +47,6 @@ EXPERIMENTAL {
 	global:
 
 	rte_pktmbuf_pool_create_extbuf;
+	rte_pktmbuf_copy_ex;
 
 };
diff --git a/lib/pcapng/rte_pcapng.c b/lib/pcapng/rte_pcapng.c
index af2b814251..ae871c4865 100644
--- a/lib/pcapng/rte_pcapng.c
+++ b/lib/pcapng/rte_pcapng.c
@@ -466,7 +466,8 @@ rte_pcapng_copy(uint16_t port_id, uint32_t queue,
 	orig_len = rte_pktmbuf_pkt_len(md);
 
 	/* Take snapshot of the data */
-	mc = rte_pktmbuf_copy(md, mp, 0, length);
+	mc = rte_pktmbuf_copy_ex(md, mp, 0, length,
+				 RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
 	if (unlikely(mc == NULL))
 		return NULL;
 
diff --git a/lib/pdump/rte_pdump.c b/lib/pdump/rte_pdump.c
index 98dcbc037b..6e61c75407 100644
--- a/lib/pdump/rte_pdump.c
+++ b/lib/pdump/rte_pdump.c
@@ -124,7 +124,8 @@ pdump_copy(uint16_t port_id, uint16_t queue,
 					    pkts[i], mp, cbs->snaplen,
 					    ts, direction);
 		else
-			p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
+			p = rte_pktmbuf_copy_ex(pkts[i], mp, 0, cbs->snaplen,
+						RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
 
 		if (unlikely(p == NULL))
 			__atomic_fetch_add(&stats->nombuf, 1, __ATOMIC_RELAXED);
@@ -134,6 +135,9 @@ pdump_copy(uint16_t port_id, uint16_t queue,
 
 	__atomic_fetch_add(&stats->accepted, d_pkts, __ATOMIC_RELAXED);
 
+	/* Flush non-temporal stores regarding the packet copies. */
+	rte_wmb();
+
 	ring_enq = rte_ring_enqueue_burst(ring, (void *)dup_bufs, d_pkts, NULL);
 	if (unlikely(ring_enq < d_pkts)) {
 		unsigned int drops = d_pkts - ring_enq;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3] eal: non-temporal memcpy
  2022-08-19 13:58 [RFC v3] non-temporal memcpy Morten Brørup
  2022-10-06 20:34 ` [PATCH] eal: " Morten Brørup
  2022-10-07 10:19 ` [PATCH v2] " Morten Brørup
@ 2022-10-09 15:35 ` Morten Brørup
  2022-10-10  6:46 ` [PATCH v4] " Morten Brørup
  3 siblings, 0 replies; 17+ messages in thread
From: Morten Brørup @ 2022-10-09 15:35 UTC (permalink / raw)
  To: hofors, bruce.richardson, konstantin.v.ananyev,
	Honnappa.Nagarahalli, stephen
  Cc: mattias.ronnblom, kda, drc, dev, Morten Brørup

This patch provides a function for memory copy using non-temporal store,
load or both, controlled by flags passed to the function.

Applications sometimes copy data to another memory location, which is only
used much later.
In this case, it is inefficient to pollute the data cache with the copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture buffer
for offline analysis.

The purpose of the function is to achieve a performance gain by not
polluting the cache when copying data.
Although the throughput can be improved by further optimization, I do not
have time to do it now.

The functional tests and performance tests for memory copy have been
expanded to include non-temporal copying.

A non-temporal version of the mbuf library's function to create a full
copy of a given packet mbuf is provided.

The packet capture and packet dump libraries have been updated to use
non-temporal memory copy of the packets.

Implementation notes:

Implementations for non-x86 architectures can be provided by anyone at a
later time. I am not going to do it.

x86 non-temporal load instructions must be 16 byte aligned [1], and
non-temporal store instructions must be 4, 8 or 16 byte aligned [2].

ARM non-temporal load and store instructions seem to require 4 byte
alignment [3].

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_load
[2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_si
[3] https://developer.arm.com/documentation/100076/0100/
A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
LDNP--SIMD-and-FP-

This patch is a major rewrite from the RFC v3, so no version log comparing
to the RFC is provided.

v3
* _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
  use it on 64-bit x86 architecture.
* CLANG warns that _mm_stream_load_si128_const() and
  rte_memcpy_nt_15_or_less_s16a() are not public,
  so remove __rte_internal from them. It also affects the documentation
  for the functions, so the fix can't be limited to CLANG.
* Use __rte_experimental instead of __rte_internal.
* Replace <n> with nnn in function documentation; it doesn't look like
  HTML.
* Slightly modify the workaround for _mm_stream_load_si128() missing const
  in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
  #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
  #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
* Fixed one coding style issue missed in v2.

v2
* The last 16 byte block of data, incl. any trailing bytes, were not
  copied from the source memory area in rte_memcpy_nt_buf().
* Fix many coding style issues.
* Add some missing header files.
* Fix build time warning for non-x86 architectures by using a different
  method to mark the flags parameter unused.
* CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
  so omit it when using CLANG.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_memcpy.c               |   65 +-
 app/test/test_memcpy_perf.c          |  187 ++--
 lib/eal/include/generic/rte_memcpy.h |  119 +++
 lib/eal/x86/include/rte_memcpy.h     | 1233 ++++++++++++++++++++++++++
 lib/mbuf/rte_mbuf.c                  |   77 ++
 lib/mbuf/rte_mbuf.h                  |   32 +
 lib/mbuf/version.map                 |    1 +
 lib/pcapng/rte_pcapng.c              |    3 +-
 lib/pdump/rte_pdump.c                |    6 +-
 9 files changed, 1632 insertions(+), 91 deletions(-)

diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
index 1ab86f4967..12410ce413 100644
--- a/app/test/test_memcpy.c
+++ b/app/test/test_memcpy.c
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #include <stdint.h>
@@ -36,6 +37,19 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 /* Data is aligned on this many bytes (power of 2) */
 #define ALIGNMENT_UNIT          32
 
+const uint64_t nt_mode_flags[4] = {
+	0,
+	RTE_MEMOPS_F_SRC_NT,
+	RTE_MEMOPS_F_DST_NT,
+	RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT
+};
+const char * const nt_mode_str[4] = {
+	"none",
+	"src",
+	"dst",
+	"src+dst"
+};
+
 
 /*
  * Create two buffers, and initialise one with random values. These are copied
@@ -44,12 +58,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
  * changed.
  */
 static int
-test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
+test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size, unsigned int nt_mode)
 {
 	unsigned int i;
 	uint8_t dest[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
 	uint8_t src[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
 	void * ret;
+	const uint64_t flags = nt_mode_flags[nt_mode];
 
 	/* Setup buffers */
 	for (i = 0; i < SMALL_BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
@@ -58,18 +73,23 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	}
 
 	/* Do the copy */
-	ret = rte_memcpy(dest + off_dst, src + off_src, size);
-	if (ret != (dest + off_dst)) {
-		printf("rte_memcpy() returned %p, not %p\n",
-		       ret, dest + off_dst);
+	if (nt_mode) {
+		rte_memcpy_ex(dest + off_dst, src + off_src, size, flags);
+	} else {
+		ret = rte_memcpy(dest + off_dst, src + off_src, size);
+		if (ret != (dest + off_dst)) {
+			printf("rte_memcpy() returned %p, not %p\n",
+			       ret, dest + off_dst);
+		}
 	}
 
 	/* Check nothing before offset is affected */
 	for (i = 0; i < off_dst; i++) {
 		if (dest[i] != 0) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
 			       "[modified before start of dst].\n",
-			       (unsigned)size, off_src, off_dst);
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
 			return -1;
 		}
 	}
@@ -77,9 +97,11 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	/* Check everything was copied */
 	for (i = 0; i < size; i++) {
 		if (dest[i + off_dst] != src[i + off_src]) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
-			       "[didn't copy byte %u].\n",
-			       (unsigned)size, off_src, off_dst, i);
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
+			       "[didn't copy byte %u: 0x%02x!=0x%02x].\n",
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode], i,
+			       dest[i + off_dst], src[i + off_src]);
 			return -1;
 		}
 	}
@@ -87,9 +109,10 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	/* Check nothing after copy was affected */
 	for (i = size; i < SMALL_BUFFER_SIZE; i++) {
 		if (dest[i + off_dst] != 0) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
 			       "[copied too many].\n",
-			       (unsigned)size, off_src, off_dst);
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
 			return -1;
 		}
 	}
@@ -102,16 +125,18 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 static int
 func_test(void)
 {
-	unsigned int off_src, off_dst, i;
+	unsigned int off_src, off_dst, i, nt_mode;
 	int ret;
 
-	for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
-		for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
-			for (i = 0; i < RTE_DIM(buf_sizes); i++) {
-				ret = test_single_memcpy(off_src, off_dst,
-				                         buf_sizes[i]);
-				if (ret != 0)
-					return -1;
+	for (nt_mode = 0; nt_mode < 4; nt_mode++) {
+		for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
+			for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
+				for (i = 0; i < RTE_DIM(buf_sizes); i++) {
+					ret = test_single_memcpy(off_src, off_dst,
+								 buf_sizes[i], nt_mode);
+					if (ret != 0)
+						return -1;
+				}
 			}
 		}
 	}
diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 3727c160e6..6bb52cba88 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #include <stdint.h>
@@ -15,6 +16,7 @@
 #include <rte_malloc.h>
 
 #include <rte_memcpy.h>
+#include <rte_atomic.h>
 
 #include "test.h"
 
@@ -27,9 +29,9 @@
 /* List of buffer sizes to test */
 #if TEST_VALUE_RANGE == 0
 static size_t buf_sizes[] = {
-	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
-	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
-	449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
+	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 40, 48, 60, 63, 64, 65, 80, 92, 124,
+	127, 128, 129, 140, 152, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447,
+	448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
 	2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
 };
 /* MUST be as large as largest packet size above */
@@ -72,7 +74,7 @@ static uint8_t *small_buf_read, *small_buf_write;
 static int
 init_buffers(void)
 {
-	unsigned i;
+	unsigned int i;
 
 	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (large_buf_read == NULL)
@@ -151,7 +153,7 @@ static void
 do_uncached_write(uint8_t *dst, int is_dst_cached,
 				  const uint8_t *src, int is_src_cached, size_t size)
 {
-	unsigned i, j;
+	unsigned int i, j;
 	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];
 
 	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
@@ -167,66 +169,112 @@ do_uncached_write(uint8_t *dst, int is_dst_cached,
  * Run a single memcpy performance test. This is a macro to ensure that if
  * the "size" parameter is a constant it won't be converted to a variable.
  */
-#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,                   \
-                         src, is_src_cached, src_uoffset, size)             \
-do {                                                                        \
-    unsigned int iter, t;                                                   \
-    size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];          \
-    uint64_t start_time, total_time = 0;                                    \
-    uint64_t total_time2 = 0;                                               \
-    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
-        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
-                         src_addrs, is_src_cached, src_uoffset);            \
-        start_time = rte_rdtsc();                                           \
-        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
-            rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size);           \
-        total_time += rte_rdtsc() - start_time;                             \
-    }                                                                       \
-    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
-        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
-                         src_addrs, is_src_cached, src_uoffset);            \
-        start_time = rte_rdtsc();                                           \
-        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
-            memcpy(dst+dst_addrs[t], src+src_addrs[t], size);               \
-        total_time2 += rte_rdtsc() - start_time;                            \
-    }                                                                       \
-    printf("%3.0f -", (double)total_time  / TEST_ITERATIONS);                 \
-    printf("%3.0f",   (double)total_time2 / TEST_ITERATIONS);                 \
-    printf("(%6.2f%%) ", ((double)total_time - total_time2)*100/total_time2); \
+#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,					  \
+			 src, is_src_cached, src_uoffset, size)					  \
+do {												  \
+	unsigned int iter, t;									  \
+	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];				  \
+	uint64_t start_time;									  \
+	uint64_t total_time_rte = 0, total_time_std = 0;					  \
+	uint64_t total_time_ntd = 0, total_time_nts = 0, total_time_nt = 0;			  \
+	const uint64_t flags = ((dst_uoffset == 0) ?						  \
+				(ALIGNMENT_UNIT << RTE_MEMOPS_F_DSTA_SHIFT) : 0) |		  \
+			       ((src_uoffset == 0) ?						  \
+				(ALIGNMENT_UNIT << RTE_MEMOPS_F_SRCA_SHIFT) : 0);		  \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
+		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
+				 src_addrs, is_src_cached, src_uoffset);			  \
+		start_time = rte_rdtsc();							  \
+		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
+			rte_memcpy(dst + dst_addrs[t], src + src_addrs[t], size);		  \
+		total_time_rte += rte_rdtsc() - start_time;					  \
+	}											  \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
+		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
+				 src_addrs, is_src_cached, src_uoffset);			  \
+		start_time = rte_rdtsc();							  \
+		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
+			memcpy(dst + dst_addrs[t], src + src_addrs[t], size);			  \
+		total_time_std += rte_rdtsc() - start_time;					  \
+	}											  \
+	if (!(is_dst_cached && is_src_cached)) {						  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_DST_NT);			  \
+			total_time_ntd += rte_rdtsc() - start_time;				  \
+		}										  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_SRC_NT);			  \
+			total_time_nts += rte_rdtsc() - start_time;				  \
+		}										  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT); \
+			total_time_nt += rte_rdtsc() - start_time;				  \
+		}										  \
+	}											  \
+	printf(" %4.0f-", (double)total_time_rte / TEST_ITERATIONS);				  \
+	printf("%4.0f",   (double)total_time_std / TEST_ITERATIONS);				  \
+	printf("(%+4.0f%%)", ((double)total_time_rte - total_time_std) * 100 / total_time_std);   \
+	if (!(is_dst_cached && is_src_cached)) {						  \
+		printf(" %4.0f", (double)total_time_ntd / TEST_ITERATIONS);			  \
+		printf(" %4.0f", (double)total_time_nts / TEST_ITERATIONS);			  \
+		printf(" %4.0f", (double)total_time_nt / TEST_ITERATIONS);			  \
+		if (total_time_nt / total_time_std > 9)						  \
+			printf("(*%4.1f)", (double)total_time_nt / total_time_std);		  \
+		else										  \
+			printf("(%+4.0f%%)",							  \
+			       ((double)total_time_nt - total_time_std) * 100 / total_time_std);  \
+	}											  \
 } while (0)
 
 /* Run aligned memcpy tests for each cached/uncached permutation */
-#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
-do {                                                                     \
-    if (__builtin_constant_p(n))                                         \
-        printf("\nC%6u", (unsigned)n);                                   \
-    else                                                                 \
-        printf("\n%7u", (unsigned)n);                                    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);    \
+#define ALL_PERF_TESTS_FOR_SIZE(n)						\
+do {										\
+	if (__builtin_constant_p(n))						\
+		printf("\nC%6u", (unsigned int)n);				\
+	else									\
+		printf("\n%7u", (unsigned int)n);				\
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);	\
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);	\
 } while (0)
 
 /* Run unaligned memcpy tests for each cached/uncached permutation */
-#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
-do {                                                                     \
-    if (__builtin_constant_p(n))                                         \
-        printf("\nC%6u", (unsigned)n);                                   \
-    else                                                                 \
-        printf("\n%7u", (unsigned)n);                                    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);    \
+#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)					\
+do {										\
+	if (__builtin_constant_p(n))						\
+		printf("\nC%6u", (unsigned int)n);				\
+	else									\
+		printf("\n%7u", (unsigned int)n);				\
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);	\
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);	\
 } while (0)
 
 /* Run memcpy tests for constant length */
-#define ALL_PERF_TEST_FOR_CONSTANT                                      \
-do {                                                                    \
-    TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);         \
-    TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);      \
-    TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);    \
+#define ALL_PERF_TEST_FOR_CONSTANT						\
+do {										\
+	TEST_CONSTANT(4U); TEST_CONSTANT(6U); TEST_CONSTANT(8U);		\
+	TEST_CONSTANT(16U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);		\
+	TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);		\
+	TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);	\
+	TEST_CONSTANT(2048U);							\
 } while (0)
 
 /* Run all memcpy tests for aligned constant cases */
@@ -251,7 +299,7 @@ perf_test_constant_unaligned(void)
 static inline void
 perf_test_variable_aligned(void)
 {
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
 		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
 	}
@@ -261,7 +309,7 @@ perf_test_variable_aligned(void)
 static inline void
 perf_test_variable_unaligned(void)
 {
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
 		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
 	}
@@ -282,7 +330,7 @@ perf_test(void)
 
 #if TEST_VALUE_RANGE != 0
 	/* Set up buf_sizes array, if required */
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < TEST_VALUE_RANGE; i++)
 		buf_sizes[i] = i;
 #endif
@@ -290,13 +338,14 @@ perf_test(void)
 	/* See function comment */
 	do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
 
-	printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
-		   "======= ================= ================= ================= =================\n"
-		   "   Size   Cache to cache     Cache to mem      Mem to cache        Mem to mem\n"
-		   "(bytes)          (ticks)          (ticks)           (ticks)           (ticks)\n"
-		   "------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n** rte_memcpy(RTE)/memcpy(STD)/rte_memcpy_ex(NTD/NTS/NT) - memcpy perf. tests (C = compile-time constant) **\n"
+		   "======= ================ ====================================== ====================================== ======================================\n"
+		   "   Size  Cache to cache               Cache to mem                           Mem to cache                            Mem to mem\n"
+		   "(bytes)         (ticks)                    (ticks)                                (ticks)                               (ticks)\n"
+		   "         RTE- STD(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)\n"
+		   "------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 
-	printf("\n================================= %2dB aligned =================================",
+	printf("\n================================================================ %2dB aligned ===============================================================",
 		ALIGNMENT_UNIT);
 	/* Do aligned tests where size is a variable */
 	timespec_get(&tv_begin, TIME_UTC);
@@ -304,28 +353,28 @@ perf_test(void)
 	timespec_get(&tv_end, TIME_UTC);
 	time_aligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 	/* Do aligned tests where size is a compile-time constant */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_constant_aligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_aligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n================================== Unaligned ==================================");
+	printf("\n================================================================= Unaligned =================================================================");
 	/* Do unaligned tests where size is a variable */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_variable_unaligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_unaligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 	/* Do unaligned tests where size is a compile-time constant */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_constant_unaligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_unaligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n======= ================= ================= ================= =================\n\n");
+	printf("\n======= ================ ====================================== ====================================== ======================================\n\n");
 
 	printf("Test Execution Time (seconds):\n");
 	printf("Aligned variable copy size   = %8.3f\n", time_aligned);
diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
index e7f0f8eaa9..d141acbd3c 100644
--- a/lib/eal/include/generic/rte_memcpy.h
+++ b/lib/eal/include/generic/rte_memcpy.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #ifndef _RTE_MEMCPY_H_
@@ -11,6 +12,9 @@
  * Functions for vectorised implementation of memcpy().
  */
 
+#include <rte_common.h>
+#include <rte_compat.h>
+
 /**
  * Copy 16 bytes from one location to another using optimised
  * instructions. The locations should not overlap.
@@ -113,4 +117,119 @@ rte_memcpy(void *dst, const void *src, size_t n);
 
 #endif /* __DOXYGEN__ */
 
+/*
+ * Advanced/Non-Temporal Memory Operations Flags.
+ */
+
+/** Length alignment hint mask. */
+#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
+/** Length alignment hint shift. */
+#define RTE_MEMOPS_F_LENA_SHIFT 0
+/** Hint: Length is 2 byte aligned. */
+#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
+/** Hint: Length is 4 byte aligned. */
+#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)
+/** Hint: Length is 8 byte aligned. */
+#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)
+/** Hint: Length is 16 byte aligned. */
+#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)
+/** Hint: Length is 32 byte aligned. */
+#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)
+/** Hint: Length is 64 byte aligned. */
+#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)
+/** Hint: Length is 128 byte aligned. */
+#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)
+
+/** Prefer non-temporal access to source memory area.
+ */
+#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)
+/** Source address alignment hint mask. */
+#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)
+/** Source address alignment hint shift. */
+#define RTE_MEMOPS_F_SRCA_SHIFT 8
+/** Hint: Source address is 2 byte aligned. */
+#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)
+/** Hint: Source address is 4 byte aligned. */
+#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)
+/** Hint: Source address is 8 byte aligned. */
+#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)
+/** Hint: Source address is 16 byte aligned. */
+#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)
+/** Hint: Source address is 32 byte aligned. */
+#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)
+/** Hint: Source address is 64 byte aligned. */
+#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)
+/** Hint: Source address is 128 byte aligned. */
+#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)
+
+/** Prefer non-temporal access to destination memory area.
+ *
+ * On x86 architecture:
+ * Remember to call rte_wmb() after a sequence of copy operations.
+ */
+#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)
+/** Destination address alignment hint mask. */
+#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16)
+/** Destination address alignment hint shift. */
+#define RTE_MEMOPS_F_DSTA_SHIFT 16
+/** Hint: Destination address is 2 byte aligned. */
+#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)
+/** Hint: Destination address is 4 byte aligned. */
+#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)
+/** Hint: Destination address is 8 byte aligned. */
+#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)
+/** Hint: Destination address is 16 byte aligned. */
+#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)
+/** Hint: Destination address is 32 byte aligned. */
+#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)
+/** Hint: Destination address is 64 byte aligned. */
+#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)
+/** Hint: Destination address is 128 byte aligned. */
+#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Advanced/non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the destination memory area.
+ * @param src
+ *   Pointer to the source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, RTE_MEMOPS_F_(LEN|SRC|DST)nnnA flags.
+ *   Must be constant at build time.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags);
+
+#ifndef RTE_MEMCPY_EX_ARCH_DEFINED
+
+/* Fallback implementation, if no arch-specific implementation is provided. */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	RTE_SET_USED(flags);
+	memcpy(dst, src, len);
+}
+
+#endif /* RTE_MEMCPY_EX_ARCH_DEFINED */
+
 #endif /* _RTE_MEMCPY_H_ */
diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index d4d7a5cfc8..ef6a24eeac 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #ifndef _RTE_MEMCPY_X86_64_H_
@@ -17,6 +18,10 @@
 #include <rte_vect.h>
 #include <rte_common.h>
 #include <rte_config.h>
+#include <rte_debug.h>
+
+#define RTE_MEMCPY_EX_ARCH_DEFINED
+#include "generic/rte_memcpy.h"
 
 #ifdef __cplusplus
 extern "C" {
@@ -868,6 +873,1234 @@ rte_memcpy(void *dst, const void *src, size_t n)
 		return rte_memcpy_generic(dst, src, n);
 }
 
+/*
+ * Advanced/Non-Temporal Memory Operations.
+ */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Workaround for _mm_stream_load_si128() missing const in the parameter.
+ */
+__rte_experimental
+static __rte_always_inline
+__m128i _mm_stream_load_si128_const(const __m128i *const mem_addr)
+{
+	/* GCC 4.5.8 (in RHEL7) doesn't support the #pragma to ignore "-Wdiscarded-qualifiers".
+	 * So we explicitly type cast mem_addr and use the #pragma to ignore "-Wcast-qual".
+	 */
+#if defined(RTE_TOOLCHAIN_GCC)
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#endif
+	return _mm_stream_load_si128((__m128i *)mem_addr);
+#if defined(RTE_TOOLCHAIN_GCC)
+#pragma GCC diagnostic pop
+#endif
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Memory copy from non-temporal source area.
+ *
+ * @note
+ * Performance is optimal when source pointer is 16 byte aligned.
+ *
+ * @param dst
+ *   Pointer to the destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(LEN|SRC)nnnA flags.
+ *   The RTE_MEMOPS_F_SRC_NT flag must be set.
+ *   The RTE_MEMOPS_F_DST_NT flag must be clear.
+ *   The RTE_MEMOPS_F_DSTnnnA flags are ignored.
+ *   Must be constant at build time.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nts(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_SRC_NT);
+
+	if (unlikely(len == 0))
+		return;
+
+	/* If source is not 16 byte aligned, then copy first part of data via bounce buffer,
+	 * to achieve 16 byte alignment of source pointer.
+	 * This invalidates the source, destination and length alignment flags, and
+	 * potentially makes the destination pointer unaligned.
+	 *
+	 * Omitted if source is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) {
+		/* Source is not known to be 16 byte aligned, but might be. */
+		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+		const size_t    offset = (uintptr_t)src & 15;
+
+		if (offset) {
+			/* Source is not 16 byte aligned. */
+			char            buffer[16] __rte_aligned(16);
+			/** How many bytes is source away from 16 byte alignment
+			 * (ceiling rounding).
+			 */
+			const size_t    first = 16 - offset;
+
+			xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
+			_mm_store_si128((void *)buffer, xmm0);
+
+			/* Test for short length.
+			 *
+			 * Omitted if length is known to be >= 16.
+			 */
+			if (!(__builtin_constant_p(len) && len >= 16) &&
+					unlikely(len <= first)) {
+				/* Short length. */
+				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
+				return;
+			}
+
+			/* Copy until source pointer is 16 byte aligned. */
+			rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), first);
+			src = RTE_PTR_ADD(src, first);
+			dst = RTE_PTR_ADD(dst, first);
+			len -= first;
+		}
+	}
+
+	/* Source pointer is now 16 byte aligned. */
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(dst, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, if any, via bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the alignment
+	 * flags are still valid) and length is known to be 16 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(len & 15)) {
+		char    buffer[16] __rte_aligned(16);
+
+		xmm3 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)buffer, xmm3);
+		rte_mov15_or_less(dst, buffer, len & 15);
+	}
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Memory copy to non-temporal destination area.
+ *
+ * @note
+ * If the destination and/or length is unaligned, the first and/or last copied
+ * bytes will be stored in the destination memory area using temporal access.
+ * @note
+ * Performance is optimal when destination pointer is 16 byte aligned.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(LEN|DST)nnnA flags.
+ *   The RTE_MEMOPS_F_SRC_NT flag must be clear.
+ *   The RTE_MEMOPS_F_DST_NT flag must be set.
+ *   The RTE_MEMOPS_F_SRCnnnA flags are ignored.
+ *   Must be constant at build time.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ntd(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_DST_NT);
+
+	if (unlikely(len == 0))
+		return;
+
+	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ||
+			len >= 16) {
+		/* Length >= 16 and/or destination is known to be 16 byte aligned. */
+		register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+		/* If destination is not 16 byte aligned, then copy first part of data,
+		 * to achieve 16 byte alignment of destination pointer.
+		 * This invalidates the source, destination and length alignment flags, and
+		 * potentially makes the source pointer unaligned.
+		 *
+		 * Omitted if destination is known to be 16 byte aligned.
+		 */
+		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A)) {
+			/* Destination is not known to be 16 byte aligned, but might be. */
+			/** How many bytes is destination offset from 16 byte alignment
+			 * (floor rounding).
+			 */
+			const size_t    offset = (uintptr_t)dst & 15;
+
+			if (offset) {
+				/* Destination is not 16 byte aligned. */
+				/** How many bytes is destination away from 16 byte alignment
+				 * (ceiling rounding).
+				 */
+				const size_t    first = 16 - offset;
+
+				if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
+						(offset & 3) == 0) {
+					/* Destination is (known to be) 4 byte aligned. */
+					int32_t r0, r1, r2;
+
+					/* Copy until destination pointer is 16 byte aligned. */
+					if (first & 8) {
+						memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
+						memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
+						src = RTE_PTR_ADD(src, 8);
+						dst = RTE_PTR_ADD(dst, 8);
+						len -= 8;
+					}
+					if (first & 4) {
+						memcpy(&r2, src, 4);
+						_mm_stream_si32(dst, r2);
+						src = RTE_PTR_ADD(src, 4);
+						dst = RTE_PTR_ADD(dst, 4);
+						len -= 4;
+					}
+				} else {
+					/* Destination is not 4 byte aligned. */
+					/* Copy until destination pointer is 16 byte aligned. */
+					rte_mov15_or_less(dst, src, first);
+					src = RTE_PTR_ADD(src, first);
+					dst = RTE_PTR_ADD(dst, first);
+					len -= first;
+				}
+			}
+		}
+
+		/* Destination pointer is now 16 byte aligned. */
+		RTE_ASSERT(rte_is_aligned(dst, 16));
+
+		/* Copy large portion of data in chunks of 64 byte. */
+		while (len >= 64) {
+			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
+			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
+			xmm2 = _mm_loadu_si128(RTE_PTR_ADD(src, 2 * 16));
+			xmm3 = _mm_loadu_si128(RTE_PTR_ADD(src, 3 * 16));
+			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+			src = RTE_PTR_ADD(src, 64);
+			dst = RTE_PTR_ADD(dst, 64);
+			len -= 64;
+		}
+
+		/* Copy following 32 and 16 byte portions of data.
+		 *
+		 * Omitted if destination is known to be 16 byte aligned (so the alignment
+		 * flags are still valid)
+		 * and length is known to be respectively 64 or 32 byte aligned.
+		 */
+		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+				(len & 32)) {
+			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
+			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
+			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+			src = RTE_PTR_ADD(src, 32);
+			dst = RTE_PTR_ADD(dst, 32);
+		}
+		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+				(len & 16)) {
+			xmm2 = _mm_loadu_si128(src);
+			_mm_stream_si128(dst, xmm2);
+			src = RTE_PTR_ADD(src, 16);
+			dst = RTE_PTR_ADD(dst, 16);
+		}
+	} else {
+		/* Length <= 15, and
+		 * destination is not known to be 16 byte aligned (but might be).
+		 */
+		/* If destination is not 4 byte aligned, then
+		 * use normal copy and return.
+		 *
+		 * Omitted if destination is known to be 4 byte aligned.
+		 */
+		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) &&
+				!rte_is_aligned(dst, 4)) {
+			/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
+			rte_mov15_or_less(dst, src, len);
+			return;
+		}
+		/* Destination is (known to be) 4 byte aligned. Proceed. */
+	}
+
+	/* Destination pointer is now 4 byte (or 16 byte) aligned. */
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+
+	/* Copy following 8 and 4 byte portions of data.
+	 *
+	 * Omitted if destination is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 16 or 8 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(len & 8)) {
+		int32_t r0, r1;
+
+		memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
+		memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
+		src = RTE_PTR_ADD(src, 8);
+		dst = RTE_PTR_ADD(dst, 8);
+	}
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A)) &&
+			(len & 4)) {
+		int32_t r2;
+
+		memcpy(&r2, src, 4);
+		_mm_stream_si32(dst, r2);
+		src = RTE_PTR_ADD(src, 4);
+		dst = RTE_PTR_ADD(dst, 4);
+	}
+
+	/* Copy remaining 2 and 1 byte portions of data.
+	 *
+	 * Omitted if destination is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 4 and 2 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A)) &&
+			(len & 2)) {
+		int16_t r3;
+
+		memcpy(&r3, src, 2);
+		*(int16_t *)dst = r3;
+		src = RTE_PTR_ADD(src, 2);
+		dst = RTE_PTR_ADD(dst, 2);
+	}
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A)) &&
+			(len & 1))
+		*(char *)dst = *(const char *)src;
+}
+
+/**
+ * Non-temporal memory copy of 15 or less byte
+ * from 16 byte aligned source via bounce buffer.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Only the 4 least significant bits of this parameter are used.
+ *   The 4 least significant bits of this holds the number of remaining bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_15_or_less_s16a(void *__rte_restrict dst,
+		const void *__rte_restrict src, size_t len, const uint64_t flags)
+{
+	int32_t             buffer[4] __rte_aligned(16);
+	register __m128i    xmm0;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if ((len & 15) == 0)
+		return;
+
+	/* Non-temporal load into bounce buffer. */
+	xmm0 = _mm_stream_load_si128_const(src);
+	_mm_store_si128((void *)buffer, xmm0);
+
+	/* Store from bounce buffer. */
+	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
+			rte_is_aligned(dst, 4)) {
+		/* Destination is (known to be) 4 byte aligned. */
+		src = (const void *)buffer;
+		if (len & 8) {
+#ifdef RTE_ARCH_X86_64
+			if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) {
+				/* Destination is known to be 8 byte aligned. */
+				_mm_stream_si64(dst, *(const int64_t *)src);
+			} else {
+#endif /* RTE_ARCH_X86_64 */
+				_mm_stream_si32(RTE_PTR_ADD(dst, 0), buffer[0]);
+				_mm_stream_si32(RTE_PTR_ADD(dst, 4), buffer[1]);
+#ifdef RTE_ARCH_X86_64
+			}
+#endif /* RTE_ARCH_X86_64 */
+			src = RTE_PTR_ADD(src, 8);
+			dst = RTE_PTR_ADD(dst, 8);
+		}
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A) &&
+				(len & 4)) {
+			_mm_stream_si32(dst, *(const int32_t *)src);
+			src = RTE_PTR_ADD(src, 4);
+			dst = RTE_PTR_ADD(dst, 4);
+		}
+
+		/* Non-temporal store is unavailble for the remaining 3 byte or less. */
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) &&
+				(len & 2)) {
+			*(int16_t *)dst = *(const int16_t *)src;
+			src = RTE_PTR_ADD(src, 2);
+			dst = RTE_PTR_ADD(dst, 2);
+		}
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A) &&
+				(len & 1)) {
+			*(char *)dst = *(const char *)src;
+		}
+	} else {
+		/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
+		rte_mov15_or_less(dst, (const void *)buffer, len & 15);
+	}
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 16 byte aligned addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 16 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d16s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 16));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_stream_si128(dst, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ?
+				flags : RTE_MEMOPS_F_DST16A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+#ifdef RTE_ARCH_X86_64
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 8/16 byte aligned destination/source addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 8 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d8s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	int64_t             buffer[8] __rte_cache_aligned /* at least __rte_aligned(16) */;
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 8));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
+		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
+		_mm_store_si128((void *)&buffer[3 * 2], xmm3);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 4 * 8), buffer[4]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 5 * 8), buffer[5]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 6 * 8), buffer[6]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 7 * 8), buffer[7]);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[4]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[5]);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) ?
+				flags : RTE_MEMOPS_F_DST8A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+#endif /* RTE_ARCH_X86_64 */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 4/16 byte aligned destination/source addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 4 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d4s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	int32_t             buffer[16] __rte_cache_aligned /* at least __rte_aligned(16) */;
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
+		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
+		_mm_store_si128((void *)&buffer[3 * 4], xmm3);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  0 * 4), buffer[0]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  1 * 4), buffer[1]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  2 * 4), buffer[2]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  3 * 4), buffer[3]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  4 * 4), buffer[4]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  5 * 4), buffer[5]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  6 * 4), buffer[6]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  7 * 4), buffer[7]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  8 * 4), buffer[8]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  9 * 4), buffer[9]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 10 * 4), buffer[10]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 11 * 4), buffer[11]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 12 * 4), buffer[12]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 13 * 4), buffer[13]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 14 * 4), buffer[14]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 15 * 4), buffer[15]);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 4 * 4), buffer[4]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 5 * 4), buffer[5]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 6 * 4), buffer[6]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 7 * 4), buffer[7]);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[8]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[9]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[10]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[11]);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ?
+				flags : RTE_MEMOPS_F_DST4A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 4 byte aligned addresses (non-temporal) memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the (non-temporal) destination memory area.
+ *   Must be 4 byte aligned if using non-temporal store.
+ * @param src
+ *   Pointer to the (non-temporal) source memory area.
+ *   Must be 4 byte aligned if using non-temporal load.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d4s4a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+	const size_t    offset = (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A ?
+			0 : (uintptr_t)src & 15;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+	RTE_ASSERT(rte_is_aligned(src, 4));
+
+	if (unlikely(len == 0))
+		return;
+
+	if (offset == 0) {
+		/* Source is 16 byte aligned. */
+		/* Copy everything, using upgraded source alignment flags. */
+		rte_memcpy_nt_d4s16a(dst, src, len,
+				(flags & ~RTE_MEMOPS_F_SRCA_MASK) | RTE_MEMOPS_F_SRC16A);
+	} else {
+		/* Source is not 16 byte aligned, so make it 16 byte aligned. */
+		int32_t             buffer[4] __rte_aligned(16);
+		const size_t        first = 16 - offset;
+		register __m128i    xmm0;
+
+		/* First, copy first part of data in chunks of 4 byte,
+		 * to achieve 16 byte alignment of source.
+		 * This invalidates the source, destination and length alignment flags, and
+		 * potentially makes the destination pointer 16 byte unaligned/aligned.
+		 */
+
+		/** Copy from 16 byte aligned source pointer (floor rounding). */
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
+		_mm_store_si128((void *)buffer, xmm0);
+
+		if (unlikely(len + offset <= 16)) {
+			/* Short length. */
+			if (((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ||
+					(len & 3) == 0) {
+				/* Length is 4 byte aligned. */
+				switch (len) {
+				case 1 * 4:
+					/* Offset can be 1 * 4, 2 * 4 or 3 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
+							buffer[offset / 4]);
+					break;
+				case 2 * 4:
+					/* Offset can be 1 * 4 or 2 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
+							buffer[offset / 4]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4),
+							buffer[offset / 4 + 1]);
+					break;
+				case 3 * 4:
+					/* Offset can only be 1 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
+					break;
+				}
+			} else {
+				/* Length is not 4 byte aligned. */
+				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
+			}
+			return;
+		}
+
+		switch (first) {
+		case 1 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
+			break;
+		case 2 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
+			break;
+		case 3 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
+			break;
+		}
+
+		src = RTE_PTR_ADD(src, first);
+		dst = RTE_PTR_ADD(dst, first);
+		len -= first;
+
+		/* Source pointer is now 16 byte aligned. */
+		RTE_ASSERT(rte_is_aligned(src, 16));
+
+		/* Then, copy the rest, using corrected alignment flags. */
+		if (rte_is_aligned(dst, 16))
+			rte_memcpy_nt_d16s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST16A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+#ifdef RTE_ARCH_X86_64
+		else if (rte_is_aligned(dst, 8))
+			rte_memcpy_nt_d8s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST8A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+#endif /* RTE_ARCH_X86_64 */
+		else
+			rte_memcpy_nt_d4s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST4A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+	}
+}
+
+#ifndef RTE_MEMCPY_NT_BUFSIZE
+
+#include <lib/mbuf/rte_mbuf_core.h>
+
+/** Bounce buffer size for non-temporal memcpy.
+ *
+ * Must be 2^N and >= 128.
+ * The actual buffer will be slightly larger, due to added padding.
+ * The default is chosen to be able to handle a non-segmented packet.
+ */
+#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM
+
+#endif  /* RTE_MEMCPY_NT_BUFSIZE */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Non-temporal memory copy via bounce buffer.
+ *
+ * @note
+ * If the destination and/or length is unaligned, the first and/or last copied
+ * bytes will be stored in the destination memory area using temporal access.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_buf(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	/** Cache line aligned bounce buffer with preceding and trailing padding.
+	 *
+	 * The preceding padding is one cache line, so the data area itself
+	 * is cache line aligned.
+	 * The trailing padding is 16 bytes, leaving room for the trailing bytes
+	 * of a 16 byte store operation.
+	 */
+	char			buffer[RTE_CACHE_LINE_SIZE + RTE_MEMCPY_NT_BUFSIZE +  16]
+				__rte_cache_aligned;
+	/** Pointer to bounce buffer's aligned data area. */
+	char		* const buf0 = &buffer[RTE_CACHE_LINE_SIZE];
+	void		       *buf;
+	/** Number of bytes to copy from source, incl. any extra preceding bytes. */
+	size_t			srclen;
+	register __m128i	xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Step 1:
+	 * Copy data from the source to the bounce buffer's aligned data area,
+	 * using aligned non-temporal load from the source,
+	 * and unaligned store in the bounce buffer.
+	 *
+	 * If the source is unaligned, the additional bytes preceding the data will be copied
+	 * to the padding area preceding the bounce buffer's aligned data area.
+	 * Similarly, if the source data ends at an unaligned address, the additional bytes
+	 * trailing the data will be copied to the padding area trailing the bounce buffer's
+	 * aligned data area.
+	 */
+
+	/* Adjust for extra preceding bytes, unless source is known to be 16 byte aligned. */
+	if ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) {
+		buf = buf0;
+		srclen = len;
+	} else {
+		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+		const size_t offset = (uintptr_t)src & 15;
+
+		buf = RTE_PTR_SUB(buf0, offset);
+		src = RTE_PTR_SUB(src, offset);
+		srclen = len + offset;
+	}
+
+	/* Copy large portion of data from source to bounce buffer in chunks of 64 byte. */
+	while (srclen >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		buf = RTE_PTR_ADD(buf, 64);
+		srclen -= 64;
+	}
+
+	/* Copy remaining 32 and 16 byte portions of data from source to bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the length alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+			(srclen & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		buf = RTE_PTR_ADD(buf, 32);
+	}
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+			(srclen & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(buf, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		buf = RTE_PTR_ADD(buf, 16);
+	}
+	/* Copy any trailing bytes of data from source to bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the length alignment
+	 * flags are still valid)
+	 * and length is known to be 16 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(srclen & 15)) {
+		xmm3 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(buf, xmm3);
+	}
+
+	/* Step 2:
+	 * Copy from the aligned bounce buffer to the non-temporal destination.
+	 */
+	rte_memcpy_ntd(dst, buf0, len,
+			(flags & ~(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_SRCA_MASK)) |
+			(RTE_CACHE_LINE_SIZE << RTE_MEMOPS_F_SRCA_SHIFT));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @note
+ * If the destination and/or length is unaligned, some copied bytes will be
+ * stored in the destination memory area using temporal access.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_generic(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+
+	while (len > RTE_MEMCPY_NT_BUFSIZE) {
+		rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE,
+				(flags & ~RTE_MEMOPS_F_LENA_MASK) | RTE_MEMOPS_F_LEN128A);
+		dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
+		src = RTE_PTR_ADD(src, RTE_MEMCPY_NT_BUFSIZE);
+		len -= RTE_MEMCPY_NT_BUFSIZE;
+	}
+	rte_memcpy_nt_buf(dst, src, len, flags);
+}
+
+/* Implementation. Refer to function declaration for documentation. */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	if ((flags & (RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) ==
+			(RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) {
+		/* Copy between non-temporal source and destination. */
+		if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d16s16a(dst, src, len, flags);
+#ifdef RTE_ARCH_X86_64
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d8s16a(dst, src, len, flags);
+#endif /* RTE_ARCH_X86_64 */
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d4s16a(dst, src, len, flags);
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A)
+			rte_memcpy_nt_d4s4a(dst, src, len, flags);
+		else if (len <= RTE_MEMCPY_NT_BUFSIZE)
+			rte_memcpy_nt_buf(dst, src, len, flags);
+		else
+			rte_memcpy_nt_generic(dst, src, len, flags);
+	} else if (flags & RTE_MEMOPS_F_SRC_NT) {
+		/* Copy from non-temporal source. */
+		rte_memcpy_nts(dst, src, len, flags);
+	} else if (flags & RTE_MEMOPS_F_DST_NT) {
+		/* Copy to non-temporal destination. */
+		rte_memcpy_ntd(dst, src, len, flags);
+	} else
+		rte_memcpy(dst, src, len);
+}
+
 #undef ALIGNMENT_MASK
 
 #if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index a2307cebe6..aa96fb4cc8 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -660,6 +660,83 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	return mc;
 }
 
+/* Create a deep copy of mbuf, using non-temporal memory access */
+struct rte_mbuf *
+rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
+		 uint32_t off, uint32_t len, const uint64_t flags)
+{
+	const struct rte_mbuf *seg = m;
+	struct rte_mbuf *mc, *m_last, **prev;
+
+	/* garbage in check */
+	__rte_mbuf_sanity_check(m, 1);
+
+	/* check for request to copy at offset past end of mbuf */
+	if (unlikely(off >= m->pkt_len))
+		return NULL;
+
+	mc = rte_pktmbuf_alloc(mp);
+	if (unlikely(mc == NULL))
+		return NULL;
+
+	/* truncate requested length to available data */
+	if (len > m->pkt_len - off)
+		len = m->pkt_len - off;
+
+	__rte_pktmbuf_copy_hdr(mc, m);
+
+	/* copied mbuf is not indirect or external */
+	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
+
+	prev = &mc->next;
+	m_last = mc;
+	while (len > 0) {
+		uint32_t copy_len;
+
+		/* skip leading mbuf segments */
+		while (off >= seg->data_len) {
+			off -= seg->data_len;
+			seg = seg->next;
+		}
+
+		/* current buffer is full, chain a new one */
+		if (rte_pktmbuf_tailroom(m_last) == 0) {
+			m_last = rte_pktmbuf_alloc(mp);
+			if (unlikely(m_last == NULL)) {
+				rte_pktmbuf_free(mc);
+				return NULL;
+			}
+			++mc->nb_segs;
+			*prev = m_last;
+			prev = &m_last->next;
+		}
+
+		/*
+		 * copy the min of data in input segment (seg)
+		 * vs space available in output (m_last)
+		 */
+		copy_len = RTE_MIN(seg->data_len - off, len);
+		if (copy_len > rte_pktmbuf_tailroom(m_last))
+			copy_len = rte_pktmbuf_tailroom(m_last);
+
+		/* append from seg to m_last */
+		rte_memcpy_ex(rte_pktmbuf_mtod_offset(m_last, char *,
+						   m_last->data_len),
+			   rte_pktmbuf_mtod_offset(seg, char *, off),
+			   copy_len, flags);
+
+		/* update offsets and lengths */
+		m_last->data_len += copy_len;
+		mc->pkt_len += copy_len;
+		off += copy_len;
+		len -= copy_len;
+	}
+
+	/* garbage out check */
+	__rte_mbuf_sanity_check(mc, 1);
+	return mc;
+}
+
 /* dump a mbuf on console */
 void
 rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index b6e23d98ce..030df396a3 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1443,6 +1443,38 @@ struct rte_mbuf *
 rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t offset, uint32_t length);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Create a full copy of a given packet mbuf,
+ * using non-temporal memory access as specified by flags.
+ *
+ * Copies all the data from a given packet mbuf to a newly allocated
+ * set of mbufs. The private data are is not copied.
+ *
+ * @param m
+ *   The packet mbuf to be copied.
+ * @param mp
+ *   The mempool from which the "clone" mbufs are allocated.
+ * @param offset
+ *   The number of bytes to skip before copying.
+ *   If the mbuf does not have that many bytes, it is an error
+ *   and NULL is returned.
+ * @param length
+ *   The upper limit on bytes to copy.  Passing UINT32_MAX
+ *   means all data (after offset).
+ * @param flags
+ *   Non-temporal memory access hints for rte_memcpy_ex.
+ * @return
+ *   - The pointer to the new "clone" mbuf on success.
+ *   - NULL if allocation fails.
+ */
+__rte_experimental
+struct rte_mbuf *
+rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
+		    uint32_t offset, uint32_t length, const uint64_t flags);
+
 /**
  * Adds given value to the refcnt of all packet mbuf segments.
  *
diff --git a/lib/mbuf/version.map b/lib/mbuf/version.map
index ed486ed14e..b583364ad4 100644
--- a/lib/mbuf/version.map
+++ b/lib/mbuf/version.map
@@ -47,5 +47,6 @@ EXPERIMENTAL {
 	global:
 
 	rte_pktmbuf_pool_create_extbuf;
+	rte_pktmbuf_copy_ex;
 
 };
diff --git a/lib/pcapng/rte_pcapng.c b/lib/pcapng/rte_pcapng.c
index af2b814251..ae871c4865 100644
--- a/lib/pcapng/rte_pcapng.c
+++ b/lib/pcapng/rte_pcapng.c
@@ -466,7 +466,8 @@ rte_pcapng_copy(uint16_t port_id, uint32_t queue,
 	orig_len = rte_pktmbuf_pkt_len(md);
 
 	/* Take snapshot of the data */
-	mc = rte_pktmbuf_copy(md, mp, 0, length);
+	mc = rte_pktmbuf_copy_ex(md, mp, 0, length,
+				 RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
 	if (unlikely(mc == NULL))
 		return NULL;
 
diff --git a/lib/pdump/rte_pdump.c b/lib/pdump/rte_pdump.c
index 98dcbc037b..6e61c75407 100644
--- a/lib/pdump/rte_pdump.c
+++ b/lib/pdump/rte_pdump.c
@@ -124,7 +124,8 @@ pdump_copy(uint16_t port_id, uint16_t queue,
 					    pkts[i], mp, cbs->snaplen,
 					    ts, direction);
 		else
-			p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
+			p = rte_pktmbuf_copy_ex(pkts[i], mp, 0, cbs->snaplen,
+						RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
 
 		if (unlikely(p == NULL))
 			__atomic_fetch_add(&stats->nombuf, 1, __ATOMIC_RELAXED);
@@ -134,6 +135,9 @@ pdump_copy(uint16_t port_id, uint16_t queue,
 
 	__atomic_fetch_add(&stats->accepted, d_pkts, __ATOMIC_RELAXED);
 
+	/* Flush non-temporal stores regarding the packet copies. */
+	rte_wmb();
+
 	ring_enq = rte_ring_enqueue_burst(ring, (void *)dup_bufs, d_pkts, NULL);
 	if (unlikely(ring_enq < d_pkts)) {
 		unsigned int drops = d_pkts - ring_enq;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4] eal: non-temporal memcpy
  2022-08-19 13:58 [RFC v3] non-temporal memcpy Morten Brørup
                   ` (2 preceding siblings ...)
  2022-10-09 15:35 ` [PATCH v3] " Morten Brørup
@ 2022-10-10  6:46 ` Morten Brørup
  2022-10-16 14:27   ` Mattias Rönnblom
                     ` (2 more replies)
  3 siblings, 3 replies; 17+ messages in thread
From: Morten Brørup @ 2022-10-10  6:46 UTC (permalink / raw)
  To: hofors, bruce.richardson, konstantin.v.ananyev,
	Honnappa.Nagarahalli, stephen
  Cc: mattias.ronnblom, kda, drc, dev, Morten Brørup

This patch provides a function for memory copy using non-temporal store,
load or both, controlled by flags passed to the function.

Applications sometimes copy data to another memory location, which is only
used much later.
In this case, it is inefficient to pollute the data cache with the copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture buffer
for offline analysis.

The purpose of the function is to achieve a performance gain by not
polluting the cache when copying data.
Although the throughput can be improved by further optimization, I do not
have time to do it now.

The functional tests and performance tests for memory copy have been
expanded to include non-temporal copying.

A non-temporal version of the mbuf library's function to create a full
copy of a given packet mbuf is provided.

The packet capture and packet dump libraries have been updated to use
non-temporal memory copy of the packets.

Implementation notes:

Implementations for non-x86 architectures can be provided by anyone at a
later time. I am not going to do it.

x86 non-temporal load instructions must be 16 byte aligned [1], and
non-temporal store instructions must be 4, 8 or 16 byte aligned [2].

ARM non-temporal load and store instructions seem to require 4 byte
alignment [3].

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_load
[2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_si
[3] https://developer.arm.com/documentation/100076/0100/
A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
LDNP--SIMD-and-FP-

This patch is a major rewrite from the RFC v3, so no version log comparing
to the RFC is provided.

v4
* Also ignore the warning for clang int the workaround for
  _mm_stream_load_si128() missing const in the parameter.
* Add missing C linkage specifier in rte_memcpy.h.

v3
* _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
  use it on 64-bit x86 architecture.
* CLANG warns that _mm_stream_load_si128_const() and
  rte_memcpy_nt_15_or_less_s16a() are not public,
  so remove __rte_internal from them. It also affects the documentation
  for the functions, so the fix can't be limited to CLANG.
* Use __rte_experimental instead of __rte_internal.
* Replace <n> with nnn in function documentation; it doesn't look like
  HTML.
* Slightly modify the workaround for _mm_stream_load_si128() missing const
  in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
  #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
  #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
* Fixed one coding style issue missed in v2.

v2
* The last 16 byte block of data, incl. any trailing bytes, were not
  copied from the source memory area in rte_memcpy_nt_buf().
* Fix many coding style issues.
* Add some missing header files.
* Fix build time warning for non-x86 architectures by using a different
  method to mark the flags parameter unused.
* CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
  so omit it when using CLANG.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_memcpy.c               |   65 +-
 app/test/test_memcpy_perf.c          |  187 ++--
 lib/eal/include/generic/rte_memcpy.h |  127 +++
 lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
 lib/mbuf/rte_mbuf.c                  |   77 ++
 lib/mbuf/rte_mbuf.h                  |   32 +
 lib/mbuf/version.map                 |    1 +
 lib/pcapng/rte_pcapng.c              |    3 +-
 lib/pdump/rte_pdump.c                |    6 +-
 9 files changed, 1645 insertions(+), 91 deletions(-)

diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
index 1ab86f4967..12410ce413 100644
--- a/app/test/test_memcpy.c
+++ b/app/test/test_memcpy.c
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #include <stdint.h>
@@ -36,6 +37,19 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 /* Data is aligned on this many bytes (power of 2) */
 #define ALIGNMENT_UNIT          32
 
+const uint64_t nt_mode_flags[4] = {
+	0,
+	RTE_MEMOPS_F_SRC_NT,
+	RTE_MEMOPS_F_DST_NT,
+	RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT
+};
+const char * const nt_mode_str[4] = {
+	"none",
+	"src",
+	"dst",
+	"src+dst"
+};
+
 
 /*
  * Create two buffers, and initialise one with random values. These are copied
@@ -44,12 +58,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
  * changed.
  */
 static int
-test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
+test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size, unsigned int nt_mode)
 {
 	unsigned int i;
 	uint8_t dest[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
 	uint8_t src[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
 	void * ret;
+	const uint64_t flags = nt_mode_flags[nt_mode];
 
 	/* Setup buffers */
 	for (i = 0; i < SMALL_BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
@@ -58,18 +73,23 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	}
 
 	/* Do the copy */
-	ret = rte_memcpy(dest + off_dst, src + off_src, size);
-	if (ret != (dest + off_dst)) {
-		printf("rte_memcpy() returned %p, not %p\n",
-		       ret, dest + off_dst);
+	if (nt_mode) {
+		rte_memcpy_ex(dest + off_dst, src + off_src, size, flags);
+	} else {
+		ret = rte_memcpy(dest + off_dst, src + off_src, size);
+		if (ret != (dest + off_dst)) {
+			printf("rte_memcpy() returned %p, not %p\n",
+			       ret, dest + off_dst);
+		}
 	}
 
 	/* Check nothing before offset is affected */
 	for (i = 0; i < off_dst; i++) {
 		if (dest[i] != 0) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
 			       "[modified before start of dst].\n",
-			       (unsigned)size, off_src, off_dst);
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
 			return -1;
 		}
 	}
@@ -77,9 +97,11 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	/* Check everything was copied */
 	for (i = 0; i < size; i++) {
 		if (dest[i + off_dst] != src[i + off_src]) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
-			       "[didn't copy byte %u].\n",
-			       (unsigned)size, off_src, off_dst, i);
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
+			       "[didn't copy byte %u: 0x%02x!=0x%02x].\n",
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode], i,
+			       dest[i + off_dst], src[i + off_src]);
 			return -1;
 		}
 	}
@@ -87,9 +109,10 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 	/* Check nothing after copy was affected */
 	for (i = size; i < SMALL_BUFFER_SIZE; i++) {
 		if (dest[i + off_dst] != 0) {
-			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
+			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
 			       "[copied too many].\n",
-			       (unsigned)size, off_src, off_dst);
+			       nt_mode ? "_ex" : "",
+			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
 			return -1;
 		}
 	}
@@ -102,16 +125,18 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
 static int
 func_test(void)
 {
-	unsigned int off_src, off_dst, i;
+	unsigned int off_src, off_dst, i, nt_mode;
 	int ret;
 
-	for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
-		for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
-			for (i = 0; i < RTE_DIM(buf_sizes); i++) {
-				ret = test_single_memcpy(off_src, off_dst,
-				                         buf_sizes[i]);
-				if (ret != 0)
-					return -1;
+	for (nt_mode = 0; nt_mode < 4; nt_mode++) {
+		for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
+			for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
+				for (i = 0; i < RTE_DIM(buf_sizes); i++) {
+					ret = test_single_memcpy(off_src, off_dst,
+								 buf_sizes[i], nt_mode);
+					if (ret != 0)
+						return -1;
+				}
 			}
 		}
 	}
diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 3727c160e6..6bb52cba88 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #include <stdint.h>
@@ -15,6 +16,7 @@
 #include <rte_malloc.h>
 
 #include <rte_memcpy.h>
+#include <rte_atomic.h>
 
 #include "test.h"
 
@@ -27,9 +29,9 @@
 /* List of buffer sizes to test */
 #if TEST_VALUE_RANGE == 0
 static size_t buf_sizes[] = {
-	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
-	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
-	449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
+	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 40, 48, 60, 63, 64, 65, 80, 92, 124,
+	127, 128, 129, 140, 152, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447,
+	448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
 	2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
 };
 /* MUST be as large as largest packet size above */
@@ -72,7 +74,7 @@ static uint8_t *small_buf_read, *small_buf_write;
 static int
 init_buffers(void)
 {
-	unsigned i;
+	unsigned int i;
 
 	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (large_buf_read == NULL)
@@ -151,7 +153,7 @@ static void
 do_uncached_write(uint8_t *dst, int is_dst_cached,
 				  const uint8_t *src, int is_src_cached, size_t size)
 {
-	unsigned i, j;
+	unsigned int i, j;
 	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];
 
 	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
@@ -167,66 +169,112 @@ do_uncached_write(uint8_t *dst, int is_dst_cached,
  * Run a single memcpy performance test. This is a macro to ensure that if
  * the "size" parameter is a constant it won't be converted to a variable.
  */
-#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,                   \
-                         src, is_src_cached, src_uoffset, size)             \
-do {                                                                        \
-    unsigned int iter, t;                                                   \
-    size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];          \
-    uint64_t start_time, total_time = 0;                                    \
-    uint64_t total_time2 = 0;                                               \
-    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
-        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
-                         src_addrs, is_src_cached, src_uoffset);            \
-        start_time = rte_rdtsc();                                           \
-        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
-            rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size);           \
-        total_time += rte_rdtsc() - start_time;                             \
-    }                                                                       \
-    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
-        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
-                         src_addrs, is_src_cached, src_uoffset);            \
-        start_time = rte_rdtsc();                                           \
-        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
-            memcpy(dst+dst_addrs[t], src+src_addrs[t], size);               \
-        total_time2 += rte_rdtsc() - start_time;                            \
-    }                                                                       \
-    printf("%3.0f -", (double)total_time  / TEST_ITERATIONS);                 \
-    printf("%3.0f",   (double)total_time2 / TEST_ITERATIONS);                 \
-    printf("(%6.2f%%) ", ((double)total_time - total_time2)*100/total_time2); \
+#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,					  \
+			 src, is_src_cached, src_uoffset, size)					  \
+do {												  \
+	unsigned int iter, t;									  \
+	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];				  \
+	uint64_t start_time;									  \
+	uint64_t total_time_rte = 0, total_time_std = 0;					  \
+	uint64_t total_time_ntd = 0, total_time_nts = 0, total_time_nt = 0;			  \
+	const uint64_t flags = ((dst_uoffset == 0) ?						  \
+				(ALIGNMENT_UNIT << RTE_MEMOPS_F_DSTA_SHIFT) : 0) |		  \
+			       ((src_uoffset == 0) ?						  \
+				(ALIGNMENT_UNIT << RTE_MEMOPS_F_SRCA_SHIFT) : 0);		  \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
+		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
+				 src_addrs, is_src_cached, src_uoffset);			  \
+		start_time = rte_rdtsc();							  \
+		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
+			rte_memcpy(dst + dst_addrs[t], src + src_addrs[t], size);		  \
+		total_time_rte += rte_rdtsc() - start_time;					  \
+	}											  \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
+		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
+				 src_addrs, is_src_cached, src_uoffset);			  \
+		start_time = rte_rdtsc();							  \
+		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
+			memcpy(dst + dst_addrs[t], src + src_addrs[t], size);			  \
+		total_time_std += rte_rdtsc() - start_time;					  \
+	}											  \
+	if (!(is_dst_cached && is_src_cached)) {						  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_DST_NT);			  \
+			total_time_ntd += rte_rdtsc() - start_time;				  \
+		}										  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_SRC_NT);			  \
+			total_time_nts += rte_rdtsc() - start_time;				  \
+		}										  \
+		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
+			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
+					 src_addrs, is_src_cached, src_uoffset);		  \
+			start_time = rte_rdtsc();						  \
+			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
+				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
+					      flags | RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT); \
+			total_time_nt += rte_rdtsc() - start_time;				  \
+		}										  \
+	}											  \
+	printf(" %4.0f-", (double)total_time_rte / TEST_ITERATIONS);				  \
+	printf("%4.0f",   (double)total_time_std / TEST_ITERATIONS);				  \
+	printf("(%+4.0f%%)", ((double)total_time_rte - total_time_std) * 100 / total_time_std);   \
+	if (!(is_dst_cached && is_src_cached)) {						  \
+		printf(" %4.0f", (double)total_time_ntd / TEST_ITERATIONS);			  \
+		printf(" %4.0f", (double)total_time_nts / TEST_ITERATIONS);			  \
+		printf(" %4.0f", (double)total_time_nt / TEST_ITERATIONS);			  \
+		if (total_time_nt / total_time_std > 9)						  \
+			printf("(*%4.1f)", (double)total_time_nt / total_time_std);		  \
+		else										  \
+			printf("(%+4.0f%%)",							  \
+			       ((double)total_time_nt - total_time_std) * 100 / total_time_std);  \
+	}											  \
 } while (0)
 
 /* Run aligned memcpy tests for each cached/uncached permutation */
-#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
-do {                                                                     \
-    if (__builtin_constant_p(n))                                         \
-        printf("\nC%6u", (unsigned)n);                                   \
-    else                                                                 \
-        printf("\n%7u", (unsigned)n);                                    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);    \
+#define ALL_PERF_TESTS_FOR_SIZE(n)						\
+do {										\
+	if (__builtin_constant_p(n))						\
+		printf("\nC%6u", (unsigned int)n);				\
+	else									\
+		printf("\n%7u", (unsigned int)n);				\
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);	\
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);	\
 } while (0)
 
 /* Run unaligned memcpy tests for each cached/uncached permutation */
-#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
-do {                                                                     \
-    if (__builtin_constant_p(n))                                         \
-        printf("\nC%6u", (unsigned)n);                                   \
-    else                                                                 \
-        printf("\n%7u", (unsigned)n);                                    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);    \
-    SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);    \
-    SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);    \
+#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)					\
+do {										\
+	if (__builtin_constant_p(n))						\
+		printf("\nC%6u", (unsigned int)n);				\
+	else									\
+		printf("\n%7u", (unsigned int)n);				\
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);	\
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);	\
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);	\
 } while (0)
 
 /* Run memcpy tests for constant length */
-#define ALL_PERF_TEST_FOR_CONSTANT                                      \
-do {                                                                    \
-    TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);         \
-    TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);      \
-    TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);    \
+#define ALL_PERF_TEST_FOR_CONSTANT						\
+do {										\
+	TEST_CONSTANT(4U); TEST_CONSTANT(6U); TEST_CONSTANT(8U);		\
+	TEST_CONSTANT(16U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);		\
+	TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);		\
+	TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);	\
+	TEST_CONSTANT(2048U);							\
 } while (0)
 
 /* Run all memcpy tests for aligned constant cases */
@@ -251,7 +299,7 @@ perf_test_constant_unaligned(void)
 static inline void
 perf_test_variable_aligned(void)
 {
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
 		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
 	}
@@ -261,7 +309,7 @@ perf_test_variable_aligned(void)
 static inline void
 perf_test_variable_unaligned(void)
 {
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
 		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
 	}
@@ -282,7 +330,7 @@ perf_test(void)
 
 #if TEST_VALUE_RANGE != 0
 	/* Set up buf_sizes array, if required */
-	unsigned i;
+	unsigned int i;
 	for (i = 0; i < TEST_VALUE_RANGE; i++)
 		buf_sizes[i] = i;
 #endif
@@ -290,13 +338,14 @@ perf_test(void)
 	/* See function comment */
 	do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
 
-	printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
-		   "======= ================= ================= ================= =================\n"
-		   "   Size   Cache to cache     Cache to mem      Mem to cache        Mem to mem\n"
-		   "(bytes)          (ticks)          (ticks)           (ticks)           (ticks)\n"
-		   "------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n** rte_memcpy(RTE)/memcpy(STD)/rte_memcpy_ex(NTD/NTS/NT) - memcpy perf. tests (C = compile-time constant) **\n"
+		   "======= ================ ====================================== ====================================== ======================================\n"
+		   "   Size  Cache to cache               Cache to mem                           Mem to cache                            Mem to mem\n"
+		   "(bytes)         (ticks)                    (ticks)                                (ticks)                               (ticks)\n"
+		   "         RTE- STD(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)\n"
+		   "------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 
-	printf("\n================================= %2dB aligned =================================",
+	printf("\n================================================================ %2dB aligned ===============================================================",
 		ALIGNMENT_UNIT);
 	/* Do aligned tests where size is a variable */
 	timespec_get(&tv_begin, TIME_UTC);
@@ -304,28 +353,28 @@ perf_test(void)
 	timespec_get(&tv_end, TIME_UTC);
 	time_aligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 	/* Do aligned tests where size is a compile-time constant */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_constant_aligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_aligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n================================== Unaligned ==================================");
+	printf("\n================================================================= Unaligned =================================================================");
 	/* Do unaligned tests where size is a variable */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_variable_unaligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_unaligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n------- ----------------- ----------------- ----------------- -----------------");
+	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
 	/* Do unaligned tests where size is a compile-time constant */
 	timespec_get(&tv_begin, TIME_UTC);
 	perf_test_constant_unaligned();
 	timespec_get(&tv_end, TIME_UTC);
 	time_unaligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
 		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
-	printf("\n======= ================= ================= ================= =================\n\n");
+	printf("\n======= ================ ====================================== ====================================== ======================================\n\n");
 
 	printf("Test Execution Time (seconds):\n");
 	printf("Aligned variable copy size   = %8.3f\n", time_aligned);
diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
index e7f0f8eaa9..b087f09c35 100644
--- a/lib/eal/include/generic/rte_memcpy.h
+++ b/lib/eal/include/generic/rte_memcpy.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #ifndef _RTE_MEMCPY_H_
@@ -11,6 +12,13 @@
  * Functions for vectorised implementation of memcpy().
  */
 
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
 /**
  * Copy 16 bytes from one location to another using optimised
  * instructions. The locations should not overlap.
@@ -113,4 +121,123 @@ rte_memcpy(void *dst, const void *src, size_t n);
 
 #endif /* __DOXYGEN__ */
 
+/*
+ * Advanced/Non-Temporal Memory Operations Flags.
+ */
+
+/** Length alignment hint mask. */
+#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
+/** Length alignment hint shift. */
+#define RTE_MEMOPS_F_LENA_SHIFT 0
+/** Hint: Length is 2 byte aligned. */
+#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
+/** Hint: Length is 4 byte aligned. */
+#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)
+/** Hint: Length is 8 byte aligned. */
+#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)
+/** Hint: Length is 16 byte aligned. */
+#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)
+/** Hint: Length is 32 byte aligned. */
+#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)
+/** Hint: Length is 64 byte aligned. */
+#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)
+/** Hint: Length is 128 byte aligned. */
+#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)
+
+/** Prefer non-temporal access to source memory area.
+ */
+#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)
+/** Source address alignment hint mask. */
+#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)
+/** Source address alignment hint shift. */
+#define RTE_MEMOPS_F_SRCA_SHIFT 8
+/** Hint: Source address is 2 byte aligned. */
+#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)
+/** Hint: Source address is 4 byte aligned. */
+#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)
+/** Hint: Source address is 8 byte aligned. */
+#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)
+/** Hint: Source address is 16 byte aligned. */
+#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)
+/** Hint: Source address is 32 byte aligned. */
+#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)
+/** Hint: Source address is 64 byte aligned. */
+#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)
+/** Hint: Source address is 128 byte aligned. */
+#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)
+
+/** Prefer non-temporal access to destination memory area.
+ *
+ * On x86 architecture:
+ * Remember to call rte_wmb() after a sequence of copy operations.
+ */
+#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)
+/** Destination address alignment hint mask. */
+#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16)
+/** Destination address alignment hint shift. */
+#define RTE_MEMOPS_F_DSTA_SHIFT 16
+/** Hint: Destination address is 2 byte aligned. */
+#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)
+/** Hint: Destination address is 4 byte aligned. */
+#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)
+/** Hint: Destination address is 8 byte aligned. */
+#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)
+/** Hint: Destination address is 16 byte aligned. */
+#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)
+/** Hint: Destination address is 32 byte aligned. */
+#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)
+/** Hint: Destination address is 64 byte aligned. */
+#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)
+/** Hint: Destination address is 128 byte aligned. */
+#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Advanced/non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the destination memory area.
+ * @param src
+ *   Pointer to the source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, RTE_MEMOPS_F_(LEN|SRC|DST)nnnA flags.
+ *   Must be constant at build time.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags);
+
+#ifndef RTE_MEMCPY_EX_ARCH_DEFINED
+
+/* Fallback implementation, if no arch-specific implementation is provided. */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	RTE_SET_USED(flags);
+	memcpy(dst, src, len);
+}
+
+#endif /* RTE_MEMCPY_EX_ARCH_DEFINED */
+
+#ifdef __cplusplus
+}
+#endif
+
 #endif /* _RTE_MEMCPY_H_ */
diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index d4d7a5cfc8..31d0faf7a8 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2022 SmartShare Systems
  */
 
 #ifndef _RTE_MEMCPY_X86_64_H_
@@ -17,6 +18,10 @@
 #include <rte_vect.h>
 #include <rte_common.h>
 #include <rte_config.h>
+#include <rte_debug.h>
+
+#define RTE_MEMCPY_EX_ARCH_DEFINED
+#include "generic/rte_memcpy.h"
 
 #ifdef __cplusplus
 extern "C" {
@@ -868,6 +873,1239 @@ rte_memcpy(void *dst, const void *src, size_t n)
 		return rte_memcpy_generic(dst, src, n);
 }
 
+/*
+ * Advanced/Non-Temporal Memory Operations.
+ */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Workaround for _mm_stream_load_si128() missing const in the parameter.
+ */
+__rte_experimental
+static __rte_always_inline
+__m128i _mm_stream_load_si128_const(const __m128i *const mem_addr)
+{
+	/* GCC 4.5.8 (in RHEL7) doesn't support the #pragma to ignore "-Wdiscarded-qualifiers".
+	 * So we explicitly type cast mem_addr and use the #pragma to ignore "-Wcast-qual".
+	 */
+#if defined(RTE_TOOLCHAIN_GCC)
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wcast-qual"
+#elif defined(RTE_TOOLCHAIN_CLANG)
+#pragma clang diagnostic push
+#pragma clang diagnostic ignored "-Wcast-qual"
+#endif
+	return _mm_stream_load_si128((__m128i *)mem_addr);
+#if defined(RTE_TOOLCHAIN_GCC)
+#pragma GCC diagnostic pop
+#elif defined(RTE_TOOLCHAIN_CLANG)
+#pragma clang diagnostic pop
+#endif
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Memory copy from non-temporal source area.
+ *
+ * @note
+ * Performance is optimal when source pointer is 16 byte aligned.
+ *
+ * @param dst
+ *   Pointer to the destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(LEN|SRC)nnnA flags.
+ *   The RTE_MEMOPS_F_SRC_NT flag must be set.
+ *   The RTE_MEMOPS_F_DST_NT flag must be clear.
+ *   The RTE_MEMOPS_F_DSTnnnA flags are ignored.
+ *   Must be constant at build time.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nts(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_SRC_NT);
+
+	if (unlikely(len == 0))
+		return;
+
+	/* If source is not 16 byte aligned, then copy first part of data via bounce buffer,
+	 * to achieve 16 byte alignment of source pointer.
+	 * This invalidates the source, destination and length alignment flags, and
+	 * potentially makes the destination pointer unaligned.
+	 *
+	 * Omitted if source is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) {
+		/* Source is not known to be 16 byte aligned, but might be. */
+		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+		const size_t    offset = (uintptr_t)src & 15;
+
+		if (offset) {
+			/* Source is not 16 byte aligned. */
+			char            buffer[16] __rte_aligned(16);
+			/** How many bytes is source away from 16 byte alignment
+			 * (ceiling rounding).
+			 */
+			const size_t    first = 16 - offset;
+
+			xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
+			_mm_store_si128((void *)buffer, xmm0);
+
+			/* Test for short length.
+			 *
+			 * Omitted if length is known to be >= 16.
+			 */
+			if (!(__builtin_constant_p(len) && len >= 16) &&
+					unlikely(len <= first)) {
+				/* Short length. */
+				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
+				return;
+			}
+
+			/* Copy until source pointer is 16 byte aligned. */
+			rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), first);
+			src = RTE_PTR_ADD(src, first);
+			dst = RTE_PTR_ADD(dst, first);
+			len -= first;
+		}
+	}
+
+	/* Source pointer is now 16 byte aligned. */
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(dst, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, if any, via bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the alignment
+	 * flags are still valid) and length is known to be 16 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(len & 15)) {
+		char    buffer[16] __rte_aligned(16);
+
+		xmm3 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)buffer, xmm3);
+		rte_mov15_or_less(dst, buffer, len & 15);
+	}
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Memory copy to non-temporal destination area.
+ *
+ * @note
+ * If the destination and/or length is unaligned, the first and/or last copied
+ * bytes will be stored in the destination memory area using temporal access.
+ * @note
+ * Performance is optimal when destination pointer is 16 byte aligned.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ *   Any of the RTE_MEMOPS_F_(LEN|DST)nnnA flags.
+ *   The RTE_MEMOPS_F_SRC_NT flag must be clear.
+ *   The RTE_MEMOPS_F_DST_NT flag must be set.
+ *   The RTE_MEMOPS_F_SRCnnnA flags are ignored.
+ *   Must be constant at build time.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ntd(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_DST_NT);
+
+	if (unlikely(len == 0))
+		return;
+
+	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ||
+			len >= 16) {
+		/* Length >= 16 and/or destination is known to be 16 byte aligned. */
+		register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+		/* If destination is not 16 byte aligned, then copy first part of data,
+		 * to achieve 16 byte alignment of destination pointer.
+		 * This invalidates the source, destination and length alignment flags, and
+		 * potentially makes the source pointer unaligned.
+		 *
+		 * Omitted if destination is known to be 16 byte aligned.
+		 */
+		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A)) {
+			/* Destination is not known to be 16 byte aligned, but might be. */
+			/** How many bytes is destination offset from 16 byte alignment
+			 * (floor rounding).
+			 */
+			const size_t    offset = (uintptr_t)dst & 15;
+
+			if (offset) {
+				/* Destination is not 16 byte aligned. */
+				/** How many bytes is destination away from 16 byte alignment
+				 * (ceiling rounding).
+				 */
+				const size_t    first = 16 - offset;
+
+				if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
+						(offset & 3) == 0) {
+					/* Destination is (known to be) 4 byte aligned. */
+					int32_t r0, r1, r2;
+
+					/* Copy until destination pointer is 16 byte aligned. */
+					if (first & 8) {
+						memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
+						memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
+						_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
+						src = RTE_PTR_ADD(src, 8);
+						dst = RTE_PTR_ADD(dst, 8);
+						len -= 8;
+					}
+					if (first & 4) {
+						memcpy(&r2, src, 4);
+						_mm_stream_si32(dst, r2);
+						src = RTE_PTR_ADD(src, 4);
+						dst = RTE_PTR_ADD(dst, 4);
+						len -= 4;
+					}
+				} else {
+					/* Destination is not 4 byte aligned. */
+					/* Copy until destination pointer is 16 byte aligned. */
+					rte_mov15_or_less(dst, src, first);
+					src = RTE_PTR_ADD(src, first);
+					dst = RTE_PTR_ADD(dst, first);
+					len -= first;
+				}
+			}
+		}
+
+		/* Destination pointer is now 16 byte aligned. */
+		RTE_ASSERT(rte_is_aligned(dst, 16));
+
+		/* Copy large portion of data in chunks of 64 byte. */
+		while (len >= 64) {
+			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
+			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
+			xmm2 = _mm_loadu_si128(RTE_PTR_ADD(src, 2 * 16));
+			xmm3 = _mm_loadu_si128(RTE_PTR_ADD(src, 3 * 16));
+			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+			src = RTE_PTR_ADD(src, 64);
+			dst = RTE_PTR_ADD(dst, 64);
+			len -= 64;
+		}
+
+		/* Copy following 32 and 16 byte portions of data.
+		 *
+		 * Omitted if destination is known to be 16 byte aligned (so the alignment
+		 * flags are still valid)
+		 * and length is known to be respectively 64 or 32 byte aligned.
+		 */
+		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+				(len & 32)) {
+			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
+			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
+			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+			src = RTE_PTR_ADD(src, 32);
+			dst = RTE_PTR_ADD(dst, 32);
+		}
+		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+				(len & 16)) {
+			xmm2 = _mm_loadu_si128(src);
+			_mm_stream_si128(dst, xmm2);
+			src = RTE_PTR_ADD(src, 16);
+			dst = RTE_PTR_ADD(dst, 16);
+		}
+	} else {
+		/* Length <= 15, and
+		 * destination is not known to be 16 byte aligned (but might be).
+		 */
+		/* If destination is not 4 byte aligned, then
+		 * use normal copy and return.
+		 *
+		 * Omitted if destination is known to be 4 byte aligned.
+		 */
+		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) &&
+				!rte_is_aligned(dst, 4)) {
+			/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
+			rte_mov15_or_less(dst, src, len);
+			return;
+		}
+		/* Destination is (known to be) 4 byte aligned. Proceed. */
+	}
+
+	/* Destination pointer is now 4 byte (or 16 byte) aligned. */
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+
+	/* Copy following 8 and 4 byte portions of data.
+	 *
+	 * Omitted if destination is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 16 or 8 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(len & 8)) {
+		int32_t r0, r1;
+
+		memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
+		memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
+		src = RTE_PTR_ADD(src, 8);
+		dst = RTE_PTR_ADD(dst, 8);
+	}
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A)) &&
+			(len & 4)) {
+		int32_t r2;
+
+		memcpy(&r2, src, 4);
+		_mm_stream_si32(dst, r2);
+		src = RTE_PTR_ADD(src, 4);
+		dst = RTE_PTR_ADD(dst, 4);
+	}
+
+	/* Copy remaining 2 and 1 byte portions of data.
+	 *
+	 * Omitted if destination is known to be 16 byte aligned (so the alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 4 and 2 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A)) &&
+			(len & 2)) {
+		int16_t r3;
+
+		memcpy(&r3, src, 2);
+		*(int16_t *)dst = r3;
+		src = RTE_PTR_ADD(src, 2);
+		dst = RTE_PTR_ADD(dst, 2);
+	}
+	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A)) &&
+			(len & 1))
+		*(char *)dst = *(const char *)src;
+}
+
+/**
+ * Non-temporal memory copy of 15 or less byte
+ * from 16 byte aligned source via bounce buffer.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Only the 4 least significant bits of this parameter are used.
+ *   The 4 least significant bits of this holds the number of remaining bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_15_or_less_s16a(void *__rte_restrict dst,
+		const void *__rte_restrict src, size_t len, const uint64_t flags)
+{
+	int32_t             buffer[4] __rte_aligned(16);
+	register __m128i    xmm0;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if ((len & 15) == 0)
+		return;
+
+	/* Non-temporal load into bounce buffer. */
+	xmm0 = _mm_stream_load_si128_const(src);
+	_mm_store_si128((void *)buffer, xmm0);
+
+	/* Store from bounce buffer. */
+	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
+			rte_is_aligned(dst, 4)) {
+		/* Destination is (known to be) 4 byte aligned. */
+		src = (const void *)buffer;
+		if (len & 8) {
+#ifdef RTE_ARCH_X86_64
+			if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) {
+				/* Destination is known to be 8 byte aligned. */
+				_mm_stream_si64(dst, *(const int64_t *)src);
+			} else {
+#endif /* RTE_ARCH_X86_64 */
+				_mm_stream_si32(RTE_PTR_ADD(dst, 0), buffer[0]);
+				_mm_stream_si32(RTE_PTR_ADD(dst, 4), buffer[1]);
+#ifdef RTE_ARCH_X86_64
+			}
+#endif /* RTE_ARCH_X86_64 */
+			src = RTE_PTR_ADD(src, 8);
+			dst = RTE_PTR_ADD(dst, 8);
+		}
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A) &&
+				(len & 4)) {
+			_mm_stream_si32(dst, *(const int32_t *)src);
+			src = RTE_PTR_ADD(src, 4);
+			dst = RTE_PTR_ADD(dst, 4);
+		}
+
+		/* Non-temporal store is unavailble for the remaining 3 byte or less. */
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) &&
+				(len & 2)) {
+			*(int16_t *)dst = *(const int16_t *)src;
+			src = RTE_PTR_ADD(src, 2);
+			dst = RTE_PTR_ADD(dst, 2);
+		}
+		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A) &&
+				(len & 1)) {
+			*(char *)dst = *(const char *)src;
+		}
+	} else {
+		/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
+		rte_mov15_or_less(dst, (const void *)buffer, len & 15);
+	}
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 16 byte aligned addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 16 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d16s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 16));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
+		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_stream_si128(dst, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ?
+				flags : RTE_MEMOPS_F_DST16A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+#ifdef RTE_ARCH_X86_64
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 8/16 byte aligned destination/source addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 8 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d8s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	int64_t             buffer[8] __rte_cache_aligned /* at least __rte_aligned(16) */;
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 8));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
+		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
+		_mm_store_si128((void *)&buffer[3 * 2], xmm3);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 4 * 8), buffer[4]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 5 * 8), buffer[5]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 6 * 8), buffer[6]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 7 * 8), buffer[7]);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[4]);
+		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[5]);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) ?
+				flags : RTE_MEMOPS_F_DST8A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+#endif /* RTE_ARCH_X86_64 */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 4/16 byte aligned destination/source addresses non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ *   Must be 4 byte aligned.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ *   Must be 16 byte aligned.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d4s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	int32_t             buffer[16] __rte_cache_aligned /* at least __rte_aligned(16) */;
+	register __m128i    xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+	RTE_ASSERT(rte_is_aligned(src, 16));
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Copy large portion of data in chunks of 64 byte. */
+	while (len >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
+		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
+		_mm_store_si128((void *)&buffer[3 * 4], xmm3);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  0 * 4), buffer[0]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  1 * 4), buffer[1]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  2 * 4), buffer[2]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  3 * 4), buffer[3]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  4 * 4), buffer[4]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  5 * 4), buffer[5]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  6 * 4), buffer[6]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  7 * 4), buffer[7]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  8 * 4), buffer[8]);
+		_mm_stream_si32(RTE_PTR_ADD(dst,  9 * 4), buffer[9]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 10 * 4), buffer[10]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 11 * 4), buffer[11]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 12 * 4), buffer[12]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 13 * 4), buffer[13]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 14 * 4), buffer[14]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 15 * 4), buffer[15]);
+		src = RTE_PTR_ADD(src, 64);
+		dst = RTE_PTR_ADD(dst, 64);
+		len -= 64;
+	}
+
+	/* Copy following 32 and 16 byte portions of data.
+	 *
+	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
+			(len & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
+		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 4 * 4), buffer[4]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 5 * 4), buffer[5]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 6 * 4), buffer[6]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 7 * 4), buffer[7]);
+		src = RTE_PTR_ADD(src, 32);
+		dst = RTE_PTR_ADD(dst, 32);
+	}
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
+			(len & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[8]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[9]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[10]);
+		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[11]);
+		src = RTE_PTR_ADD(src, 16);
+		dst = RTE_PTR_ADD(dst, 16);
+	}
+
+	/* Copy remaining data, 15 byte or less, via bounce buffer.
+	 *
+	 * Omitted if length is known to be 16 byte aligned.
+	 */
+	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
+		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
+				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
+				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ?
+				flags : RTE_MEMOPS_F_DST4A) |
+				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
+				flags : RTE_MEMOPS_F_SRC16A));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * 4 byte aligned addresses (non-temporal) memory copy.
+ * The memory areas must not overlap.
+ *
+ * @param dst
+ *   Pointer to the (non-temporal) destination memory area.
+ *   Must be 4 byte aligned if using non-temporal store.
+ * @param src
+ *   Pointer to the (non-temporal) source memory area.
+ *   Must be 4 byte aligned if using non-temporal load.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_d4s4a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+	const size_t    offset = (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A ?
+			0 : (uintptr_t)src & 15;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(rte_is_aligned(dst, 4));
+	RTE_ASSERT(rte_is_aligned(src, 4));
+
+	if (unlikely(len == 0))
+		return;
+
+	if (offset == 0) {
+		/* Source is 16 byte aligned. */
+		/* Copy everything, using upgraded source alignment flags. */
+		rte_memcpy_nt_d4s16a(dst, src, len,
+				(flags & ~RTE_MEMOPS_F_SRCA_MASK) | RTE_MEMOPS_F_SRC16A);
+	} else {
+		/* Source is not 16 byte aligned, so make it 16 byte aligned. */
+		int32_t             buffer[4] __rte_aligned(16);
+		const size_t        first = 16 - offset;
+		register __m128i    xmm0;
+
+		/* First, copy first part of data in chunks of 4 byte,
+		 * to achieve 16 byte alignment of source.
+		 * This invalidates the source, destination and length alignment flags, and
+		 * potentially makes the destination pointer 16 byte unaligned/aligned.
+		 */
+
+		/** Copy from 16 byte aligned source pointer (floor rounding). */
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
+		_mm_store_si128((void *)buffer, xmm0);
+
+		if (unlikely(len + offset <= 16)) {
+			/* Short length. */
+			if (((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ||
+					(len & 3) == 0) {
+				/* Length is 4 byte aligned. */
+				switch (len) {
+				case 1 * 4:
+					/* Offset can be 1 * 4, 2 * 4 or 3 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
+							buffer[offset / 4]);
+					break;
+				case 2 * 4:
+					/* Offset can be 1 * 4 or 2 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
+							buffer[offset / 4]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4),
+							buffer[offset / 4 + 1]);
+					break;
+				case 3 * 4:
+					/* Offset can only be 1 * 4. */
+					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
+					_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
+					break;
+				}
+			} else {
+				/* Length is not 4 byte aligned. */
+				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
+			}
+			return;
+		}
+
+		switch (first) {
+		case 1 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
+			break;
+		case 2 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
+			break;
+		case 3 * 4:
+			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
+			_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
+			break;
+		}
+
+		src = RTE_PTR_ADD(src, first);
+		dst = RTE_PTR_ADD(dst, first);
+		len -= first;
+
+		/* Source pointer is now 16 byte aligned. */
+		RTE_ASSERT(rte_is_aligned(src, 16));
+
+		/* Then, copy the rest, using corrected alignment flags. */
+		if (rte_is_aligned(dst, 16))
+			rte_memcpy_nt_d16s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST16A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+#ifdef RTE_ARCH_X86_64
+		else if (rte_is_aligned(dst, 8))
+			rte_memcpy_nt_d8s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST8A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+#endif /* RTE_ARCH_X86_64 */
+		else
+			rte_memcpy_nt_d4s16a(dst, src, len, (flags &
+					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
+					RTE_MEMOPS_F_LENA_MASK)) |
+					RTE_MEMOPS_F_DST4A | RTE_MEMOPS_F_SRC16A |
+					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
+					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
+	}
+}
+
+#ifndef RTE_MEMCPY_NT_BUFSIZE
+
+#include <lib/mbuf/rte_mbuf_core.h>
+
+/** Bounce buffer size for non-temporal memcpy.
+ *
+ * Must be 2^N and >= 128.
+ * The actual buffer will be slightly larger, due to added padding.
+ * The default is chosen to be able to handle a non-segmented packet.
+ */
+#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM
+
+#endif  /* RTE_MEMCPY_NT_BUFSIZE */
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Non-temporal memory copy via bounce buffer.
+ *
+ * @note
+ * If the destination and/or length is unaligned, the first and/or last copied
+ * bytes will be stored in the destination memory area using temporal access.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_buf(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+	/** Cache line aligned bounce buffer with preceding and trailing padding.
+	 *
+	 * The preceding padding is one cache line, so the data area itself
+	 * is cache line aligned.
+	 * The trailing padding is 16 bytes, leaving room for the trailing bytes
+	 * of a 16 byte store operation.
+	 */
+	char			buffer[RTE_CACHE_LINE_SIZE + RTE_MEMCPY_NT_BUFSIZE +  16]
+				__rte_cache_aligned;
+	/** Pointer to bounce buffer's aligned data area. */
+	char		* const buf0 = &buffer[RTE_CACHE_LINE_SIZE];
+	void		       *buf;
+	/** Number of bytes to copy from source, incl. any extra preceding bytes. */
+	size_t			srclen;
+	register __m128i	xmm0, xmm1, xmm2, xmm3;
+
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
+			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
+	RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);
+
+	if (unlikely(len == 0))
+		return;
+
+	/* Step 1:
+	 * Copy data from the source to the bounce buffer's aligned data area,
+	 * using aligned non-temporal load from the source,
+	 * and unaligned store in the bounce buffer.
+	 *
+	 * If the source is unaligned, the additional bytes preceding the data will be copied
+	 * to the padding area preceding the bounce buffer's aligned data area.
+	 * Similarly, if the source data ends at an unaligned address, the additional bytes
+	 * trailing the data will be copied to the padding area trailing the bounce buffer's
+	 * aligned data area.
+	 */
+
+	/* Adjust for extra preceding bytes, unless source is known to be 16 byte aligned. */
+	if ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) {
+		buf = buf0;
+		srclen = len;
+	} else {
+		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
+		const size_t offset = (uintptr_t)src & 15;
+
+		buf = RTE_PTR_SUB(buf0, offset);
+		src = RTE_PTR_SUB(src, offset);
+		srclen = len + offset;
+	}
+
+	/* Copy large portion of data from source to bounce buffer in chunks of 64 byte. */
+	while (srclen >= 64) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
+		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
+		src = RTE_PTR_ADD(src, 64);
+		buf = RTE_PTR_ADD(buf, 64);
+		srclen -= 64;
+	}
+
+	/* Copy remaining 32 and 16 byte portions of data from source to bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the length alignment
+	 * flags are still valid)
+	 * and length is known to be respectively 64 or 32 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
+			(srclen & 32)) {
+		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
+		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
+		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
+		src = RTE_PTR_ADD(src, 32);
+		buf = RTE_PTR_ADD(buf, 32);
+	}
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
+			(srclen & 16)) {
+		xmm2 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(buf, xmm2);
+		src = RTE_PTR_ADD(src, 16);
+		buf = RTE_PTR_ADD(buf, 16);
+	}
+	/* Copy any trailing bytes of data from source to bounce buffer.
+	 *
+	 * Omitted if source is known to be 16 byte aligned (so the length alignment
+	 * flags are still valid)
+	 * and length is known to be 16 byte aligned.
+	 */
+	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
+			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
+			(srclen & 15)) {
+		xmm3 = _mm_stream_load_si128_const(src);
+		_mm_storeu_si128(buf, xmm3);
+	}
+
+	/* Step 2:
+	 * Copy from the aligned bounce buffer to the non-temporal destination.
+	 */
+	rte_memcpy_ntd(dst, buf0, len,
+			(flags & ~(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_SRCA_MASK)) |
+			(RTE_CACHE_LINE_SIZE << RTE_MEMOPS_F_SRCA_SHIFT));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Non-temporal memory copy.
+ * The memory areas must not overlap.
+ *
+ * @note
+ * If the destination and/or length is unaligned, some copied bytes will be
+ * stored in the destination memory area using temporal access.
+ *
+ * @param dst
+ *   Pointer to the non-temporal destination memory area.
+ * @param src
+ *   Pointer to the non-temporal source memory area.
+ * @param len
+ *   Number of bytes to copy.
+ * @param flags
+ *   Hints for memory access.
+ */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_nt_generic(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+
+	while (len > RTE_MEMCPY_NT_BUFSIZE) {
+		rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE,
+				(flags & ~RTE_MEMOPS_F_LENA_MASK) | RTE_MEMOPS_F_LEN128A);
+		dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
+		src = RTE_PTR_ADD(src, RTE_MEMCPY_NT_BUFSIZE);
+		len -= RTE_MEMCPY_NT_BUFSIZE;
+	}
+	rte_memcpy_nt_buf(dst, src, len, flags);
+}
+
+/* Implementation. Refer to function declaration for documentation. */
+__rte_experimental
+static __rte_always_inline
+__attribute__((__nonnull__(1, 2)))
+#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
+__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
+#endif
+void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
+		const uint64_t flags)
+{
+#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
+	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
+#endif /* !RTE_TOOLCHAIN_CLANG */
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
+			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
+			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
+	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
+			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
+
+	if ((flags & (RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) ==
+			(RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) {
+		/* Copy between non-temporal source and destination. */
+		if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d16s16a(dst, src, len, flags);
+#ifdef RTE_ARCH_X86_64
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d8s16a(dst, src, len, flags);
+#endif /* RTE_ARCH_X86_64 */
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
+			rte_memcpy_nt_d4s16a(dst, src, len, flags);
+		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
+				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A)
+			rte_memcpy_nt_d4s4a(dst, src, len, flags);
+		else if (len <= RTE_MEMCPY_NT_BUFSIZE)
+			rte_memcpy_nt_buf(dst, src, len, flags);
+		else
+			rte_memcpy_nt_generic(dst, src, len, flags);
+	} else if (flags & RTE_MEMOPS_F_SRC_NT) {
+		/* Copy from non-temporal source. */
+		rte_memcpy_nts(dst, src, len, flags);
+	} else if (flags & RTE_MEMOPS_F_DST_NT) {
+		/* Copy to non-temporal destination. */
+		rte_memcpy_ntd(dst, src, len, flags);
+	} else
+		rte_memcpy(dst, src, len);
+}
+
 #undef ALIGNMENT_MASK
 
 #if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
index a2307cebe6..aa96fb4cc8 100644
--- a/lib/mbuf/rte_mbuf.c
+++ b/lib/mbuf/rte_mbuf.c
@@ -660,6 +660,83 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 	return mc;
 }
 
+/* Create a deep copy of mbuf, using non-temporal memory access */
+struct rte_mbuf *
+rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
+		 uint32_t off, uint32_t len, const uint64_t flags)
+{
+	const struct rte_mbuf *seg = m;
+	struct rte_mbuf *mc, *m_last, **prev;
+
+	/* garbage in check */
+	__rte_mbuf_sanity_check(m, 1);
+
+	/* check for request to copy at offset past end of mbuf */
+	if (unlikely(off >= m->pkt_len))
+		return NULL;
+
+	mc = rte_pktmbuf_alloc(mp);
+	if (unlikely(mc == NULL))
+		return NULL;
+
+	/* truncate requested length to available data */
+	if (len > m->pkt_len - off)
+		len = m->pkt_len - off;
+
+	__rte_pktmbuf_copy_hdr(mc, m);
+
+	/* copied mbuf is not indirect or external */
+	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
+
+	prev = &mc->next;
+	m_last = mc;
+	while (len > 0) {
+		uint32_t copy_len;
+
+		/* skip leading mbuf segments */
+		while (off >= seg->data_len) {
+			off -= seg->data_len;
+			seg = seg->next;
+		}
+
+		/* current buffer is full, chain a new one */
+		if (rte_pktmbuf_tailroom(m_last) == 0) {
+			m_last = rte_pktmbuf_alloc(mp);
+			if (unlikely(m_last == NULL)) {
+				rte_pktmbuf_free(mc);
+				return NULL;
+			}
+			++mc->nb_segs;
+			*prev = m_last;
+			prev = &m_last->next;
+		}
+
+		/*
+		 * copy the min of data in input segment (seg)
+		 * vs space available in output (m_last)
+		 */
+		copy_len = RTE_MIN(seg->data_len - off, len);
+		if (copy_len > rte_pktmbuf_tailroom(m_last))
+			copy_len = rte_pktmbuf_tailroom(m_last);
+
+		/* append from seg to m_last */
+		rte_memcpy_ex(rte_pktmbuf_mtod_offset(m_last, char *,
+						   m_last->data_len),
+			   rte_pktmbuf_mtod_offset(seg, char *, off),
+			   copy_len, flags);
+
+		/* update offsets and lengths */
+		m_last->data_len += copy_len;
+		mc->pkt_len += copy_len;
+		off += copy_len;
+		len -= copy_len;
+	}
+
+	/* garbage out check */
+	__rte_mbuf_sanity_check(mc, 1);
+	return mc;
+}
+
 /* dump a mbuf on console */
 void
 rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index b6e23d98ce..030df396a3 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -1443,6 +1443,38 @@ struct rte_mbuf *
 rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
 		 uint32_t offset, uint32_t length);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Create a full copy of a given packet mbuf,
+ * using non-temporal memory access as specified by flags.
+ *
+ * Copies all the data from a given packet mbuf to a newly allocated
+ * set of mbufs. The private data are is not copied.
+ *
+ * @param m
+ *   The packet mbuf to be copied.
+ * @param mp
+ *   The mempool from which the "clone" mbufs are allocated.
+ * @param offset
+ *   The number of bytes to skip before copying.
+ *   If the mbuf does not have that many bytes, it is an error
+ *   and NULL is returned.
+ * @param length
+ *   The upper limit on bytes to copy.  Passing UINT32_MAX
+ *   means all data (after offset).
+ * @param flags
+ *   Non-temporal memory access hints for rte_memcpy_ex.
+ * @return
+ *   - The pointer to the new "clone" mbuf on success.
+ *   - NULL if allocation fails.
+ */
+__rte_experimental
+struct rte_mbuf *
+rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
+		    uint32_t offset, uint32_t length, const uint64_t flags);
+
 /**
  * Adds given value to the refcnt of all packet mbuf segments.
  *
diff --git a/lib/mbuf/version.map b/lib/mbuf/version.map
index ed486ed14e..b583364ad4 100644
--- a/lib/mbuf/version.map
+++ b/lib/mbuf/version.map
@@ -47,5 +47,6 @@ EXPERIMENTAL {
 	global:
 
 	rte_pktmbuf_pool_create_extbuf;
+	rte_pktmbuf_copy_ex;
 
 };
diff --git a/lib/pcapng/rte_pcapng.c b/lib/pcapng/rte_pcapng.c
index af2b814251..ae871c4865 100644
--- a/lib/pcapng/rte_pcapng.c
+++ b/lib/pcapng/rte_pcapng.c
@@ -466,7 +466,8 @@ rte_pcapng_copy(uint16_t port_id, uint32_t queue,
 	orig_len = rte_pktmbuf_pkt_len(md);
 
 	/* Take snapshot of the data */
-	mc = rte_pktmbuf_copy(md, mp, 0, length);
+	mc = rte_pktmbuf_copy_ex(md, mp, 0, length,
+				 RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
 	if (unlikely(mc == NULL))
 		return NULL;
 
diff --git a/lib/pdump/rte_pdump.c b/lib/pdump/rte_pdump.c
index 98dcbc037b..6e61c75407 100644
--- a/lib/pdump/rte_pdump.c
+++ b/lib/pdump/rte_pdump.c
@@ -124,7 +124,8 @@ pdump_copy(uint16_t port_id, uint16_t queue,
 					    pkts[i], mp, cbs->snaplen,
 					    ts, direction);
 		else
-			p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
+			p = rte_pktmbuf_copy_ex(pkts[i], mp, 0, cbs->snaplen,
+						RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
 
 		if (unlikely(p == NULL))
 			__atomic_fetch_add(&stats->nombuf, 1, __ATOMIC_RELAXED);
@@ -134,6 +135,9 @@ pdump_copy(uint16_t port_id, uint16_t queue,
 
 	__atomic_fetch_add(&stats->accepted, d_pkts, __ATOMIC_RELAXED);
 
+	/* Flush non-temporal stores regarding the packet copies. */
+	rte_wmb();
+
 	ring_enq = rte_ring_enqueue_burst(ring, (void *)dup_bufs, d_pkts, NULL);
 	if (unlikely(ring_enq < d_pkts)) {
 		unsigned int drops = d_pkts - ring_enq;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH] eal: non-temporal memcpy
  2022-10-06 20:34 ` [PATCH] eal: " Morten Brørup
@ 2022-10-10  7:35   ` Morten Brørup
  2022-10-10  8:58     ` Mattias Rönnblom
  2022-10-11  9:25     ` Konstantin Ananyev
  0 siblings, 2 replies; 17+ messages in thread
From: Morten Brørup @ 2022-10-10  7:35 UTC (permalink / raw)
  To: hofors, konstantin.v.ananyev, Honnappa.Nagarahalli, stephen
  Cc: mattias.ronnblom, bruce.richardson, kda, drc, dev

Mattias, Konstantin, Honnappa, Stephen,

In my patch for non-temporal memcpy, I have been aiming for using as much non-temporal store as possible. E.g. copying 16 byte to a 16 byte aligned address will be done using non-temporal store instructions.

Now, I am seriously considering this alternative:

Only using non-temporal stores for complete cache lines, and using normal stores for partial cache lines.

I think it will make things simpler when an application mixes normal and non-temporal stores. E.g. an application writing metadata (a pcap header) followed by packet data.

The disadvantage is that copying a burst of 32 packets, will - in the worst case - pollute 64 cache lines (one at the start plus one at the end of the copied data), i.e. 4 KiB of data cache. If copying to a consecutive memory area, e.g. a packet capture buffer, it will pollute 33 cache lines (because the start of packet #2 is in the same cache line as the end of packet #1, etc.). 

What do you think?

PS: Non-temporal loads are easy to work with, so don't worry about that.

Med venlig hilsen / Kind regards,
-Morten Brørup

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] eal: non-temporal memcpy
  2022-10-10  7:35   ` Morten Brørup
@ 2022-10-10  8:58     ` Mattias Rönnblom
  2022-10-10  9:36       ` Morten Brørup
  2022-10-10  9:57       ` Bruce Richardson
  2022-10-11  9:25     ` Konstantin Ananyev
  1 sibling, 2 replies; 17+ messages in thread
From: Mattias Rönnblom @ 2022-10-10  8:58 UTC (permalink / raw)
  To: Morten Brørup, konstantin.v.ananyev, Honnappa.Nagarahalli, stephen
  Cc: mattias.ronnblom, bruce.richardson, kda, drc, dev

On 2022-10-10 09:35, Morten Brørup wrote:
> Mattias, Konstantin, Honnappa, Stephen,
> 
> In my patch for non-temporal memcpy, I have been aiming for using as much non-temporal store as possible. E.g. copying 16 byte to a 16 byte aligned address will be done using non-temporal store instructions.
> 
> Now, I am seriously considering this alternative:
> 
> Only using non-temporal stores for complete cache lines, and using normal stores for partial cache lines.
> 

This is how I've done it in the past, in DPDK applications. That was 
both to simplify (and potentially optimize) the code somewhat, and 
because I had my doubt there was any actual benefits from using 
non-temporal stores for the beginning or the end of the memory block.

That latter reason however, was pure conjecture. I think it would be 
great if Intel, ARM, AMD, IBM etc. DPDK developers could dig in the 
manuals or go find the appropriate CPU expert, to find out if that is true.

More specifically, my question is:

A) Consider a scenario where a core does a regular store against some 
cache line, and then pretty much immediately does a non-temporal store 
against a different address in the same cache line. How will this cache 
line be treated?

B) Consider the same scenario, but where no regular stores preceded (or 
followed) the non-temporal store, and the non-temporal stores performed 
did not cover the entirety of the cache line.

Scenario A) would be common in the beginning of the copy, in case 
there's a header preceding the data, and writing that header 
non-temporally might be cumbersome. Scenario B) would common at the end 
of the copy. Both assuming copies of memory blocks which are not 
cache-line aligned.

> I think it will make things simpler when an application mixes normal and non-temporal stores. E.g. an application writing metadata (a pcap header) followed by packet data.
> 

The application *could* use NT stores for the pcap header as well.

I haven't reviewed v3 of your patch, but in some earlier patch you did 
not use the movnti instruction to make smaller (< 16 bytes) stores.

> The disadvantage is that copying a burst of 32 packets, will - in the worst case - pollute 64 cache lines (one at the start plus one at the end of the copied data), i.e. 4 KiB of data cache. If copying to a consecutive memory area, e.g. a packet capture buffer, it will pollute 33 cache lines (because the start of packet #2 is in the same cache line as the end of packet #1, etc.).
> 
> What do you think?
> 

For large copies, which I'm guessing is what non-temporal stores are 
usually used for, this is hair splitting. For DPDK applications, it 
might well be at least somewhat relevant, because such an application 
may make an enormous amount of copies, each roughly the size of a packet.

If we had a rte_memcpy_ex() that only cared about copying whole cache 
line in a NT manner, the application could add a clflushopt (or the 
equivalent) after the copy, flushing the the beginning and end cache 
line of the destination buffer.

> 
> PS: Non-temporal loads are easy to work with, so don't worry about that.
> 
> 
> Med venlig hilsen / Kind regards,
> -Morten Brørup

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH] eal: non-temporal memcpy
  2022-10-10  8:58     ` Mattias Rönnblom
@ 2022-10-10  9:36       ` Morten Brørup
  2022-10-10 11:58         ` Stanislaw Kardach
  2022-10-10  9:57       ` Bruce Richardson
  1 sibling, 1 reply; 17+ messages in thread
From: Morten Brørup @ 2022-10-10  9:36 UTC (permalink / raw)
  To: Mattias Rönnblom, konstantin.v.ananyev,
	Honnappa.Nagarahalli, stephen
  Cc: mattias.ronnblom, bruce.richardson, kda, drc, dev

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 10 October 2022 10.59
> 
> On 2022-10-10 09:35, Morten Brørup wrote:
> > Mattias, Konstantin, Honnappa, Stephen,
> >
> > In my patch for non-temporal memcpy, I have been aiming for using as
> much non-temporal store as possible. E.g. copying 16 byte to a 16 byte
> aligned address will be done using non-temporal store instructions.
> >
> > Now, I am seriously considering this alternative:
> >
> > Only using non-temporal stores for complete cache lines, and using
> normal stores for partial cache lines.
> >
> 
> This is how I've done it in the past, in DPDK applications. That was
> both to simplify (and potentially optimize) the code somewhat, and
> because I had my doubt there was any actual benefits from using
> non-temporal stores for the beginning or the end of the memory block.
> 
> That latter reason however, was pure conjecture. I think it would be
> great if Intel, ARM, AMD, IBM etc. DPDK developers could dig in the
> manuals or go find the appropriate CPU expert, to find out if that is
> true.
> 
> More specifically, my question is:
> 
> A) Consider a scenario where a core does a regular store against some
> cache line, and then pretty much immediately does a non-temporal store
> against a different address in the same cache line. How will this cache
> line be treated?
> 
> B) Consider the same scenario, but where no regular stores preceded (or
> followed) the non-temporal store, and the non-temporal stores performed
> did not cover the entirety of the cache line.
> 
> Scenario A) would be common in the beginning of the copy, in case
> there's a header preceding the data, and writing that header
> non-temporally might be cumbersome. Scenario B) would common at the end
> of the copy. Both assuming copies of memory blocks which are not
> cache-line aligned.
> 

Yeah, I wish some CPU expert from Intel/AMD and ARM would provide these functions instead of me. ;-)

> > I think it will make things simpler when an application mixes normal
> and non-temporal stores. E.g. an application writing metadata (a pcap
> header) followed by packet data.
> >
> 
> The application *could* use NT stores for the pcap header as well.

Our application does this. It also ensures 16 byte alignment for the stores. So our NT memcpy function is relatively simple.

However, I didn't think the DPDK community would accept a contribution with requirement that the destination must be 16 byte aligned and the length must be 16 byte divisible. So the patch needs to consider all weird alignments, and thus grew an order of magnitude larger than the NT memcopy function we have in our application. Much more work than anticipated. :-(

> 
> I haven't reviewed v3 of your patch, but in some earlier patch you did
> not use the movnti instruction to make smaller (< 16 bytes) stores.

I also use _mm_stream_si32() and _mm_stream_si64() now.

> 
> 
> > The disadvantage is that copying a burst of 32 packets, will - in the
> worst case - pollute 64 cache lines (one at the start plus one at the
> end of the copied data), i.e. 4 KiB of data cache. If copying to a
> consecutive memory area, e.g. a packet capture buffer, it will pollute
> 33 cache lines (because the start of packet #2 is in the same cache
> line as the end of packet #1, etc.).
> >
> > What do you think?
> >
> 
> For large copies, which I'm guessing is what non-temporal stores are
> usually used for, this is hair splitting. For DPDK applications, it
> might well be at least somewhat relevant, because such an application
> may make an enormous amount of copies, each roughly the size of a
> packet.
> 
> If we had a rte_memcpy_ex() that only cared about copying whole cache
> line in a NT manner, the application could add a clflushopt (or the
> equivalent) after the copy, flushing the the beginning and end cache
> line of the destination buffer.

That is a good idea.

Furthermore, POWER and RISC-V don't have NT store, but if they have a cache line flush instruction, NT destination memcpy could be implemented for those architectures too - i.e. storing cache line sized blocks and flushing the cache, and letting the application flush the cache lines at the ends, if useful for the application.

> 
> >
> > PS: Non-temporal loads are easy to work with, so don't worry about
> that.
> >
> >
> > Med venlig hilsen / Kind regards,
> > -Morten Brørup

Thank you, Mattias, for sharing your thoughts.

Now, let's wait and see if anyone else on the list has further input. :-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] eal: non-temporal memcpy
  2022-10-10  8:58     ` Mattias Rönnblom
  2022-10-10  9:36       ` Morten Brørup
@ 2022-10-10  9:57       ` Bruce Richardson
  1 sibling, 0 replies; 17+ messages in thread
From: Bruce Richardson @ 2022-10-10  9:57 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Morten Brørup, konstantin.v.ananyev, Honnappa.Nagarahalli,
	stephen, mattias.ronnblom, kda, drc, dev

On Mon, Oct 10, 2022 at 10:58:57AM +0200, Mattias Rönnblom wrote:
> On 2022-10-10 09:35, Morten Brørup wrote:
> > Mattias, Konstantin, Honnappa, Stephen,
> > 
> > In my patch for non-temporal memcpy, I have been aiming for using as much non-temporal store as possible. E.g. copying 16 byte to a 16 byte aligned address will be done using non-temporal store instructions.
> > 
> > Now, I am seriously considering this alternative:
> > 
> > Only using non-temporal stores for complete cache lines, and using normal stores for partial cache lines.
> > 
> 
> This is how I've done it in the past, in DPDK applications. That was both to
> simplify (and potentially optimize) the code somewhat, and because I had my
> doubt there was any actual benefits from using non-temporal stores for the
> beginning or the end of the memory block.
> 
> That latter reason however, was pure conjecture. I think it would be great
> if Intel, ARM, AMD, IBM etc. DPDK developers could dig in the manuals or go
> find the appropriate CPU expert, to find out if that is true.
> 
> More specifically, my question is:
> 
> A) Consider a scenario where a core does a regular store against some cache
> line, and then pretty much immediately does a non-temporal store against a
> different address in the same cache line. How will this cache line be
> treated?
> 
> B) Consider the same scenario, but where no regular stores preceded (or
> followed) the non-temporal store, and the non-temporal stores performed did
> not cover the entirety of the cache line.
> 
The best reference I am aware of for this for Intel CPUs is section
10.4.6.2 in Vol 1 of the Software Developers Manual[1].

The bit relevant to your scenarios above is:

"If a program specifies a non-temporal store with one of these instruc-
tions and the memory type of the destination region is write back (WB), write through (WT), or write combining
(WC), the processor will do the following:
• If the memory location being written to is present in the cache hierarchy, the data in the caches is evicted.
• The non-temporal data is written to memory with WC semantics"

Hope this helps a little.

Regards,
/Bruce

[1] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf#G11.44032

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] eal: non-temporal memcpy
  2022-10-10  9:36       ` Morten Brørup
@ 2022-10-10 11:58         ` Stanislaw Kardach
  0 siblings, 0 replies; 17+ messages in thread
From: Stanislaw Kardach @ 2022-10-10 11:58 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, konstantin.v.ananyev,
	Honnappa.Nagarahalli, stephen, mattias.ronnblom,
	bruce.richardson, drc, dev

On Mon, Oct 10, 2022 at 11:36:11AM +0200, Morten Brørup wrote:
<snip>
> > For large copies, which I'm guessing is what non-temporal stores are
> > usually used for, this is hair splitting. For DPDK applications, it
> > might well be at least somewhat relevant, because such an application
> > may make an enormous amount of copies, each roughly the size of a
> > packet.
> > 
> > If we had a rte_memcpy_ex() that only cared about copying whole cache
> > line in a NT manner, the application could add a clflushopt (or the
> > equivalent) after the copy, flushing the the beginning and end cache
> > line of the destination buffer.
> 
> That is a good idea.
> 
> Furthermore, POWER and RISC-V don't have NT store, but if they have a cache line flush instruction, NT destination memcpy could be implemented for those architectures too - i.e. storing cache line sized blocks and flushing the cache, and letting the application flush the cache lines at the ends, if useful for the application.

On RISC-V all stores are from a register (scalar or vector) to a memory
location. So is the reasoning behind flushing the cache line to free it
up to other data?

Other than that there is a ratified RISC-V extension for cache
management operations (including flush) - Zicbom.
NT load/store hints are being worked on right now.

-- 
Best Regards,
Stanislaw Kardach

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH] eal: non-temporal memcpy
  2022-10-10  7:35   ` Morten Brørup
  2022-10-10  8:58     ` Mattias Rönnblom
@ 2022-10-11  9:25     ` Konstantin Ananyev
  1 sibling, 0 replies; 17+ messages in thread
From: Konstantin Ananyev @ 2022-10-11  9:25 UTC (permalink / raw)
  To: Morten Brørup, hofors, konstantin.v.ananyev,
	Honnappa.Nagarahalli, stephen
  Cc: mattias.ronnblom, bruce.richardson, kda, drc, dev


Hi Morten,
 
> Mattias, Konstantin, Honnappa, Stephen,
> 
> In my patch for non-temporal memcpy, I have been aiming for using as much non-temporal store as possible. E.g. copying 16 byte to a
> 16 byte aligned address will be done using non-temporal store instructions.
> 
> Now, I am seriously considering this alternative:
> 
> Only using non-temporal stores for complete cache lines, and using normal stores for partial cache lines.
> 
> I think it will make things simpler when an application mixes normal and non-temporal stores. E.g. an application writing metadata (a
> pcap header) followed by packet data.

Sounds like a reasonable idea to me.

> 
> The disadvantage is that copying a burst of 32 packets, will - in the worst case - pollute 64 cache lines (one at the start plus one at the
> end of the copied data), i.e. 4 KiB of data cache. If copying to a consecutive memory area, e.g. a packet capture buffer, it will pollute 33
> cache lines (because the start of packet #2 is in the same cache line as the end of packet #1, etc.).
> 
> What do you think?

My guess that for modern high-end x86 CPUs the difference would be neglectable.
Though again, right now it is just my guess, and I don't have a clue what will be impact (if any) on other platforms. 
If we really want to avoid any doubts, then probably the best thing it  to have some sort of micro-bench in our UT that would simulate
some memory(/cache) bound workload plus normal or NT copies.
As a very rough though:
Allocate some big enough memory buffer (size=X) that for sure wouldn't fit into CPU caches.
Then in a loop for each iteration:
    - do N random normal reads/writes from/to that buffer to simulate some memory bound workload.
     (so each iteration cause  some (more or less) constant % of cache-misses).    
    - invoke our memcpy_ex(size=Y) in question K(=32 as DPDK magic number?) times for different memory locations.
Measure amount of cycles it takes for some big number of iterations.
That would probably show us a difference (if any)
between memcpy vs memcpy_ex() or between different implementations of memcpy_ex()
in terms of cache-line saving, etc.  
Again it will probably show at what size>=Y it is worth to start using NT instead of normal copies for such workloads.
By varying X,N,Y,K parameters we can test different scenarios on different platforms.  

> 
> PS: Non-temporal loads are easy to work with, so don't worry about that.
> 
> 
> Med venlig hilsen / Kind regards,
> -Morten Brørup

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] eal: non-temporal memcpy
  2022-10-10  6:46 ` [PATCH v4] " Morten Brørup
@ 2022-10-16 14:27   ` Mattias Rönnblom
  2022-10-16 19:55   ` Mattias Rönnblom
  2023-07-31 12:14   ` Thomas Monjalon
  2 siblings, 0 replies; 17+ messages in thread
From: Mattias Rönnblom @ 2022-10-16 14:27 UTC (permalink / raw)
  To: Morten Brørup, bruce.richardson, konstantin.v.ananyev,
	Honnappa.Nagarahalli, stephen
  Cc: mattias.ronnblom, kda, drc, dev

On 2022-10-10 08:46, Morten Brørup wrote:
> This patch provides a function for memory copy using non-temporal store,
> load or both, controlled by flags passed to the function.
> 
> Applications sometimes copy data to another memory location, which is only
> used much later.
> In this case, it is inefficient to pollute the data cache with the copied
> data.
> 
> An example use case (originating from a real life application):
> Copying filtered packets, or the first part of them, into a capture buffer
> for offline analysis.
> 
> The purpose of the function is to achieve a performance gain by not
> polluting the cache when copying data.
> Although the throughput can be improved by further optimization, I do not
> have time to do it now.
> 

The above section is a little repetitive, and only indirectly explains 
what NT loads/stores are.

"This patch provides a new function rte_memcpy_ex() for copying data 
between non-overlapping memory regions. The primary aim of 
rte_memcpy_ex() is to provide a rte_memcpy() (and memcpy()) plug-in 
replacement, where the user may opt for loads and/or stores with 
non-temporal hints to be used while copying the data.

By using a non-temporal hint, the program informs the system that it 
does not intended to further access the data any time soon.

This in turn allows the CPU to bypass the CPU caches or by other means 
avoid this unlikely-to-be-used-soon data to evict cache lines or force 
the future evictions of more useful cache lines."

You should also say something about the memory ordering issue.

> The functional tests and performance tests for memory copy have been
> expanded to include non-temporal copying.
> 
> A non-temporal version of the mbuf library's function to create a full
> copy of a given packet mbuf is provided.
> 
> The packet capture and packet dump libraries have been updated to use
> non-temporal memory copy of the packets.
>  > Implementation notes:
> 
> Implementations for non-x86 architectures can be provided by anyone at a
> later time. I am not going to do it.
> 
> x86 non-temporal load instructions must be 16 byte aligned [1], and
> non-temporal store instructions must be 4, 8 or 16 byte aligned [2].
> 
> ARM non-temporal load and store instructions seem to require 4 byte
> alignment [3].
> 

Would this patch be better off as a series? And maybe leave some of this 
information to a cover letter?

> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_load
> [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_si
> [3] https://developer.arm.com/documentation/100076/0100/
> A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
> LDNP--SIMD-and-FP-
> 
> This patch is a major rewrite from the RFC v3, so no version log comparing
> to the RFC is provided.
> 
> v4
> * Also ignore the warning for clang int the workaround for
>    _mm_stream_load_si128() missing const in the parameter.
> * Add missing C linkage specifier in rte_memcpy.h.
> 
> v3
> * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
>    use it on 64-bit x86 architecture.
> * CLANG warns that _mm_stream_load_si128_const() and
>    rte_memcpy_nt_15_or_less_s16a() are not public,
>    so remove __rte_internal from them. It also affects the documentation
>    for the functions, so the fix can't be limited to CLANG.
> * Use __rte_experimental instead of __rte_internal.
> * Replace <n> with nnn in function documentation; it doesn't look like
>    HTML.
> * Slightly modify the workaround for _mm_stream_load_si128() missing const
>    in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
>    #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
>    #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
> * Fixed one coding style issue missed in v2.
> 
> v2
> * The last 16 byte block of data, incl. any trailing bytes, were not
>    copied from the source memory area in rte_memcpy_nt_buf().
> * Fix many coding style issues.
> * Add some missing header files.
> * Fix build time warning for non-x86 architectures by using a different
>    method to mark the flags parameter unused.
> * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
>    so omit it when using CLANG.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>   app/test/test_memcpy.c               |   65 +-
>   app/test/test_memcpy_perf.c          |  187 ++--
>   lib/eal/include/generic/rte_memcpy.h |  127 +++
>   lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
>   lib/mbuf/rte_mbuf.c                  |   77 ++
>   lib/mbuf/rte_mbuf.h                  |   32 +
>   lib/mbuf/version.map                 |    1 +
>   lib/pcapng/rte_pcapng.c              |    3 +-
>   lib/pdump/rte_pdump.c                |    6 +-
>   9 files changed, 1645 insertions(+), 91 deletions(-)
> 
> diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
> index 1ab86f4967..12410ce413 100644
> --- a/app/test/test_memcpy.c
> +++ b/app/test/test_memcpy.c
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #include <stdint.h>
> @@ -36,6 +37,19 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
>   /* Data is aligned on this many bytes (power of 2) */
>   #define ALIGNMENT_UNIT          32
>   
> +const uint64_t nt_mode_flags[4] = {

Delete "4".

> +	0,
> +	RTE_MEMOPS_F_SRC_NT,
> +	RTE_MEMOPS_F_DST_NT,
> +	RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT
> +};
> +const char * const nt_mode_str[4] = {

Delete "4".

> +	"none",
> +	"src",
> +	"dst",
> +	"src+dst"
> +};
> +
>   
>   /*
>    * Create two buffers, and initialise one with random values. These are copied
> @@ -44,12 +58,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
>    * changed.
>    */
>   static int
> -test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
> +test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size, unsigned int nt_mode)
>   {
>   	unsigned int i;
>   	uint8_t dest[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
>   	uint8_t src[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
>   	void * ret;
> +	const uint64_t flags = nt_mode_flags[nt_mode];
>   
>   	/* Setup buffers */
>   	for (i = 0; i < SMALL_BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
> @@ -58,18 +73,23 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	}
>   
>   	/* Do the copy */
> -	ret = rte_memcpy(dest + off_dst, src + off_src, size);
> -	if (ret != (dest + off_dst)) {
> -		printf("rte_memcpy() returned %p, not %p\n",
> -		       ret, dest + off_dst);
> +	if (nt_mode) {
> +		rte_memcpy_ex(dest + off_dst, src + off_src, size, flags);
> +	} else {
> +		ret = rte_memcpy(dest + off_dst, src + off_src, size);
> +		if (ret != (dest + off_dst)) {
> +			printf("rte_memcpy() returned %p, not %p\n",
> +			       ret, dest + off_dst);
> +		}
>   	}
>   
>   	/* Check nothing before offset is affected */
>   	for (i = 0; i < off_dst; i++) {
>   		if (dest[i] != 0) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
>   			       "[modified before start of dst].\n",
> -			       (unsigned)size, off_src, off_dst);
> +			       nt_mode ? "_ex" : "",

Introduce nt_mode_name() helper, which returns a string.

> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
>   			return -1;
>   		}
>   	}
> @@ -77,9 +97,11 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	/* Check everything was copied */
>   	for (i = 0; i < size; i++) {
>   		if (dest[i + off_dst] != src[i + off_src]) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> -			       "[didn't copy byte %u].\n",
> -			       (unsigned)size, off_src, off_dst, i);
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
> +			       "[didn't copy byte %u: 0x%02x!=0x%02x].\n",
> +			       nt_mode ? "_ex" : "",
> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode], i,
> +			       dest[i + off_dst], src[i + off_src]);
>   			return -1;
>   		}
>   	}
> @@ -87,9 +109,10 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	/* Check nothing after copy was affected */
>   	for (i = size; i < SMALL_BUFFER_SIZE; i++) {
>   		if (dest[i + off_dst] != 0) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
>   			       "[copied too many].\n",
> -			       (unsigned)size, off_src, off_dst);
> +			       nt_mode ? "_ex" : "",
> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);

For the size_t argument, use the 'z' length modifier, instead of a cast.

>   			return -1;
>   		}
>   	}
> @@ -102,16 +125,18 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   static int
>   func_test(void)
>   {
> -	unsigned int off_src, off_dst, i;
> +	unsigned int off_src, off_dst, i, nt_mode;
>   	int ret;
>   
> -	for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
> -		for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
> -			for (i = 0; i < RTE_DIM(buf_sizes); i++) {
> -				ret = test_single_memcpy(off_src, off_dst,
> -				                         buf_sizes[i]);
> -				if (ret != 0)
> -					return -1;
> +	for (nt_mode = 0; nt_mode < 4; nt_mode++) {
> +		for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
> +			for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
> +				for (i = 0; i < RTE_DIM(buf_sizes); i++) {
> +					ret = test_single_memcpy(off_src, off_dst,
> +								 buf_sizes[i], nt_mode);
> +					if (ret != 0)
> +						return -1;
> +				}
>   			}
>   		}
>   	}
> diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
> index 3727c160e6..6bb52cba88 100644
> --- a/app/test/test_memcpy_perf.c
> +++ b/app/test/test_memcpy_perf.c
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #include <stdint.h>
> @@ -15,6 +16,7 @@
>   #include <rte_malloc.h>
>   
>   #include <rte_memcpy.h>
> +#include <rte_atomic.h>
>   
>   #include "test.h"
>   
> @@ -27,9 +29,9 @@
>   /* List of buffer sizes to test */
>   #if TEST_VALUE_RANGE == 0
>   static size_t buf_sizes[] = {
> -	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
> -	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
> -	449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
> +	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 40, 48, 60, 63, 64, 65, 80, 92, 124,
> +	127, 128, 129, 140, 152, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447,
> +	448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
>   	2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
>   };
>   /* MUST be as large as largest packet size above */
> @@ -72,7 +74,7 @@ static uint8_t *small_buf_read, *small_buf_write;
>   static int
>   init_buffers(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   
>   	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
>   	if (large_buf_read == NULL)
> @@ -151,7 +153,7 @@ static void
>   do_uncached_write(uint8_t *dst, int is_dst_cached,
>   				  const uint8_t *src, int is_src_cached, size_t size)
>   {
> -	unsigned i, j;
> +	unsigned int i, j;
>   	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];
>   
>   	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
> @@ -167,66 +169,112 @@ do_uncached_write(uint8_t *dst, int is_dst_cached,
>    * Run a single memcpy performance test. This is a macro to ensure that if
>    * the "size" parameter is a constant it won't be converted to a variable.
>    */
> -#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,                   \
> -                         src, is_src_cached, src_uoffset, size)             \
> -do {                                                                        \
> -    unsigned int iter, t;                                                   \
> -    size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];          \
> -    uint64_t start_time, total_time = 0;                                    \
> -    uint64_t total_time2 = 0;                                               \
> -    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
> -        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
> -                         src_addrs, is_src_cached, src_uoffset);            \
> -        start_time = rte_rdtsc();                                           \
> -        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
> -            rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size);           \
> -        total_time += rte_rdtsc() - start_time;                             \
> -    }                                                                       \
> -    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
> -        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
> -                         src_addrs, is_src_cached, src_uoffset);            \
> -        start_time = rte_rdtsc();                                           \
> -        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
> -            memcpy(dst+dst_addrs[t], src+src_addrs[t], size);               \
> -        total_time2 += rte_rdtsc() - start_time;                            \
> -    }                                                                       \
> -    printf("%3.0f -", (double)total_time  / TEST_ITERATIONS);                 \
> -    printf("%3.0f",   (double)total_time2 / TEST_ITERATIONS);                 \
> -    printf("(%6.2f%%) ", ((double)total_time - total_time2)*100/total_time2); \
> +#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,					  \
> +			 src, is_src_cached, src_uoffset, size)					  \
> +do {												  \
> +	unsigned int iter, t;									  \
> +	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];				  \
> +	uint64_t start_time;									  \
> +	uint64_t total_time_rte = 0, total_time_std = 0;					  \
> +	uint64_t total_time_ntd = 0, total_time_nts = 0, total_time_nt = 0;			  \
> +	const uint64_t flags = ((dst_uoffset == 0) ?						  \
> +				(ALIGNMENT_UNIT << RTE_MEMOPS_F_DSTA_SHIFT) : 0) |		  \
> +			       ((src_uoffset == 0) ?						  \
> +				(ALIGNMENT_UNIT << RTE_MEMOPS_F_SRCA_SHIFT) : 0);		  \
> +	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
> +		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
> +				 src_addrs, is_src_cached, src_uoffset);			  \
> +		start_time = rte_rdtsc();							  \
> +		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
> +			rte_memcpy(dst + dst_addrs[t], src + src_addrs[t], size);		  \
> +		total_time_rte += rte_rdtsc() - start_time;					  \
> +	}											  \
> +	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
> +		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
> +				 src_addrs, is_src_cached, src_uoffset);			  \
> +		start_time = rte_rdtsc();							  \
> +		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
> +			memcpy(dst + dst_addrs[t], src + src_addrs[t], size);			  \
> +		total_time_std += rte_rdtsc() - start_time;					  \
> +	}											  \
> +	if (!(is_dst_cached && is_src_cached)) {						  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_DST_NT);			  \
> +			total_time_ntd += rte_rdtsc() - start_time;				  \
> +		}										  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_SRC_NT);			  \
> +			total_time_nts += rte_rdtsc() - start_time;				  \
> +		}										  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT); \
> +			total_time_nt += rte_rdtsc() - start_time;				  \
> +		}										  \
> +	}											  \
> +	printf(" %4.0f-", (double)total_time_rte / TEST_ITERATIONS);				  \
> +	printf("%4.0f",   (double)total_time_std / TEST_ITERATIONS);				  \
> +	printf("(%+4.0f%%)", ((double)total_time_rte - total_time_std) * 100 / total_time_std);   \
> +	if (!(is_dst_cached && is_src_cached)) {						  \
> +		printf(" %4.0f", (double)total_time_ntd / TEST_ITERATIONS);			  \
> +		printf(" %4.0f", (double)total_time_nts / TEST_ITERATIONS);			  \
> +		printf(" %4.0f", (double)total_time_nt / TEST_ITERATIONS);			  \
> +		if (total_time_nt / total_time_std > 9)						  \
> +			printf("(*%4.1f)", (double)total_time_nt / total_time_std);		  \
> +		else										  \
> +			printf("(%+4.0f%%)",							  \
> +			       ((double)total_time_nt - total_time_std) * 100 / total_time_std);  \
> +	}											  \
>   } while (0)
>   
>   /* Run aligned memcpy tests for each cached/uncached permutation */
> -#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
> -do {                                                                     \
> -    if (__builtin_constant_p(n))                                         \
> -        printf("\nC%6u", (unsigned)n);                                   \
> -    else                                                                 \
> -        printf("\n%7u", (unsigned)n);                                    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);    \
> +#define ALL_PERF_TESTS_FOR_SIZE(n)						\
> +do {										\
> +	if (__builtin_constant_p(n))						\
> +		printf("\nC%6u", (unsigned int)n);				\
> +	else									\
> +		printf("\n%7u", (unsigned int)n);				\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);	\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);	\
>   } while (0)
>   
>   /* Run unaligned memcpy tests for each cached/uncached permutation */
> -#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
> -do {                                                                     \
> -    if (__builtin_constant_p(n))                                         \
> -        printf("\nC%6u", (unsigned)n);                                   \
> -    else                                                                 \
> -        printf("\n%7u", (unsigned)n);                                    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);    \
> +#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)					\
> +do {										\
> +	if (__builtin_constant_p(n))						\
> +		printf("\nC%6u", (unsigned int)n);				\
> +	else									\
> +		printf("\n%7u", (unsigned int)n);				\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);	\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);	\
>   } while (0)
>   
>   /* Run memcpy tests for constant length */
> -#define ALL_PERF_TEST_FOR_CONSTANT                                      \
> -do {                                                                    \
> -    TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);         \
> -    TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);      \
> -    TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);    \
> +#define ALL_PERF_TEST_FOR_CONSTANT						\
> +do {										\
> +	TEST_CONSTANT(4U); TEST_CONSTANT(6U); TEST_CONSTANT(8U);		\
> +	TEST_CONSTANT(16U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);		\
> +	TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);		\
> +	TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);	\
> +	TEST_CONSTANT(2048U);							\
>   } while (0)
>   
>   /* Run all memcpy tests for aligned constant cases */
> @@ -251,7 +299,7 @@ perf_test_constant_unaligned(void)
>   static inline void
>   perf_test_variable_aligned(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
>   		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
>   	}
> @@ -261,7 +309,7 @@ perf_test_variable_aligned(void)
>   static inline void
>   perf_test_variable_unaligned(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
>   		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
>   	}
> @@ -282,7 +330,7 @@ perf_test(void)
>   
>   #if TEST_VALUE_RANGE != 0
>   	/* Set up buf_sizes array, if required */
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < TEST_VALUE_RANGE; i++)
>   		buf_sizes[i] = i;
>   #endif
> @@ -290,13 +338,14 @@ perf_test(void)
>   	/* See function comment */
>   	do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
>   
> -	printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
> -		   "======= ================= ================= ================= =================\n"
> -		   "   Size   Cache to cache     Cache to mem      Mem to cache        Mem to mem\n"
> -		   "(bytes)          (ticks)          (ticks)           (ticks)           (ticks)\n"
> -		   "------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n** rte_memcpy(RTE)/memcpy(STD)/rte_memcpy_ex(NTD/NTS/NT) - memcpy perf. tests (C = compile-time constant) **\n"
> +		   "======= ================ ====================================== ====================================== ======================================\n"
> +		   "   Size  Cache to cache               Cache to mem                           Mem to cache                            Mem to mem\n"
> +		   "(bytes)         (ticks)                    (ticks)                                (ticks)                               (ticks)\n"
> +		   "         RTE- STD(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)\n"
> +		   "------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   
> -	printf("\n================================= %2dB aligned =================================",
> +	printf("\n================================================================ %2dB aligned ===============================================================",
>   		ALIGNMENT_UNIT);
>   	/* Do aligned tests where size is a variable */
>   	timespec_get(&tv_begin, TIME_UTC);
> @@ -304,28 +353,28 @@ perf_test(void)
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_aligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   	/* Do aligned tests where size is a compile-time constant */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_constant_aligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_aligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n================================== Unaligned ==================================");
> +	printf("\n================================================================= Unaligned =================================================================");
>   	/* Do unaligned tests where size is a variable */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_variable_unaligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_unaligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   	/* Do unaligned tests where size is a compile-time constant */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_constant_unaligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_unaligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n======= ================= ================= ================= =================\n\n");
> +	printf("\n======= ================ ====================================== ====================================== ======================================\n\n");
>   
>   	printf("Test Execution Time (seconds):\n");
>   	printf("Aligned variable copy size   = %8.3f\n", time_aligned);
> diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
> index e7f0f8eaa9..b087f09c35 100644
> --- a/lib/eal/include/generic/rte_memcpy.h
> +++ b/lib/eal/include/generic/rte_memcpy.h
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #ifndef _RTE_MEMCPY_H_
> @@ -11,6 +12,13 @@
>    * Functions for vectorised implementation of memcpy().
>    */
>   
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
>   /**
>    * Copy 16 bytes from one location to another using optimised
>    * instructions. The locations should not overlap.
> @@ -113,4 +121,123 @@ rte_memcpy(void *dst, const void *src, size_t n);
>   
>   #endif /* __DOXYGEN__ */
>   
> +/*
> + * Advanced/Non-Temporal Memory Operations Flags.
> + */
> +
> +/** Length alignment hint mask. */
> +#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
> +/** Length alignment hint shift. */
> +#define RTE_MEMOPS_F_LENA_SHIFT 0
> +/** Hint: Length is 2 byte aligned. */
> +#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
> +/** Hint: Length is 4 byte aligned. */
> +#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)
> +/** Hint: Length is 8 byte aligned. */
> +#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)
> +/** Hint: Length is 16 byte aligned. */
> +#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)
> +/** Hint: Length is 32 byte aligned. */
> +#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)
> +/** Hint: Length is 64 byte aligned. */
> +#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)
> +/** Hint: Length is 128 byte aligned. */
> +#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)
> +
> +/** Prefer non-temporal access to source memory area.
> + */
> +#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)
> +/** Source address alignment hint mask. */
> +#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)
> +/** Source address alignment hint shift. */
> +#define RTE_MEMOPS_F_SRCA_SHIFT 8
> +/** Hint: Source address is 2 byte aligned. */
> +#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)
> +/** Hint: Source address is 4 byte aligned. */
> +#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)
> +/** Hint: Source address is 8 byte aligned. */
> +#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)
> +/** Hint: Source address is 16 byte aligned. */
> +#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)
> +/** Hint: Source address is 32 byte aligned. */
> +#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)
> +/** Hint: Source address is 64 byte aligned. */
> +#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)
> +/** Hint: Source address is 128 byte aligned. */
> +#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)
> +
> +/** Prefer non-temporal access to destination memory area.
> + *
> + * On x86 architecture:
> + * Remember to call rte_wmb() after a sequence of copy operations.
> + */

NT memcpy should have memcpy() semantics by default, and there should be 
a flag if you don't want a sfence after any NT stores, or lfence before 
any NT loads, on x86. That is, assuming the x86 memcpy_ex w/ NT hints 
will always be using NT stores, as oppose to regular stores + cflushopt. 
For the latter case, or in x86 cases where the NT store variants aren̈́t 
supported, the fencing isn't needed, even on x86.

I don't know what the "ignore ordering" flag should be called.

RTE_MEMOPS_F_NO_MB
RTE_MEMOPS_F_UNORDERED
RTE_MEMOPS_F_NO_WMB
RTE_MEMOPS_F_NO_RMB

For those that use this "ignore ordering" flag (or for anyone using the 
API this patch proposes), there will be a need to insert a barrier at 
some point, unless the application is completely serial. It should be 
possible to do this in a portable manner. No #ifdef x86.

One way to attack this is to have two new functions rte_nt_wmb() and 
rte_nt_rmb() (or maybe rte_memcpy_nt_w|rmb()), which calls sfence/lfence 
(or whatever is needed on that architecture), to order the NT loads 
and/or NT stores with load/stores in the default memory consistency model.

> +#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)
> +/** Destination address alignment hint mask. */
> +#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16)
> +/** Destination address alignment hint shift. */
> +#define RTE_MEMOPS_F_DSTA_SHIFT 16
> +/** Hint: Destination address is 2 byte aligned. */
> +#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)
> +/** Hint: Destination address is 4 byte aligned. */
> +#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)
> +/** Hint: Destination address is 8 byte aligned. */
> +#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)
> +/** Hint: Destination address is 16 byte aligned. */
> +#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)
> +/** Hint: Destination address is 32 byte aligned. */
> +#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)
> +/** Hint: Destination address is 64 byte aligned. */
> +#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)
> +/** Hint: Destination address is 128 byte aligned. */
> +#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Advanced/non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the destination memory area.
> + * @param src
> + *   Pointer to the source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, RTE_MEMOPS_F_(LEN|SRC|DST)nnnA flags.
> + *   Must be constant at build time.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags);
> +
> +#ifndef RTE_MEMCPY_EX_ARCH_DEFINED
> +
> +/* Fallback implementation, if no arch-specific implementation is provided. */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)

I like the rte_memcpy_ex() name, in particular that it doesn't say 
anything about NT.

Is there a point in having flags declared const?

> +{
> +	RTE_SET_USED(flags);
> +	memcpy(dst, src, len);

Fall back to rte_memcpy().

> +}
> +
> +#endif /* RTE_MEMCPY_EX_ARCH_DEFINED */
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
>   #endif /* _RTE_MEMCPY_H_ */
> diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
> index d4d7a5cfc8..31d0faf7a8 100644
> --- a/lib/eal/x86/include/rte_memcpy.h
> +++ b/lib/eal/x86/include/rte_memcpy.h
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #ifndef _RTE_MEMCPY_X86_64_H_
> @@ -17,6 +18,10 @@
>   #include <rte_vect.h>
>   #include <rte_common.h>
>   #include <rte_config.h>
> +#include <rte_debug.h>
> +
> +#define RTE_MEMCPY_EX_ARCH_DEFINED
> +#include "generic/rte_memcpy.h"
>   
>   #ifdef __cplusplus
>   extern "C" {
> @@ -868,6 +873,1239 @@ rte_memcpy(void *dst, const void *src, size_t n)
>   		return rte_memcpy_generic(dst, src, n);
>   }
>   
> +/*
> + * Advanced/Non-Temporal Memory Operations.
> + */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Workaround for _mm_stream_load_si128() missing const in the parameter.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__m128i _mm_stream_load_si128_const(const __m128i *const mem_addr)

I'm not sure it's wise to use the _mm namespace for this wrapper. There 
could be a fix to this issue, and this fix could be exactly where you 
landed here.

__rte_mm_stream_load_si128()?

> +{
> +	/* GCC 4.5.8 (in RHEL7) doesn't support the #pragma to ignore "-Wdiscarded-qualifiers".
> +	 * So we explicitly type cast mem_addr and use the #pragma to ignore "-Wcast-qual".
> +	 */
> +#if defined(RTE_TOOLCHAIN_GCC)
> +#pragma GCC diagnostic push
> +#pragma GCC diagnostic ignored "-Wcast-qual"
> +#elif defined(RTE_TOOLCHAIN_CLANG)
> +#pragma clang diagnostic push
> +#pragma clang diagnostic ignored "-Wcast-qual"
> +#endif
> +	return _mm_stream_load_si128((__m128i *)mem_addr);
> +#if defined(RTE_TOOLCHAIN_GCC)
> +#pragma GCC diagnostic pop
> +#elif defined(RTE_TOOLCHAIN_CLANG)
> +#pragma clang diagnostic pop
> +#endif
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Memory copy from non-temporal source area.
> + *
> + * @note
> + * Performance is optimal when source pointer is 16 byte aligned.
> + *
> + * @param dst
> + *   Pointer to the destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(LEN|SRC)nnnA flags.
> + *   The RTE_MEMOPS_F_SRC_NT flag must be set.
> + *   The RTE_MEMOPS_F_DST_NT flag must be clear.
> + *   The RTE_MEMOPS_F_DSTnnnA flags are ignored.
> + *   Must be constant at build time.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nts(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)

Why not have rte_memcpy_ex() as the single addition to the public API? 
Then you may have __rte-prefixed helpers as well, but not to be directly 
called by the application. Would simplify things from a 
documentation/user comprehension point of view, I think.

> +{
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;

Declare the xmm<N> in the scope they are used (those that are used).

Aren't you supposed to have a single whitespace between the type and the 
name in DPDK? I may be mistaken.

> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_SRC_NT);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* If source is not 16 byte aligned, then copy first part of data via bounce buffer,
> +	 * to achieve 16 byte alignment of source pointer.
> +	 * This invalidates the source, destination and length alignment flags, and
> +	 * potentially makes the destination pointer unaligned.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) {

I think it's worth giving this expression a name, especially since it's
repeatedly used.

const bool src_atleast_16a = (flags & RTE_MEMOPS_F_SRCA_MASK) >=
RTE_MEMOPS_F_SRC16A;

An alternative would be to have a macro RTE_MEMOPS_ATLEAST_SRC16A(flags).

> +		/* Source is not known to be 16 byte aligned, but might be. */
> +		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +		const size_t    offset = (uintptr_t)src & 15;
> +
> +		if (offset) {
offset > 0
> +			/* Source is not 16 byte aligned. */
> +			char            buffer[16] __rte_aligned(16);
> +			/** How many bytes is source away from 16 byte alignment
> +			 * (ceiling rounding).
> +			 */
> +			const size_t    first = 16 - offset;
> +
> +			xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
> +			_mm_store_si128((void *)buffer, xmm0);
> +
> +			/* Test for short length.
> +			 *
> +			 * Omitted if length is known to be >= 16.
> +			 */
> +			if (!(__builtin_constant_p(len) && len >= 16) &&

Why is __builtin_constant_p() used here?

> +					unlikely(len <= first)) {
> +				/* Short length. */
> +				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
> +				return;
> +			}
> +
> +			/* Copy until source pointer is 16 byte aligned. */
> +			rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), first);
> +			src = RTE_PTR_ADD(src, first);
> +			dst = RTE_PTR_ADD(dst, first);
> +			len -= first;
> +		}
> +	}
> +
> +	/* Source pointer is now 16 byte aligned. */
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);

Is this some attempt at manual register allocation, or why is "xmm2" 
used, and not "xmm0"?

> +		_mm_storeu_si128(dst, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, if any, via bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid) and length is known to be 16 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(len & 15)) {
> +		char    buffer[16] __rte_aligned(16);
> +
> +		xmm3 = _mm_stream_load_si128_const(src);

If this is indeed a register allocation trick, it should be mentioned in 
a commment. Otherwise it's just confusing. If it's a trick, does it 
actually have a positive effect? I wouldn't expect the compiler to take 
"xmm3" so literally, and secondarly, register renaming in the CPU to fix 
the false dependency.

> +		_mm_store_si128((void *)buffer, xmm3);
> +		rte_mov15_or_less(dst, buffer, len & 15);
> +	}
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Memory copy to non-temporal destination area.
> + *
> + * @note
> + * If the destination and/or length is unaligned, the first and/or last copied
> + * bytes will be stored in the destination memory area using temporal access.
> + * @note
> + * Performance is optimal when destination pointer is 16 byte aligned.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(LEN|DST)nnnA flags.
> + *   The RTE_MEMOPS_F_SRC_NT flag must be clear.
> + *   The RTE_MEMOPS_F_DST_NT flag must be set.
> + *   The RTE_MEMOPS_F_SRCnnnA flags are ignored.
> + *   Must be constant at build time.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ntd(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)

This should also go into the __rte_memcpy namespace, rather than 
rte_memcpy*.

> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_DST_NT);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ||
> +			len >= 16) {

See my comments on the SRCA mask handling.

> +		/* Length >= 16 and/or destination is known to be 16 byte aligned. */
> +		register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +		/* If destination is not 16 byte aligned, then copy first part of data,
> +		 * to achieve 16 byte alignment of destination pointer.
> +		 * This invalidates the source, destination and length alignment flags, and
> +		 * potentially makes the source pointer unaligned.
> +		 *
> +		 * Omitted if destination is known to be 16 byte aligned.
> +		 */
> +		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A)) {
> +			/* Destination is not known to be 16 byte aligned, but might be. */
> +			/** How many bytes is destination offset from 16 byte alignment
> +			 * (floor rounding).
> +			 */
> +			const size_t    offset = (uintptr_t)dst & 15;
> +
> +			if (offset) {
> +				/* Destination is not 16 byte aligned. */
> +				/** How many bytes is destination away from 16 byte alignment
> +				 * (ceiling rounding).
> +				 */
> +				const size_t    first = 16 - offset;
> +
> +				if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
> +						(offset & 3) == 0) {
> +					/* Destination is (known to be) 4 byte aligned. */
> +					int32_t r0, r1, r2;
> +
> +					/* Copy until destination pointer is 16 byte aligned. */
> +					if (first & 8) {
> +						memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
> +						memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
> +						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
> +						_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
> +						src = RTE_PTR_ADD(src, 8);
> +						dst = RTE_PTR_ADD(dst, 8);
> +						len -= 8;
> +					}
> +					if (first & 4) {
> +						memcpy(&r2, src, 4);
> +						_mm_stream_si32(dst, r2);
> +						src = RTE_PTR_ADD(src, 4);
> +						dst = RTE_PTR_ADD(dst, 4);
> +						len -= 4;
> +					}
> +				} else {
> +					/* Destination is not 4 byte aligned. */
> +					/* Copy until destination pointer is 16 byte aligned. */
> +					rte_mov15_or_less(dst, src, first);
> +					src = RTE_PTR_ADD(src, first);
> +					dst = RTE_PTR_ADD(dst, first);
> +					len -= first;
> +				}
> +			}
> +		}
> +
> +		/* Destination pointer is now 16 byte aligned. */
> +		RTE_ASSERT(rte_is_aligned(dst, 16));
> +
> +		/* Copy large portion of data in chunks of 64 byte. */
> +		while (len >= 64) {
> +			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
> +			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
> +			xmm2 = _mm_loadu_si128(RTE_PTR_ADD(src, 2 * 16));
> +			xmm3 = _mm_loadu_si128(RTE_PTR_ADD(src, 3 * 16));
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +			src = RTE_PTR_ADD(src, 64);
> +			dst = RTE_PTR_ADD(dst, 64);
> +			len -= 64;
> +		}
> +
> +		/* Copy following 32 and 16 byte portions of data.
> +		 *
> +		 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +		 * flags are still valid)
> +		 * and length is known to be respectively 64 or 32 byte aligned.
> +		 */
> +		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +				(len & 32)) {
> +			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
> +			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +			src = RTE_PTR_ADD(src, 32);
> +			dst = RTE_PTR_ADD(dst, 32);
> +		}
> +		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +				(len & 16)) {
> +			xmm2 = _mm_loadu_si128(src);
> +			_mm_stream_si128(dst, xmm2);
> +			src = RTE_PTR_ADD(src, 16);
> +			dst = RTE_PTR_ADD(dst, 16);
> +		}
> +	} else {
> +		/* Length <= 15, and
> +		 * destination is not known to be 16 byte aligned (but might be).
> +		 */
> +		/* If destination is not 4 byte aligned, then
> +		 * use normal copy and return.
> +		 *
> +		 * Omitted if destination is known to be 4 byte aligned.
> +		 */
> +		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) &&
> +				!rte_is_aligned(dst, 4)) {
> +			/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
> +			rte_mov15_or_less(dst, src, len);
> +			return;
> +		}
> +		/* Destination is (known to be) 4 byte aligned. Proceed. */
> +	}
> +
> +	/* Destination pointer is now 4 byte (or 16 byte) aligned. */
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +
> +	/* Copy following 8 and 4 byte portions of data.
> +	 *
> +	 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 16 or 8 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(len & 8)) {
> +		int32_t r0, r1;
> +
> +		memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
> +		memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
> +		src = RTE_PTR_ADD(src, 8);
> +		dst = RTE_PTR_ADD(dst, 8);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A)) &&
> +			(len & 4)) {
> +		int32_t r2;
> +
> +		memcpy(&r2, src, 4);
> +		_mm_stream_si32(dst, r2);
> +		src = RTE_PTR_ADD(src, 4);
> +		dst = RTE_PTR_ADD(dst, 4);
> +	}
> +
> +	/* Copy remaining 2 and 1 byte portions of data.
> +	 *
> +	 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 4 and 2 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A)) &&
> +			(len & 2)) {
> +		int16_t r3;
> +
> +		memcpy(&r3, src, 2);
> +		*(int16_t *)dst = r3;

Writing to 'dst' both through an int16_t pointer and a void pointer 
could cause type-based aliasing issues.

There's no reason not to use memcpy() here.

> +		src = RTE_PTR_ADD(src, 2);
> +		dst = RTE_PTR_ADD(dst, 2);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A)) &&
> +			(len & 1))
> +		*(char *)dst = *(const char *)src;
> +}
> +
> +/**
> + * Non-temporal memory copy of 15 or less byte
> + * from 16 byte aligned source via bounce buffer.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Only the 4 least significant bits of this parameter are used.
> + *   The 4 least significant bits of this holds the number of remaining bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_15_or_less_s16a(void *__rte_restrict dst,
> +		const void *__rte_restrict src, size_t len, const uint64_t flags)
> +{
> +	int32_t             buffer[4] __rte_aligned(16);
> +	register __m128i    xmm0;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if ((len & 15) == 0)
> +		return;
> +
> +	/* Non-temporal load into bounce buffer. */
> +	xmm0 = _mm_stream_load_si128_const(src);
> +	_mm_store_si128((void *)buffer, xmm0);
> +
> +	/* Store from bounce buffer. */
> +	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
> +			rte_is_aligned(dst, 4)) {
> +		/* Destination is (known to be) 4 byte aligned. */
> +		src = (const void *)buffer;

Redundant cast.

> +		if (len & 8) {
> +#ifdef RTE_ARCH_X86_64
> +			if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) {
> +				/* Destination is known to be 8 byte aligned. */
> +				_mm_stream_si64(dst, *(const int64_t *)src);
> +			} else {
> +#endif /* RTE_ARCH_X86_64 */
> +				_mm_stream_si32(RTE_PTR_ADD(dst, 0), buffer[0]);
> +				_mm_stream_si32(RTE_PTR_ADD(dst, 4), buffer[1]);
> +#ifdef RTE_ARCH_X86_64
> +			}
> +#endif /* RTE_ARCH_X86_64 */
> +			src = RTE_PTR_ADD(src, 8);
> +			dst = RTE_PTR_ADD(dst, 8);
> +		}
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A) &&
> +				(len & 4)) {
> +			_mm_stream_si32(dst, *(const int32_t *)src);
> +			src = RTE_PTR_ADD(src, 4);
> +			dst = RTE_PTR_ADD(dst, 4);
> +		}
> +
> +		/* Non-temporal store is unavailble for the remaining 3 byte or less. */
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) &&
> +				(len & 2)) {
> +			*(int16_t *)dst = *(const int16_t *)src;

Looks like another type-based aliasing issue.

> +			src = RTE_PTR_ADD(src, 2);
> +			dst = RTE_PTR_ADD(dst, 2);
> +		}
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A) &&
> +				(len & 1)) {
> +			*(char *)dst = *(const char *)src;
> +		}
> +	} else {
> +		/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
> +		rte_mov15_or_less(dst, (const void *)buffer, len & 15);

This cast is not needed.

> +	}
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 16 byte aligned addresses non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + *   Must be 16 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d16s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)

This function should not be public. That goes for all the other public 
functions below as well.

That said, maybe there's a presedence against this, with all the various 
rte_memcpy() helpers being public already. I don't know.

> +{
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;

Reduce scope of this variable declarations.

> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 16));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_stream_si128(dst, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ?
> +				flags : RTE_MEMOPS_F_DST16A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +
> +#ifdef RTE_ARCH_X86_64
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 8/16 byte aligned destination/source addresses non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + *   Must be 8 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d8s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	int64_t             buffer[8] __rte_cache_aligned /* at least __rte_aligned(16) */;
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 8));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
> +		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
> +		_mm_store_si128((void *)&buffer[3 * 2], xmm3);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 4 * 8), buffer[4]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 5 * 8), buffer[5]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 6 * 8), buffer[6]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 7 * 8), buffer[7]);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[4]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[5]);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) ?
> +				flags : RTE_MEMOPS_F_DST8A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +#endif /* RTE_ARCH_X86_64 */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 4/16 byte aligned destination/source addresses non-temporal memory copy.

/../ non-temporal source and destination /../

> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.

Delete "non-temporal" here and below. NT is not a property of a memory area.

> + *   Must be 4 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d4s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	int32_t             buffer[16] __rte_cache_aligned /* at least __rte_aligned(16) */;
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
> +		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
> +		_mm_store_si128((void *)&buffer[3 * 4], xmm3);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  0 * 4), buffer[0]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  1 * 4), buffer[1]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  2 * 4), buffer[2]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  3 * 4), buffer[3]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  4 * 4), buffer[4]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  5 * 4), buffer[5]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  6 * 4), buffer[6]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  7 * 4), buffer[7]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  8 * 4), buffer[8]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  9 * 4), buffer[9]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 10 * 4), buffer[10]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 11 * 4), buffer[11]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 12 * 4), buffer[12]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 13 * 4), buffer[13]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 14 * 4), buffer[14]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 15 * 4), buffer[15]);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 4 * 4), buffer[4]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 5 * 4), buffer[5]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 6 * 4), buffer[6]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 7 * 4), buffer[7]);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[8]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[9]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[10]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[11]);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ?
> +				flags : RTE_MEMOPS_F_DST4A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 4 byte aligned addresses (non-temporal) memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the (non-temporal) destination memory area.
> + *   Must be 4 byte aligned if using non-temporal store.
> + * @param src
> + *   Pointer to the (non-temporal) source memory area.
> + *   Must be 4 byte aligned if using non-temporal load.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d4s4a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)

If this isn't a NT memcpy, why is it named _nt_?

Why is it needed at all? Why not use rte_memcpy() in this case?

> +{
> +	/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +	const size_t    offset = (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A ?
> +			0 : (uintptr_t)src & 15;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +	RTE_ASSERT(rte_is_aligned(src, 4));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	if (offset == 0) {
> +		/* Source is 16 byte aligned. */
> +		/* Copy everything, using upgraded source alignment flags. */
> +		rte_memcpy_nt_d4s16a(dst, src, len,
> +				(flags & ~RTE_MEMOPS_F_SRCA_MASK) | RTE_MEMOPS_F_SRC16A);
> +	} else {
> +		/* Source is not 16 byte aligned, so make it 16 byte aligned. */
> +		int32_t             buffer[4] __rte_aligned(16);
> +		const size_t        first = 16 - offset;
> +		register __m128i    xmm0;
> +
> +		/* First, copy first part of data in chunks of 4 byte,
> +		 * to achieve 16 byte alignment of source.
> +		 * This invalidates the source, destination and length alignment flags, and
> +		 * potentially makes the destination pointer 16 byte unaligned/aligned.
> +		 */
> +
> +		/** Copy from 16 byte aligned source pointer (floor rounding). */
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
> +		_mm_store_si128((void *)buffer, xmm0);
> +
> +		if (unlikely(len + offset <= 16)) {
> +			/* Short length. */
> +			if (((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ||
> +					(len & 3) == 0) {
> +				/* Length is 4 byte aligned. */
> +				switch (len) {
> +				case 1 * 4:
> +					/* Offset can be 1 * 4, 2 * 4 or 3 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
> +							buffer[offset / 4]);
> +					break;
> +				case 2 * 4:
> +					/* Offset can be 1 * 4 or 2 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
> +							buffer[offset / 4]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4),
> +							buffer[offset / 4 + 1]);
> +					break;
> +				case 3 * 4:
> +					/* Offset can only be 1 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
> +					break;
> +				}
> +			} else {
> +				/* Length is not 4 byte aligned. */
> +				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
> +			}
> +			return;
> +		}
> +
> +		switch (first) {
> +		case 1 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
> +			break;
> +		case 2 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
> +			break;
> +		case 3 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
> +			break;
> +		}
> +
> +		src = RTE_PTR_ADD(src, first);
> +		dst = RTE_PTR_ADD(dst, first);
> +		len -= first;
> +
> +		/* Source pointer is now 16 byte aligned. */
> +		RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +		/* Then, copy the rest, using corrected alignment flags. */
> +		if (rte_is_aligned(dst, 16))
> +			rte_memcpy_nt_d16s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST16A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +#ifdef RTE_ARCH_X86_64
> +		else if (rte_is_aligned(dst, 8))
> +			rte_memcpy_nt_d8s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST8A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +#endif /* RTE_ARCH_X86_64 */
> +		else
> +			rte_memcpy_nt_d4s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST4A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +	}
> +}
> +
> +#ifndef RTE_MEMCPY_NT_BUFSIZE
> +
> +#include <lib/mbuf/rte_mbuf_core.h>
> +
> +/** Bounce buffer size for non-temporal memcpy.
> + *
> + * Must be 2^N and >= 128.
> + * The actual buffer will be slightly larger, due to added padding.
> + * The default is chosen to be able to handle a non-segmented packet.
> + */
> +#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM
> +
> +#endif  /* RTE_MEMCPY_NT_BUFSIZE */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Non-temporal memory copy via bounce buffer.
> + *
> + * @note
> + * If the destination and/or length is unaligned, the first and/or last copied
> + * bytes will be stored in the destination memory area using temporal access.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_buf(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	/** Cache line aligned bounce buffer with preceding and trailing padding.
> +	 *
> +	 * The preceding padding is one cache line, so the data area itself
> +	 * is cache line aligned.
> +	 * The trailing padding is 16 bytes, leaving room for the trailing bytes
> +	 * of a 16 byte store operation.
> +	 */
> +	char			buffer[RTE_CACHE_LINE_SIZE + RTE_MEMCPY_NT_BUFSIZE +  16]
> +				__rte_cache_aligned;
> +	/** Pointer to bounce buffer's aligned data area. */
> +	char		* const buf0 = &buffer[RTE_CACHE_LINE_SIZE];
> +	void		       *buf;
> +	/** Number of bytes to copy from source, incl. any extra preceding bytes. */
> +	size_t			srclen;
> +	register __m128i	xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Step 1:
> +	 * Copy data from the source to the bounce buffer's aligned data area,
> +	 * using aligned non-temporal load from the source,
> +	 * and unaligned store in the bounce buffer.
> +	 *
> +	 * If the source is unaligned, the additional bytes preceding the data will be copied
> +	 * to the padding area preceding the bounce buffer's aligned data area.
> +	 * Similarly, if the source data ends at an unaligned address, the additional bytes
> +	 * trailing the data will be copied to the padding area trailing the bounce buffer's
> +	 * aligned data area.
> +	 */
> +
> +	/* Adjust for extra preceding bytes, unless source is known to be 16 byte aligned. */
> +	if ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) {
> +		buf = buf0;
> +		srclen = len;
> +	} else {
> +		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +		const size_t offset = (uintptr_t)src & 15;
> +
> +		buf = RTE_PTR_SUB(buf0, offset);
> +		src = RTE_PTR_SUB(src, offset);
> +		srclen = len + offset;
> +	}
> +
> +	/* Copy large portion of data from source to bounce buffer in chunks of 64 byte. */
> +	while (srclen >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		buf = RTE_PTR_ADD(buf, 64);
> +		srclen -= 64;
> +	}
> +
> +	/* Copy remaining 32 and 16 byte portions of data from source to bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the length alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +			(srclen & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		buf = RTE_PTR_ADD(buf, 32);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +			(srclen & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_storeu_si128(buf, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		buf = RTE_PTR_ADD(buf, 16);
> +	}
> +	/* Copy any trailing bytes of data from source to bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the length alignment
> +	 * flags are still valid)
> +	 * and length is known to be 16 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(srclen & 15)) {
> +		xmm3 = _mm_stream_load_si128_const(src);
> +		_mm_storeu_si128(buf, xmm3);
> +	}
> +
> +	/* Step 2:
> +	 * Copy from the aligned bounce buffer to the non-temporal destination.
> +	 */
> +	rte_memcpy_ntd(dst, buf0, len,
> +			(flags & ~(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_SRCA_MASK)) |
> +			(RTE_CACHE_LINE_SIZE << RTE_MEMOPS_F_SRCA_SHIFT));
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @note
> + * If the destination and/or length is unaligned, some copied bytes will be
> + * stored in the destination memory area using temporal access.

Is temporal access the proper term?

I would describe it as "stored in the destination memory area without 
the use non-temporal hints", or something like that.

> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_generic(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +
> +	while (len > RTE_MEMCPY_NT_BUFSIZE) {
> +		rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE,
> +				(flags & ~RTE_MEMOPS_F_LENA_MASK) | RTE_MEMOPS_F_LEN128A);
> +		dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
> +		src = RTE_PTR_ADD(src, RTE_MEMCPY_NT_BUFSIZE);
> +		len -= RTE_MEMCPY_NT_BUFSIZE;
> +	}
> +	rte_memcpy_nt_buf(dst, src, len, flags);
> +}
> +
> +/* Implementation. Refer to function declaration for documentation. */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	if ((flags & (RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) ==
> +			(RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) {
> +		/* Copy between non-temporal source and destination. */
> +		if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d16s16a(dst, src, len, flags);
> +#ifdef RTE_ARCH_X86_64
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d8s16a(dst, src, len, flags);
> +#endif /* RTE_ARCH_X86_64 */
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d4s16a(dst, src, len, flags);
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A)
> +			rte_memcpy_nt_d4s4a(dst, src, len, flags);
> +		else if (len <= RTE_MEMCPY_NT_BUFSIZE)
> +			rte_memcpy_nt_buf(dst, src, len, flags);
> +		else
> +			rte_memcpy_nt_generic(dst, src, len, flags);
> +	} else if (flags & RTE_MEMOPS_F_SRC_NT) {
> +		/* Copy from non-temporal source. */
> +		rte_memcpy_nts(dst, src, len, flags);
> +	} else if (flags & RTE_MEMOPS_F_DST_NT) {
> +		/* Copy to non-temporal destination. */
> +		rte_memcpy_ntd(dst, src, len, flags);
> +	} else
> +		rte_memcpy(dst, src, len);
> +}
> +
>   #undef ALIGNMENT_MASK
>   
>   #if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
> index a2307cebe6..aa96fb4cc8 100644
> --- a/lib/mbuf/rte_mbuf.c
> +++ b/lib/mbuf/rte_mbuf.c
> @@ -660,6 +660,83 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
>   	return mc;
>   }
>   
> +/* Create a deep copy of mbuf, using non-temporal memory access */
> +struct rte_mbuf *
> +rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
> +		 uint32_t off, uint32_t len, const uint64_t flags)
> +{
> +	const struct rte_mbuf *seg = m;
> +	struct rte_mbuf *mc, *m_last, **prev;
> +
> +	/* garbage in check */
> +	__rte_mbuf_sanity_check(m, 1);
> +
> +	/* check for request to copy at offset past end of mbuf */
> +	if (unlikely(off >= m->pkt_len))
> +		return NULL;
> +
> +	mc = rte_pktmbuf_alloc(mp);
> +	if (unlikely(mc == NULL))
> +		return NULL;
> +
> +	/* truncate requested length to available data */
> +	if (len > m->pkt_len - off)
> +		len = m->pkt_len - off;
> +
> +	__rte_pktmbuf_copy_hdr(mc, m);
> +
> +	/* copied mbuf is not indirect or external */
> +	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
> +
> +	prev = &mc->next;
> +	m_last = mc;
> +	while (len > 0) {
> +		uint32_t copy_len;
> +
> +		/* skip leading mbuf segments */
> +		while (off >= seg->data_len) {
> +			off -= seg->data_len;
> +			seg = seg->next;
> +		}
> +
> +		/* current buffer is full, chain a new one */
> +		if (rte_pktmbuf_tailroom(m_last) == 0) {
> +			m_last = rte_pktmbuf_alloc(mp);
> +			if (unlikely(m_last == NULL)) {
> +				rte_pktmbuf_free(mc);
> +				return NULL;
> +			}
> +			++mc->nb_segs;
> +			*prev = m_last;
> +			prev = &m_last->next;
> +		}
> +
> +		/*
> +		 * copy the min of data in input segment (seg)
> +		 * vs space available in output (m_last)
> +		 */
> +		copy_len = RTE_MIN(seg->data_len - off, len);
> +		if (copy_len > rte_pktmbuf_tailroom(m_last))
> +			copy_len = rte_pktmbuf_tailroom(m_last);
> +
> +		/* append from seg to m_last */
> +		rte_memcpy_ex(rte_pktmbuf_mtod_offset(m_last, char *,
> +						   m_last->data_len),
> +			   rte_pktmbuf_mtod_offset(seg, char *, off),
> +			   copy_len, flags);
> +
> +		/* update offsets and lengths */
> +		m_last->data_len += copy_len;
> +		mc->pkt_len += copy_len;
> +		off += copy_len;
> +		len -= copy_len;
> +	}
> +
> +	/* garbage out check */
> +	__rte_mbuf_sanity_check(mc, 1);
> +	return mc;
> +}
> +

This looks like a cut-and-paste from rte_pktmbuf_copy(). Make a 
__rte_pktmbuf_copy_generic() which takes either memcpy()-function 
pointer+flags, or just flags, as input, which both the new copy_ex() and 
the old copy function deleteges to.

>   /* dump a mbuf on console */
>   void
>   rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index b6e23d98ce..030df396a3 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -1443,6 +1443,38 @@ struct rte_mbuf *
>   rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
>   		 uint32_t offset, uint32_t length);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Create a full copy of a given packet mbuf,
> + * using non-temporal memory access as specified by flags.
> + *
> + * Copies all the data from a given packet mbuf to a newly allocated
> + * set of mbufs. The private data are is not copied.
> + *
> + * @param m
> + *   The packet mbuf to be copied.
> + * @param mp
> + *   The mempool from which the "clone" mbufs are allocated.
> + * @param offset
> + *   The number of bytes to skip before copying.
> + *   If the mbuf does not have that many bytes, it is an error
> + *   and NULL is returned.
> + * @param length
> + *   The upper limit on bytes to copy.  Passing UINT32_MAX
> + *   means all data (after offset).
> + * @param flags
> + *   Non-temporal memory access hints for rte_memcpy_ex.
> + * @return
> + *   - The pointer to the new "clone" mbuf on success.
> + *   - NULL if allocation fails.
> + */
> +__rte_experimental
> +struct rte_mbuf *
> +rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
> +		    uint32_t offset, uint32_t length, const uint64_t flags);

The same question about why flags is const.

> +
>   /**
>    * Adds given value to the refcnt of all packet mbuf segments.
>    *
> diff --git a/lib/mbuf/version.map b/lib/mbuf/version.map
> index ed486ed14e..b583364ad4 100644
> --- a/lib/mbuf/version.map
> +++ b/lib/mbuf/version.map
> @@ -47,5 +47,6 @@ EXPERIMENTAL {
>   	global:
>   
>   	rte_pktmbuf_pool_create_extbuf;
> +	rte_pktmbuf_copy_ex;
>   
>   };
> diff --git a/lib/pcapng/rte_pcapng.c b/lib/pcapng/rte_pcapng.c
> index af2b814251..ae871c4865 100644
> --- a/lib/pcapng/rte_pcapng.c
> +++ b/lib/pcapng/rte_pcapng.c
> @@ -466,7 +466,8 @@ rte_pcapng_copy(uint16_t port_id, uint32_t queue,
>   	orig_len = rte_pktmbuf_pkt_len(md);
>   
>   	/* Take snapshot of the data */
> -	mc = rte_pktmbuf_copy(md, mp, 0, length);
> +	mc = rte_pktmbuf_copy_ex(md, mp, 0, length,
> +				 RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
>   	if (unlikely(mc == NULL))
>   		return NULL;
>   
> diff --git a/lib/pdump/rte_pdump.c b/lib/pdump/rte_pdump.c
> index 98dcbc037b..6e61c75407 100644
> --- a/lib/pdump/rte_pdump.c
> +++ b/lib/pdump/rte_pdump.c
> @@ -124,7 +124,8 @@ pdump_copy(uint16_t port_id, uint16_t queue,
>   					    pkts[i], mp, cbs->snaplen,
>   					    ts, direction);
>   		else
> -			p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
> +			p = rte_pktmbuf_copy_ex(pkts[i], mp, 0, cbs->snaplen,
> +						RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
>   
>   		if (unlikely(p == NULL))
>   			__atomic_fetch_add(&stats->nombuf, 1, __ATOMIC_RELAXED);
> @@ -134,6 +135,9 @@ pdump_copy(uint16_t port_id, uint16_t queue,
>   
>   	__atomic_fetch_add(&stats->accepted, d_pkts, __ATOMIC_RELAXED);
>   
> +	/* Flush non-temporal stores regarding the packet copies. */
> +	rte_wmb();
> +

This is an unnessary barrier for many architectures.

>   	ring_enq = rte_ring_enqueue_burst(ring, (void *)dup_bufs, d_pkts, NULL);
>   	if (unlikely(ring_enq < d_pkts)) {
>   		unsigned int drops = d_pkts - ring_enq;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] eal: non-temporal memcpy
  2022-10-10  6:46 ` [PATCH v4] " Morten Brørup
  2022-10-16 14:27   ` Mattias Rönnblom
@ 2022-10-16 19:55   ` Mattias Rönnblom
  2023-07-31 12:14   ` Thomas Monjalon
  2 siblings, 0 replies; 17+ messages in thread
From: Mattias Rönnblom @ 2022-10-16 19:55 UTC (permalink / raw)
  To: Morten Brørup, bruce.richardson, konstantin.v.ananyev,
	Honnappa.Nagarahalli, stephen
  Cc: mattias.ronnblom, kda, drc, dev

On 2022-10-10 08:46, Morten Brørup wrote:
> This patch provides a function for memory copy using non-temporal store,
> load or both, controlled by flags passed to the function.
> 
> Applications sometimes copy data to another memory location, which is only
> used much later.
> In this case, it is inefficient to pollute the data cache with the copied
> data.
> 
> An example use case (originating from a real life application):
> Copying filtered packets, or the first part of them, into a capture buffer
> for offline analysis.
> 
> The purpose of the function is to achieve a performance gain by not
> polluting the cache when copying data.
> Although the throughput can be improved by further optimization, I do not
> have time to do it now.
> 
> The functional tests and performance tests for memory copy have been
> expanded to include non-temporal copying.
> 
> A non-temporal version of the mbuf library's function to create a full
> copy of a given packet mbuf is provided.
> 
> The packet capture and packet dump libraries have been updated to use
> non-temporal memory copy of the packets.
> 
> Implementation notes:
> 
> Implementations for non-x86 architectures can be provided by anyone at a
> later time. I am not going to do it.
> 
> x86 non-temporal load instructions must be 16 byte aligned [1], and
> non-temporal store instructions must be 4, 8 or 16 byte aligned [2].
> 
> ARM non-temporal load and store instructions seem to require 4 byte
> alignment [3].
> 
> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_load
> [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_si
> [3] https://developer.arm.com/documentation/100076/0100/
> A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
> LDNP--SIMD-and-FP-
> 
> This patch is a major rewrite from the RFC v3, so no version log comparing
> to the RFC is provided.
> 
> v4
> * Also ignore the warning for clang int the workaround for
>    _mm_stream_load_si128() missing const in the parameter.
> * Add missing C linkage specifier in rte_memcpy.h.
> 
> v3
> * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
>    use it on 64-bit x86 architecture.
> * CLANG warns that _mm_stream_load_si128_const() and
>    rte_memcpy_nt_15_or_less_s16a() are not public,
>    so remove __rte_internal from them. It also affects the documentation
>    for the functions, so the fix can't be limited to CLANG.
> * Use __rte_experimental instead of __rte_internal.
> * Replace <n> with nnn in function documentation; it doesn't look like
>    HTML.
> * Slightly modify the workaround for _mm_stream_load_si128() missing const
>    in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
>    #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
>    #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
> * Fixed one coding style issue missed in v2.
> 
> v2
> * The last 16 byte block of data, incl. any trailing bytes, were not
>    copied from the source memory area in rte_memcpy_nt_buf().
> * Fix many coding style issues.
> * Add some missing header files.
> * Fix build time warning for non-x86 architectures by using a different
>    method to mark the flags parameter unused.
> * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
>    so omit it when using CLANG.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>   app/test/test_memcpy.c               |   65 +-
>   app/test/test_memcpy_perf.c          |  187 ++--
>   lib/eal/include/generic/rte_memcpy.h |  127 +++
>   lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
>   lib/mbuf/rte_mbuf.c                  |   77 ++
>   lib/mbuf/rte_mbuf.h                  |   32 +
>   lib/mbuf/version.map                 |    1 +
>   lib/pcapng/rte_pcapng.c              |    3 +-
>   lib/pdump/rte_pdump.c                |    6 +-
>   9 files changed, 1645 insertions(+), 91 deletions(-)
> 
> diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
> index 1ab86f4967..12410ce413 100644
> --- a/app/test/test_memcpy.c
> +++ b/app/test/test_memcpy.c
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #include <stdint.h>
> @@ -36,6 +37,19 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
>   /* Data is aligned on this many bytes (power of 2) */
>   #define ALIGNMENT_UNIT          32
>   
> +const uint64_t nt_mode_flags[4] = {
> +	0,
> +	RTE_MEMOPS_F_SRC_NT,
> +	RTE_MEMOPS_F_DST_NT,
> +	RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT
> +};
> +const char * const nt_mode_str[4] = {
> +	"none",
> +	"src",
> +	"dst",
> +	"src+dst"
> +};
> +
>   
>   /*
>    * Create two buffers, and initialise one with random values. These are copied
> @@ -44,12 +58,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
>    * changed.
>    */
>   static int
> -test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
> +test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size, unsigned int nt_mode)
>   {
>   	unsigned int i;
>   	uint8_t dest[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
>   	uint8_t src[SMALL_BUFFER_SIZE + ALIGNMENT_UNIT];
>   	void * ret;
> +	const uint64_t flags = nt_mode_flags[nt_mode];
>   
>   	/* Setup buffers */
>   	for (i = 0; i < SMALL_BUFFER_SIZE + ALIGNMENT_UNIT; i++) {
> @@ -58,18 +73,23 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	}
>   
>   	/* Do the copy */
> -	ret = rte_memcpy(dest + off_dst, src + off_src, size);
> -	if (ret != (dest + off_dst)) {
> -		printf("rte_memcpy() returned %p, not %p\n",
> -		       ret, dest + off_dst);
> +	if (nt_mode) {
> +		rte_memcpy_ex(dest + off_dst, src + off_src, size, flags);
> +	} else {
> +		ret = rte_memcpy(dest + off_dst, src + off_src, size);
> +		if (ret != (dest + off_dst)) {
> +			printf("rte_memcpy() returned %p, not %p\n",
> +			       ret, dest + off_dst);
> +		}
>   	}
>   
>   	/* Check nothing before offset is affected */
>   	for (i = 0; i < off_dst; i++) {
>   		if (dest[i] != 0) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
>   			       "[modified before start of dst].\n",
> -			       (unsigned)size, off_src, off_dst);
> +			       nt_mode ? "_ex" : "",
> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
>   			return -1;
>   		}
>   	}
> @@ -77,9 +97,11 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	/* Check everything was copied */
>   	for (i = 0; i < size; i++) {
>   		if (dest[i + off_dst] != src[i + off_src]) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> -			       "[didn't copy byte %u].\n",
> -			       (unsigned)size, off_src, off_dst, i);
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
> +			       "[didn't copy byte %u: 0x%02x!=0x%02x].\n",
> +			       nt_mode ? "_ex" : "",
> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode], i,
> +			       dest[i + off_dst], src[i + off_src]);
>   			return -1;
>   		}
>   	}
> @@ -87,9 +109,10 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   	/* Check nothing after copy was affected */
>   	for (i = size; i < SMALL_BUFFER_SIZE; i++) {
>   		if (dest[i + off_dst] != 0) {
> -			printf("rte_memcpy() failed for %u bytes (offsets=%u,%u): "
> +			printf("rte_memcpy%s() failed for %u bytes (offsets=%u,%u nt=%s): "
>   			       "[copied too many].\n",
> -			       (unsigned)size, off_src, off_dst);
> +			       nt_mode ? "_ex" : "",
> +			       (unsigned int)size, off_src, off_dst, nt_mode_str[nt_mode]);
>   			return -1;
>   		}
>   	}
> @@ -102,16 +125,18 @@ test_single_memcpy(unsigned int off_src, unsigned int off_dst, size_t size)
>   static int
>   func_test(void)
>   {
> -	unsigned int off_src, off_dst, i;
> +	unsigned int off_src, off_dst, i, nt_mode;
>   	int ret;
>   
> -	for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
> -		for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
> -			for (i = 0; i < RTE_DIM(buf_sizes); i++) {
> -				ret = test_single_memcpy(off_src, off_dst,
> -				                         buf_sizes[i]);
> -				if (ret != 0)
> -					return -1;
> +	for (nt_mode = 0; nt_mode < 4; nt_mode++) {
> +		for (off_src = 0; off_src < ALIGNMENT_UNIT; off_src++) {
> +			for (off_dst = 0; off_dst < ALIGNMENT_UNIT; off_dst++) {
> +				for (i = 0; i < RTE_DIM(buf_sizes); i++) {
> +					ret = test_single_memcpy(off_src, off_dst,
> +								 buf_sizes[i], nt_mode);
> +					if (ret != 0)
> +						return -1;
> +				}
>   			}
>   		}
>   	}
> diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
> index 3727c160e6..6bb52cba88 100644
> --- a/app/test/test_memcpy_perf.c
> +++ b/app/test/test_memcpy_perf.c
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #include <stdint.h>
> @@ -15,6 +16,7 @@
>   #include <rte_malloc.h>
>   
>   #include <rte_memcpy.h>
> +#include <rte_atomic.h>
>   
>   #include "test.h"
>   
> @@ -27,9 +29,9 @@
>   /* List of buffer sizes to test */
>   #if TEST_VALUE_RANGE == 0
>   static size_t buf_sizes[] = {
> -	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
> -	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
> -	449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
> +	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 40, 48, 60, 63, 64, 65, 80, 92, 124,
> +	127, 128, 129, 140, 152, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447,
> +	448, 449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
>   	2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
>   };
>   /* MUST be as large as largest packet size above */
> @@ -72,7 +74,7 @@ static uint8_t *small_buf_read, *small_buf_write;
>   static int
>   init_buffers(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   
>   	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
>   	if (large_buf_read == NULL)
> @@ -151,7 +153,7 @@ static void
>   do_uncached_write(uint8_t *dst, int is_dst_cached,
>   				  const uint8_t *src, int is_src_cached, size_t size)
>   {
> -	unsigned i, j;
> +	unsigned int i, j;
>   	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];
>   
>   	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
> @@ -167,66 +169,112 @@ do_uncached_write(uint8_t *dst, int is_dst_cached,
>    * Run a single memcpy performance test. This is a macro to ensure that if
>    * the "size" parameter is a constant it won't be converted to a variable.
>    */
> -#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,                   \
> -                         src, is_src_cached, src_uoffset, size)             \
> -do {                                                                        \
> -    unsigned int iter, t;                                                   \
> -    size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];          \
> -    uint64_t start_time, total_time = 0;                                    \
> -    uint64_t total_time2 = 0;                                               \
> -    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
> -        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
> -                         src_addrs, is_src_cached, src_uoffset);            \
> -        start_time = rte_rdtsc();                                           \
> -        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
> -            rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size);           \
> -        total_time += rte_rdtsc() - start_time;                             \
> -    }                                                                       \
> -    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
> -        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
> -                         src_addrs, is_src_cached, src_uoffset);            \
> -        start_time = rte_rdtsc();                                           \
> -        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
> -            memcpy(dst+dst_addrs[t], src+src_addrs[t], size);               \
> -        total_time2 += rte_rdtsc() - start_time;                            \
> -    }                                                                       \
> -    printf("%3.0f -", (double)total_time  / TEST_ITERATIONS);                 \
> -    printf("%3.0f",   (double)total_time2 / TEST_ITERATIONS);                 \
> -    printf("(%6.2f%%) ", ((double)total_time - total_time2)*100/total_time2); \
> +#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,					  \
> +			 src, is_src_cached, src_uoffset, size)					  \
> +do {												  \
> +	unsigned int iter, t;									  \
> +	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];				  \
> +	uint64_t start_time;									  \
> +	uint64_t total_time_rte = 0, total_time_std = 0;					  \
> +	uint64_t total_time_ntd = 0, total_time_nts = 0, total_time_nt = 0;			  \
> +	const uint64_t flags = ((dst_uoffset == 0) ?						  \
> +				(ALIGNMENT_UNIT << RTE_MEMOPS_F_DSTA_SHIFT) : 0) |		  \
> +			       ((src_uoffset == 0) ?						  \
> +				(ALIGNMENT_UNIT << RTE_MEMOPS_F_SRCA_SHIFT) : 0);		  \
> +	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
> +		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
> +				 src_addrs, is_src_cached, src_uoffset);			  \
> +		start_time = rte_rdtsc();							  \
> +		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
> +			rte_memcpy(dst + dst_addrs[t], src + src_addrs[t], size);		  \
> +		total_time_rte += rte_rdtsc() - start_time;					  \
> +	}											  \
> +	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {			  \
> +		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,				  \
> +				 src_addrs, is_src_cached, src_uoffset);			  \
> +		start_time = rte_rdtsc();							  \
> +		for (t = 0; t < TEST_BATCH_SIZE; t++)						  \
> +			memcpy(dst + dst_addrs[t], src + src_addrs[t], size);			  \
> +		total_time_std += rte_rdtsc() - start_time;					  \
> +	}											  \
> +	if (!(is_dst_cached && is_src_cached)) {						  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_DST_NT);			  \
> +			total_time_ntd += rte_rdtsc() - start_time;				  \
> +		}										  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_SRC_NT);			  \
> +			total_time_nts += rte_rdtsc() - start_time;				  \
> +		}										  \
> +		for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {		  \
> +			fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,			  \
> +					 src_addrs, is_src_cached, src_uoffset);		  \
> +			start_time = rte_rdtsc();						  \
> +			for (t = 0; t < TEST_BATCH_SIZE; t++)					  \
> +				rte_memcpy_ex(dst + dst_addrs[t], src + src_addrs[t], size,       \
> +					      flags | RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT); \
> +			total_time_nt += rte_rdtsc() - start_time;				  \
> +		}										  \
> +	}											  \
> +	printf(" %4.0f-", (double)total_time_rte / TEST_ITERATIONS);				  \
> +	printf("%4.0f",   (double)total_time_std / TEST_ITERATIONS);				  \
> +	printf("(%+4.0f%%)", ((double)total_time_rte - total_time_std) * 100 / total_time_std);   \
> +	if (!(is_dst_cached && is_src_cached)) {						  \
> +		printf(" %4.0f", (double)total_time_ntd / TEST_ITERATIONS);			  \
> +		printf(" %4.0f", (double)total_time_nts / TEST_ITERATIONS);			  \
> +		printf(" %4.0f", (double)total_time_nt / TEST_ITERATIONS);			  \
> +		if (total_time_nt / total_time_std > 9)						  \
> +			printf("(*%4.1f)", (double)total_time_nt / total_time_std);		  \
> +		else										  \
> +			printf("(%+4.0f%%)",							  \
> +			       ((double)total_time_nt - total_time_std) * 100 / total_time_std);  \
> +	}											  \
>   } while (0)
>   
>   /* Run aligned memcpy tests for each cached/uncached permutation */
> -#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
> -do {                                                                     \
> -    if (__builtin_constant_p(n))                                         \
> -        printf("\nC%6u", (unsigned)n);                                   \
> -    else                                                                 \
> -        printf("\n%7u", (unsigned)n);                                    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);    \
> +#define ALL_PERF_TESTS_FOR_SIZE(n)						\
> +do {										\
> +	if (__builtin_constant_p(n))						\
> +		printf("\nC%6u", (unsigned int)n);				\
> +	else									\
> +		printf("\n%7u", (unsigned int)n);				\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);	\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);	\
>   } while (0)
>   
>   /* Run unaligned memcpy tests for each cached/uncached permutation */
> -#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
> -do {                                                                     \
> -    if (__builtin_constant_p(n))                                         \
> -        printf("\nC%6u", (unsigned)n);                                   \
> -    else                                                                 \
> -        printf("\n%7u", (unsigned)n);                                    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);    \
> -    SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);    \
> -    SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);    \
> +#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)					\
> +do {										\
> +	if (__builtin_constant_p(n))						\
> +		printf("\nC%6u", (unsigned int)n);				\
> +	else									\
> +		printf("\n%7u", (unsigned int)n);				\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);	\
> +	SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);	\
> +	SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);	\
>   } while (0)
>   
>   /* Run memcpy tests for constant length */
> -#define ALL_PERF_TEST_FOR_CONSTANT                                      \
> -do {                                                                    \
> -    TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);         \
> -    TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);      \
> -    TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);    \
> +#define ALL_PERF_TEST_FOR_CONSTANT						\
> +do {										\
> +	TEST_CONSTANT(4U); TEST_CONSTANT(6U); TEST_CONSTANT(8U);		\
> +	TEST_CONSTANT(16U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);		\
> +	TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);		\
> +	TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);	\
> +	TEST_CONSTANT(2048U);							\
>   } while (0)
>   
>   /* Run all memcpy tests for aligned constant cases */
> @@ -251,7 +299,7 @@ perf_test_constant_unaligned(void)
>   static inline void
>   perf_test_variable_aligned(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
>   		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
>   	}
> @@ -261,7 +309,7 @@ perf_test_variable_aligned(void)
>   static inline void
>   perf_test_variable_unaligned(void)
>   {
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < RTE_DIM(buf_sizes); i++) {
>   		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
>   	}
> @@ -282,7 +330,7 @@ perf_test(void)
>   
>   #if TEST_VALUE_RANGE != 0
>   	/* Set up buf_sizes array, if required */
> -	unsigned i;
> +	unsigned int i;
>   	for (i = 0; i < TEST_VALUE_RANGE; i++)
>   		buf_sizes[i] = i;
>   #endif
> @@ -290,13 +338,14 @@ perf_test(void)
>   	/* See function comment */
>   	do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
>   
> -	printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
> -		   "======= ================= ================= ================= =================\n"
> -		   "   Size   Cache to cache     Cache to mem      Mem to cache        Mem to mem\n"
> -		   "(bytes)          (ticks)          (ticks)           (ticks)           (ticks)\n"
> -		   "------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n** rte_memcpy(RTE)/memcpy(STD)/rte_memcpy_ex(NTD/NTS/NT) - memcpy perf. tests (C = compile-time constant) **\n"
> +		   "======= ================ ====================================== ====================================== ======================================\n"
> +		   "   Size  Cache to cache               Cache to mem                           Mem to cache                            Mem to mem\n"
> +		   "(bytes)         (ticks)                    (ticks)                                (ticks)                               (ticks)\n"
> +		   "         RTE- STD(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)  RTE- STD(diff%%)  NTD  NTS   NT(diff%%)\n"
> +		   "------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   
> -	printf("\n================================= %2dB aligned =================================",
> +	printf("\n================================================================ %2dB aligned ===============================================================",
>   		ALIGNMENT_UNIT);
>   	/* Do aligned tests where size is a variable */
>   	timespec_get(&tv_begin, TIME_UTC);
> @@ -304,28 +353,28 @@ perf_test(void)
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_aligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   	/* Do aligned tests where size is a compile-time constant */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_constant_aligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_aligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n================================== Unaligned ==================================");
> +	printf("\n================================================================= Unaligned =================================================================");
>   	/* Do unaligned tests where size is a variable */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_variable_unaligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_unaligned = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n------- ----------------- ----------------- ----------------- -----------------");
> +	printf("\n------- ---------------- -------------------------------------- -------------------------------------- --------------------------------------");
>   	/* Do unaligned tests where size is a compile-time constant */
>   	timespec_get(&tv_begin, TIME_UTC);
>   	perf_test_constant_unaligned();
>   	timespec_get(&tv_end, TIME_UTC);
>   	time_unaligned_const = (double)(tv_end.tv_sec - tv_begin.tv_sec)
>   		+ ((double)tv_end.tv_nsec - tv_begin.tv_nsec) / NS_PER_S;
> -	printf("\n======= ================= ================= ================= =================\n\n");
> +	printf("\n======= ================ ====================================== ====================================== ======================================\n\n");
>   
>   	printf("Test Execution Time (seconds):\n");
>   	printf("Aligned variable copy size   = %8.3f\n", time_aligned);
> diff --git a/lib/eal/include/generic/rte_memcpy.h b/lib/eal/include/generic/rte_memcpy.h
> index e7f0f8eaa9..b087f09c35 100644
> --- a/lib/eal/include/generic/rte_memcpy.h
> +++ b/lib/eal/include/generic/rte_memcpy.h
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #ifndef _RTE_MEMCPY_H_
> @@ -11,6 +12,13 @@
>    * Functions for vectorised implementation of memcpy().
>    */
>   
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
>   /**
>    * Copy 16 bytes from one location to another using optimised
>    * instructions. The locations should not overlap.
> @@ -113,4 +121,123 @@ rte_memcpy(void *dst, const void *src, size_t n);
>   
>   #endif /* __DOXYGEN__ */
>   
> +/*
> + * Advanced/Non-Temporal Memory Operations Flags.
> + */
> +
> +/** Length alignment hint mask. */
> +#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
> +/** Length alignment hint shift. */
> +#define RTE_MEMOPS_F_LENA_SHIFT 0
> +/** Hint: Length is 2 byte aligned. */
> +#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
> +/** Hint: Length is 4 byte aligned. */
> +#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)
> +/** Hint: Length is 8 byte aligned. */
> +#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)
> +/** Hint: Length is 16 byte aligned. */
> +#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)
> +/** Hint: Length is 32 byte aligned. */
> +#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)
> +/** Hint: Length is 64 byte aligned. */
> +#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)
> +/** Hint: Length is 128 byte aligned. */
> +#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)
> +
> +/** Prefer non-temporal access to source memory area.
> + */
> +#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)
> +/** Source address alignment hint mask. */
> +#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)
> +/** Source address alignment hint shift. */
> +#define RTE_MEMOPS_F_SRCA_SHIFT 8
> +/** Hint: Source address is 2 byte aligned. */
> +#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)
> +/** Hint: Source address is 4 byte aligned. */
> +#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)
> +/** Hint: Source address is 8 byte aligned. */
> +#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)
> +/** Hint: Source address is 16 byte aligned. */
> +#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)
> +/** Hint: Source address is 32 byte aligned. */
> +#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)
> +/** Hint: Source address is 64 byte aligned. */
> +#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)
> +/** Hint: Source address is 128 byte aligned. */
> +#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)
> +
> +/** Prefer non-temporal access to destination memory area.
> + *
> + * On x86 architecture:
> + * Remember to call rte_wmb() after a sequence of copy operations.
> + */
> +#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)
> +/** Destination address alignment hint mask. */
> +#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16)
> +/** Destination address alignment hint shift. */
> +#define RTE_MEMOPS_F_DSTA_SHIFT 16
> +/** Hint: Destination address is 2 byte aligned. */
> +#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)
> +/** Hint: Destination address is 4 byte aligned. */
> +#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)
> +/** Hint: Destination address is 8 byte aligned. */
> +#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)
> +/** Hint: Destination address is 16 byte aligned. */
> +#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)
> +/** Hint: Destination address is 32 byte aligned. */
> +#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)
> +/** Hint: Destination address is 64 byte aligned. */
> +#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)
> +/** Hint: Destination address is 128 byte aligned. */
> +#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Advanced/non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the destination memory area.
> + * @param src
> + *   Pointer to the source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, RTE_MEMOPS_F_(LEN|SRC|DST)nnnA flags.
> + *   Must be constant at build time.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags);
> +
> +#ifndef RTE_MEMCPY_EX_ARCH_DEFINED
> +
> +/* Fallback implementation, if no arch-specific implementation is provided. */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	RTE_SET_USED(flags);
> +	memcpy(dst, src, len);
> +}
> +
> +#endif /* RTE_MEMCPY_EX_ARCH_DEFINED */
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
>   #endif /* _RTE_MEMCPY_H_ */
> diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
> index d4d7a5cfc8..31d0faf7a8 100644
> --- a/lib/eal/x86/include/rte_memcpy.h
> +++ b/lib/eal/x86/include/rte_memcpy.h
> @@ -1,5 +1,6 @@
>   /* SPDX-License-Identifier: BSD-3-Clause
>    * Copyright(c) 2010-2014 Intel Corporation
> + * Copyright(c) 2022 SmartShare Systems
>    */
>   
>   #ifndef _RTE_MEMCPY_X86_64_H_
> @@ -17,6 +18,10 @@
>   #include <rte_vect.h>
>   #include <rte_common.h>
>   #include <rte_config.h>
> +#include <rte_debug.h>
> +
> +#define RTE_MEMCPY_EX_ARCH_DEFINED
> +#include "generic/rte_memcpy.h"
>   
>   #ifdef __cplusplus
>   extern "C" {
> @@ -868,6 +873,1239 @@ rte_memcpy(void *dst, const void *src, size_t n)
>   		return rte_memcpy_generic(dst, src, n);
>   }
>   
> +/*
> + * Advanced/Non-Temporal Memory Operations.
> + */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Workaround for _mm_stream_load_si128() missing const in the parameter.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__m128i _mm_stream_load_si128_const(const __m128i *const mem_addr)
> +{
> +	/* GCC 4.5.8 (in RHEL7) doesn't support the #pragma to ignore "-Wdiscarded-qualifiers".
> +	 * So we explicitly type cast mem_addr and use the #pragma to ignore "-Wcast-qual".
> +	 */
> +#if defined(RTE_TOOLCHAIN_GCC)
> +#pragma GCC diagnostic push
> +#pragma GCC diagnostic ignored "-Wcast-qual"
> +#elif defined(RTE_TOOLCHAIN_CLANG)
> +#pragma clang diagnostic push
> +#pragma clang diagnostic ignored "-Wcast-qual"
> +#endif
> +	return _mm_stream_load_si128((__m128i *)mem_addr);
> +#if defined(RTE_TOOLCHAIN_GCC)
> +#pragma GCC diagnostic pop
> +#elif defined(RTE_TOOLCHAIN_CLANG)
> +#pragma clang diagnostic pop
> +#endif
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Memory copy from non-temporal source area.
> + *
> + * @note
> + * Performance is optimal when source pointer is 16 byte aligned.
> + *
> + * @param dst
> + *   Pointer to the destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(LEN|SRC)nnnA flags.
> + *   The RTE_MEMOPS_F_SRC_NT flag must be set.
> + *   The RTE_MEMOPS_F_DST_NT flag must be clear.
> + *   The RTE_MEMOPS_F_DSTnnnA flags are ignored.
> + *   Must be constant at build time.

Why do the flags need to be build-time constants?

> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nts(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_SRC_NT);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* If source is not 16 byte aligned, then copy first part of data via bounce buffer,
> +	 * to achieve 16 byte alignment of source pointer.
> +	 * This invalidates the source, destination and length alignment flags, and
> +	 * potentially makes the destination pointer unaligned.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) {

An alternative to rely on compiler constant propagation to eliminate 
conditionals when various things are aligned, would be to use GCC's 
__builtin_assume_aligned().

The basic pattern then would look something like:

const void *aligned_source;
void *aligned_dst;
size_t aligned_len;

if (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
	aligned_source = __builtin_assume_aligned(source, 16);
else
	aligned_source = source;

then you would go on to do the same for dst, and len (w/ some uintptr_t 
casting required).

After this, the code may be written as if the pointers aligned were not 
known, and the compiler would properly eliminate any sections that dealt 
with unaligned cases, in case the proper flags were set.

Another more radical change would be to just drop all the src, dst, and 
len flags altogheter, and provide a __built_assume_aligned() wrapper 
instead, for the application to use, i.e.:

#define rte_assume_aligned(ptr, n) __builtin_assume_aligned(ptr, n)

With this API, the user code would look something like:

rte_memcpy_ex(rte_assume_aligned(my_dst, 16), my_src, len, 
RTE_MEMOPS_F_DST_NT);

...if it knew my_dst to have a particular alignment. The rte_memcpy_ex() 
implemementation wouldn't assume any particular alignment of any input 
parameters.

> +		/* Source is not known to be 16 byte aligned, but might be. */
> +		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +		const size_t    offset = (uintptr_t)src & 15;

I would argue "(uintptr_t)src % 16" is more readable, and it generates 
the same code.

Sorry for breaking up the review into two parts.

> +
> +		if (offset) {
> +			/* Source is not 16 byte aligned. */
> +			char            buffer[16] __rte_aligned(16);
> +			/** How many bytes is source away from 16 byte alignment
> +			 * (ceiling rounding).
> +			 */
> +			const size_t    first = 16 - offset;
> +
> +			xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
> +			_mm_store_si128((void *)buffer, xmm0);
> +
> +			/* Test for short length.
> +			 *
> +			 * Omitted if length is known to be >= 16.
> +			 */
> +			if (!(__builtin_constant_p(len) && len >= 16) &&
> +					unlikely(len <= first)) {
> +				/* Short length. */
> +				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
> +				return;
> +			}
> +
> +			/* Copy until source pointer is 16 byte aligned. */
> +			rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), first);
> +			src = RTE_PTR_ADD(src, first);
> +			dst = RTE_PTR_ADD(dst, first);
> +			len -= first;
> +		}
> +	}
> +
> +	/* Source pointer is now 16 byte aligned. */
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_storeu_si128(dst, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, if any, via bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid) and length is known to be 16 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(len & 15)) {
> +		char    buffer[16] __rte_aligned(16);
> +
> +		xmm3 = _mm_stream_load_si128_const(src);
> +		_mm_store_si128((void *)buffer, xmm3);
> +		rte_mov15_or_less(dst, buffer, len & 15);
> +	}
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Memory copy to non-temporal destination area.
> + *
> + * @note
> + * If the destination and/or length is unaligned, the first and/or last copied
> + * bytes will be stored in the destination memory area using temporal access.
> + * @note
> + * Performance is optimal when destination pointer is 16 byte aligned.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + *   Any of the RTE_MEMOPS_F_(LEN|DST)nnnA flags.
> + *   The RTE_MEMOPS_F_SRC_NT flag must be clear.
> + *   The RTE_MEMOPS_F_DST_NT flag must be set.
> + *   The RTE_MEMOPS_F_SRCnnnA flags are ignored.
> + *   Must be constant at build time.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ntd(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) == RTE_MEMOPS_F_DST_NT);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ||
> +			len >= 16) {
> +		/* Length >= 16 and/or destination is known to be 16 byte aligned. */
> +		register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +		/* If destination is not 16 byte aligned, then copy first part of data,
> +		 * to achieve 16 byte alignment of destination pointer.
> +		 * This invalidates the source, destination and length alignment flags, and
> +		 * potentially makes the source pointer unaligned.
> +		 *
> +		 * Omitted if destination is known to be 16 byte aligned.
> +		 */
> +		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A)) {
> +			/* Destination is not known to be 16 byte aligned, but might be. */
> +			/** How many bytes is destination offset from 16 byte alignment
> +			 * (floor rounding).
> +			 */
> +			const size_t    offset = (uintptr_t)dst & 15;
> +
> +			if (offset) {
> +				/* Destination is not 16 byte aligned. */
> +				/** How many bytes is destination away from 16 byte alignment
> +				 * (ceiling rounding).
> +				 */
> +				const size_t    first = 16 - offset;
> +
> +				if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
> +						(offset & 3) == 0) {
> +					/* Destination is (known to be) 4 byte aligned. */
> +					int32_t r0, r1, r2;
> +
> +					/* Copy until destination pointer is 16 byte aligned. */
> +					if (first & 8) {
> +						memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
> +						memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
> +						_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
> +						_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
> +						src = RTE_PTR_ADD(src, 8);
> +						dst = RTE_PTR_ADD(dst, 8);
> +						len -= 8;
> +					}
> +					if (first & 4) {
> +						memcpy(&r2, src, 4);
> +						_mm_stream_si32(dst, r2);
> +						src = RTE_PTR_ADD(src, 4);
> +						dst = RTE_PTR_ADD(dst, 4);
> +						len -= 4;
> +					}
> +				} else {
> +					/* Destination is not 4 byte aligned. */
> +					/* Copy until destination pointer is 16 byte aligned. */
> +					rte_mov15_or_less(dst, src, first);
> +					src = RTE_PTR_ADD(src, first);
> +					dst = RTE_PTR_ADD(dst, first);
> +					len -= first;
> +				}
> +			}
> +		}
> +
> +		/* Destination pointer is now 16 byte aligned. */
> +		RTE_ASSERT(rte_is_aligned(dst, 16));
> +
> +		/* Copy large portion of data in chunks of 64 byte. */
> +		while (len >= 64) {
> +			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
> +			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
> +			xmm2 = _mm_loadu_si128(RTE_PTR_ADD(src, 2 * 16));
> +			xmm3 = _mm_loadu_si128(RTE_PTR_ADD(src, 3 * 16));
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +			src = RTE_PTR_ADD(src, 64);
> +			dst = RTE_PTR_ADD(dst, 64);
> +			len -= 64;
> +		}
> +
> +		/* Copy following 32 and 16 byte portions of data.
> +		 *
> +		 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +		 * flags are still valid)
> +		 * and length is known to be respectively 64 or 32 byte aligned.
> +		 */
> +		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +				(len & 32)) {
> +			xmm0 = _mm_loadu_si128(RTE_PTR_ADD(src, 0 * 16));
> +			xmm1 = _mm_loadu_si128(RTE_PTR_ADD(src, 1 * 16));
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +			_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +			src = RTE_PTR_ADD(src, 32);
> +			dst = RTE_PTR_ADD(dst, 32);
> +		}
> +		if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +				((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +				(len & 16)) {
> +			xmm2 = _mm_loadu_si128(src);
> +			_mm_stream_si128(dst, xmm2);
> +			src = RTE_PTR_ADD(src, 16);
> +			dst = RTE_PTR_ADD(dst, 16);
> +		}
> +	} else {
> +		/* Length <= 15, and
> +		 * destination is not known to be 16 byte aligned (but might be).
> +		 */
> +		/* If destination is not 4 byte aligned, then
> +		 * use normal copy and return.
> +		 *
> +		 * Omitted if destination is known to be 4 byte aligned.
> +		 */
> +		if (!((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) &&
> +				!rte_is_aligned(dst, 4)) {
> +			/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
> +			rte_mov15_or_less(dst, src, len);
> +			return;
> +		}
> +		/* Destination is (known to be) 4 byte aligned. Proceed. */
> +	}
> +
> +	/* Destination pointer is now 4 byte (or 16 byte) aligned. */
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +
> +	/* Copy following 8 and 4 byte portions of data.
> +	 *
> +	 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 16 or 8 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(len & 8)) {
> +		int32_t r0, r1;
> +
> +		memcpy(&r0, RTE_PTR_ADD(src, 0 * 4), 4);
> +		memcpy(&r1, RTE_PTR_ADD(src, 1 * 4), 4);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), r0);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), r1);
> +		src = RTE_PTR_ADD(src, 8);
> +		dst = RTE_PTR_ADD(dst, 8);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A)) &&
> +			(len & 4)) {
> +		int32_t r2;
> +
> +		memcpy(&r2, src, 4);
> +		_mm_stream_si32(dst, r2);
> +		src = RTE_PTR_ADD(src, 4);
> +		dst = RTE_PTR_ADD(dst, 4);
> +	}
> +
> +	/* Copy remaining 2 and 1 byte portions of data.
> +	 *
> +	 * Omitted if destination is known to be 16 byte aligned (so the alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 4 and 2 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A)) &&
> +			(len & 2)) {
> +		int16_t r3;
> +
> +		memcpy(&r3, src, 2);
> +		*(int16_t *)dst = r3;
> +		src = RTE_PTR_ADD(src, 2);
> +		dst = RTE_PTR_ADD(dst, 2);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A)) &&
> +			(len & 1))
> +		*(char *)dst = *(const char *)src;
> +}
> +
> +/**
> + * Non-temporal memory copy of 15 or less byte
> + * from 16 byte aligned source via bounce buffer.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Only the 4 least significant bits of this parameter are used.
> + *   The 4 least significant bits of this holds the number of remaining bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_15_or_less_s16a(void *__rte_restrict dst,
> +		const void *__rte_restrict src, size_t len, const uint64_t flags)
> +{
> +	int32_t             buffer[4] __rte_aligned(16);
> +	register __m128i    xmm0;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if ((len & 15) == 0)
> +		return;
> +
> +	/* Non-temporal load into bounce buffer. */
> +	xmm0 = _mm_stream_load_si128_const(src);
> +	_mm_store_si128((void *)buffer, xmm0);
> +
> +	/* Store from bounce buffer. */
> +	if (((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ||
> +			rte_is_aligned(dst, 4)) {
> +		/* Destination is (known to be) 4 byte aligned. */
> +		src = (const void *)buffer;
> +		if (len & 8) {
> +#ifdef RTE_ARCH_X86_64
> +			if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) {
> +				/* Destination is known to be 8 byte aligned. */
> +				_mm_stream_si64(dst, *(const int64_t *)src);
> +			} else {
> +#endif /* RTE_ARCH_X86_64 */
> +				_mm_stream_si32(RTE_PTR_ADD(dst, 0), buffer[0]);
> +				_mm_stream_si32(RTE_PTR_ADD(dst, 4), buffer[1]);
> +#ifdef RTE_ARCH_X86_64
> +			}
> +#endif /* RTE_ARCH_X86_64 */
> +			src = RTE_PTR_ADD(src, 8);
> +			dst = RTE_PTR_ADD(dst, 8);
> +		}
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN8A) &&
> +				(len & 4)) {
> +			_mm_stream_si32(dst, *(const int32_t *)src);
> +			src = RTE_PTR_ADD(src, 4);
> +			dst = RTE_PTR_ADD(dst, 4);
> +		}
> +
> +		/* Non-temporal store is unavailble for the remaining 3 byte or less. */
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) &&
> +				(len & 2)) {
> +			*(int16_t *)dst = *(const int16_t *)src;
> +			src = RTE_PTR_ADD(src, 2);
> +			dst = RTE_PTR_ADD(dst, 2);
> +		}
> +		if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN2A) &&
> +				(len & 1)) {
> +			*(char *)dst = *(const char *)src;
> +		}
> +	} else {
> +		/* Destination is not 4 byte aligned. Non-temporal store is unavailable. */
> +		rte_mov15_or_less(dst, (const void *)buffer, len & 15);
> +	}
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 16 byte aligned addresses non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + *   Must be 16 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d16s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 16));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
> +		_mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_stream_si128(dst, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) ?
> +				flags : RTE_MEMOPS_F_DST16A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +
> +#ifdef RTE_ARCH_X86_64
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 8/16 byte aligned destination/source addresses non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + *   Must be 8 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d8s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	int64_t             buffer[8] __rte_cache_aligned /* at least __rte_aligned(16) */;
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 8));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
> +		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
> +		_mm_store_si128((void *)&buffer[3 * 2], xmm3);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 4 * 8), buffer[4]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 5 * 8), buffer[5]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 6 * 8), buffer[6]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 7 * 8), buffer[7]);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 2], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 2], xmm1);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[0]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[1]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 2 * 8), buffer[2]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 3 * 8), buffer[3]);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_store_si128((void *)&buffer[2 * 2], xmm2);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 0 * 8), buffer[4]);
> +		_mm_stream_si64(RTE_PTR_ADD(dst, 1 * 8), buffer[5]);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A) ?
> +				flags : RTE_MEMOPS_F_DST8A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +#endif /* RTE_ARCH_X86_64 */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 4/16 byte aligned destination/source addresses non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + *   Must be 4 byte aligned.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + *   Must be 16 byte aligned.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d4s16a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	int32_t             buffer[16] __rte_cache_aligned /* at least __rte_aligned(16) */;
> +	register __m128i    xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +	RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Copy large portion of data in chunks of 64 byte. */
> +	while (len >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
> +		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
> +		_mm_store_si128((void *)&buffer[3 * 4], xmm3);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  0 * 4), buffer[0]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  1 * 4), buffer[1]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  2 * 4), buffer[2]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  3 * 4), buffer[3]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  4 * 4), buffer[4]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  5 * 4), buffer[5]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  6 * 4), buffer[6]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  7 * 4), buffer[7]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  8 * 4), buffer[8]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst,  9 * 4), buffer[9]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 10 * 4), buffer[10]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 11 * 4), buffer[11]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 12 * 4), buffer[12]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 13 * 4), buffer[13]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 14 * 4), buffer[14]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 15 * 4), buffer[15]);
> +		src = RTE_PTR_ADD(src, 64);
> +		dst = RTE_PTR_ADD(dst, 64);
> +		len -= 64;
> +	}
> +
> +	/* Copy following 32 and 16 byte portions of data.
> +	 *
> +	 * Omitted if length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A) &&
> +			(len & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_store_si128((void *)&buffer[0 * 4], xmm0);
> +		_mm_store_si128((void *)&buffer[1 * 4], xmm1);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 4 * 4), buffer[4]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 5 * 4), buffer[5]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 6 * 4), buffer[6]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 7 * 4), buffer[7]);
> +		src = RTE_PTR_ADD(src, 32);
> +		dst = RTE_PTR_ADD(dst, 32);
> +	}
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A) &&
> +			(len & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_store_si128((void *)&buffer[2 * 4], xmm2);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[8]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[9]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[10]);
> +		_mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[11]);
> +		src = RTE_PTR_ADD(src, 16);
> +		dst = RTE_PTR_ADD(dst, 16);
> +	}
> +
> +	/* Copy remaining data, 15 byte or less, via bounce buffer.
> +	 *
> +	 * Omitted if length is known to be 16 byte aligned.
> +	 */
> +	if (!((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A))
> +		rte_memcpy_nt_15_or_less_s16a(dst, src, len,
> +				(flags & ~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK)) |
> +				(((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) ?
> +				flags : RTE_MEMOPS_F_DST4A) |
> +				(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) ?
> +				flags : RTE_MEMOPS_F_SRC16A));
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * 4 byte aligned addresses (non-temporal) memory copy.
> + * The memory areas must not overlap.
> + *
> + * @param dst
> + *   Pointer to the (non-temporal) destination memory area.
> + *   Must be 4 byte aligned if using non-temporal store.
> + * @param src
> + *   Pointer to the (non-temporal) source memory area.
> + *   Must be 4 byte aligned if using non-temporal load.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_d4s4a(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +	const size_t    offset = (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A ?
> +			0 : (uintptr_t)src & 15;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(rte_is_aligned(dst, 4));
> +	RTE_ASSERT(rte_is_aligned(src, 4));
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	if (offset == 0) {
> +		/* Source is 16 byte aligned. */
> +		/* Copy everything, using upgraded source alignment flags. */
> +		rte_memcpy_nt_d4s16a(dst, src, len,
> +				(flags & ~RTE_MEMOPS_F_SRCA_MASK) | RTE_MEMOPS_F_SRC16A);
> +	} else {
> +		/* Source is not 16 byte aligned, so make it 16 byte aligned. */
> +		int32_t             buffer[4] __rte_aligned(16);
> +		const size_t        first = 16 - offset;
> +		register __m128i    xmm0;
> +
> +		/* First, copy first part of data in chunks of 4 byte,
> +		 * to achieve 16 byte alignment of source.
> +		 * This invalidates the source, destination and length alignment flags, and
> +		 * potentially makes the destination pointer 16 byte unaligned/aligned.
> +		 */
> +
> +		/** Copy from 16 byte aligned source pointer (floor rounding). */
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_SUB(src, offset));
> +		_mm_store_si128((void *)buffer, xmm0);
> +
> +		if (unlikely(len + offset <= 16)) {
> +			/* Short length. */
> +			if (((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ||
> +					(len & 3) == 0) {
> +				/* Length is 4 byte aligned. */
> +				switch (len) {
> +				case 1 * 4:
> +					/* Offset can be 1 * 4, 2 * 4 or 3 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
> +							buffer[offset / 4]);
> +					break;
> +				case 2 * 4:
> +					/* Offset can be 1 * 4 or 2 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4),
> +							buffer[offset / 4]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4),
> +							buffer[offset / 4 + 1]);
> +					break;
> +				case 3 * 4:
> +					/* Offset can only be 1 * 4. */
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
> +					_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
> +					break;
> +				}
> +			} else {
> +				/* Length is not 4 byte aligned. */
> +				rte_mov15_or_less(dst, RTE_PTR_ADD(buffer, offset), len);
> +			}
> +			return;
> +		}
> +
> +		switch (first) {
> +		case 1 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
> +			break;
> +		case 2 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
> +			break;
> +		case 3 * 4:
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
> +			_mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
> +			break;
> +		}
> +
> +		src = RTE_PTR_ADD(src, first);
> +		dst = RTE_PTR_ADD(dst, first);
> +		len -= first;
> +
> +		/* Source pointer is now 16 byte aligned. */
> +		RTE_ASSERT(rte_is_aligned(src, 16));
> +
> +		/* Then, copy the rest, using corrected alignment flags. */
> +		if (rte_is_aligned(dst, 16))
> +			rte_memcpy_nt_d16s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST16A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +#ifdef RTE_ARCH_X86_64
> +		else if (rte_is_aligned(dst, 8))
> +			rte_memcpy_nt_d8s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST8A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +#endif /* RTE_ARCH_X86_64 */
> +		else
> +			rte_memcpy_nt_d4s16a(dst, src, len, (flags &
> +					~(RTE_MEMOPS_F_DSTA_MASK | RTE_MEMOPS_F_SRCA_MASK |
> +					RTE_MEMOPS_F_LENA_MASK)) |
> +					RTE_MEMOPS_F_DST4A | RTE_MEMOPS_F_SRC16A |
> +					(((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A) ?
> +					RTE_MEMOPS_F_LEN4A : (flags & RTE_MEMOPS_F_LEN2A)));
> +	}
> +}
> +
> +#ifndef RTE_MEMCPY_NT_BUFSIZE
> +
> +#include <lib/mbuf/rte_mbuf_core.h>
> +
> +/** Bounce buffer size for non-temporal memcpy.
> + *
> + * Must be 2^N and >= 128.
> + * The actual buffer will be slightly larger, due to added padding.
> + * The default is chosen to be able to handle a non-segmented packet.
> + */
> +#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM
> +
> +#endif  /* RTE_MEMCPY_NT_BUFSIZE */
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Non-temporal memory copy via bounce buffer.
> + *
> + * @note
> + * If the destination and/or length is unaligned, the first and/or last copied
> + * bytes will be stored in the destination memory area using temporal access.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_buf(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +	/** Cache line aligned bounce buffer with preceding and trailing padding.
> +	 *
> +	 * The preceding padding is one cache line, so the data area itself
> +	 * is cache line aligned.
> +	 * The trailing padding is 16 bytes, leaving room for the trailing bytes
> +	 * of a 16 byte store operation.
> +	 */
> +	char			buffer[RTE_CACHE_LINE_SIZE + RTE_MEMCPY_NT_BUFSIZE +  16]
> +				__rte_cache_aligned;
> +	/** Pointer to bounce buffer's aligned data area. */
> +	char		* const buf0 = &buffer[RTE_CACHE_LINE_SIZE];
> +	void		       *buf;
> +	/** Number of bytes to copy from source, incl. any extra preceding bytes. */
> +	size_t			srclen;
> +	register __m128i	xmm0, xmm1, xmm2, xmm3;
> +
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	RTE_ASSERT((flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) ==
> +			(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT));
> +	RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);
> +
> +	if (unlikely(len == 0))
> +		return;
> +
> +	/* Step 1:
> +	 * Copy data from the source to the bounce buffer's aligned data area,
> +	 * using aligned non-temporal load from the source,
> +	 * and unaligned store in the bounce buffer.
> +	 *
> +	 * If the source is unaligned, the additional bytes preceding the data will be copied
> +	 * to the padding area preceding the bounce buffer's aligned data area.
> +	 * Similarly, if the source data ends at an unaligned address, the additional bytes
> +	 * trailing the data will be copied to the padding area trailing the bounce buffer's
> +	 * aligned data area.
> +	 */
> +
> +	/* Adjust for extra preceding bytes, unless source is known to be 16 byte aligned. */
> +	if ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) {
> +		buf = buf0;
> +		srclen = len;
> +	} else {
> +		/** How many bytes is source offset from 16 byte alignment (floor rounding). */
> +		const size_t offset = (uintptr_t)src & 15;
> +
> +		buf = RTE_PTR_SUB(buf0, offset);
> +		src = RTE_PTR_SUB(src, offset);
> +		srclen = len + offset;
> +	}
> +
> +	/* Copy large portion of data from source to bounce buffer in chunks of 64 byte. */
> +	while (srclen >= 64) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
> +		xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
> +		src = RTE_PTR_ADD(src, 64);
> +		buf = RTE_PTR_ADD(buf, 64);
> +		srclen -= 64;
> +	}
> +
> +	/* Copy remaining 32 and 16 byte portions of data from source to bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the length alignment
> +	 * flags are still valid)
> +	 * and length is known to be respectively 64 or 32 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A)) &&
> +			(srclen & 32)) {
> +		xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
> +		xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
> +		_mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
> +		src = RTE_PTR_ADD(src, 32);
> +		buf = RTE_PTR_ADD(buf, 32);
> +	}
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN32A)) &&
> +			(srclen & 16)) {
> +		xmm2 = _mm_stream_load_si128_const(src);
> +		_mm_storeu_si128(buf, xmm2);
> +		src = RTE_PTR_ADD(src, 16);
> +		buf = RTE_PTR_ADD(buf, 16);
> +	}
> +	/* Copy any trailing bytes of data from source to bounce buffer.
> +	 *
> +	 * Omitted if source is known to be 16 byte aligned (so the length alignment
> +	 * flags are still valid)
> +	 * and length is known to be 16 byte aligned.
> +	 */
> +	if (!(((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A) &&
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A)) &&
> +			(srclen & 15)) {
> +		xmm3 = _mm_stream_load_si128_const(src);
> +		_mm_storeu_si128(buf, xmm3);
> +	}
> +
> +	/* Step 2:
> +	 * Copy from the aligned bounce buffer to the non-temporal destination.
> +	 */
> +	rte_memcpy_ntd(dst, buf0, len,
> +			(flags & ~(RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_SRCA_MASK)) |
> +			(RTE_CACHE_LINE_SIZE << RTE_MEMOPS_F_SRCA_SHIFT));
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Non-temporal memory copy.
> + * The memory areas must not overlap.
> + *
> + * @note
> + * If the destination and/or length is unaligned, some copied bytes will be
> + * stored in the destination memory area using temporal access.
> + *
> + * @param dst
> + *   Pointer to the non-temporal destination memory area.
> + * @param src
> + *   Pointer to the non-temporal source memory area.
> + * @param len
> + *   Number of bytes to copy.
> + * @param flags
> + *   Hints for memory access.
> + */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_nt_generic(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +
> +	while (len > RTE_MEMCPY_NT_BUFSIZE) {
> +		rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE,
> +				(flags & ~RTE_MEMOPS_F_LENA_MASK) | RTE_MEMOPS_F_LEN128A);
> +		dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
> +		src = RTE_PTR_ADD(src, RTE_MEMCPY_NT_BUFSIZE);
> +		len -= RTE_MEMCPY_NT_BUFSIZE;
> +	}
> +	rte_memcpy_nt_buf(dst, src, len, flags);
> +}
> +
> +/* Implementation. Refer to function declaration for documentation. */
> +__rte_experimental
> +static __rte_always_inline
> +__attribute__((__nonnull__(1, 2)))
> +#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> +__attribute__((__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> +#endif
> +void rte_memcpy_ex(void *__rte_restrict dst, const void *__rte_restrict src, size_t len,
> +		const uint64_t flags)
> +{
> +#ifndef RTE_TOOLCHAIN_CLANG /* Clang doesn't support using __builtin_constant_p() like this. */
> +	RTE_BUILD_BUG_ON(!__builtin_constant_p(flags));
> +#endif /* !RTE_TOOLCHAIN_CLANG */
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_DSTA_MASK) || rte_is_aligned(dst,
> +			(flags & RTE_MEMOPS_F_DSTA_MASK) >> RTE_MEMOPS_F_DSTA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_SRCA_MASK) || rte_is_aligned(src,
> +			(flags & RTE_MEMOPS_F_SRCA_MASK) >> RTE_MEMOPS_F_SRCA_SHIFT));
> +	RTE_ASSERT(!(flags & RTE_MEMOPS_F_LENA_MASK) || (len &
> +			((flags & RTE_MEMOPS_F_LENA_MASK) >> RTE_MEMOPS_F_LENA_SHIFT) - 1) == 0);
> +
> +	if ((flags & (RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) ==
> +			(RTE_MEMOPS_F_DST_NT | RTE_MEMOPS_F_SRC_NT)) {
> +		/* Copy between non-temporal source and destination. */
> +		if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d16s16a(dst, src, len, flags);
> +#ifdef RTE_ARCH_X86_64
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST8A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d8s16a(dst, src, len, flags);
> +#endif /* RTE_ARCH_X86_64 */
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)
> +			rte_memcpy_nt_d4s16a(dst, src, len, flags);
> +		else if ((flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
> +				(flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A)
> +			rte_memcpy_nt_d4s4a(dst, src, len, flags);
> +		else if (len <= RTE_MEMCPY_NT_BUFSIZE)
> +			rte_memcpy_nt_buf(dst, src, len, flags);
> +		else
> +			rte_memcpy_nt_generic(dst, src, len, flags);
> +	} else if (flags & RTE_MEMOPS_F_SRC_NT) {
> +		/* Copy from non-temporal source. */
> +		rte_memcpy_nts(dst, src, len, flags);
> +	} else if (flags & RTE_MEMOPS_F_DST_NT) {
> +		/* Copy to non-temporal destination. */
> +		rte_memcpy_ntd(dst, src, len, flags);
> +	} else
> +		rte_memcpy(dst, src, len);
> +}
> +
>   #undef ALIGNMENT_MASK
>   
>   #if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
> diff --git a/lib/mbuf/rte_mbuf.c b/lib/mbuf/rte_mbuf.c
> index a2307cebe6..aa96fb4cc8 100644
> --- a/lib/mbuf/rte_mbuf.c
> +++ b/lib/mbuf/rte_mbuf.c
> @@ -660,6 +660,83 @@ rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
>   	return mc;
>   }
>   
> +/* Create a deep copy of mbuf, using non-temporal memory access */
> +struct rte_mbuf *
> +rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
> +		 uint32_t off, uint32_t len, const uint64_t flags)
> +{
> +	const struct rte_mbuf *seg = m;
> +	struct rte_mbuf *mc, *m_last, **prev;
> +
> +	/* garbage in check */
> +	__rte_mbuf_sanity_check(m, 1);
> +
> +	/* check for request to copy at offset past end of mbuf */
> +	if (unlikely(off >= m->pkt_len))
> +		return NULL;
> +
> +	mc = rte_pktmbuf_alloc(mp);
> +	if (unlikely(mc == NULL))
> +		return NULL;
> +
> +	/* truncate requested length to available data */
> +	if (len > m->pkt_len - off)
> +		len = m->pkt_len - off;
> +
> +	__rte_pktmbuf_copy_hdr(mc, m);
> +
> +	/* copied mbuf is not indirect or external */
> +	mc->ol_flags = m->ol_flags & ~(RTE_MBUF_F_INDIRECT|RTE_MBUF_F_EXTERNAL);
> +
> +	prev = &mc->next;
> +	m_last = mc;
> +	while (len > 0) {
> +		uint32_t copy_len;
> +
> +		/* skip leading mbuf segments */
> +		while (off >= seg->data_len) {
> +			off -= seg->data_len;
> +			seg = seg->next;
> +		}
> +
> +		/* current buffer is full, chain a new one */
> +		if (rte_pktmbuf_tailroom(m_last) == 0) {
> +			m_last = rte_pktmbuf_alloc(mp);
> +			if (unlikely(m_last == NULL)) {
> +				rte_pktmbuf_free(mc);
> +				return NULL;
> +			}
> +			++mc->nb_segs;
> +			*prev = m_last;
> +			prev = &m_last->next;
> +		}
> +
> +		/*
> +		 * copy the min of data in input segment (seg)
> +		 * vs space available in output (m_last)
> +		 */
> +		copy_len = RTE_MIN(seg->data_len - off, len);
> +		if (copy_len > rte_pktmbuf_tailroom(m_last))
> +			copy_len = rte_pktmbuf_tailroom(m_last);
> +
> +		/* append from seg to m_last */
> +		rte_memcpy_ex(rte_pktmbuf_mtod_offset(m_last, char *,
> +						   m_last->data_len),
> +			   rte_pktmbuf_mtod_offset(seg, char *, off),
> +			   copy_len, flags);
> +
> +		/* update offsets and lengths */
> +		m_last->data_len += copy_len;
> +		mc->pkt_len += copy_len;
> +		off += copy_len;
> +		len -= copy_len;
> +	}
> +
> +	/* garbage out check */
> +	__rte_mbuf_sanity_check(mc, 1);
> +	return mc;
> +}
> +
>   /* dump a mbuf on console */
>   void
>   rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index b6e23d98ce..030df396a3 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -1443,6 +1443,38 @@ struct rte_mbuf *
>   rte_pktmbuf_copy(const struct rte_mbuf *m, struct rte_mempool *mp,
>   		 uint32_t offset, uint32_t length);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Create a full copy of a given packet mbuf,
> + * using non-temporal memory access as specified by flags.
> + *
> + * Copies all the data from a given packet mbuf to a newly allocated
> + * set of mbufs. The private data are is not copied.
> + *
> + * @param m
> + *   The packet mbuf to be copied.
> + * @param mp
> + *   The mempool from which the "clone" mbufs are allocated.
> + * @param offset
> + *   The number of bytes to skip before copying.
> + *   If the mbuf does not have that many bytes, it is an error
> + *   and NULL is returned.
> + * @param length
> + *   The upper limit on bytes to copy.  Passing UINT32_MAX
> + *   means all data (after offset).
> + * @param flags
> + *   Non-temporal memory access hints for rte_memcpy_ex.
> + * @return
> + *   - The pointer to the new "clone" mbuf on success.
> + *   - NULL if allocation fails.
> + */
> +__rte_experimental
> +struct rte_mbuf *
> +rte_pktmbuf_copy_ex(const struct rte_mbuf *m, struct rte_mempool *mp,
> +		    uint32_t offset, uint32_t length, const uint64_t flags);
> +
>   /**
>    * Adds given value to the refcnt of all packet mbuf segments.
>    *
> diff --git a/lib/mbuf/version.map b/lib/mbuf/version.map
> index ed486ed14e..b583364ad4 100644
> --- a/lib/mbuf/version.map
> +++ b/lib/mbuf/version.map
> @@ -47,5 +47,6 @@ EXPERIMENTAL {
>   	global:
>   
>   	rte_pktmbuf_pool_create_extbuf;
> +	rte_pktmbuf_copy_ex;
>   
>   };
> diff --git a/lib/pcapng/rte_pcapng.c b/lib/pcapng/rte_pcapng.c
> index af2b814251..ae871c4865 100644
> --- a/lib/pcapng/rte_pcapng.c
> +++ b/lib/pcapng/rte_pcapng.c
> @@ -466,7 +466,8 @@ rte_pcapng_copy(uint16_t port_id, uint32_t queue,
>   	orig_len = rte_pktmbuf_pkt_len(md);
>   
>   	/* Take snapshot of the data */
> -	mc = rte_pktmbuf_copy(md, mp, 0, length);
> +	mc = rte_pktmbuf_copy_ex(md, mp, 0, length,
> +				 RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
>   	if (unlikely(mc == NULL))
>   		return NULL;
>   
> diff --git a/lib/pdump/rte_pdump.c b/lib/pdump/rte_pdump.c
> index 98dcbc037b..6e61c75407 100644
> --- a/lib/pdump/rte_pdump.c
> +++ b/lib/pdump/rte_pdump.c
> @@ -124,7 +124,8 @@ pdump_copy(uint16_t port_id, uint16_t queue,
>   					    pkts[i], mp, cbs->snaplen,
>   					    ts, direction);
>   		else
> -			p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
> +			p = rte_pktmbuf_copy_ex(pkts[i], mp, 0, cbs->snaplen,
> +						RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT);
>   
>   		if (unlikely(p == NULL))
>   			__atomic_fetch_add(&stats->nombuf, 1, __ATOMIC_RELAXED);
> @@ -134,6 +135,9 @@ pdump_copy(uint16_t port_id, uint16_t queue,
>   
>   	__atomic_fetch_add(&stats->accepted, d_pkts, __ATOMIC_RELAXED);
>   
> +	/* Flush non-temporal stores regarding the packet copies. */
> +	rte_wmb();
> +
>   	ring_enq = rte_ring_enqueue_burst(ring, (void *)dup_bufs, d_pkts, NULL);
>   	if (unlikely(ring_enq < d_pkts)) {
>   		unsigned int drops = d_pkts - ring_enq;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] eal: non-temporal memcpy
  2022-10-10  6:46 ` [PATCH v4] " Morten Brørup
  2022-10-16 14:27   ` Mattias Rönnblom
  2022-10-16 19:55   ` Mattias Rönnblom
@ 2023-07-31 12:14   ` Thomas Monjalon
  2023-07-31 12:25     ` Morten Brørup
  2 siblings, 1 reply; 17+ messages in thread
From: Thomas Monjalon @ 2023-07-31 12:14 UTC (permalink / raw)
  To: Morten Brørup
  Cc: hofors, bruce.richardson, konstantin.v.ananyev,
	Honnappa.Nagarahalli, stephen, dev, mattias.ronnblom, kda, drc,
	dev, andrew.rybchenko, olivier.matz, anatoly.burakov,
	dmitry.kozliuk

Hello,

What's the status of this feature?


10/10/2022 08:46, Morten Brørup:
> This patch provides a function for memory copy using non-temporal store,
> load or both, controlled by flags passed to the function.
> 
> Applications sometimes copy data to another memory location, which is only
> used much later.
> In this case, it is inefficient to pollute the data cache with the copied
> data.
> 
> An example use case (originating from a real life application):
> Copying filtered packets, or the first part of them, into a capture buffer
> for offline analysis.
> 
> The purpose of the function is to achieve a performance gain by not
> polluting the cache when copying data.
> Although the throughput can be improved by further optimization, I do not
> have time to do it now.
> 
> The functional tests and performance tests for memory copy have been
> expanded to include non-temporal copying.
> 
> A non-temporal version of the mbuf library's function to create a full
> copy of a given packet mbuf is provided.
> 
> The packet capture and packet dump libraries have been updated to use
> non-temporal memory copy of the packets.
> 
> Implementation notes:
> 
> Implementations for non-x86 architectures can be provided by anyone at a
> later time. I am not going to do it.
> 
> x86 non-temporal load instructions must be 16 byte aligned [1], and
> non-temporal store instructions must be 4, 8 or 16 byte aligned [2].
> 
> ARM non-temporal load and store instructions seem to require 4 byte
> alignment [3].
> 
> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_load
> [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> index.html#text=_mm_stream_si
> [3] https://developer.arm.com/documentation/100076/0100/
> A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
> LDNP--SIMD-and-FP-
> 
> This patch is a major rewrite from the RFC v3, so no version log comparing
> to the RFC is provided.
> 
> v4
> * Also ignore the warning for clang int the workaround for
>   _mm_stream_load_si128() missing const in the parameter.
> * Add missing C linkage specifier in rte_memcpy.h.
> 
> v3
> * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
>   use it on 64-bit x86 architecture.
> * CLANG warns that _mm_stream_load_si128_const() and
>   rte_memcpy_nt_15_or_less_s16a() are not public,
>   so remove __rte_internal from them. It also affects the documentation
>   for the functions, so the fix can't be limited to CLANG.
> * Use __rte_experimental instead of __rte_internal.
> * Replace <n> with nnn in function documentation; it doesn't look like
>   HTML.
> * Slightly modify the workaround for _mm_stream_load_si128() missing const
>   in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
>   #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
>   #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
> * Fixed one coding style issue missed in v2.
> 
> v2
> * The last 16 byte block of data, incl. any trailing bytes, were not
>   copied from the source memory area in rte_memcpy_nt_buf().
> * Fix many coding style issues.
> * Add some missing header files.
> * Fix build time warning for non-x86 architectures by using a different
>   method to mark the flags parameter unused.
> * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
>   so omit it when using CLANG.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  app/test/test_memcpy.c               |   65 +-
>  app/test/test_memcpy_perf.c          |  187 ++--
>  lib/eal/include/generic/rte_memcpy.h |  127 +++
>  lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
>  lib/mbuf/rte_mbuf.c                  |   77 ++
>  lib/mbuf/rte_mbuf.h                  |   32 +
>  lib/mbuf/version.map                 |    1 +
>  lib/pcapng/rte_pcapng.c              |    3 +-
>  lib/pdump/rte_pdump.c                |    6 +-
>  9 files changed, 1645 insertions(+), 91 deletions(-)





^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v4] eal: non-temporal memcpy
  2023-07-31 12:14   ` Thomas Monjalon
@ 2023-07-31 12:25     ` Morten Brørup
  2023-08-04  5:49       ` Mattias Rönnblom
  0 siblings, 1 reply; 17+ messages in thread
From: Morten Brørup @ 2023-07-31 12:25 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: hofors, bruce.richardson, konstantin.v.ananyev,
	Honnappa.Nagarahalli, stephen, dev, mattias.ronnblom, kda, drc,
	dev, andrew.rybchenko, olivier.matz, anatoly.burakov,
	dmitry.kozliuk

> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Monday, 31 July 2023 14.14
> 
> Hello,
> 
> What's the status of this feature?

I haven't given up on upstreaming this feature, but there doesn't seem to be much demand for it, so working on it has low priority.

> 
> 
> 10/10/2022 08:46, Morten Brørup:
> > This patch provides a function for memory copy using non-temporal store,
> > load or both, controlled by flags passed to the function.
> >
> > Applications sometimes copy data to another memory location, which is only
> > used much later.
> > In this case, it is inefficient to pollute the data cache with the copied
> > data.
> >
> > An example use case (originating from a real life application):
> > Copying filtered packets, or the first part of them, into a capture buffer
> > for offline analysis.
> >
> > The purpose of the function is to achieve a performance gain by not
> > polluting the cache when copying data.
> > Although the throughput can be improved by further optimization, I do not
> > have time to do it now.
> >
> > The functional tests and performance tests for memory copy have been
> > expanded to include non-temporal copying.
> >
> > A non-temporal version of the mbuf library's function to create a full
> > copy of a given packet mbuf is provided.
> >
> > The packet capture and packet dump libraries have been updated to use
> > non-temporal memory copy of the packets.
> >
> > Implementation notes:
> >
> > Implementations for non-x86 architectures can be provided by anyone at a
> > later time. I am not going to do it.
> >
> > x86 non-temporal load instructions must be 16 byte aligned [1], and
> > non-temporal store instructions must be 4, 8 or 16 byte aligned [2].
> >
> > ARM non-temporal load and store instructions seem to require 4 byte
> > alignment [3].
> >
> > [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> > index.html#text=_mm_stream_load
> > [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
> > index.html#text=_mm_stream_si
> > [3] https://developer.arm.com/documentation/100076/0100/
> > A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
> > LDNP--SIMD-and-FP-
> >
> > This patch is a major rewrite from the RFC v3, so no version log comparing
> > to the RFC is provided.
> >
> > v4
> > * Also ignore the warning for clang int the workaround for
> >   _mm_stream_load_si128() missing const in the parameter.
> > * Add missing C linkage specifier in rte_memcpy.h.
> >
> > v3
> > * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
> >   use it on 64-bit x86 architecture.
> > * CLANG warns that _mm_stream_load_si128_const() and
> >   rte_memcpy_nt_15_or_less_s16a() are not public,
> >   so remove __rte_internal from them. It also affects the documentation
> >   for the functions, so the fix can't be limited to CLANG.
> > * Use __rte_experimental instead of __rte_internal.
> > * Replace <n> with nnn in function documentation; it doesn't look like
> >   HTML.
> > * Slightly modify the workaround for _mm_stream_load_si128() missing const
> >   in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
> >   #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
> >   #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
> > * Fixed one coding style issue missed in v2.
> >
> > v2
> > * The last 16 byte block of data, incl. any trailing bytes, were not
> >   copied from the source memory area in rte_memcpy_nt_buf().
> > * Fix many coding style issues.
> > * Add some missing header files.
> > * Fix build time warning for non-x86 architectures by using a different
> >   method to mark the flags parameter unused.
> > * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
> >   so omit it when using CLANG.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  app/test/test_memcpy.c               |   65 +-
> >  app/test/test_memcpy_perf.c          |  187 ++--
> >  lib/eal/include/generic/rte_memcpy.h |  127 +++
> >  lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
> >  lib/mbuf/rte_mbuf.c                  |   77 ++
> >  lib/mbuf/rte_mbuf.h                  |   32 +
> >  lib/mbuf/version.map                 |    1 +
> >  lib/pcapng/rte_pcapng.c              |    3 +-
> >  lib/pdump/rte_pdump.c                |    6 +-
> >  9 files changed, 1645 insertions(+), 91 deletions(-)
> 
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4] eal: non-temporal memcpy
  2023-07-31 12:25     ` Morten Brørup
@ 2023-08-04  5:49       ` Mattias Rönnblom
  0 siblings, 0 replies; 17+ messages in thread
From: Mattias Rönnblom @ 2023-08-04  5:49 UTC (permalink / raw)
  To: Morten Brørup, Thomas Monjalon
  Cc: bruce.richardson, konstantin.v.ananyev, Honnappa.Nagarahalli,
	stephen, dev, mattias.ronnblom, kda, drc, andrew.rybchenko,
	olivier.matz, anatoly.burakov, dmitry.kozliuk

On 2023-07-31 14:25, Morten Brørup wrote:
>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
>> Sent: Monday, 31 July 2023 14.14
>>
>> Hello,
>>
>> What's the status of this feature?
> 
> I haven't given up on upstreaming this feature, but there doesn't seem to be much demand for it, so working on it has low priority.
> 

This would definitely be a useful addition to the EAL, IMO.

It's also a case where it's difficult to provide a generic and portable 
solution with both good performance and reasonable semantics. The upside 
is you seem to come pretty far already.

>>
>>
>> 10/10/2022 08:46, Morten Brørup:
>>> This patch provides a function for memory copy using non-temporal store,
>>> load or both, controlled by flags passed to the function.
>>>
>>> Applications sometimes copy data to another memory location, which is only
>>> used much later.
>>> In this case, it is inefficient to pollute the data cache with the copied
>>> data.
>>>
>>> An example use case (originating from a real life application):
>>> Copying filtered packets, or the first part of them, into a capture buffer
>>> for offline analysis.
>>>
>>> The purpose of the function is to achieve a performance gain by not
>>> polluting the cache when copying data.
>>> Although the throughput can be improved by further optimization, I do not
>>> have time to do it now.
>>>
>>> The functional tests and performance tests for memory copy have been
>>> expanded to include non-temporal copying.
>>>
>>> A non-temporal version of the mbuf library's function to create a full
>>> copy of a given packet mbuf is provided.
>>>
>>> The packet capture and packet dump libraries have been updated to use
>>> non-temporal memory copy of the packets.
>>>
>>> Implementation notes:
>>>
>>> Implementations for non-x86 architectures can be provided by anyone at a
>>> later time. I am not going to do it.
>>>
>>> x86 non-temporal load instructions must be 16 byte aligned [1], and
>>> non-temporal store instructions must be 4, 8 or 16 byte aligned [2].
>>>
>>> ARM non-temporal load and store instructions seem to require 4 byte
>>> alignment [3].
>>>
>>> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
>>> index.html#text=_mm_stream_load
>>> [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
>>> index.html#text=_mm_stream_si
>>> [3] https://developer.arm.com/documentation/100076/0100/
>>> A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
>>> LDNP--SIMD-and-FP-
>>>
>>> This patch is a major rewrite from the RFC v3, so no version log comparing
>>> to the RFC is provided.
>>>
>>> v4
>>> * Also ignore the warning for clang int the workaround for
>>>    _mm_stream_load_si128() missing const in the parameter.
>>> * Add missing C linkage specifier in rte_memcpy.h.
>>>
>>> v3
>>> * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
>>>    use it on 64-bit x86 architecture.
>>> * CLANG warns that _mm_stream_load_si128_const() and
>>>    rte_memcpy_nt_15_or_less_s16a() are not public,
>>>    so remove __rte_internal from them. It also affects the documentation
>>>    for the functions, so the fix can't be limited to CLANG.
>>> * Use __rte_experimental instead of __rte_internal.
>>> * Replace <n> with nnn in function documentation; it doesn't look like
>>>    HTML.
>>> * Slightly modify the workaround for _mm_stream_load_si128() missing const
>>>    in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
>>>    #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
>>>    #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
>>> * Fixed one coding style issue missed in v2.
>>>
>>> v2
>>> * The last 16 byte block of data, incl. any trailing bytes, were not
>>>    copied from the source memory area in rte_memcpy_nt_buf().
>>> * Fix many coding style issues.
>>> * Add some missing header files.
>>> * Fix build time warning for non-x86 architectures by using a different
>>>    method to mark the flags parameter unused.
>>> * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
>>>    so omit it when using CLANG.
>>>
>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>>> ---
>>>   app/test/test_memcpy.c               |   65 +-
>>>   app/test/test_memcpy_perf.c          |  187 ++--
>>>   lib/eal/include/generic/rte_memcpy.h |  127 +++
>>>   lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
>>>   lib/mbuf/rte_mbuf.c                  |   77 ++
>>>   lib/mbuf/rte_mbuf.h                  |   32 +
>>>   lib/mbuf/version.map                 |    1 +
>>>   lib/pcapng/rte_pcapng.c              |    3 +-
>>>   lib/pdump/rte_pdump.c                |    6 +-
>>>   9 files changed, 1645 insertions(+), 91 deletions(-)
>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [RFC v3] non-temporal memcpy
@ 2022-09-07  9:22 Morten Brørup
  0 siblings, 0 replies; 17+ messages in thread
From: Morten Brørup @ 2022-09-07  9:22 UTC (permalink / raw)
  To: dev
  Cc: Bruce Richardson, Konstantin Ananyev, Honnappa Nagarahalli,
	Stephen Hemminger, Mattias Rönnblom

> From: Morten Brørup
> Sent: Friday, 19 August 2022 15.58
> 
> This RFC proposes a set of functions optimized for non-temporal memory
> copy.
> 
> At this stage, I am asking for acceptance of the concept and API.
> Feedback on the x86 implementation is also welcome.

Potential reviewers: An updated version will follow. Please don't review this version.

-Morten


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-08-04  5:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-19 13:58 [RFC v3] non-temporal memcpy Morten Brørup
2022-10-06 20:34 ` [PATCH] eal: " Morten Brørup
2022-10-10  7:35   ` Morten Brørup
2022-10-10  8:58     ` Mattias Rönnblom
2022-10-10  9:36       ` Morten Brørup
2022-10-10 11:58         ` Stanislaw Kardach
2022-10-10  9:57       ` Bruce Richardson
2022-10-11  9:25     ` Konstantin Ananyev
2022-10-07 10:19 ` [PATCH v2] " Morten Brørup
2022-10-09 15:35 ` [PATCH v3] " Morten Brørup
2022-10-10  6:46 ` [PATCH v4] " Morten Brørup
2022-10-16 14:27   ` Mattias Rönnblom
2022-10-16 19:55   ` Mattias Rönnblom
2023-07-31 12:14   ` Thomas Monjalon
2023-07-31 12:25     ` Morten Brørup
2023-08-04  5:49       ` Mattias Rönnblom
2022-09-07  9:22 [RFC v3] " Morten Brørup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).