RE: [RFC v3] non-temporal memcpy

DPDK patches and discussions
 help / color / mirror / Atom feed

* RE: [RFC v3] non-temporal memcpy
@ 2022-09-07  9:22 Morten Brørup
  0 siblings, 0 replies; 2+ messages in thread
From: Morten Brørup @ 2022-09-07  9:22 UTC (permalink / raw)
  To: dev
  Cc: Bruce Richardson, Konstantin Ananyev, Honnappa Nagarahalli,
	Stephen Hemminger, Mattias Rönnblom

> From: Morten Brørup
> Sent: Friday, 19 August 2022 15.58
> 
> This RFC proposes a set of functions optimized for non-temporal memory
> copy.
> 
> At this stage, I am asking for acceptance of the concept and API.
> Feedback on the x86 implementation is also welcome.

Potential reviewers: An updated version will follow. Please don't review this version.

-Morten


^ permalink raw reply	[flat|nested] 2+ messages in thread

* [RFC v3] non-temporal memcpy
@ 2022-08-19 13:58 Morten Brørup
  0 siblings, 0 replies; 2+ messages in thread
From: Morten Brørup @ 2022-08-19 13:58 UTC (permalink / raw)
  To: dev
  Cc: Bruce Richardson, Konstantin Ananyev, Honnappa Nagarahalli,
	Stephen Hemminger, Mattias Rönnblom

This RFC proposes a set of functions optimized for non-temporal memory copy.

At this stage, I am asking for acceptance of the concept and API.
Feedback on the x86 implementation is also welcome.

Applications sometimes copy data to another memory location, which is only
used much later.
In this case, it is inefficient to pollute the data cache with the copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture buffer
for offline analysis.

The purpose of the functions is to achieve a performance gain by not
polluting the cache when copying data.
Although the throughput may be improved by further optimization, I do not
consider throughput optimization relevant initially.

Implementation notes:

Implementations for non-x86 architectures can be provided by anyone at a
later time. I am not going to do it.

x86 non-temporal load instructions have 16 byte alignment requirements [1].
ARM non-temporal load instructions are available with 4 byte alignment
requirements [2].
Both platforms offer non-temporal store instructions with 4 byte alignment
requirements.

In addition to the general function without any alignment requirements, I
have also implemente functions for respectivly 16 and 4 byte aligned access
for performance purposes.

NB: Don't comment on spaces for indentation; a patch will follow DPDK coding
style and use TAB.

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_load
[2] https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-and-FP-

V2:
- Only copy from non-temporal source to non-temporal destination.
  I.e. remove the two variants with only source and/or destination being
  non-temporal.
- Do not require alignment.
  Instead, offer additional 4 and 16 byte aligned functions for performance
  purposes.
- Implemented two of the functions for x86.
- Remove memset function.

V3:
- Only one generic function is exposed in the API:
  rte_memcpy_ex(dst, src, len, flags),
  which should be called with the flags constant at build time.
- Requests for non-temporal source/destination memory access are now flags.
- Alignment hints are now flags.
- The functions for various alignments are not part of the declaration.
  They are implementation specific.
- Implemented the generic and unaligned functions for x86.
- Added note about rte_wmb() to non-temporal store.
- Variants for normal load and non-temporal store, as well as
  variants for non-temporal load and normal store are not implemented.
  They can be added later.
- Improved the workaround for _mm_stream_load_si128() not taking a const
  pointer as parameter.
- Extensive use of __builtin_constant_p(flags) to help the compiler optimize
  the code.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---

/*****************************************************************/
/* Declaration. Goes into: /lib/eal/include/generic/rte_memcpy.h */
/*****************************************************************/

/*
 * Advanced/Non-Temporal Memory Operations Flags.
 */

/** Length alignment hint mask. */
#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)
/** Hint: Length is 2 byte aligned. */
#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)
/** Hint: Length is 4 byte aligned. */
#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)      
/** Hint: Length is 8 byte aligned. */
#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)      
/** Hint: Length is 16 byte aligned. */
#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)     
/** Hint: Length is 32 byte aligned. */
#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)     
/** Hint: Length is 64 byte aligned. */
#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)     
/** Hint: Length is 128 byte aligned. */
#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)    

/** Prefer non-temporal access to source memory area.
 *
 * On ARM architecture:
 * Remember to call rte_???() before a sequence of copy operations.
 */
#define RTE_MEMOPS_F_SRC_NT     (UINT64_C(1) << 8)      
/** Source address alignment hint mask. */
#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 8)   
/** Hint: Source address is 2 byte aligned. */
#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 8)      
/** Hint: Source address is 4 byte aligned. */
#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 8)      
/** Hint: Source address is 8 byte aligned. */
#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 8)      
/** Hint: Source address is 16 byte aligned. */
#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 8)     
/** Hint: Source address is 32 byte aligned. */
#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 8)     
/** Hint: Source address is 64 byte aligned. */
#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 8)     
/** Hint: Source address is 128 byte aligned. */
#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 8)    

/** Prefer non-temporal access to destination memory area.
 *
 * On x86 architecture:
 * Remember to call rte_wmb() after a sequence of copy operations.
 * On ARM architecture:
 * Remember to call rte_???() after a sequence of copy operations.
 */
#define RTE_MEMOPS_F_DST_NT     (UINT64_C(1) << 16)     
/** Destination address alignment hint mask. */
#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 16)  
/** Hint: Destination address is 2 byte aligned. */
#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 16)     
/** Hint: Destination address is 4 byte aligned. */
#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 16)     
/** Hint: Destination address is 8 byte aligned. */
#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 16)     
/** Hint: Destination address is 16 byte aligned. */
#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 16)    
/** Hint: Destination address is 32 byte aligned. */
#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 16)    
/** Hint: Destination address is 64 byte aligned. */
#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 16)    
/** Hint: Destination address is 128 byte aligned. */
#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 16)   


/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Advanced/non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the destination memory area.
 * @param src
 *   Pointer to the source memory area.
 * @param len
 *   Number of bytes to copy.
 * @param flags
 *   Hints for memory access.
 *   Any of the RTE_MEMOPS_F_(SRC|DST)_NT, RTE_MEMOPS_F_(LEN|SRC|DST)<n>A flags.
 *   Should be constant at build time.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_ex(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags);


/****************************************************************/
/* Implementation. Goes into: /lib/eal/x86/include/rte_memcpy.h */
/****************************************************************/

/* Assumptions about register sizes. */
_Static_assert(sizeof(int32_t) == 4, "Wrong size of int32_t.");
_Static_assert(sizeof(__m128i) == 16, "Wrong size of __m128i.");

/**
 * @internal
 * Workaround for _mm_stream_load_si128() missing const in the parameter.
 */
__rte_internal
static __rte_always_inline
__m128i _mm_stream_load_si128_const(const __m128i * const mem_addr)
{
#if defined(RTE_TOOLCHAIN_GCC)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
#endif
    return _mm_stream_load_si128(mem_addr);
#if defined(RTE_TOOLCHAIN_GCC)
#pragma GCC diagnostic pop
#endif
}

/**
 * @internal
 * 16 byte aligned non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   Must be 16 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 16.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt16a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags)
{
    register __m128i    xmm0, xmm1, xmm2, xmm3;

    RTE_ASSERT(rte_is_aligned(dst, 16));
    RTE_ASSERT(rte_is_aligned(src, 16));
    RTE_ASSERT(rte_is_aligned(len, 16));

    /* Copy large portion of data in chunks of 64 byte. */
    while (len >= 4 * 16) {
        xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
        xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
        xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
        xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
        _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
        _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
        _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
        _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
        src = RTE_PTR_ADD(src, 4 * 16);
        dst = RTE_PTR_ADD(dst, 4 * 16);
        len -= 4 * 16;
    }

    /* Copy remaining data.
     * Omitted if length is known to be 64 byte aligned.
     */
    if (!(__builtin_constant_p(flags) &&
            ((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN64A))) {
        while (len != 0) {
            xmm0 = _mm_stream_load_si128_const(src);
            _mm_stream_si128(dst, xmm0);
            src = RTE_PTR_ADD(src, 16);
            dst = RTE_PTR_ADD(dst, 16);
            len -= 16;
        }
    }
}

/**
 * @internal
 * 4 byte aligned non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 4 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   Must be 4 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 4.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt4a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags)
{
    int32_t             buffer[16 / 4] __rte_aligned(16);
    /** How many bytes is source offset from 16 byte alignment (floor rounding). */
    const size_t        offset = ((uintptr_t)src & (16 - 1));
    register __m128i    xmm0;

    RTE_ASSERT(rte_is_aligned(dst, 4));
    RTE_ASSERT(rte_is_aligned(src, 4));
    RTE_ASSERT(rte_is_aligned(len, 4));

    if (unlikely(len == 0)) return;

    /* Copy first, not 16 byte aligned, part of source data.
     * Omitted if source is known to be 16 byte aligned.
     */
    if (!(__builtin_constant_p(flags) &&
            ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) &&
            offset != 0) {
        const size_t    first = 16 - offset;

        /** Adjust source pointer to achieve 16 byte alignment (floor rounding). */
        src = RTE_PTR_SUB(src, offset);
        xmm0 = _mm_stream_load_si128_const(src);
        _mm_store_si128((void *)buffer, xmm0);
        switch (first) {
            case 3 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[1]);
                if (unlikely(len == 1 * 4)) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[2]);
                if (unlikely(len == 2 * 4)) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[3]);
                break;
            case 2 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[2]);
                if (unlikely(len == 1 * 4)) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[3]);
                break;
            case 1 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[3]);
                break;
        }
        src = RTE_PTR_ADD(src, first);
        dst = RTE_PTR_ADD(dst, first);
        len -= first;
    }

    /* Copy middle, 16 byte aligned, part of source data. */
    while (len >= 16) {
        xmm0 = _mm_stream_load_si128_const(src);
        _mm_store_si128((void *)buffer, xmm0);
        _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 3 * 4), buffer[3]);
        src = RTE_PTR_ADD(src, 16);
        dst = RTE_PTR_ADD(dst, 4 * 4);
        len -= 16;
    }

    /* Copy last, not 16 byte aligned, part of source data.
     * Omitted if source is known to be 16 byte aligned.
     */
    if (!(__builtin_constant_p(flags) &&
            ((flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A)) &&
            len != 0) {
        xmm0 = _mm_stream_load_si128_const(src);
        _mm_store_si128((void *)buffer, xmm0);
        switch (len) {
            case 1 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
                break;
            case 2 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
                break;
            case 3 * 4:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * 4), buffer[0]);
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * 4), buffer[1]);
                _mm_stream_si32(RTE_PTR_ADD(dst, 2 * 4), buffer[2]);
                break;
        }
    }
}

#ifndef RTE_MEMCPY_NT_BUFSIZE

#include <rte_mbuf.h>   /* #include <rte_mbuf_core.h> */

/** Bounce buffer size for non-temporal memcpy.
 *
 * The actual buffer will be slightly larger, due to added padding.
 * The default is chosen to be able to handle a non-segmented packet.
 */
#define RTE_MEMCPY_NT_BUFSIZE RTE_MBUF_DEFAULT_DATAROOM

#endif  /* RTE_MEMCPY_NT_BUFSIZE */

/**
 * @internal
 * Non-temporal memory copy to 16 byte aligned destination and length
 * from unaligned source via bounce buffer.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Must be be divisible by 16.
 *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt_buf16dla(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags __rte_unused)
{
    /** Aligned bounce buffer with preceding and trailing padding. */
    unsigned char       buffer[16 + RTE_MEMCPY_NT_BUFSIZE + 16] __rte_aligned(16);
    void *              buf;
    register __m128i    xmm0, xmm1, xmm2, xmm3;

    RTE_ASSERT(rte_is_aligned(dst, 16));
    RTE_ASSERT(rte_is_aligned(len, 16));
    RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);

    if (unlikely(len == 0)) return;

    /* Step 1:
     * Copy data from the source to the bounce buffer's aligned data area,
     * using aligned non-temporal load from the source,
     * and unaligned store in the bounce buffer.
     *
     * If the source is unaligned, the extra bytes preceding the data will be copied
     * to the padding area preceding the bounce buffer's aligned data area.
     * Similarly, if the source data ends at an unaligned address, the additional bytes
     * trailing the data will be copied to the padding area trailing the bounce buffer's
     * aligned data area.
     */
    {
        /** How many bytes is source offset from 16 byte alignment (floor rounding). */
        const size_t        offset = ((uintptr_t)src & (16 - 1));
        /** Number of bytes to copy from source, incl. any extra preceding bytes. */
        size_t              srclen = len + offset;

        /* Adjust source pointer for extra preceding bytes. */
        src = RTE_PTR_SUB(src, offset);
        /* Bounce buffer pointer, adjusted for extra preceding bytes. */
        buf = RTE_PTR_ADD(buffer, 16 - offset);

        /* Copy large portion of data from source to bounce buffer. */
        while (srclen >= 4 * 16) {
            xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
            xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
            xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
            xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
            _mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
            src = RTE_PTR_ADD(src, 4 * 16);
            buf = RTE_PTR_ADD(buf, 4 * 16);
            srclen -= 4 * 16;
        }

        /* Copy remaining data from source to bounce buffer. */
        while ((ssize_t)srclen > 0) {
            xmm0 = _mm_stream_load_si128_const(src);
            _mm_storeu_si128(buf, xmm0);
            src = RTE_PTR_ADD(src, 16);
            buf = RTE_PTR_ADD(buf, 16);
            srclen -= 16;
        }
    }

    /* Step 2:
     * Copy from the aligned bounce buffer to the aligned destination.
     */

    /* Reset bounce buffer pointer; point to the aligned data area. */
    buf = RTE_PTR_ADD(buffer, 16);

    /* Copy large portion of data from bounce buffer to destination in chunks of 64 byte. */
    while (len >= 4 * 16) {
        xmm0 = _mm_load_si128(RTE_PTR_ADD(buf, 0 * 16));
        xmm1 = _mm_load_si128(RTE_PTR_ADD(buf, 1 * 16));
        xmm2 = _mm_load_si128(RTE_PTR_ADD(buf, 2 * 16));
        xmm3 = _mm_load_si128(RTE_PTR_ADD(buf, 3 * 16));
        _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
        _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
        _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
        _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
        buf = RTE_PTR_ADD(buf, 4 * 16);
        dst = RTE_PTR_ADD(dst, 4 * 16);
        len -= 4 * 16;
    }

    /* Copy remaining data from bounce buffer to destination. */
    while (len != 0) {
        xmm0 = _mm_load_si128(buf);
        _mm_stream_si128(dst, xmm0);
        buf = RTE_PTR_ADD(buf, 16);
        dst = RTE_PTR_ADD(dst, 16);
        len -= 16;
    }
}

/**
 * @internal
 * Non-temporal memory copy via bounce buffer.
 *
 * @note
 * If the destination and/or length is unaligned, the first and/or last copied
 * bytes will be stored in the destination memory area using temporal access.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Must be <= RTE_MEMCPY_NT_BUFSIZE.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt_buf(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags __rte_unused)
{
    /** Aligned bounce buffer with preceding and trailing padding. */
    unsigned char       buffer[16 + RTE_MEMCPY_NT_BUFSIZE + 16] __rte_aligned(16);
    void *              buf;
    register __m128i    xmm0, xmm1, xmm2, xmm3;

    RTE_ASSERT(len <= RTE_MEMCPY_NT_BUFSIZE);

    if (unlikely(len == 0)) return;

    /* Step 1:
     * Copy data from the source to the bounce buffer's aligned data area,
     * using aligned non-temporal load from the source,
     * and unaligned store in the bounce buffer.
     *
     * If the source is unaligned, the additional bytes preceding the data will be copied
     * to the padding area preceding the bounce buffer's aligned data area.
     * Similarly, if the source data ends at an unaligned address, the additional bytes
     * trailing the data will be copied to the padding area trailing the bounce buffer's
     * aligned data area.
     */
    {
        /** How many bytes is source offset from 16 byte alignment (floor rounding). */
        const size_t        offset = ((uintptr_t)src & (16 - 1));
        /** Number of bytes to copy from source, incl. any extra preceding bytes. */
        size_t              srclen = len + offset;

        /* Adjust source pointer for extra preceding bytes. */
        src = RTE_PTR_SUB(src, offset);
        /* Bounce buffer pointer, adjusted for extra preceding bytes. */
        buf = RTE_PTR_ADD(buffer, 16 - offset);

        /* Copy large portion of data from source to bounce buffer. */
        while (srclen >= 4 * 16) {
            xmm0 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 0 * 16));
            xmm1 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 1 * 16));
            xmm2 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 2 * 16));
            xmm3 = _mm_stream_load_si128_const(RTE_PTR_ADD(src, 3 * 16));
            _mm_storeu_si128(RTE_PTR_ADD(buf, 0 * 16), xmm0);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 1 * 16), xmm1);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 2 * 16), xmm2);
            _mm_storeu_si128(RTE_PTR_ADD(buf, 3 * 16), xmm3);
            src = RTE_PTR_ADD(src, 4 * 16);
            buf = RTE_PTR_ADD(buf, 4 * 16);
            srclen -= 4 * 16;
        }

        /* Copy remaining data from source to bounce buffer. */
        while ((ssize_t)srclen > 0) {
            xmm0 = _mm_stream_load_si128_const(src);
            _mm_storeu_si128(buf, xmm0);
            src = RTE_PTR_ADD(src, 16);
            buf = RTE_PTR_ADD(buf, 16);
            srclen -= 16;
        }
    }

    /* Step 2:
     * Copy from the aligned bounce buffer to the destination.
     */

    /* Reset bounce buffer pointer; point to the aligned data area. */
    buf = RTE_PTR_ADD(buffer, 16);

    if (unlikely(!rte_is_aligned(dst, 16))) {
        /* Destination is not 16 byte aligned. */

        if (unlikely(!rte_is_aligned(dst, 4))) {
            /* Destination is not 4 byte aligned. */
            /** How many bytes are missing to reach 16 byte alignment. */
            const size_t n = RTE_PTR_DIFF(RTE_PTR_ALIGN_CEIL(dst, 16), dst);

            if (unlikely(len <= n))
                goto copy_trailing_bytes;

            /* Copy from bounce buffer until destination pointer is 16 byte aligned. */
            memcpy(dst, buf, n);
            buf = RTE_PTR_ADD(buf, n);
            dst = RTE_PTR_ADD(dst, n);
            len -= n;
        } else {
            /* Destination is 4 byte aligned. */

            /* Copy from bounce buffer until destination pointer is 16 byte aligned. */
            while (!rte_is_aligned(dst, 16)) {
                register int32_t    r;

                if (unlikely(len < 4))
                    goto copy_trailing_bytes;

                r = *(int32_t *)buf;
                _mm_stream_si32(dst, r);
                buf = RTE_PTR_ADD(buf, 4);
                dst = RTE_PTR_ADD(dst, 4);
                len -= 4;
            }
        }
    }

    /* Destination is 16 byte aligned. */

    /* Copy large portion of data from bounce buffer to destination in chunks of 64 byte. */
    while (len >= 4 * 16) {
        xmm0 = _mm_loadu_si128(RTE_PTR_ADD(buf, 0 * 16));
        xmm1 = _mm_loadu_si128(RTE_PTR_ADD(buf, 1 * 16));
        xmm2 = _mm_loadu_si128(RTE_PTR_ADD(buf, 2 * 16));
        xmm3 = _mm_loadu_si128(RTE_PTR_ADD(buf, 3 * 16));
        _mm_stream_si128(RTE_PTR_ADD(dst, 0 * 16), xmm0);
        _mm_stream_si128(RTE_PTR_ADD(dst, 1 * 16), xmm1);
        _mm_stream_si128(RTE_PTR_ADD(dst, 2 * 16), xmm2);
        _mm_stream_si128(RTE_PTR_ADD(dst, 3 * 16), xmm3);
        buf = RTE_PTR_ADD(buf, 4 * 16);
        dst = RTE_PTR_ADD(dst, 4 * 16);
        len -= 4 * 16;
    }

    /* Copy remaining data from bounce buffer to destination. */
    while (len >= 4) {
        int32_t r;

        memcpy(&r, buf, 4);
        _mm_stream_si32(dst, r);
        buf = RTE_PTR_ADD(buf, 4);
        dst = RTE_PTR_ADD(dst, 4);
        len -= 4;
    }

copy_trailing_bytes:
    if (unlikely(len != 0)) {
        /* Copy trailing bytes. */
        memcpy(dst, buf, len);
    }
}

/**
 * @internal
 * Non-temporal memory copy to 16 byte aligned destination and length.
 * The memory areas must not overlap.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source memory area.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Must be be divisible by 16.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt16dla(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags)
{
    RTE_ASSERT(rte_is_aligned(dst, 16));
    RTE_ASSERT(rte_is_aligned(len, 16));

    while (len > RTE_MEMCPY_NT_BUFSIZE) {
        rte_memcpy_nt_buf16dla(dst, src, RTE_MEMCPY_NT_BUFSIZE, flags);
        dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        src = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        len -= RTE_MEMCPY_NT_BUFSIZE;
    }
    rte_memcpy_nt_buf16dla(dst, src, len, flags);
}

/**
 * @internal
 * Non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @note
 * If the destination and/or length is unaligned, some copied bytes will be
 * stored in the destination memory area using temporal access.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 * @param src
 *   Pointer to the non-temporal source memory area.
 * @param len
 *   Number of bytes to copy.
 */
__rte_internal
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt_fallback(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags)
{
    while (len > RTE_MEMCPY_NT_BUFSIZE) {
        rte_memcpy_nt_buf(dst, src, RTE_MEMCPY_NT_BUFSIZE, flags);
        dst = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        src = RTE_PTR_ADD(dst, RTE_MEMCPY_NT_BUFSIZE);
        len -= RTE_MEMCPY_NT_BUFSIZE;
    }
    rte_memcpy_nt_buf(dst, src, len, flags);
}

/* Implementation. Refer to function declaration for documentation. */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_ex(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags)
{
    if (flags & (RTE_MEMOPS_F_SRC_NT | RTE_MEMOPS_F_DST_NT)) {
        if (__builtin_constant_p(flags) ?
                ((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A &&
                (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) :
                !(((uintptr_t)dst | len) & (16 - 1))) {
            if (__builtin_constant_p(flags) ?
                    (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A :
                    !((uintptr_t)src & (16 - 1)))
                rte_memcpy_nt16a(dst, src, len, flags);
            else
                rte_memcpy_nt16dla(dst, src, len, flags);
        }
        else if (__builtin_constant_p(flags) ? (
                (flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A &&
                (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A &&
                (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A) :
                !(((uintptr_t)dst | (uintptr_t)src | len) & (4 - 1)))
            rte_memcpy_nt4a(dst, src, len, flags);
        else
            rte_memcpy_nt_fallback(dst, src, len, flags);
    } else
        rte_memcpy(dst, src, len);
}


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-09-07  9:22 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-07  9:22 [RFC v3] non-temporal memcpy Morten Brørup
  -- strict thread matches above, loose matches on Subject: below --
2022-08-19 13:58 Morten Brørup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).