DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC v2] non-temporal memcpy
@ 2022-07-19 15:26 Morten Brørup
  2022-07-19 18:00 ` David Christensen
                   ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Morten Brørup @ 2022-07-19 15:26 UTC (permalink / raw)
  To: dev, Bruce Richardson, Konstantin Ananyev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

This RFC proposes a set of functions optimized for non-temporal memory copy.

At this stage, I am asking for feedback on the concept.

Applications sometimes data to another memory location, which is only used
much later.
In this case, it is inefficient to pollute the data cache with the copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture buffer
for offline analysis.

The purpose of these functions is to achieve a performance gain by not
polluting the cache when copying data.
Although the throughput may be improved by further optimization, I do not
consider througput optimization relevant initially.

The x86 non-temporal load instructions have 16 byte alignment
requirements [1], while ARM non-temporal load instructions are available with
4 byte alignment requirements [2].
Both platforms offer non-temporal store instructions with 4 byte alignment
requirements.

In addition to the primary function without any alignment requirements, we
also provide functions for respectivly 16 and 4 byte aligned access for
performance purposes.

The function names resemble standard C library function names, but their
signatures are intentionally different. No need to drag legacy into it.

NB: Don't comment on spaces for indentation; a patch will follow DPDK coding
style and use TAB.

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_load
[2] https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-and-FP-

V2:
- Only copy from non-temporal source to non-temporal destination.
  I.e. remove the two variants with only source and/or destination being
  non-temporal.
- Do not require alignment.
  Instead, offer additional 4 and 16 byte aligned functions for performance
  purposes.
- Implemented two of the functions for x86.
- Remove memset function.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---

/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Copy data from non-temporal source to non-temporal destination.
 *
 * @param dst
 *   Pointer to the non-temporal destination of the data.
 *   Should be 4 byte aligned, for optimal performance.
 * @param src
 *   Pointer to the non-temporal source data.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Should be be divisible by 4, for optimal performance.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
/* Implementation T.B.D. */

/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Copy data in blocks of 16 byte from aligned non-temporal source
 * to aligned non-temporal destination.
 *
 * @param dst
 *   Pointer to the non-temporal destination of the data.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source data.
 *   Must be 16 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 16.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt16a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
{
    const void * const  end = RTE_PTR_ADD(src, len);

    RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
    RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
    RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));

    /* Copy large portion of data. */
    while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
        register __m128i    xmm0, xmm1, xmm2, xmm3;

/* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
        xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 * sizeof(__m128i)));
        xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 * sizeof(__m128i)));
        xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 * sizeof(__m128i)));
        xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 * sizeof(__m128i)));
#pragma GCC diagnostic pop
        _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)), xmm0);
        _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)), xmm1);
        _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)), xmm2);
        _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)), xmm3);
        src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
        dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
    }

    /* Copy remaining data. */
    while (src != end) {
        register __m128i    xmm;

/* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
        xmm = _mm_stream_load_si128(src);
#pragma GCC diagnostic pop
        _mm_stream_si128(dst, xmm);
        src = RTE_PTR_ADD(src, sizeof(__m128i));
        dst = RTE_PTR_ADD(dst, sizeof(__m128i));
    }
}

/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Copy data in blocks of 4 byte from aligned non-temporal source
 * to aligned non-temporal destination.
 *
 * @param dst
 *   Pointer to the non-temporal destination of the data.
 *   Must be 4 byte aligned.
 * @param src
 *   Pointer to the non-temporal source data.
 *   Must be 4 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 4.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt4a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
{
    int32_t             buf[sizeof(__m128i) / sizeof(int32_t)] __rte_aligned(sizeof(__m128i));
    /** Address of source data, rounded down to achieve alignment. */
    const void *        srca = RTE_PTR_ALIGN_FLOOR(src, sizeof(__m128i));
    /** Address of end of source data, rounded down to achieve alignment. */
    const void * const  srcenda = RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
    const int           offset =  RTE_PTR_DIFF(src, srca) / sizeof(int32_t);
    register __m128i    xmm0;

    RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
    RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
    RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));

    if (unlikely(len == 0)) return;

    /* Copy first, non-__m128i aligned, part of source data. */
    if (offset) {
/* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
        xmm0 = _mm_stream_load_si128(srca);
        _mm_store_si128((void *)buf, xmm0);
#pragma GCC diagnostic pop
        switch (offset) {
            case 1:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[1]);
                if (unlikely(len == 1 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[2]);
                if (unlikely(len == 2 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[3]);
                break;
            case 2:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[2]);
                if (unlikely(len == 1 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[3]);
                break;
            case 3:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[3]);
                break;
        }
        srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
        dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
    }

    /* Copy middle, __m128i aligned, part of source data. */
    while (srca != srcenda) {
/* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
        xmm0 = _mm_stream_load_si128(srca);
#pragma GCC diagnostic pop
        _mm_store_si128((void *)buf, xmm0);
        _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)), buf[3]);
        srca = RTE_PTR_ADD(srca, sizeof(__m128i));
        dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
    }

    /* Copy last, non-__m128i aligned, part of source data. */
    if (RTE_PTR_DIFF(srca, src) != 4) {
/* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
        xmm0 = _mm_stream_load_si128(srca);
        _mm_store_si128((void *)buf, xmm0);
#pragma GCC diagnostic pop
        switch (offset) {
            case 1:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
                break;
            case 2:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
                if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
                break;
            case 3:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
                if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
                if (unlikely(RTE_PTR_DIFF(srca, src) == 2 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]);
                break;
        }
    }
}


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-10-09 16:16 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-19 15:26 [RFC v2] non-temporal memcpy Morten Brørup
2022-07-19 18:00 ` David Christensen
2022-07-19 18:41   ` Morten Brørup
2022-07-19 18:51     ` Stanisław Kardach
2022-07-19 22:15       ` Morten Brørup
2022-07-21 23:19 ` Konstantin Ananyev
2022-07-22 10:44   ` Morten Brørup
2022-07-24 13:35     ` Konstantin Ananyev
2022-07-24 22:18       ` Morten Brørup
2022-07-29 10:00         ` Konstantin Ananyev
2022-07-29 10:46           ` Morten Brørup
2022-07-29 11:50             ` Konstantin Ananyev
2022-07-29 17:17               ` Morten Brørup
2022-07-29 22:00                 ` Konstantin Ananyev
2022-07-30  9:51                   ` Morten Brørup
2022-08-02  9:05                     ` Konstantin Ananyev
2022-07-29 12:13             ` Konstantin Ananyev
2022-07-29 16:05               ` Stephen Hemminger
2022-07-29 17:29                 ` Morten Brørup
2022-08-07 20:40                 ` Mattias Rönnblom
2022-08-09  9:24                   ` Morten Brørup
2022-08-09 11:53                     ` Mattias Rönnblom
2022-10-09 16:16                       ` Morten Brørup
2022-07-29 18:13               ` Morten Brørup
2022-07-29 19:49                 ` Konstantin Ananyev
2022-07-29 20:26                   ` Morten Brørup
2022-07-29 21:34                     ` Konstantin Ananyev
2022-08-07 20:20                     ` Mattias Rönnblom
2022-08-09  9:34                       ` Morten Brørup
2022-08-09 11:56                         ` Mattias Rönnblom
2022-08-10 21:05                     ` Honnappa Nagarahalli
2022-08-11 11:50                       ` Mattias Rönnblom
2022-08-11 16:26                         ` Honnappa Nagarahalli
2022-07-25  1:17       ` Honnappa Nagarahalli
2022-07-27 10:26         ` Morten Brørup
2022-07-27 17:37           ` Honnappa Nagarahalli
2022-07-27 18:49             ` Morten Brørup
2022-07-27 19:12               ` Stephen Hemminger
2022-07-28  9:00                 ` Morten Brørup
2022-07-27 19:52               ` Honnappa Nagarahalli
2022-07-27 22:02                 ` Stanisław Kardach
2022-07-28 10:51                   ` Morten Brørup
2022-07-29  9:21                     ` Konstantin Ananyev
2022-08-07 20:25 ` Mattias Rönnblom
2022-08-09  9:46   ` Morten Brørup
2022-08-09 12:05     ` Mattias Rönnblom
2022-08-09 15:00       ` Morten Brørup
2022-08-10 11:47         ` Mattias Rönnblom
2022-08-09 15:26     ` Stephen Hemminger
2022-08-09 17:24       ` Morten Brørup
2022-08-10 11:59         ` Mattias Rönnblom
2022-08-10 12:12           ` Morten Brørup
2022-08-10 11:55       ` Mattias Rönnblom
2022-08-10 12:18         ` Morten Brørup
2022-08-10 21:20           ` Honnappa Nagarahalli
2022-08-11 11:53             ` Mattias Rönnblom
2022-08-11 22:24               ` Honnappa Nagarahalli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).