From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: "Morten Brørup" <mb@smartsharesystems.com>,
dev@dpdk.org, "Bruce Richardson" <bruce.richardson@intel.com>,
"Konstantin Ananyev" <konstantin.v.ananyev@yandex.ru>
Cc: Jan Viktorin <viktorin@rehivetech.com>,
Ruifeng Wang <ruifeng.wang@arm.com>,
David Christensen <drc@linux.vnet.ibm.com>,
Stanislaw Kardach <kda@semihalf.com>
Subject: Re: [RFC v2] non-temporal memcpy
Date: Sun, 7 Aug 2022 22:25:20 +0200 [thread overview]
Message-ID: <9ac934d2-ad05-6ec9-3bb6-63986d68d5d3@lysator.liu.se> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D871D4@smartserver.smartshare.dk>
On 2022-07-19 17:26, Morten Brørup wrote:
> This RFC proposes a set of functions optimized for non-temporal memory copy.
>
> At this stage, I am asking for feedback on the concept.
>
> Applications sometimes data to another memory location, which is only used
> much later.
> In this case, it is inefficient to pollute the data cache with the copied
> data.
>
> An example use case (originating from a real life application):
> Copying filtered packets, or the first part of them, into a capture buffer
> for offline analysis.
>
> The purpose of these functions is to achieve a performance gain by not
> polluting the cache when copying data.
> Although the throughput may be improved by further optimization, I do not
> consider througput optimization relevant initially.
>
> The x86 non-temporal load instructions have 16 byte alignment
> requirements [1], while ARM non-temporal load instructions are available with
> 4 byte alignment requirements [2].
> Both platforms offer non-temporal store instructions with 4 byte alignment
> requirements.
>
I don't think memcpy() functions should have alignment requirements.
That's not very practical, and violates the principle of least surprise.
Use normal memcpy() for the unaligned parts, and for the whole thing for
small sizes (at least on x86).
> In addition to the primary function without any alignment requirements, we
> also provide functions for respectivly 16 and 4 byte aligned access for
> performance purposes.
>
> The function names resemble standard C library function names, but their
> signatures are intentionally different. No need to drag legacy into it.
>
> NB: Don't comment on spaces for indentation; a patch will follow DPDK coding
> style and use TAB.
>
> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_load
> [2] https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-and-FP-
>
> V2:
> - Only copy from non-temporal source to non-temporal destination.
> I.e. remove the two variants with only source and/or destination being
> non-temporal.
> - Do not require alignment.
> Instead, offer additional 4 and 16 byte aligned functions for performance
> purposes.
> - Implemented two of the functions for x86.
> - Remove memset function.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>
> /**
> * @warning
> * @b EXPERIMENTAL: this API may change without prior notice.
> *
> * Copy data from non-temporal source to non-temporal destination.
> *
> * @param dst
> * Pointer to the non-temporal destination of the data.
> * Should be 4 byte aligned, for optimal performance.
> * @param src
> * Pointer to the non-temporal source data.
> * No alignment requirements.
> * @param len
> * Number of bytes to copy.
> * Should be be divisible by 4, for optimal performance.
> */
> __rte_experimental
> static __rte_always_inline
> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> void rte_memcpy_nt(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
> /* Implementation T.B.D. */
>
> /**
> * @warning
> * @b EXPERIMENTAL: this API may change without prior notice.
> *
> * Copy data in blocks of 16 byte from aligned non-temporal source
> * to aligned non-temporal destination.
> *
> * @param dst
> * Pointer to the non-temporal destination of the data.
> * Must be 16 byte aligned.
> * @param src
> * Pointer to the non-temporal source data.
> * Must be 16 byte aligned.
> * @param len
> * Number of bytes to copy.
> * Must be divisible by 16.
> */
> __rte_experimental
> static __rte_always_inline
> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> void rte_memcpy_nt16a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
> {
> const void * const end = RTE_PTR_ADD(src, len);
>
> RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
> RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
> RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
>
> /* Copy large portion of data. */
> while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
> register __m128i xmm0, xmm1, xmm2, xmm3;
>
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 * sizeof(__m128i)));
> xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 * sizeof(__m128i)));
> xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 * sizeof(__m128i)));
> xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 * sizeof(__m128i)));
> #pragma GCC diagnostic pop
> _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)), xmm0);
> _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)), xmm1);
> _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)), xmm2);
> _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)), xmm3);
> src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
> dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
> }
>
> /* Copy remaining data. */
> while (src != end) {
> register __m128i xmm;
>
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> xmm = _mm_stream_load_si128(src);
> #pragma GCC diagnostic pop
> _mm_stream_si128(dst, xmm);
> src = RTE_PTR_ADD(src, sizeof(__m128i));
> dst = RTE_PTR_ADD(dst, sizeof(__m128i));
> }
> }
>
> /**
> * @warning
> * @b EXPERIMENTAL: this API may change without prior notice.
> *
> * Copy data in blocks of 4 byte from aligned non-temporal source
> * to aligned non-temporal destination.
> *
> * @param dst
> * Pointer to the non-temporal destination of the data.
> * Must be 4 byte aligned.
> * @param src
> * Pointer to the non-temporal source data.
> * Must be 4 byte aligned.
> * @param len
> * Number of bytes to copy.
> * Must be divisible by 4.
> */
> __rte_experimental
> static __rte_always_inline
> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> void rte_memcpy_nt4a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
> {
> int32_t buf[sizeof(__m128i) / sizeof(int32_t)] __rte_aligned(sizeof(__m128i));
> /** Address of source data, rounded down to achieve alignment. */
> const void * srca = RTE_PTR_ALIGN_FLOOR(src, sizeof(__m128i));
> /** Address of end of source data, rounded down to achieve alignment. */
> const void * const srcenda = RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
> const int offset = RTE_PTR_DIFF(src, srca) / sizeof(int32_t);
> register __m128i xmm0;
>
> RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
> RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
> RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
>
> if (unlikely(len == 0)) return;
>
> /* Copy first, non-__m128i aligned, part of source data. */
> if (offset) {
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> xmm0 = _mm_stream_load_si128(srca);
> _mm_store_si128((void *)buf, xmm0);
> #pragma GCC diagnostic pop
> switch (offset) {
> case 1:
> _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[1]);
> if (unlikely(len == 1 * sizeof(int32_t))) return;
> _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[2]);
> if (unlikely(len == 2 * sizeof(int32_t))) return;
> _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[3]);
> break;
> case 2:
> _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[2]);
> if (unlikely(len == 1 * sizeof(int32_t))) return;
> _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[3]);
> break;
> case 3:
> _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[3]);
> break;
> }
> srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
> dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
> }
>
> /* Copy middle, __m128i aligned, part of source data. */
> while (srca != srcenda) {
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> xmm0 = _mm_stream_load_si128(srca);
> #pragma GCC diagnostic pop
> _mm_store_si128((void *)buf, xmm0);
> _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
> _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
> _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]);
> _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)), buf[3]);
> srca = RTE_PTR_ADD(srca, sizeof(__m128i));
> dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
> }
>
> /* Copy last, non-__m128i aligned, part of source data. */
> if (RTE_PTR_DIFF(srca, src) != 4) {
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> xmm0 = _mm_stream_load_si128(srca);
> _mm_store_si128((void *)buf, xmm0);
> #pragma GCC diagnostic pop
> switch (offset) {
> case 1:
> _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
> break;
> case 2:
> _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
> if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return;
> _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
> break;
> case 3:
> _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
> if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return;
> _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
> if (unlikely(RTE_PTR_DIFF(srca, src) == 2 * sizeof(int32_t))) return;
> _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]);
> break;
> }
> }
> }
>
next prev parent reply other threads:[~2022-08-07 20:25 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-19 15:26 Morten Brørup
2022-07-19 18:00 ` David Christensen
2022-07-19 18:41 ` Morten Brørup
2022-07-19 18:51 ` Stanisław Kardach
2022-07-19 22:15 ` Morten Brørup
2022-07-21 23:19 ` Konstantin Ananyev
2022-07-22 10:44 ` Morten Brørup
2022-07-24 13:35 ` Konstantin Ananyev
2022-07-24 22:18 ` Morten Brørup
2022-07-29 10:00 ` Konstantin Ananyev
2022-07-29 10:46 ` Morten Brørup
2022-07-29 11:50 ` Konstantin Ananyev
2022-07-29 17:17 ` Morten Brørup
2022-07-29 22:00 ` Konstantin Ananyev
2022-07-30 9:51 ` Morten Brørup
2022-08-02 9:05 ` Konstantin Ananyev
2022-07-29 12:13 ` Konstantin Ananyev
2022-07-29 16:05 ` Stephen Hemminger
2022-07-29 17:29 ` Morten Brørup
2022-08-07 20:40 ` Mattias Rönnblom
2022-08-09 9:24 ` Morten Brørup
2022-08-09 11:53 ` Mattias Rönnblom
2022-10-09 16:16 ` Morten Brørup
2022-07-29 18:13 ` Morten Brørup
2022-07-29 19:49 ` Konstantin Ananyev
2022-07-29 20:26 ` Morten Brørup
2022-07-29 21:34 ` Konstantin Ananyev
2022-08-07 20:20 ` Mattias Rönnblom
2022-08-09 9:34 ` Morten Brørup
2022-08-09 11:56 ` Mattias Rönnblom
2022-08-10 21:05 ` Honnappa Nagarahalli
2022-08-11 11:50 ` Mattias Rönnblom
2022-08-11 16:26 ` Honnappa Nagarahalli
2022-07-25 1:17 ` Honnappa Nagarahalli
2022-07-27 10:26 ` Morten Brørup
2022-07-27 17:37 ` Honnappa Nagarahalli
2022-07-27 18:49 ` Morten Brørup
2022-07-27 19:12 ` Stephen Hemminger
2022-07-28 9:00 ` Morten Brørup
2022-07-27 19:52 ` Honnappa Nagarahalli
2022-07-27 22:02 ` Stanisław Kardach
2022-07-28 10:51 ` Morten Brørup
2022-07-29 9:21 ` Konstantin Ananyev
2022-08-07 20:25 ` Mattias Rönnblom [this message]
2022-08-09 9:46 ` Morten Brørup
2022-08-09 12:05 ` Mattias Rönnblom
2022-08-09 15:00 ` Morten Brørup
2022-08-10 11:47 ` Mattias Rönnblom
2022-08-09 15:26 ` Stephen Hemminger
2022-08-09 17:24 ` Morten Brørup
2022-08-10 11:59 ` Mattias Rönnblom
2022-08-10 12:12 ` Morten Brørup
2022-08-10 11:55 ` Mattias Rönnblom
2022-08-10 12:18 ` Morten Brørup
2022-08-10 21:20 ` Honnappa Nagarahalli
2022-08-11 11:53 ` Mattias Rönnblom
2022-08-11 22:24 ` Honnappa Nagarahalli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9ac934d2-ad05-6ec9-3bb6-63986d68d5d3@lysator.liu.se \
--to=hofors@lysator.liu.se \
--cc=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
--cc=drc@linux.vnet.ibm.com \
--cc=kda@semihalf.com \
--cc=konstantin.v.ananyev@yandex.ru \
--cc=mb@smartsharesystems.com \
--cc=ruifeng.wang@arm.com \
--cc=viktorin@rehivetech.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).