Re: [RFC v2] non-temporal memcpy

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	dev@dpdk.org, "Bruce Richardson" <bruce.richardson@intel.com>
Cc: Jan Viktorin <viktorin@rehivetech.com>,
	Ruifeng Wang <ruifeng.wang@arm.com>,
	David Christensen <drc@linux.vnet.ibm.com>,
	Stanislaw Kardach <kda@semihalf.com>
Subject: Re: [RFC v2] non-temporal memcpy
Date: Fri, 29 Jul 2022 11:00:14 +0100	[thread overview]
Message-ID: <2c646d01-14d0-e5cb-2d7c-50c8456fc3e5@yandex.ru> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D871E6@smartserver.smartshare.dk>

24/07/2022 23:18, Morten Brørup пишет:
>> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
>> Sent: Sunday, 24 July 2022 15.35
>>
>> 22/07/2022 11:44, Morten Brørup пишет:
>>>> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
>>>> Sent: Friday, 22 July 2022 01.20
>>>>
>>>> Hi Morten,
>>>>
>>>>> This RFC proposes a set of functions optimized for non-temporal
>>>> memory copy.
>>>>>
>>>>> At this stage, I am asking for feedback on the concept.
>>>>>
>>>>> Applications sometimes data to another memory location, which is
>> only
>>>> used
>>>>> much later.
>>>>> In this case, it is inefficient to pollute the data cache with the
>>>> copied
>>>>> data.
>>>>>
>>>>> An example use case (originating from a real life application):
>>>>> Copying filtered packets, or the first part of them, into a capture
>>>> buffer
>>>>> for offline analysis.
>>>>>
>>>>> The purpose of these functions is to achieve a performance gain by
>>>> not
>>>>> polluting the cache when copying data.
>>>>> Although the throughput may be improved by further optimization, I
>> do
>>>> not
>>>>> consider througput optimization relevant initially.
>>>>>
>>>>> The x86 non-temporal load instructions have 16 byte alignment
>>>>> requirements [1], while ARM non-temporal load instructions are
>>>> available with
>>>>> 4 byte alignment requirements [2].
>>>>> Both platforms offer non-temporal store instructions with 4 byte
>>>> alignment
>>>>> requirements.
>>>>>
>>>>> In addition to the primary function without any alignment
>>>> requirements, we
>>>>> also provide functions for respectivly 16 and 4 byte aligned access
>>>> for
>>>>> performance purposes.
>>>>>
>>>>> The function names resemble standard C library function names, but
>>>> their
>>>>> signatures are intentionally different. No need to drag legacy into
>>>> it.
>>>>>
>>>>> NB: Don't comment on spaces for indentation; a patch will follow
>> DPDK
>>>> coding
>>>>> style and use TAB.
>>>>
>>>>
>>>> I think there were discussions in other direction - remove
>> rte_memcpy()
>>>> completely and use memcpy() instead...
>>>
>>> Yes, the highly optimized rte_memcpy() implementation of memcpy() has
>> become obsolete, now that modern compilers provide an efficient
>> memcpy() implementation.
>>>
>>> It's an excellent reference, because we should learn from it, and
>> avoid introducing similar mistakes with non-temporal memcpy.
>>>
>>>> But if we have a good use case for that, then I am positive in
>>>> principle.
>>>
>>> The standard C library doesn't offer non-temporal memcpy(), so we
>> need to implement it ourselves.
>>>
>>>> Though I think we need a clear use-case within dpdk for it
>>>> to demonstrate perfomance gain.
>>>
>>> The performance gain is to avoid polluting the data cache. DPDK
>> example applications, like l3fwd, are probably too primitive to measure
>> any benefit in this regard.
>>>
>>>> Probably copying packets within pdump lib, or examples/dma. or ...
>>>
>>> Good point - the new functions should be used somewhere within DPDK.
>> For this purpose, I will look into modifying rte_pktmbuf_copy(), which
>> is used by pdump_copy(), to use non-temporal copying of the packet
>> data.
>>>
>>>> Another thought - do we really need a separate inline function for
>> each
>>>> flavour?
>>>> Might be just one non-inline rte_memcpy_nt(dst, src, size, flags),
>>>> where flags could be combination of NT_SRC, NT_DST, and keep
>> alignment
>>>> detection/decisions to particular implementation?
>>>
>>> Thank you for the feedback, Konstantin.
>>>
>>> My answer to this suggestion gets a little longwinded...
>>>
>>> Looking at the DPDK pcapng library, it copies a 4 byte aligned
>> metadata structure sized 28 byte. So it can do with 4 byte aligned
>> functions.
>>>
>>> Our application can capture packets starting at the IP header, which
>> is offset by 14 byte (Ethernet header size) from the packet buffer, so
>> it requires 2 byte alignment. And thus, requiring 4 byte alignment is
>> not acceptable.
>>>
>>> Our application uses 16 byte alignment in the capture buffer area,
>> and can benefit from 16 byte aligned functions. Furthermore, x86
>> processors require 16 byte alignment for non-temporal load
>> instructions, so I think a 16 byte aligned non-temporal memcpy function
>> should be offered.
>>
>>
>> Yes, x86 needs 16B alignment for NT load/stores
>> But that's supposed to be arch specific limitation,
>> that we probably want to hide, no?
> 
> Agree.
> 
>> Inside the function can check alignment of both src and dst
>> and decide should it use NT load/store instructions or just
>> do normal copy.
> 
> Yes, I'm experimenting with the x86 inline function shown below. And hopefully, with some "extern inline" or other magic, I can hide the different implementations in the arch specific headers, and only expose the function declaration of rte_memcpy_nt() in the common header.
> 
> I'm currently working on the x86 implementation - when I'm satisfied with that, I'll look into how to hide the implementations in the arch specific header files, and only expose the common function declaration in the generic header file also used for documentation. I works for rte_memcpy(), so I can probably find the way to do it there.
> 
> /*
>   * Non-Temporal Memory Operations Flags.
>   */
> 
> #define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)   /** Length alignment mask. */
> #define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)      /** Length is 2 byte aligned. */
> #define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)      /** Length is 4 byte aligned. */
> #define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)      /** Length is 8 byte aligned. */
> #define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)     /** Length is 16 byte aligned. */
> #define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)     /** Length is 32 byte aligned. */
> #define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)     /** Length is 64 byte aligned. */
> #define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)    /** Length is 128 byte aligned. */
> 
> #define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 8)   /** Destination address alignment mask. */
> #define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 8)      /** Destination address is 2 byte aligned. */
> #define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 8)      /** Destination address is 4 byte aligned. */
> #define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 8)      /** Destination address is 8 byte aligned. */
> #define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 8)     /** Destination address is 16 byte aligned. */
> #define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 8)     /** Destination address is 32 byte aligned. */
> #define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 8)     /** Destination address is 64 byte aligned. */
> #define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 8)    /** Destination address is 128 byte aligned. */
> 
> #define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 16)  /** Source address alignment mask. */
> #define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 16)     /** Source address is 2 byte aligned. */
> #define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 16)     /** Source address is 4 byte aligned. */
> #define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 16)     /** Source address is 8 byte aligned. */
> #define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 16)    /** Source address is 16 byte aligned. */
> #define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 16)    /** Source address is 32 byte aligned. */
> #define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 16)    /** Source address is 64 byte aligned. */
> #define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 16)   /** Source address is 128 byte aligned. */
> 
> /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice.
>   *
>   * Non-temporal memory copy.
>   * The memory areas must not overlap.
>   *
>   * @note
>   * If the destination and/or length is unaligned, some copied bytes will be
>   * stored in the destination memory area using temporal access.
>   *
>   * @param dst
>   *   Pointer to the non-temporal destination memory area.
>   * @param src
>   *   Pointer to the non-temporal source memory area.
>   * @param len
>   *   Number of bytes to copy.
>   * @param flags
>   *   Hints for memory access.
>   *   Any of the RTE_MEMOPS_F_LENnA, RTE_MEMOPS_F_DSTnA, RTE_MEMOPS_F_SRCnA flags.
>   */
> __rte_experimental
> static __rte_always_inline
> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> void rte_memcpy_nt(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
>          const uint64_t flags)
> {
>      if (__builtin_constant_p(flags) ?
>              ((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A &&
>              (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) :
>              !(((uintptr_t)dst | len) & (16 - 1))) {
>          if (__builtin_constant_p(flags) ?
>                  (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A :
>                  !((uintptr_t)src & (16 - 1)))
>              rte_memcpy_nt16a(dst, src, len/*, flags*/);
>          else
>              rte_memcpy_nt16dla(dst, src, len/*, flags*/);
>      }
>      else if (__builtin_constant_p(flags) ? (
>              (flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A &&
>              (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
>              (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A) :
>              !(((uintptr_t)dst | (uintptr_t)src | len) & (4 - 1))) {
>          rte_memcpy_nt4a(dst, src, len/*, flags*/);
>      }
>      else
>          rte_memcpy_nt_unaligned(dst, src, len/*, flags*/);
> }


Do we really need to expose all these dozen flags?
My thought at least about x86 implementaion was about something more 
simple like:
void rte_memcpy_nt(void * __rte_restrict dst,
	const void * __rte_restrict src, size_t len,
	const uint64_t flags)
{

	if (flags == (SRC_NT | DST_NT) && ((dst | src) & 0xf) == 0) {
		_do_nt_src_nt_dst_nt(...);
	} else if (flags == DST_NT && (dst & 0xf) == 0) {
		_do_src_na_dst_nt(...);
	} else if (flags == SRC_NT && (src & 0xf) == 0) {
		_do_src_nt_dst_na(...);
	} else
		memcpy(dst, src, len);
}

> 
> 
>>
>>
>>> While working on these funtions, I experimented with an
>> rte_memcpy_nt() taking flags, which is also my personal preference, but
>> haven't succeed yet. Especially when copying a 16 byte aligned
>> structure of only 16 byte, the overhead of the function call +
>> comparing the flags + the copy loop overhead is significant, compared
>> to inline code consisting of only one pair of "movntdqa (%rsi),%xmm0;
>> movntdq %xmm0,(%rdi)" instructions.
>>>
>>> Remember that a non-inlined rte_memcpy_nt() will be called with very
>> varying size, due to the typical mix of small and big packets, so
>> branch prediction will not help.
>>>
>>> This RFC does not yet show the rte_memcpy_nt() function handling
>> unaligned load/store, but it is more complex than the aligned
>> functions. So I think the aligned variants are warranted - for
>> performance reasons.
>>>
>>> Some of the need for exposing individual functions for different
>> alignment stems from the compiler being unable to determine the
>> alignment of the source and destination pointers at build time. So we
>> need to help the compiler with this at build time, and thus the need
>> for inlining the function. If we expose a bunch of small inline
>> functions or a big inline function with flags seems to be a matter of
>> taste.
>>>
>>> Thinking about it, you are probably right that exposing a single
>> function with flags is better for documentation purposes and easier for
>> other architectures to implement. But it still needs to be inline, for
>> the reasons described above.
>>
>>
>> Ok, my initial thought was that main use-case for it would be copying
>> of
>> big chunks of data, but from your description it might not be the case.
> 
> This is for quickly copying relatively small pieces of data synchronously without polluting the CPUs data cache, e.g. just before passing on a packet to an Ethernet PMD for transmission.
> 
> Big chunks of data should be copied asynchronously by DMA.
> 
>> Yes, for just 16/32B copy function call overhead might be way too
>> high...
>> As another alternative - would memcpy_nt_bulk() help somehow?
>> It can do copying for the several src/dst pairs at once and
>> that might help to amortize cost of function call.
> 
> In many cases, memcpy_nt() will replace memcpy() inside loops, so it should be just as easy to use as memcpy(). E.g. look at rte_pktmbuf_copy()... Building a memcopy array to pass to memcpy_nt_bulk() from rte_pktmbuf_copy() would require a significant rewrite of rte_pktmbuf_copy(), compared to just replacing rte_memcpy() with rte_memcpy_nt(). And this is just one function using memcpy().

Actually, one question I have for such small data-transfer
(16B per packet) - do you still see some noticable perfomance
improvement for such scenario?
Another question - who will do 'sfence' after the copying?
Would it be inside memcpy_nt (seems quite costly), or would
it be another API function for that: memcpy_nt_flush() or so?

>>
>>
>>>
>>>>
>>>>
>>>>> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
>>>> guide/index.html#text=_mm_stream_load
>>>>> [2] https://developer.arm.com/documentation/100076/0100/A64-
>>>> Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--
>> SIMD-
>>>> and-FP-
>>>>>
>>>>> V2:
>>>>> - Only copy from non-temporal source to non-temporal destination.
>>>>>      I.e. remove the two variants with only source and/or
>> destination
>>>> being
>>>>>      non-temporal.
>>>>> - Do not require alignment.
>>>>>      Instead, offer additional 4 and 16 byte aligned functions for
>>>> performance
>>>>>      purposes.
>>>>> - Implemented two of the functions for x86.
>>>>> - Remove memset function.
>>>>>
>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>>>>> ---
>>>>>
>>>>> /**
>>>>>     * @warning
>>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
>>>>>     *
>>>>>     * Copy data from non-temporal source to non-temporal
>> destination.
>>>>>     *
>>>>>     * @param dst
>>>>>     *   Pointer to the non-temporal destination of the data.
>>>>>     *   Should be 4 byte aligned, for optimal performance.
>>>>>     * @param src
>>>>>     *   Pointer to the non-temporal source data.
>>>>>     *   No alignment requirements.
>>>>>     * @param len
>>>>>     *   Number of bytes to copy.
>>>>>     *   Should be be divisible by 4, for optimal performance.
>>>>>     */
>>>>> __rte_experimental
>>>>> static __rte_always_inline
>>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
>>>> __access__(read_only, 2, 3)))
>>>>> void rte_memcpy_nt(void * __rte_restrict dst, const void *
>>>> __rte_restrict src, size_t len)
>>>>> /* Implementation T.B.D. */
>>>>>
>>>>> /**
>>>>>     * @warning
>>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
>>>>>     *
>>>>>     * Copy data in blocks of 16 byte from aligned non-temporal
>> source
>>>>>     * to aligned non-temporal destination.
>>>>>     *
>>>>>     * @param dst
>>>>>     *   Pointer to the non-temporal destination of the data.
>>>>>     *   Must be 16 byte aligned.
>>>>>     * @param src
>>>>>     *   Pointer to the non-temporal source data.
>>>>>     *   Must be 16 byte aligned.
>>>>>     * @param len
>>>>>     *   Number of bytes to copy.
>>>>>     *   Must be divisible by 16.
>>>>>     */
>>>>> __rte_experimental
>>>>> static __rte_always_inline
>>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
>>>> __access__(read_only, 2, 3)))
>>>>> void rte_memcpy_nt16a(void * __rte_restrict dst, const void *
>>>> __rte_restrict src, size_t len)
>>>>> {
>>>>>        const void * const  end = RTE_PTR_ADD(src, len);
>>>>>
>>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
>>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
>>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
>>>>>
>>>>>        /* Copy large portion of data. */
>>>>>        while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
>>>>>            register __m128i    xmm0, xmm1, xmm2, xmm3;
>>>>>
>>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>>>> pointer as parameter. */
>>>>> #pragma GCC diagnostic push
>>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>>>            xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 *
>>>> sizeof(__m128i)));
>>>>>            xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 *
>>>> sizeof(__m128i)));
>>>>>            xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 *
>>>> sizeof(__m128i)));
>>>>>            xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 *
>>>> sizeof(__m128i)));
>>>>> #pragma GCC diagnostic pop
>>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)),
>>>> xmm0);
>>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)),
>>>> xmm1);
>>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)),
>>>> xmm2);
>>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)),
>>>> xmm3);
>>>>>            src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
>>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
>>>>>        }
>>>>>
>>>>>        /* Copy remaining data. */
>>>>>        while (src != end) {
>>>>>            register __m128i    xmm;
>>>>>
>>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>>>> pointer as parameter. */
>>>>> #pragma GCC diagnostic push
>>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>>>            xmm = _mm_stream_load_si128(src);
>>>>> #pragma GCC diagnostic pop
>>>>>            _mm_stream_si128(dst, xmm);
>>>>>            src = RTE_PTR_ADD(src, sizeof(__m128i));
>>>>>            dst = RTE_PTR_ADD(dst, sizeof(__m128i));
>>>>>        }
>>>>> }
>>>>>
>>>>> /**
>>>>>     * @warning
>>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
>>>>>     *
>>>>>     * Copy data in blocks of 4 byte from aligned non-temporal source
>>>>>     * to aligned non-temporal destination.
>>>>>     *
>>>>>     * @param dst
>>>>>     *   Pointer to the non-temporal destination of the data.
>>>>>     *   Must be 4 byte aligned.
>>>>>     * @param src
>>>>>     *   Pointer to the non-temporal source data.
>>>>>     *   Must be 4 byte aligned.
>>>>>     * @param len
>>>>>     *   Number of bytes to copy.
>>>>>     *   Must be divisible by 4.
>>>>>     */
>>>>> __rte_experimental
>>>>> static __rte_always_inline
>>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
>>>> __access__(read_only, 2, 3)))
>>>>> void rte_memcpy_nt4a(void * __rte_restrict dst, const void *
>>>> __rte_restrict src, size_t len)
>>>>> {
>>>>>        int32_t             buf[sizeof(__m128i) / sizeof(int32_t)]
>>>> __rte_aligned(sizeof(__m128i));
>>>>>        /** Address of source data, rounded down to achieve
>> alignment.
>>>> */
>>>>>        const void *        srca = RTE_PTR_ALIGN_FLOOR(src,
>>>> sizeof(__m128i));
>>>>>        /** Address of end of source data, rounded down to achieve
>>>> alignment. */
>>>>>        const void * const  srcenda =
>>>> RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
>>>>>        const int           offset =  RTE_PTR_DIFF(src, srca) /
>>>> sizeof(int32_t);
>>>>>        register __m128i    xmm0;
>>>>>
>>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
>>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
>>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
>>>>>
>>>>>        if (unlikely(len == 0)) return;
>>>>>
>>>>>        /* Copy first, non-__m128i aligned, part of source data. */
>>>>>        if (offset) {
>>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>>>> pointer as parameter. */
>>>>> #pragma GCC diagnostic push
>>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>>>            xmm0 = _mm_stream_load_si128(srca);
>>>>>            _mm_store_si128((void *)buf, xmm0);
>>>>> #pragma GCC diagnostic pop
>>>>>            switch (offset) {
>>>>>                case 1:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[1]);
>>>>>                    if (unlikely(len == 1 * sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>>>> sizeof(int32_t)), buf[2]);
>>>>>                    if (unlikely(len == 2 * sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
>>>> sizeof(int32_t)), buf[3]);
>>>>>                    break;
>>>>>                case 2:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[2]);
>>>>>                    if (unlikely(len == 1 * sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>>>> sizeof(int32_t)), buf[3]);
>>>>>                    break;
>>>>>                case 3:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[3]);
>>>>>                    break;
>>>>>            }
>>>>>            srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
>>>>>            dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
>>>>>        }
>>>>>
>>>>>        /* Copy middle, __m128i aligned, part of source data. */
>>>>>        while (srca != srcenda) {
>>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>>>> pointer as parameter. */
>>>>> #pragma GCC diagnostic push
>>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>>>            xmm0 = _mm_stream_load_si128(srca);
>>>>> #pragma GCC diagnostic pop
>>>>>            _mm_store_si128((void *)buf, xmm0);
>>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)),
>>>> buf[0]);
>>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)),
>>>> buf[1]);
>>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)),
>>>> buf[2]);
>>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)),
>>>> buf[3]);
>>>>>            srca = RTE_PTR_ADD(srca, sizeof(__m128i));
>>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
>>>>>        }
>>>>>
>>>>>        /* Copy last, non-__m128i aligned, part of source data. */
>>>>>        if (RTE_PTR_DIFF(srca, src) != 4) {
>>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>>>> pointer as parameter. */
>>>>> #pragma GCC diagnostic push
>>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>>>            xmm0 = _mm_stream_load_si128(srca);
>>>>>            _mm_store_si128((void *)buf, xmm0);
>>>>> #pragma GCC diagnostic pop
>>>>>            switch (offset) {
>>>>>                case 1:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[0]);
>>>>>                    break;
>>>>>                case 2:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[0]);
>>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
>>>> sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>>>> sizeof(int32_t)), buf[1]);
>>>>>                    break;
>>>>>                case 3:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[0]);
>>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
>>>> sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>>>> sizeof(int32_t)), buf[1]);
>>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 2 *
>>>> sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
>>>> sizeof(int32_t)), buf[2]);
>>>>>                    break;
>>>>>            }
>>>>>        }
>>>>> }
>>>>>
>>>>
>>>
>>
>

next prev parent reply	other threads:[~2022-07-29 10:00 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-19 15:26 Morten Brørup
2022-07-19 18:00 ` David Christensen
2022-07-19 18:41   ` Morten Brørup
2022-07-19 18:51     ` Stanisław Kardach
2022-07-19 22:15       ` Morten Brørup
2022-07-21 23:19 ` Konstantin Ananyev
2022-07-22 10:44   ` Morten Brørup
2022-07-24 13:35     ` Konstantin Ananyev
2022-07-24 22:18       ` Morten Brørup
2022-07-29 10:00         ` Konstantin Ananyev [this message]
2022-07-29 10:46           ` Morten Brørup
2022-07-29 11:50             ` Konstantin Ananyev
2022-07-29 17:17               ` Morten Brørup
2022-07-29 22:00                 ` Konstantin Ananyev
2022-07-30  9:51                   ` Morten Brørup
2022-08-02  9:05                     ` Konstantin Ananyev
2022-07-29 12:13             ` Konstantin Ananyev
2022-07-29 16:05               ` Stephen Hemminger
2022-07-29 17:29                 ` Morten Brørup
2022-08-07 20:40                 ` Mattias Rönnblom
2022-08-09  9:24                   ` Morten Brørup
2022-08-09 11:53                     ` Mattias Rönnblom
2022-10-09 16:16                       ` Morten Brørup
2022-07-29 18:13               ` Morten Brørup
2022-07-29 19:49                 ` Konstantin Ananyev
2022-07-29 20:26                   ` Morten Brørup
2022-07-29 21:34                     ` Konstantin Ananyev
2022-08-07 20:20                     ` Mattias Rönnblom
2022-08-09  9:34                       ` Morten Brørup
2022-08-09 11:56                         ` Mattias Rönnblom
2022-08-10 21:05                     ` Honnappa Nagarahalli
2022-08-11 11:50                       ` Mattias Rönnblom
2022-08-11 16:26                         ` Honnappa Nagarahalli
2022-07-25  1:17       ` Honnappa Nagarahalli
2022-07-27 10:26         ` Morten Brørup
2022-07-27 17:37           ` Honnappa Nagarahalli
2022-07-27 18:49             ` Morten Brørup
2022-07-27 19:12               ` Stephen Hemminger
2022-07-28  9:00                 ` Morten Brørup
2022-07-27 19:52               ` Honnappa Nagarahalli
2022-07-27 22:02                 ` Stanisław Kardach
2022-07-28 10:51                   ` Morten Brørup
2022-07-29  9:21                     ` Konstantin Ananyev
2022-08-07 20:25 ` Mattias Rönnblom
2022-08-09  9:46   ` Morten Brørup
2022-08-09 12:05     ` Mattias Rönnblom
2022-08-09 15:00       ` Morten Brørup
2022-08-10 11:47         ` Mattias Rönnblom
2022-08-09 15:26     ` Stephen Hemminger
2022-08-09 17:24       ` Morten Brørup
2022-08-10 11:59         ` Mattias Rönnblom
2022-08-10 12:12           ` Morten Brørup
2022-08-10 11:55       ` Mattias Rönnblom
2022-08-10 12:18         ` Morten Brørup
2022-08-10 21:20           ` Honnappa Nagarahalli
2022-08-11 11:53             ` Mattias Rönnblom
2022-08-11 22:24               ` Honnappa Nagarahalli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2c646d01-14d0-e5cb-2d7c-50c8456fc3e5@yandex.ru \
    --to=konstantin.v.ananyev@yandex.ru \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=drc@linux.vnet.ibm.com \
    --cc=kda@semihalf.com \
    --cc=mb@smartsharesystems.com \
    --cc=ruifeng.wang@arm.com \
    --cc=viktorin@rehivetech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).