DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC v2] non-temporal memcpy
@ 2022-07-19 15:26 Morten Brørup
  2022-07-19 18:00 ` David Christensen
                   ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Morten Brørup @ 2022-07-19 15:26 UTC (permalink / raw)
  To: dev, Bruce Richardson, Konstantin Ananyev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

This RFC proposes a set of functions optimized for non-temporal memory copy.

At this stage, I am asking for feedback on the concept.

Applications sometimes data to another memory location, which is only used
much later.
In this case, it is inefficient to pollute the data cache with the copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture buffer
for offline analysis.

The purpose of these functions is to achieve a performance gain by not
polluting the cache when copying data.
Although the throughput may be improved by further optimization, I do not
consider througput optimization relevant initially.

The x86 non-temporal load instructions have 16 byte alignment
requirements [1], while ARM non-temporal load instructions are available with
4 byte alignment requirements [2].
Both platforms offer non-temporal store instructions with 4 byte alignment
requirements.

In addition to the primary function without any alignment requirements, we
also provide functions for respectivly 16 and 4 byte aligned access for
performance purposes.

The function names resemble standard C library function names, but their
signatures are intentionally different. No need to drag legacy into it.

NB: Don't comment on spaces for indentation; a patch will follow DPDK coding
style and use TAB.

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_load
[2] https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-and-FP-

V2:
- Only copy from non-temporal source to non-temporal destination.
  I.e. remove the two variants with only source and/or destination being
  non-temporal.
- Do not require alignment.
  Instead, offer additional 4 and 16 byte aligned functions for performance
  purposes.
- Implemented two of the functions for x86.
- Remove memset function.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---

/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Copy data from non-temporal source to non-temporal destination.
 *
 * @param dst
 *   Pointer to the non-temporal destination of the data.
 *   Should be 4 byte aligned, for optimal performance.
 * @param src
 *   Pointer to the non-temporal source data.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Should be be divisible by 4, for optimal performance.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
/* Implementation T.B.D. */

/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Copy data in blocks of 16 byte from aligned non-temporal source
 * to aligned non-temporal destination.
 *
 * @param dst
 *   Pointer to the non-temporal destination of the data.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source data.
 *   Must be 16 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 16.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt16a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
{
    const void * const  end = RTE_PTR_ADD(src, len);

    RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
    RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
    RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));

    /* Copy large portion of data. */
    while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
        register __m128i    xmm0, xmm1, xmm2, xmm3;

/* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
        xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 * sizeof(__m128i)));
        xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 * sizeof(__m128i)));
        xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 * sizeof(__m128i)));
        xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 * sizeof(__m128i)));
#pragma GCC diagnostic pop
        _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)), xmm0);
        _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)), xmm1);
        _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)), xmm2);
        _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)), xmm3);
        src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
        dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
    }

    /* Copy remaining data. */
    while (src != end) {
        register __m128i    xmm;

/* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
        xmm = _mm_stream_load_si128(src);
#pragma GCC diagnostic pop
        _mm_stream_si128(dst, xmm);
        src = RTE_PTR_ADD(src, sizeof(__m128i));
        dst = RTE_PTR_ADD(dst, sizeof(__m128i));
    }
}

/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Copy data in blocks of 4 byte from aligned non-temporal source
 * to aligned non-temporal destination.
 *
 * @param dst
 *   Pointer to the non-temporal destination of the data.
 *   Must be 4 byte aligned.
 * @param src
 *   Pointer to the non-temporal source data.
 *   Must be 4 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 4.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt4a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
{
    int32_t             buf[sizeof(__m128i) / sizeof(int32_t)] __rte_aligned(sizeof(__m128i));
    /** Address of source data, rounded down to achieve alignment. */
    const void *        srca = RTE_PTR_ALIGN_FLOOR(src, sizeof(__m128i));
    /** Address of end of source data, rounded down to achieve alignment. */
    const void * const  srcenda = RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
    const int           offset =  RTE_PTR_DIFF(src, srca) / sizeof(int32_t);
    register __m128i    xmm0;

    RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
    RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
    RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));

    if (unlikely(len == 0)) return;

    /* Copy first, non-__m128i aligned, part of source data. */
    if (offset) {
/* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
        xmm0 = _mm_stream_load_si128(srca);
        _mm_store_si128((void *)buf, xmm0);
#pragma GCC diagnostic pop
        switch (offset) {
            case 1:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[1]);
                if (unlikely(len == 1 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[2]);
                if (unlikely(len == 2 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[3]);
                break;
            case 2:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[2]);
                if (unlikely(len == 1 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[3]);
                break;
            case 3:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[3]);
                break;
        }
        srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
        dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
    }

    /* Copy middle, __m128i aligned, part of source data. */
    while (srca != srcenda) {
/* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
        xmm0 = _mm_stream_load_si128(srca);
#pragma GCC diagnostic pop
        _mm_store_si128((void *)buf, xmm0);
        _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]);
        _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)), buf[3]);
        srca = RTE_PTR_ADD(srca, sizeof(__m128i));
        dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
    }

    /* Copy last, non-__m128i aligned, part of source data. */
    if (RTE_PTR_DIFF(srca, src) != 4) {
/* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
        xmm0 = _mm_stream_load_si128(srca);
        _mm_store_si128((void *)buf, xmm0);
#pragma GCC diagnostic pop
        switch (offset) {
            case 1:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
                break;
            case 2:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
                if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
                break;
            case 3:
                _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
                if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
                if (unlikely(RTE_PTR_DIFF(srca, src) == 2 * sizeof(int32_t))) return;
                _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]);
                break;
        }
    }
}


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-19 15:26 [RFC v2] non-temporal memcpy Morten Brørup
@ 2022-07-19 18:00 ` David Christensen
  2022-07-19 18:41   ` Morten Brørup
  2022-07-21 23:19 ` Konstantin Ananyev
  2022-08-07 20:25 ` Mattias Rönnblom
  2 siblings, 1 reply; 57+ messages in thread
From: David Christensen @ 2022-07-19 18:00 UTC (permalink / raw)
  To: Morten Brørup, dev, Bruce Richardson, Konstantin Ananyev
  Cc: Jan Viktorin, Ruifeng Wang, Stanislaw Kardach



On 7/19/22 8:26 AM, Morten Brørup wrote:
> This RFC proposes a set of functions optimized for non-temporal memory copy.
> 
> At this stage, I am asking for feedback on the concept.
> 
> Applications sometimes data to another memory location, which is only used
> much later.
> In this case, it is inefficient to pollute the data cache with the copied
> data.
> 
> An example use case (originating from a real life application):
> Copying filtered packets, or the first part of them, into a capture buffer
> for offline analysis.
> 
> The purpose of these functions is to achieve a performance gain by not
> polluting the cache when copying data.
> Although the throughput may be improved by further optimization, I do not
> consider througput optimization relevant initially.
> 
Assume that fallback to the standard temporal memcpy is an acceptable 
implementation when not supported by the architecture, yes?  My internal 
queries on the POWER side indicate that there's no support in P8/P9/P10 
ISA for such functionality.

Dave

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-19 18:00 ` David Christensen
@ 2022-07-19 18:41   ` Morten Brørup
  2022-07-19 18:51     ` Stanisław Kardach
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-07-19 18:41 UTC (permalink / raw)
  To: David Christensen, dev, Bruce Richardson, Konstantin Ananyev
  Cc: Jan Viktorin, Ruifeng Wang, Stanislaw Kardach

> From: David Christensen [mailto:drc@linux.vnet.ibm.com]
> Sent: Tuesday, 19 July 2022 20.01
> 
> On 7/19/22 8:26 AM, Morten Brørup wrote:
> > This RFC proposes a set of functions optimized for non-temporal
> memory copy.
> >
> > At this stage, I am asking for feedback on the concept.
> >
> > Applications sometimes data to another memory location, which is only
> used
> > much later.
> > In this case, it is inefficient to pollute the data cache with the
> copied
> > data.
> >
> > An example use case (originating from a real life application):
> > Copying filtered packets, or the first part of them, into a capture
> buffer
> > for offline analysis.
> >
> > The purpose of these functions is to achieve a performance gain by
> not
> > polluting the cache when copying data.
> > Although the throughput may be improved by further optimization, I do
> not
> > consider througput optimization relevant initially.
> >
> Assume that fallback to the standard temporal memcpy is an acceptable
> implementation when not supported by the architecture, yes?

Yes, that is exactly what I envisioned.

Furthermore, stores unaligned to a degree not supported by the architecture, will also use temporal mempcy - at least for the unaligned first and last part of the copy. The middle (aligned) part may use non-temporal copy.

> My internal
> queries on the POWER side indicate that there's no support in P8/P9/P10
> ISA for such functionality.
> 
> Dave

Thank you for quick feedback, Dave!


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-19 18:41   ` Morten Brørup
@ 2022-07-19 18:51     ` Stanisław Kardach
  2022-07-19 22:15       ` Morten Brørup
  0 siblings, 1 reply; 57+ messages in thread
From: Stanisław Kardach @ 2022-07-19 18:51 UTC (permalink / raw)
  To: Morten Brørup
  Cc: David Christensen, dev, Bruce Richardson, Konstantin Ananyev,
	Jan Viktorin, Ruifeng Wang

On Tue, Jul 19, 2022 at 8:41 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: David Christensen [mailto:drc@linux.vnet.ibm.com]
> > Assume that fallback to the standard temporal memcpy is an acceptable
> > implementation when not supported by the architecture, yes?
>
> Yes, that is exactly what I envisioned.
>
> Furthermore, stores unaligned to a degree not supported by the architecture, will also use temporal mempcy - at least for the unaligned first and last part of the copy. The middle (aligned) part may use non-temporal copy.
>
To clarify, would you envision implementation in the arch-specific
headers + generic fallback or a shared one (generic unaligned + call
to aligned arch-specific)? First one seems more lean.
RISC-V will definitely use generic implementation as non-temporal
load/store hints are still not ratified.
-- 
Best Regards,
Stanisław Kardach

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-19 18:51     ` Stanisław Kardach
@ 2022-07-19 22:15       ` Morten Brørup
  0 siblings, 0 replies; 57+ messages in thread
From: Morten Brørup @ 2022-07-19 22:15 UTC (permalink / raw)
  To: Stanisław Kardach
  Cc: David Christensen, dev, Bruce Richardson, Konstantin Ananyev,
	Jan Viktorin, Ruifeng Wang

> From: Stanisław Kardach [mailto:kda@semihalf.com]
> Sent: Tuesday, 19 July 2022 20.51
> 
> On Tue, Jul 19, 2022 at 8:41 PM Morten Brørup
> <mb@smartsharesystems.com> wrote:
> >
> > > From: David Christensen [mailto:drc@linux.vnet.ibm.com]
> > > Assume that fallback to the standard temporal memcpy is an
> acceptable
> > > implementation when not supported by the architecture, yes?
> >
> > Yes, that is exactly what I envisioned.
> >
> > Furthermore, stores unaligned to a degree not supported by the
> architecture, will also use temporal mempcy - at least for the
> unaligned first and last part of the copy. The middle (aligned) part
> may use non-temporal copy.
> >
> To clarify, would you envision implementation in the arch-specific
> headers + generic fallback or a shared one (generic unaligned + call
> to aligned arch-specific)? First one seems more lean.

Good feedback, Stanisław.

I agree that the first one is preferable.

It is also better prepared for some future platform supporting unaligned non-temporal load/store, if that is ever going to appear. :-)

> RISC-V will definitely use generic implementation as non-temporal
> load/store hints are still not ratified.

Yeah... my brief research on the topic showed that it had been suggested on some RISC-V mailing list, so I suppose it will get in there one day.

Not all CPUs have the same advanced features; and with memcpy() as a trustworthy fallback, I didn't expect anyone to object to this RFC on the basis of lack of support. I am pleased that both you (RISC-V maintainer) and Dave (POWER maintainer) are share this opinion, although not supported by your platforms. Thank you, both!

> --
> Best Regards,
> Stanisław Kardach


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-19 15:26 [RFC v2] non-temporal memcpy Morten Brørup
  2022-07-19 18:00 ` David Christensen
@ 2022-07-21 23:19 ` Konstantin Ananyev
  2022-07-22 10:44   ` Morten Brørup
  2022-08-07 20:25 ` Mattias Rönnblom
  2 siblings, 1 reply; 57+ messages in thread
From: Konstantin Ananyev @ 2022-07-21 23:19 UTC (permalink / raw)
  To: Morten Brørup, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

Hi Morten,

> This RFC proposes a set of functions optimized for non-temporal memory copy.
> 
> At this stage, I am asking for feedback on the concept.
> 
> Applications sometimes data to another memory location, which is only used
> much later.
> In this case, it is inefficient to pollute the data cache with the copied
> data.
> 
> An example use case (originating from a real life application):
> Copying filtered packets, or the first part of them, into a capture buffer
> for offline analysis.
> 
> The purpose of these functions is to achieve a performance gain by not
> polluting the cache when copying data.
> Although the throughput may be improved by further optimization, I do not
> consider througput optimization relevant initially.
> 
> The x86 non-temporal load instructions have 16 byte alignment
> requirements [1], while ARM non-temporal load instructions are available with
> 4 byte alignment requirements [2].
> Both platforms offer non-temporal store instructions with 4 byte alignment
> requirements.
> 
> In addition to the primary function without any alignment requirements, we
> also provide functions for respectivly 16 and 4 byte aligned access for
> performance purposes.
> 
> The function names resemble standard C library function names, but their
> signatures are intentionally different. No need to drag legacy into it.
> 
> NB: Don't comment on spaces for indentation; a patch will follow DPDK coding
> style and use TAB.


I think there were discussions in other direction - remove rte_memcpy() 
completely and use memcpy() instead...
But if we have a good use case for that, then I am positive in principle.
Though I think we need a clear use-case within dpdk for it
to demonstrate perfomance gain.
Probably copying packets within pdump lib, or examples/dma. or ...
Another thought - do we really need a separate inline function for each 
flavour?
Might be just one non-inline rte_memcpy_nt(dst, src, size, flags),
where flags could be combination of NT_SRC, NT_DST, and keep alignment
detection/decisions to particular implementation?


> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_load
> [2] https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-and-FP-
> 
> V2:
> - Only copy from non-temporal source to non-temporal destination.
>    I.e. remove the two variants with only source and/or destination being
>    non-temporal.
> - Do not require alignment.
>    Instead, offer additional 4 and 16 byte aligned functions for performance
>    purposes.
> - Implemented two of the functions for x86.
> - Remove memset function.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> 
> /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice.
>   *
>   * Copy data from non-temporal source to non-temporal destination.
>   *
>   * @param dst
>   *   Pointer to the non-temporal destination of the data.
>   *   Should be 4 byte aligned, for optimal performance.
>   * @param src
>   *   Pointer to the non-temporal source data.
>   *   No alignment requirements.
>   * @param len
>   *   Number of bytes to copy.
>   *   Should be be divisible by 4, for optimal performance.
>   */
> __rte_experimental
> static __rte_always_inline
> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> void rte_memcpy_nt(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
> /* Implementation T.B.D. */
> 
> /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice.
>   *
>   * Copy data in blocks of 16 byte from aligned non-temporal source
>   * to aligned non-temporal destination.
>   *
>   * @param dst
>   *   Pointer to the non-temporal destination of the data.
>   *   Must be 16 byte aligned.
>   * @param src
>   *   Pointer to the non-temporal source data.
>   *   Must be 16 byte aligned.
>   * @param len
>   *   Number of bytes to copy.
>   *   Must be divisible by 16.
>   */
> __rte_experimental
> static __rte_always_inline
> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> void rte_memcpy_nt16a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
> {
>      const void * const  end = RTE_PTR_ADD(src, len);
> 
>      RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
>      RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
>      RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
> 
>      /* Copy large portion of data. */
>      while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
>          register __m128i    xmm0, xmm1, xmm2, xmm3;
> 
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>          xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 * sizeof(__m128i)));
>          xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 * sizeof(__m128i)));
>          xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 * sizeof(__m128i)));
>          xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 * sizeof(__m128i)));
> #pragma GCC diagnostic pop
>          _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)), xmm0);
>          _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)), xmm1);
>          _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)), xmm2);
>          _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)), xmm3);
>          src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
>          dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
>      }
> 
>      /* Copy remaining data. */
>      while (src != end) {
>          register __m128i    xmm;
> 
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>          xmm = _mm_stream_load_si128(src);
> #pragma GCC diagnostic pop
>          _mm_stream_si128(dst, xmm);
>          src = RTE_PTR_ADD(src, sizeof(__m128i));
>          dst = RTE_PTR_ADD(dst, sizeof(__m128i));
>      }
> }
> 
> /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice.
>   *
>   * Copy data in blocks of 4 byte from aligned non-temporal source
>   * to aligned non-temporal destination.
>   *
>   * @param dst
>   *   Pointer to the non-temporal destination of the data.
>   *   Must be 4 byte aligned.
>   * @param src
>   *   Pointer to the non-temporal source data.
>   *   Must be 4 byte aligned.
>   * @param len
>   *   Number of bytes to copy.
>   *   Must be divisible by 4.
>   */
> __rte_experimental
> static __rte_always_inline
> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> void rte_memcpy_nt4a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
> {
>      int32_t             buf[sizeof(__m128i) / sizeof(int32_t)] __rte_aligned(sizeof(__m128i));
>      /** Address of source data, rounded down to achieve alignment. */
>      const void *        srca = RTE_PTR_ALIGN_FLOOR(src, sizeof(__m128i));
>      /** Address of end of source data, rounded down to achieve alignment. */
>      const void * const  srcenda = RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
>      const int           offset =  RTE_PTR_DIFF(src, srca) / sizeof(int32_t);
>      register __m128i    xmm0;
> 
>      RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
>      RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
>      RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
> 
>      if (unlikely(len == 0)) return;
> 
>      /* Copy first, non-__m128i aligned, part of source data. */
>      if (offset) {
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>          xmm0 = _mm_stream_load_si128(srca);
>          _mm_store_si128((void *)buf, xmm0);
> #pragma GCC diagnostic pop
>          switch (offset) {
>              case 1:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[1]);
>                  if (unlikely(len == 1 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[2]);
>                  if (unlikely(len == 2 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[3]);
>                  break;
>              case 2:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[2]);
>                  if (unlikely(len == 1 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[3]);
>                  break;
>              case 3:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[3]);
>                  break;
>          }
>          srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
>          dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
>      }
> 
>      /* Copy middle, __m128i aligned, part of source data. */
>      while (srca != srcenda) {
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>          xmm0 = _mm_stream_load_si128(srca);
> #pragma GCC diagnostic pop
>          _mm_store_si128((void *)buf, xmm0);
>          _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
>          _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
>          _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]);
>          _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)), buf[3]);
>          srca = RTE_PTR_ADD(srca, sizeof(__m128i));
>          dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
>      }
> 
>      /* Copy last, non-__m128i aligned, part of source data. */
>      if (RTE_PTR_DIFF(srca, src) != 4) {
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>          xmm0 = _mm_stream_load_si128(srca);
>          _mm_store_si128((void *)buf, xmm0);
> #pragma GCC diagnostic pop
>          switch (offset) {
>              case 1:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
>                  break;
>              case 2:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
>                  if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
>                  break;
>              case 3:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
>                  if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
>                  if (unlikely(RTE_PTR_DIFF(srca, src) == 2 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]);
>                  break;
>          }
>      }
> }
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-21 23:19 ` Konstantin Ananyev
@ 2022-07-22 10:44   ` Morten Brørup
  2022-07-24 13:35     ` Konstantin Ananyev
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-07-22 10:44 UTC (permalink / raw)
  To: Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> Sent: Friday, 22 July 2022 01.20
> 
> Hi Morten,
> 
> > This RFC proposes a set of functions optimized for non-temporal
> memory copy.
> >
> > At this stage, I am asking for feedback on the concept.
> >
> > Applications sometimes data to another memory location, which is only
> used
> > much later.
> > In this case, it is inefficient to pollute the data cache with the
> copied
> > data.
> >
> > An example use case (originating from a real life application):
> > Copying filtered packets, or the first part of them, into a capture
> buffer
> > for offline analysis.
> >
> > The purpose of these functions is to achieve a performance gain by
> not
> > polluting the cache when copying data.
> > Although the throughput may be improved by further optimization, I do
> not
> > consider througput optimization relevant initially.
> >
> > The x86 non-temporal load instructions have 16 byte alignment
> > requirements [1], while ARM non-temporal load instructions are
> available with
> > 4 byte alignment requirements [2].
> > Both platforms offer non-temporal store instructions with 4 byte
> alignment
> > requirements.
> >
> > In addition to the primary function without any alignment
> requirements, we
> > also provide functions for respectivly 16 and 4 byte aligned access
> for
> > performance purposes.
> >
> > The function names resemble standard C library function names, but
> their
> > signatures are intentionally different. No need to drag legacy into
> it.
> >
> > NB: Don't comment on spaces for indentation; a patch will follow DPDK
> coding
> > style and use TAB.
> 
> 
> I think there were discussions in other direction - remove rte_memcpy()
> completely and use memcpy() instead...

Yes, the highly optimized rte_memcpy() implementation of memcpy() has become obsolete, now that modern compilers provide an efficient memcpy() implementation.

It's an excellent reference, because we should learn from it, and avoid introducing similar mistakes with non-temporal memcpy.

> But if we have a good use case for that, then I am positive in
> principle.

The standard C library doesn't offer non-temporal memcpy(), so we need to implement it ourselves.

> Though I think we need a clear use-case within dpdk for it
> to demonstrate perfomance gain.

The performance gain is to avoid polluting the data cache. DPDK example applications, like l3fwd, are probably too primitive to measure any benefit in this regard.

> Probably copying packets within pdump lib, or examples/dma. or ...

Good point - the new functions should be used somewhere within DPDK. For this purpose, I will look into modifying rte_pktmbuf_copy(), which is used by pdump_copy(), to use non-temporal copying of the packet data.

> Another thought - do we really need a separate inline function for each
> flavour?
> Might be just one non-inline rte_memcpy_nt(dst, src, size, flags),
> where flags could be combination of NT_SRC, NT_DST, and keep alignment
> detection/decisions to particular implementation?

Thank you for the feedback, Konstantin.

My answer to this suggestion gets a little longwinded...

Looking at the DPDK pcapng library, it copies a 4 byte aligned metadata structure sized 28 byte. So it can do with 4 byte aligned functions.

Our application can capture packets starting at the IP header, which is offset by 14 byte (Ethernet header size) from the packet buffer, so it requires 2 byte alignment. And thus, requiring 4 byte alignment is not acceptable.

Our application uses 16 byte alignment in the capture buffer area, and can benefit from 16 byte aligned functions. Furthermore, x86 processors require 16 byte alignment for non-temporal load instructions, so I think a 16 byte aligned non-temporal memcpy function should be offered.

While working on these funtions, I experimented with an rte_memcpy_nt() taking flags, which is also my personal preference, but haven't succeed yet. Especially when copying a 16 byte aligned structure of only 16 byte, the overhead of the function call + comparing the flags + the copy loop overhead is significant, compared to inline code consisting of only one pair of "movntdqa (%rsi),%xmm0; movntdq %xmm0,(%rdi)" instructions.

Remember that a non-inlined rte_memcpy_nt() will be called with very varying size, due to the typical mix of small and big packets, so branch prediction will not help.

This RFC does not yet show the rte_memcpy_nt() function handling unaligned load/store, but it is more complex than the aligned functions. So I think the aligned variants are warranted - for performance reasons.

Some of the need for exposing individual functions for different alignment stems from the compiler being unable to determine the alignment of the source and destination pointers at build time. So we need to help the compiler with this at build time, and thus the need for inlining the function. If we expose a bunch of small inline functions or a big inline function with flags seems to be a matter of taste.

Thinking about it, you are probably right that exposing a single function with flags is better for documentation purposes and easier for other architectures to implement. But it still needs to be inline, for the reasons described above.

> 
> 
> > [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
> guide/index.html#text=_mm_stream_load
> > [2] https://developer.arm.com/documentation/100076/0100/A64-
> Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-
> and-FP-
> >
> > V2:
> > - Only copy from non-temporal source to non-temporal destination.
> >    I.e. remove the two variants with only source and/or destination
> being
> >    non-temporal.
> > - Do not require alignment.
> >    Instead, offer additional 4 and 16 byte aligned functions for
> performance
> >    purposes.
> > - Implemented two of the functions for x86.
> > - Remove memset function.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >
> > /**
> >   * @warning
> >   * @b EXPERIMENTAL: this API may change without prior notice.
> >   *
> >   * Copy data from non-temporal source to non-temporal destination.
> >   *
> >   * @param dst
> >   *   Pointer to the non-temporal destination of the data.
> >   *   Should be 4 byte aligned, for optimal performance.
> >   * @param src
> >   *   Pointer to the non-temporal source data.
> >   *   No alignment requirements.
> >   * @param len
> >   *   Number of bytes to copy.
> >   *   Should be be divisible by 4, for optimal performance.
> >   */
> > __rte_experimental
> > static __rte_always_inline
> > __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> __access__(read_only, 2, 3)))
> > void rte_memcpy_nt(void * __rte_restrict dst, const void *
> __rte_restrict src, size_t len)
> > /* Implementation T.B.D. */
> >
> > /**
> >   * @warning
> >   * @b EXPERIMENTAL: this API may change without prior notice.
> >   *
> >   * Copy data in blocks of 16 byte from aligned non-temporal source
> >   * to aligned non-temporal destination.
> >   *
> >   * @param dst
> >   *   Pointer to the non-temporal destination of the data.
> >   *   Must be 16 byte aligned.
> >   * @param src
> >   *   Pointer to the non-temporal source data.
> >   *   Must be 16 byte aligned.
> >   * @param len
> >   *   Number of bytes to copy.
> >   *   Must be divisible by 16.
> >   */
> > __rte_experimental
> > static __rte_always_inline
> > __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> __access__(read_only, 2, 3)))
> > void rte_memcpy_nt16a(void * __rte_restrict dst, const void *
> __rte_restrict src, size_t len)
> > {
> >      const void * const  end = RTE_PTR_ADD(src, len);
> >
> >      RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
> >      RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
> >      RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
> >
> >      /* Copy large portion of data. */
> >      while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
> >          register __m128i    xmm0, xmm1, xmm2, xmm3;
> >
> > /* Note: Workaround for _mm_stream_load_si128() not taking a const
> pointer as parameter. */
> > #pragma GCC diagnostic push
> > #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >          xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 *
> sizeof(__m128i)));
> >          xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 *
> sizeof(__m128i)));
> >          xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 *
> sizeof(__m128i)));
> >          xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 *
> sizeof(__m128i)));
> > #pragma GCC diagnostic pop
> >          _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)),
> xmm0);
> >          _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)),
> xmm1);
> >          _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)),
> xmm2);
> >          _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)),
> xmm3);
> >          src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
> >          dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
> >      }
> >
> >      /* Copy remaining data. */
> >      while (src != end) {
> >          register __m128i    xmm;
> >
> > /* Note: Workaround for _mm_stream_load_si128() not taking a const
> pointer as parameter. */
> > #pragma GCC diagnostic push
> > #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >          xmm = _mm_stream_load_si128(src);
> > #pragma GCC diagnostic pop
> >          _mm_stream_si128(dst, xmm);
> >          src = RTE_PTR_ADD(src, sizeof(__m128i));
> >          dst = RTE_PTR_ADD(dst, sizeof(__m128i));
> >      }
> > }
> >
> > /**
> >   * @warning
> >   * @b EXPERIMENTAL: this API may change without prior notice.
> >   *
> >   * Copy data in blocks of 4 byte from aligned non-temporal source
> >   * to aligned non-temporal destination.
> >   *
> >   * @param dst
> >   *   Pointer to the non-temporal destination of the data.
> >   *   Must be 4 byte aligned.
> >   * @param src
> >   *   Pointer to the non-temporal source data.
> >   *   Must be 4 byte aligned.
> >   * @param len
> >   *   Number of bytes to copy.
> >   *   Must be divisible by 4.
> >   */
> > __rte_experimental
> > static __rte_always_inline
> > __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> __access__(read_only, 2, 3)))
> > void rte_memcpy_nt4a(void * __rte_restrict dst, const void *
> __rte_restrict src, size_t len)
> > {
> >      int32_t             buf[sizeof(__m128i) / sizeof(int32_t)]
> __rte_aligned(sizeof(__m128i));
> >      /** Address of source data, rounded down to achieve alignment.
> */
> >      const void *        srca = RTE_PTR_ALIGN_FLOOR(src,
> sizeof(__m128i));
> >      /** Address of end of source data, rounded down to achieve
> alignment. */
> >      const void * const  srcenda =
> RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
> >      const int           offset =  RTE_PTR_DIFF(src, srca) /
> sizeof(int32_t);
> >      register __m128i    xmm0;
> >
> >      RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
> >      RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
> >      RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
> >
> >      if (unlikely(len == 0)) return;
> >
> >      /* Copy first, non-__m128i aligned, part of source data. */
> >      if (offset) {
> > /* Note: Workaround for _mm_stream_load_si128() not taking a const
> pointer as parameter. */
> > #pragma GCC diagnostic push
> > #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >          xmm0 = _mm_stream_load_si128(srca);
> >          _mm_store_si128((void *)buf, xmm0);
> > #pragma GCC diagnostic pop
> >          switch (offset) {
> >              case 1:
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> sizeof(int32_t)), buf[1]);
> >                  if (unlikely(len == 1 * sizeof(int32_t))) return;
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> sizeof(int32_t)), buf[2]);
> >                  if (unlikely(len == 2 * sizeof(int32_t))) return;
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> sizeof(int32_t)), buf[3]);
> >                  break;
> >              case 2:
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> sizeof(int32_t)), buf[2]);
> >                  if (unlikely(len == 1 * sizeof(int32_t))) return;
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> sizeof(int32_t)), buf[3]);
> >                  break;
> >              case 3:
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> sizeof(int32_t)), buf[3]);
> >                  break;
> >          }
> >          srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
> >          dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
> >      }
> >
> >      /* Copy middle, __m128i aligned, part of source data. */
> >      while (srca != srcenda) {
> > /* Note: Workaround for _mm_stream_load_si128() not taking a const
> pointer as parameter. */
> > #pragma GCC diagnostic push
> > #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >          xmm0 = _mm_stream_load_si128(srca);
> > #pragma GCC diagnostic pop
> >          _mm_store_si128((void *)buf, xmm0);
> >          _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)),
> buf[0]);
> >          _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)),
> buf[1]);
> >          _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)),
> buf[2]);
> >          _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)),
> buf[3]);
> >          srca = RTE_PTR_ADD(srca, sizeof(__m128i));
> >          dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
> >      }
> >
> >      /* Copy last, non-__m128i aligned, part of source data. */
> >      if (RTE_PTR_DIFF(srca, src) != 4) {
> > /* Note: Workaround for _mm_stream_load_si128() not taking a const
> pointer as parameter. */
> > #pragma GCC diagnostic push
> > #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >          xmm0 = _mm_stream_load_si128(srca);
> >          _mm_store_si128((void *)buf, xmm0);
> > #pragma GCC diagnostic pop
> >          switch (offset) {
> >              case 1:
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> sizeof(int32_t)), buf[0]);
> >                  break;
> >              case 2:
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> sizeof(int32_t)), buf[0]);
> >                  if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
> sizeof(int32_t))) return;
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> sizeof(int32_t)), buf[1]);
> >                  break;
> >              case 3:
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> sizeof(int32_t)), buf[0]);
> >                  if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
> sizeof(int32_t))) return;
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> sizeof(int32_t)), buf[1]);
> >                  if (unlikely(RTE_PTR_DIFF(srca, src) == 2 *
> sizeof(int32_t))) return;
> >                  _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> sizeof(int32_t)), buf[2]);
> >                  break;
> >          }
> >      }
> > }
> >
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-22 10:44   ` Morten Brørup
@ 2022-07-24 13:35     ` Konstantin Ananyev
  2022-07-24 22:18       ` Morten Brørup
  2022-07-25  1:17       ` Honnappa Nagarahalli
  0 siblings, 2 replies; 57+ messages in thread
From: Konstantin Ananyev @ 2022-07-24 13:35 UTC (permalink / raw)
  To: Morten Brørup, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

22/07/2022 11:44, Morten Brørup пишет:
>> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
>> Sent: Friday, 22 July 2022 01.20
>>
>> Hi Morten,
>>
>>> This RFC proposes a set of functions optimized for non-temporal
>> memory copy.
>>>
>>> At this stage, I am asking for feedback on the concept.
>>>
>>> Applications sometimes data to another memory location, which is only
>> used
>>> much later.
>>> In this case, it is inefficient to pollute the data cache with the
>> copied
>>> data.
>>>
>>> An example use case (originating from a real life application):
>>> Copying filtered packets, or the first part of them, into a capture
>> buffer
>>> for offline analysis.
>>>
>>> The purpose of these functions is to achieve a performance gain by
>> not
>>> polluting the cache when copying data.
>>> Although the throughput may be improved by further optimization, I do
>> not
>>> consider througput optimization relevant initially.
>>>
>>> The x86 non-temporal load instructions have 16 byte alignment
>>> requirements [1], while ARM non-temporal load instructions are
>> available with
>>> 4 byte alignment requirements [2].
>>> Both platforms offer non-temporal store instructions with 4 byte
>> alignment
>>> requirements.
>>>
>>> In addition to the primary function without any alignment
>> requirements, we
>>> also provide functions for respectivly 16 and 4 byte aligned access
>> for
>>> performance purposes.
>>>
>>> The function names resemble standard C library function names, but
>> their
>>> signatures are intentionally different. No need to drag legacy into
>> it.
>>>
>>> NB: Don't comment on spaces for indentation; a patch will follow DPDK
>> coding
>>> style and use TAB.
>>
>>
>> I think there were discussions in other direction - remove rte_memcpy()
>> completely and use memcpy() instead...
> 
> Yes, the highly optimized rte_memcpy() implementation of memcpy() has become obsolete, now that modern compilers provide an efficient memcpy() implementation.
> 
> It's an excellent reference, because we should learn from it, and avoid introducing similar mistakes with non-temporal memcpy.
> 
>> But if we have a good use case for that, then I am positive in
>> principle.
> 
> The standard C library doesn't offer non-temporal memcpy(), so we need to implement it ourselves.
> 
>> Though I think we need a clear use-case within dpdk for it
>> to demonstrate perfomance gain.
> 
> The performance gain is to avoid polluting the data cache. DPDK example applications, like l3fwd, are probably too primitive to measure any benefit in this regard.
> 
>> Probably copying packets within pdump lib, or examples/dma. or ...
> 
> Good point - the new functions should be used somewhere within DPDK. For this purpose, I will look into modifying rte_pktmbuf_copy(), which is used by pdump_copy(), to use non-temporal copying of the packet data.
> 
>> Another thought - do we really need a separate inline function for each
>> flavour?
>> Might be just one non-inline rte_memcpy_nt(dst, src, size, flags),
>> where flags could be combination of NT_SRC, NT_DST, and keep alignment
>> detection/decisions to particular implementation?
> 
> Thank you for the feedback, Konstantin.
> 
> My answer to this suggestion gets a little longwinded...
> 
> Looking at the DPDK pcapng library, it copies a 4 byte aligned metadata structure sized 28 byte. So it can do with 4 byte aligned functions.
> 
> Our application can capture packets starting at the IP header, which is offset by 14 byte (Ethernet header size) from the packet buffer, so it requires 2 byte alignment. And thus, requiring 4 byte alignment is not acceptable.
> 
> Our application uses 16 byte alignment in the capture buffer area, and can benefit from 16 byte aligned functions. Furthermore, x86 processors require 16 byte alignment for non-temporal load instructions, so I think a 16 byte aligned non-temporal memcpy function should be offered.


Yes, x86 needs 16B alignment for NT load/stores
But that's supposed to be arch specific limitation,
that we probably want to hide, no?
Inside the function can check alignment of both src and dst
and decide should it use NT load/store instructions or just
do normal copy.


> While working on these funtions, I experimented with an rte_memcpy_nt() taking flags, which is also my personal preference, but haven't succeed yet. Especially when copying a 16 byte aligned structure of only 16 byte, the overhead of the function call + comparing the flags + the copy loop overhead is significant, compared to inline code consisting of only one pair of "movntdqa (%rsi),%xmm0; movntdq %xmm0,(%rdi)" instructions.
> 
> Remember that a non-inlined rte_memcpy_nt() will be called with very varying size, due to the typical mix of small and big packets, so branch prediction will not help.
> 
> This RFC does not yet show the rte_memcpy_nt() function handling unaligned load/store, but it is more complex than the aligned functions. So I think the aligned variants are warranted - for performance reasons.
> 
> Some of the need for exposing individual functions for different alignment stems from the compiler being unable to determine the alignment of the source and destination pointers at build time. So we need to help the compiler with this at build time, and thus the need for inlining the function. If we expose a bunch of small inline functions or a big inline function with flags seems to be a matter of taste.
> 
> Thinking about it, you are probably right that exposing a single function with flags is better for documentation purposes and easier for other architectures to implement. But it still needs to be inline, for the reasons described above.


Ok, my initial thought was that main use-case for it would be copying of
big chunks of data, but from your description it might not be the case.
Yes, for just 16/32B copy function call overhead might be way too high...
As another alternative - would memcpy_nt_bulk() help somehow?
It can do copying for the several src/dst pairs at once and
that might help to amortize cost of function call.


> 
>>
>>
>>> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
>> guide/index.html#text=_mm_stream_load
>>> [2] https://developer.arm.com/documentation/100076/0100/A64-
>> Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-
>> and-FP-
>>>
>>> V2:
>>> - Only copy from non-temporal source to non-temporal destination.
>>>     I.e. remove the two variants with only source and/or destination
>> being
>>>     non-temporal.
>>> - Do not require alignment.
>>>     Instead, offer additional 4 and 16 byte aligned functions for
>> performance
>>>     purposes.
>>> - Implemented two of the functions for x86.
>>> - Remove memset function.
>>>
>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>>> ---
>>>
>>> /**
>>>    * @warning
>>>    * @b EXPERIMENTAL: this API may change without prior notice.
>>>    *
>>>    * Copy data from non-temporal source to non-temporal destination.
>>>    *
>>>    * @param dst
>>>    *   Pointer to the non-temporal destination of the data.
>>>    *   Should be 4 byte aligned, for optimal performance.
>>>    * @param src
>>>    *   Pointer to the non-temporal source data.
>>>    *   No alignment requirements.
>>>    * @param len
>>>    *   Number of bytes to copy.
>>>    *   Should be be divisible by 4, for optimal performance.
>>>    */
>>> __rte_experimental
>>> static __rte_always_inline
>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
>> __access__(read_only, 2, 3)))
>>> void rte_memcpy_nt(void * __rte_restrict dst, const void *
>> __rte_restrict src, size_t len)
>>> /* Implementation T.B.D. */
>>>
>>> /**
>>>    * @warning
>>>    * @b EXPERIMENTAL: this API may change without prior notice.
>>>    *
>>>    * Copy data in blocks of 16 byte from aligned non-temporal source
>>>    * to aligned non-temporal destination.
>>>    *
>>>    * @param dst
>>>    *   Pointer to the non-temporal destination of the data.
>>>    *   Must be 16 byte aligned.
>>>    * @param src
>>>    *   Pointer to the non-temporal source data.
>>>    *   Must be 16 byte aligned.
>>>    * @param len
>>>    *   Number of bytes to copy.
>>>    *   Must be divisible by 16.
>>>    */
>>> __rte_experimental
>>> static __rte_always_inline
>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
>> __access__(read_only, 2, 3)))
>>> void rte_memcpy_nt16a(void * __rte_restrict dst, const void *
>> __rte_restrict src, size_t len)
>>> {
>>>       const void * const  end = RTE_PTR_ADD(src, len);
>>>
>>>       RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
>>>       RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
>>>       RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
>>>
>>>       /* Copy large portion of data. */
>>>       while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
>>>           register __m128i    xmm0, xmm1, xmm2, xmm3;
>>>
>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>> pointer as parameter. */
>>> #pragma GCC diagnostic push
>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>           xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 *
>> sizeof(__m128i)));
>>>           xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 *
>> sizeof(__m128i)));
>>>           xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 *
>> sizeof(__m128i)));
>>>           xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 *
>> sizeof(__m128i)));
>>> #pragma GCC diagnostic pop
>>>           _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)),
>> xmm0);
>>>           _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)),
>> xmm1);
>>>           _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)),
>> xmm2);
>>>           _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)),
>> xmm3);
>>>           src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
>>>           dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
>>>       }
>>>
>>>       /* Copy remaining data. */
>>>       while (src != end) {
>>>           register __m128i    xmm;
>>>
>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>> pointer as parameter. */
>>> #pragma GCC diagnostic push
>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>           xmm = _mm_stream_load_si128(src);
>>> #pragma GCC diagnostic pop
>>>           _mm_stream_si128(dst, xmm);
>>>           src = RTE_PTR_ADD(src, sizeof(__m128i));
>>>           dst = RTE_PTR_ADD(dst, sizeof(__m128i));
>>>       }
>>> }
>>>
>>> /**
>>>    * @warning
>>>    * @b EXPERIMENTAL: this API may change without prior notice.
>>>    *
>>>    * Copy data in blocks of 4 byte from aligned non-temporal source
>>>    * to aligned non-temporal destination.
>>>    *
>>>    * @param dst
>>>    *   Pointer to the non-temporal destination of the data.
>>>    *   Must be 4 byte aligned.
>>>    * @param src
>>>    *   Pointer to the non-temporal source data.
>>>    *   Must be 4 byte aligned.
>>>    * @param len
>>>    *   Number of bytes to copy.
>>>    *   Must be divisible by 4.
>>>    */
>>> __rte_experimental
>>> static __rte_always_inline
>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
>> __access__(read_only, 2, 3)))
>>> void rte_memcpy_nt4a(void * __rte_restrict dst, const void *
>> __rte_restrict src, size_t len)
>>> {
>>>       int32_t             buf[sizeof(__m128i) / sizeof(int32_t)]
>> __rte_aligned(sizeof(__m128i));
>>>       /** Address of source data, rounded down to achieve alignment.
>> */
>>>       const void *        srca = RTE_PTR_ALIGN_FLOOR(src,
>> sizeof(__m128i));
>>>       /** Address of end of source data, rounded down to achieve
>> alignment. */
>>>       const void * const  srcenda =
>> RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
>>>       const int           offset =  RTE_PTR_DIFF(src, srca) /
>> sizeof(int32_t);
>>>       register __m128i    xmm0;
>>>
>>>       RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
>>>       RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
>>>       RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
>>>
>>>       if (unlikely(len == 0)) return;
>>>
>>>       /* Copy first, non-__m128i aligned, part of source data. */
>>>       if (offset) {
>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>> pointer as parameter. */
>>> #pragma GCC diagnostic push
>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>           xmm0 = _mm_stream_load_si128(srca);
>>>           _mm_store_si128((void *)buf, xmm0);
>>> #pragma GCC diagnostic pop
>>>           switch (offset) {
>>>               case 1:
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>> sizeof(int32_t)), buf[1]);
>>>                   if (unlikely(len == 1 * sizeof(int32_t))) return;
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>> sizeof(int32_t)), buf[2]);
>>>                   if (unlikely(len == 2 * sizeof(int32_t))) return;
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
>> sizeof(int32_t)), buf[3]);
>>>                   break;
>>>               case 2:
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>> sizeof(int32_t)), buf[2]);
>>>                   if (unlikely(len == 1 * sizeof(int32_t))) return;
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>> sizeof(int32_t)), buf[3]);
>>>                   break;
>>>               case 3:
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>> sizeof(int32_t)), buf[3]);
>>>                   break;
>>>           }
>>>           srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
>>>           dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
>>>       }
>>>
>>>       /* Copy middle, __m128i aligned, part of source data. */
>>>       while (srca != srcenda) {
>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>> pointer as parameter. */
>>> #pragma GCC diagnostic push
>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>           xmm0 = _mm_stream_load_si128(srca);
>>> #pragma GCC diagnostic pop
>>>           _mm_store_si128((void *)buf, xmm0);
>>>           _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)),
>> buf[0]);
>>>           _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)),
>> buf[1]);
>>>           _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)),
>> buf[2]);
>>>           _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)),
>> buf[3]);
>>>           srca = RTE_PTR_ADD(srca, sizeof(__m128i));
>>>           dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
>>>       }
>>>
>>>       /* Copy last, non-__m128i aligned, part of source data. */
>>>       if (RTE_PTR_DIFF(srca, src) != 4) {
>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>> pointer as parameter. */
>>> #pragma GCC diagnostic push
>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>           xmm0 = _mm_stream_load_si128(srca);
>>>           _mm_store_si128((void *)buf, xmm0);
>>> #pragma GCC diagnostic pop
>>>           switch (offset) {
>>>               case 1:
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>> sizeof(int32_t)), buf[0]);
>>>                   break;
>>>               case 2:
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>> sizeof(int32_t)), buf[0]);
>>>                   if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
>> sizeof(int32_t))) return;
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>> sizeof(int32_t)), buf[1]);
>>>                   break;
>>>               case 3:
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>> sizeof(int32_t)), buf[0]);
>>>                   if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
>> sizeof(int32_t))) return;
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>> sizeof(int32_t)), buf[1]);
>>>                   if (unlikely(RTE_PTR_DIFF(srca, src) == 2 *
>> sizeof(int32_t))) return;
>>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
>> sizeof(int32_t)), buf[2]);
>>>                   break;
>>>           }
>>>       }
>>> }
>>>
>>
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-24 13:35     ` Konstantin Ananyev
@ 2022-07-24 22:18       ` Morten Brørup
  2022-07-29 10:00         ` Konstantin Ananyev
  2022-07-25  1:17       ` Honnappa Nagarahalli
  1 sibling, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-07-24 22:18 UTC (permalink / raw)
  To: Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> Sent: Sunday, 24 July 2022 15.35
> 
> 22/07/2022 11:44, Morten Brørup пишет:
> >> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> >> Sent: Friday, 22 July 2022 01.20
> >>
> >> Hi Morten,
> >>
> >>> This RFC proposes a set of functions optimized for non-temporal
> >> memory copy.
> >>>
> >>> At this stage, I am asking for feedback on the concept.
> >>>
> >>> Applications sometimes data to another memory location, which is
> only
> >> used
> >>> much later.
> >>> In this case, it is inefficient to pollute the data cache with the
> >> copied
> >>> data.
> >>>
> >>> An example use case (originating from a real life application):
> >>> Copying filtered packets, or the first part of them, into a capture
> >> buffer
> >>> for offline analysis.
> >>>
> >>> The purpose of these functions is to achieve a performance gain by
> >> not
> >>> polluting the cache when copying data.
> >>> Although the throughput may be improved by further optimization, I
> do
> >> not
> >>> consider througput optimization relevant initially.
> >>>
> >>> The x86 non-temporal load instructions have 16 byte alignment
> >>> requirements [1], while ARM non-temporal load instructions are
> >> available with
> >>> 4 byte alignment requirements [2].
> >>> Both platforms offer non-temporal store instructions with 4 byte
> >> alignment
> >>> requirements.
> >>>
> >>> In addition to the primary function without any alignment
> >> requirements, we
> >>> also provide functions for respectivly 16 and 4 byte aligned access
> >> for
> >>> performance purposes.
> >>>
> >>> The function names resemble standard C library function names, but
> >> their
> >>> signatures are intentionally different. No need to drag legacy into
> >> it.
> >>>
> >>> NB: Don't comment on spaces for indentation; a patch will follow
> DPDK
> >> coding
> >>> style and use TAB.
> >>
> >>
> >> I think there were discussions in other direction - remove
> rte_memcpy()
> >> completely and use memcpy() instead...
> >
> > Yes, the highly optimized rte_memcpy() implementation of memcpy() has
> become obsolete, now that modern compilers provide an efficient
> memcpy() implementation.
> >
> > It's an excellent reference, because we should learn from it, and
> avoid introducing similar mistakes with non-temporal memcpy.
> >
> >> But if we have a good use case for that, then I am positive in
> >> principle.
> >
> > The standard C library doesn't offer non-temporal memcpy(), so we
> need to implement it ourselves.
> >
> >> Though I think we need a clear use-case within dpdk for it
> >> to demonstrate perfomance gain.
> >
> > The performance gain is to avoid polluting the data cache. DPDK
> example applications, like l3fwd, are probably too primitive to measure
> any benefit in this regard.
> >
> >> Probably copying packets within pdump lib, or examples/dma. or ...
> >
> > Good point - the new functions should be used somewhere within DPDK.
> For this purpose, I will look into modifying rte_pktmbuf_copy(), which
> is used by pdump_copy(), to use non-temporal copying of the packet
> data.
> >
> >> Another thought - do we really need a separate inline function for
> each
> >> flavour?
> >> Might be just one non-inline rte_memcpy_nt(dst, src, size, flags),
> >> where flags could be combination of NT_SRC, NT_DST, and keep
> alignment
> >> detection/decisions to particular implementation?
> >
> > Thank you for the feedback, Konstantin.
> >
> > My answer to this suggestion gets a little longwinded...
> >
> > Looking at the DPDK pcapng library, it copies a 4 byte aligned
> metadata structure sized 28 byte. So it can do with 4 byte aligned
> functions.
> >
> > Our application can capture packets starting at the IP header, which
> is offset by 14 byte (Ethernet header size) from the packet buffer, so
> it requires 2 byte alignment. And thus, requiring 4 byte alignment is
> not acceptable.
> >
> > Our application uses 16 byte alignment in the capture buffer area,
> and can benefit from 16 byte aligned functions. Furthermore, x86
> processors require 16 byte alignment for non-temporal load
> instructions, so I think a 16 byte aligned non-temporal memcpy function
> should be offered.
> 
> 
> Yes, x86 needs 16B alignment for NT load/stores
> But that's supposed to be arch specific limitation,
> that we probably want to hide, no?

Agree.

> Inside the function can check alignment of both src and dst
> and decide should it use NT load/store instructions or just
> do normal copy.

Yes, I'm experimenting with the x86 inline function shown below. And hopefully, with some "extern inline" or other magic, I can hide the different implementations in the arch specific headers, and only expose the function declaration of rte_memcpy_nt() in the common header.

I'm currently working on the x86 implementation - when I'm satisfied with that, I'll look into how to hide the implementations in the arch specific header files, and only expose the common function declaration in the generic header file also used for documentation. I works for rte_memcpy(), so I can probably find the way to do it there.

/*
 * Non-Temporal Memory Operations Flags.
 */

#define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)   /** Length alignment mask. */
#define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)      /** Length is 2 byte aligned. */
#define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)      /** Length is 4 byte aligned. */
#define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)      /** Length is 8 byte aligned. */
#define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)     /** Length is 16 byte aligned. */
#define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)     /** Length is 32 byte aligned. */
#define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)     /** Length is 64 byte aligned. */
#define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)    /** Length is 128 byte aligned. */

#define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 8)   /** Destination address alignment mask. */
#define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 8)      /** Destination address is 2 byte aligned. */
#define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 8)      /** Destination address is 4 byte aligned. */
#define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 8)      /** Destination address is 8 byte aligned. */
#define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 8)     /** Destination address is 16 byte aligned. */
#define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 8)     /** Destination address is 32 byte aligned. */
#define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 8)     /** Destination address is 64 byte aligned. */
#define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 8)    /** Destination address is 128 byte aligned. */

#define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 16)  /** Source address alignment mask. */
#define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 16)     /** Source address is 2 byte aligned. */
#define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 16)     /** Source address is 4 byte aligned. */
#define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 16)     /** Source address is 8 byte aligned. */
#define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 16)    /** Source address is 16 byte aligned. */
#define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 16)    /** Source address is 32 byte aligned. */
#define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 16)    /** Source address is 64 byte aligned. */
#define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 16)   /** Source address is 128 byte aligned. */

/**
 * @warning
 * @b EXPERIMENTAL: this API may change without prior notice.
 *
 * Non-temporal memory copy.
 * The memory areas must not overlap.
 *
 * @note
 * If the destination and/or length is unaligned, some copied bytes will be
 * stored in the destination memory area using temporal access.
 *
 * @param dst
 *   Pointer to the non-temporal destination memory area.
 * @param src
 *   Pointer to the non-temporal source memory area.
 * @param len
 *   Number of bytes to copy.
 * @param flags
 *   Hints for memory access.
 *   Any of the RTE_MEMOPS_F_LENnA, RTE_MEMOPS_F_DSTnA, RTE_MEMOPS_F_SRCnA flags.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nt(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
        const uint64_t flags)
{
    if (__builtin_constant_p(flags) ?
            ((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A &&
            (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) :
            !(((uintptr_t)dst | len) & (16 - 1))) {
        if (__builtin_constant_p(flags) ?
                (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A :
                !((uintptr_t)src & (16 - 1)))
            rte_memcpy_nt16a(dst, src, len/*, flags*/);
        else
            rte_memcpy_nt16dla(dst, src, len/*, flags*/);
    }
    else if (__builtin_constant_p(flags) ? (
            (flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A &&
            (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
            (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A) :
            !(((uintptr_t)dst | (uintptr_t)src | len) & (4 - 1))) {
        rte_memcpy_nt4a(dst, src, len/*, flags*/);
    }
    else
        rte_memcpy_nt_unaligned(dst, src, len/*, flags*/);
}


> 
> 
> > While working on these funtions, I experimented with an
> rte_memcpy_nt() taking flags, which is also my personal preference, but
> haven't succeed yet. Especially when copying a 16 byte aligned
> structure of only 16 byte, the overhead of the function call +
> comparing the flags + the copy loop overhead is significant, compared
> to inline code consisting of only one pair of "movntdqa (%rsi),%xmm0;
> movntdq %xmm0,(%rdi)" instructions.
> >
> > Remember that a non-inlined rte_memcpy_nt() will be called with very
> varying size, due to the typical mix of small and big packets, so
> branch prediction will not help.
> >
> > This RFC does not yet show the rte_memcpy_nt() function handling
> unaligned load/store, but it is more complex than the aligned
> functions. So I think the aligned variants are warranted - for
> performance reasons.
> >
> > Some of the need for exposing individual functions for different
> alignment stems from the compiler being unable to determine the
> alignment of the source and destination pointers at build time. So we
> need to help the compiler with this at build time, and thus the need
> for inlining the function. If we expose a bunch of small inline
> functions or a big inline function with flags seems to be a matter of
> taste.
> >
> > Thinking about it, you are probably right that exposing a single
> function with flags is better for documentation purposes and easier for
> other architectures to implement. But it still needs to be inline, for
> the reasons described above.
> 
> 
> Ok, my initial thought was that main use-case for it would be copying
> of
> big chunks of data, but from your description it might not be the case.

This is for quickly copying relatively small pieces of data synchronously without polluting the CPUs data cache, e.g. just before passing on a packet to an Ethernet PMD for transmission.

Big chunks of data should be copied asynchronously by DMA.

> Yes, for just 16/32B copy function call overhead might be way too
> high...
> As another alternative - would memcpy_nt_bulk() help somehow?
> It can do copying for the several src/dst pairs at once and
> that might help to amortize cost of function call.

In many cases, memcpy_nt() will replace memcpy() inside loops, so it should be just as easy to use as memcpy(). E.g. look at rte_pktmbuf_copy()... Building a memcopy array to pass to memcpy_nt_bulk() from rte_pktmbuf_copy() would require a significant rewrite of rte_pktmbuf_copy(), compared to just replacing rte_memcpy() with rte_memcpy_nt(). And this is just one function using memcpy().

> 
> 
> >
> >>
> >>
> >>> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
> >> guide/index.html#text=_mm_stream_load
> >>> [2] https://developer.arm.com/documentation/100076/0100/A64-
> >> Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--
> SIMD-
> >> and-FP-
> >>>
> >>> V2:
> >>> - Only copy from non-temporal source to non-temporal destination.
> >>>     I.e. remove the two variants with only source and/or
> destination
> >> being
> >>>     non-temporal.
> >>> - Do not require alignment.
> >>>     Instead, offer additional 4 and 16 byte aligned functions for
> >> performance
> >>>     purposes.
> >>> - Implemented two of the functions for x86.
> >>> - Remove memset function.
> >>>
> >>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >>> ---
> >>>
> >>> /**
> >>>    * @warning
> >>>    * @b EXPERIMENTAL: this API may change without prior notice.
> >>>    *
> >>>    * Copy data from non-temporal source to non-temporal
> destination.
> >>>    *
> >>>    * @param dst
> >>>    *   Pointer to the non-temporal destination of the data.
> >>>    *   Should be 4 byte aligned, for optimal performance.
> >>>    * @param src
> >>>    *   Pointer to the non-temporal source data.
> >>>    *   No alignment requirements.
> >>>    * @param len
> >>>    *   Number of bytes to copy.
> >>>    *   Should be be divisible by 4, for optimal performance.
> >>>    */
> >>> __rte_experimental
> >>> static __rte_always_inline
> >>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> >> __access__(read_only, 2, 3)))
> >>> void rte_memcpy_nt(void * __rte_restrict dst, const void *
> >> __rte_restrict src, size_t len)
> >>> /* Implementation T.B.D. */
> >>>
> >>> /**
> >>>    * @warning
> >>>    * @b EXPERIMENTAL: this API may change without prior notice.
> >>>    *
> >>>    * Copy data in blocks of 16 byte from aligned non-temporal
> source
> >>>    * to aligned non-temporal destination.
> >>>    *
> >>>    * @param dst
> >>>    *   Pointer to the non-temporal destination of the data.
> >>>    *   Must be 16 byte aligned.
> >>>    * @param src
> >>>    *   Pointer to the non-temporal source data.
> >>>    *   Must be 16 byte aligned.
> >>>    * @param len
> >>>    *   Number of bytes to copy.
> >>>    *   Must be divisible by 16.
> >>>    */
> >>> __rte_experimental
> >>> static __rte_always_inline
> >>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> >> __access__(read_only, 2, 3)))
> >>> void rte_memcpy_nt16a(void * __rte_restrict dst, const void *
> >> __rte_restrict src, size_t len)
> >>> {
> >>>       const void * const  end = RTE_PTR_ADD(src, len);
> >>>
> >>>       RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
> >>>       RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
> >>>       RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
> >>>
> >>>       /* Copy large portion of data. */
> >>>       while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
> >>>           register __m128i    xmm0, xmm1, xmm2, xmm3;
> >>>
> >>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
> >> pointer as parameter. */
> >>> #pragma GCC diagnostic push
> >>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>           xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 *
> >> sizeof(__m128i)));
> >>>           xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 *
> >> sizeof(__m128i)));
> >>>           xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 *
> >> sizeof(__m128i)));
> >>>           xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 *
> >> sizeof(__m128i)));
> >>> #pragma GCC diagnostic pop
> >>>           _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)),
> >> xmm0);
> >>>           _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)),
> >> xmm1);
> >>>           _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)),
> >> xmm2);
> >>>           _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)),
> >> xmm3);
> >>>           src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
> >>>           dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
> >>>       }
> >>>
> >>>       /* Copy remaining data. */
> >>>       while (src != end) {
> >>>           register __m128i    xmm;
> >>>
> >>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
> >> pointer as parameter. */
> >>> #pragma GCC diagnostic push
> >>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>           xmm = _mm_stream_load_si128(src);
> >>> #pragma GCC diagnostic pop
> >>>           _mm_stream_si128(dst, xmm);
> >>>           src = RTE_PTR_ADD(src, sizeof(__m128i));
> >>>           dst = RTE_PTR_ADD(dst, sizeof(__m128i));
> >>>       }
> >>> }
> >>>
> >>> /**
> >>>    * @warning
> >>>    * @b EXPERIMENTAL: this API may change without prior notice.
> >>>    *
> >>>    * Copy data in blocks of 4 byte from aligned non-temporal source
> >>>    * to aligned non-temporal destination.
> >>>    *
> >>>    * @param dst
> >>>    *   Pointer to the non-temporal destination of the data.
> >>>    *   Must be 4 byte aligned.
> >>>    * @param src
> >>>    *   Pointer to the non-temporal source data.
> >>>    *   Must be 4 byte aligned.
> >>>    * @param len
> >>>    *   Number of bytes to copy.
> >>>    *   Must be divisible by 4.
> >>>    */
> >>> __rte_experimental
> >>> static __rte_always_inline
> >>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> >> __access__(read_only, 2, 3)))
> >>> void rte_memcpy_nt4a(void * __rte_restrict dst, const void *
> >> __rte_restrict src, size_t len)
> >>> {
> >>>       int32_t             buf[sizeof(__m128i) / sizeof(int32_t)]
> >> __rte_aligned(sizeof(__m128i));
> >>>       /** Address of source data, rounded down to achieve
> alignment.
> >> */
> >>>       const void *        srca = RTE_PTR_ALIGN_FLOOR(src,
> >> sizeof(__m128i));
> >>>       /** Address of end of source data, rounded down to achieve
> >> alignment. */
> >>>       const void * const  srcenda =
> >> RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
> >>>       const int           offset =  RTE_PTR_DIFF(src, srca) /
> >> sizeof(int32_t);
> >>>       register __m128i    xmm0;
> >>>
> >>>       RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
> >>>       RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
> >>>       RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
> >>>
> >>>       if (unlikely(len == 0)) return;
> >>>
> >>>       /* Copy first, non-__m128i aligned, part of source data. */
> >>>       if (offset) {
> >>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
> >> pointer as parameter. */
> >>> #pragma GCC diagnostic push
> >>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>           xmm0 = _mm_stream_load_si128(srca);
> >>>           _mm_store_si128((void *)buf, xmm0);
> >>> #pragma GCC diagnostic pop
> >>>           switch (offset) {
> >>>               case 1:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[1]);
> >>>                   if (unlikely(len == 1 * sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >> sizeof(int32_t)), buf[2]);
> >>>                   if (unlikely(len == 2 * sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> >> sizeof(int32_t)), buf[3]);
> >>>                   break;
> >>>               case 2:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[2]);
> >>>                   if (unlikely(len == 1 * sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >> sizeof(int32_t)), buf[3]);
> >>>                   break;
> >>>               case 3:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[3]);
> >>>                   break;
> >>>           }
> >>>           srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
> >>>           dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
> >>>       }
> >>>
> >>>       /* Copy middle, __m128i aligned, part of source data. */
> >>>       while (srca != srcenda) {
> >>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
> >> pointer as parameter. */
> >>> #pragma GCC diagnostic push
> >>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>           xmm0 = _mm_stream_load_si128(srca);
> >>> #pragma GCC diagnostic pop
> >>>           _mm_store_si128((void *)buf, xmm0);
> >>>           _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)),
> >> buf[0]);
> >>>           _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)),
> >> buf[1]);
> >>>           _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)),
> >> buf[2]);
> >>>           _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)),
> >> buf[3]);
> >>>           srca = RTE_PTR_ADD(srca, sizeof(__m128i));
> >>>           dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
> >>>       }
> >>>
> >>>       /* Copy last, non-__m128i aligned, part of source data. */
> >>>       if (RTE_PTR_DIFF(srca, src) != 4) {
> >>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
> >> pointer as parameter. */
> >>> #pragma GCC diagnostic push
> >>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>           xmm0 = _mm_stream_load_si128(srca);
> >>>           _mm_store_si128((void *)buf, xmm0);
> >>> #pragma GCC diagnostic pop
> >>>           switch (offset) {
> >>>               case 1:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[0]);
> >>>                   break;
> >>>               case 2:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[0]);
> >>>                   if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
> >> sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >> sizeof(int32_t)), buf[1]);
> >>>                   break;
> >>>               case 3:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[0]);
> >>>                   if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
> >> sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >> sizeof(int32_t)), buf[1]);
> >>>                   if (unlikely(RTE_PTR_DIFF(srca, src) == 2 *
> >> sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> >> sizeof(int32_t)), buf[2]);
> >>>                   break;
> >>>           }
> >>>       }
> >>> }
> >>>
> >>
> >
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-24 13:35     ` Konstantin Ananyev
  2022-07-24 22:18       ` Morten Brørup
@ 2022-07-25  1:17       ` Honnappa Nagarahalli
  2022-07-27 10:26         ` Morten Brørup
  1 sibling, 1 reply; 57+ messages in thread
From: Honnappa Nagarahalli @ 2022-07-25  1:17 UTC (permalink / raw)
  To: Konstantin Ananyev, Morten Brørup, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach, nd, nd

<snip>
> 
> 22/07/2022 11:44, Morten Brørup пишет:
> >> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> >> Sent: Friday, 22 July 2022 01.20
> >>
> >> Hi Morten,
> >>
> >>> This RFC proposes a set of functions optimized for non-temporal
> >> memory copy.
> >>>
> >>> At this stage, I am asking for feedback on the concept.
> >>>
> >>> Applications sometimes data to another memory location, which is
> >>> only
> >> used
> >>> much later.
> >>> In this case, it is inefficient to pollute the data cache with the
> >> copied
> >>> data.
> >>>
> >>> An example use case (originating from a real life application):
> >>> Copying filtered packets, or the first part of them, into a capture
> >> buffer
> >>> for offline analysis.
> >>>
> >>> The purpose of these functions is to achieve a performance gain by
> >> not
> >>> polluting the cache when copying data.
> >>> Although the throughput may be improved by further optimization, I
> >>> do
> >> not
> >>> consider througput optimization relevant initially.
> >>>
> >>> The x86 non-temporal load instructions have 16 byte alignment
> >>> requirements [1], while ARM non-temporal load instructions are
> >> available with
> >>> 4 byte alignment requirements [2].
> >>> Both platforms offer non-temporal store instructions with 4 byte
> >> alignment
> >>> requirements.
> >>>
> >>> In addition to the primary function without any alignment
> >> requirements, we
> >>> also provide functions for respectivly 16 and 4 byte aligned access
> >> for
> >>> performance purposes.
> >>>
> >>> The function names resemble standard C library function names, but
> >> their
> >>> signatures are intentionally different. No need to drag legacy into
> >> it.
> >>>
> >>> NB: Don't comment on spaces for indentation; a patch will follow
> >>> DPDK
> >> coding
> >>> style and use TAB.
> >>
> >>
> >> I think there were discussions in other direction - remove
> >> rte_memcpy() completely and use memcpy() instead...
> >
> > Yes, the highly optimized rte_memcpy() implementation of memcpy() has
> become obsolete, now that modern compilers provide an efficient memcpy()
> implementation.
> >
> > It's an excellent reference, because we should learn from it, and avoid
> introducing similar mistakes with non-temporal memcpy.
> >
> >> But if we have a good use case for that, then I am positive in
> >> principle.
> >
> > The standard C library doesn't offer non-temporal memcpy(), so we need to
> implement it ourselves.
> >
> >> Though I think we need a clear use-case within dpdk for it to
> >> demonstrate perfomance gain.
> >
> > The performance gain is to avoid polluting the data cache. DPDK example
> applications, like l3fwd, are probably too primitive to measure any benefit in this
> regard.
> >
> >> Probably copying packets within pdump lib, or examples/dma. or ...
> >
> > Good point - the new functions should be used somewhere within DPDK. For
> this purpose, I will look into modifying rte_pktmbuf_copy(), which is used by
> pdump_copy(), to use non-temporal copying of the packet data.
> >
> >> Another thought - do we really need a separate inline function for
> >> each flavour?
> >> Might be just one non-inline rte_memcpy_nt(dst, src, size, flags),
> >> where flags could be combination of NT_SRC, NT_DST, and keep
> >> alignment detection/decisions to particular implementation?
> >
> > Thank you for the feedback, Konstantin.
> >
> > My answer to this suggestion gets a little longwinded...
> >
> > Looking at the DPDK pcapng library, it copies a 4 byte aligned metadata
> structure sized 28 byte. So it can do with 4 byte aligned functions.
> >
> > Our application can capture packets starting at the IP header, which is offset
> by 14 byte (Ethernet header size) from the packet buffer, so it requires 2 byte
> alignment. And thus, requiring 4 byte alignment is not acceptable.
> >
> > Our application uses 16 byte alignment in the capture buffer area, and can
> benefit from 16 byte aligned functions. Furthermore, x86 processors require 16
> byte alignment for non-temporal load instructions, so I think a 16 byte aligned
> non-temporal memcpy function should be offered.
> 
> 
> Yes, x86 needs 16B alignment for NT load/stores But that's supposed to be arch
> specific limitation, that we probably want to hide, no?
> Inside the function can check alignment of both src and dst and decide should it
> use NT load/store instructions or just do normal copy.
IMO, the normal copy should not be done by this API under any conditions. Why not let the application call memcpy/rte_memcpy when the NT copy is not applicable? It helps the programmer to understand and debug the issues much easier.

> 
> 
> > While working on these funtions, I experimented with an rte_memcpy_nt()
> taking flags, which is also my personal preference, but haven't succeed yet.
> Especially when copying a 16 byte aligned structure of only 16 byte, the
> overhead of the function call + comparing the flags + the copy loop overhead is
> significant, compared to inline code consisting of only one pair of "movntdqa
> (%rsi),%xmm0; movntdq %xmm0,(%rdi)" instructions.
> >
> > Remember that a non-inlined rte_memcpy_nt() will be called with very
> varying size, due to the typical mix of small and big packets, so branch prediction
> will not help.
> >
> > This RFC does not yet show the rte_memcpy_nt() function handling unaligned
> load/store, but it is more complex than the aligned functions. So I think the
> aligned variants are warranted - for performance reasons.
> >
> > Some of the need for exposing individual functions for different alignment
> stems from the compiler being unable to determine the alignment of the source
> and destination pointers at build time. So we need to help the compiler with this
> at build time, and thus the need for inlining the function. If we expose a bunch
> of small inline functions or a big inline function with flags seems to be a matter
> of taste.
> >
> > Thinking about it, you are probably right that exposing a single function with
> flags is better for documentation purposes and easier for other architectures to
> implement. But it still needs to be inline, for the reasons described above.
> 
> 
> Ok, my initial thought was that main use-case for it would be copying of big
> chunks of data, but from your description it might not be the case.
> Yes, for just 16/32B copy function call overhead might be way too high...
> As another alternative - would memcpy_nt_bulk() help somehow?
> It can do copying for the several src/dst pairs at once and that might help to
> amortize cost of function call.
> 
> 
> >
> >>
> >>
> >>> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
> >> guide/index.html#text=_mm_stream_load
> >>> [2] https://developer.arm.com/documentation/100076/0100/A64-
> >> Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-
> >> and-FP-
> >>>
> >>> V2:
> >>> - Only copy from non-temporal source to non-temporal destination.
> >>>     I.e. remove the two variants with only source and/or destination
> >> being
> >>>     non-temporal.
> >>> - Do not require alignment.
> >>>     Instead, offer additional 4 and 16 byte aligned functions for
> >> performance
> >>>     purposes.
> >>> - Implemented two of the functions for x86.
> >>> - Remove memset function.
> >>>
> >>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >>> ---
> >>>
> >>> /**
> >>>    * @warning
> >>>    * @b EXPERIMENTAL: this API may change without prior notice.
> >>>    *
> >>>    * Copy data from non-temporal source to non-temporal destination.
> >>>    *
> >>>    * @param dst
> >>>    *   Pointer to the non-temporal destination of the data.
> >>>    *   Should be 4 byte aligned, for optimal performance.
> >>>    * @param src
> >>>    *   Pointer to the non-temporal source data.
> >>>    *   No alignment requirements.
> >>>    * @param len
> >>>    *   Number of bytes to copy.
> >>>    *   Should be be divisible by 4, for optimal performance.
> >>>    */
> >>> __rte_experimental
> >>> static __rte_always_inline
> >>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> >> __access__(read_only, 2, 3)))
> >>> void rte_memcpy_nt(void * __rte_restrict dst, const void *
> >> __rte_restrict src, size_t len)
> >>> /* Implementation T.B.D. */
> >>>
> >>> /**
> >>>    * @warning
> >>>    * @b EXPERIMENTAL: this API may change without prior notice.
> >>>    *
> >>>    * Copy data in blocks of 16 byte from aligned non-temporal source
> >>>    * to aligned non-temporal destination.
> >>>    *
> >>>    * @param dst
> >>>    *   Pointer to the non-temporal destination of the data.
> >>>    *   Must be 16 byte aligned.
> >>>    * @param src
> >>>    *   Pointer to the non-temporal source data.
> >>>    *   Must be 16 byte aligned.
> >>>    * @param len
> >>>    *   Number of bytes to copy.
> >>>    *   Must be divisible by 16.
> >>>    */
> >>> __rte_experimental
> >>> static __rte_always_inline
> >>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> >> __access__(read_only, 2, 3)))
> >>> void rte_memcpy_nt16a(void * __rte_restrict dst, const void *
> >> __rte_restrict src, size_t len)
> >>> {
> >>>       const void * const  end = RTE_PTR_ADD(src, len);
> >>>
> >>>       RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
> >>>       RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
> >>>       RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
> >>>
> >>>       /* Copy large portion of data. */
> >>>       while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
> >>>           register __m128i    xmm0, xmm1, xmm2, xmm3;
> >>>
> >>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
> >> pointer as parameter. */
> >>> #pragma GCC diagnostic push
> >>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>           xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 *
> >> sizeof(__m128i)));
> >>>           xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 *
> >> sizeof(__m128i)));
> >>>           xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 *
> >> sizeof(__m128i)));
> >>>           xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 *
> >> sizeof(__m128i)));
> >>> #pragma GCC diagnostic pop
> >>>           _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)),
> >> xmm0);
> >>>           _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)),
> >> xmm1);
> >>>           _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)),
> >> xmm2);
> >>>           _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)),
> >> xmm3);
> >>>           src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
> >>>           dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
> >>>       }
> >>>
> >>>       /* Copy remaining data. */
> >>>       while (src != end) {
> >>>           register __m128i    xmm;
> >>>
> >>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
> >> pointer as parameter. */
> >>> #pragma GCC diagnostic push
> >>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>           xmm = _mm_stream_load_si128(src); #pragma GCC diagnostic
> >>> pop
> >>>           _mm_stream_si128(dst, xmm);
> >>>           src = RTE_PTR_ADD(src, sizeof(__m128i));
> >>>           dst = RTE_PTR_ADD(dst, sizeof(__m128i));
> >>>       }
> >>> }
> >>>
> >>> /**
> >>>    * @warning
> >>>    * @b EXPERIMENTAL: this API may change without prior notice.
> >>>    *
> >>>    * Copy data in blocks of 4 byte from aligned non-temporal source
> >>>    * to aligned non-temporal destination.
> >>>    *
> >>>    * @param dst
> >>>    *   Pointer to the non-temporal destination of the data.
> >>>    *   Must be 4 byte aligned.
> >>>    * @param src
> >>>    *   Pointer to the non-temporal source data.
> >>>    *   Must be 4 byte aligned.
> >>>    * @param len
> >>>    *   Number of bytes to copy.
> >>>    *   Must be divisible by 4.
> >>>    */
> >>> __rte_experimental
> >>> static __rte_always_inline
> >>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> >> __access__(read_only, 2, 3)))
> >>> void rte_memcpy_nt4a(void * __rte_restrict dst, const void *
> >> __rte_restrict src, size_t len)
> >>> {
> >>>       int32_t             buf[sizeof(__m128i) / sizeof(int32_t)]
> >> __rte_aligned(sizeof(__m128i));
> >>>       /** Address of source data, rounded down to achieve alignment.
> >> */
> >>>       const void *        srca = RTE_PTR_ALIGN_FLOOR(src,
> >> sizeof(__m128i));
> >>>       /** Address of end of source data, rounded down to achieve
> >> alignment. */
> >>>       const void * const  srcenda =
> >> RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
> >>>       const int           offset =  RTE_PTR_DIFF(src, srca) /
> >> sizeof(int32_t);
> >>>       register __m128i    xmm0;
> >>>
> >>>       RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
> >>>       RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
> >>>       RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
> >>>
> >>>       if (unlikely(len == 0)) return;
> >>>
> >>>       /* Copy first, non-__m128i aligned, part of source data. */
> >>>       if (offset) {
> >>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
> >> pointer as parameter. */
> >>> #pragma GCC diagnostic push
> >>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>           xmm0 = _mm_stream_load_si128(srca);
> >>>           _mm_store_si128((void *)buf, xmm0); #pragma GCC diagnostic
> >>> pop
> >>>           switch (offset) {
> >>>               case 1:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[1]);
> >>>                   if (unlikely(len == 1 * sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >> sizeof(int32_t)), buf[2]);
> >>>                   if (unlikely(len == 2 * sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> >> sizeof(int32_t)), buf[3]);
> >>>                   break;
> >>>               case 2:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[2]);
> >>>                   if (unlikely(len == 1 * sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >> sizeof(int32_t)), buf[3]);
> >>>                   break;
> >>>               case 3:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[3]);
> >>>                   break;
> >>>           }
> >>>           srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
> >>>           dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
> >>>       }
> >>>
> >>>       /* Copy middle, __m128i aligned, part of source data. */
> >>>       while (srca != srcenda) {
> >>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
> >> pointer as parameter. */
> >>> #pragma GCC diagnostic push
> >>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>           xmm0 = _mm_stream_load_si128(srca); #pragma GCC diagnostic
> >>> pop
> >>>           _mm_store_si128((void *)buf, xmm0);
> >>>           _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)),
> >> buf[0]);
> >>>           _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)),
> >> buf[1]);
> >>>           _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)),
> >> buf[2]);
> >>>           _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)),
> >> buf[3]);
> >>>           srca = RTE_PTR_ADD(srca, sizeof(__m128i));
> >>>           dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
> >>>       }
> >>>
> >>>       /* Copy last, non-__m128i aligned, part of source data. */
> >>>       if (RTE_PTR_DIFF(srca, src) != 4) {
> >>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
> >> pointer as parameter. */
> >>> #pragma GCC diagnostic push
> >>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>           xmm0 = _mm_stream_load_si128(srca);
> >>>           _mm_store_si128((void *)buf, xmm0); #pragma GCC diagnostic
> >>> pop
> >>>           switch (offset) {
> >>>               case 1:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[0]);
> >>>                   break;
> >>>               case 2:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[0]);
> >>>                   if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
> >> sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >> sizeof(int32_t)), buf[1]);
> >>>                   break;
> >>>               case 3:
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >> sizeof(int32_t)), buf[0]);
> >>>                   if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
> >> sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >> sizeof(int32_t)), buf[1]);
> >>>                   if (unlikely(RTE_PTR_DIFF(srca, src) == 2 *
> >> sizeof(int32_t))) return;
> >>>                   _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> >> sizeof(int32_t)), buf[2]);
> >>>                   break;
> >>>           }
> >>>       }
> >>> }
> >>>
> >>
> >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-25  1:17       ` Honnappa Nagarahalli
@ 2022-07-27 10:26         ` Morten Brørup
  2022-07-27 17:37           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-07-27 10:26 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach, nd, nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Monday, 25 July 2022 03.18
> 

[...]

> > Yes, x86 needs 16B alignment for NT load/stores But that's supposed
> to be arch
> > specific limitation, that we probably want to hide, no?

Correct. However, optional hints for optimization purposes will be available. And it is up to the architecture specific implementation to make the best use of these hints, or just ignore them.

> > Inside the function can check alignment of both src and dst and
> decide should it
> > use NT load/store instructions or just do normal copy.
> IMO, the normal copy should not be done by this API under any
> conditions. Why not let the application call memcpy/rte_memcpy when the
> NT copy is not applicable? It helps the programmer to understand and
> debug the issues much easier.

Yes, the programmer must choose between normal memcpy() and non-temporal rte_memcpy_nt(). I am offering new functions, not modifying memcpy() or rte_memcpy().

And rte_memcpy_nt() will silently fall back to normal memcpy() if non-temporal copying is unavailable, e.g. on POWER and RISC-V architectures, which don't have NT load/store instructions.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-27 10:26         ` Morten Brørup
@ 2022-07-27 17:37           ` Honnappa Nagarahalli
  2022-07-27 18:49             ` Morten Brørup
  0 siblings, 1 reply; 57+ messages in thread
From: Honnappa Nagarahalli @ 2022-07-27 17:37 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach, nd, nd

<snip>

> 
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Monday, 25 July 2022 03.18
> >
> 
> [...]
> 
> > > Yes, x86 needs 16B alignment for NT load/stores But that's supposed
> > to be arch
> > > specific limitation, that we probably want to hide, no?
> 
> Correct. However, optional hints for optimization purposes will be available.
> And it is up to the architecture specific implementation to make the best use
> of these hints, or just ignore them.
> 
> > > Inside the function can check alignment of both src and dst and
> > decide should it
> > > use NT load/store instructions or just do normal copy.
> > IMO, the normal copy should not be done by this API under any
> > conditions. Why not let the application call memcpy/rte_memcpy when
> > the NT copy is not applicable? It helps the programmer to understand
> > and debug the issues much easier.
> 
> Yes, the programmer must choose between normal memcpy() and non-
> temporal rte_memcpy_nt(). I am offering new functions, not modifying
> memcpy() or rte_memcpy().
> 
> And rte_memcpy_nt() will silently fall back to normal memcpy() if non-
> temporal copying is unavailable, e.g. on POWER and RISC-V architectures,
> which don't have NT load/store instructions.
I am talking about a scenario where the application is being ported between architectures. Not everyone knows about the capabilities of the architecture. It is better to indicate upfront (ex: compilation failures) that a certain feature is not supported on the target architecture rather than the user having to discover through painful debugging.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-27 17:37           ` Honnappa Nagarahalli
@ 2022-07-27 18:49             ` Morten Brørup
  2022-07-27 19:12               ` Stephen Hemminger
  2022-07-27 19:52               ` Honnappa Nagarahalli
  0 siblings, 2 replies; 57+ messages in thread
From: Morten Brørup @ 2022-07-27 18:49 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach, nd, nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, 27 July 2022 19.38
> 

[...]

> >
> > > > Yes, x86 needs 16B alignment for NT load/stores But that's
> supposed
> > > to be arch
> > > > specific limitation, that we probably want to hide, no?
> >
> > Correct. However, optional hints for optimization purposes will be
> available.
> > And it is up to the architecture specific implementation to make the
> best use
> > of these hints, or just ignore them.
> >
> > > > Inside the function can check alignment of both src and dst and
> > > decide should it
> > > > use NT load/store instructions or just do normal copy.
> > > IMO, the normal copy should not be done by this API under any
> > > conditions. Why not let the application call memcpy/rte_memcpy when
> > > the NT copy is not applicable? It helps the programmer to
> understand
> > > and debug the issues much easier.
> >
> > Yes, the programmer must choose between normal memcpy() and non-
> > temporal rte_memcpy_nt(). I am offering new functions, not modifying
> > memcpy() or rte_memcpy().
> >
> > And rte_memcpy_nt() will silently fall back to normal memcpy() if
> non-
> > temporal copying is unavailable, e.g. on POWER and RISC-V
> architectures,
> > which don't have NT load/store instructions.
> I am talking about a scenario where the application is being ported
> between architectures. Not everyone knows about the capabilities of the
> architecture. It is better to indicate upfront (ex: compilation
> failures) that a certain feature is not supported on the target
> architecture rather than the user having to discover through painful
> debugging.

I'm considering rte_memcpy_nt() a performance optimized variant of memcpy(), where the performance gain is less cache pollution. Thus, silent fallback to memcpy() should suffice.

Other architecture differences also affect DPDK performance; the inability to perform non-temporal load/store just one more to the (undocumented) list.

Failing at build time if NT load/store is unavailable by the architecture would prevent the function from being used by other DPDK libraries, e.g. by the rte_pktmbuf_copy() function used by the pdump library.

I don't oppose to your idea, I just don't have any idea how to reasonably implement it. So I'm trying to defend why it is not important.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-27 18:49             ` Morten Brørup
@ 2022-07-27 19:12               ` Stephen Hemminger
  2022-07-28  9:00                 ` Morten Brørup
  2022-07-27 19:52               ` Honnappa Nagarahalli
  1 sibling, 1 reply; 57+ messages in thread
From: Stephen Hemminger @ 2022-07-27 19:12 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Honnappa Nagarahalli, Konstantin Ananyev, dev, Bruce Richardson,
	Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach,
	nd

On Wed, 27 Jul 2022 20:49:59 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> I'm considering rte_memcpy_nt() a performance optimized variant of memcpy(), where the performance gain is less cache pollution. Thus, silent fallback to memcpy() should suffice.


Have you looked at existing Glibc code? last time I checked it was already doing
non-temporal instructions on several architectures.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-27 18:49             ` Morten Brørup
  2022-07-27 19:12               ` Stephen Hemminger
@ 2022-07-27 19:52               ` Honnappa Nagarahalli
  2022-07-27 22:02                 ` Stanisław Kardach
  1 sibling, 1 reply; 57+ messages in thread
From: Honnappa Nagarahalli @ 2022-07-27 19:52 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach, nd, nd

<snip>
> 
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Wednesday, 27 July 2022 19.38
> >
> 
> [...]
> 
> > >
> > > > > Yes, x86 needs 16B alignment for NT load/stores But that's
> > supposed
> > > > to be arch
> > > > > specific limitation, that we probably want to hide, no?
> > >
> > > Correct. However, optional hints for optimization purposes will be
> > available.
> > > And it is up to the architecture specific implementation to make the
> > best use
> > > of these hints, or just ignore them.
> > >
> > > > > Inside the function can check alignment of both src and dst and
> > > > decide should it
> > > > > use NT load/store instructions or just do normal copy.
> > > > IMO, the normal copy should not be done by this API under any
> > > > conditions. Why not let the application call memcpy/rte_memcpy
> > > > when the NT copy is not applicable? It helps the programmer to
> > understand
> > > > and debug the issues much easier.
> > >
> > > Yes, the programmer must choose between normal memcpy() and non-
> > > temporal rte_memcpy_nt(). I am offering new functions, not modifying
> > > memcpy() or rte_memcpy().
> > >
> > > And rte_memcpy_nt() will silently fall back to normal memcpy() if
> > non-
> > > temporal copying is unavailable, e.g. on POWER and RISC-V
> > architectures,
> > > which don't have NT load/store instructions.
> > I am talking about a scenario where the application is being ported
> > between architectures. Not everyone knows about the capabilities of
> > the architecture. It is better to indicate upfront (ex: compilation
> > failures) that a certain feature is not supported on the target
> > architecture rather than the user having to discover through painful
> > debugging.
> 
> I'm considering rte_memcpy_nt() a performance optimized variant of
> memcpy(), where the performance gain is less cache pollution. Thus, silent
> fallback to memcpy() should suffice.
> 
> Other architecture differences also affect DPDK performance; the inability to
> perform non-temporal load/store just one more to the (undocumented) list.
> 
> Failing at build time if NT load/store is unavailable by the architecture would
> prevent the function from being used by other DPDK libraries, e.g. by the
> rte_pktmbuf_copy() function used by the pdump library.
The other libraries in DPDK need to provide NT versions as the libraries need to cater for not-NT use cases as well. i.e. we cannot hide a NT copy under rte_pktmbuf_copy() API, we need to have rte_pktmbuf_copy_nt()

> 
> I don't oppose to your idea, I just don't have any idea how to reasonably
> implement it. So I'm trying to defend why it is not important.
I am suggesting that the applications could implement #ifdef depending on the architecture.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-27 19:52               ` Honnappa Nagarahalli
@ 2022-07-27 22:02                 ` Stanisław Kardach
  2022-07-28 10:51                   ` Morten Brørup
  0 siblings, 1 reply; 57+ messages in thread
From: Stanisław Kardach @ 2022-07-27 22:02 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Morten Brørup, Konstantin Ananyev, dev, Bruce Richardson,
	Jan Viktorin, Ruifeng Wang, David Christensen, nd

[-- Attachment #1: Type: text/plain, Size: 3449 bytes --]

On Wed, 27 Jul 2022, 21:53 Honnappa Nagarahalli, <
Honnappa.Nagarahalli@arm.com> wrote:

> <snip>
> >
> > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > Sent: Wednesday, 27 July 2022 19.38
> > >
> >
> > [...]
> >
> > > >
> > > > > > Yes, x86 needs 16B alignment for NT load/stores But that's
> > > supposed
> > > > > to be arch
> > > > > > specific limitation, that we probably want to hide, no?
> > > >
> > > > Correct. However, optional hints for optimization purposes will be
> > > available.
> > > > And it is up to the architecture specific implementation to make the
> > > best use
> > > > of these hints, or just ignore them.
> > > >
> > > > > > Inside the function can check alignment of both src and dst and
> > > > > decide should it
> > > > > > use NT load/store instructions or just do normal copy.
> > > > > IMO, the normal copy should not be done by this API under any
> > > > > conditions. Why not let the application call memcpy/rte_memcpy
> > > > > when the NT copy is not applicable? It helps the programmer to
> > > understand
> > > > > and debug the issues much easier.
> > > >
> > > > Yes, the programmer must choose between normal memcpy() and non-
> > > > temporal rte_memcpy_nt(). I am offering new functions, not modifying
> > > > memcpy() or rte_memcpy().
> > > >
> > > > And rte_memcpy_nt() will silently fall back to normal memcpy() if
> > > non-
> > > > temporal copying is unavailable, e.g. on POWER and RISC-V
> > > architectures,
> > > > which don't have NT load/store instructions.
> > > I am talking about a scenario where the application is being ported
> > > between architectures. Not everyone knows about the capabilities of
> > > the architecture. It is better to indicate upfront (ex: compilation
> > > failures) that a certain feature is not supported on the target
> > > architecture rather than the user having to discover through painful
> > > debugging.
> >
> > I'm considering rte_memcpy_nt() a performance optimized variant of
> > memcpy(), where the performance gain is less cache pollution. Thus,
> silent
> > fallback to memcpy() should suffice.
> >
> > Other architecture differences also affect DPDK performance; the
> inability to
> > perform non-temporal load/store just one more to the (undocumented) list.
> >
> > Failing at build time if NT load/store is unavailable by the
> architecture would
> > prevent the function from being used by other DPDK libraries, e.g. by the
> > rte_pktmbuf_copy() function used by the pdump library.
> The other libraries in DPDK need to provide NT versions as the libraries
> need to cater for not-NT use cases as well. i.e. we cannot hide a NT copy
> under rte_pktmbuf_copy() API, we need to have rte_pktmbuf_copy_nt()
>
> >
> > I don't oppose to your idea, I just don't have any idea how to reasonably
> > implement it. So I'm trying to defend why it is not important.

I am suggesting that the applications could implement #ifdef depending on
> the architecture.
>
I assume that it would be a pre-processor flag defined (or not) on DPDK
side and application doing #ifdef based on it?

Another way to achieve this would be to use #warning directive (see [1])
inside DPDK when the generic fallback is taken.

Also isn't the argument on memcpy_nt capability query not a more general
one, that is how would/should application query DPDK's capabilities when
run or compiled?

[1] https://gcc.gnu.org/onlinedocs/cpp/Diagnostics.html

>

[-- Attachment #2: Type: text/html, Size: 5101 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-27 19:12               ` Stephen Hemminger
@ 2022-07-28  9:00                 ` Morten Brørup
  0 siblings, 0 replies; 57+ messages in thread
From: Morten Brørup @ 2022-07-28  9:00 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Honnappa Nagarahalli, Konstantin Ananyev, dev, Bruce Richardson,
	Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach,
	nd

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 27 July 2022 21.12
[RFC v2] non-temporal memcpy
> 
> On Wed, 27 Jul 2022 20:49:59 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > I'm considering rte_memcpy_nt() a performance optimized variant of
> memcpy(), where the performance gain is less cache pollution. Thus,
> silent fallback to memcpy() should suffice.
> 
> 
> Have you looked at existing Glibc code? last time I checked it was
> already doing
> non-temporal instructions on several architectures.

Good idea!

I found the glibc implementation of memcpy() [1], and it only uses non-temporal store, not non-temporal load; and only for big lengths.

BTW, this also reveals that memcpy() sometimes behaves differently than rte_memcpy(), which never uses non-temporal store.

[1] https://elixir.bootlin.com/glibc/latest/source/sysdeps/x86_64/multiarch/memcpy-ssse3.S


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-27 22:02                 ` Stanisław Kardach
@ 2022-07-28 10:51                   ` Morten Brørup
  2022-07-29  9:21                     ` Konstantin Ananyev
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-07-28 10:51 UTC (permalink / raw)
  To: Stanisław Kardach, Honnappa Nagarahalli
  Cc: Konstantin Ananyev, dev, Bruce Richardson, Jan Viktorin,
	Ruifeng Wang, David Christensen, nd

From: Stanisław Kardach [mailto:kda@semihalf.com] 
Sent: Thursday, 28 July 2022 00.02
> On Wed, 27 Jul 2022, 21:53 Honnappa Nagarahalli, <Honnappa.Nagarahalli@arm.com> wrote:
>
> > > > > > Yes, x86 needs 16B alignment for NT load/stores But that's
> > > supposed
> > > > > to be arch
> > > > > > specific limitation, that we probably want to hide, no?
> > > >
> > > > Correct. However, optional hints for optimization purposes will be
> > > available.
> > > > And it is up to the architecture specific implementation to make the
> > > best use
> > > > of these hints, or just ignore them.
> > > >
> > > > > > Inside the function can check alignment of both src and dst and
> > > > > decide should it
> > > > > > use NT load/store instructions or just do normal copy.
> > > > > IMO, the normal copy should not be done by this API under any
> > > > > conditions. Why not let the application call memcpy/rte_memcpy
> > > > > when the NT copy is not applicable? It helps the programmer to
> > > understand
> > > > > and debug the issues much easier.
> > > >
> > > > Yes, the programmer must choose between normal memcpy() and non-
> > > > temporal rte_memcpy_nt(). I am offering new functions, not modifying
> > > > memcpy() or rte_memcpy().
> > > >
> > > > And rte_memcpy_nt() will silently fall back to normal memcpy() if
> > > non-
> > > > temporal copying is unavailable, e.g. on POWER and RISC-V
> > > architectures,
> > > > which don't have NT load/store instructions.
> > > I am talking about a scenario where the application is being ported
> > > between architectures. Not everyone knows about the capabilities of
> > > the architecture. It is better to indicate upfront (ex: compilation
> > > failures) that a certain feature is not supported on the target
> > > architecture rather than the user having to discover through painful
> > > debugging.
> > 
> > I'm considering rte_memcpy_nt() a performance optimized variant of
> > memcpy(), where the performance gain is less cache pollution. Thus, silent
> > fallback to memcpy() should suffice.
> > 
> > Other architecture differences also affect DPDK performance; the inability to
> > perform non-temporal load/store just one more to the (undocumented) list.
> > 
> > Failing at build time if NT load/store is unavailable by the architecture would
> > prevent the function from being used by other DPDK libraries, e.g. by the
> > rte_pktmbuf_copy() function used by the pdump library.
> The other libraries in DPDK need to provide NT versions as the libraries need to cater for not-NT use cases as well. i.e. we cannot hide a NT copy under rte_pktmbuf_copy() API, we need to have rte_pktmbuf_copy_nt()

Yes, it was my intention to provide rte_pktmbuf_copy_nt() as a new function. Some uses of rte_pktmbuf_copy() may benefit from having the copied data in cache.

But there is a ripple effect:

It is also my intention to improve the pdump and pcapng libraries by using rte_pktmbuf_copy_nt() instead of rte_pktmbuf_copy(). These would normally benefit from not polluting the cache.

So the underlying rte_memcpy_nt() function needs a fallback if the architecture doesn't support non-temporal memory copy, now that the pdump and pcapng libraries depend on it.

Alternatively, if rte_memcpy_nt() has no fallback to standard memcpy(), but an application fails to build if the application developer tries to use rte_memcpy_nt(), we would have to modify e.g. pdump_copy() like this:

+ #ifdef RTE_CPUFLAG_xxx
  p = rte_pktmbuf_copy_nt(pkts[i], mp, 0, cbs->snaplen);
+ #else
  p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
+ #endif

Personally, I prefer the fallback inside rte_memcpy_nt(), rather than having to check for it everywhere.

The developer using the pdump library will not know if the fallback is inside rte_memcpy_nt() or outside using #ifdef. It is still hidden inside pdump_copy().

> 
> > 
> > I don't oppose to your idea, I just don't have any idea how to reasonably
> > implement it. So I'm trying to defend why it is not important.
> I am suggesting that the applications could implement #ifdef depending on the architecture.
> I assume that it would be a pre-processor flag defined (or not) on DPDK side and application doing #ifdef based on it?
> 
> Another way to achieve this would be to use #warning directive (see [1]) inside DPDK when the generic fallback is taken.
> 
> Also isn't the argument on memcpy_nt capability query not a more general one, that is how would/should application query DPDK's capabilities when run or compiled?

Good point! You just solved this part of the puzzle, Stanislaw:

The ability to perform non-temporal memory load/store is a CPU feature.

Applications that need to know if non-temporal memory access is available should check for the appropriate CPU feature flag, e.g. RTE_CPUFLAG_SSE4_1 on x86 architecture. This works both at runtime and at compile time.

> 
> [1] https://gcc.gnu.org/onlinedocs/cpp/Diagnostics.html


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-28 10:51                   ` Morten Brørup
@ 2022-07-29  9:21                     ` Konstantin Ananyev
  0 siblings, 0 replies; 57+ messages in thread
From: Konstantin Ananyev @ 2022-07-29  9:21 UTC (permalink / raw)
  To: Morten Brørup, Stanisław Kardach, Honnappa Nagarahalli
  Cc: dev, Bruce Richardson, Jan Viktorin, Ruifeng Wang, David Christensen, nd

28/07/2022 11:51, Morten Brørup пишет:
> From: Stanisław Kardach [mailto:kda@semihalf.com]
> Sent: Thursday, 28 July 2022 00.02
>> On Wed, 27 Jul 2022, 21:53 Honnappa Nagarahalli, <Honnappa.Nagarahalli@arm.com> wrote:
>>
>>>>>>> Yes, x86 needs 16B alignment for NT load/stores But that's
>>>> supposed
>>>>>> to be arch
>>>>>>> specific limitation, that we probably want to hide, no?
>>>>>
>>>>> Correct. However, optional hints for optimization purposes will be
>>>> available.
>>>>> And it is up to the architecture specific implementation to make the
>>>> best use
>>>>> of these hints, or just ignore them.
>>>>>
>>>>>>> Inside the function can check alignment of both src and dst and
>>>>>> decide should it
>>>>>>> use NT load/store instructions or just do normal copy.
>>>>>> IMO, the normal copy should not be done by this API under any
>>>>>> conditions. Why not let the application call memcpy/rte_memcpy
>>>>>> when the NT copy is not applicable? It helps the programmer to
>>>> understand
>>>>>> and debug the issues much easier.
>>>>>
>>>>> Yes, the programmer must choose between normal memcpy() and non-
>>>>> temporal rte_memcpy_nt(). I am offering new functions, not modifying
>>>>> memcpy() or rte_memcpy().
>>>>>
>>>>> And rte_memcpy_nt() will silently fall back to normal memcpy() if
>>>> non-
>>>>> temporal copying is unavailable, e.g. on POWER and RISC-V
>>>> architectures,
>>>>> which don't have NT load/store instructions.
>>>> I am talking about a scenario where the application is being ported
>>>> between architectures. Not everyone knows about the capabilities of
>>>> the architecture. It is better to indicate upfront (ex: compilation
>>>> failures) that a certain feature is not supported on the target
>>>> architecture rather than the user having to discover through painful
>>>> debugging.
>>>
>>> I'm considering rte_memcpy_nt() a performance optimized variant of
>>> memcpy(), where the performance gain is less cache pollution. Thus, silent
>>> fallback to memcpy() should suffice.
>>>
>>> Other architecture differences also affect DPDK performance; the inability to
>>> perform non-temporal load/store just one more to the (undocumented) list.
>>>
>>> Failing at build time if NT load/store is unavailable by the architecture would
>>> prevent the function from being used by other DPDK libraries, e.g. by the
>>> rte_pktmbuf_copy() function used by the pdump library.
>> The other libraries in DPDK need to provide NT versions as the libraries need to cater for not-NT use cases as well. i.e. we cannot hide a NT copy under rte_pktmbuf_copy() API, we need to have rte_pktmbuf_copy_nt()
> 
> Yes, it was my intention to provide rte_pktmbuf_copy_nt() as a new function. Some uses of rte_pktmbuf_copy() may benefit from having the copied data in cache.
> 
> But there is a ripple effect:
> 
> It is also my intention to improve the pdump and pcapng libraries by using rte_pktmbuf_copy_nt() instead of rte_pktmbuf_copy(). These would normally benefit from not polluting the cache.
> 
> So the underlying rte_memcpy_nt() function needs a fallback if the architecture doesn't support non-temporal memory copy, now that the pdump and pcapng libraries depend on it.
> 
> Alternatively, if rte_memcpy_nt() has no fallback to standard memcpy(), but an application fails to build if the application developer tries to use rte_memcpy_nt(), we would have to modify e.g. pdump_copy() like this:
> 
> + #ifdef RTE_CPUFLAG_xxx
>    p = rte_pktmbuf_copy_nt(pkts[i], mp, 0, cbs->snaplen);
> + #else
>    p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen);
> + #endif
> 
> Personally, I prefer the fallback inside rte_memcpy_nt(), rather than having to check for it everywhere.

+1 here.
If we going to introduce rte_memcpy_nt(), I think it better be
'best effort' approach - if it can do NT, great, if not
just fall back to normal copy.

> 
> The developer using the pdump library will not know if the fallback is inside rte_memcpy_nt() or outside using #ifdef. It is still hidden inside pdump_copy().
> 
>>
>>>
>>> I don't oppose to your idea, I just don't have any idea how to reasonably
>>> implement it. So I'm trying to defend why it is not important.
>> I am suggesting that the applications could implement #ifdef depending on the architecture.
>> I assume that it would be a pre-processor flag defined (or not) on DPDK side and application doing #ifdef based on it?
>>
>> Another way to achieve this would be to use #warning directive (see [1]) inside DPDK when the generic fallback is taken.
>>
>> Also isn't the argument on memcpy_nt capability query not a more general one, that is how would/should application query DPDK's capabilities when run or compiled?
> 
> Good point! You just solved this part of the puzzle, Stanislaw:
> 
> The ability to perform non-temporal memory load/store is a CPU feature.
> 
> Applications that need to know if non-temporal memory access is available should check for the appropriate CPU feature flag, e.g. RTE_CPUFLAG_SSE4_1 on x86 architecture. This works both at runtime and at compile time.
> 
>>
>> [1] https://gcc.gnu.org/onlinedocs/cpp/Diagnostics.html
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-24 22:18       ` Morten Brørup
@ 2022-07-29 10:00         ` Konstantin Ananyev
  2022-07-29 10:46           ` Morten Brørup
  0 siblings, 1 reply; 57+ messages in thread
From: Konstantin Ananyev @ 2022-07-29 10:00 UTC (permalink / raw)
  To: Morten Brørup, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

24/07/2022 23:18, Morten Brørup пишет:
>> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
>> Sent: Sunday, 24 July 2022 15.35
>>
>> 22/07/2022 11:44, Morten Brørup пишет:
>>>> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
>>>> Sent: Friday, 22 July 2022 01.20
>>>>
>>>> Hi Morten,
>>>>
>>>>> This RFC proposes a set of functions optimized for non-temporal
>>>> memory copy.
>>>>>
>>>>> At this stage, I am asking for feedback on the concept.
>>>>>
>>>>> Applications sometimes data to another memory location, which is
>> only
>>>> used
>>>>> much later.
>>>>> In this case, it is inefficient to pollute the data cache with the
>>>> copied
>>>>> data.
>>>>>
>>>>> An example use case (originating from a real life application):
>>>>> Copying filtered packets, or the first part of them, into a capture
>>>> buffer
>>>>> for offline analysis.
>>>>>
>>>>> The purpose of these functions is to achieve a performance gain by
>>>> not
>>>>> polluting the cache when copying data.
>>>>> Although the throughput may be improved by further optimization, I
>> do
>>>> not
>>>>> consider througput optimization relevant initially.
>>>>>
>>>>> The x86 non-temporal load instructions have 16 byte alignment
>>>>> requirements [1], while ARM non-temporal load instructions are
>>>> available with
>>>>> 4 byte alignment requirements [2].
>>>>> Both platforms offer non-temporal store instructions with 4 byte
>>>> alignment
>>>>> requirements.
>>>>>
>>>>> In addition to the primary function without any alignment
>>>> requirements, we
>>>>> also provide functions for respectivly 16 and 4 byte aligned access
>>>> for
>>>>> performance purposes.
>>>>>
>>>>> The function names resemble standard C library function names, but
>>>> their
>>>>> signatures are intentionally different. No need to drag legacy into
>>>> it.
>>>>>
>>>>> NB: Don't comment on spaces for indentation; a patch will follow
>> DPDK
>>>> coding
>>>>> style and use TAB.
>>>>
>>>>
>>>> I think there were discussions in other direction - remove
>> rte_memcpy()
>>>> completely and use memcpy() instead...
>>>
>>> Yes, the highly optimized rte_memcpy() implementation of memcpy() has
>> become obsolete, now that modern compilers provide an efficient
>> memcpy() implementation.
>>>
>>> It's an excellent reference, because we should learn from it, and
>> avoid introducing similar mistakes with non-temporal memcpy.
>>>
>>>> But if we have a good use case for that, then I am positive in
>>>> principle.
>>>
>>> The standard C library doesn't offer non-temporal memcpy(), so we
>> need to implement it ourselves.
>>>
>>>> Though I think we need a clear use-case within dpdk for it
>>>> to demonstrate perfomance gain.
>>>
>>> The performance gain is to avoid polluting the data cache. DPDK
>> example applications, like l3fwd, are probably too primitive to measure
>> any benefit in this regard.
>>>
>>>> Probably copying packets within pdump lib, or examples/dma. or ...
>>>
>>> Good point - the new functions should be used somewhere within DPDK.
>> For this purpose, I will look into modifying rte_pktmbuf_copy(), which
>> is used by pdump_copy(), to use non-temporal copying of the packet
>> data.
>>>
>>>> Another thought - do we really need a separate inline function for
>> each
>>>> flavour?
>>>> Might be just one non-inline rte_memcpy_nt(dst, src, size, flags),
>>>> where flags could be combination of NT_SRC, NT_DST, and keep
>> alignment
>>>> detection/decisions to particular implementation?
>>>
>>> Thank you for the feedback, Konstantin.
>>>
>>> My answer to this suggestion gets a little longwinded...
>>>
>>> Looking at the DPDK pcapng library, it copies a 4 byte aligned
>> metadata structure sized 28 byte. So it can do with 4 byte aligned
>> functions.
>>>
>>> Our application can capture packets starting at the IP header, which
>> is offset by 14 byte (Ethernet header size) from the packet buffer, so
>> it requires 2 byte alignment. And thus, requiring 4 byte alignment is
>> not acceptable.
>>>
>>> Our application uses 16 byte alignment in the capture buffer area,
>> and can benefit from 16 byte aligned functions. Furthermore, x86
>> processors require 16 byte alignment for non-temporal load
>> instructions, so I think a 16 byte aligned non-temporal memcpy function
>> should be offered.
>>
>>
>> Yes, x86 needs 16B alignment for NT load/stores
>> But that's supposed to be arch specific limitation,
>> that we probably want to hide, no?
> 
> Agree.
> 
>> Inside the function can check alignment of both src and dst
>> and decide should it use NT load/store instructions or just
>> do normal copy.
> 
> Yes, I'm experimenting with the x86 inline function shown below. And hopefully, with some "extern inline" or other magic, I can hide the different implementations in the arch specific headers, and only expose the function declaration of rte_memcpy_nt() in the common header.
> 
> I'm currently working on the x86 implementation - when I'm satisfied with that, I'll look into how to hide the implementations in the arch specific header files, and only expose the common function declaration in the generic header file also used for documentation. I works for rte_memcpy(), so I can probably find the way to do it there.
> 
> /*
>   * Non-Temporal Memory Operations Flags.
>   */
> 
> #define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)   /** Length alignment mask. */
> #define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)      /** Length is 2 byte aligned. */
> #define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)      /** Length is 4 byte aligned. */
> #define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)      /** Length is 8 byte aligned. */
> #define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)     /** Length is 16 byte aligned. */
> #define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)     /** Length is 32 byte aligned. */
> #define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)     /** Length is 64 byte aligned. */
> #define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)    /** Length is 128 byte aligned. */
> 
> #define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 8)   /** Destination address alignment mask. */
> #define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 8)      /** Destination address is 2 byte aligned. */
> #define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 8)      /** Destination address is 4 byte aligned. */
> #define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 8)      /** Destination address is 8 byte aligned. */
> #define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 8)     /** Destination address is 16 byte aligned. */
> #define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 8)     /** Destination address is 32 byte aligned. */
> #define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 8)     /** Destination address is 64 byte aligned. */
> #define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 8)    /** Destination address is 128 byte aligned. */
> 
> #define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 16)  /** Source address alignment mask. */
> #define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 16)     /** Source address is 2 byte aligned. */
> #define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 16)     /** Source address is 4 byte aligned. */
> #define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 16)     /** Source address is 8 byte aligned. */
> #define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 16)    /** Source address is 16 byte aligned. */
> #define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 16)    /** Source address is 32 byte aligned. */
> #define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 16)    /** Source address is 64 byte aligned. */
> #define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 16)   /** Source address is 128 byte aligned. */
> 
> /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice.
>   *
>   * Non-temporal memory copy.
>   * The memory areas must not overlap.
>   *
>   * @note
>   * If the destination and/or length is unaligned, some copied bytes will be
>   * stored in the destination memory area using temporal access.
>   *
>   * @param dst
>   *   Pointer to the non-temporal destination memory area.
>   * @param src
>   *   Pointer to the non-temporal source memory area.
>   * @param len
>   *   Number of bytes to copy.
>   * @param flags
>   *   Hints for memory access.
>   *   Any of the RTE_MEMOPS_F_LENnA, RTE_MEMOPS_F_DSTnA, RTE_MEMOPS_F_SRCnA flags.
>   */
> __rte_experimental
> static __rte_always_inline
> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> void rte_memcpy_nt(void * __rte_restrict dst, const void * __rte_restrict src, size_t len,
>          const uint64_t flags)
> {
>      if (__builtin_constant_p(flags) ?
>              ((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A &&
>              (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A) :
>              !(((uintptr_t)dst | len) & (16 - 1))) {
>          if (__builtin_constant_p(flags) ?
>                  (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC16A :
>                  !((uintptr_t)src & (16 - 1)))
>              rte_memcpy_nt16a(dst, src, len/*, flags*/);
>          else
>              rte_memcpy_nt16dla(dst, src, len/*, flags*/);
>      }
>      else if (__builtin_constant_p(flags) ? (
>              (flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A &&
>              (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A &&
>              (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A) :
>              !(((uintptr_t)dst | (uintptr_t)src | len) & (4 - 1))) {
>          rte_memcpy_nt4a(dst, src, len/*, flags*/);
>      }
>      else
>          rte_memcpy_nt_unaligned(dst, src, len/*, flags*/);
> }


Do we really need to expose all these dozen flags?
My thought at least about x86 implementaion was about something more 
simple like:
void rte_memcpy_nt(void * __rte_restrict dst,
	const void * __rte_restrict src, size_t len,
	const uint64_t flags)
{

	if (flags == (SRC_NT | DST_NT) && ((dst | src) & 0xf) == 0) {
		_do_nt_src_nt_dst_nt(...);
	} else if (flags == DST_NT && (dst & 0xf) == 0) {
		_do_src_na_dst_nt(...);
	} else if (flags == SRC_NT && (src & 0xf) == 0) {
		_do_src_nt_dst_na(...);
	} else
		memcpy(dst, src, len);
}

> 
> 
>>
>>
>>> While working on these funtions, I experimented with an
>> rte_memcpy_nt() taking flags, which is also my personal preference, but
>> haven't succeed yet. Especially when copying a 16 byte aligned
>> structure of only 16 byte, the overhead of the function call +
>> comparing the flags + the copy loop overhead is significant, compared
>> to inline code consisting of only one pair of "movntdqa (%rsi),%xmm0;
>> movntdq %xmm0,(%rdi)" instructions.
>>>
>>> Remember that a non-inlined rte_memcpy_nt() will be called with very
>> varying size, due to the typical mix of small and big packets, so
>> branch prediction will not help.
>>>
>>> This RFC does not yet show the rte_memcpy_nt() function handling
>> unaligned load/store, but it is more complex than the aligned
>> functions. So I think the aligned variants are warranted - for
>> performance reasons.
>>>
>>> Some of the need for exposing individual functions for different
>> alignment stems from the compiler being unable to determine the
>> alignment of the source and destination pointers at build time. So we
>> need to help the compiler with this at build time, and thus the need
>> for inlining the function. If we expose a bunch of small inline
>> functions or a big inline function with flags seems to be a matter of
>> taste.
>>>
>>> Thinking about it, you are probably right that exposing a single
>> function with flags is better for documentation purposes and easier for
>> other architectures to implement. But it still needs to be inline, for
>> the reasons described above.
>>
>>
>> Ok, my initial thought was that main use-case for it would be copying
>> of
>> big chunks of data, but from your description it might not be the case.
> 
> This is for quickly copying relatively small pieces of data synchronously without polluting the CPUs data cache, e.g. just before passing on a packet to an Ethernet PMD for transmission.
> 
> Big chunks of data should be copied asynchronously by DMA.
> 
>> Yes, for just 16/32B copy function call overhead might be way too
>> high...
>> As another alternative - would memcpy_nt_bulk() help somehow?
>> It can do copying for the several src/dst pairs at once and
>> that might help to amortize cost of function call.
> 
> In many cases, memcpy_nt() will replace memcpy() inside loops, so it should be just as easy to use as memcpy(). E.g. look at rte_pktmbuf_copy()... Building a memcopy array to pass to memcpy_nt_bulk() from rte_pktmbuf_copy() would require a significant rewrite of rte_pktmbuf_copy(), compared to just replacing rte_memcpy() with rte_memcpy_nt(). And this is just one function using memcpy().

Actually, one question I have for such small data-transfer
(16B per packet) - do you still see some noticable perfomance
improvement for such scenario?
Another question - who will do 'sfence' after the copying?
Would it be inside memcpy_nt (seems quite costly), or would
it be another API function for that: memcpy_nt_flush() or so?

>>
>>
>>>
>>>>
>>>>
>>>>> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
>>>> guide/index.html#text=_mm_stream_load
>>>>> [2] https://developer.arm.com/documentation/100076/0100/A64-
>>>> Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--
>> SIMD-
>>>> and-FP-
>>>>>
>>>>> V2:
>>>>> - Only copy from non-temporal source to non-temporal destination.
>>>>>      I.e. remove the two variants with only source and/or
>> destination
>>>> being
>>>>>      non-temporal.
>>>>> - Do not require alignment.
>>>>>      Instead, offer additional 4 and 16 byte aligned functions for
>>>> performance
>>>>>      purposes.
>>>>> - Implemented two of the functions for x86.
>>>>> - Remove memset function.
>>>>>
>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>>>>> ---
>>>>>
>>>>> /**
>>>>>     * @warning
>>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
>>>>>     *
>>>>>     * Copy data from non-temporal source to non-temporal
>> destination.
>>>>>     *
>>>>>     * @param dst
>>>>>     *   Pointer to the non-temporal destination of the data.
>>>>>     *   Should be 4 byte aligned, for optimal performance.
>>>>>     * @param src
>>>>>     *   Pointer to the non-temporal source data.
>>>>>     *   No alignment requirements.
>>>>>     * @param len
>>>>>     *   Number of bytes to copy.
>>>>>     *   Should be be divisible by 4, for optimal performance.
>>>>>     */
>>>>> __rte_experimental
>>>>> static __rte_always_inline
>>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
>>>> __access__(read_only, 2, 3)))
>>>>> void rte_memcpy_nt(void * __rte_restrict dst, const void *
>>>> __rte_restrict src, size_t len)
>>>>> /* Implementation T.B.D. */
>>>>>
>>>>> /**
>>>>>     * @warning
>>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
>>>>>     *
>>>>>     * Copy data in blocks of 16 byte from aligned non-temporal
>> source
>>>>>     * to aligned non-temporal destination.
>>>>>     *
>>>>>     * @param dst
>>>>>     *   Pointer to the non-temporal destination of the data.
>>>>>     *   Must be 16 byte aligned.
>>>>>     * @param src
>>>>>     *   Pointer to the non-temporal source data.
>>>>>     *   Must be 16 byte aligned.
>>>>>     * @param len
>>>>>     *   Number of bytes to copy.
>>>>>     *   Must be divisible by 16.
>>>>>     */
>>>>> __rte_experimental
>>>>> static __rte_always_inline
>>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
>>>> __access__(read_only, 2, 3)))
>>>>> void rte_memcpy_nt16a(void * __rte_restrict dst, const void *
>>>> __rte_restrict src, size_t len)
>>>>> {
>>>>>        const void * const  end = RTE_PTR_ADD(src, len);
>>>>>
>>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
>>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
>>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
>>>>>
>>>>>        /* Copy large portion of data. */
>>>>>        while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
>>>>>            register __m128i    xmm0, xmm1, xmm2, xmm3;
>>>>>
>>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>>>> pointer as parameter. */
>>>>> #pragma GCC diagnostic push
>>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>>>            xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 *
>>>> sizeof(__m128i)));
>>>>>            xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 *
>>>> sizeof(__m128i)));
>>>>>            xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 *
>>>> sizeof(__m128i)));
>>>>>            xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 *
>>>> sizeof(__m128i)));
>>>>> #pragma GCC diagnostic pop
>>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)),
>>>> xmm0);
>>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)),
>>>> xmm1);
>>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)),
>>>> xmm2);
>>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)),
>>>> xmm3);
>>>>>            src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
>>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
>>>>>        }
>>>>>
>>>>>        /* Copy remaining data. */
>>>>>        while (src != end) {
>>>>>            register __m128i    xmm;
>>>>>
>>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>>>> pointer as parameter. */
>>>>> #pragma GCC diagnostic push
>>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>>>            xmm = _mm_stream_load_si128(src);
>>>>> #pragma GCC diagnostic pop
>>>>>            _mm_stream_si128(dst, xmm);
>>>>>            src = RTE_PTR_ADD(src, sizeof(__m128i));
>>>>>            dst = RTE_PTR_ADD(dst, sizeof(__m128i));
>>>>>        }
>>>>> }
>>>>>
>>>>> /**
>>>>>     * @warning
>>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
>>>>>     *
>>>>>     * Copy data in blocks of 4 byte from aligned non-temporal source
>>>>>     * to aligned non-temporal destination.
>>>>>     *
>>>>>     * @param dst
>>>>>     *   Pointer to the non-temporal destination of the data.
>>>>>     *   Must be 4 byte aligned.
>>>>>     * @param src
>>>>>     *   Pointer to the non-temporal source data.
>>>>>     *   Must be 4 byte aligned.
>>>>>     * @param len
>>>>>     *   Number of bytes to copy.
>>>>>     *   Must be divisible by 4.
>>>>>     */
>>>>> __rte_experimental
>>>>> static __rte_always_inline
>>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
>>>> __access__(read_only, 2, 3)))
>>>>> void rte_memcpy_nt4a(void * __rte_restrict dst, const void *
>>>> __rte_restrict src, size_t len)
>>>>> {
>>>>>        int32_t             buf[sizeof(__m128i) / sizeof(int32_t)]
>>>> __rte_aligned(sizeof(__m128i));
>>>>>        /** Address of source data, rounded down to achieve
>> alignment.
>>>> */
>>>>>        const void *        srca = RTE_PTR_ALIGN_FLOOR(src,
>>>> sizeof(__m128i));
>>>>>        /** Address of end of source data, rounded down to achieve
>>>> alignment. */
>>>>>        const void * const  srcenda =
>>>> RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
>>>>>        const int           offset =  RTE_PTR_DIFF(src, srca) /
>>>> sizeof(int32_t);
>>>>>        register __m128i    xmm0;
>>>>>
>>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
>>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
>>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
>>>>>
>>>>>        if (unlikely(len == 0)) return;
>>>>>
>>>>>        /* Copy first, non-__m128i aligned, part of source data. */
>>>>>        if (offset) {
>>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>>>> pointer as parameter. */
>>>>> #pragma GCC diagnostic push
>>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>>>            xmm0 = _mm_stream_load_si128(srca);
>>>>>            _mm_store_si128((void *)buf, xmm0);
>>>>> #pragma GCC diagnostic pop
>>>>>            switch (offset) {
>>>>>                case 1:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[1]);
>>>>>                    if (unlikely(len == 1 * sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>>>> sizeof(int32_t)), buf[2]);
>>>>>                    if (unlikely(len == 2 * sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
>>>> sizeof(int32_t)), buf[3]);
>>>>>                    break;
>>>>>                case 2:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[2]);
>>>>>                    if (unlikely(len == 1 * sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>>>> sizeof(int32_t)), buf[3]);
>>>>>                    break;
>>>>>                case 3:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[3]);
>>>>>                    break;
>>>>>            }
>>>>>            srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
>>>>>            dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
>>>>>        }
>>>>>
>>>>>        /* Copy middle, __m128i aligned, part of source data. */
>>>>>        while (srca != srcenda) {
>>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>>>> pointer as parameter. */
>>>>> #pragma GCC diagnostic push
>>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>>>            xmm0 = _mm_stream_load_si128(srca);
>>>>> #pragma GCC diagnostic pop
>>>>>            _mm_store_si128((void *)buf, xmm0);
>>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)),
>>>> buf[0]);
>>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)),
>>>> buf[1]);
>>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)),
>>>> buf[2]);
>>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)),
>>>> buf[3]);
>>>>>            srca = RTE_PTR_ADD(srca, sizeof(__m128i));
>>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
>>>>>        }
>>>>>
>>>>>        /* Copy last, non-__m128i aligned, part of source data. */
>>>>>        if (RTE_PTR_DIFF(srca, src) != 4) {
>>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a const
>>>> pointer as parameter. */
>>>>> #pragma GCC diagnostic push
>>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>>>>>            xmm0 = _mm_stream_load_si128(srca);
>>>>>            _mm_store_si128((void *)buf, xmm0);
>>>>> #pragma GCC diagnostic pop
>>>>>            switch (offset) {
>>>>>                case 1:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[0]);
>>>>>                    break;
>>>>>                case 2:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[0]);
>>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
>>>> sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>>>> sizeof(int32_t)), buf[1]);
>>>>>                    break;
>>>>>                case 3:
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
>>>> sizeof(int32_t)), buf[0]);
>>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
>>>> sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
>>>> sizeof(int32_t)), buf[1]);
>>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 2 *
>>>> sizeof(int32_t))) return;
>>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
>>>> sizeof(int32_t)), buf[2]);
>>>>>                    break;
>>>>>            }
>>>>>        }
>>>>> }
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 10:00         ` Konstantin Ananyev
@ 2022-07-29 10:46           ` Morten Brørup
  2022-07-29 11:50             ` Konstantin Ananyev
  2022-07-29 12:13             ` Konstantin Ananyev
  0 siblings, 2 replies; 57+ messages in thread
From: Morten Brørup @ 2022-07-29 10:46 UTC (permalink / raw)
  To: Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> Sent: Friday, 29 July 2022 12.00
> 
> 24/07/2022 23:18, Morten Brørup пишет:
> >> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> >> Sent: Sunday, 24 July 2022 15.35
> >>
> >> 22/07/2022 11:44, Morten Brørup пишет:
> >>>> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> >>>> Sent: Friday, 22 July 2022 01.20
> >>>>
> >>>> Hi Morten,
> >>>>
> >>>>> This RFC proposes a set of functions optimized for non-temporal
> >>>> memory copy.
> >>>>>
> >>>>> At this stage, I am asking for feedback on the concept.
> >>>>>
> >>>>> Applications sometimes data to another memory location, which is
> >> only
> >>>> used
> >>>>> much later.
> >>>>> In this case, it is inefficient to pollute the data cache with
> the
> >>>> copied
> >>>>> data.
> >>>>>
> >>>>> An example use case (originating from a real life application):
> >>>>> Copying filtered packets, or the first part of them, into a
> capture
> >>>> buffer
> >>>>> for offline analysis.
> >>>>>
> >>>>> The purpose of these functions is to achieve a performance gain
> by
> >>>> not
> >>>>> polluting the cache when copying data.
> >>>>> Although the throughput may be improved by further optimization,
> I
> >> do
> >>>> not
> >>>>> consider througput optimization relevant initially.
> >>>>>
> >>>>> The x86 non-temporal load instructions have 16 byte alignment
> >>>>> requirements [1], while ARM non-temporal load instructions are
> >>>> available with
> >>>>> 4 byte alignment requirements [2].
> >>>>> Both platforms offer non-temporal store instructions with 4 byte
> >>>> alignment
> >>>>> requirements.
> >>>>>
> >>>>> In addition to the primary function without any alignment
> >>>> requirements, we
> >>>>> also provide functions for respectivly 16 and 4 byte aligned
> access
> >>>> for
> >>>>> performance purposes.
> >>>>>
> >>>>> The function names resemble standard C library function names,
> but
> >>>> their
> >>>>> signatures are intentionally different. No need to drag legacy
> into
> >>>> it.
> >>>>>
> >>>>> NB: Don't comment on spaces for indentation; a patch will follow
> >> DPDK
> >>>> coding
> >>>>> style and use TAB.
> >>>>
> >>>>
> >>>> I think there were discussions in other direction - remove
> >> rte_memcpy()
> >>>> completely and use memcpy() instead...
> >>>
> >>> Yes, the highly optimized rte_memcpy() implementation of memcpy()
> has
> >> become obsolete, now that modern compilers provide an efficient
> >> memcpy() implementation.
> >>>
> >>> It's an excellent reference, because we should learn from it, and
> >> avoid introducing similar mistakes with non-temporal memcpy.
> >>>
> >>>> But if we have a good use case for that, then I am positive in
> >>>> principle.
> >>>
> >>> The standard C library doesn't offer non-temporal memcpy(), so we
> >> need to implement it ourselves.
> >>>
> >>>> Though I think we need a clear use-case within dpdk for it
> >>>> to demonstrate perfomance gain.
> >>>
> >>> The performance gain is to avoid polluting the data cache. DPDK
> >> example applications, like l3fwd, are probably too primitive to
> measure
> >> any benefit in this regard.
> >>>
> >>>> Probably copying packets within pdump lib, or examples/dma. or ...
> >>>
> >>> Good point - the new functions should be used somewhere within
> DPDK.
> >> For this purpose, I will look into modifying rte_pktmbuf_copy(),
> which
> >> is used by pdump_copy(), to use non-temporal copying of the packet
> >> data.
> >>>
> >>>> Another thought - do we really need a separate inline function for
> >> each
> >>>> flavour?
> >>>> Might be just one non-inline rte_memcpy_nt(dst, src, size, flags),
> >>>> where flags could be combination of NT_SRC, NT_DST, and keep
> >> alignment
> >>>> detection/decisions to particular implementation?
> >>>
> >>> Thank you for the feedback, Konstantin.
> >>>
> >>> My answer to this suggestion gets a little longwinded...
> >>>
> >>> Looking at the DPDK pcapng library, it copies a 4 byte aligned
> >> metadata structure sized 28 byte. So it can do with 4 byte aligned
> >> functions.
> >>>
> >>> Our application can capture packets starting at the IP header,
> which
> >> is offset by 14 byte (Ethernet header size) from the packet buffer,
> so
> >> it requires 2 byte alignment. And thus, requiring 4 byte alignment
> is
> >> not acceptable.
> >>>
> >>> Our application uses 16 byte alignment in the capture buffer area,
> >> and can benefit from 16 byte aligned functions. Furthermore, x86
> >> processors require 16 byte alignment for non-temporal load
> >> instructions, so I think a 16 byte aligned non-temporal memcpy
> function
> >> should be offered.
> >>
> >>
> >> Yes, x86 needs 16B alignment for NT load/stores
> >> But that's supposed to be arch specific limitation,
> >> that we probably want to hide, no?
> >
> > Agree.
> >
> >> Inside the function can check alignment of both src and dst
> >> and decide should it use NT load/store instructions or just
> >> do normal copy.
> >
> > Yes, I'm experimenting with the x86 inline function shown below. And
> hopefully, with some "extern inline" or other magic, I can hide the
> different implementations in the arch specific headers, and only expose
> the function declaration of rte_memcpy_nt() in the common header.
> >
> > I'm currently working on the x86 implementation - when I'm satisfied
> with that, I'll look into how to hide the implementations in the arch
> specific header files, and only expose the common function declaration
> in the generic header file also used for documentation. I works for
> rte_memcpy(), so I can probably find the way to do it there.
> >
> > /*
> >   * Non-Temporal Memory Operations Flags.
> >   */
> >
> > #define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)   /** Length
> alignment mask. */
> > #define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)      /** Length is
> 2 byte aligned. */
> > #define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)      /** Length is
> 4 byte aligned. */
> > #define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)      /** Length is
> 8 byte aligned. */
> > #define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)     /** Length is
> 16 byte aligned. */
> > #define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)     /** Length is
> 32 byte aligned. */
> > #define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)     /** Length is
> 64 byte aligned. */
> > #define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)    /** Length is
> 128 byte aligned. */
> >
> > #define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 8)   /**
> Destination address alignment mask. */
> > #define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 8)      /**
> Destination address is 2 byte aligned. */
> > #define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 8)      /**
> Destination address is 4 byte aligned. */
> > #define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 8)      /**
> Destination address is 8 byte aligned. */
> > #define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 8)     /**
> Destination address is 16 byte aligned. */
> > #define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 8)     /**
> Destination address is 32 byte aligned. */
> > #define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 8)     /**
> Destination address is 64 byte aligned. */
> > #define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 8)    /**
> Destination address is 128 byte aligned. */
> >
> > #define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 16)  /** Source
> address alignment mask. */
> > #define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 16)     /** Source
> address is 2 byte aligned. */
> > #define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 16)     /** Source
> address is 4 byte aligned. */
> > #define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 16)     /** Source
> address is 8 byte aligned. */
> > #define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 16)    /** Source
> address is 16 byte aligned. */
> > #define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 16)    /** Source
> address is 32 byte aligned. */
> > #define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 16)    /** Source
> address is 64 byte aligned. */
> > #define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 16)   /** Source
> address is 128 byte aligned. */
> >
> > /**
> >   * @warning
> >   * @b EXPERIMENTAL: this API may change without prior notice.
> >   *
> >   * Non-temporal memory copy.
> >   * The memory areas must not overlap.
> >   *
> >   * @note
> >   * If the destination and/or length is unaligned, some copied bytes
> will be
> >   * stored in the destination memory area using temporal access.
> >   *
> >   * @param dst
> >   *   Pointer to the non-temporal destination memory area.
> >   * @param src
> >   *   Pointer to the non-temporal source memory area.
> >   * @param len
> >   *   Number of bytes to copy.
> >   * @param flags
> >   *   Hints for memory access.
> >   *   Any of the RTE_MEMOPS_F_LENnA, RTE_MEMOPS_F_DSTnA,
> RTE_MEMOPS_F_SRCnA flags.
> >   */
> > __rte_experimental
> > static __rte_always_inline
> > __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> __access__(read_only, 2, 3)))
> > void rte_memcpy_nt(void * __rte_restrict dst, const void *
> __rte_restrict src, size_t len,
> >          const uint64_t flags)
> > {
> >      if (__builtin_constant_p(flags) ?
> >              ((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A
> &&
> >              (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A)
> :
> >              !(((uintptr_t)dst | len) & (16 - 1))) {
> >          if (__builtin_constant_p(flags) ?
> >                  (flags & RTE_MEMOPS_F_SRCA_MASK) >=
> RTE_MEMOPS_F_SRC16A :
> >                  !((uintptr_t)src & (16 - 1)))
> >              rte_memcpy_nt16a(dst, src, len/*, flags*/);
> >          else
> >              rte_memcpy_nt16dla(dst, src, len/*, flags*/);
> >      }
> >      else if (__builtin_constant_p(flags) ? (
> >              (flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A
> &&
> >              (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A
> &&
> >              (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A)
> :
> >              !(((uintptr_t)dst | (uintptr_t)src | len) & (4 - 1))) {
> >          rte_memcpy_nt4a(dst, src, len/*, flags*/);
> >      }
> >      else
> >          rte_memcpy_nt_unaligned(dst, src, len/*, flags*/);
> > }
> 
> 
> Do we really need to expose all these dozen flags?
> My thought at least about x86 implementaion was about something more
> simple like:
> void rte_memcpy_nt(void * __rte_restrict dst,
> 	const void * __rte_restrict src, size_t len,
> 	const uint64_t flags)
> {
> 
> 	if (flags == (SRC_NT | DST_NT) && ((dst | src) & 0xf) == 0) {
> 		_do_nt_src_nt_dst_nt(...);
> 	} else if (flags == DST_NT && (dst & 0xf) == 0) {
> 		_do_src_na_dst_nt(...);
> 	} else if (flags == SRC_NT && (src & 0xf) == 0) {
> 		_do_src_nt_dst_na(...);
> 	} else
> 		memcpy(dst, src, len);
> }

The combination of flags, inline and __builtin_constant_p() allows the compiler to produce zero-overhead code. Without it, the resulting code will contain a bunch of run-time bitmask comparisons and branches to determine the ultimate copy function. On x86 there are not only 16 byte, but also 4 byte alignment variants of non-temporal store. The beauty of it will be more obvious when the patch is ready.

And in my current working version (not the code provided here), the flags are hints, so using them will be optional. The function headers will look roughly like this:

static inline void rte_memcpy_nt_ex(
		void * dst, const void * src,
		size_t len, uint64_t flags);

static inline void rte_memcpy_nt(
		void * dst, const void * src,
		size_t len) 
{
	rte_memcpy_nt_ex(dst, src, len, 0);
}

I might add an _ex postfix variant of the mbuf packet non-temporal copy function too, but I'm not working on that function yet, so I don't yet know if it makes sense or not.

My concept for build time alignment hints can also be copied into an rte_memcpy_ex() function for improved performance. But I don't want my patch to expand too much outside its initial scope, so I will not modify rte_memcpy() with this patch.

Alternatively, I could provide rte_memcpy_ex(d,s,l,flags) instead of rte_memcpy_nt[_ex](), and use the flags to indicate non-temporal source and destination.

This is only a question about which API the community prefers. I will not change the implementation of rte_memcpy() with this patch - it's another job to do that.

> 
> >
> >
> >>
> >>
> >>> While working on these funtions, I experimented with an
> >> rte_memcpy_nt() taking flags, which is also my personal preference,
> but
> >> haven't succeed yet. Especially when copying a 16 byte aligned
> >> structure of only 16 byte, the overhead of the function call +
> >> comparing the flags + the copy loop overhead is significant,
> compared
> >> to inline code consisting of only one pair of "movntdqa
> (%rsi),%xmm0;
> >> movntdq %xmm0,(%rdi)" instructions.
> >>>
> >>> Remember that a non-inlined rte_memcpy_nt() will be called with
> very
> >> varying size, due to the typical mix of small and big packets, so
> >> branch prediction will not help.
> >>>
> >>> This RFC does not yet show the rte_memcpy_nt() function handling
> >> unaligned load/store, but it is more complex than the aligned
> >> functions. So I think the aligned variants are warranted - for
> >> performance reasons.
> >>>
> >>> Some of the need for exposing individual functions for different
> >> alignment stems from the compiler being unable to determine the
> >> alignment of the source and destination pointers at build time. So
> we
> >> need to help the compiler with this at build time, and thus the need
> >> for inlining the function. If we expose a bunch of small inline
> >> functions or a big inline function with flags seems to be a matter
> of
> >> taste.
> >>>
> >>> Thinking about it, you are probably right that exposing a single
> >> function with flags is better for documentation purposes and easier
> for
> >> other architectures to implement. But it still needs to be inline,
> for
> >> the reasons described above.
> >>
> >>
> >> Ok, my initial thought was that main use-case for it would be
> copying
> >> of
> >> big chunks of data, but from your description it might not be the
> case.
> >
> > This is for quickly copying relatively small pieces of data
> synchronously without polluting the CPUs data cache, e.g. just before
> passing on a packet to an Ethernet PMD for transmission.
> >
> > Big chunks of data should be copied asynchronously by DMA.
> >
> >> Yes, for just 16/32B copy function call overhead might be way too
> >> high...
> >> As another alternative - would memcpy_nt_bulk() help somehow?
> >> It can do copying for the several src/dst pairs at once and
> >> that might help to amortize cost of function call.
> >
> > In many cases, memcpy_nt() will replace memcpy() inside loops, so it
> should be just as easy to use as memcpy(). E.g. look at
> rte_pktmbuf_copy()... Building a memcopy array to pass to
> memcpy_nt_bulk() from rte_pktmbuf_copy() would require a significant
> rewrite of rte_pktmbuf_copy(), compared to just replacing rte_memcpy()
> with rte_memcpy_nt(). And this is just one function using memcpy().
> 
> Actually, one question I have for such small data-transfer
> (16B per packet) - do you still see some noticable perfomance
> improvement for such scenario?

Copying 16 byte from each packet in a burst of 32 packets would otherwise pollute 64 cache lines = 4 KB cache. With typically 64 KB L1 cache, I think it makes a difference.

> Another question - who will do 'sfence' after the copying?
> Would it be inside memcpy_nt (seems quite costly), or would
> it be another API function for that: memcpy_nt_flush() or so?

Outside. Only the developer knows when it is required, so it wouldn't make any sense to add the cost inside memcpy_nt().

I don't think we should add a flush function; it would just be another name for an already existing function. Referring to the required operation in the memcpy_nt() function documentation should suffice.

> 
> >>
> >>
> >>>
> >>>>
> >>>>
> >>>>> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
> >>>> guide/index.html#text=_mm_stream_load
> >>>>> [2] https://developer.arm.com/documentation/100076/0100/A64-
> >>>> Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--
> >> SIMD-
> >>>> and-FP-
> >>>>>
> >>>>> V2:
> >>>>> - Only copy from non-temporal source to non-temporal destination.
> >>>>>      I.e. remove the two variants with only source and/or
> >> destination
> >>>> being
> >>>>>      non-temporal.
> >>>>> - Do not require alignment.
> >>>>>      Instead, offer additional 4 and 16 byte aligned functions
> for
> >>>> performance
> >>>>>      purposes.
> >>>>> - Implemented two of the functions for x86.
> >>>>> - Remove memset function.
> >>>>>
> >>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >>>>> ---
> >>>>>
> >>>>> /**
> >>>>>     * @warning
> >>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
> >>>>>     *
> >>>>>     * Copy data from non-temporal source to non-temporal
> >> destination.
> >>>>>     *
> >>>>>     * @param dst
> >>>>>     *   Pointer to the non-temporal destination of the data.
> >>>>>     *   Should be 4 byte aligned, for optimal performance.
> >>>>>     * @param src
> >>>>>     *   Pointer to the non-temporal source data.
> >>>>>     *   No alignment requirements.
> >>>>>     * @param len
> >>>>>     *   Number of bytes to copy.
> >>>>>     *   Should be be divisible by 4, for optimal performance.
> >>>>>     */
> >>>>> __rte_experimental
> >>>>> static __rte_always_inline
> >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> >>>> __access__(read_only, 2, 3)))
> >>>>> void rte_memcpy_nt(void * __rte_restrict dst, const void *
> >>>> __rte_restrict src, size_t len)
> >>>>> /* Implementation T.B.D. */
> >>>>>
> >>>>> /**
> >>>>>     * @warning
> >>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
> >>>>>     *
> >>>>>     * Copy data in blocks of 16 byte from aligned non-temporal
> >> source
> >>>>>     * to aligned non-temporal destination.
> >>>>>     *
> >>>>>     * @param dst
> >>>>>     *   Pointer to the non-temporal destination of the data.
> >>>>>     *   Must be 16 byte aligned.
> >>>>>     * @param src
> >>>>>     *   Pointer to the non-temporal source data.
> >>>>>     *   Must be 16 byte aligned.
> >>>>>     * @param len
> >>>>>     *   Number of bytes to copy.
> >>>>>     *   Must be divisible by 16.
> >>>>>     */
> >>>>> __rte_experimental
> >>>>> static __rte_always_inline
> >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> >>>> __access__(read_only, 2, 3)))
> >>>>> void rte_memcpy_nt16a(void * __rte_restrict dst, const void *
> >>>> __rte_restrict src, size_t len)
> >>>>> {
> >>>>>        const void * const  end = RTE_PTR_ADD(src, len);
> >>>>>
> >>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
> >>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
> >>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
> >>>>>
> >>>>>        /* Copy large portion of data. */
> >>>>>        while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
> >>>>>            register __m128i    xmm0, xmm1, xmm2, xmm3;
> >>>>>
> >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> const
> >>>> pointer as parameter. */
> >>>>> #pragma GCC diagnostic push
> >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>>>            xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 *
> >>>> sizeof(__m128i)));
> >>>>>            xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 *
> >>>> sizeof(__m128i)));
> >>>>>            xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 *
> >>>> sizeof(__m128i)));
> >>>>>            xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 *
> >>>> sizeof(__m128i)));
> >>>>> #pragma GCC diagnostic pop
> >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 0 *
> sizeof(__m128i)),
> >>>> xmm0);
> >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 1 *
> sizeof(__m128i)),
> >>>> xmm1);
> >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 2 *
> sizeof(__m128i)),
> >>>> xmm2);
> >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 3 *
> sizeof(__m128i)),
> >>>> xmm3);
> >>>>>            src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
> >>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
> >>>>>        }
> >>>>>
> >>>>>        /* Copy remaining data. */
> >>>>>        while (src != end) {
> >>>>>            register __m128i    xmm;
> >>>>>
> >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> const
> >>>> pointer as parameter. */
> >>>>> #pragma GCC diagnostic push
> >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>>>            xmm = _mm_stream_load_si128(src);
> >>>>> #pragma GCC diagnostic pop
> >>>>>            _mm_stream_si128(dst, xmm);
> >>>>>            src = RTE_PTR_ADD(src, sizeof(__m128i));
> >>>>>            dst = RTE_PTR_ADD(dst, sizeof(__m128i));
> >>>>>        }
> >>>>> }
> >>>>>
> >>>>> /**
> >>>>>     * @warning
> >>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
> >>>>>     *
> >>>>>     * Copy data in blocks of 4 byte from aligned non-temporal
> source
> >>>>>     * to aligned non-temporal destination.
> >>>>>     *
> >>>>>     * @param dst
> >>>>>     *   Pointer to the non-temporal destination of the data.
> >>>>>     *   Must be 4 byte aligned.
> >>>>>     * @param src
> >>>>>     *   Pointer to the non-temporal source data.
> >>>>>     *   Must be 4 byte aligned.
> >>>>>     * @param len
> >>>>>     *   Number of bytes to copy.
> >>>>>     *   Must be divisible by 4.
> >>>>>     */
> >>>>> __rte_experimental
> >>>>> static __rte_always_inline
> >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> >>>> __access__(read_only, 2, 3)))
> >>>>> void rte_memcpy_nt4a(void * __rte_restrict dst, const void *
> >>>> __rte_restrict src, size_t len)
> >>>>> {
> >>>>>        int32_t             buf[sizeof(__m128i) / sizeof(int32_t)]
> >>>> __rte_aligned(sizeof(__m128i));
> >>>>>        /** Address of source data, rounded down to achieve
> >> alignment.
> >>>> */
> >>>>>        const void *        srca = RTE_PTR_ALIGN_FLOOR(src,
> >>>> sizeof(__m128i));
> >>>>>        /** Address of end of source data, rounded down to achieve
> >>>> alignment. */
> >>>>>        const void * const  srcenda =
> >>>> RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
> >>>>>        const int           offset =  RTE_PTR_DIFF(src, srca) /
> >>>> sizeof(int32_t);
> >>>>>        register __m128i    xmm0;
> >>>>>
> >>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
> >>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
> >>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
> >>>>>
> >>>>>        if (unlikely(len == 0)) return;
> >>>>>
> >>>>>        /* Copy first, non-__m128i aligned, part of source data.
> */
> >>>>>        if (offset) {
> >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> const
> >>>> pointer as parameter. */
> >>>>> #pragma GCC diagnostic push
> >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>>>            xmm0 = _mm_stream_load_si128(srca);
> >>>>>            _mm_store_si128((void *)buf, xmm0);
> >>>>> #pragma GCC diagnostic pop
> >>>>>            switch (offset) {
> >>>>>                case 1:
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >>>> sizeof(int32_t)), buf[1]);
> >>>>>                    if (unlikely(len == 1 * sizeof(int32_t)))
> return;
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >>>> sizeof(int32_t)), buf[2]);
> >>>>>                    if (unlikely(len == 2 * sizeof(int32_t)))
> return;
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> >>>> sizeof(int32_t)), buf[3]);
> >>>>>                    break;
> >>>>>                case 2:
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >>>> sizeof(int32_t)), buf[2]);
> >>>>>                    if (unlikely(len == 1 * sizeof(int32_t)))
> return;
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >>>> sizeof(int32_t)), buf[3]);
> >>>>>                    break;
> >>>>>                case 3:
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >>>> sizeof(int32_t)), buf[3]);
> >>>>>                    break;
> >>>>>            }
> >>>>>            srca = RTE_PTR_ADD(srca, (4 - offset) *
> sizeof(int32_t));
> >>>>>            dst = RTE_PTR_ADD(dst, (4 - offset) *
> sizeof(int32_t));
> >>>>>        }
> >>>>>
> >>>>>        /* Copy middle, __m128i aligned, part of source data. */
> >>>>>        while (srca != srcenda) {
> >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> const
> >>>> pointer as parameter. */
> >>>>> #pragma GCC diagnostic push
> >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>>>            xmm0 = _mm_stream_load_si128(srca);
> >>>>> #pragma GCC diagnostic pop
> >>>>>            _mm_store_si128((void *)buf, xmm0);
> >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)),
> >>>> buf[0]);
> >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)),
> >>>> buf[1]);
> >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)),
> >>>> buf[2]);
> >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)),
> >>>> buf[3]);
> >>>>>            srca = RTE_PTR_ADD(srca, sizeof(__m128i));
> >>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
> >>>>>        }
> >>>>>
> >>>>>        /* Copy last, non-__m128i aligned, part of source data. */
> >>>>>        if (RTE_PTR_DIFF(srca, src) != 4) {
> >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> const
> >>>> pointer as parameter. */
> >>>>> #pragma GCC diagnostic push
> >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> >>>>>            xmm0 = _mm_stream_load_si128(srca);
> >>>>>            _mm_store_si128((void *)buf, xmm0);
> >>>>> #pragma GCC diagnostic pop
> >>>>>            switch (offset) {
> >>>>>                case 1:
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >>>> sizeof(int32_t)), buf[0]);
> >>>>>                    break;
> >>>>>                case 2:
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >>>> sizeof(int32_t)), buf[0]);
> >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
> >>>> sizeof(int32_t))) return;
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >>>> sizeof(int32_t)), buf[1]);
> >>>>>                    break;
> >>>>>                case 3:
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> >>>> sizeof(int32_t)), buf[0]);
> >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
> >>>> sizeof(int32_t))) return;
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> >>>> sizeof(int32_t)), buf[1]);
> >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 2 *
> >>>> sizeof(int32_t))) return;
> >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> >>>> sizeof(int32_t)), buf[2]);
> >>>>>                    break;
> >>>>>            }
> >>>>>        }
> >>>>> }
> >>>>>
> >>>>
> >>>
> >>
> >
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 10:46           ` Morten Brørup
@ 2022-07-29 11:50             ` Konstantin Ananyev
  2022-07-29 17:17               ` Morten Brørup
  2022-07-29 12:13             ` Konstantin Ananyev
  1 sibling, 1 reply; 57+ messages in thread
From: Konstantin Ananyev @ 2022-07-29 11:50 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach


> > From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> > Sent: Friday, 29 July 2022 12.00
> >
> > 24/07/2022 23:18, Morten Brørup пишет:
> > >> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> > >> Sent: Sunday, 24 July 2022 15.35
> > >>
> > >> 22/07/2022 11:44, Morten Brørup пишет:
> > >>>> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> > >>>> Sent: Friday, 22 July 2022 01.20
> > >>>>
> > >>>> Hi Morten,
> > >>>>
> > >>>>> This RFC proposes a set of functions optimized for non-temporal
> > >>>> memory copy.
> > >>>>>
> > >>>>> At this stage, I am asking for feedback on the concept.
> > >>>>>
> > >>>>> Applications sometimes data to another memory location, which is
> > >> only
> > >>>> used
> > >>>>> much later.
> > >>>>> In this case, it is inefficient to pollute the data cache with
> > the
> > >>>> copied
> > >>>>> data.
> > >>>>>
> > >>>>> An example use case (originating from a real life application):
> > >>>>> Copying filtered packets, or the first part of them, into a
> > capture
> > >>>> buffer
> > >>>>> for offline analysis.
> > >>>>>
> > >>>>> The purpose of these functions is to achieve a performance gain
> > by
> > >>>> not
> > >>>>> polluting the cache when copying data.
> > >>>>> Although the throughput may be improved by further optimization,
> > I
> > >> do
> > >>>> not
> > >>>>> consider througput optimization relevant initially.
> > >>>>>
> > >>>>> The x86 non-temporal load instructions have 16 byte alignment
> > >>>>> requirements [1], while ARM non-temporal load instructions are
> > >>>> available with
> > >>>>> 4 byte alignment requirements [2].
> > >>>>> Both platforms offer non-temporal store instructions with 4 byte
> > >>>> alignment
> > >>>>> requirements.
> > >>>>>
> > >>>>> In addition to the primary function without any alignment
> > >>>> requirements, we
> > >>>>> also provide functions for respectivly 16 and 4 byte aligned
> > access
> > >>>> for
> > >>>>> performance purposes.
> > >>>>>
> > >>>>> The function names resemble standard C library function names,
> > but
> > >>>> their
> > >>>>> signatures are intentionally different. No need to drag legacy
> > into
> > >>>> it.
> > >>>>>
> > >>>>> NB: Don't comment on spaces for indentation; a patch will follow
> > >> DPDK
> > >>>> coding
> > >>>>> style and use TAB.
> > >>>>
> > >>>>
> > >>>> I think there were discussions in other direction - remove
> > >> rte_memcpy()
> > >>>> completely and use memcpy() instead...
> > >>>
> > >>> Yes, the highly optimized rte_memcpy() implementation of memcpy()
> > has
> > >> become obsolete, now that modern compilers provide an efficient
> > >> memcpy() implementation.
> > >>>
> > >>> It's an excellent reference, because we should learn from it, and
> > >> avoid introducing similar mistakes with non-temporal memcpy.
> > >>>
> > >>>> But if we have a good use case for that, then I am positive in
> > >>>> principle.
> > >>>
> > >>> The standard C library doesn't offer non-temporal memcpy(), so we
> > >> need to implement it ourselves.
> > >>>
> > >>>> Though I think we need a clear use-case within dpdk for it
> > >>>> to demonstrate perfomance gain.
> > >>>
> > >>> The performance gain is to avoid polluting the data cache. DPDK
> > >> example applications, like l3fwd, are probably too primitive to
> > measure
> > >> any benefit in this regard.
> > >>>
> > >>>> Probably copying packets within pdump lib, or examples/dma. or ...
> > >>>
> > >>> Good point - the new functions should be used somewhere within
> > DPDK.
> > >> For this purpose, I will look into modifying rte_pktmbuf_copy(),
> > which
> > >> is used by pdump_copy(), to use non-temporal copying of the packet
> > >> data.
> > >>>
> > >>>> Another thought - do we really need a separate inline function for
> > >> each
> > >>>> flavour?
> > >>>> Might be just one non-inline rte_memcpy_nt(dst, src, size, flags),
> > >>>> where flags could be combination of NT_SRC, NT_DST, and keep
> > >> alignment
> > >>>> detection/decisions to particular implementation?
> > >>>
> > >>> Thank you for the feedback, Konstantin.
> > >>>
> > >>> My answer to this suggestion gets a little longwinded...
> > >>>
> > >>> Looking at the DPDK pcapng library, it copies a 4 byte aligned
> > >> metadata structure sized 28 byte. So it can do with 4 byte aligned
> > >> functions.
> > >>>
> > >>> Our application can capture packets starting at the IP header,
> > which
> > >> is offset by 14 byte (Ethernet header size) from the packet buffer,
> > so
> > >> it requires 2 byte alignment. And thus, requiring 4 byte alignment
> > is
> > >> not acceptable.
> > >>>
> > >>> Our application uses 16 byte alignment in the capture buffer area,
> > >> and can benefit from 16 byte aligned functions. Furthermore, x86
> > >> processors require 16 byte alignment for non-temporal load
> > >> instructions, so I think a 16 byte aligned non-temporal memcpy
> > function
> > >> should be offered.
> > >>
> > >>
> > >> Yes, x86 needs 16B alignment for NT load/stores
> > >> But that's supposed to be arch specific limitation,
> > >> that we probably want to hide, no?
> > >
> > > Agree.
> > >
> > >> Inside the function can check alignment of both src and dst
> > >> and decide should it use NT load/store instructions or just
> > >> do normal copy.
> > >
> > > Yes, I'm experimenting with the x86 inline function shown below. And
> > hopefully, with some "extern inline" or other magic, I can hide the
> > different implementations in the arch specific headers, and only expose
> > the function declaration of rte_memcpy_nt() in the common header.
> > >
> > > I'm currently working on the x86 implementation - when I'm satisfied
> > with that, I'll look into how to hide the implementations in the arch
> > specific header files, and only expose the common function declaration
> > in the generic header file also used for documentation. I works for
> > rte_memcpy(), so I can probably find the way to do it there.
> > >
> > > /*
> > >   * Non-Temporal Memory Operations Flags.
> > >   */
> > >
> > > #define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)   /** Length
> > alignment mask. */
> > > #define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)      /** Length is
> > 2 byte aligned. */
> > > #define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)      /** Length is
> > 4 byte aligned. */
> > > #define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)      /** Length is
> > 8 byte aligned. */
> > > #define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)     /** Length is
> > 16 byte aligned. */
> > > #define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)     /** Length is
> > 32 byte aligned. */
> > > #define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)     /** Length is
> > 64 byte aligned. */
> > > #define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)    /** Length is
> > 128 byte aligned. */
> > >
> > > #define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 8)   /**
> > Destination address alignment mask. */
> > > #define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 8)      /**
> > Destination address is 2 byte aligned. */
> > > #define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 8)      /**
> > Destination address is 4 byte aligned. */
> > > #define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 8)      /**
> > Destination address is 8 byte aligned. */
> > > #define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 8)     /**
> > Destination address is 16 byte aligned. */
> > > #define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 8)     /**
> > Destination address is 32 byte aligned. */
> > > #define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 8)     /**
> > Destination address is 64 byte aligned. */
> > > #define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 8)    /**
> > Destination address is 128 byte aligned. */
> > >
> > > #define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 16)  /** Source
> > address alignment mask. */
> > > #define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 16)     /** Source
> > address is 2 byte aligned. */
> > > #define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 16)     /** Source
> > address is 4 byte aligned. */
> > > #define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 16)     /** Source
> > address is 8 byte aligned. */
> > > #define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 16)    /** Source
> > address is 16 byte aligned. */
> > > #define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 16)    /** Source
> > address is 32 byte aligned. */
> > > #define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 16)    /** Source
> > address is 64 byte aligned. */
> > > #define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 16)   /** Source
> > address is 128 byte aligned. */
> > >
> > > /**
> > >   * @warning
> > >   * @b EXPERIMENTAL: this API may change without prior notice.
> > >   *
> > >   * Non-temporal memory copy.
> > >   * The memory areas must not overlap.
> > >   *
> > >   * @note
> > >   * If the destination and/or length is unaligned, some copied bytes
> > will be
> > >   * stored in the destination memory area using temporal access.
> > >   *
> > >   * @param dst
> > >   *   Pointer to the non-temporal destination memory area.
> > >   * @param src
> > >   *   Pointer to the non-temporal source memory area.
> > >   * @param len
> > >   *   Number of bytes to copy.
> > >   * @param flags
> > >   *   Hints for memory access.
> > >   *   Any of the RTE_MEMOPS_F_LENnA, RTE_MEMOPS_F_DSTnA,
> > RTE_MEMOPS_F_SRCnA flags.
> > >   */
> > > __rte_experimental
> > > static __rte_always_inline
> > > __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> > __access__(read_only, 2, 3)))
> > > void rte_memcpy_nt(void * __rte_restrict dst, const void *
> > __rte_restrict src, size_t len,
> > >          const uint64_t flags)
> > > {
> > >      if (__builtin_constant_p(flags) ?
> > >              ((flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN16A
> > &&
> > >              (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST16A)
> > :
> > >              !(((uintptr_t)dst | len) & (16 - 1))) {
> > >          if (__builtin_constant_p(flags) ?
> > >                  (flags & RTE_MEMOPS_F_SRCA_MASK) >=
> > RTE_MEMOPS_F_SRC16A :
> > >                  !((uintptr_t)src & (16 - 1)))
> > >              rte_memcpy_nt16a(dst, src, len/*, flags*/);
> > >          else
> > >              rte_memcpy_nt16dla(dst, src, len/*, flags*/);
> > >      }
> > >      else if (__builtin_constant_p(flags) ? (
> > >              (flags & RTE_MEMOPS_F_LENA_MASK) >= RTE_MEMOPS_F_LEN4A
> > &&
> > >              (flags & RTE_MEMOPS_F_DSTA_MASK) >= RTE_MEMOPS_F_DST4A
> > &&
> > >              (flags & RTE_MEMOPS_F_SRCA_MASK) >= RTE_MEMOPS_F_SRC4A)
> > :
> > >              !(((uintptr_t)dst | (uintptr_t)src | len) & (4 - 1))) {
> > >          rte_memcpy_nt4a(dst, src, len/*, flags*/);
> > >      }
> > >      else
> > >          rte_memcpy_nt_unaligned(dst, src, len/*, flags*/);
> > > }
> >
> >
> > Do we really need to expose all these dozen flags?
> > My thought at least about x86 implementaion was about something more
> > simple like:
> > void rte_memcpy_nt(void * __rte_restrict dst,
> > 	const void * __rte_restrict src, size_t len,
> > 	const uint64_t flags)
> > {
> >
> > 	if (flags == (SRC_NT | DST_NT) && ((dst | src) & 0xf) == 0) {
> > 		_do_nt_src_nt_dst_nt(...);
> > 	} else if (flags == DST_NT && (dst & 0xf) == 0) {
> > 		_do_src_na_dst_nt(...);
> > 	} else if (flags == SRC_NT && (src & 0xf) == 0) {
> > 		_do_src_nt_dst_na(...);
> > 	} else
> > 		memcpy(dst, src, len);
> > }
> 
> The combination of flags, inline and __builtin_constant_p() allows the compiler to produce zero-overhead code. Without it, the
> resulting code will contain a bunch of run-time bitmask comparisons and branches to determine the ultimate copy function.

I think it is unavoidable, unless your intention for this function to trust flags without checking actual addresses.

 On x86
> there are not only 16 byte, but also 4 byte alignment variants of non-temporal store. The beauty of it will be more obvious when the
> patch is ready.

Ok, I will just wait for final version then :)
 
> 
> And in my current working version (not the code provided here), the flags are hints, so using them will be optional. The function
> headers will look roughly like this:
> 
> static inline void rte_memcpy_nt_ex(
> 		void * dst, const void * src,
> 		size_t len, uint64_t flags);
> 
> static inline void rte_memcpy_nt(
> 		void * dst, const void * src,
> 		size_t len)
> {
> 	rte_memcpy_nt_ex(dst, src, len, 0);
> }
> 
> I might add an _ex postfix variant of the mbuf packet non-temporal copy function too, but I'm not working on that function yet, so I
> don't yet know if it makes sense or not.
> 
> My concept for build time alignment hints can also be copied into an rte_memcpy_ex() function for improved performance. But I
> don't want my patch to expand too much outside its initial scope, so I will not modify rte_memcpy() with this patch.
> 
> Alternatively, I could provide rte_memcpy_ex(d,s,l,flags) instead of rte_memcpy_nt[_ex](), and use the flags to indicate non-
> temporal source and destination.
> 
> This is only a question about which API the community prefers. I will not change the implementation of rte_memcpy() with this patch -
> it's another job to do that.
> 
> >
> > >
> > >
> > >>
> > >>
> > >>> While working on these funtions, I experimented with an
> > >> rte_memcpy_nt() taking flags, which is also my personal preference,
> > but
> > >> haven't succeed yet. Especially when copying a 16 byte aligned
> > >> structure of only 16 byte, the overhead of the function call +
> > >> comparing the flags + the copy loop overhead is significant,
> > compared
> > >> to inline code consisting of only one pair of "movntdqa
> > (%rsi),%xmm0;
> > >> movntdq %xmm0,(%rdi)" instructions.
> > >>>
> > >>> Remember that a non-inlined rte_memcpy_nt() will be called with
> > very
> > >> varying size, due to the typical mix of small and big packets, so
> > >> branch prediction will not help.
> > >>>
> > >>> This RFC does not yet show the rte_memcpy_nt() function handling
> > >> unaligned load/store, but it is more complex than the aligned
> > >> functions. So I think the aligned variants are warranted - for
> > >> performance reasons.
> > >>>
> > >>> Some of the need for exposing individual functions for different
> > >> alignment stems from the compiler being unable to determine the
> > >> alignment of the source and destination pointers at build time. So
> > we
> > >> need to help the compiler with this at build time, and thus the need
> > >> for inlining the function. If we expose a bunch of small inline
> > >> functions or a big inline function with flags seems to be a matter
> > of
> > >> taste.
> > >>>
> > >>> Thinking about it, you are probably right that exposing a single
> > >> function with flags is better for documentation purposes and easier
> > for
> > >> other architectures to implement. But it still needs to be inline,
> > for
> > >> the reasons described above.
> > >>
> > >>
> > >> Ok, my initial thought was that main use-case for it would be
> > copying
> > >> of
> > >> big chunks of data, but from your description it might not be the
> > case.
> > >
> > > This is for quickly copying relatively small pieces of data
> > synchronously without polluting the CPUs data cache, e.g. just before
> > passing on a packet to an Ethernet PMD for transmission.
> > >
> > > Big chunks of data should be copied asynchronously by DMA.
> > >
> > >> Yes, for just 16/32B copy function call overhead might be way too
> > >> high...
> > >> As another alternative - would memcpy_nt_bulk() help somehow?
> > >> It can do copying for the several src/dst pairs at once and
> > >> that might help to amortize cost of function call.
> > >
> > > In many cases, memcpy_nt() will replace memcpy() inside loops, so it
> > should be just as easy to use as memcpy(). E.g. look at
> > rte_pktmbuf_copy()... Building a memcopy array to pass to
> > memcpy_nt_bulk() from rte_pktmbuf_copy() would require a significant
> > rewrite of rte_pktmbuf_copy(), compared to just replacing rte_memcpy()
> > with rte_memcpy_nt(). And this is just one function using memcpy().
> >
> > Actually, one question I have for such small data-transfer
> > (16B per packet) - do you still see some noticable perfomance
> > improvement for such scenario?
> 
> Copying 16 byte from each packet in a burst of 32 packets would otherwise pollute 64 cache lines = 4 KB cache. With typically 64 KB L1
> cache, I think it makes a difference.

I understand the intention behind, my question was - it is really measurable?
Something like: using pktmbuf_copy_nt(len=16) over using pktmbuf_copy(len=16)
on workload X gives Y% thoughtput improvement?

> 
> > Another question - who will do 'sfence' after the copying?
> > Would it be inside memcpy_nt (seems quite costly), or would
> > it be another API function for that: memcpy_nt_flush() or so?
> 
> Outside. Only the developer knows when it is required, so it wouldn't make any sense to add the cost inside memcpy_nt().
> 
> I don't think we should add a flush function; it would just be another name for an already existing function. Referring to the required
> operation in the memcpy_nt() function documentation should suffice.
> 
> >
> > >>
> > >>
> > >>>
> > >>>>
> > >>>>
> > >>>>> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
> > >>>> guide/index.html#text=_mm_stream_load
> > >>>>> [2] https://developer.arm.com/documentation/100076/0100/A64-
> > >>>> Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--
> > >> SIMD-
> > >>>> and-FP-
> > >>>>>
> > >>>>> V2:
> > >>>>> - Only copy from non-temporal source to non-temporal destination.
> > >>>>>      I.e. remove the two variants with only source and/or
> > >> destination
> > >>>> being
> > >>>>>      non-temporal.
> > >>>>> - Do not require alignment.
> > >>>>>      Instead, offer additional 4 and 16 byte aligned functions
> > for
> > >>>> performance
> > >>>>>      purposes.
> > >>>>> - Implemented two of the functions for x86.
> > >>>>> - Remove memset function.
> > >>>>>
> > >>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >>>>> ---
> > >>>>>
> > >>>>> /**
> > >>>>>     * @warning
> > >>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
> > >>>>>     *
> > >>>>>     * Copy data from non-temporal source to non-temporal
> > >> destination.
> > >>>>>     *
> > >>>>>     * @param dst
> > >>>>>     *   Pointer to the non-temporal destination of the data.
> > >>>>>     *   Should be 4 byte aligned, for optimal performance.
> > >>>>>     * @param src
> > >>>>>     *   Pointer to the non-temporal source data.
> > >>>>>     *   No alignment requirements.
> > >>>>>     * @param len
> > >>>>>     *   Number of bytes to copy.
> > >>>>>     *   Should be be divisible by 4, for optimal performance.
> > >>>>>     */
> > >>>>> __rte_experimental
> > >>>>> static __rte_always_inline
> > >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> > >>>> __access__(read_only, 2, 3)))
> > >>>>> void rte_memcpy_nt(void * __rte_restrict dst, const void *
> > >>>> __rte_restrict src, size_t len)
> > >>>>> /* Implementation T.B.D. */
> > >>>>>
> > >>>>> /**
> > >>>>>     * @warning
> > >>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
> > >>>>>     *
> > >>>>>     * Copy data in blocks of 16 byte from aligned non-temporal
> > >> source
> > >>>>>     * to aligned non-temporal destination.
> > >>>>>     *
> > >>>>>     * @param dst
> > >>>>>     *   Pointer to the non-temporal destination of the data.
> > >>>>>     *   Must be 16 byte aligned.
> > >>>>>     * @param src
> > >>>>>     *   Pointer to the non-temporal source data.
> > >>>>>     *   Must be 16 byte aligned.
> > >>>>>     * @param len
> > >>>>>     *   Number of bytes to copy.
> > >>>>>     *   Must be divisible by 16.
> > >>>>>     */
> > >>>>> __rte_experimental
> > >>>>> static __rte_always_inline
> > >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> > >>>> __access__(read_only, 2, 3)))
> > >>>>> void rte_memcpy_nt16a(void * __rte_restrict dst, const void *
> > >>>> __rte_restrict src, size_t len)
> > >>>>> {
> > >>>>>        const void * const  end = RTE_PTR_ADD(src, len);
> > >>>>>
> > >>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
> > >>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
> > >>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
> > >>>>>
> > >>>>>        /* Copy large portion of data. */
> > >>>>>        while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
> > >>>>>            register __m128i    xmm0, xmm1, xmm2, xmm3;
> > >>>>>
> > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > const
> > >>>> pointer as parameter. */
> > >>>>> #pragma GCC diagnostic push
> > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > >>>>>            xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 *
> > >>>> sizeof(__m128i)));
> > >>>>>            xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 *
> > >>>> sizeof(__m128i)));
> > >>>>>            xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 *
> > >>>> sizeof(__m128i)));
> > >>>>>            xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 *
> > >>>> sizeof(__m128i)));
> > >>>>> #pragma GCC diagnostic pop
> > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 0 *
> > sizeof(__m128i)),
> > >>>> xmm0);
> > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 1 *
> > sizeof(__m128i)),
> > >>>> xmm1);
> > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 2 *
> > sizeof(__m128i)),
> > >>>> xmm2);
> > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 3 *
> > sizeof(__m128i)),
> > >>>> xmm3);
> > >>>>>            src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
> > >>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
> > >>>>>        }
> > >>>>>
> > >>>>>        /* Copy remaining data. */
> > >>>>>        while (src != end) {
> > >>>>>            register __m128i    xmm;
> > >>>>>
> > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > const
> > >>>> pointer as parameter. */
> > >>>>> #pragma GCC diagnostic push
> > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > >>>>>            xmm = _mm_stream_load_si128(src);
> > >>>>> #pragma GCC diagnostic pop
> > >>>>>            _mm_stream_si128(dst, xmm);
> > >>>>>            src = RTE_PTR_ADD(src, sizeof(__m128i));
> > >>>>>            dst = RTE_PTR_ADD(dst, sizeof(__m128i));
> > >>>>>        }
> > >>>>> }
> > >>>>>
> > >>>>> /**
> > >>>>>     * @warning
> > >>>>>     * @b EXPERIMENTAL: this API may change without prior notice.
> > >>>>>     *
> > >>>>>     * Copy data in blocks of 4 byte from aligned non-temporal
> > source
> > >>>>>     * to aligned non-temporal destination.
> > >>>>>     *
> > >>>>>     * @param dst
> > >>>>>     *   Pointer to the non-temporal destination of the data.
> > >>>>>     *   Must be 4 byte aligned.
> > >>>>>     * @param src
> > >>>>>     *   Pointer to the non-temporal source data.
> > >>>>>     *   Must be 4 byte aligned.
> > >>>>>     * @param len
> > >>>>>     *   Number of bytes to copy.
> > >>>>>     *   Must be divisible by 4.
> > >>>>>     */
> > >>>>> __rte_experimental
> > >>>>> static __rte_always_inline
> > >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> > >>>> __access__(read_only, 2, 3)))
> > >>>>> void rte_memcpy_nt4a(void * __rte_restrict dst, const void *
> > >>>> __rte_restrict src, size_t len)
> > >>>>> {
> > >>>>>        int32_t             buf[sizeof(__m128i) / sizeof(int32_t)]
> > >>>> __rte_aligned(sizeof(__m128i));
> > >>>>>        /** Address of source data, rounded down to achieve
> > >> alignment.
> > >>>> */
> > >>>>>        const void *        srca = RTE_PTR_ALIGN_FLOOR(src,
> > >>>> sizeof(__m128i));
> > >>>>>        /** Address of end of source data, rounded down to achieve
> > >>>> alignment. */
> > >>>>>        const void * const  srcenda =
> > >>>> RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
> > >>>>>        const int           offset =  RTE_PTR_DIFF(src, srca) /
> > >>>> sizeof(int32_t);
> > >>>>>        register __m128i    xmm0;
> > >>>>>
> > >>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
> > >>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
> > >>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
> > >>>>>
> > >>>>>        if (unlikely(len == 0)) return;
> > >>>>>
> > >>>>>        /* Copy first, non-__m128i aligned, part of source data.
> > */
> > >>>>>        if (offset) {
> > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > const
> > >>>> pointer as parameter. */
> > >>>>> #pragma GCC diagnostic push
> > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > >>>>>            xmm0 = _mm_stream_load_si128(srca);
> > >>>>>            _mm_store_si128((void *)buf, xmm0);
> > >>>>> #pragma GCC diagnostic pop
> > >>>>>            switch (offset) {
> > >>>>>                case 1:
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > >>>> sizeof(int32_t)), buf[1]);
> > >>>>>                    if (unlikely(len == 1 * sizeof(int32_t)))
> > return;
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > >>>> sizeof(int32_t)), buf[2]);
> > >>>>>                    if (unlikely(len == 2 * sizeof(int32_t)))
> > return;
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> > >>>> sizeof(int32_t)), buf[3]);
> > >>>>>                    break;
> > >>>>>                case 2:
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > >>>> sizeof(int32_t)), buf[2]);
> > >>>>>                    if (unlikely(len == 1 * sizeof(int32_t)))
> > return;
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > >>>> sizeof(int32_t)), buf[3]);
> > >>>>>                    break;
> > >>>>>                case 3:
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > >>>> sizeof(int32_t)), buf[3]);
> > >>>>>                    break;
> > >>>>>            }
> > >>>>>            srca = RTE_PTR_ADD(srca, (4 - offset) *
> > sizeof(int32_t));
> > >>>>>            dst = RTE_PTR_ADD(dst, (4 - offset) *
> > sizeof(int32_t));
> > >>>>>        }
> > >>>>>
> > >>>>>        /* Copy middle, __m128i aligned, part of source data. */
> > >>>>>        while (srca != srcenda) {
> > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > const
> > >>>> pointer as parameter. */
> > >>>>> #pragma GCC diagnostic push
> > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > >>>>>            xmm0 = _mm_stream_load_si128(srca);
> > >>>>> #pragma GCC diagnostic pop
> > >>>>>            _mm_store_si128((void *)buf, xmm0);
> > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)),
> > >>>> buf[0]);
> > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)),
> > >>>> buf[1]);
> > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)),
> > >>>> buf[2]);
> > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)),
> > >>>> buf[3]);
> > >>>>>            srca = RTE_PTR_ADD(srca, sizeof(__m128i));
> > >>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
> > >>>>>        }
> > >>>>>
> > >>>>>        /* Copy last, non-__m128i aligned, part of source data. */
> > >>>>>        if (RTE_PTR_DIFF(srca, src) != 4) {
> > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > const
> > >>>> pointer as parameter. */
> > >>>>> #pragma GCC diagnostic push
> > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > >>>>>            xmm0 = _mm_stream_load_si128(srca);
> > >>>>>            _mm_store_si128((void *)buf, xmm0);
> > >>>>> #pragma GCC diagnostic pop
> > >>>>>            switch (offset) {
> > >>>>>                case 1:
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > >>>> sizeof(int32_t)), buf[0]);
> > >>>>>                    break;
> > >>>>>                case 2:
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > >>>> sizeof(int32_t)), buf[0]);
> > >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
> > >>>> sizeof(int32_t))) return;
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > >>>> sizeof(int32_t)), buf[1]);
> > >>>>>                    break;
> > >>>>>                case 3:
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > >>>> sizeof(int32_t)), buf[0]);
> > >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1 *
> > >>>> sizeof(int32_t))) return;
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > >>>> sizeof(int32_t)), buf[1]);
> > >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 2 *
> > >>>> sizeof(int32_t))) return;
> > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> > >>>> sizeof(int32_t)), buf[2]);
> > >>>>>                    break;
> > >>>>>            }
> > >>>>>        }
> > >>>>> }
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 10:46           ` Morten Brørup
  2022-07-29 11:50             ` Konstantin Ananyev
@ 2022-07-29 12:13             ` Konstantin Ananyev
  2022-07-29 16:05               ` Stephen Hemminger
  2022-07-29 18:13               ` Morten Brørup
  1 sibling, 2 replies; 57+ messages in thread
From: Konstantin Ananyev @ 2022-07-29 12:13 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach


Sorry, missed that part.

> 
> > Another question - who will do 'sfence' after the copying?
> > Would it be inside memcpy_nt (seems quite costly), or would
> > it be another API function for that: memcpy_nt_flush() or so?
> 
> Outside. Only the developer knows when it is required, so it wouldn't make any sense to add the cost inside memcpy_nt().
> 
> I don't think we should add a flush function; it would just be another name for an already existing function. Referring to the required
> operation in the memcpy_nt() function documentation should suffice.
> 

Ok, but again wouldn't it be arch specific?
AFAIK for x86 it needs to boil down to sfence, for other architectures - I don't know.
If you think there already is some generic one (rte_wmb?) that would always produce
correct instructions - sure let's use it. 
 
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-29 12:13             ` Konstantin Ananyev
@ 2022-07-29 16:05               ` Stephen Hemminger
  2022-07-29 17:29                 ` Morten Brørup
  2022-08-07 20:40                 ` Mattias Rönnblom
  2022-07-29 18:13               ` Morten Brørup
  1 sibling, 2 replies; 57+ messages in thread
From: Stephen Hemminger @ 2022-07-29 16:05 UTC (permalink / raw)
  To: Konstantin Ananyev
  Cc: Morten Brørup, Konstantin Ananyev, dev, Bruce Richardson,
	Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

On Fri, 29 Jul 2022 12:13:52 +0000
Konstantin Ananyev <konstantin.ananyev@huawei.com> wrote:

> Sorry, missed that part.
> 
> >   
> > > Another question - who will do 'sfence' after the copying?
> > > Would it be inside memcpy_nt (seems quite costly), or would
> > > it be another API function for that: memcpy_nt_flush() or so?  
> > 
> > Outside. Only the developer knows when it is required, so it wouldn't make any sense to add the cost inside memcpy_nt().
> > 
> > I don't think we should add a flush function; it would just be another name for an already existing function. Referring to the required
> > operation in the memcpy_nt() function documentation should suffice.
> >   
> 
> Ok, but again wouldn't it be arch specific?
> AFAIK for x86 it needs to boil down to sfence, for other architectures - I don't know.
> If you think there already is some generic one (rte_wmb?) that would always produce
> correct instructions - sure let's use it. 
>  
>  

It makes sense in a few select places to use non-temporal copy.
But it would add unnecessary complexity to DPDK if every function in DPDK that could
cause a copy had a non-temporal variant.

Maybe just having rte_memcpy have a threshold (config value?) that if copy is larger than
a certain size, then it would automatically be non-temporal.  Small copies wouldn't matter,
the optimization is more about not stopping cache size issues with large streams of data.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 11:50             ` Konstantin Ananyev
@ 2022-07-29 17:17               ` Morten Brørup
  2022-07-29 22:00                 ` Konstantin Ananyev
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-07-29 17:17 UTC (permalink / raw)
  To: Konstantin Ananyev, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Friday, 29 July 2022 13.50
> 
> > > From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> > > Sent: Friday, 29 July 2022 12.00
> > >
> > > 24/07/2022 23:18, Morten Brørup пишет:
> > > >> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> > > >> Sent: Sunday, 24 July 2022 15.35
> > > >>
> > > >> 22/07/2022 11:44, Morten Brørup пишет:
> > > >>>> From: Konstantin Ananyev
> [mailto:konstantin.v.ananyev@yandex.ru]
> > > >>>> Sent: Friday, 22 July 2022 01.20
> > > >>>>
> > > >>>> Hi Morten,
> > > >>>>
> > > >>>>> This RFC proposes a set of functions optimized for non-
> temporal
> > > >>>> memory copy.
> > > >>>>>
> > > >>>>> At this stage, I am asking for feedback on the concept.
> > > >>>>>
> > > >>>>> Applications sometimes data to another memory location, which
> is
> > > >> only
> > > >>>> used
> > > >>>>> much later.
> > > >>>>> In this case, it is inefficient to pollute the data cache
> with
> > > the
> > > >>>> copied
> > > >>>>> data.
> > > >>>>>
> > > >>>>> An example use case (originating from a real life
> application):
> > > >>>>> Copying filtered packets, or the first part of them, into a
> > > capture
> > > >>>> buffer
> > > >>>>> for offline analysis.
> > > >>>>>
> > > >>>>> The purpose of these functions is to achieve a performance
> gain
> > > by
> > > >>>> not
> > > >>>>> polluting the cache when copying data.
> > > >>>>> Although the throughput may be improved by further
> optimization,
> > > I
> > > >> do
> > > >>>> not
> > > >>>>> consider througput optimization relevant initially.
> > > >>>>>
> > > >>>>> The x86 non-temporal load instructions have 16 byte alignment
> > > >>>>> requirements [1], while ARM non-temporal load instructions
> are
> > > >>>> available with
> > > >>>>> 4 byte alignment requirements [2].
> > > >>>>> Both platforms offer non-temporal store instructions with 4
> byte
> > > >>>> alignment
> > > >>>>> requirements.
> > > >>>>>
> > > >>>>> In addition to the primary function without any alignment
> > > >>>> requirements, we
> > > >>>>> also provide functions for respectivly 16 and 4 byte aligned
> > > access
> > > >>>> for
> > > >>>>> performance purposes.
> > > >>>>>
> > > >>>>> The function names resemble standard C library function
> names,
> > > but
> > > >>>> their
> > > >>>>> signatures are intentionally different. No need to drag
> legacy
> > > into
> > > >>>> it.
> > > >>>>>
> > > >>>>> NB: Don't comment on spaces for indentation; a patch will
> follow
> > > >> DPDK
> > > >>>> coding
> > > >>>>> style and use TAB.
> > > >>>>
> > > >>>>
> > > >>>> I think there were discussions in other direction - remove
> > > >> rte_memcpy()
> > > >>>> completely and use memcpy() instead...
> > > >>>
> > > >>> Yes, the highly optimized rte_memcpy() implementation of
> memcpy()
> > > has
> > > >> become obsolete, now that modern compilers provide an efficient
> > > >> memcpy() implementation.
> > > >>>
> > > >>> It's an excellent reference, because we should learn from it,
> and
> > > >> avoid introducing similar mistakes with non-temporal memcpy.
> > > >>>
> > > >>>> But if we have a good use case for that, then I am positive in
> > > >>>> principle.
> > > >>>
> > > >>> The standard C library doesn't offer non-temporal memcpy(), so
> we
> > > >> need to implement it ourselves.
> > > >>>
> > > >>>> Though I think we need a clear use-case within dpdk for it
> > > >>>> to demonstrate perfomance gain.
> > > >>>
> > > >>> The performance gain is to avoid polluting the data cache. DPDK
> > > >> example applications, like l3fwd, are probably too primitive to
> > > measure
> > > >> any benefit in this regard.
> > > >>>
> > > >>>> Probably copying packets within pdump lib, or examples/dma. or
> ...
> > > >>>
> > > >>> Good point - the new functions should be used somewhere within
> > > DPDK.
> > > >> For this purpose, I will look into modifying rte_pktmbuf_copy(),
> > > which
> > > >> is used by pdump_copy(), to use non-temporal copying of the
> packet
> > > >> data.
> > > >>>
> > > >>>> Another thought - do we really need a separate inline function
> for
> > > >> each
> > > >>>> flavour?
> > > >>>> Might be just one non-inline rte_memcpy_nt(dst, src, size,
> flags),
> > > >>>> where flags could be combination of NT_SRC, NT_DST, and keep
> > > >> alignment
> > > >>>> detection/decisions to particular implementation?
> > > >>>
> > > >>> Thank you for the feedback, Konstantin.
> > > >>>
> > > >>> My answer to this suggestion gets a little longwinded...
> > > >>>
> > > >>> Looking at the DPDK pcapng library, it copies a 4 byte aligned
> > > >> metadata structure sized 28 byte. So it can do with 4 byte
> aligned
> > > >> functions.
> > > >>>
> > > >>> Our application can capture packets starting at the IP header,
> > > which
> > > >> is offset by 14 byte (Ethernet header size) from the packet
> buffer,
> > > so
> > > >> it requires 2 byte alignment. And thus, requiring 4 byte
> alignment
> > > is
> > > >> not acceptable.
> > > >>>
> > > >>> Our application uses 16 byte alignment in the capture buffer
> area,
> > > >> and can benefit from 16 byte aligned functions. Furthermore, x86
> > > >> processors require 16 byte alignment for non-temporal load
> > > >> instructions, so I think a 16 byte aligned non-temporal memcpy
> > > function
> > > >> should be offered.
> > > >>
> > > >>
> > > >> Yes, x86 needs 16B alignment for NT load/stores
> > > >> But that's supposed to be arch specific limitation,
> > > >> that we probably want to hide, no?
> > > >
> > > > Agree.
> > > >
> > > >> Inside the function can check alignment of both src and dst
> > > >> and decide should it use NT load/store instructions or just
> > > >> do normal copy.
> > > >
> > > > Yes, I'm experimenting with the x86 inline function shown below.
> And
> > > hopefully, with some "extern inline" or other magic, I can hide the
> > > different implementations in the arch specific headers, and only
> expose
> > > the function declaration of rte_memcpy_nt() in the common header.
> > > >
> > > > I'm currently working on the x86 implementation - when I'm
> satisfied
> > > with that, I'll look into how to hide the implementations in the
> arch
> > > specific header files, and only expose the common function
> declaration
> > > in the generic header file also used for documentation. I works for
> > > rte_memcpy(), so I can probably find the way to do it there.
> > > >
> > > > /*
> > > >   * Non-Temporal Memory Operations Flags.
> > > >   */
> > > >
> > > > #define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)   /**
> Length
> > > alignment mask. */
> > > > #define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)      /**
> Length is
> > > 2 byte aligned. */
> > > > #define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)      /**
> Length is
> > > 4 byte aligned. */
> > > > #define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)      /**
> Length is
> > > 8 byte aligned. */
> > > > #define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)     /**
> Length is
> > > 16 byte aligned. */
> > > > #define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)     /**
> Length is
> > > 32 byte aligned. */
> > > > #define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)     /**
> Length is
> > > 64 byte aligned. */
> > > > #define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)    /**
> Length is
> > > 128 byte aligned. */
> > > >
> > > > #define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 8)   /**
> > > Destination address alignment mask. */
> > > > #define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 8)      /**
> > > Destination address is 2 byte aligned. */
> > > > #define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 8)      /**
> > > Destination address is 4 byte aligned. */
> > > > #define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 8)      /**
> > > Destination address is 8 byte aligned. */
> > > > #define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 8)     /**
> > > Destination address is 16 byte aligned. */
> > > > #define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 8)     /**
> > > Destination address is 32 byte aligned. */
> > > > #define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 8)     /**
> > > Destination address is 64 byte aligned. */
> > > > #define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 8)    /**
> > > Destination address is 128 byte aligned. */
> > > >
> > > > #define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 16)  /**
> Source
> > > address alignment mask. */
> > > > #define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 16)     /**
> Source
> > > address is 2 byte aligned. */
> > > > #define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 16)     /**
> Source
> > > address is 4 byte aligned. */
> > > > #define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 16)     /**
> Source
> > > address is 8 byte aligned. */
> > > > #define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 16)    /**
> Source
> > > address is 16 byte aligned. */
> > > > #define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 16)    /**
> Source
> > > address is 32 byte aligned. */
> > > > #define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 16)    /**
> Source
> > > address is 64 byte aligned. */
> > > > #define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 16)   /**
> Source
> > > address is 128 byte aligned. */
> > > >
> > > > /**
> > > >   * @warning
> > > >   * @b EXPERIMENTAL: this API may change without prior notice.
> > > >   *
> > > >   * Non-temporal memory copy.
> > > >   * The memory areas must not overlap.
> > > >   *
> > > >   * @note
> > > >   * If the destination and/or length is unaligned, some copied
> bytes
> > > will be
> > > >   * stored in the destination memory area using temporal access.
> > > >   *
> > > >   * @param dst
> > > >   *   Pointer to the non-temporal destination memory area.
> > > >   * @param src
> > > >   *   Pointer to the non-temporal source memory area.
> > > >   * @param len
> > > >   *   Number of bytes to copy.
> > > >   * @param flags
> > > >   *   Hints for memory access.
> > > >   *   Any of the RTE_MEMOPS_F_LENnA, RTE_MEMOPS_F_DSTnA,
> > > RTE_MEMOPS_F_SRCnA flags.
> > > >   */
> > > > __rte_experimental
> > > > static __rte_always_inline
> > > > __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> > > __access__(read_only, 2, 3)))
> > > > void rte_memcpy_nt(void * __rte_restrict dst, const void *
> > > __rte_restrict src, size_t len,
> > > >          const uint64_t flags)
> > > > {
> > > >      if (__builtin_constant_p(flags) ?
> > > >              ((flags & RTE_MEMOPS_F_LENA_MASK) >=
> RTE_MEMOPS_F_LEN16A
> > > &&
> > > >              (flags & RTE_MEMOPS_F_DSTA_MASK) >=
> RTE_MEMOPS_F_DST16A)
> > > :
> > > >              !(((uintptr_t)dst | len) & (16 - 1))) {
> > > >          if (__builtin_constant_p(flags) ?
> > > >                  (flags & RTE_MEMOPS_F_SRCA_MASK) >=
> > > RTE_MEMOPS_F_SRC16A :
> > > >                  !((uintptr_t)src & (16 - 1)))
> > > >              rte_memcpy_nt16a(dst, src, len/*, flags*/);
> > > >          else
> > > >              rte_memcpy_nt16dla(dst, src, len/*, flags*/);
> > > >      }
> > > >      else if (__builtin_constant_p(flags) ? (
> > > >              (flags & RTE_MEMOPS_F_LENA_MASK) >=
> RTE_MEMOPS_F_LEN4A
> > > &&
> > > >              (flags & RTE_MEMOPS_F_DSTA_MASK) >=
> RTE_MEMOPS_F_DST4A
> > > &&
> > > >              (flags & RTE_MEMOPS_F_SRCA_MASK) >=
> RTE_MEMOPS_F_SRC4A)
> > > :
> > > >              !(((uintptr_t)dst | (uintptr_t)src | len) & (4 -
> 1))) {
> > > >          rte_memcpy_nt4a(dst, src, len/*, flags*/);
> > > >      }
> > > >      else
> > > >          rte_memcpy_nt_unaligned(dst, src, len/*, flags*/);
> > > > }
> > >
> > >
> > > Do we really need to expose all these dozen flags?
> > > My thought at least about x86 implementaion was about something
> more
> > > simple like:
> > > void rte_memcpy_nt(void * __rte_restrict dst,
> > > 	const void * __rte_restrict src, size_t len,
> > > 	const uint64_t flags)
> > > {
> > >
> > > 	if (flags == (SRC_NT | DST_NT) && ((dst | src) & 0xf) == 0) {
> > > 		_do_nt_src_nt_dst_nt(...);
> > > 	} else if (flags == DST_NT && (dst & 0xf) == 0) {
> > > 		_do_src_na_dst_nt(...);
> > > 	} else if (flags == SRC_NT && (src & 0xf) == 0) {
> > > 		_do_src_nt_dst_na(...);
> > > 	} else
> > > 		memcpy(dst, src, len);
> > > }
> >
> > The combination of flags, inline and __builtin_constant_p() allows
> the compiler to produce zero-overhead code. Without it, the
> > resulting code will contain a bunch of run-time bitmask comparisons
> and branches to determine the ultimate copy function.
> 
> I think it is unavoidable, unless your intention for this function to
> trust flags without checking actual addresses.

The intention is to trust the flags. Fast path functions should trust that parameters passed are valid and conform to specified requirements.

We can always throw in a few RTE_ASSERTs - they are omitted unless compiled for debug, and thus have zero cost in production.

> 
>  On x86
> > there are not only 16 byte, but also 4 byte alignment variants of
> non-temporal store. The beauty of it will be more obvious when the
> > patch is ready.
> 
> Ok, I will just wait for final version then :)
> 
> >
> > And in my current working version (not the code provided here), the
> flags are hints, so using them will be optional. The function
> > headers will look roughly like this:
> >
> > static inline void rte_memcpy_nt_ex(
> > 		void * dst, const void * src,
> > 		size_t len, uint64_t flags);
> >
> > static inline void rte_memcpy_nt(
> > 		void * dst, const void * src,
> > 		size_t len)
> > {
> > 	rte_memcpy_nt_ex(dst, src, len, 0);
> > }
> >
> > I might add an _ex postfix variant of the mbuf packet non-temporal
> copy function too, but I'm not working on that function yet, so I
> > don't yet know if it makes sense or not.
> >
> > My concept for build time alignment hints can also be copied into an
> rte_memcpy_ex() function for improved performance. But I
> > don't want my patch to expand too much outside its initial scope, so
> I will not modify rte_memcpy() with this patch.
> >
> > Alternatively, I could provide rte_memcpy_ex(d,s,l,flags) instead of
> rte_memcpy_nt[_ex](), and use the flags to indicate non-
> > temporal source and destination.
> >
> > This is only a question about which API the community prefers. I will
> not change the implementation of rte_memcpy() with this patch -
> > it's another job to do that.
> >
> > >
> > > >
> > > >
> > > >>
> > > >>
> > > >>> While working on these funtions, I experimented with an
> > > >> rte_memcpy_nt() taking flags, which is also my personal
> preference,
> > > but
> > > >> haven't succeed yet. Especially when copying a 16 byte aligned
> > > >> structure of only 16 byte, the overhead of the function call +
> > > >> comparing the flags + the copy loop overhead is significant,
> > > compared
> > > >> to inline code consisting of only one pair of "movntdqa
> > > (%rsi),%xmm0;
> > > >> movntdq %xmm0,(%rdi)" instructions.
> > > >>>
> > > >>> Remember that a non-inlined rte_memcpy_nt() will be called with
> > > very
> > > >> varying size, due to the typical mix of small and big packets,
> so
> > > >> branch prediction will not help.
> > > >>>
> > > >>> This RFC does not yet show the rte_memcpy_nt() function
> handling
> > > >> unaligned load/store, but it is more complex than the aligned
> > > >> functions. So I think the aligned variants are warranted - for
> > > >> performance reasons.
> > > >>>
> > > >>> Some of the need for exposing individual functions for
> different
> > > >> alignment stems from the compiler being unable to determine the
> > > >> alignment of the source and destination pointers at build time.
> So
> > > we
> > > >> need to help the compiler with this at build time, and thus the
> need
> > > >> for inlining the function. If we expose a bunch of small inline
> > > >> functions or a big inline function with flags seems to be a
> matter
> > > of
> > > >> taste.
> > > >>>
> > > >>> Thinking about it, you are probably right that exposing a
> single
> > > >> function with flags is better for documentation purposes and
> easier
> > > for
> > > >> other architectures to implement. But it still needs to be
> inline,
> > > for
> > > >> the reasons described above.
> > > >>
> > > >>
> > > >> Ok, my initial thought was that main use-case for it would be
> > > copying
> > > >> of
> > > >> big chunks of data, but from your description it might not be
> the
> > > case.
> > > >
> > > > This is for quickly copying relatively small pieces of data
> > > synchronously without polluting the CPUs data cache, e.g. just
> before
> > > passing on a packet to an Ethernet PMD for transmission.
> > > >
> > > > Big chunks of data should be copied asynchronously by DMA.
> > > >
> > > >> Yes, for just 16/32B copy function call overhead might be way
> too
> > > >> high...
> > > >> As another alternative - would memcpy_nt_bulk() help somehow?
> > > >> It can do copying for the several src/dst pairs at once and
> > > >> that might help to amortize cost of function call.
> > > >
> > > > In many cases, memcpy_nt() will replace memcpy() inside loops, so
> it
> > > should be just as easy to use as memcpy(). E.g. look at
> > > rte_pktmbuf_copy()... Building a memcopy array to pass to
> > > memcpy_nt_bulk() from rte_pktmbuf_copy() would require a
> significant
> > > rewrite of rte_pktmbuf_copy(), compared to just replacing
> rte_memcpy()
> > > with rte_memcpy_nt(). And this is just one function using memcpy().
> > >
> > > Actually, one question I have for such small data-transfer
> > > (16B per packet) - do you still see some noticable perfomance
> > > improvement for such scenario?
> >
> > Copying 16 byte from each packet in a burst of 32 packets would
> otherwise pollute 64 cache lines = 4 KB cache. With typically 64 KB L1
> > cache, I think it makes a difference.
> 
> I understand the intention behind, my question was - it is really
> measurable?
> Something like: using pktmbuf_copy_nt(len=16) over using
> pktmbuf_copy(len=16)
> on workload X gives Y% thoughtput improvement?

If the application is complex enough, and needs some of those 4 KB cache otherwise wasted, there will be a significant throughput improvement; otherwise probably not.

I have a general problem with this type of question: I hate that throughput is the only KPI (Key Performance Indicator) getting any attention on the mailing list! Other KPIs, such as latency and resource conservation, are just as important in many real life use cases.

Here's a number for you: 6.25 % reduction in L1 data cache consumption. (Assuming 64 KB L1 cache with 64 byte cache lines and application burst length of 32 packets.)

> 
> >
> > > Another question - who will do 'sfence' after the copying?
> > > Would it be inside memcpy_nt (seems quite costly), or would
> > > it be another API function for that: memcpy_nt_flush() or so?
> >
> > Outside. Only the developer knows when it is required, so it wouldn't
> make any sense to add the cost inside memcpy_nt().
> >
> > I don't think we should add a flush function; it would just be
> another name for an already existing function. Referring to the
> required
> > operation in the memcpy_nt() function documentation should suffice.
> >
> > >
> > > >>
> > > >>
> > > >>>
> > > >>>>
> > > >>>>
> > > >>>>> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
> > > >>>> guide/index.html#text=_mm_stream_load
> > > >>>>> [2] https://developer.arm.com/documentation/100076/0100/A64-
> > > >>>> Instruction-Set-Reference/A64-Floating-point-
> Instructions/LDNP--
> > > >> SIMD-
> > > >>>> and-FP-
> > > >>>>>
> > > >>>>> V2:
> > > >>>>> - Only copy from non-temporal source to non-temporal
> destination.
> > > >>>>>      I.e. remove the two variants with only source and/or
> > > >> destination
> > > >>>> being
> > > >>>>>      non-temporal.
> > > >>>>> - Do not require alignment.
> > > >>>>>      Instead, offer additional 4 and 16 byte aligned
> functions
> > > for
> > > >>>> performance
> > > >>>>>      purposes.
> > > >>>>> - Implemented two of the functions for x86.
> > > >>>>> - Remove memset function.
> > > >>>>>
> > > >>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > >>>>> ---
> > > >>>>>
> > > >>>>> /**
> > > >>>>>     * @warning
> > > >>>>>     * @b EXPERIMENTAL: this API may change without prior
> notice.
> > > >>>>>     *
> > > >>>>>     * Copy data from non-temporal source to non-temporal
> > > >> destination.
> > > >>>>>     *
> > > >>>>>     * @param dst
> > > >>>>>     *   Pointer to the non-temporal destination of the data.
> > > >>>>>     *   Should be 4 byte aligned, for optimal performance.
> > > >>>>>     * @param src
> > > >>>>>     *   Pointer to the non-temporal source data.
> > > >>>>>     *   No alignment requirements.
> > > >>>>>     * @param len
> > > >>>>>     *   Number of bytes to copy.
> > > >>>>>     *   Should be be divisible by 4, for optimal performance.
> > > >>>>>     */
> > > >>>>> __rte_experimental
> > > >>>>> static __rte_always_inline
> > > >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1,
> 3),
> > > >>>> __access__(read_only, 2, 3)))
> > > >>>>> void rte_memcpy_nt(void * __rte_restrict dst, const void *
> > > >>>> __rte_restrict src, size_t len)
> > > >>>>> /* Implementation T.B.D. */
> > > >>>>>
> > > >>>>> /**
> > > >>>>>     * @warning
> > > >>>>>     * @b EXPERIMENTAL: this API may change without prior
> notice.
> > > >>>>>     *
> > > >>>>>     * Copy data in blocks of 16 byte from aligned non-
> temporal
> > > >> source
> > > >>>>>     * to aligned non-temporal destination.
> > > >>>>>     *
> > > >>>>>     * @param dst
> > > >>>>>     *   Pointer to the non-temporal destination of the data.
> > > >>>>>     *   Must be 16 byte aligned.
> > > >>>>>     * @param src
> > > >>>>>     *   Pointer to the non-temporal source data.
> > > >>>>>     *   Must be 16 byte aligned.
> > > >>>>>     * @param len
> > > >>>>>     *   Number of bytes to copy.
> > > >>>>>     *   Must be divisible by 16.
> > > >>>>>     */
> > > >>>>> __rte_experimental
> > > >>>>> static __rte_always_inline
> > > >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1,
> 3),
> > > >>>> __access__(read_only, 2, 3)))
> > > >>>>> void rte_memcpy_nt16a(void * __rte_restrict dst, const void *
> > > >>>> __rte_restrict src, size_t len)
> > > >>>>> {
> > > >>>>>        const void * const  end = RTE_PTR_ADD(src, len);
> > > >>>>>
> > > >>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
> > > >>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
> > > >>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
> > > >>>>>
> > > >>>>>        /* Copy large portion of data. */
> > > >>>>>        while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i))
> {
> > > >>>>>            register __m128i    xmm0, xmm1, xmm2, xmm3;
> > > >>>>>
> > > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > > const
> > > >>>> pointer as parameter. */
> > > >>>>> #pragma GCC diagnostic push
> > > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > > >>>>>            xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 *
> > > >>>> sizeof(__m128i)));
> > > >>>>>            xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 *
> > > >>>> sizeof(__m128i)));
> > > >>>>>            xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 *
> > > >>>> sizeof(__m128i)));
> > > >>>>>            xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 *
> > > >>>> sizeof(__m128i)));
> > > >>>>> #pragma GCC diagnostic pop
> > > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 0 *
> > > sizeof(__m128i)),
> > > >>>> xmm0);
> > > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 1 *
> > > sizeof(__m128i)),
> > > >>>> xmm1);
> > > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 2 *
> > > sizeof(__m128i)),
> > > >>>> xmm2);
> > > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 3 *
> > > sizeof(__m128i)),
> > > >>>> xmm3);
> > > >>>>>            src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
> > > >>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
> > > >>>>>        }
> > > >>>>>
> > > >>>>>        /* Copy remaining data. */
> > > >>>>>        while (src != end) {
> > > >>>>>            register __m128i    xmm;
> > > >>>>>
> > > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > > const
> > > >>>> pointer as parameter. */
> > > >>>>> #pragma GCC diagnostic push
> > > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > > >>>>>            xmm = _mm_stream_load_si128(src);
> > > >>>>> #pragma GCC diagnostic pop
> > > >>>>>            _mm_stream_si128(dst, xmm);
> > > >>>>>            src = RTE_PTR_ADD(src, sizeof(__m128i));
> > > >>>>>            dst = RTE_PTR_ADD(dst, sizeof(__m128i));
> > > >>>>>        }
> > > >>>>> }
> > > >>>>>
> > > >>>>> /**
> > > >>>>>     * @warning
> > > >>>>>     * @b EXPERIMENTAL: this API may change without prior
> notice.
> > > >>>>>     *
> > > >>>>>     * Copy data in blocks of 4 byte from aligned non-temporal
> > > source
> > > >>>>>     * to aligned non-temporal destination.
> > > >>>>>     *
> > > >>>>>     * @param dst
> > > >>>>>     *   Pointer to the non-temporal destination of the data.
> > > >>>>>     *   Must be 4 byte aligned.
> > > >>>>>     * @param src
> > > >>>>>     *   Pointer to the non-temporal source data.
> > > >>>>>     *   Must be 4 byte aligned.
> > > >>>>>     * @param len
> > > >>>>>     *   Number of bytes to copy.
> > > >>>>>     *   Must be divisible by 4.
> > > >>>>>     */
> > > >>>>> __rte_experimental
> > > >>>>> static __rte_always_inline
> > > >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1,
> 3),
> > > >>>> __access__(read_only, 2, 3)))
> > > >>>>> void rte_memcpy_nt4a(void * __rte_restrict dst, const void *
> > > >>>> __rte_restrict src, size_t len)
> > > >>>>> {
> > > >>>>>        int32_t             buf[sizeof(__m128i) /
> sizeof(int32_t)]
> > > >>>> __rte_aligned(sizeof(__m128i));
> > > >>>>>        /** Address of source data, rounded down to achieve
> > > >> alignment.
> > > >>>> */
> > > >>>>>        const void *        srca = RTE_PTR_ALIGN_FLOOR(src,
> > > >>>> sizeof(__m128i));
> > > >>>>>        /** Address of end of source data, rounded down to
> achieve
> > > >>>> alignment. */
> > > >>>>>        const void * const  srcenda =
> > > >>>> RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
> > > >>>>>        const int           offset =  RTE_PTR_DIFF(src, srca)
> /
> > > >>>> sizeof(int32_t);
> > > >>>>>        register __m128i    xmm0;
> > > >>>>>
> > > >>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
> > > >>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
> > > >>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
> > > >>>>>
> > > >>>>>        if (unlikely(len == 0)) return;
> > > >>>>>
> > > >>>>>        /* Copy first, non-__m128i aligned, part of source
> data.
> > > */
> > > >>>>>        if (offset) {
> > > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > > const
> > > >>>> pointer as parameter. */
> > > >>>>> #pragma GCC diagnostic push
> > > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > > >>>>>            xmm0 = _mm_stream_load_si128(srca);
> > > >>>>>            _mm_store_si128((void *)buf, xmm0);
> > > >>>>> #pragma GCC diagnostic pop
> > > >>>>>            switch (offset) {
> > > >>>>>                case 1:
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > >>>> sizeof(int32_t)), buf[1]);
> > > >>>>>                    if (unlikely(len == 1 * sizeof(int32_t)))
> > > return;
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > > >>>> sizeof(int32_t)), buf[2]);
> > > >>>>>                    if (unlikely(len == 2 * sizeof(int32_t)))
> > > return;
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> > > >>>> sizeof(int32_t)), buf[3]);
> > > >>>>>                    break;
> > > >>>>>                case 2:
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > >>>> sizeof(int32_t)), buf[2]);
> > > >>>>>                    if (unlikely(len == 1 * sizeof(int32_t)))
> > > return;
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > > >>>> sizeof(int32_t)), buf[3]);
> > > >>>>>                    break;
> > > >>>>>                case 3:
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > >>>> sizeof(int32_t)), buf[3]);
> > > >>>>>                    break;
> > > >>>>>            }
> > > >>>>>            srca = RTE_PTR_ADD(srca, (4 - offset) *
> > > sizeof(int32_t));
> > > >>>>>            dst = RTE_PTR_ADD(dst, (4 - offset) *
> > > sizeof(int32_t));
> > > >>>>>        }
> > > >>>>>
> > > >>>>>        /* Copy middle, __m128i aligned, part of source data.
> */
> > > >>>>>        while (srca != srcenda) {
> > > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > > const
> > > >>>> pointer as parameter. */
> > > >>>>> #pragma GCC diagnostic push
> > > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > > >>>>>            xmm0 = _mm_stream_load_si128(srca);
> > > >>>>> #pragma GCC diagnostic pop
> > > >>>>>            _mm_store_si128((void *)buf, xmm0);
> > > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> sizeof(int32_t)),
> > > >>>> buf[0]);
> > > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> sizeof(int32_t)),
> > > >>>> buf[1]);
> > > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> sizeof(int32_t)),
> > > >>>> buf[2]);
> > > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 3 *
> sizeof(int32_t)),
> > > >>>> buf[3]);
> > > >>>>>            srca = RTE_PTR_ADD(srca, sizeof(__m128i));
> > > >>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
> > > >>>>>        }
> > > >>>>>
> > > >>>>>        /* Copy last, non-__m128i aligned, part of source
> data. */
> > > >>>>>        if (RTE_PTR_DIFF(srca, src) != 4) {
> > > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > > const
> > > >>>> pointer as parameter. */
> > > >>>>> #pragma GCC diagnostic push
> > > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > > >>>>>            xmm0 = _mm_stream_load_si128(srca);
> > > >>>>>            _mm_store_si128((void *)buf, xmm0);
> > > >>>>> #pragma GCC diagnostic pop
> > > >>>>>            switch (offset) {
> > > >>>>>                case 1:
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > >>>> sizeof(int32_t)), buf[0]);
> > > >>>>>                    break;
> > > >>>>>                case 2:
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > >>>> sizeof(int32_t)), buf[0]);
> > > >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1
> *
> > > >>>> sizeof(int32_t))) return;
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > > >>>> sizeof(int32_t)), buf[1]);
> > > >>>>>                    break;
> > > >>>>>                case 3:
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > >>>> sizeof(int32_t)), buf[0]);
> > > >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1
> *
> > > >>>> sizeof(int32_t))) return;
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > > >>>> sizeof(int32_t)), buf[1]);
> > > >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 2
> *
> > > >>>> sizeof(int32_t))) return;
> > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> > > >>>> sizeof(int32_t)), buf[2]);
> > > >>>>>                    break;
> > > >>>>>            }
> > > >>>>>        }
> > > >>>>> }
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 16:05               ` Stephen Hemminger
@ 2022-07-29 17:29                 ` Morten Brørup
  2022-08-07 20:40                 ` Mattias Rönnblom
  1 sibling, 0 replies; 57+ messages in thread
From: Morten Brørup @ 2022-07-29 17:29 UTC (permalink / raw)
  To: Stephen Hemminger, Konstantin Ananyev
  Cc: Konstantin Ananyev, dev, Bruce Richardson, Jan Viktorin,
	Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, 29 July 2022 18.06
> 
> On Fri, 29 Jul 2022 12:13:52 +0000
> Konstantin Ananyev <konstantin.ananyev@huawei.com> wrote:
> 
> > Sorry, missed that part.
> >
> > >
> > > > Another question - who will do 'sfence' after the copying?
> > > > Would it be inside memcpy_nt (seems quite costly), or would
> > > > it be another API function for that: memcpy_nt_flush() or so?
> > >
> > > Outside. Only the developer knows when it is required, so it
> wouldn't make any sense to add the cost inside memcpy_nt().
> > >
> > > I don't think we should add a flush function; it would just be
> another name for an already existing function. Referring to the
> required
> > > operation in the memcpy_nt() function documentation should suffice.
> > >
> >
> > Ok, but again wouldn't it be arch specific?
> > AFAIK for x86 it needs to boil down to sfence, for other
> architectures - I don't know.
> > If you think there already is some generic one (rte_wmb?) that would
> always produce
> > correct instructions - sure let's use it.
> >
> >
> 
> It makes sense in a few select places to use non-temporal copy.
> But it would add unnecessary complexity to DPDK if every function in
> DPDK that could
> cause a copy had a non-temporal variant.

Agree.

Packet capturing is one of those few places where it makes sense - the improvement scales with the number of packet, not just with the number of packet bursts.

> 
> Maybe just having rte_memcpy have a threshold (config value?) that if
> copy is larger than
> a certain size, then it would automatically be non-temporal.  Small
> copies wouldn't matter,
> the optimization is more about not stopping cache size issues with
> large streams of data.

Small copies matter too, if there are many of them. As shown in my previous response, a burst of 32 packets will save 6.25 % of a 64 KB L1 data cache, when copying 64 byte or less from each packet. The saving is per packet, so it quickly adds up.

Copying a burst of 32 1518 byte packets trashes 2 * 32 * 1536 = 98 KB data cache, i.e. the entire L1 cache.

The threshold in glibc's memcpy() is much higher than 1536 byte. I don't think it will be possible to find a good threshold that works 99 % of the time. So we have to let the application developer make the choice.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 12:13             ` Konstantin Ananyev
  2022-07-29 16:05               ` Stephen Hemminger
@ 2022-07-29 18:13               ` Morten Brørup
  2022-07-29 19:49                 ` Konstantin Ananyev
  1 sibling, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-07-29 18:13 UTC (permalink / raw)
  To: Konstantin Ananyev, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Friday, 29 July 2022 14.14
> 
> 
> Sorry, missed that part.
> 
> >
> > > Another question - who will do 'sfence' after the copying?
> > > Would it be inside memcpy_nt (seems quite costly), or would
> > > it be another API function for that: memcpy_nt_flush() or so?
> >
> > Outside. Only the developer knows when it is required, so it wouldn't
> make any sense to add the cost inside memcpy_nt().
> >
> > I don't think we should add a flush function; it would just be
> another name for an already existing function. Referring to the
> required
> > operation in the memcpy_nt() function documentation should suffice.
> >
> 
> Ok, but again wouldn't it be arch specific?
> AFAIK for x86 it needs to boil down to sfence, for other architectures
> - I don't know.
> If you think there already is some generic one (rte_wmb?) that would
> always produce
> correct instructions - sure let's use it.
> 

DPDK has generic functions to wrap architecture specific stuff like memory barriers.

Because they are non-temporal stores, I suspect that rte_mb() is required before reading the data from the location it was copied to. Ensuring that STORE operations are ordered (rte_wmb) might not suffice. However, I'm not a CPU expert, so I will seek advice from more qualified people in the community on this.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 18:13               ` Morten Brørup
@ 2022-07-29 19:49                 ` Konstantin Ananyev
  2022-07-29 20:26                   ` Morten Brørup
  0 siblings, 1 reply; 57+ messages in thread
From: Konstantin Ananyev @ 2022-07-29 19:49 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach



> 
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Friday, 29 July 2022 14.14
> >
> >
> > Sorry, missed that part.
> >
> > >
> > > > Another question - who will do 'sfence' after the copying?
> > > > Would it be inside memcpy_nt (seems quite costly), or would
> > > > it be another API function for that: memcpy_nt_flush() or so?
> > >
> > > Outside. Only the developer knows when it is required, so it wouldn't
> > make any sense to add the cost inside memcpy_nt().
> > >
> > > I don't think we should add a flush function; it would just be
> > another name for an already existing function. Referring to the
> > required
> > > operation in the memcpy_nt() function documentation should suffice.
> > >
> >
> > Ok, but again wouldn't it be arch specific?
> > AFAIK for x86 it needs to boil down to sfence, for other architectures
> > - I don't know.
> > If you think there already is some generic one (rte_wmb?) that would
> > always produce
> > correct instructions - sure let's use it.
> >
> 
> DPDK has generic functions to wrap architecture specific stuff like memory barriers.
> 
> Because they are non-temporal stores, I suspect that rte_mb() is required before reading the data from the location it was copied to.
> Ensuring that STORE operations are ordered (rte_wmb) might not suffice. However, I'm not a CPU expert, so I will seek advice from
> more qualified people in the community on this.

I think for IA sfence is enough, see citation below,
for other architectures - no idea.
What I am trying to say - it needs to be the *same* function on all archs we support. 

IA SW optimization manual:
9.4.2 Streaming Store Usage Models
The two primary usage domains for streaming store are coherent requests and non-coherent requests.
9.4.2.1 Coherent Requests
Coherent requests are normal loads and stores to system memory, which may also hit cache lines
present in another processor in a multiprocessor environment. With coherent requests, a streaming store
can be used in the same way as a regular store that has been mapped with a WC memory type (PAT or
MTRR). An SFENCE instruction must be used within a producer-consumer usage model in order to ensure
coherency and visibility of data between processors.
Within a single-processor system, the CPU can also re-read the same memory location and be assured of
coherence (that is, a single, consistent view of this memory location). The same is true for a multiprocessor
(MP) system, assuming an accepted MP software producer-consumer synchronization policy is
employed.

 
 




^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 19:49                 ` Konstantin Ananyev
@ 2022-07-29 20:26                   ` Morten Brørup
  2022-07-29 21:34                     ` Konstantin Ananyev
                                       ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Morten Brørup @ 2022-07-29 20:26 UTC (permalink / raw)
  To: Konstantin Ananyev, Konstantin Ananyev, dev, Bruce Richardson,
	Honnappa Nagarahalli
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

+TO: @Honnappa, we need input from ARM

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Friday, 29 July 2022 21.49
> >
> > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > Sent: Friday, 29 July 2022 14.14
> > >
> > >
> > > Sorry, missed that part.
> > >
> > > >
> > > > > Another question - who will do 'sfence' after the copying?
> > > > > Would it be inside memcpy_nt (seems quite costly), or would
> > > > > it be another API function for that: memcpy_nt_flush() or so?
> > > >
> > > > Outside. Only the developer knows when it is required, so it
> wouldn't
> > > make any sense to add the cost inside memcpy_nt().
> > > >
> > > > I don't think we should add a flush function; it would just be
> > > another name for an already existing function. Referring to the
> > > required
> > > > operation in the memcpy_nt() function documentation should
> suffice.
> > > >
> > >
> > > Ok, but again wouldn't it be arch specific?
> > > AFAIK for x86 it needs to boil down to sfence, for other
> architectures
> > > - I don't know.
> > > If you think there already is some generic one (rte_wmb?) that
> would
> > > always produce
> > > correct instructions - sure let's use it.
> > >
> >
> > DPDK has generic functions to wrap architecture specific stuff like
> memory barriers.
> >
> > Because they are non-temporal stores, I suspect that rte_mb() is
> required before reading the data from the location it was copied to.
> > Ensuring that STORE operations are ordered (rte_wmb) might not
> suffice. However, I'm not a CPU expert, so I will seek advice from
> > more qualified people in the community on this.
> 
> I think for IA sfence is enough, see citation below,
> for other architectures - no idea.
> What I am trying to say - it needs to be the *same* function on all
> archs we support.

Now I get it: rte_wmb() might be appropriate on x86, but if any other architecture requires something else, we should add a new common function for flushing, e.g. rte_memcpy_nt_flush().

> 
> IA SW optimization manual:
> 9.4.2 Streaming Store Usage Models
> The two primary usage domains for streaming store are coherent requests
> and non-coherent requests.
> 9.4.2.1 Coherent Requests
> Coherent requests are normal loads and stores to system memory, which
> may also hit cache lines
> present in another processor in a multiprocessor environment. With
> coherent requests, a streaming store
> can be used in the same way as a regular store that has been mapped
> with a WC memory type (PAT or
> MTRR). An SFENCE instruction must be used within a producer-consumer
> usage model in order to ensure
> coherency and visibility of data between processors.
> Within a single-processor system, the CPU can also re-read the same
> memory location and be assured of
> coherence (that is, a single, consistent view of this memory location).
> The same is true for a multiprocessor
> (MP) system, assuming an accepted MP software producer-consumer
> synchronization policy is
> employed.
> 

With this reference, I am convinced that you are right about the SFENCE. This puts a checkmark on this item on my TODO list for the patch. Thank you, Konstantin!

Any ARM CPU experts on the mailing list seeing this, not on vacation? @Honnappa, I'm looking at you. :-)

Summing up, the question is:

After a bunch of *non-temporal* stores (STNP instruction) on ARM architecture, does calling rte_wmb() suffice to ensure the data is visible across the system?


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 20:26                   ` Morten Brørup
@ 2022-07-29 21:34                     ` Konstantin Ananyev
  2022-08-07 20:20                     ` Mattias Rönnblom
  2022-08-10 21:05                     ` Honnappa Nagarahalli
  2 siblings, 0 replies; 57+ messages in thread
From: Konstantin Ananyev @ 2022-07-29 21:34 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, dev, Bruce Richardson,
	Honnappa Nagarahalli
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach


> +TO: @Honnappa, we need input from ARM
> 
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Friday, 29 July 2022 21.49
> > >
> > > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > > Sent: Friday, 29 July 2022 14.14
> > > >
> > > >
> > > > Sorry, missed that part.
> > > >
> > > > >
> > > > > > Another question - who will do 'sfence' after the copying?
> > > > > > Would it be inside memcpy_nt (seems quite costly), or would
> > > > > > it be another API function for that: memcpy_nt_flush() or so?
> > > > >
> > > > > Outside. Only the developer knows when it is required, so it
> > wouldn't
> > > > make any sense to add the cost inside memcpy_nt().
> > > > >
> > > > > I don't think we should add a flush function; it would just be
> > > > another name for an already existing function. Referring to the
> > > > required
> > > > > operation in the memcpy_nt() function documentation should
> > suffice.
> > > > >
> > > >
> > > > Ok, but again wouldn't it be arch specific?
> > > > AFAIK for x86 it needs to boil down to sfence, for other
> > architectures
> > > > - I don't know.
> > > > If you think there already is some generic one (rte_wmb?) that
> > would
> > > > always produce
> > > > correct instructions - sure let's use it.
> > > >
> > >
> > > DPDK has generic functions to wrap architecture specific stuff like
> > memory barriers.
> > >
> > > Because they are non-temporal stores, I suspect that rte_mb() is
> > required before reading the data from the location it was copied to.
> > > Ensuring that STORE operations are ordered (rte_wmb) might not
> > suffice. However, I'm not a CPU expert, so I will seek advice from
> > > more qualified people in the community on this.
> >
> > I think for IA sfence is enough, see citation below,
> > for other architectures - no idea.
> > What I am trying to say - it needs to be the *same* function on all
> > archs we support.
> 
> Now I get it: rte_wmb() might be appropriate on x86, but if any other architecture requires something else, we should add a new
> common function for flushing, e.g. rte_memcpy_nt_flush().

Yep, that was my thought.
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 17:17               ` Morten Brørup
@ 2022-07-29 22:00                 ` Konstantin Ananyev
  2022-07-30  9:51                   ` Morten Brørup
  0 siblings, 1 reply; 57+ messages in thread
From: Konstantin Ananyev @ 2022-07-29 22:00 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach



> > > > >>>> Hi Morten,
> > > > >>>>
> > > > >>>>> This RFC proposes a set of functions optimized for non-
> > temporal
> > > > >>>> memory copy.
> > > > >>>>>
> > > > >>>>> At this stage, I am asking for feedback on the concept.
> > > > >>>>>
> > > > >>>>> Applications sometimes data to another memory location, which
> > is
> > > > >> only
> > > > >>>> used
> > > > >>>>> much later.
> > > > >>>>> In this case, it is inefficient to pollute the data cache
> > with
> > > > the
> > > > >>>> copied
> > > > >>>>> data.
> > > > >>>>>
> > > > >>>>> An example use case (originating from a real life
> > application):
> > > > >>>>> Copying filtered packets, or the first part of them, into a
> > > > capture
> > > > >>>> buffer
> > > > >>>>> for offline analysis.
> > > > >>>>>
> > > > >>>>> The purpose of these functions is to achieve a performance
> > gain
> > > > by
> > > > >>>> not
> > > > >>>>> polluting the cache when copying data.
> > > > >>>>> Although the throughput may be improved by further
> > optimization,
> > > > I
> > > > >> do
> > > > >>>> not
> > > > >>>>> consider througput optimization relevant initially.
> > > > >>>>>
> > > > >>>>> The x86 non-temporal load instructions have 16 byte alignment
> > > > >>>>> requirements [1], while ARM non-temporal load instructions
> > are
> > > > >>>> available with
> > > > >>>>> 4 byte alignment requirements [2].
> > > > >>>>> Both platforms offer non-temporal store instructions with 4
> > byte
> > > > >>>> alignment
> > > > >>>>> requirements.
> > > > >>>>>
> > > > >>>>> In addition to the primary function without any alignment
> > > > >>>> requirements, we
> > > > >>>>> also provide functions for respectivly 16 and 4 byte aligned
> > > > access
> > > > >>>> for
> > > > >>>>> performance purposes.
> > > > >>>>>
> > > > >>>>> The function names resemble standard C library function
> > names,
> > > > but
> > > > >>>> their
> > > > >>>>> signatures are intentionally different. No need to drag
> > legacy
> > > > into
> > > > >>>> it.
> > > > >>>>>
> > > > >>>>> NB: Don't comment on spaces for indentation; a patch will
> > follow
> > > > >> DPDK
> > > > >>>> coding
> > > > >>>>> style and use TAB.
> > > > >>>>
> > > > >>>>
> > > > >>>> I think there were discussions in other direction - remove
> > > > >> rte_memcpy()
> > > > >>>> completely and use memcpy() instead...
> > > > >>>
> > > > >>> Yes, the highly optimized rte_memcpy() implementation of
> > memcpy()
> > > > has
> > > > >> become obsolete, now that modern compilers provide an efficient
> > > > >> memcpy() implementation.
> > > > >>>
> > > > >>> It's an excellent reference, because we should learn from it,
> > and
> > > > >> avoid introducing similar mistakes with non-temporal memcpy.
> > > > >>>
> > > > >>>> But if we have a good use case for that, then I am positive in
> > > > >>>> principle.
> > > > >>>
> > > > >>> The standard C library doesn't offer non-temporal memcpy(), so
> > we
> > > > >> need to implement it ourselves.
> > > > >>>
> > > > >>>> Though I think we need a clear use-case within dpdk for it
> > > > >>>> to demonstrate perfomance gain.
> > > > >>>
> > > > >>> The performance gain is to avoid polluting the data cache. DPDK
> > > > >> example applications, like l3fwd, are probably too primitive to
> > > > measure
> > > > >> any benefit in this regard.
> > > > >>>
> > > > >>>> Probably copying packets within pdump lib, or examples/dma. or
> > ...
> > > > >>>
> > > > >>> Good point - the new functions should be used somewhere within
> > > > DPDK.
> > > > >> For this purpose, I will look into modifying rte_pktmbuf_copy(),
> > > > which
> > > > >> is used by pdump_copy(), to use non-temporal copying of the
> > packet
> > > > >> data.
> > > > >>>
> > > > >>>> Another thought - do we really need a separate inline function
> > for
> > > > >> each
> > > > >>>> flavour?
> > > > >>>> Might be just one non-inline rte_memcpy_nt(dst, src, size,
> > flags),
> > > > >>>> where flags could be combination of NT_SRC, NT_DST, and keep
> > > > >> alignment
> > > > >>>> detection/decisions to particular implementation?
> > > > >>>
> > > > >>> Thank you for the feedback, Konstantin.
> > > > >>>
> > > > >>> My answer to this suggestion gets a little longwinded...
> > > > >>>
> > > > >>> Looking at the DPDK pcapng library, it copies a 4 byte aligned
> > > > >> metadata structure sized 28 byte. So it can do with 4 byte
> > aligned
> > > > >> functions.
> > > > >>>
> > > > >>> Our application can capture packets starting at the IP header,
> > > > which
> > > > >> is offset by 14 byte (Ethernet header size) from the packet
> > buffer,
> > > > so
> > > > >> it requires 2 byte alignment. And thus, requiring 4 byte
> > alignment
> > > > is
> > > > >> not acceptable.
> > > > >>>
> > > > >>> Our application uses 16 byte alignment in the capture buffer
> > area,
> > > > >> and can benefit from 16 byte aligned functions. Furthermore, x86
> > > > >> processors require 16 byte alignment for non-temporal load
> > > > >> instructions, so I think a 16 byte aligned non-temporal memcpy
> > > > function
> > > > >> should be offered.
> > > > >>
> > > > >>
> > > > >> Yes, x86 needs 16B alignment for NT load/stores
> > > > >> But that's supposed to be arch specific limitation,
> > > > >> that we probably want to hide, no?
> > > > >
> > > > > Agree.
> > > > >
> > > > >> Inside the function can check alignment of both src and dst
> > > > >> and decide should it use NT load/store instructions or just
> > > > >> do normal copy.
> > > > >
> > > > > Yes, I'm experimenting with the x86 inline function shown below.
> > And
> > > > hopefully, with some "extern inline" or other magic, I can hide the
> > > > different implementations in the arch specific headers, and only
> > expose
> > > > the function declaration of rte_memcpy_nt() in the common header.
> > > > >
> > > > > I'm currently working on the x86 implementation - when I'm
> > satisfied
> > > > with that, I'll look into how to hide the implementations in the
> > arch
> > > > specific header files, and only expose the common function
> > declaration
> > > > in the generic header file also used for documentation. I works for
> > > > rte_memcpy(), so I can probably find the way to do it there.
> > > > >
> > > > > /*
> > > > >   * Non-Temporal Memory Operations Flags.
> > > > >   */
> > > > >
> > > > > #define RTE_MEMOPS_F_LENA_MASK  (UINT64_C(0xFE) << 0)   /**
> > Length
> > > > alignment mask. */
> > > > > #define RTE_MEMOPS_F_LEN2A      (UINT64_C(2) << 0)      /**
> > Length is
> > > > 2 byte aligned. */
> > > > > #define RTE_MEMOPS_F_LEN4A      (UINT64_C(4) << 0)      /**
> > Length is
> > > > 4 byte aligned. */
> > > > > #define RTE_MEMOPS_F_LEN8A      (UINT64_C(8) << 0)      /**
> > Length is
> > > > 8 byte aligned. */
> > > > > #define RTE_MEMOPS_F_LEN16A     (UINT64_C(16) << 0)     /**
> > Length is
> > > > 16 byte aligned. */
> > > > > #define RTE_MEMOPS_F_LEN32A     (UINT64_C(32) << 0)     /**
> > Length is
> > > > 32 byte aligned. */
> > > > > #define RTE_MEMOPS_F_LEN64A     (UINT64_C(64) << 0)     /**
> > Length is
> > > > 64 byte aligned. */
> > > > > #define RTE_MEMOPS_F_LEN128A    (UINT64_C(128) << 0)    /**
> > Length is
> > > > 128 byte aligned. */
> > > > >
> > > > > #define RTE_MEMOPS_F_DSTA_MASK  (UINT64_C(0xFE) << 8)   /**
> > > > Destination address alignment mask. */
> > > > > #define RTE_MEMOPS_F_DST2A      (UINT64_C(2) << 8)      /**
> > > > Destination address is 2 byte aligned. */
> > > > > #define RTE_MEMOPS_F_DST4A      (UINT64_C(4) << 8)      /**
> > > > Destination address is 4 byte aligned. */
> > > > > #define RTE_MEMOPS_F_DST8A      (UINT64_C(8) << 8)      /**
> > > > Destination address is 8 byte aligned. */
> > > > > #define RTE_MEMOPS_F_DST16A     (UINT64_C(16) << 8)     /**
> > > > Destination address is 16 byte aligned. */
> > > > > #define RTE_MEMOPS_F_DST32A     (UINT64_C(32) << 8)     /**
> > > > Destination address is 32 byte aligned. */
> > > > > #define RTE_MEMOPS_F_DST64A     (UINT64_C(64) << 8)     /**
> > > > Destination address is 64 byte aligned. */
> > > > > #define RTE_MEMOPS_F_DST128A    (UINT64_C(128) << 8)    /**
> > > > Destination address is 128 byte aligned. */
> > > > >
> > > > > #define RTE_MEMOPS_F_SRCA_MASK  (UINT64_C(0xFE) << 16)  /**
> > Source
> > > > address alignment mask. */
> > > > > #define RTE_MEMOPS_F_SRC2A      (UINT64_C(2) << 16)     /**
> > Source
> > > > address is 2 byte aligned. */
> > > > > #define RTE_MEMOPS_F_SRC4A      (UINT64_C(4) << 16)     /**
> > Source
> > > > address is 4 byte aligned. */
> > > > > #define RTE_MEMOPS_F_SRC8A      (UINT64_C(8) << 16)     /**
> > Source
> > > > address is 8 byte aligned. */
> > > > > #define RTE_MEMOPS_F_SRC16A     (UINT64_C(16) << 16)    /**
> > Source
> > > > address is 16 byte aligned. */
> > > > > #define RTE_MEMOPS_F_SRC32A     (UINT64_C(32) << 16)    /**
> > Source
> > > > address is 32 byte aligned. */
> > > > > #define RTE_MEMOPS_F_SRC64A     (UINT64_C(64) << 16)    /**
> > Source
> > > > address is 64 byte aligned. */
> > > > > #define RTE_MEMOPS_F_SRC128A    (UINT64_C(128) << 16)   /**
> > Source
> > > > address is 128 byte aligned. */
> > > > >
> > > > > /**
> > > > >   * @warning
> > > > >   * @b EXPERIMENTAL: this API may change without prior notice.
> > > > >   *
> > > > >   * Non-temporal memory copy.
> > > > >   * The memory areas must not overlap.
> > > > >   *
> > > > >   * @note
> > > > >   * If the destination and/or length is unaligned, some copied
> > bytes
> > > > will be
> > > > >   * stored in the destination memory area using temporal access.
> > > > >   *
> > > > >   * @param dst
> > > > >   *   Pointer to the non-temporal destination memory area.
> > > > >   * @param src
> > > > >   *   Pointer to the non-temporal source memory area.
> > > > >   * @param len
> > > > >   *   Number of bytes to copy.
> > > > >   * @param flags
> > > > >   *   Hints for memory access.
> > > > >   *   Any of the RTE_MEMOPS_F_LENnA, RTE_MEMOPS_F_DSTnA,
> > > > RTE_MEMOPS_F_SRCnA flags.
> > > > >   */
> > > > > __rte_experimental
> > > > > static __rte_always_inline
> > > > > __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3),
> > > > __access__(read_only, 2, 3)))
> > > > > void rte_memcpy_nt(void * __rte_restrict dst, const void *
> > > > __rte_restrict src, size_t len,
> > > > >          const uint64_t flags)
> > > > > {
> > > > >      if (__builtin_constant_p(flags) ?
> > > > >              ((flags & RTE_MEMOPS_F_LENA_MASK) >=
> > RTE_MEMOPS_F_LEN16A
> > > > &&
> > > > >              (flags & RTE_MEMOPS_F_DSTA_MASK) >=
> > RTE_MEMOPS_F_DST16A)
> > > > :
> > > > >              !(((uintptr_t)dst | len) & (16 - 1))) {
> > > > >          if (__builtin_constant_p(flags) ?
> > > > >                  (flags & RTE_MEMOPS_F_SRCA_MASK) >=
> > > > RTE_MEMOPS_F_SRC16A :
> > > > >                  !((uintptr_t)src & (16 - 1)))
> > > > >              rte_memcpy_nt16a(dst, src, len/*, flags*/);
> > > > >          else
> > > > >              rte_memcpy_nt16dla(dst, src, len/*, flags*/);
> > > > >      }
> > > > >      else if (__builtin_constant_p(flags) ? (
> > > > >              (flags & RTE_MEMOPS_F_LENA_MASK) >=
> > RTE_MEMOPS_F_LEN4A
> > > > &&
> > > > >              (flags & RTE_MEMOPS_F_DSTA_MASK) >=
> > RTE_MEMOPS_F_DST4A
> > > > &&
> > > > >              (flags & RTE_MEMOPS_F_SRCA_MASK) >=
> > RTE_MEMOPS_F_SRC4A)
> > > > :
> > > > >              !(((uintptr_t)dst | (uintptr_t)src | len) & (4 -
> > 1))) {
> > > > >          rte_memcpy_nt4a(dst, src, len/*, flags*/);
> > > > >      }
> > > > >      else
> > > > >          rte_memcpy_nt_unaligned(dst, src, len/*, flags*/);
> > > > > }
> > > >
> > > >
> > > > Do we really need to expose all these dozen flags?
> > > > My thought at least about x86 implementaion was about something
> > more
> > > > simple like:
> > > > void rte_memcpy_nt(void * __rte_restrict dst,
> > > > 	const void * __rte_restrict src, size_t len,
> > > > 	const uint64_t flags)
> > > > {
> > > >
> > > > 	if (flags == (SRC_NT | DST_NT) && ((dst | src) & 0xf) == 0) {
> > > > 		_do_nt_src_nt_dst_nt(...);
> > > > 	} else if (flags == DST_NT && (dst & 0xf) == 0) {
> > > > 		_do_src_na_dst_nt(...);
> > > > 	} else if (flags == SRC_NT && (src & 0xf) == 0) {
> > > > 		_do_src_nt_dst_na(...);
> > > > 	} else
> > > > 		memcpy(dst, src, len);
> > > > }
> > >
> > > The combination of flags, inline and __builtin_constant_p() allows
> > the compiler to produce zero-overhead code. Without it, the
> > > resulting code will contain a bunch of run-time bitmask comparisons
> > and branches to determine the ultimate copy function.
> >
> > I think it is unavoidable, unless your intention for this function to
> > trust flags without checking actual addresses.
> 
> The intention is to trust the flags. Fast path functions should trust that parameters passed are valid and conform to specified
> requirements.
> 
> We can always throw in a few RTE_ASSERTs - they are omitted unless compiled for debug, and thus have zero cost in production.
> 
> >
> >  On x86
> > > there are not only 16 byte, but also 4 byte alignment variants of
> > non-temporal store. The beauty of it will be more obvious when the
> > > patch is ready.
> >
> > Ok, I will just wait for final version then :)
> >
> > >
> > > And in my current working version (not the code provided here), the
> > flags are hints, so using them will be optional. The function
> > > headers will look roughly like this:
> > >
> > > static inline void rte_memcpy_nt_ex(
> > > 		void * dst, const void * src,
> > > 		size_t len, uint64_t flags);
> > >
> > > static inline void rte_memcpy_nt(
> > > 		void * dst, const void * src,
> > > 		size_t len)
> > > {
> > > 	rte_memcpy_nt_ex(dst, src, len, 0);
> > > }
> > >
> > > I might add an _ex postfix variant of the mbuf packet non-temporal
> > copy function too, but I'm not working on that function yet, so I
> > > don't yet know if it makes sense or not.
> > >
> > > My concept for build time alignment hints can also be copied into an
> > rte_memcpy_ex() function for improved performance. But I
> > > don't want my patch to expand too much outside its initial scope, so
> > I will not modify rte_memcpy() with this patch.
> > >
> > > Alternatively, I could provide rte_memcpy_ex(d,s,l,flags) instead of
> > rte_memcpy_nt[_ex](), and use the flags to indicate non-
> > > temporal source and destination.
> > >
> > > This is only a question about which API the community prefers. I will
> > not change the implementation of rte_memcpy() with this patch -
> > > it's another job to do that.
> > >
> > > >
> > > > >
> > > > >
> > > > >>
> > > > >>
> > > > >>> While working on these funtions, I experimented with an
> > > > >> rte_memcpy_nt() taking flags, which is also my personal
> > preference,
> > > > but
> > > > >> haven't succeed yet. Especially when copying a 16 byte aligned
> > > > >> structure of only 16 byte, the overhead of the function call +
> > > > >> comparing the flags + the copy loop overhead is significant,
> > > > compared
> > > > >> to inline code consisting of only one pair of "movntdqa
> > > > (%rsi),%xmm0;
> > > > >> movntdq %xmm0,(%rdi)" instructions.
> > > > >>>
> > > > >>> Remember that a non-inlined rte_memcpy_nt() will be called with
> > > > very
> > > > >> varying size, due to the typical mix of small and big packets,
> > so
> > > > >> branch prediction will not help.
> > > > >>>
> > > > >>> This RFC does not yet show the rte_memcpy_nt() function
> > handling
> > > > >> unaligned load/store, but it is more complex than the aligned
> > > > >> functions. So I think the aligned variants are warranted - for
> > > > >> performance reasons.
> > > > >>>
> > > > >>> Some of the need for exposing individual functions for
> > different
> > > > >> alignment stems from the compiler being unable to determine the
> > > > >> alignment of the source and destination pointers at build time.
> > So
> > > > we
> > > > >> need to help the compiler with this at build time, and thus the
> > need
> > > > >> for inlining the function. If we expose a bunch of small inline
> > > > >> functions or a big inline function with flags seems to be a
> > matter
> > > > of
> > > > >> taste.
> > > > >>>
> > > > >>> Thinking about it, you are probably right that exposing a
> > single
> > > > >> function with flags is better for documentation purposes and
> > easier
> > > > for
> > > > >> other architectures to implement. But it still needs to be
> > inline,
> > > > for
> > > > >> the reasons described above.
> > > > >>
> > > > >>
> > > > >> Ok, my initial thought was that main use-case for it would be
> > > > copying
> > > > >> of
> > > > >> big chunks of data, but from your description it might not be
> > the
> > > > case.
> > > > >
> > > > > This is for quickly copying relatively small pieces of data
> > > > synchronously without polluting the CPUs data cache, e.g. just
> > before
> > > > passing on a packet to an Ethernet PMD for transmission.
> > > > >
> > > > > Big chunks of data should be copied asynchronously by DMA.
> > > > >
> > > > >> Yes, for just 16/32B copy function call overhead might be way
> > too
> > > > >> high...
> > > > >> As another alternative - would memcpy_nt_bulk() help somehow?
> > > > >> It can do copying for the several src/dst pairs at once and
> > > > >> that might help to amortize cost of function call.
> > > > >
> > > > > In many cases, memcpy_nt() will replace memcpy() inside loops, so
> > it
> > > > should be just as easy to use as memcpy(). E.g. look at
> > > > rte_pktmbuf_copy()... Building a memcopy array to pass to
> > > > memcpy_nt_bulk() from rte_pktmbuf_copy() would require a
> > significant
> > > > rewrite of rte_pktmbuf_copy(), compared to just replacing
> > rte_memcpy()
> > > > with rte_memcpy_nt(). And this is just one function using memcpy().
> > > >
> > > > Actually, one question I have for such small data-transfer
> > > > (16B per packet) - do you still see some noticable perfomance
> > > > improvement for such scenario?
> > >
> > > Copying 16 byte from each packet in a burst of 32 packets would
> > otherwise pollute 64 cache lines = 4 KB cache. With typically 64 KB L1
> > > cache, I think it makes a difference.
> >
> > I understand the intention behind, my question was - it is really
> > measurable?
> > Something like: using pktmbuf_copy_nt(len=16) over using
> > pktmbuf_copy(len=16)
> > on workload X gives Y% thoughtput improvement?
> 
> If the application is complex enough, and needs some of those 4 KB cache otherwise wasted, there will be a significant throughput
> improvement; otherwise probably not.
> 
> I have a general problem with this type of question: I hate that throughput is the only KPI (Key Performance Indicator) getting any
> attention on the mailing list! Other KPIs, such as latency and resource conservation, are just as important in many real life use cases.

Well, I suppose that sort of expected question for the patch that introduces performance optimization:
what is the benefit we expect to get and is it worth the effort?
Throughput or latency improvement seems like an obvious choice here.
About resource conservation - if the patch aims to improve cache consumption, then on some cache-bound
workloads it should result in throughput improvement, correct? 

> 
> Here's a number for you: 6.25 % reduction in L1 data cache consumption. (Assuming 64 KB L1 cache with 64 byte cache lines and
> application burst length of 32 packets.)

I understand that it should reduce cache eviction rate.
The thing is that non-temporal stores are not free also: they consume WC buffers and some memory-bus bandwidth.
AFAIK, for 16B non-consecutive NT stores, it means that only 25% of WC buffers capacity will be used,
and in theory it might lead to extra memory pressure and worse performance in general.
In fact, IA manuals explicitly recommend to avoid partial cach-line writes whenever possible.
Now, I don't know what would be more expensive in that case: re-fill extra cache-lines,
or extra partial write memory transactions.
That's why I asked for some performance numbers here.
 
> >
> > >
> > > > Another question - who will do 'sfence' after the copying?
> > > > Would it be inside memcpy_nt (seems quite costly), or would
> > > > it be another API function for that: memcpy_nt_flush() or so?
> > >
> > > Outside. Only the developer knows when it is required, so it wouldn't
> > make any sense to add the cost inside memcpy_nt().
> > >
> > > I don't think we should add a flush function; it would just be
> > another name for an already existing function. Referring to the
> > required
> > > operation in the memcpy_nt() function documentation should suffice.
> > >
> > > >
> > > > >>
> > > > >>
> > > > >>>
> > > > >>>>
> > > > >>>>
> > > > >>>>> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-
> > > > >>>> guide/index.html#text=_mm_stream_load
> > > > >>>>> [2] https://developer.arm.com/documentation/100076/0100/A64-
> > > > >>>> Instruction-Set-Reference/A64-Floating-point-
> > Instructions/LDNP--
> > > > >> SIMD-
> > > > >>>> and-FP-
> > > > >>>>>
> > > > >>>>> V2:
> > > > >>>>> - Only copy from non-temporal source to non-temporal
> > destination.
> > > > >>>>>      I.e. remove the two variants with only source and/or
> > > > >> destination
> > > > >>>> being
> > > > >>>>>      non-temporal.
> > > > >>>>> - Do not require alignment.
> > > > >>>>>      Instead, offer additional 4 and 16 byte aligned
> > functions
> > > > for
> > > > >>>> performance
> > > > >>>>>      purposes.
> > > > >>>>> - Implemented two of the functions for x86.
> > > > >>>>> - Remove memset function.
> > > > >>>>>
> > > > >>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > >>>>> ---
> > > > >>>>>
> > > > >>>>> /**
> > > > >>>>>     * @warning
> > > > >>>>>     * @b EXPERIMENTAL: this API may change without prior
> > notice.
> > > > >>>>>     *
> > > > >>>>>     * Copy data from non-temporal source to non-temporal
> > > > >> destination.
> > > > >>>>>     *
> > > > >>>>>     * @param dst
> > > > >>>>>     *   Pointer to the non-temporal destination of the data.
> > > > >>>>>     *   Should be 4 byte aligned, for optimal performance.
> > > > >>>>>     * @param src
> > > > >>>>>     *   Pointer to the non-temporal source data.
> > > > >>>>>     *   No alignment requirements.
> > > > >>>>>     * @param len
> > > > >>>>>     *   Number of bytes to copy.
> > > > >>>>>     *   Should be be divisible by 4, for optimal performance.
> > > > >>>>>     */
> > > > >>>>> __rte_experimental
> > > > >>>>> static __rte_always_inline
> > > > >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1,
> > 3),
> > > > >>>> __access__(read_only, 2, 3)))
> > > > >>>>> void rte_memcpy_nt(void * __rte_restrict dst, const void *
> > > > >>>> __rte_restrict src, size_t len)
> > > > >>>>> /* Implementation T.B.D. */
> > > > >>>>>
> > > > >>>>> /**
> > > > >>>>>     * @warning
> > > > >>>>>     * @b EXPERIMENTAL: this API may change without prior
> > notice.
> > > > >>>>>     *
> > > > >>>>>     * Copy data in blocks of 16 byte from aligned non-
> > temporal
> > > > >> source
> > > > >>>>>     * to aligned non-temporal destination.
> > > > >>>>>     *
> > > > >>>>>     * @param dst
> > > > >>>>>     *   Pointer to the non-temporal destination of the data.
> > > > >>>>>     *   Must be 16 byte aligned.
> > > > >>>>>     * @param src
> > > > >>>>>     *   Pointer to the non-temporal source data.
> > > > >>>>>     *   Must be 16 byte aligned.
> > > > >>>>>     * @param len
> > > > >>>>>     *   Number of bytes to copy.
> > > > >>>>>     *   Must be divisible by 16.
> > > > >>>>>     */
> > > > >>>>> __rte_experimental
> > > > >>>>> static __rte_always_inline
> > > > >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1,
> > 3),
> > > > >>>> __access__(read_only, 2, 3)))
> > > > >>>>> void rte_memcpy_nt16a(void * __rte_restrict dst, const void *
> > > > >>>> __rte_restrict src, size_t len)
> > > > >>>>> {
> > > > >>>>>        const void * const  end = RTE_PTR_ADD(src, len);
> > > > >>>>>
> > > > >>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
> > > > >>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
> > > > >>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
> > > > >>>>>
> > > > >>>>>        /* Copy large portion of data. */
> > > > >>>>>        while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i))
> > {
> > > > >>>>>            register __m128i    xmm0, xmm1, xmm2, xmm3;
> > > > >>>>>
> > > > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > > > const
> > > > >>>> pointer as parameter. */
> > > > >>>>> #pragma GCC diagnostic push
> > > > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > > > >>>>>            xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 *
> > > > >>>> sizeof(__m128i)));
> > > > >>>>>            xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 *
> > > > >>>> sizeof(__m128i)));
> > > > >>>>>            xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 *
> > > > >>>> sizeof(__m128i)));
> > > > >>>>>            xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 *
> > > > >>>> sizeof(__m128i)));
> > > > >>>>> #pragma GCC diagnostic pop
> > > > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 0 *
> > > > sizeof(__m128i)),
> > > > >>>> xmm0);
> > > > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 1 *
> > > > sizeof(__m128i)),
> > > > >>>> xmm1);
> > > > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 2 *
> > > > sizeof(__m128i)),
> > > > >>>> xmm2);
> > > > >>>>>            _mm_stream_si128(RTE_PTR_ADD(dst, 3 *
> > > > sizeof(__m128i)),
> > > > >>>> xmm3);
> > > > >>>>>            src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
> > > > >>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
> > > > >>>>>        }
> > > > >>>>>
> > > > >>>>>        /* Copy remaining data. */
> > > > >>>>>        while (src != end) {
> > > > >>>>>            register __m128i    xmm;
> > > > >>>>>
> > > > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > > > const
> > > > >>>> pointer as parameter. */
> > > > >>>>> #pragma GCC diagnostic push
> > > > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > > > >>>>>            xmm = _mm_stream_load_si128(src);
> > > > >>>>> #pragma GCC diagnostic pop
> > > > >>>>>            _mm_stream_si128(dst, xmm);
> > > > >>>>>            src = RTE_PTR_ADD(src, sizeof(__m128i));
> > > > >>>>>            dst = RTE_PTR_ADD(dst, sizeof(__m128i));
> > > > >>>>>        }
> > > > >>>>> }
> > > > >>>>>
> > > > >>>>> /**
> > > > >>>>>     * @warning
> > > > >>>>>     * @b EXPERIMENTAL: this API may change without prior
> > notice.
> > > > >>>>>     *
> > > > >>>>>     * Copy data in blocks of 4 byte from aligned non-temporal
> > > > source
> > > > >>>>>     * to aligned non-temporal destination.
> > > > >>>>>     *
> > > > >>>>>     * @param dst
> > > > >>>>>     *   Pointer to the non-temporal destination of the data.
> > > > >>>>>     *   Must be 4 byte aligned.
> > > > >>>>>     * @param src
> > > > >>>>>     *   Pointer to the non-temporal source data.
> > > > >>>>>     *   Must be 4 byte aligned.
> > > > >>>>>     * @param len
> > > > >>>>>     *   Number of bytes to copy.
> > > > >>>>>     *   Must be divisible by 4.
> > > > >>>>>     */
> > > > >>>>> __rte_experimental
> > > > >>>>> static __rte_always_inline
> > > > >>>>> __attribute__((__nonnull__(1, 2), __access__(write_only, 1,
> > 3),
> > > > >>>> __access__(read_only, 2, 3)))
> > > > >>>>> void rte_memcpy_nt4a(void * __rte_restrict dst, const void *
> > > > >>>> __rte_restrict src, size_t len)
> > > > >>>>> {
> > > > >>>>>        int32_t             buf[sizeof(__m128i) /
> > sizeof(int32_t)]
> > > > >>>> __rte_aligned(sizeof(__m128i));
> > > > >>>>>        /** Address of source data, rounded down to achieve
> > > > >> alignment.
> > > > >>>> */
> > > > >>>>>        const void *        srca = RTE_PTR_ALIGN_FLOOR(src,
> > > > >>>> sizeof(__m128i));
> > > > >>>>>        /** Address of end of source data, rounded down to
> > achieve
> > > > >>>> alignment. */
> > > > >>>>>        const void * const  srcenda =
> > > > >>>> RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
> > > > >>>>>        const int           offset =  RTE_PTR_DIFF(src, srca)
> > /
> > > > >>>> sizeof(int32_t);
> > > > >>>>>        register __m128i    xmm0;
> > > > >>>>>
> > > > >>>>>        RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
> > > > >>>>>        RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
> > > > >>>>>        RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
> > > > >>>>>
> > > > >>>>>        if (unlikely(len == 0)) return;
> > > > >>>>>
> > > > >>>>>        /* Copy first, non-__m128i aligned, part of source
> > data.
> > > > */
> > > > >>>>>        if (offset) {
> > > > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > > > const
> > > > >>>> pointer as parameter. */
> > > > >>>>> #pragma GCC diagnostic push
> > > > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > > > >>>>>            xmm0 = _mm_stream_load_si128(srca);
> > > > >>>>>            _mm_store_si128((void *)buf, xmm0);
> > > > >>>>> #pragma GCC diagnostic pop
> > > > >>>>>            switch (offset) {
> > > > >>>>>                case 1:
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > > >>>> sizeof(int32_t)), buf[1]);
> > > > >>>>>                    if (unlikely(len == 1 * sizeof(int32_t)))
> > > > return;
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > > > >>>> sizeof(int32_t)), buf[2]);
> > > > >>>>>                    if (unlikely(len == 2 * sizeof(int32_t)))
> > > > return;
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> > > > >>>> sizeof(int32_t)), buf[3]);
> > > > >>>>>                    break;
> > > > >>>>>                case 2:
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > > >>>> sizeof(int32_t)), buf[2]);
> > > > >>>>>                    if (unlikely(len == 1 * sizeof(int32_t)))
> > > > return;
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > > > >>>> sizeof(int32_t)), buf[3]);
> > > > >>>>>                    break;
> > > > >>>>>                case 3:
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > > >>>> sizeof(int32_t)), buf[3]);
> > > > >>>>>                    break;
> > > > >>>>>            }
> > > > >>>>>            srca = RTE_PTR_ADD(srca, (4 - offset) *
> > > > sizeof(int32_t));
> > > > >>>>>            dst = RTE_PTR_ADD(dst, (4 - offset) *
> > > > sizeof(int32_t));
> > > > >>>>>        }
> > > > >>>>>
> > > > >>>>>        /* Copy middle, __m128i aligned, part of source data.
> > */
> > > > >>>>>        while (srca != srcenda) {
> > > > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > > > const
> > > > >>>> pointer as parameter. */
> > > > >>>>> #pragma GCC diagnostic push
> > > > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > > > >>>>>            xmm0 = _mm_stream_load_si128(srca);
> > > > >>>>> #pragma GCC diagnostic pop
> > > > >>>>>            _mm_store_si128((void *)buf, xmm0);
> > > > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > sizeof(int32_t)),
> > > > >>>> buf[0]);
> > > > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > sizeof(int32_t)),
> > > > >>>> buf[1]);
> > > > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> > sizeof(int32_t)),
> > > > >>>> buf[2]);
> > > > >>>>>            _mm_stream_si32(RTE_PTR_ADD(dst, 3 *
> > sizeof(int32_t)),
> > > > >>>> buf[3]);
> > > > >>>>>            srca = RTE_PTR_ADD(srca, sizeof(__m128i));
> > > > >>>>>            dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
> > > > >>>>>        }
> > > > >>>>>
> > > > >>>>>        /* Copy last, non-__m128i aligned, part of source
> > data. */
> > > > >>>>>        if (RTE_PTR_DIFF(srca, src) != 4) {
> > > > >>>>> /* Note: Workaround for _mm_stream_load_si128() not taking a
> > > > const
> > > > >>>> pointer as parameter. */
> > > > >>>>> #pragma GCC diagnostic push
> > > > >>>>> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
> > > > >>>>>            xmm0 = _mm_stream_load_si128(srca);
> > > > >>>>>            _mm_store_si128((void *)buf, xmm0);
> > > > >>>>> #pragma GCC diagnostic pop
> > > > >>>>>            switch (offset) {
> > > > >>>>>                case 1:
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > > >>>> sizeof(int32_t)), buf[0]);
> > > > >>>>>                    break;
> > > > >>>>>                case 2:
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > > >>>> sizeof(int32_t)), buf[0]);
> > > > >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1
> > *
> > > > >>>> sizeof(int32_t))) return;
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > > > >>>> sizeof(int32_t)), buf[1]);
> > > > >>>>>                    break;
> > > > >>>>>                case 3:
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 0 *
> > > > >>>> sizeof(int32_t)), buf[0]);
> > > > >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 1
> > *
> > > > >>>> sizeof(int32_t))) return;
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 1 *
> > > > >>>> sizeof(int32_t)), buf[1]);
> > > > >>>>>                    if (unlikely(RTE_PTR_DIFF(srca, src) == 2
> > *
> > > > >>>> sizeof(int32_t))) return;
> > > > >>>>>                    _mm_stream_si32(RTE_PTR_ADD(dst, 2 *
> > > > >>>> sizeof(int32_t)), buf[2]);
> > > > >>>>>                    break;
> > > > >>>>>            }
> > > > >>>>>        }
> > > > >>>>> }
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > > >
> > > >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 22:00                 ` Konstantin Ananyev
@ 2022-07-30  9:51                   ` Morten Brørup
  2022-08-02  9:05                     ` Konstantin Ananyev
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-07-30  9:51 UTC (permalink / raw)
  To: Konstantin Ananyev, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Saturday, 30 July 2022 00.00
> 
> > > > > Actually, one question I have for such small data-transfer
> > > > > (16B per packet) - do you still see some noticable perfomance
> > > > > improvement for such scenario?
> > > >
> > > > Copying 16 byte from each packet in a burst of 32 packets would
> > > otherwise pollute 64 cache lines = 4 KB cache. With typically 64 KB
> L1
> > > > cache, I think it makes a difference.
> > >
> > > I understand the intention behind, my question was - it is really
> > > measurable?
> > > Something like: using pktmbuf_copy_nt(len=16) over using
> > > pktmbuf_copy(len=16)
> > > on workload X gives Y% thoughtput improvement?
> >
> > If the application is complex enough, and needs some of those 4 KB
> cache otherwise wasted, there will be a significant throughput
> > improvement; otherwise probably not.
> >
> > I have a general problem with this type of question: I hate that
> throughput is the only KPI (Key Performance Indicator) getting any
> > attention on the mailing list! Other KPIs, such as latency and
> resource conservation, are just as important in many real life use
> cases.
> 
> Well, I suppose that sort of expected question for the patch that
> introduces performance optimization:
> what is the benefit we expect to get and is it worth the effort?
> Throughput or latency improvement seems like an obvious choice here.
> About resource conservation - if the patch aims to improve cache
> consumption, then on some cache-bound
> workloads it should result in throughput improvement, correct?

The benefit is cache conservation.

Copying a burst of 32 1518 byte packets - using memcpy() - pollutes the entire L1 cache. Not trashing the entire L1 cache - using memcpy_nt() - should provide derived benefits for latency and/or throughput for most applications copying entire packets.

> 
> >
> > Here's a number for you: 6.25 % reduction in L1 data cache
> consumption. (Assuming 64 KB L1 cache with 64 byte cache lines and
> > application burst length of 32 packets.)
> 
> I understand that it should reduce cache eviction rate.
> The thing is that non-temporal stores are not free also: they consume
> WC buffers and some memory-bus bandwidth.
> AFAIK, for 16B non-consecutive NT stores, it means that only 25% of WC
> buffers capacity will be used,
> and in theory it might lead to extra memory pressure and worse
> performance in general.

I'm not a CPU expert, so I wonder if it makes any difference if the 16B non-consecutive store is non-temporal or normal... intuitively, the need to use a WC buffer and memory-bus bandwidth seems similar to me.

Also, my 16B example might be a bit silly... I used it to argue for the execution performance cost of omitting the alignment hints (added compares and branches). I suppose most NT copies will be packets, so mostly 64 or 1518 byte copies.

And in our application also 16 byte metadata being copied to the front of each packet. The copied packet follows immediately after the 16B metadata, so perhaps I should try to find a way to make these stores consecutive. Feature creep? ;-)

> In fact, IA manuals explicitly recommend to avoid partial cach-line
> writes whenever possible.
> Now, I don't know what would be more expensive in that case: re-fill
> extra cache-lines,
> or extra partial write memory transactions.

Good input, Konstantin. I will take this into consideration when optimizing the copy loops.

> That's why I asked for some performance numbers here.
> 

Got it.

The benefit of the patch is to avoid data cache pollution, and we agree about this.

So, to consider the other side of the coin, i.e. the potentially degraded memory copy throughput, I will measure both the NT copy and the normal copy using our application's packet capture feature, and provide both performance numbers.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-30  9:51                   ` Morten Brørup
@ 2022-08-02  9:05                     ` Konstantin Ananyev
  0 siblings, 0 replies; 57+ messages in thread
From: Konstantin Ananyev @ 2022-08-02  9:05 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach


> 
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Saturday, 30 July 2022 00.00
> >
> > > > > > Actually, one question I have for such small data-transfer
> > > > > > (16B per packet) - do you still see some noticable perfomance
> > > > > > improvement for such scenario?
> > > > >
> > > > > Copying 16 byte from each packet in a burst of 32 packets would
> > > > otherwise pollute 64 cache lines = 4 KB cache. With typically 64 KB
> > L1
> > > > > cache, I think it makes a difference.
> > > >
> > > > I understand the intention behind, my question was - it is really
> > > > measurable?
> > > > Something like: using pktmbuf_copy_nt(len=16) over using
> > > > pktmbuf_copy(len=16)
> > > > on workload X gives Y% thoughtput improvement?
> > >
> > > If the application is complex enough, and needs some of those 4 KB
> > cache otherwise wasted, there will be a significant throughput
> > > improvement; otherwise probably not.
> > >
> > > I have a general problem with this type of question: I hate that
> > throughput is the only KPI (Key Performance Indicator) getting any
> > > attention on the mailing list! Other KPIs, such as latency and
> > resource conservation, are just as important in many real life use
> > cases.
> >
> > Well, I suppose that sort of expected question for the patch that
> > introduces performance optimization:
> > what is the benefit we expect to get and is it worth the effort?
> > Throughput or latency improvement seems like an obvious choice here.
> > About resource conservation - if the patch aims to improve cache
> > consumption, then on some cache-bound
> > workloads it should result in throughput improvement, correct?
> 
> The benefit is cache conservation.
> 
> Copying a burst of 32 1518 byte packets - using memcpy() - pollutes the entire L1 cache. Not trashing the entire L1 cache - using
> memcpy_nt() - should provide derived benefits for latency and/or throughput for most applications copying entire packets.
> 
> >
> > >
> > > Here's a number for you: 6.25 % reduction in L1 data cache
> > consumption. (Assuming 64 KB L1 cache with 64 byte cache lines and
> > > application burst length of 32 packets.)
> >
> > I understand that it should reduce cache eviction rate.
> > The thing is that non-temporal stores are not free also: they consume
> > WC buffers and some memory-bus bandwidth.
> > AFAIK, for 16B non-consecutive NT stores, it means that only 25% of WC
> > buffers capacity will be used,
> > and in theory it might lead to extra memory pressure and worse
> > performance in general.
> 
> I'm not a CPU expert, so I wonder if it makes any difference if the 16B non-consecutive store is non-temporal or normal... intuitively,
> the need to use a WC buffer and memory-bus bandwidth seems similar to me.

I don't consider myself as proper expert in that area too, but according to IA optimization manual,
it does:

"11.5.5 Use Full Write Transactions to Achieve Higher Data Rate
Write transactions across the bus can result in write to physical memory either using the full line size of
64 bytes or less than the full line size. The latter is referred to as a partial write. Typically, writes to writeback
(WB) memory addresses are full-size and writes to write-combine (WC) or uncacheable (UC) type
memory addresses result in partial writes. Both cached WB store operations and WC store operations
utilize a set of six WC buffers (64 bytes wide) to manage the traffic of write transactions. When
competing traffic closes a WC buffer before all writes to the buffer are finished, this results in a series of
8-byte partial bus transactions rather than a single 64-byte write transaction.
....
When partial-writes are transacted on the bus, the effective data rate to system memory is reduced to
only 1/8 of the system bus bandwidth." 

> 
> Also, my 16B example might be a bit silly... I used it to argue for the execution performance cost of omitting the alignment hints
> (added compares and branches). I suppose most NT copies will be packets, so mostly 64 or 1518 byte copies.

Yes, for full packets I think it makes sense to use NT writes.
My concern is about small (less then cache-line) writes, I presume it might make things slower.
Though yes, right now it is just a guess - I don't have any data to prove it.

> 
> And in our application also 16 byte metadata being copied to the front of each packet. The copied packet follows immediately after
> the 16B metadata, so perhaps I should try to find a way to make these stores consecutive. Feature creep? ;-)
> 
> > In fact, IA manuals explicitly recommend to avoid partial cach-line
> > writes whenever possible.
> > Now, I don't know what would be more expensive in that case: re-fill
> > extra cache-lines,
> > or extra partial write memory transactions.
> 
> Good input, Konstantin. I will take this into consideration when optimizing the copy loops.
> 
> > That's why I asked for some performance numbers here.
> >
> 
> Got it.
> 
> The benefit of the patch is to avoid data cache pollution, and we agree about this.
> 
> So, to consider the other side of the coin, i.e. the potentially degraded memory copy throughput, I will measure both the NT copy and
> the normal copy using our application's packet capture feature, and provide both performance numbers.

Sounds like a good plan.
Thanks
Konstantin
 




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-29 20:26                   ` Morten Brørup
  2022-07-29 21:34                     ` Konstantin Ananyev
@ 2022-08-07 20:20                     ` Mattias Rönnblom
  2022-08-09  9:34                       ` Morten Brørup
  2022-08-10 21:05                     ` Honnappa Nagarahalli
  2 siblings, 1 reply; 57+ messages in thread
From: Mattias Rönnblom @ 2022-08-07 20:20 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, Konstantin Ananyev, dev,
	Bruce Richardson, Honnappa Nagarahalli
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

On 2022-07-29 22:26, Morten Brørup wrote:
> +TO: @Honnappa, we need input from ARM
> 
>> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
>> Sent: Friday, 29 July 2022 21.49
>>>
>>>> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
>>>> Sent: Friday, 29 July 2022 14.14
>>>>
>>>>
>>>> Sorry, missed that part.
>>>>
>>>>>
>>>>>> Another question - who will do 'sfence' after the copying?
>>>>>> Would it be inside memcpy_nt (seems quite costly), or would
>>>>>> it be another API function for that: memcpy_nt_flush() or so?
>>>>>
>>>>> Outside. Only the developer knows when it is required, so it
>> wouldn't
>>>> make any sense to add the cost inside memcpy_nt().
>>>>>
>>>>> I don't think we should add a flush function; it would just be
>>>> another name for an already existing function. Referring to the
>>>> required
>>>>> operation in the memcpy_nt() function documentation should
>> suffice.
>>>>>
>>>>
>>>> Ok, but again wouldn't it be arch specific?
>>>> AFAIK for x86 it needs to boil down to sfence, for other
>> architectures
>>>> - I don't know.
>>>> If you think there already is some generic one (rte_wmb?) that
>> would
>>>> always produce
>>>> correct instructions - sure let's use it.
>>>>
>>>
>>> DPDK has generic functions to wrap architecture specific stuff like
>> memory barriers.
>>>
>>> Because they are non-temporal stores, I suspect that rte_mb() is
>> required before reading the data from the location it was copied to.
>>> Ensuring that STORE operations are ordered (rte_wmb) might not
>> suffice. However, I'm not a CPU expert, so I will seek advice from
>>> more qualified people in the community on this.
>>
>> I think for IA sfence is enough, see citation below,
>> for other architectures - no idea.
>> What I am trying to say - it needs to be the *same* function on all
>> archs we support.
> 
> Now I get it: rte_wmb() might be appropriate on x86, but if any other architecture requires something else, we should add a new common function for flushing, e.g. rte_memcpy_nt_flush().
> 

rte_wmb() not being enough also my understanding. NT stores are weakly 
ordered on x86, and requires a sfence to be ordered with non-NT stores. 
Unfortunately, this per-memcpy sfence instruction make even 1500-byte 
sized copy operations much slower - at least in micro benchmarks.

>>
>> IA SW optimization manual:
>> 9.4.2 Streaming Store Usage Models
>> The two primary usage domains for streaming store are coherent requests
>> and non-coherent requests.
>> 9.4.2.1 Coherent Requests
>> Coherent requests are normal loads and stores to system memory, which
>> may also hit cache lines
>> present in another processor in a multiprocessor environment. With
>> coherent requests, a streaming store
>> can be used in the same way as a regular store that has been mapped
>> with a WC memory type (PAT or
>> MTRR). An SFENCE instruction must be used within a producer-consumer
>> usage model in order to ensure
>> coherency and visibility of data between processors.
>> Within a single-processor system, the CPU can also re-read the same
>> memory location and be assured of
>> coherence (that is, a single, consistent view of this memory location).
>> The same is true for a multiprocessor
>> (MP) system, assuming an accepted MP software producer-consumer
>> synchronization policy is
>> employed.
>>
> 
> With this reference, I am convinced that you are right about the SFENCE. This puts a checkmark on this item on my TODO list for the patch. Thank you, Konstantin!
> 
> Any ARM CPU experts on the mailing list seeing this, not on vacation? @Honnappa, I'm looking at you. :-)
> 
> Summing up, the question is:
> 
> After a bunch of *non-temporal* stores (STNP instruction) on ARM architecture, does calling rte_wmb() suffice to ensure the data is visible across the system?
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-19 15:26 [RFC v2] non-temporal memcpy Morten Brørup
  2022-07-19 18:00 ` David Christensen
  2022-07-21 23:19 ` Konstantin Ananyev
@ 2022-08-07 20:25 ` Mattias Rönnblom
  2022-08-09  9:46   ` Morten Brørup
  2 siblings, 1 reply; 57+ messages in thread
From: Mattias Rönnblom @ 2022-08-07 20:25 UTC (permalink / raw)
  To: Morten Brørup, dev, Bruce Richardson, Konstantin Ananyev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

On 2022-07-19 17:26, Morten Brørup wrote:
> This RFC proposes a set of functions optimized for non-temporal memory copy.
> 
> At this stage, I am asking for feedback on the concept.
> 
> Applications sometimes data to another memory location, which is only used
> much later.
> In this case, it is inefficient to pollute the data cache with the copied
> data.
> 
> An example use case (originating from a real life application):
> Copying filtered packets, or the first part of them, into a capture buffer
> for offline analysis.
> 
> The purpose of these functions is to achieve a performance gain by not
> polluting the cache when copying data.
> Although the throughput may be improved by further optimization, I do not
> consider througput optimization relevant initially.
> 
> The x86 non-temporal load instructions have 16 byte alignment
> requirements [1], while ARM non-temporal load instructions are available with
> 4 byte alignment requirements [2].
> Both platforms offer non-temporal store instructions with 4 byte alignment
> requirements.
> 

I don't think memcpy() functions should have alignment requirements. 
That's not very practical, and violates the principle of least surprise.

Use normal memcpy() for the unaligned parts, and for the whole thing for 
small sizes (at least on x86).

> In addition to the primary function without any alignment requirements, we
> also provide functions for respectivly 16 and 4 byte aligned access for
> performance purposes.
> 
> The function names resemble standard C library function names, but their
> signatures are intentionally different. No need to drag legacy into it.
> 
> NB: Don't comment on spaces for indentation; a patch will follow DPDK coding
> style and use TAB.
> 
> [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_load
> [2] https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-and-FP-
> 
> V2:
> - Only copy from non-temporal source to non-temporal destination.
>    I.e. remove the two variants with only source and/or destination being
>    non-temporal.
> - Do not require alignment.
>    Instead, offer additional 4 and 16 byte aligned functions for performance
>    purposes.
> - Implemented two of the functions for x86.
> - Remove memset function.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> 
> /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice.
>   *
>   * Copy data from non-temporal source to non-temporal destination.
>   *
>   * @param dst
>   *   Pointer to the non-temporal destination of the data.
>   *   Should be 4 byte aligned, for optimal performance.
>   * @param src
>   *   Pointer to the non-temporal source data.
>   *   No alignment requirements.
>   * @param len
>   *   Number of bytes to copy.
>   *   Should be be divisible by 4, for optimal performance.
>   */
> __rte_experimental
> static __rte_always_inline
> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> void rte_memcpy_nt(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
> /* Implementation T.B.D. */
> 
> /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice.
>   *
>   * Copy data in blocks of 16 byte from aligned non-temporal source
>   * to aligned non-temporal destination.
>   *
>   * @param dst
>   *   Pointer to the non-temporal destination of the data.
>   *   Must be 16 byte aligned.
>   * @param src
>   *   Pointer to the non-temporal source data.
>   *   Must be 16 byte aligned.
>   * @param len
>   *   Number of bytes to copy.
>   *   Must be divisible by 16.
>   */
> __rte_experimental
> static __rte_always_inline
> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> void rte_memcpy_nt16a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
> {
>      const void * const  end = RTE_PTR_ADD(src, len);
> 
>      RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i)));
>      RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i)));
>      RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i)));
> 
>      /* Copy large portion of data. */
>      while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) {
>          register __m128i    xmm0, xmm1, xmm2, xmm3;
> 
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>          xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 * sizeof(__m128i)));
>          xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 * sizeof(__m128i)));
>          xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 * sizeof(__m128i)));
>          xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 * sizeof(__m128i)));
> #pragma GCC diagnostic pop
>          _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)), xmm0);
>          _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)), xmm1);
>          _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)), xmm2);
>          _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)), xmm3);
>          src = RTE_PTR_ADD(src, 4 * sizeof(__m128i));
>          dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i));
>      }
> 
>      /* Copy remaining data. */
>      while (src != end) {
>          register __m128i    xmm;
> 
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>          xmm = _mm_stream_load_si128(src);
> #pragma GCC diagnostic pop
>          _mm_stream_si128(dst, xmm);
>          src = RTE_PTR_ADD(src, sizeof(__m128i));
>          dst = RTE_PTR_ADD(dst, sizeof(__m128i));
>      }
> }
> 
> /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice.
>   *
>   * Copy data in blocks of 4 byte from aligned non-temporal source
>   * to aligned non-temporal destination.
>   *
>   * @param dst
>   *   Pointer to the non-temporal destination of the data.
>   *   Must be 4 byte aligned.
>   * @param src
>   *   Pointer to the non-temporal source data.
>   *   Must be 4 byte aligned.
>   * @param len
>   *   Number of bytes to copy.
>   *   Must be divisible by 4.
>   */
> __rte_experimental
> static __rte_always_inline
> __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3)))
> void rte_memcpy_nt4a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len)
> {
>      int32_t             buf[sizeof(__m128i) / sizeof(int32_t)] __rte_aligned(sizeof(__m128i));
>      /** Address of source data, rounded down to achieve alignment. */
>      const void *        srca = RTE_PTR_ALIGN_FLOOR(src, sizeof(__m128i));
>      /** Address of end of source data, rounded down to achieve alignment. */
>      const void * const  srcenda = RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i));
>      const int           offset =  RTE_PTR_DIFF(src, srca) / sizeof(int32_t);
>      register __m128i    xmm0;
> 
>      RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t)));
>      RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t)));
>      RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t)));
> 
>      if (unlikely(len == 0)) return;
> 
>      /* Copy first, non-__m128i aligned, part of source data. */
>      if (offset) {
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>          xmm0 = _mm_stream_load_si128(srca);
>          _mm_store_si128((void *)buf, xmm0);
> #pragma GCC diagnostic pop
>          switch (offset) {
>              case 1:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[1]);
>                  if (unlikely(len == 1 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[2]);
>                  if (unlikely(len == 2 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[3]);
>                  break;
>              case 2:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[2]);
>                  if (unlikely(len == 1 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[3]);
>                  break;
>              case 3:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[3]);
>                  break;
>          }
>          srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t));
>          dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t));
>      }
> 
>      /* Copy middle, __m128i aligned, part of source data. */
>      while (srca != srcenda) {
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>          xmm0 = _mm_stream_load_si128(srca);
> #pragma GCC diagnostic pop
>          _mm_store_si128((void *)buf, xmm0);
>          _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
>          _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
>          _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]);
>          _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)), buf[3]);
>          srca = RTE_PTR_ADD(srca, sizeof(__m128i));
>          dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t));
>      }
> 
>      /* Copy last, non-__m128i aligned, part of source data. */
>      if (RTE_PTR_DIFF(srca, src) != 4) {
> /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers"
>          xmm0 = _mm_stream_load_si128(srca);
>          _mm_store_si128((void *)buf, xmm0);
> #pragma GCC diagnostic pop
>          switch (offset) {
>              case 1:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
>                  break;
>              case 2:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
>                  if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
>                  break;
>              case 3:
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]);
>                  if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]);
>                  if (unlikely(RTE_PTR_DIFF(srca, src) == 2 * sizeof(int32_t))) return;
>                  _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]);
>                  break;
>          }
>      }
> }
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-07-29 16:05               ` Stephen Hemminger
  2022-07-29 17:29                 ` Morten Brørup
@ 2022-08-07 20:40                 ` Mattias Rönnblom
  2022-08-09  9:24                   ` Morten Brørup
  1 sibling, 1 reply; 57+ messages in thread
From: Mattias Rönnblom @ 2022-08-07 20:40 UTC (permalink / raw)
  To: dev

On 2022-07-29 18:05, Stephen Hemminger wrote:
> On Fri, 29 Jul 2022 12:13:52 +0000
> Konstantin Ananyev <konstantin.ananyev@huawei.com> wrote:
> 
>> Sorry, missed that part.
>>
>>>    
>>>> Another question - who will do 'sfence' after the copying?
>>>> Would it be inside memcpy_nt (seems quite costly), or would
>>>> it be another API function for that: memcpy_nt_flush() or so?
>>>
>>> Outside. Only the developer knows when it is required, so it wouldn't make any sense to add the cost inside memcpy_nt().
>>>
>>> I don't think we should add a flush function; it would just be another name for an already existing function. Referring to the required
>>> operation in the memcpy_nt() function documentation should suffice.
>>>    
>>
>> Ok, but again wouldn't it be arch specific?
>> AFAIK for x86 it needs to boil down to sfence, for other architectures - I don't know.
>> If you think there already is some generic one (rte_wmb?) that would always produce
>> correct instructions - sure let's use it.
>>   
>>   
> 
> It makes sense in a few select places to use non-temporal copy.
> But it would add unnecessary complexity to DPDK if every function in DPDK that could
> cause a copy had a non-temporal variant.

A NT load and NT store variant, plus a NT load+store variant. :)

> 
> Maybe just having rte_memcpy have a threshold (config value?) that if copy is larger than
> a certain size, then it would automatically be non-temporal.  Small copies wouldn't matter,
> the optimization is more about not stopping cache size issues with large streams of data.

I don't think there's any way for rte_memcpy() to know if the 
application plan to use the source, the destination, both, or neither of 
the buffers in the immediate future. For huge copies (MBs or more) the 
size heuristic makes sense, but for medium sized copies (say a packet 
worth of data), I'm not so sure.

What is unclear to me is if there is a benefit (or drawback) of using 
the imaginary rte_memcpy_nt(), compared to doing rte_memcpy() + 
clflushopt or cldemote, in the typical use case (if there is such).


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-08-07 20:40                 ` Mattias Rönnblom
@ 2022-08-09  9:24                   ` Morten Brørup
  2022-08-09 11:53                     ` Mattias Rönnblom
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-08-09  9:24 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Stephen Hemminger, Konstantin Ananyev, Bruce Richardson,
	Honnappa Nagarahalli

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Sunday, 7 August 2022 22.41
> 
> On 2022-07-29 18:05, Stephen Hemminger wrote:
> >
> > It makes sense in a few select places to use non-temporal copy.
> > But it would add unnecessary complexity to DPDK if every function in
> DPDK that could
> > cause a copy had a non-temporal variant.
> 
> A NT load and NT store variant, plus a NT load+store variant. :)

I considered this, but it adds complexity, and our use case only needs the NT load+store. So I decided to only provide that variant.

I can prepare the API for all four combinations. The extended function would be renamed from rte_memcpy_nt_ex() to just rte_memcpy_ex(). And the rte_memcpy_nt() would be omitted, rather than just perform rte_memcpy_ex(dst,src,len,F_DST_NT|F_SRC_NT).

What does the community prefer in this regard?

> 
> >
> > Maybe just having rte_memcpy have a threshold (config value?) that if
> copy is larger than
> > a certain size, then it would automatically be non-temporal.  Small
> copies wouldn't matter,
> > the optimization is more about not stopping cache size issues with
> large streams of data.
> 
> I don't think there's any way for rte_memcpy() to know if the
> application plan to use the source, the destination, both, or neither
> of
> the buffers in the immediate future.

Agree. Which is why explicit NT function variants should be offered.

> For huge copies (MBs or more) the
> size heuristic makes sense, but for medium sized copies (say a packet
> worth of data), I'm not so sure.

This is the behavior of glibc memcpy().

> 
> What is unclear to me is if there is a benefit (or drawback) of using
> the imaginary rte_memcpy_nt(), compared to doing rte_memcpy() +
> clflushopt or cldemote, in the typical use case (if there is such).
> 

Our use case is packet capture (copying) to memory, where the copies will be read much later, so there is no need to pollute the cache with the copies.

Our application also doesn't look deep inside the original packets after copying them, there is also no need to pollute the cache with the originals.

And even though the application looked partially into the packets before copying them (and thus they are partially in cache) using NT load (instead of normal load) has no additional cost.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-08-07 20:20                     ` Mattias Rönnblom
@ 2022-08-09  9:34                       ` Morten Brørup
  2022-08-09 11:56                         ` Mattias Rönnblom
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-08-09  9:34 UTC (permalink / raw)
  To: Mattias Rönnblom, Konstantin Ananyev, Konstantin Ananyev,
	dev, Bruce Richardson, Honnappa Nagarahalli
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Sunday, 7 August 2022 22.20
> 
> On 2022-07-29 22:26, Morten Brørup wrote:
> > +TO: @Honnappa, we need input from ARM
> >
> >> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> >> Sent: Friday, 29 July 2022 21.49
> >>>
> >>>> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> >>>> Sent: Friday, 29 July 2022 14.14
> >>>>
> >>>>
> >>>> Sorry, missed that part.
> >>>>
> >>>>>
> >>>>>> Another question - who will do 'sfence' after the copying?
> >>>>>> Would it be inside memcpy_nt (seems quite costly), or would
> >>>>>> it be another API function for that: memcpy_nt_flush() or so?
> >>>>>
> >>>>> Outside. Only the developer knows when it is required, so it
> >> wouldn't
> >>>> make any sense to add the cost inside memcpy_nt().
> >>>>>
> >>>>> I don't think we should add a flush function; it would just be
> >>>> another name for an already existing function. Referring to the
> >>>> required
> >>>>> operation in the memcpy_nt() function documentation should
> >> suffice.
> >>>>>
> >>>>
> >>>> Ok, but again wouldn't it be arch specific?
> >>>> AFAIK for x86 it needs to boil down to sfence, for other
> >> architectures
> >>>> - I don't know.
> >>>> If you think there already is some generic one (rte_wmb?) that
> >> would
> >>>> always produce
> >>>> correct instructions - sure let's use it.
> >>>>
> >>>
> >>> DPDK has generic functions to wrap architecture specific stuff like
> >> memory barriers.
> >>>
> >>> Because they are non-temporal stores, I suspect that rte_mb() is
> >> required before reading the data from the location it was copied to.
> >>> Ensuring that STORE operations are ordered (rte_wmb) might not
> >> suffice. However, I'm not a CPU expert, so I will seek advice from
> >>> more qualified people in the community on this.
> >>
> >> I think for IA sfence is enough, see citation below,
> >> for other architectures - no idea.
> >> What I am trying to say - it needs to be the *same* function on all
> >> archs we support.
> >
> > Now I get it: rte_wmb() might be appropriate on x86, but if any other
> architecture requires something else, we should add a new common
> function for flushing, e.g. rte_memcpy_nt_flush().
> >
> 
> rte_wmb() not being enough also my understanding. NT stores are weakly
> ordered on x86, and requires a sfence to be ordered with non-NT stores.
> 
> Unfortunately, this per-memcpy sfence instruction make even 1500-byte
> sized copy operations much slower - at least in micro benchmarks.

I agree that calling rte_mb() from each NT memcpy would be counterproductive on small/medium copy operations.

Which is why rte_mb() is not called by the NT memcpy function itself, but by the application. This requirement will be part of the function's documentation.

The application must call rte_mb() before it accesses the copy.

Alternatively, the application can call rte_mb() after a burst of NT memcopies.

> 
> >>
> >> IA SW optimization manual:
> >> 9.4.2 Streaming Store Usage Models
> >> The two primary usage domains for streaming store are coherent
> requests
> >> and non-coherent requests.
> >> 9.4.2.1 Coherent Requests
> >> Coherent requests are normal loads and stores to system memory,
> which
> >> may also hit cache lines
> >> present in another processor in a multiprocessor environment. With
> >> coherent requests, a streaming store
> >> can be used in the same way as a regular store that has been mapped
> >> with a WC memory type (PAT or
> >> MTRR). An SFENCE instruction must be used within a producer-consumer
> >> usage model in order to ensure
> >> coherency and visibility of data between processors.
> >> Within a single-processor system, the CPU can also re-read the same
> >> memory location and be assured of
> >> coherence (that is, a single, consistent view of this memory
> location).
> >> The same is true for a multiprocessor
> >> (MP) system, assuming an accepted MP software producer-consumer
> >> synchronization policy is
> >> employed.
> >>
> >
> > With this reference, I am convinced that you are right about the
> SFENCE. This puts a checkmark on this item on my TODO list for the
> patch. Thank you, Konstantin!
> >
> > Any ARM CPU experts on the mailing list seeing this, not on vacation?
> @Honnappa, I'm looking at you. :-)
> >
> > Summing up, the question is:
> >
> > After a bunch of *non-temporal* stores (STNP instruction) on ARM
> architecture, does calling rte_wmb() suffice to ensure the data is
> visible across the system?
> >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-08-07 20:25 ` Mattias Rönnblom
@ 2022-08-09  9:46   ` Morten Brørup
  2022-08-09 12:05     ` Mattias Rönnblom
  2022-08-09 15:26     ` Stephen Hemminger
  0 siblings, 2 replies; 57+ messages in thread
From: Morten Brørup @ 2022-08-09  9:46 UTC (permalink / raw)
  To: Mattias Rönnblom, dev, Bruce Richardson, Konstantin Ananyev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Sunday, 7 August 2022 22.25
> 
> On 2022-07-19 17:26, Morten Brørup wrote:
> > This RFC proposes a set of functions optimized for non-temporal
> memory copy.
> >
> > At this stage, I am asking for feedback on the concept.
> >
> > Applications sometimes data to another memory location, which is only
> used
> > much later.
> > In this case, it is inefficient to pollute the data cache with the
> copied
> > data.
> >
> > An example use case (originating from a real life application):
> > Copying filtered packets, or the first part of them, into a capture
> buffer
> > for offline analysis.
> >
> > The purpose of these functions is to achieve a performance gain by
> not
> > polluting the cache when copying data.
> > Although the throughput may be improved by further optimization, I do
> not
> > consider througput optimization relevant initially.
> >
> > The x86 non-temporal load instructions have 16 byte alignment
> > requirements [1], while ARM non-temporal load instructions are
> available with
> > 4 byte alignment requirements [2].
> > Both platforms offer non-temporal store instructions with 4 byte
> alignment
> > requirements.
> >
> 
> I don't think memcpy() functions should have alignment requirements.
> That's not very practical, and violates the principle of least
> surprise.

I didn't make the CPUs with these alignment requirements.

However, I will offer optimized performance in a generic NT memcpy() function in the cases where the individual alignment requirements of various CPUs happen to be met.

> 
> Use normal memcpy() for the unaligned parts, and for the whole thing
> for
> small sizes (at least on x86).
> 

I'm not going to plunge into some advanced vector programming, so I'm working on an implementation where misalignment is handled by using a bounce buffer (allocated on the stack, which is probably cache hot anyway).



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-08-09  9:24                   ` Morten Brørup
@ 2022-08-09 11:53                     ` Mattias Rönnblom
  2022-10-09 16:16                       ` Morten Brørup
  0 siblings, 1 reply; 57+ messages in thread
From: Mattias Rönnblom @ 2022-08-09 11:53 UTC (permalink / raw)
  To: Morten Brørup
  Cc: dev, Stephen Hemminger, Konstantin Ananyev, Bruce Richardson,
	Honnappa Nagarahalli

On 2022-08-09 11:24, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Sunday, 7 August 2022 22.41
>>
>> On 2022-07-29 18:05, Stephen Hemminger wrote:
>>>
>>> It makes sense in a few select places to use non-temporal copy.
>>> But it would add unnecessary complexity to DPDK if every function in
>> DPDK that could
>>> cause a copy had a non-temporal variant.
>>
>> A NT load and NT store variant, plus a NT load+store variant. :)
> 
> I considered this, but it adds complexity, and our use case only needs the NT load+store. So I decided to only provide that variant.
> 
> I can prepare the API for all four combinations. The extended function would be renamed from rte_memcpy_nt_ex() to just rte_memcpy_ex(). And the rte_memcpy_nt() would be omitted, rather than just perform rte_memcpy_ex(dst,src,len,F_DST_NT|F_SRC_NT).
> 
> What does the community prefer in this regard?
> 

I would suggest just having a single function, with a flags or an enum 
to signify, if load, store or both should be non-temporal. If all 
platforms honor all combinations is a different matter.

Is there something that suggests that this particular use case will be 
more common than others? When I've used non-temporal memcpy(), only the 
store side was NT, since the application would go on an use the source data.

>>
>>>
>>> Maybe just having rte_memcpy have a threshold (config value?) that if
>> copy is larger than
>>> a certain size, then it would automatically be non-temporal.  Small
>> copies wouldn't matter,
>>> the optimization is more about not stopping cache size issues with
>> large streams of data.
>>
>> I don't think there's any way for rte_memcpy() to know if the
>> application plan to use the source, the destination, both, or neither
>> of
>> the buffers in the immediate future.
> 
> Agree. Which is why explicit NT function variants should be offered.
> 
>> For huge copies (MBs or more) the
>> size heuristic makes sense, but for medium sized copies (say a packet
>> worth of data), I'm not so sure.
> 
> This is the behavior of glibc memcpy().
> 

Yes, but, from what I can tell, glibc issues a sfence at the end of the 
copy.

Have a non-temporal memcpy() with a different memory model than the 
compiler intrinsic memcpy(), the glibc memcpy() and the DPDK 
rte_memcpy() implementations seems like asking for trouble.

>>
>> What is unclear to me is if there is a benefit (or drawback) of using
>> the imaginary rte_memcpy_nt(), compared to doing rte_memcpy() +
>> clflushopt or cldemote, in the typical use case (if there is such).
>>
> 
> Our use case is packet capture (copying) to memory, where the copies will be read much later, so there is no need to pollute the cache with the copies.
> 

If you flush/demote the cache line you've used more or less immediately, 
there won't be much pollution. Especially if you include the 
clflushopt/cldemote into the copying routine, as opposed to a large 
flush at the end.

I haven't tried this in practice, but it seems to me it's an option 
worth exploring. It could be a way to implement a portable NT memcpy(), 
if nothing else.

> Our application also doesn't look deep inside the original packets after copying them, there is also no need to pollute the cache with the originals.
> 

See above.

> And even though the application looked partially into the packets before copying them (and thus they are partially in cache) using NT load (instead of normal load) has no additional cost.
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-08-09  9:34                       ` Morten Brørup
@ 2022-08-09 11:56                         ` Mattias Rönnblom
  0 siblings, 0 replies; 57+ messages in thread
From: Mattias Rönnblom @ 2022-08-09 11:56 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, Konstantin Ananyev, dev,
	Bruce Richardson, Honnappa Nagarahalli
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

On 2022-08-09 11:34, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Sunday, 7 August 2022 22.20
>>
>> On 2022-07-29 22:26, Morten Brørup wrote:
>>> +TO: @Honnappa, we need input from ARM
>>>
>>>> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
>>>> Sent: Friday, 29 July 2022 21.49
>>>>>
>>>>>> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
>>>>>> Sent: Friday, 29 July 2022 14.14
>>>>>>
>>>>>>
>>>>>> Sorry, missed that part.
>>>>>>
>>>>>>>
>>>>>>>> Another question - who will do 'sfence' after the copying?
>>>>>>>> Would it be inside memcpy_nt (seems quite costly), or would
>>>>>>>> it be another API function for that: memcpy_nt_flush() or so?
>>>>>>>
>>>>>>> Outside. Only the developer knows when it is required, so it
>>>> wouldn't
>>>>>> make any sense to add the cost inside memcpy_nt().
>>>>>>>
>>>>>>> I don't think we should add a flush function; it would just be
>>>>>> another name for an already existing function. Referring to the
>>>>>> required
>>>>>>> operation in the memcpy_nt() function documentation should
>>>> suffice.
>>>>>>>
>>>>>>
>>>>>> Ok, but again wouldn't it be arch specific?
>>>>>> AFAIK for x86 it needs to boil down to sfence, for other
>>>> architectures
>>>>>> - I don't know.
>>>>>> If you think there already is some generic one (rte_wmb?) that
>>>> would
>>>>>> always produce
>>>>>> correct instructions - sure let's use it.
>>>>>>
>>>>>
>>>>> DPDK has generic functions to wrap architecture specific stuff like
>>>> memory barriers.
>>>>>
>>>>> Because they are non-temporal stores, I suspect that rte_mb() is
>>>> required before reading the data from the location it was copied to.
>>>>> Ensuring that STORE operations are ordered (rte_wmb) might not
>>>> suffice. However, I'm not a CPU expert, so I will seek advice from
>>>>> more qualified people in the community on this.
>>>>
>>>> I think for IA sfence is enough, see citation below,
>>>> for other architectures - no idea.
>>>> What I am trying to say - it needs to be the *same* function on all
>>>> archs we support.
>>>
>>> Now I get it: rte_wmb() might be appropriate on x86, but if any other
>> architecture requires something else, we should add a new common
>> function for flushing, e.g. rte_memcpy_nt_flush().
>>>
>>
>> rte_wmb() not being enough also my understanding. NT stores are weakly
>> ordered on x86, and requires a sfence to be ordered with non-NT stores.
>>
>> Unfortunately, this per-memcpy sfence instruction make even 1500-byte
>> sized copy operations much slower - at least in micro benchmarks.
> 
> I agree that calling rte_mb() from each NT memcpy would be counterproductive on small/medium copy operations.
> 

There's no need for a full barrier after the copy operation. sfence 
(e.g., rte_wmb()) is enough. I guess you will want a rte_rmb() before 
you start copying as well, if you are using NT loads.

> Which is why rte_mb() is not called by the NT memcpy function itself, but by the application. This requirement will be part of the function's documentation.
> 

rte_memcpy_surprise()

:)

> The application must call rte_mb() before it accesses the copy.
> 
> Alternatively, the application can call rte_mb() after a burst of NT memcopies.
> 
>>
>>>>
>>>> IA SW optimization manual:
>>>> 9.4.2 Streaming Store Usage Models
>>>> The two primary usage domains for streaming store are coherent
>> requests
>>>> and non-coherent requests.
>>>> 9.4.2.1 Coherent Requests
>>>> Coherent requests are normal loads and stores to system memory,
>> which
>>>> may also hit cache lines
>>>> present in another processor in a multiprocessor environment. With
>>>> coherent requests, a streaming store
>>>> can be used in the same way as a regular store that has been mapped
>>>> with a WC memory type (PAT or
>>>> MTRR). An SFENCE instruction must be used within a producer-consumer
>>>> usage model in order to ensure
>>>> coherency and visibility of data between processors.
>>>> Within a single-processor system, the CPU can also re-read the same
>>>> memory location and be assured of
>>>> coherence (that is, a single, consistent view of this memory
>> location).
>>>> The same is true for a multiprocessor
>>>> (MP) system, assuming an accepted MP software producer-consumer
>>>> synchronization policy is
>>>> employed.
>>>>
>>>
>>> With this reference, I am convinced that you are right about the
>> SFENCE. This puts a checkmark on this item on my TODO list for the
>> patch. Thank you, Konstantin!
>>>
>>> Any ARM CPU experts on the mailing list seeing this, not on vacation?
>> @Honnappa, I'm looking at you. :-)
>>>
>>> Summing up, the question is:
>>>
>>> After a bunch of *non-temporal* stores (STNP instruction) on ARM
>> architecture, does calling rte_wmb() suffice to ensure the data is
>> visible across the system?
>>>
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-08-09  9:46   ` Morten Brørup
@ 2022-08-09 12:05     ` Mattias Rönnblom
  2022-08-09 15:00       ` Morten Brørup
  2022-08-09 15:26     ` Stephen Hemminger
  1 sibling, 1 reply; 57+ messages in thread
From: Mattias Rönnblom @ 2022-08-09 12:05 UTC (permalink / raw)
  To: Morten Brørup, dev, Bruce Richardson, Konstantin Ananyev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

On 2022-08-09 11:46, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Sunday, 7 August 2022 22.25
>>
>> On 2022-07-19 17:26, Morten Brørup wrote:
>>> This RFC proposes a set of functions optimized for non-temporal
>> memory copy.
>>>
>>> At this stage, I am asking for feedback on the concept.
>>>
>>> Applications sometimes data to another memory location, which is only
>> used
>>> much later.
>>> In this case, it is inefficient to pollute the data cache with the
>> copied
>>> data.
>>>
>>> An example use case (originating from a real life application):
>>> Copying filtered packets, or the first part of them, into a capture
>> buffer
>>> for offline analysis.
>>>
>>> The purpose of these functions is to achieve a performance gain by
>> not
>>> polluting the cache when copying data.
>>> Although the throughput may be improved by further optimization, I do
>> not
>>> consider througput optimization relevant initially.
>>>
>>> The x86 non-temporal load instructions have 16 byte alignment
>>> requirements [1], while ARM non-temporal load instructions are
>> available with
>>> 4 byte alignment requirements [2].
>>> Both platforms offer non-temporal store instructions with 4 byte
>> alignment
>>> requirements.
>>>
>>
>> I don't think memcpy() functions should have alignment requirements.
>> That's not very practical, and violates the principle of least
>> surprise.
> 
> I didn't make the CPUs with these alignment requirements.
> 
> However, I will offer optimized performance in a generic NT memcpy() function in the cases where the individual alignment requirements of various CPUs happen to be met.
> 
>>
>> Use normal memcpy() for the unaligned parts, and for the whole thing
>> for
>> small sizes (at least on x86).
>>
> 
> I'm not going to plunge into some advanced vector programming, so I'm working on an implementation where misalignment is handled by using a bounce buffer (allocated on the stack, which is probably cache hot anyway).
> 
> 

I don't know for the NT load + NT store case, but for regular load + NT 
store, this is trivial. The implementation I've used is 36 
straight-forward lines of code.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-08-09 12:05     ` Mattias Rönnblom
@ 2022-08-09 15:00       ` Morten Brørup
  2022-08-10 11:47         ` Mattias Rönnblom
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-08-09 15:00 UTC (permalink / raw)
  To: Mattias Rönnblom, dev, Bruce Richardson, Konstantin Ananyev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Tuesday, 9 August 2022 14.05
> 
> On 2022-08-09 11:46, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Sunday, 7 August 2022 22.25
> >>
> >> On 2022-07-19 17:26, Morten Brørup wrote:
> >>> This RFC proposes a set of functions optimized for non-temporal
> >> memory copy.
> >>>
> >>> At this stage, I am asking for feedback on the concept.
> >>>
> >>> Applications sometimes data to another memory location, which is
> only
> >> used
> >>> much later.
> >>> In this case, it is inefficient to pollute the data cache with the
> >> copied
> >>> data.
> >>>
> >>> An example use case (originating from a real life application):
> >>> Copying filtered packets, or the first part of them, into a capture
> >> buffer
> >>> for offline analysis.
> >>>
> >>> The purpose of these functions is to achieve a performance gain by
> >> not
> >>> polluting the cache when copying data.
> >>> Although the throughput may be improved by further optimization, I
> do
> >> not
> >>> consider througput optimization relevant initially.
> >>>
> >>> The x86 non-temporal load instructions have 16 byte alignment
> >>> requirements [1], while ARM non-temporal load instructions are
> >> available with
> >>> 4 byte alignment requirements [2].
> >>> Both platforms offer non-temporal store instructions with 4 byte
> >> alignment
> >>> requirements.
> >>>
> >>
> >> I don't think memcpy() functions should have alignment requirements.
> >> That's not very practical, and violates the principle of least
> >> surprise.
> >
> > I didn't make the CPUs with these alignment requirements.
> >
> > However, I will offer optimized performance in a generic NT memcpy()
> function in the cases where the individual alignment requirements of
> various CPUs happen to be met.
> >
> >>
> >> Use normal memcpy() for the unaligned parts, and for the whole thing
> >> for
> >> small sizes (at least on x86).
> >>
> >
> > I'm not going to plunge into some advanced vector programming, so I'm
> working on an implementation where misalignment is handled by using a
> bounce buffer (allocated on the stack, which is probably cache hot
> anyway).
> >
> >
> 
> I don't know for the NT load + NT store case, but for regular load + NT
> store, this is trivial. The implementation I've used is 36
> straight-forward lines of code.

Is that implementation available for inspiration anywhere?


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-08-09  9:46   ` Morten Brørup
  2022-08-09 12:05     ` Mattias Rönnblom
@ 2022-08-09 15:26     ` Stephen Hemminger
  2022-08-09 17:24       ` Morten Brørup
  2022-08-10 11:55       ` Mattias Rönnblom
  1 sibling, 2 replies; 57+ messages in thread
From: Stephen Hemminger @ 2022-08-09 15:26 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, dev, Bruce Richardson, Konstantin Ananyev,
	Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

On Tue, 9 Aug 2022 11:46:19 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > 
> > I don't think memcpy() functions should have alignment requirements.
> > That's not very practical, and violates the principle of least
> > surprise.  
> 
> I didn't make the CPUs with these alignment requirements.
> 
> However, I will offer optimized performance in a generic NT memcpy() function in the cases where the individual alignment requirements of various CPUs happen to be met.

Rather than making a generic equivalent memcpy function, why not have
something which only takes aligned data. And to avoid user confusion
change the name to be something not suggestive of memcpy.

Maybe rte_non_cache_copy()?

Want to avoid the naive user just doing s/memcpy/rte_memcpy_nt/ and expect
everything to work.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-08-09 15:26     ` Stephen Hemminger
@ 2022-08-09 17:24       ` Morten Brørup
  2022-08-10 11:59         ` Mattias Rönnblom
  2022-08-10 11:55       ` Mattias Rönnblom
  1 sibling, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-08-09 17:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Mattias Rönnblom, dev, Bruce Richardson, Konstantin Ananyev,
	Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Tuesday, 9 August 2022 17.26
> 
> On Tue, 9 Aug 2022 11:46:19 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > >
> > > I don't think memcpy() functions should have alignment
> requirements.
> > > That's not very practical, and violates the principle of least
> > > surprise.
> >
> > I didn't make the CPUs with these alignment requirements.
> >
> > However, I will offer optimized performance in a generic NT memcpy()
> function in the cases where the individual alignment requirements of
> various CPUs happen to be met.
> 
> Rather than making a generic equivalent memcpy function, why not have
> something which only takes aligned data.

Our application is copying data not meeting x86 NT load alignment requirements (16 byte), so the function must support that. Specifically, our application is copying complete or truncated IP packets excl. the Ethernet and VLAN headers, i.e. offset by 14, 18 or 22 byte from the cache line aligned packet buffer.

> And to avoid user confusion
> change the name to be something not suggestive of memcpy.
> 
> Maybe rte_non_cache_copy()?
> 
> Want to avoid the naive user just doing s/memcpy/rte_memcpy_nt/ and
> expect
> everything to work.

I see the risk you point out here... But it's not advertised in presentations, whitepapers and elsewhere like rte_memcpy() having much better performance than classic memcpy(), which might lead to that misconception. So the probability should be low.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-08-09 15:00       ` Morten Brørup
@ 2022-08-10 11:47         ` Mattias Rönnblom
  0 siblings, 0 replies; 57+ messages in thread
From: Mattias Rönnblom @ 2022-08-10 11:47 UTC (permalink / raw)
  To: Morten Brørup, dev, Bruce Richardson, Konstantin Ananyev
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach

On 2022-08-09 17:00, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Tuesday, 9 August 2022 14.05
>>
>> On 2022-08-09 11:46, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>>> Sent: Sunday, 7 August 2022 22.25
>>>>
>>>> On 2022-07-19 17:26, Morten Brørup wrote:
>>>>> This RFC proposes a set of functions optimized for non-temporal
>>>> memory copy.
>>>>>
>>>>> At this stage, I am asking for feedback on the concept.
>>>>>
>>>>> Applications sometimes data to another memory location, which is
>> only
>>>> used
>>>>> much later.
>>>>> In this case, it is inefficient to pollute the data cache with the
>>>> copied
>>>>> data.
>>>>>
>>>>> An example use case (originating from a real life application):
>>>>> Copying filtered packets, or the first part of them, into a capture
>>>> buffer
>>>>> for offline analysis.
>>>>>
>>>>> The purpose of these functions is to achieve a performance gain by
>>>> not
>>>>> polluting the cache when copying data.
>>>>> Although the throughput may be improved by further optimization, I
>> do
>>>> not
>>>>> consider througput optimization relevant initially.
>>>>>
>>>>> The x86 non-temporal load instructions have 16 byte alignment
>>>>> requirements [1], while ARM non-temporal load instructions are
>>>> available with
>>>>> 4 byte alignment requirements [2].
>>>>> Both platforms offer non-temporal store instructions with 4 byte
>>>> alignment
>>>>> requirements.
>>>>>
>>>>
>>>> I don't think memcpy() functions should have alignment requirements.
>>>> That's not very practical, and violates the principle of least
>>>> surprise.
>>>
>>> I didn't make the CPUs with these alignment requirements.
>>>
>>> However, I will offer optimized performance in a generic NT memcpy()
>> function in the cases where the individual alignment requirements of
>> various CPUs happen to be met.
>>>
>>>>
>>>> Use normal memcpy() for the unaligned parts, and for the whole thing
>>>> for
>>>> small sizes (at least on x86).
>>>>
>>>
>>> I'm not going to plunge into some advanced vector programming, so I'm
>> working on an implementation where misalignment is handled by using a
>> bounce buffer (allocated on the stack, which is probably cache hot
>> anyway).
>>>
>>>
>>
>> I don't know for the NT load + NT store case, but for regular load + NT
>> store, this is trivial. The implementation I've used is 36
>> straight-forward lines of code.
> 
> Is that implementation available for inspiration anywhere?
> 
#define NT_THRESHOLD (2 * CACHE_LINE_SIZE)

void nt_memcpy(void *__restrict dst, const void * __restrict src, size_t n)
{
	if (n < NT_THRESHOLD) {
		memcpy(dst, src, n);
		return;
	}

	size_t n_unaligned = CACHE_LINE_SIZE - (uintptr_t)dst % CACHE_LINE_SIZE;

	if (n_unaligned > n)
		n_unaligned = n;

	memcpy(dst, src, n_unaligned);
	dst += n_unaligned;
	src += n_unaligned;
	n -= n_unaligned;

	size_t num_lines = n / CACHE_LINE_SIZE;

	size_t i;
	for (i = 0; i < num_lines; i++) {
		size_t j;
		for (j = 0; j < CACHE_LINE_SIZE / sizeof(__m128i); j++) {
			__m128i blk = _mm_loadu_si128((const __m128i *)src);
			/* non-temporal store */
			_mm_stream_si128((__m128i *)dst, blk);
			src += sizeof(__m128i);
			dst += sizeof(__m128i);
		}
		n -= CACHE_LINE_SIZE;
	}

	if (num_lines > 0)
		_mm_sfence();

	memcpy(dst, src, n);
}

(This was written as a part of a benchmark exercise, and hasn't been 
properly tested.)

Use this for inspiration, or I can DPDK-ify this and make it a proper 
patch/RFC. I would try to add support for NT load as well, and make both 
NT load and store depend on flags parameter.

The above threshold setting is completely arbitrary. What you should 
keep in mind when thinking about the threshold, is that it might well be 
worth to suffer a little lower performance of NT store + sfence 
(compared to regular store), since you will benefit from not trashing 
the cache.

For example, back-to-back copying of 1500 bytes buffers with this 
copying routine is much slower than regular memcpy() (measured in the 
core cycles spent in the copying), but nevertheless in a real-world 
application it may still improve the overall performance, since the 
packet copies doesn't evict useful data from the various caches. I know 
for sure that certain applications do benefit.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-08-09 15:26     ` Stephen Hemminger
  2022-08-09 17:24       ` Morten Brørup
@ 2022-08-10 11:55       ` Mattias Rönnblom
  2022-08-10 12:18         ` Morten Brørup
  1 sibling, 1 reply; 57+ messages in thread
From: Mattias Rönnblom @ 2022-08-10 11:55 UTC (permalink / raw)
  To: Stephen Hemminger, Morten Brørup
  Cc: dev, Bruce Richardson, Konstantin Ananyev, Jan Viktorin,
	Ruifeng Wang, David Christensen, Stanislaw Kardach

On 2022-08-09 17:26, Stephen Hemminger wrote:
> On Tue, 9 Aug 2022 11:46:19 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
>>>
>>> I don't think memcpy() functions should have alignment requirements.
>>> That's not very practical, and violates the principle of least
>>> surprise.
>>
>> I didn't make the CPUs with these alignment requirements.
>>
>> However, I will offer optimized performance in a generic NT memcpy() function in the cases where the individual alignment requirements of various CPUs happen to be met.
> 
> Rather than making a generic equivalent memcpy function, why not have
> something which only takes aligned data. And to avoid user confusion
> change the name to be something not suggestive of memcpy.
> 

Alignment seems like a non-issue to me. A NT-store memcpy() can be made 
free of alignment requirements, incurring only a very slight cost for 
the always-aligned case (who has their data always 16-byte aligned 
anyways?).

The memory barrier required on x86 seems like a bigger issue.

> Maybe rte_non_cache_copy()?
> 

rte_memcpy_nt_weakly_ordered(), or rte_memcpy_nt_weak(). And a 
rte_memcpy_nt() with the sfence is place, which the user hopefully will 
find first? I don't know. I would prefer not having the weak variant at all.

Accepting weak memory ordering (i.e., no sfence) could also be one of 
the flags, assuming rte_memcpy_nt() would have a flags parameter. 
Default is safe (=memcpy() semantics), but potentially slower.

> Want to avoid the naive user just doing s/memcpy/rte_memcpy_nt/ and expect
> everything to work.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-08-09 17:24       ` Morten Brørup
@ 2022-08-10 11:59         ` Mattias Rönnblom
  2022-08-10 12:12           ` Morten Brørup
  0 siblings, 1 reply; 57+ messages in thread
From: Mattias Rönnblom @ 2022-08-10 11:59 UTC (permalink / raw)
  To: Morten Brørup, Stephen Hemminger
  Cc: dev, Bruce Richardson, Konstantin Ananyev, Jan Viktorin,
	Ruifeng Wang, David Christensen, Stanislaw Kardach

On 2022-08-09 19:24, Morten Brørup wrote:
>> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>> Sent: Tuesday, 9 August 2022 17.26
>>
>> On Tue, 9 Aug 2022 11:46:19 +0200
>> Morten Brørup <mb@smartsharesystems.com> wrote:
>>
>>>>
>>>> I don't think memcpy() functions should have alignment
>> requirements.
>>>> That's not very practical, and violates the principle of least
>>>> surprise.
>>>
>>> I didn't make the CPUs with these alignment requirements.
>>>
>>> However, I will offer optimized performance in a generic NT memcpy()
>> function in the cases where the individual alignment requirements of
>> various CPUs happen to be met.
>>
>> Rather than making a generic equivalent memcpy function, why not have
>> something which only takes aligned data.
> 
> Our application is copying data not meeting x86 NT load alignment requirements (16 byte), so the function must support that. Specifically, our application is copying complete or truncated IP packets excl. the Ethernet and VLAN headers, i.e. offset by 14, 18 or 22 byte from the cache line aligned packet buffer.
> 

Sure, but you can use regular loads for the non-aligned parts, and the 
you continue to use NT load for the rest of the data. I suspect there is 
no point in doing NT loads for data on the same cache line that you've 
done regular loads for, so you might as well treat the alignment 
requirements as 64 byte, not 16.

>> And to avoid user confusion
>> change the name to be something not suggestive of memcpy.
>>
>> Maybe rte_non_cache_copy()?
>>
>> Want to avoid the naive user just doing s/memcpy/rte_memcpy_nt/ and
>> expect
>> everything to work.
> 
> I see the risk you point out here... But it's not advertised in presentations, whitepapers and elsewhere like rte_memcpy() having much better performance than classic memcpy(), which might lead to that misconception. So the probability should be low.
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-08-10 11:59         ` Mattias Rönnblom
@ 2022-08-10 12:12           ` Morten Brørup
  0 siblings, 0 replies; 57+ messages in thread
From: Morten Brørup @ 2022-08-10 12:12 UTC (permalink / raw)
  To: Mattias Rönnblom, Stephen Hemminger
  Cc: dev, Bruce Richardson, Konstantin Ananyev, Jan Viktorin,
	Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Wednesday, 10 August 2022 14.00
> 
> On 2022-08-09 19:24, Morten Brørup wrote:
> >> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> >> Sent: Tuesday, 9 August 2022 17.26
> >>
> >> On Tue, 9 Aug 2022 11:46:19 +0200
> >> Morten Brørup <mb@smartsharesystems.com> wrote:
> >>
> >>>>
> >>>> I don't think memcpy() functions should have alignment
> >> requirements.
> >>>> That's not very practical, and violates the principle of least
> >>>> surprise.
> >>>
> >>> I didn't make the CPUs with these alignment requirements.
> >>>
> >>> However, I will offer optimized performance in a generic NT
> memcpy()
> >> function in the cases where the individual alignment requirements of
> >> various CPUs happen to be met.
> >>
> >> Rather than making a generic equivalent memcpy function, why not
> have
> >> something which only takes aligned data.
> >
> > Our application is copying data not meeting x86 NT load alignment
> requirements (16 byte), so the function must support that.
> Specifically, our application is copying complete or truncated IP
> packets excl. the Ethernet and VLAN headers, i.e. offset by 14, 18 or
> 22 byte from the cache line aligned packet buffer.
> >
> 
> Sure, but you can use regular loads for the non-aligned parts, and the
> you continue to use NT load for the rest of the data. I suspect there
> is
> no point in doing NT loads for data on the same cache line that you've
> done regular loads for, so you might as well treat the alignment
> requirements as 64 byte, not 16.

I'm NT loading from the aligned address preceding the source address, and when NT storing these data, I'm skipping past the initial few bytes that were too many.

In some scenarios, the lcore capturing the packets has not touched the packet data at all, so not even the Ethernet header is in its data cache. Remember, not all applications are run-to-completion, so the capturing lcore might not access the packet header, but only the mbuf structure.

> 
> >> And to avoid user confusion
> >> change the name to be something not suggestive of memcpy.
> >>
> >> Maybe rte_non_cache_copy()?
> >>
> >> Want to avoid the naive user just doing s/memcpy/rte_memcpy_nt/ and
> >> expect
> >> everything to work.
> >
> > I see the risk you point out here... But it's not advertised in
> presentations, whitepapers and elsewhere like rte_memcpy() having much
> better performance than classic memcpy(), which might lead to that
> misconception. So the probability should be low.
> >


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-08-10 11:55       ` Mattias Rönnblom
@ 2022-08-10 12:18         ` Morten Brørup
  2022-08-10 21:20           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 57+ messages in thread
From: Morten Brørup @ 2022-08-10 12:18 UTC (permalink / raw)
  To: Mattias Rönnblom, Stephen Hemminger
  Cc: dev, Bruce Richardson, Konstantin Ananyev, Jan Viktorin,
	Ruifeng Wang, David Christensen, Stanislaw Kardach

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Wednesday, 10 August 2022 13.56
> 
> On 2022-08-09 17:26, Stephen Hemminger wrote:

[...]

> 
> Alignment seems like a non-issue to me. A NT-store memcpy() can be made
> free of alignment requirements, incurring only a very slight cost for
> the always-aligned case (who has their data always 16-byte aligned
> anyways?).
> 
> The memory barrier required on x86 seems like a bigger issue.
> 
> > Maybe rte_non_cache_copy()?
> >
> 
> rte_memcpy_nt_weakly_ordered(), or rte_memcpy_nt_weak(). And a
> rte_memcpy_nt() with the sfence is place, which the user hopefully will
> find first? I don't know. I would prefer not having the weak variant at
> all.
> 
> Accepting weak memory ordering (i.e., no sfence) could also be one of
> the flags, assuming rte_memcpy_nt() would have a flags parameter.
> Default is safe (=memcpy() semantics), but potentially slower.

Excellent idea!

> 
> > Want to avoid the naive user just doing s/memcpy/rte_memcpy_nt/ and
> expect
> > everything to work.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-07-29 20:26                   ` Morten Brørup
  2022-07-29 21:34                     ` Konstantin Ananyev
  2022-08-07 20:20                     ` Mattias Rönnblom
@ 2022-08-10 21:05                     ` Honnappa Nagarahalli
  2022-08-11 11:50                       ` Mattias Rönnblom
  2 siblings, 1 reply; 57+ messages in thread
From: Honnappa Nagarahalli @ 2022-08-10 21:05 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, Konstantin Ananyev, dev,
	Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach,
	Honnappa Nagarahalli, nd, nd

<snip>

> 
> +TO: @Honnappa, we need input from ARM
> 
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Friday, 29 July 2022 21.49
> > >
> > > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > > Sent: Friday, 29 July 2022 14.14
> > > >
> > > >
> > > > Sorry, missed that part.
> > > >
> > > > >
> > > > > > Another question - who will do 'sfence' after the copying?
> > > > > > Would it be inside memcpy_nt (seems quite costly), or would it
> > > > > > be another API function for that: memcpy_nt_flush() or so?
> > > > >
> > > > > Outside. Only the developer knows when it is required, so it
> > wouldn't
> > > > make any sense to add the cost inside memcpy_nt().
> > > > >
> > > > > I don't think we should add a flush function; it would just be
> > > > another name for an already existing function. Referring to the
> > > > required
> > > > > operation in the memcpy_nt() function documentation should
> > suffice.
> > > > >
> > > >
> > > > Ok, but again wouldn't it be arch specific?
> > > > AFAIK for x86 it needs to boil down to sfence, for other
> > architectures
> > > > - I don't know.
> > > > If you think there already is some generic one (rte_wmb?) that
> > would
> > > > always produce
> > > > correct instructions - sure let's use it.
> > > >
> > >
> > > DPDK has generic functions to wrap architecture specific stuff like
> > memory barriers.
> > >
> > > Because they are non-temporal stores, I suspect that rte_mb() is
> > required before reading the data from the location it was copied to.
> > > Ensuring that STORE operations are ordered (rte_wmb) might not
> > suffice. However, I'm not a CPU expert, so I will seek advice from
> > > more qualified people in the community on this.
> >
> > I think for IA sfence is enough, see citation below, for other
> > architectures - no idea.
> > What I am trying to say - it needs to be the *same* function on all
> > archs we support.
> 
> Now I get it: rte_wmb() might be appropriate on x86, but if any other
> architecture requires something else, we should add a new common function
> for flushing, e.g. rte_memcpy_nt_flush().
> 
> >
> > IA SW optimization manual:
> > 9.4.2 Streaming Store Usage Models
> > The two primary usage domains for streaming store are coherent
> > requests and non-coherent requests.
> > 9.4.2.1 Coherent Requests
> > Coherent requests are normal loads and stores to system memory, which
> > may also hit cache lines present in another processor in a
> > multiprocessor environment. With coherent requests, a streaming store
> > can be used in the same way as a regular store that has been mapped
> > with a WC memory type (PAT or MTRR). An SFENCE instruction must be
> > used within a producer-consumer usage model in order to ensure
> > coherency and visibility of data between processors.
> > Within a single-processor system, the CPU can also re-read the same
> > memory location and be assured of coherence (that is, a single,
> > consistent view of this memory location).
> > The same is true for a multiprocessor
> > (MP) system, assuming an accepted MP software producer-consumer
> > synchronization policy is employed.
> >
> 
> With this reference, I am convinced that you are right about the SFENCE. This
> puts a checkmark on this item on my TODO list for the patch. Thank you,
> Konstantin!
> 
> Any ARM CPU experts on the mailing list seeing this, not on vacation?
> @Honnappa, I'm looking at you. :-)
> 
> Summing up, the question is:
> 
> After a bunch of *non-temporal* stores (STNP instruction) on ARM
> architecture, does calling rte_wmb() suffice to ensure the data is visible across
> the system?
Apologies for the late response, the docs did not have enough information. The internal dialogue is still going on, but I have some information now. There is some information in ArmV8 programmer's guide [1], though it is not complete.
In summary, rte_wmb()/rte_mb() would not suffice, we need new APIs.

From my perspective, I see several scenarios:
1)	Need for ordering before the memcpy_nt. Here there are several cases:
	a.	LD – LDNP/STNP – DMB NSHLD
	b.	ST – LDNP/STNP – DMB NSH
2)	Need for ordering after the memcpy. Again, we have the similar use cases:
	a.	LDNP/STNP – LD – DMB NSH
	b.	LDNP/STNP – ST – DMB NSH

The 'ST - STNP' and 'STNP - ST' do not apply here, but good to add an API for completion.

So, may be we could have rte_[r|w]mb_nt() APIs.

[1] https://developer.arm.com/documentation/den0024/a/The-A64-instruction-set/Memory-access-instructions/Non-temporal-load-and-store-pair

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-08-10 12:18         ` Morten Brørup
@ 2022-08-10 21:20           ` Honnappa Nagarahalli
  2022-08-11 11:53             ` Mattias Rönnblom
  0 siblings, 1 reply; 57+ messages in thread
From: Honnappa Nagarahalli @ 2022-08-10 21:20 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, Stephen Hemminger
  Cc: dev, Bruce Richardson, Konstantin Ananyev, Jan Viktorin,
	Ruifeng Wang, David Christensen, Stanislaw Kardach, nd, nd

<snip>

> 
> > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > Sent: Wednesday, 10 August 2022 13.56
> >
> > On 2022-08-09 17:26, Stephen Hemminger wrote:
> 
> [...]
> 
> >
> > Alignment seems like a non-issue to me. A NT-store memcpy() can be
> > made free of alignment requirements, incurring only a very slight cost
> > for the always-aligned case (who has their data always 16-byte aligned
> > anyways?).
> >
> > The memory barrier required on x86 seems like a bigger issue.
> >
> > > Maybe rte_non_cache_copy()?
> > >
> >
> > rte_memcpy_nt_weakly_ordered(), or rte_memcpy_nt_weak(). And a
> > rte_memcpy_nt() with the sfence is place, which the user hopefully
> > will find first? I don't know. I would prefer not having the weak
> > variant at all.
I think providing weakly ordered version is required to offset the cost of the barriers. One might be able to copy multiple packets and then issue a barrier.

> >
> > Accepting weak memory ordering (i.e., no sfence) could also be one of
> > the flags, assuming rte_memcpy_nt() would have a flags parameter.
> > Default is safe (=memcpy() semantics), but potentially slower.
> 
> Excellent idea!
> 
> >
> > > Want to avoid the naive user just doing s/memcpy/rte_memcpy_nt/ and
> > expect
> > > everything to work.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-08-10 21:05                     ` Honnappa Nagarahalli
@ 2022-08-11 11:50                       ` Mattias Rönnblom
  2022-08-11 16:26                         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 57+ messages in thread
From: Mattias Rönnblom @ 2022-08-11 11:50 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Morten Brørup, Konstantin Ananyev,
	Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach, nd

On 2022-08-10 23:05, Honnappa Nagarahalli wrote:
> <snip>
> 
>>
>> +TO: @Honnappa, we need input from ARM
>>
>>> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
>>> Sent: Friday, 29 July 2022 21.49
>>>>
>>>>> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
>>>>> Sent: Friday, 29 July 2022 14.14
>>>>>
>>>>>
>>>>> Sorry, missed that part.
>>>>>
>>>>>>
>>>>>>> Another question - who will do 'sfence' after the copying?
>>>>>>> Would it be inside memcpy_nt (seems quite costly), or would it
>>>>>>> be another API function for that: memcpy_nt_flush() or so?
>>>>>>
>>>>>> Outside. Only the developer knows when it is required, so it
>>> wouldn't
>>>>> make any sense to add the cost inside memcpy_nt().
>>>>>>
>>>>>> I don't think we should add a flush function; it would just be
>>>>> another name for an already existing function. Referring to the
>>>>> required
>>>>>> operation in the memcpy_nt() function documentation should
>>> suffice.
>>>>>>
>>>>>
>>>>> Ok, but again wouldn't it be arch specific?
>>>>> AFAIK for x86 it needs to boil down to sfence, for other
>>> architectures
>>>>> - I don't know.
>>>>> If you think there already is some generic one (rte_wmb?) that
>>> would
>>>>> always produce
>>>>> correct instructions - sure let's use it.
>>>>>
>>>>
>>>> DPDK has generic functions to wrap architecture specific stuff like
>>> memory barriers.
>>>>
>>>> Because they are non-temporal stores, I suspect that rte_mb() is
>>> required before reading the data from the location it was copied to.
>>>> Ensuring that STORE operations are ordered (rte_wmb) might not
>>> suffice. However, I'm not a CPU expert, so I will seek advice from
>>>> more qualified people in the community on this.
>>>
>>> I think for IA sfence is enough, see citation below, for other
>>> architectures - no idea.
>>> What I am trying to say - it needs to be the *same* function on all
>>> archs we support.
>>
>> Now I get it: rte_wmb() might be appropriate on x86, but if any other
>> architecture requires something else, we should add a new common function
>> for flushing, e.g. rte_memcpy_nt_flush().
>>
>>>
>>> IA SW optimization manual:
>>> 9.4.2 Streaming Store Usage Models
>>> The two primary usage domains for streaming store are coherent
>>> requests and non-coherent requests.
>>> 9.4.2.1 Coherent Requests
>>> Coherent requests are normal loads and stores to system memory, which
>>> may also hit cache lines present in another processor in a
>>> multiprocessor environment. With coherent requests, a streaming store
>>> can be used in the same way as a regular store that has been mapped
>>> with a WC memory type (PAT or MTRR). An SFENCE instruction must be
>>> used within a producer-consumer usage model in order to ensure
>>> coherency and visibility of data between processors.
>>> Within a single-processor system, the CPU can also re-read the same
>>> memory location and be assured of coherence (that is, a single,
>>> consistent view of this memory location).
>>> The same is true for a multiprocessor
>>> (MP) system, assuming an accepted MP software producer-consumer
>>> synchronization policy is employed.
>>>
>>
>> With this reference, I am convinced that you are right about the SFENCE. This
>> puts a checkmark on this item on my TODO list for the patch. Thank you,
>> Konstantin!
>>
>> Any ARM CPU experts on the mailing list seeing this, not on vacation?
>> @Honnappa, I'm looking at you. :-)
>>
>> Summing up, the question is:
>>
>> After a bunch of *non-temporal* stores (STNP instruction) on ARM
>> architecture, does calling rte_wmb() suffice to ensure the data is visible across
>> the system?
> Apologies for the late response, the docs did not have enough information. The internal dialogue is still going on, but I have some information now. There is some information in ArmV8 programmer's guide [1], though it is not complete.
> In summary, rte_wmb()/rte_mb() would not suffice, we need new APIs.
> 
>  From my perspective, I see several scenarios:
> 1)	Need for ordering before the memcpy_nt. Here there are several cases:
> 	a.	LD – LDNP/STNP – DMB NSHLD
> 	b.	ST – LDNP/STNP – DMB NSH
> 2)	Need for ordering after the memcpy. Again, we have the similar use cases:
> 	a.	LDNP/STNP – LD – DMB NSH
> 	b.	LDNP/STNP – ST – DMB NSH
> 
> The 'ST - STNP' and 'STNP - ST' do not apply here, but good to add an API for completion.
> 
> So, may be we could have rte_[r|w]mb_nt() APIs.
> 

Is rte_smp_rmb()/rte_smp_wmb() also not enough on ARM?

> [1] https://developer.arm.com/documentation/den0024/a/The-A64-instruction-set/Memory-access-instructions/Non-temporal-load-and-store-pair

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC v2] non-temporal memcpy
  2022-08-10 21:20           ` Honnappa Nagarahalli
@ 2022-08-11 11:53             ` Mattias Rönnblom
  2022-08-11 22:24               ` Honnappa Nagarahalli
  0 siblings, 1 reply; 57+ messages in thread
From: Mattias Rönnblom @ 2022-08-11 11:53 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Morten Brørup, Stephen Hemminger
  Cc: dev, Bruce Richardson, Konstantin Ananyev, Jan Viktorin,
	Ruifeng Wang, David Christensen, Stanislaw Kardach, nd

On 2022-08-10 23:20, Honnappa Nagarahalli wrote:
> <snip>
> 
>>
>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>> Sent: Wednesday, 10 August 2022 13.56
>>>
>>> On 2022-08-09 17:26, Stephen Hemminger wrote:
>>
>> [...]
>>
>>>
>>> Alignment seems like a non-issue to me. A NT-store memcpy() can be
>>> made free of alignment requirements, incurring only a very slight cost
>>> for the always-aligned case (who has their data always 16-byte aligned
>>> anyways?).
>>>
>>> The memory barrier required on x86 seems like a bigger issue.
>>>
>>>> Maybe rte_non_cache_copy()?
>>>>
>>>
>>> rte_memcpy_nt_weakly_ordered(), or rte_memcpy_nt_weak(). And a
>>> rte_memcpy_nt() with the sfence is place, which the user hopefully
>>> will find first? I don't know. I would prefer not having the weak
>>> variant at all.
> I think providing weakly ordered version is required to offset the cost of the barriers. One might be able to copy multiple packets and then issue a barrier.
> 

On what architecture?

I assumed that only x86 had the peculiar property of having different 
memory models for regular and NT load/stores.

>>>
>>> Accepting weak memory ordering (i.e., no sfence) could also be one of
>>> the flags, assuming rte_memcpy_nt() would have a flags parameter.
>>> Default is safe (=memcpy() semantics), but potentially slower.
>>
>> Excellent idea!
>>
>>>
>>>> Want to avoid the naive user just doing s/memcpy/rte_memcpy_nt/ and
>>> expect
>>>> everything to work.
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-08-11 11:50                       ` Mattias Rönnblom
@ 2022-08-11 16:26                         ` Honnappa Nagarahalli
  0 siblings, 0 replies; 57+ messages in thread
From: Honnappa Nagarahalli @ 2022-08-11 16:26 UTC (permalink / raw)
  To: Mattias Rönnblom, Morten Brørup, Konstantin Ananyev,
	Konstantin Ananyev, dev, Bruce Richardson
  Cc: Jan Viktorin, Ruifeng Wang, David Christensen, Stanislaw Kardach, nd, nd

<snip>

> >>
> >> +TO: @Honnappa, we need input from ARM
> >>
> >>> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> >>> Sent: Friday, 29 July 2022 21.49
> >>>>
> >>>>> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> >>>>> Sent: Friday, 29 July 2022 14.14
> >>>>>
> >>>>>
> >>>>> Sorry, missed that part.
> >>>>>
> >>>>>>
> >>>>>>> Another question - who will do 'sfence' after the copying?
> >>>>>>> Would it be inside memcpy_nt (seems quite costly), or would it
> >>>>>>> be another API function for that: memcpy_nt_flush() or so?
> >>>>>>
> >>>>>> Outside. Only the developer knows when it is required, so it
> >>> wouldn't
> >>>>> make any sense to add the cost inside memcpy_nt().
> >>>>>>
> >>>>>> I don't think we should add a flush function; it would just be
> >>>>> another name for an already existing function. Referring to the
> >>>>> required
> >>>>>> operation in the memcpy_nt() function documentation should
> >>> suffice.
> >>>>>>
> >>>>>
> >>>>> Ok, but again wouldn't it be arch specific?
> >>>>> AFAIK for x86 it needs to boil down to sfence, for other
> >>> architectures
> >>>>> - I don't know.
> >>>>> If you think there already is some generic one (rte_wmb?) that
> >>> would
> >>>>> always produce
> >>>>> correct instructions - sure let's use it.
> >>>>>
> >>>>
> >>>> DPDK has generic functions to wrap architecture specific stuff like
> >>> memory barriers.
> >>>>
> >>>> Because they are non-temporal stores, I suspect that rte_mb() is
> >>> required before reading the data from the location it was copied to.
> >>>> Ensuring that STORE operations are ordered (rte_wmb) might not
> >>> suffice. However, I'm not a CPU expert, so I will seek advice from
> >>>> more qualified people in the community on this.
> >>>
> >>> I think for IA sfence is enough, see citation below, for other
> >>> architectures - no idea.
> >>> What I am trying to say - it needs to be the *same* function on all
> >>> archs we support.
> >>
> >> Now I get it: rte_wmb() might be appropriate on x86, but if any other
> >> architecture requires something else, we should add a new common
> >> function for flushing, e.g. rte_memcpy_nt_flush().
> >>
> >>>
> >>> IA SW optimization manual:
> >>> 9.4.2 Streaming Store Usage Models
> >>> The two primary usage domains for streaming store are coherent
> >>> requests and non-coherent requests.
> >>> 9.4.2.1 Coherent Requests
> >>> Coherent requests are normal loads and stores to system memory,
> >>> which may also hit cache lines present in another processor in a
> >>> multiprocessor environment. With coherent requests, a streaming
> >>> store can be used in the same way as a regular store that has been
> >>> mapped with a WC memory type (PAT or MTRR). An SFENCE instruction
> >>> must be used within a producer-consumer usage model in order to
> >>> ensure coherency and visibility of data between processors.
> >>> Within a single-processor system, the CPU can also re-read the same
> >>> memory location and be assured of coherence (that is, a single,
> >>> consistent view of this memory location).
> >>> The same is true for a multiprocessor
> >>> (MP) system, assuming an accepted MP software producer-consumer
> >>> synchronization policy is employed.
> >>>
> >>
> >> With this reference, I am convinced that you are right about the
> >> SFENCE. This puts a checkmark on this item on my TODO list for the
> >> patch. Thank you, Konstantin!
> >>
> >> Any ARM CPU experts on the mailing list seeing this, not on vacation?
> >> @Honnappa, I'm looking at you. :-)
> >>
> >> Summing up, the question is:
> >>
> >> After a bunch of *non-temporal* stores (STNP instruction) on ARM
> >> architecture, does calling rte_wmb() suffice to ensure the data is
> >> visible across the system?
> > Apologies for the late response, the docs did not have enough information.
> The internal dialogue is still going on, but I have some information now.
> There is some information in ArmV8 programmer's guide [1], though it is not
> complete.
> > In summary, rte_wmb()/rte_mb() would not suffice, we need new APIs.
> >
> >  From my perspective, I see several scenarios:
> > 1)	Need for ordering before the memcpy_nt. Here there are several
> cases:
> > 	a.	LD – LDNP/STNP – DMB NSHLD
> > 	b.	ST – LDNP/STNP – DMB NSH
> > 2)	Need for ordering after the memcpy. Again, we have the similar use
> cases:
> > 	a.	LDNP/STNP – LD – DMB NSH
> > 	b.	LDNP/STNP – ST – DMB NSH
> >
> > The 'ST - STNP' and 'STNP - ST' do not apply here, but good to add an API for
> completion.
> >
> > So, may be we could have rte_[r|w]mb_nt() APIs.
> >
> 
> Is rte_smp_rmb()/rte_smp_wmb() also not enough on ARM?
No, they are not as they fall under inner sharable domain where as non-temporal loads/stores fall under non-sharable domain

> 
> > [1]
> > https://developer.arm.com/documentation/den0024/a/The-A64-
> instruction-
> > set/Memory-access-instructions/Non-temporal-load-and-store-pair

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-08-11 11:53             ` Mattias Rönnblom
@ 2022-08-11 22:24               ` Honnappa Nagarahalli
  0 siblings, 0 replies; 57+ messages in thread
From: Honnappa Nagarahalli @ 2022-08-11 22:24 UTC (permalink / raw)
  To: Mattias Rönnblom, Morten Brørup, Stephen Hemminger
  Cc: dev, Bruce Richardson, Konstantin Ananyev, Jan Viktorin,
	Ruifeng Wang, David Christensen, Stanislaw Kardach, nd, nd

<snip>

> >
> >>
> >>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >>> Sent: Wednesday, 10 August 2022 13.56
> >>>
> >>> On 2022-08-09 17:26, Stephen Hemminger wrote:
> >>
> >> [...]
> >>
> >>>
> >>> Alignment seems like a non-issue to me. A NT-store memcpy() can be
> >>> made free of alignment requirements, incurring only a very slight
> >>> cost for the always-aligned case (who has their data always 16-byte
> >>> aligned anyways?).
> >>>
> >>> The memory barrier required on x86 seems like a bigger issue.
> >>>
> >>>> Maybe rte_non_cache_copy()?
> >>>>
> >>>
> >>> rte_memcpy_nt_weakly_ordered(), or rte_memcpy_nt_weak(). And a
> >>> rte_memcpy_nt() with the sfence is place, which the user hopefully
> >>> will find first? I don't know. I would prefer not having the weak
> >>> variant at all.
> > I think providing weakly ordered version is required to offset the cost of the
> barriers. One might be able to copy multiple packets and then issue a barrier.
> >
> 
> On what architecture?
I am talking about Arm architecture. Arm architecture needs barriers between normal and NT operations.

> 
> I assumed that only x86 had the peculiar property of having different memory
> models for regular and NT load/stores.
> 
> >>>
> >>> Accepting weak memory ordering (i.e., no sfence) could also be one
> >>> of the flags, assuming rte_memcpy_nt() would have a flags parameter.
> >>> Default is safe (=memcpy() semantics), but potentially slower.
> >>
> >> Excellent idea!
> >>
> >>>
> >>>> Want to avoid the naive user just doing s/memcpy/rte_memcpy_nt/ and
> >>> expect
> >>>> everything to work.
> >

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [RFC v2] non-temporal memcpy
  2022-08-09 11:53                     ` Mattias Rönnblom
@ 2022-10-09 16:16                       ` Morten Brørup
  0 siblings, 0 replies; 57+ messages in thread
From: Morten Brørup @ 2022-10-09 16:16 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Stephen Hemminger, Konstantin Ananyev, Bruce Richardson,
	Honnappa Nagarahalli

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Tuesday, 9 August 2022 13.53
> 
> On 2022-08-09 11:24, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Sunday, 7 August 2022 22.41
> >>
> >> On 2022-07-29 18:05, Stephen Hemminger wrote:
> >>>
> >>> It makes sense in a few select places to use non-temporal copy.
> >>> But it would add unnecessary complexity to DPDK if every function
> in
> >> DPDK that could
> >>> cause a copy had a non-temporal variant.
> >>
> >> A NT load and NT store variant, plus a NT load+store variant. :)
> >
> > I considered this, but it adds complexity, and our use case only
> needs the NT load+store. So I decided to only provide that variant.
> >
> > I can prepare the API for all four combinations. The extended
> function would be renamed from rte_memcpy_nt_ex() to just
> rte_memcpy_ex(). And the rte_memcpy_nt() would be omitted, rather than
> just perform rte_memcpy_ex(dst,src,len,F_DST_NT|F_SRC_NT).
> >
> > What does the community prefer in this regard?
> >
> 
> I would suggest just having a single function, with a flags or an enum
> to signify, if load, store or both should be non-temporal. If all
> platforms honor all combinations is a different matter.

Good input, thank you!

I have finally released a patch, and am iterating through versions to fix minor bugs detected by the CI system.

The public API is now a single rte_memcpy_ex(dst, src, len, flags) function, where the flags are also used to request non-temporal load and/or store.

> 
> Is there something that suggests that this particular use case will be
> more common than others? When I've used non-temporal memcpy(), only the
> store side was NT, since the application would go on an use the source
> data.

OK. For completeness, all three variants are now implemented: NT destination, NT source, and NT source and destination.

> 
> >>
> >>>
> >>> Maybe just having rte_memcpy have a threshold (config value?) that
> if
> >> copy is larger than
> >>> a certain size, then it would automatically be non-temporal.  Small
> >> copies wouldn't matter,
> >>> the optimization is more about not stopping cache size issues with
> >> large streams of data.
> >>
> >> I don't think there's any way for rte_memcpy() to know if the
> >> application plan to use the source, the destination, both, or
> neither
> >> of
> >> the buffers in the immediate future.
> >
> > Agree. Which is why explicit NT function variants should be offered.
> >
> >> For huge copies (MBs or more) the
> >> size heuristic makes sense, but for medium sized copies (say a
> packet
> >> worth of data), I'm not so sure.
> >
> > This is the behavior of glibc memcpy().
> >
> 
> Yes, but, from what I can tell, glibc issues a sfence at the end of the
> copy.
> 
> Have a non-temporal memcpy() with a different memory model than the
> compiler intrinsic memcpy(), the glibc memcpy() and the DPDK
> rte_memcpy() implementations seems like asking for trouble.
> 
> >>
> >> What is unclear to me is if there is a benefit (or drawback) of
> using
> >> the imaginary rte_memcpy_nt(), compared to doing rte_memcpy() +
> >> clflushopt or cldemote, in the typical use case (if there is such).
> >>
> >
> > Our use case is packet capture (copying) to memory, where the copies
> will be read much later, so there is no need to pollute the cache with
> the copies.
> >
> 
> If you flush/demote the cache line you've used more or less
> immediately,
> there won't be much pollution. Especially if you include the
> clflushopt/cldemote into the copying routine, as opposed to a large
> flush at the end.

The source data may already be in cache, and some applications might continue using it after the non-temporal memcpy; in this case, flushing the source data cache would be counterproductive.

However, flushing the destination cache might be simpler than using the non-temporal store instructions. Unfortunately, I didn't have time to explore this alternative.

> 
> I haven't tried this in practice, but it seems to me it's an option
> worth exploring. It could be a way to implement a portable NT memcpy(),
> if nothing else.
> 
> > Our application also doesn't look deep inside the original packets
> after copying them, there is also no need to pollute the cache with the
> originals.
> >
> 
> See above.
> 
> > And even though the application looked partially into the packets
> before copying them (and thus they are partially in cache) using NT
> load (instead of normal load) has no additional cost.
> >


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-10-09 16:16 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-19 15:26 [RFC v2] non-temporal memcpy Morten Brørup
2022-07-19 18:00 ` David Christensen
2022-07-19 18:41   ` Morten Brørup
2022-07-19 18:51     ` Stanisław Kardach
2022-07-19 22:15       ` Morten Brørup
2022-07-21 23:19 ` Konstantin Ananyev
2022-07-22 10:44   ` Morten Brørup
2022-07-24 13:35     ` Konstantin Ananyev
2022-07-24 22:18       ` Morten Brørup
2022-07-29 10:00         ` Konstantin Ananyev
2022-07-29 10:46           ` Morten Brørup
2022-07-29 11:50             ` Konstantin Ananyev
2022-07-29 17:17               ` Morten Brørup
2022-07-29 22:00                 ` Konstantin Ananyev
2022-07-30  9:51                   ` Morten Brørup
2022-08-02  9:05                     ` Konstantin Ananyev
2022-07-29 12:13             ` Konstantin Ananyev
2022-07-29 16:05               ` Stephen Hemminger
2022-07-29 17:29                 ` Morten Brørup
2022-08-07 20:40                 ` Mattias Rönnblom
2022-08-09  9:24                   ` Morten Brørup
2022-08-09 11:53                     ` Mattias Rönnblom
2022-10-09 16:16                       ` Morten Brørup
2022-07-29 18:13               ` Morten Brørup
2022-07-29 19:49                 ` Konstantin Ananyev
2022-07-29 20:26                   ` Morten Brørup
2022-07-29 21:34                     ` Konstantin Ananyev
2022-08-07 20:20                     ` Mattias Rönnblom
2022-08-09  9:34                       ` Morten Brørup
2022-08-09 11:56                         ` Mattias Rönnblom
2022-08-10 21:05                     ` Honnappa Nagarahalli
2022-08-11 11:50                       ` Mattias Rönnblom
2022-08-11 16:26                         ` Honnappa Nagarahalli
2022-07-25  1:17       ` Honnappa Nagarahalli
2022-07-27 10:26         ` Morten Brørup
2022-07-27 17:37           ` Honnappa Nagarahalli
2022-07-27 18:49             ` Morten Brørup
2022-07-27 19:12               ` Stephen Hemminger
2022-07-28  9:00                 ` Morten Brørup
2022-07-27 19:52               ` Honnappa Nagarahalli
2022-07-27 22:02                 ` Stanisław Kardach
2022-07-28 10:51                   ` Morten Brørup
2022-07-29  9:21                     ` Konstantin Ananyev
2022-08-07 20:25 ` Mattias Rönnblom
2022-08-09  9:46   ` Morten Brørup
2022-08-09 12:05     ` Mattias Rönnblom
2022-08-09 15:00       ` Morten Brørup
2022-08-10 11:47         ` Mattias Rönnblom
2022-08-09 15:26     ` Stephen Hemminger
2022-08-09 17:24       ` Morten Brørup
2022-08-10 11:59         ` Mattias Rönnblom
2022-08-10 12:12           ` Morten Brørup
2022-08-10 11:55       ` Mattias Rönnblom
2022-08-10 12:18         ` Morten Brørup
2022-08-10 21:20           ` Honnappa Nagarahalli
2022-08-11 11:53             ` Mattias Rönnblom
2022-08-11 22:24               ` Honnappa Nagarahalli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).