[dpdk-dev] rte

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] rte_memcpy
@ 2021-05-24  8:49 Manish Sharma
  2021-05-24 18:13 ` [dpdk-dev] rte_memcpy - fence and stream Manish Sharma
  0 siblings, 1 reply; 9+ messages in thread
From: Manish Sharma @ 2021-05-24  8:49 UTC (permalink / raw)
  To: dev

I am looking at the source for rte_memcpy (this is a discussion only for
x64-64)

For one of the cases, when aligned correctly, it uses

/**
 * Copy 64 bytes from one location to another,
 * locations should not overlap.
 */
static __rte_always_inline void
rte_mov64(uint8_t *dst, const uint8_t *src)
{
        __m512i zmm0;

        zmm0 = _mm512_loadu_si512((const void *)src);
        _mm512_storeu_si512((void *)dst, zmm0);
}

I had some questions about this:

1. What I dont see is any use of x86 fence(rmb,wmb) instructions. Is that
not required in this case and if not, why isnt it needed?

2. Are the  mm512_loadu_si512 and  _mm512_storeu_si512 non temporal?

3. Why isn't the code using  stream variants, _mm512_stream_load_si512 and
friends?

4. Do the _mm512_stream_load_si512 need fence instructions?

TIA,
Manish

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [dpdk-dev] rte_memcpy - fence and stream
  2021-05-24  8:49 [dpdk-dev] rte_memcpy Manish Sharma
@ 2021-05-24 18:13 ` Manish Sharma
  2021-05-25  9:20   ` Bruce Richardson
  0 siblings, 1 reply; 9+ messages in thread
From: Manish Sharma @ 2021-05-24 18:13 UTC (permalink / raw)
  To: dev

I am looking at the source for rte_memcpy (this is a discussion only for
x86-64)

For one of the cases, when aligned correctly, it uses

/**
 * Copy 64 bytes from one location to another,
 * locations should not overlap.
 */
static __rte_always_inline void
rte_mov64(uint8_t *dst, const uint8_t *src)
{
        __m512i zmm0;

        zmm0 = _mm512_loadu_si512((const void *)src);
        _mm512_storeu_si512((void *)dst, zmm0);
}

I had some questions about this:

1. What I dont see is any use of x86 fence(rmb,wmb) instructions. Is that
not required in this case and if not, why isnt it needed?

2. Are the  mm512_loadu_si512 and  _mm512_storeu_si512 non temporal?

3. Why isn't the code using  stream variants, _mm512_stream_load_si512 and
friends? It would not pollute the cache, so should be better - unless the
required fence instructions cause a drop in performance?

4. Do the _mm512_stream_load_si512 need fence instructions? Based on my
reading of the spec, the answer is yes - but wanted to confirm.

TIA,
Manish

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] rte_memcpy - fence and stream
  2021-05-24 18:13 ` [dpdk-dev] rte_memcpy - fence and stream Manish Sharma
@ 2021-05-25  9:20   ` Bruce Richardson
  2021-05-27 15:49     ` Morten Brørup
  0 siblings, 1 reply; 9+ messages in thread
From: Bruce Richardson @ 2021-05-25  9:20 UTC (permalink / raw)
  To: Manish Sharma; +Cc: dev

On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote:
> I am looking at the source for rte_memcpy (this is a discussion only for
> x86-64)
> 
> For one of the cases, when aligned correctly, it uses
> 
> /**
>  * Copy 64 bytes from one location to another,
>  * locations should not overlap.
>  */
> static __rte_always_inline void
> rte_mov64(uint8_t *dst, const uint8_t *src)
> {
>         __m512i zmm0;
> 
>         zmm0 = _mm512_loadu_si512((const void *)src);
>         _mm512_storeu_si512((void *)dst, zmm0);
> }
> 
> I had some questions about this:
> 
> 1. What I dont see is any use of x86 fence(rmb,wmb) instructions. Is that
> not required in this case and if not, why isnt it needed?
> 
> 2. Are the  mm512_loadu_si512 and  _mm512_storeu_si512 non temporal?
> 
They are not non-temporal, so don't need fences.

> 3. Why isn't the code using  stream variants, _mm512_stream_load_si512 and
> friends? It would not pollute the cache, so should be better - unless the
> required fence instructions cause a drop in performance?
> 
Whether the stream variants perform better really depends on what you are
trying to measure. However, in the vast majority of cases a core is making
a copy of data to work on it, in which case having the data not in cache is
going to cause massive stalls on the next access to that data as it's
fetched from DRAM. Therefore, the best option is to use regular
instructions meaning that the data is in local cache afterwards, giving far
better performance when the data is actually used.

> 4. Do the _mm512_stream_load_si512 need fence instructions? Based on my
> reading of the spec, the answer is yes - but wanted to confirm.
>
I believe a fence would be necessary for safety, yes. 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] rte_memcpy - fence and stream
  2021-05-25  9:20   ` Bruce Richardson
@ 2021-05-27 15:49     ` Morten Brørup
  2021-05-27 16:25       ` Bruce Richardson
  0 siblings, 1 reply; 9+ messages in thread
From: Morten Brørup @ 2021-05-27 15:49 UTC (permalink / raw)
  To: Bruce Richardson, Manish Sharma; +Cc: dev

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Tuesday, 25 May 2021 11.20
> 
> On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote:
> > I am looking at the source for rte_memcpy (this is a discussion only
> > for x86-64)
> >
> > For one of the cases, when aligned correctly, it uses
> >
> > /**
> >  * Copy 64 bytes from one location to another,
> >  * locations should not overlap.
> >  */
> > static __rte_always_inline void
> > rte_mov64(uint8_t *dst, const uint8_t *src)
> > {
> >         __m512i zmm0;
> >
> >         zmm0 = _mm512_loadu_si512((const void *)src);
> >         _mm512_storeu_si512((void *)dst, zmm0);
> > }
> >
> > I had some questions about this:
> >

[snip]

> > 3. Why isn't the code using  stream variants,
> > _mm512_stream_load_si512 and friends?
> > It would not pollute the cache, so should be better - unless
> > the required fence instructions cause a drop in performance?
> >
> Whether the stream variants perform better really depends on what you
> are
> trying to measure. However, in the vast majority of cases a core is
> making
> a copy of data to work on it, in which case having the data not in
> cache is
> going to cause massive stalls on the next access to that data as it's
> fetched from DRAM. Therefore, the best option is to use regular
> instructions meaning that the data is in local cache afterwards, giving
> far
> better performance when the data is actually used.
> 

Good response, Bruce. And you are probably right about most use cases looking like you describe.

I'm certainly no expert on deep x86-64 optimization, but please let me think out loud here...

I can come up with a different scenario: One core is analyzing packet headers, and determines to copy some of the packets in their entirety (e.g. using rte_pktmbuf_copy) for another core to process later, so it enqueues the copies to a ring, which the other core dequeues from.

The first core doesn't care about the packet contents, and the second core will read the copy of the packet much later, because it needs to process the packets in front of the queue first.

Wouldn't this scenario benefit from a rte_memcpy variant that doesn't pollute the cache?

I know that rte_pktmbuf_clone might be better to use in the described scenario, but rte_pktmbuf_copy must be there for a reason - and I guess that some uses of rte_pktmbuf_copy could benefit from a non-cache polluting variant of rte_memcpy.

-Morten

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] rte_memcpy - fence and stream
  2021-05-27 15:49     ` Morten Brørup
@ 2021-05-27 16:25       ` Bruce Richardson
  2021-05-27 17:09         ` Manish Sharma
  0 siblings, 1 reply; 9+ messages in thread
From: Bruce Richardson @ 2021-05-27 16:25 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Manish Sharma, dev

On Thu, May 27, 2021 at 05:49:19PM +0200, Morten Brørup wrote:
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Tuesday, 25 May 2021 11.20
> > 
> > On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote:
> > > I am looking at the source for rte_memcpy (this is a discussion only
> > > for x86-64)
> > >
> > > For one of the cases, when aligned correctly, it uses
> > >
> > > /**
> > >  * Copy 64 bytes from one location to another,
> > >  * locations should not overlap.
> > >  */
> > > static __rte_always_inline void
> > > rte_mov64(uint8_t *dst, const uint8_t *src)
> > > {
> > >         __m512i zmm0;
> > >
> > >         zmm0 = _mm512_loadu_si512((const void *)src);
> > >         _mm512_storeu_si512((void *)dst, zmm0);
> > > }
> > >
> > > I had some questions about this:
> > >
> 
> [snip]
> 
> > > 3. Why isn't the code using  stream variants,
> > > _mm512_stream_load_si512 and friends?
> > > It would not pollute the cache, so should be better - unless
> > > the required fence instructions cause a drop in performance?
> > >
> > Whether the stream variants perform better really depends on what you
> > are
> > trying to measure. However, in the vast majority of cases a core is
> > making
> > a copy of data to work on it, in which case having the data not in
> > cache is
> > going to cause massive stalls on the next access to that data as it's
> > fetched from DRAM. Therefore, the best option is to use regular
> > instructions meaning that the data is in local cache afterwards, giving
> > far
> > better performance when the data is actually used.
> > 
> 
> Good response, Bruce. And you are probably right about most use cases looking like you describe.
> 
> I'm certainly no expert on deep x86-64 optimization, but please let me think out loud here...
> 
> I can come up with a different scenario: One core is analyzing packet headers, and determines to copy some of the packets in their entirety (e.g. using rte_pktmbuf_copy) for another core to process later, so it enqueues the copies to a ring, which the other core dequeues from.
> 
> The first core doesn't care about the packet contents, and the second core will read the copy of the packet much later, because it needs to process the packets in front of the queue first.
> 
> Wouldn't this scenario benefit from a rte_memcpy variant that doesn't pollute the cache?
> 
> I know that rte_pktmbuf_clone might be better to use in the described scenario, but rte_pktmbuf_copy must be there for a reason - and I guess that some uses of rte_pktmbuf_copy could benefit from a non-cache polluting variant of rte_memcpy.
> 

That is indeed a possible scenario, but in that case we would probably want
to differentiate between different levels of cache. While we would not want
the copy to be polluting the local L1 or L2 cache of the core doing the
copy (unless the other core was a hyperthread), we probably would want any
copy to be present in any shared caches, rather than all the way in DRAM.
For Intel platforms and a scenario which you describe, I would actually
recommend using the "ioat" driver copy accelerator if cache pollution is a
concern. In the case of the copies being done in HW, the local cache of a
core would not be polluted, but the copied data could still end up in LLC
due to DDIO.

In terms of memcpy functions, given the number of possibilities of
scenarios, in the absense of compelling data showing a meaningful benefit
for a common scenario, I'd be wary about trying to provide specialized
varients, since we could end up with a lot of them to maintain and tune.

/Bruce

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] rte_memcpy - fence and stream
  2021-05-27 16:25       ` Bruce Richardson
@ 2021-05-27 17:09         ` Manish Sharma
  2021-05-27 17:22           ` Bruce Richardson
  0 siblings, 1 reply; 9+ messages in thread
From: Manish Sharma @ 2021-05-27 17:09 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: Morten Brørup, dev

For the case I have, hardly 2% of the data buffers which are being copied
get looked at - mostly its for DMA. Having a version of DPDK memcopy that
does non temporal copies would definitely be good.

If in my case, I have a lot of CPUs doing the copy in parallel, would I/OAT
driver copy accelerator still help?

On Thu, May 27, 2021 at 9:55 PM Bruce Richardson <bruce.richardson@intel.com>
wrote:

> On Thu, May 27, 2021 at 05:49:19PM +0200, Morten Brørup wrote:
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > > Sent: Tuesday, 25 May 2021 11.20
> > >
> > > On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote:
> > > > I am looking at the source for rte_memcpy (this is a discussion only
> > > > for x86-64)
> > > >
> > > > For one of the cases, when aligned correctly, it uses
> > > >
> > > > /**
> > > >  * Copy 64 bytes from one location to another,
> > > >  * locations should not overlap.
> > > >  */
> > > > static __rte_always_inline void
> > > > rte_mov64(uint8_t *dst, const uint8_t *src)
> > > > {
> > > >         __m512i zmm0;
> > > >
> > > >         zmm0 = _mm512_loadu_si512((const void *)src);
> > > >         _mm512_storeu_si512((void *)dst, zmm0);
> > > > }
> > > >
> > > > I had some questions about this:
> > > >
> >
> > [snip]
> >
> > > > 3. Why isn't the code using  stream variants,
> > > > _mm512_stream_load_si512 and friends?
> > > > It would not pollute the cache, so should be better - unless
> > > > the required fence instructions cause a drop in performance?
> > > >
> > > Whether the stream variants perform better really depends on what you
> > > are
> > > trying to measure. However, in the vast majority of cases a core is
> > > making
> > > a copy of data to work on it, in which case having the data not in
> > > cache is
> > > going to cause massive stalls on the next access to that data as it's
> > > fetched from DRAM. Therefore, the best option is to use regular
> > > instructions meaning that the data is in local cache afterwards, giving
> > > far
> > > better performance when the data is actually used.
> > >
> >
> > Good response, Bruce. And you are probably right about most use cases
> looking like you describe.
> >
> > I'm certainly no expert on deep x86-64 optimization, but please let me
> think out loud here...
> >
> > I can come up with a different scenario: One core is analyzing packet
> headers, and determines to copy some of the packets in their entirety (e.g.
> using rte_pktmbuf_copy) for another core to process later, so it enqueues
> the copies to a ring, which the other core dequeues from.
> >
> > The first core doesn't care about the packet contents, and the second
> core will read the copy of the packet much later, because it needs to
> process the packets in front of the queue first.
> >
> > Wouldn't this scenario benefit from a rte_memcpy variant that doesn't
> pollute the cache?
> >
> > I know that rte_pktmbuf_clone might be better to use in the described
> scenario, but rte_pktmbuf_copy must be there for a reason - and I guess
> that some uses of rte_pktmbuf_copy could benefit from a non-cache polluting
> variant of rte_memcpy.
> >
>
> That is indeed a possible scenario, but in that case we would probably want
> to differentiate between different levels of cache. While we would not want
> the copy to be polluting the local L1 or L2 cache of the core doing the
> copy (unless the other core was a hyperthread), we probably would want any
> copy to be present in any shared caches, rather than all the way in DRAM.
> For Intel platforms and a scenario which you describe, I would actually
> recommend using the "ioat" driver copy accelerator if cache pollution is a
> concern. In the case of the copies being done in HW, the local cache of a
> core would not be polluted, but the copied data could still end up in LLC
> due to DDIO.
>
> In terms of memcpy functions, given the number of possibilities of
> scenarios, in the absense of compelling data showing a meaningful benefit
> for a common scenario, I'd be wary about trying to provide specialized
> varients, since we could end up with a lot of them to maintain and tune.
>
> /Bruce
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] rte_memcpy - fence and stream
  2021-05-27 17:09         ` Manish Sharma
@ 2021-05-27 17:22           ` Bruce Richardson
  2021-05-27 18:15             ` Morten Brørup
  0 siblings, 1 reply; 9+ messages in thread
From: Bruce Richardson @ 2021-05-27 17:22 UTC (permalink / raw)
  To: Manish Sharma; +Cc: Morten Brørup, dev

On Thu, May 27, 2021 at 10:39:59PM +0530, Manish Sharma wrote:
>    For the case I have, hardly 2% of the data buffers which are being
>    copied get looked at - mostly its for DMA. Having a version of DPDK
>    memcopy that does non temporal copies would definitely be good.
>    If in my case, I have a lot of CPUs doing the copy in parallel, would
>    I/OAT driver copy accelerator still help?
> 
It will depend upon the size of the copies being done. For bigger packets
the accelerator can help free up CPU cycles for other things.

However, if only 2% of the data which is being copied gets looked at, why
does it need to be copied? Can the original buffers not be used in that
case?

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] rte_memcpy - fence and stream
  2021-05-27 17:22           ` Bruce Richardson
@ 2021-05-27 18:15             ` Morten Brørup
  2021-06-22 21:55               ` Morten Brørup
  0 siblings, 1 reply; 9+ messages in thread
From: Morten Brørup @ 2021-05-27 18:15 UTC (permalink / raw)
  To: Bruce Richardson, Manish Sharma; +Cc: dev

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Thursday, 27 May 2021 19.22
> 
> On Thu, May 27, 2021 at 10:39:59PM +0530, Manish Sharma wrote:
> >    For the case I have, hardly 2% of the data buffers which are being
> >    copied get looked at - mostly its for DMA. Having a version of
> DPDK
> >    memcopy that does non temporal copies would definitely be good.
> >    If in my case, I have a lot of CPUs doing the copy in parallel,
> would
> >    I/OAT driver copy accelerator still help?
> >
> It will depend upon the size of the copies being done. For bigger
> packets
> the accelerator can help free up CPU cycles for other things.
> 
> However, if only 2% of the data which is being copied gets looked at,
> why
> does it need to be copied? Can the original buffers not be used in that
> case?

I can only speak for myself here...

Our firmware has a packet capture feature with a filter.

If a packet matches the capture filter, a metadata header and the relevant part of the packet contents ("snap length" in tcpdump terminology) is appended to a large memory area (the "capture buffer") using rte_pktmbuf_read/rte_memcpy. This capture buffer is only read through the GUI or management API by the network administrator, i.e. it will only be read minutes or hours later, so there is no need to put any of it in any CPU cache.

It does not make sense to clone and hold on to many thousands of mbufs when we only need some of their contents. So we copy the contents instead of increasing the mbuf refcount.

We currently only use our packet capture feature for R&D purposes, so we have not optimized it yet. However, we will need to optimize it for production use at some point. So I find this discussion initiated by Manish very interesting.

-Morten

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [dpdk-dev] rte_memcpy - fence and stream
  2021-05-27 18:15             ` Morten Brørup
@ 2021-06-22 21:55               ` Morten Brørup
  0 siblings, 0 replies; 9+ messages in thread
From: Morten Brørup @ 2021-06-22 21:55 UTC (permalink / raw)
  To: Manish Sharma; +Cc: dev, Bruce Richardson

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Morten Brørup
> Sent: Thursday, 27 May 2021 20.15
> 
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Thursday, 27 May 2021 19.22
> >
> > On Thu, May 27, 2021 at 10:39:59PM +0530, Manish Sharma wrote:
> > >    For the case I have, hardly 2% of the data buffers which are
> being
> > >    copied get looked at - mostly its for DMA.

Which data buffers are you not looking at, Manish? The original data buffers, or the copies, or both?

> > >    Having a version of DPDK
> > >    memcopy that does non temporal copies would definitely be good.
> > >    If in my case, I have a lot of CPUs doing the copy in parallel,
> > would
> > >    I/OAT driver copy accelerator still help?
> > >
> > It will depend upon the size of the copies being done. For bigger
> > packets
> > the accelerator can help free up CPU cycles for other things.
> >
> > However, if only 2% of the data which is being copied gets looked at,
> > why
> > does it need to be copied? Can the original buffers not be used in
> that
> > case?
> 
> I can only speak for myself here...
> 
> Our firmware has a packet capture feature with a filter.
> 
> If a packet matches the capture filter, a metadata header and the
> relevant part of the packet contents ("snap length" in tcpdump
> terminology) is appended to a large memory area (the "capture buffer")
> using rte_pktmbuf_read/rte_memcpy. This capture buffer is only read
> through the GUI or management API by the network administrator, i.e. it
> will only be read minutes or hours later, so there is no need to put
> any of it in any CPU cache.
> 
> It does not make sense to clone and hold on to many thousands of mbufs
> when we only need some of their contents. So we copy the contents
> instead of increasing the mbuf refcount.
> 
> We currently only use our packet capture feature for R&D purposes, so
> we have not optimized it yet. However, we will need to optimize it for
> production use at some point. So I find this discussion initiated by
> Manish very interesting.
> 
> -Morten

Here's some code for inspiration. I haven't tested it yet. And it can be further optimized.

/**
 * Copy 16 bytes from one location to another, using non-temporal storage
 * at the destination.
 * The locations must not overlap.
 *
 * @param dst
 *   Pointer to the destination of the data.
 *   Must be aligned on a 16-byte boundary.
 * @param src
 *   Pointer to the source data.
 *   Does not need to be aligned on any particular boundary.
 */
static __rte_always_inline void
rte_mov16_aligned16_non_temporal(uint8_t *dst, const uint8_t *src)
{
    __m128i xmm0;

    xmm0 = _mm_loadu_si128((const __m128i *)src);
    _mm_stream_si128((__m128i *)dst, xmm0);
}

/**
 * Copy bytes from one location to another, using non-temporal storage
 * at the destination.
 * The locations must not overlap.
 *
 * @param dst
 *   Pointer to the destination of the data.
 *   Must be aligned on a 16-byte boundary.
 * @param src
 *   Pointer to the source data.
 *   Does not need to be aligned on any particular boundary.
 * @param n
 *   Number of bytes to copy.
 *   Must be divisble by 4.
 * @return
 *   Pointer to the destination data.
 */
static __rte_always_inline void *
rte_memcpy_aligned16_non_temporal(void *dst, const void *src, size_t n)
{
    void * const ret = dst;

    RTE_ASSERT(!((uintptr_t)dst & 0xF));
    RTE_ASSERT(!(n & 3));

    while (n >= 16) {
        rte_mov16_aligned16_non_temporal(dst, src);
        src = (const uint8_t *)src + 16;
        dst = (uint8_t *)dst + 16;
        n -= 16;
    }
    if (n & 8) {
        int64_t a = *(const int64_t *)src;
        _mm_stream_si64((long long int *)dst, a);
        src = (const uint8_t *)src + 8;
        dst = (uint8_t *)dst + 8;
        n -= 8;
    }
    if (n & 4) {
        int32_t a = *(const int32_t *)src;
        _mm_stream_si32((int32_t *)dst, a);
        src = (const uint8_t *)src + 4;
        dst = (uint8_t *)dst + 4;
        n -= 4;
    }

    return ret;
}


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-06-22 21:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-24  8:49 [dpdk-dev] rte_memcpy Manish Sharma
2021-05-24 18:13 ` [dpdk-dev] rte_memcpy - fence and stream Manish Sharma
2021-05-25  9:20   ` Bruce Richardson
2021-05-27 15:49     ` Morten Brørup
2021-05-27 16:25       ` Bruce Richardson
2021-05-27 17:09         ` Manish Sharma
2021-05-27 17:22           ` Bruce Richardson
2021-05-27 18:15             ` Morten Brørup
2021-06-22 21:55               ` Morten Brørup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).