DPDK patches and discussions
 help / color / mirror / Atom feed
From: Manish Sharma <manish.sharmajee75@gmail.com>
To: Bruce Richardson <bruce.richardson@intel.com>
Cc: "Morten Brørup" <mb@smartsharesystems.com>, dev@dpdk.org
Subject: Re: [dpdk-dev] rte_memcpy - fence and stream
Date: Thu, 27 May 2021 22:39:59 +0530	[thread overview]
Message-ID: <CAOT2AHVrMkq=6nw-YLu+Bq34w-u4LJARSmoRgjia4m5DR3g0LA@mail.gmail.com> (raw)
In-Reply-To: <YK/H718QhozDFVlX@bricha3-MOBL.ger.corp.intel.com>

For the case I have, hardly 2% of the data buffers which are being copied
get looked at - mostly its for DMA. Having a version of DPDK memcopy that
does non temporal copies would definitely be good.

If in my case, I have a lot of CPUs doing the copy in parallel, would I/OAT
driver copy accelerator still help?

On Thu, May 27, 2021 at 9:55 PM Bruce Richardson <bruce.richardson@intel.com>
wrote:

> On Thu, May 27, 2021 at 05:49:19PM +0200, Morten Brørup wrote:
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > > Sent: Tuesday, 25 May 2021 11.20
> > >
> > > On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote:
> > > > I am looking at the source for rte_memcpy (this is a discussion only
> > > > for x86-64)
> > > >
> > > > For one of the cases, when aligned correctly, it uses
> > > >
> > > > /**
> > > >  * Copy 64 bytes from one location to another,
> > > >  * locations should not overlap.
> > > >  */
> > > > static __rte_always_inline void
> > > > rte_mov64(uint8_t *dst, const uint8_t *src)
> > > > {
> > > >         __m512i zmm0;
> > > >
> > > >         zmm0 = _mm512_loadu_si512((const void *)src);
> > > >         _mm512_storeu_si512((void *)dst, zmm0);
> > > > }
> > > >
> > > > I had some questions about this:
> > > >
> >
> > [snip]
> >
> > > > 3. Why isn't the code using  stream variants,
> > > > _mm512_stream_load_si512 and friends?
> > > > It would not pollute the cache, so should be better - unless
> > > > the required fence instructions cause a drop in performance?
> > > >
> > > Whether the stream variants perform better really depends on what you
> > > are
> > > trying to measure. However, in the vast majority of cases a core is
> > > making
> > > a copy of data to work on it, in which case having the data not in
> > > cache is
> > > going to cause massive stalls on the next access to that data as it's
> > > fetched from DRAM. Therefore, the best option is to use regular
> > > instructions meaning that the data is in local cache afterwards, giving
> > > far
> > > better performance when the data is actually used.
> > >
> >
> > Good response, Bruce. And you are probably right about most use cases
> looking like you describe.
> >
> > I'm certainly no expert on deep x86-64 optimization, but please let me
> think out loud here...
> >
> > I can come up with a different scenario: One core is analyzing packet
> headers, and determines to copy some of the packets in their entirety (e.g.
> using rte_pktmbuf_copy) for another core to process later, so it enqueues
> the copies to a ring, which the other core dequeues from.
> >
> > The first core doesn't care about the packet contents, and the second
> core will read the copy of the packet much later, because it needs to
> process the packets in front of the queue first.
> >
> > Wouldn't this scenario benefit from a rte_memcpy variant that doesn't
> pollute the cache?
> >
> > I know that rte_pktmbuf_clone might be better to use in the described
> scenario, but rte_pktmbuf_copy must be there for a reason - and I guess
> that some uses of rte_pktmbuf_copy could benefit from a non-cache polluting
> variant of rte_memcpy.
> >
>
> That is indeed a possible scenario, but in that case we would probably want
> to differentiate between different levels of cache. While we would not want
> the copy to be polluting the local L1 or L2 cache of the core doing the
> copy (unless the other core was a hyperthread), we probably would want any
> copy to be present in any shared caches, rather than all the way in DRAM.
> For Intel platforms and a scenario which you describe, I would actually
> recommend using the "ioat" driver copy accelerator if cache pollution is a
> concern. In the case of the copies being done in HW, the local cache of a
> core would not be polluted, but the copied data could still end up in LLC
> due to DDIO.
>
> In terms of memcpy functions, given the number of possibilities of
> scenarios, in the absense of compelling data showing a meaningful benefit
> for a common scenario, I'd be wary about trying to provide specialized
> varients, since we could end up with a lot of them to maintain and tune.
>
> /Bruce
>

  reply	other threads:[~2021-05-27 17:10 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-24  8:49 [dpdk-dev] rte_memcpy Manish Sharma
2021-05-24 18:13 ` [dpdk-dev] rte_memcpy - fence and stream Manish Sharma
2021-05-25  9:20   ` Bruce Richardson
2021-05-27 15:49     ` Morten Brørup
2021-05-27 16:25       ` Bruce Richardson
2021-05-27 17:09         ` Manish Sharma [this message]
2021-05-27 17:22           ` Bruce Richardson
2021-05-27 18:15             ` Morten Brørup
2021-06-22 21:55               ` Morten Brørup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOT2AHVrMkq=6nw-YLu+Bq34w-u4LJARSmoRgjia4m5DR3g0LA@mail.gmail.com' \
    --to=manish.sharmajee75@gmail.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=mb@smartsharesystems.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).