DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Mattias Rönnblom" <hofors@lysator.liu.se>,
	konstantin.v.ananyev@yandex.ru, Honnappa.Nagarahalli@arm.com,
	stephen@networkplumber.org
Cc: <mattias.ronnblom@ericsson.com>, <bruce.richardson@intel.com>,
	<kda@semihalf.com>, <drc@linux.vnet.ibm.com>, <dev@dpdk.org>
Subject: RE: [PATCH] eal: non-temporal memcpy
Date: Mon, 10 Oct 2022 11:36:11 +0200	[thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D873C3@smartserver.smartshare.dk> (raw)
In-Reply-To: <730193b1-9574-ff59-28be-c1449cba0ffc@lysator.liu.se>

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 10 October 2022 10.59
> 
> On 2022-10-10 09:35, Morten Brørup wrote:
> > Mattias, Konstantin, Honnappa, Stephen,
> >
> > In my patch for non-temporal memcpy, I have been aiming for using as
> much non-temporal store as possible. E.g. copying 16 byte to a 16 byte
> aligned address will be done using non-temporal store instructions.
> >
> > Now, I am seriously considering this alternative:
> >
> > Only using non-temporal stores for complete cache lines, and using
> normal stores for partial cache lines.
> >
> 
> This is how I've done it in the past, in DPDK applications. That was
> both to simplify (and potentially optimize) the code somewhat, and
> because I had my doubt there was any actual benefits from using
> non-temporal stores for the beginning or the end of the memory block.
> 
> That latter reason however, was pure conjecture. I think it would be
> great if Intel, ARM, AMD, IBM etc. DPDK developers could dig in the
> manuals or go find the appropriate CPU expert, to find out if that is
> true.
> 
> More specifically, my question is:
> 
> A) Consider a scenario where a core does a regular store against some
> cache line, and then pretty much immediately does a non-temporal store
> against a different address in the same cache line. How will this cache
> line be treated?
> 
> B) Consider the same scenario, but where no regular stores preceded (or
> followed) the non-temporal store, and the non-temporal stores performed
> did not cover the entirety of the cache line.
> 
> Scenario A) would be common in the beginning of the copy, in case
> there's a header preceding the data, and writing that header
> non-temporally might be cumbersome. Scenario B) would common at the end
> of the copy. Both assuming copies of memory blocks which are not
> cache-line aligned.
> 

Yeah, I wish some CPU expert from Intel/AMD and ARM would provide these functions instead of me. ;-)

> > I think it will make things simpler when an application mixes normal
> and non-temporal stores. E.g. an application writing metadata (a pcap
> header) followed by packet data.
> >
> 
> The application *could* use NT stores for the pcap header as well.

Our application does this. It also ensures 16 byte alignment for the stores. So our NT memcpy function is relatively simple.

However, I didn't think the DPDK community would accept a contribution with requirement that the destination must be 16 byte aligned and the length must be 16 byte divisible. So the patch needs to consider all weird alignments, and thus grew an order of magnitude larger than the NT memcopy function we have in our application. Much more work than anticipated. :-(

> 
> I haven't reviewed v3 of your patch, but in some earlier patch you did
> not use the movnti instruction to make smaller (< 16 bytes) stores.

I also use _mm_stream_si32() and _mm_stream_si64() now.

> 
> 
> > The disadvantage is that copying a burst of 32 packets, will - in the
> worst case - pollute 64 cache lines (one at the start plus one at the
> end of the copied data), i.e. 4 KiB of data cache. If copying to a
> consecutive memory area, e.g. a packet capture buffer, it will pollute
> 33 cache lines (because the start of packet #2 is in the same cache
> line as the end of packet #1, etc.).
> >
> > What do you think?
> >
> 
> For large copies, which I'm guessing is what non-temporal stores are
> usually used for, this is hair splitting. For DPDK applications, it
> might well be at least somewhat relevant, because such an application
> may make an enormous amount of copies, each roughly the size of a
> packet.
> 
> If we had a rte_memcpy_ex() that only cared about copying whole cache
> line in a NT manner, the application could add a clflushopt (or the
> equivalent) after the copy, flushing the the beginning and end cache
> line of the destination buffer.

That is a good idea.

Furthermore, POWER and RISC-V don't have NT store, but if they have a cache line flush instruction, NT destination memcpy could be implemented for those architectures too - i.e. storing cache line sized blocks and flushing the cache, and letting the application flush the cache lines at the ends, if useful for the application.

> 
> >
> > PS: Non-temporal loads are easy to work with, so don't worry about
> that.
> >
> >
> > Med venlig hilsen / Kind regards,
> > -Morten Brørup

Thank you, Mattias, for sharing your thoughts.

Now, let's wait and see if anyone else on the list has further input. :-)


  reply	other threads:[~2022-10-10  9:36 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-19 13:58 [RFC v3] " Morten Brørup
2022-10-06 20:34 ` [PATCH] eal: " Morten Brørup
2022-10-10  7:35   ` Morten Brørup
2022-10-10  8:58     ` Mattias Rönnblom
2022-10-10  9:36       ` Morten Brørup [this message]
2022-10-10 11:58         ` Stanislaw Kardach
2022-10-10  9:57       ` Bruce Richardson
2022-10-11  9:25     ` Konstantin Ananyev
2022-10-07 10:19 ` [PATCH v2] " Morten Brørup
2022-10-09 15:35 ` [PATCH v3] " Morten Brørup
2022-10-10  6:46 ` [PATCH v4] " Morten Brørup
2022-10-16 14:27   ` Mattias Rönnblom
2022-10-16 19:55   ` Mattias Rönnblom
2023-07-31 12:14   ` Thomas Monjalon
2023-07-31 12:25     ` Morten Brørup
2023-08-04  5:49       ` Mattias Rönnblom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98CBD80474FA8B44BF855DF32C47DC35D873C3@smartserver.smartshare.dk \
    --to=mb@smartsharesystems.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=drc@linux.vnet.ibm.com \
    --cc=hofors@lysator.liu.se \
    --cc=kda@semihalf.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=mattias.ronnblom@ericsson.com \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).