DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	dev@dpdk.org, "Bruce Richardson" <bruce.richardson@intel.com>,
	"Konstantin Ananyev" <konstantin.v.ananyev@yandex.ru>
Cc: Jan Viktorin <viktorin@rehivetech.com>,
	Ruifeng Wang <ruifeng.wang@arm.com>,
	David Christensen <drc@linux.vnet.ibm.com>,
	Stanislaw Kardach <kda@semihalf.com>
Subject: Re: [RFC v2] non-temporal memcpy
Date: Wed, 10 Aug 2022 13:47:42 +0200	[thread overview]
Message-ID: <fc427711-a7ab-a5ba-0523-5389c90580e6@lysator.liu.se> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D8724B@smartserver.smartshare.dk>

On 2022-08-09 17:00, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Tuesday, 9 August 2022 14.05
>>
>> On 2022-08-09 11:46, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>>> Sent: Sunday, 7 August 2022 22.25
>>>>
>>>> On 2022-07-19 17:26, Morten Brørup wrote:
>>>>> This RFC proposes a set of functions optimized for non-temporal
>>>> memory copy.
>>>>>
>>>>> At this stage, I am asking for feedback on the concept.
>>>>>
>>>>> Applications sometimes data to another memory location, which is
>> only
>>>> used
>>>>> much later.
>>>>> In this case, it is inefficient to pollute the data cache with the
>>>> copied
>>>>> data.
>>>>>
>>>>> An example use case (originating from a real life application):
>>>>> Copying filtered packets, or the first part of them, into a capture
>>>> buffer
>>>>> for offline analysis.
>>>>>
>>>>> The purpose of these functions is to achieve a performance gain by
>>>> not
>>>>> polluting the cache when copying data.
>>>>> Although the throughput may be improved by further optimization, I
>> do
>>>> not
>>>>> consider througput optimization relevant initially.
>>>>>
>>>>> The x86 non-temporal load instructions have 16 byte alignment
>>>>> requirements [1], while ARM non-temporal load instructions are
>>>> available with
>>>>> 4 byte alignment requirements [2].
>>>>> Both platforms offer non-temporal store instructions with 4 byte
>>>> alignment
>>>>> requirements.
>>>>>
>>>>
>>>> I don't think memcpy() functions should have alignment requirements.
>>>> That's not very practical, and violates the principle of least
>>>> surprise.
>>>
>>> I didn't make the CPUs with these alignment requirements.
>>>
>>> However, I will offer optimized performance in a generic NT memcpy()
>> function in the cases where the individual alignment requirements of
>> various CPUs happen to be met.
>>>
>>>>
>>>> Use normal memcpy() for the unaligned parts, and for the whole thing
>>>> for
>>>> small sizes (at least on x86).
>>>>
>>>
>>> I'm not going to plunge into some advanced vector programming, so I'm
>> working on an implementation where misalignment is handled by using a
>> bounce buffer (allocated on the stack, which is probably cache hot
>> anyway).
>>>
>>>
>>
>> I don't know for the NT load + NT store case, but for regular load + NT
>> store, this is trivial. The implementation I've used is 36
>> straight-forward lines of code.
> 
> Is that implementation available for inspiration anywhere?
> 
#define NT_THRESHOLD (2 * CACHE_LINE_SIZE)

void nt_memcpy(void *__restrict dst, const void * __restrict src, size_t n)
{
	if (n < NT_THRESHOLD) {
		memcpy(dst, src, n);
		return;
	}

	size_t n_unaligned = CACHE_LINE_SIZE - (uintptr_t)dst % CACHE_LINE_SIZE;

	if (n_unaligned > n)
		n_unaligned = n;

	memcpy(dst, src, n_unaligned);
	dst += n_unaligned;
	src += n_unaligned;
	n -= n_unaligned;

	size_t num_lines = n / CACHE_LINE_SIZE;

	size_t i;
	for (i = 0; i < num_lines; i++) {
		size_t j;
		for (j = 0; j < CACHE_LINE_SIZE / sizeof(__m128i); j++) {
			__m128i blk = _mm_loadu_si128((const __m128i *)src);
			/* non-temporal store */
			_mm_stream_si128((__m128i *)dst, blk);
			src += sizeof(__m128i);
			dst += sizeof(__m128i);
		}
		n -= CACHE_LINE_SIZE;
	}

	if (num_lines > 0)
		_mm_sfence();

	memcpy(dst, src, n);
}

(This was written as a part of a benchmark exercise, and hasn't been 
properly tested.)

Use this for inspiration, or I can DPDK-ify this and make it a proper 
patch/RFC. I would try to add support for NT load as well, and make both 
NT load and store depend on flags parameter.

The above threshold setting is completely arbitrary. What you should 
keep in mind when thinking about the threshold, is that it might well be 
worth to suffer a little lower performance of NT store + sfence 
(compared to regular store), since you will benefit from not trashing 
the cache.

For example, back-to-back copying of 1500 bytes buffers with this 
copying routine is much slower than regular memcpy() (measured in the 
core cycles spent in the copying), but nevertheless in a real-world 
application it may still improve the overall performance, since the 
packet copies doesn't evict useful data from the various caches. I know 
for sure that certain applications do benefit.

  reply	other threads:[~2022-08-10 11:47 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-19 15:26 Morten Brørup
2022-07-19 18:00 ` David Christensen
2022-07-19 18:41   ` Morten Brørup
2022-07-19 18:51     ` Stanisław Kardach
2022-07-19 22:15       ` Morten Brørup
2022-07-21 23:19 ` Konstantin Ananyev
2022-07-22 10:44   ` Morten Brørup
2022-07-24 13:35     ` Konstantin Ananyev
2022-07-24 22:18       ` Morten Brørup
2022-07-29 10:00         ` Konstantin Ananyev
2022-07-29 10:46           ` Morten Brørup
2022-07-29 11:50             ` Konstantin Ananyev
2022-07-29 17:17               ` Morten Brørup
2022-07-29 22:00                 ` Konstantin Ananyev
2022-07-30  9:51                   ` Morten Brørup
2022-08-02  9:05                     ` Konstantin Ananyev
2022-07-29 12:13             ` Konstantin Ananyev
2022-07-29 16:05               ` Stephen Hemminger
2022-07-29 17:29                 ` Morten Brørup
2022-08-07 20:40                 ` Mattias Rönnblom
2022-08-09  9:24                   ` Morten Brørup
2022-08-09 11:53                     ` Mattias Rönnblom
2022-10-09 16:16                       ` Morten Brørup
2022-07-29 18:13               ` Morten Brørup
2022-07-29 19:49                 ` Konstantin Ananyev
2022-07-29 20:26                   ` Morten Brørup
2022-07-29 21:34                     ` Konstantin Ananyev
2022-08-07 20:20                     ` Mattias Rönnblom
2022-08-09  9:34                       ` Morten Brørup
2022-08-09 11:56                         ` Mattias Rönnblom
2022-08-10 21:05                     ` Honnappa Nagarahalli
2022-08-11 11:50                       ` Mattias Rönnblom
2022-08-11 16:26                         ` Honnappa Nagarahalli
2022-07-25  1:17       ` Honnappa Nagarahalli
2022-07-27 10:26         ` Morten Brørup
2022-07-27 17:37           ` Honnappa Nagarahalli
2022-07-27 18:49             ` Morten Brørup
2022-07-27 19:12               ` Stephen Hemminger
2022-07-28  9:00                 ` Morten Brørup
2022-07-27 19:52               ` Honnappa Nagarahalli
2022-07-27 22:02                 ` Stanisław Kardach
2022-07-28 10:51                   ` Morten Brørup
2022-07-29  9:21                     ` Konstantin Ananyev
2022-08-07 20:25 ` Mattias Rönnblom
2022-08-09  9:46   ` Morten Brørup
2022-08-09 12:05     ` Mattias Rönnblom
2022-08-09 15:00       ` Morten Brørup
2022-08-10 11:47         ` Mattias Rönnblom [this message]
2022-08-09 15:26     ` Stephen Hemminger
2022-08-09 17:24       ` Morten Brørup
2022-08-10 11:59         ` Mattias Rönnblom
2022-08-10 12:12           ` Morten Brørup
2022-08-10 11:55       ` Mattias Rönnblom
2022-08-10 12:18         ` Morten Brørup
2022-08-10 21:20           ` Honnappa Nagarahalli
2022-08-11 11:53             ` Mattias Rönnblom
2022-08-11 22:24               ` Honnappa Nagarahalli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fc427711-a7ab-a5ba-0523-5389c90580e6@lysator.liu.se \
    --to=hofors@lysator.liu.se \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=drc@linux.vnet.ibm.com \
    --cc=kda@semihalf.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=mb@smartsharesystems.com \
    --cc=ruifeng.wang@arm.com \
    --cc=viktorin@rehivetech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).