DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	"Bruce Richardson" <bruce.richardson@intel.com>
Cc: "Mattias Rönnblom" <mattias.ronnblom@ericsson.com>,
	dev@dpdk.org, "Stephen Hemminger" <stephen@networkplumber.org>
Subject: Re: [RFC v2] eal: provide option to use compiler memcpy instead of RTE
Date: Tue, 28 May 2024 18:17:51 +0200	[thread overview]
Message-ID: <184120df-26c4-446c-826e-19a8a3b1aa49@lysator.liu.se> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F4B7@smartserver.smartshare.dk>

On 2024-05-28 11:07, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Tuesday, 28 May 2024 11.00
>>
>> On 2024-05-28 10:27, Bruce Richardson wrote:
>>> On Tue, May 28, 2024 at 10:19:15AM +0200, Mattias Rönnblom wrote:
>>>> On 2024-05-28 09:43, Mattias Rönnblom wrote:
>>>>> Provide build option to have functions in <rte_memcpy.h> delegate to
>>>>> the standard compiler/libc memcpy(), instead of using the various
>>>>> traditional, handcrafted, per-architecture rte_memcpy()
>>>>> implementations.
>>>>>
>>>>> A new meson build option 'use_cc_memcpy' is added. The default is
>>>>> true. It's not obvious what should be the default, but compiler
>>>>> memcpy() is enabled by default in this RFC so any tests run with this
>>>>> patch use the new approach.
>>>>>
>>>>> One purpose of this RFC is to make it easy to evaluate the costs and
>>>>> benefits of a switch.
>>>>>
>>>>
>>>> I've tested this patch some with DSW micro benchmarks, and the result is a
>>>> 2.5% reduction of the DSW+testapp overhead with cc/libc memcpy. GCC 11.4.
>>>>
>>>> We've also run characteristic test suite of a large, real world app. Here,
>>>> we saw no effect. GCC 10.5.
>>>>
>>>> x86_64 in both cases (Skylake and Raptor Lake).
>>>>
>>>> Last time we did the same, there were a noticeable performance degradation
>>>> in both the above cases.
> 
> Mattias, which compiler was that?
> 

GCC 9, I think.

Not only the compiler changed between those two test runs.

It would be interesting with some ARM data points as well.

> As previously mentioned in another thread, I'm worried about memcpy performance with older compilers.
> DPDK officially supports GCC 4.9 and clang 3.4 [1].
> I don't think degrading performance when using supported compilers is considered acceptable.
> 
> Alternatively, we could change the DPDK compiler policy from "supported" to "works with (but might not perform optimally)".
> 

GCC 4.9 is ten years old.

If you are using an old compiler, odds are you don't really care too 
much about squeezing out max performance, considering how much better 
code generation is in newer compilers.

That said, we obviously don't want to cause large performance 
regressions for no good reason, even for old compilers.

> [1]: https://doc.dpdk.org/guides-21.11/linux_gsg/sys_reqs.html#compilation-of-the-dpdk
> 
>>>>
>>>> This is not a lot of data points, but I think it we should consider making
>>>> the custom RTE memcpy() implementations optional in the next release, and
>> if
>>>> no-one complains, remove the implementations in the next release.
>>>>
>>>> (Whether or not [or how long] to keep the wrapper API is another question.)
>>>>
>>>> <snip>
>>>
>>> The other instance I've heard mention of in the past is virtio/vhost, which
>>> used to have a speedup from the custom memcpy.
>>>
>>> My own thinking on these cases, is that for targetted settings like these,
>>> we should look to have local memcpy functions written - taking account of
>>> the specifics of each usecase. For virtio/vhost for example, we can have
>>> assumptions around host buffer alignment, and we also can be pretty
>>> confident we are copying to another CPU. For DSW, or other eventdev cases,
>>> we would only be looking at copies of multiples of 16, with guaranteed
>>> 8-byte alignment on both source and destination. Writing efficient copy fns
>>
>> In such cases, you should first try to tell the compiler that it's safe
>> to assume that the pointers have a certain alignment.
>>
>> void copy256(void *dst, const void *src)
>> {
>>       memcpy(dst, src, 256);
>> }
>>
>> void copy256_a(void *dst, const void *src)
>> {
>>       void *dst_a = __builtin_assume_aligned(dst, 32);
>>       const void *src_a = __builtin_assume_aligned(src, 32);
>>       memcpy(dst_a, src_a, 256);
>> }
>>
>> The first will generate loads/stores without alignment restrictions,
>> while the latter will use things like vmovdqa or vmovaps.
>>
>> (I doubt there's much of a performance difference though, if any at all.)
> 
> Interesting.
> 
>>
>>> for specific scenarios can be faster and more effective than trying to
>>> write a general, optimized in all cases, memcpy. It also discourages the
>>> use of non-libc memcpy except where really necessary.
> 
> Good idea, Bruce.
> I have previously worked on an optimized memcpy, where information about alignment, multiples, non-temporal source/destination, etc. is passed as flags to the function [2]. But it turned into too much work, so I never finished it.
> 
> If we start with local memcpy functions optimized for each specific use case, we still have the option of consolidating them into a common rte_memcpy function later. It will also reveal which flags/features such a common function needs to support.
> 
> [2]: https://inbox.dpdk.org/dev/20221010064600.16495-1-mb@smartsharesystems.com/
> 
>>>
>>> Naturally, if we find there are a lot of cases where use of libc memcpy
>>> slows us down, we will want to keep a general rte_memcpy. However, I'd hope
>>> the slowdown cases are very few.
>>>
>>> /Bruce

  reply	other threads:[~2024-05-28 16:17 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-27 11:11 [RFC] " Mattias Rönnblom
2024-05-28  7:43 ` [RFC v2] " Mattias Rönnblom
2024-05-28  8:19   ` Mattias Rönnblom
2024-05-28  8:27     ` Bruce Richardson
2024-05-28  8:59       ` Mattias Rönnblom
2024-05-28  9:07         ` Morten Brørup
2024-05-28 16:17           ` Mattias Rönnblom [this message]
2024-05-28 14:59     ` Stephen Hemminger
2024-05-28 15:09       ` Bruce Richardson
2024-05-31  5:19         ` Mattias Rönnblom
2024-05-31 16:50           ` Stephen Hemminger
2024-06-02 11:33             ` Mattias Rönnblom
2024-05-28 16:03       ` Mattias Rönnblom
2024-05-29 21:55         ` Stephen Hemminger
2024-05-28  8:20   ` Bruce Richardson
2024-06-02 12:39   ` [RFC v3 0/5] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 1/5] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-05  6:49       ` [PATCH 0/5] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-05  6:49         ` [PATCH 1/5] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-05  6:49         ` [PATCH 2/5] net/octeon_ep: properly include vector API header file Mattias Rönnblom
2024-06-05  6:49         ` [PATCH 3/5] distributor: " Mattias Rönnblom
2024-06-10 14:27           ` Tyler Retzlaff
2024-06-05  6:49         ` [PATCH 4/5] fib: " Mattias Rönnblom
2024-06-10 14:28           ` Tyler Retzlaff
2024-06-05  6:49         ` [PATCH 5/5] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-20  7:24         ` [PATCH v2 0/6] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-20  7:24           ` [PATCH v2 1/6] net/fm10k: add missing intrinsic include Mattias Rönnblom
2024-06-20  9:02             ` Bruce Richardson
2024-06-20  9:28             ` Bruce Richardson
2024-06-20 11:40               ` Mattias Rönnblom
2024-06-20 11:59                 ` Bruce Richardson
2024-06-20 11:50             ` [PATCH v3 0/6] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 1/6] net/fm10k: add missing vector API header include Mattias Rönnblom
2024-06-20 12:34                 ` Bruce Richardson
2024-06-20 17:57                 ` [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 01/13] net/i40e: add missing vector API header include Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 02/13] net/iavf: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 03/13] net/ice: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 04/13] net/ixgbe: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 05/13] net/ngbe: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 06/13] net/txgbe: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 07/13] net/virtio: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 08/13] net/fm10k: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 09/13] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 10/13] net/octeon_ep: add missing vector API header include Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 11/13] distributor: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 12/13] fib: " Mattias Rönnblom
2024-06-20 17:57                   ` [PATCH v4 13/13] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-21 15:19                     ` Stephen Hemminger
2024-06-24 10:05                     ` Thomas Monjalon
2024-06-24 17:56                       ` Mattias Rönnblom
2024-06-25 13:06                       ` Mattias Rönnblom
2024-06-25 13:34                         ` Thomas Monjalon
2024-06-20 18:53                   ` [PATCH v4 00/13] Optionally have rte_memcpy delegate to compiler memcpy Morten Brørup
2024-06-21  6:56                   ` Mattias Rönnblom
2024-06-21  7:04                     ` David Marchand
2024-06-21  7:35                       ` Mattias Rönnblom
2024-06-21  7:41                         ` David Marchand
2024-06-25 15:29                   ` Maxime Coquelin
2024-06-25 15:44                     ` Stephen Hemminger
2024-06-25 19:27                     ` Mattias Rönnblom
2024-06-26  8:37                       ` Maxime Coquelin
2024-06-26 14:58                         ` Stephen Hemminger
2024-06-26 15:24                           ` Maxime Coquelin
2024-06-26 18:47                             ` Mattias Rönnblom
2024-06-26 20:16                               ` Morten Brørup
2024-06-27 11:06                                 ` Mattias Rönnblom
2024-06-27 15:10                                   ` Stephen Hemminger
2024-06-27 15:23                                     ` Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 2/6] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 3/6] net/octeon_ep: add missing vector API header include Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 4/6] distributor: " Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 5/6] fib: " Mattias Rönnblom
2024-06-20 11:50               ` [PATCH v3 6/6] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-20  7:24           ` [PATCH v2 2/6] event/dlb2: include headers for vector and memory copy APIs Mattias Rönnblom
2024-06-20  9:03             ` Bruce Richardson
2024-06-20  7:24           ` [PATCH v2 3/6] net/octeon_ep: properly include vector API header file Mattias Rönnblom
2024-06-20 14:43             ` Stephen Hemminger
2024-06-20  7:24           ` [PATCH v2 4/6] distributor: " Mattias Rönnblom
2024-06-20  9:13             ` Bruce Richardson
2024-06-20  7:24           ` [PATCH v2 5/6] fib: " Mattias Rönnblom
2024-06-20  9:14             ` Bruce Richardson
2024-06-20 14:43               ` Stephen Hemminger
2024-06-20  7:24           ` [PATCH v2 6/6] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 2/5] net/octeon_ep: properly include vector API header file Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 3/5] distributor: " Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 4/5] fib: " Mattias Rönnblom
2024-06-02 12:39     ` [RFC v3 5/5] eal: provide option to use compiler memcpy instead of RTE Mattias Rönnblom
2024-06-02 20:58       ` Morten Brørup
2024-06-03 17:04         ` Mattias Rönnblom
2024-06-03 17:08           ` Stephen Hemminger
2024-05-29 21:56 ` [RFC] " Stephen Hemminger
2024-06-02 11:30   ` Mattias Rönnblom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=184120df-26c4-446c-826e-19a8a3b1aa49@lysator.liu.se \
    --to=hofors@lysator.liu.se \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=mattias.ronnblom@ericsson.com \
    --cc=mb@smartsharesystems.com \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).