Re: [RFC] cache guard - Mattias Rönnblom

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	"Thomas Monjalon" <thomas@monjalon.net>
Cc: Bruce Richardson <bruce.richardson@intel.com>,
	dev@dpdk.org, olivier.matz@6wind.com,
	andrew.rybchenko@oktetlabs.ru, honnappa.nagarahalli@arm.com,
	konstantin.v.ananyev@yandex.ru, mattias.ronnblom@ericsson.com
Subject: Re: [RFC] cache guard
Date: Tue, 5 Sep 2023 07:50:06 +0200	[thread overview]
Message-ID: <66b862bc-ca22-02da-069c-a82c2d52ed49@lysator.liu.se> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87B6F@smartserver.smartshare.dk>

On 2023-09-04 14:48, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Monday, 4 September 2023 14.07
>>
>> On 2023-09-01 20:52, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>>> Sent: Friday, 1 September 2023 18.58
>>>>
>>>> On 2023-09-01 14:26, Thomas Monjalon wrote:
>>>>> 27/08/2023 10:34, Morten Brørup:
>>>>>> +CC Honnappa and Konstantin, Ring lib maintainers
>>>>>> +CC Mattias, PRNG lib maintainer
>>>>>>
>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>> Sent: Friday, 25 August 2023 11.24
>>>>>>>
>>>>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
>>>>>>>> +CC mempool maintainers
>>>>>>>>
>>>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>>>> Sent: Friday, 25 August 2023 10.23
>>>>>>>>>
>>>>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
>>>>>>>>>> Bruce,
>>>>>>>>>>
>>>>>>>>>> With this patch [1], it is noted that the ring producer and
>>>>>>> consumer data
>>>>>>>>> should not be on adjacent cache lines, for performance reasons.
>>>>>>>>>>
>>>>>>>>>> [1]:
>>>>>>>>>
>>>>>>>
>>>>
>> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
>>>>>>> fd4b66
>>>>>>>>> e75485cc8b63b9aedfbdfe8b0
>>>>>>>>>>
>>>>>>>>>> (It's obvious that they cannot share the same cache line,
>> because
>>>>>>> they are
>>>>>>>>> accessed by two different threads.)
>>>>>>>>>>
>>>>>>>>>> Intuitively, I would think that having them on different cache
>>>>>>> lines would
>>>>>>>>> suffice. Why does having an empty cache line between them make a
>>>>>>> difference?
>>>>>>>>>>
>>>>>>>>>> And does it need to be an empty cache line? Or does it suffice
>>>>>>> having the
>>>>>>>>> second structure start at two cache lines after the start of the
>>>>>>> first
>>>>>>>>> structure (e.g. if the size of the first structure is two cache
>>>>>>> lines)?
>>>>>>>>>>
>>>>>>>>>> I'm asking because the same principle might apply to other code
>>>>>>> too.
>>>>>>>>>>
>>>>>>>>> Hi Morten,
>>>>>>>>>
>>>>>>>>> this was something we discovered when working on the distributor
>>>>>>> library.
>>>>>>>>> If we have cachelines per core where there is heavy access,
>> having
>>>>>>> some
>>>>>>>>> cachelines as a gap between the content cachelines can help
>>>>>>> performance. We
>>>>>>>>> believe this helps due to avoiding issues with the HW
>> prefetchers
>>>>>>> (e.g.
>>>>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
>>>>>>>>> speculatively when an operation is done on the first line.
>>>>>>>>
>>>>>>>> I guessed that it had something to do with speculative
>> prefetching,
>>>>>>> but wasn't sure. Good to get confirmation, and that it has a
>>>> measureable
>>>>>>> effect somewhere. Very interesting!
>>>>>>>>
>>>>>>>> NB: More comments in the ring lib about stuff like this would be
>>>> nice.
>>>>>>>>
>>>>>>>> So, for the mempool lib, what do you think about applying the
>> same
>>>>>>> technique to the rte_mempool_debug_stats structure (which is an
>>>> array
>>>>>>> indexed per lcore)... Two adjacent lcores heavily accessing their
>>>> local
>>>>>>> mempool caches seems likely to me. But how heavy does the access
>>>> need to
>>>>>>> be for this technique to be relevant?
>>>>>>>>
>>>>>>>
>>>>>>> No idea how heavy the accesses need to be for this to have a
>>>> noticable
>>>>>>> effect. For things like debug stats, I wonder how worthwhile
>> making
>>>> such
>>>>>>> a
>>>>>>> change would be, but then again, any change would have very low
>>>> impact
>>>>>>> too
>>>>>>> in that case.
>>>>>>
>>>>>> I just tried adding padding to some of the hot structures in our
>> own
>>>> application, and observed a significant performance improvement for
>>>> those.
>>>>>>
>>>>>> So I think this technique should have higher visibility in DPDK by
>>>> adding a new cache macro to rte_common.h:
>>>>>
>>>>> +1 to make more visibility in doc and adding a macro, good idea!
>>>>>
>>>>>
>>>>>
>>>>
>>>> A worry I have is that for CPUs with large (in this context) N, you
>> will
>>>> end up with a lot of padding to avoid next-N-lines false sharing.
>> That
>>>> would be padding after, and in the general (non-array) case also
>> before,
>>>> the actual per-lcore data. A slight nuisance is also that those
>>>> prefetched lines of padding, will never contain anything useful, and
>>>> thus fetching them will always be a waste.
>>>
>>> Out of curiosity, what is the largest N anyone here on the list is
>> aware of?
>>>
>>>>
>>>> Padding/alignment may not be the only way to avoid HW-prefetcher-
>> induced
>>>> false sharing for per-lcore data structures.
>>>>
>>>> What we are discussing here is organizing the statically allocated
>>>> per-lcore structs of a particular module in an array with the
>>>> appropriate padding/alignment. In this model, all data related to a
>>>> particular module is close (memory address/page-wise), but not so
>> close
>>>> to cause false sharing.
>>>>
>>>> /* rte_a.c */
>>>>
>>>> struct rte_a_state
>>>> {
>>>> 	int x;
>>>>            RTE_CACHE_GUARD;
>>>> } __rte_cache_aligned;
>>>>
>>>> static struct rte_a_state a_states[RTE_MAX_LCORE];
>>>>
>>>> /* rte_b.c */
>>>>
>>>> struct rte_b_state
>>>> {
>>>> 	char y;
>>>>            char z;
>>>>            RTE_CACHE_GUARD;
>>>> } __rte_cache_aligned;
>>>>
>>>>
>>>> static struct rte_b_state b_states[RTE_MAX_LCORE];
>>>>
>>>> What you would end up with in runtime when the linker has done its
>> job
>>>> is something that essentially looks like this (in memory):
>>>>
>>>> struct {
>>>> 	struct rte_a_state a_states[RTE_MAX_LCORE];
>>>> 	struct rte_b_state b_states[RTE_MAX_LCORE];
>>>> };
>>>>
>>>> You could consider turning it around, and keeping data (i.e., module
>>>> structs) related to a particular lcore, for all modules, close. In
>> other
>>>> words, keeping a per-lcore arrays of variable-sized elements.
>>>>
>>>> So, something that will end up looking like this (in memory, not in
>> the
>>>> source code):
>>>>
>>>> struct rte_lcore_state
>>>> {
>>>> 	struct rte_a_state a_state;
>>>> 	struct rte_b_state b_state;
>>>>            RTE_CACHE_GUARD;
>>>> };
>>>>
>>>> struct rte_lcore_state lcore_states[RTE_LCORE_MAX];
>>>>
>>>> In such a scenario, the per-lcore struct type for a module need not
>> (and
>>>> should not) be cache-line-aligned (but may still have some alignment
>>>> requirements). Data will be more tightly packed, and the "next lines"
>>>> prefetched may actually be useful (although I'm guessing in practice
>>>> they will usually not).
>>>>
>>>> There may be several ways to implement that scheme. The above is to
>>>> illustrate how thing would look in memory, not necessarily on the
>> level
>>>> of the source code.
>>>>
>>>> One way could be to fit the per-module-per-lcore struct in a chunk of
>>>> memory allocated in a per-lcore heap. In such a case, the DPDK heap
>>>> would need extension, maybe with semantics similar to that of NUMA-
>> node
>>>> specific allocations.
>>>>
>>>> Another way would be to use thread-local storage (TLS, __thread),
>>>> although it's unclear to me how well TLS works with larger data
>>>> structures.
>>>>
>>>> A third way may be to somehow achieve something that looks like the
>>>> above example, using macros, without breaking module encapsulation or
>>>> generally be too intrusive or otherwise cumbersome.
>>>>
>>>> Not sure this is worth the trouble (compared to just more padding),
>> but
>>>> I thought it was an idea worth sharing.
>>>
>>> I think what Mattias suggests is relevant, and it would be great if a
>> generic solution could be found for DPDK.
>>>
>>> For reference, we initially used RTE_PER_LCORE(module_variable), i.e.
>> thread local storage, extensively in our application modules. But it has
>> two disadvantages:
>>> 1. TLS does not use hugepages. (The same applies to global and local
>> variables, BTW.)
>>> 2. We need to set up global pointers to these TLS variables, so they
>> can be accessed from the main lcore (e.g. for statistics). This means
>> that every module needs some sort of module_per_lcore_init, called by
>> the thread after its creation, to set the
>> module_global_ptr[rte_lcore_id()] = &RTE_PER_LCORE(module_variable).
>>>
>>
>> Good points. I never thought about the initialization issue.
>>
>> How about memory consumption and TLS? If you have many non-EAL-threads
>> in the DPDK process, would the system allocate TLS memory for DPDK
>> lcore-specific data structures? Assuming a scenario where __thread was
>> used instead of the standard DPDK pattern.
> 
> Room for all the __thread variables is allocated and initialized for every thread started.
> 
> Note: RTE_PER_LCORE variables are simply __thread-wrapped variables:
> 
> #define RTE_DEFINE_PER_LCORE(type, name)			\
> 	__thread __typeof__(type) per_lcore_##name
> 
>>
>>> Eventually, we gave up and migrated to the DPDK standard design
>> pattern of instantiating a global module_variable[RTE_MAX_LCORE], and
>> letting each thread use its own entry in that array.
>>>
>>> And as Mattias suggests, this adds a lot of useless padding, because
>> each modules' variables now need to start on its own cache line.
>>>
>>> So a generic solution with packed per-thread data would be a great
>> long term solution.
>>>
>>> Short term, I can live with the simple cache guard. It is very easy to
>> implement and use.
>>>
>> The RTE_CACHE_GUARD pattern is also intrusive, in the sense it needs to
>> be explicitly added everywhere (just like __rte_cache_aligned) and error
>> prone, and somewhat brittle (in the face of changed <N>).
> 
> Agree. Like a lot of other stuff in DPDK. DPDK is very explicit. :-)
> 
>>
>> (I mentioned this not to discourage the use of RTE_CACHE_GUARD - more to
>> encourage somehow to invite something more efficient, robust and
>> easier-to-use.)
> 
> I get that you don't discourage it, but I don't expect the discussion to proceed further at this time. So please don’t forget to also ACK this patch. ;-)
> 

Acked-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

> There should be a wish list, where concepts like your suggested improvement could be easily noted.
>

     prev parent reply	other threads:[~2023-09-05  5:50 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-25  6:45 cache thrashing question Morten Brørup
2023-08-25  8:22 ` Bruce Richardson
2023-08-25  9:06   ` Morten Brørup
2023-08-25  9:23     ` Bruce Richardson
2023-08-27  8:34       ` [RFC] cache guard Morten Brørup
2023-08-27 13:55         ` Mattias Rönnblom
2023-08-27 15:40           ` Morten Brørup
2023-08-27 22:30             ` Mattias Rönnblom
2023-08-28  6:32               ` Morten Brørup
2023-08-28  8:46                 ` Mattias Rönnblom
2023-08-28  9:54                   ` Morten Brørup
2023-08-28 10:40                     ` Stephen Hemminger
2023-08-28  7:57             ` Bruce Richardson
2023-09-01 12:26         ` Thomas Monjalon
2023-09-01 16:57           ` Mattias Rönnblom
2023-09-01 18:52             ` Morten Brørup
2023-09-04 12:07               ` Mattias Rönnblom
2023-09-04 12:48                 ` Morten Brørup
2023-09-05  5:50                   ` Mattias Rönnblom [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=66b862bc-ca22-02da-069c-a82c2d52ed49@lysator.liu.se \
    --to=hofors@lysator.liu.se \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=honnappa.nagarahalli@arm.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=mattias.ronnblom@ericsson.com \
    --cc=mb@smartsharesystems.com \
    --cc=olivier.matz@6wind.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).