Re: [RFC] cache guard - Mattias Rönnblom

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	"Thomas Monjalon" <thomas@monjalon.net>
Cc: Bruce Richardson <bruce.richardson@intel.com>,
	dev@dpdk.org, olivier.matz@6wind.com,
	andrew.rybchenko@oktetlabs.ru, honnappa.nagarahalli@arm.com,
	konstantin.v.ananyev@yandex.ru, mattias.ronnblom@ericsson.com
Subject: Re: [RFC] cache guard
Date: Mon, 4 Sep 2023 14:07:20 +0200	[thread overview]
Message-ID: <10be7ca8-fe5a-ba24-503a-9c6ce2fef710@lysator.liu.se> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87B69@smartserver.smartshare.dk>

On 2023-09-01 20:52, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Friday, 1 September 2023 18.58
>>
>> On 2023-09-01 14:26, Thomas Monjalon wrote:
>>> 27/08/2023 10:34, Morten Brørup:
>>>> +CC Honnappa and Konstantin, Ring lib maintainers
>>>> +CC Mattias, PRNG lib maintainer
>>>>
>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>> Sent: Friday, 25 August 2023 11.24
>>>>>
>>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
>>>>>> +CC mempool maintainers
>>>>>>
>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>> Sent: Friday, 25 August 2023 10.23
>>>>>>>
>>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
>>>>>>>> Bruce,
>>>>>>>>
>>>>>>>> With this patch [1], it is noted that the ring producer and
>>>>> consumer data
>>>>>>> should not be on adjacent cache lines, for performance reasons.
>>>>>>>>
>>>>>>>> [1]:
>>>>>>>
>>>>>
>> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
>>>>> fd4b66
>>>>>>> e75485cc8b63b9aedfbdfe8b0
>>>>>>>>
>>>>>>>> (It's obvious that they cannot share the same cache line, because
>>>>> they are
>>>>>>> accessed by two different threads.)
>>>>>>>>
>>>>>>>> Intuitively, I would think that having them on different cache
>>>>> lines would
>>>>>>> suffice. Why does having an empty cache line between them make a
>>>>> difference?
>>>>>>>>
>>>>>>>> And does it need to be an empty cache line? Or does it suffice
>>>>> having the
>>>>>>> second structure start at two cache lines after the start of the
>>>>> first
>>>>>>> structure (e.g. if the size of the first structure is two cache
>>>>> lines)?
>>>>>>>>
>>>>>>>> I'm asking because the same principle might apply to other code
>>>>> too.
>>>>>>>>
>>>>>>> Hi Morten,
>>>>>>>
>>>>>>> this was something we discovered when working on the distributor
>>>>> library.
>>>>>>> If we have cachelines per core where there is heavy access, having
>>>>> some
>>>>>>> cachelines as a gap between the content cachelines can help
>>>>> performance. We
>>>>>>> believe this helps due to avoiding issues with the HW prefetchers
>>>>> (e.g.
>>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
>>>>>>> speculatively when an operation is done on the first line.
>>>>>>
>>>>>> I guessed that it had something to do with speculative prefetching,
>>>>> but wasn't sure. Good to get confirmation, and that it has a
>> measureable
>>>>> effect somewhere. Very interesting!
>>>>>>
>>>>>> NB: More comments in the ring lib about stuff like this would be
>> nice.
>>>>>>
>>>>>> So, for the mempool lib, what do you think about applying the same
>>>>> technique to the rte_mempool_debug_stats structure (which is an
>> array
>>>>> indexed per lcore)... Two adjacent lcores heavily accessing their
>> local
>>>>> mempool caches seems likely to me. But how heavy does the access
>> need to
>>>>> be for this technique to be relevant?
>>>>>>
>>>>>
>>>>> No idea how heavy the accesses need to be for this to have a
>> noticable
>>>>> effect. For things like debug stats, I wonder how worthwhile making
>> such
>>>>> a
>>>>> change would be, but then again, any change would have very low
>> impact
>>>>> too
>>>>> in that case.
>>>>
>>>> I just tried adding padding to some of the hot structures in our own
>> application, and observed a significant performance improvement for
>> those.
>>>>
>>>> So I think this technique should have higher visibility in DPDK by
>> adding a new cache macro to rte_common.h:
>>>
>>> +1 to make more visibility in doc and adding a macro, good idea!
>>>
>>>
>>>
>>
>> A worry I have is that for CPUs with large (in this context) N, you will
>> end up with a lot of padding to avoid next-N-lines false sharing. That
>> would be padding after, and in the general (non-array) case also before,
>> the actual per-lcore data. A slight nuisance is also that those
>> prefetched lines of padding, will never contain anything useful, and
>> thus fetching them will always be a waste.
> 
> Out of curiosity, what is the largest N anyone here on the list is aware of?
> 
>>
>> Padding/alignment may not be the only way to avoid HW-prefetcher-induced
>> false sharing for per-lcore data structures.
>>
>> What we are discussing here is organizing the statically allocated
>> per-lcore structs of a particular module in an array with the
>> appropriate padding/alignment. In this model, all data related to a
>> particular module is close (memory address/page-wise), but not so close
>> to cause false sharing.
>>
>> /* rte_a.c */
>>
>> struct rte_a_state
>> {
>> 	int x;
>>           RTE_CACHE_GUARD;
>> } __rte_cache_aligned;
>>
>> static struct rte_a_state a_states[RTE_MAX_LCORE];
>>
>> /* rte_b.c */
>>
>> struct rte_b_state
>> {
>> 	char y;
>>           char z;
>>           RTE_CACHE_GUARD;
>> } __rte_cache_aligned;
>>
>>
>> static struct rte_b_state b_states[RTE_MAX_LCORE];
>>
>> What you would end up with in runtime when the linker has done its job
>> is something that essentially looks like this (in memory):
>>
>> struct {
>> 	struct rte_a_state a_states[RTE_MAX_LCORE];
>> 	struct rte_b_state b_states[RTE_MAX_LCORE];
>> };
>>
>> You could consider turning it around, and keeping data (i.e., module
>> structs) related to a particular lcore, for all modules, close. In other
>> words, keeping a per-lcore arrays of variable-sized elements.
>>
>> So, something that will end up looking like this (in memory, not in the
>> source code):
>>
>> struct rte_lcore_state
>> {
>> 	struct rte_a_state a_state;
>> 	struct rte_b_state b_state;
>>           RTE_CACHE_GUARD;
>> };
>>
>> struct rte_lcore_state lcore_states[RTE_LCORE_MAX];
>>
>> In such a scenario, the per-lcore struct type for a module need not (and
>> should not) be cache-line-aligned (but may still have some alignment
>> requirements). Data will be more tightly packed, and the "next lines"
>> prefetched may actually be useful (although I'm guessing in practice
>> they will usually not).
>>
>> There may be several ways to implement that scheme. The above is to
>> illustrate how thing would look in memory, not necessarily on the level
>> of the source code.
>>
>> One way could be to fit the per-module-per-lcore struct in a chunk of
>> memory allocated in a per-lcore heap. In such a case, the DPDK heap
>> would need extension, maybe with semantics similar to that of NUMA-node
>> specific allocations.
>>
>> Another way would be to use thread-local storage (TLS, __thread),
>> although it's unclear to me how well TLS works with larger data
>> structures.
>>
>> A third way may be to somehow achieve something that looks like the
>> above example, using macros, without breaking module encapsulation or
>> generally be too intrusive or otherwise cumbersome.
>>
>> Not sure this is worth the trouble (compared to just more padding), but
>> I thought it was an idea worth sharing.
> 
> I think what Mattias suggests is relevant, and it would be great if a generic solution could be found for DPDK.
> 
> For reference, we initially used RTE_PER_LCORE(module_variable), i.e. thread local storage, extensively in our application modules. But it has two disadvantages:
> 1. TLS does not use hugepages. (The same applies to global and local variables, BTW.)
> 2. We need to set up global pointers to these TLS variables, so they can be accessed from the main lcore (e.g. for statistics). This means that every module needs some sort of module_per_lcore_init, called by the thread after its creation, to set the module_global_ptr[rte_lcore_id()] = &RTE_PER_LCORE(module_variable).
> 

Good points. I never thought about the initialization issue.

How about memory consumption and TLS? If you have many non-EAL-threads 
in the DPDK process, would the system allocate TLS memory for DPDK 
lcore-specific data structures? Assuming a scenario where __thread was 
used instead of the standard DPDK pattern.

> Eventually, we gave up and migrated to the DPDK standard design pattern of instantiating a global module_variable[RTE_MAX_LCORE], and letting each thread use its own entry in that array.
> 
> And as Mattias suggests, this adds a lot of useless padding, because each modules' variables now need to start on its own cache line.
> 
> So a generic solution with packed per-thread data would be a great long term solution.
> 
> Short term, I can live with the simple cache guard. It is very easy to implement and use.
> 
The RTE_CACHE_GUARD pattern is also intrusive, in the sense it needs to 
be explicitly added everywhere (just like __rte_cache_aligned) and error 
prone, and somewhat brittle (in the face of changed <N>).

(I mentioned this not to discourage the use of RTE_CACHE_GUARD - more to 
encourage somehow to invite something more efficient, robust and 
easier-to-use.)

next prev parent reply	other threads:[~2023-09-04 12:07 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-25  6:45 cache thrashing question Morten Brørup
2023-08-25  8:22 ` Bruce Richardson
2023-08-25  9:06   ` Morten Brørup
2023-08-25  9:23     ` Bruce Richardson
2023-08-27  8:34       ` [RFC] cache guard Morten Brørup
2023-08-27 13:55         ` Mattias Rönnblom
2023-08-27 15:40           ` Morten Brørup
2023-08-27 22:30             ` Mattias Rönnblom
2023-08-28  6:32               ` Morten Brørup
2023-08-28  8:46                 ` Mattias Rönnblom
2023-08-28  9:54                   ` Morten Brørup
2023-08-28 10:40                     ` Stephen Hemminger
2023-08-28  7:57             ` Bruce Richardson
2023-09-01 12:26         ` Thomas Monjalon
2023-09-01 16:57           ` Mattias Rönnblom
2023-09-01 18:52             ` Morten Brørup
2023-09-04 12:07               ` Mattias Rönnblom [this message]
2023-09-04 12:48                 ` Morten Brørup
2023-09-05  5:50                   ` Mattias Rönnblom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=10be7ca8-fe5a-ba24-503a-9c6ce2fef710@lysator.liu.se \
    --to=hofors@lysator.liu.se \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=honnappa.nagarahalli@arm.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=mattias.ronnblom@ericsson.com \
    --cc=mb@smartsharesystems.com \
    --cc=olivier.matz@6wind.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).