From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Mattias Rönnblom" <hofors@lysator.liu.se>,
"Thomas Monjalon" <thomas@monjalon.net>
Cc: "Bruce Richardson" <bruce.richardson@intel.com>, <dev@dpdk.org>,
<olivier.matz@6wind.com>, <andrew.rybchenko@oktetlabs.ru>,
<honnappa.nagarahalli@arm.com>, <konstantin.v.ananyev@yandex.ru>,
<mattias.ronnblom@ericsson.com>
Subject: RE: [RFC] cache guard
Date: Mon, 4 Sep 2023 14:48:46 +0200 [thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D87B6F@smartserver.smartshare.dk> (raw)
In-Reply-To: <10be7ca8-fe5a-ba24-503a-9c6ce2fef710@lysator.liu.se>
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 4 September 2023 14.07
>
> On 2023-09-01 20:52, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Friday, 1 September 2023 18.58
> >>
> >> On 2023-09-01 14:26, Thomas Monjalon wrote:
> >>> 27/08/2023 10:34, Morten Brørup:
> >>>> +CC Honnappa and Konstantin, Ring lib maintainers
> >>>> +CC Mattias, PRNG lib maintainer
> >>>>
> >>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>>> Sent: Friday, 25 August 2023 11.24
> >>>>>
> >>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> >>>>>> +CC mempool maintainers
> >>>>>>
> >>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>>>>> Sent: Friday, 25 August 2023 10.23
> >>>>>>>
> >>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> >>>>>>>> Bruce,
> >>>>>>>>
> >>>>>>>> With this patch [1], it is noted that the ring producer and
> >>>>> consumer data
> >>>>>>> should not be on adjacent cache lines, for performance reasons.
> >>>>>>>>
> >>>>>>>> [1]:
> >>>>>>>
> >>>>>
> >>
> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
> >>>>> fd4b66
> >>>>>>> e75485cc8b63b9aedfbdfe8b0
> >>>>>>>>
> >>>>>>>> (It's obvious that they cannot share the same cache line,
> because
> >>>>> they are
> >>>>>>> accessed by two different threads.)
> >>>>>>>>
> >>>>>>>> Intuitively, I would think that having them on different cache
> >>>>> lines would
> >>>>>>> suffice. Why does having an empty cache line between them make a
> >>>>> difference?
> >>>>>>>>
> >>>>>>>> And does it need to be an empty cache line? Or does it suffice
> >>>>> having the
> >>>>>>> second structure start at two cache lines after the start of the
> >>>>> first
> >>>>>>> structure (e.g. if the size of the first structure is two cache
> >>>>> lines)?
> >>>>>>>>
> >>>>>>>> I'm asking because the same principle might apply to other code
> >>>>> too.
> >>>>>>>>
> >>>>>>> Hi Morten,
> >>>>>>>
> >>>>>>> this was something we discovered when working on the distributor
> >>>>> library.
> >>>>>>> If we have cachelines per core where there is heavy access,
> having
> >>>>> some
> >>>>>>> cachelines as a gap between the content cachelines can help
> >>>>> performance. We
> >>>>>>> believe this helps due to avoiding issues with the HW
> prefetchers
> >>>>> (e.g.
> >>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
> >>>>>>> speculatively when an operation is done on the first line.
> >>>>>>
> >>>>>> I guessed that it had something to do with speculative
> prefetching,
> >>>>> but wasn't sure. Good to get confirmation, and that it has a
> >> measureable
> >>>>> effect somewhere. Very interesting!
> >>>>>>
> >>>>>> NB: More comments in the ring lib about stuff like this would be
> >> nice.
> >>>>>>
> >>>>>> So, for the mempool lib, what do you think about applying the
> same
> >>>>> technique to the rte_mempool_debug_stats structure (which is an
> >> array
> >>>>> indexed per lcore)... Two adjacent lcores heavily accessing their
> >> local
> >>>>> mempool caches seems likely to me. But how heavy does the access
> >> need to
> >>>>> be for this technique to be relevant?
> >>>>>>
> >>>>>
> >>>>> No idea how heavy the accesses need to be for this to have a
> >> noticable
> >>>>> effect. For things like debug stats, I wonder how worthwhile
> making
> >> such
> >>>>> a
> >>>>> change would be, but then again, any change would have very low
> >> impact
> >>>>> too
> >>>>> in that case.
> >>>>
> >>>> I just tried adding padding to some of the hot structures in our
> own
> >> application, and observed a significant performance improvement for
> >> those.
> >>>>
> >>>> So I think this technique should have higher visibility in DPDK by
> >> adding a new cache macro to rte_common.h:
> >>>
> >>> +1 to make more visibility in doc and adding a macro, good idea!
> >>>
> >>>
> >>>
> >>
> >> A worry I have is that for CPUs with large (in this context) N, you
> will
> >> end up with a lot of padding to avoid next-N-lines false sharing.
> That
> >> would be padding after, and in the general (non-array) case also
> before,
> >> the actual per-lcore data. A slight nuisance is also that those
> >> prefetched lines of padding, will never contain anything useful, and
> >> thus fetching them will always be a waste.
> >
> > Out of curiosity, what is the largest N anyone here on the list is
> aware of?
> >
> >>
> >> Padding/alignment may not be the only way to avoid HW-prefetcher-
> induced
> >> false sharing for per-lcore data structures.
> >>
> >> What we are discussing here is organizing the statically allocated
> >> per-lcore structs of a particular module in an array with the
> >> appropriate padding/alignment. In this model, all data related to a
> >> particular module is close (memory address/page-wise), but not so
> close
> >> to cause false sharing.
> >>
> >> /* rte_a.c */
> >>
> >> struct rte_a_state
> >> {
> >> int x;
> >> RTE_CACHE_GUARD;
> >> } __rte_cache_aligned;
> >>
> >> static struct rte_a_state a_states[RTE_MAX_LCORE];
> >>
> >> /* rte_b.c */
> >>
> >> struct rte_b_state
> >> {
> >> char y;
> >> char z;
> >> RTE_CACHE_GUARD;
> >> } __rte_cache_aligned;
> >>
> >>
> >> static struct rte_b_state b_states[RTE_MAX_LCORE];
> >>
> >> What you would end up with in runtime when the linker has done its
> job
> >> is something that essentially looks like this (in memory):
> >>
> >> struct {
> >> struct rte_a_state a_states[RTE_MAX_LCORE];
> >> struct rte_b_state b_states[RTE_MAX_LCORE];
> >> };
> >>
> >> You could consider turning it around, and keeping data (i.e., module
> >> structs) related to a particular lcore, for all modules, close. In
> other
> >> words, keeping a per-lcore arrays of variable-sized elements.
> >>
> >> So, something that will end up looking like this (in memory, not in
> the
> >> source code):
> >>
> >> struct rte_lcore_state
> >> {
> >> struct rte_a_state a_state;
> >> struct rte_b_state b_state;
> >> RTE_CACHE_GUARD;
> >> };
> >>
> >> struct rte_lcore_state lcore_states[RTE_LCORE_MAX];
> >>
> >> In such a scenario, the per-lcore struct type for a module need not
> (and
> >> should not) be cache-line-aligned (but may still have some alignment
> >> requirements). Data will be more tightly packed, and the "next lines"
> >> prefetched may actually be useful (although I'm guessing in practice
> >> they will usually not).
> >>
> >> There may be several ways to implement that scheme. The above is to
> >> illustrate how thing would look in memory, not necessarily on the
> level
> >> of the source code.
> >>
> >> One way could be to fit the per-module-per-lcore struct in a chunk of
> >> memory allocated in a per-lcore heap. In such a case, the DPDK heap
> >> would need extension, maybe with semantics similar to that of NUMA-
> node
> >> specific allocations.
> >>
> >> Another way would be to use thread-local storage (TLS, __thread),
> >> although it's unclear to me how well TLS works with larger data
> >> structures.
> >>
> >> A third way may be to somehow achieve something that looks like the
> >> above example, using macros, without breaking module encapsulation or
> >> generally be too intrusive or otherwise cumbersome.
> >>
> >> Not sure this is worth the trouble (compared to just more padding),
> but
> >> I thought it was an idea worth sharing.
> >
> > I think what Mattias suggests is relevant, and it would be great if a
> generic solution could be found for DPDK.
> >
> > For reference, we initially used RTE_PER_LCORE(module_variable), i.e.
> thread local storage, extensively in our application modules. But it has
> two disadvantages:
> > 1. TLS does not use hugepages. (The same applies to global and local
> variables, BTW.)
> > 2. We need to set up global pointers to these TLS variables, so they
> can be accessed from the main lcore (e.g. for statistics). This means
> that every module needs some sort of module_per_lcore_init, called by
> the thread after its creation, to set the
> module_global_ptr[rte_lcore_id()] = &RTE_PER_LCORE(module_variable).
> >
>
> Good points. I never thought about the initialization issue.
>
> How about memory consumption and TLS? If you have many non-EAL-threads
> in the DPDK process, would the system allocate TLS memory for DPDK
> lcore-specific data structures? Assuming a scenario where __thread was
> used instead of the standard DPDK pattern.
Room for all the __thread variables is allocated and initialized for every thread started.
Note: RTE_PER_LCORE variables are simply __thread-wrapped variables:
#define RTE_DEFINE_PER_LCORE(type, name) \
__thread __typeof__(type) per_lcore_##name
>
> > Eventually, we gave up and migrated to the DPDK standard design
> pattern of instantiating a global module_variable[RTE_MAX_LCORE], and
> letting each thread use its own entry in that array.
> >
> > And as Mattias suggests, this adds a lot of useless padding, because
> each modules' variables now need to start on its own cache line.
> >
> > So a generic solution with packed per-thread data would be a great
> long term solution.
> >
> > Short term, I can live with the simple cache guard. It is very easy to
> implement and use.
> >
> The RTE_CACHE_GUARD pattern is also intrusive, in the sense it needs to
> be explicitly added everywhere (just like __rte_cache_aligned) and error
> prone, and somewhat brittle (in the face of changed <N>).
Agree. Like a lot of other stuff in DPDK. DPDK is very explicit. :-)
>
> (I mentioned this not to discourage the use of RTE_CACHE_GUARD - more to
> encourage somehow to invite something more efficient, robust and
> easier-to-use.)
I get that you don't discourage it, but I don't expect the discussion to proceed further at this time. So please don’t forget to also ACK this patch. ;-)
There should be a wish list, where concepts like your suggested improvement could be easily noted.
next prev parent reply other threads:[~2023-09-04 12:48 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-25 6:45 cache thrashing question Morten Brørup
2023-08-25 8:22 ` Bruce Richardson
2023-08-25 9:06 ` Morten Brørup
2023-08-25 9:23 ` Bruce Richardson
2023-08-27 8:34 ` [RFC] cache guard Morten Brørup
2023-08-27 13:55 ` Mattias Rönnblom
2023-08-27 15:40 ` Morten Brørup
2023-08-27 22:30 ` Mattias Rönnblom
2023-08-28 6:32 ` Morten Brørup
2023-08-28 8:46 ` Mattias Rönnblom
2023-08-28 9:54 ` Morten Brørup
2023-08-28 10:40 ` Stephen Hemminger
2023-08-28 7:57 ` Bruce Richardson
2023-09-01 12:26 ` Thomas Monjalon
2023-09-01 16:57 ` Mattias Rönnblom
2023-09-01 18:52 ` Morten Brørup
2023-09-04 12:07 ` Mattias Rönnblom
2023-09-04 12:48 ` Morten Brørup [this message]
2023-09-05 5:50 ` Mattias Rönnblom
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=98CBD80474FA8B44BF855DF32C47DC35D87B6F@smartserver.smartshare.dk \
--to=mb@smartsharesystems.com \
--cc=andrew.rybchenko@oktetlabs.ru \
--cc=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
--cc=hofors@lysator.liu.se \
--cc=honnappa.nagarahalli@arm.com \
--cc=konstantin.v.ananyev@yandex.ru \
--cc=mattias.ronnblom@ericsson.com \
--cc=olivier.matz@6wind.com \
--cc=thomas@monjalon.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).