DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	"Bruce Richardson" <bruce.richardson@intel.com>
Cc: dev@dpdk.org, olivier.matz@6wind.com,
	andrew.rybchenko@oktetlabs.ru, honnappa.nagarahalli@arm.com,
	konstantin.v.ananyev@yandex.ru, mattias.ronnblom@ericsson.com
Subject: Re: [RFC] cache guard
Date: Mon, 28 Aug 2023 00:30:50 +0200	[thread overview]
Message-ID: <d757da9f-7dc9-9f78-24ce-ba82b8d8fc57@lysator.liu.se> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87B49@smartserver.smartshare.dk>

On 2023-08-27 17:40, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Sunday, 27 August 2023 15.55
>>
>> On 2023-08-27 10:34, Morten Brørup wrote:
>>> +CC Honnappa and Konstantin, Ring lib maintainers
>>> +CC Mattias, PRNG lib maintainer
>>>
>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>> Sent: Friday, 25 August 2023 11.24
>>>>
>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
>>>>> +CC mempool maintainers
>>>>>
>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>> Sent: Friday, 25 August 2023 10.23
>>>>>>
>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
>>>>>>> Bruce,
>>>>>>>
>>>>>>> With this patch [1], it is noted that the ring producer and
>>>> consumer data
>>>>>> should not be on adjacent cache lines, for performance reasons.
>>>>>>>
>>>>>>> [1]:
>>>>>>
>>>>
>> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
>>>> fd4b66
>>>>>> e75485cc8b63b9aedfbdfe8b0
>>>>>>>
>>>>>>> (It's obvious that they cannot share the same cache line, because
>>>> they are
>>>>>> accessed by two different threads.)
>>>>>>>
>>>>>>> Intuitively, I would think that having them on different cache
>>>> lines would
>>>>>> suffice. Why does having an empty cache line between them make a
>>>> difference?
>>>>>>>
>>>>>>> And does it need to be an empty cache line? Or does it suffice
>>>> having the
>>>>>> second structure start at two cache lines after the start of the
>>>> first
>>>>>> structure (e.g. if the size of the first structure is two cache
>>>> lines)?
>>>>>>>
>>>>>>> I'm asking because the same principle might apply to other code
>>>> too.
>>>>>>>
>>>>>> Hi Morten,
>>>>>>
>>>>>> this was something we discovered when working on the distributor
>>>> library.
>>>>>> If we have cachelines per core where there is heavy access, having
>>>> some
>>>>>> cachelines as a gap between the content cachelines can help
>>>> performance. We
>>>>>> believe this helps due to avoiding issues with the HW prefetchers
>>>> (e.g.
>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
>>>>>> speculatively when an operation is done on the first line.
>>>>>
>>>>> I guessed that it had something to do with speculative prefetching,
>>>> but wasn't sure. Good to get confirmation, and that it has a
>> measureable
>>>> effect somewhere. Very interesting!
>>>>>
>>>>> NB: More comments in the ring lib about stuff like this would be
>> nice.
>>>>>
>>>>> So, for the mempool lib, what do you think about applying the same
>>>> technique to the rte_mempool_debug_stats structure (which is an array
>>>> indexed per lcore)... Two adjacent lcores heavily accessing their
>> local
>>>> mempool caches seems likely to me. But how heavy does the access need
>> to
>>>> be for this technique to be relevant?
>>>>>
>>>>
>>>> No idea how heavy the accesses need to be for this to have a
>> noticable
>>>> effect. For things like debug stats, I wonder how worthwhile making
>> such
>>>> a
>>>> change would be, but then again, any change would have very low
>> impact
>>>> too
>>>> in that case.
>>>
>>> I just tried adding padding to some of the hot structures in our own
>> application, and observed a significant performance improvement for
>> those.
>>>
>>> So I think this technique should have higher visibility in DPDK by
>> adding a new cache macro to rte_common.h:
>>>
>>> /**
>>>    * Empty cache line, to guard against speculative prefetching.
>>>    *
>>
>> "to guard against false sharing-like effects on systems with a
>> next-N-lines hardware prefetcher"
>>
>>>    * Use as spacing between data accessed by different lcores,
>>>    * to prevent cache thrashing on CPUs with speculative prefetching.
>>>    */
>>> #define RTE_CACHE_GUARD(name) char
>> cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned;
>>>
>>
>> You could have a macro which specified how much guarding there needs to
>> be, ideally defined on a per-CPU basis. (These things has nothing to do
>> with the ISA, but everything to do with the implementation.)
>>
>> I'm not sure N is always 1.
>>
>> So the guard padding should be RTE_CACHE_LINE_SIZE *
>> RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in
>> #if RTE_CACHE_GUARD_LINES > 0
>> #endif
>>
>> ...so you can disable this (cute!) hack (on custom DPDK builds) in case
>> you have disabled hardware prefetching, which seems generally to be a
>> good idea for packet processing type applications.
>>
>> ...which leads me to another suggestions: add a note on disabling
>> hardware prefetching in the optimization guide.
>>
>> Seems like a very good idea to have this in <rte_common.h>, and
>> otherwise make this issue visible and known.
> 
> Good points, Mattias!
> 
> I also prefer the name-less macro you suggested below.
> 
> So, this gets added to rte_common.h:
> 
> /**
>   * Empty cache lines, to guard against false sharing-like effects
>   * on systems with a next-N-lines hardware prefetcher.
>   *
>   * Use as spacing between data accessed by different lcores,
>   * to prevent cache thrashing on hardware with speculative prefetching.
>   */
> #if RTE_CACHE_GUARD_LINES > 0
> #define _RTE_CACHE_GUARD_HELPER2(unique) \
>          char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE * RTE_CACHE_GUARD_LINES] \
>          __rte_cache_aligned;
> #define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique)
> #define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)
> #else
> #define RTE_CACHE_GUARD
> #endif
> 

Seems like a good solution. I thought as far as using __LINE__ to build 
a unique name, but __COUNTER__ is much cleaner, provided it's available 
in relevant compilers. (It's not in C11.)

Should the semicolon be included or not in HELPER2? If left out, a 
lonely ";" will be left for RTE_CACHE_GUARD_LINES == 0, but I don't 
think that is a problem.

I don't see why __rte_cache_aligned is needed here. The adjacent struct 
must be cache-line aligned. Maybe it makes it more readable, having the 
explicit guard padding starting at the start of the actual guard cache 
lines, rather than potentially at some earlier point before, and having 
non-guard padding at the end of the struct (from __rte_cache_aligned on 
the struct level).

> And a line in /config/x86/meson.build for x86 architecture:
> 
>    dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
> + dpdk_conf.set('RTE_CACHE_GUARD_LINES', 1)
> 
> I don't know about various architectures and implementations, so we should probably use a default of 1, matching the existing guard size in the ring lib.
> 
> @Bruce, I hope you can help with the configuration part of this.
> 
>>
>>> To be used like this:
>>>
>>> struct rte_ring {
>>> 	char name[RTE_RING_NAMESIZE] __rte_cache_aligned;
>>> 	/**< Name of the ring. */
>>> 	int flags;               /**< Flags supplied at creation. */
>>> 	const struct rte_memzone *memzone;
>>> 			/**< Memzone, if any, containing the rte_ring */
>>> 	uint32_t size;           /**< Size of ring. */
>>> 	uint32_t mask;           /**< Mask (size-1) of ring. */
>>> 	uint32_t capacity;       /**< Usable size of ring */
>>>
>>> -	char pad0 __rte_cache_aligned; /**< empty cache line */
>>> +	RTE_CACHE_GUARD(prod);  /**< Isolate producer status. */
> 
> +	RTE_CACHE_GUARD;
> 
> Note: Commenting anonymous cache guard fields seems somewhat superfluous, and would largely be a copy of the general RTE_CACHE_GUARD description anyway, so I have removed the comments.
> 
>>>
>>> 	/** Ring producer status. */
>>> 	union {
>>> 		struct rte_ring_headtail prod;
>>> 		struct rte_ring_hts_headtail hts_prod;
>>> 		struct rte_ring_rts_headtail rts_prod;
>>> 	}  __rte_cache_aligned;
>>>
>>> -	char pad1 __rte_cache_aligned; /**< empty cache line */
>>> +	RTE_CACHE_GUARD(both);  /**< Isolate producer from consumer. */
> 
> +	RTE_CACHE_GUARD;
> 
>>>
>>> 	/** Ring consumer status. */
>>> 	union {
>>> 		struct rte_ring_headtail cons;
>>> 		struct rte_ring_hts_headtail hts_cons;
>>> 		struct rte_ring_rts_headtail rts_cons;
>>> 	}  __rte_cache_aligned;
>>>
>>> -	char pad2 __rte_cache_aligned; /**< empty cache line */
>>> +	RTE_CACHE_GUARD(cons);  /**< Isolate consumer status. */
> 
> +	RTE_CACHE_GUARD;
> 
>>> };
>>>
>>>
>>> And for the mempool library:
>>>
>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS
>>> /**
>>>    * A structure that stores the mempool statistics (per-lcore).
>>>    * Note: Cache stats (put_cache_bulk/objs, get_cache_bulk/objs) are
>> not
>>>    * captured since they can be calculated from other stats.
>>>    * For example: put_cache_objs = put_objs - put_common_pool_objs.
>>>    */
>>> struct rte_mempool_debug_stats {
>>> 	uint64_t put_bulk;             /**< Number of puts. */
>>> 	uint64_t put_objs;             /**< Number of objects successfully
>> put. */
>>> 	uint64_t put_common_pool_bulk; /**< Number of bulks enqueued in
>> common pool. */
>>> 	uint64_t put_common_pool_objs; /**< Number of objects enqueued in
>> common pool. */
>>> 	uint64_t get_common_pool_bulk; /**< Number of bulks dequeued from
>> common pool. */
>>> 	uint64_t get_common_pool_objs; /**< Number of objects dequeued
>> from common pool. */
>>> 	uint64_t get_success_bulk;     /**< Successful allocation number.
>> */
>>> 	uint64_t get_success_objs;     /**< Objects successfully
>> allocated. */
>>> 	uint64_t get_fail_bulk;        /**< Failed allocation number. */
>>> 	uint64_t get_fail_objs;        /**< Objects that failed to be
>> allocated. */
>>> 	uint64_t get_success_blks;     /**< Successful allocation number
>> of contiguous blocks. */
>>> 	uint64_t get_fail_blks;        /**< Failed allocation number of
>> contiguous blocks. */
>>> +	RTE_CACHE_GUARD(debug_stats);  /**< Isolation between lcores. */
> 
> +	RTE_CACHE_GUARD;
> 
>>> } __rte_cache_aligned;
>>> #endif
>>>
>>> struct rte_mempool {
>>>    [...]
>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS
>>> 	/** Per-lcore statistics.
>>> 	 *
>>> 	 * Plus one, for unregistered non-EAL threads.
>>> 	 */
>>> 	struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1];
>>> #endif
>>> }  __rte_cache_aligned;
>>>
>>>
>>> It also seems relevant for the PRNG library:
>>>
>>> /lib/eal/common/rte_random.c:
>>>
>>> struct rte_rand_state {
>>> 	uint64_t z1;
>>> 	uint64_t z2;
>>> 	uint64_t z3;
>>> 	uint64_t z4;
>>> 	uint64_t z5;
>>> +	RTE_CACHE_GUARD(z);
> 
> +	RTE_CACHE_GUARD;
> 
>>> } __rte_cache_aligned;
>>>
>>
>> Yes.
>>
>> Should there be two cache guard macros? One parameter-free
>> RTE_CACHE_GUARD and a RTE_CACHE_NAMED_GUARD(name) macro?
>>
>> Maybe it's better just to keep the single macro, but have a convention
>> with some generic name (i.e., not 'z' above) for the guard field, like
>> 'cache_guard' or just 'guard'. Having unique name makes no sense, except
>> in rare cases where you need multiple guard lines per struct.
> 
> There is no need to access these fields, so let's use auto-generated unique names, and not offer an API with a name. This also makes the API as simple as possible.
> 
>>
>>> /* One instance each for every lcore id-equipped thread, and one
>>>    * additional instance to be shared by all others threads (i.e., all
>>>    * unregistered non-EAL threads).
>>>    */
>>> static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
>>>

  reply	other threads:[~2023-08-27 22:30 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-25  6:45 cache thrashing question Morten Brørup
2023-08-25  8:22 ` Bruce Richardson
2023-08-25  9:06   ` Morten Brørup
2023-08-25  9:23     ` Bruce Richardson
2023-08-27  8:34       ` [RFC] cache guard Morten Brørup
2023-08-27 13:55         ` Mattias Rönnblom
2023-08-27 15:40           ` Morten Brørup
2023-08-27 22:30             ` Mattias Rönnblom [this message]
2023-08-28  6:32               ` Morten Brørup
2023-08-28  8:46                 ` Mattias Rönnblom
2023-08-28  9:54                   ` Morten Brørup
2023-08-28 10:40                     ` Stephen Hemminger
2023-08-28  7:57             ` Bruce Richardson
2023-09-01 12:26         ` Thomas Monjalon
2023-09-01 16:57           ` Mattias Rönnblom
2023-09-01 18:52             ` Morten Brørup
2023-09-04 12:07               ` Mattias Rönnblom
2023-09-04 12:48                 ` Morten Brørup
2023-09-05  5:50                   ` Mattias Rönnblom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d757da9f-7dc9-9f78-24ce-ba82b8d8fc57@lysator.liu.se \
    --to=hofors@lysator.liu.se \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=honnappa.nagarahalli@arm.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=mattias.ronnblom@ericsson.com \
    --cc=mb@smartsharesystems.com \
    --cc=olivier.matz@6wind.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).