DPDK patches and discussions
 help / color / mirror / Atom feed
* cache thrashing question
@ 2023-08-25  6:45 Morten Brørup
  2023-08-25  8:22 ` Bruce Richardson
  0 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2023-08-25  6:45 UTC (permalink / raw)
  To: bruce.richardson; +Cc: dev

Bruce,

With this patch [1], it is noted that the ring producer and consumer data should not be on adjacent cache lines, for performance reasons.

[1]: https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1ffd4b66e75485cc8b63b9aedfbdfe8b0

(It's obvious that they cannot share the same cache line, because they are accessed by two different threads.)

Intuitively, I would think that having them on different cache lines would suffice. Why does having an empty cache line between them make a difference?

And does it need to be an empty cache line? Or does it suffice having the second structure start at two cache lines after the start of the first structure (e.g. if the size of the first structure is two cache lines)?

I'm asking because the same principle might apply to other code too.


Med venlig hilsen / Kind regards,
-Morten Brørup


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: cache thrashing question
  2023-08-25  6:45 cache thrashing question Morten Brørup
@ 2023-08-25  8:22 ` Bruce Richardson
  2023-08-25  9:06   ` Morten Brørup
  0 siblings, 1 reply; 19+ messages in thread
From: Bruce Richardson @ 2023-08-25  8:22 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev

On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> Bruce,
> 
> With this patch [1], it is noted that the ring producer and consumer data should not be on adjacent cache lines, for performance reasons.
> 
> [1]: https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1ffd4b66e75485cc8b63b9aedfbdfe8b0
> 
> (It's obvious that they cannot share the same cache line, because they are accessed by two different threads.)
> 
> Intuitively, I would think that having them on different cache lines would suffice. Why does having an empty cache line between them make a difference?
> 
> And does it need to be an empty cache line? Or does it suffice having the second structure start at two cache lines after the start of the first structure (e.g. if the size of the first structure is two cache lines)?
> 
> I'm asking because the same principle might apply to other code too.
> 
Hi Morten,

this was something we discovered when working on the distributor library.
If we have cachelines per core where there is heavy access, having some
cachelines as a gap between the content cachelines can help performance. We
believe this helps due to avoiding issues with the HW prefetchers (e.g.
adjacent cacheline prefetcher) bringing in the second cacheline
speculatively when an operation is done on the first line.

/Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: cache thrashing question
  2023-08-25  8:22 ` Bruce Richardson
@ 2023-08-25  9:06   ` Morten Brørup
  2023-08-25  9:23     ` Bruce Richardson
  0 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2023-08-25  9:06 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev, olivier.matz, andrew.rybchenko

+CC mempool maintainers

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 25 August 2023 10.23
> 
> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> > Bruce,
> >
> > With this patch [1], it is noted that the ring producer and consumer data
> should not be on adjacent cache lines, for performance reasons.
> >
> > [1]:
> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1ffd4b66
> e75485cc8b63b9aedfbdfe8b0
> >
> > (It's obvious that they cannot share the same cache line, because they are
> accessed by two different threads.)
> >
> > Intuitively, I would think that having them on different cache lines would
> suffice. Why does having an empty cache line between them make a difference?
> >
> > And does it need to be an empty cache line? Or does it suffice having the
> second structure start at two cache lines after the start of the first
> structure (e.g. if the size of the first structure is two cache lines)?
> >
> > I'm asking because the same principle might apply to other code too.
> >
> Hi Morten,
> 
> this was something we discovered when working on the distributor library.
> If we have cachelines per core where there is heavy access, having some
> cachelines as a gap between the content cachelines can help performance. We
> believe this helps due to avoiding issues with the HW prefetchers (e.g.
> adjacent cacheline prefetcher) bringing in the second cacheline
> speculatively when an operation is done on the first line.

I guessed that it had something to do with speculative prefetching, but wasn't sure. Good to get confirmation, and that it has a measureable effect somewhere. Very interesting!

NB: More comments in the ring lib about stuff like this would be nice.

So, for the mempool lib, what do you think about applying the same technique to the rte_mempool_debug_stats structure (which is an array indexed per lcore)... Two adjacent lcores heavily accessing their local mempool caches seems likely to me. But how heavy does the access need to be for this technique to be relevant?

For the rte_mempool_cache structure (also an array indexed per lcore), the last entries of the "objs" array at the end of the structure are unlikely to be used, so they already serve as a gap, and an additional gap seems irrelevant here.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: cache thrashing question
  2023-08-25  9:06   ` Morten Brørup
@ 2023-08-25  9:23     ` Bruce Richardson
  2023-08-27  8:34       ` [RFC] cache guard Morten Brørup
  0 siblings, 1 reply; 19+ messages in thread
From: Bruce Richardson @ 2023-08-25  9:23 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev, olivier.matz, andrew.rybchenko

On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> +CC mempool maintainers
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Friday, 25 August 2023 10.23
> > 
> > On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> > > Bruce,
> > >
> > > With this patch [1], it is noted that the ring producer and consumer data
> > should not be on adjacent cache lines, for performance reasons.
> > >
> > > [1]:
> > https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1ffd4b66
> > e75485cc8b63b9aedfbdfe8b0
> > >
> > > (It's obvious that they cannot share the same cache line, because they are
> > accessed by two different threads.)
> > >
> > > Intuitively, I would think that having them on different cache lines would
> > suffice. Why does having an empty cache line between them make a difference?
> > >
> > > And does it need to be an empty cache line? Or does it suffice having the
> > second structure start at two cache lines after the start of the first
> > structure (e.g. if the size of the first structure is two cache lines)?
> > >
> > > I'm asking because the same principle might apply to other code too.
> > >
> > Hi Morten,
> > 
> > this was something we discovered when working on the distributor library.
> > If we have cachelines per core where there is heavy access, having some
> > cachelines as a gap between the content cachelines can help performance. We
> > believe this helps due to avoiding issues with the HW prefetchers (e.g.
> > adjacent cacheline prefetcher) bringing in the second cacheline
> > speculatively when an operation is done on the first line.
> 
> I guessed that it had something to do with speculative prefetching, but wasn't sure. Good to get confirmation, and that it has a measureable effect somewhere. Very interesting!
> 
> NB: More comments in the ring lib about stuff like this would be nice.
> 
> So, for the mempool lib, what do you think about applying the same technique to the rte_mempool_debug_stats structure (which is an array indexed per lcore)... Two adjacent lcores heavily accessing their local mempool caches seems likely to me. But how heavy does the access need to be for this technique to be relevant?
>

No idea how heavy the accesses need to be for this to have a noticable
effect. For things like debug stats, I wonder how worthwhile making such a
change would be, but then again, any change would have very low impact too
in that case.

/Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC] cache guard
  2023-08-25  9:23     ` Bruce Richardson
@ 2023-08-27  8:34       ` Morten Brørup
  2023-08-27 13:55         ` Mattias Rönnblom
  2023-09-01 12:26         ` Thomas Monjalon
  0 siblings, 2 replies; 19+ messages in thread
From: Morten Brørup @ 2023-08-27  8:34 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: dev, olivier.matz, andrew.rybchenko, honnappa.nagarahalli,
	konstantin.v.ananyev, mattias.ronnblom

+CC Honnappa and Konstantin, Ring lib maintainers
+CC Mattias, PRNG lib maintainer

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 25 August 2023 11.24
> 
> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> > +CC mempool maintainers
> >
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Friday, 25 August 2023 10.23
> > >
> > > On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> > > > Bruce,
> > > >
> > > > With this patch [1], it is noted that the ring producer and
> consumer data
> > > should not be on adjacent cache lines, for performance reasons.
> > > >
> > > > [1]:
> > >
> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
> fd4b66
> > > e75485cc8b63b9aedfbdfe8b0
> > > >
> > > > (It's obvious that they cannot share the same cache line, because
> they are
> > > accessed by two different threads.)
> > > >
> > > > Intuitively, I would think that having them on different cache
> lines would
> > > suffice. Why does having an empty cache line between them make a
> difference?
> > > >
> > > > And does it need to be an empty cache line? Or does it suffice
> having the
> > > second structure start at two cache lines after the start of the
> first
> > > structure (e.g. if the size of the first structure is two cache
> lines)?
> > > >
> > > > I'm asking because the same principle might apply to other code
> too.
> > > >
> > > Hi Morten,
> > >
> > > this was something we discovered when working on the distributor
> library.
> > > If we have cachelines per core where there is heavy access, having
> some
> > > cachelines as a gap between the content cachelines can help
> performance. We
> > > believe this helps due to avoiding issues with the HW prefetchers
> (e.g.
> > > adjacent cacheline prefetcher) bringing in the second cacheline
> > > speculatively when an operation is done on the first line.
> >
> > I guessed that it had something to do with speculative prefetching,
> but wasn't sure. Good to get confirmation, and that it has a measureable
> effect somewhere. Very interesting!
> >
> > NB: More comments in the ring lib about stuff like this would be nice.
> >
> > So, for the mempool lib, what do you think about applying the same
> technique to the rte_mempool_debug_stats structure (which is an array
> indexed per lcore)... Two adjacent lcores heavily accessing their local
> mempool caches seems likely to me. But how heavy does the access need to
> be for this technique to be relevant?
> >
> 
> No idea how heavy the accesses need to be for this to have a noticable
> effect. For things like debug stats, I wonder how worthwhile making such
> a
> change would be, but then again, any change would have very low impact
> too
> in that case.

I just tried adding padding to some of the hot structures in our own application, and observed a significant performance improvement for those.

So I think this technique should have higher visibility in DPDK by adding a new cache macro to rte_common.h:

/**
 * Empty cache line, to guard against speculative prefetching.
 *
 * Use as spacing between data accessed by different lcores,
 * to prevent cache thrashing on CPUs with speculative prefetching.
 */
#define RTE_CACHE_GUARD(name) char cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned;

To be used like this:

struct rte_ring {
	char name[RTE_RING_NAMESIZE] __rte_cache_aligned;
	/**< Name of the ring. */
	int flags;               /**< Flags supplied at creation. */
	const struct rte_memzone *memzone;
			/**< Memzone, if any, containing the rte_ring */
	uint32_t size;           /**< Size of ring. */
	uint32_t mask;           /**< Mask (size-1) of ring. */
	uint32_t capacity;       /**< Usable size of ring */

-	char pad0 __rte_cache_aligned; /**< empty cache line */
+	RTE_CACHE_GUARD(prod);  /**< Isolate producer status. */

	/** Ring producer status. */
	union {
		struct rte_ring_headtail prod;
		struct rte_ring_hts_headtail hts_prod;
		struct rte_ring_rts_headtail rts_prod;
	}  __rte_cache_aligned;

-	char pad1 __rte_cache_aligned; /**< empty cache line */
+	RTE_CACHE_GUARD(both);  /**< Isolate producer from consumer. */

	/** Ring consumer status. */
	union {
		struct rte_ring_headtail cons;
		struct rte_ring_hts_headtail hts_cons;
		struct rte_ring_rts_headtail rts_cons;
	}  __rte_cache_aligned;

-	char pad2 __rte_cache_aligned; /**< empty cache line */
+	RTE_CACHE_GUARD(cons);  /**< Isolate consumer status. */
};


And for the mempool library:

#ifdef RTE_LIBRTE_MEMPOOL_STATS
/**
 * A structure that stores the mempool statistics (per-lcore).
 * Note: Cache stats (put_cache_bulk/objs, get_cache_bulk/objs) are not
 * captured since they can be calculated from other stats.
 * For example: put_cache_objs = put_objs - put_common_pool_objs.
 */
struct rte_mempool_debug_stats {
	uint64_t put_bulk;             /**< Number of puts. */
	uint64_t put_objs;             /**< Number of objects successfully put. */
	uint64_t put_common_pool_bulk; /**< Number of bulks enqueued in common pool. */
	uint64_t put_common_pool_objs; /**< Number of objects enqueued in common pool. */
	uint64_t get_common_pool_bulk; /**< Number of bulks dequeued from common pool. */
	uint64_t get_common_pool_objs; /**< Number of objects dequeued from common pool. */
	uint64_t get_success_bulk;     /**< Successful allocation number. */
	uint64_t get_success_objs;     /**< Objects successfully allocated. */
	uint64_t get_fail_bulk;        /**< Failed allocation number. */
	uint64_t get_fail_objs;        /**< Objects that failed to be allocated. */
	uint64_t get_success_blks;     /**< Successful allocation number of contiguous blocks. */
	uint64_t get_fail_blks;        /**< Failed allocation number of contiguous blocks. */
+	RTE_CACHE_GUARD(debug_stats);  /**< Isolation between lcores. */
} __rte_cache_aligned;
#endif

struct rte_mempool {
 [...]
#ifdef RTE_LIBRTE_MEMPOOL_STATS
	/** Per-lcore statistics.
	 *
	 * Plus one, for unregistered non-EAL threads.
	 */
	struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1];
#endif
}  __rte_cache_aligned;


It also seems relevant for the PRNG library:

/lib/eal/common/rte_random.c:

struct rte_rand_state {
	uint64_t z1;
	uint64_t z2;
	uint64_t z3;
	uint64_t z4;
	uint64_t z5;
+	RTE_CACHE_GUARD(z);
} __rte_cache_aligned;

/* One instance each for every lcore id-equipped thread, and one
 * additional instance to be shared by all others threads (i.e., all
 * unregistered non-EAL threads).
 */
static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] cache guard
  2023-08-27  8:34       ` [RFC] cache guard Morten Brørup
@ 2023-08-27 13:55         ` Mattias Rönnblom
  2023-08-27 15:40           ` Morten Brørup
  2023-09-01 12:26         ` Thomas Monjalon
  1 sibling, 1 reply; 19+ messages in thread
From: Mattias Rönnblom @ 2023-08-27 13:55 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson
  Cc: dev, olivier.matz, andrew.rybchenko, honnappa.nagarahalli,
	konstantin.v.ananyev, mattias.ronnblom

On 2023-08-27 10:34, Morten Brørup wrote:
> +CC Honnappa and Konstantin, Ring lib maintainers
> +CC Mattias, PRNG lib maintainer
> 
>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>> Sent: Friday, 25 August 2023 11.24
>>
>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
>>> +CC mempool maintainers
>>>
>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>> Sent: Friday, 25 August 2023 10.23
>>>>
>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
>>>>> Bruce,
>>>>>
>>>>> With this patch [1], it is noted that the ring producer and
>> consumer data
>>>> should not be on adjacent cache lines, for performance reasons.
>>>>>
>>>>> [1]:
>>>>
>> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
>> fd4b66
>>>> e75485cc8b63b9aedfbdfe8b0
>>>>>
>>>>> (It's obvious that they cannot share the same cache line, because
>> they are
>>>> accessed by two different threads.)
>>>>>
>>>>> Intuitively, I would think that having them on different cache
>> lines would
>>>> suffice. Why does having an empty cache line between them make a
>> difference?
>>>>>
>>>>> And does it need to be an empty cache line? Or does it suffice
>> having the
>>>> second structure start at two cache lines after the start of the
>> first
>>>> structure (e.g. if the size of the first structure is two cache
>> lines)?
>>>>>
>>>>> I'm asking because the same principle might apply to other code
>> too.
>>>>>
>>>> Hi Morten,
>>>>
>>>> this was something we discovered when working on the distributor
>> library.
>>>> If we have cachelines per core where there is heavy access, having
>> some
>>>> cachelines as a gap between the content cachelines can help
>> performance. We
>>>> believe this helps due to avoiding issues with the HW prefetchers
>> (e.g.
>>>> adjacent cacheline prefetcher) bringing in the second cacheline
>>>> speculatively when an operation is done on the first line.
>>>
>>> I guessed that it had something to do with speculative prefetching,
>> but wasn't sure. Good to get confirmation, and that it has a measureable
>> effect somewhere. Very interesting!
>>>
>>> NB: More comments in the ring lib about stuff like this would be nice.
>>>
>>> So, for the mempool lib, what do you think about applying the same
>> technique to the rte_mempool_debug_stats structure (which is an array
>> indexed per lcore)... Two adjacent lcores heavily accessing their local
>> mempool caches seems likely to me. But how heavy does the access need to
>> be for this technique to be relevant?
>>>
>>
>> No idea how heavy the accesses need to be for this to have a noticable
>> effect. For things like debug stats, I wonder how worthwhile making such
>> a
>> change would be, but then again, any change would have very low impact
>> too
>> in that case.
> 
> I just tried adding padding to some of the hot structures in our own application, and observed a significant performance improvement for those.
> 
> So I think this technique should have higher visibility in DPDK by adding a new cache macro to rte_common.h:
> 
> /**
>   * Empty cache line, to guard against speculative prefetching.
>   *

"to guard against false sharing-like effects on systems with a 
next-N-lines hardware prefetcher"

>   * Use as spacing between data accessed by different lcores,
>   * to prevent cache thrashing on CPUs with speculative prefetching.
>   */
> #define RTE_CACHE_GUARD(name) char cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned;
> 

You could have a macro which specified how much guarding there needs to 
be, ideally defined on a per-CPU basis. (These things has nothing to do 
with the ISA, but everything to do with the implementation.)

I'm not sure N is always 1.

So the guard padding should be RTE_CACHE_LINE_SIZE * 
RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in
#if RTE_CACHE_GUARD_LINES > 0
#endif

...so you can disable this (cute!) hack (on custom DPDK builds) in case 
you have disabled hardware prefetching, which seems generally to be a 
good idea for packet processing type applications.

...which leads me to another suggestions: add a note on disabling 
hardware prefetching in the optimization guide.

Seems like a very good idea to have this in <rte_common.h>, and 
otherwise make this issue visible and known.

> To be used like this:
> 
> struct rte_ring {
> 	char name[RTE_RING_NAMESIZE] __rte_cache_aligned;
> 	/**< Name of the ring. */
> 	int flags;               /**< Flags supplied at creation. */
> 	const struct rte_memzone *memzone;
> 			/**< Memzone, if any, containing the rte_ring */
> 	uint32_t size;           /**< Size of ring. */
> 	uint32_t mask;           /**< Mask (size-1) of ring. */
> 	uint32_t capacity;       /**< Usable size of ring */
> 
> -	char pad0 __rte_cache_aligned; /**< empty cache line */
> +	RTE_CACHE_GUARD(prod);  /**< Isolate producer status. */
> 
> 	/** Ring producer status. */
> 	union {
> 		struct rte_ring_headtail prod;
> 		struct rte_ring_hts_headtail hts_prod;
> 		struct rte_ring_rts_headtail rts_prod;
> 	}  __rte_cache_aligned;
> 
> -	char pad1 __rte_cache_aligned; /**< empty cache line */
> +	RTE_CACHE_GUARD(both);  /**< Isolate producer from consumer. */
> 
> 	/** Ring consumer status. */
> 	union {
> 		struct rte_ring_headtail cons;
> 		struct rte_ring_hts_headtail hts_cons;
> 		struct rte_ring_rts_headtail rts_cons;
> 	}  __rte_cache_aligned;
> 
> -	char pad2 __rte_cache_aligned; /**< empty cache line */
> +	RTE_CACHE_GUARD(cons);  /**< Isolate consumer status. */
> };
> 
> 
> And for the mempool library:
> 
> #ifdef RTE_LIBRTE_MEMPOOL_STATS
> /**
>   * A structure that stores the mempool statistics (per-lcore).
>   * Note: Cache stats (put_cache_bulk/objs, get_cache_bulk/objs) are not
>   * captured since they can be calculated from other stats.
>   * For example: put_cache_objs = put_objs - put_common_pool_objs.
>   */
> struct rte_mempool_debug_stats {
> 	uint64_t put_bulk;             /**< Number of puts. */
> 	uint64_t put_objs;             /**< Number of objects successfully put. */
> 	uint64_t put_common_pool_bulk; /**< Number of bulks enqueued in common pool. */
> 	uint64_t put_common_pool_objs; /**< Number of objects enqueued in common pool. */
> 	uint64_t get_common_pool_bulk; /**< Number of bulks dequeued from common pool. */
> 	uint64_t get_common_pool_objs; /**< Number of objects dequeued from common pool. */
> 	uint64_t get_success_bulk;     /**< Successful allocation number. */
> 	uint64_t get_success_objs;     /**< Objects successfully allocated. */
> 	uint64_t get_fail_bulk;        /**< Failed allocation number. */
> 	uint64_t get_fail_objs;        /**< Objects that failed to be allocated. */
> 	uint64_t get_success_blks;     /**< Successful allocation number of contiguous blocks. */
> 	uint64_t get_fail_blks;        /**< Failed allocation number of contiguous blocks. */
> +	RTE_CACHE_GUARD(debug_stats);  /**< Isolation between lcores. */
> } __rte_cache_aligned;
> #endif
> 
> struct rte_mempool {
>   [...]
> #ifdef RTE_LIBRTE_MEMPOOL_STATS
> 	/** Per-lcore statistics.
> 	 *
> 	 * Plus one, for unregistered non-EAL threads.
> 	 */
> 	struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1];
> #endif
> }  __rte_cache_aligned;
> 
> 
> It also seems relevant for the PRNG library:
> 
> /lib/eal/common/rte_random.c:
> 
> struct rte_rand_state {
> 	uint64_t z1;
> 	uint64_t z2;
> 	uint64_t z3;
> 	uint64_t z4;
> 	uint64_t z5;
> +	RTE_CACHE_GUARD(z);
> } __rte_cache_aligned;
> 

Yes.

Should there be two cache guard macros? One parameter-free 
RTE_CACHE_GUARD and a RTE_CACHE_NAMED_GUARD(name) macro?

Maybe it's better just to keep the single macro, but have a convention 
with some generic name (i.e., not 'z' above) for the guard field, like 
'cache_guard' or just 'guard'. Having unique name makes no sense, except 
in rare cases where you need multiple guard lines per struct.

> /* One instance each for every lcore id-equipped thread, and one
>   * additional instance to be shared by all others threads (i.e., all
>   * unregistered non-EAL threads).
>   */
> static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC] cache guard
  2023-08-27 13:55         ` Mattias Rönnblom
@ 2023-08-27 15:40           ` Morten Brørup
  2023-08-27 22:30             ` Mattias Rönnblom
  2023-08-28  7:57             ` Bruce Richardson
  0 siblings, 2 replies; 19+ messages in thread
From: Morten Brørup @ 2023-08-27 15:40 UTC (permalink / raw)
  To: Mattias Rönnblom, Bruce Richardson
  Cc: dev, olivier.matz, andrew.rybchenko, honnappa.nagarahalli,
	konstantin.v.ananyev, mattias.ronnblom

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Sunday, 27 August 2023 15.55
> 
> On 2023-08-27 10:34, Morten Brørup wrote:
> > +CC Honnappa and Konstantin, Ring lib maintainers
> > +CC Mattias, PRNG lib maintainer
> >
> >> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >> Sent: Friday, 25 August 2023 11.24
> >>
> >> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> >>> +CC mempool maintainers
> >>>
> >>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>> Sent: Friday, 25 August 2023 10.23
> >>>>
> >>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> >>>>> Bruce,
> >>>>>
> >>>>> With this patch [1], it is noted that the ring producer and
> >> consumer data
> >>>> should not be on adjacent cache lines, for performance reasons.
> >>>>>
> >>>>> [1]:
> >>>>
> >>
> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
> >> fd4b66
> >>>> e75485cc8b63b9aedfbdfe8b0
> >>>>>
> >>>>> (It's obvious that they cannot share the same cache line, because
> >> they are
> >>>> accessed by two different threads.)
> >>>>>
> >>>>> Intuitively, I would think that having them on different cache
> >> lines would
> >>>> suffice. Why does having an empty cache line between them make a
> >> difference?
> >>>>>
> >>>>> And does it need to be an empty cache line? Or does it suffice
> >> having the
> >>>> second structure start at two cache lines after the start of the
> >> first
> >>>> structure (e.g. if the size of the first structure is two cache
> >> lines)?
> >>>>>
> >>>>> I'm asking because the same principle might apply to other code
> >> too.
> >>>>>
> >>>> Hi Morten,
> >>>>
> >>>> this was something we discovered when working on the distributor
> >> library.
> >>>> If we have cachelines per core where there is heavy access, having
> >> some
> >>>> cachelines as a gap between the content cachelines can help
> >> performance. We
> >>>> believe this helps due to avoiding issues with the HW prefetchers
> >> (e.g.
> >>>> adjacent cacheline prefetcher) bringing in the second cacheline
> >>>> speculatively when an operation is done on the first line.
> >>>
> >>> I guessed that it had something to do with speculative prefetching,
> >> but wasn't sure. Good to get confirmation, and that it has a
> measureable
> >> effect somewhere. Very interesting!
> >>>
> >>> NB: More comments in the ring lib about stuff like this would be
> nice.
> >>>
> >>> So, for the mempool lib, what do you think about applying the same
> >> technique to the rte_mempool_debug_stats structure (which is an array
> >> indexed per lcore)... Two adjacent lcores heavily accessing their
> local
> >> mempool caches seems likely to me. But how heavy does the access need
> to
> >> be for this technique to be relevant?
> >>>
> >>
> >> No idea how heavy the accesses need to be for this to have a
> noticable
> >> effect. For things like debug stats, I wonder how worthwhile making
> such
> >> a
> >> change would be, but then again, any change would have very low
> impact
> >> too
> >> in that case.
> >
> > I just tried adding padding to some of the hot structures in our own
> application, and observed a significant performance improvement for
> those.
> >
> > So I think this technique should have higher visibility in DPDK by
> adding a new cache macro to rte_common.h:
> >
> > /**
> >   * Empty cache line, to guard against speculative prefetching.
> >   *
> 
> "to guard against false sharing-like effects on systems with a
> next-N-lines hardware prefetcher"
> 
> >   * Use as spacing between data accessed by different lcores,
> >   * to prevent cache thrashing on CPUs with speculative prefetching.
> >   */
> > #define RTE_CACHE_GUARD(name) char
> cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned;
> >
> 
> You could have a macro which specified how much guarding there needs to
> be, ideally defined on a per-CPU basis. (These things has nothing to do
> with the ISA, but everything to do with the implementation.)
> 
> I'm not sure N is always 1.
> 
> So the guard padding should be RTE_CACHE_LINE_SIZE *
> RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in
> #if RTE_CACHE_GUARD_LINES > 0
> #endif
> 
> ...so you can disable this (cute!) hack (on custom DPDK builds) in case
> you have disabled hardware prefetching, which seems generally to be a
> good idea for packet processing type applications.
> 
> ...which leads me to another suggestions: add a note on disabling
> hardware prefetching in the optimization guide.
> 
> Seems like a very good idea to have this in <rte_common.h>, and
> otherwise make this issue visible and known.

Good points, Mattias!

I also prefer the name-less macro you suggested below.

So, this gets added to rte_common.h:

/**
 * Empty cache lines, to guard against false sharing-like effects
 * on systems with a next-N-lines hardware prefetcher.
 *
 * Use as spacing between data accessed by different lcores,
 * to prevent cache thrashing on hardware with speculative prefetching.
 */
#if RTE_CACHE_GUARD_LINES > 0
#define _RTE_CACHE_GUARD_HELPER2(unique) \
        char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE * RTE_CACHE_GUARD_LINES] \
        __rte_cache_aligned;
#define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique)
#define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)
#else
#define RTE_CACHE_GUARD
#endif

And a line in /config/x86/meson.build for x86 architecture:

  dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
+ dpdk_conf.set('RTE_CACHE_GUARD_LINES', 1)

I don't know about various architectures and implementations, so we should probably use a default of 1, matching the existing guard size in the ring lib.

@Bruce, I hope you can help with the configuration part of this.

> 
> > To be used like this:
> >
> > struct rte_ring {
> > 	char name[RTE_RING_NAMESIZE] __rte_cache_aligned;
> > 	/**< Name of the ring. */
> > 	int flags;               /**< Flags supplied at creation. */
> > 	const struct rte_memzone *memzone;
> > 			/**< Memzone, if any, containing the rte_ring */
> > 	uint32_t size;           /**< Size of ring. */
> > 	uint32_t mask;           /**< Mask (size-1) of ring. */
> > 	uint32_t capacity;       /**< Usable size of ring */
> >
> > -	char pad0 __rte_cache_aligned; /**< empty cache line */
> > +	RTE_CACHE_GUARD(prod);  /**< Isolate producer status. */

+	RTE_CACHE_GUARD;

Note: Commenting anonymous cache guard fields seems somewhat superfluous, and would largely be a copy of the general RTE_CACHE_GUARD description anyway, so I have removed the comments. 

> >
> > 	/** Ring producer status. */
> > 	union {
> > 		struct rte_ring_headtail prod;
> > 		struct rte_ring_hts_headtail hts_prod;
> > 		struct rte_ring_rts_headtail rts_prod;
> > 	}  __rte_cache_aligned;
> >
> > -	char pad1 __rte_cache_aligned; /**< empty cache line */
> > +	RTE_CACHE_GUARD(both);  /**< Isolate producer from consumer. */

+	RTE_CACHE_GUARD;

> >
> > 	/** Ring consumer status. */
> > 	union {
> > 		struct rte_ring_headtail cons;
> > 		struct rte_ring_hts_headtail hts_cons;
> > 		struct rte_ring_rts_headtail rts_cons;
> > 	}  __rte_cache_aligned;
> >
> > -	char pad2 __rte_cache_aligned; /**< empty cache line */
> > +	RTE_CACHE_GUARD(cons);  /**< Isolate consumer status. */

+	RTE_CACHE_GUARD;

> > };
> >
> >
> > And for the mempool library:
> >
> > #ifdef RTE_LIBRTE_MEMPOOL_STATS
> > /**
> >   * A structure that stores the mempool statistics (per-lcore).
> >   * Note: Cache stats (put_cache_bulk/objs, get_cache_bulk/objs) are
> not
> >   * captured since they can be calculated from other stats.
> >   * For example: put_cache_objs = put_objs - put_common_pool_objs.
> >   */
> > struct rte_mempool_debug_stats {
> > 	uint64_t put_bulk;             /**< Number of puts. */
> > 	uint64_t put_objs;             /**< Number of objects successfully
> put. */
> > 	uint64_t put_common_pool_bulk; /**< Number of bulks enqueued in
> common pool. */
> > 	uint64_t put_common_pool_objs; /**< Number of objects enqueued in
> common pool. */
> > 	uint64_t get_common_pool_bulk; /**< Number of bulks dequeued from
> common pool. */
> > 	uint64_t get_common_pool_objs; /**< Number of objects dequeued
> from common pool. */
> > 	uint64_t get_success_bulk;     /**< Successful allocation number.
> */
> > 	uint64_t get_success_objs;     /**< Objects successfully
> allocated. */
> > 	uint64_t get_fail_bulk;        /**< Failed allocation number. */
> > 	uint64_t get_fail_objs;        /**< Objects that failed to be
> allocated. */
> > 	uint64_t get_success_blks;     /**< Successful allocation number
> of contiguous blocks. */
> > 	uint64_t get_fail_blks;        /**< Failed allocation number of
> contiguous blocks. */
> > +	RTE_CACHE_GUARD(debug_stats);  /**< Isolation between lcores. */

+	RTE_CACHE_GUARD;

> > } __rte_cache_aligned;
> > #endif
> >
> > struct rte_mempool {
> >   [...]
> > #ifdef RTE_LIBRTE_MEMPOOL_STATS
> > 	/** Per-lcore statistics.
> > 	 *
> > 	 * Plus one, for unregistered non-EAL threads.
> > 	 */
> > 	struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1];
> > #endif
> > }  __rte_cache_aligned;
> >
> >
> > It also seems relevant for the PRNG library:
> >
> > /lib/eal/common/rte_random.c:
> >
> > struct rte_rand_state {
> > 	uint64_t z1;
> > 	uint64_t z2;
> > 	uint64_t z3;
> > 	uint64_t z4;
> > 	uint64_t z5;
> > +	RTE_CACHE_GUARD(z);

+	RTE_CACHE_GUARD;

> > } __rte_cache_aligned;
> >
> 
> Yes.
> 
> Should there be two cache guard macros? One parameter-free
> RTE_CACHE_GUARD and a RTE_CACHE_NAMED_GUARD(name) macro?
> 
> Maybe it's better just to keep the single macro, but have a convention
> with some generic name (i.e., not 'z' above) for the guard field, like
> 'cache_guard' or just 'guard'. Having unique name makes no sense, except
> in rare cases where you need multiple guard lines per struct.

There is no need to access these fields, so let's use auto-generated unique names, and not offer an API with a name. This also makes the API as simple as possible.

> 
> > /* One instance each for every lcore id-equipped thread, and one
> >   * additional instance to be shared by all others threads (i.e., all
> >   * unregistered non-EAL threads).
> >   */
> > static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] cache guard
  2023-08-27 15:40           ` Morten Brørup
@ 2023-08-27 22:30             ` Mattias Rönnblom
  2023-08-28  6:32               ` Morten Brørup
  2023-08-28  7:57             ` Bruce Richardson
  1 sibling, 1 reply; 19+ messages in thread
From: Mattias Rönnblom @ 2023-08-27 22:30 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson
  Cc: dev, olivier.matz, andrew.rybchenko, honnappa.nagarahalli,
	konstantin.v.ananyev, mattias.ronnblom

On 2023-08-27 17:40, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Sunday, 27 August 2023 15.55
>>
>> On 2023-08-27 10:34, Morten Brørup wrote:
>>> +CC Honnappa and Konstantin, Ring lib maintainers
>>> +CC Mattias, PRNG lib maintainer
>>>
>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>> Sent: Friday, 25 August 2023 11.24
>>>>
>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
>>>>> +CC mempool maintainers
>>>>>
>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>> Sent: Friday, 25 August 2023 10.23
>>>>>>
>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
>>>>>>> Bruce,
>>>>>>>
>>>>>>> With this patch [1], it is noted that the ring producer and
>>>> consumer data
>>>>>> should not be on adjacent cache lines, for performance reasons.
>>>>>>>
>>>>>>> [1]:
>>>>>>
>>>>
>> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
>>>> fd4b66
>>>>>> e75485cc8b63b9aedfbdfe8b0
>>>>>>>
>>>>>>> (It's obvious that they cannot share the same cache line, because
>>>> they are
>>>>>> accessed by two different threads.)
>>>>>>>
>>>>>>> Intuitively, I would think that having them on different cache
>>>> lines would
>>>>>> suffice. Why does having an empty cache line between them make a
>>>> difference?
>>>>>>>
>>>>>>> And does it need to be an empty cache line? Or does it suffice
>>>> having the
>>>>>> second structure start at two cache lines after the start of the
>>>> first
>>>>>> structure (e.g. if the size of the first structure is two cache
>>>> lines)?
>>>>>>>
>>>>>>> I'm asking because the same principle might apply to other code
>>>> too.
>>>>>>>
>>>>>> Hi Morten,
>>>>>>
>>>>>> this was something we discovered when working on the distributor
>>>> library.
>>>>>> If we have cachelines per core where there is heavy access, having
>>>> some
>>>>>> cachelines as a gap between the content cachelines can help
>>>> performance. We
>>>>>> believe this helps due to avoiding issues with the HW prefetchers
>>>> (e.g.
>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
>>>>>> speculatively when an operation is done on the first line.
>>>>>
>>>>> I guessed that it had something to do with speculative prefetching,
>>>> but wasn't sure. Good to get confirmation, and that it has a
>> measureable
>>>> effect somewhere. Very interesting!
>>>>>
>>>>> NB: More comments in the ring lib about stuff like this would be
>> nice.
>>>>>
>>>>> So, for the mempool lib, what do you think about applying the same
>>>> technique to the rte_mempool_debug_stats structure (which is an array
>>>> indexed per lcore)... Two adjacent lcores heavily accessing their
>> local
>>>> mempool caches seems likely to me. But how heavy does the access need
>> to
>>>> be for this technique to be relevant?
>>>>>
>>>>
>>>> No idea how heavy the accesses need to be for this to have a
>> noticable
>>>> effect. For things like debug stats, I wonder how worthwhile making
>> such
>>>> a
>>>> change would be, but then again, any change would have very low
>> impact
>>>> too
>>>> in that case.
>>>
>>> I just tried adding padding to some of the hot structures in our own
>> application, and observed a significant performance improvement for
>> those.
>>>
>>> So I think this technique should have higher visibility in DPDK by
>> adding a new cache macro to rte_common.h:
>>>
>>> /**
>>>    * Empty cache line, to guard against speculative prefetching.
>>>    *
>>
>> "to guard against false sharing-like effects on systems with a
>> next-N-lines hardware prefetcher"
>>
>>>    * Use as spacing between data accessed by different lcores,
>>>    * to prevent cache thrashing on CPUs with speculative prefetching.
>>>    */
>>> #define RTE_CACHE_GUARD(name) char
>> cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned;
>>>
>>
>> You could have a macro which specified how much guarding there needs to
>> be, ideally defined on a per-CPU basis. (These things has nothing to do
>> with the ISA, but everything to do with the implementation.)
>>
>> I'm not sure N is always 1.
>>
>> So the guard padding should be RTE_CACHE_LINE_SIZE *
>> RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in
>> #if RTE_CACHE_GUARD_LINES > 0
>> #endif
>>
>> ...so you can disable this (cute!) hack (on custom DPDK builds) in case
>> you have disabled hardware prefetching, which seems generally to be a
>> good idea for packet processing type applications.
>>
>> ...which leads me to another suggestions: add a note on disabling
>> hardware prefetching in the optimization guide.
>>
>> Seems like a very good idea to have this in <rte_common.h>, and
>> otherwise make this issue visible and known.
> 
> Good points, Mattias!
> 
> I also prefer the name-less macro you suggested below.
> 
> So, this gets added to rte_common.h:
> 
> /**
>   * Empty cache lines, to guard against false sharing-like effects
>   * on systems with a next-N-lines hardware prefetcher.
>   *
>   * Use as spacing between data accessed by different lcores,
>   * to prevent cache thrashing on hardware with speculative prefetching.
>   */
> #if RTE_CACHE_GUARD_LINES > 0
> #define _RTE_CACHE_GUARD_HELPER2(unique) \
>          char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE * RTE_CACHE_GUARD_LINES] \
>          __rte_cache_aligned;
> #define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique)
> #define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)
> #else
> #define RTE_CACHE_GUARD
> #endif
> 

Seems like a good solution. I thought as far as using __LINE__ to build 
a unique name, but __COUNTER__ is much cleaner, provided it's available 
in relevant compilers. (It's not in C11.)

Should the semicolon be included or not in HELPER2? If left out, a 
lonely ";" will be left for RTE_CACHE_GUARD_LINES == 0, but I don't 
think that is a problem.

I don't see why __rte_cache_aligned is needed here. The adjacent struct 
must be cache-line aligned. Maybe it makes it more readable, having the 
explicit guard padding starting at the start of the actual guard cache 
lines, rather than potentially at some earlier point before, and having 
non-guard padding at the end of the struct (from __rte_cache_aligned on 
the struct level).

> And a line in /config/x86/meson.build for x86 architecture:
> 
>    dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
> + dpdk_conf.set('RTE_CACHE_GUARD_LINES', 1)
> 
> I don't know about various architectures and implementations, so we should probably use a default of 1, matching the existing guard size in the ring lib.
> 
> @Bruce, I hope you can help with the configuration part of this.
> 
>>
>>> To be used like this:
>>>
>>> struct rte_ring {
>>> 	char name[RTE_RING_NAMESIZE] __rte_cache_aligned;
>>> 	/**< Name of the ring. */
>>> 	int flags;               /**< Flags supplied at creation. */
>>> 	const struct rte_memzone *memzone;
>>> 			/**< Memzone, if any, containing the rte_ring */
>>> 	uint32_t size;           /**< Size of ring. */
>>> 	uint32_t mask;           /**< Mask (size-1) of ring. */
>>> 	uint32_t capacity;       /**< Usable size of ring */
>>>
>>> -	char pad0 __rte_cache_aligned; /**< empty cache line */
>>> +	RTE_CACHE_GUARD(prod);  /**< Isolate producer status. */
> 
> +	RTE_CACHE_GUARD;
> 
> Note: Commenting anonymous cache guard fields seems somewhat superfluous, and would largely be a copy of the general RTE_CACHE_GUARD description anyway, so I have removed the comments.
> 
>>>
>>> 	/** Ring producer status. */
>>> 	union {
>>> 		struct rte_ring_headtail prod;
>>> 		struct rte_ring_hts_headtail hts_prod;
>>> 		struct rte_ring_rts_headtail rts_prod;
>>> 	}  __rte_cache_aligned;
>>>
>>> -	char pad1 __rte_cache_aligned; /**< empty cache line */
>>> +	RTE_CACHE_GUARD(both);  /**< Isolate producer from consumer. */
> 
> +	RTE_CACHE_GUARD;
> 
>>>
>>> 	/** Ring consumer status. */
>>> 	union {
>>> 		struct rte_ring_headtail cons;
>>> 		struct rte_ring_hts_headtail hts_cons;
>>> 		struct rte_ring_rts_headtail rts_cons;
>>> 	}  __rte_cache_aligned;
>>>
>>> -	char pad2 __rte_cache_aligned; /**< empty cache line */
>>> +	RTE_CACHE_GUARD(cons);  /**< Isolate consumer status. */
> 
> +	RTE_CACHE_GUARD;
> 
>>> };
>>>
>>>
>>> And for the mempool library:
>>>
>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS
>>> /**
>>>    * A structure that stores the mempool statistics (per-lcore).
>>>    * Note: Cache stats (put_cache_bulk/objs, get_cache_bulk/objs) are
>> not
>>>    * captured since they can be calculated from other stats.
>>>    * For example: put_cache_objs = put_objs - put_common_pool_objs.
>>>    */
>>> struct rte_mempool_debug_stats {
>>> 	uint64_t put_bulk;             /**< Number of puts. */
>>> 	uint64_t put_objs;             /**< Number of objects successfully
>> put. */
>>> 	uint64_t put_common_pool_bulk; /**< Number of bulks enqueued in
>> common pool. */
>>> 	uint64_t put_common_pool_objs; /**< Number of objects enqueued in
>> common pool. */
>>> 	uint64_t get_common_pool_bulk; /**< Number of bulks dequeued from
>> common pool. */
>>> 	uint64_t get_common_pool_objs; /**< Number of objects dequeued
>> from common pool. */
>>> 	uint64_t get_success_bulk;     /**< Successful allocation number.
>> */
>>> 	uint64_t get_success_objs;     /**< Objects successfully
>> allocated. */
>>> 	uint64_t get_fail_bulk;        /**< Failed allocation number. */
>>> 	uint64_t get_fail_objs;        /**< Objects that failed to be
>> allocated. */
>>> 	uint64_t get_success_blks;     /**< Successful allocation number
>> of contiguous blocks. */
>>> 	uint64_t get_fail_blks;        /**< Failed allocation number of
>> contiguous blocks. */
>>> +	RTE_CACHE_GUARD(debug_stats);  /**< Isolation between lcores. */
> 
> +	RTE_CACHE_GUARD;
> 
>>> } __rte_cache_aligned;
>>> #endif
>>>
>>> struct rte_mempool {
>>>    [...]
>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS
>>> 	/** Per-lcore statistics.
>>> 	 *
>>> 	 * Plus one, for unregistered non-EAL threads.
>>> 	 */
>>> 	struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1];
>>> #endif
>>> }  __rte_cache_aligned;
>>>
>>>
>>> It also seems relevant for the PRNG library:
>>>
>>> /lib/eal/common/rte_random.c:
>>>
>>> struct rte_rand_state {
>>> 	uint64_t z1;
>>> 	uint64_t z2;
>>> 	uint64_t z3;
>>> 	uint64_t z4;
>>> 	uint64_t z5;
>>> +	RTE_CACHE_GUARD(z);
> 
> +	RTE_CACHE_GUARD;
> 
>>> } __rte_cache_aligned;
>>>
>>
>> Yes.
>>
>> Should there be two cache guard macros? One parameter-free
>> RTE_CACHE_GUARD and a RTE_CACHE_NAMED_GUARD(name) macro?
>>
>> Maybe it's better just to keep the single macro, but have a convention
>> with some generic name (i.e., not 'z' above) for the guard field, like
>> 'cache_guard' or just 'guard'. Having unique name makes no sense, except
>> in rare cases where you need multiple guard lines per struct.
> 
> There is no need to access these fields, so let's use auto-generated unique names, and not offer an API with a name. This also makes the API as simple as possible.
> 
>>
>>> /* One instance each for every lcore id-equipped thread, and one
>>>    * additional instance to be shared by all others threads (i.e., all
>>>    * unregistered non-EAL threads).
>>>    */
>>> static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
>>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC] cache guard
  2023-08-27 22:30             ` Mattias Rönnblom
@ 2023-08-28  6:32               ` Morten Brørup
  2023-08-28  8:46                 ` Mattias Rönnblom
  0 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2023-08-28  6:32 UTC (permalink / raw)
  To: Mattias Rönnblom, Bruce Richardson
  Cc: dev, olivier.matz, andrew.rybchenko, honnappa.nagarahalli,
	konstantin.v.ananyev, mattias.ronnblom

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 28 August 2023 00.31
> 
> On 2023-08-27 17:40, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Sunday, 27 August 2023 15.55
> >>
> >> On 2023-08-27 10:34, Morten Brørup wrote:
> >>> +CC Honnappa and Konstantin, Ring lib maintainers
> >>> +CC Mattias, PRNG lib maintainer
> >>>
> >>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>> Sent: Friday, 25 August 2023 11.24
> >>>>
> >>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> >>>>> +CC mempool maintainers
> >>>>>
> >>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>>>> Sent: Friday, 25 August 2023 10.23
> >>>>>>
> >>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> >>>>>>> Bruce,
> >>>>>>>
> >>>>>>> With this patch [1], it is noted that the ring producer and
> >>>> consumer data
> >>>>>> should not be on adjacent cache lines, for performance reasons.
> >>>>>>>
> >>>>>>> [1]:
> >>>>>>
> >>>>
> >> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
> >>>> fd4b66
> >>>>>> e75485cc8b63b9aedfbdfe8b0
> >>>>>>>
> >>>>>>> (It's obvious that they cannot share the same cache line, because
> >>>> they are
> >>>>>> accessed by two different threads.)
> >>>>>>>
> >>>>>>> Intuitively, I would think that having them on different cache
> >>>> lines would
> >>>>>> suffice. Why does having an empty cache line between them make a
> >>>> difference?
> >>>>>>>
> >>>>>>> And does it need to be an empty cache line? Or does it suffice
> >>>> having the
> >>>>>> second structure start at two cache lines after the start of the
> >>>> first
> >>>>>> structure (e.g. if the size of the first structure is two cache
> >>>> lines)?
> >>>>>>>
> >>>>>>> I'm asking because the same principle might apply to other code
> >>>> too.
> >>>>>>>
> >>>>>> Hi Morten,
> >>>>>>
> >>>>>> this was something we discovered when working on the distributor
> >>>> library.
> >>>>>> If we have cachelines per core where there is heavy access, having
> >>>> some
> >>>>>> cachelines as a gap between the content cachelines can help
> >>>> performance. We
> >>>>>> believe this helps due to avoiding issues with the HW prefetchers
> >>>> (e.g.
> >>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
> >>>>>> speculatively when an operation is done on the first line.
> >>>>>
> >>>>> I guessed that it had something to do with speculative prefetching,
> >>>> but wasn't sure. Good to get confirmation, and that it has a
> >> measureable
> >>>> effect somewhere. Very interesting!
> >>>>>
> >>>>> NB: More comments in the ring lib about stuff like this would be
> >> nice.
> >>>>>
> >>>>> So, for the mempool lib, what do you think about applying the same
> >>>> technique to the rte_mempool_debug_stats structure (which is an array
> >>>> indexed per lcore)... Two adjacent lcores heavily accessing their
> >> local
> >>>> mempool caches seems likely to me. But how heavy does the access need
> >> to
> >>>> be for this technique to be relevant?
> >>>>>
> >>>>
> >>>> No idea how heavy the accesses need to be for this to have a
> >> noticable
> >>>> effect. For things like debug stats, I wonder how worthwhile making
> >> such
> >>>> a
> >>>> change would be, but then again, any change would have very low
> >> impact
> >>>> too
> >>>> in that case.
> >>>
> >>> I just tried adding padding to some of the hot structures in our own
> >> application, and observed a significant performance improvement for
> >> those.
> >>>
> >>> So I think this technique should have higher visibility in DPDK by
> >> adding a new cache macro to rte_common.h:
> >>>
> >>> /**
> >>>    * Empty cache line, to guard against speculative prefetching.
> >>>    *
> >>
> >> "to guard against false sharing-like effects on systems with a
> >> next-N-lines hardware prefetcher"
> >>
> >>>    * Use as spacing between data accessed by different lcores,
> >>>    * to prevent cache thrashing on CPUs with speculative prefetching.
> >>>    */
> >>> #define RTE_CACHE_GUARD(name) char
> >> cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned;
> >>>
> >>
> >> You could have a macro which specified how much guarding there needs to
> >> be, ideally defined on a per-CPU basis. (These things has nothing to do
> >> with the ISA, but everything to do with the implementation.)
> >>
> >> I'm not sure N is always 1.
> >>
> >> So the guard padding should be RTE_CACHE_LINE_SIZE *
> >> RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in
> >> #if RTE_CACHE_GUARD_LINES > 0
> >> #endif
> >>
> >> ...so you can disable this (cute!) hack (on custom DPDK builds) in case
> >> you have disabled hardware prefetching, which seems generally to be a
> >> good idea for packet processing type applications.
> >>
> >> ...which leads me to another suggestions: add a note on disabling
> >> hardware prefetching in the optimization guide.
> >>
> >> Seems like a very good idea to have this in <rte_common.h>, and
> >> otherwise make this issue visible and known.
> >
> > Good points, Mattias!
> >
> > I also prefer the name-less macro you suggested below.
> >
> > So, this gets added to rte_common.h:
> >
> > /**
> >   * Empty cache lines, to guard against false sharing-like effects
> >   * on systems with a next-N-lines hardware prefetcher.
> >   *
> >   * Use as spacing between data accessed by different lcores,
> >   * to prevent cache thrashing on hardware with speculative prefetching.
> >   */
> > #if RTE_CACHE_GUARD_LINES > 0
> > #define _RTE_CACHE_GUARD_HELPER2(unique) \
> >          char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE *
> RTE_CACHE_GUARD_LINES] \
> >          __rte_cache_aligned;
> > #define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique)
> > #define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)
> > #else
> > #define RTE_CACHE_GUARD
> > #endif
> >
> 
> Seems like a good solution. I thought as far as using __LINE__ to build
> a unique name, but __COUNTER__ is much cleaner, provided it's available
> in relevant compilers. (It's not in C11.)

I considered __LINE__ too, but came to the same conclusion... __COUNTER__ is cleaner for this purpose.

And since __COUNTER__ is being used elsewhere in DPDK, I assume it is available for use here too.

If it turns out causing problems, we can easily switch to __LINE__ instead.

> 
> Should the semicolon be included or not in HELPER2? If left out, a
> lonely ";" will be left for RTE_CACHE_GUARD_LINES == 0, but I don't
> think that is a problem.

I tested it on Godbolt, and the lonely ";" in a struct didn't seem to be a problem.

With the semicolon in HELPER2, there will be a lonely ";" in the struct in both cases, i.e. with and without cache guards enabled.

> 
> I don't see why __rte_cache_aligned is needed here. The adjacent struct
> must be cache-line aligned. Maybe it makes it more readable, having the
> explicit guard padding starting at the start of the actual guard cache
> lines, rather than potentially at some earlier point before, and having
> non-guard padding at the end of the struct (from __rte_cache_aligned on
> the struct level).

Having both __rte_cache_aligned and the char array with full cache lines ensures that the guard field itself is on its own separate cache line, regardless of the organization of adjacent fields in the struct. E.g. this will also work:

struct test {
    char x;
    RTE_CACHE_GUARD;
    char y;
};

> 
> > And a line in /config/x86/meson.build for x86 architecture:
> >
> >    dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
> > + dpdk_conf.set('RTE_CACHE_GUARD_LINES', 1)
> >
> > I don't know about various architectures and implementations, so we should
> probably use a default of 1, matching the existing guard size in the ring lib.
> >
> > @Bruce, I hope you can help with the configuration part of this.
> >
> >>
> >>> To be used like this:
> >>>
> >>> struct rte_ring {
> >>> 	char name[RTE_RING_NAMESIZE] __rte_cache_aligned;
> >>> 	/**< Name of the ring. */
> >>> 	int flags;               /**< Flags supplied at creation. */
> >>> 	const struct rte_memzone *memzone;
> >>> 			/**< Memzone, if any, containing the rte_ring */
> >>> 	uint32_t size;           /**< Size of ring. */
> >>> 	uint32_t mask;           /**< Mask (size-1) of ring. */
> >>> 	uint32_t capacity;       /**< Usable size of ring */
> >>>
> >>> -	char pad0 __rte_cache_aligned; /**< empty cache line */
> >>> +	RTE_CACHE_GUARD(prod);  /**< Isolate producer status. */
> >
> > +	RTE_CACHE_GUARD;
> >
> > Note: Commenting anonymous cache guard fields seems somewhat superfluous,
> and would largely be a copy of the general RTE_CACHE_GUARD description anyway,
> so I have removed the comments.
> >
> >>>
> >>> 	/** Ring producer status. */
> >>> 	union {
> >>> 		struct rte_ring_headtail prod;
> >>> 		struct rte_ring_hts_headtail hts_prod;
> >>> 		struct rte_ring_rts_headtail rts_prod;
> >>> 	}  __rte_cache_aligned;
> >>>
> >>> -	char pad1 __rte_cache_aligned; /**< empty cache line */
> >>> +	RTE_CACHE_GUARD(both);  /**< Isolate producer from consumer. */
> >
> > +	RTE_CACHE_GUARD;
> >
> >>>
> >>> 	/** Ring consumer status. */
> >>> 	union {
> >>> 		struct rte_ring_headtail cons;
> >>> 		struct rte_ring_hts_headtail hts_cons;
> >>> 		struct rte_ring_rts_headtail rts_cons;
> >>> 	}  __rte_cache_aligned;
> >>>
> >>> -	char pad2 __rte_cache_aligned; /**< empty cache line */
> >>> +	RTE_CACHE_GUARD(cons);  /**< Isolate consumer status. */
> >
> > +	RTE_CACHE_GUARD;
> >
> >>> };
> >>>
> >>>
> >>> And for the mempool library:
> >>>
> >>> #ifdef RTE_LIBRTE_MEMPOOL_STATS
> >>> /**
> >>>    * A structure that stores the mempool statistics (per-lcore).
> >>>    * Note: Cache stats (put_cache_bulk/objs, get_cache_bulk/objs) are
> >> not
> >>>    * captured since they can be calculated from other stats.
> >>>    * For example: put_cache_objs = put_objs - put_common_pool_objs.
> >>>    */
> >>> struct rte_mempool_debug_stats {
> >>> 	uint64_t put_bulk;             /**< Number of puts. */
> >>> 	uint64_t put_objs;             /**< Number of objects successfully
> >> put. */
> >>> 	uint64_t put_common_pool_bulk; /**< Number of bulks enqueued in
> >> common pool. */
> >>> 	uint64_t put_common_pool_objs; /**< Number of objects enqueued in
> >> common pool. */
> >>> 	uint64_t get_common_pool_bulk; /**< Number of bulks dequeued from
> >> common pool. */
> >>> 	uint64_t get_common_pool_objs; /**< Number of objects dequeued
> >> from common pool. */
> >>> 	uint64_t get_success_bulk;     /**< Successful allocation number.
> >> */
> >>> 	uint64_t get_success_objs;     /**< Objects successfully
> >> allocated. */
> >>> 	uint64_t get_fail_bulk;        /**< Failed allocation number. */
> >>> 	uint64_t get_fail_objs;        /**< Objects that failed to be
> >> allocated. */
> >>> 	uint64_t get_success_blks;     /**< Successful allocation number
> >> of contiguous blocks. */
> >>> 	uint64_t get_fail_blks;        /**< Failed allocation number of
> >> contiguous blocks. */
> >>> +	RTE_CACHE_GUARD(debug_stats);  /**< Isolation between lcores. */
> >
> > +	RTE_CACHE_GUARD;
> >
> >>> } __rte_cache_aligned;
> >>> #endif
> >>>
> >>> struct rte_mempool {
> >>>    [...]
> >>> #ifdef RTE_LIBRTE_MEMPOOL_STATS
> >>> 	/** Per-lcore statistics.
> >>> 	 *
> >>> 	 * Plus one, for unregistered non-EAL threads.
> >>> 	 */
> >>> 	struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1];
> >>> #endif
> >>> }  __rte_cache_aligned;
> >>>
> >>>
> >>> It also seems relevant for the PRNG library:
> >>>
> >>> /lib/eal/common/rte_random.c:
> >>>
> >>> struct rte_rand_state {
> >>> 	uint64_t z1;
> >>> 	uint64_t z2;
> >>> 	uint64_t z3;
> >>> 	uint64_t z4;
> >>> 	uint64_t z5;
> >>> +	RTE_CACHE_GUARD(z);
> >
> > +	RTE_CACHE_GUARD;
> >
> >>> } __rte_cache_aligned;
> >>>
> >>
> >> Yes.
> >>
> >> Should there be two cache guard macros? One parameter-free
> >> RTE_CACHE_GUARD and a RTE_CACHE_NAMED_GUARD(name) macro?
> >>
> >> Maybe it's better just to keep the single macro, but have a convention
> >> with some generic name (i.e., not 'z' above) for the guard field, like
> >> 'cache_guard' or just 'guard'. Having unique name makes no sense, except
> >> in rare cases where you need multiple guard lines per struct.
> >
> > There is no need to access these fields, so let's use auto-generated unique
> names, and not offer an API with a name. This also makes the API as simple as
> possible.
> >
> >>
> >>> /* One instance each for every lcore id-equipped thread, and one
> >>>    * additional instance to be shared by all others threads (i.e., all
> >>>    * unregistered non-EAL threads).
> >>>    */
> >>> static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
> >>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] cache guard
  2023-08-27 15:40           ` Morten Brørup
  2023-08-27 22:30             ` Mattias Rönnblom
@ 2023-08-28  7:57             ` Bruce Richardson
  1 sibling, 0 replies; 19+ messages in thread
From: Bruce Richardson @ 2023-08-28  7:57 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, dev, olivier.matz, andrew.rybchenko,
	honnappa.nagarahalli, konstantin.v.ananyev, mattias.ronnblom

On Sun, Aug 27, 2023 at 05:40:33PM +0200, Morten Brørup wrote:
> > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > Sent: Sunday, 27 August 2023 15.55
> > 
> > On 2023-08-27 10:34, Morten Brørup wrote:
> > > +CC Honnappa and Konstantin, Ring lib maintainers
> > > +CC Mattias, PRNG lib maintainer
> > >
> > >> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > >> Sent: Friday, 25 August 2023 11.24
> > >>
> > >> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> > >>> +CC mempool maintainers
> > >>>
> > >>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > >>>> Sent: Friday, 25 August 2023 10.23
> > >>>>
> > >>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> > >>>>> Bruce,
> > >>>>>
> > >>>>> With this patch [1], it is noted that the ring producer and
> > >> consumer data
> > >>>> should not be on adjacent cache lines, for performance reasons.
> > >>>>>
> > >>>>> [1]:
> > >>>>
> > >>
> > https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
> > >> fd4b66
> > >>>> e75485cc8b63b9aedfbdfe8b0
> > >>>>>
> > >>>>> (It's obvious that they cannot share the same cache line, because
> > >> they are
> > >>>> accessed by two different threads.)
> > >>>>>
> > >>>>> Intuitively, I would think that having them on different cache
> > >> lines would
> > >>>> suffice. Why does having an empty cache line between them make a
> > >> difference?
> > >>>>>
> > >>>>> And does it need to be an empty cache line? Or does it suffice
> > >> having the
> > >>>> second structure start at two cache lines after the start of the
> > >> first
> > >>>> structure (e.g. if the size of the first structure is two cache
> > >> lines)?
> > >>>>>
> > >>>>> I'm asking because the same principle might apply to other code
> > >> too.
> > >>>>>
> > >>>> Hi Morten,
> > >>>>
> > >>>> this was something we discovered when working on the distributor
> > >> library.
> > >>>> If we have cachelines per core where there is heavy access, having
> > >> some
> > >>>> cachelines as a gap between the content cachelines can help
> > >> performance. We
> > >>>> believe this helps due to avoiding issues with the HW prefetchers
> > >> (e.g.
> > >>>> adjacent cacheline prefetcher) bringing in the second cacheline
> > >>>> speculatively when an operation is done on the first line.
> > >>>
> > >>> I guessed that it had something to do with speculative prefetching,
> > >> but wasn't sure. Good to get confirmation, and that it has a
> > measureable
> > >> effect somewhere. Very interesting!
> > >>>
> > >>> NB: More comments in the ring lib about stuff like this would be
> > nice.
> > >>>
> > >>> So, for the mempool lib, what do you think about applying the same
> > >> technique to the rte_mempool_debug_stats structure (which is an array
> > >> indexed per lcore)... Two adjacent lcores heavily accessing their
> > local
> > >> mempool caches seems likely to me. But how heavy does the access need
> > to
> > >> be for this technique to be relevant?
> > >>>
> > >>
> > >> No idea how heavy the accesses need to be for this to have a
> > noticable
> > >> effect. For things like debug stats, I wonder how worthwhile making
> > such
> > >> a
> > >> change would be, but then again, any change would have very low
> > impact
> > >> too
> > >> in that case.
> > >
> > > I just tried adding padding to some of the hot structures in our own
> > application, and observed a significant performance improvement for
> > those.
> > >
> > > So I think this technique should have higher visibility in DPDK by
> > adding a new cache macro to rte_common.h:
> > >
> > > /**
> > >   * Empty cache line, to guard against speculative prefetching.
> > >   *
> > 
> > "to guard against false sharing-like effects on systems with a
> > next-N-lines hardware prefetcher"
> > 
> > >   * Use as spacing between data accessed by different lcores,
> > >   * to prevent cache thrashing on CPUs with speculative prefetching.
> > >   */
> > > #define RTE_CACHE_GUARD(name) char
> > cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned;
> > >
> > 
> > You could have a macro which specified how much guarding there needs to
> > be, ideally defined on a per-CPU basis. (These things has nothing to do
> > with the ISA, but everything to do with the implementation.)
> > 
> > I'm not sure N is always 1.
> > 
> > So the guard padding should be RTE_CACHE_LINE_SIZE *
> > RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in
> > #if RTE_CACHE_GUARD_LINES > 0
> > #endif
> > 
> > ...so you can disable this (cute!) hack (on custom DPDK builds) in case
> > you have disabled hardware prefetching, which seems generally to be a
> > good idea for packet processing type applications.
> > 
> > ...which leads me to another suggestions: add a note on disabling
> > hardware prefetching in the optimization guide.
> > 
> > Seems like a very good idea to have this in <rte_common.h>, and
> > otherwise make this issue visible and known.
> 
> Good points, Mattias!
> 
> I also prefer the name-less macro you suggested below.
> 
> So, this gets added to rte_common.h:
> 
> /**
>  * Empty cache lines, to guard against false sharing-like effects
>  * on systems with a next-N-lines hardware prefetcher.
>  *
>  * Use as spacing between data accessed by different lcores,
>  * to prevent cache thrashing on hardware with speculative prefetching.
>  */
> #if RTE_CACHE_GUARD_LINES > 0
> #define _RTE_CACHE_GUARD_HELPER2(unique) \
>         char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE * RTE_CACHE_GUARD_LINES] \
>         __rte_cache_aligned;
> #define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique)
> #define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)
> #else
> #define RTE_CACHE_GUARD
> #endif
> 
> And a line in /config/x86/meson.build for x86 architecture:
> 
>   dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
> + dpdk_conf.set('RTE_CACHE_GUARD_LINES', 1)
> 
> I don't know about various architectures and implementations, so we should probably use a default of 1, matching the existing guard size in the ring lib.
> 
> @Bruce, I hope you can help with the configuration part of this.
> 
This all seems a good idea. For the config, I'm not sure what is best
because I can't see many folks wanting to change the default very often.
I'd probably tend towards a value in rte_config.h file, but putting a
per-architecture default in meson.build is probably ok too, if we see
different archs wanting different defaults. A third alternative is maybe
just to put the #define in rte_common.h alongside the macro definition.

I don't think we want an actual meson config option for this, I see it
being too rarely used to make it worth expanding out that list.

/Bruce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] cache guard
  2023-08-28  6:32               ` Morten Brørup
@ 2023-08-28  8:46                 ` Mattias Rönnblom
  2023-08-28  9:54                   ` Morten Brørup
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Rönnblom @ 2023-08-28  8:46 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson
  Cc: dev, olivier.matz, andrew.rybchenko, honnappa.nagarahalli,
	konstantin.v.ananyev, mattias.ronnblom

On 2023-08-28 08:32, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Monday, 28 August 2023 00.31
>>
>> On 2023-08-27 17:40, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>>> Sent: Sunday, 27 August 2023 15.55
>>>>
>>>> On 2023-08-27 10:34, Morten Brørup wrote:
>>>>> +CC Honnappa and Konstantin, Ring lib maintainers
>>>>> +CC Mattias, PRNG lib maintainer
>>>>>
>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>> Sent: Friday, 25 August 2023 11.24
>>>>>>
>>>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
>>>>>>> +CC mempool maintainers
>>>>>>>
>>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>>> Sent: Friday, 25 August 2023 10.23
>>>>>>>>
>>>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
>>>>>>>>> Bruce,
>>>>>>>>>
>>>>>>>>> With this patch [1], it is noted that the ring producer and
>>>>>> consumer data
>>>>>>>> should not be on adjacent cache lines, for performance reasons.
>>>>>>>>>
>>>>>>>>> [1]:
>>>>>>>>
>>>>>>
>>>> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
>>>>>> fd4b66
>>>>>>>> e75485cc8b63b9aedfbdfe8b0
>>>>>>>>>
>>>>>>>>> (It's obvious that they cannot share the same cache line, because
>>>>>> they are
>>>>>>>> accessed by two different threads.)
>>>>>>>>>
>>>>>>>>> Intuitively, I would think that having them on different cache
>>>>>> lines would
>>>>>>>> suffice. Why does having an empty cache line between them make a
>>>>>> difference?
>>>>>>>>>
>>>>>>>>> And does it need to be an empty cache line? Or does it suffice
>>>>>> having the
>>>>>>>> second structure start at two cache lines after the start of the
>>>>>> first
>>>>>>>> structure (e.g. if the size of the first structure is two cache
>>>>>> lines)?
>>>>>>>>>
>>>>>>>>> I'm asking because the same principle might apply to other code
>>>>>> too.
>>>>>>>>>
>>>>>>>> Hi Morten,
>>>>>>>>
>>>>>>>> this was something we discovered when working on the distributor
>>>>>> library.
>>>>>>>> If we have cachelines per core where there is heavy access, having
>>>>>> some
>>>>>>>> cachelines as a gap between the content cachelines can help
>>>>>> performance. We
>>>>>>>> believe this helps due to avoiding issues with the HW prefetchers
>>>>>> (e.g.
>>>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
>>>>>>>> speculatively when an operation is done on the first line.
>>>>>>>
>>>>>>> I guessed that it had something to do with speculative prefetching,
>>>>>> but wasn't sure. Good to get confirmation, and that it has a
>>>> measureable
>>>>>> effect somewhere. Very interesting!
>>>>>>>
>>>>>>> NB: More comments in the ring lib about stuff like this would be
>>>> nice.
>>>>>>>
>>>>>>> So, for the mempool lib, what do you think about applying the same
>>>>>> technique to the rte_mempool_debug_stats structure (which is an array
>>>>>> indexed per lcore)... Two adjacent lcores heavily accessing their
>>>> local
>>>>>> mempool caches seems likely to me. But how heavy does the access need
>>>> to
>>>>>> be for this technique to be relevant?
>>>>>>>
>>>>>>
>>>>>> No idea how heavy the accesses need to be for this to have a
>>>> noticable
>>>>>> effect. For things like debug stats, I wonder how worthwhile making
>>>> such
>>>>>> a
>>>>>> change would be, but then again, any change would have very low
>>>> impact
>>>>>> too
>>>>>> in that case.
>>>>>
>>>>> I just tried adding padding to some of the hot structures in our own
>>>> application, and observed a significant performance improvement for
>>>> those.
>>>>>
>>>>> So I think this technique should have higher visibility in DPDK by
>>>> adding a new cache macro to rte_common.h:
>>>>>
>>>>> /**
>>>>>     * Empty cache line, to guard against speculative prefetching.
>>>>>     *
>>>>
>>>> "to guard against false sharing-like effects on systems with a
>>>> next-N-lines hardware prefetcher"
>>>>
>>>>>     * Use as spacing between data accessed by different lcores,
>>>>>     * to prevent cache thrashing on CPUs with speculative prefetching.
>>>>>     */
>>>>> #define RTE_CACHE_GUARD(name) char
>>>> cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned;
>>>>>
>>>>
>>>> You could have a macro which specified how much guarding there needs to
>>>> be, ideally defined on a per-CPU basis. (These things has nothing to do
>>>> with the ISA, but everything to do with the implementation.)
>>>>
>>>> I'm not sure N is always 1.
>>>>
>>>> So the guard padding should be RTE_CACHE_LINE_SIZE *
>>>> RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in
>>>> #if RTE_CACHE_GUARD_LINES > 0
>>>> #endif
>>>>
>>>> ...so you can disable this (cute!) hack (on custom DPDK builds) in case
>>>> you have disabled hardware prefetching, which seems generally to be a
>>>> good idea for packet processing type applications.
>>>>
>>>> ...which leads me to another suggestions: add a note on disabling
>>>> hardware prefetching in the optimization guide.
>>>>
>>>> Seems like a very good idea to have this in <rte_common.h>, and
>>>> otherwise make this issue visible and known.
>>>
>>> Good points, Mattias!
>>>
>>> I also prefer the name-less macro you suggested below.
>>>
>>> So, this gets added to rte_common.h:
>>>
>>> /**
>>>    * Empty cache lines, to guard against false sharing-like effects
>>>    * on systems with a next-N-lines hardware prefetcher.
>>>    *
>>>    * Use as spacing between data accessed by different lcores,
>>>    * to prevent cache thrashing on hardware with speculative prefetching.
>>>    */
>>> #if RTE_CACHE_GUARD_LINES > 0
>>> #define _RTE_CACHE_GUARD_HELPER2(unique) \
>>>           char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE *
>> RTE_CACHE_GUARD_LINES] \
>>>           __rte_cache_aligned;
>>> #define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique)
>>> #define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)
>>> #else
>>> #define RTE_CACHE_GUARD
>>> #endif
>>>
>>
>> Seems like a good solution. I thought as far as using __LINE__ to build
>> a unique name, but __COUNTER__ is much cleaner, provided it's available
>> in relevant compilers. (It's not in C11.)
> 
> I considered __LINE__ too, but came to the same conclusion... __COUNTER__ is cleaner for this purpose.
> 
> And since __COUNTER__ is being used elsewhere in DPDK, I assume it is available for use here too.
> 
> If it turns out causing problems, we can easily switch to __LINE__ instead.
> 
>>
>> Should the semicolon be included or not in HELPER2? If left out, a
>> lonely ";" will be left for RTE_CACHE_GUARD_LINES == 0, but I don't
>> think that is a problem.
> 
> I tested it on Godbolt, and the lonely ";" in a struct didn't seem to be a problem.
> 
> With the semicolon in HELPER2, there will be a lonely ";" in the struct in both cases, i.e. with and without cache guards enabled.
> 
>>
>> I don't see why __rte_cache_aligned is needed here. The adjacent struct
>> must be cache-line aligned. Maybe it makes it more readable, having the
>> explicit guard padding starting at the start of the actual guard cache
>> lines, rather than potentially at some earlier point before, and having
>> non-guard padding at the end of the struct (from __rte_cache_aligned on
>> the struct level).
> 
> Having both __rte_cache_aligned and the char array with full cache lines ensures that the guard field itself is on its own separate cache line, regardless of the organization of adjacent fields in the struct. E.g. this will also work:
> 
> struct test {
>      char x;
>      RTE_CACHE_GUARD;
>      char y;
> };
> 

That struct declaration is broken, since it will create false sharing 
between x and y, in case RTE_CACHE_GUARD_LINES is defined to 0.

Maybe the most intuitive function (semantics) of the RTE_CACHE_GUARD 
macro would be have it deal exclusively with the issue resulting from 
next-N-line (and similar) hardware prefetching, and leave 
__rte_cache_aligned to deal with "classic" (same-cache line) false sharing.

Otherwise you would have to have something like

struct test
{
	char x;
	RTE_CACHE_GUARD(char, y);
};

...so that 'y' can be made __rte_cache_aligned by the macro.

RTE_HW_PREFETCH_GUARD could be an alternative name, but I think I like 
RTE_CACHE_GUARD better.

>>
>>> And a line in /config/x86/meson.build for x86 architecture:
>>>
>>>     dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64)
>>> + dpdk_conf.set('RTE_CACHE_GUARD_LINES', 1)
>>>
>>> I don't know about various architectures and implementations, so we should
>> probably use a default of 1, matching the existing guard size in the ring lib.
>>>
>>> @Bruce, I hope you can help with the configuration part of this.
>>>
>>>>
>>>>> To be used like this:
>>>>>
>>>>> struct rte_ring {
>>>>> 	char name[RTE_RING_NAMESIZE] __rte_cache_aligned;
>>>>> 	/**< Name of the ring. */
>>>>> 	int flags;               /**< Flags supplied at creation. */
>>>>> 	const struct rte_memzone *memzone;
>>>>> 			/**< Memzone, if any, containing the rte_ring */
>>>>> 	uint32_t size;           /**< Size of ring. */
>>>>> 	uint32_t mask;           /**< Mask (size-1) of ring. */
>>>>> 	uint32_t capacity;       /**< Usable size of ring */
>>>>>
>>>>> -	char pad0 __rte_cache_aligned; /**< empty cache line */
>>>>> +	RTE_CACHE_GUARD(prod);  /**< Isolate producer status. */
>>>
>>> +	RTE_CACHE_GUARD;
>>>
>>> Note: Commenting anonymous cache guard fields seems somewhat superfluous,
>> and would largely be a copy of the general RTE_CACHE_GUARD description anyway,
>> so I have removed the comments.
>>>
>>>>>
>>>>> 	/** Ring producer status. */
>>>>> 	union {
>>>>> 		struct rte_ring_headtail prod;
>>>>> 		struct rte_ring_hts_headtail hts_prod;
>>>>> 		struct rte_ring_rts_headtail rts_prod;
>>>>> 	}  __rte_cache_aligned;
>>>>>
>>>>> -	char pad1 __rte_cache_aligned; /**< empty cache line */
>>>>> +	RTE_CACHE_GUARD(both);  /**< Isolate producer from consumer. */
>>>
>>> +	RTE_CACHE_GUARD;
>>>
>>>>>
>>>>> 	/** Ring consumer status. */
>>>>> 	union {
>>>>> 		struct rte_ring_headtail cons;
>>>>> 		struct rte_ring_hts_headtail hts_cons;
>>>>> 		struct rte_ring_rts_headtail rts_cons;
>>>>> 	}  __rte_cache_aligned;
>>>>>
>>>>> -	char pad2 __rte_cache_aligned; /**< empty cache line */
>>>>> +	RTE_CACHE_GUARD(cons);  /**< Isolate consumer status. */
>>>
>>> +	RTE_CACHE_GUARD;
>>>
>>>>> };
>>>>>
>>>>>
>>>>> And for the mempool library:
>>>>>
>>>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS
>>>>> /**
>>>>>     * A structure that stores the mempool statistics (per-lcore).
>>>>>     * Note: Cache stats (put_cache_bulk/objs, get_cache_bulk/objs) are
>>>> not
>>>>>     * captured since they can be calculated from other stats.
>>>>>     * For example: put_cache_objs = put_objs - put_common_pool_objs.
>>>>>     */
>>>>> struct rte_mempool_debug_stats {
>>>>> 	uint64_t put_bulk;             /**< Number of puts. */
>>>>> 	uint64_t put_objs;             /**< Number of objects successfully
>>>> put. */
>>>>> 	uint64_t put_common_pool_bulk; /**< Number of bulks enqueued in
>>>> common pool. */
>>>>> 	uint64_t put_common_pool_objs; /**< Number of objects enqueued in
>>>> common pool. */
>>>>> 	uint64_t get_common_pool_bulk; /**< Number of bulks dequeued from
>>>> common pool. */
>>>>> 	uint64_t get_common_pool_objs; /**< Number of objects dequeued
>>>> from common pool. */
>>>>> 	uint64_t get_success_bulk;     /**< Successful allocation number.
>>>> */
>>>>> 	uint64_t get_success_objs;     /**< Objects successfully
>>>> allocated. */
>>>>> 	uint64_t get_fail_bulk;        /**< Failed allocation number. */
>>>>> 	uint64_t get_fail_objs;        /**< Objects that failed to be
>>>> allocated. */
>>>>> 	uint64_t get_success_blks;     /**< Successful allocation number
>>>> of contiguous blocks. */
>>>>> 	uint64_t get_fail_blks;        /**< Failed allocation number of
>>>> contiguous blocks. */
>>>>> +	RTE_CACHE_GUARD(debug_stats);  /**< Isolation between lcores. */
>>>
>>> +	RTE_CACHE_GUARD;
>>>
>>>>> } __rte_cache_aligned;
>>>>> #endif
>>>>>
>>>>> struct rte_mempool {
>>>>>     [...]
>>>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS
>>>>> 	/** Per-lcore statistics.
>>>>> 	 *
>>>>> 	 * Plus one, for unregistered non-EAL threads.
>>>>> 	 */
>>>>> 	struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1];
>>>>> #endif
>>>>> }  __rte_cache_aligned;
>>>>>
>>>>>
>>>>> It also seems relevant for the PRNG library:
>>>>>
>>>>> /lib/eal/common/rte_random.c:
>>>>>
>>>>> struct rte_rand_state {
>>>>> 	uint64_t z1;
>>>>> 	uint64_t z2;
>>>>> 	uint64_t z3;
>>>>> 	uint64_t z4;
>>>>> 	uint64_t z5;
>>>>> +	RTE_CACHE_GUARD(z);
>>>
>>> +	RTE_CACHE_GUARD;
>>>
>>>>> } __rte_cache_aligned;
>>>>>
>>>>
>>>> Yes.
>>>>
>>>> Should there be two cache guard macros? One parameter-free
>>>> RTE_CACHE_GUARD and a RTE_CACHE_NAMED_GUARD(name) macro?
>>>>
>>>> Maybe it's better just to keep the single macro, but have a convention
>>>> with some generic name (i.e., not 'z' above) for the guard field, like
>>>> 'cache_guard' or just 'guard'. Having unique name makes no sense, except
>>>> in rare cases where you need multiple guard lines per struct.
>>>
>>> There is no need to access these fields, so let's use auto-generated unique
>> names, and not offer an API with a name. This also makes the API as simple as
>> possible.
>>>
>>>>
>>>>> /* One instance each for every lcore id-equipped thread, and one
>>>>>     * additional instance to be shared by all others threads (i.e., all
>>>>>     * unregistered non-EAL threads).
>>>>>     */
>>>>> static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1];
>>>>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC] cache guard
  2023-08-28  8:46                 ` Mattias Rönnblom
@ 2023-08-28  9:54                   ` Morten Brørup
  2023-08-28 10:40                     ` Stephen Hemminger
  0 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2023-08-28  9:54 UTC (permalink / raw)
  To: Mattias Rönnblom, Bruce Richardson
  Cc: dev, olivier.matz, andrew.rybchenko, honnappa.nagarahalli,
	konstantin.v.ananyev, mattias.ronnblom

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 28 August 2023 10.46
> 
> On 2023-08-28 08:32, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Monday, 28 August 2023 00.31
> >>
> >> On 2023-08-27 17:40, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >>>> Sent: Sunday, 27 August 2023 15.55

[...]

> >>> So, this gets added to rte_common.h:
> >>>
> >>> /**
> >>>    * Empty cache lines, to guard against false sharing-like effects
> >>>    * on systems with a next-N-lines hardware prefetcher.
> >>>    *
> >>>    * Use as spacing between data accessed by different lcores,
> >>>    * to prevent cache thrashing on hardware with speculative
> prefetching.
> >>>    */
> >>> #if RTE_CACHE_GUARD_LINES > 0
> >>> #define _RTE_CACHE_GUARD_HELPER2(unique) \
> >>>           char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE *
> >> RTE_CACHE_GUARD_LINES] \
> >>>           __rte_cache_aligned;
> >>> #define _RTE_CACHE_GUARD_HELPER1(unique)
> _RTE_CACHE_GUARD_HELPER2(unique)
> >>> #define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)
> >>> #else
> >>> #define RTE_CACHE_GUARD
> >>> #endif
> >>>
> >>
> >> Seems like a good solution. I thought as far as using __LINE__ to
> build
> >> a unique name, but __COUNTER__ is much cleaner, provided it's
> available
> >> in relevant compilers. (It's not in C11.)
> >
> > I considered __LINE__ too, but came to the same conclusion...
> __COUNTER__ is cleaner for this purpose.
> >
> > And since __COUNTER__ is being used elsewhere in DPDK, I assume it is
> available for use here too.
> >
> > If it turns out causing problems, we can easily switch to __LINE__
> instead.
> >
> >>
> >> Should the semicolon be included or not in HELPER2? If left out, a
> >> lonely ";" will be left for RTE_CACHE_GUARD_LINES == 0, but I don't
> >> think that is a problem.
> >
> > I tested it on Godbolt, and the lonely ";" in a struct didn't seem to
> be a problem.
> >
> > With the semicolon in HELPER2, there will be a lonely ";" in the
> struct in both cases, i.e. with and without cache guards enabled.
> >
> >>
> >> I don't see why __rte_cache_aligned is needed here. The adjacent
> struct
> >> must be cache-line aligned. Maybe it makes it more readable, having
> the
> >> explicit guard padding starting at the start of the actual guard
> cache
> >> lines, rather than potentially at some earlier point before, and
> having
> >> non-guard padding at the end of the struct (from __rte_cache_aligned
> on
> >> the struct level).
> >
> > Having both __rte_cache_aligned and the char array with full cache
> lines ensures that the guard field itself is on its own separate cache
> line, regardless of the organization of adjacent fields in the struct.
> E.g. this will also work:
> >
> > struct test {
> >      char x;
> >      RTE_CACHE_GUARD;
> >      char y;
> > };
> >
> 
> That struct declaration is broken, since it will create false sharing
> between x and y, in case RTE_CACHE_GUARD_LINES is defined to 0.
> 
> Maybe the most intuitive function (semantics) of the RTE_CACHE_GUARD
> macro would be have it deal exclusively with the issue resulting from
> next-N-line (and similar) hardware prefetching, and leave
> __rte_cache_aligned to deal with "classic" (same-cache line) false
> sharing.

Excellent review feedback!

I only thought of the cache guard as a means to provide spacing between elements where the developer already prevented (same-cache line) false sharing by some other means. I didn't even consider the alternative interpretation of its purpose.

Your feedback leaves no doubt that we should extend the cache guard's purpose to also enforce cache alignment (under all circumstances, also when RTE_CACHE_GUARD_LINES is 0).

> 
> Otherwise you would have to have something like
> 
> struct test
> {
> 	char x;
> 	RTE_CACHE_GUARD(char, y);
> };
> 
> ...so that 'y' can be made __rte_cache_aligned by the macro.

There's an easier solution...

We can copy the concept from the RTE_MARKER type, which uses a zero-length array. By simply omitting the #if RTE_CACHE_GUARD_LINES > 0, the macro will serve both purposes:

#define _RTE_CACHE_GUARD_HELPER2(unique) \
        char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE * RTE_CACHE_GUARD_LINES] \
        __rte_cache_aligned;
#define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique)
#define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)

I have verified on Godbolt that this works. The memif driver also uses RTE_MARKER this way [1].

[1]: https://elixir.bootlin.com/dpdk/latest/source/drivers/net/memif/memif.h#L173

> 
> RTE_HW_PREFETCH_GUARD could be an alternative name, but I think I like
> RTE_CACHE_GUARD better.
> 

When the macro serves both purposes (regardless of the value of RTE_CACHE_GUARD_LINES), I think we can stick with the RTE_CACHE_GUARD name.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] cache guard
  2023-08-28  9:54                   ` Morten Brørup
@ 2023-08-28 10:40                     ` Stephen Hemminger
  0 siblings, 0 replies; 19+ messages in thread
From: Stephen Hemminger @ 2023-08-28 10:40 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, Bruce Richardson, dev, Olivier Matz,
	Andrew Rybchenko, Honnappa Nagarahalli, Konstantin Ananyev,
	Mattias Rönnblom

[-- Attachment #1: Type: text/plain, Size: 5477 bytes --]

A quick hack might just to increase cache line size as experiment

On Mon, Aug 28, 2023, 11:54 AM Morten Brørup <mb@smartsharesystems.com>
wrote:

> > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > Sent: Monday, 28 August 2023 10.46
> >
> > On 2023-08-28 08:32, Morten Brørup wrote:
> > >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > >> Sent: Monday, 28 August 2023 00.31
> > >>
> > >> On 2023-08-27 17:40, Morten Brørup wrote:
> > >>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > >>>> Sent: Sunday, 27 August 2023 15.55
>
> [...]
>
> > >>> So, this gets added to rte_common.h:
> > >>>
> > >>> /**
> > >>>    * Empty cache lines, to guard against false sharing-like effects
> > >>>    * on systems with a next-N-lines hardware prefetcher.
> > >>>    *
> > >>>    * Use as spacing between data accessed by different lcores,
> > >>>    * to prevent cache thrashing on hardware with speculative
> > prefetching.
> > >>>    */
> > >>> #if RTE_CACHE_GUARD_LINES > 0
> > >>> #define _RTE_CACHE_GUARD_HELPER2(unique) \
> > >>>           char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE *
> > >> RTE_CACHE_GUARD_LINES] \
> > >>>           __rte_cache_aligned;
> > >>> #define _RTE_CACHE_GUARD_HELPER1(unique)
> > _RTE_CACHE_GUARD_HELPER2(unique)
> > >>> #define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)
> > >>> #else
> > >>> #define RTE_CACHE_GUARD
> > >>> #endif
> > >>>
> > >>
> > >> Seems like a good solution. I thought as far as using __LINE__ to
> > build
> > >> a unique name, but __COUNTER__ is much cleaner, provided it's
> > available
> > >> in relevant compilers. (It's not in C11.)
> > >
> > > I considered __LINE__ too, but came to the same conclusion...
> > __COUNTER__ is cleaner for this purpose.
> > >
> > > And since __COUNTER__ is being used elsewhere in DPDK, I assume it is
> > available for use here too.
> > >
> > > If it turns out causing problems, we can easily switch to __LINE__
> > instead.
> > >
> > >>
> > >> Should the semicolon be included or not in HELPER2? If left out, a
> > >> lonely ";" will be left for RTE_CACHE_GUARD_LINES == 0, but I don't
> > >> think that is a problem.
> > >
> > > I tested it on Godbolt, and the lonely ";" in a struct didn't seem to
> > be a problem.
> > >
> > > With the semicolon in HELPER2, there will be a lonely ";" in the
> > struct in both cases, i.e. with and without cache guards enabled.
> > >
> > >>
> > >> I don't see why __rte_cache_aligned is needed here. The adjacent
> > struct
> > >> must be cache-line aligned. Maybe it makes it more readable, having
> > the
> > >> explicit guard padding starting at the start of the actual guard
> > cache
> > >> lines, rather than potentially at some earlier point before, and
> > having
> > >> non-guard padding at the end of the struct (from __rte_cache_aligned
> > on
> > >> the struct level).
> > >
> > > Having both __rte_cache_aligned and the char array with full cache
> > lines ensures that the guard field itself is on its own separate cache
> > line, regardless of the organization of adjacent fields in the struct.
> > E.g. this will also work:
> > >
> > > struct test {
> > >      char x;
> > >      RTE_CACHE_GUARD;
> > >      char y;
> > > };
> > >
> >
> > That struct declaration is broken, since it will create false sharing
> > between x and y, in case RTE_CACHE_GUARD_LINES is defined to 0.
> >
> > Maybe the most intuitive function (semantics) of the RTE_CACHE_GUARD
> > macro would be have it deal exclusively with the issue resulting from
> > next-N-line (and similar) hardware prefetching, and leave
> > __rte_cache_aligned to deal with "classic" (same-cache line) false
> > sharing.
>
> Excellent review feedback!
>
> I only thought of the cache guard as a means to provide spacing between
> elements where the developer already prevented (same-cache line) false
> sharing by some other means. I didn't even consider the alternative
> interpretation of its purpose.
>
> Your feedback leaves no doubt that we should extend the cache guard's
> purpose to also enforce cache alignment (under all circumstances, also when
> RTE_CACHE_GUARD_LINES is 0).
>
> >
> > Otherwise you would have to have something like
> >
> > struct test
> > {
> >       char x;
> >       RTE_CACHE_GUARD(char, y);
> > };
> >
> > ...so that 'y' can be made __rte_cache_aligned by the macro.
>
> There's an easier solution...
>
> We can copy the concept from the RTE_MARKER type, which uses a zero-length
> array. By simply omitting the #if RTE_CACHE_GUARD_LINES > 0, the macro will
> serve both purposes:
>
> #define _RTE_CACHE_GUARD_HELPER2(unique) \
>         char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE *
> RTE_CACHE_GUARD_LINES] \
>         __rte_cache_aligned;
> #define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique)
> #define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)
>
> I have verified on Godbolt that this works. The memif driver also uses
> RTE_MARKER this way [1].
>
> [1]:
> https://elixir.bootlin.com/dpdk/latest/source/drivers/net/memif/memif.h#L173
>
> >
> > RTE_HW_PREFETCH_GUARD could be an alternative name, but I think I like
> > RTE_CACHE_GUARD better.
> >
>
> When the macro serves both purposes (regardless of the value of
> RTE_CACHE_GUARD_LINES), I think we can stick with the RTE_CACHE_GUARD name.
>
>
>

[-- Attachment #2: Type: text/html, Size: 7191 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] cache guard
  2023-08-27  8:34       ` [RFC] cache guard Morten Brørup
  2023-08-27 13:55         ` Mattias Rönnblom
@ 2023-09-01 12:26         ` Thomas Monjalon
  2023-09-01 16:57           ` Mattias Rönnblom
  1 sibling, 1 reply; 19+ messages in thread
From: Thomas Monjalon @ 2023-09-01 12:26 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko,
	honnappa.nagarahalli, konstantin.v.ananyev, mattias.ronnblom

27/08/2023 10:34, Morten Brørup:
> +CC Honnappa and Konstantin, Ring lib maintainers
> +CC Mattias, PRNG lib maintainer
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Friday, 25 August 2023 11.24
> > 
> > On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> > > +CC mempool maintainers
> > >
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Friday, 25 August 2023 10.23
> > > >
> > > > On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> > > > > Bruce,
> > > > >
> > > > > With this patch [1], it is noted that the ring producer and
> > consumer data
> > > > should not be on adjacent cache lines, for performance reasons.
> > > > >
> > > > > [1]:
> > > >
> > https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
> > fd4b66
> > > > e75485cc8b63b9aedfbdfe8b0
> > > > >
> > > > > (It's obvious that they cannot share the same cache line, because
> > they are
> > > > accessed by two different threads.)
> > > > >
> > > > > Intuitively, I would think that having them on different cache
> > lines would
> > > > suffice. Why does having an empty cache line between them make a
> > difference?
> > > > >
> > > > > And does it need to be an empty cache line? Or does it suffice
> > having the
> > > > second structure start at two cache lines after the start of the
> > first
> > > > structure (e.g. if the size of the first structure is two cache
> > lines)?
> > > > >
> > > > > I'm asking because the same principle might apply to other code
> > too.
> > > > >
> > > > Hi Morten,
> > > >
> > > > this was something we discovered when working on the distributor
> > library.
> > > > If we have cachelines per core where there is heavy access, having
> > some
> > > > cachelines as a gap between the content cachelines can help
> > performance. We
> > > > believe this helps due to avoiding issues with the HW prefetchers
> > (e.g.
> > > > adjacent cacheline prefetcher) bringing in the second cacheline
> > > > speculatively when an operation is done on the first line.
> > >
> > > I guessed that it had something to do with speculative prefetching,
> > but wasn't sure. Good to get confirmation, and that it has a measureable
> > effect somewhere. Very interesting!
> > >
> > > NB: More comments in the ring lib about stuff like this would be nice.
> > >
> > > So, for the mempool lib, what do you think about applying the same
> > technique to the rte_mempool_debug_stats structure (which is an array
> > indexed per lcore)... Two adjacent lcores heavily accessing their local
> > mempool caches seems likely to me. But how heavy does the access need to
> > be for this technique to be relevant?
> > >
> > 
> > No idea how heavy the accesses need to be for this to have a noticable
> > effect. For things like debug stats, I wonder how worthwhile making such
> > a
> > change would be, but then again, any change would have very low impact
> > too
> > in that case.
> 
> I just tried adding padding to some of the hot structures in our own application, and observed a significant performance improvement for those.
> 
> So I think this technique should have higher visibility in DPDK by adding a new cache macro to rte_common.h:

+1 to make more visibility in doc and adding a macro, good idea!




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] cache guard
  2023-09-01 12:26         ` Thomas Monjalon
@ 2023-09-01 16:57           ` Mattias Rönnblom
  2023-09-01 18:52             ` Morten Brørup
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Rönnblom @ 2023-09-01 16:57 UTC (permalink / raw)
  To: Thomas Monjalon, Morten Brørup
  Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko,
	honnappa.nagarahalli, konstantin.v.ananyev, mattias.ronnblom

On 2023-09-01 14:26, Thomas Monjalon wrote:
> 27/08/2023 10:34, Morten Brørup:
>> +CC Honnappa and Konstantin, Ring lib maintainers
>> +CC Mattias, PRNG lib maintainer
>>
>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>> Sent: Friday, 25 August 2023 11.24
>>>
>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
>>>> +CC mempool maintainers
>>>>
>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>> Sent: Friday, 25 August 2023 10.23
>>>>>
>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
>>>>>> Bruce,
>>>>>>
>>>>>> With this patch [1], it is noted that the ring producer and
>>> consumer data
>>>>> should not be on adjacent cache lines, for performance reasons.
>>>>>>
>>>>>> [1]:
>>>>>
>>> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
>>> fd4b66
>>>>> e75485cc8b63b9aedfbdfe8b0
>>>>>>
>>>>>> (It's obvious that they cannot share the same cache line, because
>>> they are
>>>>> accessed by two different threads.)
>>>>>>
>>>>>> Intuitively, I would think that having them on different cache
>>> lines would
>>>>> suffice. Why does having an empty cache line between them make a
>>> difference?
>>>>>>
>>>>>> And does it need to be an empty cache line? Or does it suffice
>>> having the
>>>>> second structure start at two cache lines after the start of the
>>> first
>>>>> structure (e.g. if the size of the first structure is two cache
>>> lines)?
>>>>>>
>>>>>> I'm asking because the same principle might apply to other code
>>> too.
>>>>>>
>>>>> Hi Morten,
>>>>>
>>>>> this was something we discovered when working on the distributor
>>> library.
>>>>> If we have cachelines per core where there is heavy access, having
>>> some
>>>>> cachelines as a gap between the content cachelines can help
>>> performance. We
>>>>> believe this helps due to avoiding issues with the HW prefetchers
>>> (e.g.
>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
>>>>> speculatively when an operation is done on the first line.
>>>>
>>>> I guessed that it had something to do with speculative prefetching,
>>> but wasn't sure. Good to get confirmation, and that it has a measureable
>>> effect somewhere. Very interesting!
>>>>
>>>> NB: More comments in the ring lib about stuff like this would be nice.
>>>>
>>>> So, for the mempool lib, what do you think about applying the same
>>> technique to the rte_mempool_debug_stats structure (which is an array
>>> indexed per lcore)... Two adjacent lcores heavily accessing their local
>>> mempool caches seems likely to me. But how heavy does the access need to
>>> be for this technique to be relevant?
>>>>
>>>
>>> No idea how heavy the accesses need to be for this to have a noticable
>>> effect. For things like debug stats, I wonder how worthwhile making such
>>> a
>>> change would be, but then again, any change would have very low impact
>>> too
>>> in that case.
>>
>> I just tried adding padding to some of the hot structures in our own application, and observed a significant performance improvement for those.
>>
>> So I think this technique should have higher visibility in DPDK by adding a new cache macro to rte_common.h:
> 
> +1 to make more visibility in doc and adding a macro, good idea!
> 
> 
> 

A worry I have is that for CPUs with large (in this context) N, you will 
end up with a lot of padding to avoid next-N-lines false sharing. That 
would be padding after, and in the general (non-array) case also before, 
the actual per-lcore data. A slight nuisance is also that those 
prefetched lines of padding, will never contain anything useful, and 
thus fetching them will always be a waste.

Padding/alignment may not be the only way to avoid HW-prefetcher-induced 
false sharing for per-lcore data structures.

What we are discussing here is organizing the statically allocated 
per-lcore structs of a particular module in an array with the 
appropriate padding/alignment. In this model, all data related to a 
particular module is close (memory address/page-wise), but not so close 
to cause false sharing.

/* rte_a.c */

struct rte_a_state
{
	int x;
         RTE_CACHE_GUARD;
} __rte_cache_aligned;

static struct rte_a_state a_states[RTE_MAX_LCORE];

/* rte_b.c */

struct rte_b_state
{
	char y;
         char z;
         RTE_CACHE_GUARD;
} __rte_cache_aligned;


static struct rte_b_state b_states[RTE_MAX_LCORE];

What you would end up with in runtime when the linker has done its job 
is something that essentially looks like this (in memory):

struct {
	struct rte_a_state a_states[RTE_MAX_LCORE];
	struct rte_b_state b_states[RTE_MAX_LCORE];
};

You could consider turning it around, and keeping data (i.e., module 
structs) related to a particular lcore, for all modules, close. In other 
words, keeping a per-lcore arrays of variable-sized elements.

So, something that will end up looking like this (in memory, not in the 
source code):

struct rte_lcore_state
{
	struct rte_a_state a_state;
	struct rte_b_state b_state;
         RTE_CACHE_GUARD;
};

struct rte_lcore_state lcore_states[RTE_LCORE_MAX];

In such a scenario, the per-lcore struct type for a module need not (and 
should not) be cache-line-aligned (but may still have some alignment 
requirements). Data will be more tightly packed, and the "next lines" 
prefetched may actually be useful (although I'm guessing in practice 
they will usually not).

There may be several ways to implement that scheme. The above is to 
illustrate how thing would look in memory, not necessarily on the level 
of the source code.

One way could be to fit the per-module-per-lcore struct in a chunk of 
memory allocated in a per-lcore heap. In such a case, the DPDK heap 
would need extension, maybe with semantics similar to that of NUMA-node 
specific allocations.

Another way would be to use thread-local storage (TLS, __thread), 
although it's unclear to me how well TLS works with larger data structures.

A third way may be to somehow achieve something that looks like the 
above example, using macros, without breaking module encapsulation or 
generally be too intrusive or otherwise cumbersome.

Not sure this is worth the trouble (compared to just more padding), but 
I thought it was an idea worth sharing.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC] cache guard
  2023-09-01 16:57           ` Mattias Rönnblom
@ 2023-09-01 18:52             ` Morten Brørup
  2023-09-04 12:07               ` Mattias Rönnblom
  0 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2023-09-01 18:52 UTC (permalink / raw)
  To: Mattias Rönnblom, Thomas Monjalon
  Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko,
	honnappa.nagarahalli, konstantin.v.ananyev, mattias.ronnblom

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Friday, 1 September 2023 18.58
> 
> On 2023-09-01 14:26, Thomas Monjalon wrote:
> > 27/08/2023 10:34, Morten Brørup:
> >> +CC Honnappa and Konstantin, Ring lib maintainers
> >> +CC Mattias, PRNG lib maintainer
> >>
> >>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>> Sent: Friday, 25 August 2023 11.24
> >>>
> >>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> >>>> +CC mempool maintainers
> >>>>
> >>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>>> Sent: Friday, 25 August 2023 10.23
> >>>>>
> >>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> >>>>>> Bruce,
> >>>>>>
> >>>>>> With this patch [1], it is noted that the ring producer and
> >>> consumer data
> >>>>> should not be on adjacent cache lines, for performance reasons.
> >>>>>>
> >>>>>> [1]:
> >>>>>
> >>>
> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
> >>> fd4b66
> >>>>> e75485cc8b63b9aedfbdfe8b0
> >>>>>>
> >>>>>> (It's obvious that they cannot share the same cache line, because
> >>> they are
> >>>>> accessed by two different threads.)
> >>>>>>
> >>>>>> Intuitively, I would think that having them on different cache
> >>> lines would
> >>>>> suffice. Why does having an empty cache line between them make a
> >>> difference?
> >>>>>>
> >>>>>> And does it need to be an empty cache line? Or does it suffice
> >>> having the
> >>>>> second structure start at two cache lines after the start of the
> >>> first
> >>>>> structure (e.g. if the size of the first structure is two cache
> >>> lines)?
> >>>>>>
> >>>>>> I'm asking because the same principle might apply to other code
> >>> too.
> >>>>>>
> >>>>> Hi Morten,
> >>>>>
> >>>>> this was something we discovered when working on the distributor
> >>> library.
> >>>>> If we have cachelines per core where there is heavy access, having
> >>> some
> >>>>> cachelines as a gap between the content cachelines can help
> >>> performance. We
> >>>>> believe this helps due to avoiding issues with the HW prefetchers
> >>> (e.g.
> >>>>> adjacent cacheline prefetcher) bringing in the second cacheline
> >>>>> speculatively when an operation is done on the first line.
> >>>>
> >>>> I guessed that it had something to do with speculative prefetching,
> >>> but wasn't sure. Good to get confirmation, and that it has a
> measureable
> >>> effect somewhere. Very interesting!
> >>>>
> >>>> NB: More comments in the ring lib about stuff like this would be
> nice.
> >>>>
> >>>> So, for the mempool lib, what do you think about applying the same
> >>> technique to the rte_mempool_debug_stats structure (which is an
> array
> >>> indexed per lcore)... Two adjacent lcores heavily accessing their
> local
> >>> mempool caches seems likely to me. But how heavy does the access
> need to
> >>> be for this technique to be relevant?
> >>>>
> >>>
> >>> No idea how heavy the accesses need to be for this to have a
> noticable
> >>> effect. For things like debug stats, I wonder how worthwhile making
> such
> >>> a
> >>> change would be, but then again, any change would have very low
> impact
> >>> too
> >>> in that case.
> >>
> >> I just tried adding padding to some of the hot structures in our own
> application, and observed a significant performance improvement for
> those.
> >>
> >> So I think this technique should have higher visibility in DPDK by
> adding a new cache macro to rte_common.h:
> >
> > +1 to make more visibility in doc and adding a macro, good idea!
> >
> >
> >
> 
> A worry I have is that for CPUs with large (in this context) N, you will
> end up with a lot of padding to avoid next-N-lines false sharing. That
> would be padding after, and in the general (non-array) case also before,
> the actual per-lcore data. A slight nuisance is also that those
> prefetched lines of padding, will never contain anything useful, and
> thus fetching them will always be a waste.

Out of curiosity, what is the largest N anyone here on the list is aware of?

> 
> Padding/alignment may not be the only way to avoid HW-prefetcher-induced
> false sharing for per-lcore data structures.
> 
> What we are discussing here is organizing the statically allocated
> per-lcore structs of a particular module in an array with the
> appropriate padding/alignment. In this model, all data related to a
> particular module is close (memory address/page-wise), but not so close
> to cause false sharing.
> 
> /* rte_a.c */
> 
> struct rte_a_state
> {
> 	int x;
>          RTE_CACHE_GUARD;
> } __rte_cache_aligned;
> 
> static struct rte_a_state a_states[RTE_MAX_LCORE];
> 
> /* rte_b.c */
> 
> struct rte_b_state
> {
> 	char y;
>          char z;
>          RTE_CACHE_GUARD;
> } __rte_cache_aligned;
> 
> 
> static struct rte_b_state b_states[RTE_MAX_LCORE];
> 
> What you would end up with in runtime when the linker has done its job
> is something that essentially looks like this (in memory):
> 
> struct {
> 	struct rte_a_state a_states[RTE_MAX_LCORE];
> 	struct rte_b_state b_states[RTE_MAX_LCORE];
> };
> 
> You could consider turning it around, and keeping data (i.e., module
> structs) related to a particular lcore, for all modules, close. In other
> words, keeping a per-lcore arrays of variable-sized elements.
> 
> So, something that will end up looking like this (in memory, not in the
> source code):
> 
> struct rte_lcore_state
> {
> 	struct rte_a_state a_state;
> 	struct rte_b_state b_state;
>          RTE_CACHE_GUARD;
> };
> 
> struct rte_lcore_state lcore_states[RTE_LCORE_MAX];
> 
> In such a scenario, the per-lcore struct type for a module need not (and
> should not) be cache-line-aligned (but may still have some alignment
> requirements). Data will be more tightly packed, and the "next lines"
> prefetched may actually be useful (although I'm guessing in practice
> they will usually not).
> 
> There may be several ways to implement that scheme. The above is to
> illustrate how thing would look in memory, not necessarily on the level
> of the source code.
> 
> One way could be to fit the per-module-per-lcore struct in a chunk of
> memory allocated in a per-lcore heap. In such a case, the DPDK heap
> would need extension, maybe with semantics similar to that of NUMA-node
> specific allocations.
> 
> Another way would be to use thread-local storage (TLS, __thread),
> although it's unclear to me how well TLS works with larger data
> structures.
> 
> A third way may be to somehow achieve something that looks like the
> above example, using macros, without breaking module encapsulation or
> generally be too intrusive or otherwise cumbersome.
> 
> Not sure this is worth the trouble (compared to just more padding), but
> I thought it was an idea worth sharing.

I think what Mattias suggests is relevant, and it would be great if a generic solution could be found for DPDK.

For reference, we initially used RTE_PER_LCORE(module_variable), i.e. thread local storage, extensively in our application modules. But it has two disadvantages:
1. TLS does not use hugepages. (The same applies to global and local variables, BTW.)
2. We need to set up global pointers to these TLS variables, so they can be accessed from the main lcore (e.g. for statistics). This means that every module needs some sort of module_per_lcore_init, called by the thread after its creation, to set the module_global_ptr[rte_lcore_id()] = &RTE_PER_LCORE(module_variable).

Eventually, we gave up and migrated to the DPDK standard design pattern of instantiating a global module_variable[RTE_MAX_LCORE], and letting each thread use its own entry in that array.

And as Mattias suggests, this adds a lot of useless padding, because each modules' variables now need to start on its own cache line.

So a generic solution with packed per-thread data would be a great long term solution.

Short term, I can live with the simple cache guard. It is very easy to implement and use.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] cache guard
  2023-09-01 18:52             ` Morten Brørup
@ 2023-09-04 12:07               ` Mattias Rönnblom
  2023-09-04 12:48                 ` Morten Brørup
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Rönnblom @ 2023-09-04 12:07 UTC (permalink / raw)
  To: Morten Brørup, Thomas Monjalon
  Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko,
	honnappa.nagarahalli, konstantin.v.ananyev, mattias.ronnblom

On 2023-09-01 20:52, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Friday, 1 September 2023 18.58
>>
>> On 2023-09-01 14:26, Thomas Monjalon wrote:
>>> 27/08/2023 10:34, Morten Brørup:
>>>> +CC Honnappa and Konstantin, Ring lib maintainers
>>>> +CC Mattias, PRNG lib maintainer
>>>>
>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>> Sent: Friday, 25 August 2023 11.24
>>>>>
>>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
>>>>>> +CC mempool maintainers
>>>>>>
>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>> Sent: Friday, 25 August 2023 10.23
>>>>>>>
>>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
>>>>>>>> Bruce,
>>>>>>>>
>>>>>>>> With this patch [1], it is noted that the ring producer and
>>>>> consumer data
>>>>>>> should not be on adjacent cache lines, for performance reasons.
>>>>>>>>
>>>>>>>> [1]:
>>>>>>>
>>>>>
>> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
>>>>> fd4b66
>>>>>>> e75485cc8b63b9aedfbdfe8b0
>>>>>>>>
>>>>>>>> (It's obvious that they cannot share the same cache line, because
>>>>> they are
>>>>>>> accessed by two different threads.)
>>>>>>>>
>>>>>>>> Intuitively, I would think that having them on different cache
>>>>> lines would
>>>>>>> suffice. Why does having an empty cache line between them make a
>>>>> difference?
>>>>>>>>
>>>>>>>> And does it need to be an empty cache line? Or does it suffice
>>>>> having the
>>>>>>> second structure start at two cache lines after the start of the
>>>>> first
>>>>>>> structure (e.g. if the size of the first structure is two cache
>>>>> lines)?
>>>>>>>>
>>>>>>>> I'm asking because the same principle might apply to other code
>>>>> too.
>>>>>>>>
>>>>>>> Hi Morten,
>>>>>>>
>>>>>>> this was something we discovered when working on the distributor
>>>>> library.
>>>>>>> If we have cachelines per core where there is heavy access, having
>>>>> some
>>>>>>> cachelines as a gap between the content cachelines can help
>>>>> performance. We
>>>>>>> believe this helps due to avoiding issues with the HW prefetchers
>>>>> (e.g.
>>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
>>>>>>> speculatively when an operation is done on the first line.
>>>>>>
>>>>>> I guessed that it had something to do with speculative prefetching,
>>>>> but wasn't sure. Good to get confirmation, and that it has a
>> measureable
>>>>> effect somewhere. Very interesting!
>>>>>>
>>>>>> NB: More comments in the ring lib about stuff like this would be
>> nice.
>>>>>>
>>>>>> So, for the mempool lib, what do you think about applying the same
>>>>> technique to the rte_mempool_debug_stats structure (which is an
>> array
>>>>> indexed per lcore)... Two adjacent lcores heavily accessing their
>> local
>>>>> mempool caches seems likely to me. But how heavy does the access
>> need to
>>>>> be for this technique to be relevant?
>>>>>>
>>>>>
>>>>> No idea how heavy the accesses need to be for this to have a
>> noticable
>>>>> effect. For things like debug stats, I wonder how worthwhile making
>> such
>>>>> a
>>>>> change would be, but then again, any change would have very low
>> impact
>>>>> too
>>>>> in that case.
>>>>
>>>> I just tried adding padding to some of the hot structures in our own
>> application, and observed a significant performance improvement for
>> those.
>>>>
>>>> So I think this technique should have higher visibility in DPDK by
>> adding a new cache macro to rte_common.h:
>>>
>>> +1 to make more visibility in doc and adding a macro, good idea!
>>>
>>>
>>>
>>
>> A worry I have is that for CPUs with large (in this context) N, you will
>> end up with a lot of padding to avoid next-N-lines false sharing. That
>> would be padding after, and in the general (non-array) case also before,
>> the actual per-lcore data. A slight nuisance is also that those
>> prefetched lines of padding, will never contain anything useful, and
>> thus fetching them will always be a waste.
> 
> Out of curiosity, what is the largest N anyone here on the list is aware of?
> 
>>
>> Padding/alignment may not be the only way to avoid HW-prefetcher-induced
>> false sharing for per-lcore data structures.
>>
>> What we are discussing here is organizing the statically allocated
>> per-lcore structs of a particular module in an array with the
>> appropriate padding/alignment. In this model, all data related to a
>> particular module is close (memory address/page-wise), but not so close
>> to cause false sharing.
>>
>> /* rte_a.c */
>>
>> struct rte_a_state
>> {
>> 	int x;
>>           RTE_CACHE_GUARD;
>> } __rte_cache_aligned;
>>
>> static struct rte_a_state a_states[RTE_MAX_LCORE];
>>
>> /* rte_b.c */
>>
>> struct rte_b_state
>> {
>> 	char y;
>>           char z;
>>           RTE_CACHE_GUARD;
>> } __rte_cache_aligned;
>>
>>
>> static struct rte_b_state b_states[RTE_MAX_LCORE];
>>
>> What you would end up with in runtime when the linker has done its job
>> is something that essentially looks like this (in memory):
>>
>> struct {
>> 	struct rte_a_state a_states[RTE_MAX_LCORE];
>> 	struct rte_b_state b_states[RTE_MAX_LCORE];
>> };
>>
>> You could consider turning it around, and keeping data (i.e., module
>> structs) related to a particular lcore, for all modules, close. In other
>> words, keeping a per-lcore arrays of variable-sized elements.
>>
>> So, something that will end up looking like this (in memory, not in the
>> source code):
>>
>> struct rte_lcore_state
>> {
>> 	struct rte_a_state a_state;
>> 	struct rte_b_state b_state;
>>           RTE_CACHE_GUARD;
>> };
>>
>> struct rte_lcore_state lcore_states[RTE_LCORE_MAX];
>>
>> In such a scenario, the per-lcore struct type for a module need not (and
>> should not) be cache-line-aligned (but may still have some alignment
>> requirements). Data will be more tightly packed, and the "next lines"
>> prefetched may actually be useful (although I'm guessing in practice
>> they will usually not).
>>
>> There may be several ways to implement that scheme. The above is to
>> illustrate how thing would look in memory, not necessarily on the level
>> of the source code.
>>
>> One way could be to fit the per-module-per-lcore struct in a chunk of
>> memory allocated in a per-lcore heap. In such a case, the DPDK heap
>> would need extension, maybe with semantics similar to that of NUMA-node
>> specific allocations.
>>
>> Another way would be to use thread-local storage (TLS, __thread),
>> although it's unclear to me how well TLS works with larger data
>> structures.
>>
>> A third way may be to somehow achieve something that looks like the
>> above example, using macros, without breaking module encapsulation or
>> generally be too intrusive or otherwise cumbersome.
>>
>> Not sure this is worth the trouble (compared to just more padding), but
>> I thought it was an idea worth sharing.
> 
> I think what Mattias suggests is relevant, and it would be great if a generic solution could be found for DPDK.
> 
> For reference, we initially used RTE_PER_LCORE(module_variable), i.e. thread local storage, extensively in our application modules. But it has two disadvantages:
> 1. TLS does not use hugepages. (The same applies to global and local variables, BTW.)
> 2. We need to set up global pointers to these TLS variables, so they can be accessed from the main lcore (e.g. for statistics). This means that every module needs some sort of module_per_lcore_init, called by the thread after its creation, to set the module_global_ptr[rte_lcore_id()] = &RTE_PER_LCORE(module_variable).
> 

Good points. I never thought about the initialization issue.

How about memory consumption and TLS? If you have many non-EAL-threads 
in the DPDK process, would the system allocate TLS memory for DPDK 
lcore-specific data structures? Assuming a scenario where __thread was 
used instead of the standard DPDK pattern.

> Eventually, we gave up and migrated to the DPDK standard design pattern of instantiating a global module_variable[RTE_MAX_LCORE], and letting each thread use its own entry in that array.
> 
> And as Mattias suggests, this adds a lot of useless padding, because each modules' variables now need to start on its own cache line.
> 
> So a generic solution with packed per-thread data would be a great long term solution.
> 
> Short term, I can live with the simple cache guard. It is very easy to implement and use.
> 
The RTE_CACHE_GUARD pattern is also intrusive, in the sense it needs to 
be explicitly added everywhere (just like __rte_cache_aligned) and error 
prone, and somewhat brittle (in the face of changed <N>).

(I mentioned this not to discourage the use of RTE_CACHE_GUARD - more to 
encourage somehow to invite something more efficient, robust and 
easier-to-use.)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC] cache guard
  2023-09-04 12:07               ` Mattias Rönnblom
@ 2023-09-04 12:48                 ` Morten Brørup
  2023-09-05  5:50                   ` Mattias Rönnblom
  0 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2023-09-04 12:48 UTC (permalink / raw)
  To: Mattias Rönnblom, Thomas Monjalon
  Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko,
	honnappa.nagarahalli, konstantin.v.ananyev, mattias.ronnblom

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 4 September 2023 14.07
> 
> On 2023-09-01 20:52, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Friday, 1 September 2023 18.58
> >>
> >> On 2023-09-01 14:26, Thomas Monjalon wrote:
> >>> 27/08/2023 10:34, Morten Brørup:
> >>>> +CC Honnappa and Konstantin, Ring lib maintainers
> >>>> +CC Mattias, PRNG lib maintainer
> >>>>
> >>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>>> Sent: Friday, 25 August 2023 11.24
> >>>>>
> >>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
> >>>>>> +CC mempool maintainers
> >>>>>>
> >>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >>>>>>> Sent: Friday, 25 August 2023 10.23
> >>>>>>>
> >>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
> >>>>>>>> Bruce,
> >>>>>>>>
> >>>>>>>> With this patch [1], it is noted that the ring producer and
> >>>>> consumer data
> >>>>>>> should not be on adjacent cache lines, for performance reasons.
> >>>>>>>>
> >>>>>>>> [1]:
> >>>>>>>
> >>>>>
> >>
> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
> >>>>> fd4b66
> >>>>>>> e75485cc8b63b9aedfbdfe8b0
> >>>>>>>>
> >>>>>>>> (It's obvious that they cannot share the same cache line,
> because
> >>>>> they are
> >>>>>>> accessed by two different threads.)
> >>>>>>>>
> >>>>>>>> Intuitively, I would think that having them on different cache
> >>>>> lines would
> >>>>>>> suffice. Why does having an empty cache line between them make a
> >>>>> difference?
> >>>>>>>>
> >>>>>>>> And does it need to be an empty cache line? Or does it suffice
> >>>>> having the
> >>>>>>> second structure start at two cache lines after the start of the
> >>>>> first
> >>>>>>> structure (e.g. if the size of the first structure is two cache
> >>>>> lines)?
> >>>>>>>>
> >>>>>>>> I'm asking because the same principle might apply to other code
> >>>>> too.
> >>>>>>>>
> >>>>>>> Hi Morten,
> >>>>>>>
> >>>>>>> this was something we discovered when working on the distributor
> >>>>> library.
> >>>>>>> If we have cachelines per core where there is heavy access,
> having
> >>>>> some
> >>>>>>> cachelines as a gap between the content cachelines can help
> >>>>> performance. We
> >>>>>>> believe this helps due to avoiding issues with the HW
> prefetchers
> >>>>> (e.g.
> >>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
> >>>>>>> speculatively when an operation is done on the first line.
> >>>>>>
> >>>>>> I guessed that it had something to do with speculative
> prefetching,
> >>>>> but wasn't sure. Good to get confirmation, and that it has a
> >> measureable
> >>>>> effect somewhere. Very interesting!
> >>>>>>
> >>>>>> NB: More comments in the ring lib about stuff like this would be
> >> nice.
> >>>>>>
> >>>>>> So, for the mempool lib, what do you think about applying the
> same
> >>>>> technique to the rte_mempool_debug_stats structure (which is an
> >> array
> >>>>> indexed per lcore)... Two adjacent lcores heavily accessing their
> >> local
> >>>>> mempool caches seems likely to me. But how heavy does the access
> >> need to
> >>>>> be for this technique to be relevant?
> >>>>>>
> >>>>>
> >>>>> No idea how heavy the accesses need to be for this to have a
> >> noticable
> >>>>> effect. For things like debug stats, I wonder how worthwhile
> making
> >> such
> >>>>> a
> >>>>> change would be, but then again, any change would have very low
> >> impact
> >>>>> too
> >>>>> in that case.
> >>>>
> >>>> I just tried adding padding to some of the hot structures in our
> own
> >> application, and observed a significant performance improvement for
> >> those.
> >>>>
> >>>> So I think this technique should have higher visibility in DPDK by
> >> adding a new cache macro to rte_common.h:
> >>>
> >>> +1 to make more visibility in doc and adding a macro, good idea!
> >>>
> >>>
> >>>
> >>
> >> A worry I have is that for CPUs with large (in this context) N, you
> will
> >> end up with a lot of padding to avoid next-N-lines false sharing.
> That
> >> would be padding after, and in the general (non-array) case also
> before,
> >> the actual per-lcore data. A slight nuisance is also that those
> >> prefetched lines of padding, will never contain anything useful, and
> >> thus fetching them will always be a waste.
> >
> > Out of curiosity, what is the largest N anyone here on the list is
> aware of?
> >
> >>
> >> Padding/alignment may not be the only way to avoid HW-prefetcher-
> induced
> >> false sharing for per-lcore data structures.
> >>
> >> What we are discussing here is organizing the statically allocated
> >> per-lcore structs of a particular module in an array with the
> >> appropriate padding/alignment. In this model, all data related to a
> >> particular module is close (memory address/page-wise), but not so
> close
> >> to cause false sharing.
> >>
> >> /* rte_a.c */
> >>
> >> struct rte_a_state
> >> {
> >> 	int x;
> >>           RTE_CACHE_GUARD;
> >> } __rte_cache_aligned;
> >>
> >> static struct rte_a_state a_states[RTE_MAX_LCORE];
> >>
> >> /* rte_b.c */
> >>
> >> struct rte_b_state
> >> {
> >> 	char y;
> >>           char z;
> >>           RTE_CACHE_GUARD;
> >> } __rte_cache_aligned;
> >>
> >>
> >> static struct rte_b_state b_states[RTE_MAX_LCORE];
> >>
> >> What you would end up with in runtime when the linker has done its
> job
> >> is something that essentially looks like this (in memory):
> >>
> >> struct {
> >> 	struct rte_a_state a_states[RTE_MAX_LCORE];
> >> 	struct rte_b_state b_states[RTE_MAX_LCORE];
> >> };
> >>
> >> You could consider turning it around, and keeping data (i.e., module
> >> structs) related to a particular lcore, for all modules, close. In
> other
> >> words, keeping a per-lcore arrays of variable-sized elements.
> >>
> >> So, something that will end up looking like this (in memory, not in
> the
> >> source code):
> >>
> >> struct rte_lcore_state
> >> {
> >> 	struct rte_a_state a_state;
> >> 	struct rte_b_state b_state;
> >>           RTE_CACHE_GUARD;
> >> };
> >>
> >> struct rte_lcore_state lcore_states[RTE_LCORE_MAX];
> >>
> >> In such a scenario, the per-lcore struct type for a module need not
> (and
> >> should not) be cache-line-aligned (but may still have some alignment
> >> requirements). Data will be more tightly packed, and the "next lines"
> >> prefetched may actually be useful (although I'm guessing in practice
> >> they will usually not).
> >>
> >> There may be several ways to implement that scheme. The above is to
> >> illustrate how thing would look in memory, not necessarily on the
> level
> >> of the source code.
> >>
> >> One way could be to fit the per-module-per-lcore struct in a chunk of
> >> memory allocated in a per-lcore heap. In such a case, the DPDK heap
> >> would need extension, maybe with semantics similar to that of NUMA-
> node
> >> specific allocations.
> >>
> >> Another way would be to use thread-local storage (TLS, __thread),
> >> although it's unclear to me how well TLS works with larger data
> >> structures.
> >>
> >> A third way may be to somehow achieve something that looks like the
> >> above example, using macros, without breaking module encapsulation or
> >> generally be too intrusive or otherwise cumbersome.
> >>
> >> Not sure this is worth the trouble (compared to just more padding),
> but
> >> I thought it was an idea worth sharing.
> >
> > I think what Mattias suggests is relevant, and it would be great if a
> generic solution could be found for DPDK.
> >
> > For reference, we initially used RTE_PER_LCORE(module_variable), i.e.
> thread local storage, extensively in our application modules. But it has
> two disadvantages:
> > 1. TLS does not use hugepages. (The same applies to global and local
> variables, BTW.)
> > 2. We need to set up global pointers to these TLS variables, so they
> can be accessed from the main lcore (e.g. for statistics). This means
> that every module needs some sort of module_per_lcore_init, called by
> the thread after its creation, to set the
> module_global_ptr[rte_lcore_id()] = &RTE_PER_LCORE(module_variable).
> >
> 
> Good points. I never thought about the initialization issue.
> 
> How about memory consumption and TLS? If you have many non-EAL-threads
> in the DPDK process, would the system allocate TLS memory for DPDK
> lcore-specific data structures? Assuming a scenario where __thread was
> used instead of the standard DPDK pattern.

Room for all the __thread variables is allocated and initialized for every thread started.

Note: RTE_PER_LCORE variables are simply __thread-wrapped variables:

#define RTE_DEFINE_PER_LCORE(type, name)			\
	__thread __typeof__(type) per_lcore_##name

> 
> > Eventually, we gave up and migrated to the DPDK standard design
> pattern of instantiating a global module_variable[RTE_MAX_LCORE], and
> letting each thread use its own entry in that array.
> >
> > And as Mattias suggests, this adds a lot of useless padding, because
> each modules' variables now need to start on its own cache line.
> >
> > So a generic solution with packed per-thread data would be a great
> long term solution.
> >
> > Short term, I can live with the simple cache guard. It is very easy to
> implement and use.
> >
> The RTE_CACHE_GUARD pattern is also intrusive, in the sense it needs to
> be explicitly added everywhere (just like __rte_cache_aligned) and error
> prone, and somewhat brittle (in the face of changed <N>).

Agree. Like a lot of other stuff in DPDK. DPDK is very explicit. :-)

> 
> (I mentioned this not to discourage the use of RTE_CACHE_GUARD - more to
> encourage somehow to invite something more efficient, robust and
> easier-to-use.)

I get that you don't discourage it, but I don't expect the discussion to proceed further at this time. So please don’t forget to also ACK this patch. ;-)

There should be a wish list, where concepts like your suggested improvement could be easily noted.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] cache guard
  2023-09-04 12:48                 ` Morten Brørup
@ 2023-09-05  5:50                   ` Mattias Rönnblom
  0 siblings, 0 replies; 19+ messages in thread
From: Mattias Rönnblom @ 2023-09-05  5:50 UTC (permalink / raw)
  To: Morten Brørup, Thomas Monjalon
  Cc: Bruce Richardson, dev, olivier.matz, andrew.rybchenko,
	honnappa.nagarahalli, konstantin.v.ananyev, mattias.ronnblom

On 2023-09-04 14:48, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Monday, 4 September 2023 14.07
>>
>> On 2023-09-01 20:52, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>>> Sent: Friday, 1 September 2023 18.58
>>>>
>>>> On 2023-09-01 14:26, Thomas Monjalon wrote:
>>>>> 27/08/2023 10:34, Morten Brørup:
>>>>>> +CC Honnappa and Konstantin, Ring lib maintainers
>>>>>> +CC Mattias, PRNG lib maintainer
>>>>>>
>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>> Sent: Friday, 25 August 2023 11.24
>>>>>>>
>>>>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
>>>>>>>> +CC mempool maintainers
>>>>>>>>
>>>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>>>> Sent: Friday, 25 August 2023 10.23
>>>>>>>>>
>>>>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
>>>>>>>>>> Bruce,
>>>>>>>>>>
>>>>>>>>>> With this patch [1], it is noted that the ring producer and
>>>>>>> consumer data
>>>>>>>>> should not be on adjacent cache lines, for performance reasons.
>>>>>>>>>>
>>>>>>>>>> [1]:
>>>>>>>>>
>>>>>>>
>>>>
>> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
>>>>>>> fd4b66
>>>>>>>>> e75485cc8b63b9aedfbdfe8b0
>>>>>>>>>>
>>>>>>>>>> (It's obvious that they cannot share the same cache line,
>> because
>>>>>>> they are
>>>>>>>>> accessed by two different threads.)
>>>>>>>>>>
>>>>>>>>>> Intuitively, I would think that having them on different cache
>>>>>>> lines would
>>>>>>>>> suffice. Why does having an empty cache line between them make a
>>>>>>> difference?
>>>>>>>>>>
>>>>>>>>>> And does it need to be an empty cache line? Or does it suffice
>>>>>>> having the
>>>>>>>>> second structure start at two cache lines after the start of the
>>>>>>> first
>>>>>>>>> structure (e.g. if the size of the first structure is two cache
>>>>>>> lines)?
>>>>>>>>>>
>>>>>>>>>> I'm asking because the same principle might apply to other code
>>>>>>> too.
>>>>>>>>>>
>>>>>>>>> Hi Morten,
>>>>>>>>>
>>>>>>>>> this was something we discovered when working on the distributor
>>>>>>> library.
>>>>>>>>> If we have cachelines per core where there is heavy access,
>> having
>>>>>>> some
>>>>>>>>> cachelines as a gap between the content cachelines can help
>>>>>>> performance. We
>>>>>>>>> believe this helps due to avoiding issues with the HW
>> prefetchers
>>>>>>> (e.g.
>>>>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
>>>>>>>>> speculatively when an operation is done on the first line.
>>>>>>>>
>>>>>>>> I guessed that it had something to do with speculative
>> prefetching,
>>>>>>> but wasn't sure. Good to get confirmation, and that it has a
>>>> measureable
>>>>>>> effect somewhere. Very interesting!
>>>>>>>>
>>>>>>>> NB: More comments in the ring lib about stuff like this would be
>>>> nice.
>>>>>>>>
>>>>>>>> So, for the mempool lib, what do you think about applying the
>> same
>>>>>>> technique to the rte_mempool_debug_stats structure (which is an
>>>> array
>>>>>>> indexed per lcore)... Two adjacent lcores heavily accessing their
>>>> local
>>>>>>> mempool caches seems likely to me. But how heavy does the access
>>>> need to
>>>>>>> be for this technique to be relevant?
>>>>>>>>
>>>>>>>
>>>>>>> No idea how heavy the accesses need to be for this to have a
>>>> noticable
>>>>>>> effect. For things like debug stats, I wonder how worthwhile
>> making
>>>> such
>>>>>>> a
>>>>>>> change would be, but then again, any change would have very low
>>>> impact
>>>>>>> too
>>>>>>> in that case.
>>>>>>
>>>>>> I just tried adding padding to some of the hot structures in our
>> own
>>>> application, and observed a significant performance improvement for
>>>> those.
>>>>>>
>>>>>> So I think this technique should have higher visibility in DPDK by
>>>> adding a new cache macro to rte_common.h:
>>>>>
>>>>> +1 to make more visibility in doc and adding a macro, good idea!
>>>>>
>>>>>
>>>>>
>>>>
>>>> A worry I have is that for CPUs with large (in this context) N, you
>> will
>>>> end up with a lot of padding to avoid next-N-lines false sharing.
>> That
>>>> would be padding after, and in the general (non-array) case also
>> before,
>>>> the actual per-lcore data. A slight nuisance is also that those
>>>> prefetched lines of padding, will never contain anything useful, and
>>>> thus fetching them will always be a waste.
>>>
>>> Out of curiosity, what is the largest N anyone here on the list is
>> aware of?
>>>
>>>>
>>>> Padding/alignment may not be the only way to avoid HW-prefetcher-
>> induced
>>>> false sharing for per-lcore data structures.
>>>>
>>>> What we are discussing here is organizing the statically allocated
>>>> per-lcore structs of a particular module in an array with the
>>>> appropriate padding/alignment. In this model, all data related to a
>>>> particular module is close (memory address/page-wise), but not so
>> close
>>>> to cause false sharing.
>>>>
>>>> /* rte_a.c */
>>>>
>>>> struct rte_a_state
>>>> {
>>>> 	int x;
>>>>            RTE_CACHE_GUARD;
>>>> } __rte_cache_aligned;
>>>>
>>>> static struct rte_a_state a_states[RTE_MAX_LCORE];
>>>>
>>>> /* rte_b.c */
>>>>
>>>> struct rte_b_state
>>>> {
>>>> 	char y;
>>>>            char z;
>>>>            RTE_CACHE_GUARD;
>>>> } __rte_cache_aligned;
>>>>
>>>>
>>>> static struct rte_b_state b_states[RTE_MAX_LCORE];
>>>>
>>>> What you would end up with in runtime when the linker has done its
>> job
>>>> is something that essentially looks like this (in memory):
>>>>
>>>> struct {
>>>> 	struct rte_a_state a_states[RTE_MAX_LCORE];
>>>> 	struct rte_b_state b_states[RTE_MAX_LCORE];
>>>> };
>>>>
>>>> You could consider turning it around, and keeping data (i.e., module
>>>> structs) related to a particular lcore, for all modules, close. In
>> other
>>>> words, keeping a per-lcore arrays of variable-sized elements.
>>>>
>>>> So, something that will end up looking like this (in memory, not in
>> the
>>>> source code):
>>>>
>>>> struct rte_lcore_state
>>>> {
>>>> 	struct rte_a_state a_state;
>>>> 	struct rte_b_state b_state;
>>>>            RTE_CACHE_GUARD;
>>>> };
>>>>
>>>> struct rte_lcore_state lcore_states[RTE_LCORE_MAX];
>>>>
>>>> In such a scenario, the per-lcore struct type for a module need not
>> (and
>>>> should not) be cache-line-aligned (but may still have some alignment
>>>> requirements). Data will be more tightly packed, and the "next lines"
>>>> prefetched may actually be useful (although I'm guessing in practice
>>>> they will usually not).
>>>>
>>>> There may be several ways to implement that scheme. The above is to
>>>> illustrate how thing would look in memory, not necessarily on the
>> level
>>>> of the source code.
>>>>
>>>> One way could be to fit the per-module-per-lcore struct in a chunk of
>>>> memory allocated in a per-lcore heap. In such a case, the DPDK heap
>>>> would need extension, maybe with semantics similar to that of NUMA-
>> node
>>>> specific allocations.
>>>>
>>>> Another way would be to use thread-local storage (TLS, __thread),
>>>> although it's unclear to me how well TLS works with larger data
>>>> structures.
>>>>
>>>> A third way may be to somehow achieve something that looks like the
>>>> above example, using macros, without breaking module encapsulation or
>>>> generally be too intrusive or otherwise cumbersome.
>>>>
>>>> Not sure this is worth the trouble (compared to just more padding),
>> but
>>>> I thought it was an idea worth sharing.
>>>
>>> I think what Mattias suggests is relevant, and it would be great if a
>> generic solution could be found for DPDK.
>>>
>>> For reference, we initially used RTE_PER_LCORE(module_variable), i.e.
>> thread local storage, extensively in our application modules. But it has
>> two disadvantages:
>>> 1. TLS does not use hugepages. (The same applies to global and local
>> variables, BTW.)
>>> 2. We need to set up global pointers to these TLS variables, so they
>> can be accessed from the main lcore (e.g. for statistics). This means
>> that every module needs some sort of module_per_lcore_init, called by
>> the thread after its creation, to set the
>> module_global_ptr[rte_lcore_id()] = &RTE_PER_LCORE(module_variable).
>>>
>>
>> Good points. I never thought about the initialization issue.
>>
>> How about memory consumption and TLS? If you have many non-EAL-threads
>> in the DPDK process, would the system allocate TLS memory for DPDK
>> lcore-specific data structures? Assuming a scenario where __thread was
>> used instead of the standard DPDK pattern.
> 
> Room for all the __thread variables is allocated and initialized for every thread started.
> 
> Note: RTE_PER_LCORE variables are simply __thread-wrapped variables:
> 
> #define RTE_DEFINE_PER_LCORE(type, name)			\
> 	__thread __typeof__(type) per_lcore_##name
> 
>>
>>> Eventually, we gave up and migrated to the DPDK standard design
>> pattern of instantiating a global module_variable[RTE_MAX_LCORE], and
>> letting each thread use its own entry in that array.
>>>
>>> And as Mattias suggests, this adds a lot of useless padding, because
>> each modules' variables now need to start on its own cache line.
>>>
>>> So a generic solution with packed per-thread data would be a great
>> long term solution.
>>>
>>> Short term, I can live with the simple cache guard. It is very easy to
>> implement and use.
>>>
>> The RTE_CACHE_GUARD pattern is also intrusive, in the sense it needs to
>> be explicitly added everywhere (just like __rte_cache_aligned) and error
>> prone, and somewhat brittle (in the face of changed <N>).
> 
> Agree. Like a lot of other stuff in DPDK. DPDK is very explicit. :-)
> 
>>
>> (I mentioned this not to discourage the use of RTE_CACHE_GUARD - more to
>> encourage somehow to invite something more efficient, robust and
>> easier-to-use.)
> 
> I get that you don't discourage it, but I don't expect the discussion to proceed further at this time. So please don’t forget to also ACK this patch. ;-)
> 

Acked-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

> There should be a wish list, where concepts like your suggested improvement could be easily noted.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-09-05  5:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-25  6:45 cache thrashing question Morten Brørup
2023-08-25  8:22 ` Bruce Richardson
2023-08-25  9:06   ` Morten Brørup
2023-08-25  9:23     ` Bruce Richardson
2023-08-27  8:34       ` [RFC] cache guard Morten Brørup
2023-08-27 13:55         ` Mattias Rönnblom
2023-08-27 15:40           ` Morten Brørup
2023-08-27 22:30             ` Mattias Rönnblom
2023-08-28  6:32               ` Morten Brørup
2023-08-28  8:46                 ` Mattias Rönnblom
2023-08-28  9:54                   ` Morten Brørup
2023-08-28 10:40                     ` Stephen Hemminger
2023-08-28  7:57             ` Bruce Richardson
2023-09-01 12:26         ` Thomas Monjalon
2023-09-01 16:57           ` Mattias Rönnblom
2023-09-01 18:52             ` Morten Brørup
2023-09-04 12:07               ` Mattias Rönnblom
2023-09-04 12:48                 ` Morten Brørup
2023-09-05  5:50                   ` Mattias Rönnblom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).