A quick hack might just to increase cache line size as experiment 

On Mon, Aug 28, 2023, 11:54 AM Morten Brørup <mb@smartsharesystems.com> wrote:
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 28 August 2023 10.46
>
> On 2023-08-28 08:32, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Monday, 28 August 2023 00.31
> >>
> >> On 2023-08-27 17:40, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >>>> Sent: Sunday, 27 August 2023 15.55

[...]

> >>> So, this gets added to rte_common.h:
> >>>
> >>> /**
> >>>    * Empty cache lines, to guard against false sharing-like effects
> >>>    * on systems with a next-N-lines hardware prefetcher.
> >>>    *
> >>>    * Use as spacing between data accessed by different lcores,
> >>>    * to prevent cache thrashing on hardware with speculative
> prefetching.
> >>>    */
> >>> #if RTE_CACHE_GUARD_LINES > 0
> >>> #define _RTE_CACHE_GUARD_HELPER2(unique) \
> >>>           char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE *
> >> RTE_CACHE_GUARD_LINES] \
> >>>           __rte_cache_aligned;
> >>> #define _RTE_CACHE_GUARD_HELPER1(unique)
> _RTE_CACHE_GUARD_HELPER2(unique)
> >>> #define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)
> >>> #else
> >>> #define RTE_CACHE_GUARD
> >>> #endif
> >>>
> >>
> >> Seems like a good solution. I thought as far as using __LINE__ to
> build
> >> a unique name, but __COUNTER__ is much cleaner, provided it's
> available
> >> in relevant compilers. (It's not in C11.)
> >
> > I considered __LINE__ too, but came to the same conclusion...
> __COUNTER__ is cleaner for this purpose.
> >
> > And since __COUNTER__ is being used elsewhere in DPDK, I assume it is
> available for use here too.
> >
> > If it turns out causing problems, we can easily switch to __LINE__
> instead.
> >
> >>
> >> Should the semicolon be included or not in HELPER2? If left out, a
> >> lonely ";" will be left for RTE_CACHE_GUARD_LINES == 0, but I don't
> >> think that is a problem.
> >
> > I tested it on Godbolt, and the lonely ";" in a struct didn't seem to
> be a problem.
> >
> > With the semicolon in HELPER2, there will be a lonely ";" in the
> struct in both cases, i.e. with and without cache guards enabled.
> >
> >>
> >> I don't see why __rte_cache_aligned is needed here. The adjacent
> struct
> >> must be cache-line aligned. Maybe it makes it more readable, having
> the
> >> explicit guard padding starting at the start of the actual guard
> cache
> >> lines, rather than potentially at some earlier point before, and
> having
> >> non-guard padding at the end of the struct (from __rte_cache_aligned
> on
> >> the struct level).
> >
> > Having both __rte_cache_aligned and the char array with full cache
> lines ensures that the guard field itself is on its own separate cache
> line, regardless of the organization of adjacent fields in the struct.
> E.g. this will also work:
> >
> > struct test {
> >      char x;
> >      RTE_CACHE_GUARD;
> >      char y;
> > };
> >
>
> That struct declaration is broken, since it will create false sharing
> between x and y, in case RTE_CACHE_GUARD_LINES is defined to 0.
>
> Maybe the most intuitive function (semantics) of the RTE_CACHE_GUARD
> macro would be have it deal exclusively with the issue resulting from
> next-N-line (and similar) hardware prefetching, and leave
> __rte_cache_aligned to deal with "classic" (same-cache line) false
> sharing.

Excellent review feedback!

I only thought of the cache guard as a means to provide spacing between elements where the developer already prevented (same-cache line) false sharing by some other means. I didn't even consider the alternative interpretation of its purpose.

Your feedback leaves no doubt that we should extend the cache guard's purpose to also enforce cache alignment (under all circumstances, also when RTE_CACHE_GUARD_LINES is 0).

>
> Otherwise you would have to have something like
>
> struct test
> {
>       char x;
>       RTE_CACHE_GUARD(char, y);
> };
>
> ...so that 'y' can be made __rte_cache_aligned by the macro.

There's an easier solution...

We can copy the concept from the RTE_MARKER type, which uses a zero-length array. By simply omitting the #if RTE_CACHE_GUARD_LINES > 0, the macro will serve both purposes:

#define _RTE_CACHE_GUARD_HELPER2(unique) \
        char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE * RTE_CACHE_GUARD_LINES] \
        __rte_cache_aligned;
#define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique)
#define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__)

I have verified on Godbolt that this works. The memif driver also uses RTE_MARKER this way [1].

[1]: https://elixir.bootlin.com/dpdk/latest/source/drivers/net/memif/memif.h#L173

>
> RTE_HW_PREFETCH_GUARD could be an alternative name, but I think I like
> RTE_CACHE_GUARD better.
>

When the macro serves both purposes (regardless of the value of RTE_CACHE_GUARD_LINES), I think we can stick with the RTE_CACHE_GUARD name.