From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 89EB8424FC;
	Tue,  5 Sep 2023 07:50:11 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 68AD540270;
	Tue,  5 Sep 2023 07:50:11 +0200 (CEST)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id C679F4026A
 for <dev@dpdk.org>; Tue,  5 Sep 2023 07:50:09 +0200 (CEST)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id 68E6114E63
 for <dev@dpdk.org>; Tue,  5 Sep 2023 07:50:09 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id 67D9414D73; Tue,  5 Sep 2023 07:50:09 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-2.3 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A
 autolearn=disabled version=3.4.6
X-Spam-Score: -2.3
Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se
 [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits))
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id 7F9B815193;
 Tue,  5 Sep 2023 07:50:06 +0200 (CEST)
Message-ID: <66b862bc-ca22-02da-069c-a82c2d52ed49@lysator.liu.se>
Date: Tue, 5 Sep 2023 07:50:06 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.13.0
Subject: Re: [RFC] cache guard
Content-Language: en-US
To: =?UTF-8?Q?Morten_Br=c3=b8rup?= <mb@smartsharesystems.com>,
 Thomas Monjalon <thomas@monjalon.net>
Cc: Bruce Richardson <bruce.richardson@intel.com>, dev@dpdk.org,
 olivier.matz@6wind.com, andrew.rybchenko@oktetlabs.ru,
 honnappa.nagarahalli@arm.com, konstantin.v.ananyev@yandex.ru,
 mattias.ronnblom@ericsson.com
References: <98CBD80474FA8B44BF855DF32C47DC35D87B39@smartserver.smartshare.dk>
 <ZOhzKDNixWg2GY50@bricha3-MOBL.ger.corp.intel.com>
 <98CBD80474FA8B44BF855DF32C47DC35D87B47@smartserver.smartshare.dk>
 <2507011.Sgy9Pd6rRy@thomas>
 <ad50b89b-b24a-27be-67d8-22a6d1358984@lysator.liu.se>
 <98CBD80474FA8B44BF855DF32C47DC35D87B69@smartserver.smartshare.dk>
 <10be7ca8-fe5a-ba24-503a-9c6ce2fef710@lysator.liu.se>
 <98CBD80474FA8B44BF855DF32C47DC35D87B6F@smartserver.smartshare.dk>
From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87B6F@smartserver.smartshare.dk>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On 2023-09-04 14:48, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Monday, 4 September 2023 14.07
>>
>> On 2023-09-01 20:52, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>>> Sent: Friday, 1 September 2023 18.58
>>>>
>>>> On 2023-09-01 14:26, Thomas Monjalon wrote:
>>>>> 27/08/2023 10:34, Morten Brørup:
>>>>>> +CC Honnappa and Konstantin, Ring lib maintainers
>>>>>> +CC Mattias, PRNG lib maintainer
>>>>>>
>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>> Sent: Friday, 25 August 2023 11.24
>>>>>>>
>>>>>>> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote:
>>>>>>>> +CC mempool maintainers
>>>>>>>>
>>>>>>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>>>>>>>>> Sent: Friday, 25 August 2023 10.23
>>>>>>>>>
>>>>>>>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote:
>>>>>>>>>> Bruce,
>>>>>>>>>>
>>>>>>>>>> With this patch [1], it is noted that the ring producer and
>>>>>>> consumer data
>>>>>>>>> should not be on adjacent cache lines, for performance reasons.
>>>>>>>>>>
>>>>>>>>>> [1]:
>>>>>>>>>
>>>>>>>
>>>>
>> https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f
>>>>>>> fd4b66
>>>>>>>>> e75485cc8b63b9aedfbdfe8b0
>>>>>>>>>>
>>>>>>>>>> (It's obvious that they cannot share the same cache line,
>> because
>>>>>>> they are
>>>>>>>>> accessed by two different threads.)
>>>>>>>>>>
>>>>>>>>>> Intuitively, I would think that having them on different cache
>>>>>>> lines would
>>>>>>>>> suffice. Why does having an empty cache line between them make a
>>>>>>> difference?
>>>>>>>>>>
>>>>>>>>>> And does it need to be an empty cache line? Or does it suffice
>>>>>>> having the
>>>>>>>>> second structure start at two cache lines after the start of the
>>>>>>> first
>>>>>>>>> structure (e.g. if the size of the first structure is two cache
>>>>>>> lines)?
>>>>>>>>>>
>>>>>>>>>> I'm asking because the same principle might apply to other code
>>>>>>> too.
>>>>>>>>>>
>>>>>>>>> Hi Morten,
>>>>>>>>>
>>>>>>>>> this was something we discovered when working on the distributor
>>>>>>> library.
>>>>>>>>> If we have cachelines per core where there is heavy access,
>> having
>>>>>>> some
>>>>>>>>> cachelines as a gap between the content cachelines can help
>>>>>>> performance. We
>>>>>>>>> believe this helps due to avoiding issues with the HW
>> prefetchers
>>>>>>> (e.g.
>>>>>>>>> adjacent cacheline prefetcher) bringing in the second cacheline
>>>>>>>>> speculatively when an operation is done on the first line.
>>>>>>>>
>>>>>>>> I guessed that it had something to do with speculative
>> prefetching,
>>>>>>> but wasn't sure. Good to get confirmation, and that it has a
>>>> measureable
>>>>>>> effect somewhere. Very interesting!
>>>>>>>>
>>>>>>>> NB: More comments in the ring lib about stuff like this would be
>>>> nice.
>>>>>>>>
>>>>>>>> So, for the mempool lib, what do you think about applying the
>> same
>>>>>>> technique to the rte_mempool_debug_stats structure (which is an
>>>> array
>>>>>>> indexed per lcore)... Two adjacent lcores heavily accessing their
>>>> local
>>>>>>> mempool caches seems likely to me. But how heavy does the access
>>>> need to
>>>>>>> be for this technique to be relevant?
>>>>>>>>
>>>>>>>
>>>>>>> No idea how heavy the accesses need to be for this to have a
>>>> noticable
>>>>>>> effect. For things like debug stats, I wonder how worthwhile
>> making
>>>> such
>>>>>>> a
>>>>>>> change would be, but then again, any change would have very low
>>>> impact
>>>>>>> too
>>>>>>> in that case.
>>>>>>
>>>>>> I just tried adding padding to some of the hot structures in our
>> own
>>>> application, and observed a significant performance improvement for
>>>> those.
>>>>>>
>>>>>> So I think this technique should have higher visibility in DPDK by
>>>> adding a new cache macro to rte_common.h:
>>>>>
>>>>> +1 to make more visibility in doc and adding a macro, good idea!
>>>>>
>>>>>
>>>>>
>>>>
>>>> A worry I have is that for CPUs with large (in this context) N, you
>> will
>>>> end up with a lot of padding to avoid next-N-lines false sharing.
>> That
>>>> would be padding after, and in the general (non-array) case also
>> before,
>>>> the actual per-lcore data. A slight nuisance is also that those
>>>> prefetched lines of padding, will never contain anything useful, and
>>>> thus fetching them will always be a waste.
>>>
>>> Out of curiosity, what is the largest N anyone here on the list is
>> aware of?
>>>
>>>>
>>>> Padding/alignment may not be the only way to avoid HW-prefetcher-
>> induced
>>>> false sharing for per-lcore data structures.
>>>>
>>>> What we are discussing here is organizing the statically allocated
>>>> per-lcore structs of a particular module in an array with the
>>>> appropriate padding/alignment. In this model, all data related to a
>>>> particular module is close (memory address/page-wise), but not so
>> close
>>>> to cause false sharing.
>>>>
>>>> /* rte_a.c */
>>>>
>>>> struct rte_a_state
>>>> {
>>>> 	int x;
>>>>            RTE_CACHE_GUARD;
>>>> } __rte_cache_aligned;
>>>>
>>>> static struct rte_a_state a_states[RTE_MAX_LCORE];
>>>>
>>>> /* rte_b.c */
>>>>
>>>> struct rte_b_state
>>>> {
>>>> 	char y;
>>>>            char z;
>>>>            RTE_CACHE_GUARD;
>>>> } __rte_cache_aligned;
>>>>
>>>>
>>>> static struct rte_b_state b_states[RTE_MAX_LCORE];
>>>>
>>>> What you would end up with in runtime when the linker has done its
>> job
>>>> is something that essentially looks like this (in memory):
>>>>
>>>> struct {
>>>> 	struct rte_a_state a_states[RTE_MAX_LCORE];
>>>> 	struct rte_b_state b_states[RTE_MAX_LCORE];
>>>> };
>>>>
>>>> You could consider turning it around, and keeping data (i.e., module
>>>> structs) related to a particular lcore, for all modules, close. In
>> other
>>>> words, keeping a per-lcore arrays of variable-sized elements.
>>>>
>>>> So, something that will end up looking like this (in memory, not in
>> the
>>>> source code):
>>>>
>>>> struct rte_lcore_state
>>>> {
>>>> 	struct rte_a_state a_state;
>>>> 	struct rte_b_state b_state;
>>>>            RTE_CACHE_GUARD;
>>>> };
>>>>
>>>> struct rte_lcore_state lcore_states[RTE_LCORE_MAX];
>>>>
>>>> In such a scenario, the per-lcore struct type for a module need not
>> (and
>>>> should not) be cache-line-aligned (but may still have some alignment
>>>> requirements). Data will be more tightly packed, and the "next lines"
>>>> prefetched may actually be useful (although I'm guessing in practice
>>>> they will usually not).
>>>>
>>>> There may be several ways to implement that scheme. The above is to
>>>> illustrate how thing would look in memory, not necessarily on the
>> level
>>>> of the source code.
>>>>
>>>> One way could be to fit the per-module-per-lcore struct in a chunk of
>>>> memory allocated in a per-lcore heap. In such a case, the DPDK heap
>>>> would need extension, maybe with semantics similar to that of NUMA-
>> node
>>>> specific allocations.
>>>>
>>>> Another way would be to use thread-local storage (TLS, __thread),
>>>> although it's unclear to me how well TLS works with larger data
>>>> structures.
>>>>
>>>> A third way may be to somehow achieve something that looks like the
>>>> above example, using macros, without breaking module encapsulation or
>>>> generally be too intrusive or otherwise cumbersome.
>>>>
>>>> Not sure this is worth the trouble (compared to just more padding),
>> but
>>>> I thought it was an idea worth sharing.
>>>
>>> I think what Mattias suggests is relevant, and it would be great if a
>> generic solution could be found for DPDK.
>>>
>>> For reference, we initially used RTE_PER_LCORE(module_variable), i.e.
>> thread local storage, extensively in our application modules. But it has
>> two disadvantages:
>>> 1. TLS does not use hugepages. (The same applies to global and local
>> variables, BTW.)
>>> 2. We need to set up global pointers to these TLS variables, so they
>> can be accessed from the main lcore (e.g. for statistics). This means
>> that every module needs some sort of module_per_lcore_init, called by
>> the thread after its creation, to set the
>> module_global_ptr[rte_lcore_id()] = &RTE_PER_LCORE(module_variable).
>>>
>>
>> Good points. I never thought about the initialization issue.
>>
>> How about memory consumption and TLS? If you have many non-EAL-threads
>> in the DPDK process, would the system allocate TLS memory for DPDK
>> lcore-specific data structures? Assuming a scenario where __thread was
>> used instead of the standard DPDK pattern.
> 
> Room for all the __thread variables is allocated and initialized for every thread started.
> 
> Note: RTE_PER_LCORE variables are simply __thread-wrapped variables:
> 
> #define RTE_DEFINE_PER_LCORE(type, name)			\
> 	__thread __typeof__(type) per_lcore_##name
> 
>>
>>> Eventually, we gave up and migrated to the DPDK standard design
>> pattern of instantiating a global module_variable[RTE_MAX_LCORE], and
>> letting each thread use its own entry in that array.
>>>
>>> And as Mattias suggests, this adds a lot of useless padding, because
>> each modules' variables now need to start on its own cache line.
>>>
>>> So a generic solution with packed per-thread data would be a great
>> long term solution.
>>>
>>> Short term, I can live with the simple cache guard. It is very easy to
>> implement and use.
>>>
>> The RTE_CACHE_GUARD pattern is also intrusive, in the sense it needs to
>> be explicitly added everywhere (just like __rte_cache_aligned) and error
>> prone, and somewhat brittle (in the face of changed <N>).
> 
> Agree. Like a lot of other stuff in DPDK. DPDK is very explicit. :-)
> 
>>
>> (I mentioned this not to discourage the use of RTE_CACHE_GUARD - more to
>> encourage somehow to invite something more efficient, robust and
>> easier-to-use.)
> 
> I get that you don't discourage it, but I don't expect the discussion to proceed further at this time. So please don’t forget to also ACK this patch. ;-)
> 

Acked-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

> There should be a wish list, where concepts like your suggested improvement could be easily noted.
>