From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id A82584596A;
	Thu, 12 Sep 2024 08:38:39 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 43C9540B94;
	Thu, 12 Sep 2024 08:38:39 +0200 (CEST)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id E20834025E
 for <dev@dpdk.org>; Thu, 12 Sep 2024 08:38:36 +0200 (CEST)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id DF292389C
 for <dev@dpdk.org>; Thu, 12 Sep 2024 08:38:35 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id D282F377E; Thu, 12 Sep 2024 08:38:35 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL,
 T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0
X-Spam-Score: -1.2
Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se
 [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id 9B29B37F6;
 Thu, 12 Sep 2024 08:38:32 +0200 (CEST)
Message-ID: <c590b06b-6d26-4766-92cb-4cc3f1c6e164@lysator.liu.se>
Date: Thu, 12 Sep 2024 08:38:32 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC 0/2] introduce LLC aware functions
To: "Varghese, Vipin" <Vipin.Varghese@amd.com>,
 Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Cc: "Yigit, Ferruh" <Ferruh.Yigit@amd.com>, "dev@dpdk.org" <dev@dpdk.org>,
 nd <nd@arm.com>
References: <20240827151014.201-1-vipin.varghese@amd.com>
 <45f26104-ad6c-4e42-8446-d8b51ac3f2dd@lysator.liu.se>
 <d0eb351b-250b-4d09-8bfb-8a69ea14aab6@amd.com>
 <db151bdb-7683-49e3-b759-d729f53a556c@lysator.liu.se>
 <PH7PR12MB8596E0CBFDE4E00FAFFA9CB382992@PH7PR12MB8596.namprd12.prod.outlook.com>
 <da6eadec-608a-4675-8f23-c2ed27481c14@lysator.liu.se>
 <PH7PR12MB8596A0BB9868E1FF31C9BFDF829B2@PH7PR12MB8596.namprd12.prod.outlook.com>
 <38d0336d-ea9e-41b3-b3d8-333efb70eb1f@lysator.liu.se>
 <716375DE-0C2F-4983-934A-144D7DE342C6@arm.com>
 <PH7PR12MB8596D0906FA40A55FAE516D282642@PH7PR12MB8596.namprd12.prod.outlook.com>
Content-Language: en-US
From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <PH7PR12MB8596D0906FA40A55FAE516D282642@PH7PR12MB8596.namprd12.prod.outlook.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On 2024-09-12 03:33, Varghese, Vipin wrote:
> [Public]
> 
> Snipped
> 
>>>>>>
>>>>>> <snipped>
>>>>>>
>>>>>>>> <snipped>
>>>>>>>>
>>>>>>>> Thank you Mattias for the comments and question, please let me
>>>>>>>> try to explain the same below
>>>>>>>>
>>>>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead?
>>>>>>>>
>>>>>>>> Based on the intention to bring in CPU lcores which share same L3
>>>>>>>> (for better cache hits and less noisy neighbor) current API
>>>>>>>> focuses on using
>>>>>>>>
>>>>>>>> Last Level Cache. But if the suggestion is `there are SoC where
>>>>>>>> L2 cache are also shared, and the new API should be provisioned`,
>>>>>>>> I am also
>>>>>>>>
>>>>>>>> comfortable with the thought.
>>>>>>>>
>>>>>>>
>>>>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I
>>>>>>> think we are better off with no DPDK API at all for this kind of
>> functionality.
>>>>>>
>>>>>> Hi Mattias, as shared in the earlier email thread, this is not a
>>>>>> AMD special
>>>>> case at all. Let me try to explain this one more time. One of
>>>>> techniques used to increase cores cost effective way to go for tiles of
>> compute complexes.
>>>>>> This introduces a bunch of cores in sharing same Last Level Cache
>>>>>> (namely
>>>>> L2, L3 or even L4) depending upon cache topology architecture.
>>>>>>
>>>>>> The API suggested in RFC is to help end users to selectively use
>>>>>> cores under
>>>>> same Last Level Cache Hierarchy as advertised by OS (irrespective of
>>>>> the BIOS settings used). This is useful in both bare-metal and container
>> environment.
>>>>>>
>>>>>
>>>>> I'm pretty familiar with AMD CPUs and the use of tiles (including
>>>>> the challenges these kinds of non-uniformities pose for work scheduling).
>>>>>
>>>>> To maximize performance, caring about core<->LLC relationship may
>>>>> well not be enough, and more HT/core/cache/memory topology
>>>>> information is required. That's what I meant by special case. A
>>>>> proper API should allow access to information about which lcores are
>>>>> SMT siblings, cores on the same L2, and cores on the same L3, to
>>>>> name a few things. Probably you want to fit NUMA into the same API
>>>>> as well, although that is available already in <rte_lcore.h>.
>>>> Thank you Mattias for the information, as shared by in the reply with
>> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a
>> extra argument `u32 flags`.
>>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
>> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
>> RTE_GET_LCORE_BOOST_DISABLED.
>>>
>>> Wouldn't using that API be pretty awkward to use?
> Current API available under DPDK is ` rte_get_next_lcore`, which is used within DPDK example and in customer solution.
> Based on the comments from others we responded to the idea of changing the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.
> 
> Can you please help us understand what is `awkward`.
> 

The awkwardness starts when you are trying to fit provide hwloc type 
information over an API that was designed for iterating over lcores.

It seems to me that you should either have:
A) An API in similar to that of hwloc (or any DOM-like API), which would 
give a low-level description of the hardware in implementation terms. 
The topology would consist of nodes, with attributes, etc, where nodes 
are things like cores or instances of caches of some level and 
attributes are things like CPU actual and nominal, and maybe max 
frequency, cache size, or memory size.
or
B) An API to be directly useful for a work scheduler, in which case you 
should abstract away things like "boost" (and fold them into some 
abstract capacity notion, together with core "size" [in 
big-little/heterogeneous systems]), and have an abstract notion of what 
core is "close" to some other core. This would something like Linux' 
scheduling domains.

If you want B you probably need A as a part of its implementation, so 
you may just as well start with A, I suppose.

What you could do to explore the API design is to add support for, for 
example, boost core awareness or SMT affinity in the SW scheduler. You 
could also do an "lstopo" equivalent, since that's needed for debugging 
and exploration, if nothing else.

One question that will have to be answered in a work scheduling scenario 
is "are these two lcores SMT siblings," or "are these two cores on the 
same LLC", or "give me all lcores on a particular L2 cache".

>>>
>>> I mean, what you have is a topology, with nodes of different types and with
>> different properties, and you want to present it to the user.
> Let me be clear, what we want via DPDK to help customer to use an Unified API which works across multiple platforms.
> Example - let a vendor have 2 products namely A and B. CPU-A has all cores within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA domain based on split LLC.
> When `rte_get_next_lcore_extnd` is invoked for `LLC` on
> 1. CPU-A: it returns all cores as there is no split
> 2. CPU-B: it returns cores from specific sub-NUMA which is partitioned by L3
> 

I think the function name rte_get_next_lcore_extnd() alone makes clear 
this is an awkward API. :)

My gut feeling is to make it more explicit and forget about 
<rte_lcore.h>. <rte_hwtopo.h>? Could and should still be EAL.

>>>
>>> In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style,
>> and what I have in mind is something DOM-like.
>>>
>>> What use case do you have in mind? What's on top of my list is a scenario
>> where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure
>> out how best make use of them.
> Exactly.
> 
>   It's not going to "skip" (ignore, leave unused)
>> SMT siblings, or skip non-boosted cores, it would just try to be clever in
>> regards to which cores to use for what purpose.
> Let me try to share my idea on SMT sibling. When user invoked for rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API identifies first whether given lcore is part of enabled core list.
> If yes, it programmatically either using `sysfs` or `hwloc library (shared the version concern on distros. Will recheck again)` identify the sibling thread and return.
> If there is no sibling thread available under DPDK it will fetch next lcore (probably lcore +1 ).
> 

Distributions having old hwloc versions isn't an argument for a new DPDK 
library or new API. If only that was the issue, then it would be better 
to help the hwloc and/or distributions, rather than the DPDK project.

>>>
>>>> This is AMD EPYC SoC agnostic and trying to address for all generic cases.
>>>> Please do let us know if we (Ferruh & myself) can sync up via call?
>>>
>>> Sure, I can do that.
> 
> Let me sync with Ferruh and get a time slot for internal sync.
> 
>>>
>> Can this be opened to the rest of the community? This is a common problem
>> that needs to be solved for multiple architectures. I would be interested in
>> attending.
> Thank you Mattias, in DPDK Bangkok summit 2024 we did bring this up. As per the suggestion from Thomas and Jerrin we tried to bring the RFC for discussion.
> For DPDK Montreal 2024, Keesang and Ferruh (most likely) is travelling for the summit and presenting this as the talk to get things moving.
> 
>>
>>>>>
> <snipped>