From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 75A404585A;
	Wed, 11 Sep 2024 17:55:19 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 6501742F0A;
	Wed, 11 Sep 2024 17:55:19 +0200 (CEST)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id 9F3C142F07
 for <dev@dpdk.org>; Wed, 11 Sep 2024 17:55:17 +0200 (CEST)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id 42AF3104C
 for <dev@dpdk.org>; Wed, 11 Sep 2024 17:55:17 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id 110B3104B; Wed, 11 Sep 2024 17:55:17 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL,
 T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0
X-Spam-Score: -1.2
Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se
 [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id 5D9C5F7F;
 Wed, 11 Sep 2024 17:55:15 +0200 (CEST)
Message-ID: <38d0336d-ea9e-41b3-b3d8-333efb70eb1f@lysator.liu.se>
Date: Wed, 11 Sep 2024 17:55:15 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC 0/2] introduce LLC aware functions
To: "Varghese, Vipin" <Vipin.Varghese@amd.com>,
 "Yigit, Ferruh" <Ferruh.Yigit@amd.com>, "dev@dpdk.org" <dev@dpdk.org>
References: <20240827151014.201-1-vipin.varghese@amd.com>
 <45f26104-ad6c-4e42-8446-d8b51ac3f2dd@lysator.liu.se>
 <d0eb351b-250b-4d09-8bfb-8a69ea14aab6@amd.com>
 <db151bdb-7683-49e3-b759-d729f53a556c@lysator.liu.se>
 <PH7PR12MB8596E0CBFDE4E00FAFFA9CB382992@PH7PR12MB8596.namprd12.prod.outlook.com>
 <da6eadec-608a-4675-8f23-c2ed27481c14@lysator.liu.se>
 <PH7PR12MB8596A0BB9868E1FF31C9BFDF829B2@PH7PR12MB8596.namprd12.prod.outlook.com>
Content-Language: en-US
From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <PH7PR12MB8596A0BB9868E1FF31C9BFDF829B2@PH7PR12MB8596.namprd12.prod.outlook.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On 2024-09-11 05:26, Varghese, Vipin wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> <snipped>
> 
>>
>> On 2024-09-09 16:22, Varghese, Vipin wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> <snipped>
>>>
>>>>> <snipped>
>>>>>
>>>>> Thank you Mattias for the comments and question, please let me try
>>>>> to explain the same below
>>>>>
>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead?
>>>>>
>>>>> Based on the intention to bring in CPU lcores which share same L3
>>>>> (for better cache hits and less noisy neighbor) current API focuses
>>>>> on using
>>>>>
>>>>> Last Level Cache. But if the suggestion is `there are SoC where L2
>>>>> cache are also shared, and the new API should be provisioned`, I am
>>>>> also
>>>>>
>>>>> comfortable with the thought.
>>>>>
>>>>
>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I
>>>> think we are better off with no DPDK API at all for this kind of functionality.
>>>
>>> Hi Mattias, as shared in the earlier email thread, this is not a AMD special
>> case at all. Let me try to explain this one more time. One of techniques used to
>> increase cores cost effective way to go for tiles of compute complexes.
>>> This introduces a bunch of cores in sharing same Last Level Cache (namely
>> L2, L3 or even L4) depending upon cache topology architecture.
>>>
>>> The API suggested in RFC is to help end users to selectively use cores under
>> same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS
>> settings used). This is useful in both bare-metal and container environment.
>>>
>>
>> I'm pretty familiar with AMD CPUs and the use of tiles (including the
>> challenges these kinds of non-uniformities pose for work scheduling).
>>
>> To maximize performance, caring about core<->LLC relationship may well not
>> be enough, and more HT/core/cache/memory topology information is
>> required. That's what I meant by special case. A proper API should allow
>> access to information about which lcores are SMT siblings, cores on the same
>> L2, and cores on the same L3, to name a few things. Probably you want to fit
>> NUMA into the same API as well, although that is available already in
>> <rte_lcore.h>.
> 
> Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument `u32 flags`.
> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED.
> 

Wouldn't using that API be pretty awkward to use?

I mean, what you have is a topology, with nodes of different types and 
with different properties, and you want to present it to the user.

In a sense, it's similar to XCM and DOM versus SAX. The above is 
SAX-style, and what I have in mind is something DOM-like.

What use case do you have in mind? What's on top of my list is a 
scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and 
tries to figure out how best make use of them. It's not going to "skip" 
(ignore, leave unused) SMT siblings, or skip non-boosted cores, it would 
just try to be clever in regards to which cores to use for what purpose.

> This is AMD EPYC SoC agnostic and trying to address for all generic cases.
> 
> Please do let us know if we (Ferruh & myself) can sync up via call?
> 

Sure, I can do that.

>>
>> One can have a look at how scheduling domains work in the Linux kernel.
>> They model this kind of thing.
>>
>>> As shared in response for cover letter +1 to expand it to more than
>>> just LLC cores. We have also confirmed the same to
>>> https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-
>> 1-vip
>>> in.varghese@amd.com/
>>>
>>>>
>>>> A DPDK CPU/memory hierarchy topology API very much makes sense, but
>>>> it should be reasonably generic and complete from the start.
>>>>
>>>>>>
>>>>>> Could potentially be built on the 'hwloc' library.
>>>>>
>>>>> There are 3 reason on AMD SoC we did not explore this path, reasons
>>>>> are
>>>>>
>>>>> 1. depending n hwloc version and kernel version certain SoC
>>>>> hierarchies are not available
>>>>>
>>>>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD
>>>> Epyc Soc.
>>>>>
>>>>> 3. adds the extra dependency layer of library layer to be made
>>>>> available to work.
>>>>>
>>>>>
>>>>> hence we have tried to use Linux Documented generic layer of `sysfs
>>>>> CPU cache`.
>>>>>
>>>>> I will try to explore more on hwloc and check if other libraries
>>>>> within DPDK leverages the same.
>>>>>
>>>>>>
>>>>>> I much agree cache/core topology may be of interest of the
>>>>>> application (or a work scheduler, like a DPDK event device), but
>>>>>> it's not limited to LLC. It may well be worthwhile to care about
>>>>>> which cores shares L2 cache, for example. Not sure the
>>>>>> RTE_LCORE_FOREACH_*
>>>> approach scales.
>>>>>
>>>>> yes, totally understand as some SoC, multiple lcores shares same L2 cache.
>>>>>
>>>>>
>>>>> Can we rework the API to be rte_get_cache_<function> where user
>>>>> argument is desired lcore index.
>>>>>
>>>>> 1. index-1: SMT threads
>>>>>
>>>>> 2. index-2: threads sharing same L2 cache
>>>>>
>>>>> 3. index-3: threads sharing same L3 cache
>>>>>
>>>>> 4. index-MAX: identify the threads sharing last level cache.
>>>>>
>>>>>>
>>>>>>> < Function: Purpose >
>>>>>>> ---------------------
>>>>>>>     - rte_get_llc_first_lcores: Retrieves all the first lcores in
>>>>>>> the shared LLC.
>>>>>>>     - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>>>>>>>     - rte_get_llc_n_lcore: Retrieves the first n or skips the first
>>>>>>> n lcores in the shared LLC.
>>>>>>>
>>>>>>> < MACRO: Purpose >
>>>>>>> ------------------
>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from
>>>>>>> each LLC.
>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first
>>>>>>> worker lcore from each LLC.
>>>>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on
>>>> hint
>>>>>>> (lcore id).
>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from
>> LLC
>>>>>>> while skipping first worker.
>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n`
>> lcores
>>>>>>> from each LLC.
>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
>>>>>>> iterates through reaming lcores in each LLC.
>>>>>>>
>>>>> While the MACRO are simple wrapper invoking appropriate API. can
>>>>> this be worked out in this fashion?
>>>>>
>>>>> <snipped>