Re: [RFC 0/2] introduce LLC aware functions

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: "Varghese, Vipin" <Vipin.Varghese@amd.com>,
	Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Cc: "Yigit, Ferruh" <Ferruh.Yigit@amd.com>,
	"dev@dpdk.org" <dev@dpdk.org>, nd <nd@arm.com>
Subject: Re: [RFC 0/2] introduce LLC aware functions
Date: Thu, 12 Sep 2024 09:02:36 +0200	[thread overview]
Message-ID: <50ee2d2f-00c0-488c-a80e-1d3021103060@lysator.liu.se> (raw)
In-Reply-To: <c590b06b-6d26-4766-92cb-4cc3f1c6e164@lysator.liu.se>

On 2024-09-12 08:38, Mattias Rönnblom wrote:
> On 2024-09-12 03:33, Varghese, Vipin wrote:
>> [Public]
>>
>> Snipped
>>
>>>>>>>
>>>>>>> <snipped>
>>>>>>>
>>>>>>>>> <snipped>
>>>>>>>>>
>>>>>>>>> Thank you Mattias for the comments and question, please let me
>>>>>>>>> try to explain the same below
>>>>>>>>>
>>>>>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead?
>>>>>>>>>
>>>>>>>>> Based on the intention to bring in CPU lcores which share same L3
>>>>>>>>> (for better cache hits and less noisy neighbor) current API
>>>>>>>>> focuses on using
>>>>>>>>>
>>>>>>>>> Last Level Cache. But if the suggestion is `there are SoC where
>>>>>>>>> L2 cache are also shared, and the new API should be provisioned`,
>>>>>>>>> I am also
>>>>>>>>>
>>>>>>>>> comfortable with the thought.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I
>>>>>>>> think we are better off with no DPDK API at all for this kind of
>>> functionality.
>>>>>>>
>>>>>>> Hi Mattias, as shared in the earlier email thread, this is not a
>>>>>>> AMD special
>>>>>> case at all. Let me try to explain this one more time. One of
>>>>>> techniques used to increase cores cost effective way to go for 
>>>>>> tiles of
>>> compute complexes.
>>>>>>> This introduces a bunch of cores in sharing same Last Level Cache
>>>>>>> (namely
>>>>>> L2, L3 or even L4) depending upon cache topology architecture.
>>>>>>>
>>>>>>> The API suggested in RFC is to help end users to selectively use
>>>>>>> cores under
>>>>>> same Last Level Cache Hierarchy as advertised by OS (irrespective of
>>>>>> the BIOS settings used). This is useful in both bare-metal and 
>>>>>> container
>>> environment.
>>>>>>>
>>>>>>
>>>>>> I'm pretty familiar with AMD CPUs and the use of tiles (including
>>>>>> the challenges these kinds of non-uniformities pose for work 
>>>>>> scheduling).
>>>>>>
>>>>>> To maximize performance, caring about core<->LLC relationship may
>>>>>> well not be enough, and more HT/core/cache/memory topology
>>>>>> information is required. That's what I meant by special case. A
>>>>>> proper API should allow access to information about which lcores are
>>>>>> SMT siblings, cores on the same L2, and cores on the same L3, to
>>>>>> name a few things. Probably you want to fit NUMA into the same API
>>>>>> as well, although that is available already in <rte_lcore.h>.
>>>>> Thank you Mattias for the information, as shared by in the reply with
>>> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a
>>> extra argument `u32 flags`.
>>>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
>>> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
>>> RTE_GET_LCORE_BOOST_DISABLED.
>>>>
>>>> Wouldn't using that API be pretty awkward to use?
>> Current API available under DPDK is ` rte_get_next_lcore`, which is 
>> used within DPDK example and in customer solution.
>> Based on the comments from others we responded to the idea of changing 
>> the new Api from ` rte_get_next_lcore_llc` to ` 
>> rte_get_next_lcore_exntd`.
>>
>> Can you please help us understand what is `awkward`.
>>
> 
> The awkwardness starts when you are trying to fit provide hwloc type 
> information over an API that was designed for iterating over lcores.
> 
> It seems to me that you should either have:
> A) An API in similar to that of hwloc (or any DOM-like API), which would 
> give a low-level description of the hardware in implementation terms. 
> The topology would consist of nodes, with attributes, etc, where nodes 
> are things like cores or instances of caches of some level and 
> attributes are things like CPU actual and nominal, and maybe max 
> frequency, cache size, or memory size.


To to be clear; it's something like this I think of when I say 
"DOM-style" API.

#ifndef RTE_HWTOPO_H
#define RTE_HWTOPO_H

struct rte_hwtopo_node;

enum rte_hwtopo_node_type {
     RTE_HWTOPO_NODE_TYPE_CPU_CORE,
     RTE_HWTOPO_NODE_TYPE_CACHE,
     RTE_HWTOPO_NODE_TYPE_NUMA
};

int
rte_hwtopo_init(void);

struct rte_hwtopo_node *
rte_hwtopo_get_core_by_lcore(unsigned int lcore);

struct rte_hwtopo_node *
rte_hwtopo_get_core_by_id(unsigned int os_cpu_id);

struct rte_hwtopo_node *
rte_hwtopo_parent(struct rte_hwtopo_node *node);

struct rte_hwtopo_node *
rte_hwtopo_first_child(struct rte_hwtopo_node *node);

struct rte_hwtopo_node *
rte_hwtopo_next_child(struct rte_hwtopo_node *node,
		      struct rte_hwtopo_node *child);

struct rte_hwtopo_node *
rte_hwtopo_first_sibling(struct rte_hwtopo_node *node);

struct rte_hwtopo_node *
rte_hwtopo_next_sibling(struct rte_hwtopo_node *node,
			struct rte_hwtopo_node *child);

enum rte_hwtopo_node_type
rte_hwtopo_get_type(struct rte_hwtopo_node *node);

#define RTE_HWTOPO_NODE_ATTR_CORE_FREQUENCY_NOMINAL 0
#define RTE_HWTOPO_NODE_ATTR_CACHE_LEVEL 1
#define RTE_HWTOPO_NODE_ATTR_CACHE_SIZE 2

int
rte_hwtopo_get_attr_int64(struct rte_hwtopo_node *node, unsigned int 
attr_name,
			  int64_t *attr_value);

int
rte_hwtopo_get_attr_str(struct rte_hwtopo_node *node, unsigned int 
attr_name,
			char *attr_value, size_t capacity);

#endif

Surely, this too would be awkward (or should I say cumbersome) to use in 
certain scenarios. You could have syntactic sugar/special case helpers 
which address common use cases. You would also build abstractions on top 
of this (like the B case below).

One could have node type specific functions instead of generic getter 
and setters. Anyway, this is not a counter-proposal, but rather just to 
make clear, what I had in mind.

> or
> B) An API to be directly useful for a work scheduler, in which case you 
> should abstract away things like "boost" (and fold them into some 
> abstract capacity notion, together with core "size" [in 
> big-little/heterogeneous systems]), and have an abstract notion of what 
> core is "close" to some other core. This would something like Linux' 
> scheduling domains.
> 
> If you want B you probably need A as a part of its implementation, so 
> you may just as well start with A, I suppose.
> 
> What you could do to explore the API design is to add support for, for 
> example, boost core awareness or SMT affinity in the SW scheduler. You 
> could also do an "lstopo" equivalent, since that's needed for debugging 
> and exploration, if nothing else.
> 
> One question that will have to be answered in a work scheduling scenario 
> is "are these two lcores SMT siblings," or "are these two cores on the 
> same LLC", or "give me all lcores on a particular L2 cache".
> 
>>>>
>>>> I mean, what you have is a topology, with nodes of different types 
>>>> and with
>>> different properties, and you want to present it to the user.
>> Let me be clear, what we want via DPDK to help customer to use an 
>> Unified API which works across multiple platforms.
>> Example - let a vendor have 2 products namely A and B. CPU-A has all 
>> cores within same SUB-NUMA domain and CPU-B has cores split to 2 
>> sub-NUMA domain based on split LLC.
>> When `rte_get_next_lcore_extnd` is invoked for `LLC` on
>> 1. CPU-A: it returns all cores as there is no split
>> 2. CPU-B: it returns cores from specific sub-NUMA which is partitioned 
>> by L3
>>
> 
> I think the function name rte_get_next_lcore_extnd() alone makes clear 
> this is an awkward API. :)
> 
> My gut feeling is to make it more explicit and forget about 
> <rte_lcore.h>. <rte_hwtopo.h>? Could and should still be EAL.
> 
>>>>
>>>> In a sense, it's similar to XCM and DOM versus SAX. The above is 
>>>> SAX-style,
>>> and what I have in mind is something DOM-like.
>>>>
>>>> What use case do you have in mind? What's on top of my list is a 
>>>> scenario
>>> where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries 
>>> to figure
>>> out how best make use of them.
>> Exactly.
>>
>>   It's not going to "skip" (ignore, leave unused)
>>> SMT siblings, or skip non-boosted cores, it would just try to be 
>>> clever in
>>> regards to which cores to use for what purpose.
>> Let me try to share my idea on SMT sibling. When user invoked for 
>> rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; 
>> the API identifies first whether given lcore is part of enabled core 
>> list.
>> If yes, it programmatically either using `sysfs` or `hwloc library 
>> (shared the version concern on distros. Will recheck again)` identify 
>> the sibling thread and return.
>> If there is no sibling thread available under DPDK it will fetch next 
>> lcore (probably lcore +1 ).
>>
> 
> Distributions having old hwloc versions isn't an argument for a new DPDK 
> library or new API. If only that was the issue, then it would be better 
> to help the hwloc and/or distributions, rather than the DPDK project.
> 
>>>>
>>>>> This is AMD EPYC SoC agnostic and trying to address for all generic 
>>>>> cases.
>>>>> Please do let us know if we (Ferruh & myself) can sync up via call?
>>>>
>>>> Sure, I can do that.
>>
>> Let me sync with Ferruh and get a time slot for internal sync.
>>
>>>>
>>> Can this be opened to the rest of the community? This is a common 
>>> problem
>>> that needs to be solved for multiple architectures. I would be 
>>> interested in
>>> attending.
>> Thank you Mattias, in DPDK Bangkok summit 2024 we did bring this up. 
>> As per the suggestion from Thomas and Jerrin we tried to bring the RFC 
>> for discussion.
>> For DPDK Montreal 2024, Keesang and Ferruh (most likely) is travelling 
>> for the summit and presenting this as the talk to get things moving.
>>
>>>
>>>>>>
>> <snipped>

next prev parent reply	other threads:[~2024-09-12  7:02 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-27 15:10 Vipin Varghese
2024-08-27 15:10 ` [RFC 1/2] eal: add llc " Vipin Varghese
2024-08-27 17:36   ` Stephen Hemminger
2024-09-02  0:27     ` Varghese, Vipin
2024-08-27 20:56   ` Wathsala Wathawana Vithanage
2024-08-29  3:21     ` 答复: " Feifei Wang
2024-09-02  1:20     ` Varghese, Vipin
2024-09-03 17:54       ` Wathsala Wathawana Vithanage
2024-09-04  8:18         ` Bruce Richardson
2024-09-06 11:59         ` Varghese, Vipin
2024-09-12 16:58           ` Wathsala Wathawana Vithanage
2024-08-27 15:10 ` [RFC 2/2] eal/lcore: add llc aware for each macro Vipin Varghese
2024-08-27 21:23 ` [RFC 0/2] introduce LLC aware functions Mattias Rönnblom
2024-09-02  0:39   ` Varghese, Vipin
2024-09-04  9:30     ` Mattias Rönnblom
2024-09-04 14:37       ` Stephen Hemminger
2024-09-11  3:13         ` Varghese, Vipin
2024-09-11  3:53           ` Stephen Hemminger
2024-09-12  1:11             ` Varghese, Vipin
2024-09-09 14:22       ` Varghese, Vipin
2024-09-09 14:52         ` Mattias Rönnblom
2024-09-11  3:26           ` Varghese, Vipin
2024-09-11 15:55             ` Mattias Rönnblom
2024-09-11 17:04               ` Honnappa Nagarahalli
2024-09-12  1:33                 ` Varghese, Vipin
2024-09-12  6:38                   ` Mattias Rönnblom
2024-09-12  7:02                     ` Mattias Rönnblom [this message]
2024-09-12 11:23                       ` Varghese, Vipin
2024-09-12 12:12                         ` Mattias Rönnblom
2024-09-12 15:50                           ` Stephen Hemminger
2024-09-12 11:17                     ` Varghese, Vipin
2024-09-12 11:59                       ` Mattias Rönnblom
2024-09-12 13:30                         ` Bruce Richardson
2024-09-12 16:32                           ` Mattias Rönnblom
2024-09-12  2:28                 ` Varghese, Vipin
2024-09-11 16:01             ` Bruce Richardson
2024-09-11 22:25               ` Konstantin Ananyev
2024-09-12  2:38                 ` Varghese, Vipin
2024-09-12  2:19               ` Varghese, Vipin
2024-09-12  9:17                 ` Bruce Richardson
2024-09-12 11:50                   ` Varghese, Vipin
2024-09-13 14:15                     ` Burakov, Anatoly
2024-09-12 13:18                   ` Mattias Rönnblom
2024-08-28  8:38 ` Burakov, Anatoly
2024-09-02  1:08   ` Varghese, Vipin
2024-09-02 14:17     ` Burakov, Anatoly
2024-09-02 15:33       ` Varghese, Vipin
2024-09-03  8:50         ` Burakov, Anatoly
2024-09-05 13:05           ` Ferruh Yigit
2024-09-05 14:45             ` Burakov, Anatoly
2024-09-05 15:34               ` Ferruh Yigit
2024-09-06  8:44                 ` Burakov, Anatoly
2024-09-09 14:14                   ` Varghese, Vipin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50ee2d2f-00c0-488c-a80e-1d3021103060@lysator.liu.se \
    --to=hofors@lysator.liu.se \
    --cc=Ferruh.Yigit@amd.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=Vipin.Varghese@amd.com \
    --cc=dev@dpdk.org \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).