Re: [RFC 0/2] introduce LLC aware functions

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
To: "Varghese, Vipin" <Vipin.Varghese@amd.com>,
	Bruce Richardson <bruce.richardson@intel.com>
Cc: "Mattias Rönnblom" <hofors@lysator.liu.se>,
	"Yigit, Ferruh" <Ferruh.Yigit@amd.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [RFC 0/2] introduce LLC aware functions
Date: Fri, 13 Sep 2024 16:15:24 +0200	[thread overview]
Message-ID: <da8dd475-c379-446b-90e2-d6b9dd62e155@intel.com> (raw)
In-Reply-To: <PH7PR12MB8596CB9F11A1719C33E26A5582642@PH7PR12MB8596.namprd12.prod.outlook.com>

On 9/12/2024 1:50 PM, Varghese, Vipin wrote:
> [Public]
> 
> Snipped
> 
>>>
>>>
>>>     Based on the discussions we agreed on sharing version-2 FRC for
>>>     extending API as `rte_get_next_lcore_extnd` with extra argument as
>>>     `flags`.
>>>
>>>     As per my ideation, for the API ` rte_get_next_sibling_core`, the above
>>>     API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right
>>>     understanding?
>>>
>>>     We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which
>>>     allows to iterate SMT sibling threads.
>>>
>>>
>>
>> This seems like a lot of new macro and API additions! I'd really like to cut that
>> back and simplify the amount of new things we are adding to DPDK for this.
> I disagree Bruce, as per the new conversation with Anatoly and you it has been shared the new API are
> ```
> 1. rte_get_next_lcore_exntd
> 2. rte_get_next_n_lcore_exntd
> ```
> 
> While I mentioned custom Macro can augment based on typical flag usage similar to ` RTE_LCORE_FOREACH and RTE_LCORE_FOREACH_WORKER` as
> ```
> RTE_LCORE_FOREACH_FLAG
> RTE_LCORE_FOREACH_WORKER_FLAG
> 
> Or
> 
> RTE_LCORE_FOREACH_LLC
> RTE_LCORE_FOREACH_WORKER_LLC
> ```
> 
> Please note I have not even shared version-2 of RFC yet.
> 
>> I tend to agree with others that external libs would be better for apps that really want to deal with all this.
> I have covered why this is not a good idea for Mattias query.
> 
>>
>>>
>>>     >
>>>
>>>     > Looking logically, I'm not sure about the BOOST_ENABLED and BOOST_DISABLED flags you propose
>>>     The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power library which allows to enable boost.
>>>     Allow user to select lcores where BOOST is enabled|disabled using MACRO or API.
> May be there is confusion, so let me try to be explicit here. The intention of any `rte_get_next_lcore##`  is fetch lcores.
> Hence with new proposed API `rte_get_next_lcore_exntd` with `flag set for Boost` is to fetch lcores where boost is enabled.
> There is no intention to enable or disable boost on lcore with `get` API.
> 
>>>
>>>
>>>
>>>     - in a system with multiple possible
>>>
>>>     > standard and boost frequencies what would those correspond to?
>>>
>>>     I now understand the confusion, apologies for mixing the AMD EPYC SoC
>>>     boost with Intel Turbo.
>>>
>>>
>>>
>>>     Thank you for pointing out, we will use the terminology `
>>>     RTE_GET_LCORE_TURBO`.
>>>
>>>
>>
>> That still doesn't clarify it for me. If you start mixing in power management related functions in with topology ones things will turn into a real headache.
> Can you please tell me what is not clarified. DPDK lcores as of today has no notion of Cache, Numa, Power, Turbo or any DPDK supported features.
> The initial API introduced were to expose lcore sharing the same Last Level Cache. Based on interaction with Anatoly, extending this to support multiple features turned out to be possibility.
> Hence, we said we can share v2 for RFC based on this idea.
> 
> But if the claim is not to put TURBO I am also ok for this. Let only keep cache and NUMA-IO domain.
> 
>> What does boost or turbo correspond to? Is it for cores that have the feature enabled - whether or not it's currently in use - or is it for finding cores that are
>> currently boosted?  Do we need additions for cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are in lower frequencies for
>> power-saving. Do we add macros for finding those?
> Why are we talking about feq-up and freq-down? This was not even discussed in this RFC patch at all.
> 
>>>
>>>     What's also
>>>
>>>     > missing is a define for getting actual NUMA siblings i.e. those
>>>     sharing common memory but not an L3 or anything else.
>>>
>>>     This can be extended into `rte_get_next_lcore_extnd` with flag `
>>>     RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same
>>>     sub-memory NUMA as shared by LCORE.
>>>
>>>     If SMT sibling is enabled and DPDK Lcore mask covers the sibling
>>>     threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads
>>>     under same memory NUMA of lcore shared.
>>>
>>>
>>
>> Yes. That can work. But it means we are basing the implementation on a fixed idea of what topologies there are or can exist.
>> My suggestion below is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a way to find it's nearest nodes.
> Bruce, for different vendor SoC, the implementation of architecture is different. Let me share what I know
> 1. using L1, we can fetch SMT threads
> 2. using L2 we can get certain SoC on Arm, Intel and power PC which is like efficient cores
> 3. using L3 we can get certain SoC like AMD, AF64x and others which follow chiplet or tile split L3 domain.
> 
>>
>> After all, the app doesn't want to know the topology just for the sake of knowing it - it wants it to ensure best placement of work on cores! To that end, it just needs to know what cores are near to each other and what are far away.
> Exactly, that is why we want to minimize new libraries and limit to format of existing API `rte_get_next_lcore`. The end user need to deploy another library or external library then map to DPDK lcore mapping to identify what is where.
> So as end user I prefer simple API which get my work done.
> 
>>
>>>
>>>     >
>>>
>>>     > My suggestion would be to have the function take just an integer-type
>>>     e.g.
>>>
>>>     > uint16_t parameter which defines the memory/cache hierarchy level to
>>>     use, 0
>>>
>>>     > being lowest, 1 next, and so on. Different systems may have different
>>>     numbers
>>>
>>>     > of cache levels so lets just make it a zero-based index of levels,
>>>     rather than
>>>
>>>     > giving explicit defines (except for memory which should probably
>>>     always be
>>>
>>>     > last). The zero-level will be for "closest neighbour"
>>>
>>>     Good idea, we did prototype this internally. But issue it will keep on
>>>     adding the number of API into lcore library.
>>>
>>>     To keep the API count less, we are using lcore id as hint to sub-NUMA.
>>>
>>
>> I'm unclear about this keeping the API count down - you are proposing a lot of APIs and macros up above.
> No, I am not. I have shared based on the last discussion with Anatoly we will end up with 2 API in lcore only. Explained in the above response
> 
>> My suggestion is basically to add two APIs and no macros: one API to get the max number of topology-nearness levels, and a
>> second API to get the next sibling a given nearness level from
>> 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too.
>>
>> Overall, though, as I say above, let's focus on the problem the app actually
>> wants these APIs for, not how we think we should solve it. Apps don't want to
>> know the topology for knowledge sake, they want to use that knowledge to
>> improve performance by pinning tasks to cores. What is the minimum that we
>> need to provide to enable the app to do that? For example, if there are no
>> lcores that share an L1, then from an app topology viewpoint that L1 level may
>> as well not exist, because it provides us no details on how to place our work.
> I have shared above why we need vendor agnostic L1, L2, L3 and sub-NUMA-IO.
> 
> Snipped

Just to add my 2c here, since my name is being thrown around a lot in 
this discussion :)

I tend to agree with Bruce here in the sense that if we want this API be 
used to group cores together, then ideally we shouldn't really 
explicitly call out the principle by which we group them unless we have 
to. My main contention with the initial RFC *was* the fact that it was 
tied to specific HW arch stuff in the API.

Vipin has suggested using a "flags" value to discriminate between 
L1/L2/L3/NUMA/whatever ways of grouping cores, and I agree that it's 
better than what was initially proposed (at least from my vantage 
point), but what's even better is not to have any flags at all! As in, I 
think the thing we're presumably trying to achieve here just as well 
could be achieved simply by returning a number of "levels" we have in 
our hierarchy, and user then being able to iterate over nearest 
neighbours sitting on that "level" without explicitly specifying what 
that level is.

So, for some systems level 0 would be SMT, for others - L3, for some - 
NUMA, for yet others - efficiency/performance cores, etc. Bruce's 
suggestion is that we don't explicitly call out the thing we use to 
group the cores by, and instead rely on EAL to parse that information 
out for us into a set of "levels". I would agree that for anything more 
complicated an external library would be the way to go, because, well, 
we're DPDK, not Linux kernel.

But, just to be clear, this is not mutually exclusive with some kind of 
topology-style API. If we do go down that route, then the point about 
"attaching to specific architectural features" becomes moot, as by 
necessity any DOM-style API would have to represent topology in some 
way, which then gets used by DPDK.

The main question here (and other people have rightly asked this 
question) would be, do we want a topology API, or do we want an API to 
assist with scheduling. My impression so far has been that Vipin is 
looking for the latter rather than the former, as no topology-related 
use cases were mentioned in the discussion except as a proxy for scheduling.

-- 
Thanks,
Anatoly

next prev parent reply	other threads:[~2024-09-13 14:15 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-27 15:10 Vipin Varghese
2024-08-27 15:10 ` [RFC 1/2] eal: add llc " Vipin Varghese
2024-08-27 17:36   ` Stephen Hemminger
2024-09-02  0:27     ` Varghese, Vipin
2024-08-27 20:56   ` Wathsala Wathawana Vithanage
2024-08-29  3:21     ` 答复: " Feifei Wang
2024-09-02  1:20     ` Varghese, Vipin
2024-09-03 17:54       ` Wathsala Wathawana Vithanage
2024-09-04  8:18         ` Bruce Richardson
2024-09-06 11:59         ` Varghese, Vipin
2024-09-12 16:58           ` Wathsala Wathawana Vithanage
2024-08-27 15:10 ` [RFC 2/2] eal/lcore: add llc aware for each macro Vipin Varghese
2024-08-27 21:23 ` [RFC 0/2] introduce LLC aware functions Mattias Rönnblom
2024-09-02  0:39   ` Varghese, Vipin
2024-09-04  9:30     ` Mattias Rönnblom
2024-09-04 14:37       ` Stephen Hemminger
2024-09-11  3:13         ` Varghese, Vipin
2024-09-11  3:53           ` Stephen Hemminger
2024-09-12  1:11             ` Varghese, Vipin
2024-09-09 14:22       ` Varghese, Vipin
2024-09-09 14:52         ` Mattias Rönnblom
2024-09-11  3:26           ` Varghese, Vipin
2024-09-11 15:55             ` Mattias Rönnblom
2024-09-11 17:04               ` Honnappa Nagarahalli
2024-09-12  1:33                 ` Varghese, Vipin
2024-09-12  6:38                   ` Mattias Rönnblom
2024-09-12  7:02                     ` Mattias Rönnblom
2024-09-12 11:23                       ` Varghese, Vipin
2024-09-12 12:12                         ` Mattias Rönnblom
2024-09-12 15:50                           ` Stephen Hemminger
2024-09-12 11:17                     ` Varghese, Vipin
2024-09-12 11:59                       ` Mattias Rönnblom
2024-09-12 13:30                         ` Bruce Richardson
2024-09-12 16:32                           ` Mattias Rönnblom
2024-09-12  2:28                 ` Varghese, Vipin
2024-09-11 16:01             ` Bruce Richardson
2024-09-11 22:25               ` Konstantin Ananyev
2024-09-12  2:38                 ` Varghese, Vipin
2024-09-12  2:19               ` Varghese, Vipin
2024-09-12  9:17                 ` Bruce Richardson
2024-09-12 11:50                   ` Varghese, Vipin
2024-09-13 14:15                     ` Burakov, Anatoly [this message]
2024-09-12 13:18                   ` Mattias Rönnblom
2024-08-28  8:38 ` Burakov, Anatoly
2024-09-02  1:08   ` Varghese, Vipin
2024-09-02 14:17     ` Burakov, Anatoly
2024-09-02 15:33       ` Varghese, Vipin
2024-09-03  8:50         ` Burakov, Anatoly
2024-09-05 13:05           ` Ferruh Yigit
2024-09-05 14:45             ` Burakov, Anatoly
2024-09-05 15:34               ` Ferruh Yigit
2024-09-06  8:44                 ` Burakov, Anatoly
2024-09-09 14:14                   ` Varghese, Vipin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=da8dd475-c379-446b-90e2-d6b9dd62e155@intel.com \
    --to=anatoly.burakov@intel.com \
    --cc=Ferruh.Yigit@amd.com \
    --cc=Vipin.Varghese@amd.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=hofors@lysator.liu.se \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).