Re: [RFC 0/2] introduce LLC aware functions

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: Bruce Richardson <bruce.richardson@intel.com>,
	"Varghese, Vipin" <Vipin.Varghese@amd.com>
Cc: "Yigit, Ferruh" <Ferruh.Yigit@amd.com>, "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [RFC 0/2] introduce LLC aware functions
Date: Thu, 12 Sep 2024 15:18:43 +0200	[thread overview]
Message-ID: <0add3e61-570c-4d61-bb4e-c11747e75690@lysator.liu.se> (raw)
In-Reply-To: <ZuKxoJT0ELst7arJ@bricha3-mobl1.ger.corp.intel.com>

On 2024-09-12 11:17, Bruce Richardson wrote:
> On Thu, Sep 12, 2024 at 02:19:07AM +0000, Varghese, Vipin wrote:
>>     [Public]
>>
>>     <snipped>
>>
>>
>>
>>     > > > > <snipped>
>>
>>     > > > >
>>
>>     > > > >>> <snipped>
>>
>>     > > > >>>
>>
>>     > > > >>> Thank you Mattias for the comments and question, please let
>>     me
>>
>>     > > > >>> try to explain the same below
>>
>>     > > > >>>
>>
>>     > > > >>>> We shouldn't have a separate CPU/cache hierarchy API
>>     instead?
>>
>>     > > > >>>
>>
>>     > > > >>> Based on the intention to bring in CPU lcores which share
>>     same
>>
>>     > > > >>> L3 (for better cache hits and less noisy neighbor) current
>>     API
>>
>>     > > > >>> focuses on using
>>
>>     > > > >>>
>>
>>     > > > >>> Last Level Cache. But if the suggestion is `there are SoC
>>     where
>>
>>     > > > >>> L2 cache are also shared, and the new API should be
>>
>>     > > > >>> provisioned`, I am also
>>
>>     > > > >>>
>>
>>     > > > >>> comfortable with the thought.
>>
>>     > > > >>>
>>
>>     > > > >>
>>
>>     > > > >> Rather than some AMD special case API hacked into
>>     <rte_lcore.h>,
>>
>>     > > > >> I think we are better off with no DPDK API at all for this
>>     kind of
>>
>>     > functionality.
>>
>>     > > > >
>>
>>     > > > > Hi Mattias, as shared in the earlier email thread, this is not
>>     a
>>
>>     > > > > AMD special
>>
>>     > > > case at all. Let me try to explain this one more time. One of
>>
>>     > > > techniques used to increase cores cost effective way to go for
>>     tiles of
>>
>>     > compute complexes.
>>
>>     > > > > This introduces a bunch of cores in sharing same Last Level
>>     Cache
>>
>>     > > > > (namely
>>
>>     > > > L2, L3 or even L4) depending upon cache topology architecture.
>>
>>     > > > >
>>
>>     > > > > The API suggested in RFC is to help end users to selectively
>>     use
>>
>>     > > > > cores under
>>
>>     > > > same Last Level Cache Hierarchy as advertised by OS (irrespective
>>     of
>>
>>     > > > the BIOS settings used). This is useful in both bare-metal and
>>     container
>>
>>     > environment.
>>
>>     > > > >
>>
>>     > > >
>>
>>     > > > I'm pretty familiar with AMD CPUs and the use of tiles (including
>>
>>     > > > the challenges these kinds of non-uniformities pose for work
>>     scheduling).
>>
>>     > > >
>>
>>     > > > To maximize performance, caring about core<->LLC relationship may
>>
>>     > > > well not be enough, and more HT/core/cache/memory topology
>>
>>     > > > information is required. That's what I meant by special case. A
>>
>>     > > > proper API should allow access to information about which lcores
>>     are
>>
>>     > > > SMT siblings, cores on the same L2, and cores on the same L3, to
>>
>>     > > > name a few things. Probably you want to fit NUMA into the same
>>     API
>>
>>     > > > as well, although that is available already in <rte_lcore.h>.
>>
>>     > >
>>
>>     > > Thank you Mattias for the information, as shared by in the reply
>>     with
>>
>>     > Anatoly we want expose a new API `rte_get_next_lcore_ex` which
>>     intakes a
>>
>>     > extra argument `u32 flags`.
>>
>>     > > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
>>
>>     > RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
>>
>>     > RTE_GET_LCORE_BOOST_DISABLED.
>>
>>     > >
>>
>>     >
>>
>>     > For the naming, would "rte_get_next_sibling_core" (or lcore if you
>>     prefer) be a
>>
>>     > clearer name than just adding "ex" on to the end of the existing
>>     function?
>>
>>     Thank you Bruce, Please find my answer below
>>
>>
>>
>>     Functions shared as per the RFC were
>>
>>     ```
>>
>>     - rte_get_llc_first_lcores: Retrieves all the first lcores in the
>>     shared LLC.
>>
>>     - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>>
>>     - rte_get_llc_n_lcore: Retrieves the first n or skips the first n
>>     lcores in the shared LLC.
>>
>>     ```
>>
>>
>>
>>     MACRO’s extending the usability were
>>
>>     ```
>>
>>     RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each
>>     LLC.
>>
>>     RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker
>>     lcore from each LLC.
>>
>>     RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint
>>     (lcore id).
>>
>>     RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while
>>     skipping first worker.
>>
>>     RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from
>>     each LLC.
>>
>>     RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
>>     iterates through reaming lcores in each LLC.
>>
>>     ```
>>
>>
>>
>>     Based on the discussions we agreed on sharing version-2 FRC for
>>     extending API as `rte_get_next_lcore_extnd` with extra argument as
>>     `flags`.
>>
>>     As per my ideation, for the API ` rte_get_next_sibling_core`, the above
>>     API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right
>>     understanding?
>>
>>     We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which
>>     allows to iterate SMT sibling threads.
>>
>>
> 
> This seems like a lot of new macro and API additions! I'd really like to
> cut that back and simplify the amount of new things we are adding to DPDK
> for this. I tend to agree with others that external libs would be better
> for apps that really want to deal with all this.
> 

Conveying HW topology will require a fair bit of API verbiage. I think 
there's no way around it, other than giving the API user half of the 
story (or 1% of the story).

That's one of the reasons I think it should be in a separate header file 
in EAL.

>>
>>     >
>>
>>     > Looking logically, I'm not sure about the BOOST_ENABLED and
>>
>>     > BOOST_DISABLED flags you propose
>>
>>     The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power
>>     library which allows to enable boost.
>>
>>     Allow user to select lcores where BOOST is enabled|disabled using MACRO
>>     or API.
>>
>>
>>
>>     - in a system with multiple possible
>>
>>     > standard and boost frequencies what would those correspond to?
>>
>>     I now understand the confusion, apologies for mixing the AMD EPYC SoC
>>     boost with Intel Turbo.
>>
>>
>>
>>     Thank you for pointing out, we will use the terminology `
>>     RTE_GET_LCORE_TURBO`.
>>
>>
> 
> That still doesn't clarify it for me. If you start mixing in power
> management related functions in with topology ones things will turn into a
> real headache. What does boost or turbo correspond to? Is it for cores that
> have the feature enabled - whether or not it's currently in use - or is it
> for finding cores that are currently boosted? Do we need additions for
> cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are
> in lower frequencies for power-saving. Do we add macros for finding those?
> 

In my world, the operating frequency is a property of a CPU core node in 
the hardware topology.

lcore discrimination (or classification) shouldn't be built as a myriad 
of FOREACH macros, but rather generic iteration + app domain logic.

For example, the size of the L3 could be a factor. Should we have a 
FOREACH_BIG_L3. No.

>>
>>     What's also
>>
>>     > missing is a define for getting actual NUMA siblings i.e. those
>>     sharing common
>>
>>     > memory but not an L3 or anything else.
>>
>>     This can be extended into `rte_get_next_lcore_extnd` with flag `
>>     RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same
>>     sub-memory NUMA as shared by LCORE.
>>
>>     If SMT sibling is enabled and DPDK Lcore mask covers the sibling
>>     threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads
>>     under same memory NUMA of lcore shared.
>>
>>
> 
> Yes. That can work. But it means we are basing the implementation on a
> fixed idea of what topologies there are or can exist. My suggestion below
> is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a
> way to find it's nearest nodes.
> 

I think we need to agree what is the purpose of this API. Is it the to 
describe the hardware topology in some details for general-purpose use 
(including informing the operator, lstopo-style), or just some abstract, 
simplified representation to be use purely for work scheduling.

> After all, the app doesn't want to know the topology just for the sake of
> knowing it - it wants it to ensure best placement of work on cores! To that
> end, it just needs to know what cores are near to each other and what are
> far away.
> 
>>
>>     >
>>
>>     > My suggestion would be to have the function take just an integer-type
>>     e.g.
>>
>>     > uint16_t parameter which defines the memory/cache hierarchy level to
>>     use, 0
>>
>>     > being lowest, 1 next, and so on. Different systems may have different
>>     numbers
>>
>>     > of cache levels so lets just make it a zero-based index of levels,
>>     rather than
>>
>>     > giving explicit defines (except for memory which should probably
>>     always be
>>
>>     > last). The zero-level will be for "closest neighbour"
>>
>>     Good idea, we did prototype this internally. But issue it will keep on
>>     adding the number of API into lcore library.
>>
>>     To keep the API count less, we are using lcore id as hint to sub-NUMA.
>>
> 
> I'm unclear about this keeping the API count down - you are proposing a lot
> of APIs and macros up above. My suggestion is basically to add two APIs and
> no macros: one API to get the max number of topology-nearness levels, and a
> second API to get the next sibling a given nearness level from
> 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too.
> 
> Overall, though, as I say above, let's focus on the problem the app
> actually wants these APIs for, not how we think we should solve it. Apps
> don't want to know the topology for knowledge sake, they want to use that
> knowledge to improve performance by pinning tasks to cores. What is the
> minimum that we need to provide to enable the app to do that? For example,
> if there are no lcores that share an L1, then from an app topology
> viewpoint that L1 level may as well not exist, because it provides us no
> details on how to place our work.
> 
> For the rare app that does have some esoteric use-case that does actually
> want to know some intricate details of the topology, then having that app
> use an external lib is probably a better solution than us trying to cover
> all possible options in DPDK.
> 
> My 2c. on this at this stage anyway.
> 
> /Bruce
>

next prev parent reply	other threads:[~2024-09-12 13:18 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-27 15:10 Vipin Varghese
2024-08-27 15:10 ` [RFC 1/2] eal: add llc " Vipin Varghese
2024-08-27 17:36   ` Stephen Hemminger
2024-09-02  0:27     ` Varghese, Vipin
2024-08-27 20:56   ` Wathsala Wathawana Vithanage
2024-08-29  3:21     ` 答复: " Feifei Wang
2024-09-02  1:20     ` Varghese, Vipin
2024-09-03 17:54       ` Wathsala Wathawana Vithanage
2024-09-04  8:18         ` Bruce Richardson
2024-09-06 11:59         ` Varghese, Vipin
2024-09-12 16:58           ` Wathsala Wathawana Vithanage
2024-10-21  8:20             ` Varghese, Vipin
2024-08-27 15:10 ` [RFC 2/2] eal/lcore: add llc aware for each macro Vipin Varghese
2024-08-27 21:23 ` [RFC 0/2] introduce LLC aware functions Mattias Rönnblom
2024-09-02  0:39   ` Varghese, Vipin
2024-09-04  9:30     ` Mattias Rönnblom
2024-09-04 14:37       ` Stephen Hemminger
2024-09-11  3:13         ` Varghese, Vipin
2024-09-11  3:53           ` Stephen Hemminger
2024-09-12  1:11             ` Varghese, Vipin
2024-09-09 14:22       ` Varghese, Vipin
2024-09-09 14:52         ` Mattias Rönnblom
2024-09-11  3:26           ` Varghese, Vipin
2024-09-11 15:55             ` Mattias Rönnblom
2024-09-11 17:04               ` Honnappa Nagarahalli
2024-09-12  1:33                 ` Varghese, Vipin
2024-09-12  6:38                   ` Mattias Rönnblom
2024-09-12  7:02                     ` Mattias Rönnblom
2024-09-12 11:23                       ` Varghese, Vipin
2024-09-12 12:12                         ` Mattias Rönnblom
2024-09-12 15:50                           ` Stephen Hemminger
2024-09-12 11:17                     ` Varghese, Vipin
2024-09-12 11:59                       ` Mattias Rönnblom
2024-09-12 13:30                         ` Bruce Richardson
2024-09-12 16:32                           ` Mattias Rönnblom
2024-09-12  2:28                 ` Varghese, Vipin
2024-09-11 16:01             ` Bruce Richardson
2024-09-11 22:25               ` Konstantin Ananyev
2024-09-12  2:38                 ` Varghese, Vipin
2024-09-12  2:19               ` Varghese, Vipin
2024-09-12  9:17                 ` Bruce Richardson
2024-09-12 11:50                   ` Varghese, Vipin
2024-09-13 14:15                     ` Burakov, Anatoly
2024-09-12 13:18                   ` Mattias Rönnblom [this message]
2024-08-28  8:38 ` Burakov, Anatoly
2024-09-02  1:08   ` Varghese, Vipin
2024-09-02 14:17     ` Burakov, Anatoly
2024-09-02 15:33       ` Varghese, Vipin
2024-09-03  8:50         ` Burakov, Anatoly
2024-09-05 13:05           ` Ferruh Yigit
2024-09-05 14:45             ` Burakov, Anatoly
2024-09-05 15:34               ` Ferruh Yigit
2024-09-06  8:44                 ` Burakov, Anatoly
2024-09-09 14:14                   ` Varghese, Vipin
2024-10-07 21:28 ` Stephen Hemminger
2024-10-21  8:17   ` Varghese, Vipin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0add3e61-570c-4d61-bb4e-c11747e75690@lysator.liu.se \
    --to=hofors@lysator.liu.se \
    --cc=Ferruh.Yigit@amd.com \
    --cc=Vipin.Varghese@amd.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).