Re: [RFC 0/2] introduce LLC aware functions

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Mattias Rönnblom" <hofors@lysator.liu.se>
To: "Varghese, Vipin" <Vipin.Varghese@amd.com>,
	Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Cc: "Yigit, Ferruh" <Ferruh.Yigit@amd.com>,
	"dev@dpdk.org" <dev@dpdk.org>, nd <nd@arm.com>
Subject: Re: [RFC 0/2] introduce LLC aware functions
Date: Thu, 12 Sep 2024 13:59:34 +0200	[thread overview]
Message-ID: <42b8749d-ef6d-4857-bf2c-0a5d700405eb@lysator.liu.se> (raw)
In-Reply-To: <PH7PR12MB8596B0BFCBD9D8AF75FD2C6B82642@PH7PR12MB8596.namprd12.prod.outlook.com>

On 2024-09-12 13:17, Varghese, Vipin wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> <snipped>
>> >>>> Thank you Mattias for the information, as shared by in the reply
>> >>>> with
>> >> Anatoly we want expose a new API `rte_get_next_lcore_ex` which
>> >> intakes a extra argument `u32 flags`.
>> >>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
>> >> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
>> >> RTE_GET_LCORE_BOOST_DISABLED.
>> >>>
>> >>> Wouldn't using that API be pretty awkward to use?
>> > Current API available under DPDK is ` rte_get_next_lcore`, which is used
>> within DPDK example and in customer solution.
>> > Based on the comments from others we responded to the idea of changing
>> the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.
>> >
>> > Can you please help us understand what is `awkward`.
>> >
>> 
>> The awkwardness starts when you are trying to fit provide hwloc type
>> information over an API that was designed for iterating over lcores.
> I disagree to this point, current implementation of lcore libraries is 
> only focused on iterating through list of enabled cores, core-mask, and 
> lcore-map.
> With ever increasing core count, memory, io and accelerators on SoC, 
> sub-numa partitioning is common in various vendor SoC. Enhancing or 
> Augumenting lcore API to extract or provision NUMA, Cache Topology is 
> not awkward.

DPDK providing an API for this information makes sense to me, as I've 
mentioned before. What I questioned was the way it was done (i.e., the 
API design) in your RFC, and the limited scope (which in part you have 
addressed).

> If memory, IO and accelerator can have sub-NUMA domain, why is it 
> awkward to have lcore in domains? Hence I do not agree on the 
> awkwardness argument.
>> 
>> It seems to me that you should either have:
>> A) An API in similar to that of hwloc (or any DOM-like API), which would give a
>> low-level description of the hardware in implementation terms.
>> The topology would consist of nodes, with attributes, etc, where nodes are
>> things like cores or instances of caches of some level and attributes are things
>> like CPU actual and nominal, and maybe max frequency, cache size, or memory
>> size.
> Here is the catch, `rte_eal_init` internally invokes `get_cpu|lcores` 
> and populates thread (lcore) to physical CPU. But there is more than 
> just CPU mapping, as we have seeing in SoC architecture. The argument 
> shared by many is `DPDK is not the place for such topology discovery`.
> As per my current understanding, I have to disagree to the abive because
> 1. forces user to use external libraries example like hwloc
> 2. forces user to creating internal mapping for lcore, core-mask, and 
> lcore-map with topology awareness code.
> My intention is to `enable end user to leverage the API format or 
> similar API format (rte_get_next_lcore)` to get best results on any SoC 
> (vendor agnostic).
> I fail to grasp why we are asking CPU topology to exported, while NIC, 
> PCIe and accelerators are not asked to be exported via external 
> libraries like hwloc.
> Hence let us setup tech call in slack or teams to understand this better.
>> or
>> B) An API to be directly useful for a work scheduler, in which case you should
>> abstract away things like "boost"
> Please note as shared in earlier reply to Bruce, I made a mistake of 
> calling it boost (AMD SoC terminology). Instead it should DPDK_TURBO.
> There are use cases and DPDK examples, where cypto and compression are 
> run on cores where TURBO is enabled. This allows end users to boost when 
> there is more work and disable boost when there is less or no work.
>>  (and fold them into some abstract capacity notion, together with core "size" [in big-little/heterogeneous systems]), and
>> have an abstract notion of what core is "close" to some other core. This would
>> something like Linux'
>> scheduling domains.
> We had similar discussion with Jerrin on the last day of Bangkok DPDK 
> summit. This RFC was intended to help capture this relevant point. With 
> my current understanding on selected SoC the little core on ARM Soc 
> shares L2 cache, while this analogy does not cover all cases. But this 
> would be good start.
>> 
>> If you want B you probably need A as a part of its implementation, so you may
>> just as well start with A, I suppose.
>> 
>> What you could do to explore the API design is to add support for, for
>> example, boost core awareness or SMT affinity in the SW scheduler. You could
>> also do an "lstopo" equivalent, since that's needed for debugging and
>> exploration, if nothing else.
> Not following on this analogy, will discuss in detail in tech talk
>> 
>> One question that will have to be answered in a work scheduling scenario is
>> "are these two lcores SMT siblings," or "are these two cores on the same LLC",
>> or "give me all lcores on a particular L2 cache".
>> 
> Is not that we have been trying to address based on Anatoly request to 
> generalize than LLC. Hence we agreed on sharing version-2 of RFC with 
> `rte_get_nex_lcore_extnd` with `flags`.
> May I ask where is the disconnect?
>> >>>
>> >>> I mean, what you have is a topology, with nodes of different types
>> >>> and with
>> >> different properties, and you want to present it to the user.
>> > Let me be clear, what we want via DPDK to help customer to use an Unified
>> API which works across multiple platforms.
>> > Example - let a vendor have 2 products namely A and B. CPU-A has all cores
>> within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA
>> domain based on split LLC.
>> > When `rte_get_next_lcore_extnd` is invoked for `LLC` on 1. CPU-A: it
>> > returns all cores as there is no split 2. CPU-B: it returns cores from
>> > specific sub-NUMA which is partitioned by L3
>> >
>> 
>> I think the function name rte_get_next_lcore_extnd() alone makes clear this is an awkward API. :)
> I humbly disagree to this statement, as explained above.
>> 
>> My gut feeling is to make it more explicit and forget about <rte_lcore.h>.
>> <rte_hwtopo.h>? Could and should still be EAL.
> For me this is like adding a new level of library and more code. While 
> the easiest way was to add an API similar to existing `get_next_lcore` 
> style for easy adoption.

A poorly designed, special-case API is not less work. It's just less 
work for *you* *now*, and much more work for someone in the future to 
clean it up.

>> 
>> >>>
>> >>> In a sense, it's similar to XCM and DOM versus SAX. The above is
>> >>> SAX-style,
>> >> and what I have in mind is something DOM-like.
>> >>>
>> >>> What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure out how best make use of them.
>> > Exactly.
>> >
>> >   It's not going to "skip" (ignore, leave unused)
>> >> SMT siblings, or skip non-boosted cores, it would just try to be
>> >> clever in regards to which cores to use for what purpose.
>> > Let me try to share my idea on SMT sibling. When user invoked for
>> rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API
>> identifies first whether given lcore is part of enabled core list.
>> > If yes, it programmatically either using `sysfs` or `hwloc library (shared the
>> version concern on distros. Will recheck again)` identify the sibling thread and
>> return.
>> > If there is no sibling thread available under DPDK it will fetch next lcore 
>> (probably lcore +1 ).
>> >
>> 
>> Distributions having old hwloc versions isn't an argument for a new DPDK library or new API. If only that was the issue, then it would be better to help the hwloc and/or distributions, rather  than the DPDK project.
> I do not agree to terms of ` Distributions having old hwloc versions 
> isn't an argument for a new DPDK library or new API.` Because this is 
> not what my intention is. Let me be clear on Ampere & AMD Bios settings 
> are 2
> 1. SLC or L3 as NUMA enable
> 2. Numa for IO|memory
> With `NUMA for IO|memory` is set hwloc library works as expected. But 
> when `L3 as NUMA` is set gives incorrect details. We have been fixing 
> this and pushing to upstream. But as I clearly shared, version of 
> distros having latest hwloc is almost nil.
> Hence to keep things simple, in documentation of DPDK we pointed to AMD 
> SoC tuning guide we have been recommending not to enable `L3 as NUMA`.
> Now end goal for me is to allow vendor agnostic API which is easy to 
> understand and use, and works irrespective of BIOS settings. I have 
> enabled parsing of OS `sysfs` as a RFC. But if the comment is to use 
> `hwloc` as shared with response for Stephen I am open to try this again.
> <snipped>

next prev parent reply	other threads:[~2024-09-12 11:59 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-27 15:10 Vipin Varghese
2024-08-27 15:10 ` [RFC 1/2] eal: add llc " Vipin Varghese
2024-08-27 17:36   ` Stephen Hemminger
2024-09-02  0:27     ` Varghese, Vipin
2024-08-27 20:56   ` Wathsala Wathawana Vithanage
2024-08-29  3:21     ` 答复: " Feifei Wang
2024-09-02  1:20     ` Varghese, Vipin
2024-09-03 17:54       ` Wathsala Wathawana Vithanage
2024-09-04  8:18         ` Bruce Richardson
2024-09-06 11:59         ` Varghese, Vipin
2024-09-12 16:58           ` Wathsala Wathawana Vithanage
2024-10-21  8:20             ` Varghese, Vipin
2024-08-27 15:10 ` [RFC 2/2] eal/lcore: add llc aware for each macro Vipin Varghese
2024-08-27 21:23 ` [RFC 0/2] introduce LLC aware functions Mattias Rönnblom
2024-09-02  0:39   ` Varghese, Vipin
2024-09-04  9:30     ` Mattias Rönnblom
2024-09-04 14:37       ` Stephen Hemminger
2024-09-11  3:13         ` Varghese, Vipin
2024-09-11  3:53           ` Stephen Hemminger
2024-09-12  1:11             ` Varghese, Vipin
2024-09-09 14:22       ` Varghese, Vipin
2024-09-09 14:52         ` Mattias Rönnblom
2024-09-11  3:26           ` Varghese, Vipin
2024-09-11 15:55             ` Mattias Rönnblom
2024-09-11 17:04               ` Honnappa Nagarahalli
2024-09-12  1:33                 ` Varghese, Vipin
2024-09-12  6:38                   ` Mattias Rönnblom
2024-09-12  7:02                     ` Mattias Rönnblom
2024-09-12 11:23                       ` Varghese, Vipin
2024-09-12 12:12                         ` Mattias Rönnblom
2024-09-12 15:50                           ` Stephen Hemminger
2024-09-12 11:17                     ` Varghese, Vipin
2024-09-12 11:59                       ` Mattias Rönnblom [this message]
2024-09-12 13:30                         ` Bruce Richardson
2024-09-12 16:32                           ` Mattias Rönnblom
2024-09-12  2:28                 ` Varghese, Vipin
2024-09-11 16:01             ` Bruce Richardson
2024-09-11 22:25               ` Konstantin Ananyev
2024-09-12  2:38                 ` Varghese, Vipin
2024-09-12  2:19               ` Varghese, Vipin
2024-09-12  9:17                 ` Bruce Richardson
2024-09-12 11:50                   ` Varghese, Vipin
2024-09-13 14:15                     ` Burakov, Anatoly
2024-09-12 13:18                   ` Mattias Rönnblom
2024-08-28  8:38 ` Burakov, Anatoly
2024-09-02  1:08   ` Varghese, Vipin
2024-09-02 14:17     ` Burakov, Anatoly
2024-09-02 15:33       ` Varghese, Vipin
2024-09-03  8:50         ` Burakov, Anatoly
2024-09-05 13:05           ` Ferruh Yigit
2024-09-05 14:45             ` Burakov, Anatoly
2024-09-05 15:34               ` Ferruh Yigit
2024-09-06  8:44                 ` Burakov, Anatoly
2024-09-09 14:14                   ` Varghese, Vipin
2024-10-07 21:28 ` Stephen Hemminger
2024-10-21  8:17   ` Varghese, Vipin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42b8749d-ef6d-4857-bf2c-0a5d700405eb@lysator.liu.se \
    --to=hofors@lysator.liu.se \
    --cc=Ferruh.Yigit@amd.com \
    --cc=Honnappa.Nagarahalli@arm.com \
    --cc=Vipin.Varghese@amd.com \
    --cc=dev@dpdk.org \
    --cc=nd@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).