From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 556D14596A; Thu, 12 Sep 2024 09:02:41 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 1414440E20; Thu, 12 Sep 2024 09:02:41 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 67D6C40E1C for ; Thu, 12 Sep 2024 09:02:40 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 102553965 for ; Thu, 12 Sep 2024 09:02:40 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 039A03A81; Thu, 12 Sep 2024 09:02:40 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.2 Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id AB82038FF; Thu, 12 Sep 2024 09:02:36 +0200 (CEST) Message-ID: <50ee2d2f-00c0-488c-a80e-1d3021103060@lysator.liu.se> Date: Thu, 12 Sep 2024 09:02:36 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 0/2] introduce LLC aware functions From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= To: "Varghese, Vipin" , Honnappa Nagarahalli Cc: "Yigit, Ferruh" , "dev@dpdk.org" , nd References: <20240827151014.201-1-vipin.varghese@amd.com> <45f26104-ad6c-4e42-8446-d8b51ac3f2dd@lysator.liu.se> <38d0336d-ea9e-41b3-b3d8-333efb70eb1f@lysator.liu.se> <716375DE-0C2F-4983-934A-144D7DE342C6@arm.com> Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-09-12 08:38, Mattias Rönnblom wrote: > On 2024-09-12 03:33, Varghese, Vipin wrote: >> [Public] >> >> Snipped >> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thank you Mattias for the comments and question, please let me >>>>>>>>> try to explain the same below >>>>>>>>> >>>>>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead? >>>>>>>>> >>>>>>>>> Based on the intention to bring in CPU lcores which share same L3 >>>>>>>>> (for better cache hits and less noisy neighbor) current API >>>>>>>>> focuses on using >>>>>>>>> >>>>>>>>> Last Level Cache. But if the suggestion is `there are SoC where >>>>>>>>> L2 cache are also shared, and the new API should be provisioned`, >>>>>>>>> I am also >>>>>>>>> >>>>>>>>> comfortable with the thought. >>>>>>>>> >>>>>>>> >>>>>>>> Rather than some AMD special case API hacked into , I >>>>>>>> think we are better off with no DPDK API at all for this kind of >>> functionality. >>>>>>> >>>>>>> Hi Mattias, as shared in the earlier email thread, this is not a >>>>>>> AMD special >>>>>> case at all. Let me try to explain this one more time. One of >>>>>> techniques used to increase cores cost effective way to go for >>>>>> tiles of >>> compute complexes. >>>>>>> This introduces a bunch of cores in sharing same Last Level Cache >>>>>>> (namely >>>>>> L2, L3 or even L4) depending upon cache topology architecture. >>>>>>> >>>>>>> The API suggested in RFC is to help end users to selectively use >>>>>>> cores under >>>>>> same Last Level Cache Hierarchy as advertised by OS (irrespective of >>>>>> the BIOS settings used). This is useful in both bare-metal and >>>>>> container >>> environment. >>>>>>> >>>>>> >>>>>> I'm pretty familiar with AMD CPUs and the use of tiles (including >>>>>> the challenges these kinds of non-uniformities pose for work >>>>>> scheduling). >>>>>> >>>>>> To maximize performance, caring about core<->LLC relationship may >>>>>> well not be enough, and more HT/core/cache/memory topology >>>>>> information is required. That's what I meant by special case. A >>>>>> proper API should allow access to information about which lcores are >>>>>> SMT siblings, cores on the same L2, and cores on the same L3, to >>>>>> name a few things. Probably you want to fit NUMA into the same API >>>>>> as well, although that is available already in . >>>>> Thank you Mattias for the information, as shared by in the reply with >>> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a >>> extra argument `u32 flags`. >>>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, >>> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, >>> RTE_GET_LCORE_BOOST_DISABLED. >>>> >>>> Wouldn't using that API be pretty awkward to use? >> Current API available under DPDK is ` rte_get_next_lcore`, which is >> used within DPDK example and in customer solution. >> Based on the comments from others we responded to the idea of changing >> the new Api from ` rte_get_next_lcore_llc` to ` >> rte_get_next_lcore_exntd`. >> >> Can you please help us understand what is `awkward`. >> > > The awkwardness starts when you are trying to fit provide hwloc type > information over an API that was designed for iterating over lcores. > > It seems to me that you should either have: > A) An API in similar to that of hwloc (or any DOM-like API), which would > give a low-level description of the hardware in implementation terms. > The topology would consist of nodes, with attributes, etc, where nodes > are things like cores or instances of caches of some level and > attributes are things like CPU actual and nominal, and maybe max > frequency, cache size, or memory size. To to be clear; it's something like this I think of when I say "DOM-style" API. #ifndef RTE_HWTOPO_H #define RTE_HWTOPO_H struct rte_hwtopo_node; enum rte_hwtopo_node_type { RTE_HWTOPO_NODE_TYPE_CPU_CORE, RTE_HWTOPO_NODE_TYPE_CACHE, RTE_HWTOPO_NODE_TYPE_NUMA }; int rte_hwtopo_init(void); struct rte_hwtopo_node * rte_hwtopo_get_core_by_lcore(unsigned int lcore); struct rte_hwtopo_node * rte_hwtopo_get_core_by_id(unsigned int os_cpu_id); struct rte_hwtopo_node * rte_hwtopo_parent(struct rte_hwtopo_node *node); struct rte_hwtopo_node * rte_hwtopo_first_child(struct rte_hwtopo_node *node); struct rte_hwtopo_node * rte_hwtopo_next_child(struct rte_hwtopo_node *node, struct rte_hwtopo_node *child); struct rte_hwtopo_node * rte_hwtopo_first_sibling(struct rte_hwtopo_node *node); struct rte_hwtopo_node * rte_hwtopo_next_sibling(struct rte_hwtopo_node *node, struct rte_hwtopo_node *child); enum rte_hwtopo_node_type rte_hwtopo_get_type(struct rte_hwtopo_node *node); #define RTE_HWTOPO_NODE_ATTR_CORE_FREQUENCY_NOMINAL 0 #define RTE_HWTOPO_NODE_ATTR_CACHE_LEVEL 1 #define RTE_HWTOPO_NODE_ATTR_CACHE_SIZE 2 int rte_hwtopo_get_attr_int64(struct rte_hwtopo_node *node, unsigned int attr_name, int64_t *attr_value); int rte_hwtopo_get_attr_str(struct rte_hwtopo_node *node, unsigned int attr_name, char *attr_value, size_t capacity); #endif Surely, this too would be awkward (or should I say cumbersome) to use in certain scenarios. You could have syntactic sugar/special case helpers which address common use cases. You would also build abstractions on top of this (like the B case below). One could have node type specific functions instead of generic getter and setters. Anyway, this is not a counter-proposal, but rather just to make clear, what I had in mind. > or > B) An API to be directly useful for a work scheduler, in which case you > should abstract away things like "boost" (and fold them into some > abstract capacity notion, together with core "size" [in > big-little/heterogeneous systems]), and have an abstract notion of what > core is "close" to some other core. This would something like Linux' > scheduling domains. > > If you want B you probably need A as a part of its implementation, so > you may just as well start with A, I suppose. > > What you could do to explore the API design is to add support for, for > example, boost core awareness or SMT affinity in the SW scheduler. You > could also do an "lstopo" equivalent, since that's needed for debugging > and exploration, if nothing else. > > One question that will have to be answered in a work scheduling scenario > is "are these two lcores SMT siblings," or "are these two cores on the > same LLC", or "give me all lcores on a particular L2 cache". > >>>> >>>> I mean, what you have is a topology, with nodes of different types >>>> and with >>> different properties, and you want to present it to the user. >> Let me be clear, what we want via DPDK to help customer to use an >> Unified API which works across multiple platforms. >> Example - let a vendor have 2 products namely A and B. CPU-A has all >> cores within same SUB-NUMA domain and CPU-B has cores split to 2 >> sub-NUMA domain based on split LLC. >> When `rte_get_next_lcore_extnd` is invoked for `LLC` on >> 1. CPU-A: it returns all cores as there is no split >> 2. CPU-B: it returns cores from specific sub-NUMA which is partitioned >> by L3 >> > > I think the function name rte_get_next_lcore_extnd() alone makes clear > this is an awkward API. :) > > My gut feeling is to make it more explicit and forget about > . ? Could and should still be EAL. > >>>> >>>> In a sense, it's similar to XCM and DOM versus SAX. The above is >>>> SAX-style, >>> and what I have in mind is something DOM-like. >>>> >>>> What use case do you have in mind? What's on top of my list is a >>>> scenario >>> where a DPDK app gets a bunch of cores (e.g., -l ) and tries >>> to figure >>> out how best make use of them. >> Exactly. >> >>   It's not going to "skip" (ignore, leave unused) >>> SMT siblings, or skip non-boosted cores, it would just try to be >>> clever in >>> regards to which cores to use for what purpose. >> Let me try to share my idea on SMT sibling. When user invoked for >> rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; >> the API identifies first whether given lcore is part of enabled core >> list. >> If yes, it programmatically either using `sysfs` or `hwloc library >> (shared the version concern on distros. Will recheck again)` identify >> the sibling thread and return. >> If there is no sibling thread available under DPDK it will fetch next >> lcore (probably lcore +1 ). >> > > Distributions having old hwloc versions isn't an argument for a new DPDK > library or new API. If only that was the issue, then it would be better > to help the hwloc and/or distributions, rather than the DPDK project. > >>>> >>>>> This is AMD EPYC SoC agnostic and trying to address for all generic >>>>> cases. >>>>> Please do let us know if we (Ferruh & myself) can sync up via call? >>>> >>>> Sure, I can do that. >> >> Let me sync with Ferruh and get a time slot for internal sync. >> >>>> >>> Can this be opened to the rest of the community? This is a common >>> problem >>> that needs to be solved for multiple architectures. I would be >>> interested in >>> attending. >> Thank you Mattias, in DPDK Bangkok summit 2024 we did bring this up. >> As per the suggestion from Thomas and Jerrin we tried to bring the RFC >> for discussion. >> For DPDK Montreal 2024, Keesang and Ferruh (most likely) is travelling >> for the summit and presenting this as the talk to get things moving. >> >>> >>>>>> >>