From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 1F6604596C; Thu, 12 Sep 2024 13:59:41 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id E3BD640265; Thu, 12 Sep 2024 13:59:40 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id AA30040144 for ; Thu, 12 Sep 2024 13:59:39 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 6FE7A4B83 for ; Thu, 12 Sep 2024 13:59:39 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 63C0E4A2A; Thu, 12 Sep 2024 13:59:39 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.2 Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id BCC6F4AA2; Thu, 12 Sep 2024 13:59:35 +0200 (CEST) Message-ID: <42b8749d-ef6d-4857-bf2c-0a5d700405eb@lysator.liu.se> Date: Thu, 12 Sep 2024 13:59:34 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 0/2] introduce LLC aware functions To: "Varghese, Vipin" , Honnappa Nagarahalli Cc: "Yigit, Ferruh" , "dev@dpdk.org" , nd References: <20240827151014.201-1-vipin.varghese@amd.com> <45f26104-ad6c-4e42-8446-d8b51ac3f2dd@lysator.liu.se> <38d0336d-ea9e-41b3-b3d8-333efb70eb1f@lysator.liu.se> <716375DE-0C2F-4983-934A-144D7DE342C6@arm.com> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-09-12 13:17, Varghese, Vipin wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > >> >>>> Thank you Mattias for the information, as shared by in the reply >> >>>> with >> >> Anatoly we want expose a new API `rte_get_next_lcore_ex` which >> >> intakes a extra argument `u32 flags`. >> >>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, >> >> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, >> >> RTE_GET_LCORE_BOOST_DISABLED. >> >>> >> >>> Wouldn't using that API be pretty awkward to use? >> > Current API available under DPDK is ` rte_get_next_lcore`, which is used >> within DPDK example and in customer solution. >> > Based on the comments from others we responded to the idea of changing >> the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`. >> > >> > Can you please help us understand what is `awkward`. >> > >> >> The awkwardness starts when you are trying to fit provide hwloc type >> information over an API that was designed for iterating over lcores. > I disagree to this point, current implementation of lcore libraries is > only focused on iterating through list of enabled cores, core-mask, and > lcore-map. > With ever increasing core count, memory, io and accelerators on SoC, > sub-numa partitioning is common in various vendor SoC. Enhancing or > Augumenting lcore API to extract or provision NUMA, Cache Topology is > not awkward. DPDK providing an API for this information makes sense to me, as I've mentioned before. What I questioned was the way it was done (i.e., the API design) in your RFC, and the limited scope (which in part you have addressed). > If memory, IO and accelerator can have sub-NUMA domain, why is it > awkward to have lcore in domains? Hence I do not agree on the > awkwardness argument. >> >> It seems to me that you should either have: >> A) An API in similar to that of hwloc (or any DOM-like API), which would give a >> low-level description of the hardware in implementation terms. >> The topology would consist of nodes, with attributes, etc, where nodes are >> things like cores or instances of caches of some level and attributes are things >> like CPU actual and nominal, and maybe max frequency, cache size, or memory >> size. > Here is the catch, `rte_eal_init` internally invokes `get_cpu|lcores` > and populates thread (lcore) to physical CPU. But there is more than > just CPU mapping, as we have seeing in SoC architecture. The argument > shared by many is `DPDK is not the place for such topology discovery`. > As per my current understanding, I have to disagree to the abive because > 1. forces user to use external libraries example like hwloc > 2. forces user to creating internal mapping for lcore, core-mask, and > lcore-map with topology awareness code. > My intention is to `enable end user to leverage the API format or > similar API format (rte_get_next_lcore)` to get best results on any SoC > (vendor agnostic). > I fail to grasp why we are asking CPU topology to exported, while NIC, > PCIe and accelerators are not asked to be exported via external > libraries like hwloc. > Hence let us setup tech call in slack or teams to understand this better. >> or >> B) An API to be directly useful for a work scheduler, in which case you should >> abstract away things like "boost" > Please note as shared in earlier reply to Bruce, I made a mistake of > calling it boost (AMD SoC terminology). Instead it should DPDK_TURBO. > There are use cases and DPDK examples, where cypto and compression are > run on cores where TURBO is enabled. This allows end users to boost when > there is more work and disable boost when there is less or no work. >>  (and fold them into some abstract capacity notion, together with core "size" [in big-little/heterogeneous systems]), and >> have an abstract notion of what core is "close" to some other core. This would >> something like Linux' >> scheduling domains. > We had similar discussion with Jerrin on the last day of Bangkok DPDK > summit. This RFC was intended to help capture this relevant point. With > my current understanding on selected SoC the little core on ARM Soc > shares L2 cache, while this analogy does not cover all cases. But this > would be good start. >> >> If you want B you probably need A as a part of its implementation, so you may >> just as well start with A, I suppose. >> >> What you could do to explore the API design is to add support for, for >> example, boost core awareness or SMT affinity in the SW scheduler. You could >> also do an "lstopo" equivalent, since that's needed for debugging and >> exploration, if nothing else. > Not following on this analogy, will discuss in detail in tech talk >> >> One question that will have to be answered in a work scheduling scenario is >> "are these two lcores SMT siblings," or "are these two cores on the same LLC", >> or "give me all lcores on a particular L2 cache". >> > Is not that we have been trying to address based on Anatoly request to > generalize than LLC. Hence we agreed on sharing version-2 of RFC with > `rte_get_nex_lcore_extnd` with `flags`. > May I ask where is the disconnect? >> >>> >> >>> I mean, what you have is a topology, with nodes of different types >> >>> and with >> >> different properties, and you want to present it to the user. >> > Let me be clear, what we want via DPDK to help customer to use an Unified >> API which works across multiple platforms. >> > Example - let a vendor have 2 products namely A and B. CPU-A has all cores >> within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA >> domain based on split LLC. >> > When `rte_get_next_lcore_extnd` is invoked for `LLC` on 1. CPU-A: it >> > returns all cores as there is no split 2. CPU-B: it returns cores from >> > specific sub-NUMA which is partitioned by L3 >> > >> >> I think the function name rte_get_next_lcore_extnd() alone makes clear this is an awkward API. :) > I humbly disagree to this statement, as explained above. >> >> My gut feeling is to make it more explicit and forget about . >> ? Could and should still be EAL. > For me this is like adding a new level of library and more code. While > the easiest way was to add an API similar to existing `get_next_lcore` > style for easy adoption. A poorly designed, special-case API is not less work. It's just less work for *you* *now*, and much more work for someone in the future to clean it up. >> >> >>> >> >>> In a sense, it's similar to XCM and DOM versus SAX. The above is >> >>> SAX-style, >> >> and what I have in mind is something DOM-like. >> >>> >> >>> What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l ) and tries to figure out how best make use of them. >> > Exactly. >> > >> >   It's not going to "skip" (ignore, leave unused) >> >> SMT siblings, or skip non-boosted cores, it would just try to be >> >> clever in regards to which cores to use for what purpose. >> > Let me try to share my idea on SMT sibling. When user invoked for >> rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API >> identifies first whether given lcore is part of enabled core list. >> > If yes, it programmatically either using `sysfs` or `hwloc library (shared the >> version concern on distros. Will recheck again)` identify the sibling thread and >> return. >> > If there is no sibling thread available under DPDK it will fetch next lcore >> (probably lcore +1 ). >> > >> >> Distributions having old hwloc versions isn't an argument for a new DPDK library or new API. If only that was the issue, then it would be better to help the hwloc and/or distributions, rather than the DPDK project. > I do not agree to terms of ` Distributions having old hwloc versions > isn't an argument for a new DPDK library or new API.` Because this is > not what my intention is. Let me be clear on Ampere & AMD Bios settings > are 2 > 1. SLC or L3 as NUMA enable > 2. Numa for IO|memory > With `NUMA for IO|memory` is set hwloc library works as expected. But > when `L3 as NUMA` is set gives incorrect details. We have been fixing > this and pushing to upstream. But as I clearly shared, version of > distros having latest hwloc is almost nil. > Hence to keep things simple, in documentation of DPDK we pointed to AMD > SoC tuning guide we have been recommending not to enable `L3 as NUMA`. > Now end goal for me is to allow vendor agnostic API which is easy to > understand and use, and works irrespective of BIOS settings. I have > enabled parsing of OS `sysfs` as a RFC. But if the comment is to use > `hwloc` as shared with response for Stephen I am open to try this again. >