From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id B76644596E; Thu, 12 Sep 2024 15:18:48 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id A2EBF427D3; Thu, 12 Sep 2024 15:18:48 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 05453427D3 for ; Thu, 12 Sep 2024 15:18:47 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 67FCB5011 for ; Thu, 12 Sep 2024 15:18:46 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 4FD5E4F6D; Thu, 12 Sep 2024 15:18:46 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.2 Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 45755500E; Thu, 12 Sep 2024 15:18:44 +0200 (CEST) Message-ID: <0add3e61-570c-4d61-bb4e-c11747e75690@lysator.liu.se> Date: Thu, 12 Sep 2024 15:18:43 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 0/2] introduce LLC aware functions To: Bruce Richardson , "Varghese, Vipin" Cc: "Yigit, Ferruh" , "dev@dpdk.org" References: <20240827151014.201-1-vipin.varghese@amd.com> <45f26104-ad6c-4e42-8446-d8b51ac3f2dd@lysator.liu.se> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-09-12 11:17, Bruce Richardson wrote: > On Thu, Sep 12, 2024 at 02:19:07AM +0000, Varghese, Vipin wrote: >> [Public] >> >> >> >> >> >> > > > > >> >> > > > > >> >> > > > >>> >> >> > > > >>> >> >> > > > >>> Thank you Mattias for the comments and question, please let >> me >> >> > > > >>> try to explain the same below >> >> > > > >>> >> >> > > > >>>> We shouldn't have a separate CPU/cache hierarchy API >> instead? >> >> > > > >>> >> >> > > > >>> Based on the intention to bring in CPU lcores which share >> same >> >> > > > >>> L3 (for better cache hits and less noisy neighbor) current >> API >> >> > > > >>> focuses on using >> >> > > > >>> >> >> > > > >>> Last Level Cache. But if the suggestion is `there are SoC >> where >> >> > > > >>> L2 cache are also shared, and the new API should be >> >> > > > >>> provisioned`, I am also >> >> > > > >>> >> >> > > > >>> comfortable with the thought. >> >> > > > >>> >> >> > > > >> >> >> > > > >> Rather than some AMD special case API hacked into >> , >> >> > > > >> I think we are better off with no DPDK API at all for this >> kind of >> >> > functionality. >> >> > > > > >> >> > > > > Hi Mattias, as shared in the earlier email thread, this is not >> a >> >> > > > > AMD special >> >> > > > case at all. Let me try to explain this one more time. One of >> >> > > > techniques used to increase cores cost effective way to go for >> tiles of >> >> > compute complexes. >> >> > > > > This introduces a bunch of cores in sharing same Last Level >> Cache >> >> > > > > (namely >> >> > > > L2, L3 or even L4) depending upon cache topology architecture. >> >> > > > > >> >> > > > > The API suggested in RFC is to help end users to selectively >> use >> >> > > > > cores under >> >> > > > same Last Level Cache Hierarchy as advertised by OS (irrespective >> of >> >> > > > the BIOS settings used). This is useful in both bare-metal and >> container >> >> > environment. >> >> > > > > >> >> > > > >> >> > > > I'm pretty familiar with AMD CPUs and the use of tiles (including >> >> > > > the challenges these kinds of non-uniformities pose for work >> scheduling). >> >> > > > >> >> > > > To maximize performance, caring about core<->LLC relationship may >> >> > > > well not be enough, and more HT/core/cache/memory topology >> >> > > > information is required. That's what I meant by special case. A >> >> > > > proper API should allow access to information about which lcores >> are >> >> > > > SMT siblings, cores on the same L2, and cores on the same L3, to >> >> > > > name a few things. Probably you want to fit NUMA into the same >> API >> >> > > > as well, although that is available already in . >> >> > > >> >> > > Thank you Mattias for the information, as shared by in the reply >> with >> >> > Anatoly we want expose a new API `rte_get_next_lcore_ex` which >> intakes a >> >> > extra argument `u32 flags`. >> >> > > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, >> >> > RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, >> >> > RTE_GET_LCORE_BOOST_DISABLED. >> >> > > >> >> > >> >> > For the naming, would "rte_get_next_sibling_core" (or lcore if you >> prefer) be a >> >> > clearer name than just adding "ex" on to the end of the existing >> function? >> >> Thank you Bruce, Please find my answer below >> >> >> >> Functions shared as per the RFC were >> >> ``` >> >> - rte_get_llc_first_lcores: Retrieves all the first lcores in the >> shared LLC. >> >> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. >> >> - rte_get_llc_n_lcore: Retrieves the first n or skips the first n >> lcores in the shared LLC. >> >> ``` >> >> >> >> MACRO’s extending the usability were >> >> ``` >> >> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each >> LLC. >> >> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker >> lcore from each LLC. >> >> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint >> (lcore id). >> >> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while >> skipping first worker. >> >> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from >> each LLC. >> >> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then >> iterates through reaming lcores in each LLC. >> >> ``` >> >> >> >> Based on the discussions we agreed on sharing version-2 FRC for >> extending API as `rte_get_next_lcore_extnd` with extra argument as >> `flags`. >> >> As per my ideation, for the API ` rte_get_next_sibling_core`, the above >> API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right >> understanding? >> >> We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which >> allows to iterate SMT sibling threads. >> >> > > This seems like a lot of new macro and API additions! I'd really like to > cut that back and simplify the amount of new things we are adding to DPDK > for this. I tend to agree with others that external libs would be better > for apps that really want to deal with all this. > Conveying HW topology will require a fair bit of API verbiage. I think there's no way around it, other than giving the API user half of the story (or 1% of the story). That's one of the reasons I think it should be in a separate header file in EAL. >> >> > >> >> > Looking logically, I'm not sure about the BOOST_ENABLED and >> >> > BOOST_DISABLED flags you propose >> >> The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power >> library which allows to enable boost. >> >> Allow user to select lcores where BOOST is enabled|disabled using MACRO >> or API. >> >> >> >> - in a system with multiple possible >> >> > standard and boost frequencies what would those correspond to? >> >> I now understand the confusion, apologies for mixing the AMD EPYC SoC >> boost with Intel Turbo. >> >> >> >> Thank you for pointing out, we will use the terminology ` >> RTE_GET_LCORE_TURBO`. >> >> > > That still doesn't clarify it for me. If you start mixing in power > management related functions in with topology ones things will turn into a > real headache. What does boost or turbo correspond to? Is it for cores that > have the feature enabled - whether or not it's currently in use - or is it > for finding cores that are currently boosted? Do we need additions for > cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are > in lower frequencies for power-saving. Do we add macros for finding those? > In my world, the operating frequency is a property of a CPU core node in the hardware topology. lcore discrimination (or classification) shouldn't be built as a myriad of FOREACH macros, but rather generic iteration + app domain logic. For example, the size of the L3 could be a factor. Should we have a FOREACH_BIG_L3. No. >> >> What's also >> >> > missing is a define for getting actual NUMA siblings i.e. those >> sharing common >> >> > memory but not an L3 or anything else. >> >> This can be extended into `rte_get_next_lcore_extnd` with flag ` >> RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same >> sub-memory NUMA as shared by LCORE. >> >> If SMT sibling is enabled and DPDK Lcore mask covers the sibling >> threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads >> under same memory NUMA of lcore shared. >> >> > > Yes. That can work. But it means we are basing the implementation on a > fixed idea of what topologies there are or can exist. My suggestion below > is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a > way to find it's nearest nodes. > I think we need to agree what is the purpose of this API. Is it the to describe the hardware topology in some details for general-purpose use (including informing the operator, lstopo-style), or just some abstract, simplified representation to be use purely for work scheduling. > After all, the app doesn't want to know the topology just for the sake of > knowing it - it wants it to ensure best placement of work on cores! To that > end, it just needs to know what cores are near to each other and what are > far away. > >> >> > >> >> > My suggestion would be to have the function take just an integer-type >> e.g. >> >> > uint16_t parameter which defines the memory/cache hierarchy level to >> use, 0 >> >> > being lowest, 1 next, and so on. Different systems may have different >> numbers >> >> > of cache levels so lets just make it a zero-based index of levels, >> rather than >> >> > giving explicit defines (except for memory which should probably >> always be >> >> > last). The zero-level will be for "closest neighbour" >> >> Good idea, we did prototype this internally. But issue it will keep on >> adding the number of API into lcore library. >> >> To keep the API count less, we are using lcore id as hint to sub-NUMA. >> > > I'm unclear about this keeping the API count down - you are proposing a lot > of APIs and macros up above. My suggestion is basically to add two APIs and > no macros: one API to get the max number of topology-nearness levels, and a > second API to get the next sibling a given nearness level from > 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too. > > Overall, though, as I say above, let's focus on the problem the app > actually wants these APIs for, not how we think we should solve it. Apps > don't want to know the topology for knowledge sake, they want to use that > knowledge to improve performance by pinning tasks to cores. What is the > minimum that we need to provide to enable the app to do that? For example, > if there are no lcores that share an L1, then from an app topology > viewpoint that L1 level may as well not exist, because it provides us no > details on how to place our work. > > For the rare app that does have some esoteric use-case that does actually > want to know some intricate details of the topology, then having that app > use an external lib is probably a better solution than us trying to cover > all possible options in DPDK. > > My 2c. on this at this stage anyway. > > /Bruce >