From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 15DB245970; Thu, 12 Sep 2024 18:32:44 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 939A24029B; Thu, 12 Sep 2024 18:32:43 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 121C14028F for ; Thu, 12 Sep 2024 18:32:42 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id C58CC5B8F for ; Thu, 12 Sep 2024 18:32:41 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id B64795B8E; Thu, 12 Sep 2024 18:32:41 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.2 Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id B77AB5B8B; Thu, 12 Sep 2024 18:32:38 +0200 (CEST) Message-ID: <8a882827-874b-4e2d-ae89-d0b243ff7e77@lysator.liu.se> Date: Thu, 12 Sep 2024 18:32:38 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 0/2] introduce LLC aware functions To: Bruce Richardson Cc: "Varghese, Vipin" , Honnappa Nagarahalli , "Yigit, Ferruh" , "dev@dpdk.org" , nd References: <38d0336d-ea9e-41b3-b3d8-333efb70eb1f@lysator.liu.se> <716375DE-0C2F-4983-934A-144D7DE342C6@arm.com> <42b8749d-ef6d-4857-bf2c-0a5d700405eb@lysator.liu.se> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-09-12 15:30, Bruce Richardson wrote: > On Thu, Sep 12, 2024 at 01:59:34PM +0200, Mattias Rönnblom wrote: >> On 2024-09-12 13:17, Varghese, Vipin wrote: >>> [AMD Official Use Only - AMD Internal Distribution Only] >>> >>> >>>>>>>> Thank you Mattias for the information, as shared by in the reply >>>>>>>> with >>>>>> Anatoly we want expose a new API `rte_get_next_lcore_ex` which >>>>>> intakes a extra argument `u32 flags`. >>>>>>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, >>>>>> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, >>>>>> RTE_GET_LCORE_BOOST_DISABLED. >>>>>>> >>>>>>> Wouldn't using that API be pretty awkward to use? >>>>> Current API available under DPDK is ` rte_get_next_lcore`, which is used >>>> within DPDK example and in customer solution. >>>>> Based on the comments from others we responded to the idea of changing >>>> the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`. >>>>> >>>>> Can you please help us understand what is `awkward`. >>>>> >>>> >>>> The awkwardness starts when you are trying to fit provide hwloc type >>>> information over an API that was designed for iterating over lcores. >>> I disagree to this point, current implementation of lcore libraries is >>> only focused on iterating through list of enabled cores, core-mask, and >>> lcore-map. >>> With ever increasing core count, memory, io and accelerators on SoC, >>> sub-numa partitioning is common in various vendor SoC. Enhancing or >>> Augumenting lcore API to extract or provision NUMA, Cache Topology is >>> not awkward. >> >> DPDK providing an API for this information makes sense to me, as I've >> mentioned before. What I questioned was the way it was done (i.e., the API >> design) in your RFC, and the limited scope (which in part you have >> addressed). >> > > Actually, I'd like to touch on this first item a little bit. What is the > main benefit of providing this information in EAL? To me, it seems like > something that is for apps to try and be super-smart and select particular > cores out of a set of cores to run on. However, is that not taking work > that should really be the job of the person deploying the app? The deployer > - if I can use that term - has already selected a set of cores and NICs for > a DPDK application to use. Should they not also be the one selecting - via > app argument, via --lcores flag to map one core id to another, or otherwise > - which part of an application should run on what particular piece of > hardware? > Scheduling in one form or another will happen on a number of levels. One level is what you call the "deployer". Whether man or machine, it will allocate a bunch of lcores to the application - either statically by using -l , or dynamically, by giving a very large core mask, combined with having an agent in the app responsible to scale up or down the number of cores actually used (allowing coexistence with other non-DPDK, Linux process scheduler-scheduled processes, on the same set of cores, although not at the same time). I think the "deployer" level should generally not be aware of the DPDK app internals, including how to assign different tasks to different cores. That is consistent with how things work in a general-purpose operating system, where you allocate cores, memory and I/O devices to an instance (e.g., a VM), but then OS' scheduler figures out how to best use them. The app internal may be complicated, change across software versions and traffic mixes/patterns, and most of all, not lend itself to static at-start configuration at all. > In summary, what is the final real-world intended usecase for this work? One real-world example is an Eventdev app with some atomic processing stage, using DSW, and SMT. Hardware threading on Intel x86 generally improves performance with ~25%, which seems to hold true for data plane apps as well, in my experience. So that's a (not-so-)freebie you don't want to miss out on. To max out single-flow performance, the work scheduler may not only need to give 100% of an lcore to bottleneck stage atomic processing for that elephant flow, but a *full* physical core (i.e., assure that the SMT sibling is idle). But, DSW doesn't understand the CPU topology, so you have to choose between max multi-flow throughput or max single-flow throughput at the time of deployment. A RTE hwtopo API would certainly help in the implementation of SMT-aware scheduling. Another example could be the use of bigger or turbo-capable cores to run CPU-hungry, singleton services (e.g., a Eventdev RX timer adapter core), or the use of a hardware thread to run the SW scheduler service (which needs to react quickly to incoming scheduling events, but maybe not need all the cycles of a full physical core). Yet another example would be an event device which understand how to spread a particular flow across multiple cores, but use only cores sharing the same L2. Or, keep only processing of a certain kind (e.g., a certain Eventdev Queue) on cores with the same L2, improve L2 hit rates for instructions and data related to that processing stage. > DPDK already tries to be smart about cores and NUMA, and in some cases we > have hit issues where users have - for their own valid reasons - wanted to > run DPDK in a sub-optimal way, and they end up having to fight DPDK's > smarts in order to do so! Ref: [1] > > /Bruce > > [1] https://git.dpdk.org/dpdk/commit/?id=ed34d87d9cfbae8b908159f60df2008e45e4c39f