From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 75A404585A; Wed, 11 Sep 2024 17:55:19 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 6501742F0A; Wed, 11 Sep 2024 17:55:19 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 9F3C142F07 for ; Wed, 11 Sep 2024 17:55:17 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 42AF3104C for ; Wed, 11 Sep 2024 17:55:17 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 110B3104B; Wed, 11 Sep 2024 17:55:17 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.2 Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 5D9C5F7F; Wed, 11 Sep 2024 17:55:15 +0200 (CEST) Message-ID: <38d0336d-ea9e-41b3-b3d8-333efb70eb1f@lysator.liu.se> Date: Wed, 11 Sep 2024 17:55:15 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 0/2] introduce LLC aware functions To: "Varghese, Vipin" , "Yigit, Ferruh" , "dev@dpdk.org" References: <20240827151014.201-1-vipin.varghese@amd.com> <45f26104-ad6c-4e42-8446-d8b51ac3f2dd@lysator.liu.se> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-09-11 05:26, Varghese, Vipin wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > > >> >> On 2024-09-09 16:22, Varghese, Vipin wrote: >>> [AMD Official Use Only - AMD Internal Distribution Only] >>> >>> >>> >>>>> >>>>> >>>>> Thank you Mattias for the comments and question, please let me try >>>>> to explain the same below >>>>> >>>>>> We shouldn't have a separate CPU/cache hierarchy API instead? >>>>> >>>>> Based on the intention to bring in CPU lcores which share same L3 >>>>> (for better cache hits and less noisy neighbor) current API focuses >>>>> on using >>>>> >>>>> Last Level Cache. But if the suggestion is `there are SoC where L2 >>>>> cache are also shared, and the new API should be provisioned`, I am >>>>> also >>>>> >>>>> comfortable with the thought. >>>>> >>>> >>>> Rather than some AMD special case API hacked into , I >>>> think we are better off with no DPDK API at all for this kind of functionality. >>> >>> Hi Mattias, as shared in the earlier email thread, this is not a AMD special >> case at all. Let me try to explain this one more time. One of techniques used to >> increase cores cost effective way to go for tiles of compute complexes. >>> This introduces a bunch of cores in sharing same Last Level Cache (namely >> L2, L3 or even L4) depending upon cache topology architecture. >>> >>> The API suggested in RFC is to help end users to selectively use cores under >> same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS >> settings used). This is useful in both bare-metal and container environment. >>> >> >> I'm pretty familiar with AMD CPUs and the use of tiles (including the >> challenges these kinds of non-uniformities pose for work scheduling). >> >> To maximize performance, caring about core<->LLC relationship may well not >> be enough, and more HT/core/cache/memory topology information is >> required. That's what I meant by special case. A proper API should allow >> access to information about which lcores are SMT siblings, cores on the same >> L2, and cores on the same L3, to name a few things. Probably you want to fit >> NUMA into the same API as well, although that is available already in >> . > > Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument `u32 flags`. > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED. > Wouldn't using that API be pretty awkward to use? I mean, what you have is a topology, with nodes of different types and with different properties, and you want to present it to the user. In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style, and what I have in mind is something DOM-like. What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l ) and tries to figure out how best make use of them. It's not going to "skip" (ignore, leave unused) SMT siblings, or skip non-boosted cores, it would just try to be clever in regards to which cores to use for what purpose. > This is AMD EPYC SoC agnostic and trying to address for all generic cases. > > Please do let us know if we (Ferruh & myself) can sync up via call? > Sure, I can do that. >> >> One can have a look at how scheduling domains work in the Linux kernel. >> They model this kind of thing. >> >>> As shared in response for cover letter +1 to expand it to more than >>> just LLC cores. We have also confirmed the same to >>> https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201- >> 1-vip >>> in.varghese@amd.com/ >>> >>>> >>>> A DPDK CPU/memory hierarchy topology API very much makes sense, but >>>> it should be reasonably generic and complete from the start. >>>> >>>>>> >>>>>> Could potentially be built on the 'hwloc' library. >>>>> >>>>> There are 3 reason on AMD SoC we did not explore this path, reasons >>>>> are >>>>> >>>>> 1. depending n hwloc version and kernel version certain SoC >>>>> hierarchies are not available >>>>> >>>>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD >>>> Epyc Soc. >>>>> >>>>> 3. adds the extra dependency layer of library layer to be made >>>>> available to work. >>>>> >>>>> >>>>> hence we have tried to use Linux Documented generic layer of `sysfs >>>>> CPU cache`. >>>>> >>>>> I will try to explore more on hwloc and check if other libraries >>>>> within DPDK leverages the same. >>>>> >>>>>> >>>>>> I much agree cache/core topology may be of interest of the >>>>>> application (or a work scheduler, like a DPDK event device), but >>>>>> it's not limited to LLC. It may well be worthwhile to care about >>>>>> which cores shares L2 cache, for example. Not sure the >>>>>> RTE_LCORE_FOREACH_* >>>> approach scales. >>>>> >>>>> yes, totally understand as some SoC, multiple lcores shares same L2 cache. >>>>> >>>>> >>>>> Can we rework the API to be rte_get_cache_ where user >>>>> argument is desired lcore index. >>>>> >>>>> 1. index-1: SMT threads >>>>> >>>>> 2. index-2: threads sharing same L2 cache >>>>> >>>>> 3. index-3: threads sharing same L3 cache >>>>> >>>>> 4. index-MAX: identify the threads sharing last level cache. >>>>> >>>>>> >>>>>>> < Function: Purpose > >>>>>>> --------------------- >>>>>>> - rte_get_llc_first_lcores: Retrieves all the first lcores in >>>>>>> the shared LLC. >>>>>>> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. >>>>>>> - rte_get_llc_n_lcore: Retrieves the first n or skips the first >>>>>>> n lcores in the shared LLC. >>>>>>> >>>>>>> < MACRO: Purpose > >>>>>>> ------------------ >>>>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from >>>>>>> each LLC. >>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first >>>>>>> worker lcore from each LLC. >>>>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on >>>> hint >>>>>>> (lcore id). >>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from >> LLC >>>>>>> while skipping first worker. >>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` >> lcores >>>>>>> from each LLC. >>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then >>>>>>> iterates through reaming lcores in each LLC. >>>>>>> >>>>> While the MACRO are simple wrapper invoking appropriate API. can >>>>> this be worked out in this fashion? >>>>> >>>>>