From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 7339A45945; Mon, 9 Sep 2024 16:53:00 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 042E540291; Mon, 9 Sep 2024 16:53:00 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id CCD6640268 for ; Mon, 9 Sep 2024 16:52:58 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 8E389179DE for ; Mon, 9 Sep 2024 16:52:58 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 81FDA17962; Mon, 9 Sep 2024 16:52:58 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.2 Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 8F91517A17; Mon, 9 Sep 2024 16:52:56 +0200 (CEST) Message-ID: Date: Mon, 9 Sep 2024 16:52:56 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 0/2] introduce LLC aware functions To: "Varghese, Vipin" , "Yigit, Ferruh" , "dev@dpdk.org" References: <20240827151014.201-1-vipin.varghese@amd.com> <45f26104-ad6c-4e42-8446-d8b51ac3f2dd@lysator.liu.se> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-09-09 16:22, Varghese, Vipin wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > > >>> >>> >>> Thank you Mattias for the comments and question, please let me try to >>> explain the same below >>> >>>> We shouldn't have a separate CPU/cache hierarchy API instead? >>> >>> Based on the intention to bring in CPU lcores which share same L3 (for >>> better cache hits and less noisy neighbor) current API focuses on >>> using >>> >>> Last Level Cache. But if the suggestion is `there are SoC where L2 >>> cache are also shared, and the new API should be provisioned`, I am >>> also >>> >>> comfortable with the thought. >>> >> >> Rather than some AMD special case API hacked into , I think we >> are better off with no DPDK API at all for this kind of functionality. > > Hi Mattias, as shared in the earlier email thread, this is not a AMD special case at all. Let me try to explain this one more time. One of techniques used to increase cores cost effective way to go for tiles of compute complexes. > This introduces a bunch of cores in sharing same Last Level Cache (namely L2, L3 or even L4) depending upon cache topology architecture. > > The API suggested in RFC is to help end users to selectively use cores under same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS settings used). This is useful in both bare-metal and container environment. > I'm pretty familiar with AMD CPUs and the use of tiles (including the challenges these kinds of non-uniformities pose for work scheduling). To maximize performance, caring about core<->LLC relationship may well not be enough, and more HT/core/cache/memory topology information is required. That's what I meant by special case. A proper API should allow access to information about which lcores are SMT siblings, cores on the same L2, and cores on the same L3, to name a few things. Probably you want to fit NUMA into the same API as well, although that is available already in . One can have a look at how scheduling domains work in the Linux kernel. They model this kind of thing. > As shared in response for cover letter +1 to expand it to more than just LLC cores. We have also confirmed the same to https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-1-vipin.varghese@amd.com/ > >> >> A DPDK CPU/memory hierarchy topology API very much makes sense, but it >> should be reasonably generic and complete from the start. >> >>>> >>>> Could potentially be built on the 'hwloc' library. >>> >>> There are 3 reason on AMD SoC we did not explore this path, reasons >>> are >>> >>> 1. depending n hwloc version and kernel version certain SoC >>> hierarchies are not available >>> >>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD >> Epyc Soc. >>> >>> 3. adds the extra dependency layer of library layer to be made >>> available to work. >>> >>> >>> hence we have tried to use Linux Documented generic layer of `sysfs >>> CPU cache`. >>> >>> I will try to explore more on hwloc and check if other libraries >>> within DPDK leverages the same. >>> >>>> >>>> I much agree cache/core topology may be of interest of the >>>> application (or a work scheduler, like a DPDK event device), but it's >>>> not limited to LLC. It may well be worthwhile to care about which >>>> cores shares L2 cache, for example. Not sure the RTE_LCORE_FOREACH_* >> approach scales. >>> >>> yes, totally understand as some SoC, multiple lcores shares same L2 cache. >>> >>> >>> Can we rework the API to be rte_get_cache_ where user >>> argument is desired lcore index. >>> >>> 1. index-1: SMT threads >>> >>> 2. index-2: threads sharing same L2 cache >>> >>> 3. index-3: threads sharing same L3 cache >>> >>> 4. index-MAX: identify the threads sharing last level cache. >>> >>>> >>>>> < Function: Purpose > >>>>> --------------------- >>>>> - rte_get_llc_first_lcores: Retrieves all the first lcores in the >>>>> shared LLC. >>>>> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. >>>>> - rte_get_llc_n_lcore: Retrieves the first n or skips the first n >>>>> lcores in the shared LLC. >>>>> >>>>> < MACRO: Purpose > >>>>> ------------------ >>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from >>>>> each LLC. >>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first >>>>> worker lcore from each LLC. >>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on >> hint >>>>> (lcore id). >>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC >>>>> while skipping first worker. >>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores >>>>> from each LLC. >>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then >>>>> iterates through reaming lcores in each LLC. >>>>> >>> While the MACRO are simple wrapper invoking appropriate API. can this >>> be worked out in this fashion? >>> >>>