From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 15DB245970;
	Thu, 12 Sep 2024 18:32:44 +0200 (CEST)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 939A24029B;
	Thu, 12 Sep 2024 18:32:43 +0200 (CEST)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id 121C14028F
 for <dev@dpdk.org>; Thu, 12 Sep 2024 18:32:42 +0200 (CEST)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id C58CC5B8F
 for <dev@dpdk.org>; Thu, 12 Sep 2024 18:32:41 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id B64795B8E; Thu, 12 Sep 2024 18:32:41 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.2 required=5.0 tests=ALL_TRUSTED,AWL,
 T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0
X-Spam-Score: -1.2
Received: from [192.168.1.86] (h-62-63-215-114.A163.priv.bahnhof.se
 [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id B77AB5B8B;
 Thu, 12 Sep 2024 18:32:38 +0200 (CEST)
Message-ID: <8a882827-874b-4e2d-ae89-d0b243ff7e77@lysator.liu.se>
Date: Thu, 12 Sep 2024 18:32:38 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [RFC 0/2] introduce LLC aware functions
To: Bruce Richardson <bruce.richardson@intel.com>
Cc: "Varghese, Vipin" <Vipin.Varghese@amd.com>,
 Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
 "Yigit, Ferruh" <Ferruh.Yigit@amd.com>, "dev@dpdk.org" <dev@dpdk.org>,
 nd <nd@arm.com>
References: <db151bdb-7683-49e3-b759-d729f53a556c@lysator.liu.se>
 <PH7PR12MB8596E0CBFDE4E00FAFFA9CB382992@PH7PR12MB8596.namprd12.prod.outlook.com>
 <da6eadec-608a-4675-8f23-c2ed27481c14@lysator.liu.se>
 <PH7PR12MB8596A0BB9868E1FF31C9BFDF829B2@PH7PR12MB8596.namprd12.prod.outlook.com>
 <38d0336d-ea9e-41b3-b3d8-333efb70eb1f@lysator.liu.se>
 <716375DE-0C2F-4983-934A-144D7DE342C6@arm.com>
 <PH7PR12MB8596D0906FA40A55FAE516D282642@PH7PR12MB8596.namprd12.prod.outlook.com>
 <c590b06b-6d26-4766-92cb-4cc3f1c6e164@lysator.liu.se>
 <PH7PR12MB8596B0BFCBD9D8AF75FD2C6B82642@PH7PR12MB8596.namprd12.prod.outlook.com>
 <42b8749d-ef6d-4857-bf2c-0a5d700405eb@lysator.liu.se>
 <ZuLtBLIkyUQJ01S4@bricha3-mobl1.ger.corp.intel.com>
Content-Language: en-US
From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <ZuLtBLIkyUQJ01S4@bricha3-mobl1.ger.corp.intel.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On 2024-09-12 15:30, Bruce Richardson wrote:
> On Thu, Sep 12, 2024 at 01:59:34PM +0200, Mattias Rönnblom wrote:
>> On 2024-09-12 13:17, Varghese, Vipin wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> <snipped>
>>>>>>>> Thank you Mattias for the information, as shared by in the reply
>>>>>>>> with
>>>>>> Anatoly we want expose a new API `rte_get_next_lcore_ex` which
>>>>>> intakes a extra argument `u32 flags`.
>>>>>>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
>>>>>> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
>>>>>> RTE_GET_LCORE_BOOST_DISABLED.
>>>>>>>
>>>>>>> Wouldn't using that API be pretty awkward to use?
>>>>> Current API available under DPDK is ` rte_get_next_lcore`, which is used
>>>> within DPDK example and in customer solution.
>>>>> Based on the comments from others we responded to the idea of changing
>>>> the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.
>>>>>
>>>>> Can you please help us understand what is `awkward`.
>>>>>
>>>>
>>>> The awkwardness starts when you are trying to fit provide hwloc type
>>>> information over an API that was designed for iterating over lcores.
>>> I disagree to this point, current implementation of lcore libraries is
>>> only focused on iterating through list of enabled cores, core-mask, and
>>> lcore-map.
>>> With ever increasing core count, memory, io and accelerators on SoC,
>>> sub-numa partitioning is common in various vendor SoC. Enhancing or
>>> Augumenting lcore API to extract or provision NUMA, Cache Topology is
>>> not awkward.
>>
>> DPDK providing an API for this information makes sense to me, as I've
>> mentioned before. What I questioned was the way it was done (i.e., the API
>> design) in your RFC, and the limited scope (which in part you have
>> addressed).
>>
> 
> Actually, I'd like to touch on this first item a little bit. What is the
> main benefit of providing this information in EAL? To me, it seems like
> something that is for apps to try and be super-smart and select particular
> cores out of a set of cores to run on. However, is that not taking work
> that should really be the job of the person deploying the app? The deployer
> - if I can use that term - has already selected a set of cores and NICs for
> a DPDK application to use. Should they not also be the one selecting - via
> app argument, via --lcores flag to map one core id to another, or otherwise
> - which part of an application should run on what particular piece of
> hardware?
> 

Scheduling in one form or another will happen on a number of levels. One 
level is what you call the "deployer". Whether man or machine, it will 
allocate a bunch of lcores to the application - either statically by 
using -l <cores>, or dynamically, by giving a very large core mask, 
combined with having an agent in the app responsible to scale up or down 
the number of cores actually used (allowing coexistence with other 
non-DPDK, Linux process scheduler-scheduled processes, on the same set 
of cores, although not at the same time).

I think the "deployer" level should generally not be aware of the DPDK 
app internals, including how to assign different tasks to different 
cores. That is consistent with how things work in a general-purpose 
operating system, where you allocate cores, memory and I/O devices to an 
instance (e.g., a VM), but then OS' scheduler figures out how to best 
use them.

The app internal may be complicated, change across software versions and 
traffic mixes/patterns, and most of all, not lend itself to static 
at-start configuration at all.

> In summary, what is the final real-world intended usecase for this work?

One real-world example is an Eventdev app with some atomic processing 
stage, using DSW, and SMT. Hardware threading on Intel x86 generally 
improves performance with ~25%, which seems to hold true for data plane 
apps as well, in my experience. So that's a (not-so-)freebie you don't 
want to miss out on. To max out single-flow performance, the work 
scheduler may not only need to give 100% of an lcore to bottleneck stage 
atomic processing for that elephant flow, but a *full* physical core 
(i.e., assure that the SMT sibling is idle). But, DSW doesn't understand 
the CPU topology, so you have to choose between max multi-flow 
throughput or max single-flow throughput at the time of deployment. A 
RTE hwtopo API would certainly help in the implementation of SMT-aware 
scheduling.

Another example could be the use of bigger or turbo-capable cores to run 
CPU-hungry, singleton services (e.g., a Eventdev RX timer adapter core), 
or the use of a hardware thread to run the SW scheduler service (which 
needs to react quickly to incoming scheduling events, but maybe not need 
all the cycles of a full physical core).

Yet another example would be an event device which understand how to 
spread a particular flow across multiple cores, but use only cores 
sharing the same L2. Or, keep only processing of a certain kind (e.g., a 
certain Eventdev Queue) on cores with the same L2, improve L2 hit rates 
for instructions and data related to that processing stage.

> DPDK already tries to be smart about cores and NUMA, and in some cases we
> have hit issues where users have - for their own valid reasons - wanted to
> run DPDK in a sub-optimal way, and they end up having to fight DPDK's
> smarts in order to do so! Ref: [1]
> 
> /Bruce
> 
> [1] https://git.dpdk.org/dpdk/commit/?id=ed34d87d9cfbae8b908159f60df2008e45e4c39f