DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
To: "Varghese, Vipin" <vipin.varghese@amd.com>,
	<ferruh.yigit@amd.com>, <dev@dpdk.org>
Subject: Re: [RFC 0/2] introduce LLC aware functions
Date: Tue, 3 Sep 2024 10:50:05 +0200	[thread overview]
Message-ID: <3edc8a89-7d10-47f4-8f95-856c2a7fc7ba@intel.com> (raw)
In-Reply-To: <db3af9ba-6b7b-4c43-bce0-d85d222cfa99@amd.com>

On 9/2/2024 5:33 PM, Varghese, Vipin wrote:
> <snipped>
>>>
>>>> I recently looked into how Intel's Sub-NUMA Clustering would work 
>>>> within
>>>> DPDK, and found that I actually didn't have to do anything, because the
>>>> SNC "clusters" present themselves as NUMA nodes, which DPDK already
>>>> supports natively.
>>>
>>> yes, this is correct. In Intel Xeon Platinum BIOS one can enable
>>> `Cluster per NUMA` as `1,2 or4`.
>>>
>>> This divides the tiles into Sub-Numa parition, each having separate
>>> lcores,memory controllers, PCIe
>>>
>>> and accelerator.
>>>
>>>>
>>>> Does AMD's implementation of chiplets not report themselves as separate
>>>> NUMA nodes?
>>>
>>> In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely
>>>
>>> 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and
>>> Accelerator) to be partitioned as Numa 0, 1, 2 or 4.
>>>
>>> 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows
>>> all CPU tiles to be independent NUMA cores.
>>>
>>>
>>> The above settings are possible because CPU is independent from IO tile.
>>> Thus allowing 4 combinations be available for use.
>>
>> Sure, but presumably if the user wants to distinguish this, they have to
>> configure their system appropriately. If user wants to take advantage of
>> L3 as NUMA (which is what your patch proposes), then they can enable the
>> BIOS knob and get that functionality for free. DPDK already supports 
>> this.
>>
> The intend of the RFC is to introduce the ability to select lcore within 
> the same
> 
> L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is also 
> achieved
> 
> and tested on platforms which advertises via sysfs by OS kernel. Thus 
> eliminating
> 
> the dependency on hwloc and libuma which can be different versions in 
> different distros.

But we do depend on libnuma, so we might as well depend on it? Are there 
different versions of libnuma that interfere with what you're trying to 
do? You keep coming back to this "whether the BIOS is set or unset" for 
L3 as NUMA, but I'm still unclear as to what issues your patch is 
solving assuming "knob is set". When the system is configured correctly, 
it already works and reports cores as part of NUMA nodes (as L3) 
correctly. It is only when the system is configured *not* to do that 
that issues arise, is it not? In which case IMO the easier solution 
would be to just tell the user to enable that knob in BIOS?

> 
> 
>>>
>>> These are covered in the tuning gudie for the SoC in 12. How to get best
>>> performance on AMD platform — Data Plane Development Kit 24.07.0
>>> documentation (dpdk.org)
>>> <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>.
>>>
>>>
>>>> Because if it does, I don't really think any changes are
>>>> required because NUMA nodes would give you the same thing, would it 
>>>> not?
>>>
>>> I have a different opinion to this outlook. An end user can
>>>
>>> 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py`
>>
>> I recently submitted an enhacement for CPU layout script to print out
>> NUMA separately from physical socket [1].
>>
>> [1]
>> https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/
>>
>> I believe when "L3 as NUMA" is enabled in BIOS, the script will display
>> both physical package ID as well as NUMA nodes reported by the system,
>> which will be different from physical package ID, and which will display
>> information you were looking for.
> 
> As AMD we had submitted earlier work on the same via usertools: enhance 
> logic to display NUMA - Patchwork (dpdk.org) 
> <https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.varghese@amd.com/>.
> 
> this clearly were distinguishing NUMA and Physical socket.

Oh, cool, I didn't see that patch. I would argue my visual format is 
more readable though, so perhaps we can get that in :)

> Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the tile 
> consists of cpu, memory, pcie and accelerator.
> 
> hence setting the BIOS option `Cluster per NUMA` the OS kernel & libnuma 
> display appropriate Domain with memory, pcie and cpu.
> 
> 
> In case of AMD SoC, libnuma for CPU is different from memory NUMA per 
> socket.

I'm curious how does the kernel handle this then, and what are you 
getting from libnuma. You seem to be implying that there are two 
different NUMA nodes on your SoC, and either kernel or libnuma are in 
conflict as to what belongs to what NUMA node?

> 
>>
>>>
>>> 3. there are no API which distinguish L3 numa domain. Function
>>> `rte_socket_id
>>> <https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` for CPU tiles like AMD SoC will return physical socket.
>>
>> Sure, but I would think the answer to that would be to introduce an API
>> to distinguish between NUMA (socket ID in DPDK parlance) and package
>> (physical socket ID in the "traditional NUMA" sense). Once we can
>> distinguish between those, DPDK can just rely on NUMA information
>> provided by the OS, while still being capable of identifying physical
>> sockets if the user so desires.
> Agreed, +1 for the idea for physcial socket and changes in library to 
> exploit the same.
>>
>> I am actually going to introduce API to get *physical socket* (as
>> opposed to NUMA node) in the next few days.
>>
> But how does it solve the end customer issues
> 
> 1. if there are multiple NIC or Accelerator on multiple socket, but IO 
> tile is partitioned to Sub Domain.

At least on Intel platforms, NUMA node gets assigned correctly - that 
is, if my Xeon with SNC enabled has NUMA nodes 3,4 on socket 1, and 
there's a NIC connected to socket 1, it's going to show up as being on 
NUMA node 3 or 4 depending on where exactly I plugged it in. Everything 
already works as expected, and there is no need for any changes for 
Intel platforms (at least none that I can see).

My proposed API is really for those users who wish to explicitly allow 
for reserving memory/cores on "the same physical socket", as "on the 
same tile" is already taken care of by NUMA nodes.

> 
> 2. If RTE_FLOW steering is applied on NIC which needs to processed under 
> same L3 - reduces noisy neighbor and better cache hits
> 
> 3, for PKT-distribute library which needs to run within same worker 
> lcore set as RX-Distributor-TX.
> 

Same as above: on Intel platforms, NUMA nodes already solve this.

<snip>

> Totally agree, that is what the RFC is also doing, based on what OS sees 
> as NUMA we are using it.
> 
> Only addition is within the NUMA if there are split LLC, allow selection 
> of those lcores. Rather than blindly choosing lcore using
> 
> rte_lcore_get_next.

It feels like we're working around a problem that shouldn't exist in the 
first place, because kernel should already report this information. 
Within NUMA subsystem, there is sysfs node "distance" that, at least on 
Intel platforms and in certain BIOS configuration, reports distance 
between NUMA nodes, from which one can make inferences about how far a 
specific NUMA node is from any other NUMA node. This could have been 
used to encode L3 cache information. Do AMD platforms not do that? In 
that case, "lcore next" for a particular socket ID (NUMA node, in 
reality) should already get us any cores that are close to each other, 
because all of this information is already encoded in NUMA nodes by the 
system.

I feel like there's a disconnect between my understanding of the problem 
space, and yours, so I'm going to ask a very basic question:

Assuming the user has configured their AMD system correctly (i.e. 
enabled L3 as NUMA), are there any problem to be solved by adding a new 
API? Does the system not report each L3 as a separate NUMA node?

> 
> 
>> We force the user to configure their system
>> correctly as it is, and I see no reason to second-guess user's BIOS
>> configuration otherwise.
> 
> Again iterating, the changes suggested in RFC are agnostic to what BIOS 
> options are used,

But that is exactly my contention: are we not effectively working around 
users' misconfiguration of a system then?

-- 
Thanks,
Anatoly


  reply	other threads:[~2024-09-03  8:50 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-27 15:10 Vipin Varghese
2024-08-27 15:10 ` [RFC 1/2] eal: add llc " Vipin Varghese
2024-08-27 17:36   ` Stephen Hemminger
2024-09-02  0:27     ` Varghese, Vipin
2024-08-27 20:56   ` Wathsala Wathawana Vithanage
2024-08-29  3:21     ` 答复: " Feifei Wang
2024-09-02  1:20     ` Varghese, Vipin
2024-09-03 17:54       ` Wathsala Wathawana Vithanage
2024-09-04  8:18         ` Bruce Richardson
2024-09-06 11:59         ` Varghese, Vipin
2024-09-12 16:58           ` Wathsala Wathawana Vithanage
2024-08-27 15:10 ` [RFC 2/2] eal/lcore: add llc aware for each macro Vipin Varghese
2024-08-27 21:23 ` [RFC 0/2] introduce LLC aware functions Mattias Rönnblom
2024-09-02  0:39   ` Varghese, Vipin
2024-09-04  9:30     ` Mattias Rönnblom
2024-09-04 14:37       ` Stephen Hemminger
2024-09-11  3:13         ` Varghese, Vipin
2024-09-11  3:53           ` Stephen Hemminger
2024-09-12  1:11             ` Varghese, Vipin
2024-09-09 14:22       ` Varghese, Vipin
2024-09-09 14:52         ` Mattias Rönnblom
2024-09-11  3:26           ` Varghese, Vipin
2024-09-11 15:55             ` Mattias Rönnblom
2024-09-11 17:04               ` Honnappa Nagarahalli
2024-09-12  1:33                 ` Varghese, Vipin
2024-09-12  6:38                   ` Mattias Rönnblom
2024-09-12  7:02                     ` Mattias Rönnblom
2024-09-12 11:23                       ` Varghese, Vipin
2024-09-12 12:12                         ` Mattias Rönnblom
2024-09-12 15:50                           ` Stephen Hemminger
2024-09-12 11:17                     ` Varghese, Vipin
2024-09-12 11:59                       ` Mattias Rönnblom
2024-09-12 13:30                         ` Bruce Richardson
2024-09-12 16:32                           ` Mattias Rönnblom
2024-09-12  2:28                 ` Varghese, Vipin
2024-09-11 16:01             ` Bruce Richardson
2024-09-11 22:25               ` Konstantin Ananyev
2024-09-12  2:38                 ` Varghese, Vipin
2024-09-12  2:19               ` Varghese, Vipin
2024-09-12  9:17                 ` Bruce Richardson
2024-09-12 11:50                   ` Varghese, Vipin
2024-09-13 14:15                     ` Burakov, Anatoly
2024-09-12 13:18                   ` Mattias Rönnblom
2024-08-28  8:38 ` Burakov, Anatoly
2024-09-02  1:08   ` Varghese, Vipin
2024-09-02 14:17     ` Burakov, Anatoly
2024-09-02 15:33       ` Varghese, Vipin
2024-09-03  8:50         ` Burakov, Anatoly [this message]
2024-09-05 13:05           ` Ferruh Yigit
2024-09-05 14:45             ` Burakov, Anatoly
2024-09-05 15:34               ` Ferruh Yigit
2024-09-06  8:44                 ` Burakov, Anatoly
2024-09-09 14:14                   ` Varghese, Vipin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3edc8a89-7d10-47f4-8f95-856c2a7fc7ba@intel.com \
    --to=anatoly.burakov@intel.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@amd.com \
    --cc=vipin.varghese@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).