<snipped>

I recently looked into how Intel's Sub-NUMA Clustering would work within
DPDK, and found that I actually didn't have to do anything, because the
SNC "clusters" present themselves as NUMA nodes, which DPDK already
supports natively.

yes, this is correct. In Intel Xeon Platinum BIOS one can enable
`Cluster per NUMA` as `1,2 or4`.

This divides the tiles into Sub-Numa parition, each having separate
lcores,memory controllers, PCIe

and accelerator.


Does AMD's implementation of chiplets not report themselves as separate
NUMA nodes?

In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely

1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and
Accelerator) to be partitioned as Numa 0, 1, 2 or 4.

2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows
all CPU tiles to be independent NUMA cores.


The above settings are possible because CPU is independent from IO tile.
Thus allowing 4 combinations be available for use.

Sure, but presumably if the user wants to distinguish this, they have to
configure their system appropriately. If user wants to take advantage of
L3 as NUMA (which is what your patch proposes), then they can enable the
BIOS knob and get that functionality for free. DPDK already supports this.

The intend of the RFC is to introduce the ability to select lcore within the same

L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is also achieved

and tested on platforms which advertises via sysfs by OS kernel. Thus eliminating

the dependency on hwloc and libuma which can be different versions in different distros.



These are covered in the tuning gudie for the SoC in 12. How to get best
performance on AMD platform — Data Plane Development Kit 24.07.0
documentation (dpdk.org)
<https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>.


Because if it does, I don't really think any changes are
required because NUMA nodes would give you the same thing, would it not?

I have a different opinion to this outlook. An end user can

1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py`

I recently submitted an enhacement for CPU layout script to print out
NUMA separately from physical socket [1].

[1]
https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/

I believe when "L3 as NUMA" is enabled in BIOS, the script will display
both physical package ID as well as NUMA nodes reported by the system,
which will be different from physical package ID, and which will display
information you were looking for.

As AMD we had submitted earlier work on the same via usertools: enhance logic to display NUMA - Patchwork (dpdk.org).

this clearly were distinguishing NUMA and Physical socket.



2. But it is core mask in eal arguments which makes the threads
available to be used in a process.

See above: if the OS already reports NUMA information, this is not a
problem to be solved, CPU layout script can give this information to the
user.

Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the tile consists of cpu, memory, pcie and accelerator.

hence setting the BIOS option `Cluster per NUMA` the OS kernel & libnuma display appropriate Domain with memory, pcie and cpu.


In case of AMD SoC, libnuma for CPU is different from memory NUMA per socket.



3. there are no API which distinguish L3 numa domain. Function
`rte_socket_id
<https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` for CPU tiles like AMD SoC will return physical socket.

Sure, but I would think the answer to that would be to introduce an API
to distinguish between NUMA (socket ID in DPDK parlance) and package
(physical socket ID in the "traditional NUMA" sense). Once we can
distinguish between those, DPDK can just rely on NUMA information
provided by the OS, while still being capable of identifying physical
sockets if the user so desires.
Agreed, +1 for the idea for physcial socket and changes in library to exploit the same.

I am actually going to introduce API to get *physical socket* (as
opposed to NUMA node) in the next few days.

But how does it solve the end customer issues

1. if there are multiple NIC or Accelerator on multiple socket, but IO tile is partitioned to Sub Domain.

2. If RTE_FLOW steering is applied on NIC which needs to processed under same L3 - reduces noisy neighbor and better cache hits

3, for PKT-distribute library which needs to run within same worker lcore set as RX-Distributor-TX.


Current RFC suggested addresses the above, by helping the end users to identify the lcores withing same L3 domain under a NUMA|Physical socket irresepctive of BIOS setting.



Example: In AMD EPYC Genoa, there are total of 13 tiles. 12 CPU tiles
and 1 IO tile. Setting

1. NPS to 4 will divide the memory, PCIe and accelerator into 4 domain.
While the all CPU will appear as single NUMA but each 12 tile having
independent L3 caches.

2. Setting `L3 as NUMA` allows each tile to appear as separate L3 clusters.


Hence, adding an API which allows to select available lcores based on
Split L3 is essential irrespective of the BIOS setting.


I think the crucial issue here is the "irrespective of BIOS setting"
bit.

That is what the current RFC achieves.

If EAL is getting into the game of figuring out exact intricacies
of physical layout of the system, then there's a lot more work to be
done as there are lots of different topologies, as other people have
already commented, and such an API needs *a lot* of thought put into it.

There is standard sysfs interfaces for CPU cache topology (OS kernel), as mentioned earlier

problem with hwloc and libnuma is different distros has different versions. There are solutions for

specific SoC architectures as per latest comment.


But we always can limit the API to selected SoC, while all other SoC when invoked will invoke rte_get_next_lcore.



If, on the other hand, we leave this issue to the kernel, and only
gather NUMA information provided by the kernel, then nothing has to be
done - DPDK already supports all of this natively, provided the user has
configured the system correctly.

As shared above, we tried to bring this usertools: enhance logic to display NUMA - Patchwork (dpdk.org).

DPDK support for lcore is getting enhanced and allowing user to use more favorable lcores within same Tile.



Moreover, arguably DPDK already works that way: technically you can get
physical socket information even absent of NUMA support in BIOS, but
DPDK does not do that. Instead, if OS reports NUMA node as 0, that's
what we're going with (even if we could detect multiple sockets from
sysfs),

In the above argument, it is shared as OS kernel detects NUMA or domain, which is used by DPDK right?

The RFC suggested also adheres to the same, what OS sees. can you please explain for better understanding

what in the RFC is doing differently?


and IMO it should stay that way unless there is a strong
argument otherwise.

Totally agree, that is what the RFC is also doing, based on what OS sees as NUMA we are using it.

Only addition is within the NUMA if there are split LLC, allow selection of those lcores. Rather than blindly choosing lcore using

rte_lcore_get_next.


We force the user to configure their system
correctly as it is, and I see no reason to second-guess user's BIOS
configuration otherwise.

Again iterating, the changes suggested in RFC are agnostic to what BIOS options are used,

It is to earlier question `is AMD configuration same as Intel tile` I have explained it is not using BIOS setting.



--
Thanks,
Anatoly