>> >>> I recently looked into how Intel's Sub-NUMA Clustering would work >>> within >>> DPDK, and found that I actually didn't have to do anything, because the >>> SNC "clusters" present themselves as NUMA nodes, which DPDK already >>> supports natively. >> >> yes, this is correct. In Intel Xeon Platinum BIOS one can enable >> `Cluster per NUMA` as `1,2 or4`. >> >> This divides the tiles into Sub-Numa parition, each having separate >> lcores,memory controllers, PCIe >> >> and accelerator. >> >>> >>> Does AMD's implementation of chiplets not report themselves as separate >>> NUMA nodes? >> >> In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely >> >> 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and >> Accelerator) to be partitioned as Numa 0, 1, 2 or 4. >> >> 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows >> all CPU tiles to be independent NUMA cores. >> >> >> The above settings are possible because CPU is independent from IO tile. >> Thus allowing 4 combinations be available for use. > > Sure, but presumably if the user wants to distinguish this, they have to > configure their system appropriately. If user wants to take advantage of > L3 as NUMA (which is what your patch proposes), then they can enable the > BIOS knob and get that functionality for free. DPDK already supports > this. > The intend of the RFC is to introduce the ability to select lcore within the same L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is also achieved and tested on platforms which advertises via sysfs by OS kernel. Thus eliminating the dependency on hwloc and libuma which can be different versions in different distros. >> >> These are covered in the tuning gudie for the SoC in 12. How to get best >> performance on AMD platform — Data Plane Development Kit 24.07.0 >> documentation (dpdk.org) >> . >> >> >>> Because if it does, I don't really think any changes are >>> required because NUMA nodes would give you the same thing, would it >>> not? >> >> I have a different opinion to this outlook. An end user can >> >> 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py` > > I recently submitted an enhacement for CPU layout script to print out > NUMA separately from physical socket [1]. > > [1] > https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/ > > > I believe when "L3 as NUMA" is enabled in BIOS, the script will display > both physical package ID as well as NUMA nodes reported by the system, > which will be different from physical package ID, and which will display > information you were looking for. As AMD we had submitted earlier work on the same via usertools: enhance logic to display NUMA - Patchwork (dpdk.org) . this clearly were distinguishing NUMA and Physical socket. > >> >> 2. But it is core mask in eal arguments which makes the threads >> available to be used in a process. > > See above: if the OS already reports NUMA information, this is not a > problem to be solved, CPU layout script can give this information to the > user. Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the tile consists of cpu, memory, pcie and accelerator. hence setting the BIOS option `Cluster per NUMA` the OS kernel & libnuma display appropriate Domain with memory, pcie and cpu. In case of AMD SoC, libnuma for CPU is different from memory NUMA per socket. > >> >> 3. there are no API which distinguish L3 numa domain. Function >> `rte_socket_id >> ` >> for CPU tiles like AMD SoC will return physical socket. > > Sure, but I would think the answer to that would be to introduce an API > to distinguish between NUMA (socket ID in DPDK parlance) and package > (physical socket ID in the "traditional NUMA" sense). Once we can > distinguish between those, DPDK can just rely on NUMA information > provided by the OS, while still being capable of identifying physical > sockets if the user so desires. Agreed, +1 for the idea for physcial socket and changes in library to exploit the same. > > I am actually going to introduce API to get *physical socket* (as > opposed to NUMA node) in the next few days. > But how does it solve the end customer issues 1. if there are multiple NIC or Accelerator on multiple socket, but IO tile is partitioned to Sub Domain. 2. If RTE_FLOW steering is applied on NIC which needs to processed under same L3 - reduces noisy neighbor and better cache hits 3, for PKT-distribute library which needs to run within same worker lcore set as RX-Distributor-TX. Current RFC suggested addresses the above, by helping the end users to identify the lcores withing same L3 domain under a NUMA|Physical socket irresepctive of BIOS setting. >> >> >> Example: In AMD EPYC Genoa, there are total of 13 tiles. 12 CPU tiles >> and 1 IO tile. Setting >> >> 1. NPS to 4 will divide the memory, PCIe and accelerator into 4 domain. >> While the all CPU will appear as single NUMA but each 12 tile having >> independent L3 caches. >> >> 2. Setting `L3 as NUMA` allows each tile to appear as separate L3 >> clusters. >> >> >> Hence, adding an API which allows to select available lcores based on >> Split L3 is essential irrespective of the BIOS setting. >> > > I think the crucial issue here is the "irrespective of BIOS setting" > bit. That is what the current RFC achieves. > If EAL is getting into the game of figuring out exact intricacies > of physical layout of the system, then there's a lot more work to be > done as there are lots of different topologies, as other people have > already commented, and such an API needs *a lot* of thought put into it. There is standard sysfs interfaces for CPU cache topology (OS kernel), as mentioned earlier problem with hwloc and libnuma is different distros has different versions. There are solutions for specific SoC architectures as per latest comment. But we always can limit the API to selected SoC, while all other SoC when invoked will invoke rte_get_next_lcore. > > If, on the other hand, we leave this issue to the kernel, and only > gather NUMA information provided by the kernel, then nothing has to be > done - DPDK already supports all of this natively, provided the user has > configured the system correctly. As shared above, we tried to bring this usertools: enhance logic to display NUMA - Patchwork (dpdk.org) . DPDK support for lcore is getting enhanced and allowing user to use more favorable lcores within same Tile. > > Moreover, arguably DPDK already works that way: technically you can get > physical socket information even absent of NUMA support in BIOS, but > DPDK does not do that. Instead, if OS reports NUMA node as 0, that's > what we're going with (even if we could detect multiple sockets from > sysfs), In the above argument, it is shared as OS kernel detects NUMA or domain, which is used by DPDK right? The RFC suggested also adheres to the same, what OS sees. can you please explain for better understanding what in the RFC is doing differently? > and IMO it should stay that way unless there is a strong > argument otherwise. Totally agree, that is what the RFC is also doing, based on what OS sees as NUMA we are using it. Only addition is within the NUMA if there are split LLC, allow selection of those lcores. Rather than blindly choosing lcore using rte_lcore_get_next. > We force the user to configure their system > correctly as it is, and I see no reason to second-guess user's BIOS > configuration otherwise. Again iterating, the changes suggested in RFC are agnostic to what BIOS options are used, It is to earlier question `is AMD configuration same as Intel tile` I have explained it is not using BIOS setting. > > -- > Thanks, > Anatoly >