DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Varghese, Vipin" <vipin.varghese@amd.com>
To: "Burakov, Anatoly" <anatoly.burakov@intel.com>,
	ferruh.yigit@amd.com, dev@dpdk.org
Subject: Re: [RFC 0/2] introduce LLC aware functions
Date: Mon, 2 Sep 2024 21:03:24 +0530	[thread overview]
Message-ID: <db3af9ba-6b7b-4c43-bce0-d85d222cfa99@amd.com> (raw)
In-Reply-To: <8addd7f6-fac8-45ec-a44f-f81eb008cc36@intel.com>

[-- Attachment #1: Type: text/plain, Size: 8164 bytes --]

<snipped>
>>
>>> I recently looked into how Intel's Sub-NUMA Clustering would work 
>>> within
>>> DPDK, and found that I actually didn't have to do anything, because the
>>> SNC "clusters" present themselves as NUMA nodes, which DPDK already
>>> supports natively.
>>
>> yes, this is correct. In Intel Xeon Platinum BIOS one can enable
>> `Cluster per NUMA` as `1,2 or4`.
>>
>> This divides the tiles into Sub-Numa parition, each having separate
>> lcores,memory controllers, PCIe
>>
>> and accelerator.
>>
>>>
>>> Does AMD's implementation of chiplets not report themselves as separate
>>> NUMA nodes?
>>
>> In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely
>>
>> 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and
>> Accelerator) to be partitioned as Numa 0, 1, 2 or 4.
>>
>> 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows
>> all CPU tiles to be independent NUMA cores.
>>
>>
>> The above settings are possible because CPU is independent from IO tile.
>> Thus allowing 4 combinations be available for use.
>
> Sure, but presumably if the user wants to distinguish this, they have to
> configure their system appropriately. If user wants to take advantage of
> L3 as NUMA (which is what your patch proposes), then they can enable the
> BIOS knob and get that functionality for free. DPDK already supports 
> this.
>
The intend of the RFC is to introduce the ability to select lcore within 
the same

L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is also 
achieved

and tested on platforms which advertises via sysfs by OS kernel. Thus 
eliminating

the dependency on hwloc and libuma which can be different versions in 
different distros.


>>
>> These are covered in the tuning gudie for the SoC in 12. How to get best
>> performance on AMD platform — Data Plane Development Kit 24.07.0
>> documentation (dpdk.org)
>> <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>.
>>
>>
>>> Because if it does, I don't really think any changes are
>>> required because NUMA nodes would give you the same thing, would it 
>>> not?
>>
>> I have a different opinion to this outlook. An end user can
>>
>> 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py`
>
> I recently submitted an enhacement for CPU layout script to print out
> NUMA separately from physical socket [1].
>
> [1]
> https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/ 
>
>
> I believe when "L3 as NUMA" is enabled in BIOS, the script will display
> both physical package ID as well as NUMA nodes reported by the system,
> which will be different from physical package ID, and which will display
> information you were looking for.

As AMD we had submitted earlier work on the same via usertools: enhance 
logic to display NUMA - Patchwork (dpdk.org) 
<https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.varghese@amd.com/>.

this clearly were distinguishing NUMA and Physical socket.

>
>>
>> 2. But it is core mask in eal arguments which makes the threads
>> available to be used in a process.
>
> See above: if the OS already reports NUMA information, this is not a
> problem to be solved, CPU layout script can give this information to the
> user.

Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the tile 
consists of cpu, memory, pcie and accelerator.

hence setting the BIOS option `Cluster per NUMA` the OS kernel & libnuma 
display appropriate Domain with memory, pcie and cpu.


In case of AMD SoC, libnuma for CPU is different from memory NUMA per 
socket.

>
>>
>> 3. there are no API which distinguish L3 numa domain. Function
>> `rte_socket_id
>> <https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` 
>> for CPU tiles like AMD SoC will return physical socket.
>
> Sure, but I would think the answer to that would be to introduce an API
> to distinguish between NUMA (socket ID in DPDK parlance) and package
> (physical socket ID in the "traditional NUMA" sense). Once we can
> distinguish between those, DPDK can just rely on NUMA information
> provided by the OS, while still being capable of identifying physical
> sockets if the user so desires.
Agreed, +1 for the idea for physcial socket and changes in library to 
exploit the same.
>
> I am actually going to introduce API to get *physical socket* (as
> opposed to NUMA node) in the next few days.
>
But how does it solve the end customer issues

1. if there are multiple NIC or Accelerator on multiple socket, but IO 
tile is partitioned to Sub Domain.

2. If RTE_FLOW steering is applied on NIC which needs to processed under 
same L3 - reduces noisy neighbor and better cache hits

3, for PKT-distribute library which needs to run within same worker 
lcore set as RX-Distributor-TX.


Current RFC suggested addresses the above, by helping the end users to 
identify the lcores withing same L3 domain under a NUMA|Physical socket 
irresepctive of BIOS setting.

>>
>>
>> Example: In AMD EPYC Genoa, there are total of 13 tiles. 12 CPU tiles
>> and 1 IO tile. Setting
>>
>> 1. NPS to 4 will divide the memory, PCIe and accelerator into 4 domain.
>> While the all CPU will appear as single NUMA but each 12 tile having
>> independent L3 caches.
>>
>> 2. Setting `L3 as NUMA` allows each tile to appear as separate L3 
>> clusters.
>>
>>
>> Hence, adding an API which allows to select available lcores based on
>> Split L3 is essential irrespective of the BIOS setting.
>>
>
> I think the crucial issue here is the "irrespective of BIOS setting"
> bit.

That is what the current RFC achieves.

> If EAL is getting into the game of figuring out exact intricacies
> of physical layout of the system, then there's a lot more work to be
> done as there are lots of different topologies, as other people have
> already commented, and such an API needs *a lot* of thought put into it.

There is standard sysfs interfaces for CPU cache topology (OS kernel), 
as mentioned earlier

problem with hwloc and libnuma is different distros has different 
versions. There are solutions for

specific SoC architectures as per latest comment.


But we always can limit the API to selected SoC, while all other SoC 
when invoked will invoke rte_get_next_lcore.


>
> If, on the other hand, we leave this issue to the kernel, and only
> gather NUMA information provided by the kernel, then nothing has to be
> done - DPDK already supports all of this natively, provided the user has
> configured the system correctly.

As shared above, we tried to bring this usertools: enhance logic to 
display NUMA - Patchwork (dpdk.org) 
<https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.varghese@amd.com/>. 


DPDK support for lcore is getting enhanced and allowing user to use more 
favorable lcores within same Tile.


>
> Moreover, arguably DPDK already works that way: technically you can get
> physical socket information even absent of NUMA support in BIOS, but
> DPDK does not do that. Instead, if OS reports NUMA node as 0, that's
> what we're going with (even if we could detect multiple sockets from
> sysfs), 

In the above argument, it is shared as OS kernel detects NUMA or domain, 
which is used by DPDK right?

The RFC suggested also adheres to the same, what OS sees. can you please 
explain for better understanding

what in the RFC is doing differently?


> and IMO it should stay that way unless there is a strong
> argument otherwise.

Totally agree, that is what the RFC is also doing, based on what OS sees 
as NUMA we are using it.

Only addition is within the NUMA if there are split LLC, allow selection 
of those lcores. Rather than blindly choosing lcore using

rte_lcore_get_next.


> We force the user to configure their system
> correctly as it is, and I see no reason to second-guess user's BIOS
> configuration otherwise.

Again iterating, the changes suggested in RFC are agnostic to what BIOS 
options are used,

It is to earlier question `is AMD configuration same as Intel tile` I 
have explained it is not using BIOS setting.


>
> -- 
> Thanks,
> Anatoly
>

[-- Attachment #2: Type: text/html, Size: 13167 bytes --]

  reply	other threads:[~2024-09-02 15:33 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-27 15:10 Vipin Varghese
2024-08-27 15:10 ` [RFC 1/2] eal: add llc " Vipin Varghese
2024-08-27 17:36   ` Stephen Hemminger
2024-09-02  0:27     ` Varghese, Vipin
2024-08-27 20:56   ` Wathsala Wathawana Vithanage
2024-08-29  3:21     ` 答复: " Feifei Wang
2024-09-02  1:20     ` Varghese, Vipin
2024-09-03 17:54       ` Wathsala Wathawana Vithanage
2024-09-04  8:18         ` Bruce Richardson
2024-09-06 11:59         ` Varghese, Vipin
2024-09-12 16:58           ` Wathsala Wathawana Vithanage
2024-08-27 15:10 ` [RFC 2/2] eal/lcore: add llc aware for each macro Vipin Varghese
2024-08-27 21:23 ` [RFC 0/2] introduce LLC aware functions Mattias Rönnblom
2024-09-02  0:39   ` Varghese, Vipin
2024-09-04  9:30     ` Mattias Rönnblom
2024-09-04 14:37       ` Stephen Hemminger
2024-09-11  3:13         ` Varghese, Vipin
2024-09-11  3:53           ` Stephen Hemminger
2024-09-12  1:11             ` Varghese, Vipin
2024-09-09 14:22       ` Varghese, Vipin
2024-09-09 14:52         ` Mattias Rönnblom
2024-09-11  3:26           ` Varghese, Vipin
2024-09-11 15:55             ` Mattias Rönnblom
2024-09-11 17:04               ` Honnappa Nagarahalli
2024-09-12  1:33                 ` Varghese, Vipin
2024-09-12  6:38                   ` Mattias Rönnblom
2024-09-12  7:02                     ` Mattias Rönnblom
2024-09-12 11:23                       ` Varghese, Vipin
2024-09-12 12:12                         ` Mattias Rönnblom
2024-09-12 15:50                           ` Stephen Hemminger
2024-09-12 11:17                     ` Varghese, Vipin
2024-09-12 11:59                       ` Mattias Rönnblom
2024-09-12 13:30                         ` Bruce Richardson
2024-09-12 16:32                           ` Mattias Rönnblom
2024-09-12  2:28                 ` Varghese, Vipin
2024-09-11 16:01             ` Bruce Richardson
2024-09-11 22:25               ` Konstantin Ananyev
2024-09-12  2:38                 ` Varghese, Vipin
2024-09-12  2:19               ` Varghese, Vipin
2024-09-12  9:17                 ` Bruce Richardson
2024-09-12 11:50                   ` Varghese, Vipin
2024-09-13 14:15                     ` Burakov, Anatoly
2024-09-12 13:18                   ` Mattias Rönnblom
2024-08-28  8:38 ` Burakov, Anatoly
2024-09-02  1:08   ` Varghese, Vipin
2024-09-02 14:17     ` Burakov, Anatoly
2024-09-02 15:33       ` Varghese, Vipin [this message]
2024-09-03  8:50         ` Burakov, Anatoly
2024-09-05 13:05           ` Ferruh Yigit
2024-09-05 14:45             ` Burakov, Anatoly
2024-09-05 15:34               ` Ferruh Yigit
2024-09-06  8:44                 ` Burakov, Anatoly
2024-09-09 14:14                   ` Varghese, Vipin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=db3af9ba-6b7b-4c43-bce0-d85d222cfa99@amd.com \
    --to=vipin.varghese@amd.com \
    --cc=anatoly.burakov@intel.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).