DPDK patches and discussions
 help / color / mirror / Atom feed
From: Ferruh Yigit <ferruh.yigit@amd.com>
To: "Burakov, Anatoly" <anatoly.burakov@intel.com>,
	"Varghese, Vipin" <vipin.varghese@amd.com>,
	dev@dpdk.org
Cc: "Mattias Rönnblom" <hofors@lysator.liu.se>
Subject: Re: [RFC 0/2] introduce LLC aware functions
Date: Thu, 5 Sep 2024 14:05:46 +0100	[thread overview]
Message-ID: <3eae1577-f06f-48f2-863a-faf70b97bc72@amd.com> (raw)
In-Reply-To: <3edc8a89-7d10-47f4-8f95-856c2a7fc7ba@intel.com>

On 9/3/2024 9:50 AM, Burakov, Anatoly wrote:
> On 9/2/2024 5:33 PM, Varghese, Vipin wrote:
>> <snipped>
>>>>
>>>>> I recently looked into how Intel's Sub-NUMA Clustering would work
>>>>> within
>>>>> DPDK, and found that I actually didn't have to do anything, because
>>>>> the
>>>>> SNC "clusters" present themselves as NUMA nodes, which DPDK already
>>>>> supports natively.
>>>>
>>>> yes, this is correct. In Intel Xeon Platinum BIOS one can enable
>>>> `Cluster per NUMA` as `1,2 or4`.
>>>>
>>>> This divides the tiles into Sub-Numa parition, each having separate
>>>> lcores,memory controllers, PCIe
>>>>
>>>> and accelerator.
>>>>
>>>>>
>>>>> Does AMD's implementation of chiplets not report themselves as
>>>>> separate
>>>>> NUMA nodes?
>>>>
>>>> In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely
>>>>
>>>> 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and
>>>> Accelerator) to be partitioned as Numa 0, 1, 2 or 4.
>>>>
>>>> 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows
>>>> all CPU tiles to be independent NUMA cores.
>>>>
>>>>
>>>> The above settings are possible because CPU is independent from IO
>>>> tile.
>>>> Thus allowing 4 combinations be available for use.
>>>
>>> Sure, but presumably if the user wants to distinguish this, they have to
>>> configure their system appropriately. If user wants to take advantage of
>>> L3 as NUMA (which is what your patch proposes), then they can enable the
>>> BIOS knob and get that functionality for free. DPDK already supports
>>> this.
>>>
>> The intend of the RFC is to introduce the ability to select lcore
>> within the same
>>
>> L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is
>> also achieved
>>
>> and tested on platforms which advertises via sysfs by OS kernel. Thus
>> eliminating
>>
>> the dependency on hwloc and libuma which can be different versions in
>> different distros.
> 
> But we do depend on libnuma, so we might as well depend on it? Are there
> different versions of libnuma that interfere with what you're trying to
> do? You keep coming back to this "whether the BIOS is set or unset" for
> L3 as NUMA, but I'm still unclear as to what issues your patch is
> solving assuming "knob is set". When the system is configured correctly,
> it already works and reports cores as part of NUMA nodes (as L3)
> correctly. It is only when the system is configured *not* to do that
> that issues arise, is it not? In which case IMO the easier solution
> would be to just tell the user to enable that knob in BIOS?
> 
>>
>>
>>>>
>>>> These are covered in the tuning gudie for the SoC in 12. How to get
>>>> best
>>>> performance on AMD platform — Data Plane Development Kit 24.07.0
>>>> documentation (dpdk.org)
>>>> <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>.
>>>>
>>>>
>>>>> Because if it does, I don't really think any changes are
>>>>> required because NUMA nodes would give you the same thing, would it
>>>>> not?
>>>>
>>>> I have a different opinion to this outlook. An end user can
>>>>
>>>> 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py`
>>>
>>> I recently submitted an enhacement for CPU layout script to print out
>>> NUMA separately from physical socket [1].
>>>
>>> [1]
>>> https://patches.dpdk.org/project/dpdk/
>>> patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/
>>>
>>> I believe when "L3 as NUMA" is enabled in BIOS, the script will display
>>> both physical package ID as well as NUMA nodes reported by the system,
>>> which will be different from physical package ID, and which will display
>>> information you were looking for.
>>
>> As AMD we had submitted earlier work on the same via usertools:
>> enhance logic to display NUMA - Patchwork (dpdk.org) <https://
>> patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-
>> vipin.varghese@amd.com/>.
>>
>> this clearly were distinguishing NUMA and Physical socket.
> 
> Oh, cool, I didn't see that patch. I would argue my visual format is
> more readable though, so perhaps we can get that in :)
> 
>> Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the
>> tile consists of cpu, memory, pcie and accelerator.
>>
>> hence setting the BIOS option `Cluster per NUMA` the OS kernel &
>> libnuma display appropriate Domain with memory, pcie and cpu.
>>
>>
>> In case of AMD SoC, libnuma for CPU is different from memory NUMA per
>> socket.
> 
> I'm curious how does the kernel handle this then, and what are you
> getting from libnuma. You seem to be implying that there are two
> different NUMA nodes on your SoC, and either kernel or libnuma are in
> conflict as to what belongs to what NUMA node?
> 
>>
>>>
>>>>
>>>> 3. there are no API which distinguish L3 numa domain. Function
>>>> `rte_socket_id
>>>> <https://doc.dpdk.org/api/
>>>> rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` for CPU tiles
>>>> like AMD SoC will return physical socket.
>>>
>>> Sure, but I would think the answer to that would be to introduce an API
>>> to distinguish between NUMA (socket ID in DPDK parlance) and package
>>> (physical socket ID in the "traditional NUMA" sense). Once we can
>>> distinguish between those, DPDK can just rely on NUMA information
>>> provided by the OS, while still being capable of identifying physical
>>> sockets if the user so desires.
>> Agreed, +1 for the idea for physcial socket and changes in library to
>> exploit the same.
>>>
>>> I am actually going to introduce API to get *physical socket* (as
>>> opposed to NUMA node) in the next few days.
>>>
>> But how does it solve the end customer issues
>>
>> 1. if there are multiple NIC or Accelerator on multiple socket, but IO
>> tile is partitioned to Sub Domain.
> 
> At least on Intel platforms, NUMA node gets assigned correctly - that
> is, if my Xeon with SNC enabled has NUMA nodes 3,4 on socket 1, and
> there's a NIC connected to socket 1, it's going to show up as being on
> NUMA node 3 or 4 depending on where exactly I plugged it in. Everything
> already works as expected, and there is no need for any changes for
> Intel platforms (at least none that I can see).
> 
> My proposed API is really for those users who wish to explicitly allow
> for reserving memory/cores on "the same physical socket", as "on the
> same tile" is already taken care of by NUMA nodes.
> 
>>
>> 2. If RTE_FLOW steering is applied on NIC which needs to processed
>> under same L3 - reduces noisy neighbor and better cache hits
>>
>> 3, for PKT-distribute library which needs to run within same worker
>> lcore set as RX-Distributor-TX.
>>
> 
> Same as above: on Intel platforms, NUMA nodes already solve this.
> 
> <snip>
> 
>> Totally agree, that is what the RFC is also doing, based on what OS
>> sees as NUMA we are using it.
>>
>> Only addition is within the NUMA if there are split LLC, allow
>> selection of those lcores. Rather than blindly choosing lcore using
>>
>> rte_lcore_get_next.
> 
> It feels like we're working around a problem that shouldn't exist in the
> first place, because kernel should already report this information.
> Within NUMA subsystem, there is sysfs node "distance" that, at least on
> Intel platforms and in certain BIOS configuration, reports distance
> between NUMA nodes, from which one can make inferences about how far a
> specific NUMA node is from any other NUMA node. This could have been
> used to encode L3 cache information. Do AMD platforms not do that? In
> that case, "lcore next" for a particular socket ID (NUMA node, in
> reality) should already get us any cores that are close to each other,
> because all of this information is already encoded in NUMA nodes by the
> system.
> 
> I feel like there's a disconnect between my understanding of the problem
> space, and yours, so I'm going to ask a very basic question:
> 
> Assuming the user has configured their AMD system correctly (i.e.
> enabled L3 as NUMA), are there any problem to be solved by adding a new
> API? Does the system not report each L3 as a separate NUMA node?
> 

Hi Anatoly,

Let me try to answer.

To start with, Intel "Sub-NUMA Clustering" and AMD NUMA is different, as
far as I understand SNC is more similar to more classic physical socket
based NUMA.

Following is the AMD CPU:
      ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      │TILE1││TILE2││          ││TILE5││TILE6│
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      └─────┘└─────┘│    IO    │└─────┘└─────┘
      ┌─────┐┌─────┐│   TILE   │┌─────┐┌─────┐
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      │TILE3││TILE4││          ││TILE7││TILE8│
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      └─────┘└─────┘└──────────┘└─────┘└─────┘

Each 'Tile' has multiple cores, and 'IO Tile' has memory controller, bus
controllers etc..

When NPS=x configured in bios, IO tile resources are split and each seen
as a NUMA node.

Following is NPS=4
      ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
      │     ││     ││     .    ││     ││     │
      │     ││     ││     .    ││     ││     │
      │TILE1││TILE2││     .    ││TILE5││TILE6│
      │     ││     ││NUMA .NUMA││     ││     │
      │     ││     ││ 0   . 1  ││     ││     │
      │     ││     ││     .    ││     ││     │
      └─────┘└─────┘│     .    │└─────┘└─────┘
      ┌─────┐┌─────┐│..........│┌─────┐┌─────┐
      │     ││     ││     .    ││     ││     │
      │     ││     ││NUMA .NUMA││     ││     │
      │TILE3││TILE4││ 2   . 3  ││TILE7││TILE8│
      │     ││     ││     .    ││     ││     │
      │     ││     ││     .    ││     ││     │
      │     ││     ││     .    ││     ││     │
      └─────┘└─────┘└─────.────┘└─────┘└─────┘

Benefit of this is approach is now all cores has to access all NUMA
without any penalty. Like a DPDK application can use cores from 'TILE1',
'TILE4' & 'TILE7' to access to NUMA0 (or any NUMA) resources in high
performance.
This is different than SNC where cores access to cross NUMA resources
hit by performance penalty.

Now, although which tile cores come from doesn't matter from NUMA
perspective, it may matter (based on workload) to have them under same LLC.

One way to make sure all cores are under same LLC, is to enable "L3 as
NUMA" BIOS option, which will make each TILE shown as a different NUMA,
and user select cores from one NUMA.
This is sufficient up to some point, but not enough when application
needs number of cores that uses multiple tiles.

Assume each tile has 8 cores, and application needs 24 cores, when user
provide all cores from TILE1, TILE2 & TILE3, in DPDK right now there is
now way for application to figure out how to group/select these cores to
use cores efficiently.

Indeed this is what Vipin is enabling, from a core, he is finding list
of cores that will work efficiently with this core. In this perspective
this is nothing really related to NUMA configuration, and nothing really
specific to AMD, as defined Linux sysfs interface is used for this.

There are other architectures around that has similar NUMA configuration
and they can also use same logic, at worst we can introduce an
architecture specific code that all architectures can have a way to find
other cores that works more efficient with given core. This is a useful
feature for DPDK.

Lets looks into another example, application uses 24 cores in an graph
library like usage, that we want to group each three cores to process a
graph node. Application needs to a way to select which three cores works
most efficient with eachother, that is what this patch enables. In this
case enabling "L3 as NUMA" does not help at all. With this patch both
bios config works, but of course user should select cores to provide
application based on configuration.


And we can even improve this effective core selection, like as Mattias
suggested we can select cores that share L2 caches, with expansion of
this patch. This is unrelated to NUMA, and again it is not introducing
architecture details to DPDK as this implementation already relies on
Linux sysfs interface.

I hope it clarifies a little more.


Thanks,
ferruh














  reply	other threads:[~2024-09-05 13:06 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-27 15:10 Vipin Varghese
2024-08-27 15:10 ` [RFC 1/2] eal: add llc " Vipin Varghese
2024-08-27 17:36   ` Stephen Hemminger
2024-09-02  0:27     ` Varghese, Vipin
2024-08-27 20:56   ` Wathsala Wathawana Vithanage
2024-08-29  3:21     ` 答复: " Feifei Wang
2024-09-02  1:20     ` Varghese, Vipin
2024-09-03 17:54       ` Wathsala Wathawana Vithanage
2024-09-04  8:18         ` Bruce Richardson
2024-09-06 11:59         ` Varghese, Vipin
2024-09-12 16:58           ` Wathsala Wathawana Vithanage
2024-08-27 15:10 ` [RFC 2/2] eal/lcore: add llc aware for each macro Vipin Varghese
2024-08-27 21:23 ` [RFC 0/2] introduce LLC aware functions Mattias Rönnblom
2024-09-02  0:39   ` Varghese, Vipin
2024-09-04  9:30     ` Mattias Rönnblom
2024-09-04 14:37       ` Stephen Hemminger
2024-09-11  3:13         ` Varghese, Vipin
2024-09-11  3:53           ` Stephen Hemminger
2024-09-12  1:11             ` Varghese, Vipin
2024-09-09 14:22       ` Varghese, Vipin
2024-09-09 14:52         ` Mattias Rönnblom
2024-09-11  3:26           ` Varghese, Vipin
2024-09-11 15:55             ` Mattias Rönnblom
2024-09-11 17:04               ` Honnappa Nagarahalli
2024-09-12  1:33                 ` Varghese, Vipin
2024-09-12  6:38                   ` Mattias Rönnblom
2024-09-12  7:02                     ` Mattias Rönnblom
2024-09-12 11:23                       ` Varghese, Vipin
2024-09-12 12:12                         ` Mattias Rönnblom
2024-09-12 15:50                           ` Stephen Hemminger
2024-09-12 11:17                     ` Varghese, Vipin
2024-09-12 11:59                       ` Mattias Rönnblom
2024-09-12 13:30                         ` Bruce Richardson
2024-09-12 16:32                           ` Mattias Rönnblom
2024-09-12  2:28                 ` Varghese, Vipin
2024-09-11 16:01             ` Bruce Richardson
2024-09-11 22:25               ` Konstantin Ananyev
2024-09-12  2:38                 ` Varghese, Vipin
2024-09-12  2:19               ` Varghese, Vipin
2024-09-12  9:17                 ` Bruce Richardson
2024-09-12 11:50                   ` Varghese, Vipin
2024-09-13 14:15                     ` Burakov, Anatoly
2024-09-12 13:18                   ` Mattias Rönnblom
2024-08-28  8:38 ` Burakov, Anatoly
2024-09-02  1:08   ` Varghese, Vipin
2024-09-02 14:17     ` Burakov, Anatoly
2024-09-02 15:33       ` Varghese, Vipin
2024-09-03  8:50         ` Burakov, Anatoly
2024-09-05 13:05           ` Ferruh Yigit [this message]
2024-09-05 14:45             ` Burakov, Anatoly
2024-09-05 15:34               ` Ferruh Yigit
2024-09-06  8:44                 ` Burakov, Anatoly
2024-09-09 14:14                   ` Varghese, Vipin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3eae1577-f06f-48f2-863a-faf70b97bc72@amd.com \
    --to=ferruh.yigit@amd.com \
    --cc=anatoly.burakov@intel.com \
    --cc=dev@dpdk.org \
    --cc=hofors@lysator.liu.se \
    --cc=vipin.varghese@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).