[Public]


<snipped>

> > > > <snipped>
> > > >
> > > >>> <snipped>
> > > >>>
> > > >>> Thank you Mattias for the comments and question, please let me
> > > >>> try to explain the same below
> > > >>>
> > > >>>> We shouldn't have a separate CPU/cache hierarchy API instead?
> > > >>>
> > > >>> Based on the intention to bring in CPU lcores which share same
> > > >>> L3 (for better cache hits and less noisy neighbor) current API
> > > >>> focuses on using
> > > >>>
> > > >>> Last Level Cache. But if the suggestion is `there are SoC where
> > > >>> L2 cache are also shared, and the new API should be
> > > >>> provisioned`, I am also
> > > >>>
> > > >>> comfortable with the thought.
> > > >>>
> > > >>
> > > >> Rather than some AMD special case API hacked into <rte_lcore.h>,
> > > >> I think we are better off with no DPDK API at all for this kind of
> functionality.
> > > >
> > > > Hi Mattias, as shared in the earlier email thread, this is not a
> > > > AMD special
> > > case at all. Let me try to explain this one more time. One of
> > > techniques used to increase cores cost effective way to go for tiles of
> compute complexes.
> > > > This introduces a bunch of cores in sharing same Last Level Cache
> > > > (namely
> > > L2, L3 or even L4) depending upon cache topology architecture.
> > > >
> > > > The API suggested in RFC is to help end users to selectively use
> > > > cores under
> > > same Last Level Cache Hierarchy as advertised by OS (irrespective of
> > > the BIOS settings used). This is useful in both bare-metal and container
> environment.
> > > >
> > >
> > > I'm pretty familiar with AMD CPUs and the use of tiles (including
> > > the challenges these kinds of non-uniformities pose for work scheduling).
> > >
> > > To maximize performance, caring about core<->LLC relationship may
> > > well not be enough, and more HT/core/cache/memory topology
> > > information is required. That's what I meant by special case. A
> > > proper API should allow access to information about which lcores are
> > > SMT siblings, cores on the same L2, and cores on the same L3, to
> > > name a few things. Probably you want to fit NUMA into the same API
> > > as well, although that is available already in <rte_lcore.h>.
> >
> > Thank you Mattias for the information, as shared by in the reply with
> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a
> extra argument `u32 flags`.
> > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
> RTE_GET_LCORE_BOOST_DISABLED.
> >
>
> For the naming, would "rte_get_next_sibling_core" (or lcore if you prefer) be a
> clearer name than just adding "ex" on to the end of the existing function?
Thank you Bruce, Please find my answer below

Functions shared as per the RFC were
```
 - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC.
 - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
 - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC.
```

MACRO's extending the usability were
```
RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC.
RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC.
RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id).
RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker.
RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC.
RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC.
```

Based on the discussions we agreed on sharing version-2 FRC for extending API as `rte_get_next_lcore_extnd` with extra argument as `flags`.
As per my ideation, for the API ` rte_get_next_sibling_core`, the above API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right understanding?
We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which allows to iterate SMT sibling threads.

>
> Looking logically, I'm not sure about the BOOST_ENABLED and
> BOOST_DISABLED flags you propose
The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power library which allows to enable boost.
Allow user to select lcores where BOOST is enabled|disabled using MACRO or API.

 - in a system with multiple possible
> standard and boost frequencies what would those correspond to?
I now understand the confusion, apologies for mixing the AMD EPYC SoC boost with Intel Turbo.

Thank you for pointing out, we will use the terminology ` RTE_GET_LCORE_TURBO`.

 What's also
> missing is a define for getting actual NUMA siblings i.e. those sharing common
> memory but not an L3 or anything else.
This can be extended into `rte_get_next_lcore_extnd` with flag ` RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same sub-memory NUMA as shared by LCORE.
If SMT sibling is enabled and DPDK Lcore mask covers the sibling threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads under same memory NUMA of lcore shared.

>
> My suggestion would be to have the function take just an integer-type e.g.
> uint16_t parameter which defines the memory/cache hierarchy level to use, 0
> being lowest, 1 next, and so on. Different systems may have different numbers
> of cache levels so lets just make it a zero-based index of levels, rather than
> giving explicit defines (except for memory which should probably always be
> last). The zero-level will be for "closest neighbour"
Good idea, we did prototype this internally. But issue it will keep on adding the number of API into lcore library.
To keep the API count less, we are using lcore id as hint to sub-NUMA.

> whatever that happens to be, with as many levels as is necessary to express
> the topology, e.g. without SMT, but with 3 cache levels, level 0 would be an L2
> neighbour, level 1 an L3 neighbour. If the L3 was split within a memory NUMA
> node, then level 2 would give the NUMA siblings. We'd just need an API to
> return the max number of levels along with the iterator.
We are using lcore numa as the hint.

>
> Regards,
> /Bruce