DPDK patches and discussions
 help / color / mirror / Atom feed
From: Don Wallwork <donw@xsightlabs.com>
To: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [RFC] EAL: legacy memory fixed address translations
Date: Wed, 27 Jul 2022 13:20:22 -0400	[thread overview]
Message-ID: <e426c21b-0235-11a7-7039-0c55dcc15cde@xsightlabs.com> (raw)
In-Reply-To: <20220726225910.26159820@sovereign>

On 7/26/2022 3:59 PM, Dmitry Kozlyuk wrote:
> Hi Don,
>
> 2022-07-26 14:33 (UTC-0400), Don Wallwork:
>> This proposal describes a method for translating any huge page
>> address from virtual to physical or vice versa using simple
>> addition or subtraction of a single fixed value. This allows
>> devices to efficiently access arbitrary huge page memory, even
>> stack data when worker stacks are in huge pages.
> What is the use case and how much is the benefit?

Several examples where this could help include:

1. A device could return flow lookup results containing the physical
address of a matching entry that needs to be translated to a virtual
address.

2. Hardware can perform offloads on dynamically allocted heap
memory objects and would need PA to avoid requiring IOMMU.

3. It may be useful to prepare data such as descriptors in stack
variables, then pass the PA to hardware which can DMA directly
from stack memory.

4. The CPU instruction set provides memory operations such as
prefetch, atomics, ALU and so on which operate on virtual
addresses with no software requirement to provide physical
addresses. A device may be able to provide a more optimized
implementation of such features that could avoid performance
degradation associated with using a hardware IOMMU if provided
virtual addresses. Having the ability to offload such operations
without requiring data structure modifications to store an IOVA for
every virtual address is desirable.

All of these cases can run at packet rate and are not operating on
mbuf data. These would all benefit from efficient address translation
in the same way that mbufs already do. Unlike mbuf translation
that only covers VA to PA, this translation can perform both VA to PA
and PA to VA with equal efficiency.

>
> When drivers need to process a large number of memory blocks,
> these are typically packets in the form of mbufs,
> which already have IOVA attached, so there is no translation.
> Does translation of mbuf VA to PA with the proposed method
> show significant improvement over reading mbuf->iova?

This proposal does not relate to mbufs.  As you say, there is
already an efficient VA to PA mechanism in place for those.

>
> When drivers need to process a few IOVA-contiguous memory blocks,
> they can calculate VA-to-PA offsets in advance,
> amortizing translation cost.
> Hugepage stack falls within this category.

As the cases listed above hopefully show, there are cases where
it is not practical or desirable to precalculate the offsets.

>
>> When legacy memory mode is used, it is possible to map a single
>> virtual memory region large enough to cover all huge pages. During
>> legacy hugepage init, each hugepage is mapped into that region.
> Legacy mode is called "legacy" with an intent to be deprecated :)

Understood.  For our initial implementation, we were okay with
that limitation given that supporting in legacy mode was simpler.

> There is initial allocation (-m) and --socket-limit in dynamic mode.
> When initial allocation is equal to the socket limit,
> it should be the same behavior as in legacy mode:
> the number of hugepages mapped is constant and cannot grow,
> so the feature seems applicable as well.

It seems feasible to implement this feature in non-legacy mode as
well. The approach would be similar; reserve a region of virtual
address space large enough to cover all huge pages before they are
allocated.  As huge pages are allocated, they are mapped into the
appropriate location within that virtual address space.

>
>> Once all pages have been mapped, any unused holes in that memory
>> region are unmapped.
> Who tracks these holes and prevents translation from their VA?

Since the holes are unmapped, references to locations in unused
regions will result in seg faults.

> Why the holes appear?

Memory layout for different NUMA nodes may cause holes.  Also,
there is no guarantee that all huge pages are physically contiguous.

>
>> This feature is applicable when rte_eal_iova_mode() == RTE_IOVA_PA
> One can say it always works for RTE_IOVA_VA with VA-to-PA offset of 0.

This is true, but requires the use of a hardware IOMMU which
degrades performance.

>
>> and could be enabled either by default when the legacy memory EAL
>> option is given, or a new EAL option could be added to specifically
>> enable this feature.
>>
>> It may be desirable to set a capability bit when this feature is
>> enabled to allow drivers to behave differently depending on the
>> state of that flag.
> The feature requires, in IOVA-as-PA mode:
> 1) that hugepage mapping is static (legacy mode or "-m" == "--socket-limit");
> 2) that EAL has succeeded to map all hugepages in one PA-continuous block.

It does not require huge pages to be physically contiguous.
Theoretically the mapping a giant VA region could fail, but
we have not seen this in practice even when running on x86_64
servers with multiple NUMA nodes, many cores and huge pages
that span TBs of physical address space.

> As userspace code, DPDK cannot guarantee 2).
> Because this mode breaks nothing and just makes translation more efficient,
> DPDK can always try to implement it and then report whether it has succeeded.
> Applications and drivers can decide what to do by querying this API.

Yes, providing an API to check this capability would
definitely work.

Thanks for all the good feedback.

-Don

  reply	other threads:[~2022-07-27 17:20 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-26 18:18 RFC: EAL " Don Wallwork
2022-07-26 18:33 ` [RFC] EAL: " Don Wallwork
2022-07-26 19:59   ` Dmitry Kozlyuk
2022-07-27 17:20     ` Don Wallwork [this message]
2022-07-27 19:12       ` Stephen Hemminger
2022-07-27 19:27         ` Don Wallwork
2022-07-27 20:36       ` Dmitry Kozlyuk
2022-07-27 21:43         ` Don Wallwork
2022-07-28  7:25           ` Dmitry Kozlyuk
2022-07-28 11:29             ` Morten Brørup
2022-07-28 14:46               ` Don Wallwork
2022-07-28 15:41                 ` Don Wallwork
2022-07-28 16:33                   ` Dmitry Kozlyuk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e426c21b-0235-11a7-7039-0c55dcc15cde@xsightlabs.com \
    --to=donw@xsightlabs.com \
    --cc=dev@dpdk.org \
    --cc=dmitry.kozliuk@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).