Re: [dpdk-dev] Running DPDK as an unprivileged user

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Tan, Jianfeng" <jianfeng.tan@intel.com>
To: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>,
	"Walker, Benjamin" <benjamin.walker@intel.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] Running DPDK as an unprivileged user
Date: Thu, 5 Jan 2017 22:58:22 +0800	[thread overview]
Message-ID: <e3352616-8334-7a00-0860-63f075aaddf6@intel.com> (raw)
In-Reply-To: <e554e14f-6e3a-7d12-2df0-690a6a06df80@intel.com>

Hi,


On 1/5/2017 6:16 PM, Sergio Gonzalez Monroy wrote:
> On 05/01/2017 10:09, Sergio Gonzalez Monroy wrote:
>> On 04/01/2017 21:34, Walker, Benjamin wrote:
>>> On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote:
>>>> Hi Benjamin,
>>>>
>>>>
>>>> On 12/30/2016 4:41 AM, Walker, Benjamin wrote:
>>>>> DPDK today begins by allocating all of the required
>>>>> hugepages, then finds all of the physical addresses for
>>>>> those hugepages using /proc/self/pagemap, sorts the
>>>>> hugepages by physical address, then remaps the pages to
>>>>> contiguous virtual addresses. Later on and if vfio is
>>>>> enabled, it asks vfio to pin the hugepages and to set their
>>>>> DMA addresses in the IOMMU to be the physical addresses
>>>>> discovered earlier. Of course, running as an unprivileged
>>>>> user means all of the physical addresses in
>>>>> /proc/self/pagemap are just 0, so this doesn't end up
>>>>> working. Further, there is no real reason to choose the
>>>>> physical address as the DMA address in the IOMMU - it would
>>>>> be better to just count up starting at 0.
>>>> Why not just using virtual address as the DMA address in this case to
>>>> avoid maintaining another kind of addresses?
>>> That's a valid choice, although I'm just storing the DMA address in the
>>> physical address field that already exists. You either have a physical
>>> address or a DMA address and never both.
>>>
>>>>>    Also, because the
>>>>> pages are pinned after the virtual to physical mapping is
>>>>> looked up, there is a window where a page could be moved.
>>>>> Hugepage mappings can be moved on more recent kernels (at
>>>>> least 4.x), and the reliability of hugepages having static
>>>>> mappings decreases with every kernel release.
>>>> Do you mean kernel might take back a physical page after mapping it 
>>>> to a
>>>> virtual page (maybe copy the data to another physical page)? Could you
>>>> please show some links or kernel commits?
>>> Yes - the kernel can move a physical page to another physical page
>>> and change the virtual mapping at any time. For a concise example
>>> see 'man migrate_pages(2)', or for a more serious example the code
>>> that performs memory page compaction in the kernel which was
>>> recently extended to support hugepages.
>>>
>>> Before we go down the path of me proving that the mapping isn't static,
>>> let me turn that line of thinking around. Do you have any documentation
>>> demonstrating that the mapping is static? It's not static for 4k 
>>> pages, so
>>> why are we assuming that it is static for 2MB pages? I understand that
>>> it happened to be static for some versions of the kernel, but my 
>>> understanding
>>> is that this was purely by coincidence and never by intention.
>>
>> It looks to me as if you are talking about Transparent hugepages, and 
>> not hugetlbfs managed hugepages (DPDK usecase).
>> AFAIK memory (hugepages) managed by hugetlbfs is not compacted and/or 
>> moved, they are not part of the kernel memory management.
>>
>
> Please forgive my loose/poor use of words here when saying that "they 
> are not part of the kernel memory management", I mean to say that
> they are not part of the kernel memory management process you were 
> mentioning, ie. compacting, moving, etc.
>
> Sergio
>
>> So again, do you have some references to code/articles where this 
>> "dynamic" behavior of hugepages managed by hugetlbfs is mentioned?
>>
>> Sergio

According to the information Benjamin provided, I did some home work and 
find this macro in kernel config, CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION, 
and further the function, hugepage_migration_supported().

Seems that there are at least three ways to make this behavior happen 
(I'm basing on Linux 4.8.1):

a) Through a syscall migrate_pages();
b) through a syscall move_pages();
c) Since some version of kernel, there's a kthread named kcompactd for 
each numa socket, to perform memory compaction.

Thanks,
Jianfeng

>>
>>>>> Note that this
>>>>> probably means that using uio on recent kernels is subtly
>>>>> broken and cannot be supported going forward because there
>>>>> is no uio mechanism to pin the memory.
>>>>>
>>>>> The first open question I have is whether DPDK should allow
>>>>> uio at all on recent (4.x) kernels. My current understanding
>>>>> is that there is no way to pin memory and hugepages can now
>>>>> be moved around, so uio would be unsafe. What does the
>>>>> community think here?
>>>>>
>>>>> My second question is whether the user should be allowed to
>>>>> mix uio and vfio usage simultaneously. For vfio, the
>>>>> physical addresses are really DMA addresses and are best
>>>>> when arbitrarily chosen to appear sequential relative to
>>>>> their virtual addresses.
>>>> Why "sequential relative to their virtual addresses"? IOMMU table 
>>>> is for
>>>> DMA addr -> physical addr mapping. So we need to DMA addresses
>>>> "sequential relative to their physical addresses"? Based on your above
>>>> analysis on how hugepages are initialized, virtual addresses is a good
>>>> candidate for DMA address?
>>> The code already goes through a separate organizational step on all of
>>> the pages that remaps the virtual addresses such that they're 
>>> sequential
>>> relative to the physical backing pages, so this mostly ends up as 
>>> the same
>>> thing.
>>> Choosing to use the virtual address is a totally valid choice, but I 
>>> worry it
>>> may lead to confusion during debugging or in a multi-process scenario.
>>> I'm open to making this choice instead of starting from zero, though.
>>>
>>>> Thanks,
>>>> Jianfeng
>>
>>
>

next prev parent reply	other threads:[~2017-01-05 14:58 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-29 20:41 Walker, Benjamin
2016-12-30  1:14 ` Stephen Hemminger
2017-01-02 14:32   ` Thomas Monjalon
2017-01-02 19:47     ` Stephen Hemminger
2017-01-03 22:50       ` Walker, Benjamin
2017-01-04 10:11         ` Thomas Monjalon
2017-01-04 21:35           ` Walker, Benjamin
2017-01-04 11:39 ` Tan, Jianfeng
2017-01-04 21:34   ` Walker, Benjamin
2017-01-05 10:09     ` Sergio Gonzalez Monroy
2017-01-05 10:16       ` Sergio Gonzalez Monroy
2017-01-05 14:58         ` Tan, Jianfeng [this message]
2017-01-05 15:52     ` Tan, Jianfeng
2017-11-05  0:17       ` Thomas Monjalon
2017-11-27 17:58         ` Walker, Benjamin
2017-11-28 14:16           ` Alejandro Lucero
2017-11-28 17:50             ` Walker, Benjamin
2017-11-28 19:13               ` Alejandro Lucero

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e3352616-8334-7a00-0860-63f075aaddf6@intel.com \
    --to=jianfeng.tan@intel.com \
    --cc=benjamin.walker@intel.com \
    --cc=dev@dpdk.org \
    --cc=sergio.gonzalez.monroy@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).