From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by dpdk.org (Postfix) with ESMTP id 1F16758CF for ; Thu, 5 Jan 2017 15:58:24 +0100 (CET) Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga105.jf.intel.com with ESMTP; 05 Jan 2017 06:58:23 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,321,1477983600"; d="scan'208";a="49634760" Received: from shwdeisgchi083.ccr.corp.intel.com (HELO [10.239.67.193]) ([10.239.67.193]) by orsmga005.jf.intel.com with ESMTP; 05 Jan 2017 06:58:22 -0800 To: Sergio Gonzalez Monroy , "Walker, Benjamin" , "dev@dpdk.org" References: <1483044080.11975.1.camel@intel.com> <685186b4-e50e-c122-459b-e4635404c3f8@intel.com> <1483565664.9482.3.camel@intel.com> From: "Tan, Jianfeng" Message-ID: Date: Thu, 5 Jan 2017 22:58:22 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] Running DPDK as an unprivileged user X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Jan 2017 14:58:25 -0000 Hi, On 1/5/2017 6:16 PM, Sergio Gonzalez Monroy wrote: > On 05/01/2017 10:09, Sergio Gonzalez Monroy wrote: >> On 04/01/2017 21:34, Walker, Benjamin wrote: >>> On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote: >>>> Hi Benjamin, >>>> >>>> >>>> On 12/30/2016 4:41 AM, Walker, Benjamin wrote: >>>>> DPDK today begins by allocating all of the required >>>>> hugepages, then finds all of the physical addresses for >>>>> those hugepages using /proc/self/pagemap, sorts the >>>>> hugepages by physical address, then remaps the pages to >>>>> contiguous virtual addresses. Later on and if vfio is >>>>> enabled, it asks vfio to pin the hugepages and to set their >>>>> DMA addresses in the IOMMU to be the physical addresses >>>>> discovered earlier. Of course, running as an unprivileged >>>>> user means all of the physical addresses in >>>>> /proc/self/pagemap are just 0, so this doesn't end up >>>>> working. Further, there is no real reason to choose the >>>>> physical address as the DMA address in the IOMMU - it would >>>>> be better to just count up starting at 0. >>>> Why not just using virtual address as the DMA address in this case to >>>> avoid maintaining another kind of addresses? >>> That's a valid choice, although I'm just storing the DMA address in the >>> physical address field that already exists. You either have a physical >>> address or a DMA address and never both. >>> >>>>> Also, because the >>>>> pages are pinned after the virtual to physical mapping is >>>>> looked up, there is a window where a page could be moved. >>>>> Hugepage mappings can be moved on more recent kernels (at >>>>> least 4.x), and the reliability of hugepages having static >>>>> mappings decreases with every kernel release. >>>> Do you mean kernel might take back a physical page after mapping it >>>> to a >>>> virtual page (maybe copy the data to another physical page)? Could you >>>> please show some links or kernel commits? >>> Yes - the kernel can move a physical page to another physical page >>> and change the virtual mapping at any time. For a concise example >>> see 'man migrate_pages(2)', or for a more serious example the code >>> that performs memory page compaction in the kernel which was >>> recently extended to support hugepages. >>> >>> Before we go down the path of me proving that the mapping isn't static, >>> let me turn that line of thinking around. Do you have any documentation >>> demonstrating that the mapping is static? It's not static for 4k >>> pages, so >>> why are we assuming that it is static for 2MB pages? I understand that >>> it happened to be static for some versions of the kernel, but my >>> understanding >>> is that this was purely by coincidence and never by intention. >> >> It looks to me as if you are talking about Transparent hugepages, and >> not hugetlbfs managed hugepages (DPDK usecase). >> AFAIK memory (hugepages) managed by hugetlbfs is not compacted and/or >> moved, they are not part of the kernel memory management. >> > > Please forgive my loose/poor use of words here when saying that "they > are not part of the kernel memory management", I mean to say that > they are not part of the kernel memory management process you were > mentioning, ie. compacting, moving, etc. > > Sergio > >> So again, do you have some references to code/articles where this >> "dynamic" behavior of hugepages managed by hugetlbfs is mentioned? >> >> Sergio According to the information Benjamin provided, I did some home work and find this macro in kernel config, CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION, and further the function, hugepage_migration_supported(). Seems that there are at least three ways to make this behavior happen (I'm basing on Linux 4.8.1): a) Through a syscall migrate_pages(); b) through a syscall move_pages(); c) Since some version of kernel, there's a kthread named kcompactd for each numa socket, to perform memory compaction. Thanks, Jianfeng >> >>>>> Note that this >>>>> probably means that using uio on recent kernels is subtly >>>>> broken and cannot be supported going forward because there >>>>> is no uio mechanism to pin the memory. >>>>> >>>>> The first open question I have is whether DPDK should allow >>>>> uio at all on recent (4.x) kernels. My current understanding >>>>> is that there is no way to pin memory and hugepages can now >>>>> be moved around, so uio would be unsafe. What does the >>>>> community think here? >>>>> >>>>> My second question is whether the user should be allowed to >>>>> mix uio and vfio usage simultaneously. For vfio, the >>>>> physical addresses are really DMA addresses and are best >>>>> when arbitrarily chosen to appear sequential relative to >>>>> their virtual addresses. >>>> Why "sequential relative to their virtual addresses"? IOMMU table >>>> is for >>>> DMA addr -> physical addr mapping. So we need to DMA addresses >>>> "sequential relative to their physical addresses"? Based on your above >>>> analysis on how hugepages are initialized, virtual addresses is a good >>>> candidate for DMA address? >>> The code already goes through a separate organizational step on all of >>> the pages that remaps the virtual addresses such that they're >>> sequential >>> relative to the physical backing pages, so this mostly ends up as >>> the same >>> thing. >>> Choosing to use the virtual address is a totally valid choice, but I >>> worry it >>> may lead to confusion during debugging or in a multi-process scenario. >>> I'm open to making this choice instead of starting from zero, though. >>> >>>> Thanks, >>>> Jianfeng >> >> >