From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 27E873777 for ; Thu, 5 Jan 2017 11:16:55 +0100 (CET) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga101.fm.intel.com with ESMTP; 05 Jan 2017 02:16:55 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,320,1477983600"; d="scan'208";a="26466964" Received: from smonroyx-mobl.ger.corp.intel.com (HELO [10.237.221.23]) ([10.237.221.23]) by orsmga002.jf.intel.com with ESMTP; 05 Jan 2017 02:16:53 -0800 To: "Walker, Benjamin" , "Tan, Jianfeng" , "dev@dpdk.org" References: <1483044080.11975.1.camel@intel.com> <685186b4-e50e-c122-459b-e4635404c3f8@intel.com> <1483565664.9482.3.camel@intel.com> From: Sergio Gonzalez Monroy Message-ID: Date: Thu, 5 Jan 2017 10:16:53 +0000 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] Running DPDK as an unprivileged user X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 Jan 2017 10:16:56 -0000 On 05/01/2017 10:09, Sergio Gonzalez Monroy wrote: > On 04/01/2017 21:34, Walker, Benjamin wrote: >> On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote: >>> Hi Benjamin, >>> >>> >>> On 12/30/2016 4:41 AM, Walker, Benjamin wrote: >>>> DPDK today begins by allocating all of the required >>>> hugepages, then finds all of the physical addresses for >>>> those hugepages using /proc/self/pagemap, sorts the >>>> hugepages by physical address, then remaps the pages to >>>> contiguous virtual addresses. Later on and if vfio is >>>> enabled, it asks vfio to pin the hugepages and to set their >>>> DMA addresses in the IOMMU to be the physical addresses >>>> discovered earlier. Of course, running as an unprivileged >>>> user means all of the physical addresses in >>>> /proc/self/pagemap are just 0, so this doesn't end up >>>> working. Further, there is no real reason to choose the >>>> physical address as the DMA address in the IOMMU - it would >>>> be better to just count up starting at 0. >>> Why not just using virtual address as the DMA address in this case to >>> avoid maintaining another kind of addresses? >> That's a valid choice, although I'm just storing the DMA address in the >> physical address field that already exists. You either have a physical >> address or a DMA address and never both. >> >>>> Also, because the >>>> pages are pinned after the virtual to physical mapping is >>>> looked up, there is a window where a page could be moved. >>>> Hugepage mappings can be moved on more recent kernels (at >>>> least 4.x), and the reliability of hugepages having static >>>> mappings decreases with every kernel release. >>> Do you mean kernel might take back a physical page after mapping it >>> to a >>> virtual page (maybe copy the data to another physical page)? Could you >>> please show some links or kernel commits? >> Yes - the kernel can move a physical page to another physical page >> and change the virtual mapping at any time. For a concise example >> see 'man migrate_pages(2)', or for a more serious example the code >> that performs memory page compaction in the kernel which was >> recently extended to support hugepages. >> >> Before we go down the path of me proving that the mapping isn't static, >> let me turn that line of thinking around. Do you have any documentation >> demonstrating that the mapping is static? It's not static for 4k >> pages, so >> why are we assuming that it is static for 2MB pages? I understand that >> it happened to be static for some versions of the kernel, but my >> understanding >> is that this was purely by coincidence and never by intention. > > It looks to me as if you are talking about Transparent hugepages, and > not hugetlbfs managed hugepages (DPDK usecase). > AFAIK memory (hugepages) managed by hugetlbfs is not compacted and/or > moved, they are not part of the kernel memory management. > Please forgive my loose/poor use of words here when saying that "they are not part of the kernel memory management", I mean to say that they are not part of the kernel memory management process you were mentioning, ie. compacting, moving, etc. Sergio > So again, do you have some references to code/articles where this > "dynamic" behavior of hugepages managed by hugetlbfs is mentioned? > > Sergio > >>>> Note that this >>>> probably means that using uio on recent kernels is subtly >>>> broken and cannot be supported going forward because there >>>> is no uio mechanism to pin the memory. >>>> >>>> The first open question I have is whether DPDK should allow >>>> uio at all on recent (4.x) kernels. My current understanding >>>> is that there is no way to pin memory and hugepages can now >>>> be moved around, so uio would be unsafe. What does the >>>> community think here? >>>> >>>> My second question is whether the user should be allowed to >>>> mix uio and vfio usage simultaneously. For vfio, the >>>> physical addresses are really DMA addresses and are best >>>> when arbitrarily chosen to appear sequential relative to >>>> their virtual addresses. >>> Why "sequential relative to their virtual addresses"? IOMMU table is >>> for >>> DMA addr -> physical addr mapping. So we need to DMA addresses >>> "sequential relative to their physical addresses"? Based on your above >>> analysis on how hugepages are initialized, virtual addresses is a good >>> candidate for DMA address? >> The code already goes through a separate organizational step on all of >> the pages that remaps the virtual addresses such that they're sequential >> relative to the physical backing pages, so this mostly ends up as the >> same >> thing. >> Choosing to use the virtual address is a totally valid choice, but I >> worry it >> may lead to confusion during debugging or in a multi-process scenario. >> I'm open to making this choice instead of starting from zero, though. >> >>> Thanks, >>> Jianfeng > >