* [dpdk-dev] Running DPDK as an unprivileged user @ 2016-12-29 20:41 Walker, Benjamin 2016-12-30 1:14 ` Stephen Hemminger 2017-01-04 11:39 ` Tan, Jianfeng 0 siblings, 2 replies; 18+ messages in thread From: Walker, Benjamin @ 2016-12-29 20:41 UTC (permalink / raw) To: dev Hi all, I've been digging in to what it would take to run DPDK as an unprivileged user and I have some findings that I thought were worthy of discussion. The assumptions here are that I'm using a very recent Linux kernel (4.8.15 to be specific) and I'm using vfio with my IOMMU enabled. I'm only interested in making it possible to run as an unprivileged user in this type of environment. There are a few key things that DPDK needs to do in order to run as an unprivileged user: 1) Allocate hugepages 2) Map device resources 3) Map hugepage virtual addresses to DMA addresses. For #1 and #2, DPDK works just fine today. You simply chown the relevant resources in sysfs to the desired user and everything is happy. The problem is #3. This currently relies on looking up the mappings in /proc/self/pagemap, but the ability to get physical addresses in /proc/self/pagemap as an unprivileged user was removed from the kernel in the 4.x timeframe due to the Rowhammer vulnerability. At this time, it is not possible to run DPDK as an unprivileged user on a 4.x Linux kernel. There is a way to make this work though, which I'll outline now. Unfortunately, I think it is going to require some very significant changes to the initialization flow in the EAL. One bit of of background before I go into how to fix this - there are three types of memory addresses - virtual addresses, physical addresses, and DMA addresses. Sometimes DMA addresses are called bus addresses or I/O addresses, but I'll call them DMA addresses because I think that's the clearest name. In a system without an IOMMU, DMA addresses and physical addresses are equivalent, but in a system with an IOMMU any arbitrary DMA address can be chosen by the user to map to a given physical address. For security reasons (rowhammer), it is no longer considered safe to expose physical addresses to userspace, but it is perfectly fine to expose DMA addresses when an IOMMU is present. DPDK today begins by allocating all of the required hugepages, then finds all of the physical addresses for those hugepages using /proc/self/pagemap, sorts the hugepages by physical address, then remaps the pages to contiguous virtual addresses. Later on and if vfio is enabled, it asks vfio to pin the hugepages and to set their DMA addresses in the IOMMU to be the physical addresses discovered earlier. Of course, running as an unprivileged user means all of the physical addresses in /proc/self/pagemap are just 0, so this doesn't end up working. Further, there is no real reason to choose the physical address as the DMA address in the IOMMU - it would be better to just count up starting at 0. Also, because the pages are pinned after the virtual to physical mapping is looked up, there is a window where a page could be moved. Hugepage mappings can be moved on more recent kernels (at least 4.x), and the reliability of hugepages having static mappings decreases with every kernel release. Note that this probably means that using uio on recent kernels is subtly broken and cannot be supported going forward because there is no uio mechanism to pin the memory. The first open question I have is whether DPDK should allow uio at all on recent (4.x) kernels. My current understanding is that there is no way to pin memory and hugepages can now be moved around, so uio would be unsafe. What does the community think here? My second question is whether the user should be allowed to mix uio and vfio usage simultaneously. For vfio, the physical addresses are really DMA addresses and are best when arbitrarily chosen to appear sequential relative to their virtual addresses. For uio, they are physical addresses and are not chosen at all. It seems that these two things are in conflict and that it will be difficult, ugly, and maybe impossible to resolve the simultaneous use of both. Once we agree on the above two things, we can try to talk through some solutions in the code. Thanks, Ben ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2016-12-29 20:41 [dpdk-dev] Running DPDK as an unprivileged user Walker, Benjamin @ 2016-12-30 1:14 ` Stephen Hemminger 2017-01-02 14:32 ` Thomas Monjalon 2017-01-04 11:39 ` Tan, Jianfeng 1 sibling, 1 reply; 18+ messages in thread From: Stephen Hemminger @ 2016-12-30 1:14 UTC (permalink / raw) To: Walker, Benjamin; +Cc: dev On Thu, 29 Dec 2016 20:41:21 +0000 "Walker, Benjamin" <benjamin.walker@intel.com> wrote: > The first open question I have is whether DPDK should allow > uio at all on recent (4.x) kernels. My current understanding > is that there is no way to pin memory and hugepages can now > be moved around, so uio would be unsafe. What does the > community think here? DMA access without IOMMU (ie UIO) is not safe from a security point of view. A malicious app could program device (like Ethernet NIC) to change its current privledge level in kernel memory. Therefore ignore UIO as an option if you want to allow unprivileged access. But there are many many systems without working IOMMU. Not just broken motherboards, but virtualization environments (Xen, Hyper-V, and KVM until very recently) where IOMMU is no going to work. In these environments, DPDK is still useful where the security risks are known. If kernel broke pinning of hugepages, then it is an upstream kernel bug. > > My second question is whether the user should be allowed to > mix uio and vfio usage simultaneously. For vfio, the > physical addresses are really DMA addresses and are best > when arbitrarily chosen to appear sequential relative to > their virtual addresses. For uio, they are physical > addresses and are not chosen at all. It seems that these two > things are in conflict and that it will be difficult, ugly, > and maybe impossible to resolve the simultaneous use of > both. Unless application is running as privileged user (ie root), UIO is not going to work. Therefore don't worry about mixed environment. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2016-12-30 1:14 ` Stephen Hemminger @ 2017-01-02 14:32 ` Thomas Monjalon 2017-01-02 19:47 ` Stephen Hemminger 0 siblings, 1 reply; 18+ messages in thread From: Thomas Monjalon @ 2017-01-02 14:32 UTC (permalink / raw) To: Walker, Benjamin; +Cc: dev, Stephen Hemminger 2016-12-29 17:14, Stephen Hemminger: > On Thu, 29 Dec 2016 20:41:21 +0000 > "Walker, Benjamin" <benjamin.walker@intel.com> wrote: > > My second question is whether the user should be allowed to > > mix uio and vfio usage simultaneously. For vfio, the > > physical addresses are really DMA addresses and are best > > when arbitrarily chosen to appear sequential relative to > > their virtual addresses. For uio, they are physical > > addresses and are not chosen at all. It seems that these two > > things are in conflict and that it will be difficult, ugly, > > and maybe impossible to resolve the simultaneous use of > > both. > > Unless application is running as privileged user (ie root), UIO > is not going to work. Therefore don't worry about mixed environment. Yes, mixing UIO and VFIO is possible only as root. However, what is the benefit of mixing them? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-01-02 14:32 ` Thomas Monjalon @ 2017-01-02 19:47 ` Stephen Hemminger 2017-01-03 22:50 ` Walker, Benjamin 0 siblings, 1 reply; 18+ messages in thread From: Stephen Hemminger @ 2017-01-02 19:47 UTC (permalink / raw) To: Thomas Monjalon; +Cc: Walker, Benjamin, dev On Mon, 02 Jan 2017 15:32:08 +0100 Thomas Monjalon <thomas.monjalon@6wind.com> wrote: > 2016-12-29 17:14, Stephen Hemminger: > > On Thu, 29 Dec 2016 20:41:21 +0000 > > "Walker, Benjamin" <benjamin.walker@intel.com> wrote: > > > My second question is whether the user should be allowed to > > > mix uio and vfio usage simultaneously. For vfio, the > > > physical addresses are really DMA addresses and are best > > > when arbitrarily chosen to appear sequential relative to > > > their virtual addresses. For uio, they are physical > > > addresses and are not chosen at all. It seems that these two > > > things are in conflict and that it will be difficult, ugly, > > > and maybe impossible to resolve the simultaneous use of > > > both. > > > > Unless application is running as privileged user (ie root), UIO > > is not going to work. Therefore don't worry about mixed environment. > > Yes, mixing UIO and VFIO is possible only as root. > However, what is the benefit of mixing them? One possible case where this could be used, Hyper-V/Azure and SR-IOV. The VF interface will show up on an isolated PCI bus and the virtual NIC is on VMBUS. It is possible to use VFIO on the PCI to get MSI-X per queue interrupts, but there is no support for VFIO on VMBUS. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-01-02 19:47 ` Stephen Hemminger @ 2017-01-03 22:50 ` Walker, Benjamin 2017-01-04 10:11 ` Thomas Monjalon 0 siblings, 1 reply; 18+ messages in thread From: Walker, Benjamin @ 2017-01-03 22:50 UTC (permalink / raw) To: stephen, thomas.monjalon; +Cc: dev On Thu, 2016-12-29 at 17:14 -0800, Stephen Hemminger wrote: > If kernel broke pinning of hugepages, then it is an upstream kernel bug. The kernel, under a myriad of circumstances, will change the mapping of virtual to physical addresses for hugepages. This behavior began somewhere around kernel 3.16 and with each release more cases where the mapping can change are introduced. DPDK should not be relying on that mapping staying static, and instead should be using vfio to explicitly pin the pages. I've consulted the relevant kernel developers who write the code in this area and they are universally in agreement that this is not a kernel bug and the mappings will get less static over time. On Mon, 2017-01-02 at 11:47 -0800, Stephen Hemminger wrote: > On Mon, 02 Jan 2017 15:32:08 +0100 > Thomas Monjalon <thomas.monjalon@6wind.com> wrote: > > > 2016-12-29 17:14, Stephen Hemminger: > > > On Thu, 29 Dec 2016 20:41:21 +0000 > > > "Walker, Benjamin" <benjamin.walker@intel.com> wrote: > > > > My second question is whether the user should be allowed to > > > > mix uio and vfio usage simultaneously. For vfio, the > > > > physical addresses are really DMA addresses and are best > > > > when arbitrarily chosen to appear sequential relative to > > > > their virtual addresses. For uio, they are physical > > > > addresses and are not chosen at all. It seems that these two > > > > things are in conflict and that it will be difficult, ugly, > > > > and maybe impossible to resolve the simultaneous use of > > > > both. > > > > > > Unless application is running as privileged user (ie root), UIO > > > is not going to work. Therefore don't worry about mixed environment. > > > > Yes, mixing UIO and VFIO is possible only as root. > > However, what is the benefit of mixing them? > > One possible case where this could be used, Hyper-V/Azure and SR-IOV. > The VF interface will show up on an isolated PCI bus and the virtual NIC > is on VMBUS. It is possible to use VFIO on the PCI to get MSI-X per queue > interrupts, but there is no support for VFIO on VMBUS. I sent out a patch a little while ago that makes DPDK work when running as an unprivileged user with an IOMMU. I allow mixing of uio/vfio when root (I choose the DMA address to be the physical address), but only vfio when unprivileged (I choose the DMA addresses to start at 0). Unfortunately, there are a few more wrinkles for systems that do not have an IOMMU. These systems still need to explicitly pin memory, but they need to use physical addresses instead of DMA addresses. There are two concerns with this: 1) Physical addresses cannot be exposed to unprivileged users due to security concerns (the fallout of rowhammer). Therefore, systems without an IOMMU can only support privileged users. I think this is probably fine. 2) The IOCTL from vfio to pin the memory is tied to specifying the DMA address and programming the IOMMU. This is unfortunate - systems without an IOMMU still want to do the pinning, but they need to be given the physical address instead of specifying a DMA address. 3) Not all device types, particularly in virtualization environments, support vfio today. These devices have no way to explicitly pin memory. I think this is going to take a kernel patch or two to resolve, unless someone has a good idea. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-01-03 22:50 ` Walker, Benjamin @ 2017-01-04 10:11 ` Thomas Monjalon 2017-01-04 21:35 ` Walker, Benjamin 0 siblings, 1 reply; 18+ messages in thread From: Thomas Monjalon @ 2017-01-04 10:11 UTC (permalink / raw) To: Walker, Benjamin; +Cc: stephen, dev 2017-01-03 22:50, Walker, Benjamin: > 1) Physical addresses cannot be exposed to unprivileged users due to security > concerns (the fallout of rowhammer). Therefore, systems without an IOMMU can > only support privileged users. I think this is probably fine. > 2) The IOCTL from vfio to pin the memory is tied to specifying the DMA address > and programming the IOMMU. This is unfortunate - systems without an IOMMU still > want to do the pinning, but they need to be given the physical address instead > of specifying a DMA address. > 3) Not all device types, particularly in virtualization environments, support > vfio today. These devices have no way to explicitly pin memory. In VM we can use VFIO-noiommu. Is it helping for mapping? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-01-04 10:11 ` Thomas Monjalon @ 2017-01-04 21:35 ` Walker, Benjamin 0 siblings, 0 replies; 18+ messages in thread From: Walker, Benjamin @ 2017-01-04 21:35 UTC (permalink / raw) To: thomas.monjalon; +Cc: stephen, dev On Wed, 2017-01-04 at 11:11 +0100, Thomas Monjalon wrote: > 2017-01-03 22:50, Walker, Benjamin: > > 1) Physical addresses cannot be exposed to unprivileged users due to > > security > > concerns (the fallout of rowhammer). Therefore, systems without an IOMMU can > > only support privileged users. I think this is probably fine. > > 2) The IOCTL from vfio to pin the memory is tied to specifying the DMA > > address > > and programming the IOMMU. This is unfortunate - systems without an IOMMU > > still > > want to do the pinning, but they need to be given the physical address > > instead > > of specifying a DMA address. > > 3) Not all device types, particularly in virtualization environments, > > support > > vfio today. These devices have no way to explicitly pin memory. > > In VM we can use VFIO-noiommu. Is it helping for mapping? There does not appear to be a vfio IOCTL that pins memory without also programming the IOMMU, so vfio-noiommu is broken in the same way that uio is for drivers that require physical memory. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2016-12-29 20:41 [dpdk-dev] Running DPDK as an unprivileged user Walker, Benjamin 2016-12-30 1:14 ` Stephen Hemminger @ 2017-01-04 11:39 ` Tan, Jianfeng 2017-01-04 21:34 ` Walker, Benjamin 1 sibling, 1 reply; 18+ messages in thread From: Tan, Jianfeng @ 2017-01-04 11:39 UTC (permalink / raw) To: Walker, Benjamin, dev Hi Benjamin, On 12/30/2016 4:41 AM, Walker, Benjamin wrote: > Hi all, > > I've been digging in to what it would take to run DPDK as an > unprivileged user and I have some findings that I thought > were worthy of discussion. The assumptions here are that I'm > using a very recent Linux kernel (4.8.15 to be specific) and > I'm using vfio with my IOMMU enabled. I'm only interested in > making it possible to run as an unprivileged user in this > type of environment. > > There are a few key things that DPDK needs to do in order to > run as an unprivileged user: > > 1) Allocate hugepages > 2) Map device resources > 3) Map hugepage virtual addresses to DMA addresses. > > For #1 and #2, DPDK works just fine today. You simply chown > the relevant resources in sysfs to the desired user and > everything is happy. > > The problem is #3. This currently relies on looking up the > mappings in /proc/self/pagemap, but the ability to get > physical addresses in /proc/self/pagemap as an unprivileged > user was removed from the kernel in the 4.x timeframe due to > the Rowhammer vulnerability. At this time, it is not > possible to run DPDK as an unprivileged user on a 4.x Linux > kernel. > > There is a way to make this work though, which I'll outline > now. Unfortunately, I think it is going to require some very > significant changes to the initialization flow in the EAL. > One bit of of background before I go into how to fix this - > there are three types of memory addresses - virtual > addresses, physical addresses, and DMA addresses. Sometimes > DMA addresses are called bus addresses or I/O addresses, but > I'll call them DMA addresses because I think that's the > clearest name. In a system without an IOMMU, DMA addresses > and physical addresses are equivalent, but in a system with > an IOMMU any arbitrary DMA address can be chosen by the user > to map to a given physical address. For security reasons > (rowhammer), it is no longer considered safe to expose > physical addresses to userspace, but it is perfectly fine to > expose DMA addresses when an IOMMU is present. > > DPDK today begins by allocating all of the required > hugepages, then finds all of the physical addresses for > those hugepages using /proc/self/pagemap, sorts the > hugepages by physical address, then remaps the pages to > contiguous virtual addresses. Later on and if vfio is > enabled, it asks vfio to pin the hugepages and to set their > DMA addresses in the IOMMU to be the physical addresses > discovered earlier. Of course, running as an unprivileged > user means all of the physical addresses in > /proc/self/pagemap are just 0, so this doesn't end up > working. Further, there is no real reason to choose the > physical address as the DMA address in the IOMMU - it would > be better to just count up starting at 0. Why not just using virtual address as the DMA address in this case to avoid maintaining another kind of addresses? > Also, because the > pages are pinned after the virtual to physical mapping is > looked up, there is a window where a page could be moved. > Hugepage mappings can be moved on more recent kernels (at > least 4.x), and the reliability of hugepages having static > mappings decreases with every kernel release. Do you mean kernel might take back a physical page after mapping it to a virtual page (maybe copy the data to another physical page)? Could you please show some links or kernel commits? > Note that this > probably means that using uio on recent kernels is subtly > broken and cannot be supported going forward because there > is no uio mechanism to pin the memory. > > The first open question I have is whether DPDK should allow > uio at all on recent (4.x) kernels. My current understanding > is that there is no way to pin memory and hugepages can now > be moved around, so uio would be unsafe. What does the > community think here? > > My second question is whether the user should be allowed to > mix uio and vfio usage simultaneously. For vfio, the > physical addresses are really DMA addresses and are best > when arbitrarily chosen to appear sequential relative to > their virtual addresses. Why "sequential relative to their virtual addresses"? IOMMU table is for DMA addr -> physical addr mapping. So we need to DMA addresses "sequential relative to their physical addresses"? Based on your above analysis on how hugepages are initialized, virtual addresses is a good candidate for DMA address? Thanks, Jianfeng ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-01-04 11:39 ` Tan, Jianfeng @ 2017-01-04 21:34 ` Walker, Benjamin 2017-01-05 10:09 ` Sergio Gonzalez Monroy 2017-01-05 15:52 ` Tan, Jianfeng 0 siblings, 2 replies; 18+ messages in thread From: Walker, Benjamin @ 2017-01-04 21:34 UTC (permalink / raw) To: Tan, Jianfeng, dev On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote: > Hi Benjamin, > > > On 12/30/2016 4:41 AM, Walker, Benjamin wrote: > > DPDK today begins by allocating all of the required > > hugepages, then finds all of the physical addresses for > > those hugepages using /proc/self/pagemap, sorts the > > hugepages by physical address, then remaps the pages to > > contiguous virtual addresses. Later on and if vfio is > > enabled, it asks vfio to pin the hugepages and to set their > > DMA addresses in the IOMMU to be the physical addresses > > discovered earlier. Of course, running as an unprivileged > > user means all of the physical addresses in > > /proc/self/pagemap are just 0, so this doesn't end up > > working. Further, there is no real reason to choose the > > physical address as the DMA address in the IOMMU - it would > > be better to just count up starting at 0. > > Why not just using virtual address as the DMA address in this case to > avoid maintaining another kind of addresses? That's a valid choice, although I'm just storing the DMA address in the physical address field that already exists. You either have a physical address or a DMA address and never both. > > > Also, because the > > pages are pinned after the virtual to physical mapping is > > looked up, there is a window where a page could be moved. > > Hugepage mappings can be moved on more recent kernels (at > > least 4.x), and the reliability of hugepages having static > > mappings decreases with every kernel release. > > Do you mean kernel might take back a physical page after mapping it to a > virtual page (maybe copy the data to another physical page)? Could you > please show some links or kernel commits? Yes - the kernel can move a physical page to another physical page and change the virtual mapping at any time. For a concise example see 'man migrate_pages(2)', or for a more serious example the code that performs memory page compaction in the kernel which was recently extended to support hugepages. Before we go down the path of me proving that the mapping isn't static, let me turn that line of thinking around. Do you have any documentation demonstrating that the mapping is static? It's not static for 4k pages, so why are we assuming that it is static for 2MB pages? I understand that it happened to be static for some versions of the kernel, but my understanding is that this was purely by coincidence and never by intention. > > > Note that this > > probably means that using uio on recent kernels is subtly > > broken and cannot be supported going forward because there > > is no uio mechanism to pin the memory. > > > > The first open question I have is whether DPDK should allow > > uio at all on recent (4.x) kernels. My current understanding > > is that there is no way to pin memory and hugepages can now > > be moved around, so uio would be unsafe. What does the > > community think here? > > > > My second question is whether the user should be allowed to > > mix uio and vfio usage simultaneously. For vfio, the > > physical addresses are really DMA addresses and are best > > when arbitrarily chosen to appear sequential relative to > > their virtual addresses. > > Why "sequential relative to their virtual addresses"? IOMMU table is for > DMA addr -> physical addr mapping. So we need to DMA addresses > "sequential relative to their physical addresses"? Based on your above > analysis on how hugepages are initialized, virtual addresses is a good > candidate for DMA address? The code already goes through a separate organizational step on all of the pages that remaps the virtual addresses such that they're sequential relative to the physical backing pages, so this mostly ends up as the same thing. Choosing to use the virtual address is a totally valid choice, but I worry it may lead to confusion during debugging or in a multi-process scenario. I'm open to making this choice instead of starting from zero, though. > > Thanks, > Jianfeng ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-01-04 21:34 ` Walker, Benjamin @ 2017-01-05 10:09 ` Sergio Gonzalez Monroy 2017-01-05 10:16 ` Sergio Gonzalez Monroy 2017-01-05 15:52 ` Tan, Jianfeng 1 sibling, 1 reply; 18+ messages in thread From: Sergio Gonzalez Monroy @ 2017-01-05 10:09 UTC (permalink / raw) To: Walker, Benjamin, Tan, Jianfeng, dev On 04/01/2017 21:34, Walker, Benjamin wrote: > On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote: >> Hi Benjamin, >> >> >> On 12/30/2016 4:41 AM, Walker, Benjamin wrote: >>> DPDK today begins by allocating all of the required >>> hugepages, then finds all of the physical addresses for >>> those hugepages using /proc/self/pagemap, sorts the >>> hugepages by physical address, then remaps the pages to >>> contiguous virtual addresses. Later on and if vfio is >>> enabled, it asks vfio to pin the hugepages and to set their >>> DMA addresses in the IOMMU to be the physical addresses >>> discovered earlier. Of course, running as an unprivileged >>> user means all of the physical addresses in >>> /proc/self/pagemap are just 0, so this doesn't end up >>> working. Further, there is no real reason to choose the >>> physical address as the DMA address in the IOMMU - it would >>> be better to just count up starting at 0. >> Why not just using virtual address as the DMA address in this case to >> avoid maintaining another kind of addresses? > That's a valid choice, although I'm just storing the DMA address in the > physical address field that already exists. You either have a physical > address or a DMA address and never both. > >>> Also, because the >>> pages are pinned after the virtual to physical mapping is >>> looked up, there is a window where a page could be moved. >>> Hugepage mappings can be moved on more recent kernels (at >>> least 4.x), and the reliability of hugepages having static >>> mappings decreases with every kernel release. >> Do you mean kernel might take back a physical page after mapping it to a >> virtual page (maybe copy the data to another physical page)? Could you >> please show some links or kernel commits? > Yes - the kernel can move a physical page to another physical page > and change the virtual mapping at any time. For a concise example > see 'man migrate_pages(2)', or for a more serious example the code > that performs memory page compaction in the kernel which was > recently extended to support hugepages. > > Before we go down the path of me proving that the mapping isn't static, > let me turn that line of thinking around. Do you have any documentation > demonstrating that the mapping is static? It's not static for 4k pages, so > why are we assuming that it is static for 2MB pages? I understand that > it happened to be static for some versions of the kernel, but my understanding > is that this was purely by coincidence and never by intention. It looks to me as if you are talking about Transparent hugepages, and not hugetlbfs managed hugepages (DPDK usecase). AFAIK memory (hugepages) managed by hugetlbfs is not compacted and/or moved, they are not part of the kernel memory management. So again, do you have some references to code/articles where this "dynamic" behavior of hugepages managed by hugetlbfs is mentioned? Sergio >>> Note that this >>> probably means that using uio on recent kernels is subtly >>> broken and cannot be supported going forward because there >>> is no uio mechanism to pin the memory. >>> >>> The first open question I have is whether DPDK should allow >>> uio at all on recent (4.x) kernels. My current understanding >>> is that there is no way to pin memory and hugepages can now >>> be moved around, so uio would be unsafe. What does the >>> community think here? >>> >>> My second question is whether the user should be allowed to >>> mix uio and vfio usage simultaneously. For vfio, the >>> physical addresses are really DMA addresses and are best >>> when arbitrarily chosen to appear sequential relative to >>> their virtual addresses. >> Why "sequential relative to their virtual addresses"? IOMMU table is for >> DMA addr -> physical addr mapping. So we need to DMA addresses >> "sequential relative to their physical addresses"? Based on your above >> analysis on how hugepages are initialized, virtual addresses is a good >> candidate for DMA address? > The code already goes through a separate organizational step on all of > the pages that remaps the virtual addresses such that they're sequential > relative to the physical backing pages, so this mostly ends up as the same > thing. > Choosing to use the virtual address is a totally valid choice, but I worry it > may lead to confusion during debugging or in a multi-process scenario. > I'm open to making this choice instead of starting from zero, though. > >> Thanks, >> Jianfeng ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-01-05 10:09 ` Sergio Gonzalez Monroy @ 2017-01-05 10:16 ` Sergio Gonzalez Monroy 2017-01-05 14:58 ` Tan, Jianfeng 0 siblings, 1 reply; 18+ messages in thread From: Sergio Gonzalez Monroy @ 2017-01-05 10:16 UTC (permalink / raw) To: Walker, Benjamin, Tan, Jianfeng, dev On 05/01/2017 10:09, Sergio Gonzalez Monroy wrote: > On 04/01/2017 21:34, Walker, Benjamin wrote: >> On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote: >>> Hi Benjamin, >>> >>> >>> On 12/30/2016 4:41 AM, Walker, Benjamin wrote: >>>> DPDK today begins by allocating all of the required >>>> hugepages, then finds all of the physical addresses for >>>> those hugepages using /proc/self/pagemap, sorts the >>>> hugepages by physical address, then remaps the pages to >>>> contiguous virtual addresses. Later on and if vfio is >>>> enabled, it asks vfio to pin the hugepages and to set their >>>> DMA addresses in the IOMMU to be the physical addresses >>>> discovered earlier. Of course, running as an unprivileged >>>> user means all of the physical addresses in >>>> /proc/self/pagemap are just 0, so this doesn't end up >>>> working. Further, there is no real reason to choose the >>>> physical address as the DMA address in the IOMMU - it would >>>> be better to just count up starting at 0. >>> Why not just using virtual address as the DMA address in this case to >>> avoid maintaining another kind of addresses? >> That's a valid choice, although I'm just storing the DMA address in the >> physical address field that already exists. You either have a physical >> address or a DMA address and never both. >> >>>> Also, because the >>>> pages are pinned after the virtual to physical mapping is >>>> looked up, there is a window where a page could be moved. >>>> Hugepage mappings can be moved on more recent kernels (at >>>> least 4.x), and the reliability of hugepages having static >>>> mappings decreases with every kernel release. >>> Do you mean kernel might take back a physical page after mapping it >>> to a >>> virtual page (maybe copy the data to another physical page)? Could you >>> please show some links or kernel commits? >> Yes - the kernel can move a physical page to another physical page >> and change the virtual mapping at any time. For a concise example >> see 'man migrate_pages(2)', or for a more serious example the code >> that performs memory page compaction in the kernel which was >> recently extended to support hugepages. >> >> Before we go down the path of me proving that the mapping isn't static, >> let me turn that line of thinking around. Do you have any documentation >> demonstrating that the mapping is static? It's not static for 4k >> pages, so >> why are we assuming that it is static for 2MB pages? I understand that >> it happened to be static for some versions of the kernel, but my >> understanding >> is that this was purely by coincidence and never by intention. > > It looks to me as if you are talking about Transparent hugepages, and > not hugetlbfs managed hugepages (DPDK usecase). > AFAIK memory (hugepages) managed by hugetlbfs is not compacted and/or > moved, they are not part of the kernel memory management. > Please forgive my loose/poor use of words here when saying that "they are not part of the kernel memory management", I mean to say that they are not part of the kernel memory management process you were mentioning, ie. compacting, moving, etc. Sergio > So again, do you have some references to code/articles where this > "dynamic" behavior of hugepages managed by hugetlbfs is mentioned? > > Sergio > >>>> Note that this >>>> probably means that using uio on recent kernels is subtly >>>> broken and cannot be supported going forward because there >>>> is no uio mechanism to pin the memory. >>>> >>>> The first open question I have is whether DPDK should allow >>>> uio at all on recent (4.x) kernels. My current understanding >>>> is that there is no way to pin memory and hugepages can now >>>> be moved around, so uio would be unsafe. What does the >>>> community think here? >>>> >>>> My second question is whether the user should be allowed to >>>> mix uio and vfio usage simultaneously. For vfio, the >>>> physical addresses are really DMA addresses and are best >>>> when arbitrarily chosen to appear sequential relative to >>>> their virtual addresses. >>> Why "sequential relative to their virtual addresses"? IOMMU table is >>> for >>> DMA addr -> physical addr mapping. So we need to DMA addresses >>> "sequential relative to their physical addresses"? Based on your above >>> analysis on how hugepages are initialized, virtual addresses is a good >>> candidate for DMA address? >> The code already goes through a separate organizational step on all of >> the pages that remaps the virtual addresses such that they're sequential >> relative to the physical backing pages, so this mostly ends up as the >> same >> thing. >> Choosing to use the virtual address is a totally valid choice, but I >> worry it >> may lead to confusion during debugging or in a multi-process scenario. >> I'm open to making this choice instead of starting from zero, though. >> >>> Thanks, >>> Jianfeng > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-01-05 10:16 ` Sergio Gonzalez Monroy @ 2017-01-05 14:58 ` Tan, Jianfeng 0 siblings, 0 replies; 18+ messages in thread From: Tan, Jianfeng @ 2017-01-05 14:58 UTC (permalink / raw) To: Sergio Gonzalez Monroy, Walker, Benjamin, dev Hi, On 1/5/2017 6:16 PM, Sergio Gonzalez Monroy wrote: > On 05/01/2017 10:09, Sergio Gonzalez Monroy wrote: >> On 04/01/2017 21:34, Walker, Benjamin wrote: >>> On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote: >>>> Hi Benjamin, >>>> >>>> >>>> On 12/30/2016 4:41 AM, Walker, Benjamin wrote: >>>>> DPDK today begins by allocating all of the required >>>>> hugepages, then finds all of the physical addresses for >>>>> those hugepages using /proc/self/pagemap, sorts the >>>>> hugepages by physical address, then remaps the pages to >>>>> contiguous virtual addresses. Later on and if vfio is >>>>> enabled, it asks vfio to pin the hugepages and to set their >>>>> DMA addresses in the IOMMU to be the physical addresses >>>>> discovered earlier. Of course, running as an unprivileged >>>>> user means all of the physical addresses in >>>>> /proc/self/pagemap are just 0, so this doesn't end up >>>>> working. Further, there is no real reason to choose the >>>>> physical address as the DMA address in the IOMMU - it would >>>>> be better to just count up starting at 0. >>>> Why not just using virtual address as the DMA address in this case to >>>> avoid maintaining another kind of addresses? >>> That's a valid choice, although I'm just storing the DMA address in the >>> physical address field that already exists. You either have a physical >>> address or a DMA address and never both. >>> >>>>> Also, because the >>>>> pages are pinned after the virtual to physical mapping is >>>>> looked up, there is a window where a page could be moved. >>>>> Hugepage mappings can be moved on more recent kernels (at >>>>> least 4.x), and the reliability of hugepages having static >>>>> mappings decreases with every kernel release. >>>> Do you mean kernel might take back a physical page after mapping it >>>> to a >>>> virtual page (maybe copy the data to another physical page)? Could you >>>> please show some links or kernel commits? >>> Yes - the kernel can move a physical page to another physical page >>> and change the virtual mapping at any time. For a concise example >>> see 'man migrate_pages(2)', or for a more serious example the code >>> that performs memory page compaction in the kernel which was >>> recently extended to support hugepages. >>> >>> Before we go down the path of me proving that the mapping isn't static, >>> let me turn that line of thinking around. Do you have any documentation >>> demonstrating that the mapping is static? It's not static for 4k >>> pages, so >>> why are we assuming that it is static for 2MB pages? I understand that >>> it happened to be static for some versions of the kernel, but my >>> understanding >>> is that this was purely by coincidence and never by intention. >> >> It looks to me as if you are talking about Transparent hugepages, and >> not hugetlbfs managed hugepages (DPDK usecase). >> AFAIK memory (hugepages) managed by hugetlbfs is not compacted and/or >> moved, they are not part of the kernel memory management. >> > > Please forgive my loose/poor use of words here when saying that "they > are not part of the kernel memory management", I mean to say that > they are not part of the kernel memory management process you were > mentioning, ie. compacting, moving, etc. > > Sergio > >> So again, do you have some references to code/articles where this >> "dynamic" behavior of hugepages managed by hugetlbfs is mentioned? >> >> Sergio According to the information Benjamin provided, I did some home work and find this macro in kernel config, CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION, and further the function, hugepage_migration_supported(). Seems that there are at least three ways to make this behavior happen (I'm basing on Linux 4.8.1): a) Through a syscall migrate_pages(); b) through a syscall move_pages(); c) Since some version of kernel, there's a kthread named kcompactd for each numa socket, to perform memory compaction. Thanks, Jianfeng >> >>>>> Note that this >>>>> probably means that using uio on recent kernels is subtly >>>>> broken and cannot be supported going forward because there >>>>> is no uio mechanism to pin the memory. >>>>> >>>>> The first open question I have is whether DPDK should allow >>>>> uio at all on recent (4.x) kernels. My current understanding >>>>> is that there is no way to pin memory and hugepages can now >>>>> be moved around, so uio would be unsafe. What does the >>>>> community think here? >>>>> >>>>> My second question is whether the user should be allowed to >>>>> mix uio and vfio usage simultaneously. For vfio, the >>>>> physical addresses are really DMA addresses and are best >>>>> when arbitrarily chosen to appear sequential relative to >>>>> their virtual addresses. >>>> Why "sequential relative to their virtual addresses"? IOMMU table >>>> is for >>>> DMA addr -> physical addr mapping. So we need to DMA addresses >>>> "sequential relative to their physical addresses"? Based on your above >>>> analysis on how hugepages are initialized, virtual addresses is a good >>>> candidate for DMA address? >>> The code already goes through a separate organizational step on all of >>> the pages that remaps the virtual addresses such that they're >>> sequential >>> relative to the physical backing pages, so this mostly ends up as >>> the same >>> thing. >>> Choosing to use the virtual address is a totally valid choice, but I >>> worry it >>> may lead to confusion during debugging or in a multi-process scenario. >>> I'm open to making this choice instead of starting from zero, though. >>> >>>> Thanks, >>>> Jianfeng >> >> > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-01-04 21:34 ` Walker, Benjamin 2017-01-05 10:09 ` Sergio Gonzalez Monroy @ 2017-01-05 15:52 ` Tan, Jianfeng 2017-11-05 0:17 ` Thomas Monjalon 1 sibling, 1 reply; 18+ messages in thread From: Tan, Jianfeng @ 2017-01-05 15:52 UTC (permalink / raw) To: Walker, Benjamin, dev Hi Benjamin, On 1/5/2017 5:34 AM, Walker, Benjamin wrote: > On Wed, 2017-01-04 at 19:39 +0800, Tan, Jianfeng wrote: >> Hi Benjamin, >> >> >> On 12/30/2016 4:41 AM, Walker, Benjamin wrote: >>> DPDK today begins by allocating all of the required >>> hugepages, then finds all of the physical addresses for >>> those hugepages using /proc/self/pagemap, sorts the >>> hugepages by physical address, then remaps the pages to >>> contiguous virtual addresses. Later on and if vfio is >>> enabled, it asks vfio to pin the hugepages and to set their >>> DMA addresses in the IOMMU to be the physical addresses >>> discovered earlier. Of course, running as an unprivileged >>> user means all of the physical addresses in >>> /proc/self/pagemap are just 0, so this doesn't end up >>> working. Further, there is no real reason to choose the >>> physical address as the DMA address in the IOMMU - it would >>> be better to just count up starting at 0. >> Why not just using virtual address as the DMA address in this case to >> avoid maintaining another kind of addresses? > That's a valid choice, although I'm just storing the DMA address in the > physical address field that already exists. You either have a physical > address or a DMA address and never both. Yes, I understand that's why you cast the second question below. > >>> Also, because the >>> pages are pinned after the virtual to physical mapping is >>> looked up, there is a window where a page could be moved. >>> Hugepage mappings can be moved on more recent kernels (at >>> least 4.x), and the reliability of hugepages having static >>> mappings decreases with every kernel release. >> Do you mean kernel might take back a physical page after mapping it to a >> virtual page (maybe copy the data to another physical page)? Could you >> please show some links or kernel commits? > Yes - the kernel can move a physical page to another physical page > and change the virtual mapping at any time. For a concise example > see 'man migrate_pages(2)', or for a more serious example the code > that performs memory page compaction in the kernel which was > recently extended to support hugepages. > > Before we go down the path of me proving that the mapping isn't static, > let me turn that line of thinking around. Do you have any documentation > demonstrating that the mapping is static? It's not static for 4k pages, so > why are we assuming that it is static for 2MB pages? I understand that > it happened to be static for some versions of the kernel, but my understanding > is that this was purely by coincidence and never by intention. Thank you for the information. Based on what you provide above, I realize this behavior could happen since long time ago. > >>> Note that this >>> probably means that using uio on recent kernels is subtly >>> broken and cannot be supported going forward because there >>> is no uio mechanism to pin the memory. >>> >>> The first open question I have is whether DPDK should allow >>> uio at all on recent (4.x) kernels. My current understanding >>> is that there is no way to pin memory and hugepages can now >>> be moved around, so uio would be unsafe. What does the >>> community think here? Back to this question, removing uio support in DPDK seems a little overkill to me. Can we just document it down? Like, firstly warn users do not invoke migrate_pages() or move_pages() to a DPDK process; as for the kcompactd daemon and some more cases (like compaction could be triggered by alloc_pages()), could we just recommend to disable CONFIG_COMPACTION? Another side, how does vfio pin those memory? Through memlock (from code in vfio_pin_pages())? So why not just mlock those hugepages? >>> >>> My second question is whether the user should be allowed to >>> mix uio and vfio usage simultaneously. For vfio, the >>> physical addresses are really DMA addresses and are best >>> when arbitrarily chosen to appear sequential relative to >>> their virtual addresses. >> Why "sequential relative to their virtual addresses"? IOMMU table is for >> DMA addr -> physical addr mapping. So we need to DMA addresses >> "sequential relative to their physical addresses"? Based on your above >> analysis on how hugepages are initialized, virtual addresses is a good >> candidate for DMA address? > The code already goes through a separate organizational step on all of > the pages that remaps the virtual addresses such that they're sequential > relative to the physical backing pages, so this mostly ends up as the same > thing. Agreed. > Choosing to use the virtual address is a totally valid choice, but I worry it > may lead to confusion during debugging or in a multi-process scenario. Make sense. Thanks, Jianfeng ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-01-05 15:52 ` Tan, Jianfeng @ 2017-11-05 0:17 ` Thomas Monjalon 2017-11-27 17:58 ` Walker, Benjamin 0 siblings, 1 reply; 18+ messages in thread From: Thomas Monjalon @ 2017-11-05 0:17 UTC (permalink / raw) To: Tan, Jianfeng, Walker, Benjamin, sergio.gonzalez.monroy, anatoly.burakov Cc: dev Hi, restarting an old topic, 05/01/2017 16:52, Tan, Jianfeng: > On 1/5/2017 5:34 AM, Walker, Benjamin wrote: > >>> Note that this > >>> probably means that using uio on recent kernels is subtly > >>> broken and cannot be supported going forward because there > >>> is no uio mechanism to pin the memory. > >>> > >>> The first open question I have is whether DPDK should allow > >>> uio at all on recent (4.x) kernels. My current understanding > >>> is that there is no way to pin memory and hugepages can now > >>> be moved around, so uio would be unsafe. What does the > >>> community think here? > > Back to this question, removing uio support in DPDK seems a little > overkill to me. Can we just document it down? Like, firstly warn users > do not invoke migrate_pages() or move_pages() to a DPDK process; as for > the kcompactd daemon and some more cases (like compaction could be > triggered by alloc_pages()), could we just recommend to disable > CONFIG_COMPACTION? We really need to better document the limitations of UIO. May we have some suggestions here? > Another side, how does vfio pin those memory? Through memlock (from code > in vfio_pin_pages())? So why not just mlock those hugepages? Good question. Why not mlock the hugepages? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-11-05 0:17 ` Thomas Monjalon @ 2017-11-27 17:58 ` Walker, Benjamin 2017-11-28 14:16 ` Alejandro Lucero 0 siblings, 1 reply; 18+ messages in thread From: Walker, Benjamin @ 2017-11-27 17:58 UTC (permalink / raw) To: thomas, Gonzalez Monroy, Sergio, Burakov, Anatoly, Tan, Jianfeng; +Cc: dev On Sun, 2017-11-05 at 01:17 +0100, Thomas Monjalon wrote: > Hi, restarting an old topic, > > 05/01/2017 16:52, Tan, Jianfeng: > > On 1/5/2017 5:34 AM, Walker, Benjamin wrote: > > > > > Note that this > > > > > probably means that using uio on recent kernels is subtly > > > > > broken and cannot be supported going forward because there > > > > > is no uio mechanism to pin the memory. > > > > > > > > > > The first open question I have is whether DPDK should allow > > > > > uio at all on recent (4.x) kernels. My current understanding > > > > > is that there is no way to pin memory and hugepages can now > > > > > be moved around, so uio would be unsafe. What does the > > > > > community think here? > > > > Back to this question, removing uio support in DPDK seems a little > > overkill to me. Can we just document it down? Like, firstly warn users > > do not invoke migrate_pages() or move_pages() to a DPDK process; as for > > the kcompactd daemon and some more cases (like compaction could be > > triggered by alloc_pages()), could we just recommend to disable > > CONFIG_COMPACTION? > > We really need to better document the limitations of UIO. > May we have some suggestions here? > > > Another side, how does vfio pin those memory? Through memlock (from code > > in vfio_pin_pages())? So why not just mlock those hugepages? > > Good question. Why not mlock the hugepages? mlock just guarantees that a virtual page is always backed by *some* physical page of memory. It does not guarantee that over the lifetime of the process a virtual page is mapped to the *same* physical page. The kernel is free to transparently move memory around, compress it, dedupe it, etc. vfio is not pinning the memory, but instead is using the IOMMU (a piece of hardware) to participate in the memory management on the platform. If a device begins a DMA transfer to an I/O virtual address, the IOMMU will coordinate with the main MMU to make sure that the data ends up in the correct location, even as the virtual to physical mappings are being modified. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-11-27 17:58 ` Walker, Benjamin @ 2017-11-28 14:16 ` Alejandro Lucero 2017-11-28 17:50 ` Walker, Benjamin 0 siblings, 1 reply; 18+ messages in thread From: Alejandro Lucero @ 2017-11-28 14:16 UTC (permalink / raw) To: Walker, Benjamin Cc: thomas, Gonzalez Monroy, Sergio, Burakov, Anatoly, Tan, Jianfeng, dev On Mon, Nov 27, 2017 at 5:58 PM, Walker, Benjamin <benjamin.walker@intel.com > wrote: > On Sun, 2017-11-05 at 01:17 +0100, Thomas Monjalon wrote: > > Hi, restarting an old topic, > > > > 05/01/2017 16:52, Tan, Jianfeng: > > > On 1/5/2017 5:34 AM, Walker, Benjamin wrote: > > > > > > Note that this > > > > > > probably means that using uio on recent kernels is subtly > > > > > > broken and cannot be supported going forward because there > > > > > > is no uio mechanism to pin the memory. > > > > > > > > > > > > The first open question I have is whether DPDK should allow > > > > > > uio at all on recent (4.x) kernels. My current understanding > > > > > > is that there is no way to pin memory and hugepages can now > > > > > > be moved around, so uio would be unsafe. What does the > > > > > > community think here? > > > > > > Back to this question, removing uio support in DPDK seems a little > > > overkill to me. Can we just document it down? Like, firstly warn users > > > do not invoke migrate_pages() or move_pages() to a DPDK process; as for > > > the kcompactd daemon and some more cases (like compaction could be > > > triggered by alloc_pages()), could we just recommend to disable > > > CONFIG_COMPACTION? > > > > We really need to better document the limitations of UIO. > > May we have some suggestions here? > > > > > Another side, how does vfio pin those memory? Through memlock (from > code > > > in vfio_pin_pages())? So why not just mlock those hugepages? > > > > Good question. Why not mlock the hugepages? > > mlock just guarantees that a virtual page is always backed by *some* > physical > page of memory. It does not guarantee that over the lifetime of the > process a > virtual page is mapped to the *same* physical page. The kernel is free to > transparently move memory around, compress it, dedupe it, etc. > > vfio is not pinning the memory, but instead is using the IOMMU (a piece of > hardware) to participate in the memory management on the platform. If a > device > begins a DMA transfer to an I/O virtual address, the IOMMU will coordinate > with > the main MMU to make sure that the data ends up in the correct location, > even as > the virtual to physical mappings are being modified. This last comment confused me because you said VFIO did the page pinning in your first email. I have been looking at the kernel code and the VFIO driver does pin the pages, at least the iommu type 1. I can see a problem adding support to UIO for doing the same, because that implies there is a device doing DMAs and programmed from user space, which is something the UIO maintainer is against. But because vfio-noiommu mode was implemented just for this, I guess that could be added to the VFIO driver. This does not solve the problem of software not using vfio though. Apart from improving the UIO documentation when used with DPDK, maybe some sort of check could be done and DPDK requiring a explicit parameter for making the user aware of the potential risk when UIO is used and the kernel page migration is enabled. Not sure if this last thing could be easily known from user space. On another side, we suffered a similar problem when VMs were using SRIOV and memory balloning. The IOMMU was removing the mapping for the memory removed, but the kernel inside the VM did not get any event and the device ended up doing some wrong DMA operation. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-11-28 14:16 ` Alejandro Lucero @ 2017-11-28 17:50 ` Walker, Benjamin 2017-11-28 19:13 ` Alejandro Lucero 0 siblings, 1 reply; 18+ messages in thread From: Walker, Benjamin @ 2017-11-28 17:50 UTC (permalink / raw) To: alejandro.lucero Cc: thomas, Gonzalez Monroy, Sergio, Burakov, Anatoly, Tan, Jianfeng, dev On Tue, 2017-11-28 at 14:16 +0000, Alejandro Lucero wrote: > > > On Mon, Nov 27, 2017 at 5:58 PM, Walker, Benjamin <benjamin.walker@intel.com> > wrote: > > On Sun, 2017-11-05 at 01:17 +0100, Thomas Monjalon wrote: > > > Hi, restarting an old topic, > > > > > > 05/01/2017 16:52, Tan, Jianfeng: > > > > On 1/5/2017 5:34 AM, Walker, Benjamin wrote: > > > > > > > Note that this > > > > > > > probably means that using uio on recent kernels is subtly > > > > > > > broken and cannot be supported going forward because there > > > > > > > is no uio mechanism to pin the memory. > > > > > > > > > > > > > > The first open question I have is whether DPDK should allow > > > > > > > uio at all on recent (4.x) kernels. My current understanding > > > > > > > is that there is no way to pin memory and hugepages can now > > > > > > > be moved around, so uio would be unsafe. What does the > > > > > > > community think here? > > > > > > > > Back to this question, removing uio support in DPDK seems a little > > > > overkill to me. Can we just document it down? Like, firstly warn users > > > > do not invoke migrate_pages() or move_pages() to a DPDK process; as for > > > > the kcompactd daemon and some more cases (like compaction could be > > > > triggered by alloc_pages()), could we just recommend to disable > > > > CONFIG_COMPACTION? > > > > > > We really need to better document the limitations of UIO. > > > May we have some suggestions here? > > > > > > > Another side, how does vfio pin those memory? Through memlock (from code > > > > in vfio_pin_pages())? So why not just mlock those hugepages? > > > > > > Good question. Why not mlock the hugepages? > > > > mlock just guarantees that a virtual page is always backed by *some* > > physical > > page of memory. It does not guarantee that over the lifetime of the process > > a > > virtual page is mapped to the *same* physical page. The kernel is free to > > transparently move memory around, compress it, dedupe it, etc. > > > > vfio is not pinning the memory, but instead is using the IOMMU (a piece of > > hardware) to participate in the memory management on the platform. If a > > device > > begins a DMA transfer to an I/O virtual address, the IOMMU will coordinate > > with > > the main MMU to make sure that the data ends up in the correct location, > > even as > > the virtual to physical mappings are being modified. > > This last comment confused me because you said VFIO did the page pinning in > your first email. > I have been looking at the kernel code and the VFIO driver does pin the pages, > at least the iommu type 1. The vfio driver does flag the page in a way that prevents some types of movement, so in that sense it is pinning it. I haven't done an audit to guarantee that it prevents all types of movement - that would be very difficult. My point was more that vfio is not strictly relying on pinning to function, but instead relying on the IOMMU. In my previous email I said "pinning" when I really meant "programs the IOMMU". Of course, with vfio-noiommu you'd be back to relying on pinning again, in which case you'd really have to do that full audit of the kernel memory manager to confirm that the flags vfio is setting prevent all movement for any reason. > > I can see a problem adding support to UIO for doing the same, because that > implies there is a device > doing DMAs and programmed from user space, which is something the UIO > maintainer is against. But because > vfio-noiommu mode was implemented just for this, I guess that could be added > to the VFIO driver. This does not > solve the problem of software not using vfio though. vfio-noiommu is intended for devices programmed in user space, but primarily for devices that don't require physical addresses to perform data transfers (like RDMA NICs). Those devices don't actually require pinned memory and already participate in the regular memory management on the platform, so putting them behind an IOMMU is of no additional value. > > Apart from improving the UIO documentation when used with DPDK, maybe some > sort of check could be done > and DPDK requiring a explicit parameter for making the user aware of the > potential risk when UIO is used and the > kernel page migration is enabled. Not sure if this last thing could be easily > known from user space. The challenge is that there are so many reasons for a page to move, and more are added all the time. It would be really hard to correctly prevent the user from using uio in every case. Further, if the user is using uio inside of a virtual machine that happens to be deployed using the IOMMU on the host system, most of the reasons for a page to move (besides explicit requests to move pages) are alleviated and it is more or less safe. But the user would have no idea from within the guest that they're actually protected. I think this case - using uio from within a guest VM that is protected by the IOMMU - is common. > > On another side, we suffered a similar problem when VMs were using SRIOV and > memory balloning. The IOMMU was > removing the mapping for the memory removed, but the kernel inside the VM did > not get any event and the device > ended up doing some wrong DMA operation. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [dpdk-dev] Running DPDK as an unprivileged user 2017-11-28 17:50 ` Walker, Benjamin @ 2017-11-28 19:13 ` Alejandro Lucero 0 siblings, 0 replies; 18+ messages in thread From: Alejandro Lucero @ 2017-11-28 19:13 UTC (permalink / raw) To: Walker, Benjamin Cc: thomas, Gonzalez Monroy, Sergio, Burakov, Anatoly, Tan, Jianfeng, dev On Tue, Nov 28, 2017 at 5:50 PM, Walker, Benjamin <benjamin.walker@intel.com > wrote: > On Tue, 2017-11-28 at 14:16 +0000, Alejandro Lucero wrote: > > > > > > On Mon, Nov 27, 2017 at 5:58 PM, Walker, Benjamin < > benjamin.walker@intel.com> > > wrote: > > > On Sun, 2017-11-05 at 01:17 +0100, Thomas Monjalon wrote: > > > > Hi, restarting an old topic, > > > > > > > > 05/01/2017 16:52, Tan, Jianfeng: > > > > > On 1/5/2017 5:34 AM, Walker, Benjamin wrote: > > > > > > > > Note that this > > > > > > > > probably means that using uio on recent kernels is subtly > > > > > > > > broken and cannot be supported going forward because there > > > > > > > > is no uio mechanism to pin the memory. > > > > > > > > > > > > > > > > The first open question I have is whether DPDK should allow > > > > > > > > uio at all on recent (4.x) kernels. My current understanding > > > > > > > > is that there is no way to pin memory and hugepages can now > > > > > > > > be moved around, so uio would be unsafe. What does the > > > > > > > > community think here? > > > > > > > > > > Back to this question, removing uio support in DPDK seems a little > > > > > overkill to me. Can we just document it down? Like, firstly warn > users > > > > > do not invoke migrate_pages() or move_pages() to a DPDK process; > as for > > > > > the kcompactd daemon and some more cases (like compaction could be > > > > > triggered by alloc_pages()), could we just recommend to disable > > > > > CONFIG_COMPACTION? > > > > > > > > We really need to better document the limitations of UIO. > > > > May we have some suggestions here? > > > > > > > > > Another side, how does vfio pin those memory? Through memlock > (from code > > > > > in vfio_pin_pages())? So why not just mlock those hugepages? > > > > > > > > Good question. Why not mlock the hugepages? > > > > > > mlock just guarantees that a virtual page is always backed by *some* > > > physical > > > page of memory. It does not guarantee that over the lifetime of the > process > > > a > > > virtual page is mapped to the *same* physical page. The kernel is free > to > > > transparently move memory around, compress it, dedupe it, etc. > > > > > > vfio is not pinning the memory, but instead is using the IOMMU (a > piece of > > > hardware) to participate in the memory management on the platform. If a > > > device > > > begins a DMA transfer to an I/O virtual address, the IOMMU will > coordinate > > > with > > > the main MMU to make sure that the data ends up in the correct > location, > > > even as > > > the virtual to physical mappings are being modified. > > > > This last comment confused me because you said VFIO did the page pinning > in > > your first email. > > I have been looking at the kernel code and the VFIO driver does pin the > pages, > > at least the iommu type 1. > > The vfio driver does flag the page in a way that prevents some types of > movement, so in that sense it is pinning it. I haven't done an audit to > guarantee that it prevents all types of movement - that would be very > difficult. > My point was more that vfio is not strictly relying on pinning to > function, but > instead relying on the IOMMU. In my previous email I said "pinning" when I > really meant "programs the IOMMU". Of course, with vfio-noiommu you'd be > back to > relying on pinning again, in which case you'd really have to do that full > audit > of the kernel memory manager to confirm that the flags vfio is setting > prevent > all movement for any reason. > > If you are saying the kernel code related to page migration will know how to reprogram the IOMMU, I think that is unlikely. What the VFIO code is doing is to set a flag for those involved pages saying they are "writable", and therefore it is not safe to do the page migration. If that mm code needs to reprogram the IOMMU, it needs to know not just the process which page table will be modified, but also the device that process has assigned, because the IOMMU mapping is related to devices and not processes. So I'm not 100% sure, but I don't think the kernel is doing so. > > > > I can see a problem adding support to UIO for doing the same, because > that > > implies there is a device > > doing DMAs and programmed from user space, which is something the UIO > > maintainer is against. But because > > vfio-noiommu mode was implemented just for this, I guess that could be > added > > to the VFIO driver. This does not > > solve the problem of software not using vfio though. > > vfio-noiommu is intended for devices programmed in user space, but > primarily for > devices that don't require physical addresses to perform data transfers > (like > RDMA NICs). Those devices don't actually require pinned memory and already > participate in the regular memory management on the platform, so putting > them > behind an IOMMU is of no additional value. > > AFAIK, noiommu mode was added to VFIO just for DPDK (mainly) and for solving the problem with the unupstreamable igb_uio module, and the problem with adding more features to the uio.ko > > > > Apart from improving the UIO documentation when used with DPDK, maybe > some > > sort of check could be done > > and DPDK requiring a explicit parameter for making the user aware of the > > potential risk when UIO is used and the > > kernel page migration is enabled. Not sure if this last thing could be > easily > > known from user space. > > The challenge is that there are so many reasons for a page to move, and > more are > added all the time. It would be really hard to correctly prevent the user > from > using uio in every case. Further, if the user is using uio inside of a > virtual > machine that happens to be deployed using the IOMMU on the host system, > most of > the reasons for a page to move (besides explicit requests to move pages) > are > alleviated and it is more or less safe. But the user would have no idea > from > within the guest that they're actually protected. I think this case - > using uio > from within a guest VM that is protected by the IOMMU - is common. > > That is true, but a driver can know if the system is a virtualized one, so then that explicit flag could not be needed. > > > > On another side, we suffered a similar problem when VMs were using SRIOV > and > > memory balloning. The IOMMU was > > removing the mapping for the memory removed, but the kernel inside the > VM did > > not get any event and the device > > ended up doing some wrong DMA operation. > ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2017-11-28 19:13 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-12-29 20:41 [dpdk-dev] Running DPDK as an unprivileged user Walker, Benjamin 2016-12-30 1:14 ` Stephen Hemminger 2017-01-02 14:32 ` Thomas Monjalon 2017-01-02 19:47 ` Stephen Hemminger 2017-01-03 22:50 ` Walker, Benjamin 2017-01-04 10:11 ` Thomas Monjalon 2017-01-04 21:35 ` Walker, Benjamin 2017-01-04 11:39 ` Tan, Jianfeng 2017-01-04 21:34 ` Walker, Benjamin 2017-01-05 10:09 ` Sergio Gonzalez Monroy 2017-01-05 10:16 ` Sergio Gonzalez Monroy 2017-01-05 14:58 ` Tan, Jianfeng 2017-01-05 15:52 ` Tan, Jianfeng 2017-11-05 0:17 ` Thomas Monjalon 2017-11-27 17:58 ` Walker, Benjamin 2017-11-28 14:16 ` Alejandro Lucero 2017-11-28 17:50 ` Walker, Benjamin 2017-11-28 19:13 ` Alejandro Lucero
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).