* [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance @ 2015-09-27 7:05 Vlad Zolotarov 2015-09-27 9:43 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-27 7:05 UTC (permalink / raw) To: dev; +Cc: Michael S. Tsirkin Hi, I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on Amazon EC2 instances with Enhanced Networking enabled. The idea is to create a DPDK environment that doesn't require compiling kernel modules (igb_uio). However I was surprised to discover that uio_pci_generic refuses to work with EN device on AWS: $ lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01) $ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0 Error: bind failed for 0000:00:04.0 - Cannot bind to driver uio_pci_generic $dmesg --> snip <--- [ 816.655575] uio_pci_generic 0000:00:04.0: No IRQ assigned to device: no support for interrupts? $ sudo lspci -s 00:04.0 -vvv 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) Physical Slot: 4 Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Region 0: Memory at f3008000 (64-bit, prefetchable) [size=16K] Region 3: Memory at f300c000 (64-bit, prefetchable) [size=16K] Capabilities: [70] MSI-X: Enable- Count=3 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Kernel modules: ixgbevf So, as we may see the PCI device doesn't have an INTX interrupt line assigned indeed. It has an MSI-X capability however. Looking at the uio_pci_generic code it seems to require the INTX: uio_pci_generic.c: line 74: probe(): if (!pdev->irq) { dev_warn(&pdev->dev, "No IRQ assigned to device: " "no support for interrupts?\n"); pci_disable_device(pdev); return -ENODEV; } Is it a known limitation? Michael, could u, pls., comment on this? thanks, vlad ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-27 7:05 [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance Vlad Zolotarov @ 2015-09-27 9:43 ` Michael S. Tsirkin 2015-09-27 10:50 ` Vladislav Zolotarov 2015-09-29 16:41 ` Vlad Zolotarov 0 siblings, 2 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-27 9:43 UTC (permalink / raw) To: Vlad Zolotarov; +Cc: dev On Sun, Sep 27, 2015 at 10:05:11AM +0300, Vlad Zolotarov wrote: > Hi, > I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on > Amazon EC2 instances with Enhanced Networking enabled. > The idea is to create a DPDK environment that doesn't require compiling > kernel modules (igb_uio). > However I was surprised to discover that uio_pci_generic refuses to work > with EN device on AWS: > > $ lspci > 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) > 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] > 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] > 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01) > 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 > 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) > 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) > 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01) > > $ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0 > Error: bind failed for 0000:00:04.0 - Cannot bind to driver uio_pci_generic > $dmesg > > --> snip <--- > [ 816.655575] uio_pci_generic 0000:00:04.0: No IRQ assigned to device: no support for interrupts? > > $ sudo lspci -s 00:04.0 -vvv > 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) > Physical Slot: 4 > Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Region 0: Memory at f3008000 (64-bit, prefetchable) [size=16K] > Region 3: Memory at f300c000 (64-bit, prefetchable) [size=16K] > Capabilities: [70] MSI-X: Enable- Count=3 Masked- > Vector table: BAR=3 offset=00000000 > PBA: BAR=3 offset=00002000 > Kernel modules: ixgbevf > > So, as we may see the PCI device doesn't have an INTX interrupt line > assigned indeed. It has an MSI-X capability however. > Looking at the uio_pci_generic code it seems to require the INTX: > > uio_pci_generic.c: line 74: probe(): > > if (!pdev->irq) { > dev_warn(&pdev->dev, "No IRQ assigned to device: " > "no support for interrupts?\n"); > pci_disable_device(pdev); > return -ENODEV; > } > > Is it a known limitation? Michael, could u, pls., comment on this? > > thanks, > vlad This is expected. uio_pci_generic forwards INT#x interrupts from device to userspace, but VF devices never assert INT#x. So it doesn't seem to make any sense to bind uio_pci_generic there. I think that DPDK should be fixed to not require uio_pci_generic for VF devices (or any devices without INT#x). If DPDK requires a place-holder driver, the pci-stub driver should do this adequately. See ./drivers/pci/pci-stub.c -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-27 9:43 ` Michael S. Tsirkin @ 2015-09-27 10:50 ` Vladislav Zolotarov 2015-09-29 16:41 ` Vlad Zolotarov 1 sibling, 0 replies; 100+ messages in thread From: Vladislav Zolotarov @ 2015-09-27 10:50 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On Sep 27, 2015 12:43 PM, "Michael S. Tsirkin" <mst@redhat.com> wrote: > > On Sun, Sep 27, 2015 at 10:05:11AM +0300, Vlad Zolotarov wrote: > > Hi, > > I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on > > Amazon EC2 instances with Enhanced Networking enabled. > > The idea is to create a DPDK environment that doesn't require compiling > > kernel modules (igb_uio). > > However I was surprised to discover that uio_pci_generic refuses to work > > with EN device on AWS: > > > > $ lspci > > 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) > > 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] > > 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] > > 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01) > > 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 > > 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) > > 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) > > 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01) > > > > $ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0 > > Error: bind failed for 0000:00:04.0 - Cannot bind to driver uio_pci_generic > > > $dmesg > > > > --> snip <--- > > [ 816.655575] uio_pci_generic 0000:00:04.0: No IRQ assigned to device: no support for interrupts? > > > > $ sudo lspci -s 00:04.0 -vvv > > 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) > > Physical Slot: 4 > > Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > > Region 0: Memory at f3008000 (64-bit, prefetchable) [size=16K] > > Region 3: Memory at f300c000 (64-bit, prefetchable) [size=16K] > > Capabilities: [70] MSI-X: Enable- Count=3 Masked- > > Vector table: BAR=3 offset=00000000 > > PBA: BAR=3 offset=00002000 > > Kernel modules: ixgbevf > > > > So, as we may see the PCI device doesn't have an INTX interrupt line > > assigned indeed. It has an MSI-X capability however. > > Looking at the uio_pci_generic code it seems to require the INTX: > > > > uio_pci_generic.c: line 74: probe(): > > > > if (!pdev->irq) { > > dev_warn(&pdev->dev, "No IRQ assigned to device: " > > "no support for interrupts?\n"); > > pci_disable_device(pdev); > > return -ENODEV; > > } > > > > Is it a known limitation? Michael, could u, pls., comment on this? > > > > thanks, > > vlad > > This is expected. uio_pci_generic forwards INT#x interrupts from device > to userspace, but VF devices never assert INT#x. > > So it doesn't seem to make any sense to bind uio_pci_generic there. > > I think that DPDK should be fixed to not require uio_pci_generic > for VF devices (or any devices without INT#x). > > If DPDK requires a place-holder driver, the pci-stub driver should > do this adequately. See ./drivers/pci/pci-stub.c Thank for clarification, Michael. I'll take a look. > > -- > MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-27 9:43 ` Michael S. Tsirkin 2015-09-27 10:50 ` Vladislav Zolotarov @ 2015-09-29 16:41 ` Vlad Zolotarov 2015-09-29 20:54 ` Michael S. Tsirkin 1 sibling, 1 reply; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-29 16:41 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/27/15 12:43, Michael S. Tsirkin wrote: > On Sun, Sep 27, 2015 at 10:05:11AM +0300, Vlad Zolotarov wrote: >> Hi, >> I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on >> Amazon EC2 instances with Enhanced Networking enabled. >> The idea is to create a DPDK environment that doesn't require compiling >> kernel modules (igb_uio). >> However I was surprised to discover that uio_pci_generic refuses to work >> with EN device on AWS: >> >> $ lspci >> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) >> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] >> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] >> 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01) >> 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 >> 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) >> 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) >> 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01) >> >> $ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0 >> Error: bind failed for 0000:00:04.0 - Cannot bind to driver uio_pci_generic >> $dmesg >> >> --> snip <--- >> [ 816.655575] uio_pci_generic 0000:00:04.0: No IRQ assigned to device: no support for interrupts? >> >> $ sudo lspci -s 00:04.0 -vvv >> 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) >> Physical Slot: 4 >> Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- >> Region 0: Memory at f3008000 (64-bit, prefetchable) [size=16K] >> Region 3: Memory at f300c000 (64-bit, prefetchable) [size=16K] >> Capabilities: [70] MSI-X: Enable- Count=3 Masked- >> Vector table: BAR=3 offset=00000000 >> PBA: BAR=3 offset=00002000 >> Kernel modules: ixgbevf >> >> So, as we may see the PCI device doesn't have an INTX interrupt line >> assigned indeed. It has an MSI-X capability however. >> Looking at the uio_pci_generic code it seems to require the INTX: >> >> uio_pci_generic.c: line 74: probe(): >> >> if (!pdev->irq) { >> dev_warn(&pdev->dev, "No IRQ assigned to device: " >> "no support for interrupts?\n"); >> pci_disable_device(pdev); >> return -ENODEV; >> } >> >> Is it a known limitation? Michael, could u, pls., comment on this? >> >> thanks, >> vlad Michael, I took a look at the pci_stub driver and the reason why DPDK uses uio the first place and I have some comments below. > This is expected. uio_pci_generic forwards INT#x interrupts from device > to userspace, but VF devices never assert INT#x. > > So it doesn't seem to make any sense to bind uio_pci_generic there. Well, it's not completely correct to put it this way. The thing is that DPDK (and it could be any other framework/developer) uses uio_pci_generic to actually get interrupts from the device and it makes a perfect sense to be able to do so in the SR-IOV devices too. The problem is, like u've described above, that the current implementation of uio_pci_generic wouldn't let them do so and it seems like a bogus behavior to me. There is no reason, why uio_pci_generic wouldn't be able to work the same way as it does today but with MSI-X interrupts. To make things simple forwarding just the first vector as an initial implementation. The security breach motivation u brought in "[RFC PATCH] uio: uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak since one u let the userland access to the bar it may do any funny thing using the DMA engine of the device. This kind of stuff should be prevented using the iommu and if it's enabled then any funny tricks using MSI/MSI-X configuration will be prevented too. I'm about to send the patch to main Linux mailing list. Let's continue this discussion there. > > I think that DPDK should be fixed to not require uio_pci_generic > for VF devices (or any devices without INT#x). > > If DPDK requires a place-holder driver, the pci-stub driver should > do this adequately. See ./drivers/pci/pci-stub.c > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-29 16:41 ` Vlad Zolotarov @ 2015-09-29 20:54 ` Michael S. Tsirkin 2015-09-29 21:46 ` Stephen Hemminger 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-29 20:54 UTC (permalink / raw) To: Vlad Zolotarov; +Cc: dev On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: > The security breach motivation u brought in "[RFC PATCH] uio: > uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak > since one u let the userland access to the bar it may do any funny thing > using the DMA engine of the device. This kind of stuff should be prevented > using the iommu and if it's enabled then any funny tricks using MSI/MSI-X > configuration will be prevented too. > > I'm about to send the patch to main Linux mailing list. Let's continue this > discussion there. > Basically UIO shouldn't be used with devices capable of DMA. Use VFIO for that (yes, this implies an emulated or PV IOMMU). I don't think this can change. > > > >I think that DPDK should be fixed to not require uio_pci_generic > >for VF devices (or any devices without INT#x). > > > >If DPDK requires a place-holder driver, the pci-stub driver should > >do this adequately. See ./drivers/pci/pci-stub.c > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-29 20:54 ` Michael S. Tsirkin @ 2015-09-29 21:46 ` Stephen Hemminger 2015-09-29 21:49 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Stephen Hemminger @ 2015-09-29 21:46 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On Tue, 29 Sep 2015 23:54:54 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: > > The security breach motivation u brought in "[RFC PATCH] uio: > > uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak > > since one u let the userland access to the bar it may do any funny thing > > using the DMA engine of the device. This kind of stuff should be prevented > > using the iommu and if it's enabled then any funny tricks using MSI/MSI-X > > configuration will be prevented too. > > > > I'm about to send the patch to main Linux mailing list. Let's continue this > > discussion there. > > > > Basically UIO shouldn't be used with devices capable of DMA. > Use VFIO for that (yes, this implies an emulated or PV IOMMU). > I don't think this can change. Given there is no PV IOMMU and even if there was it would be too slow for DPDK use, I can't accept that. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-29 21:46 ` Stephen Hemminger @ 2015-09-29 21:49 ` Michael S. Tsirkin 2015-09-30 10:37 ` Vlad Zolotarov 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-29 21:49 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote: > On Tue, 29 Sep 2015 23:54:54 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: > > > The security breach motivation u brought in "[RFC PATCH] uio: > > > uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak > > > since one u let the userland access to the bar it may do any funny thing > > > using the DMA engine of the device. This kind of stuff should be prevented > > > using the iommu and if it's enabled then any funny tricks using MSI/MSI-X > > > configuration will be prevented too. > > > > > > I'm about to send the patch to main Linux mailing list. Let's continue this > > > discussion there. > > > > > > > Basically UIO shouldn't be used with devices capable of DMA. > > Use VFIO for that (yes, this implies an emulated or PV IOMMU). > > I don't think this can change. > > Given there is no PV IOMMU and even if there was it would be too slow for DPDK > use, I can't accept that. QEMU does allow emulating an iommu. DPDK uses static mappings, so I doubt it's speed matters at all. Anyway, DPDK is doing polling all the time. I don't see why does it insist on using interrupts to detect link up events. Just poll for that too. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-29 21:49 ` Michael S. Tsirkin @ 2015-09-30 10:37 ` Vlad Zolotarov 2015-09-30 10:58 ` Michael S. Tsirkin 2015-09-30 17:28 ` Stephen Hemminger 0 siblings, 2 replies; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-30 10:37 UTC (permalink / raw) To: Michael S. Tsirkin, Stephen Hemminger; +Cc: dev On 09/30/15 00:49, Michael S. Tsirkin wrote: > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote: >> On Tue, 29 Sep 2015 23:54:54 +0300 >> "Michael S. Tsirkin" <mst@redhat.com> wrote: >> >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: >>>> The security breach motivation u brought in "[RFC PATCH] uio: >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak >>>> since one u let the userland access to the bar it may do any funny thing >>>> using the DMA engine of the device. This kind of stuff should be prevented >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X >>>> configuration will be prevented too. >>>> >>>> I'm about to send the patch to main Linux mailing list. Let's continue this >>>> discussion there. >>>> >>> Basically UIO shouldn't be used with devices capable of DMA. >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU). If there is an IOMMU in the picture there shouldn't be any problem to use UIO with DMA capable devices. >>> I don't think this can change. >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK >> use, I can't accept that. > QEMU does allow emulating an iommu. Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an option there. And again, it's a general issue not DPDK specific. Today one has to develop some proprietary modules (like igb_uio) to workaround the issue and this is lame. IMHO uio_pci_generic should be fixed to be able to properly work within any virtualized environment and not only with KVM. > DPDK uses static mappings, so I > doubt it's speed matters at all. > > Anyway, DPDK is doing polling all the time. I don't see why does it > insist on using interrupts to detect link up events. Just poll for that > too. > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 10:37 ` Vlad Zolotarov @ 2015-09-30 10:58 ` Michael S. Tsirkin 2015-09-30 11:26 ` Vlad Zolotarov 2015-09-30 17:28 ` Stephen Hemminger 1 sibling, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 10:58 UTC (permalink / raw) To: Vlad Zolotarov; +Cc: dev On Wed, Sep 30, 2015 at 01:37:22PM +0300, Vlad Zolotarov wrote: > > > On 09/30/15 00:49, Michael S. Tsirkin wrote: > >On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote: > >>On Tue, 29 Sep 2015 23:54:54 +0300 > >>"Michael S. Tsirkin" <mst@redhat.com> wrote: > >> > >>>On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: > >>>>The security breach motivation u brought in "[RFC PATCH] uio: > >>>>uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak > >>>>since one u let the userland access to the bar it may do any funny thing > >>>>using the DMA engine of the device. This kind of stuff should be prevented > >>>>using the iommu and if it's enabled then any funny tricks using MSI/MSI-X > >>>>configuration will be prevented too. > >>>> > >>>>I'm about to send the patch to main Linux mailing list. Let's continue this > >>>>discussion there. > >>>Basically UIO shouldn't be used with devices capable of DMA. > >>>Use VFIO for that (yes, this implies an emulated or PV IOMMU). > > If there is an IOMMU in the picture there shouldn't be any problem to use > UIO with DMA capable devices. UIO doesn't enforce the IOMMU though. That's why it's not a good fit. > >>>I don't think this can change. > >>Given there is no PV IOMMU and even if there was it would be too slow for DPDK > >>use, I can't accept that. > >QEMU does allow emulating an iommu. > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an option > there. Not only that, a bunch of boxes have their IOMMU disabled. So for such systems, you can't have userspace poking at device registers. You need a kernel driver to validate userspace requests before passing them on to devices. > And again, it's a general issue not DPDK specific. > Today one has to develop some proprietary modules (like igb_uio) to > workaround the issue and this is lame. Of course it is lame. So don't bypass the kernel then, use the upstream drivers. > IMHO uio_pci_generic should > be fixed to be able to properly work within any virtualized environment and > not only with KVM. The motivation for UIO is pretty clear: For many types of devices, creating a Linux kernel driver is overkill. All that is really needed is some way to handle an interrupt and provide access to the memory space of the device. The logic of controlling the device does not necessarily have to be within the kernel, as the device does not need to take advantage of any of other resources that the kernel provides. One such common class of devices that are like this are for industrial I/O cards. Devices doing DMA do need to take advantage of memory protection that the kernel provides. > > > DPDK uses static mappings, so I > >doubt it's speed matters at all. > > > >Anyway, DPDK is doing polling all the time. I don't see why does it > >insist on using interrupts to detect link up events. Just poll for that > >too. > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 10:58 ` Michael S. Tsirkin @ 2015-09-30 11:26 ` Vlad Zolotarov [not found] ` <20150930143927-mutt-send-email-mst@redhat.com> 0 siblings, 1 reply; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-30 11:26 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/15 13:58, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 01:37:22PM +0300, Vlad Zolotarov wrote: >> >> On 09/30/15 00:49, Michael S. Tsirkin wrote: >>> On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote: >>>> On Tue, 29 Sep 2015 23:54:54 +0300 >>>> "Michael S. Tsirkin" <mst@redhat.com> wrote: >>>> >>>>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: >>>>>> The security breach motivation u brought in "[RFC PATCH] uio: >>>>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak >>>>>> since one u let the userland access to the bar it may do any funny thing >>>>>> using the DMA engine of the device. This kind of stuff should be prevented >>>>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X >>>>>> configuration will be prevented too. >>>>>> >>>>>> I'm about to send the patch to main Linux mailing list. Let's continue this >>>>>> discussion there. >>>>> Basically UIO shouldn't be used with devices capable of DMA. >>>>> Use VFIO for that (yes, this implies an emulated or PV IOMMU). >> If there is an IOMMU in the picture there shouldn't be any problem to use >> UIO with DMA capable devices. > UIO doesn't enforce the IOMMU though. That's why it's not a good fit. Having said all that - does UIO denies to work with the devices with DMA capability today? Either i have missed that logic or it's not there. So all what u are so worried about may already be done today. That's why I don't understand why adding a support for MSI/MSI-X interrupts would change anything here. U are right that UIO *today* has a security hole however it should be addressed separately and the same solution that will cover the the security breach in the current code will cover the "MSI/MSI-X security vulnerability" because they are actually exactly the same issue. > >>>>> I don't think this can change. >>>> Given there is no PV IOMMU and even if there was it would be too slow for DPDK >>>> use, I can't accept that. >>> QEMU does allow emulating an iommu. >> Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an option >> there. > Not only that, a bunch of boxes have their IOMMU disabled. > So for such systems, you can't have userspace poking at > device registers. You need a kernel driver to validate > userspace requests before passing them on to devices. I think u are describing a HV functionality here. ;) And yes, u are absolutely right, HV has to control the non-privileged userland. For HV/non-virtualized boxes a possible solution could be to allow UIO only for some privileged group of processes. > >> And again, it's a general issue not DPDK specific. >> Today one has to develop some proprietary modules (like igb_uio) to >> workaround the issue and this is lame. > Of course it is lame. So don't bypass the kernel then, use the upstream drivers. This would impose a heavy performance penalty. The whole idea is to bypass kernel. Especially for networking... > >> IMHO uio_pci_generic should >> be fixed to be able to properly work within any virtualized environment and >> not only with KVM. > The motivation for UIO is pretty clear: > > For many types of devices, creating a Linux kernel driver is > overkill. All that is really needed is some way to handle an > interrupt and provide access to the memory space of the > device. The logic of controlling the device does not > necessarily have to be within the kernel, as the device does > not need to take advantage of any of other resources that the > kernel provides. One such common class of devices that are > like this are for industrial I/O cards. > > Devices doing DMA do need to take advantage of memory protection > that the kernel provides. Well, yeah - but who said I has to be forbidden to work with the device if MSI-X interrupts is my only option? Kernel may provide a protection in the way that it would check the process permissions and deny the UIO access to non-privileged processes. I'm not sure it's the case today and if it's not the case then, as mentioned above, this would rather be fixed ASAP exactly due to reasons u bring up here. And once it's done there shouldn't be any limitation to allow MSI or MSI-X interrupts along with INT#X. > >>> DPDK uses static mappings, so I >>> doubt it's speed matters at all. >>> >>> Anyway, DPDK is doing polling all the time. I don't see why does it >>> insist on using interrupts to detect link up events. Just poll for that >>> too. >>> ^ permalink raw reply [flat|nested] 100+ messages in thread
[parent not found: <20150930143927-mutt-send-email-mst@redhat.com>]
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance [not found] ` <20150930143927-mutt-send-email-mst@redhat.com> @ 2015-09-30 11:53 ` Vlad Zolotarov 2015-09-30 12:03 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-30 11:53 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/15 14:41, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote: >> The whole idea is to bypass kernel. Especially for networking... > ... on dumb hardware that doesn't support doing that securely. On a very capable HW that supports whatever security requirements needed (e.g. 82599 Intel's SR-IOV VF devices). > Colour me unimpressed. > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 11:53 ` Vlad Zolotarov @ 2015-09-30 12:03 ` Michael S. Tsirkin 2015-09-30 12:16 ` Vlad Zolotarov 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 12:03 UTC (permalink / raw) To: Vlad Zolotarov; +Cc: dev On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote: > > > On 09/30/15 14:41, Michael S. Tsirkin wrote: > >On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote: > >>The whole idea is to bypass kernel. Especially for networking... > >... on dumb hardware that doesn't support doing that securely. > > On a very capable HW that supports whatever security requirements needed > (e.g. 82599 Intel's SR-IOV VF devices). Network card type is irrelevant as long as you do not have an IOMMU, otherwise you would just use e.g. VFIO. > >Colour me unimpressed. > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 12:03 ` Michael S. Tsirkin @ 2015-09-30 12:16 ` Vlad Zolotarov 2015-09-30 12:27 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-30 12:16 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/15 15:03, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote: >> >> On 09/30/15 14:41, Michael S. Tsirkin wrote: >>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote: >>>> The whole idea is to bypass kernel. Especially for networking... >>> ... on dumb hardware that doesn't support doing that securely. >> On a very capable HW that supports whatever security requirements needed >> (e.g. 82599 Intel's SR-IOV VF devices). > Network card type is irrelevant as long as you do not have an IOMMU, > otherwise you would just use e.g. VFIO. Sorry, but I don't follow your logic here - Amazon EC2 environment is a example where there *is* iommu but it's not virtualized and thus VFIO is useless and there is an option to use directly assigned SR-IOV networking device there where using the kernel drivers impose a performance impact compared to user space UIO-based user space kernel bypass mode of usage. How is it irrelevant? Could u, pls, clarify your point? > >>> Colour me unimpressed. >>> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 12:16 ` Vlad Zolotarov @ 2015-09-30 12:27 ` Michael S. Tsirkin 2015-09-30 12:50 ` Vlad Zolotarov 2015-09-30 13:05 ` Avi Kivity 0 siblings, 2 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 12:27 UTC (permalink / raw) To: Vlad Zolotarov; +Cc: dev On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote: > > > On 09/30/15 15:03, Michael S. Tsirkin wrote: > >On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote: > >> > >>On 09/30/15 14:41, Michael S. Tsirkin wrote: > >>>On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote: > >>>>The whole idea is to bypass kernel. Especially for networking... > >>>... on dumb hardware that doesn't support doing that securely. > >>On a very capable HW that supports whatever security requirements needed > >>(e.g. 82599 Intel's SR-IOV VF devices). > >Network card type is irrelevant as long as you do not have an IOMMU, > >otherwise you would just use e.g. VFIO. > > Sorry, but I don't follow your logic here - Amazon EC2 environment is a > example where there *is* iommu but it's not virtualized > and thus VFIO is > useless and there is an option to use directly assigned SR-IOV networking > device there where using the kernel drivers impose a performance impact > compared to user space UIO-based user space kernel bypass mode of usage. How > is it irrelevant? Could u, pls, clarify your point? > So it's not even dumb hardware, it's another piece of software that forces an "all or nothing" approach where either device has access to all VM memory, or none. And this, unfortunately, leaves you with no secure way to allow userspace drivers. So it makes even less sense to add insecure work-arounds in the kernel. It seems quite likely that by the time the new kernel reaches production X years from now, EC2 will have a virtual iommu. > > > >>>Colour me unimpressed. > >>> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 12:27 ` Michael S. Tsirkin @ 2015-09-30 12:50 ` Vlad Zolotarov 2015-09-30 15:26 ` Michael S. Tsirkin 2015-09-30 13:05 ` Avi Kivity 1 sibling, 1 reply; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-30 12:50 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/15 15:27, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote: >> >> On 09/30/15 15:03, Michael S. Tsirkin wrote: >>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote: >>>> On 09/30/15 14:41, Michael S. Tsirkin wrote: >>>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote: >>>>>> The whole idea is to bypass kernel. Especially for networking... >>>>> ... on dumb hardware that doesn't support doing that securely. >>>> On a very capable HW that supports whatever security requirements needed >>>> (e.g. 82599 Intel's SR-IOV VF devices). >>> Network card type is irrelevant as long as you do not have an IOMMU, >>> otherwise you would just use e.g. VFIO. >> Sorry, but I don't follow your logic here - Amazon EC2 environment is a >> example where there *is* iommu but it's not virtualized >> and thus VFIO is >> useless and there is an option to use directly assigned SR-IOV networking >> device there where using the kernel drivers impose a performance impact >> compared to user space UIO-based user space kernel bypass mode of usage. How >> is it irrelevant? Could u, pls, clarify your point? >> > So it's not even dumb hardware, it's another piece of software > that forces an "all or nothing" approach where either > device has access to all VM memory, or none. > And this, unfortunately, leaves you with no secure way to > allow userspace drivers. UIO is not secure even today so what are we arguing about? ;) Adding MSI/MSI-X support won't change this state, so, pls., discard the security argument unless u thing that UIO is completely secure piece of software today. In the later case, could u, pls., clarify what would prevent the userspace program to configure a DMA controller via registers and do whatever it wants? How not virtualizing iommu forces "all or nothing" approach? What insecure in relying on HV to control the iommu and not letting the VF any access to it? As far as I see it - there isn't any security problem here at all. The only problem I see here is that dumb current uio_pci_generic implementation forces people to go and invent the workarounds instead of having a proper MSI/MSI-X support implemented. And as I've mentioned above it has nothing to do with security because there is no such thing as security (on the UIO driver level) when we talk about UIO - it has to be ensured by some other entity like HV. > > So it makes even less sense to add insecure work-arounds in the kernel. > It seems quite likely that by the time the new kernel reaches > production X years from now, EC2 will have a virtual iommu. I'd bet that new kernel would reach production long before Amazon does that... ;) > > >>>>> Colour me unimpressed. >>>>> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 12:50 ` Vlad Zolotarov @ 2015-09-30 15:26 ` Michael S. Tsirkin 2015-09-30 18:15 ` Vlad Zolotarov 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 15:26 UTC (permalink / raw) To: Vlad Zolotarov; +Cc: dev On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote: > How not virtualizing iommu forces "all or nothing" approach? Looks like you can't limit an assigned device to only access part of guest memory that belongs to a given process. Either let it access all of guest memory ("all") or don't assign the device ("nothing"). -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 15:26 ` Michael S. Tsirkin @ 2015-09-30 18:15 ` Vlad Zolotarov 2015-09-30 18:55 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-30 18:15 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/15 18:26, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote: >> How not virtualizing iommu forces "all or nothing" approach? > Looks like you can't limit an assigned device to only access part of > guest memory that belongs to a given process. Either let it access all > of guest memory ("all") or don't assign the device ("nothing"). Ok. A question then: can u limit the assigned device to only access part of the guest memory even if iommu was virtualized? How would iommu virtualization change anything? And why do we care about an assigned device to be able to access all Guest memory? > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 18:15 ` Vlad Zolotarov @ 2015-09-30 18:55 ` Michael S. Tsirkin 2015-09-30 19:06 ` Vlad Zolotarov 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 18:55 UTC (permalink / raw) To: Vlad Zolotarov; +Cc: dev On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote: > > > On 09/30/15 18:26, Michael S. Tsirkin wrote: > >On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote: > >>How not virtualizing iommu forces "all or nothing" approach? > >Looks like you can't limit an assigned device to only access part of > >guest memory that belongs to a given process. Either let it access all > >of guest memory ("all") or don't assign the device ("nothing"). > > Ok. A question then: can u limit the assigned device to only access part of > the guest memory even if iommu was virtualized? That's exactly what an iommu does - limit the device io access to memory. > How would iommu > virtualization change anything? Kernel can use an iommu to limit device access to memory of the controlling application. > And why do we care about an assigned device > to be able to access all Guest memory? Because we want to be reasonably sure a kernel memory corruption is not a result of a bug in a userspace application. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 18:55 ` Michael S. Tsirkin @ 2015-09-30 19:06 ` Vlad Zolotarov 2015-09-30 19:10 ` Vlad Zolotarov 2015-09-30 19:39 ` Michael S. Tsirkin 0 siblings, 2 replies; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-30 19:06 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/15 21:55, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote: >> >> On 09/30/15 18:26, Michael S. Tsirkin wrote: >>> On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote: >>>> How not virtualizing iommu forces "all or nothing" approach? >>> Looks like you can't limit an assigned device to only access part of >>> guest memory that belongs to a given process. Either let it access all >>> of guest memory ("all") or don't assign the device ("nothing"). >> Ok. A question then: can u limit the assigned device to only access part of >> the guest memory even if iommu was virtualized? > That's exactly what an iommu does - limit the device io access to memory. If it does - it will continue to do so with or without the patch and if it doesn't (for any reason) it won't do it even without the patch. So, again, the above (rhetorical) question stands. ;) I think Avi has already explained quite in detail why security is absolutely a non issue in regard to this patch or in regard to UIO in general. Security has to be enforced by some other means like iommu. > >> How would iommu >> virtualization change anything? > Kernel can use an iommu to limit device access to memory of > the controlling application. Ok, this is obvious but what it has to do with enabling using MSI/MSI-X interrupts support in uio_pci_generic? kernel may continue to limit the above access with this support as well. > >> And why do we care about an assigned device >> to be able to access all Guest memory? > Because we want to be reasonably sure a kernel memory corruption > is not a result of a bug in a userspace application. Corrupting Guest's memory due to any SW misbehavior (including bugs) is a non-issue by design - this is what HV and Guest machines were invented for. So, like Avi also said, instead of trying to enforce nobody cares about we'd rather make the developers life easier instead (by applying the not-yet-completed patch I'm working on). > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 19:06 ` Vlad Zolotarov @ 2015-09-30 19:10 ` Vlad Zolotarov 2015-09-30 19:11 ` Vlad Zolotarov 2015-09-30 19:39 ` Michael S. Tsirkin 1 sibling, 1 reply; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-30 19:10 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/15 22:06, Vlad Zolotarov wrote: > > > On 09/30/15 21:55, Michael S. Tsirkin wrote: >> On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote: >>> >>> On 09/30/15 18:26, Michael S. Tsirkin wrote: >>>> On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote: >>>>> How not virtualizing iommu forces "all or nothing" approach? >>>> Looks like you can't limit an assigned device to only access part of >>>> guest memory that belongs to a given process. Either let it access >>>> all >>>> of guest memory ("all") or don't assign the device ("nothing"). >>> Ok. A question then: can u limit the assigned device to only access >>> part of >>> the guest memory even if iommu was virtualized? >> That's exactly what an iommu does - limit the device io access to >> memory. > > If it does - it will continue to do so with or without the patch and > if it doesn't (for any reason) it won't do it even without the patch. > So, again, the above (rhetorical) question stands. ;) > > I think Avi has already explained quite in detail why security is > absolutely a non issue in regard to this patch or in regard to UIO in > general. Security has to be enforced by some other means like iommu. > >> >>> How would iommu >>> virtualization change anything? >> Kernel can use an iommu to limit device access to memory of >> the controlling application. > > Ok, this is obvious but what it has to do with enabling using > MSI/MSI-X interrupts support in uio_pci_generic? kernel may continue > to limit the above access with this support as well. > >> >>> And why do we care about an assigned device >>> to be able to access all Guest memory? >> Because we want to be reasonably sure a kernel memory corruption >> is not a result of a bug in a userspace application. > > Corrupting Guest's memory due to any SW misbehavior (including bugs) > is a non-issue by design - this is what HV and Guest machines were > invented for. So, like Avi also said, instead of trying to enforce > nobody cares about Let me rephrase: by pretending enforcing some security promise that u don't actually fulfill... ;) > we'd rather make the developers life easier instead (by applying the > not-yet-completed patch I'm working on). >> > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 19:10 ` Vlad Zolotarov @ 2015-09-30 19:11 ` Vlad Zolotarov 0 siblings, 0 replies; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-30 19:11 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/15 22:10, Vlad Zolotarov wrote: > > > On 09/30/15 22:06, Vlad Zolotarov wrote: >> >> >> On 09/30/15 21:55, Michael S. Tsirkin wrote: >>> On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote: >>>> >>>> On 09/30/15 18:26, Michael S. Tsirkin wrote: >>>>> On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote: >>>>>> How not virtualizing iommu forces "all or nothing" approach? >>>>> Looks like you can't limit an assigned device to only access part of >>>>> guest memory that belongs to a given process. Either let it >>>>> access all >>>>> of guest memory ("all") or don't assign the device ("nothing"). >>>> Ok. A question then: can u limit the assigned device to only access >>>> part of >>>> the guest memory even if iommu was virtualized? >>> That's exactly what an iommu does - limit the device io access to >>> memory. >> >> If it does - it will continue to do so with or without the patch and >> if it doesn't (for any reason) it won't do it even without the patch. >> So, again, the above (rhetorical) question stands. ;) >> >> I think Avi has already explained quite in detail why security is >> absolutely a non issue in regard to this patch or in regard to UIO in >> general. Security has to be enforced by some other means like iommu. >> >>> >>>> How would iommu >>>> virtualization change anything? >>> Kernel can use an iommu to limit device access to memory of >>> the controlling application. >> >> Ok, this is obvious but what it has to do with enabling using >> MSI/MSI-X interrupts support in uio_pci_generic? kernel may continue >> to limit the above access with this support as well. >> >>> >>>> And why do we care about an assigned device >>>> to be able to access all Guest memory? >>> Because we want to be reasonably sure a kernel memory corruption >>> is not a result of a bug in a userspace application. >> >> Corrupting Guest's memory due to any SW misbehavior (including bugs) >> is a non-issue by design - this is what HV and Guest machines were >> invented for. So, like Avi also said, instead of trying to enforce >> nobody cares about > > Let me rephrase: by pretending enforcing some security promise that u > don't actually fulfill... ;) ...the promise nobody really cares about... > >> we'd rather make the developers life easier instead (by applying the >> not-yet-completed patch I'm working on). >>> >> > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 19:06 ` Vlad Zolotarov 2015-09-30 19:10 ` Vlad Zolotarov @ 2015-09-30 19:39 ` Michael S. Tsirkin 2015-09-30 20:09 ` Vlad Zolotarov 1 sibling, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 19:39 UTC (permalink / raw) To: Vlad Zolotarov; +Cc: dev On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote: > >>How would iommu > >>virtualization change anything? > >Kernel can use an iommu to limit device access to memory of > >the controlling application. > > Ok, this is obvious but what it has to do with enabling using MSI/MSI-X > interrupts support in uio_pci_generic? kernel may continue to limit the > above access with this support as well. It could maybe. So if you write a patch to allow MSI by at the same time creating an isolated IOMMU group and blocking DMA from device in question anywhere, that sounds reasonable. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 19:39 ` Michael S. Tsirkin @ 2015-09-30 20:09 ` Vlad Zolotarov 2015-09-30 21:36 ` Stephen Hemminger 0 siblings, 1 reply; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-30 20:09 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/15 22:39, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote: >>>> How would iommu >>>> virtualization change anything? >>> Kernel can use an iommu to limit device access to memory of >>> the controlling application. >> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X >> interrupts support in uio_pci_generic? kernel may continue to limit the >> above access with this support as well. > It could maybe. So if you write a patch to allow MSI by at the same time > creating an isolated IOMMU group and blocking DMA from device in > question anywhere, that sounds reasonable. No, I'm only planning to add MSI and MSI-X interrupts support for uio_pci_generic device. The rest mentioned above should naturally be a matter of a different patch and writing it is orthogonal to the patch I'm working on as has been extensively discussed in this thread. > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 20:09 ` Vlad Zolotarov @ 2015-09-30 21:36 ` Stephen Hemminger 2015-09-30 21:53 ` Michael S. Tsirkin ` (2 more replies) 0 siblings, 3 replies; 100+ messages in thread From: Stephen Hemminger @ 2015-09-30 21:36 UTC (permalink / raw) To: Vlad Zolotarov; +Cc: dev, Michael S. Tsirkin On Wed, 30 Sep 2015 23:09:33 +0300 Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > > > On 09/30/15 22:39, Michael S. Tsirkin wrote: > > On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote: > >>>> How would iommu > >>>> virtualization change anything? > >>> Kernel can use an iommu to limit device access to memory of > >>> the controlling application. > >> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X > >> interrupts support in uio_pci_generic? kernel may continue to limit the > >> above access with this support as well. > > It could maybe. So if you write a patch to allow MSI by at the same time > > creating an isolated IOMMU group and blocking DMA from device in > > question anywhere, that sounds reasonable. > > No, I'm only planning to add MSI and MSI-X interrupts support for > uio_pci_generic device. > The rest mentioned above should naturally be a matter of a different > patch and writing it is orthogonal to the patch I'm working on as has > been extensively discussed in this thread. > > > > I have a generic MSI and MSI-X driver (posted earlier on this list). About to post to upstream kernel. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 21:36 ` Stephen Hemminger @ 2015-09-30 21:53 ` Michael S. Tsirkin 2015-09-30 22:20 ` Vlad Zolotarov 2015-10-01 8:00 ` Vlad Zolotarov 2 siblings, 0 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 21:53 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev On Wed, Sep 30, 2015 at 02:36:48PM -0700, Stephen Hemminger wrote: > On Wed, 30 Sep 2015 23:09:33 +0300 > Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > > > > > > > On 09/30/15 22:39, Michael S. Tsirkin wrote: > > > On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote: > > >>>> How would iommu > > >>>> virtualization change anything? > > >>> Kernel can use an iommu to limit device access to memory of > > >>> the controlling application. > > >> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X > > >> interrupts support in uio_pci_generic? kernel may continue to limit the > > >> above access with this support as well. > > > It could maybe. So if you write a patch to allow MSI by at the same time > > > creating an isolated IOMMU group and blocking DMA from device in > > > question anywhere, that sounds reasonable. > > > > No, I'm only planning to add MSI and MSI-X interrupts support for > > uio_pci_generic device. > > The rest mentioned above should naturally be a matter of a different > > patch and writing it is orthogonal to the patch I'm working on as has > > been extensively discussed in this thread. > > > > > > > > > I have a generic MSI and MSI-X driver (posted earlier on this list). > About to post to upstream kernel. If Linux holds out and refuses to support insecure interfaces, hypervisor vendors will add secure ones. If Linux lets them ignore guest security, they will. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 21:36 ` Stephen Hemminger 2015-09-30 21:53 ` Michael S. Tsirkin @ 2015-09-30 22:20 ` Vlad Zolotarov 2015-10-01 8:00 ` Vlad Zolotarov 2 siblings, 0 replies; 100+ messages in thread From: Vlad Zolotarov @ 2015-09-30 22:20 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev, Michael S. Tsirkin On 10/01/15 00:36, Stephen Hemminger wrote: > On Wed, 30 Sep 2015 23:09:33 +0300 > Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > >> >> On 09/30/15 22:39, Michael S. Tsirkin wrote: >>> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote: >>>>>> How would iommu >>>>>> virtualization change anything? >>>>> Kernel can use an iommu to limit device access to memory of >>>>> the controlling application. >>>> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X >>>> interrupts support in uio_pci_generic? kernel may continue to limit the >>>> above access with this support as well. >>> It could maybe. So if you write a patch to allow MSI by at the same time >>> creating an isolated IOMMU group and blocking DMA from device in >>> question anywhere, that sounds reasonable. >> No, I'm only planning to add MSI and MSI-X interrupts support for >> uio_pci_generic device. >> The rest mentioned above should naturally be a matter of a different >> patch and writing it is orthogonal to the patch I'm working on as has >> been extensively discussed in this thread. >> > I have a generic MSI and MSI-X driver (posted earlier on this list). > About to post to upstream kernel. Great! It would save me a few working days... ;) Thanks, Stephen! ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 21:36 ` Stephen Hemminger 2015-09-30 21:53 ` Michael S. Tsirkin 2015-09-30 22:20 ` Vlad Zolotarov @ 2015-10-01 8:00 ` Vlad Zolotarov 2015-10-01 14:47 ` Stephen Hemminger 2 siblings, 1 reply; 100+ messages in thread From: Vlad Zolotarov @ 2015-10-01 8:00 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev, Michael S. Tsirkin On 10/01/15 00:36, Stephen Hemminger wrote: > On Wed, 30 Sep 2015 23:09:33 +0300 > Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > >> >> On 09/30/15 22:39, Michael S. Tsirkin wrote: >>> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote: >>>>>> How would iommu >>>>>> virtualization change anything? >>>>> Kernel can use an iommu to limit device access to memory of >>>>> the controlling application. >>>> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X >>>> interrupts support in uio_pci_generic? kernel may continue to limit the >>>> above access with this support as well. >>> It could maybe. So if you write a patch to allow MSI by at the same time >>> creating an isolated IOMMU group and blocking DMA from device in >>> question anywhere, that sounds reasonable. >> No, I'm only planning to add MSI and MSI-X interrupts support for >> uio_pci_generic device. >> The rest mentioned above should naturally be a matter of a different >> patch and writing it is orthogonal to the patch I'm working on as has >> been extensively discussed in this thread. >> > I have a generic MSI and MSI-X driver (posted earlier on this list). > About to post to upstream kernel. Stephen, hi! I found the mentioned series and first thing I noticed was that it's been sent in May so the first question is how far in your list of tasks submitting it upstream is? We need it more or less yesterday and I'm working on it right now. Therefore if u don't have time for it I'd like to help... ;) However I'd like u to clarify a few small things. Pls., see below... I noticed that u've created a separate msi_msix driver and the second question is what do u plan for the upstream? I was thinking of extending the existing uio_pci_generic with the MSI-X functionality similar to your code and preserving the INT#X functionality as it is now: * INT#X and MSI would provide the IRQ number to the UIO module while only MSI-X case would register with UIO_IRQ_CUSTOM. I also noticed that u enable MSI-X on a first open() call. I assume there was a good reason (that I miss) for not doing it in probe(). Could u, pls., clarify? ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 8:00 ` Vlad Zolotarov @ 2015-10-01 14:47 ` Stephen Hemminger 2015-10-01 15:03 ` Vlad Zolotarov 0 siblings, 1 reply; 100+ messages in thread From: Stephen Hemminger @ 2015-10-01 14:47 UTC (permalink / raw) To: Vlad Zolotarov; +Cc: dev, Michael S. Tsirkin On Thu, 1 Oct 2015 11:00:28 +0300 Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > > > On 10/01/15 00:36, Stephen Hemminger wrote: > > On Wed, 30 Sep 2015 23:09:33 +0300 > > Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > > > >> > >> On 09/30/15 22:39, Michael S. Tsirkin wrote: > >>> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote: > >>>>>> How would iommu > >>>>>> virtualization change anything? > >>>>> Kernel can use an iommu to limit device access to memory of > >>>>> the controlling application. > >>>> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X > >>>> interrupts support in uio_pci_generic? kernel may continue to limit the > >>>> above access with this support as well. > >>> It could maybe. So if you write a patch to allow MSI by at the same time > >>> creating an isolated IOMMU group and blocking DMA from device in > >>> question anywhere, that sounds reasonable. > >> No, I'm only planning to add MSI and MSI-X interrupts support for > >> uio_pci_generic device. > >> The rest mentioned above should naturally be a matter of a different > >> patch and writing it is orthogonal to the patch I'm working on as has > >> been extensively discussed in this thread. > >> > > I have a generic MSI and MSI-X driver (posted earlier on this list). > > About to post to upstream kernel. > > Stephen, hi! > > I found the mentioned series and first thing I noticed was that it's > been sent in May so the first question is how far in your list of tasks > submitting it upstream is? We need it more or less yesterday and I'm > working on it right now. Therefore if u don't have time for it I'd like > to help... ;) However I'd like u to clarify a few small things. Pls., > see below... > > I noticed that u've created a separate msi_msix driver and the second > question is what do u plan for the upstream? I was thinking of extending > the existing uio_pci_generic with the MSI-X functionality similar to > your code and preserving the INT#X functionality as it is now: The igb_uio has a bunch of other things I didn't want to deal with: the name (being specific to old Intel driver); compatibility with older kernels; legacy ABI support. Therefore in effect uio_msi is a rebase of igb_uio. The submission upstream yesterday is the first step, I expect lots of review feedback. > * INT#X and MSI would provide the IRQ number to the UIO module while > only MSI-X case would register with UIO_IRQ_CUSTOM. I wanted all IRQ's to be the same for the driver, ie all go through eventfd mechanism. This makes code on DPDK side consistent with less special cases. > I also noticed that u enable MSI-X on a first open() call. I assume > there was a good reason (that I miss) for not doing it in probe(). Could > u, pls., clarify? ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 14:47 ` Stephen Hemminger @ 2015-10-01 15:03 ` Vlad Zolotarov 0 siblings, 0 replies; 100+ messages in thread From: Vlad Zolotarov @ 2015-10-01 15:03 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev, Michael S. Tsirkin On 10/01/15 17:47, Stephen Hemminger wrote: > On Thu, 1 Oct 2015 11:00:28 +0300 > Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > >> >> On 10/01/15 00:36, Stephen Hemminger wrote: >>> On Wed, 30 Sep 2015 23:09:33 +0300 >>> Vlad Zolotarov <vladz@cloudius-systems.com> wrote: >>> >>>> On 09/30/15 22:39, Michael S. Tsirkin wrote: >>>>> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote: >>>>>>>> How would iommu >>>>>>>> virtualization change anything? >>>>>>> Kernel can use an iommu to limit device access to memory of >>>>>>> the controlling application. >>>>>> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X >>>>>> interrupts support in uio_pci_generic? kernel may continue to limit the >>>>>> above access with this support as well. >>>>> It could maybe. So if you write a patch to allow MSI by at the same time >>>>> creating an isolated IOMMU group and blocking DMA from device in >>>>> question anywhere, that sounds reasonable. >>>> No, I'm only planning to add MSI and MSI-X interrupts support for >>>> uio_pci_generic device. >>>> The rest mentioned above should naturally be a matter of a different >>>> patch and writing it is orthogonal to the patch I'm working on as has >>>> been extensively discussed in this thread. >>>> >>> I have a generic MSI and MSI-X driver (posted earlier on this list). >>> About to post to upstream kernel. >> Stephen, hi! >> >> I found the mentioned series and first thing I noticed was that it's >> been sent in May so the first question is how far in your list of tasks >> submitting it upstream is? We need it more or less yesterday and I'm >> working on it right now. Therefore if u don't have time for it I'd like >> to help... ;) However I'd like u to clarify a few small things. Pls., >> see below... >> >> I noticed that u've created a separate msi_msix driver and the second >> question is what do u plan for the upstream? I was thinking of extending >> the existing uio_pci_generic with the MSI-X functionality similar to >> your code and preserving the INT#X functionality as it is now: > The igb_uio has a bunch of other things I didn't want to deal with: > the name (being specific to old Intel driver); compatibility with older > kernels; legacy ABI support. Therefore in effect uio_msi is a rebase > of igb_uio. > > The submission upstream yesterday is the first step, I expect lots > of review feedback. Sure, we have lots of feedback already even before the patch has been sent... ;) So, I'm preparing the uio_pci_generic patch. Just wanted to make sure we are not working on the same patch at the same time... ;) It's going to enable both MSI and MSI-X support. For a backward compatibility it'll enable INT#X by default. It follows the concepts and uses some code pieces from your uio_msi patch. If u want I'll put u as a signed-off when I send it. > >> * INT#X and MSI would provide the IRQ number to the UIO module while >> only MSI-X case would register with UIO_IRQ_CUSTOM. > I wanted all IRQ's to be the same for the driver, ie all go through > eventfd mechanism. This makes code on DPDK side consistent with less > special cases. Of course. The name (uio_msi) is a bit confusing since it only adds MSI-X support. I mistakenly thought that it adds both MSI and MSI-X but it seems to only add MSI-X and then there are no further questions... ;) > >> I also noticed that u enable MSI-X on a first open() call. I assume >> there was a good reason (that I miss) for not doing it in probe(). Could >> u, pls., clarify? What about this? ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 12:27 ` Michael S. Tsirkin 2015-09-30 12:50 ` Vlad Zolotarov @ 2015-09-30 13:05 ` Avi Kivity 2015-09-30 14:39 ` Michael S. Tsirkin 1 sibling, 1 reply; 100+ messages in thread From: Avi Kivity @ 2015-09-30 13:05 UTC (permalink / raw) To: Michael S. Tsirkin, Vlad Zolotarov; +Cc: dev On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote: >> >> On 09/30/15 15:03, Michael S. Tsirkin wrote: >>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote: >>>> On 09/30/15 14:41, Michael S. Tsirkin wrote: >>>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote: >>>>>> The whole idea is to bypass kernel. Especially for networking... >>>>> ... on dumb hardware that doesn't support doing that securely. >>>> On a very capable HW that supports whatever security requirements needed >>>> (e.g. 82599 Intel's SR-IOV VF devices). >>> Network card type is irrelevant as long as you do not have an IOMMU, >>> otherwise you would just use e.g. VFIO. >> Sorry, but I don't follow your logic here - Amazon EC2 environment is a >> example where there *is* iommu but it's not virtualized >> and thus VFIO is >> useless and there is an option to use directly assigned SR-IOV networking >> device there where using the kernel drivers impose a performance impact >> compared to user space UIO-based user space kernel bypass mode of usage. How >> is it irrelevant? Could u, pls, clarify your point? >> > So it's not even dumb hardware, it's another piece of software > that forces an "all or nothing" approach where either > device has access to all VM memory, or none. > And this, unfortunately, leaves you with no secure way to > allow userspace drivers. Some setups don't need security (they are single-user, single application). But do need a lot of performance (like 5X-10X performance). An example is OpenVSwitch, security doesn't help it at all and if you force it to use the kernel drivers you cripple it. Also, I'm root. I can do anything I like, including loading a patched pci_uio_generic. You're not providing _any_ security, you're simply making life harder for users. > So it makes even less sense to add insecure work-arounds in the kernel. > It seems quite likely that by the time the new kernel reaches > production X years from now, EC2 will have a virtual iommu. I can adopt a new kernel tomorrow. I have no influence on EC2. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 13:05 ` Avi Kivity @ 2015-09-30 14:39 ` Michael S. Tsirkin 2015-09-30 14:53 ` Avi Kivity 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 14:39 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote: > > > On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote: > >On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote: > >> > >>On 09/30/15 15:03, Michael S. Tsirkin wrote: > >>>On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote: > >>>>On 09/30/15 14:41, Michael S. Tsirkin wrote: > >>>>>On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote: > >>>>>>The whole idea is to bypass kernel. Especially for networking... > >>>>>... on dumb hardware that doesn't support doing that securely. > >>>>On a very capable HW that supports whatever security requirements needed > >>>>(e.g. 82599 Intel's SR-IOV VF devices). > >>>Network card type is irrelevant as long as you do not have an IOMMU, > >>>otherwise you would just use e.g. VFIO. > >>Sorry, but I don't follow your logic here - Amazon EC2 environment is a > >>example where there *is* iommu but it's not virtualized > >>and thus VFIO is > >>useless and there is an option to use directly assigned SR-IOV networking > >>device there where using the kernel drivers impose a performance impact > >>compared to user space UIO-based user space kernel bypass mode of usage. How > >>is it irrelevant? Could u, pls, clarify your point? > >> > >So it's not even dumb hardware, it's another piece of software > >that forces an "all or nothing" approach where either > >device has access to all VM memory, or none. > >And this, unfortunately, leaves you with no secure way to > >allow userspace drivers. > > Some setups don't need security (they are single-user, single application). > But do need a lot of performance (like 5X-10X performance). An example is > OpenVSwitch, security doesn't help it at all and if you force it to use the > kernel drivers you cripple it. We'd have to see there are actual users that need this. So far, dpdk seems like the only one, and it wants to use UIO for slow path stuff like polling link status. Why this needs kernel bypass support, I don't know. I asked, and got no answer. > > Also, I'm root. I can do anything I like, including loading a patched > pci_uio_generic. You're not providing _any_ security, you're simply making > life harder for users. Maybe that's true on your system. But I guess you know that's not true for everyone, not in 2015. > >So it makes even less sense to add insecure work-arounds in the kernel. > >It seems quite likely that by the time the new kernel reaches > >production X years from now, EC2 will have a virtual iommu. > > I can adopt a new kernel tomorrow. I have no influence on EC2. > > Xen grant tables sound like they could be the right interface for EC2. google search for "grant tables iommu" immediately gives me: http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html Maybe latest Xen is already doing the right thing, and it's just the question of making VFIO use that. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 14:39 ` Michael S. Tsirkin @ 2015-09-30 14:53 ` Avi Kivity 2015-09-30 15:21 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Avi Kivity @ 2015-09-30 14:53 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote: >> >> On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote: >>> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote: >>>> On 09/30/15 15:03, Michael S. Tsirkin wrote: >>>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote: >>>>>> On 09/30/15 14:41, Michael S. Tsirkin wrote: >>>>>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote: >>>>>>>> The whole idea is to bypass kernel. Especially for networking... >>>>>>> ... on dumb hardware that doesn't support doing that securely. >>>>>> On a very capable HW that supports whatever security requirements needed >>>>>> (e.g. 82599 Intel's SR-IOV VF devices). >>>>> Network card type is irrelevant as long as you do not have an IOMMU, >>>>> otherwise you would just use e.g. VFIO. >>>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a >>>> example where there *is* iommu but it's not virtualized >>>> and thus VFIO is >>>> useless and there is an option to use directly assigned SR-IOV networking >>>> device there where using the kernel drivers impose a performance impact >>>> compared to user space UIO-based user space kernel bypass mode of usage. How >>>> is it irrelevant? Could u, pls, clarify your point? >>>> >>> So it's not even dumb hardware, it's another piece of software >>> that forces an "all or nothing" approach where either >>> device has access to all VM memory, or none. >>> And this, unfortunately, leaves you with no secure way to >>> allow userspace drivers. >> Some setups don't need security (they are single-user, single application). >> But do need a lot of performance (like 5X-10X performance). An example is >> OpenVSwitch, security doesn't help it at all and if you force it to use the >> kernel drivers you cripple it. > We'd have to see there are actual users that need this. So far, dpdk > seems like the only one, dpdk is a whole class if users. It's not a specific application. > and it wants to use UIO for slow path stuff > like polling link status. Why this needs kernel bypass support, I don't > know. I asked, and got no answer. First, it's more than link status. dpdk also has an interrupt mode, which applications can fall back to when when the load is light in order to save power (and in order not to get support calls about 100% cpu when idle). Even for link status, you don't want to poll for that, because accessing device registers is expensive. An interrupt is the best approach for rare events like link changed. > >> Also, I'm root. I can do anything I like, including loading a patched >> pci_uio_generic. You're not providing _any_ security, you're simply making >> life harder for users. > Maybe that's true on your system. But I guess you know that's not true > for everyone, not in 2015. Why is it not true? if I'm root, I can do anything I like to my system, and everyone is root in 2015. I can access the BARs directly and program DMA, how am I more secure by uio not allowing me to setup msix? Non-root users are already secured by their inability to load the module, and by the device permissions. > >>> So it makes even less sense to add insecure work-arounds in the kernel. >>> It seems quite likely that by the time the new kernel reaches >>> production X years from now, EC2 will have a virtual iommu. >> I can adopt a new kernel tomorrow. I have no influence on EC2. >> >> > Xen grant tables sound like they could be the right interface > for EC2. google search for "grant tables iommu" immediately gives me: > http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html > Maybe latest Xen is already doing the right thing, and it's just the > question of making VFIO use that. > grant tables only work for virtual devices, not physical devices. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 14:53 ` Avi Kivity @ 2015-09-30 15:21 ` Michael S. Tsirkin 2015-09-30 15:36 ` Avi Kivity 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 15:21 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Wed, Sep 30, 2015 at 05:53:54PM +0300, Avi Kivity wrote: > On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote: > >On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote: > >> > >>On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote: > >>>On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote: > >>>>On 09/30/15 15:03, Michael S. Tsirkin wrote: > >>>>>On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote: > >>>>>>On 09/30/15 14:41, Michael S. Tsirkin wrote: > >>>>>>>On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote: > >>>>>>>>The whole idea is to bypass kernel. Especially for networking... > >>>>>>>... on dumb hardware that doesn't support doing that securely. > >>>>>>On a very capable HW that supports whatever security requirements needed > >>>>>>(e.g. 82599 Intel's SR-IOV VF devices). > >>>>>Network card type is irrelevant as long as you do not have an IOMMU, > >>>>>otherwise you would just use e.g. VFIO. > >>>>Sorry, but I don't follow your logic here - Amazon EC2 environment is a > >>>>example where there *is* iommu but it's not virtualized > >>>>and thus VFIO is > >>>>useless and there is an option to use directly assigned SR-IOV networking > >>>>device there where using the kernel drivers impose a performance impact > >>>>compared to user space UIO-based user space kernel bypass mode of usage. How > >>>>is it irrelevant? Could u, pls, clarify your point? > >>>> > >>>So it's not even dumb hardware, it's another piece of software > >>>that forces an "all or nothing" approach where either > >>>device has access to all VM memory, or none. > >>>And this, unfortunately, leaves you with no secure way to > >>>allow userspace drivers. > >>Some setups don't need security (they are single-user, single application). > >>But do need a lot of performance (like 5X-10X performance). An example is > >>OpenVSwitch, security doesn't help it at all and if you force it to use the > >>kernel drivers you cripple it. > >We'd have to see there are actual users that need this. So far, dpdk > >seems like the only one, > > dpdk is a whole class if users. It's not a specific application. > > > and it wants to use UIO for slow path stuff > >like polling link status. Why this needs kernel bypass support, I don't > >know. I asked, and got no answer. > > First, it's more than link status. dpdk also has an interrupt mode, which > applications can fall back to when when the load is light in order to save > power (and in order not to get support calls about 100% cpu when idle). Aha, looks like it appeared in June. Interesting, thanks for the info. > Even for link status, you don't want to poll for that, because accessing > device registers is expensive. An interrupt is the best approach for rare > events like link changed. Yea, but you probably can get by with a timer for that, even if it's ugly. > >>Also, I'm root. I can do anything I like, including loading a patched > >>pci_uio_generic. You're not providing _any_ security, you're simply making > >>life harder for users. > >Maybe that's true on your system. But I guess you know that's not true > >for everyone, not in 2015. > > Why is it not true? if I'm root, I can do anything I like to my > system, and everyone is root in 2015. I can access the BARs directly > and program DMA, how am I more secure by uio not allowing me to setup > msix? That's not the point. The point always was that using uio for these devices (capable of DMA, in particular of msix) isn't possible in a secure way. And yes, if same device happens to also do interrupts, UIO does not reject it as it probably should, and we can't change this without breaking some working setups. But this doesn't mean we should add more setups like this that we'll then be forced to maintain. > Non-root users are already secured by their inability to load the module, > and by the device permissions. > > > > >>>So it makes even less sense to add insecure work-arounds in the kernel. > >>>It seems quite likely that by the time the new kernel reaches > >>>production X years from now, EC2 will have a virtual iommu. > >>I can adopt a new kernel tomorrow. I have no influence on EC2. > >> > >> > >Xen grant tables sound like they could be the right interface > >for EC2. google search for "grant tables iommu" immediately gives me: > >http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html > >Maybe latest Xen is already doing the right thing, and it's just the > >question of making VFIO use that. > > > > grant tables only work for virtual devices, not physical devices. Why not? That's what the patches above seem to do. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 15:21 ` Michael S. Tsirkin @ 2015-09-30 15:36 ` Avi Kivity 2015-09-30 20:40 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Avi Kivity @ 2015-09-30 15:36 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/2015 06:21 PM, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 05:53:54PM +0300, Avi Kivity wrote: >> On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote: >>> On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote: >>>> On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote: >>>>> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote: >>>>>> On 09/30/15 15:03, Michael S. Tsirkin wrote: >>>>>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote: >>>>>>>> On 09/30/15 14:41, Michael S. Tsirkin wrote: >>>>>>>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote: >>>>>>>>>> The whole idea is to bypass kernel. Especially for networking... >>>>>>>>> ... on dumb hardware that doesn't support doing that securely. >>>>>>>> On a very capable HW that supports whatever security requirements needed >>>>>>>> (e.g. 82599 Intel's SR-IOV VF devices). >>>>>>> Network card type is irrelevant as long as you do not have an IOMMU, >>>>>>> otherwise you would just use e.g. VFIO. >>>>>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a >>>>>> example where there *is* iommu but it's not virtualized >>>>>> and thus VFIO is >>>>>> useless and there is an option to use directly assigned SR-IOV networking >>>>>> device there where using the kernel drivers impose a performance impact >>>>>> compared to user space UIO-based user space kernel bypass mode of usage. How >>>>>> is it irrelevant? Could u, pls, clarify your point? >>>>>> >>>>> So it's not even dumb hardware, it's another piece of software >>>>> that forces an "all or nothing" approach where either >>>>> device has access to all VM memory, or none. >>>>> And this, unfortunately, leaves you with no secure way to >>>>> allow userspace drivers. >>>> Some setups don't need security (they are single-user, single application). >>>> But do need a lot of performance (like 5X-10X performance). An example is >>>> OpenVSwitch, security doesn't help it at all and if you force it to use the >>>> kernel drivers you cripple it. >>> We'd have to see there are actual users that need this. So far, dpdk >>> seems like the only one, >> dpdk is a whole class if users. It's not a specific application. >> >>> and it wants to use UIO for slow path stuff >>> like polling link status. Why this needs kernel bypass support, I don't >>> know. I asked, and got no answer. >> First, it's more than link status. dpdk also has an interrupt mode, which >> applications can fall back to when when the load is light in order to save >> power (and in order not to get support calls about 100% cpu when idle). > Aha, looks like it appeared in June. Interesting, thanks for the info. > >> Even for link status, you don't want to poll for that, because accessing >> device registers is expensive. An interrupt is the best approach for rare >> events like link changed. > Yea, but you probably can get by with a timer for that, even if it's ugly. Maybe you can, but (a) why increase link status change detection latency (b) link status change detection is not the only user of the feature, since June. >>>> Also, I'm root. I can do anything I like, including loading a patched >>>> pci_uio_generic. You're not providing _any_ security, you're simply making >>>> life harder for users. >>> Maybe that's true on your system. But I guess you know that's not true >>> for everyone, not in 2015. >> Why is it not true? if I'm root, I can do anything I like to my >> system, and everyone is root in 2015. I can access the BARs directly >> and program DMA, how am I more secure by uio not allowing me to setup >> msix? > That's not the point. The point always was that using uio for these > devices (capable of DMA, in particular of msix) isn't possible in a > secure way. uio is used today for DMA-capable devices. Some users are perfectly willing to give up security for functionality (that's all users who have root access to their machines, not just uio users). You aren't adding any security by disallowing uio, you're just removing functionality. As it happens, you're removing the functionality from the users who have no other option. They can't use vfio because it doesn't work on virtualized setups. (note even on a setup that does support vfio, high performance users will want to avoid it). > And yes, if same device happens to also do interrupts, UIO > does not reject it as it probably should, and we can't change this > without breaking some working setups. But this doesn't mean we should > add more setups like this that we'll then be forced to maintain. pci_uio_generic is maybe the driver with the lowest maintenance burden in the entire kernel. One driver supporting all pci devices, if you don't need msi/msix. And with the patch, it will be one driver supporting all pci devices. I don't really understand the tradeoff. By rejecting the patch you're denying users the ability to use their devices, except through the much slower kernel drivers. The patch would not allow a non-root user to do ANYTHING. Root can already do anything. So what security issue is there? > > >> Non-root users are already secured by their inability to load the module, >> and by the device permissions. >> >>>>> So it makes even less sense to add insecure work-arounds in the kernel. >>>>> It seems quite likely that by the time the new kernel reaches >>>>> production X years from now, EC2 will have a virtual iommu. >>>> I can adopt a new kernel tomorrow. I have no influence on EC2. >>>> >>>> >>> Xen grant tables sound like they could be the right interface >>> for EC2. google search for "grant tables iommu" immediately gives me: >>> http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html >>> Maybe latest Xen is already doing the right thing, and it's just the >>> question of making VFIO use that. >>> >> grant tables only work for virtual devices, not physical devices. > Why not? That's what the patches above seem to do. > Oh, I think those are for emulating transient iommu maps (new map for every request) on top of a real iommu. The dpdk use case is permanently mapping a large chunk of guest userspace, I don't think Xen exposes enough grant table entries for that. In addition, that leaves users of kvm, vmware, older Xen, or bare metal machines without iommus out in the cold; and bare metal users that want the iommu off for performance are forced to use it. And for what, to prevent root from touching memory via dma that they can access in a million other ways? ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 15:36 ` Avi Kivity @ 2015-09-30 20:40 ` Michael S. Tsirkin 2015-09-30 21:00 ` Avi Kivity 2015-10-01 8:44 ` Michael S. Tsirkin 0 siblings, 2 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 20:40 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Wed, Sep 30, 2015 at 06:36:17PM +0300, Avi Kivity wrote: > As it happens, you're removing the functionality from the users who have no > other option. They can't use vfio because it doesn't work on virtualized > setups. ... > Root can already do anything. I think there's a contradiction between the two claims above. > So what security issue is there? A buggy userspace can and will corrupt kernel memory. ... > And for what, to prevent > root from touching memory via dma that they can access in a million other > ways? So one can be reasonably sure a kernel oops is not a result of a userspace bug. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 20:40 ` Michael S. Tsirkin @ 2015-09-30 21:00 ` Avi Kivity 2015-10-01 8:44 ` Michael S. Tsirkin 1 sibling, 0 replies; 100+ messages in thread From: Avi Kivity @ 2015-09-30 21:00 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 09/30/2015 11:40 PM, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 06:36:17PM +0300, Avi Kivity wrote: >> As it happens, you're removing the functionality from the users who have no >> other option. They can't use vfio because it doesn't work on virtualized >> setups. > ... > >> Root can already do anything. > I think there's a contradiction between the two claims above. Yes, root can replace the current kernel with a patched kernel. In that sense, root can do anything, and the kernel is complete. Now let's stop playing word games. >> So what security issue is there? > A buggy userspace can and will corrupt kernel memory. > > ... > >> And for what, to prevent >> root from touching memory via dma that they can access in a million other >> ways? > So one can be reasonably sure a kernel oops is not a result of a > userspace bug. > That's not security. It's a legitimate concern though, one that is addressed by tainting the kernel. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 20:40 ` Michael S. Tsirkin 2015-09-30 21:00 ` Avi Kivity @ 2015-10-01 8:44 ` Michael S. Tsirkin 2015-10-01 8:46 ` Vlad Zolotarov ` (2 more replies) 1 sibling, 3 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 8:44 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote: > > And for what, to prevent > > root from touching memory via dma that they can access in a million other > > ways? > > So one can be reasonably sure a kernel oops is not a result of a > userspace bug. Actually, I thought about this overnight, and it should be possible to drive it securely from userspace, without hypervisor changes. See https://mid.gmane.org/20151001104505-mutt-send-email-mst@redhat.com > -- > MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 8:44 ` Michael S. Tsirkin @ 2015-10-01 8:46 ` Vlad Zolotarov 2015-10-01 8:52 ` Avi Kivity 2015-10-01 9:16 ` Michael S. Tsirkin 2 siblings, 0 replies; 100+ messages in thread From: Vlad Zolotarov @ 2015-10-01 8:46 UTC (permalink / raw) To: Michael S. Tsirkin, Avi Kivity; +Cc: dev On 10/01/15 11:44, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote: >>> And for what, to prevent >>> root from touching memory via dma that they can access in a million other >>> ways? >> So one can be reasonably sure a kernel oops is not a result of a >> userspace bug. > Actually, I thought about this overnight, and it should be possible to > drive it securely from userspace, without hypervisor changes. > > See > > https://mid.gmane.org/20151001104505-mutt-send-email-mst@redhat.com Looks like a dead link. > > > >> -- >> MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 8:44 ` Michael S. Tsirkin 2015-10-01 8:46 ` Vlad Zolotarov @ 2015-10-01 8:52 ` Avi Kivity 2015-10-01 9:15 ` Michael S. Tsirkin 2015-10-01 9:15 ` Avi Kivity 2015-10-01 9:16 ` Michael S. Tsirkin 2 siblings, 2 replies; 100+ messages in thread From: Avi Kivity @ 2015-10-01 8:52 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 11:44 AM, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote: >>> And for what, to prevent >>> root from touching memory via dma that they can access in a million other >>> ways? >> So one can be reasonably sure a kernel oops is not a result of a >> userspace bug. > Actually, I thought about this overnight, and it should be possible to > drive it securely from userspace, without hypervisor changes. Also without the performance that was the whole reason from doing it in userspace in the first place. I still don't understand your objection to the patch: > MSI messages are memory writes so any generic device capable > of MSI is capable of corrupting kernel memory. > This means that a bug in userspace will lead to kernel memory corruption > and crashes. This is something distributions can't support. If a distribution feels it can't support this configuration, it can disable the uio_pci_generic driver, or refuse to support tainted kernels. If it feels it can (and many distributions are starting to support dpdk), then you're just denying it the ability to serve its users. > See > > https://mid.gmane.org/20151001104505-mutt-send-email-mst@redhat.com > > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 8:52 ` Avi Kivity @ 2015-10-01 9:15 ` Michael S. Tsirkin 2015-10-01 9:22 ` Avi Kivity 2015-10-01 9:15 ` Avi Kivity 1 sibling, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 9:15 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 11:52:26AM +0300, Avi Kivity wrote: > I still don't understand your objection to the patch: > > > MSI messages are memory writes so any generic device capable > of MSI is capable of corrupting kernel memory. > This means that a bug in userspace will lead to kernel memory corruption > and crashes. This is something distributions can't support. > > > If a distribution feels it can't support this configuration, it can disable the > uio_pci_generic driver, or refuse to support tainted kernels. If it feels it > can (and many distributions are starting to support dpdk), then you're just > denying it the ability to serve its users. I don't, and can't deny users anything. I merely think upstream should avoid putting this driver in-tree. By doing this, driver writers will be pushed to develop solutions that can't crash kernel. I pointed out one way to build it, there are sure to be more. As far as I could see, without this kind of motivation, people do not even want to try. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:15 ` Michael S. Tsirkin @ 2015-10-01 9:22 ` Avi Kivity 2015-10-01 9:42 ` Michael S. Tsirkin ` (2 more replies) 0 siblings, 3 replies; 100+ messages in thread From: Avi Kivity @ 2015-10-01 9:22 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 12:15 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 11:52:26AM +0300, Avi Kivity wrote: >> I still don't understand your objection to the patch: >> >> >> MSI messages are memory writes so any generic device capable >> of MSI is capable of corrupting kernel memory. >> This means that a bug in userspace will lead to kernel memory corruption >> and crashes. This is something distributions can't support. >> >> >> If a distribution feels it can't support this configuration, it can disable the >> uio_pci_generic driver, or refuse to support tainted kernels. If it feels it >> can (and many distributions are starting to support dpdk), then you're just >> denying it the ability to serve its users. > I don't, and can't deny users anything. I merely think upstream should > avoid putting this driver in-tree. By doing this, driver writers will > be pushed to develop solutions that can't crash kernel. > > I pointed out one way to build it, there are sure to be more. And I pointed out that your solution is unworkable. It's easy to claim that a solution is around the corner, only no one was looking for it, but the reality is that kernel bypass has been a solution for years for high performance users, that it cannot be made safe without an iommu, and that iommus are not available everywhere; and even when they are some users prefer to avoid the performance penalty. > As far as I could see, without this kind of motivation, people do not > even want to try. You are mistaken. The problem is a lot harder than you think. People didn't go and write userspace drivers because they were lazy. They wrote them because there was no other way. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:22 ` Avi Kivity @ 2015-10-01 9:42 ` Michael S. Tsirkin 2015-10-01 9:53 ` Avi Kivity 2015-10-01 21:17 ` Alexander Duyck 2015-10-01 9:42 ` Vincent JARDIN 2015-10-01 9:55 ` Michael S. Tsirkin 2 siblings, 2 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 9:42 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: > even when they are some users > prefer to avoid the performance penalty. I don't think there's a measureable penalty from passing through the IOMMU, as long as mappings are mostly static (i.e. iommu=pt). I sure never saw any numbers that show such. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:42 ` Michael S. Tsirkin @ 2015-10-01 9:53 ` Avi Kivity 2015-10-01 10:17 ` Michael S. Tsirkin 2015-10-01 21:17 ` Alexander Duyck 1 sibling, 1 reply; 100+ messages in thread From: Avi Kivity @ 2015-10-01 9:53 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 12:42 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: >> even when they are some users >> prefer to avoid the performance penalty. > I don't think there's a measureable penalty from passing through the > IOMMU, as long as mappings are mostly static (i.e. iommu=pt). I sure > never saw any numbers that show such. > Maybe not. But again, virtualized setups will not have a guest iommu and therefore can't use it; and those happen to be exactly the setups you're blocking. Non-virtualized setups have an iommu available, but they can also use pci_uio_generic without patching if they like. The virtualized setups have no other option; you're leaving them out in the cold. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:53 ` Avi Kivity @ 2015-10-01 10:17 ` Michael S. Tsirkin 2015-10-01 10:24 ` Avi Kivity 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 10:17 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote: > Non-virtualized setups have an iommu available, but they can also use > pci_uio_generic without patching if they like. Not with VFs, they can't. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 10:17 ` Michael S. Tsirkin @ 2015-10-01 10:24 ` Avi Kivity 2015-10-01 10:25 ` Avi Kivity 0 siblings, 1 reply; 100+ messages in thread From: Avi Kivity @ 2015-10-01 10:24 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 01:17 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote: >> Non-virtualized setups have an iommu available, but they can also use >> pci_uio_generic without patching if they like. > Not with VFs, they can't. > They can and they do (I use it myself). ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 10:24 ` Avi Kivity @ 2015-10-01 10:25 ` Avi Kivity 2015-10-01 10:44 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Avi Kivity @ 2015-10-01 10:25 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 01:24 PM, Avi Kivity wrote: > On 10/01/2015 01:17 PM, Michael S. Tsirkin wrote: >> On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote: >>> Non-virtualized setups have an iommu available, but they can also use >>> pci_uio_generic without patching if they like. >> Not with VFs, they can't. >> > > They can and they do (I use it myself). I mean with a PF. Why use a VF on a non-virtualized system? ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 10:25 ` Avi Kivity @ 2015-10-01 10:44 ` Michael S. Tsirkin 2015-10-01 10:55 ` Avi Kivity 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 10:44 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 01:25:17PM +0300, Avi Kivity wrote: > Why use a VF on a non-virtualized system? 1. So a userspace bug does not destroy your hardware (PFs generally assume trusted non-buggy drivers, VFs generally don't). 2. So you can use a PF or another VF for regular networking. 3. So you can manage this system, to some level. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 10:44 ` Michael S. Tsirkin @ 2015-10-01 10:55 ` Avi Kivity 0 siblings, 0 replies; 100+ messages in thread From: Avi Kivity @ 2015-10-01 10:55 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 01:44 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 01:25:17PM +0300, Avi Kivity wrote: >> Why use a VF on a non-virtualized system? > 1. So a userspace bug does not destroy your hardware > (PFs generally assume trusted non-buggy drivers, VFs > generally don't). People who use dpdk trust their drivers (those drivers are the reason for the system to exist in the first place). > 2. So you can use a PF or another VF for regular networking. This is valid, but usually those systems have a separate management network. Unfortunately VFs limit the number of queues you can expose, making them less performant than PFs. The "bifurcated drivers" were meant as a way of enabling this functionality without resorting to VFs, but it seems they are stalled. > 3. So you can manage this system, to some level. > Again existing practice doesn't follow this. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:42 ` Michael S. Tsirkin 2015-10-01 9:53 ` Avi Kivity @ 2015-10-01 21:17 ` Alexander Duyck 2015-10-02 13:50 ` Michael S. Tsirkin 1 sibling, 1 reply; 100+ messages in thread From: Alexander Duyck @ 2015-10-01 21:17 UTC (permalink / raw) To: Michael S. Tsirkin, Avi Kivity; +Cc: dev On 10/01/2015 02:42 AM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: >> even when they are some users >> prefer to avoid the performance penalty. > I don't think there's a measureable penalty from passing through the > IOMMU, as long as mappings are mostly static (i.e. iommu=pt). I sure > never saw any numbers that show such. It depends on the IOMMU. I believe Intel had a performance penalty on all CPUs prior to Ivy Bridge. Since then things have improved to where they are comparable to bare metal. The graph on page 5 of https://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf shows the penalty clear as day. Pretty much anything before Ivy Bridge w/ small packets is slowed to a crawl with an IOMMU enabled. - Alex ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 21:17 ` Alexander Duyck @ 2015-10-02 13:50 ` Michael S. Tsirkin 0 siblings, 0 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-02 13:50 UTC (permalink / raw) To: Alexander Duyck; +Cc: dev, Avi Kivity On Thu, Oct 01, 2015 at 02:17:49PM -0700, Alexander Duyck wrote: > On 10/01/2015 02:42 AM, Michael S. Tsirkin wrote: > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: > >>even when they are some users > >>prefer to avoid the performance penalty. > >I don't think there's a measureable penalty from passing through the > >IOMMU, as long as mappings are mostly static (i.e. iommu=pt). I sure > >never saw any numbers that show such. > > It depends on the IOMMU. I believe Intel had a performance penalty on all > CPUs prior to Ivy Bridge. Since then things have improved to where they are > comparable to bare metal. > > The graph on page 5 of > https://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf > shows the penalty clear as day. Pretty much anything before Ivy Bridge w/ > small packets is slowed to a crawl with an IOMMU enabled. > > - Alex VMs are running with IOMMU enabled anyway. Avi here tells us no one uses SRIOV on bare metal so ... we don't need to argue about that. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:22 ` Avi Kivity 2015-10-01 9:42 ` Michael S. Tsirkin @ 2015-10-01 9:42 ` Vincent JARDIN 2015-10-01 9:43 ` Avi Kivity 2015-10-01 14:54 ` Stephen Hemminger 2015-10-01 9:55 ` Michael S. Tsirkin 2 siblings, 2 replies; 100+ messages in thread From: Vincent JARDIN @ 2015-10-01 9:42 UTC (permalink / raw) To: Avi Kivity, Michael S. Tsirkin; +Cc: dev On 01/10/2015 11:22, Avi Kivity wrote: >> As far as I could see, without this kind of motivation, people do not >> even want to try. > > You are mistaken. The problem is a lot harder than you think. > > People didn't go and write userspace drivers because they were lazy. > They wrote them because there was no other way. I disagree, it is possible to write a 'partial' userspace driver. Here it is an example: http://dpdk.org/browse/dpdk/tree/drivers/net/mlx4 It benefits of the kernel's capabilities while the userland manages only the IOs. There were some tentative to get it for other (older) drivers, named 'bifurcated drivers', but it is stalled. best regards, Vincent ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:42 ` Vincent JARDIN @ 2015-10-01 9:43 ` Avi Kivity 2015-10-01 9:48 ` Vincent JARDIN 2015-10-01 10:14 ` Michael S. Tsirkin 2015-10-01 14:54 ` Stephen Hemminger 1 sibling, 2 replies; 100+ messages in thread From: Avi Kivity @ 2015-10-01 9:43 UTC (permalink / raw) To: Vincent JARDIN, Michael S. Tsirkin; +Cc: dev On 10/01/2015 12:42 PM, Vincent JARDIN wrote: > On 01/10/2015 11:22, Avi Kivity wrote: >>> As far as I could see, without this kind of motivation, people do not >>> even want to try. >> >> You are mistaken. The problem is a lot harder than you think. >> >> People didn't go and write userspace drivers because they were lazy. >> They wrote them because there was no other way. > > I disagree, it is possible to write a 'partial' userspace driver. > > Here it is an example: > http://dpdk.org/browse/dpdk/tree/drivers/net/mlx4 > > It benefits of the kernel's capabilities while the userland manages > only the IOs. > That is because the device itself contains an iommu. > There were some tentative to get it for other (older) drivers, named > 'bifurcated drivers', but it is stalled. IIRC they still exposed the ring to userspace. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:43 ` Avi Kivity @ 2015-10-01 9:48 ` Vincent JARDIN 2015-10-01 9:54 ` Avi Kivity 2015-10-01 10:14 ` Michael S. Tsirkin 1 sibling, 1 reply; 100+ messages in thread From: Vincent JARDIN @ 2015-10-01 9:48 UTC (permalink / raw) To: Avi Kivity, Michael S. Tsirkin; +Cc: dev On 01/10/2015 11:43, Avi Kivity wrote: > > That is because the device itself contains an iommu. Yes. It could be an option: - we could flag the Linux system unsafe when the device does not have any IOMMU - we flag the Linux system safe when the device has an IOMMU ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:48 ` Vincent JARDIN @ 2015-10-01 9:54 ` Avi Kivity 0 siblings, 0 replies; 100+ messages in thread From: Avi Kivity @ 2015-10-01 9:54 UTC (permalink / raw) To: Vincent JARDIN, Michael S. Tsirkin; +Cc: dev On 10/01/2015 12:48 PM, Vincent JARDIN wrote: > On 01/10/2015 11:43, Avi Kivity wrote: >> >> That is because the device itself contains an iommu. > > Yes. > > It could be an option: > - we could flag the Linux system unsafe when the device does not > have any IOMMU > - we flag the Linux system safe when the device has an IOMMU This already exists, it's called the tainted flag. I don't know if pci_uio_generic already taints the kernel; it certainly should with DMA capable devices. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:43 ` Avi Kivity 2015-10-01 9:48 ` Vincent JARDIN @ 2015-10-01 10:14 ` Michael S. Tsirkin 2015-10-01 10:23 ` Avi Kivity 2015-10-01 14:55 ` Stephen Hemminger 1 sibling, 2 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 10:14 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote: > >There were some tentative to get it for other (older) drivers, named > >'bifurcated drivers', but it is stalled. > > IIRC they still exposed the ring to userspace. How much would a ring write syscall cost? 1-2 microseconds, isn't it? Measureable, but it's not the end of the world. ring read might be safe to allow. Should buy us enough time until hypervisors support IOMMU. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 10:14 ` Michael S. Tsirkin @ 2015-10-01 10:23 ` Avi Kivity 2015-10-01 14:55 ` Stephen Hemminger 1 sibling, 0 replies; 100+ messages in thread From: Avi Kivity @ 2015-10-01 10:23 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 01:14 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote: >>> There were some tentative to get it for other (older) drivers, named >>> 'bifurcated drivers', but it is stalled. >> IIRC they still exposed the ring to userspace. > How much would a ring write syscall cost? 1-2 microseconds, isn't it? > Measureable, but it's not the end of the world. Plus a page table walk per packet fragment (dpdk has the physical address prepared in the mbuf IIRC). The 10Mpps+ users of dpdk should comment on whether the performance hit is acceptable (my use case is much more modest). > ring read might be safe to allow. > Should buy us enough time until hypervisors support IOMMU. All the relevant drivers need to be converted to support ring translation, and exposing the ring to userspace in the new API. It shouldn't take more than 3-4 years. Meanwhile, users of virtualized systems that need interrupt support cannot use their machines, while non-virtualized users are free to DMA wherever they like, in the name of security. btw, an API like you describe already exists -- vhost. Of course the virtio protocol is nowhere near fast enough, but at least it's an example. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 10:14 ` Michael S. Tsirkin 2015-10-01 10:23 ` Avi Kivity @ 2015-10-01 14:55 ` Stephen Hemminger 2015-10-01 15:49 ` Michael S. Tsirkin 1 sibling, 1 reply; 100+ messages in thread From: Stephen Hemminger @ 2015-10-01 14:55 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev, Avi Kivity On Thu, 1 Oct 2015 13:14:08 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote: > > >There were some tentative to get it for other (older) drivers, named > > >'bifurcated drivers', but it is stalled. > > > > IIRC they still exposed the ring to userspace. > > How much would a ring write syscall cost? 1-2 microseconds, isn't it? The per-packet budget at 10G is 62ns, a syscall just doesn't cut it. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 14:55 ` Stephen Hemminger @ 2015-10-01 15:49 ` Michael S. Tsirkin 0 siblings, 0 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 15:49 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev, Avi Kivity On Thu, Oct 01, 2015 at 07:55:20AM -0700, Stephen Hemminger wrote: > On Thu, 1 Oct 2015 13:14:08 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote: > > > >There were some tentative to get it for other (older) drivers, named > > > >'bifurcated drivers', but it is stalled. > > > > > > IIRC they still exposed the ring to userspace. > > > > How much would a ring write syscall cost? 1-2 microseconds, isn't it? > > The per-packet budget at 10G is 62ns, a syscall just doesn't cut it. If you give up on privacy and only insist on security (can read all kernel memory, can't corrupt it), then you only need the syscall to re-arm RX descriptors, and these can be batched aggressively without impacting latency. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:42 ` Vincent JARDIN 2015-10-01 9:43 ` Avi Kivity @ 2015-10-01 14:54 ` Stephen Hemminger 1 sibling, 0 replies; 100+ messages in thread From: Stephen Hemminger @ 2015-10-01 14:54 UTC (permalink / raw) To: Vincent JARDIN; +Cc: dev, Avi Kivity, Michael S. Tsirkin On Thu, 01 Oct 2015 11:42:23 +0200 Vincent JARDIN <vincent.jardin@6wind.com> wrote: > On 01/10/2015 11:22, Avi Kivity wrote: > >> As far as I could see, without this kind of motivation, people do not > >> even want to try. > > > > You are mistaken. The problem is a lot harder than you think. > > > > People didn't go and write userspace drivers because they were lazy. > > They wrote them because there was no other way. > > I disagree, it is possible to write a 'partial' userspace driver. > > Here it is an example: > http://dpdk.org/browse/dpdk/tree/drivers/net/mlx4 > > It benefits of the kernel's capabilities while the userland manages only > the IOs. > > There were some tentative to get it for other (older) drivers, named > 'bifurcated drivers', but it is stalled. > And in our testing the mlx4 driver performance is terrible. That maybe because of the overhead of infiniband library. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:22 ` Avi Kivity 2015-10-01 9:42 ` Michael S. Tsirkin 2015-10-01 9:42 ` Vincent JARDIN @ 2015-10-01 9:55 ` Michael S. Tsirkin 2015-10-01 9:59 ` Avi Kivity 2 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 9:55 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: > It's easy to claim that > a solution is around the corner, only no one was looking for it, but the > reality is that kernel bypass has been a solution for years for high > performance users, I never said that it's trivial. It's probably a lot of work. It's definitely more work than just abusing sysfs. But it looks like a write system call into an eventfd is about 1.5 microseconds on my laptop. Even with a system call per packet, system call overhead is not what makes DPDK drivers outperform Linux ones. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:55 ` Michael S. Tsirkin @ 2015-10-01 9:59 ` Avi Kivity 2015-10-01 10:38 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Avi Kivity @ 2015-10-01 9:59 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: >> It's easy to claim that >> a solution is around the corner, only no one was looking for it, but the >> reality is that kernel bypass has been a solution for years for high >> performance users, > I never said that it's trivial. > > It's probably a lot of work. It's definitely more work than just abusing > sysfs. > > But it looks like a write system call into an eventfd is about 1.5 > microseconds on my laptop. Even with a system call per packet, system > call overhead is not what makes DPDK drivers outperform Linux ones. > 1.5 us = 0.6 Mpps per core limit. dpdk performance is in the tens of millions of packets per system. It's not just the lack of system calls, of course, the architecture is completely different. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:59 ` Avi Kivity @ 2015-10-01 10:38 ` Michael S. Tsirkin 2015-10-01 10:50 ` Avi Kivity 2015-10-01 11:08 ` Bruce Richardson 0 siblings, 2 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 10:38 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote: > > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote: > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: > >>It's easy to claim that > >>a solution is around the corner, only no one was looking for it, but the > >>reality is that kernel bypass has been a solution for years for high > >>performance users, > >I never said that it's trivial. > > > >It's probably a lot of work. It's definitely more work than just abusing > >sysfs. > > > >But it looks like a write system call into an eventfd is about 1.5 > >microseconds on my laptop. Even with a system call per packet, system > >call overhead is not what makes DPDK drivers outperform Linux ones. > > > > 1.5 us = 0.6 Mpps per core limit. Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps. But for RX, you can batch a lot of packets. You can see by now I'm not that good at benchmarking. Here's what I wrote: #include <stdbool.h> #include <sys/eventfd.h> #include <inttypes.h> #include <unistd.h> int main(int argc, char **argv) { int e = eventfd(0, 0); uint64_t v = 1; int i; for (i = 0; i < 10000000; ++i) { write(e, &v, sizeof v); } } This takes 1.5 seconds to run on my laptop: $ time ./a.out real 0m1.507s user 0m0.179s sys 0m1.328s > dpdk performance is in the tens of > millions of packets per system. I think that's with a bunch of batching though. > It's not just the lack of system calls, of course, the architecture is > completely different. Absolutely - I'm not saying move all of DPDK into kernel. We just need to protect the RX rings so hardware does not corrupt kernel memory. Thinking about it some more, many devices have separate rings for DMA: TX (device reads memory) and RX (device writes memory). With such devices, a mode where userspace can write TX ring but not RX ring might make sense. This will mean userspace might read kernel memory through the device, but can not corrupt it. That's already a big win! And RX buffers do not have to be added one at a time. If we assume 0.2usec per system call, batching some 100 buffers per system call gives you 2 nano seconds overhead. That seems quite reasonable. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 10:38 ` Michael S. Tsirkin @ 2015-10-01 10:50 ` Avi Kivity 2015-10-01 11:09 ` Michael S. Tsirkin 2015-10-01 11:08 ` Bruce Richardson 1 sibling, 1 reply; 100+ messages in thread From: Avi Kivity @ 2015-10-01 10:50 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 01:38 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote: >> >> On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote: >>> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: >>>> It's easy to claim that >>>> a solution is around the corner, only no one was looking for it, but the >>>> reality is that kernel bypass has been a solution for years for high >>>> performance users, >>> I never said that it's trivial. >>> >>> It's probably a lot of work. It's definitely more work than just abusing >>> sysfs. >>> >>> But it looks like a write system call into an eventfd is about 1.5 >>> microseconds on my laptop. Even with a system call per packet, system >>> call overhead is not what makes DPDK drivers outperform Linux ones. >>> >> 1.5 us = 0.6 Mpps per core limit. > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps. You also trimmed the extra work that needs to be done, that I mentioned. Maybe your ring proxy can work, maybe it can't. In any case it's a hefty chunk of work. Should this work block users from using their VFs, if they happen to need interrupt support? > But for RX, you can batch a lot of packets. > > You can see by now I'm not that good at benchmarking. > Here's what I wrote: > > > #include <stdbool.h> > #include <sys/eventfd.h> > #include <inttypes.h> > #include <unistd.h> > > > int main(int argc, char **argv) > { > int e = eventfd(0, 0); > uint64_t v = 1; > > int i; > > for (i = 0; i < 10000000; ++i) { > write(e, &v, sizeof v); > } > } > > > This takes 1.5 seconds to run on my laptop: > > $ time ./a.out > > real 0m1.507s > user 0m0.179s > sys 0m1.328s > > >> dpdk performance is in the tens of >> millions of packets per system. > I think that's with a bunch of batching though. Yes, it's also with their application code running as well. They didn't reach this kind of performance by spending cycles unnecessarily. I'm not saying that the ring proxy is not workable; just that we don't know whether it is or not, and I don't think that a patch that enables _existing functionality_ for VFs should be blocked in favor of a new and unproven approach. > >> It's not just the lack of system calls, of course, the architecture is >> completely different. > Absolutely - I'm not saying move all of DPDK into kernel. > We just need to protect the RX rings so hardware does > not corrupt kernel memory. > > > Thinking about it some more, many devices > have separate rings for DMA: TX (device reads memory) > and RX (device writes memory). > With such devices, a mode where userspace can write TX ring > but not RX ring might make sense. I'm sure you can cause havoc just by reading, if you read from I/O memory. > > This will mean userspace might read kernel memory > through the device, but can not corrupt it. > > That's already a big win! > > And RX buffers do not have to be added one at a time. > If we assume 0.2usec per system call, batching some 100 buffers per > system call gives you 2 nano seconds overhead. That seems quite > reasonable. You're ignoring the page table walk and other per-descriptor processing. Again^2, maybe this can work. But it shouldn't block a patch enabling interrupt support of VFs. After the ring proxy is available and proven for a few years, we can deprecate bus mastering from uio, and after a few more years remove it. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 10:50 ` Avi Kivity @ 2015-10-01 11:09 ` Michael S. Tsirkin 2015-10-01 11:20 ` Avi Kivity 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 11:09 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote: > > > >>It's not just the lack of system calls, of course, the architecture is > >>completely different. > >Absolutely - I'm not saying move all of DPDK into kernel. > >We just need to protect the RX rings so hardware does > >not corrupt kernel memory. > > > > > >Thinking about it some more, many devices > >have separate rings for DMA: TX (device reads memory) > >and RX (device writes memory). > >With such devices, a mode where userspace can write TX ring > >but not RX ring might make sense. > > I'm sure you can cause havoc just by reading, if you read from I/O memory. Not talking about I/O memory here. These are device rings in RAM. > > > >This will mean userspace might read kernel memory > >through the device, but can not corrupt it. > > > >That's already a big win! > > > >And RX buffers do not have to be added one at a time. > >If we assume 0.2usec per system call, batching some 100 buffers per > >system call gives you 2 nano seconds overhead. That seems quite > >reasonable. > > You're ignoring the page table walk Some caching strategy might work here. > and other per-descriptor processing. You probably can let userspace pre-format it all, just validate addresses. > Again^2, maybe this can work. But it shouldn't block a patch enabling > interrupt support of VFs. After the ring proxy is available and proven for > a few years, we can deprecate bus mastering from uio, and after a few more > years remove it. We are talking about DPDK patches posted in June 2015. It's not some software proven for years. If Linux keeps enabling hacks, no one will bother doing the right thing. Upstream inclusion is the only carrot Linux has to make people do the right thing. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 11:09 ` Michael S. Tsirkin @ 2015-10-01 11:20 ` Avi Kivity 2015-10-01 11:27 ` Michael S. Tsirkin 2015-10-01 11:31 ` Michael S. Tsirkin 0 siblings, 2 replies; 100+ messages in thread From: Avi Kivity @ 2015-10-01 11:20 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote: >>>> It's not just the lack of system calls, of course, the architecture is >>>> completely different. >>> Absolutely - I'm not saying move all of DPDK into kernel. >>> We just need to protect the RX rings so hardware does >>> not corrupt kernel memory. >>> >>> >>> Thinking about it some more, many devices >>> have separate rings for DMA: TX (device reads memory) >>> and RX (device writes memory). >>> With such devices, a mode where userspace can write TX ring >>> but not RX ring might make sense. >> I'm sure you can cause havoc just by reading, if you read from I/O memory. > Not talking about I/O memory here. These are device rings in RAM. Right. But you program them with DMA addresses, so the device can read another device's memory. >>> This will mean userspace might read kernel memory >>> through the device, but can not corrupt it. >>> >>> That's already a big win! >>> >>> And RX buffers do not have to be added one at a time. >>> If we assume 0.2usec per system call, batching some 100 buffers per >>> system call gives you 2 nano seconds overhead. That seems quite >>> reasonable. >> You're ignoring the page table walk > Some caching strategy might work here. It may, or it may not. I'm not against this. I'm against blocking user's access to their hardware, using an existing, established interface, for a small subset of setups. It doesn't help you in any way (you can still get reports of oopses due to buggy userspace drivers on physical machines, or on virtual machines that don't require interrupts), and it harms them. >> and other per-descriptor processing. > You probably can let userspace pre-format it all, > just validate addresses. You have to figure out if the descriptor contains an address or not (many devices have several descriptor formats, some with addresses and some without, which are intermixed). You also have to parse the descriptor size and see if it crosses a page boundary or not. > >> Again^2, maybe this can work. But it shouldn't block a patch enabling >> interrupt support of VFs. After the ring proxy is available and proven for >> a few years, we can deprecate bus mastering from uio, and after a few more >> years remove it. > We are talking about DPDK patches posted in June 2015. It's not some > software proven for years. dpdk has been used for years, it just won't work on VFs, if you need interrupt support. > If Linux keeps enabling hacks, no one will > bother doing the right thing. Upstream inclusion is the only carrot > Linux has to make people do the right thing. It's not a carrot, it's a stick. Implementing you scheme will take a huge effort, is not guaranteed to provide the performance needed, and will not be available for years. Meanwhile exactly the same thing on physical machines is supported. People will just use out of tree drivers (dpdk has several already). It's a pain, but nowhere near what you are proposing. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 11:20 ` Avi Kivity @ 2015-10-01 11:27 ` Michael S. Tsirkin 2015-10-01 11:32 ` Avi Kivity 2015-10-01 11:31 ` Michael S. Tsirkin 1 sibling, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 11:27 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote: > People will just use out of tree drivers (dpdk has several already). It's a > pain, but nowhere near what you are proposing. What's the issue with that? We already agreed this kernel is going to be tainted, and unsupportable. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 11:27 ` Michael S. Tsirkin @ 2015-10-01 11:32 ` Avi Kivity 2015-10-01 15:01 ` Stephen Hemminger 2015-10-01 15:11 ` Michael S. Tsirkin 0 siblings, 2 replies; 100+ messages in thread From: Avi Kivity @ 2015-10-01 11:32 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 02:27 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote: >> People will just use out of tree drivers (dpdk has several already). It's a >> pain, but nowhere near what you are proposing. > What's the issue with that? Out of tree drivers have to be compiled on the target system (cannot ship a binary package), and occasionally break. dkms helps with that, as do distributions that promise binary compatibility, but it is still a pain, compared to just shipping a userspace package. > We already agreed this kernel > is going to be tainted, and unsupportable. Yes. So your only motivation in rejecting the patch is to get the author to write the ring translation patch and port it to all relevant drivers instead? ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 11:32 ` Avi Kivity @ 2015-10-01 15:01 ` Stephen Hemminger 2015-10-01 15:08 ` Avi Kivity 2015-10-01 15:46 ` Michael S. Tsirkin 2015-10-01 15:11 ` Michael S. Tsirkin 1 sibling, 2 replies; 100+ messages in thread From: Stephen Hemminger @ 2015-10-01 15:01 UTC (permalink / raw) To: Avi Kivity; +Cc: dev, Michael S. Tsirkin On Thu, 1 Oct 2015 14:32:19 +0300 Avi Kivity <avi@scylladb.com> wrote: > On 10/01/2015 02:27 PM, Michael S. Tsirkin wrote: > > On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote: > >> People will just use out of tree drivers (dpdk has several already). It's a > >> pain, but nowhere near what you are proposing. > > What's the issue with that? > > Out of tree drivers have to be compiled on the target system (cannot > ship a binary package), and occasionally break. > > dkms helps with that, as do distributions that promise binary > compatibility, but it is still a pain, compared to just shipping a > userspace package. > > > We already agreed this kernel > > is going to be tainted, and unsupportable. > > Yes. So your only motivation in rejecting the patch is to get the > author to write the ring translation patch and port it to all relevant > drivers instead? The per-driver ring method is what netmap did. The problem with that is that it forces infrastructure into already complex network driver. It never was accepted. There were also still security issues like time of check/time of use with the ring. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 15:01 ` Stephen Hemminger @ 2015-10-01 15:08 ` Avi Kivity 2015-10-01 15:46 ` Michael S. Tsirkin 1 sibling, 0 replies; 100+ messages in thread From: Avi Kivity @ 2015-10-01 15:08 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev, Michael S. Tsirkin On 10/01/2015 06:01 PM, Stephen Hemminger wrote: > On Thu, 1 Oct 2015 14:32:19 +0300 > Avi Kivity <avi@scylladb.com> wrote: > >> On 10/01/2015 02:27 PM, Michael S. Tsirkin wrote: >>> On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote: >>>> People will just use out of tree drivers (dpdk has several already). It's a >>>> pain, but nowhere near what you are proposing. >>> What's the issue with that? >> Out of tree drivers have to be compiled on the target system (cannot >> ship a binary package), and occasionally break. >> >> dkms helps with that, as do distributions that promise binary >> compatibility, but it is still a pain, compared to just shipping a >> userspace package. >> >>> We already agreed this kernel >>> is going to be tainted, and unsupportable. >> Yes. So your only motivation in rejecting the patch is to get the >> author to write the ring translation patch and port it to all relevant >> drivers instead? > The per-driver ring method is what netmap did. > The problem with that is that it forces infrastructure into already > complex network driver. It never was accepted. There were also still > security issues like time of check/time of use with the ring. There would have to be two rings, with the driver picking up descriptors from the software ring, translating virtual addresses, and pushing them into the hardware ring. I'm not familiar enough with the truly high performance dpdk applications to estimate the performance impact. Seastar/scylla gets a huge benefit from dpdk, but is still nowhere near line rate. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 15:01 ` Stephen Hemminger 2015-10-01 15:08 ` Avi Kivity @ 2015-10-01 15:46 ` Michael S. Tsirkin 1 sibling, 0 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 15:46 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev, Avi Kivity On Thu, Oct 01, 2015 at 08:01:00AM -0700, Stephen Hemminger wrote: > The per-driver ring method is what netmap did. IIUC netmap has a standard format for descriptors, so was slower: it still had to do all networking in kernel (it only bypasses part of the networking stack), and to have a thread to translate between software and hardware formats. > The problem with that is that it forces infrastructure into already > complex network driver. It never was accepted. There were also still > security issues like time of check/time of use with the ring. Right, because people do care about security. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 11:32 ` Avi Kivity 2015-10-01 15:01 ` Stephen Hemminger @ 2015-10-01 15:11 ` Michael S. Tsirkin 2015-10-01 15:19 ` Avi Kivity 1 sibling, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 15:11 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote: > > We already agreed this kernel > >is going to be tainted, and unsupportable. > > Yes. So your only motivation in rejecting the patch is to get the author to > write the ring translation patch and port it to all relevant drivers > instead? Not only that. To make sure users are aware they are doing insecure things when using software poking at device BARs in sysfs. To avoid giving virtualization a bad name for security. To get people to work on safe, maintainable solutions. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 15:11 ` Michael S. Tsirkin @ 2015-10-01 15:19 ` Avi Kivity 2015-10-01 15:40 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Avi Kivity @ 2015-10-01 15:19 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 06:11 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote: >>> We already agreed this kernel >>> is going to be tainted, and unsupportable. >> Yes. So your only motivation in rejecting the patch is to get the author to >> write the ring translation patch and port it to all relevant drivers >> instead? > Not only that. > > To make sure users are aware they are doing insecure > things when using software poking at device BARs in sysfs. I don't think you need to worry about that. People who program DMA are aware of the damage is can cause. If you want to be extra sure, have uio taint the kernel when bus mastering is enabled. > To avoid giving virtualization a bad name for security. There is no security issue here. Those VMs run a single application, and cannot attack the host or other VMs. > To get people to work on safe, maintainable solutions. That's a great goal but I don't think it can be achieved without sacrificing performance, which is the only reason for dpdk's existence. If safe and maintainable were the only requirements, people would not bypass the kernel. The only thing you are really achieving by blocking this is causing pain. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 15:19 ` Avi Kivity @ 2015-10-01 15:40 ` Michael S. Tsirkin 0 siblings, 0 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 15:40 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 06:19:33PM +0300, Avi Kivity wrote: > On 10/01/2015 06:11 PM, Michael S. Tsirkin wrote: > >On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote: > >>> We already agreed this kernel > >>>is going to be tainted, and unsupportable. > >>Yes. So your only motivation in rejecting the patch is to get the author to > >>write the ring translation patch and port it to all relevant drivers > >>instead? > >Not only that. > > > >To make sure users are aware they are doing insecure > >things when using software poking at device BARs in sysfs. > > I don't think you need to worry about that. People who program DMA are > aware of the damage is can cause. People just install software and run it. They don't program DMA. And I notice that no software (ab)using this seems to come with documentation explaining the implications. > If you want to be extra sure, have uio > taint the kernel when bus mastering is enabled. People don't notice kernel is tainted. Denying module load will make them notice. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 11:20 ` Avi Kivity 2015-10-01 11:27 ` Michael S. Tsirkin @ 2015-10-01 11:31 ` Michael S. Tsirkin 2015-10-01 11:34 ` Avi Kivity 1 sibling, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 11:31 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote: > > > On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote: > >On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote: > >>>>It's not just the lack of system calls, of course, the architecture is > >>>>completely different. > >>>Absolutely - I'm not saying move all of DPDK into kernel. > >>>We just need to protect the RX rings so hardware does > >>>not corrupt kernel memory. > >>> > >>> > >>>Thinking about it some more, many devices > >>>have separate rings for DMA: TX (device reads memory) > >>>and RX (device writes memory). > >>>With such devices, a mode where userspace can write TX ring > >>>but not RX ring might make sense. > >>I'm sure you can cause havoc just by reading, if you read from I/O memory. > >Not talking about I/O memory here. These are device rings in RAM. > > Right. But you program them with DMA addresses, so the device can read > another device's memory. It can't if host has limited it to only DMA into guest RAM, which is pretty common. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 11:31 ` Michael S. Tsirkin @ 2015-10-01 11:34 ` Avi Kivity 0 siblings, 0 replies; 100+ messages in thread From: Avi Kivity @ 2015-10-01 11:34 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 02:31 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote: >> >> On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote: >>> On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote: >>>>>> It's not just the lack of system calls, of course, the architecture is >>>>>> completely different. >>>>> Absolutely - I'm not saying move all of DPDK into kernel. >>>>> We just need to protect the RX rings so hardware does >>>>> not corrupt kernel memory. >>>>> >>>>> >>>>> Thinking about it some more, many devices >>>>> have separate rings for DMA: TX (device reads memory) >>>>> and RX (device writes memory). >>>>> With such devices, a mode where userspace can write TX ring >>>>> but not RX ring might make sense. >>>> I'm sure you can cause havoc just by reading, if you read from I/O memory. >>> Not talking about I/O memory here. These are device rings in RAM. >> Right. But you program them with DMA addresses, so the device can read >> another device's memory. > It can't if host has limited it to only DMA into guest RAM, which is > pretty common. > Ok. So yes, the tx ring can be mapped R/W. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 10:38 ` Michael S. Tsirkin 2015-10-01 10:50 ` Avi Kivity @ 2015-10-01 11:08 ` Bruce Richardson 2015-10-01 11:23 ` Michael S. Tsirkin 1 sibling, 1 reply; 100+ messages in thread From: Bruce Richardson @ 2015-10-01 11:08 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev, Avi Kivity On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote: > > > > > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote: > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: > > >>It's easy to claim that > > >>a solution is around the corner, only no one was looking for it, but the > > >>reality is that kernel bypass has been a solution for years for high > > >>performance users, > > >I never said that it's trivial. > > > > > >It's probably a lot of work. It's definitely more work than just abusing > > >sysfs. > > > > > >But it looks like a write system call into an eventfd is about 1.5 > > >microseconds on my laptop. Even with a system call per packet, system > > >call overhead is not what makes DPDK drivers outperform Linux ones. > > > > > > > 1.5 us = 0.6 Mpps per core limit. > > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps. > But for RX, you can batch a lot of packets. > > You can see by now I'm not that good at benchmarking. > Here's what I wrote: > > > #include <stdbool.h> > #include <sys/eventfd.h> > #include <inttypes.h> > #include <unistd.h> > > > int main(int argc, char **argv) > { > int e = eventfd(0, 0); > uint64_t v = 1; > > int i; > > for (i = 0; i < 10000000; ++i) { > write(e, &v, sizeof v); > } > } > > > This takes 1.5 seconds to run on my laptop: > > $ time ./a.out > > real 0m1.507s > user 0m0.179s > sys 0m1.328s > > > > dpdk performance is in the tens of > > millions of packets per system. > > I think that's with a bunch of batching though. > > > It's not just the lack of system calls, of course, the architecture is > > completely different. > > Absolutely - I'm not saying move all of DPDK into kernel. > We just need to protect the RX rings so hardware does > not corrupt kernel memory. > > > Thinking about it some more, many devices > have separate rings for DMA: TX (device reads memory) > and RX (device writes memory). > With such devices, a mode where userspace can write TX ring > but not RX ring might make sense. > > This will mean userspace might read kernel memory > through the device, but can not corrupt it. > > That's already a big win! > > And RX buffers do not have to be added one at a time. > If we assume 0.2usec per system call, batching some 100 buffers per > system call gives you 2 nano seconds overhead. That seems quite > reasonable. > Hi, just to jump in a bit on this. Batching of 100 packets is a very large batch, and will add to latency. The standard batch size in DPDK right now is 32, and even that may be too high for applications in certain domains. However, even with that 2ns of overhead calculation, I'd make a few additional points. * For DPDK, we are reasonably close to being able to do 40GB of IO - both RX and TX on a single thread. 10GB of IO doesn't really stress a core any more. For 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with a huge batch size of 100 packets, your system call overhead on RX is taking almost 12% of our processing time. For a batch size of 32 this overhead would rise to over 35% of our packet processing time. For 100G line rate, the packet arrival rate is just 6.7ns... * As well as this overhead from the system call itself, you are also omitting the overhead of scanning the RX descriptors. This in itself is going to use up a good proportion of the processing time, as well as that we have to spend cycles copying the descriptors from one ring in memory to another. Given that right now with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen cycles on modern cores, every additional cycle (fraction of a nanosecond) has an impact. Regards, /Bruce ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 11:08 ` Bruce Richardson @ 2015-10-01 11:23 ` Michael S. Tsirkin 2015-10-01 12:07 ` Bruce Richardson 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 11:23 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev, Avi Kivity On Thu, Oct 01, 2015 at 12:08:07PM +0100, Bruce Richardson wrote: > On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote: > > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote: > > > > > > > > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote: > > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: > > > >>It's easy to claim that > > > >>a solution is around the corner, only no one was looking for it, but the > > > >>reality is that kernel bypass has been a solution for years for high > > > >>performance users, > > > >I never said that it's trivial. > > > > > > > >It's probably a lot of work. It's definitely more work than just abusing > > > >sysfs. > > > > > > > >But it looks like a write system call into an eventfd is about 1.5 > > > >microseconds on my laptop. Even with a system call per packet, system > > > >call overhead is not what makes DPDK drivers outperform Linux ones. > > > > > > > > > > 1.5 us = 0.6 Mpps per core limit. > > > > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps. > > But for RX, you can batch a lot of packets. > > > > You can see by now I'm not that good at benchmarking. > > Here's what I wrote: > > > > > > #include <stdbool.h> > > #include <sys/eventfd.h> > > #include <inttypes.h> > > #include <unistd.h> > > > > > > int main(int argc, char **argv) > > { > > int e = eventfd(0, 0); > > uint64_t v = 1; > > > > int i; > > > > for (i = 0; i < 10000000; ++i) { > > write(e, &v, sizeof v); > > } > > } > > > > > > This takes 1.5 seconds to run on my laptop: > > > > $ time ./a.out > > > > real 0m1.507s > > user 0m0.179s > > sys 0m1.328s > > > > > > > dpdk performance is in the tens of > > > millions of packets per system. > > > > I think that's with a bunch of batching though. > > > > > It's not just the lack of system calls, of course, the architecture is > > > completely different. > > > > Absolutely - I'm not saying move all of DPDK into kernel. > > We just need to protect the RX rings so hardware does > > not corrupt kernel memory. > > > > > > Thinking about it some more, many devices > > have separate rings for DMA: TX (device reads memory) > > and RX (device writes memory). > > With such devices, a mode where userspace can write TX ring > > but not RX ring might make sense. > > > > This will mean userspace might read kernel memory > > through the device, but can not corrupt it. > > > > That's already a big win! > > > > And RX buffers do not have to be added one at a time. > > If we assume 0.2usec per system call, batching some 100 buffers per > > system call gives you 2 nano seconds overhead. That seems quite > > reasonable. > > > Hi, > > just to jump in a bit on this. > > Batching of 100 packets is a very large batch, and will add to latency. This is not on transmit or receive path! This is only for re-adding buffers to the receive ring. This batching should not add latency at all: process rx: get packet packets[n] = alloc packet if (++n > 100) { system call: add bufs(packets, n); } > The > standard batch size in DPDK right now is 32, and even that may be too high for > applications in certain domains. > > However, even with that 2ns of overhead calculation, I'd make a few additional > points. > * For DPDK, we are reasonably close to being able to do 40GB of IO - both RX > and TX on a single thread. 10GB of IO doesn't really stress a core any more. For > 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with a > huge batch size of 100 packets, your system call overhead on RX is taking almost > 12% of our processing time. For a batch size of 32 this overhead would rise to > over 35% of our packet processing time. As I said, yes, measureable, but not breaking the bank, and that's with 40GB which still are not widespread. With 10GB and 100 packets, only 3% overhead. > For 100G line rate, the packet arrival > rate is just 6.7ns... Hypervisors still have time get their act together and support IOMMUs by the time 100G systems become widespread. > * As well as this overhead from the system call itself, you are also omitting > the overhead of scanning the RX descriptors. I omit it because scanning descriptors can still be done in userspace, just write-protect the RX ring page. > This in itself is going to use up > a good proportion of the processing time, as well as that we have to spend cycles > copying the descriptors from one ring in memory to another. Given that right now > with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen > cycles on modern cores, every additional cycle (fraction of a nanosecond) has > an impact. > > Regards, > /Bruce See above. There is no need for that on data path. Only re-adding buffers requires a system call. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 11:23 ` Michael S. Tsirkin @ 2015-10-01 12:07 ` Bruce Richardson 2015-10-01 13:14 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Bruce Richardson @ 2015-10-01 12:07 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev, Avi Kivity On Thu, Oct 01, 2015 at 02:23:17PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 12:08:07PM +0100, Bruce Richardson wrote: > > On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote: > > > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote: > > > > > > > > > > > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote: > > > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote: > > > > >>It's easy to claim that > > > > >>a solution is around the corner, only no one was looking for it, but the > > > > >>reality is that kernel bypass has been a solution for years for high > > > > >>performance users, > > > > >I never said that it's trivial. > > > > > > > > > >It's probably a lot of work. It's definitely more work than just abusing > > > > >sysfs. > > > > > > > > > >But it looks like a write system call into an eventfd is about 1.5 > > > > >microseconds on my laptop. Even with a system call per packet, system > > > > >call overhead is not what makes DPDK drivers outperform Linux ones. > > > > > > > > > > > > > 1.5 us = 0.6 Mpps per core limit. > > > > > > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps. > > > But for RX, you can batch a lot of packets. > > > > > > You can see by now I'm not that good at benchmarking. > > > Here's what I wrote: > > > > > > > > > #include <stdbool.h> > > > #include <sys/eventfd.h> > > > #include <inttypes.h> > > > #include <unistd.h> > > > > > > > > > int main(int argc, char **argv) > > > { > > > int e = eventfd(0, 0); > > > uint64_t v = 1; > > > > > > int i; > > > > > > for (i = 0; i < 10000000; ++i) { > > > write(e, &v, sizeof v); > > > } > > > } > > > > > > > > > This takes 1.5 seconds to run on my laptop: > > > > > > $ time ./a.out > > > > > > real 0m1.507s > > > user 0m0.179s > > > sys 0m1.328s > > > > > > > > > > dpdk performance is in the tens of > > > > millions of packets per system. > > > > > > I think that's with a bunch of batching though. > > > > > > > It's not just the lack of system calls, of course, the architecture is > > > > completely different. > > > > > > Absolutely - I'm not saying move all of DPDK into kernel. > > > We just need to protect the RX rings so hardware does > > > not corrupt kernel memory. > > > > > > > > > Thinking about it some more, many devices > > > have separate rings for DMA: TX (device reads memory) > > > and RX (device writes memory). > > > With such devices, a mode where userspace can write TX ring > > > but not RX ring might make sense. > > > > > > This will mean userspace might read kernel memory > > > through the device, but can not corrupt it. > > > > > > That's already a big win! > > > > > > And RX buffers do not have to be added one at a time. > > > If we assume 0.2usec per system call, batching some 100 buffers per > > > system call gives you 2 nano seconds overhead. That seems quite > > > reasonable. > > > > > Hi, > > > > just to jump in a bit on this. > > > > Batching of 100 packets is a very large batch, and will add to latency. > > > > This is not on transmit or receive path! > This is only for re-adding buffers to the receive ring. > This batching should not add latency at all: > > > process rx: > get packet > packets[n] = alloc packet > if (++n > 100) { > system call: add bufs(packets, n); > } > > > > > > > The > > standard batch size in DPDK right now is 32, and even that may be too high for > > applications in certain domains. > > > > However, even with that 2ns of overhead calculation, I'd make a few additional > > points. > > * For DPDK, we are reasonably close to being able to do 40GB of IO - both RX > > and TX on a single thread. 10GB of IO doesn't really stress a core any more. For > > 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with a > > huge batch size of 100 packets, your system call overhead on RX is taking almost > > 12% of our processing time. For a batch size of 32 this overhead would rise to > > over 35% of our packet processing time. > > As I said, yes, measureable, but not breaking the bank, and that's with > 40GB which still are not widespread. > With 10GB and 100 packets, only 3% overhead. > > > For 100G line rate, the packet arrival > > rate is just 6.7ns... > > Hypervisors still have time get their act together and support IOMMUs > by the time 100G systems become widespread. > > > * As well as this overhead from the system call itself, you are also omitting > > the overhead of scanning the RX descriptors. > > I omit it because scanning descriptors can still be done in userspace, > just write-protect the RX ring page. > > > > This in itself is going to use up > > a good proportion of the processing time, as well as that we have to spend cycles > > copying the descriptors from one ring in memory to another. Given that right now > > with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen > > cycles on modern cores, every additional cycle (fraction of a nanosecond) has > > an impact. > > > > Regards, > > /Bruce > > See above. There is no need for that on data path. Only re-adding > buffers requires a system call. > Re-adding buffers is a key part of the data path! Ok, the fact that its only on descriptor rearm does allow somewhat bigger batches, but the whole point of having the kernel do this extra work you propose is to allow the kernel to scan and sanitize the physical addresses - and that will take a lot of cycles, especially if it has to handle all the different descriptor formats of all the different NICs, as has already been pointed out. /Bruce > -- > MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 12:07 ` Bruce Richardson @ 2015-10-01 13:14 ` Michael S. Tsirkin 2015-10-01 16:04 ` Michael S. Tsirkin 2015-10-01 21:02 ` Alexander Duyck 0 siblings, 2 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 13:14 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev, Avi Kivity On Thu, Oct 01, 2015 at 01:07:13PM +0100, Bruce Richardson wrote: > > > This in itself is going to use up > > > a good proportion of the processing time, as well as that we have to spend cycles > > > copying the descriptors from one ring in memory to another. Given that right now > > > with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen > > > cycles on modern cores, every additional cycle (fraction of a nanosecond) has > > > an impact. > > > > > > Regards, > > > /Bruce > > > > See above. There is no need for that on data path. Only re-adding > > buffers requires a system call. > > > > Re-adding buffers is a key part of the data path! Ok, the fact that its only on > descriptor rearm does allow somewhat bigger batches, That was the point, yes. > but the whole point of having > the kernel do this extra work you propose is to allow the kernel to scan and > sanitize the physical addresses - and that will take a lot of cycles, especially > if it has to handle all the different descriptor formats of all the different NICs, > as has already been pointed out. > > /Bruce Well the driver would be per NIC, so there's only need to support specific formats supported by a given NIC. An alternative is to format the descriptors in kernel, based on just the list of addresses. This seems cleaner, but I don't know how efficient it would be. Device vendors and dpdk developers are probably the best people to figure out what's the best thing to do here. But it looks like it's not going to happen unless security is made a requirement for upstreaming code. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 13:14 ` Michael S. Tsirkin @ 2015-10-01 16:04 ` Michael S. Tsirkin 2015-10-01 21:02 ` Alexander Duyck 1 sibling, 0 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 16:04 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev, Avi Kivity On Thu, Oct 01, 2015 at 04:14:33PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 01:07:13PM +0100, Bruce Richardson wrote: > > > > This in itself is going to use up > > > > a good proportion of the processing time, as well as that we have to spend cycles > > > > copying the descriptors from one ring in memory to another. Given that right now > > > > with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen > > > > cycles on modern cores, every additional cycle (fraction of a nanosecond) has > > > > an impact. > > > > > > > > Regards, > > > > /Bruce > > > > > > See above. There is no need for that on data path. Only re-adding > > > buffers requires a system call. > > > > > > > Re-adding buffers is a key part of the data path! Ok, the fact that its only on > > descriptor rearm does allow somewhat bigger batches, > > That was the point, yes. > > > but the whole point of having > > the kernel do this extra work you propose is to allow the kernel to scan and > > sanitize the physical addresses - and that will take a lot of cycles, especially > > if it has to handle all the different descriptor formats of all the different NICs, > > as has already been pointed out. > > > > /Bruce > > Well the driver would be per NIC, so there's only need to support > specific formats supported by a given NIC. > > An alternative is to format the descriptors in kernel, based > on just the list of addresses. This seems cleaner, but I don't > know how efficient it would be. Additionally. rearming descriptors can happen on another core in parallel with processing packets on the first one. This will use more CPU but help you stay within your PPS limits. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 13:14 ` Michael S. Tsirkin 2015-10-01 16:04 ` Michael S. Tsirkin @ 2015-10-01 21:02 ` Alexander Duyck 2015-10-02 14:00 ` Michael S. Tsirkin 1 sibling, 1 reply; 100+ messages in thread From: Alexander Duyck @ 2015-10-01 21:02 UTC (permalink / raw) To: Michael S. Tsirkin, Bruce Richardson; +Cc: dev, Avi Kivity On 10/01/2015 06:14 AM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 01:07:13PM +0100, Bruce Richardson wrote: >>>> This in itself is going to use up >>>> a good proportion of the processing time, as well as that we have to spend cycles >>>> copying the descriptors from one ring in memory to another. Given that right now >>>> with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen >>>> cycles on modern cores, every additional cycle (fraction of a nanosecond) has >>>> an impact. >>>> >>>> Regards, >>>> /Bruce >>> See above. There is no need for that on data path. Only re-adding >>> buffers requires a system call. >>> >> Re-adding buffers is a key part of the data path! Ok, the fact that its only on >> descriptor rearm does allow somewhat bigger batches, > That was the point, yes. > >> but the whole point of having >> the kernel do this extra work you propose is to allow the kernel to scan and >> sanitize the physical addresses - and that will take a lot of cycles, especially >> if it has to handle all the different descriptor formats of all the different NICs, >> as has already been pointed out. >> >> /Bruce > Well the driver would be per NIC, so there's only need to support > specific formats supported by a given NIC. One thing that seems to be overlooked in your discussion is the cost to translate these descriptors. It isn't as if most systems running DPDK have the cycles to spare. As I believe was brought up in another thread we are looking at a budget of something like 68ns of 10Gbps line rate. The overhead for having to go through and translate/parse/validate the descriptors would end up being pretty significant. If you need proof of that just try running the ixgbe driver and route small packets. We end up spending something like 40ns in ixgbe_clean_rx_irq and that is mostly just translating the descriptor bits into the correct sk_buff bits. Also trying to maintain a user-space ring in addition to the kernel-space ring means that much more memory overhead and increasing the liklihood of things getting pushed out of the L1 cache. As far as the descriptor validation itself the overhead for that would guarantee that you cannot get any performance out of the device. There are too many corner cases that would have to be addressed in validating user-space input to allow for us to process packets in any sort of timely fashion. For starters we would have to validate the size, alignment, and ownership of a given buffer. If it is a transmit buffer we have to go through and validate any offloads being requested. Likely just the validation and translation would add 10s if not 100s of nanoseconds to the time needed to process each packet. In addition we are talking about doing this in kernel space which means we wouldn't really be able to take advantage of things like SSE or AVX instructions. > An alternative is to format the descriptors in kernel, based > on just the list of addresses. This seems cleaner, but I don't > know how efficient it would be. > > Device vendors and dpdk developers are probably the best people to > figure out what's the best thing to do here. As far as the bifurcated driver approach the only way something like that would ever work is if you could limit the access via an IOMMU. At least everything I have seen proposed for a bifurcated driver still involved one if they expected to get any performance. > But it looks like it's not going to happen unless security is made > a requirement for upstreaming code. The fact is we already ship uio_pci_generic. User space drivers are here to stay. What is being asked for is an extension to the existing infrastructure to allow MSI-X interrupts to trigger an event on a file descriptor. As far as I know that doesn't add any additional security risk since it is the kernel PCIe subsystem itself that would be programming the address and data for said device, it wouldn't actually grant any more access other then the additional file descriptors to support MSI-X vectors. Anyway that is just my $.02 on this. - Alex ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 21:02 ` Alexander Duyck @ 2015-10-02 14:00 ` Michael S. Tsirkin 2015-10-02 14:07 ` Bruce Richardson ` (2 more replies) 0 siblings, 3 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-02 14:00 UTC (permalink / raw) To: Alexander Duyck; +Cc: dev, Avi Kivity On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote: > validation and translation would add 10s if not 100s of nanoseconds to the > time needed to process each packet. In addition we are talking about doing > this in kernel space which means we wouldn't really be able to take > advantage of things like SSE or AVX instructions. Yes. But the nice thing is that it's rearming so it can happen on a separate core, in parallel with packet processing. It does not need to add to latency. You will burn up more CPU, but again, all this for boxes/hypervisors without an IOMMU. I'm sure people can come up with even better approaches, once enough people get it that kernel absolutely needs to be protected from userspace. Long term, the right thing to do is to focus on IOMMU support. This gives you hardware-based memory protection without need to burn up CPU cycles. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-02 14:00 ` Michael S. Tsirkin @ 2015-10-02 14:07 ` Bruce Richardson 2015-10-04 9:07 ` Michael S. Tsirkin 2015-10-02 15:56 ` Gleb Natapov 2015-10-02 16:57 ` Alexander Duyck 2 siblings, 1 reply; 100+ messages in thread From: Bruce Richardson @ 2015-10-02 14:07 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev, Avi Kivity On Fri, Oct 02, 2015 at 05:00:14PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote: > > validation and translation would add 10s if not 100s of nanoseconds to the > > time needed to process each packet. In addition we are talking about doing > > this in kernel space which means we wouldn't really be able to take > > advantage of things like SSE or AVX instructions. > > Yes. But the nice thing is that it's rearming so it can happen on > a separate core, in parallel with packet processing. > It does not need to add to latency. > > You will burn up more CPU, but again, all this for boxes/hypervisors > without an IOMMU. > > I'm sure people can come up with even better approaches, once enough > people get it that kernel absolutely needs to be protected from > userspace. > > Long term, the right thing to do is to focus on IOMMU support. This > gives you hardware-based memory protection without need to burn up CPU > cycles. > > -- > MST Running it on another will have it's own problems. The main one that springs to mind for me is the performance impact of having all those cache lines shared between the two cores. /Bruce ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-02 14:07 ` Bruce Richardson @ 2015-10-04 9:07 ` Michael S. Tsirkin 0 siblings, 0 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-04 9:07 UTC (permalink / raw) To: Bruce Richardson; +Cc: dev, Avi Kivity On Fri, Oct 02, 2015 at 03:07:24PM +0100, Bruce Richardson wrote: > On Fri, Oct 02, 2015 at 05:00:14PM +0300, Michael S. Tsirkin wrote: > > On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote: > > > validation and translation would add 10s if not 100s of nanoseconds to the > > > time needed to process each packet. In addition we are talking about doing > > > this in kernel space which means we wouldn't really be able to take > > > advantage of things like SSE or AVX instructions. > > > > Yes. But the nice thing is that it's rearming so it can happen on > > a separate core, in parallel with packet processing. > > It does not need to add to latency. > > > > You will burn up more CPU, but again, all this for boxes/hypervisors > > without an IOMMU. > > > > I'm sure people can come up with even better approaches, once enough > > people get it that kernel absolutely needs to be protected from > > userspace. > > > > Long term, the right thing to do is to focus on IOMMU support. This > > gives you hardware-based memory protection without need to burn up CPU > > cycles. > > > > -- > > MST > > Running it on another will have it's own problems. The main one that springs to > mind for me is the performance impact of having all those cache lines shared > between the two cores. > > /Bruce The cache line is currently invalidated by the device write packet processing core -> device -> packet processing core We are adding another stage packet processing core -> rearming core -> device -> packet processing core but the amount of sharing per core isn't increased. This is something that can be tried immediately without kernel changes. Who knows, maybe you will actually be able to push more pps this way. Further, rearming is not doing a lot besides moving bits around in memory, and it's in kernel so using very limited resources - maybe we can efficiently use an HT logical core for this task. This remains to be seen. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-02 14:00 ` Michael S. Tsirkin 2015-10-02 14:07 ` Bruce Richardson @ 2015-10-02 15:56 ` Gleb Natapov 2015-10-02 16:57 ` Alexander Duyck 2 siblings, 0 replies; 100+ messages in thread From: Gleb Natapov @ 2015-10-02 15:56 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev, Avi Kivity On Fri, Oct 02, 2015 at 05:00:14PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote: > > validation and translation would add 10s if not 100s of nanoseconds to the > > time needed to process each packet. In addition we are talking about doing > > this in kernel space which means we wouldn't really be able to take > > advantage of things like SSE or AVX instructions. > > Yes. But the nice thing is that it's rearming so it can happen on > a separate core, in parallel with packet processing. > It does not need to add to latency. > Modern nics have no less queues than most machines has cores. There is no such thing as free core to offload you processing to, otherwise you designed your application wrong and waste cpu cycles. > You will burn up more CPU, but again, all this for boxes/hypervisors > without an IOMMU. > > I'm sure people can come up with even better approaches, once enough > people get it that kernel absolutely needs to be protected from > userspace. > People should not "get" things which are, lets be polite here, untrue. The kernel never tried to protect itself from userspace rumning on behalf of root. Secure boot, which is quite recent, is may be an only instance where kernel tries to do so (unfortunately) and it does so by disabling things if boot is secure. Linux was always "jack of all trades" and was suitable to run on a machine with secure boot and a vm that acts as application container or embedded device running packet forwarding. the only valid point is that nobody should debug crashes that may be caused by buggy userspace and tainting kernel solves that. > Long term, the right thing to do is to focus on IOMMU support. This > gives you hardware-based memory protection without need to burn up CPU > cycles. > > -- > MST -- Gleb. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-02 14:00 ` Michael S. Tsirkin 2015-10-02 14:07 ` Bruce Richardson 2015-10-02 15:56 ` Gleb Natapov @ 2015-10-02 16:57 ` Alexander Duyck 2 siblings, 0 replies; 100+ messages in thread From: Alexander Duyck @ 2015-10-02 16:57 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev, Avi Kivity On 10/02/2015 07:00 AM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote: >> validation and translation would add 10s if not 100s of nanoseconds to the >> time needed to process each packet. In addition we are talking about doing >> this in kernel space which means we wouldn't really be able to take >> advantage of things like SSE or AVX instructions. > Yes. But the nice thing is that it's rearming so it can happen on > a separate core, in parallel with packet processing. > It does not need to add to latency. Moving it to another core is automatically going to add extra latency. You will have to evict the data out of the L1 cache for one core and into the L1 cache for another when you update it, and then reading it will force it to have to transition back out. If you are lucky it is only evicted to L2, if not then to L3, or possibly even back to memory. Odds are that alone will add tens of nanoseconds to the process, and you would need three or more cores to do the same workload as running the process over multiple threads means having to add synchronization primitives to the whole mess. Then there is the NUMA factor on top of that. > You will burn up more CPU, but again, all this for boxes/hypervisors > without an IOMMU. There are use cases this will completely make useless. If for example you are running a workload that needs three cores with DPDK bumping it to nine or more will likely push you out of being able to do the workload on some systems. > I'm sure people can come up with even better approaches, once enough > people get it that kernel absolutely needs to be protected from > userspace. I don't see that happening. Many people don't care about kernel security that much. If they did something like DPDK wouldn't have gotten off of the ground. Once someone has the ability to load kernel modules any protection of the kernel from userspace pretty much goes right out the window. You are just as much at risk from a buggy driver in userspace as you are from one that can be added to the kernel. > Long term, the right thing to do is to focus on IOMMU support. This > gives you hardware-based memory protection without need to burn up CPU > cycles. We have a solution that makes use of IOMMU support with vfio. The problem is there are multiple cases where that support is either not available, or using the IOMMU provides excess overhead. - Alex ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 8:52 ` Avi Kivity 2015-10-01 9:15 ` Michael S. Tsirkin @ 2015-10-01 9:15 ` Avi Kivity 2015-10-01 9:29 ` Michael S. Tsirkin 1 sibling, 1 reply; 100+ messages in thread From: Avi Kivity @ 2015-10-01 9:15 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 11:52 AM, Avi Kivity wrote: > > > On 10/01/2015 11:44 AM, Michael S. Tsirkin wrote: >> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote: >>>> And for what, to prevent >>>> root from touching memory via dma that they can access in a million other >>>> ways? >>> So one can be reasonably sure a kernel oops is not a result of a >>> userspace bug. >> Actually, I thought about this overnight, and it should be possible to >> drive it securely from userspace, without hypervisor changes. > > Also without the performance that was the whole reason from doing it > in userspace in the first place. > > I still don't understand your objection to the patch: > >> MSI messages are memory writes so any generic device capable >> of MSI is capable of corrupting kernel memory. >> This means that a bug in userspace will lead to kernel memory corruption >> and crashes. This is something distributions can't support. > And this: > What userspace can't be allowed to do: > > access BAR > write rings > It can access the BAR by mmap()ing the resourceN files under sysfs. You're not denying userspace the ability to oops the kernel, just the ability to do useful things with hardware. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:15 ` Avi Kivity @ 2015-10-01 9:29 ` Michael S. Tsirkin 2015-10-01 9:38 ` Avi Kivity 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 9:29 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 12:15:49PM +0300, Avi Kivity wrote: > What userspace can't be allowed to do: > > access BAR > write rings > > > > > It can access the BAR by mmap()ing the resourceN files under sysfs. You're not > denying userspace the ability to oops the kernel, just the ability to do useful > things with hardware. This interface has to stay there to support existing applications. A variety of measures (selinux, secureboot) can be used to make sure modern ones to not get to touch it. Most distributions enable some or all of these by default. And it doesn't mean modern drivers can do this kind of thing. Look, without an IOMMU, sysfs can not be used securely: you need some other interface. This is what this driver is missing. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:29 ` Michael S. Tsirkin @ 2015-10-01 9:38 ` Avi Kivity 2015-10-01 10:07 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Avi Kivity @ 2015-10-01 9:38 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 12:29 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 12:15:49PM +0300, Avi Kivity wrote: >> What userspace can't be allowed to do: >> >> access BAR >> write rings >> >> >> >> >> It can access the BAR by mmap()ing the resourceN files under sysfs. You're not >> denying userspace the ability to oops the kernel, just the ability to do useful >> things with hardware. > > This interface has to stay there to support existing applications. A > variety of measures (selinux, secureboot) can be used to make sure > modern ones to not get to touch it. By all means, secure the driver with selinux as well. > Most distributions enable > some or all of these by default. There is no problem accessing the BARs on the most modern and secure enterprise distribution I am aware of (CentOS 7.1). > > And it doesn't mean modern drivers can do this kind of thing. > > Look, without an IOMMU, sysfs can not be used securely: > you need some other interface. This is what this driver is missing. What is this magical missing interface? It simply cannot be done. You either have an iommu, or you accept that userspace can access anything via DMA. The sad thing is that you can do this since forever on a non-virtualized system, or on a virtualized system if you don't need interrupt support. All you're doing is blocking interrupt support on virtualized systems. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 9:38 ` Avi Kivity @ 2015-10-01 10:07 ` Michael S. Tsirkin 2015-10-01 10:11 ` Avi Kivity 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 10:07 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 12:38:51PM +0300, Avi Kivity wrote: > The sad thing is that you can do this since forever on a non-virtualized > system, or on a virtualized system if you don't need interrupt support. All > you're doing is blocking interrupt support on virtualized systems. True, Linux could do more to prevent this kind of abuse. In fact IIRC, if you enable secureboot, it does exactly that. A generic uio driver isn't a good interface because it relies on these sysfs files. We are luckly it doesn't work for VFs, I don't think we should do anything that relies on this interface in future applications. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 10:07 ` Michael S. Tsirkin @ 2015-10-01 10:11 ` Avi Kivity 0 siblings, 0 replies; 100+ messages in thread From: Avi Kivity @ 2015-10-01 10:11 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On 10/01/2015 01:07 PM, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 12:38:51PM +0300, Avi Kivity wrote: >> The sad thing is that you can do this since forever on a non-virtualized >> system, or on a virtualized system if you don't need interrupt support. All >> you're doing is blocking interrupt support on virtualized systems. > True, Linux could do more to prevent this kind of abuse. > In fact IIRC, if you enable secureboot, it does exactly that. > > A generic uio driver isn't a good interface because it relies on these > sysfs files. We are luckly it doesn't work for VFs, I don't think we > should do anything that relies on this interface in future applications. > I agree that uio is not a good solution. But for some users, which we are discussing now, it is the only solution. A bad solution is better than no solution. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-10-01 8:44 ` Michael S. Tsirkin 2015-10-01 8:46 ` Vlad Zolotarov 2015-10-01 8:52 ` Avi Kivity @ 2015-10-01 9:16 ` Michael S. Tsirkin 2 siblings, 0 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-10-01 9:16 UTC (permalink / raw) To: Avi Kivity; +Cc: dev On Thu, Oct 01, 2015 at 11:44:28AM +0300, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote: > > > And for what, to prevent > > > root from touching memory via dma that they can access in a million other > > > ways? > > > > So one can be reasonably sure a kernel oops is not a result of a > > userspace bug. > > Actually, I thought about this overnight, and it should be possible to > drive it securely from userspace, without hypervisor changes. > > See > > https://mid.gmane.org/20151001104505-mutt-send-email-mst@redhat.com Ouch, looks like gmane doesn't do https. Sorry, this is the correct link: http://mid.gmane.org/20151001104505-mutt-send-email-mst@redhat.com > > > > -- > > MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 10:37 ` Vlad Zolotarov 2015-09-30 10:58 ` Michael S. Tsirkin @ 2015-09-30 17:28 ` Stephen Hemminger 2015-09-30 17:39 ` Michael S. Tsirkin 1 sibling, 1 reply; 100+ messages in thread From: Stephen Hemminger @ 2015-09-30 17:28 UTC (permalink / raw) To: Vlad Zolotarov; +Cc: dev, Michael S. Tsirkin On Wed, 30 Sep 2015 13:37:22 +0300 Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > > > On 09/30/15 00:49, Michael S. Tsirkin wrote: > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote: > >> On Tue, 29 Sep 2015 23:54:54 +0300 > >> "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: > >>>> The security breach motivation u brought in "[RFC PATCH] uio: > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak > >>>> since one u let the userland access to the bar it may do any funny thing > >>>> using the DMA engine of the device. This kind of stuff should be prevented > >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X > >>>> configuration will be prevented too. > >>>> > >>>> I'm about to send the patch to main Linux mailing list. Let's continue this > >>>> discussion there. > >>>> > >>> Basically UIO shouldn't be used with devices capable of DMA. > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU). > > If there is an IOMMU in the picture there shouldn't be any problem to > use UIO with DMA capable devices. > > >>> I don't think this can change. > >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK > >> use, I can't accept that. > > QEMU does allow emulating an iommu. > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an > option there. And again, it's a general issue not DPDK specific. > Today one has to develop some proprietary modules (like igb_uio) to > workaround the issue and this is lame. IMHO uio_pci_generic should > be fixed to be able to properly work within any virtualized environment > and not only with KVM. > Also VMware (bigger problem) has no IOMMU emulation. Other environments as well (Windriver, GCE) have noe IOMMU. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 17:28 ` Stephen Hemminger @ 2015-09-30 17:39 ` Michael S. Tsirkin 2015-09-30 17:43 ` Stephen Hemminger 2015-09-30 17:44 ` Gleb Natapov 0 siblings, 2 replies; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 17:39 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote: > On Wed, 30 Sep 2015 13:37:22 +0300 > Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > > > > > > > On 09/30/15 00:49, Michael S. Tsirkin wrote: > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote: > > >> On Tue, 29 Sep 2015 23:54:54 +0300 > > >> "Michael S. Tsirkin" <mst@redhat.com> wrote: > > >> > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: > > >>>> The security breach motivation u brought in "[RFC PATCH] uio: > > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak > > >>>> since one u let the userland access to the bar it may do any funny thing > > >>>> using the DMA engine of the device. This kind of stuff should be prevented > > >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X > > >>>> configuration will be prevented too. > > >>>> > > >>>> I'm about to send the patch to main Linux mailing list. Let's continue this > > >>>> discussion there. > > >>>> > > >>> Basically UIO shouldn't be used with devices capable of DMA. > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU). > > > > If there is an IOMMU in the picture there shouldn't be any problem to > > use UIO with DMA capable devices. > > > > >>> I don't think this can change. > > >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK > > >> use, I can't accept that. > > > QEMU does allow emulating an iommu. > > > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an > > option there. And again, it's a general issue not DPDK specific. > > Today one has to develop some proprietary modules (like igb_uio) to > > workaround the issue and this is lame. IMHO uio_pci_generic should > > be fixed to be able to properly work within any virtualized environment > > and not only with KVM. > > > > Also VMware (bigger problem) has no IOMMU emulation. > Other environments as well (Windriver, GCE) have noe IOMMU. Because the use-case of userspace drivers is not important enough? Without an IOMMU, there's no way to have secure userspace drivers. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 17:39 ` Michael S. Tsirkin @ 2015-09-30 17:43 ` Stephen Hemminger 2015-09-30 18:50 ` Michael S. Tsirkin 2015-09-30 17:44 ` Gleb Natapov 1 sibling, 1 reply; 100+ messages in thread From: Stephen Hemminger @ 2015-09-30 17:43 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On Wed, 30 Sep 2015 20:39:43 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote: > > On Wed, 30 Sep 2015 13:37:22 +0300 > > Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > > > > > > > > > > > On 09/30/15 00:49, Michael S. Tsirkin wrote: > > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote: > > > >> On Tue, 29 Sep 2015 23:54:54 +0300 > > > >> "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > >> > > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: > > > >>>> The security breach motivation u brought in "[RFC PATCH] uio: > > > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak > > > >>>> since one u let the userland access to the bar it may do any funny thing > > > >>>> using the DMA engine of the device. This kind of stuff should be prevented > > > >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X > > > >>>> configuration will be prevented too. > > > >>>> > > > >>>> I'm about to send the patch to main Linux mailing list. Let's continue this > > > >>>> discussion there. > > > >>>> > > > >>> Basically UIO shouldn't be used with devices capable of DMA. > > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU). > > > > > > If there is an IOMMU in the picture there shouldn't be any problem to > > > use UIO with DMA capable devices. > > > > > > >>> I don't think this can change. > > > >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK > > > >> use, I can't accept that. > > > > QEMU does allow emulating an iommu. > > > > > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an > > > option there. And again, it's a general issue not DPDK specific. > > > Today one has to develop some proprietary modules (like igb_uio) to > > > workaround the issue and this is lame. IMHO uio_pci_generic should > > > be fixed to be able to properly work within any virtualized environment > > > and not only with KVM. > > > > > > > Also VMware (bigger problem) has no IOMMU emulation. > > Other environments as well (Windriver, GCE) have noe IOMMU. > > Because the use-case of userspace drivers is not important enough? > Without an IOMMU, there's no way to have secure userspace drivers. Look at Cloudius, there is no necessity of security in guest. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 17:43 ` Stephen Hemminger @ 2015-09-30 18:50 ` Michael S. Tsirkin 2015-09-30 20:00 ` Gleb Natapov 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 18:50 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev On Wed, Sep 30, 2015 at 10:43:04AM -0700, Stephen Hemminger wrote: > On Wed, 30 Sep 2015 20:39:43 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote: > > > On Wed, 30 Sep 2015 13:37:22 +0300 > > > Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > > > > > > > > > > > > > > > On 09/30/15 00:49, Michael S. Tsirkin wrote: > > > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote: > > > > >> On Tue, 29 Sep 2015 23:54:54 +0300 > > > > >> "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > >> > > > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: > > > > >>>> The security breach motivation u brought in "[RFC PATCH] uio: > > > > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak > > > > >>>> since one u let the userland access to the bar it may do any funny thing > > > > >>>> using the DMA engine of the device. This kind of stuff should be prevented > > > > >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X > > > > >>>> configuration will be prevented too. > > > > >>>> > > > > >>>> I'm about to send the patch to main Linux mailing list. Let's continue this > > > > >>>> discussion there. > > > > >>>> > > > > >>> Basically UIO shouldn't be used with devices capable of DMA. > > > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU). > > > > > > > > If there is an IOMMU in the picture there shouldn't be any problem to > > > > use UIO with DMA capable devices. > > > > > > > > >>> I don't think this can change. > > > > >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK > > > > >> use, I can't accept that. > > > > > QEMU does allow emulating an iommu. > > > > > > > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an > > > > option there. And again, it's a general issue not DPDK specific. > > > > Today one has to develop some proprietary modules (like igb_uio) to > > > > workaround the issue and this is lame. IMHO uio_pci_generic should > > > > be fixed to be able to properly work within any virtualized environment > > > > and not only with KVM. > > > > > > > > > > Also VMware (bigger problem) has no IOMMU emulation. > > > Other environments as well (Windriver, GCE) have noe IOMMU. > > > > Because the use-case of userspace drivers is not important enough? > > Without an IOMMU, there's no way to have secure userspace drivers. > > Look at Cloudius, there is no necessity of security in guest. It's an interesting concept, isn't it? So why not do what Cloudius does, and run this task code in ring 0 then, allocating all memory in the kernel range? You are increasing interrupt latency by a huge factor by channeling interrupts through a scheduler. Let user install an interrupt handler function, and be done with it. -- MST ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 18:50 ` Michael S. Tsirkin @ 2015-09-30 20:00 ` Gleb Natapov 2015-09-30 20:36 ` Michael S. Tsirkin 0 siblings, 1 reply; 100+ messages in thread From: Gleb Natapov @ 2015-09-30 20:00 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On Wed, Sep 30, 2015 at 09:50:08PM +0300, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 10:43:04AM -0700, Stephen Hemminger wrote: > > On Wed, 30 Sep 2015 20:39:43 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote: > > > > On Wed, 30 Sep 2015 13:37:22 +0300 > > > > Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > > > > > > > > > > > > > > > > > > > On 09/30/15 00:49, Michael S. Tsirkin wrote: > > > > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote: > > > > > >> On Tue, 29 Sep 2015 23:54:54 +0300 > > > > > >> "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > >> > > > > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: > > > > > >>>> The security breach motivation u brought in "[RFC PATCH] uio: > > > > > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak > > > > > >>>> since one u let the userland access to the bar it may do any funny thing > > > > > >>>> using the DMA engine of the device. This kind of stuff should be prevented > > > > > >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X > > > > > >>>> configuration will be prevented too. > > > > > >>>> > > > > > >>>> I'm about to send the patch to main Linux mailing list. Let's continue this > > > > > >>>> discussion there. > > > > > >>>> > > > > > >>> Basically UIO shouldn't be used with devices capable of DMA. > > > > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU). > > > > > > > > > > If there is an IOMMU in the picture there shouldn't be any problem to > > > > > use UIO with DMA capable devices. > > > > > > > > > > >>> I don't think this can change. > > > > > >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK > > > > > >> use, I can't accept that. > > > > > > QEMU does allow emulating an iommu. > > > > > > > > > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an > > > > > option there. And again, it's a general issue not DPDK specific. > > > > > Today one has to develop some proprietary modules (like igb_uio) to > > > > > workaround the issue and this is lame. IMHO uio_pci_generic should > > > > > be fixed to be able to properly work within any virtualized environment > > > > > and not only with KVM. > > > > > > > > > > > > > Also VMware (bigger problem) has no IOMMU emulation. > > > > Other environments as well (Windriver, GCE) have noe IOMMU. > > > > > > Because the use-case of userspace drivers is not important enough? > > > Without an IOMMU, there's no way to have secure userspace drivers. > > > > Look at Cloudius, there is no necessity of security in guest. > > It's an interesting concept, isn't it? > It is. > So why not do what Cloudius does, and run this task code in ring 0 then, > allocating all memory in the kernel range? > Except this is not what Cloudius does. The idea of OSv is that it can run your regular userspace application, but remove unneeded level of indirection by bypassing userspace/kernelspace communication (among other things). Application still uses virtual, not directly mapped physical memory like Linux ring 0 has. You can achieve most of the benefits of kernel bypass on Linux too, but unlike OSv you need to code for it. UIO is one of those things that allows that. > You are increasing interrupt latency by a huge factor by channeling > interrupts through a scheduler. Let user install an > interrupt handler function, and be done with it. > Interrupt latency is not always hugely important. If you enter interrupt mode only when idle hundred more us on a first packet will not kill you. If interrupt latency is important then uio may be not the right solution, but then neither is vfio. -- Gleb. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 20:00 ` Gleb Natapov @ 2015-09-30 20:36 ` Michael S. Tsirkin 2015-10-01 5:04 ` Gleb Natapov 0 siblings, 1 reply; 100+ messages in thread From: Michael S. Tsirkin @ 2015-09-30 20:36 UTC (permalink / raw) To: Gleb Natapov; +Cc: dev On Wed, Sep 30, 2015 at 11:00:49PM +0300, Gleb Natapov wrote: > > You are increasing interrupt latency by a huge factor by channeling > > interrupts through a scheduler. Let user install an > > interrupt handler function, and be done with it. > > > Interrupt latency is not always hugely important. If you enter interrupt > mode only when idle hundred more us on a first packet will not kill you. It certainly affects worst-case latency. And if you lower interupt latency, you can go idle faster, so it affects power too. > If > interrupt latency is important then uio may be not the right solution, > but then neither is vfio. That's what I'm saying, if you don't need memory isolation you can do better than just slightly tweak existing drivers. > -- > Gleb. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 20:36 ` Michael S. Tsirkin @ 2015-10-01 5:04 ` Gleb Natapov 0 siblings, 0 replies; 100+ messages in thread From: Gleb Natapov @ 2015-10-01 5:04 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On Wed, Sep 30, 2015 at 11:36:58PM +0300, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 11:00:49PM +0300, Gleb Natapov wrote: > > > You are increasing interrupt latency by a huge factor by channeling > > > interrupts through a scheduler. Let user install an > > > interrupt handler function, and be done with it. > > > > > Interrupt latency is not always hugely important. If you enter interrupt > > mode only when idle hundred more us on a first packet will not kill you. > > It certainly affects worst-case latency. And if you lower interupt > latency, you can go idle faster, so it affects power too. > We are polling 100% now. Going idle faster is the least of our concern. > > If > > interrupt latency is important then uio may be not the right solution, > > but then neither is vfio. > > That's what I'm saying, if you don't need memory isolation you can do > better than just slightly tweak existing drivers. > No, you are forcing everyone to code in kernel no matter if it make sense or not. You decide for everyone what is good for them. Believe me people here know about trade-offs and made appropriate considerations. -- Gleb. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance 2015-09-30 17:39 ` Michael S. Tsirkin 2015-09-30 17:43 ` Stephen Hemminger @ 2015-09-30 17:44 ` Gleb Natapov 1 sibling, 0 replies; 100+ messages in thread From: Gleb Natapov @ 2015-09-30 17:44 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: dev On Wed, Sep 30, 2015 at 08:39:43PM +0300, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote: > > On Wed, 30 Sep 2015 13:37:22 +0300 > > Vlad Zolotarov <vladz@cloudius-systems.com> wrote: > > > > > > > > > > > On 09/30/15 00:49, Michael S. Tsirkin wrote: > > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote: > > > >> On Tue, 29 Sep 2015 23:54:54 +0300 > > > >> "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > >> > > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote: > > > >>>> The security breach motivation u brought in "[RFC PATCH] uio: > > > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak > > > >>>> since one u let the userland access to the bar it may do any funny thing > > > >>>> using the DMA engine of the device. This kind of stuff should be prevented > > > >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X > > > >>>> configuration will be prevented too. > > > >>>> > > > >>>> I'm about to send the patch to main Linux mailing list. Let's continue this > > > >>>> discussion there. > > > >>>> > > > >>> Basically UIO shouldn't be used with devices capable of DMA. > > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU). > > > > > > If there is an IOMMU in the picture there shouldn't be any problem to > > > use UIO with DMA capable devices. > > > > > > >>> I don't think this can change. > > > >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK > > > >> use, I can't accept that. > > > > QEMU does allow emulating an iommu. > > > > > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an > > > option there. And again, it's a general issue not DPDK specific. > > > Today one has to develop some proprietary modules (like igb_uio) to > > > workaround the issue and this is lame. IMHO uio_pci_generic should > > > be fixed to be able to properly work within any virtualized environment > > > and not only with KVM. > > > > > > > Also VMware (bigger problem) has no IOMMU emulation. > > Other environments as well (Windriver, GCE) have noe IOMMU. > > Because the use-case of userspace drivers is not important enough? Because "secure" userspace drivers is not important enough. > Without an IOMMU, there's no way to have secure userspace drivers. > People use VMs as an application containers, not as a machine that needs to be secured for multiuser scenario. -- Gleb. ^ permalink raw reply [flat|nested] 100+ messages in thread
end of thread, other threads:[~2015-10-04 9:07 UTC | newest] Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-09-27 7:05 [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance Vlad Zolotarov 2015-09-27 9:43 ` Michael S. Tsirkin 2015-09-27 10:50 ` Vladislav Zolotarov 2015-09-29 16:41 ` Vlad Zolotarov 2015-09-29 20:54 ` Michael S. Tsirkin 2015-09-29 21:46 ` Stephen Hemminger 2015-09-29 21:49 ` Michael S. Tsirkin 2015-09-30 10:37 ` Vlad Zolotarov 2015-09-30 10:58 ` Michael S. Tsirkin 2015-09-30 11:26 ` Vlad Zolotarov [not found] ` <20150930143927-mutt-send-email-mst@redhat.com> 2015-09-30 11:53 ` Vlad Zolotarov 2015-09-30 12:03 ` Michael S. Tsirkin 2015-09-30 12:16 ` Vlad Zolotarov 2015-09-30 12:27 ` Michael S. Tsirkin 2015-09-30 12:50 ` Vlad Zolotarov 2015-09-30 15:26 ` Michael S. Tsirkin 2015-09-30 18:15 ` Vlad Zolotarov 2015-09-30 18:55 ` Michael S. Tsirkin 2015-09-30 19:06 ` Vlad Zolotarov 2015-09-30 19:10 ` Vlad Zolotarov 2015-09-30 19:11 ` Vlad Zolotarov 2015-09-30 19:39 ` Michael S. Tsirkin 2015-09-30 20:09 ` Vlad Zolotarov 2015-09-30 21:36 ` Stephen Hemminger 2015-09-30 21:53 ` Michael S. Tsirkin 2015-09-30 22:20 ` Vlad Zolotarov 2015-10-01 8:00 ` Vlad Zolotarov 2015-10-01 14:47 ` Stephen Hemminger 2015-10-01 15:03 ` Vlad Zolotarov 2015-09-30 13:05 ` Avi Kivity 2015-09-30 14:39 ` Michael S. Tsirkin 2015-09-30 14:53 ` Avi Kivity 2015-09-30 15:21 ` Michael S. Tsirkin 2015-09-30 15:36 ` Avi Kivity 2015-09-30 20:40 ` Michael S. Tsirkin 2015-09-30 21:00 ` Avi Kivity 2015-10-01 8:44 ` Michael S. Tsirkin 2015-10-01 8:46 ` Vlad Zolotarov 2015-10-01 8:52 ` Avi Kivity 2015-10-01 9:15 ` Michael S. Tsirkin 2015-10-01 9:22 ` Avi Kivity 2015-10-01 9:42 ` Michael S. Tsirkin 2015-10-01 9:53 ` Avi Kivity 2015-10-01 10:17 ` Michael S. Tsirkin 2015-10-01 10:24 ` Avi Kivity 2015-10-01 10:25 ` Avi Kivity 2015-10-01 10:44 ` Michael S. Tsirkin 2015-10-01 10:55 ` Avi Kivity 2015-10-01 21:17 ` Alexander Duyck 2015-10-02 13:50 ` Michael S. Tsirkin 2015-10-01 9:42 ` Vincent JARDIN 2015-10-01 9:43 ` Avi Kivity 2015-10-01 9:48 ` Vincent JARDIN 2015-10-01 9:54 ` Avi Kivity 2015-10-01 10:14 ` Michael S. Tsirkin 2015-10-01 10:23 ` Avi Kivity 2015-10-01 14:55 ` Stephen Hemminger 2015-10-01 15:49 ` Michael S. Tsirkin 2015-10-01 14:54 ` Stephen Hemminger 2015-10-01 9:55 ` Michael S. Tsirkin 2015-10-01 9:59 ` Avi Kivity 2015-10-01 10:38 ` Michael S. Tsirkin 2015-10-01 10:50 ` Avi Kivity 2015-10-01 11:09 ` Michael S. Tsirkin 2015-10-01 11:20 ` Avi Kivity 2015-10-01 11:27 ` Michael S. Tsirkin 2015-10-01 11:32 ` Avi Kivity 2015-10-01 15:01 ` Stephen Hemminger 2015-10-01 15:08 ` Avi Kivity 2015-10-01 15:46 ` Michael S. Tsirkin 2015-10-01 15:11 ` Michael S. Tsirkin 2015-10-01 15:19 ` Avi Kivity 2015-10-01 15:40 ` Michael S. Tsirkin 2015-10-01 11:31 ` Michael S. Tsirkin 2015-10-01 11:34 ` Avi Kivity 2015-10-01 11:08 ` Bruce Richardson 2015-10-01 11:23 ` Michael S. Tsirkin 2015-10-01 12:07 ` Bruce Richardson 2015-10-01 13:14 ` Michael S. Tsirkin 2015-10-01 16:04 ` Michael S. Tsirkin 2015-10-01 21:02 ` Alexander Duyck 2015-10-02 14:00 ` Michael S. Tsirkin 2015-10-02 14:07 ` Bruce Richardson 2015-10-04 9:07 ` Michael S. Tsirkin 2015-10-02 15:56 ` Gleb Natapov 2015-10-02 16:57 ` Alexander Duyck 2015-10-01 9:15 ` Avi Kivity 2015-10-01 9:29 ` Michael S. Tsirkin 2015-10-01 9:38 ` Avi Kivity 2015-10-01 10:07 ` Michael S. Tsirkin 2015-10-01 10:11 ` Avi Kivity 2015-10-01 9:16 ` Michael S. Tsirkin 2015-09-30 17:28 ` Stephen Hemminger 2015-09-30 17:39 ` Michael S. Tsirkin 2015-09-30 17:43 ` Stephen Hemminger 2015-09-30 18:50 ` Michael S. Tsirkin 2015-09-30 20:00 ` Gleb Natapov 2015-09-30 20:36 ` Michael S. Tsirkin 2015-10-01 5:04 ` Gleb Natapov 2015-09-30 17:44 ` Gleb Natapov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).