DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
@ 2015-09-27  7:05 Vlad Zolotarov
  2015-09-27  9:43 ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-27  7:05 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin

Hi,
I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on 
Amazon EC2 instances with Enhanced Networking enabled.
The idea is to create a DPDK environment that doesn't require compiling 
kernel modules (igb_uio).
However I was surprised to discover that uio_pci_generic refuses to work 
with EN device on AWS:

$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

$ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0
Error: bind failed for 0000:00:04.0 - Cannot bind to driver uio_pci_generic

$dmesg

--> snip <---
[  816.655575] uio_pci_generic 0000:00:04.0: No IRQ assigned to device: no support for interrupts?

$ sudo lspci -s 00:04.0 -vvv
00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
	Physical Slot: 4
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Region 0: Memory at f3008000 (64-bit, prefetchable) [size=16K]
	Region 3: Memory at f300c000 (64-bit, prefetchable) [size=16K]
	Capabilities: [70] MSI-X: Enable- Count=3 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00002000
	Kernel modules: ixgbevf

So, as we may see the PCI device doesn't have an INTX interrupt line 
assigned indeed. It has an MSI-X capability however.
Looking at the uio_pci_generic code it seems to require the INTX:

uio_pci_generic.c: line 74: probe():

	if (!pdev->irq) {
		dev_warn(&pdev->dev, "No IRQ assigned to device: "
			 "no support for interrupts?\n");
		pci_disable_device(pdev);
		return -ENODEV;
	}

Is it a known limitation? Michael, could u, pls., comment on this?

thanks,
vlad

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-27  7:05 [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance Vlad Zolotarov
@ 2015-09-27  9:43 ` Michael S. Tsirkin
  2015-09-27 10:50   ` Vladislav Zolotarov
  2015-09-29 16:41   ` Vlad Zolotarov
  0 siblings, 2 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-27  9:43 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev

On Sun, Sep 27, 2015 at 10:05:11AM +0300, Vlad Zolotarov wrote:
> Hi,
> I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on
> Amazon EC2 instances with Enhanced Networking enabled.
> The idea is to create a DPDK environment that doesn't require compiling
> kernel modules (igb_uio).
> However I was surprised to discover that uio_pci_generic refuses to work
> with EN device on AWS:
> 
> $ lspci
> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
> 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
> 00:02.0 VGA compatible controller: Cirrus Logic GD 5446
> 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
> 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
> 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
> 
> $ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0
> Error: bind failed for 0000:00:04.0 - Cannot bind to driver uio_pci_generic

> $dmesg
> 
> --> snip <---
> [  816.655575] uio_pci_generic 0000:00:04.0: No IRQ assigned to device: no support for interrupts?
> 
> $ sudo lspci -s 00:04.0 -vvv
> 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
> 	Physical Slot: 4
> 	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> 	Region 0: Memory at f3008000 (64-bit, prefetchable) [size=16K]
> 	Region 3: Memory at f300c000 (64-bit, prefetchable) [size=16K]
> 	Capabilities: [70] MSI-X: Enable- Count=3 Masked-
> 		Vector table: BAR=3 offset=00000000
> 		PBA: BAR=3 offset=00002000
> 	Kernel modules: ixgbevf
> 
> So, as we may see the PCI device doesn't have an INTX interrupt line
> assigned indeed. It has an MSI-X capability however.
> Looking at the uio_pci_generic code it seems to require the INTX:
> 
> uio_pci_generic.c: line 74: probe():
> 
> 	if (!pdev->irq) {
> 		dev_warn(&pdev->dev, "No IRQ assigned to device: "
> 			 "no support for interrupts?\n");
> 		pci_disable_device(pdev);
> 		return -ENODEV;
> 	}
> 
> Is it a known limitation? Michael, could u, pls., comment on this?
> 
> thanks,
> vlad

This is expected. uio_pci_generic forwards INT#x interrupts from device
to userspace, but VF devices never assert INT#x.

So it doesn't seem to make any sense to bind uio_pci_generic there.

I think that DPDK should be fixed to not require uio_pci_generic
for VF devices (or any devices without INT#x).

If DPDK requires a place-holder driver, the pci-stub driver should
do this adequately. See ./drivers/pci/pci-stub.c

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-27  9:43 ` Michael S. Tsirkin
@ 2015-09-27 10:50   ` Vladislav Zolotarov
  2015-09-29 16:41   ` Vlad Zolotarov
  1 sibling, 0 replies; 100+ messages in thread
From: Vladislav Zolotarov @ 2015-09-27 10:50 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On Sep 27, 2015 12:43 PM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
> On Sun, Sep 27, 2015 at 10:05:11AM +0300, Vlad Zolotarov wrote:
> > Hi,
> > I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on
> > Amazon EC2 instances with Enhanced Networking enabled.
> > The idea is to create a DPDK environment that doesn't require compiling
> > kernel modules (igb_uio).
> > However I was surprised to discover that uio_pci_generic refuses to work
> > with EN device on AWS:
> >
> > $ lspci
> > 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
(rev 02)
> > 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton
II]
> > 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE
[Natoma/Triton II]
> > 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
> > 00:02.0 VGA compatible controller: Cirrus Logic GD 5446
> > 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet
Controller Virtual Function (rev 01)
> > 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet
Controller Virtual Function (rev 01)
> > 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device
(rev 01)
> >
> > $ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0
> > Error: bind failed for 0000:00:04.0 - Cannot bind to driver
uio_pci_generic
>
> > $dmesg
> >
> > --> snip <---
> > [  816.655575] uio_pci_generic 0000:00:04.0: No IRQ assigned to device:
no support for interrupts?
> >
> > $ sudo lspci -s 00:04.0 -vvv
> > 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet
Controller Virtual Function (rev 01)
> >       Physical Slot: 4
> >       Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
> >       Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
> >       Region 0: Memory at f3008000 (64-bit, prefetchable) [size=16K]
> >       Region 3: Memory at f300c000 (64-bit, prefetchable) [size=16K]
> >       Capabilities: [70] MSI-X: Enable- Count=3 Masked-
> >               Vector table: BAR=3 offset=00000000
> >               PBA: BAR=3 offset=00002000
> >       Kernel modules: ixgbevf
> >
> > So, as we may see the PCI device doesn't have an INTX interrupt line
> > assigned indeed. It has an MSI-X capability however.
> > Looking at the uio_pci_generic code it seems to require the INTX:
> >
> > uio_pci_generic.c: line 74: probe():
> >
> >       if (!pdev->irq) {
> >               dev_warn(&pdev->dev, "No IRQ assigned to device: "
> >                        "no support for interrupts?\n");
> >               pci_disable_device(pdev);
> >               return -ENODEV;
> >       }
> >
> > Is it a known limitation? Michael, could u, pls., comment on this?
> >
> > thanks,
> > vlad
>
> This is expected. uio_pci_generic forwards INT#x interrupts from device
> to userspace, but VF devices never assert INT#x.
>
> So it doesn't seem to make any sense to bind uio_pci_generic there.
>
> I think that DPDK should be fixed to not require uio_pci_generic
> for VF devices (or any devices without INT#x).
>
> If DPDK requires a place-holder driver, the pci-stub driver should
> do this adequately. See ./drivers/pci/pci-stub.c

Thank for clarification, Michael. I'll take a look.

>
> --
> MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-27  9:43 ` Michael S. Tsirkin
  2015-09-27 10:50   ` Vladislav Zolotarov
@ 2015-09-29 16:41   ` Vlad Zolotarov
  2015-09-29 20:54     ` Michael S. Tsirkin
  1 sibling, 1 reply; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-29 16:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 09/27/15 12:43, Michael S. Tsirkin wrote:
> On Sun, Sep 27, 2015 at 10:05:11AM +0300, Vlad Zolotarov wrote:
>> Hi,
>> I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on
>> Amazon EC2 instances with Enhanced Networking enabled.
>> The idea is to create a DPDK environment that doesn't require compiling
>> kernel modules (igb_uio).
>> However I was surprised to discover that uio_pci_generic refuses to work
>> with EN device on AWS:
>>
>> $ lspci
>> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
>> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
>> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
>> 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
>> 00:02.0 VGA compatible controller: Cirrus Logic GD 5446
>> 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
>> 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
>> 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
>>
>> $ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0
>> Error: bind failed for 0000:00:04.0 - Cannot bind to driver uio_pci_generic
>> $dmesg
>>
>> --> snip <---
>> [  816.655575] uio_pci_generic 0000:00:04.0: No IRQ assigned to device: no support for interrupts?
>>
>> $ sudo lspci -s 00:04.0 -vvv
>> 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
>> 	Physical Slot: 4
>> 	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>> 	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> 	Region 0: Memory at f3008000 (64-bit, prefetchable) [size=16K]
>> 	Region 3: Memory at f300c000 (64-bit, prefetchable) [size=16K]
>> 	Capabilities: [70] MSI-X: Enable- Count=3 Masked-
>> 		Vector table: BAR=3 offset=00000000
>> 		PBA: BAR=3 offset=00002000
>> 	Kernel modules: ixgbevf
>>
>> So, as we may see the PCI device doesn't have an INTX interrupt line
>> assigned indeed. It has an MSI-X capability however.
>> Looking at the uio_pci_generic code it seems to require the INTX:
>>
>> uio_pci_generic.c: line 74: probe():
>>
>> 	if (!pdev->irq) {
>> 		dev_warn(&pdev->dev, "No IRQ assigned to device: "
>> 			 "no support for interrupts?\n");
>> 		pci_disable_device(pdev);
>> 		return -ENODEV;
>> 	}
>>
>> Is it a known limitation? Michael, could u, pls., comment on this?
>>
>> thanks,
>> vlad

Michael, I took a look at the pci_stub driver and the reason why DPDK 
uses uio the first place and I have some comments below.

> This is expected. uio_pci_generic forwards INT#x interrupts from device
> to userspace, but VF devices never assert INT#x.
>
> So it doesn't seem to make any sense to bind uio_pci_generic there.

Well, it's not completely correct to put it this way. The thing is that 
DPDK (and it could be any other framework/developer)
uses uio_pci_generic to actually get interrupts from the device and it 
makes a perfect sense to be able to do so
in the SR-IOV devices too. The problem is, like u've described above, 
that the current implementation of uio_pci_generic
wouldn't let them do so and it seems like a bogus behavior to me. There 
is no reason, why uio_pci_generic wouldn't be able to work
the same way as it does today but with MSI-X interrupts. To make things 
simple forwarding just the first vector as an initial implementation.

The security breach motivation u brought in "[RFC PATCH] uio: 
uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
since one u let the userland access to the bar it may do any funny thing 
using the DMA engine of the device. This kind of stuff should be prevented
using the iommu and if it's enabled then any funny tricks using 
MSI/MSI-X configuration will be prevented too.

I'm about to send the patch to main Linux mailing list. Let's continue 
this discussion there.

>
> I think that DPDK should be fixed to not require uio_pci_generic
> for VF devices (or any devices without INT#x).
>
> If DPDK requires a place-holder driver, the pci-stub driver should
> do this adequately. See ./drivers/pci/pci-stub.c
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-29 16:41   ` Vlad Zolotarov
@ 2015-09-29 20:54     ` Michael S. Tsirkin
  2015-09-29 21:46       ` Stephen Hemminger
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-29 20:54 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev

On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> The security breach motivation u brought in "[RFC PATCH] uio:
> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> since one u let the userland access to the bar it may do any funny thing
> using the DMA engine of the device. This kind of stuff should be prevented
> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> configuration will be prevented too.
> 
> I'm about to send the patch to main Linux mailing list. Let's continue this
> discussion there.
> 

Basically UIO shouldn't be used with devices capable of DMA.
Use VFIO for that (yes, this implies an emulated or PV IOMMU).
I don't think this can change.

> >
> >I think that DPDK should be fixed to not require uio_pci_generic
> >for VF devices (or any devices without INT#x).
> >
> >If DPDK requires a place-holder driver, the pci-stub driver should
> >do this adequately. See ./drivers/pci/pci-stub.c
> >

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-29 20:54     ` Michael S. Tsirkin
@ 2015-09-29 21:46       ` Stephen Hemminger
  2015-09-29 21:49         ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Stephen Hemminger @ 2015-09-29 21:46 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On Tue, 29 Sep 2015 23:54:54 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > The security breach motivation u brought in "[RFC PATCH] uio:
> > uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> > since one u let the userland access to the bar it may do any funny thing
> > using the DMA engine of the device. This kind of stuff should be prevented
> > using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> > configuration will be prevented too.
> > 
> > I'm about to send the patch to main Linux mailing list. Let's continue this
> > discussion there.
> >   
> 
> Basically UIO shouldn't be used with devices capable of DMA.
> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> I don't think this can change.

Given there is no PV IOMMU and even if there was it would be too slow for DPDK
use, I can't accept that. 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-29 21:46       ` Stephen Hemminger
@ 2015-09-29 21:49         ` Michael S. Tsirkin
  2015-09-30 10:37           ` Vlad Zolotarov
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-29 21:49 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> On Tue, 29 Sep 2015 23:54:54 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > > The security breach motivation u brought in "[RFC PATCH] uio:
> > > uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> > > since one u let the userland access to the bar it may do any funny thing
> > > using the DMA engine of the device. This kind of stuff should be prevented
> > > using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> > > configuration will be prevented too.
> > > 
> > > I'm about to send the patch to main Linux mailing list. Let's continue this
> > > discussion there.
> > >   
> > 
> > Basically UIO shouldn't be used with devices capable of DMA.
> > Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > I don't think this can change.
> 
> Given there is no PV IOMMU and even if there was it would be too slow for DPDK
> use, I can't accept that. 

QEMU does allow emulating an iommu.  DPDK uses static mappings, so I
doubt it's speed matters at all.

Anyway, DPDK is doing polling all the time. I don't see why does it
insist on using interrupts to detect link up events. Just poll for that
too.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-29 21:49         ` Michael S. Tsirkin
@ 2015-09-30 10:37           ` Vlad Zolotarov
  2015-09-30 10:58             ` Michael S. Tsirkin
  2015-09-30 17:28             ` Stephen Hemminger
  0 siblings, 2 replies; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-30 10:37 UTC (permalink / raw)
  To: Michael S. Tsirkin, Stephen Hemminger; +Cc: dev



On 09/30/15 00:49, Michael S. Tsirkin wrote:
> On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
>> On Tue, 29 Sep 2015 23:54:54 +0300
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
>>>> The security breach motivation u brought in "[RFC PATCH] uio:
>>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
>>>> since one u let the userland access to the bar it may do any funny thing
>>>> using the DMA engine of the device. This kind of stuff should be prevented
>>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
>>>> configuration will be prevented too.
>>>>
>>>> I'm about to send the patch to main Linux mailing list. Let's continue this
>>>> discussion there.
>>>>    
>>> Basically UIO shouldn't be used with devices capable of DMA.
>>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).

If there is an IOMMU in the picture there shouldn't be any problem to 
use UIO with DMA capable devices.

>>> I don't think this can change.
>> Given there is no PV IOMMU and even if there was it would be too slow for DPDK
>> use, I can't accept that.
> QEMU does allow emulating an iommu.

Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
option there. And again, it's a general issue not DPDK specific.
Today one has to develop some proprietary modules (like igb_uio) to 
workaround the issue and this is lame. IMHO uio_pci_generic should
be fixed to be able to properly work within any virtualized environment 
and not only with KVM.



>   DPDK uses static mappings, so I
> doubt it's speed matters at all.
>
> Anyway, DPDK is doing polling all the time. I don't see why does it
> insist on using interrupts to detect link up events. Just poll for that
> too.
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 10:37           ` Vlad Zolotarov
@ 2015-09-30 10:58             ` Michael S. Tsirkin
  2015-09-30 11:26               ` Vlad Zolotarov
  2015-09-30 17:28             ` Stephen Hemminger
  1 sibling, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 10:58 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev

On Wed, Sep 30, 2015 at 01:37:22PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 00:49, Michael S. Tsirkin wrote:
> >On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> >>On Tue, 29 Sep 2015 23:54:54 +0300
> >>"Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>
> >>>On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> >>>>The security breach motivation u brought in "[RFC PATCH] uio:
> >>>>uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> >>>>since one u let the userland access to the bar it may do any funny thing
> >>>>using the DMA engine of the device. This kind of stuff should be prevented
> >>>>using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> >>>>configuration will be prevented too.
> >>>>
> >>>>I'm about to send the patch to main Linux mailing list. Let's continue this
> >>>>discussion there.
> >>>Basically UIO shouldn't be used with devices capable of DMA.
> >>>Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> 
> If there is an IOMMU in the picture there shouldn't be any problem to use
> UIO with DMA capable devices.

UIO doesn't enforce the IOMMU though. That's why it's not a good fit.

> >>>I don't think this can change.
> >>Given there is no PV IOMMU and even if there was it would be too slow for DPDK
> >>use, I can't accept that.
> >QEMU does allow emulating an iommu.
> 
> Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an option
> there.

Not only that, a bunch of boxes have their IOMMU disabled.
So for such systems, you can't have userspace poking at
device registers. You need a kernel driver to validate
userspace requests before passing them on to devices.

> And again, it's a general issue not DPDK specific.
> Today one has to develop some proprietary modules (like igb_uio) to
> workaround the issue and this is lame.

Of course it is lame. So don't bypass the kernel then, use the upstream drivers.

> IMHO uio_pci_generic should
> be fixed to be able to properly work within any virtualized environment and
> not only with KVM.

The motivation for UIO is pretty clear:

        For many types of devices, creating a Linux kernel driver is
        overkill.  All that is really needed is some way to handle an
        interrupt and provide access to the memory space of the
        device.  The logic of controlling the device does not
        necessarily have to be within the kernel, as the device does
        not need to take advantage of any of other resources that the
        kernel provides.  One such common class of devices that are
        like this are for industrial I/O cards.

Devices doing DMA do need to take advantage of memory protection
that the kernel provides.

> 
> >  DPDK uses static mappings, so I
> >doubt it's speed matters at all.
> >
> >Anyway, DPDK is doing polling all the time. I don't see why does it
> >insist on using interrupts to detect link up events. Just poll for that
> >too.
> >

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 10:58             ` Michael S. Tsirkin
@ 2015-09-30 11:26               ` Vlad Zolotarov
       [not found]                 ` <20150930143927-mutt-send-email-mst@redhat.com>
  0 siblings, 1 reply; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-30 11:26 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 09/30/15 13:58, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 01:37:22PM +0300, Vlad Zolotarov wrote:
>>
>> On 09/30/15 00:49, Michael S. Tsirkin wrote:
>>> On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
>>>> On Tue, 29 Sep 2015 23:54:54 +0300
>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>
>>>>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
>>>>>> The security breach motivation u brought in "[RFC PATCH] uio:
>>>>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
>>>>>> since one u let the userland access to the bar it may do any funny thing
>>>>>> using the DMA engine of the device. This kind of stuff should be prevented
>>>>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
>>>>>> configuration will be prevented too.
>>>>>>
>>>>>> I'm about to send the patch to main Linux mailing list. Let's continue this
>>>>>> discussion there.
>>>>> Basically UIO shouldn't be used with devices capable of DMA.
>>>>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
>> If there is an IOMMU in the picture there shouldn't be any problem to use
>> UIO with DMA capable devices.
> UIO doesn't enforce the IOMMU though. That's why it's not a good fit.

Having said all that - does UIO denies to work with the devices with DMA 
capability today? Either i have missed that logic or it's not there.
So all what u are so worried about may already be done today. That's why 
I don't understand why adding a support for MSI/MSI-X interrupts
would change anything here. U are right that UIO *today* has a security 
hole however it should be addressed separately and the same solution
that will cover the the security breach in the current code will cover 
the "MSI/MSI-X security vulnerability" because they are actually exactly 
the same
issue.

>
>>>>> I don't think this can change.
>>>> Given there is no PV IOMMU and even if there was it would be too slow for DPDK
>>>> use, I can't accept that.
>>> QEMU does allow emulating an iommu.
>> Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an option
>> there.
> Not only that, a bunch of boxes have their IOMMU disabled.
> So for such systems, you can't have userspace poking at
> device registers. You need a kernel driver to validate
> userspace requests before passing them on to devices.

I think u are describing a HV functionality here. ;) And yes, u are 
absolutely right, HV has to control the non-privileged userland.
For HV/non-virtualized boxes a possible solution could be to allow UIO 
only for some privileged group of processes.

>
>> And again, it's a general issue not DPDK specific.
>> Today one has to develop some proprietary modules (like igb_uio) to
>> workaround the issue and this is lame.
> Of course it is lame. So don't bypass the kernel then, use the upstream drivers.

This would impose a heavy performance penalty. The whole idea is to 
bypass kernel. Especially for networking...

>
>> IMHO uio_pci_generic should
>> be fixed to be able to properly work within any virtualized environment and
>> not only with KVM.
> The motivation for UIO is pretty clear:
>
>          For many types of devices, creating a Linux kernel driver is
>          overkill.  All that is really needed is some way to handle an
>          interrupt and provide access to the memory space of the
>          device.  The logic of controlling the device does not
>          necessarily have to be within the kernel, as the device does
>          not need to take advantage of any of other resources that the
>          kernel provides.  One such common class of devices that are
>          like this are for industrial I/O cards.
>
> Devices doing DMA do need to take advantage of memory protection
> that the kernel provides.
Well, yeah - but who said I has to be forbidden to work with the device 
if MSI-X interrupts is my only option?

Kernel may provide a protection in the way that it would check the 
process permissions and deny the UIO access to non-privileged processes.
I'm not sure it's the case today and if it's not the case then, as 
mentioned above, this would rather be fixed ASAP exactly due to reasons 
u bring up
here. And once it's done there shouldn't be any limitation to allow MSI 
or MSI-X interrupts along with INT#X.

>
>>>   DPDK uses static mappings, so I
>>> doubt it's speed matters at all.
>>>
>>> Anyway, DPDK is doing polling all the time. I don't see why does it
>>> insist on using interrupts to detect link up events. Just poll for that
>>> too.
>>>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
       [not found]                 ` <20150930143927-mutt-send-email-mst@redhat.com>
@ 2015-09-30 11:53                   ` Vlad Zolotarov
  2015-09-30 12:03                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-30 11:53 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 09/30/15 14:41, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>> The whole idea is to bypass kernel. Especially for networking...
> ... on dumb hardware that doesn't support doing that securely.

On a very capable HW that supports whatever security requirements needed 
(e.g. 82599 Intel's SR-IOV VF devices).

> Colour me unimpressed.
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 11:53                   ` Vlad Zolotarov
@ 2015-09-30 12:03                     ` Michael S. Tsirkin
  2015-09-30 12:16                       ` Vlad Zolotarov
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 12:03 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev

On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> >>The whole idea is to bypass kernel. Especially for networking...
> >... on dumb hardware that doesn't support doing that securely.
> 
> On a very capable HW that supports whatever security requirements needed
> (e.g. 82599 Intel's SR-IOV VF devices).

Network card type is irrelevant as long as you do not have an IOMMU,
otherwise you would just use e.g. VFIO.

> >Colour me unimpressed.
> >

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 12:03                     ` Michael S. Tsirkin
@ 2015-09-30 12:16                       ` Vlad Zolotarov
  2015-09-30 12:27                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-30 12:16 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 09/30/15 15:03, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
>>
>> On 09/30/15 14:41, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>>>> The whole idea is to bypass kernel. Especially for networking...
>>> ... on dumb hardware that doesn't support doing that securely.
>> On a very capable HW that supports whatever security requirements needed
>> (e.g. 82599 Intel's SR-IOV VF devices).
> Network card type is irrelevant as long as you do not have an IOMMU,
> otherwise you would just use e.g. VFIO.

Sorry, but I don't follow your logic here - Amazon EC2 environment is a 
example where there *is* iommu but it's not virtualized and thus VFIO is 
useless and there is an option to use directly assigned SR-IOV 
networking device there where using the kernel drivers impose a 
performance impact compared to user space UIO-based user space kernel 
bypass mode of usage. How is it irrelevant? Could u, pls, clarify your 
point?

>
>>> Colour me unimpressed.
>>>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 12:16                       ` Vlad Zolotarov
@ 2015-09-30 12:27                         ` Michael S. Tsirkin
  2015-09-30 12:50                           ` Vlad Zolotarov
  2015-09-30 13:05                           ` Avi Kivity
  0 siblings, 2 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 12:27 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev

On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 15:03, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> >>
> >>On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >>>On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> >>>>The whole idea is to bypass kernel. Especially for networking...
> >>>... on dumb hardware that doesn't support doing that securely.
> >>On a very capable HW that supports whatever security requirements needed
> >>(e.g. 82599 Intel's SR-IOV VF devices).
> >Network card type is irrelevant as long as you do not have an IOMMU,
> >otherwise you would just use e.g. VFIO.
> 
> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
> example where there *is* iommu but it's not virtualized
> and thus VFIO is
> useless and there is an option to use directly assigned SR-IOV networking
> device there where using the kernel drivers impose a performance impact
> compared to user space UIO-based user space kernel bypass mode of usage. How
> is it irrelevant? Could u, pls, clarify your point?
> 

So it's not even dumb hardware, it's another piece of software
that forces an "all or nothing" approach where either
device has access to all VM memory, or none.
And this, unfortunately, leaves you with no secure way to
allow userspace drivers.

So it makes even less sense to add insecure work-arounds in the kernel.
It seems quite likely that by the time the new kernel reaches
production X years from now, EC2 will have a virtual iommu.


> >
> >>>Colour me unimpressed.
> >>>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 12:27                         ` Michael S. Tsirkin
@ 2015-09-30 12:50                           ` Vlad Zolotarov
  2015-09-30 15:26                             ` Michael S. Tsirkin
  2015-09-30 13:05                           ` Avi Kivity
  1 sibling, 1 reply; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-30 12:50 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 09/30/15 15:27, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
>>
>> On 09/30/15 15:03, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
>>>> On 09/30/15 14:41, Michael S. Tsirkin wrote:
>>>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>>>>>>   The whole idea is to bypass kernel. Especially for networking...
>>>>> ... on dumb hardware that doesn't support doing that securely.
>>>> On a very capable HW that supports whatever security requirements needed
>>>> (e.g. 82599 Intel's SR-IOV VF devices).
>>> Network card type is irrelevant as long as you do not have an IOMMU,
>>> otherwise you would just use e.g. VFIO.
>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
>> example where there *is* iommu but it's not virtualized
>> and thus VFIO is
>> useless and there is an option to use directly assigned SR-IOV networking
>> device there where using the kernel drivers impose a performance impact
>> compared to user space UIO-based user space kernel bypass mode of usage. How
>> is it irrelevant? Could u, pls, clarify your point?
>>
> So it's not even dumb hardware, it's another piece of software
> that forces an "all or nothing" approach where either
> device has access to all VM memory, or none.
> And this, unfortunately, leaves you with no secure way to
> allow userspace drivers.
UIO is not secure even today so what are we arguing about? ;)
Adding MSI/MSI-X support won't change this state, so, pls., discard the 
security argument unless u thing that UIO is completely secure piece of 
software today. In the later case, could u, pls., clarify what would 
prevent the userspace program to configure a DMA controller via 
registers and do whatever it wants?


How not virtualizing iommu forces "all or nothing" approach? What 
insecure in relying on HV to control the iommu and not letting the VF 
any access to it?
As far as I see it - there isn't any security problem here at all. The 
only problem I see here is that dumb current uio_pci_generic 
implementation forces people to go and invent the workarounds instead of 
having a proper MSI/MSI-X support implemented. And as I've mentioned 
above it has nothing to do with security because there is no such thing 
as security (on the UIO driver level) when we talk about UIO - it has to 
be ensured by some other entity like HV.

>
> So it makes even less sense to add insecure work-arounds in the kernel.
> It seems quite likely that by the time the new kernel reaches
> production X years from now, EC2 will have a virtual iommu.

I'd bet that new kernel would reach production long before Amazon does 
that... ;)

>
>
>>>>> Colour me unimpressed.
>>>>>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 12:27                         ` Michael S. Tsirkin
  2015-09-30 12:50                           ` Vlad Zolotarov
@ 2015-09-30 13:05                           ` Avi Kivity
  2015-09-30 14:39                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 100+ messages in thread
From: Avi Kivity @ 2015-09-30 13:05 UTC (permalink / raw)
  To: Michael S. Tsirkin, Vlad Zolotarov; +Cc: dev



On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
>>
>> On 09/30/15 15:03, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
>>>> On 09/30/15 14:41, Michael S. Tsirkin wrote:
>>>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>>>>>> The whole idea is to bypass kernel. Especially for networking...
>>>>> ... on dumb hardware that doesn't support doing that securely.
>>>> On a very capable HW that supports whatever security requirements needed
>>>> (e.g. 82599 Intel's SR-IOV VF devices).
>>> Network card type is irrelevant as long as you do not have an IOMMU,
>>> otherwise you would just use e.g. VFIO.
>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
>> example where there *is* iommu but it's not virtualized
>> and thus VFIO is
>> useless and there is an option to use directly assigned SR-IOV networking
>> device there where using the kernel drivers impose a performance impact
>> compared to user space UIO-based user space kernel bypass mode of usage. How
>> is it irrelevant? Could u, pls, clarify your point?
>>
> So it's not even dumb hardware, it's another piece of software
> that forces an "all or nothing" approach where either
> device has access to all VM memory, or none.
> And this, unfortunately, leaves you with no secure way to
> allow userspace drivers.

Some setups don't need security (they are single-user, single 
application). But do need a lot of performance (like 5X-10X 
performance).  An example is OpenVSwitch, security doesn't help it at 
all and if you force it to use the kernel drivers you cripple it.

Also, I'm root.  I can do anything I like, including loading a patched 
pci_uio_generic.  You're not providing _any_ security, you're simply 
making life harder for users.

> So it makes even less sense to add insecure work-arounds in the kernel.
> It seems quite likely that by the time the new kernel reaches
> production X years from now, EC2 will have a virtual iommu.

I can adopt a new kernel tomorrow.  I have no influence on EC2.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 13:05                           ` Avi Kivity
@ 2015-09-30 14:39                             ` Michael S. Tsirkin
  2015-09-30 14:53                               ` Avi Kivity
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 14:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
> 
> 
> On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
> >>
> >>On 09/30/15 15:03, Michael S. Tsirkin wrote:
> >>>On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> >>>>On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >>>>>On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> >>>>>>The whole idea is to bypass kernel. Especially for networking...
> >>>>>... on dumb hardware that doesn't support doing that securely.
> >>>>On a very capable HW that supports whatever security requirements needed
> >>>>(e.g. 82599 Intel's SR-IOV VF devices).
> >>>Network card type is irrelevant as long as you do not have an IOMMU,
> >>>otherwise you would just use e.g. VFIO.
> >>Sorry, but I don't follow your logic here - Amazon EC2 environment is a
> >>example where there *is* iommu but it's not virtualized
> >>and thus VFIO is
> >>useless and there is an option to use directly assigned SR-IOV networking
> >>device there where using the kernel drivers impose a performance impact
> >>compared to user space UIO-based user space kernel bypass mode of usage. How
> >>is it irrelevant? Could u, pls, clarify your point?
> >>
> >So it's not even dumb hardware, it's another piece of software
> >that forces an "all or nothing" approach where either
> >device has access to all VM memory, or none.
> >And this, unfortunately, leaves you with no secure way to
> >allow userspace drivers.
> 
> Some setups don't need security (they are single-user, single application).
> But do need a lot of performance (like 5X-10X performance).  An example is
> OpenVSwitch, security doesn't help it at all and if you force it to use the
> kernel drivers you cripple it.

We'd have to see there are actual users that need this.  So far, dpdk
seems like the only one, and it wants to use UIO for slow path stuff
like polling link status.  Why this needs kernel bypass support, I don't
know.  I asked, and got no answer.

> 
> Also, I'm root.  I can do anything I like, including loading a patched
> pci_uio_generic.  You're not providing _any_ security, you're simply making
> life harder for users.

Maybe that's true on your system. But I guess you know that's not true
for everyone, not in 2015.

> >So it makes even less sense to add insecure work-arounds in the kernel.
> >It seems quite likely that by the time the new kernel reaches
> >production X years from now, EC2 will have a virtual iommu.
> 
> I can adopt a new kernel tomorrow.  I have no influence on EC2.
> 
>

Xen grant tables sound like they could be the right interface
for EC2.  google search for "grant tables iommu" immediately gives me:
http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html
Maybe latest Xen is already doing the right thing, and it's just the
question of making VFIO use that.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 14:39                             ` Michael S. Tsirkin
@ 2015-09-30 14:53                               ` Avi Kivity
  2015-09-30 15:21                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Avi Kivity @ 2015-09-30 14:53 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
>>
>> On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
>>>> On 09/30/15 15:03, Michael S. Tsirkin wrote:
>>>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
>>>>>> On 09/30/15 14:41, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>>>>>>>> The whole idea is to bypass kernel. Especially for networking...
>>>>>>> ... on dumb hardware that doesn't support doing that securely.
>>>>>> On a very capable HW that supports whatever security requirements needed
>>>>>> (e.g. 82599 Intel's SR-IOV VF devices).
>>>>> Network card type is irrelevant as long as you do not have an IOMMU,
>>>>> otherwise you would just use e.g. VFIO.
>>>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
>>>> example where there *is* iommu but it's not virtualized
>>>> and thus VFIO is
>>>> useless and there is an option to use directly assigned SR-IOV networking
>>>> device there where using the kernel drivers impose a performance impact
>>>> compared to user space UIO-based user space kernel bypass mode of usage. How
>>>> is it irrelevant? Could u, pls, clarify your point?
>>>>
>>> So it's not even dumb hardware, it's another piece of software
>>> that forces an "all or nothing" approach where either
>>> device has access to all VM memory, or none.
>>> And this, unfortunately, leaves you with no secure way to
>>> allow userspace drivers.
>> Some setups don't need security (they are single-user, single application).
>> But do need a lot of performance (like 5X-10X performance).  An example is
>> OpenVSwitch, security doesn't help it at all and if you force it to use the
>> kernel drivers you cripple it.
> We'd have to see there are actual users that need this.  So far, dpdk
> seems like the only one,

dpdk is a whole class if users.  It's not a specific application.

>   and it wants to use UIO for slow path stuff
> like polling link status.  Why this needs kernel bypass support, I don't
> know.  I asked, and got no answer.

First, it's more than link status.  dpdk also has an interrupt mode, 
which applications can fall back to when when the load is light in order 
to save power (and in order not to get support calls about 100% cpu when 
idle).

Even for link status, you don't want to poll for that, because accessing 
device registers is expensive.  An interrupt is the best approach for 
rare events like link changed.

>
>> Also, I'm root.  I can do anything I like, including loading a patched
>> pci_uio_generic.  You're not providing _any_ security, you're simply making
>> life harder for users.
> Maybe that's true on your system. But I guess you know that's not true
> for everyone, not in 2015.

Why is it not true?  if I'm root, I can do anything I like to my system, 
and everyone is root in 2015.  I can access the BARs directly and 
program DMA, how am I more secure by uio not allowing me to setup msix?

Non-root users are already secured by their inability to load the 
module, and by the device permissions.

>
>>> So it makes even less sense to add insecure work-arounds in the kernel.
>>> It seems quite likely that by the time the new kernel reaches
>>> production X years from now, EC2 will have a virtual iommu.
>> I can adopt a new kernel tomorrow.  I have no influence on EC2.
>>
>>
> Xen grant tables sound like they could be the right interface
> for EC2.  google search for "grant tables iommu" immediately gives me:
> http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html
> Maybe latest Xen is already doing the right thing, and it's just the
> question of making VFIO use that.
>

grant tables only work for virtual devices, not physical devices.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 14:53                               ` Avi Kivity
@ 2015-09-30 15:21                                 ` Michael S. Tsirkin
  2015-09-30 15:36                                   ` Avi Kivity
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 15:21 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Wed, Sep 30, 2015 at 05:53:54PM +0300, Avi Kivity wrote:
> On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
> >>
> >>On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
> >>>On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
> >>>>On 09/30/15 15:03, Michael S. Tsirkin wrote:
> >>>>>On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> >>>>>>On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >>>>>>>On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> >>>>>>>>The whole idea is to bypass kernel. Especially for networking...
> >>>>>>>... on dumb hardware that doesn't support doing that securely.
> >>>>>>On a very capable HW that supports whatever security requirements needed
> >>>>>>(e.g. 82599 Intel's SR-IOV VF devices).
> >>>>>Network card type is irrelevant as long as you do not have an IOMMU,
> >>>>>otherwise you would just use e.g. VFIO.
> >>>>Sorry, but I don't follow your logic here - Amazon EC2 environment is a
> >>>>example where there *is* iommu but it's not virtualized
> >>>>and thus VFIO is
> >>>>useless and there is an option to use directly assigned SR-IOV networking
> >>>>device there where using the kernel drivers impose a performance impact
> >>>>compared to user space UIO-based user space kernel bypass mode of usage. How
> >>>>is it irrelevant? Could u, pls, clarify your point?
> >>>>
> >>>So it's not even dumb hardware, it's another piece of software
> >>>that forces an "all or nothing" approach where either
> >>>device has access to all VM memory, or none.
> >>>And this, unfortunately, leaves you with no secure way to
> >>>allow userspace drivers.
> >>Some setups don't need security (they are single-user, single application).
> >>But do need a lot of performance (like 5X-10X performance).  An example is
> >>OpenVSwitch, security doesn't help it at all and if you force it to use the
> >>kernel drivers you cripple it.
> >We'd have to see there are actual users that need this.  So far, dpdk
> >seems like the only one,
> 
> dpdk is a whole class if users.  It's not a specific application.
> 
> >  and it wants to use UIO for slow path stuff
> >like polling link status.  Why this needs kernel bypass support, I don't
> >know.  I asked, and got no answer.
> 
> First, it's more than link status.  dpdk also has an interrupt mode, which
> applications can fall back to when when the load is light in order to save
> power (and in order not to get support calls about 100% cpu when idle).

Aha, looks like it appeared in June. Interesting, thanks for the info.

> Even for link status, you don't want to poll for that, because accessing
> device registers is expensive.  An interrupt is the best approach for rare
> events like link changed.

Yea, but you probably can get by with a timer for that, even if it's ugly.

> >>Also, I'm root.  I can do anything I like, including loading a patched
> >>pci_uio_generic.  You're not providing _any_ security, you're simply making
> >>life harder for users.
> >Maybe that's true on your system. But I guess you know that's not true
> >for everyone, not in 2015.
> 
> Why is it not true?  if I'm root, I can do anything I like to my
> system, and everyone is root in 2015.  I can access the BARs directly
> and program DMA, how am I more secure by uio not allowing me to setup
> msix?

That's not the point.  The point always was that using uio for these
devices (capable of DMA, in particular of msix) isn't possible in a
secure way. And yes, if same device happens to also do interrupts, UIO
does not reject it as it probably should, and we can't change this
without breaking some working setups.  But this doesn't mean we should
add more setups like this that we'll then be forced to maintain.


> Non-root users are already secured by their inability to load the module,
> and by the device permissions.
> 
> >
> >>>So it makes even less sense to add insecure work-arounds in the kernel.
> >>>It seems quite likely that by the time the new kernel reaches
> >>>production X years from now, EC2 will have a virtual iommu.
> >>I can adopt a new kernel tomorrow.  I have no influence on EC2.
> >>
> >>
> >Xen grant tables sound like they could be the right interface
> >for EC2.  google search for "grant tables iommu" immediately gives me:
> >http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html
> >Maybe latest Xen is already doing the right thing, and it's just the
> >question of making VFIO use that.
> >
> 
> grant tables only work for virtual devices, not physical devices.

Why not? That's what the patches above seem to do.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 12:50                           ` Vlad Zolotarov
@ 2015-09-30 15:26                             ` Michael S. Tsirkin
  2015-09-30 18:15                               ` Vlad Zolotarov
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 15:26 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev

On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
> How not virtualizing iommu forces "all or nothing" approach?

Looks like you can't limit an assigned device to only access part of
guest memory that belongs to a given process.  Either let it access all
of guest memory ("all") or don't assign the device ("nothing").

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 15:21                                 ` Michael S. Tsirkin
@ 2015-09-30 15:36                                   ` Avi Kivity
  2015-09-30 20:40                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Avi Kivity @ 2015-09-30 15:36 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On 09/30/2015 06:21 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 05:53:54PM +0300, Avi Kivity wrote:
>> On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
>>>> On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
>>>>>> On 09/30/15 15:03, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
>>>>>>>> On 09/30/15 14:41, Michael S. Tsirkin wrote:
>>>>>>>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>>>>>>>>>> The whole idea is to bypass kernel. Especially for networking...
>>>>>>>>> ... on dumb hardware that doesn't support doing that securely.
>>>>>>>> On a very capable HW that supports whatever security requirements needed
>>>>>>>> (e.g. 82599 Intel's SR-IOV VF devices).
>>>>>>> Network card type is irrelevant as long as you do not have an IOMMU,
>>>>>>> otherwise you would just use e.g. VFIO.
>>>>>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
>>>>>> example where there *is* iommu but it's not virtualized
>>>>>> and thus VFIO is
>>>>>> useless and there is an option to use directly assigned SR-IOV networking
>>>>>> device there where using the kernel drivers impose a performance impact
>>>>>> compared to user space UIO-based user space kernel bypass mode of usage. How
>>>>>> is it irrelevant? Could u, pls, clarify your point?
>>>>>>
>>>>> So it's not even dumb hardware, it's another piece of software
>>>>> that forces an "all or nothing" approach where either
>>>>> device has access to all VM memory, or none.
>>>>> And this, unfortunately, leaves you with no secure way to
>>>>> allow userspace drivers.
>>>> Some setups don't need security (they are single-user, single application).
>>>> But do need a lot of performance (like 5X-10X performance).  An example is
>>>> OpenVSwitch, security doesn't help it at all and if you force it to use the
>>>> kernel drivers you cripple it.
>>> We'd have to see there are actual users that need this.  So far, dpdk
>>> seems like the only one,
>> dpdk is a whole class if users.  It's not a specific application.
>>
>>>   and it wants to use UIO for slow path stuff
>>> like polling link status.  Why this needs kernel bypass support, I don't
>>> know.  I asked, and got no answer.
>> First, it's more than link status.  dpdk also has an interrupt mode, which
>> applications can fall back to when when the load is light in order to save
>> power (and in order not to get support calls about 100% cpu when idle).
> Aha, looks like it appeared in June. Interesting, thanks for the info.
>
>> Even for link status, you don't want to poll for that, because accessing
>> device registers is expensive.  An interrupt is the best approach for rare
>> events like link changed.
> Yea, but you probably can get by with a timer for that, even if it's ugly.

Maybe you can, but (a) why increase link status change detection latency 
(b) link status change detection is not the only user of the feature, 
since June.

>>>> Also, I'm root.  I can do anything I like, including loading a patched
>>>> pci_uio_generic.  You're not providing _any_ security, you're simply making
>>>> life harder for users.
>>> Maybe that's true on your system. But I guess you know that's not true
>>> for everyone, not in 2015.
>> Why is it not true?  if I'm root, I can do anything I like to my
>> system, and everyone is root in 2015.  I can access the BARs directly
>> and program DMA, how am I more secure by uio not allowing me to setup
>> msix?
> That's not the point.  The point always was that using uio for these
> devices (capable of DMA, in particular of msix) isn't possible in a
> secure way.

uio is used today for DMA-capable devices.  Some users are perfectly 
willing to give up security for functionality (that's all users who have 
root access to their machines, not just uio users).  You aren't adding 
any security by disallowing uio, you're just removing functionality.

As it happens, you're removing the functionality from the users who have 
no other option.  They can't use vfio because it doesn't work on 
virtualized setups.

(note even on a setup that does support vfio, high performance users 
will want to avoid it).

>   And yes, if same device happens to also do interrupts, UIO
> does not reject it as it probably should, and we can't change this
> without breaking some working setups.  But this doesn't mean we should
> add more setups like this that we'll then be forced to maintain.

pci_uio_generic is maybe the driver with the lowest maintenance burden 
in the entire kernel.  One driver supporting all pci devices, if you 
don't need msi/msix.  And with the patch, it will be one driver 
supporting all pci devices.

I don't really understand the tradeoff.  By rejecting the patch you're 
denying users the ability to use their devices, except through the much 
slower kernel drivers.  The patch would not allow a non-root user to do 
ANYTHING.  Root can already do anything.  So what security issue is there?

>
>
>> Non-root users are already secured by their inability to load the module,
>> and by the device permissions.
>>
>>>>> So it makes even less sense to add insecure work-arounds in the kernel.
>>>>> It seems quite likely that by the time the new kernel reaches
>>>>> production X years from now, EC2 will have a virtual iommu.
>>>> I can adopt a new kernel tomorrow.  I have no influence on EC2.
>>>>
>>>>
>>> Xen grant tables sound like they could be the right interface
>>> for EC2.  google search for "grant tables iommu" immediately gives me:
>>> http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html
>>> Maybe latest Xen is already doing the right thing, and it's just the
>>> question of making VFIO use that.
>>>
>> grant tables only work for virtual devices, not physical devices.
> Why not? That's what the patches above seem to do.
>

Oh, I think those are for emulating transient iommu maps (new map for 
every request) on top of a real iommu.  The dpdk use case is permanently 
mapping a large chunk of guest userspace, I don't think Xen exposes 
enough grant table entries for that.

In addition, that leaves users of kvm, vmware, older Xen, or bare metal 
machines without iommus out in the cold; and bare metal users that want 
the iommu off for performance are forced to use it.  And for what, to 
prevent root from touching memory via dma that they can access in a 
million other ways?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 10:37           ` Vlad Zolotarov
  2015-09-30 10:58             ` Michael S. Tsirkin
@ 2015-09-30 17:28             ` Stephen Hemminger
  2015-09-30 17:39               ` Michael S. Tsirkin
  1 sibling, 1 reply; 100+ messages in thread
From: Stephen Hemminger @ 2015-09-30 17:28 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev, Michael S. Tsirkin

On Wed, 30 Sep 2015 13:37:22 +0300
Vlad Zolotarov <vladz@cloudius-systems.com> wrote:

> 
> 
> On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> >> On Tue, 29 Sep 2015 23:54:54 +0300
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>
> >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> >>>> The security breach motivation u brought in "[RFC PATCH] uio:
> >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> >>>> since one u let the userland access to the bar it may do any funny thing
> >>>> using the DMA engine of the device. This kind of stuff should be prevented
> >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> >>>> configuration will be prevented too.
> >>>>
> >>>> I'm about to send the patch to main Linux mailing list. Let's continue this
> >>>> discussion there.
> >>>>    
> >>> Basically UIO shouldn't be used with devices capable of DMA.
> >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> 
> If there is an IOMMU in the picture there shouldn't be any problem to 
> use UIO with DMA capable devices.
> 
> >>> I don't think this can change.
> >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK
> >> use, I can't accept that.
> > QEMU does allow emulating an iommu.
> 
> Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> option there. And again, it's a general issue not DPDK specific.
> Today one has to develop some proprietary modules (like igb_uio) to 
> workaround the issue and this is lame. IMHO uio_pci_generic should
> be fixed to be able to properly work within any virtualized environment 
> and not only with KVM.
> 

Also VMware (bigger problem) has no IOMMU emulation.
Other environments as well (Windriver, GCE) have noe IOMMU.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 17:28             ` Stephen Hemminger
@ 2015-09-30 17:39               ` Michael S. Tsirkin
  2015-09-30 17:43                 ` Stephen Hemminger
  2015-09-30 17:44                 ` Gleb Natapov
  0 siblings, 2 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 17:39 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 13:37:22 +0300
> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
> 
> > 
> > 
> > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >>
> > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > >>>> The security breach motivation u brought in "[RFC PATCH] uio:
> > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> > >>>> since one u let the userland access to the bar it may do any funny thing
> > >>>> using the DMA engine of the device. This kind of stuff should be prevented
> > >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> > >>>> configuration will be prevented too.
> > >>>>
> > >>>> I'm about to send the patch to main Linux mailing list. Let's continue this
> > >>>> discussion there.
> > >>>>    
> > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > 
> > If there is an IOMMU in the picture there shouldn't be any problem to 
> > use UIO with DMA capable devices.
> > 
> > >>> I don't think this can change.
> > >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK
> > >> use, I can't accept that.
> > > QEMU does allow emulating an iommu.
> > 
> > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > option there. And again, it's a general issue not DPDK specific.
> > Today one has to develop some proprietary modules (like igb_uio) to 
> > workaround the issue and this is lame. IMHO uio_pci_generic should
> > be fixed to be able to properly work within any virtualized environment 
> > and not only with KVM.
> > 
> 
> Also VMware (bigger problem) has no IOMMU emulation.
> Other environments as well (Windriver, GCE) have noe IOMMU.

Because the use-case of userspace drivers is not important enough?
Without an IOMMU, there's no way to have secure userspace drivers.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 17:39               ` Michael S. Tsirkin
@ 2015-09-30 17:43                 ` Stephen Hemminger
  2015-09-30 18:50                   ` Michael S. Tsirkin
  2015-09-30 17:44                 ` Gleb Natapov
  1 sibling, 1 reply; 100+ messages in thread
From: Stephen Hemminger @ 2015-09-30 17:43 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On Wed, 30 Sep 2015 20:39:43 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> > On Wed, 30 Sep 2015 13:37:22 +0300
> > Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
> > 
> > > 
> > > 
> > > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >>
> > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > > >>>> The security breach motivation u brought in "[RFC PATCH] uio:
> > > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> > > >>>> since one u let the userland access to the bar it may do any funny thing
> > > >>>> using the DMA engine of the device. This kind of stuff should be prevented
> > > >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> > > >>>> configuration will be prevented too.
> > > >>>>
> > > >>>> I'm about to send the patch to main Linux mailing list. Let's continue this
> > > >>>> discussion there.
> > > >>>>    
> > > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > > 
> > > If there is an IOMMU in the picture there shouldn't be any problem to 
> > > use UIO with DMA capable devices.
> > > 
> > > >>> I don't think this can change.
> > > >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK
> > > >> use, I can't accept that.
> > > > QEMU does allow emulating an iommu.
> > > 
> > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > > option there. And again, it's a general issue not DPDK specific.
> > > Today one has to develop some proprietary modules (like igb_uio) to 
> > > workaround the issue and this is lame. IMHO uio_pci_generic should
> > > be fixed to be able to properly work within any virtualized environment 
> > > and not only with KVM.
> > > 
> > 
> > Also VMware (bigger problem) has no IOMMU emulation.
> > Other environments as well (Windriver, GCE) have noe IOMMU.
> 
> Because the use-case of userspace drivers is not important enough?
> Without an IOMMU, there's no way to have secure userspace drivers.

Look at Cloudius, there is no necessity of security in guest.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 17:39               ` Michael S. Tsirkin
  2015-09-30 17:43                 ` Stephen Hemminger
@ 2015-09-30 17:44                 ` Gleb Natapov
  1 sibling, 0 replies; 100+ messages in thread
From: Gleb Natapov @ 2015-09-30 17:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On Wed, Sep 30, 2015 at 08:39:43PM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> > On Wed, 30 Sep 2015 13:37:22 +0300
> > Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
> > 
> > > 
> > > 
> > > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >>
> > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > > >>>> The security breach motivation u brought in "[RFC PATCH] uio:
> > > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> > > >>>> since one u let the userland access to the bar it may do any funny thing
> > > >>>> using the DMA engine of the device. This kind of stuff should be prevented
> > > >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> > > >>>> configuration will be prevented too.
> > > >>>>
> > > >>>> I'm about to send the patch to main Linux mailing list. Let's continue this
> > > >>>> discussion there.
> > > >>>>    
> > > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > > 
> > > If there is an IOMMU in the picture there shouldn't be any problem to 
> > > use UIO with DMA capable devices.
> > > 
> > > >>> I don't think this can change.
> > > >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK
> > > >> use, I can't accept that.
> > > > QEMU does allow emulating an iommu.
> > > 
> > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > > option there. And again, it's a general issue not DPDK specific.
> > > Today one has to develop some proprietary modules (like igb_uio) to 
> > > workaround the issue and this is lame. IMHO uio_pci_generic should
> > > be fixed to be able to properly work within any virtualized environment 
> > > and not only with KVM.
> > > 
> > 
> > Also VMware (bigger problem) has no IOMMU emulation.
> > Other environments as well (Windriver, GCE) have noe IOMMU.
> 
> Because the use-case of userspace drivers is not important enough?
Because "secure" userspace drivers is not important enough.

> Without an IOMMU, there's no way to have secure userspace drivers.
> 
People use VMs as an application containers, not as a machine that needs
to be secured for multiuser scenario.

--
			Gleb.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 15:26                             ` Michael S. Tsirkin
@ 2015-09-30 18:15                               ` Vlad Zolotarov
  2015-09-30 18:55                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-30 18:15 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 09/30/15 18:26, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
>> How not virtualizing iommu forces "all or nothing" approach?
> Looks like you can't limit an assigned device to only access part of
> guest memory that belongs to a given process.  Either let it access all
> of guest memory ("all") or don't assign the device ("nothing").

Ok. A question then: can u limit the assigned device to only access part 
of the guest memory even if iommu was virtualized? How would iommu 
virtualization change anything? And why do we care about an assigned 
device to be able to access all Guest memory?

>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 17:43                 ` Stephen Hemminger
@ 2015-09-30 18:50                   ` Michael S. Tsirkin
  2015-09-30 20:00                     ` Gleb Natapov
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 18:50 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

On Wed, Sep 30, 2015 at 10:43:04AM -0700, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 20:39:43 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> > > On Wed, 30 Sep 2015 13:37:22 +0300
> > > Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
> > > 
> > > > 
> > > > 
> > > > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > > > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > > > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > >>
> > > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > > > >>>> The security breach motivation u brought in "[RFC PATCH] uio:
> > > > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> > > > >>>> since one u let the userland access to the bar it may do any funny thing
> > > > >>>> using the DMA engine of the device. This kind of stuff should be prevented
> > > > >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> > > > >>>> configuration will be prevented too.
> > > > >>>>
> > > > >>>> I'm about to send the patch to main Linux mailing list. Let's continue this
> > > > >>>> discussion there.
> > > > >>>>    
> > > > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > > > 
> > > > If there is an IOMMU in the picture there shouldn't be any problem to 
> > > > use UIO with DMA capable devices.
> > > > 
> > > > >>> I don't think this can change.
> > > > >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK
> > > > >> use, I can't accept that.
> > > > > QEMU does allow emulating an iommu.
> > > > 
> > > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > > > option there. And again, it's a general issue not DPDK specific.
> > > > Today one has to develop some proprietary modules (like igb_uio) to 
> > > > workaround the issue and this is lame. IMHO uio_pci_generic should
> > > > be fixed to be able to properly work within any virtualized environment 
> > > > and not only with KVM.
> > > > 
> > > 
> > > Also VMware (bigger problem) has no IOMMU emulation.
> > > Other environments as well (Windriver, GCE) have noe IOMMU.
> > 
> > Because the use-case of userspace drivers is not important enough?
> > Without an IOMMU, there's no way to have secure userspace drivers.
> 
> Look at Cloudius, there is no necessity of security in guest.

It's an interesting concept, isn't it?

So why not do what Cloudius does, and run this task code in ring 0 then,
allocating all memory in the kernel range?

You are increasing interrupt latency by a huge factor by channeling
interrupts through a scheduler.  Let user install an
interrupt handler function, and be done with it.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 18:15                               ` Vlad Zolotarov
@ 2015-09-30 18:55                                 ` Michael S. Tsirkin
  2015-09-30 19:06                                   ` Vlad Zolotarov
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 18:55 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev

On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 18:26, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
> >>How not virtualizing iommu forces "all or nothing" approach?
> >Looks like you can't limit an assigned device to only access part of
> >guest memory that belongs to a given process.  Either let it access all
> >of guest memory ("all") or don't assign the device ("nothing").
> 
> Ok. A question then: can u limit the assigned device to only access part of
> the guest memory even if iommu was virtualized?

That's exactly what an iommu does - limit the device io access to memory.

> How would iommu
> virtualization change anything?

Kernel can use an iommu to limit device access to memory of
the controlling application.

> And why do we care about an assigned device
> to be able to access all Guest memory?

Because we want to be reasonably sure a kernel memory corruption
is not a result of a bug in a userspace application.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 18:55                                 ` Michael S. Tsirkin
@ 2015-09-30 19:06                                   ` Vlad Zolotarov
  2015-09-30 19:10                                     ` Vlad Zolotarov
  2015-09-30 19:39                                     ` Michael S. Tsirkin
  0 siblings, 2 replies; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-30 19:06 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 09/30/15 21:55, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote:
>>
>> On 09/30/15 18:26, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
>>>> How not virtualizing iommu forces "all or nothing" approach?
>>> Looks like you can't limit an assigned device to only access part of
>>> guest memory that belongs to a given process.  Either let it access all
>>> of guest memory ("all") or don't assign the device ("nothing").
>> Ok. A question then: can u limit the assigned device to only access part of
>> the guest memory even if iommu was virtualized?
> That's exactly what an iommu does - limit the device io access to memory.

If it does - it will continue to do so with or without the patch and if 
it doesn't (for any reason) it won't do it even without the patch.
So, again, the above (rhetorical) question stands. ;)

I think Avi has already explained quite in detail why security is 
absolutely a non issue in regard to this patch or in regard to UIO in 
general. Security has to be enforced by some other  means like iommu.

>
>> How would iommu
>> virtualization change anything?
> Kernel can use an iommu to limit device access to memory of
> the controlling application.

Ok, this is obvious but what it has to do with enabling using MSI/MSI-X 
interrupts support in uio_pci_generic? kernel may continue to limit the 
above access with this support as well.

>
>> And why do we care about an assigned device
>> to be able to access all Guest memory?
> Because we want to be reasonably sure a kernel memory corruption
> is not a result of a bug in a userspace application.

Corrupting Guest's memory due to any SW misbehavior (including bugs) is 
a non-issue by design - this is what HV and Guest machines were invented 
for. So, like Avi also said, instead of trying to enforce nobody cares 
about we'd rather make the developers life easier instead (by applying 
the not-yet-completed patch I'm working on).
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 19:06                                   ` Vlad Zolotarov
@ 2015-09-30 19:10                                     ` Vlad Zolotarov
  2015-09-30 19:11                                       ` Vlad Zolotarov
  2015-09-30 19:39                                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-30 19:10 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 09/30/15 22:06, Vlad Zolotarov wrote:
>
>
> On 09/30/15 21:55, Michael S. Tsirkin wrote:
>> On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote:
>>>
>>> On 09/30/15 18:26, Michael S. Tsirkin wrote:
>>>> On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
>>>>> How not virtualizing iommu forces "all or nothing" approach?
>>>> Looks like you can't limit an assigned device to only access part of
>>>> guest memory that belongs to a given process.  Either let it access 
>>>> all
>>>> of guest memory ("all") or don't assign the device ("nothing").
>>> Ok. A question then: can u limit the assigned device to only access 
>>> part of
>>> the guest memory even if iommu was virtualized?
>> That's exactly what an iommu does - limit the device io access to 
>> memory.
>
> If it does - it will continue to do so with or without the patch and 
> if it doesn't (for any reason) it won't do it even without the patch.
> So, again, the above (rhetorical) question stands. ;)
>
> I think Avi has already explained quite in detail why security is 
> absolutely a non issue in regard to this patch or in regard to UIO in 
> general. Security has to be enforced by some other  means like iommu.
>
>>
>>> How would iommu
>>> virtualization change anything?
>> Kernel can use an iommu to limit device access to memory of
>> the controlling application.
>
> Ok, this is obvious but what it has to do with enabling using 
> MSI/MSI-X interrupts support in uio_pci_generic? kernel may continue 
> to limit the above access with this support as well.
>
>>
>>> And why do we care about an assigned device
>>> to be able to access all Guest memory?
>> Because we want to be reasonably sure a kernel memory corruption
>> is not a result of a bug in a userspace application.
>
> Corrupting Guest's memory due to any SW misbehavior (including bugs) 
> is a non-issue by design - this is what HV and Guest machines were 
> invented for. So, like Avi also said, instead of trying to enforce 
> nobody cares about 

Let me rephrase: by pretending enforcing some security promise that u 
don't actually fulfill... ;)

> we'd rather make the developers life easier instead (by applying the 
> not-yet-completed patch I'm working on).
>>
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 19:10                                     ` Vlad Zolotarov
@ 2015-09-30 19:11                                       ` Vlad Zolotarov
  0 siblings, 0 replies; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-30 19:11 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 09/30/15 22:10, Vlad Zolotarov wrote:
>
>
> On 09/30/15 22:06, Vlad Zolotarov wrote:
>>
>>
>> On 09/30/15 21:55, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote:
>>>>
>>>> On 09/30/15 18:26, Michael S. Tsirkin wrote:
>>>>> On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
>>>>>> How not virtualizing iommu forces "all or nothing" approach?
>>>>> Looks like you can't limit an assigned device to only access part of
>>>>> guest memory that belongs to a given process.  Either let it 
>>>>> access all
>>>>> of guest memory ("all") or don't assign the device ("nothing").
>>>> Ok. A question then: can u limit the assigned device to only access 
>>>> part of
>>>> the guest memory even if iommu was virtualized?
>>> That's exactly what an iommu does - limit the device io access to 
>>> memory.
>>
>> If it does - it will continue to do so with or without the patch and 
>> if it doesn't (for any reason) it won't do it even without the patch.
>> So, again, the above (rhetorical) question stands. ;)
>>
>> I think Avi has already explained quite in detail why security is 
>> absolutely a non issue in regard to this patch or in regard to UIO in 
>> general. Security has to be enforced by some other means like iommu.
>>
>>>
>>>> How would iommu
>>>> virtualization change anything?
>>> Kernel can use an iommu to limit device access to memory of
>>> the controlling application.
>>
>> Ok, this is obvious but what it has to do with enabling using 
>> MSI/MSI-X interrupts support in uio_pci_generic? kernel may continue 
>> to limit the above access with this support as well.
>>
>>>
>>>> And why do we care about an assigned device
>>>> to be able to access all Guest memory?
>>> Because we want to be reasonably sure a kernel memory corruption
>>> is not a result of a bug in a userspace application.
>>
>> Corrupting Guest's memory due to any SW misbehavior (including bugs) 
>> is a non-issue by design - this is what HV and Guest machines were 
>> invented for. So, like Avi also said, instead of trying to enforce 
>> nobody cares about 
>
> Let me rephrase: by pretending enforcing some security promise that u 
> don't actually fulfill... ;)

...the promise nobody really cares about...

>
>> we'd rather make the developers life easier instead (by applying the 
>> not-yet-completed patch I'm working on).
>>>
>>
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 19:06                                   ` Vlad Zolotarov
  2015-09-30 19:10                                     ` Vlad Zolotarov
@ 2015-09-30 19:39                                     ` Michael S. Tsirkin
  2015-09-30 20:09                                       ` Vlad Zolotarov
  1 sibling, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 19:39 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev

On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
> >>How would iommu
> >>virtualization change anything?
> >Kernel can use an iommu to limit device access to memory of
> >the controlling application.
> 
> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
> interrupts support in uio_pci_generic? kernel may continue to limit the
> above access with this support as well.

It could maybe. So if you write a patch to allow MSI by at the same time
creating an isolated IOMMU group and blocking DMA from device in
question anywhere, that sounds reasonable.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 18:50                   ` Michael S. Tsirkin
@ 2015-09-30 20:00                     ` Gleb Natapov
  2015-09-30 20:36                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Gleb Natapov @ 2015-09-30 20:00 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On Wed, Sep 30, 2015 at 09:50:08PM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 10:43:04AM -0700, Stephen Hemminger wrote:
> > On Wed, 30 Sep 2015 20:39:43 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> > > > On Wed, 30 Sep 2015 13:37:22 +0300
> > > > Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
> > > > 
> > > > > 
> > > > > 
> > > > > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > > > > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > > > > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > >>
> > > > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > > > > >>>> The security breach motivation u brought in "[RFC PATCH] uio:
> > > > > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> > > > > >>>> since one u let the userland access to the bar it may do any funny thing
> > > > > >>>> using the DMA engine of the device. This kind of stuff should be prevented
> > > > > >>>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> > > > > >>>> configuration will be prevented too.
> > > > > >>>>
> > > > > >>>> I'm about to send the patch to main Linux mailing list. Let's continue this
> > > > > >>>> discussion there.
> > > > > >>>>    
> > > > > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > > > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > > > > 
> > > > > If there is an IOMMU in the picture there shouldn't be any problem to 
> > > > > use UIO with DMA capable devices.
> > > > > 
> > > > > >>> I don't think this can change.
> > > > > >> Given there is no PV IOMMU and even if there was it would be too slow for DPDK
> > > > > >> use, I can't accept that.
> > > > > > QEMU does allow emulating an iommu.
> > > > > 
> > > > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > > > > option there. And again, it's a general issue not DPDK specific.
> > > > > Today one has to develop some proprietary modules (like igb_uio) to 
> > > > > workaround the issue and this is lame. IMHO uio_pci_generic should
> > > > > be fixed to be able to properly work within any virtualized environment 
> > > > > and not only with KVM.
> > > > > 
> > > > 
> > > > Also VMware (bigger problem) has no IOMMU emulation.
> > > > Other environments as well (Windriver, GCE) have noe IOMMU.
> > > 
> > > Because the use-case of userspace drivers is not important enough?
> > > Without an IOMMU, there's no way to have secure userspace drivers.
> > 
> > Look at Cloudius, there is no necessity of security in guest.
> 
> It's an interesting concept, isn't it?
> 
It is.

> So why not do what Cloudius does, and run this task code in ring 0 then,
> allocating all memory in the kernel range?
> 
Except this is not what Cloudius does. The idea of OSv is that it can
run your regular userspace application, but remove unneeded level of
indirection by bypassing userspace/kernelspace communication (among
other things).  Application still uses virtual, not directly mapped
physical memory like Linux ring 0 has.

You can achieve most of the benefits of kernel bypass on Linux too, but
unlike OSv you need to code for it. UIO is one of those things that
allows that.

> You are increasing interrupt latency by a huge factor by channeling
> interrupts through a scheduler.  Let user install an
> interrupt handler function, and be done with it.
> 
Interrupt latency is not always hugely important. If you enter interrupt
mode only when idle hundred more us on a first packet will not kill you. If
interrupt latency is important then uio may be not the right solution,
but then neither is vfio.

--
			Gleb.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 19:39                                     ` Michael S. Tsirkin
@ 2015-09-30 20:09                                       ` Vlad Zolotarov
  2015-09-30 21:36                                         ` Stephen Hemminger
  0 siblings, 1 reply; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-30 20:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 09/30/15 22:39, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
>>>> How would iommu
>>>> virtualization change anything?
>>> Kernel can use an iommu to limit device access to memory of
>>> the controlling application.
>> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
>> interrupts support in uio_pci_generic? kernel may continue to limit the
>> above access with this support as well.
> It could maybe. So if you write a patch to allow MSI by at the same time
> creating an isolated IOMMU group and blocking DMA from device in
> question anywhere, that sounds reasonable.

No, I'm only planning to add MSI and MSI-X interrupts support for 
uio_pci_generic device.
The rest mentioned above should naturally be a matter of a different 
patch and writing it is orthogonal to the patch I'm working on as has 
been extensively discussed in this thread.

>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 20:00                     ` Gleb Natapov
@ 2015-09-30 20:36                       ` Michael S. Tsirkin
  2015-10-01  5:04                         ` Gleb Natapov
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 20:36 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: dev

On Wed, Sep 30, 2015 at 11:00:49PM +0300, Gleb Natapov wrote:
> > You are increasing interrupt latency by a huge factor by channeling
> > interrupts through a scheduler.  Let user install an
> > interrupt handler function, and be done with it.
> > 
> Interrupt latency is not always hugely important. If you enter interrupt
> mode only when idle hundred more us on a first packet will not kill you.

It certainly affects worst-case latency.  And if you lower interupt
latency, you can go idle faster, so it affects power too.

> If
> interrupt latency is important then uio may be not the right solution,
> but then neither is vfio.

That's what I'm saying, if you don't need memory isolation you can do
better than just slightly tweak existing drivers.

> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 15:36                                   ` Avi Kivity
@ 2015-09-30 20:40                                     ` Michael S. Tsirkin
  2015-09-30 21:00                                       ` Avi Kivity
  2015-10-01  8:44                                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 20:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Wed, Sep 30, 2015 at 06:36:17PM +0300, Avi Kivity wrote:
> As it happens, you're removing the functionality from the users who have no
> other option.  They can't use vfio because it doesn't work on virtualized
> setups.

...

> Root can already do anything.

I think there's a contradiction between the two claims above.

>  So what security issue is there?

A buggy userspace can and will corrupt kernel memory.

...

> And for what, to prevent
> root from touching memory via dma that they can access in a million other
> ways?

So one can be reasonably sure a kernel oops is not a result of a
userspace bug.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 20:40                                     ` Michael S. Tsirkin
@ 2015-09-30 21:00                                       ` Avi Kivity
  2015-10-01  8:44                                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 100+ messages in thread
From: Avi Kivity @ 2015-09-30 21:00 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On 09/30/2015 11:40 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 06:36:17PM +0300, Avi Kivity wrote:
>> As it happens, you're removing the functionality from the users who have no
>> other option.  They can't use vfio because it doesn't work on virtualized
>> setups.
> ...
>
>> Root can already do anything.
> I think there's a contradiction between the two claims above.

Yes, root can replace the current kernel with a patched kernel.  In that 
sense, root can do anything, and the kernel is complete.  Now let's stop 
playing word games.

>>   So what security issue is there?
> A buggy userspace can and will corrupt kernel memory.
>
> ...
>
>> And for what, to prevent
>> root from touching memory via dma that they can access in a million other
>> ways?
> So one can be reasonably sure a kernel oops is not a result of a
> userspace bug.
>

That's not security.  It's a legitimate concern though, one that is 
addressed by tainting the kernel.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 20:09                                       ` Vlad Zolotarov
@ 2015-09-30 21:36                                         ` Stephen Hemminger
  2015-09-30 21:53                                           ` Michael S. Tsirkin
                                                             ` (2 more replies)
  0 siblings, 3 replies; 100+ messages in thread
From: Stephen Hemminger @ 2015-09-30 21:36 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev, Michael S. Tsirkin

On Wed, 30 Sep 2015 23:09:33 +0300
Vlad Zolotarov <vladz@cloudius-systems.com> wrote:

> 
> 
> On 09/30/15 22:39, Michael S. Tsirkin wrote:
> > On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
> >>>> How would iommu
> >>>> virtualization change anything?
> >>> Kernel can use an iommu to limit device access to memory of
> >>> the controlling application.
> >> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
> >> interrupts support in uio_pci_generic? kernel may continue to limit the
> >> above access with this support as well.
> > It could maybe. So if you write a patch to allow MSI by at the same time
> > creating an isolated IOMMU group and blocking DMA from device in
> > question anywhere, that sounds reasonable.
> 
> No, I'm only planning to add MSI and MSI-X interrupts support for 
> uio_pci_generic device.
> The rest mentioned above should naturally be a matter of a different 
> patch and writing it is orthogonal to the patch I'm working on as has 
> been extensively discussed in this thread.
> 
> >
> 

I have a generic MSI and MSI-X driver (posted earlier on this list).
About to post to upstream kernel.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 21:36                                         ` Stephen Hemminger
@ 2015-09-30 21:53                                           ` Michael S. Tsirkin
  2015-09-30 22:20                                           ` Vlad Zolotarov
  2015-10-01  8:00                                           ` Vlad Zolotarov
  2 siblings, 0 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-09-30 21:53 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

On Wed, Sep 30, 2015 at 02:36:48PM -0700, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 23:09:33 +0300
> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
> 
> > 
> > 
> > On 09/30/15 22:39, Michael S. Tsirkin wrote:
> > > On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
> > >>>> How would iommu
> > >>>> virtualization change anything?
> > >>> Kernel can use an iommu to limit device access to memory of
> > >>> the controlling application.
> > >> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
> > >> interrupts support in uio_pci_generic? kernel may continue to limit the
> > >> above access with this support as well.
> > > It could maybe. So if you write a patch to allow MSI by at the same time
> > > creating an isolated IOMMU group and blocking DMA from device in
> > > question anywhere, that sounds reasonable.
> > 
> > No, I'm only planning to add MSI and MSI-X interrupts support for 
> > uio_pci_generic device.
> > The rest mentioned above should naturally be a matter of a different 
> > patch and writing it is orthogonal to the patch I'm working on as has 
> > been extensively discussed in this thread.
> > 
> > >
> > 
> 
> I have a generic MSI and MSI-X driver (posted earlier on this list).
> About to post to upstream kernel.

If Linux holds out and refuses to support insecure interfaces,
hypervisor vendors will add secure ones. If Linux lets them ignore guest
security, they will.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 21:36                                         ` Stephen Hemminger
  2015-09-30 21:53                                           ` Michael S. Tsirkin
@ 2015-09-30 22:20                                           ` Vlad Zolotarov
  2015-10-01  8:00                                           ` Vlad Zolotarov
  2 siblings, 0 replies; 100+ messages in thread
From: Vlad Zolotarov @ 2015-09-30 22:20 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Michael S. Tsirkin



On 10/01/15 00:36, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 23:09:33 +0300
> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
>
>>
>> On 09/30/15 22:39, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
>>>>>> How would iommu
>>>>>> virtualization change anything?
>>>>> Kernel can use an iommu to limit device access to memory of
>>>>> the controlling application.
>>>> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
>>>> interrupts support in uio_pci_generic? kernel may continue to limit the
>>>> above access with this support as well.
>>> It could maybe. So if you write a patch to allow MSI by at the same time
>>> creating an isolated IOMMU group and blocking DMA from device in
>>> question anywhere, that sounds reasonable.
>> No, I'm only planning to add MSI and MSI-X interrupts support for
>> uio_pci_generic device.
>> The rest mentioned above should naturally be a matter of a different
>> patch and writing it is orthogonal to the patch I'm working on as has
>> been extensively discussed in this thread.
>>
> I have a generic MSI and MSI-X driver (posted earlier on this list).
> About to post to upstream kernel.

Great! It would save me a few working days... ;) Thanks, Stephen!

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 20:36                       ` Michael S. Tsirkin
@ 2015-10-01  5:04                         ` Gleb Natapov
  0 siblings, 0 replies; 100+ messages in thread
From: Gleb Natapov @ 2015-10-01  5:04 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On Wed, Sep 30, 2015 at 11:36:58PM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 11:00:49PM +0300, Gleb Natapov wrote:
> > > You are increasing interrupt latency by a huge factor by channeling
> > > interrupts through a scheduler.  Let user install an
> > > interrupt handler function, and be done with it.
> > > 
> > Interrupt latency is not always hugely important. If you enter interrupt
> > mode only when idle hundred more us on a first packet will not kill you.
> 
> It certainly affects worst-case latency.  And if you lower interupt
> latency, you can go idle faster, so it affects power too.
> 
We are polling 100% now. Going idle faster is the least of our concern.

> > If
> > interrupt latency is important then uio may be not the right solution,
> > but then neither is vfio.
> 
> That's what I'm saying, if you don't need memory isolation you can do
> better than just slightly tweak existing drivers.
> 
No, you are forcing everyone to code in kernel no matter if it make
sense or not. You decide for everyone what is good for them. Believe me
people here know about trade-offs and made appropriate considerations.

--
			Gleb.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 21:36                                         ` Stephen Hemminger
  2015-09-30 21:53                                           ` Michael S. Tsirkin
  2015-09-30 22:20                                           ` Vlad Zolotarov
@ 2015-10-01  8:00                                           ` Vlad Zolotarov
  2015-10-01 14:47                                             ` Stephen Hemminger
  2 siblings, 1 reply; 100+ messages in thread
From: Vlad Zolotarov @ 2015-10-01  8:00 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Michael S. Tsirkin



On 10/01/15 00:36, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 23:09:33 +0300
> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
>
>>
>> On 09/30/15 22:39, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
>>>>>> How would iommu
>>>>>> virtualization change anything?
>>>>> Kernel can use an iommu to limit device access to memory of
>>>>> the controlling application.
>>>> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
>>>> interrupts support in uio_pci_generic? kernel may continue to limit the
>>>> above access with this support as well.
>>> It could maybe. So if you write a patch to allow MSI by at the same time
>>> creating an isolated IOMMU group and blocking DMA from device in
>>> question anywhere, that sounds reasonable.
>> No, I'm only planning to add MSI and MSI-X interrupts support for
>> uio_pci_generic device.
>> The rest mentioned above should naturally be a matter of a different
>> patch and writing it is orthogonal to the patch I'm working on as has
>> been extensively discussed in this thread.
>>
> I have a generic MSI and MSI-X driver (posted earlier on this list).
> About to post to upstream kernel.

Stephen, hi!

I found the mentioned series and first thing I noticed was that it's 
been sent in May so the first question is how far in your list of tasks 
submitting it upstream is? We need it more or less yesterday and I'm 
working on it right now. Therefore if u don't have time for it I'd like 
to help... ;) However I'd like u to clarify a few small things. Pls., 
see below...

I noticed that u've created a separate msi_msix driver and the second 
question is what do u plan for the upstream? I was thinking of extending 
the existing uio_pci_generic with the MSI-X functionality similar to 
your code and preserving the INT#X functionality as it is now:

  *   INT#X and MSI would provide the IRQ number to the UIO module while
    only MSI-X case would register with UIO_IRQ_CUSTOM.

I also noticed that u enable MSI-X on a first open() call. I assume 
there was a good reason (that I miss) for not doing it in probe(). Could 
u, pls., clarify?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-09-30 20:40                                     ` Michael S. Tsirkin
  2015-09-30 21:00                                       ` Avi Kivity
@ 2015-10-01  8:44                                       ` Michael S. Tsirkin
  2015-10-01  8:46                                         ` Vlad Zolotarov
                                                           ` (2 more replies)
  1 sibling, 3 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01  8:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
> > And for what, to prevent
> > root from touching memory via dma that they can access in a million other
> > ways?
> 
> So one can be reasonably sure a kernel oops is not a result of a
> userspace bug.

Actually, I thought about this overnight, and  it should be possible to
drive it securely from userspace, without hypervisor changes.

See

https://mid.gmane.org/20151001104505-mutt-send-email-mst@redhat.com



> -- 
> MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  8:44                                       ` Michael S. Tsirkin
@ 2015-10-01  8:46                                         ` Vlad Zolotarov
  2015-10-01  8:52                                         ` Avi Kivity
  2015-10-01  9:16                                         ` Michael S. Tsirkin
  2 siblings, 0 replies; 100+ messages in thread
From: Vlad Zolotarov @ 2015-10-01  8:46 UTC (permalink / raw)
  To: Michael S. Tsirkin, Avi Kivity; +Cc: dev



On 10/01/15 11:44, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
>>> And for what, to prevent
>>> root from touching memory via dma that they can access in a million other
>>> ways?
>> So one can be reasonably sure a kernel oops is not a result of a
>> userspace bug.
> Actually, I thought about this overnight, and  it should be possible to
> drive it securely from userspace, without hypervisor changes.
>
> See
>
> https://mid.gmane.org/20151001104505-mutt-send-email-mst@redhat.com

Looks like a dead link.

>
>
>
>> -- 
>> MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  8:44                                       ` Michael S. Tsirkin
  2015-10-01  8:46                                         ` Vlad Zolotarov
@ 2015-10-01  8:52                                         ` Avi Kivity
  2015-10-01  9:15                                           ` Michael S. Tsirkin
  2015-10-01  9:15                                           ` Avi Kivity
  2015-10-01  9:16                                         ` Michael S. Tsirkin
  2 siblings, 2 replies; 100+ messages in thread
From: Avi Kivity @ 2015-10-01  8:52 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 11:44 AM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
>>> And for what, to prevent
>>> root from touching memory via dma that they can access in a million other
>>> ways?
>> So one can be reasonably sure a kernel oops is not a result of a
>> userspace bug.
> Actually, I thought about this overnight, and  it should be possible to
> drive it securely from userspace, without hypervisor changes.

Also without the performance that was the whole reason from doing it in 
userspace in the first place.

I still don't understand your objection to the patch:

> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.

If a distribution feels it can't support this configuration, it can 
disable the uio_pci_generic driver, or refuse to support tainted 
kernels.  If it feels it can (and many distributions are starting to 
support dpdk), then you're just denying it the ability to serve its users.

> See
>
> https://mid.gmane.org/20151001104505-mutt-send-email-mst@redhat.com
>
>
>

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  8:52                                         ` Avi Kivity
@ 2015-10-01  9:15                                           ` Michael S. Tsirkin
  2015-10-01  9:22                                             ` Avi Kivity
  2015-10-01  9:15                                           ` Avi Kivity
  1 sibling, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01  9:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 11:52:26AM +0300, Avi Kivity wrote:
> I still don't understand your objection to the patch:
> 
> 
>     MSI messages are memory writes so any generic device capable
>     of MSI is capable of corrupting kernel memory.
>     This means that a bug in userspace will lead to kernel memory corruption
>     and crashes.  This is something distributions can't support.
> 
> 
> If a distribution feels it can't support this configuration, it can disable the
> uio_pci_generic driver, or refuse to support tainted kernels.  If it feels it
> can (and many distributions are starting to support dpdk), then you're just
> denying it the ability to serve its users.

I don't, and can't deny users anything.  I merely think upstream should
avoid putting this driver in-tree.  By doing this, driver writers will
be pushed to develop solutions that can't crash kernel.

I pointed out one way to build it, there are sure to be more.

As far as I could see, without this kind of motivation, people do not
even want to try.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  8:52                                         ` Avi Kivity
  2015-10-01  9:15                                           ` Michael S. Tsirkin
@ 2015-10-01  9:15                                           ` Avi Kivity
  2015-10-01  9:29                                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 100+ messages in thread
From: Avi Kivity @ 2015-10-01  9:15 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 11:52 AM, Avi Kivity wrote:
>
>
> On 10/01/2015 11:44 AM, Michael S. Tsirkin wrote:
>> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
>>>> And for what, to prevent
>>>> root from touching memory via dma that they can access in a million other
>>>> ways?
>>> So one can be reasonably sure a kernel oops is not a result of a
>>> userspace bug.
>> Actually, I thought about this overnight, and  it should be possible to
>> drive it securely from userspace, without hypervisor changes.
>
> Also without the performance that was the whole reason from doing it 
> in userspace in the first place.
>
> I still don't understand your objection to the patch:
>
>> MSI messages are memory writes so any generic device capable
>> of MSI is capable of corrupting kernel memory.
>> This means that a bug in userspace will lead to kernel memory corruption
>> and crashes.  This is something distributions can't support.
>

And this:

> What userspace can't be allowed to do:
>
> 	access BAR
> 	write rings
>

It can access the BAR by mmap()ing the resourceN files under sysfs. 
You're not denying userspace the ability to oops the kernel, just the 
ability to do useful things with hardware.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  8:44                                       ` Michael S. Tsirkin
  2015-10-01  8:46                                         ` Vlad Zolotarov
  2015-10-01  8:52                                         ` Avi Kivity
@ 2015-10-01  9:16                                         ` Michael S. Tsirkin
  2 siblings, 0 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01  9:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 11:44:28AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
> > > And for what, to prevent
> > > root from touching memory via dma that they can access in a million other
> > > ways?
> > 
> > So one can be reasonably sure a kernel oops is not a result of a
> > userspace bug.
> 
> Actually, I thought about this overnight, and  it should be possible to
> drive it securely from userspace, without hypervisor changes.
> 
> See
> 
> https://mid.gmane.org/20151001104505-mutt-send-email-mst@redhat.com

Ouch, looks like gmane doesn't do https. Sorry, this is the correct
link:

http://mid.gmane.org/20151001104505-mutt-send-email-mst@redhat.com

> 
> 
> > -- 
> > MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:15                                           ` Michael S. Tsirkin
@ 2015-10-01  9:22                                             ` Avi Kivity
  2015-10-01  9:42                                               ` Michael S. Tsirkin
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 100+ messages in thread
From: Avi Kivity @ 2015-10-01  9:22 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 12:15 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 11:52:26AM +0300, Avi Kivity wrote:
>> I still don't understand your objection to the patch:
>>
>>
>>      MSI messages are memory writes so any generic device capable
>>      of MSI is capable of corrupting kernel memory.
>>      This means that a bug in userspace will lead to kernel memory corruption
>>      and crashes.  This is something distributions can't support.
>>
>>
>> If a distribution feels it can't support this configuration, it can disable the
>> uio_pci_generic driver, or refuse to support tainted kernels.  If it feels it
>> can (and many distributions are starting to support dpdk), then you're just
>> denying it the ability to serve its users.
> I don't, and can't deny users anything.  I merely think upstream should
> avoid putting this driver in-tree.  By doing this, driver writers will
> be pushed to develop solutions that can't crash kernel.
>
> I pointed out one way to build it, there are sure to be more.

And I pointed out that your solution is unworkable.  It's easy to claim 
that a solution is around the corner, only no one was looking for it, 
but the reality is that kernel bypass has been a solution for years for 
high performance users, that it cannot be made safe without an iommu, 
and that iommus are not available everywhere; and even when they are 
some users prefer to avoid the performance penalty.

> As far as I could see, without this kind of motivation, people do not
> even want to try.

You are mistaken.  The problem is a lot harder than you think.

People didn't go and write userspace drivers because they were lazy.  
They wrote them because there was no other way.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:15                                           ` Avi Kivity
@ 2015-10-01  9:29                                             ` Michael S. Tsirkin
  2015-10-01  9:38                                               ` Avi Kivity
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01  9:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 12:15:49PM +0300, Avi Kivity wrote:
>     What userspace can't be allowed to do:
> 
>             access BAR
>             write rings
> 
> 
> 
> 
> It can access the BAR by mmap()ing the resourceN files under sysfs.  You're not
> denying userspace the ability to oops the kernel, just the ability to do useful
> things with hardware.


This interface has to stay there to support existing applications.  A
variety of measures (selinux, secureboot) can be used to make sure
modern ones to not get to touch it. Most distributions enable
some or all of these by default.

And it doesn't mean modern drivers can do this kind of thing.

Look, without an IOMMU, sysfs can not be used securely:
you need some other interface. This is what this driver is missing.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:29                                             ` Michael S. Tsirkin
@ 2015-10-01  9:38                                               ` Avi Kivity
  2015-10-01 10:07                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Avi Kivity @ 2015-10-01  9:38 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 12:29 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:15:49PM +0300, Avi Kivity wrote:
>>      What userspace can't be allowed to do:
>>
>>              access BAR
>>              write rings
>>
>>
>>
>>
>> It can access the BAR by mmap()ing the resourceN files under sysfs.  You're not
>> denying userspace the ability to oops the kernel, just the ability to do useful
>> things with hardware.
>
> This interface has to stay there to support existing applications.  A
> variety of measures (selinux, secureboot) can be used to make sure
> modern ones to not get to touch it.

By all means, secure the driver with selinux as well.

>   Most distributions enable
> some or all of these by default.

There is no problem accessing the BARs on the most modern and secure 
enterprise distribution I am aware of (CentOS 7.1).

>
> And it doesn't mean modern drivers can do this kind of thing.
>
> Look, without an IOMMU, sysfs can not be used securely:
> you need some other interface. This is what this driver is missing.

What is this magical missing interface?

It simply cannot be done.  You either have an iommu, or you accept that 
userspace can access anything via DMA.

The sad thing is that you can do this since forever on a non-virtualized 
system, or on a virtualized system if you don't need interrupt support.  
All you're doing is blocking interrupt support on virtualized systems.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:22                                             ` Avi Kivity
@ 2015-10-01  9:42                                               ` Michael S. Tsirkin
  2015-10-01  9:53                                                 ` Avi Kivity
  2015-10-01 21:17                                                 ` Alexander Duyck
  2015-10-01  9:42                                               ` Vincent JARDIN
  2015-10-01  9:55                                               ` Michael S. Tsirkin
  2 siblings, 2 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01  9:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> even when they are some users
> prefer to avoid the performance penalty.

I don't think there's a measureable penalty from passing through the
IOMMU, as long as mappings are mostly static (i.e. iommu=pt).  I sure
never saw any numbers that show such.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:22                                             ` Avi Kivity
  2015-10-01  9:42                                               ` Michael S. Tsirkin
@ 2015-10-01  9:42                                               ` Vincent JARDIN
  2015-10-01  9:43                                                 ` Avi Kivity
  2015-10-01 14:54                                                 ` Stephen Hemminger
  2015-10-01  9:55                                               ` Michael S. Tsirkin
  2 siblings, 2 replies; 100+ messages in thread
From: Vincent JARDIN @ 2015-10-01  9:42 UTC (permalink / raw)
  To: Avi Kivity, Michael S. Tsirkin; +Cc: dev

On 01/10/2015 11:22, Avi Kivity wrote:
>> As far as I could see, without this kind of motivation, people do not
>> even want to try.
>
> You are mistaken.  The problem is a lot harder than you think.
>
> People didn't go and write userspace drivers because they were lazy.
> They wrote them because there was no other way.

I disagree, it is possible to write a 'partial' userspace driver.

Here it is an example:
   http://dpdk.org/browse/dpdk/tree/drivers/net/mlx4

It benefits of the kernel's capabilities while the userland manages only 
the IOs.

There were some tentative to get it for other (older) drivers, named 
'bifurcated drivers', but it is stalled.

best regards,
   Vincent

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:42                                               ` Vincent JARDIN
@ 2015-10-01  9:43                                                 ` Avi Kivity
  2015-10-01  9:48                                                   ` Vincent JARDIN
  2015-10-01 10:14                                                   ` Michael S. Tsirkin
  2015-10-01 14:54                                                 ` Stephen Hemminger
  1 sibling, 2 replies; 100+ messages in thread
From: Avi Kivity @ 2015-10-01  9:43 UTC (permalink / raw)
  To: Vincent JARDIN, Michael S. Tsirkin; +Cc: dev



On 10/01/2015 12:42 PM, Vincent JARDIN wrote:
> On 01/10/2015 11:22, Avi Kivity wrote:
>>> As far as I could see, without this kind of motivation, people do not
>>> even want to try.
>>
>> You are mistaken.  The problem is a lot harder than you think.
>>
>> People didn't go and write userspace drivers because they were lazy.
>> They wrote them because there was no other way.
>
> I disagree, it is possible to write a 'partial' userspace driver.
>
> Here it is an example:
>   http://dpdk.org/browse/dpdk/tree/drivers/net/mlx4
>
> It benefits of the kernel's capabilities while the userland manages 
> only the IOs.
>

That is because the device itself contains an iommu.

> There were some tentative to get it for other (older) drivers, named 
> 'bifurcated drivers', but it is stalled.

IIRC they still exposed the ring to userspace.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:43                                                 ` Avi Kivity
@ 2015-10-01  9:48                                                   ` Vincent JARDIN
  2015-10-01  9:54                                                     ` Avi Kivity
  2015-10-01 10:14                                                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 100+ messages in thread
From: Vincent JARDIN @ 2015-10-01  9:48 UTC (permalink / raw)
  To: Avi Kivity, Michael S. Tsirkin; +Cc: dev

On 01/10/2015 11:43, Avi Kivity wrote:
>
> That is because the device itself contains an iommu.

Yes.

It could be an option:
   - we could flag the Linux system unsafe when the device does not have 
any IOMMU
   - we flag the Linux system safe when the device has an IOMMU

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:42                                               ` Michael S. Tsirkin
@ 2015-10-01  9:53                                                 ` Avi Kivity
  2015-10-01 10:17                                                   ` Michael S. Tsirkin
  2015-10-01 21:17                                                 ` Alexander Duyck
  1 sibling, 1 reply; 100+ messages in thread
From: Avi Kivity @ 2015-10-01  9:53 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On 10/01/2015 12:42 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
>> even when they are some users
>> prefer to avoid the performance penalty.
> I don't think there's a measureable penalty from passing through the
> IOMMU, as long as mappings are mostly static (i.e. iommu=pt).  I sure
> never saw any numbers that show such.
>

Maybe not.  But again, virtualized setups will not have a guest iommu 
and therefore can't use it; and those happen to be exactly the setups 
you're blocking.

Non-virtualized setups have an iommu available, but they can also use 
pci_uio_generic without patching if they like.

The virtualized setups have no other option; you're leaving them out in 
the cold.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:48                                                   ` Vincent JARDIN
@ 2015-10-01  9:54                                                     ` Avi Kivity
  0 siblings, 0 replies; 100+ messages in thread
From: Avi Kivity @ 2015-10-01  9:54 UTC (permalink / raw)
  To: Vincent JARDIN, Michael S. Tsirkin; +Cc: dev



On 10/01/2015 12:48 PM, Vincent JARDIN wrote:
> On 01/10/2015 11:43, Avi Kivity wrote:
>>
>> That is because the device itself contains an iommu.
>
> Yes.
>
> It could be an option:
>   - we could flag the Linux system unsafe when the device does not 
> have any IOMMU
>   - we flag the Linux system safe when the device has an IOMMU

This already exists, it's called the tainted flag.

I don't know if pci_uio_generic already taints the kernel; it certainly 
should with DMA capable devices.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:22                                             ` Avi Kivity
  2015-10-01  9:42                                               ` Michael S. Tsirkin
  2015-10-01  9:42                                               ` Vincent JARDIN
@ 2015-10-01  9:55                                               ` Michael S. Tsirkin
  2015-10-01  9:59                                                 ` Avi Kivity
  2 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01  9:55 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> It's easy to claim that
> a solution is around the corner, only no one was looking for it, but the
> reality is that kernel bypass has been a solution for years for high
> performance users,

I never said that it's trivial.

It's probably a lot of work. It's definitely more work than just abusing
sysfs.

But it looks like a write system call into an eventfd is about 1.5
microseconds on my laptop. Even with a system call per packet, system
call overhead is not what makes DPDK drivers outperform Linux ones.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:55                                               ` Michael S. Tsirkin
@ 2015-10-01  9:59                                                 ` Avi Kivity
  2015-10-01 10:38                                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Avi Kivity @ 2015-10-01  9:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
>> It's easy to claim that
>> a solution is around the corner, only no one was looking for it, but the
>> reality is that kernel bypass has been a solution for years for high
>> performance users,
> I never said that it's trivial.
>
> It's probably a lot of work. It's definitely more work than just abusing
> sysfs.
>
> But it looks like a write system call into an eventfd is about 1.5
> microseconds on my laptop. Even with a system call per packet, system
> call overhead is not what makes DPDK drivers outperform Linux ones.
>

1.5 us = 0.6 Mpps per core limit.  dpdk performance is in the tens of 
millions of packets per system.

It's not just the lack of system calls, of course, the architecture is 
completely different.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:38                                               ` Avi Kivity
@ 2015-10-01 10:07                                                 ` Michael S. Tsirkin
  2015-10-01 10:11                                                   ` Avi Kivity
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 10:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 12:38:51PM +0300, Avi Kivity wrote:
> The sad thing is that you can do this since forever on a non-virtualized
> system, or on a virtualized system if you don't need interrupt support.  All
> you're doing is blocking interrupt support on virtualized systems.

True, Linux could do more to prevent this kind of abuse.
In fact IIRC, if you enable secureboot, it does exactly that.

A generic uio driver isn't a good interface because it relies on these
sysfs files. We are luckly it doesn't work for VFs, I don't think we
should do anything that relies on this interface in future applications.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 10:07                                                 ` Michael S. Tsirkin
@ 2015-10-01 10:11                                                   ` Avi Kivity
  0 siblings, 0 replies; 100+ messages in thread
From: Avi Kivity @ 2015-10-01 10:11 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 01:07 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:38:51PM +0300, Avi Kivity wrote:
>> The sad thing is that you can do this since forever on a non-virtualized
>> system, or on a virtualized system if you don't need interrupt support.  All
>> you're doing is blocking interrupt support on virtualized systems.
> True, Linux could do more to prevent this kind of abuse.
> In fact IIRC, if you enable secureboot, it does exactly that.
>
> A generic uio driver isn't a good interface because it relies on these
> sysfs files. We are luckly it doesn't work for VFs, I don't think we
> should do anything that relies on this interface in future applications.
>

I agree that uio is not a good solution.  But for some users, which we 
are discussing now, it is the only solution.

A bad solution is better than no solution.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:43                                                 ` Avi Kivity
  2015-10-01  9:48                                                   ` Vincent JARDIN
@ 2015-10-01 10:14                                                   ` Michael S. Tsirkin
  2015-10-01 10:23                                                     ` Avi Kivity
  2015-10-01 14:55                                                     ` Stephen Hemminger
  1 sibling, 2 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 10:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote:
> >There were some tentative to get it for other (older) drivers, named
> >'bifurcated drivers', but it is stalled.
> 
> IIRC they still exposed the ring to userspace.

How much would a ring write syscall cost? 1-2 microseconds, isn't it?
Measureable, but it's not the end of the world.
ring read might be safe to allow.
Should buy us enough time until hypervisors support IOMMU.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:53                                                 ` Avi Kivity
@ 2015-10-01 10:17                                                   ` Michael S. Tsirkin
  2015-10-01 10:24                                                     ` Avi Kivity
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 10:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote:
> Non-virtualized setups have an iommu available, but they can also use
> pci_uio_generic without patching if they like.

Not with VFs, they can't.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 10:14                                                   ` Michael S. Tsirkin
@ 2015-10-01 10:23                                                     ` Avi Kivity
  2015-10-01 14:55                                                     ` Stephen Hemminger
  1 sibling, 0 replies; 100+ messages in thread
From: Avi Kivity @ 2015-10-01 10:23 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 01:14 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote:
>>> There were some tentative to get it for other (older) drivers, named
>>> 'bifurcated drivers', but it is stalled.
>> IIRC they still exposed the ring to userspace.
> How much would a ring write syscall cost? 1-2 microseconds, isn't it?
> Measureable, but it's not the end of the world.

Plus a page table walk per packet fragment (dpdk has the physical 
address prepared in the mbuf IIRC).  The 10Mpps+ users of dpdk should 
comment on whether the performance hit is acceptable (my use case is 
much more modest).

> ring read might be safe to allow.
> Should buy us enough time until hypervisors support IOMMU.

All the relevant drivers need to be converted to support ring 
translation, and exposing the ring to userspace in the new API.  It 
shouldn't take more than 3-4 years.

Meanwhile, users of virtualized systems that need interrupt support 
cannot use their machines, while non-virtualized users are free to DMA 
wherever they like, in the name of security.

btw, an API like you describe already exists -- vhost.  Of course the 
virtio protocol is nowhere near fast enough, but at least it's an example.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 10:17                                                   ` Michael S. Tsirkin
@ 2015-10-01 10:24                                                     ` Avi Kivity
  2015-10-01 10:25                                                       ` Avi Kivity
  0 siblings, 1 reply; 100+ messages in thread
From: Avi Kivity @ 2015-10-01 10:24 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On 10/01/2015 01:17 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote:
>> Non-virtualized setups have an iommu available, but they can also use
>> pci_uio_generic without patching if they like.
> Not with VFs, they can't.
>

They can and they do (I use it myself).

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 10:24                                                     ` Avi Kivity
@ 2015-10-01 10:25                                                       ` Avi Kivity
  2015-10-01 10:44                                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Avi Kivity @ 2015-10-01 10:25 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 01:24 PM, Avi Kivity wrote:
> On 10/01/2015 01:17 PM, Michael S. Tsirkin wrote:
>> On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote:
>>> Non-virtualized setups have an iommu available, but they can also use
>>> pci_uio_generic without patching if they like.
>> Not with VFs, they can't.
>>
>
> They can and they do (I use it myself).

I mean with a PF.  Why use a VF on a non-virtualized system?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:59                                                 ` Avi Kivity
@ 2015-10-01 10:38                                                   ` Michael S. Tsirkin
  2015-10-01 10:50                                                     ` Avi Kivity
  2015-10-01 11:08                                                     ` Bruce Richardson
  0 siblings, 2 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 10:38 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> 
> 
> On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> >>It's easy to claim that
> >>a solution is around the corner, only no one was looking for it, but the
> >>reality is that kernel bypass has been a solution for years for high
> >>performance users,
> >I never said that it's trivial.
> >
> >It's probably a lot of work. It's definitely more work than just abusing
> >sysfs.
> >
> >But it looks like a write system call into an eventfd is about 1.5
> >microseconds on my laptop. Even with a system call per packet, system
> >call overhead is not what makes DPDK drivers outperform Linux ones.
> >
> 
> 1.5 us = 0.6 Mpps per core limit.

Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
But for RX, you can batch a lot of packets.

You can see by now I'm not that good at benchmarking.
Here's what I wrote:


#include <stdbool.h>
#include <sys/eventfd.h>
#include <inttypes.h>
#include <unistd.h>


int main(int argc, char **argv)
{
        int e = eventfd(0, 0);
        uint64_t v = 1;

        int i;

        for (i = 0; i < 10000000; ++i) {
                write(e, &v, sizeof v);
        }
}


This takes 1.5 seconds to run on my laptop:

$ time ./a.out 

real    0m1.507s
user    0m0.179s
sys     0m1.328s


> dpdk performance is in the tens of
> millions of packets per system.

I think that's with a bunch of batching though.

> It's not just the lack of system calls, of course, the architecture is
> completely different.

Absolutely - I'm not saying move all of DPDK into kernel.
We just need to protect the RX rings so hardware does
not corrupt kernel memory.


Thinking about it some more, many devices
have separate rings for DMA: TX (device reads memory)
and RX (device writes memory).
With such devices, a mode where userspace can write TX ring
but not RX ring might make sense.

This will mean userspace might read kernel memory
through the device, but can not corrupt it.

That's already a big win!

And RX buffers do not have to be added one at a time.
If we assume 0.2usec per system call, batching some 100 buffers per
system call gives you 2 nano seconds overhead.  That seems quite
reasonable.







-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 10:25                                                       ` Avi Kivity
@ 2015-10-01 10:44                                                         ` Michael S. Tsirkin
  2015-10-01 10:55                                                           ` Avi Kivity
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 10:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 01:25:17PM +0300, Avi Kivity wrote:
> Why use a VF on a non-virtualized system?

1. So a userspace bug does not destroy your hardware
   (PFs generally assume trusted non-buggy drivers, VFs
    generally don't).
2. So you can use a PF or another VF for regular networking.
3. So you can manage this system, to some level.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 10:38                                                   ` Michael S. Tsirkin
@ 2015-10-01 10:50                                                     ` Avi Kivity
  2015-10-01 11:09                                                       ` Michael S. Tsirkin
  2015-10-01 11:08                                                     ` Bruce Richardson
  1 sibling, 1 reply; 100+ messages in thread
From: Avi Kivity @ 2015-10-01 10:50 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 01:38 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
>>
>> On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
>>>> It's easy to claim that
>>>> a solution is around the corner, only no one was looking for it, but the
>>>> reality is that kernel bypass has been a solution for years for high
>>>> performance users,
>>> I never said that it's trivial.
>>>
>>> It's probably a lot of work. It's definitely more work than just abusing
>>> sysfs.
>>>
>>> But it looks like a write system call into an eventfd is about 1.5
>>> microseconds on my laptop. Even with a system call per packet, system
>>> call overhead is not what makes DPDK drivers outperform Linux ones.
>>>
>> 1.5 us = 0.6 Mpps per core limit.
> Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.

You also trimmed the extra work that needs to be done, that I 
mentioned.  Maybe your ring proxy can work, maybe it can't.  In any case 
it's a hefty chunk of work.  Should this work block users from using 
their VFs, if they happen to need interrupt support?

> But for RX, you can batch a lot of packets.
>
> You can see by now I'm not that good at benchmarking.
> Here's what I wrote:
>
>
> #include <stdbool.h>
> #include <sys/eventfd.h>
> #include <inttypes.h>
> #include <unistd.h>
>
>
> int main(int argc, char **argv)
> {
>          int e = eventfd(0, 0);
>          uint64_t v = 1;
>
>          int i;
>
>          for (i = 0; i < 10000000; ++i) {
>                  write(e, &v, sizeof v);
>          }
> }
>
>
> This takes 1.5 seconds to run on my laptop:
>
> $ time ./a.out
>
> real    0m1.507s
> user    0m0.179s
> sys     0m1.328s
>
>
>> dpdk performance is in the tens of
>> millions of packets per system.
> I think that's with a bunch of batching though.

Yes, it's also with their application code running as well.  They didn't 
reach this kind of performance by spending cycles unnecessarily.

I'm not saying that the ring proxy is not workable; just that we don't 
know whether it is or not, and I don't think that a patch that enables 
_existing functionality_ for VFs should be blocked in favor of a new and 
unproven approach.

>
>> It's not just the lack of system calls, of course, the architecture is
>> completely different.
> Absolutely - I'm not saying move all of DPDK into kernel.
> We just need to protect the RX rings so hardware does
> not corrupt kernel memory.
>
>
> Thinking about it some more, many devices
> have separate rings for DMA: TX (device reads memory)
> and RX (device writes memory).
> With such devices, a mode where userspace can write TX ring
> but not RX ring might make sense.

I'm sure you can cause havoc just by reading, if you read from I/O memory.

>
> This will mean userspace might read kernel memory
> through the device, but can not corrupt it.
>
> That's already a big win!
>
> And RX buffers do not have to be added one at a time.
> If we assume 0.2usec per system call, batching some 100 buffers per
> system call gives you 2 nano seconds overhead.  That seems quite
> reasonable.

You're ignoring the page table walk and other per-descriptor processing.

Again^2, maybe this can work.  But it shouldn't block a patch enabling 
interrupt support of VFs.  After the ring proxy is available and proven 
for a few years, we can deprecate bus mastering from uio, and after a 
few more years remove it.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 10:44                                                         ` Michael S. Tsirkin
@ 2015-10-01 10:55                                                           ` Avi Kivity
  0 siblings, 0 replies; 100+ messages in thread
From: Avi Kivity @ 2015-10-01 10:55 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 01:44 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 01:25:17PM +0300, Avi Kivity wrote:
>> Why use a VF on a non-virtualized system?
> 1. So a userspace bug does not destroy your hardware
>     (PFs generally assume trusted non-buggy drivers, VFs
>      generally don't).

People who use dpdk trust their drivers (those drivers are the reason 
for the system to exist in the first place).

> 2. So you can use a PF or another VF for regular networking.

This is valid, but usually those systems have a separate management 
network.  Unfortunately VFs limit the number of queues you can expose, 
making them less performant than PFs.

The "bifurcated drivers" were meant as a way of enabling this 
functionality without resorting to VFs, but it seems they are stalled.

> 3. So you can manage this system, to some level.
>

Again existing practice doesn't follow this.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 10:38                                                   ` Michael S. Tsirkin
  2015-10-01 10:50                                                     ` Avi Kivity
@ 2015-10-01 11:08                                                     ` Bruce Richardson
  2015-10-01 11:23                                                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 100+ messages in thread
From: Bruce Richardson @ 2015-10-01 11:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, Avi Kivity

On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> > 
> > 
> > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> > >>It's easy to claim that
> > >>a solution is around the corner, only no one was looking for it, but the
> > >>reality is that kernel bypass has been a solution for years for high
> > >>performance users,
> > >I never said that it's trivial.
> > >
> > >It's probably a lot of work. It's definitely more work than just abusing
> > >sysfs.
> > >
> > >But it looks like a write system call into an eventfd is about 1.5
> > >microseconds on my laptop. Even with a system call per packet, system
> > >call overhead is not what makes DPDK drivers outperform Linux ones.
> > >
> > 
> > 1.5 us = 0.6 Mpps per core limit.
> 
> Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
> But for RX, you can batch a lot of packets.
> 
> You can see by now I'm not that good at benchmarking.
> Here's what I wrote:
> 
> 
> #include <stdbool.h>
> #include <sys/eventfd.h>
> #include <inttypes.h>
> #include <unistd.h>
> 
> 
> int main(int argc, char **argv)
> {
>         int e = eventfd(0, 0);
>         uint64_t v = 1;
> 
>         int i;
> 
>         for (i = 0; i < 10000000; ++i) {
>                 write(e, &v, sizeof v);
>         }
> }
> 
> 
> This takes 1.5 seconds to run on my laptop:
> 
> $ time ./a.out 
> 
> real    0m1.507s
> user    0m0.179s
> sys     0m1.328s
> 
> 
> > dpdk performance is in the tens of
> > millions of packets per system.
> 
> I think that's with a bunch of batching though.
> 
> > It's not just the lack of system calls, of course, the architecture is
> > completely different.
> 
> Absolutely - I'm not saying move all of DPDK into kernel.
> We just need to protect the RX rings so hardware does
> not corrupt kernel memory.
> 
> 
> Thinking about it some more, many devices
> have separate rings for DMA: TX (device reads memory)
> and RX (device writes memory).
> With such devices, a mode where userspace can write TX ring
> but not RX ring might make sense.
> 
> This will mean userspace might read kernel memory
> through the device, but can not corrupt it.
> 
> That's already a big win!
> 
> And RX buffers do not have to be added one at a time.
> If we assume 0.2usec per system call, batching some 100 buffers per
> system call gives you 2 nano seconds overhead.  That seems quite
> reasonable.
> 
Hi,

just to jump in a bit on this.

Batching of 100 packets is a very large batch, and will add to latency. The
standard batch size in DPDK right now is 32, and even that may be too high for
applications in certain domains.

However, even with that 2ns of overhead calculation, I'd make a few additional
points.
* For DPDK, we are reasonably close to being able to do 40GB of IO - both RX 
and TX on a single thread. 10GB of IO doesn't really stress a core any more. For
40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with a
huge batch size of 100 packets, your system call overhead on RX is taking almost
12% of our processing time. For a batch size of 32 this overhead would rise to
over 35% of our packet processing time. For 100G line rate, the packet arrival
rate is just 6.7ns...

* As well as this overhead from the system call itself, you are also omitting
the overhead of scanning the RX descriptors. This in itself is going to use up
a good proportion of the processing time, as well as that we have to spend cycles
copying the descriptors from one ring in memory to another. Given that right now
with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen
cycles on modern cores, every additional cycle (fraction of a nanosecond) has
an impact.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 10:50                                                     ` Avi Kivity
@ 2015-10-01 11:09                                                       ` Michael S. Tsirkin
  2015-10-01 11:20                                                         ` Avi Kivity
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 11:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
> >
> >>It's not just the lack of system calls, of course, the architecture is
> >>completely different.
> >Absolutely - I'm not saying move all of DPDK into kernel.
> >We just need to protect the RX rings so hardware does
> >not corrupt kernel memory.
> >
> >
> >Thinking about it some more, many devices
> >have separate rings for DMA: TX (device reads memory)
> >and RX (device writes memory).
> >With such devices, a mode where userspace can write TX ring
> >but not RX ring might make sense.
> 
> I'm sure you can cause havoc just by reading, if you read from I/O memory.

Not talking about I/O memory here. These are device rings in RAM.

> >
> >This will mean userspace might read kernel memory
> >through the device, but can not corrupt it.
> >
> >That's already a big win!
> >
> >And RX buffers do not have to be added one at a time.
> >If we assume 0.2usec per system call, batching some 100 buffers per
> >system call gives you 2 nano seconds overhead.  That seems quite
> >reasonable.
> 
> You're ignoring the page table walk

Some caching strategy might work here.

> and other per-descriptor processing.

You probably can let userspace pre-format it all,
just validate addresses.

> Again^2, maybe this can work.  But it shouldn't block a patch enabling
> interrupt support of VFs.  After the ring proxy is available and proven for
> a few years, we can deprecate bus mastering from uio, and after a few more
> years remove it.

We are talking about DPDK patches posted in June 2015.  It's not some
software proven for years.  If Linux keeps enabling hacks, no one will
bother doing the right thing.  Upstream inclusion is the only carrot
Linux has to make people do the right thing.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 11:09                                                       ` Michael S. Tsirkin
@ 2015-10-01 11:20                                                         ` Avi Kivity
  2015-10-01 11:27                                                           ` Michael S. Tsirkin
  2015-10-01 11:31                                                           ` Michael S. Tsirkin
  0 siblings, 2 replies; 100+ messages in thread
From: Avi Kivity @ 2015-10-01 11:20 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
>>>> It's not just the lack of system calls, of course, the architecture is
>>>> completely different.
>>> Absolutely - I'm not saying move all of DPDK into kernel.
>>> We just need to protect the RX rings so hardware does
>>> not corrupt kernel memory.
>>>
>>>
>>> Thinking about it some more, many devices
>>> have separate rings for DMA: TX (device reads memory)
>>> and RX (device writes memory).
>>> With such devices, a mode where userspace can write TX ring
>>> but not RX ring might make sense.
>> I'm sure you can cause havoc just by reading, if you read from I/O memory.
> Not talking about I/O memory here. These are device rings in RAM.

Right.  But you program them with DMA addresses, so the device can read 
another device's memory.

>>> This will mean userspace might read kernel memory
>>> through the device, but can not corrupt it.
>>>
>>> That's already a big win!
>>>
>>> And RX buffers do not have to be added one at a time.
>>> If we assume 0.2usec per system call, batching some 100 buffers per
>>> system call gives you 2 nano seconds overhead.  That seems quite
>>> reasonable.
>> You're ignoring the page table walk
> Some caching strategy might work here.

It may, or it may not.  I'm not against this.  I'm against blocking 
user's access to their hardware, using an existing, established 
interface, for a small subset of setups.  It doesn't help you in any way 
(you can still get reports of oopses due to buggy userspace drivers on 
physical machines, or on virtual machines that don't require 
interrupts), and it harms them.

>> and other per-descriptor processing.
> You probably can let userspace pre-format it all,
> just validate addresses.

You have to figure out if the descriptor contains an address or not 
(many devices have several descriptor formats, some with addresses and 
some without, which are intermixed).  You also have to parse the 
descriptor size and see if it crosses a page boundary or not.

>
>> Again^2, maybe this can work.  But it shouldn't block a patch enabling
>> interrupt support of VFs.  After the ring proxy is available and proven for
>> a few years, we can deprecate bus mastering from uio, and after a few more
>> years remove it.
> We are talking about DPDK patches posted in June 2015.  It's not some
> software proven for years.

dpdk has been used for years, it just won't work on VFs, if you need 
interrupt support.

>    If Linux keeps enabling hacks, no one will
> bother doing the right thing.  Upstream inclusion is the only carrot
> Linux has to make people do the right thing.

It's not a carrot, it's a stick.  Implementing you scheme will take a 
huge effort, is not guaranteed to provide the performance needed, and 
will not be available for years.  Meanwhile exactly the same thing on 
physical machines is supported.

People will just use out of tree drivers (dpdk has several already).  
It's a pain, but nowhere near what you are proposing.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 11:08                                                     ` Bruce Richardson
@ 2015-10-01 11:23                                                       ` Michael S. Tsirkin
  2015-10-01 12:07                                                         ` Bruce Richardson
  0 siblings, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 11:23 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev, Avi Kivity

On Thu, Oct 01, 2015 at 12:08:07PM +0100, Bruce Richardson wrote:
> On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> > > 
> > > 
> > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> > > >>It's easy to claim that
> > > >>a solution is around the corner, only no one was looking for it, but the
> > > >>reality is that kernel bypass has been a solution for years for high
> > > >>performance users,
> > > >I never said that it's trivial.
> > > >
> > > >It's probably a lot of work. It's definitely more work than just abusing
> > > >sysfs.
> > > >
> > > >But it looks like a write system call into an eventfd is about 1.5
> > > >microseconds on my laptop. Even with a system call per packet, system
> > > >call overhead is not what makes DPDK drivers outperform Linux ones.
> > > >
> > > 
> > > 1.5 us = 0.6 Mpps per core limit.
> > 
> > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
> > But for RX, you can batch a lot of packets.
> > 
> > You can see by now I'm not that good at benchmarking.
> > Here's what I wrote:
> > 
> > 
> > #include <stdbool.h>
> > #include <sys/eventfd.h>
> > #include <inttypes.h>
> > #include <unistd.h>
> > 
> > 
> > int main(int argc, char **argv)
> > {
> >         int e = eventfd(0, 0);
> >         uint64_t v = 1;
> > 
> >         int i;
> > 
> >         for (i = 0; i < 10000000; ++i) {
> >                 write(e, &v, sizeof v);
> >         }
> > }
> > 
> > 
> > This takes 1.5 seconds to run on my laptop:
> > 
> > $ time ./a.out 
> > 
> > real    0m1.507s
> > user    0m0.179s
> > sys     0m1.328s
> > 
> > 
> > > dpdk performance is in the tens of
> > > millions of packets per system.
> > 
> > I think that's with a bunch of batching though.
> > 
> > > It's not just the lack of system calls, of course, the architecture is
> > > completely different.
> > 
> > Absolutely - I'm not saying move all of DPDK into kernel.
> > We just need to protect the RX rings so hardware does
> > not corrupt kernel memory.
> > 
> > 
> > Thinking about it some more, many devices
> > have separate rings for DMA: TX (device reads memory)
> > and RX (device writes memory).
> > With such devices, a mode where userspace can write TX ring
> > but not RX ring might make sense.
> > 
> > This will mean userspace might read kernel memory
> > through the device, but can not corrupt it.
> > 
> > That's already a big win!
> > 
> > And RX buffers do not have to be added one at a time.
> > If we assume 0.2usec per system call, batching some 100 buffers per
> > system call gives you 2 nano seconds overhead.  That seems quite
> > reasonable.
> > 
> Hi,
> 
> just to jump in a bit on this.
> 
> Batching of 100 packets is a very large batch, and will add to latency.



This is not on transmit or receive path!
This is only for re-adding buffers to the receive ring.
This batching should not add latency at all:


process rx:
	get packet
	packets[n] = alloc packet
	if (++n > 100) {
		system call: add bufs(packets, n);
	}





> The
> standard batch size in DPDK right now is 32, and even that may be too high for
> applications in certain domains.
> 
> However, even with that 2ns of overhead calculation, I'd make a few additional
> points.
> * For DPDK, we are reasonably close to being able to do 40GB of IO - both RX 
> and TX on a single thread. 10GB of IO doesn't really stress a core any more. For
> 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with a
> huge batch size of 100 packets, your system call overhead on RX is taking almost
> 12% of our processing time. For a batch size of 32 this overhead would rise to
> over 35% of our packet processing time.

As I said, yes, measureable, but not breaking the bank, and that's with
40GB which still are not widespread.
With 10GB and 100 packets, only 3% overhead.

> For 100G line rate, the packet arrival
> rate is just 6.7ns...

Hypervisors still have time get their act together and support IOMMUs
by the time 100G systems become widespread.

> * As well as this overhead from the system call itself, you are also omitting
> the overhead of scanning the RX descriptors.

I omit it because scanning descriptors can still be done in userspace,
just write-protect the RX ring page.


> This in itself is going to use up
> a good proportion of the processing time, as well as that we have to spend cycles
> copying the descriptors from one ring in memory to another. Given that right now
> with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen
> cycles on modern cores, every additional cycle (fraction of a nanosecond) has
> an impact.
> 
> Regards,
> /Bruce

See above.  There is no need for that on data path. Only re-adding
buffers requires a system call.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 11:20                                                         ` Avi Kivity
@ 2015-10-01 11:27                                                           ` Michael S. Tsirkin
  2015-10-01 11:32                                                             ` Avi Kivity
  2015-10-01 11:31                                                           ` Michael S. Tsirkin
  1 sibling, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 11:27 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
> People will just use out of tree drivers (dpdk has several already).  It's a
> pain, but nowhere near what you are proposing.

What's the issue with that? We already agreed this kernel
is going to be tainted, and unsupportable.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 11:20                                                         ` Avi Kivity
  2015-10-01 11:27                                                           ` Michael S. Tsirkin
@ 2015-10-01 11:31                                                           ` Michael S. Tsirkin
  2015-10-01 11:34                                                             ` Avi Kivity
  1 sibling, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 11:31 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
> 
> 
> On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
> >>>>It's not just the lack of system calls, of course, the architecture is
> >>>>completely different.
> >>>Absolutely - I'm not saying move all of DPDK into kernel.
> >>>We just need to protect the RX rings so hardware does
> >>>not corrupt kernel memory.
> >>>
> >>>
> >>>Thinking about it some more, many devices
> >>>have separate rings for DMA: TX (device reads memory)
> >>>and RX (device writes memory).
> >>>With such devices, a mode where userspace can write TX ring
> >>>but not RX ring might make sense.
> >>I'm sure you can cause havoc just by reading, if you read from I/O memory.
> >Not talking about I/O memory here. These are device rings in RAM.
> 
> Right.  But you program them with DMA addresses, so the device can read
> another device's memory.

It can't if host has limited it to only DMA into guest RAM, which is
pretty common.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 11:27                                                           ` Michael S. Tsirkin
@ 2015-10-01 11:32                                                             ` Avi Kivity
  2015-10-01 15:01                                                               ` Stephen Hemminger
  2015-10-01 15:11                                                               ` Michael S. Tsirkin
  0 siblings, 2 replies; 100+ messages in thread
From: Avi Kivity @ 2015-10-01 11:32 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On 10/01/2015 02:27 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
>> People will just use out of tree drivers (dpdk has several already).  It's a
>> pain, but nowhere near what you are proposing.
> What's the issue with that?

Out of tree drivers have to be compiled on the target system (cannot 
ship a binary package), and occasionally break.

dkms helps with that, as do distributions that promise binary 
compatibility, but it is still a pain, compared to just shipping a 
userspace package.

>   We already agreed this kernel
> is going to be tainted, and unsupportable.

Yes.  So your only motivation in rejecting the patch is to get the 
author to write the ring translation patch and port it to all relevant 
drivers instead?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 11:31                                                           ` Michael S. Tsirkin
@ 2015-10-01 11:34                                                             ` Avi Kivity
  0 siblings, 0 replies; 100+ messages in thread
From: Avi Kivity @ 2015-10-01 11:34 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev



On 10/01/2015 02:31 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
>>
>> On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
>>>>>> It's not just the lack of system calls, of course, the architecture is
>>>>>> completely different.
>>>>> Absolutely - I'm not saying move all of DPDK into kernel.
>>>>> We just need to protect the RX rings so hardware does
>>>>> not corrupt kernel memory.
>>>>>
>>>>>
>>>>> Thinking about it some more, many devices
>>>>> have separate rings for DMA: TX (device reads memory)
>>>>> and RX (device writes memory).
>>>>> With such devices, a mode where userspace can write TX ring
>>>>> but not RX ring might make sense.
>>>> I'm sure you can cause havoc just by reading, if you read from I/O memory.
>>> Not talking about I/O memory here. These are device rings in RAM.
>> Right.  But you program them with DMA addresses, so the device can read
>> another device's memory.
> It can't if host has limited it to only DMA into guest RAM, which is
> pretty common.
>

Ok.  So yes, the tx ring can be mapped R/W.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 11:23                                                       ` Michael S. Tsirkin
@ 2015-10-01 12:07                                                         ` Bruce Richardson
  2015-10-01 13:14                                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Bruce Richardson @ 2015-10-01 12:07 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, Avi Kivity

On Thu, Oct 01, 2015 at 02:23:17PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:08:07PM +0100, Bruce Richardson wrote:
> > On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> > > > 
> > > > 
> > > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> > > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> > > > >>It's easy to claim that
> > > > >>a solution is around the corner, only no one was looking for it, but the
> > > > >>reality is that kernel bypass has been a solution for years for high
> > > > >>performance users,
> > > > >I never said that it's trivial.
> > > > >
> > > > >It's probably a lot of work. It's definitely more work than just abusing
> > > > >sysfs.
> > > > >
> > > > >But it looks like a write system call into an eventfd is about 1.5
> > > > >microseconds on my laptop. Even with a system call per packet, system
> > > > >call overhead is not what makes DPDK drivers outperform Linux ones.
> > > > >
> > > > 
> > > > 1.5 us = 0.6 Mpps per core limit.
> > > 
> > > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
> > > But for RX, you can batch a lot of packets.
> > > 
> > > You can see by now I'm not that good at benchmarking.
> > > Here's what I wrote:
> > > 
> > > 
> > > #include <stdbool.h>
> > > #include <sys/eventfd.h>
> > > #include <inttypes.h>
> > > #include <unistd.h>
> > > 
> > > 
> > > int main(int argc, char **argv)
> > > {
> > >         int e = eventfd(0, 0);
> > >         uint64_t v = 1;
> > > 
> > >         int i;
> > > 
> > >         for (i = 0; i < 10000000; ++i) {
> > >                 write(e, &v, sizeof v);
> > >         }
> > > }
> > > 
> > > 
> > > This takes 1.5 seconds to run on my laptop:
> > > 
> > > $ time ./a.out 
> > > 
> > > real    0m1.507s
> > > user    0m0.179s
> > > sys     0m1.328s
> > > 
> > > 
> > > > dpdk performance is in the tens of
> > > > millions of packets per system.
> > > 
> > > I think that's with a bunch of batching though.
> > > 
> > > > It's not just the lack of system calls, of course, the architecture is
> > > > completely different.
> > > 
> > > Absolutely - I'm not saying move all of DPDK into kernel.
> > > We just need to protect the RX rings so hardware does
> > > not corrupt kernel memory.
> > > 
> > > 
> > > Thinking about it some more, many devices
> > > have separate rings for DMA: TX (device reads memory)
> > > and RX (device writes memory).
> > > With such devices, a mode where userspace can write TX ring
> > > but not RX ring might make sense.
> > > 
> > > This will mean userspace might read kernel memory
> > > through the device, but can not corrupt it.
> > > 
> > > That's already a big win!
> > > 
> > > And RX buffers do not have to be added one at a time.
> > > If we assume 0.2usec per system call, batching some 100 buffers per
> > > system call gives you 2 nano seconds overhead.  That seems quite
> > > reasonable.
> > > 
> > Hi,
> > 
> > just to jump in a bit on this.
> > 
> > Batching of 100 packets is a very large batch, and will add to latency.
> 
> 
> 
> This is not on transmit or receive path!
> This is only for re-adding buffers to the receive ring.
> This batching should not add latency at all:
> 
> 
> process rx:
> 	get packet
> 	packets[n] = alloc packet
> 	if (++n > 100) {
> 		system call: add bufs(packets, n);
> 	}
> 
> 
> 
> 
> 
> > The
> > standard batch size in DPDK right now is 32, and even that may be too high for
> > applications in certain domains.
> > 
> > However, even with that 2ns of overhead calculation, I'd make a few additional
> > points.
> > * For DPDK, we are reasonably close to being able to do 40GB of IO - both RX 
> > and TX on a single thread. 10GB of IO doesn't really stress a core any more. For
> > 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with a
> > huge batch size of 100 packets, your system call overhead on RX is taking almost
> > 12% of our processing time. For a batch size of 32 this overhead would rise to
> > over 35% of our packet processing time.
> 
> As I said, yes, measureable, but not breaking the bank, and that's with
> 40GB which still are not widespread.
> With 10GB and 100 packets, only 3% overhead.
> 
> > For 100G line rate, the packet arrival
> > rate is just 6.7ns...
> 
> Hypervisors still have time get their act together and support IOMMUs
> by the time 100G systems become widespread.
> 
> > * As well as this overhead from the system call itself, you are also omitting
> > the overhead of scanning the RX descriptors.
> 
> I omit it because scanning descriptors can still be done in userspace,
> just write-protect the RX ring page.
> 
> 
> > This in itself is going to use up
> > a good proportion of the processing time, as well as that we have to spend cycles
> > copying the descriptors from one ring in memory to another. Given that right now
> > with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen
> > cycles on modern cores, every additional cycle (fraction of a nanosecond) has
> > an impact.
> > 
> > Regards,
> > /Bruce
> 
> See above.  There is no need for that on data path. Only re-adding
> buffers requires a system call.
> 

Re-adding buffers is a key part of the data path! Ok, the fact that its only on
descriptor rearm does allow somewhat bigger batches, but the whole point of having
the kernel do this extra work you propose is to allow the kernel to scan and
sanitize the physical addresses - and that will take a lot of cycles, especially
if it has to handle all the different descriptor formats of all the different NICs,
as has already been pointed out.

/Bruce

> -- 
> MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 12:07                                                         ` Bruce Richardson
@ 2015-10-01 13:14                                                           ` Michael S. Tsirkin
  2015-10-01 16:04                                                             ` Michael S. Tsirkin
  2015-10-01 21:02                                                             ` Alexander Duyck
  0 siblings, 2 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 13:14 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev, Avi Kivity

On Thu, Oct 01, 2015 at 01:07:13PM +0100, Bruce Richardson wrote:
> > > This in itself is going to use up
> > > a good proportion of the processing time, as well as that we have to spend cycles
> > > copying the descriptors from one ring in memory to another. Given that right now
> > > with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen
> > > cycles on modern cores, every additional cycle (fraction of a nanosecond) has
> > > an impact.
> > > 
> > > Regards,
> > > /Bruce
> > 
> > See above.  There is no need for that on data path. Only re-adding
> > buffers requires a system call.
> > 
> 
> Re-adding buffers is a key part of the data path! Ok, the fact that its only on
> descriptor rearm does allow somewhat bigger batches,

That was the point, yes.

> but the whole point of having
> the kernel do this extra work you propose is to allow the kernel to scan and
> sanitize the physical addresses - and that will take a lot of cycles, especially
> if it has to handle all the different descriptor formats of all the different NICs,
> as has already been pointed out.
> 
> /Bruce

Well the driver would be per NIC, so there's only need to support
specific formats supported by a given NIC.

An alternative is to format the descriptors in kernel, based
on just the list of addresses. This seems cleaner, but I don't
know how efficient it would be.

Device vendors and dpdk developers are probably the best people to
figure out what's the best thing to do here.

But it looks like it's not going to happen unless security is made
a requirement for upstreaming code.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  8:00                                           ` Vlad Zolotarov
@ 2015-10-01 14:47                                             ` Stephen Hemminger
  2015-10-01 15:03                                               ` Vlad Zolotarov
  0 siblings, 1 reply; 100+ messages in thread
From: Stephen Hemminger @ 2015-10-01 14:47 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev, Michael S. Tsirkin

On Thu, 1 Oct 2015 11:00:28 +0300
Vlad Zolotarov <vladz@cloudius-systems.com> wrote:

> 
> 
> On 10/01/15 00:36, Stephen Hemminger wrote:
> > On Wed, 30 Sep 2015 23:09:33 +0300
> > Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
> >
> >>
> >> On 09/30/15 22:39, Michael S. Tsirkin wrote:
> >>> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
> >>>>>> How would iommu
> >>>>>> virtualization change anything?
> >>>>> Kernel can use an iommu to limit device access to memory of
> >>>>> the controlling application.
> >>>> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
> >>>> interrupts support in uio_pci_generic? kernel may continue to limit the
> >>>> above access with this support as well.
> >>> It could maybe. So if you write a patch to allow MSI by at the same time
> >>> creating an isolated IOMMU group and blocking DMA from device in
> >>> question anywhere, that sounds reasonable.
> >> No, I'm only planning to add MSI and MSI-X interrupts support for
> >> uio_pci_generic device.
> >> The rest mentioned above should naturally be a matter of a different
> >> patch and writing it is orthogonal to the patch I'm working on as has
> >> been extensively discussed in this thread.
> >>
> > I have a generic MSI and MSI-X driver (posted earlier on this list).
> > About to post to upstream kernel.
> 
> Stephen, hi!
> 
> I found the mentioned series and first thing I noticed was that it's 
> been sent in May so the first question is how far in your list of tasks 
> submitting it upstream is? We need it more or less yesterday and I'm 
> working on it right now. Therefore if u don't have time for it I'd like 
> to help... ;) However I'd like u to clarify a few small things. Pls., 
> see below...
> 
> I noticed that u've created a separate msi_msix driver and the second 
> question is what do u plan for the upstream? I was thinking of extending 
> the existing uio_pci_generic with the MSI-X functionality similar to 
> your code and preserving the INT#X functionality as it is now:

The igb_uio has a bunch of other things I didn't want to deal with:
the name (being specific to old Intel driver); compatibility with older
kernels; legacy ABI support. Therefore in effect uio_msi is a rebase
of igb_uio.

The submission upstream yesterday is the first step, I expect lots
of review feedback.

>   *   INT#X and MSI would provide the IRQ number to the UIO module while
>     only MSI-X case would register with UIO_IRQ_CUSTOM.

I wanted all IRQ's to be the same for the driver, ie all go through
eventfd mechanism. This makes code on DPDK side consistent with less
special cases.

> I also noticed that u enable MSI-X on a first open() call. I assume 
> there was a good reason (that I miss) for not doing it in probe(). Could 
> u, pls., clarify?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:42                                               ` Vincent JARDIN
  2015-10-01  9:43                                                 ` Avi Kivity
@ 2015-10-01 14:54                                                 ` Stephen Hemminger
  1 sibling, 0 replies; 100+ messages in thread
From: Stephen Hemminger @ 2015-10-01 14:54 UTC (permalink / raw)
  To: Vincent JARDIN; +Cc: dev, Avi Kivity, Michael S. Tsirkin

On Thu, 01 Oct 2015 11:42:23 +0200
Vincent JARDIN <vincent.jardin@6wind.com> wrote:

> On 01/10/2015 11:22, Avi Kivity wrote:
> >> As far as I could see, without this kind of motivation, people do not
> >> even want to try.
> >
> > You are mistaken.  The problem is a lot harder than you think.
> >
> > People didn't go and write userspace drivers because they were lazy.
> > They wrote them because there was no other way.
> 
> I disagree, it is possible to write a 'partial' userspace driver.
> 
> Here it is an example:
>    http://dpdk.org/browse/dpdk/tree/drivers/net/mlx4
> 
> It benefits of the kernel's capabilities while the userland manages only 
> the IOs.
> 
> There were some tentative to get it for other (older) drivers, named 
> 'bifurcated drivers', but it is stalled.
> 

And in our testing the mlx4 driver performance is terrible.
That maybe because of the overhead of infiniband library.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 10:14                                                   ` Michael S. Tsirkin
  2015-10-01 10:23                                                     ` Avi Kivity
@ 2015-10-01 14:55                                                     ` Stephen Hemminger
  2015-10-01 15:49                                                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 100+ messages in thread
From: Stephen Hemminger @ 2015-10-01 14:55 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, Avi Kivity

On Thu, 1 Oct 2015 13:14:08 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote:
> > >There were some tentative to get it for other (older) drivers, named
> > >'bifurcated drivers', but it is stalled.
> > 
> > IIRC they still exposed the ring to userspace.
> 
> How much would a ring write syscall cost? 1-2 microseconds, isn't it?

The per-packet budget at 10G is 62ns, a syscall just doesn't cut it.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 11:32                                                             ` Avi Kivity
@ 2015-10-01 15:01                                                               ` Stephen Hemminger
  2015-10-01 15:08                                                                 ` Avi Kivity
  2015-10-01 15:46                                                                 ` Michael S. Tsirkin
  2015-10-01 15:11                                                               ` Michael S. Tsirkin
  1 sibling, 2 replies; 100+ messages in thread
From: Stephen Hemminger @ 2015-10-01 15:01 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev, Michael S. Tsirkin

On Thu, 1 Oct 2015 14:32:19 +0300
Avi Kivity <avi@scylladb.com> wrote:

> On 10/01/2015 02:27 PM, Michael S. Tsirkin wrote:
> > On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
> >> People will just use out of tree drivers (dpdk has several already).  It's a
> >> pain, but nowhere near what you are proposing.
> > What's the issue with that?
> 
> Out of tree drivers have to be compiled on the target system (cannot 
> ship a binary package), and occasionally break.
> 
> dkms helps with that, as do distributions that promise binary 
> compatibility, but it is still a pain, compared to just shipping a 
> userspace package.
> 
> >   We already agreed this kernel
> > is going to be tainted, and unsupportable.
> 
> Yes.  So your only motivation in rejecting the patch is to get the 
> author to write the ring translation patch and port it to all relevant 
> drivers instead?

The per-driver ring method is what netmap did.
The problem with that is that it forces infrastructure into already
complex network driver. It never was accepted. There were also still
security issues like time of check/time of use with the ring.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 14:47                                             ` Stephen Hemminger
@ 2015-10-01 15:03                                               ` Vlad Zolotarov
  0 siblings, 0 replies; 100+ messages in thread
From: Vlad Zolotarov @ 2015-10-01 15:03 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Michael S. Tsirkin



On 10/01/15 17:47, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 11:00:28 +0300
> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
>
>>
>> On 10/01/15 00:36, Stephen Hemminger wrote:
>>> On Wed, 30 Sep 2015 23:09:33 +0300
>>> Vlad Zolotarov <vladz@cloudius-systems.com> wrote:
>>>
>>>> On 09/30/15 22:39, Michael S. Tsirkin wrote:
>>>>> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
>>>>>>>> How would iommu
>>>>>>>> virtualization change anything?
>>>>>>> Kernel can use an iommu to limit device access to memory of
>>>>>>> the controlling application.
>>>>>> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
>>>>>> interrupts support in uio_pci_generic? kernel may continue to limit the
>>>>>> above access with this support as well.
>>>>> It could maybe. So if you write a patch to allow MSI by at the same time
>>>>> creating an isolated IOMMU group and blocking DMA from device in
>>>>> question anywhere, that sounds reasonable.
>>>> No, I'm only planning to add MSI and MSI-X interrupts support for
>>>> uio_pci_generic device.
>>>> The rest mentioned above should naturally be a matter of a different
>>>> patch and writing it is orthogonal to the patch I'm working on as has
>>>> been extensively discussed in this thread.
>>>>
>>> I have a generic MSI and MSI-X driver (posted earlier on this list).
>>> About to post to upstream kernel.
>> Stephen, hi!
>>
>> I found the mentioned series and first thing I noticed was that it's
>> been sent in May so the first question is how far in your list of tasks
>> submitting it upstream is? We need it more or less yesterday and I'm
>> working on it right now. Therefore if u don't have time for it I'd like
>> to help... ;) However I'd like u to clarify a few small things. Pls.,
>> see below...
>>
>> I noticed that u've created a separate msi_msix driver and the second
>> question is what do u plan for the upstream? I was thinking of extending
>> the existing uio_pci_generic with the MSI-X functionality similar to
>> your code and preserving the INT#X functionality as it is now:
> The igb_uio has a bunch of other things I didn't want to deal with:
> the name (being specific to old Intel driver); compatibility with older
> kernels; legacy ABI support. Therefore in effect uio_msi is a rebase
> of igb_uio.
>
> The submission upstream yesterday is the first step, I expect lots
> of review feedback.

Sure, we have lots of feedback already even before the patch has been 
sent... ;)
So, I'm preparing the uio_pci_generic patch. Just wanted to make sure we 
are not working on the same patch at the same time... ;)

It's going to enable both MSI and MSI-X support.
For a backward compatibility it'll enable INT#X by default.
It follows the concepts and uses some code pieces from your uio_msi 
patch. If u want I'll put u as a signed-off when I send it.


>
>>    *   INT#X and MSI would provide the IRQ number to the UIO module while
>>      only MSI-X case would register with UIO_IRQ_CUSTOM.
> I wanted all IRQ's to be the same for the driver, ie all go through
> eventfd mechanism. This makes code on DPDK side consistent with less
> special cases.

Of course. The name (uio_msi) is a bit confusing since it only adds 
MSI-X support. I mistakenly thought that it adds both MSI and MSI-X but 
it seems to only add MSI-X and then there are no further questions... ;)

>
>> I also noticed that u enable MSI-X on a first open() call. I assume
>> there was a good reason (that I miss) for not doing it in probe(). Could
>> u, pls., clarify?

What about this?

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 15:01                                                               ` Stephen Hemminger
@ 2015-10-01 15:08                                                                 ` Avi Kivity
  2015-10-01 15:46                                                                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 100+ messages in thread
From: Avi Kivity @ 2015-10-01 15:08 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Michael S. Tsirkin

On 10/01/2015 06:01 PM, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 14:32:19 +0300
> Avi Kivity <avi@scylladb.com> wrote:
>
>> On 10/01/2015 02:27 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
>>>> People will just use out of tree drivers (dpdk has several already).  It's a
>>>> pain, but nowhere near what you are proposing.
>>> What's the issue with that?
>> Out of tree drivers have to be compiled on the target system (cannot
>> ship a binary package), and occasionally break.
>>
>> dkms helps with that, as do distributions that promise binary
>> compatibility, but it is still a pain, compared to just shipping a
>> userspace package.
>>
>>>    We already agreed this kernel
>>> is going to be tainted, and unsupportable.
>> Yes.  So your only motivation in rejecting the patch is to get the
>> author to write the ring translation patch and port it to all relevant
>> drivers instead?
> The per-driver ring method is what netmap did.
> The problem with that is that it forces infrastructure into already
> complex network driver. It never was accepted. There were also still
> security issues like time of check/time of use with the ring.

There would have to be two rings, with the driver picking up descriptors 
from the software ring, translating virtual addresses, and pushing them 
into the hardware ring.

I'm not familiar enough with the truly high performance dpdk 
applications to estimate the performance impact.  Seastar/scylla gets a 
huge benefit from dpdk, but is still nowhere near line rate.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 11:32                                                             ` Avi Kivity
  2015-10-01 15:01                                                               ` Stephen Hemminger
@ 2015-10-01 15:11                                                               ` Michael S. Tsirkin
  2015-10-01 15:19                                                                 ` Avi Kivity
  1 sibling, 1 reply; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 15:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote:
> >  We already agreed this kernel
> >is going to be tainted, and unsupportable.
> 
> Yes.  So your only motivation in rejecting the patch is to get the author to
> write the ring translation patch and port it to all relevant drivers
> instead?

Not only that.

To make sure users are aware they are doing insecure
things when using software poking at device BARs in sysfs.
To avoid giving virtualization a bad name for security.
To get people to work on safe, maintainable solutions.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 15:11                                                               ` Michael S. Tsirkin
@ 2015-10-01 15:19                                                                 ` Avi Kivity
  2015-10-01 15:40                                                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 100+ messages in thread
From: Avi Kivity @ 2015-10-01 15:19 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev

On 10/01/2015 06:11 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote:
>>>   We already agreed this kernel
>>> is going to be tainted, and unsupportable.
>> Yes.  So your only motivation in rejecting the patch is to get the author to
>> write the ring translation patch and port it to all relevant drivers
>> instead?
> Not only that.
>
> To make sure users are aware they are doing insecure
> things when using software poking at device BARs in sysfs.

I don't think you need to worry about that.  People who program DMA are 
aware of the damage is can cause.  If you want to be extra sure, have 
uio taint the kernel when bus mastering is enabled.

> To avoid giving virtualization a bad name for security.

There is no security issue here.  Those VMs run a single application, 
and cannot attack the host or other VMs.

> To get people to work on safe, maintainable solutions.

That's a great goal but I don't think it can be achieved without 
sacrificing performance, which is the only reason for dpdk's existence.  
If safe and maintainable were the only requirements, people would not 
bypass the kernel.

The only thing you are really achieving by blocking this is causing pain.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 15:19                                                                 ` Avi Kivity
@ 2015-10-01 15:40                                                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 15:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev

On Thu, Oct 01, 2015 at 06:19:33PM +0300, Avi Kivity wrote:
> On 10/01/2015 06:11 PM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote:
> >>>  We already agreed this kernel
> >>>is going to be tainted, and unsupportable.
> >>Yes.  So your only motivation in rejecting the patch is to get the author to
> >>write the ring translation patch and port it to all relevant drivers
> >>instead?
> >Not only that.
> >
> >To make sure users are aware they are doing insecure
> >things when using software poking at device BARs in sysfs.
> 
> I don't think you need to worry about that.  People who program DMA are
> aware of the damage is can cause.

People just install software and run it. They don't program DMA.

And I notice that no software (ab)using this seems to come with
documentation explaining the implications.

> If you want to be extra sure, have uio
> taint the kernel when bus mastering is enabled.

People don't notice kernel is tainted.  Denying module load will make
them notice.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 15:01                                                               ` Stephen Hemminger
  2015-10-01 15:08                                                                 ` Avi Kivity
@ 2015-10-01 15:46                                                                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 15:46 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Avi Kivity

On Thu, Oct 01, 2015 at 08:01:00AM -0700, Stephen Hemminger wrote:
> The per-driver ring method is what netmap did.

IIUC netmap has a standard format for descriptors, so was slower: it
still had to do all networking in kernel (it only bypasses
part of the networking stack), and to have a thread to
translate between software and hardware formats.

> The problem with that is that it forces infrastructure into already
> complex network driver.  It never was accepted.  There were also still
> security issues like time of check/time of use with the ring.

Right, because people do care about security.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 14:55                                                     ` Stephen Hemminger
@ 2015-10-01 15:49                                                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 15:49 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Avi Kivity

On Thu, Oct 01, 2015 at 07:55:20AM -0700, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 13:14:08 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote:
> > > >There were some tentative to get it for other (older) drivers, named
> > > >'bifurcated drivers', but it is stalled.
> > > 
> > > IIRC they still exposed the ring to userspace.
> > 
> > How much would a ring write syscall cost? 1-2 microseconds, isn't it?
> 
> The per-packet budget at 10G is 62ns, a syscall just doesn't cut it.

If you give up on privacy and only insist on security
(can read all kernel memory, can't corrupt it), then
you only need the syscall to re-arm RX descriptors,
and these can be batched aggressively without impacting
latency.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 13:14                                                           ` Michael S. Tsirkin
@ 2015-10-01 16:04                                                             ` Michael S. Tsirkin
  2015-10-01 21:02                                                             ` Alexander Duyck
  1 sibling, 0 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 16:04 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev, Avi Kivity

On Thu, Oct 01, 2015 at 04:14:33PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 01:07:13PM +0100, Bruce Richardson wrote:
> > > > This in itself is going to use up
> > > > a good proportion of the processing time, as well as that we have to spend cycles
> > > > copying the descriptors from one ring in memory to another. Given that right now
> > > > with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen
> > > > cycles on modern cores, every additional cycle (fraction of a nanosecond) has
> > > > an impact.
> > > > 
> > > > Regards,
> > > > /Bruce
> > > 
> > > See above.  There is no need for that on data path. Only re-adding
> > > buffers requires a system call.
> > > 
> > 
> > Re-adding buffers is a key part of the data path! Ok, the fact that its only on
> > descriptor rearm does allow somewhat bigger batches,
> 
> That was the point, yes.
> 
> > but the whole point of having
> > the kernel do this extra work you propose is to allow the kernel to scan and
> > sanitize the physical addresses - and that will take a lot of cycles, especially
> > if it has to handle all the different descriptor formats of all the different NICs,
> > as has already been pointed out.
> > 
> > /Bruce
> 
> Well the driver would be per NIC, so there's only need to support
> specific formats supported by a given NIC.
> 
> An alternative is to format the descriptors in kernel, based
> on just the list of addresses. This seems cleaner, but I don't
> know how efficient it would be.


Additionally. rearming descriptors can happen on another
core in parallel with processing packets on the first one.

This will use more CPU but help you stay within your PPS limits.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 13:14                                                           ` Michael S. Tsirkin
  2015-10-01 16:04                                                             ` Michael S. Tsirkin
@ 2015-10-01 21:02                                                             ` Alexander Duyck
  2015-10-02 14:00                                                               ` Michael S. Tsirkin
  1 sibling, 1 reply; 100+ messages in thread
From: Alexander Duyck @ 2015-10-01 21:02 UTC (permalink / raw)
  To: Michael S. Tsirkin, Bruce Richardson; +Cc: dev, Avi Kivity

On 10/01/2015 06:14 AM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 01:07:13PM +0100, Bruce Richardson wrote:
>>>> This in itself is going to use up
>>>> a good proportion of the processing time, as well as that we have to spend cycles
>>>> copying the descriptors from one ring in memory to another. Given that right now
>>>> with the vector ixgbe driver, the cycle cost per packet of RX is just a few dozen
>>>> cycles on modern cores, every additional cycle (fraction of a nanosecond) has
>>>> an impact.
>>>>
>>>> Regards,
>>>> /Bruce
>>> See above.  There is no need for that on data path. Only re-adding
>>> buffers requires a system call.
>>>
>> Re-adding buffers is a key part of the data path! Ok, the fact that its only on
>> descriptor rearm does allow somewhat bigger batches,
> That was the point, yes.
>
>> but the whole point of having
>> the kernel do this extra work you propose is to allow the kernel to scan and
>> sanitize the physical addresses - and that will take a lot of cycles, especially
>> if it has to handle all the different descriptor formats of all the different NICs,
>> as has already been pointed out.
>>
>> /Bruce
> Well the driver would be per NIC, so there's only need to support
> specific formats supported by a given NIC.

One thing that seems to be overlooked in your discussion is the cost to 
translate these descriptors.  It isn't as if most systems running DPDK 
have the cycles to spare.  As I believe was brought up in another thread 
we are looking at a budget of something like 68ns of 10Gbps line rate.  
The overhead for having to go through and translate/parse/validate the 
descriptors would end up being pretty significant.  If you need proof of 
that just try running the ixgbe driver and route small packets.  We end 
up spending something like 40ns in ixgbe_clean_rx_irq and that is mostly 
just translating the descriptor bits into the correct sk_buff bits.  
Also trying to maintain a user-space ring in addition to the 
kernel-space ring means that much more memory overhead and increasing 
the liklihood of things getting pushed out of the L1 cache.

As far as the descriptor validation itself the overhead for that would 
guarantee that you cannot get any performance out of the device.  There 
are too many corner cases that would have to be addressed in validating 
user-space input to allow for us to process packets in any sort of 
timely fashion.  For starters we would have to validate the size, 
alignment, and ownership of a given buffer. If it is a transmit buffer 
we have to go through and validate any offloads being requested.  Likely 
just the validation and translation would add 10s if not 100s of 
nanoseconds to the time needed to process each packet.  In addition we 
are talking about doing this in kernel space which means we wouldn't 
really be able to take advantage of things like SSE or AVX instructions.

> An alternative is to format the descriptors in kernel, based
> on just the list of addresses. This seems cleaner, but I don't
> know how efficient it would be.
>
> Device vendors and dpdk developers are probably the best people to
> figure out what's the best thing to do here.

As far as the bifurcated driver approach the only way something like 
that would ever work is if you could limit the access via an IOMMU. At 
least everything I have seen proposed for a bifurcated driver still 
involved one if they expected to get any performance.

> But it looks like it's not going to happen unless security is made
> a requirement for upstreaming code.

The fact is we already ship uio_pci_generic.  User space drivers are 
here to stay.  What is being asked for is an extension to the existing 
infrastructure to allow MSI-X interrupts to trigger an event on a file 
descriptor.  As far as I know that doesn't add any additional security 
risk since it is the kernel PCIe subsystem itself that would be 
programming the address and data for said device, it wouldn't actually 
grant any more access other then the additional file descriptors to 
support MSI-X vectors.

Anyway that is just my $.02 on this.

- Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01  9:42                                               ` Michael S. Tsirkin
  2015-10-01  9:53                                                 ` Avi Kivity
@ 2015-10-01 21:17                                                 ` Alexander Duyck
  2015-10-02 13:50                                                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 100+ messages in thread
From: Alexander Duyck @ 2015-10-01 21:17 UTC (permalink / raw)
  To: Michael S. Tsirkin, Avi Kivity; +Cc: dev

On 10/01/2015 02:42 AM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
>> even when they are some users
>> prefer to avoid the performance penalty.
> I don't think there's a measureable penalty from passing through the
> IOMMU, as long as mappings are mostly static (i.e. iommu=pt).  I sure
> never saw any numbers that show such.

It depends on the IOMMU.  I believe Intel had a performance penalty on 
all CPUs prior to Ivy Bridge.  Since then things have improved to where 
they are comparable to bare metal.

The graph on page 5 of 
https://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf 
shows the penalty clear as day.  Pretty much anything before Ivy Bridge 
w/ small packets is slowed to a crawl with an IOMMU enabled.

- Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 21:17                                                 ` Alexander Duyck
@ 2015-10-02 13:50                                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-02 13:50 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: dev, Avi Kivity

On Thu, Oct 01, 2015 at 02:17:49PM -0700, Alexander Duyck wrote:
> On 10/01/2015 02:42 AM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> >>even when they are some users
> >>prefer to avoid the performance penalty.
> >I don't think there's a measureable penalty from passing through the
> >IOMMU, as long as mappings are mostly static (i.e. iommu=pt).  I sure
> >never saw any numbers that show such.
> 
> It depends on the IOMMU.  I believe Intel had a performance penalty on all
> CPUs prior to Ivy Bridge.  Since then things have improved to where they are
> comparable to bare metal.
> 
> The graph on page 5 of
> https://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf
> shows the penalty clear as day.  Pretty much anything before Ivy Bridge w/
> small packets is slowed to a crawl with an IOMMU enabled.
> 
> - Alex

VMs are running with IOMMU enabled anyway.
Avi here tells us no one uses SRIOV on bare metal so ...
we don't need to argue about that.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-01 21:02                                                             ` Alexander Duyck
@ 2015-10-02 14:00                                                               ` Michael S. Tsirkin
  2015-10-02 14:07                                                                 ` Bruce Richardson
                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-02 14:00 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: dev, Avi Kivity

On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
> validation and translation would add 10s if not 100s of nanoseconds to the
> time needed to process each packet.  In addition we are talking about doing
> this in kernel space which means we wouldn't really be able to take
> advantage of things like SSE or AVX instructions.

Yes. But the nice thing is that it's rearming so it can happen on
a separate core, in parallel with packet processing.
It does not need to add to latency.

You will burn up more CPU, but again, all this for boxes/hypervisors
without an IOMMU.

I'm sure people can come up with even better approaches, once enough
people get it that kernel absolutely needs to be protected from
userspace.

Long term, the right thing to do is to focus on IOMMU support. This
gives you hardware-based memory protection without need to burn up CPU
cycles.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-02 14:00                                                               ` Michael S. Tsirkin
@ 2015-10-02 14:07                                                                 ` Bruce Richardson
  2015-10-04  9:07                                                                   ` Michael S. Tsirkin
  2015-10-02 15:56                                                                 ` Gleb Natapov
  2015-10-02 16:57                                                                 ` Alexander Duyck
  2 siblings, 1 reply; 100+ messages in thread
From: Bruce Richardson @ 2015-10-02 14:07 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, Avi Kivity

On Fri, Oct 02, 2015 at 05:00:14PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
> > validation and translation would add 10s if not 100s of nanoseconds to the
> > time needed to process each packet.  In addition we are talking about doing
> > this in kernel space which means we wouldn't really be able to take
> > advantage of things like SSE or AVX instructions.
> 
> Yes. But the nice thing is that it's rearming so it can happen on
> a separate core, in parallel with packet processing.
> It does not need to add to latency.
> 
> You will burn up more CPU, but again, all this for boxes/hypervisors
> without an IOMMU.
> 
> I'm sure people can come up with even better approaches, once enough
> people get it that kernel absolutely needs to be protected from
> userspace.
> 
> Long term, the right thing to do is to focus on IOMMU support. This
> gives you hardware-based memory protection without need to burn up CPU
> cycles.
> 
> -- 
> MST

Running it on another will have it's own problems. The main one that springs to
mind for me is the performance impact of having all those cache lines shared
between the two cores.

/Bruce

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-02 14:00                                                               ` Michael S. Tsirkin
  2015-10-02 14:07                                                                 ` Bruce Richardson
@ 2015-10-02 15:56                                                                 ` Gleb Natapov
  2015-10-02 16:57                                                                 ` Alexander Duyck
  2 siblings, 0 replies; 100+ messages in thread
From: Gleb Natapov @ 2015-10-02 15:56 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, Avi Kivity

On Fri, Oct 02, 2015 at 05:00:14PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
> > validation and translation would add 10s if not 100s of nanoseconds to the
> > time needed to process each packet.  In addition we are talking about doing
> > this in kernel space which means we wouldn't really be able to take
> > advantage of things like SSE or AVX instructions.
> 
> Yes. But the nice thing is that it's rearming so it can happen on
> a separate core, in parallel with packet processing.
> It does not need to add to latency.
> 
Modern nics have no less queues than most machines has cores. There is
no such thing as free core to offload you processing to, otherwise you
designed your application wrong and waste cpu cycles.

> You will burn up more CPU, but again, all this for boxes/hypervisors
> without an IOMMU.
> 
> I'm sure people can come up with even better approaches, once enough
> people get it that kernel absolutely needs to be protected from
> userspace.
> 
People should not "get" things which are, lets be polite here, untrue.
The kernel never tried to protect itself from userspace rumning on
behalf of root. Secure boot, which is quite recent, is may be an only
instance where kernel tries to do so (unfortunately) and it does so by
disabling things if boot is secure. Linux was always "jack of all
trades" and was suitable to run on a machine with secure boot and a vm
that acts as application container or embedded device running packet
forwarding.

the only valid point is that nobody should debug crashes that may be
caused by buggy userspace and tainting kernel solves that.

> Long term, the right thing to do is to focus on IOMMU support. This
> gives you hardware-based memory protection without need to burn up CPU
> cycles.
> 
> -- 
> MST

--
			Gleb.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-02 14:00                                                               ` Michael S. Tsirkin
  2015-10-02 14:07                                                                 ` Bruce Richardson
  2015-10-02 15:56                                                                 ` Gleb Natapov
@ 2015-10-02 16:57                                                                 ` Alexander Duyck
  2 siblings, 0 replies; 100+ messages in thread
From: Alexander Duyck @ 2015-10-02 16:57 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, Avi Kivity

On 10/02/2015 07:00 AM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
>> validation and translation would add 10s if not 100s of nanoseconds to the
>> time needed to process each packet.  In addition we are talking about doing
>> this in kernel space which means we wouldn't really be able to take
>> advantage of things like SSE or AVX instructions.
> Yes. But the nice thing is that it's rearming so it can happen on
> a separate core, in parallel with packet processing.
> It does not need to add to latency.

Moving it to another core is automatically going to add extra latency.  
You will have to evict the data out of the L1 cache for one core and 
into the L1 cache for another when you update it, and then reading it 
will force it to have to transition back out.  If you are lucky it is 
only evicted to L2, if not then to L3, or possibly even back to memory.  
Odds are that alone will add tens of nanoseconds to the process, and you 
would need three or more cores to do the same workload as running the 
process over multiple threads means having to add synchronization 
primitives to the whole mess. Then there is the NUMA factor on top of that.

> You will burn up more CPU, but again, all this for boxes/hypervisors
> without an IOMMU.

There are use cases this will completely make useless.  If for example 
you are running a workload that needs three cores with DPDK bumping it 
to nine or more will likely push you out of being able to do the 
workload on some systems.

> I'm sure people can come up with even better approaches, once enough
> people get it that kernel absolutely needs to be protected from
> userspace.

I don't see that happening.  Many people don't care about kernel 
security that much.  If they did something like DPDK wouldn't have 
gotten off of the ground.  Once someone has the ability to load kernel 
modules any protection of the kernel from userspace pretty much goes 
right out the window.  You are just as much at risk from a buggy driver 
in userspace as you are from one that can be added to the kernel.

> Long term, the right thing to do is to focus on IOMMU support. This
> gives you hardware-based memory protection without need to burn up CPU
> cycles.

We have a solution that makes use of IOMMU support with vfio.  The 
problem is there are multiple cases where that support is either not 
available, or using the IOMMU provides excess overhead.

- Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance
  2015-10-02 14:07                                                                 ` Bruce Richardson
@ 2015-10-04  9:07                                                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 100+ messages in thread
From: Michael S. Tsirkin @ 2015-10-04  9:07 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev, Avi Kivity

On Fri, Oct 02, 2015 at 03:07:24PM +0100, Bruce Richardson wrote:
> On Fri, Oct 02, 2015 at 05:00:14PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
> > > validation and translation would add 10s if not 100s of nanoseconds to the
> > > time needed to process each packet.  In addition we are talking about doing
> > > this in kernel space which means we wouldn't really be able to take
> > > advantage of things like SSE or AVX instructions.
> > 
> > Yes. But the nice thing is that it's rearming so it can happen on
> > a separate core, in parallel with packet processing.
> > It does not need to add to latency.
> > 
> > You will burn up more CPU, but again, all this for boxes/hypervisors
> > without an IOMMU.
> > 
> > I'm sure people can come up with even better approaches, once enough
> > people get it that kernel absolutely needs to be protected from
> > userspace.
> > 
> > Long term, the right thing to do is to focus on IOMMU support. This
> > gives you hardware-based memory protection without need to burn up CPU
> > cycles.
> > 
> > -- 
> > MST
> 
> Running it on another will have it's own problems. The main one that springs to
> mind for me is the performance impact of having all those cache lines shared
> between the two cores.
> 
> /Bruce

The cache line is currently invalidated by the device write
packet processing core -> device -> packet processing core
We are adding another stage
packet processing core -> rearming core -> device -> packet processing core
but the amount of sharing per core isn't increased.

This is something that can be tried immediately without kernel changes.
Who knows, maybe you will actually be able to push more pps this way.


Further, rearming is not doing a lot besides moving bits around in
memory, and it's in kernel so using very limited resources - maybe we
can efficiently use an HT logical core for this task.
This remains to be seen.

-- 
MST

^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2015-10-04  9:07 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-27  7:05 [dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance Vlad Zolotarov
2015-09-27  9:43 ` Michael S. Tsirkin
2015-09-27 10:50   ` Vladislav Zolotarov
2015-09-29 16:41   ` Vlad Zolotarov
2015-09-29 20:54     ` Michael S. Tsirkin
2015-09-29 21:46       ` Stephen Hemminger
2015-09-29 21:49         ` Michael S. Tsirkin
2015-09-30 10:37           ` Vlad Zolotarov
2015-09-30 10:58             ` Michael S. Tsirkin
2015-09-30 11:26               ` Vlad Zolotarov
     [not found]                 ` <20150930143927-mutt-send-email-mst@redhat.com>
2015-09-30 11:53                   ` Vlad Zolotarov
2015-09-30 12:03                     ` Michael S. Tsirkin
2015-09-30 12:16                       ` Vlad Zolotarov
2015-09-30 12:27                         ` Michael S. Tsirkin
2015-09-30 12:50                           ` Vlad Zolotarov
2015-09-30 15:26                             ` Michael S. Tsirkin
2015-09-30 18:15                               ` Vlad Zolotarov
2015-09-30 18:55                                 ` Michael S. Tsirkin
2015-09-30 19:06                                   ` Vlad Zolotarov
2015-09-30 19:10                                     ` Vlad Zolotarov
2015-09-30 19:11                                       ` Vlad Zolotarov
2015-09-30 19:39                                     ` Michael S. Tsirkin
2015-09-30 20:09                                       ` Vlad Zolotarov
2015-09-30 21:36                                         ` Stephen Hemminger
2015-09-30 21:53                                           ` Michael S. Tsirkin
2015-09-30 22:20                                           ` Vlad Zolotarov
2015-10-01  8:00                                           ` Vlad Zolotarov
2015-10-01 14:47                                             ` Stephen Hemminger
2015-10-01 15:03                                               ` Vlad Zolotarov
2015-09-30 13:05                           ` Avi Kivity
2015-09-30 14:39                             ` Michael S. Tsirkin
2015-09-30 14:53                               ` Avi Kivity
2015-09-30 15:21                                 ` Michael S. Tsirkin
2015-09-30 15:36                                   ` Avi Kivity
2015-09-30 20:40                                     ` Michael S. Tsirkin
2015-09-30 21:00                                       ` Avi Kivity
2015-10-01  8:44                                       ` Michael S. Tsirkin
2015-10-01  8:46                                         ` Vlad Zolotarov
2015-10-01  8:52                                         ` Avi Kivity
2015-10-01  9:15                                           ` Michael S. Tsirkin
2015-10-01  9:22                                             ` Avi Kivity
2015-10-01  9:42                                               ` Michael S. Tsirkin
2015-10-01  9:53                                                 ` Avi Kivity
2015-10-01 10:17                                                   ` Michael S. Tsirkin
2015-10-01 10:24                                                     ` Avi Kivity
2015-10-01 10:25                                                       ` Avi Kivity
2015-10-01 10:44                                                         ` Michael S. Tsirkin
2015-10-01 10:55                                                           ` Avi Kivity
2015-10-01 21:17                                                 ` Alexander Duyck
2015-10-02 13:50                                                   ` Michael S. Tsirkin
2015-10-01  9:42                                               ` Vincent JARDIN
2015-10-01  9:43                                                 ` Avi Kivity
2015-10-01  9:48                                                   ` Vincent JARDIN
2015-10-01  9:54                                                     ` Avi Kivity
2015-10-01 10:14                                                   ` Michael S. Tsirkin
2015-10-01 10:23                                                     ` Avi Kivity
2015-10-01 14:55                                                     ` Stephen Hemminger
2015-10-01 15:49                                                       ` Michael S. Tsirkin
2015-10-01 14:54                                                 ` Stephen Hemminger
2015-10-01  9:55                                               ` Michael S. Tsirkin
2015-10-01  9:59                                                 ` Avi Kivity
2015-10-01 10:38                                                   ` Michael S. Tsirkin
2015-10-01 10:50                                                     ` Avi Kivity
2015-10-01 11:09                                                       ` Michael S. Tsirkin
2015-10-01 11:20                                                         ` Avi Kivity
2015-10-01 11:27                                                           ` Michael S. Tsirkin
2015-10-01 11:32                                                             ` Avi Kivity
2015-10-01 15:01                                                               ` Stephen Hemminger
2015-10-01 15:08                                                                 ` Avi Kivity
2015-10-01 15:46                                                                 ` Michael S. Tsirkin
2015-10-01 15:11                                                               ` Michael S. Tsirkin
2015-10-01 15:19                                                                 ` Avi Kivity
2015-10-01 15:40                                                                   ` Michael S. Tsirkin
2015-10-01 11:31                                                           ` Michael S. Tsirkin
2015-10-01 11:34                                                             ` Avi Kivity
2015-10-01 11:08                                                     ` Bruce Richardson
2015-10-01 11:23                                                       ` Michael S. Tsirkin
2015-10-01 12:07                                                         ` Bruce Richardson
2015-10-01 13:14                                                           ` Michael S. Tsirkin
2015-10-01 16:04                                                             ` Michael S. Tsirkin
2015-10-01 21:02                                                             ` Alexander Duyck
2015-10-02 14:00                                                               ` Michael S. Tsirkin
2015-10-02 14:07                                                                 ` Bruce Richardson
2015-10-04  9:07                                                                   ` Michael S. Tsirkin
2015-10-02 15:56                                                                 ` Gleb Natapov
2015-10-02 16:57                                                                 ` Alexander Duyck
2015-10-01  9:15                                           ` Avi Kivity
2015-10-01  9:29                                             ` Michael S. Tsirkin
2015-10-01  9:38                                               ` Avi Kivity
2015-10-01 10:07                                                 ` Michael S. Tsirkin
2015-10-01 10:11                                                   ` Avi Kivity
2015-10-01  9:16                                         ` Michael S. Tsirkin
2015-09-30 17:28             ` Stephen Hemminger
2015-09-30 17:39               ` Michael S. Tsirkin
2015-09-30 17:43                 ` Stephen Hemminger
2015-09-30 18:50                   ` Michael S. Tsirkin
2015-09-30 20:00                     ` Gleb Natapov
2015-09-30 20:36                       ` Michael S. Tsirkin
2015-10-01  5:04                         ` Gleb Natapov
2015-09-30 17:44                 ` Gleb Natapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).