DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/2] uio_msi: device driver
@ 2015-09-30 22:28 Stephen Hemminger
  2015-09-30 22:28 ` [dpdk-dev] [PATCH 1/2] uio: add support for ioctls Stephen Hemminger
                   ` (3 more replies)
  0 siblings, 4 replies; 44+ messages in thread
From: Stephen Hemminger @ 2015-09-30 22:28 UTC (permalink / raw)
  To: hjk, gregkh; +Cc: dev, linux-kernel

This is a new UIO device driver to allow supporting MSI-X and MSI devices
in userspace.  It has been used in environments like VMware and older versions
of QEMU/KVM where no IOMMU support is available.

Stephen Hemminger (2):

*** BLURB HERE ***

Stephen Hemminger (2):
  uio: add support for ioctls
  uio: new driver to support PCI MSI-X

 drivers/uio/Kconfig          |   9 ++
 drivers/uio/Makefile         |   1 +
 drivers/uio/uio.c            |  15 ++
 drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/uio_driver.h   |   3 +
 include/uapi/linux/Kbuild    |   1 +
 include/uapi/linux/uio_msi.h |  22 +++
 7 files changed, 429 insertions(+)
 create mode 100644 drivers/uio/uio_msi.c
 create mode 100644 include/uapi/linux/uio_msi.h

-- 
2.1.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [dpdk-dev] [PATCH 1/2] uio: add support for ioctls
  2015-09-30 22:28 [dpdk-dev] [PATCH 0/2] uio_msi: device driver Stephen Hemminger
@ 2015-09-30 22:28 ` Stephen Hemminger
  2015-09-30 22:28 ` [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X Stephen Hemminger
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: Stephen Hemminger @ 2015-09-30 22:28 UTC (permalink / raw)
  To: hjk, gregkh; +Cc: dev, linux-kernel

Allow UIO device driver to provide ioctl interface.
This allow additional API's for UIO.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/uio/uio.c          | 15 +++++++++++++++
 include/linux/uio_driver.h |  3 +++
 2 files changed, 18 insertions(+)

diff --git a/drivers/uio/uio.c b/drivers/uio/uio.c
index 8196581..5ab32ab 100644
--- a/drivers/uio/uio.c
+++ b/drivers/uio/uio.c
@@ -576,6 +576,20 @@ static ssize_t uio_write(struct file *filep, const char __user *buf,
 	return retval ? retval : sizeof(s32);
 }
 
+static long uio_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
+{
+	struct uio_listener *listener = filep->private_data;
+	struct uio_device *idev = listener->dev;
+
+	if (!idev->info)
+		return -EIO;
+
+	if (!idev->info->ioctl)
+		return -ENOTTY;
+
+	return idev->info->ioctl(idev->info, cmd, arg);
+}
+
 static int uio_find_mem_index(struct vm_area_struct *vma)
 {
 	struct uio_device *idev = vma->vm_private_data;
@@ -712,6 +726,7 @@ static const struct file_operations uio_fops = {
 	.write		= uio_write,
 	.mmap		= uio_mmap,
 	.poll		= uio_poll,
+	.unlocked_ioctl	= uio_ioctl,
 	.fasync		= uio_fasync,
 	.llseek		= noop_llseek,
 };
diff --git a/include/linux/uio_driver.h b/include/linux/uio_driver.h
index 32c0e83..10d7833 100644
--- a/include/linux/uio_driver.h
+++ b/include/linux/uio_driver.h
@@ -89,6 +89,7 @@ struct uio_device {
  * @mmap:		mmap operation for this uio device
  * @open:		open operation for this uio device
  * @release:		release operation for this uio device
+ * @ioctl:		ioctl handler
  * @irqcontrol:		disable/enable irqs when 0/1 is written to /dev/uioX
  */
 struct uio_info {
@@ -105,6 +106,8 @@ struct uio_info {
 	int (*open)(struct uio_info *info, struct inode *inode);
 	int (*release)(struct uio_info *info, struct inode *inode);
 	int (*irqcontrol)(struct uio_info *info, s32 irq_on);
+	int (*ioctl)(struct uio_info *info, unsigned int cmd,
+		     unsigned long arg);
 };
 
 extern int __must_check
-- 
2.1.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-09-30 22:28 [dpdk-dev] [PATCH 0/2] uio_msi: device driver Stephen Hemminger
  2015-09-30 22:28 ` [dpdk-dev] [PATCH 1/2] uio: add support for ioctls Stephen Hemminger
@ 2015-09-30 22:28 ` Stephen Hemminger
  2015-10-01  8:33   ` Michael S. Tsirkin
  2015-10-01 23:40   ` Alexander Duyck
  2015-10-01  8:36 ` [dpdk-dev] [PATCH 0/2] uio_msi: device driver Michael S. Tsirkin
  2015-10-01 10:59 ` Avi Kivity
  3 siblings, 2 replies; 44+ messages in thread
From: Stephen Hemminger @ 2015-09-30 22:28 UTC (permalink / raw)
  To: hjk, gregkh; +Cc: dev, linux-kernel

This driver allows using PCI device with Message Signalled Interrupt
from userspace. The API is similar to the igb_uio driver used by the DPDK.
Via ioctl it provides a mechanism to map MSI-X interrupts into event
file descriptors similar to VFIO.

VFIO is a better choice if IOMMU is available, but often userspace drivers
have to work in environments where IOMMU support (real or emulated) is
not available.  All UIO drivers that support DMA are not secure against
rogue userspace applications programming DMA hardware to access
private memory; this driver is no less secure than existing code.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/uio/Kconfig          |   9 ++
 drivers/uio/Makefile         |   1 +
 drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/Kbuild    |   1 +
 include/uapi/linux/uio_msi.h |  22 +++
 5 files changed, 411 insertions(+)
 create mode 100644 drivers/uio/uio_msi.c
 create mode 100644 include/uapi/linux/uio_msi.h

diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
index 52c98ce..04adfa0 100644
--- a/drivers/uio/Kconfig
+++ b/drivers/uio/Kconfig
@@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
 	  primarily, for virtualization scenarios.
 	  If you compile this as a module, it will be called uio_pci_generic.
 
+config UIO_PCI_MSI
+       tristate "Generic driver supporting MSI-x on PCI Express cards"
+	depends on PCI
+	help
+	  Generic driver that provides Message Signalled IRQ events
+	  similar to VFIO. If IOMMMU is available please use VFIO
+	  instead since it provides more security.
+	  If you compile this as a module, it will be called uio_msi.
+
 config UIO_NETX
 	tristate "Hilscher NetX Card driver"
 	depends on PCI
diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
index 8560dad..62fc44b 100644
--- a/drivers/uio/Makefile
+++ b/drivers/uio/Makefile
@@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
 obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
 obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
 obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
+obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
new file mode 100644
index 0000000..802b5c4
--- /dev/null
+++ b/drivers/uio/uio_msi.c
@@ -0,0 +1,378 @@
+/*-
+ *
+ * Copyright (c) 2015 by Brocade Communications Systems, Inc.
+ * Author: Stephen Hemminger <stephen@networkplumber.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 only.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/eventfd.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/uio_driver.h>
+#include <linux/msi.h>
+#include <linux/uio_msi.h>
+
+#define DRIVER_VERSION		"0.1.1"
+#define MAX_MSIX_VECTORS	64
+
+/* MSI-X vector information */
+struct uio_msi_pci_dev {
+	struct uio_info info;		/* UIO driver info */
+	struct pci_dev *pdev;		/* PCI device */
+	struct mutex	mutex;		/* open/release/ioctl mutex */
+	int		ref_cnt;	/* references to device */
+	unsigned int	max_vectors;	/* MSI-X slots available */
+	struct msix_entry *msix;	/* MSI-X vector table */
+	struct uio_msi_irq_ctx {
+		struct eventfd_ctx *trigger; /* vector to eventfd */
+		char *name;		/* name in /proc/interrupts */
+	} *ctx;
+};
+
+static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
+{
+	struct uio_msi_pci_dev *udev = arg;
+
+	if (pci_check_and_mask_intx(udev->pdev)) {
+		eventfd_signal(udev->ctx->trigger, 1);
+		return IRQ_HANDLED;
+	}
+
+	return IRQ_NONE;
+}
+
+static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
+{
+	struct eventfd_ctx *trigger = arg;
+
+	eventfd_signal(trigger, 1);
+	return IRQ_HANDLED;
+}
+
+/* set the mapping between vector # and existing eventfd. */
+static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
+{
+	struct eventfd_ctx *trigger;
+	int irq, err;
+
+	if (vec >= udev->max_vectors) {
+		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
+			   vec, udev->max_vectors);
+		return -ERANGE;
+	}
+
+	irq = udev->msix[vec].vector;
+	trigger = udev->ctx[vec].trigger;
+	if (trigger) {
+		/* Clearup existing irq mapping */
+		free_irq(irq, trigger);
+		eventfd_ctx_put(trigger);
+		udev->ctx[vec].trigger = NULL;
+	}
+
+	/* Passing -1 is used to disable interrupt */
+	if (fd < 0)
+		return 0;
+
+	trigger = eventfd_ctx_fdget(fd);
+	if (IS_ERR(trigger)) {
+		err = PTR_ERR(trigger);
+		dev_notice(&udev->pdev->dev,
+			   "eventfd ctx get failed: %d\n", err);
+		return err;
+	}
+
+	if (udev->msix)
+		err = request_irq(irq, uio_msi_irqhandler, 0,
+				  udev->ctx[vec].name, trigger);
+	else
+		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
+				  udev->ctx[vec].name, udev);
+
+	if (err) {
+		dev_notice(&udev->pdev->dev,
+			   "request irq failed: %d\n", err);
+		eventfd_ctx_put(trigger);
+		return err;
+	}
+
+	udev->ctx[vec].trigger = trigger;
+	return 0;
+}
+
+static int
+uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
+{
+	struct uio_msi_pci_dev *udev
+		= container_of(info, struct uio_msi_pci_dev, info);
+	struct uio_msi_irq_set hdr;
+	int err;
+
+	switch (cmd) {
+	case UIO_MSI_IRQ_SET:
+		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
+			return -EFAULT;
+
+		mutex_lock(&udev->mutex);
+		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
+		mutex_unlock(&udev->mutex);
+		break;
+	default:
+		err = -EOPNOTSUPP;
+	}
+	return err;
+}
+
+/* Opening the UIO device for first time enables MSI-X */
+static int
+uio_msi_open(struct uio_info *info, struct inode *inode)
+{
+	struct uio_msi_pci_dev *udev
+		= container_of(info, struct uio_msi_pci_dev, info);
+	int err = 0;
+
+	mutex_lock(&udev->mutex);
+	if (udev->ref_cnt++ == 0) {
+		if (udev->msix)
+			err = pci_enable_msix(udev->pdev, udev->msix,
+					      udev->max_vectors);
+	}
+	mutex_unlock(&udev->mutex);
+
+	return err;
+}
+
+/* Last close of the UIO device releases/disables all IRQ's */
+static int
+uio_msi_release(struct uio_info *info, struct inode *inode)
+{
+	struct uio_msi_pci_dev *udev
+		= container_of(info, struct uio_msi_pci_dev, info);
+	int i;
+
+	mutex_lock(&udev->mutex);
+	if (--udev->ref_cnt == 0) {
+		for (i = 0; i < udev->max_vectors; i++) {
+			int irq = udev->msix[i].vector;
+			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
+
+			if (!trigger)
+				continue;
+
+			free_irq(irq, trigger);
+			eventfd_ctx_put(trigger);
+			udev->ctx[i].trigger = NULL;
+		}
+
+		if (udev->msix)
+			pci_disable_msix(udev->pdev);
+	}
+	mutex_unlock(&udev->mutex);
+
+	return 0;
+}
+
+/* Unmap previously ioremap'd resources */
+static void
+release_iomaps(struct uio_mem *mem)
+{
+	int i;
+
+	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
+		if (mem->internal_addr)
+			iounmap(mem->internal_addr);
+	}
+}
+
+static int
+setup_maps(struct pci_dev *pdev, struct uio_info *info)
+{
+	int i, m = 0, p = 0, err;
+	static const char * const bar_names[] = {
+		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
+	};
+
+	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
+		unsigned long start = pci_resource_start(pdev, i);
+		unsigned long flags = pci_resource_flags(pdev, i);
+		unsigned long len = pci_resource_len(pdev, i);
+
+		if (start == 0 || len == 0)
+			continue;
+
+		if (flags & IORESOURCE_MEM) {
+			void __iomem *addr;
+
+			if (m >= MAX_UIO_MAPS)
+				continue;
+
+			addr = ioremap(start, len);
+			if (addr == NULL) {
+				err = -EINVAL;
+				goto fail;
+			}
+
+			info->mem[m].name = bar_names[i];
+			info->mem[m].addr = start;
+			info->mem[m].internal_addr = addr;
+			info->mem[m].size = len;
+			info->mem[m].memtype = UIO_MEM_PHYS;
+			++m;
+		} else if (flags & IORESOURCE_IO) {
+			if (p >= MAX_UIO_PORT_REGIONS)
+				continue;
+
+			info->port[p].name = bar_names[i];
+			info->port[p].start = start;
+			info->port[p].size = len;
+			info->port[p].porttype = UIO_PORT_X86;
+			++p;
+		}
+	}
+
+	return 0;
+ fail:
+	for (i = 0; i < m; i++)
+		iounmap(info->mem[i].internal_addr);
+	return err;
+}
+
+static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct uio_msi_pci_dev *udev;
+	int i, err, vectors;
+
+	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
+	if (!udev)
+		return -ENOMEM;
+
+	err = pci_enable_device(pdev);
+	if (err != 0) {
+		dev_err(&pdev->dev, "cannot enable PCI device\n");
+		goto fail_free;
+	}
+
+	err = pci_request_regions(pdev, "uio_msi");
+	if (err != 0) {
+		dev_err(&pdev->dev, "Cannot request regions\n");
+		goto fail_disable;
+	}
+
+	pci_set_master(pdev);
+
+	/* remap resources */
+	err = setup_maps(pdev, &udev->info);
+	if (err)
+		goto fail_release_iomem;
+
+	/* fill uio infos */
+	udev->info.name = "uio_msi";
+	udev->info.version = DRIVER_VERSION;
+	udev->info.priv = udev;
+	udev->pdev = pdev;
+	udev->info.ioctl = uio_msi_ioctl;
+	udev->info.open = uio_msi_open;
+	udev->info.release = uio_msi_release;
+	udev->info.irq = UIO_IRQ_CUSTOM;
+	mutex_init(&udev->mutex);
+
+	vectors = pci_msix_vec_count(pdev);
+	if (vectors > 0) {
+		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
+		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
+			udev->max_vectors);
+
+		err = -ENOMEM;
+		udev->msix = kcalloc(udev->max_vectors,
+				     sizeof(struct msix_entry), GFP_KERNEL);
+		if (!udev->msix)
+			goto fail_release_iomem;
+	} else if (!pci_intx_mask_supported(pdev)) {
+		dev_err(&pdev->dev,
+			"device does not support MSI-X or INTX\n");
+		err = -EINVAL;
+		goto fail_release_iomem;
+	} else {
+		dev_notice(&pdev->dev, "using INTX\n");
+		udev->info.irq_flags = IRQF_SHARED;
+		udev->max_vectors = 1;
+	}
+
+	udev->ctx = kcalloc(udev->max_vectors,
+			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
+	if (!udev->ctx)
+		goto fail_free_msix;
+
+	for (i = 0; i < udev->max_vectors; i++) {
+		udev->msix[i].entry = i;
+
+		udev->ctx[i].name = kasprintf(GFP_KERNEL,
+					      KBUILD_MODNAME "[%d](%s)",
+					      i, pci_name(pdev));
+		if (!udev->ctx[i].name)
+			goto fail_free_ctx;
+	}
+
+	/* register uio driver */
+	err = uio_register_device(&pdev->dev, &udev->info);
+	if (err != 0)
+		goto fail_free_ctx;
+
+	pci_set_drvdata(pdev, udev);
+	return 0;
+
+fail_free_ctx:
+	for (i = 0; i < udev->max_vectors; i++)
+		kfree(udev->ctx[i].name);
+	kfree(udev->ctx);
+fail_free_msix:
+	kfree(udev->msix);
+fail_release_iomem:
+	release_iomaps(udev->info.mem);
+	pci_release_regions(pdev);
+fail_disable:
+	pci_disable_device(pdev);
+fail_free:
+	kfree(udev);
+
+	pr_notice("%s ret %d\n", __func__, err);
+	return err;
+}
+
+static void uio_msi_remove(struct pci_dev *pdev)
+{
+	struct uio_info *info = pci_get_drvdata(pdev);
+	struct uio_msi_pci_dev *udev
+		= container_of(info, struct uio_msi_pci_dev, info);
+	int i;
+
+	uio_unregister_device(info);
+	release_iomaps(info->mem);
+
+	pci_release_regions(pdev);
+	for (i = 0; i < udev->max_vectors; i++)
+		kfree(udev->ctx[i].name);
+	kfree(udev->ctx);
+	kfree(udev->msix);
+	pci_disable_device(pdev);
+
+	pci_set_drvdata(pdev, NULL);
+	kfree(udev);
+}
+
+static struct pci_driver uio_msi_pci_driver = {
+	.name = "uio_msi",
+	.probe = uio_msi_probe,
+	.remove = uio_msi_remove,
+};
+
+module_pci_driver(uio_msi_pci_driver);
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
+MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index f7b2db4..d9497691 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -411,6 +411,7 @@ header-y += udp.h
 header-y += uhid.h
 header-y += uinput.h
 header-y += uio.h
+header-y += uio_msi.h
 header-y += ultrasound.h
 header-y += un.h
 header-y += unistd.h
diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
new file mode 100644
index 0000000..297de00
--- /dev/null
+++ b/include/uapi/linux/uio_msi.h
@@ -0,0 +1,22 @@
+/*
+ * UIO_MSI API definition
+ *
+ * Copyright (c) 2015 by Brocade Communications Systems, Inc.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef _UIO_PCI_MSI_H
+#define _UIO_PCI_MSI_H
+
+struct uio_msi_irq_set {
+	u32 vec;
+	int fd;
+};
+
+#define UIO_MSI_BASE	0x86
+#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
+
+#endif
-- 
2.1.4

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-09-30 22:28 ` [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X Stephen Hemminger
@ 2015-10-01  8:33   ` Michael S. Tsirkin
  2015-10-01 10:37     ` Michael S. Tsirkin
                       ` (3 more replies)
  2015-10-01 23:40   ` Alexander Duyck
  1 sibling, 4 replies; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01  8:33 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, hjk, gregkh, linux-kernel

On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> This driver allows using PCI device with Message Signalled Interrupt
> from userspace. The API is similar to the igb_uio driver used by the DPDK.
> Via ioctl it provides a mechanism to map MSI-X interrupts into event
> file descriptors similar to VFIO.
>
> VFIO is a better choice if IOMMU is available, but often userspace drivers
> have to work in environments where IOMMU support (real or emulated) is
> not available.  All UIO drivers that support DMA are not secure against
> rogue userspace applications programming DMA hardware to access
> private memory; this driver is no less secure than existing code.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

I don't think copying the igb_uio interface is a good idea.
What DPDK is doing with igb_uio (and indeed uio_pci_generic)
is abusing the sysfs BAR access to provide unlimited
access to hardware.

MSI messages are memory writes so any generic device capable
of MSI is capable of corrupting kernel memory.
This means that a bug in userspace will lead to kernel memory corruption
and crashes.  This is something distributions can't support.

uio_pci_generic is already abused like that, mostly
because when I wrote it, I didn't add enough protections
against using it with DMA capable devices,
and we can't go back and break working userspace.
But at least it does not bind to VFs which all of
them are capable of DMA.

The result of merging this driver will be userspace abusing the
sysfs BAR access with VFs as well, and we do not want that.


Just forwarding events is not enough to make a valid driver.
What is missing is a way to access the device in a safe way.

On a more positive note:

What would be a reasonable interface? One that does the following
in kernel:

1. initializes device rings (can be in pinned userspace memory,
   but can not be writeable by userspace), brings up interface link
2. pins userspace memory (unless using e.g. hugetlbfs)
3. gets request, make sure it's valid and belongs to
   the correct task, put it in the ring
4. in the reverse direction, notify userspace when buffers
   are available in the ring
5. notify userspace about MSI (what this driver does)

What userspace can be allowed to do:

	format requests (e.g. transmit, receive) in userspace
	read ring contents

What userspace can't be allowed to do:

	access BAR
	write rings


This means that the driver can not be a generic one,
and there will be a system call overhead when you
write the ring, but that's the price you have to
pay for ability to run on systems without an IOMMU.




> ---
>  drivers/uio/Kconfig          |   9 ++
>  drivers/uio/Makefile         |   1 +
>  drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/Kbuild    |   1 +
>  include/uapi/linux/uio_msi.h |  22 +++
>  5 files changed, 411 insertions(+)
>  create mode 100644 drivers/uio/uio_msi.c
>  create mode 100644 include/uapi/linux/uio_msi.h
> 
> diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> index 52c98ce..04adfa0 100644
> --- a/drivers/uio/Kconfig
> +++ b/drivers/uio/Kconfig
> @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
>  	  primarily, for virtualization scenarios.
>  	  If you compile this as a module, it will be called uio_pci_generic.
>  
> +config UIO_PCI_MSI
> +       tristate "Generic driver supporting MSI-x on PCI Express cards"
> +	depends on PCI
> +	help
> +	  Generic driver that provides Message Signalled IRQ events
> +	  similar to VFIO. If IOMMMU is available please use VFIO
> +	  instead since it provides more security.
> +	  If you compile this as a module, it will be called uio_msi.
> +
>  config UIO_NETX
>  	tristate "Hilscher NetX Card driver"
>  	depends on PCI
> diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> index 8560dad..62fc44b 100644
> --- a/drivers/uio/Makefile
> +++ b/drivers/uio/Makefile
> @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
>  obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
>  obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
>  obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
> +obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
> diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
> new file mode 100644
> index 0000000..802b5c4
> --- /dev/null
> +++ b/drivers/uio/uio_msi.c
> @@ -0,0 +1,378 @@
> +/*-
> + *
> + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> + * Author: Stephen Hemminger <stephen@networkplumber.org>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 only.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/device.h>
> +#include <linux/interrupt.h>
> +#include <linux/eventfd.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/uio_driver.h>
> +#include <linux/msi.h>
> +#include <linux/uio_msi.h>
> +
> +#define DRIVER_VERSION		"0.1.1"
> +#define MAX_MSIX_VECTORS	64
> +
> +/* MSI-X vector information */
> +struct uio_msi_pci_dev {
> +	struct uio_info info;		/* UIO driver info */
> +	struct pci_dev *pdev;		/* PCI device */
> +	struct mutex	mutex;		/* open/release/ioctl mutex */
> +	int		ref_cnt;	/* references to device */
> +	unsigned int	max_vectors;	/* MSI-X slots available */
> +	struct msix_entry *msix;	/* MSI-X vector table */
> +	struct uio_msi_irq_ctx {
> +		struct eventfd_ctx *trigger; /* vector to eventfd */
> +		char *name;		/* name in /proc/interrupts */
> +	} *ctx;
> +};
> +
> +static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
> +{
> +	struct uio_msi_pci_dev *udev = arg;
> +
> +	if (pci_check_and_mask_intx(udev->pdev)) {
> +		eventfd_signal(udev->ctx->trigger, 1);
> +		return IRQ_HANDLED;
> +	}
> +
> +	return IRQ_NONE;
> +}
> +
> +static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
> +{
> +	struct eventfd_ctx *trigger = arg;
> +
> +	eventfd_signal(trigger, 1);
> +	return IRQ_HANDLED;
> +}
> +
> +/* set the mapping between vector # and existing eventfd. */
> +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
> +{
> +	struct eventfd_ctx *trigger;
> +	int irq, err;
> +
> +	if (vec >= udev->max_vectors) {
> +		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
> +			   vec, udev->max_vectors);
> +		return -ERANGE;
> +	}
> +
> +	irq = udev->msix[vec].vector;
> +	trigger = udev->ctx[vec].trigger;
> +	if (trigger) {
> +		/* Clearup existing irq mapping */
> +		free_irq(irq, trigger);
> +		eventfd_ctx_put(trigger);
> +		udev->ctx[vec].trigger = NULL;
> +	}
> +
> +	/* Passing -1 is used to disable interrupt */
> +	if (fd < 0)
> +		return 0;
> +
> +	trigger = eventfd_ctx_fdget(fd);
> +	if (IS_ERR(trigger)) {
> +		err = PTR_ERR(trigger);
> +		dev_notice(&udev->pdev->dev,
> +			   "eventfd ctx get failed: %d\n", err);
> +		return err;
> +	}
> +
> +	if (udev->msix)
> +		err = request_irq(irq, uio_msi_irqhandler, 0,
> +				  udev->ctx[vec].name, trigger);
> +	else
> +		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
> +				  udev->ctx[vec].name, udev);
> +
> +	if (err) {
> +		dev_notice(&udev->pdev->dev,
> +			   "request irq failed: %d\n", err);
> +		eventfd_ctx_put(trigger);
> +		return err;
> +	}
> +
> +	udev->ctx[vec].trigger = trigger;
> +	return 0;
> +}
> +
> +static int
> +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	struct uio_msi_irq_set hdr;
> +	int err;
> +
> +	switch (cmd) {
> +	case UIO_MSI_IRQ_SET:
> +		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
> +			return -EFAULT;
> +
> +		mutex_lock(&udev->mutex);
> +		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
> +		mutex_unlock(&udev->mutex);
> +		break;
> +	default:
> +		err = -EOPNOTSUPP;
> +	}
> +	return err;
> +}
> +
> +/* Opening the UIO device for first time enables MSI-X */
> +static int
> +uio_msi_open(struct uio_info *info, struct inode *inode)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int err = 0;
> +
> +	mutex_lock(&udev->mutex);
> +	if (udev->ref_cnt++ == 0) {
> +		if (udev->msix)
> +			err = pci_enable_msix(udev->pdev, udev->msix,
> +					      udev->max_vectors);
> +	}
> +	mutex_unlock(&udev->mutex);
> +
> +	return err;
> +}
> +
> +/* Last close of the UIO device releases/disables all IRQ's */
> +static int
> +uio_msi_release(struct uio_info *info, struct inode *inode)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int i;
> +
> +	mutex_lock(&udev->mutex);
> +	if (--udev->ref_cnt == 0) {
> +		for (i = 0; i < udev->max_vectors; i++) {
> +			int irq = udev->msix[i].vector;
> +			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
> +
> +			if (!trigger)
> +				continue;
> +
> +			free_irq(irq, trigger);
> +			eventfd_ctx_put(trigger);
> +			udev->ctx[i].trigger = NULL;
> +		}
> +
> +		if (udev->msix)
> +			pci_disable_msix(udev->pdev);
> +	}
> +	mutex_unlock(&udev->mutex);
> +
> +	return 0;
> +}
> +
> +/* Unmap previously ioremap'd resources */
> +static void
> +release_iomaps(struct uio_mem *mem)
> +{
> +	int i;
> +
> +	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
> +		if (mem->internal_addr)
> +			iounmap(mem->internal_addr);
> +	}
> +}
> +
> +static int
> +setup_maps(struct pci_dev *pdev, struct uio_info *info)
> +{
> +	int i, m = 0, p = 0, err;
> +	static const char * const bar_names[] = {
> +		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
> +	};
> +
> +	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
> +		unsigned long start = pci_resource_start(pdev, i);
> +		unsigned long flags = pci_resource_flags(pdev, i);
> +		unsigned long len = pci_resource_len(pdev, i);
> +
> +		if (start == 0 || len == 0)
> +			continue;
> +
> +		if (flags & IORESOURCE_MEM) {
> +			void __iomem *addr;
> +
> +			if (m >= MAX_UIO_MAPS)
> +				continue;
> +
> +			addr = ioremap(start, len);
> +			if (addr == NULL) {
> +				err = -EINVAL;
> +				goto fail;
> +			}
> +
> +			info->mem[m].name = bar_names[i];
> +			info->mem[m].addr = start;
> +			info->mem[m].internal_addr = addr;
> +			info->mem[m].size = len;
> +			info->mem[m].memtype = UIO_MEM_PHYS;
> +			++m;
> +		} else if (flags & IORESOURCE_IO) {
> +			if (p >= MAX_UIO_PORT_REGIONS)
> +				continue;
> +
> +			info->port[p].name = bar_names[i];
> +			info->port[p].start = start;
> +			info->port[p].size = len;
> +			info->port[p].porttype = UIO_PORT_X86;
> +			++p;
> +		}
> +	}
> +
> +	return 0;
> + fail:
> +	for (i = 0; i < m; i++)
> +		iounmap(info->mem[i].internal_addr);
> +	return err;
> +}
> +
> +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> +{
> +	struct uio_msi_pci_dev *udev;
> +	int i, err, vectors;
> +
> +	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
> +	if (!udev)
> +		return -ENOMEM;
> +
> +	err = pci_enable_device(pdev);
> +	if (err != 0) {
> +		dev_err(&pdev->dev, "cannot enable PCI device\n");
> +		goto fail_free;
> +	}
> +
> +	err = pci_request_regions(pdev, "uio_msi");
> +	if (err != 0) {
> +		dev_err(&pdev->dev, "Cannot request regions\n");
> +		goto fail_disable;
> +	}
> +
> +	pci_set_master(pdev);
> +
> +	/* remap resources */
> +	err = setup_maps(pdev, &udev->info);
> +	if (err)
> +		goto fail_release_iomem;
> +
> +	/* fill uio infos */
> +	udev->info.name = "uio_msi";
> +	udev->info.version = DRIVER_VERSION;
> +	udev->info.priv = udev;
> +	udev->pdev = pdev;
> +	udev->info.ioctl = uio_msi_ioctl;
> +	udev->info.open = uio_msi_open;
> +	udev->info.release = uio_msi_release;
> +	udev->info.irq = UIO_IRQ_CUSTOM;
> +	mutex_init(&udev->mutex);
> +
> +	vectors = pci_msix_vec_count(pdev);
> +	if (vectors > 0) {
> +		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
> +		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
> +			udev->max_vectors);
> +
> +		err = -ENOMEM;
> +		udev->msix = kcalloc(udev->max_vectors,
> +				     sizeof(struct msix_entry), GFP_KERNEL);
> +		if (!udev->msix)
> +			goto fail_release_iomem;
> +	} else if (!pci_intx_mask_supported(pdev)) {
> +		dev_err(&pdev->dev,
> +			"device does not support MSI-X or INTX\n");
> +		err = -EINVAL;
> +		goto fail_release_iomem;
> +	} else {
> +		dev_notice(&pdev->dev, "using INTX\n");
> +		udev->info.irq_flags = IRQF_SHARED;
> +		udev->max_vectors = 1;
> +	}
> +
> +	udev->ctx = kcalloc(udev->max_vectors,
> +			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
> +	if (!udev->ctx)
> +		goto fail_free_msix;
> +
> +	for (i = 0; i < udev->max_vectors; i++) {
> +		udev->msix[i].entry = i;
> +
> +		udev->ctx[i].name = kasprintf(GFP_KERNEL,
> +					      KBUILD_MODNAME "[%d](%s)",
> +					      i, pci_name(pdev));
> +		if (!udev->ctx[i].name)
> +			goto fail_free_ctx;
> +	}
> +
> +	/* register uio driver */
> +	err = uio_register_device(&pdev->dev, &udev->info);
> +	if (err != 0)
> +		goto fail_free_ctx;
> +
> +	pci_set_drvdata(pdev, udev);
> +	return 0;
> +
> +fail_free_ctx:
> +	for (i = 0; i < udev->max_vectors; i++)
> +		kfree(udev->ctx[i].name);
> +	kfree(udev->ctx);
> +fail_free_msix:
> +	kfree(udev->msix);
> +fail_release_iomem:
> +	release_iomaps(udev->info.mem);
> +	pci_release_regions(pdev);
> +fail_disable:
> +	pci_disable_device(pdev);
> +fail_free:
> +	kfree(udev);
> +
> +	pr_notice("%s ret %d\n", __func__, err);
> +	return err;
> +}
> +
> +static void uio_msi_remove(struct pci_dev *pdev)
> +{
> +	struct uio_info *info = pci_get_drvdata(pdev);
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int i;
> +
> +	uio_unregister_device(info);
> +	release_iomaps(info->mem);
> +
> +	pci_release_regions(pdev);
> +	for (i = 0; i < udev->max_vectors; i++)
> +		kfree(udev->ctx[i].name);
> +	kfree(udev->ctx);
> +	kfree(udev->msix);
> +	pci_disable_device(pdev);
> +
> +	pci_set_drvdata(pdev, NULL);
> +	kfree(udev);
> +}
> +
> +static struct pci_driver uio_msi_pci_driver = {
> +	.name = "uio_msi",
> +	.probe = uio_msi_probe,
> +	.remove = uio_msi_remove,
> +};
> +
> +module_pci_driver(uio_msi_pci_driver);
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
> +MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index f7b2db4..d9497691 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -411,6 +411,7 @@ header-y += udp.h
>  header-y += uhid.h
>  header-y += uinput.h
>  header-y += uio.h
> +header-y += uio_msi.h
>  header-y += ultrasound.h
>  header-y += un.h
>  header-y += unistd.h
> diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
> new file mode 100644
> index 0000000..297de00
> --- /dev/null
> +++ b/include/uapi/linux/uio_msi.h
> @@ -0,0 +1,22 @@
> +/*
> + * UIO_MSI API definition
> + *
> + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +#ifndef _UIO_PCI_MSI_H
> +#define _UIO_PCI_MSI_H
> +
> +struct uio_msi_irq_set {
> +	u32 vec;
> +	int fd;
> +};
> +
> +#define UIO_MSI_BASE	0x86
> +#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
> +
> +#endif
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-09-30 22:28 [dpdk-dev] [PATCH 0/2] uio_msi: device driver Stephen Hemminger
  2015-09-30 22:28 ` [dpdk-dev] [PATCH 1/2] uio: add support for ioctls Stephen Hemminger
  2015-09-30 22:28 ` [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X Stephen Hemminger
@ 2015-10-01  8:36 ` Michael S. Tsirkin
  2015-10-01 10:59 ` Avi Kivity
  3 siblings, 0 replies; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01  8:36 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, hjk, gregkh, linux-kernel

On Wed, Sep 30, 2015 at 03:28:56PM -0700, Stephen Hemminger wrote:
> This is a new UIO device driver to allow supporting MSI-X and MSI devices
> in userspace.  It has been used in environments like VMware and older versions
> of QEMU/KVM where no IOMMU support is available.
> 
> Stephen Hemminger (2):
> 
> *** BLURB HERE ***

I think this needs some more work to make sure
userspace can drive devices without a risk of
crashing the kernel even when there's no IOMMU.

I sent some comments in response to the patch itself.

Please copy me on future versions of this driver.

Thanks!

> Stephen Hemminger (2):
>   uio: add support for ioctls
>   uio: new driver to support PCI MSI-X
> 
>  drivers/uio/Kconfig          |   9 ++
>  drivers/uio/Makefile         |   1 +
>  drivers/uio/uio.c            |  15 ++
>  drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/uio_driver.h   |   3 +
>  include/uapi/linux/Kbuild    |   1 +
>  include/uapi/linux/uio_msi.h |  22 +++
>  7 files changed, 429 insertions(+)
>  create mode 100644 drivers/uio/uio_msi.c
>  create mode 100644 include/uapi/linux/uio_msi.h
> 
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-01  8:33   ` Michael S. Tsirkin
@ 2015-10-01 10:37     ` Michael S. Tsirkin
  2015-10-01 16:06       ` Michael S. Tsirkin
  2015-10-01 14:50     ` Stephen Hemminger
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 10:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, hjk, gregkh, linux-kernel

On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > This driver allows using PCI device with Message Signalled Interrupt
> > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > file descriptors similar to VFIO.
> >
> > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > have to work in environments where IOMMU support (real or emulated) is
> > not available.  All UIO drivers that support DMA are not secure against
> > rogue userspace applications programming DMA hardware to access
> > private memory; this driver is no less secure than existing code.
> > 
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> 
> I don't think copying the igb_uio interface is a good idea.
> What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> is abusing the sysfs BAR access to provide unlimited
> access to hardware.
> 
> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.
> 
> uio_pci_generic is already abused like that, mostly
> because when I wrote it, I didn't add enough protections
> against using it with DMA capable devices,
> and we can't go back and break working userspace.
> But at least it does not bind to VFs which all of
> them are capable of DMA.
> 
> The result of merging this driver will be userspace abusing the
> sysfs BAR access with VFs as well, and we do not want that.
> 
> 
> Just forwarding events is not enough to make a valid driver.
> What is missing is a way to access the device in a safe way.
> 
> On a more positive note:
> 
> What would be a reasonable interface? One that does the following
> in kernel:
> 
> 1. initializes device rings (can be in pinned userspace memory,
>    but can not be writeable by userspace), brings up interface link
> 2. pins userspace memory (unless using e.g. hugetlbfs)
> 3. gets request, make sure it's valid and belongs to
>    the correct task, put it in the ring
> 4. in the reverse direction, notify userspace when buffers
>    are available in the ring
> 5. notify userspace about MSI (what this driver does)
> 
> What userspace can be allowed to do:
> 
> 	format requests (e.g. transmit, receive) in userspace
> 	read ring contents
> 
> What userspace can't be allowed to do:
> 
> 	access BAR
> 	write rings

Thinking about it some more, many devices


Have separate rings for DMA: TX (device reads memory)
and RX (device writes memory).
With such devices, a mode where userspace can write TX ring
but not RX ring might make sense.

This will mean userspace might read kernel memory
through the device, but can not corrupt it.

That's already a big win!

And RX buffers do not have to be added one at a time.
If we assume 0.2usec per system call, batching some 100 buffers per
system call gives you 2 nano seconds overhead.  That seems quite
reasonable.





> 
> This means that the driver can not be a generic one,
> and there will be a system call overhead when you
> write the ring, but that's the price you have to
> pay for ability to run on systems without an IOMMU.
> 
> 
> 
> 
> > ---
> >  drivers/uio/Kconfig          |   9 ++
> >  drivers/uio/Makefile         |   1 +
> >  drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/Kbuild    |   1 +
> >  include/uapi/linux/uio_msi.h |  22 +++
> >  5 files changed, 411 insertions(+)
> >  create mode 100644 drivers/uio/uio_msi.c
> >  create mode 100644 include/uapi/linux/uio_msi.h
> > 
> > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> > index 52c98ce..04adfa0 100644
> > --- a/drivers/uio/Kconfig
> > +++ b/drivers/uio/Kconfig
> > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
> >  	  primarily, for virtualization scenarios.
> >  	  If you compile this as a module, it will be called uio_pci_generic.
> >  
> > +config UIO_PCI_MSI
> > +       tristate "Generic driver supporting MSI-x on PCI Express cards"
> > +	depends on PCI
> > +	help
> > +	  Generic driver that provides Message Signalled IRQ events
> > +	  similar to VFIO. If IOMMMU is available please use VFIO
> > +	  instead since it provides more security.
> > +	  If you compile this as a module, it will be called uio_msi.
> > +
> >  config UIO_NETX
> >  	tristate "Hilscher NetX Card driver"
> >  	depends on PCI
> > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> > index 8560dad..62fc44b 100644
> > --- a/drivers/uio/Makefile
> > +++ b/drivers/uio/Makefile
> > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
> >  obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
> >  obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
> >  obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
> > +obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
> > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
> > new file mode 100644
> > index 0000000..802b5c4
> > --- /dev/null
> > +++ b/drivers/uio/uio_msi.c
> > @@ -0,0 +1,378 @@
> > +/*-
> > + *
> > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > + * Author: Stephen Hemminger <stephen@networkplumber.org>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 only.
> > + */
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +#include <linux/device.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/module.h>
> > +#include <linux/pci.h>
> > +#include <linux/uio_driver.h>
> > +#include <linux/msi.h>
> > +#include <linux/uio_msi.h>
> > +
> > +#define DRIVER_VERSION		"0.1.1"
> > +#define MAX_MSIX_VECTORS	64
> > +
> > +/* MSI-X vector information */
> > +struct uio_msi_pci_dev {
> > +	struct uio_info info;		/* UIO driver info */
> > +	struct pci_dev *pdev;		/* PCI device */
> > +	struct mutex	mutex;		/* open/release/ioctl mutex */
> > +	int		ref_cnt;	/* references to device */
> > +	unsigned int	max_vectors;	/* MSI-X slots available */
> > +	struct msix_entry *msix;	/* MSI-X vector table */
> > +	struct uio_msi_irq_ctx {
> > +		struct eventfd_ctx *trigger; /* vector to eventfd */
> > +		char *name;		/* name in /proc/interrupts */
> > +	} *ctx;
> > +};
> > +
> > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
> > +{
> > +	struct uio_msi_pci_dev *udev = arg;
> > +
> > +	if (pci_check_and_mask_intx(udev->pdev)) {
> > +		eventfd_signal(udev->ctx->trigger, 1);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	return IRQ_NONE;
> > +}
> > +
> > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
> > +{
> > +	struct eventfd_ctx *trigger = arg;
> > +
> > +	eventfd_signal(trigger, 1);
> > +	return IRQ_HANDLED;
> > +}
> > +
> > +/* set the mapping between vector # and existing eventfd. */
> > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
> > +{
> > +	struct eventfd_ctx *trigger;
> > +	int irq, err;
> > +
> > +	if (vec >= udev->max_vectors) {
> > +		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
> > +			   vec, udev->max_vectors);
> > +		return -ERANGE;
> > +	}
> > +
> > +	irq = udev->msix[vec].vector;
> > +	trigger = udev->ctx[vec].trigger;
> > +	if (trigger) {
> > +		/* Clearup existing irq mapping */
> > +		free_irq(irq, trigger);
> > +		eventfd_ctx_put(trigger);
> > +		udev->ctx[vec].trigger = NULL;
> > +	}
> > +
> > +	/* Passing -1 is used to disable interrupt */
> > +	if (fd < 0)
> > +		return 0;
> > +
> > +	trigger = eventfd_ctx_fdget(fd);
> > +	if (IS_ERR(trigger)) {
> > +		err = PTR_ERR(trigger);
> > +		dev_notice(&udev->pdev->dev,
> > +			   "eventfd ctx get failed: %d\n", err);
> > +		return err;
> > +	}
> > +
> > +	if (udev->msix)
> > +		err = request_irq(irq, uio_msi_irqhandler, 0,
> > +				  udev->ctx[vec].name, trigger);
> > +	else
> > +		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
> > +				  udev->ctx[vec].name, udev);
> > +
> > +	if (err) {
> > +		dev_notice(&udev->pdev->dev,
> > +			   "request irq failed: %d\n", err);
> > +		eventfd_ctx_put(trigger);
> > +		return err;
> > +	}
> > +
> > +	udev->ctx[vec].trigger = trigger;
> > +	return 0;
> > +}
> > +
> > +static int
> > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	struct uio_msi_irq_set hdr;
> > +	int err;
> > +
> > +	switch (cmd) {
> > +	case UIO_MSI_IRQ_SET:
> > +		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
> > +			return -EFAULT;
> > +
> > +		mutex_lock(&udev->mutex);
> > +		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
> > +		mutex_unlock(&udev->mutex);
> > +		break;
> > +	default:
> > +		err = -EOPNOTSUPP;
> > +	}
> > +	return err;
> > +}
> > +
> > +/* Opening the UIO device for first time enables MSI-X */
> > +static int
> > +uio_msi_open(struct uio_info *info, struct inode *inode)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int err = 0;
> > +
> > +	mutex_lock(&udev->mutex);
> > +	if (udev->ref_cnt++ == 0) {
> > +		if (udev->msix)
> > +			err = pci_enable_msix(udev->pdev, udev->msix,
> > +					      udev->max_vectors);
> > +	}
> > +	mutex_unlock(&udev->mutex);
> > +
> > +	return err;
> > +}
> > +
> > +/* Last close of the UIO device releases/disables all IRQ's */
> > +static int
> > +uio_msi_release(struct uio_info *info, struct inode *inode)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int i;
> > +
> > +	mutex_lock(&udev->mutex);
> > +	if (--udev->ref_cnt == 0) {
> > +		for (i = 0; i < udev->max_vectors; i++) {
> > +			int irq = udev->msix[i].vector;
> > +			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
> > +
> > +			if (!trigger)
> > +				continue;
> > +
> > +			free_irq(irq, trigger);
> > +			eventfd_ctx_put(trigger);
> > +			udev->ctx[i].trigger = NULL;
> > +		}
> > +
> > +		if (udev->msix)
> > +			pci_disable_msix(udev->pdev);
> > +	}
> > +	mutex_unlock(&udev->mutex);
> > +
> > +	return 0;
> > +}
> > +
> > +/* Unmap previously ioremap'd resources */
> > +static void
> > +release_iomaps(struct uio_mem *mem)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
> > +		if (mem->internal_addr)
> > +			iounmap(mem->internal_addr);
> > +	}
> > +}
> > +
> > +static int
> > +setup_maps(struct pci_dev *pdev, struct uio_info *info)
> > +{
> > +	int i, m = 0, p = 0, err;
> > +	static const char * const bar_names[] = {
> > +		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
> > +	};
> > +
> > +	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
> > +		unsigned long start = pci_resource_start(pdev, i);
> > +		unsigned long flags = pci_resource_flags(pdev, i);
> > +		unsigned long len = pci_resource_len(pdev, i);
> > +
> > +		if (start == 0 || len == 0)
> > +			continue;
> > +
> > +		if (flags & IORESOURCE_MEM) {
> > +			void __iomem *addr;
> > +
> > +			if (m >= MAX_UIO_MAPS)
> > +				continue;
> > +
> > +			addr = ioremap(start, len);
> > +			if (addr == NULL) {
> > +				err = -EINVAL;
> > +				goto fail;
> > +			}
> > +
> > +			info->mem[m].name = bar_names[i];
> > +			info->mem[m].addr = start;
> > +			info->mem[m].internal_addr = addr;
> > +			info->mem[m].size = len;
> > +			info->mem[m].memtype = UIO_MEM_PHYS;
> > +			++m;
> > +		} else if (flags & IORESOURCE_IO) {
> > +			if (p >= MAX_UIO_PORT_REGIONS)
> > +				continue;
> > +
> > +			info->port[p].name = bar_names[i];
> > +			info->port[p].start = start;
> > +			info->port[p].size = len;
> > +			info->port[p].porttype = UIO_PORT_X86;
> > +			++p;
> > +		}
> > +	}
> > +
> > +	return 0;
> > + fail:
> > +	for (i = 0; i < m; i++)
> > +		iounmap(info->mem[i].internal_addr);
> > +	return err;
> > +}
> > +
> > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > +{
> > +	struct uio_msi_pci_dev *udev;
> > +	int i, err, vectors;
> > +
> > +	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
> > +	if (!udev)
> > +		return -ENOMEM;
> > +
> > +	err = pci_enable_device(pdev);
> > +	if (err != 0) {
> > +		dev_err(&pdev->dev, "cannot enable PCI device\n");
> > +		goto fail_free;
> > +	}
> > +
> > +	err = pci_request_regions(pdev, "uio_msi");
> > +	if (err != 0) {
> > +		dev_err(&pdev->dev, "Cannot request regions\n");
> > +		goto fail_disable;
> > +	}
> > +
> > +	pci_set_master(pdev);
> > +
> > +	/* remap resources */
> > +	err = setup_maps(pdev, &udev->info);
> > +	if (err)
> > +		goto fail_release_iomem;
> > +
> > +	/* fill uio infos */
> > +	udev->info.name = "uio_msi";
> > +	udev->info.version = DRIVER_VERSION;
> > +	udev->info.priv = udev;
> > +	udev->pdev = pdev;
> > +	udev->info.ioctl = uio_msi_ioctl;
> > +	udev->info.open = uio_msi_open;
> > +	udev->info.release = uio_msi_release;
> > +	udev->info.irq = UIO_IRQ_CUSTOM;
> > +	mutex_init(&udev->mutex);
> > +
> > +	vectors = pci_msix_vec_count(pdev);
> > +	if (vectors > 0) {
> > +		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
> > +		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
> > +			udev->max_vectors);
> > +
> > +		err = -ENOMEM;
> > +		udev->msix = kcalloc(udev->max_vectors,
> > +				     sizeof(struct msix_entry), GFP_KERNEL);
> > +		if (!udev->msix)
> > +			goto fail_release_iomem;
> > +	} else if (!pci_intx_mask_supported(pdev)) {
> > +		dev_err(&pdev->dev,
> > +			"device does not support MSI-X or INTX\n");
> > +		err = -EINVAL;
> > +		goto fail_release_iomem;
> > +	} else {
> > +		dev_notice(&pdev->dev, "using INTX\n");
> > +		udev->info.irq_flags = IRQF_SHARED;
> > +		udev->max_vectors = 1;
> > +	}
> > +
> > +	udev->ctx = kcalloc(udev->max_vectors,
> > +			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
> > +	if (!udev->ctx)
> > +		goto fail_free_msix;
> > +
> > +	for (i = 0; i < udev->max_vectors; i++) {
> > +		udev->msix[i].entry = i;
> > +
> > +		udev->ctx[i].name = kasprintf(GFP_KERNEL,
> > +					      KBUILD_MODNAME "[%d](%s)",
> > +					      i, pci_name(pdev));
> > +		if (!udev->ctx[i].name)
> > +			goto fail_free_ctx;
> > +	}
> > +
> > +	/* register uio driver */
> > +	err = uio_register_device(&pdev->dev, &udev->info);
> > +	if (err != 0)
> > +		goto fail_free_ctx;
> > +
> > +	pci_set_drvdata(pdev, udev);
> > +	return 0;
> > +
> > +fail_free_ctx:
> > +	for (i = 0; i < udev->max_vectors; i++)
> > +		kfree(udev->ctx[i].name);
> > +	kfree(udev->ctx);
> > +fail_free_msix:
> > +	kfree(udev->msix);
> > +fail_release_iomem:
> > +	release_iomaps(udev->info.mem);
> > +	pci_release_regions(pdev);
> > +fail_disable:
> > +	pci_disable_device(pdev);
> > +fail_free:
> > +	kfree(udev);
> > +
> > +	pr_notice("%s ret %d\n", __func__, err);
> > +	return err;
> > +}
> > +
> > +static void uio_msi_remove(struct pci_dev *pdev)
> > +{
> > +	struct uio_info *info = pci_get_drvdata(pdev);
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int i;
> > +
> > +	uio_unregister_device(info);
> > +	release_iomaps(info->mem);
> > +
> > +	pci_release_regions(pdev);
> > +	for (i = 0; i < udev->max_vectors; i++)
> > +		kfree(udev->ctx[i].name);
> > +	kfree(udev->ctx);
> > +	kfree(udev->msix);
> > +	pci_disable_device(pdev);
> > +
> > +	pci_set_drvdata(pdev, NULL);
> > +	kfree(udev);
> > +}
> > +
> > +static struct pci_driver uio_msi_pci_driver = {
> > +	.name = "uio_msi",
> > +	.probe = uio_msi_probe,
> > +	.remove = uio_msi_remove,
> > +};
> > +
> > +module_pci_driver(uio_msi_pci_driver);
> > +MODULE_VERSION(DRIVER_VERSION);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
> > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
> > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> > index f7b2db4..d9497691 100644
> > --- a/include/uapi/linux/Kbuild
> > +++ b/include/uapi/linux/Kbuild
> > @@ -411,6 +411,7 @@ header-y += udp.h
> >  header-y += uhid.h
> >  header-y += uinput.h
> >  header-y += uio.h
> > +header-y += uio_msi.h
> >  header-y += ultrasound.h
> >  header-y += un.h
> >  header-y += unistd.h
> > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
> > new file mode 100644
> > index 0000000..297de00
> > --- /dev/null
> > +++ b/include/uapi/linux/uio_msi.h
> > @@ -0,0 +1,22 @@
> > +/*
> > + * UIO_MSI API definition
> > + *
> > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > + * All rights reserved.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +#ifndef _UIO_PCI_MSI_H
> > +#define _UIO_PCI_MSI_H
> > +
> > +struct uio_msi_irq_set {
> > +	u32 vec;
> > +	int fd;
> > +};
> > +
> > +#define UIO_MSI_BASE	0x86
> > +#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
> > +
> > +#endif
> > -- 
> > 2.1.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-09-30 22:28 [dpdk-dev] [PATCH 0/2] uio_msi: device driver Stephen Hemminger
                   ` (2 preceding siblings ...)
  2015-10-01  8:36 ` [dpdk-dev] [PATCH 0/2] uio_msi: device driver Michael S. Tsirkin
@ 2015-10-01 10:59 ` Avi Kivity
  2015-10-01 14:57   ` Stephen Hemminger
  3 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2015-10-01 10:59 UTC (permalink / raw)
  To: Stephen Hemminger, hjk, gregkh; +Cc: dev, linux-kernel

On 10/01/2015 01:28 AM, Stephen Hemminger wrote:
> This is a new UIO device driver to allow supporting MSI-X and MSI devices
> in userspace.  It has been used in environments like VMware and older versions
> of QEMU/KVM where no IOMMU support is available.

Why not add msi/msix support to uio_pci_generic?

> Stephen Hemminger (2):
>
> *** BLURB HERE ***
>
> Stephen Hemminger (2):
>    uio: add support for ioctls
>    uio: new driver to support PCI MSI-X
>
>   drivers/uio/Kconfig          |   9 ++
>   drivers/uio/Makefile         |   1 +
>   drivers/uio/uio.c            |  15 ++
>   drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
>   include/linux/uio_driver.h   |   3 +
>   include/uapi/linux/Kbuild    |   1 +
>   include/uapi/linux/uio_msi.h |  22 +++
>   7 files changed, 429 insertions(+)
>   create mode 100644 drivers/uio/uio_msi.c
>   create mode 100644 include/uapi/linux/uio_msi.h
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-01  8:33   ` Michael S. Tsirkin
  2015-10-01 10:37     ` Michael S. Tsirkin
@ 2015-10-01 14:50     ` Stephen Hemminger
  2015-10-01 15:22       ` Michael S. Tsirkin
  2015-10-01 16:31     ` Michael S. Tsirkin
  2015-10-05 21:54     ` Michael S. Tsirkin
  3 siblings, 1 reply; 44+ messages in thread
From: Stephen Hemminger @ 2015-10-01 14:50 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, hjk, gregkh, linux-kernel

On Thu, 1 Oct 2015 11:33:06 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > This driver allows using PCI device with Message Signalled Interrupt
> > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > file descriptors similar to VFIO.
> >
> > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > have to work in environments where IOMMU support (real or emulated) is
> > not available.  All UIO drivers that support DMA are not secure against
> > rogue userspace applications programming DMA hardware to access
> > private memory; this driver is no less secure than existing code.
> > 
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> 
> I don't think copying the igb_uio interface is a good idea.
> What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> is abusing the sysfs BAR access to provide unlimited
> access to hardware.
> 
> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.
> 
> uio_pci_generic is already abused like that, mostly
> because when I wrote it, I didn't add enough protections
> against using it with DMA capable devices,
> and we can't go back and break working userspace.
> But at least it does not bind to VFs which all of
> them are capable of DMA.
> 
> The result of merging this driver will be userspace abusing the
> sysfs BAR access with VFs as well, and we do not want that.
> 
> 
> Just forwarding events is not enough to make a valid driver.
> What is missing is a way to access the device in a safe way.
> 
> On a more positive note:
> 
> What would be a reasonable interface? One that does the following
> in kernel:
> 
> 1. initializes device rings (can be in pinned userspace memory,
>    but can not be writeable by userspace), brings up interface link
> 2. pins userspace memory (unless using e.g. hugetlbfs)
> 3. gets request, make sure it's valid and belongs to
>    the correct task, put it in the ring
> 4. in the reverse direction, notify userspace when buffers
>    are available in the ring
> 5. notify userspace about MSI (what this driver does)
> 
> What userspace can be allowed to do:
> 
> 	format requests (e.g. transmit, receive) in userspace
> 	read ring contents
> 
> What userspace can't be allowed to do:
> 
> 	access BAR
> 	write rings
> 
> 
> This means that the driver can not be a generic one,
> and there will be a system call overhead when you
> write the ring, but that's the price you have to
> pay for ability to run on systems without an IOMMU.

I think I understand what you are proposing, but it really doesn't
fit into the high speed userspace networking model.

1. Device rings are device specific, can't be in a generic driver.
2. DPDK uses huge mememory.
3. Performance requires all ring requests be done in pure userspace,
   (ie no syscalls)
4. Ditto, can't have kernel to userspace notification per packet

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-10-01 10:59 ` Avi Kivity
@ 2015-10-01 14:57   ` Stephen Hemminger
  2015-10-01 19:48     ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Stephen Hemminger @ 2015-10-01 14:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev, hjk, gregkh, linux-kernel

On Thu, 1 Oct 2015 13:59:02 +0300
Avi Kivity <avi@scylladb.com> wrote:

> On 10/01/2015 01:28 AM, Stephen Hemminger wrote:
> > This is a new UIO device driver to allow supporting MSI-X and MSI devices
> > in userspace.  It has been used in environments like VMware and older versions
> > of QEMU/KVM where no IOMMU support is available.  
> 
> Why not add msi/msix support to uio_pci_generic?

That is possible but that would meet ABI and other resistance from the author.
Also, uio_pci_generic makes it harder to find resources since it doesn't fully
utilize UIO infrastructure.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-01 14:50     ` Stephen Hemminger
@ 2015-10-01 15:22       ` Michael S. Tsirkin
  0 siblings, 0 replies; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 15:22 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, hjk, gregkh, linux-kernel

On Thu, Oct 01, 2015 at 07:50:37AM -0700, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 11:33:06 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > > This driver allows using PCI device with Message Signalled Interrupt
> > > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > > file descriptors similar to VFIO.
> > >
> > > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > > have to work in environments where IOMMU support (real or emulated) is
> > > not available.  All UIO drivers that support DMA are not secure against
> > > rogue userspace applications programming DMA hardware to access
> > > private memory; this driver is no less secure than existing code.
> > > 
> > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > 
> > I don't think copying the igb_uio interface is a good idea.
> > What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> > is abusing the sysfs BAR access to provide unlimited
> > access to hardware.
> > 
> > MSI messages are memory writes so any generic device capable
> > of MSI is capable of corrupting kernel memory.
> > This means that a bug in userspace will lead to kernel memory corruption
> > and crashes.  This is something distributions can't support.
> > 
> > uio_pci_generic is already abused like that, mostly
> > because when I wrote it, I didn't add enough protections
> > against using it with DMA capable devices,
> > and we can't go back and break working userspace.
> > But at least it does not bind to VFs which all of
> > them are capable of DMA.
> > 
> > The result of merging this driver will be userspace abusing the
> > sysfs BAR access with VFs as well, and we do not want that.
> > 
> > 
> > Just forwarding events is not enough to make a valid driver.
> > What is missing is a way to access the device in a safe way.
> > 
> > On a more positive note:
> > 
> > What would be a reasonable interface? One that does the following
> > in kernel:
> > 
> > 1. initializes device rings (can be in pinned userspace memory,
> >    but can not be writeable by userspace), brings up interface link
> > 2. pins userspace memory (unless using e.g. hugetlbfs)
> > 3. gets request, make sure it's valid and belongs to
> >    the correct task, put it in the ring
> > 4. in the reverse direction, notify userspace when buffers
> >    are available in the ring
> > 5. notify userspace about MSI (what this driver does)
> > 
> > What userspace can be allowed to do:
> > 
> > 	format requests (e.g. transmit, receive) in userspace
> > 	read ring contents
> > 
> > What userspace can't be allowed to do:
> > 
> > 	access BAR
> > 	write rings
> > 
> > 
> > This means that the driver can not be a generic one,
> > and there will be a system call overhead when you
> > write the ring, but that's the price you have to
> > pay for ability to run on systems without an IOMMU.
> 
> I think I understand what you are proposing, but it really doesn't
> fit into the high speed userspace networking model.

I'm aware of the fact currently the model does everything including
bringing up the link in user-space.
But there's really no justification for this.
Only data path things should be in userspace.

A userspace bug should not be able to do things like over-writing the
on-device EEPROM.


> 1. Device rings are device specific, can't be in a generic driver.

So that's more work, and it is not going to happen if people
can get by with insecure hacks.

> 2. DPDK uses huge mememory.

Hugetlbfs? Don't see why this is an issue. Might make things simpler.

> 3. Performance requires all ring requests be done in pure userspace,
>    (ie no syscalls)

Make only the TX ring writeable then. At least you won't be
able to corrupt the kernel memory.

> 4. Ditto, can't have kernel to userspace notification per packet

RX ring can be read-only, so userspace can read it directly.

-- 
MST

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-01 10:37     ` Michael S. Tsirkin
@ 2015-10-01 16:06       ` Michael S. Tsirkin
  0 siblings, 0 replies; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 16:06 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, hjk, gregkh, linux-kernel

On Thu, Oct 01, 2015 at 01:37:12PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > > This driver allows using PCI device with Message Signalled Interrupt
> > > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > > file descriptors similar to VFIO.
> > >
> > > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > > have to work in environments where IOMMU support (real or emulated) is
> > > not available.  All UIO drivers that support DMA are not secure against
> > > rogue userspace applications programming DMA hardware to access
> > > private memory; this driver is no less secure than existing code.
> > > 
> > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > 
> > I don't think copying the igb_uio interface is a good idea.
> > What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> > is abusing the sysfs BAR access to provide unlimited
> > access to hardware.
> > 
> > MSI messages are memory writes so any generic device capable
> > of MSI is capable of corrupting kernel memory.
> > This means that a bug in userspace will lead to kernel memory corruption
> > and crashes.  This is something distributions can't support.
> > 
> > uio_pci_generic is already abused like that, mostly
> > because when I wrote it, I didn't add enough protections
> > against using it with DMA capable devices,
> > and we can't go back and break working userspace.
> > But at least it does not bind to VFs which all of
> > them are capable of DMA.
> > 
> > The result of merging this driver will be userspace abusing the
> > sysfs BAR access with VFs as well, and we do not want that.
> > 
> > 
> > Just forwarding events is not enough to make a valid driver.
> > What is missing is a way to access the device in a safe way.
> > 
> > On a more positive note:
> > 
> > What would be a reasonable interface? One that does the following
> > in kernel:
> > 
> > 1. initializes device rings (can be in pinned userspace memory,
> >    but can not be writeable by userspace), brings up interface link
> > 2. pins userspace memory (unless using e.g. hugetlbfs)
> > 3. gets request, make sure it's valid and belongs to
> >    the correct task, put it in the ring
> > 4. in the reverse direction, notify userspace when buffers
> >    are available in the ring
> > 5. notify userspace about MSI (what this driver does)
> > 
> > What userspace can be allowed to do:
> > 
> > 	format requests (e.g. transmit, receive) in userspace
> > 	read ring contents
> > 
> > What userspace can't be allowed to do:
> > 
> > 	access BAR
> > 	write rings
> 
> Thinking about it some more, many devices
> 
> 
> Have separate rings for DMA: TX (device reads memory)
> and RX (device writes memory).
> With such devices, a mode where userspace can write TX ring
> but not RX ring might make sense.
> 
> This will mean userspace might read kernel memory
> through the device, but can not corrupt it.
> 
> That's already a big win!
> 
> And RX buffers do not have to be added one at a time.
> If we assume 0.2usec per system call, batching some 100 buffers per
> system call gives you 2 nano seconds overhead.  That seems quite
> reasonable.
> 

To add to that, there's no reason to do this on the same core.
Re-arming descriptors can happen in parallel with packet
processing, so this overhead won't affect PPS or latency at all:
only the CPU utilization.


> 
> 
> 
> > 
> > This means that the driver can not be a generic one,
> > and there will be a system call overhead when you
> > write the ring, but that's the price you have to
> > pay for ability to run on systems without an IOMMU.
> > 
> > 
> > 
> > 
> > > ---
> > >  drivers/uio/Kconfig          |   9 ++
> > >  drivers/uio/Makefile         |   1 +
> > >  drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/Kbuild    |   1 +
> > >  include/uapi/linux/uio_msi.h |  22 +++
> > >  5 files changed, 411 insertions(+)
> > >  create mode 100644 drivers/uio/uio_msi.c
> > >  create mode 100644 include/uapi/linux/uio_msi.h
> > > 
> > > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> > > index 52c98ce..04adfa0 100644
> > > --- a/drivers/uio/Kconfig
> > > +++ b/drivers/uio/Kconfig
> > > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
> > >  	  primarily, for virtualization scenarios.
> > >  	  If you compile this as a module, it will be called uio_pci_generic.
> > >  
> > > +config UIO_PCI_MSI
> > > +       tristate "Generic driver supporting MSI-x on PCI Express cards"
> > > +	depends on PCI
> > > +	help
> > > +	  Generic driver that provides Message Signalled IRQ events
> > > +	  similar to VFIO. If IOMMMU is available please use VFIO
> > > +	  instead since it provides more security.
> > > +	  If you compile this as a module, it will be called uio_msi.
> > > +
> > >  config UIO_NETX
> > >  	tristate "Hilscher NetX Card driver"
> > >  	depends on PCI
> > > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> > > index 8560dad..62fc44b 100644
> > > --- a/drivers/uio/Makefile
> > > +++ b/drivers/uio/Makefile
> > > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
> > >  obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
> > >  obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
> > >  obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
> > > +obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
> > > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
> > > new file mode 100644
> > > index 0000000..802b5c4
> > > --- /dev/null
> > > +++ b/drivers/uio/uio_msi.c
> > > @@ -0,0 +1,378 @@
> > > +/*-
> > > + *
> > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > > + * Author: Stephen Hemminger <stephen@networkplumber.org>
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2 only.
> > > + */
> > > +
> > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > +
> > > +#include <linux/device.h>
> > > +#include <linux/interrupt.h>
> > > +#include <linux/eventfd.h>
> > > +#include <linux/module.h>
> > > +#include <linux/pci.h>
> > > +#include <linux/uio_driver.h>
> > > +#include <linux/msi.h>
> > > +#include <linux/uio_msi.h>
> > > +
> > > +#define DRIVER_VERSION		"0.1.1"
> > > +#define MAX_MSIX_VECTORS	64
> > > +
> > > +/* MSI-X vector information */
> > > +struct uio_msi_pci_dev {
> > > +	struct uio_info info;		/* UIO driver info */
> > > +	struct pci_dev *pdev;		/* PCI device */
> > > +	struct mutex	mutex;		/* open/release/ioctl mutex */
> > > +	int		ref_cnt;	/* references to device */
> > > +	unsigned int	max_vectors;	/* MSI-X slots available */
> > > +	struct msix_entry *msix;	/* MSI-X vector table */
> > > +	struct uio_msi_irq_ctx {
> > > +		struct eventfd_ctx *trigger; /* vector to eventfd */
> > > +		char *name;		/* name in /proc/interrupts */
> > > +	} *ctx;
> > > +};
> > > +
> > > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
> > > +{
> > > +	struct uio_msi_pci_dev *udev = arg;
> > > +
> > > +	if (pci_check_and_mask_intx(udev->pdev)) {
> > > +		eventfd_signal(udev->ctx->trigger, 1);
> > > +		return IRQ_HANDLED;
> > > +	}
> > > +
> > > +	return IRQ_NONE;
> > > +}
> > > +
> > > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
> > > +{
> > > +	struct eventfd_ctx *trigger = arg;
> > > +
> > > +	eventfd_signal(trigger, 1);
> > > +	return IRQ_HANDLED;
> > > +}
> > > +
> > > +/* set the mapping between vector # and existing eventfd. */
> > > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
> > > +{
> > > +	struct eventfd_ctx *trigger;
> > > +	int irq, err;
> > > +
> > > +	if (vec >= udev->max_vectors) {
> > > +		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
> > > +			   vec, udev->max_vectors);
> > > +		return -ERANGE;
> > > +	}
> > > +
> > > +	irq = udev->msix[vec].vector;
> > > +	trigger = udev->ctx[vec].trigger;
> > > +	if (trigger) {
> > > +		/* Clearup existing irq mapping */
> > > +		free_irq(irq, trigger);
> > > +		eventfd_ctx_put(trigger);
> > > +		udev->ctx[vec].trigger = NULL;
> > > +	}
> > > +
> > > +	/* Passing -1 is used to disable interrupt */
> > > +	if (fd < 0)
> > > +		return 0;
> > > +
> > > +	trigger = eventfd_ctx_fdget(fd);
> > > +	if (IS_ERR(trigger)) {
> > > +		err = PTR_ERR(trigger);
> > > +		dev_notice(&udev->pdev->dev,
> > > +			   "eventfd ctx get failed: %d\n", err);
> > > +		return err;
> > > +	}
> > > +
> > > +	if (udev->msix)
> > > +		err = request_irq(irq, uio_msi_irqhandler, 0,
> > > +				  udev->ctx[vec].name, trigger);
> > > +	else
> > > +		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
> > > +				  udev->ctx[vec].name, udev);
> > > +
> > > +	if (err) {
> > > +		dev_notice(&udev->pdev->dev,
> > > +			   "request irq failed: %d\n", err);
> > > +		eventfd_ctx_put(trigger);
> > > +		return err;
> > > +	}
> > > +
> > > +	udev->ctx[vec].trigger = trigger;
> > > +	return 0;
> > > +}
> > > +
> > > +static int
> > > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
> > > +{
> > > +	struct uio_msi_pci_dev *udev
> > > +		= container_of(info, struct uio_msi_pci_dev, info);
> > > +	struct uio_msi_irq_set hdr;
> > > +	int err;
> > > +
> > > +	switch (cmd) {
> > > +	case UIO_MSI_IRQ_SET:
> > > +		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
> > > +			return -EFAULT;
> > > +
> > > +		mutex_lock(&udev->mutex);
> > > +		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
> > > +		mutex_unlock(&udev->mutex);
> > > +		break;
> > > +	default:
> > > +		err = -EOPNOTSUPP;
> > > +	}
> > > +	return err;
> > > +}
> > > +
> > > +/* Opening the UIO device for first time enables MSI-X */
> > > +static int
> > > +uio_msi_open(struct uio_info *info, struct inode *inode)
> > > +{
> > > +	struct uio_msi_pci_dev *udev
> > > +		= container_of(info, struct uio_msi_pci_dev, info);
> > > +	int err = 0;
> > > +
> > > +	mutex_lock(&udev->mutex);
> > > +	if (udev->ref_cnt++ == 0) {
> > > +		if (udev->msix)
> > > +			err = pci_enable_msix(udev->pdev, udev->msix,
> > > +					      udev->max_vectors);
> > > +	}
> > > +	mutex_unlock(&udev->mutex);
> > > +
> > > +	return err;
> > > +}
> > > +
> > > +/* Last close of the UIO device releases/disables all IRQ's */
> > > +static int
> > > +uio_msi_release(struct uio_info *info, struct inode *inode)
> > > +{
> > > +	struct uio_msi_pci_dev *udev
> > > +		= container_of(info, struct uio_msi_pci_dev, info);
> > > +	int i;
> > > +
> > > +	mutex_lock(&udev->mutex);
> > > +	if (--udev->ref_cnt == 0) {
> > > +		for (i = 0; i < udev->max_vectors; i++) {
> > > +			int irq = udev->msix[i].vector;
> > > +			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
> > > +
> > > +			if (!trigger)
> > > +				continue;
> > > +
> > > +			free_irq(irq, trigger);
> > > +			eventfd_ctx_put(trigger);
> > > +			udev->ctx[i].trigger = NULL;
> > > +		}
> > > +
> > > +		if (udev->msix)
> > > +			pci_disable_msix(udev->pdev);
> > > +	}
> > > +	mutex_unlock(&udev->mutex);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/* Unmap previously ioremap'd resources */
> > > +static void
> > > +release_iomaps(struct uio_mem *mem)
> > > +{
> > > +	int i;
> > > +
> > > +	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
> > > +		if (mem->internal_addr)
> > > +			iounmap(mem->internal_addr);
> > > +	}
> > > +}
> > > +
> > > +static int
> > > +setup_maps(struct pci_dev *pdev, struct uio_info *info)
> > > +{
> > > +	int i, m = 0, p = 0, err;
> > > +	static const char * const bar_names[] = {
> > > +		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
> > > +	};
> > > +
> > > +	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
> > > +		unsigned long start = pci_resource_start(pdev, i);
> > > +		unsigned long flags = pci_resource_flags(pdev, i);
> > > +		unsigned long len = pci_resource_len(pdev, i);
> > > +
> > > +		if (start == 0 || len == 0)
> > > +			continue;
> > > +
> > > +		if (flags & IORESOURCE_MEM) {
> > > +			void __iomem *addr;
> > > +
> > > +			if (m >= MAX_UIO_MAPS)
> > > +				continue;
> > > +
> > > +			addr = ioremap(start, len);
> > > +			if (addr == NULL) {
> > > +				err = -EINVAL;
> > > +				goto fail;
> > > +			}
> > > +
> > > +			info->mem[m].name = bar_names[i];
> > > +			info->mem[m].addr = start;
> > > +			info->mem[m].internal_addr = addr;
> > > +			info->mem[m].size = len;
> > > +			info->mem[m].memtype = UIO_MEM_PHYS;
> > > +			++m;
> > > +		} else if (flags & IORESOURCE_IO) {
> > > +			if (p >= MAX_UIO_PORT_REGIONS)
> > > +				continue;
> > > +
> > > +			info->port[p].name = bar_names[i];
> > > +			info->port[p].start = start;
> > > +			info->port[p].size = len;
> > > +			info->port[p].porttype = UIO_PORT_X86;
> > > +			++p;
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > + fail:
> > > +	for (i = 0; i < m; i++)
> > > +		iounmap(info->mem[i].internal_addr);
> > > +	return err;
> > > +}
> > > +
> > > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > > +{
> > > +	struct uio_msi_pci_dev *udev;
> > > +	int i, err, vectors;
> > > +
> > > +	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
> > > +	if (!udev)
> > > +		return -ENOMEM;
> > > +
> > > +	err = pci_enable_device(pdev);
> > > +	if (err != 0) {
> > > +		dev_err(&pdev->dev, "cannot enable PCI device\n");
> > > +		goto fail_free;
> > > +	}
> > > +
> > > +	err = pci_request_regions(pdev, "uio_msi");
> > > +	if (err != 0) {
> > > +		dev_err(&pdev->dev, "Cannot request regions\n");
> > > +		goto fail_disable;
> > > +	}
> > > +
> > > +	pci_set_master(pdev);
> > > +
> > > +	/* remap resources */
> > > +	err = setup_maps(pdev, &udev->info);
> > > +	if (err)
> > > +		goto fail_release_iomem;
> > > +
> > > +	/* fill uio infos */
> > > +	udev->info.name = "uio_msi";
> > > +	udev->info.version = DRIVER_VERSION;
> > > +	udev->info.priv = udev;
> > > +	udev->pdev = pdev;
> > > +	udev->info.ioctl = uio_msi_ioctl;
> > > +	udev->info.open = uio_msi_open;
> > > +	udev->info.release = uio_msi_release;
> > > +	udev->info.irq = UIO_IRQ_CUSTOM;
> > > +	mutex_init(&udev->mutex);
> > > +
> > > +	vectors = pci_msix_vec_count(pdev);
> > > +	if (vectors > 0) {
> > > +		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
> > > +		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
> > > +			udev->max_vectors);
> > > +
> > > +		err = -ENOMEM;
> > > +		udev->msix = kcalloc(udev->max_vectors,
> > > +				     sizeof(struct msix_entry), GFP_KERNEL);
> > > +		if (!udev->msix)
> > > +			goto fail_release_iomem;
> > > +	} else if (!pci_intx_mask_supported(pdev)) {
> > > +		dev_err(&pdev->dev,
> > > +			"device does not support MSI-X or INTX\n");
> > > +		err = -EINVAL;
> > > +		goto fail_release_iomem;
> > > +	} else {
> > > +		dev_notice(&pdev->dev, "using INTX\n");
> > > +		udev->info.irq_flags = IRQF_SHARED;
> > > +		udev->max_vectors = 1;
> > > +	}
> > > +
> > > +	udev->ctx = kcalloc(udev->max_vectors,
> > > +			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
> > > +	if (!udev->ctx)
> > > +		goto fail_free_msix;
> > > +
> > > +	for (i = 0; i < udev->max_vectors; i++) {
> > > +		udev->msix[i].entry = i;
> > > +
> > > +		udev->ctx[i].name = kasprintf(GFP_KERNEL,
> > > +					      KBUILD_MODNAME "[%d](%s)",
> > > +					      i, pci_name(pdev));
> > > +		if (!udev->ctx[i].name)
> > > +			goto fail_free_ctx;
> > > +	}
> > > +
> > > +	/* register uio driver */
> > > +	err = uio_register_device(&pdev->dev, &udev->info);
> > > +	if (err != 0)
> > > +		goto fail_free_ctx;
> > > +
> > > +	pci_set_drvdata(pdev, udev);
> > > +	return 0;
> > > +
> > > +fail_free_ctx:
> > > +	for (i = 0; i < udev->max_vectors; i++)
> > > +		kfree(udev->ctx[i].name);
> > > +	kfree(udev->ctx);
> > > +fail_free_msix:
> > > +	kfree(udev->msix);
> > > +fail_release_iomem:
> > > +	release_iomaps(udev->info.mem);
> > > +	pci_release_regions(pdev);
> > > +fail_disable:
> > > +	pci_disable_device(pdev);
> > > +fail_free:
> > > +	kfree(udev);
> > > +
> > > +	pr_notice("%s ret %d\n", __func__, err);
> > > +	return err;
> > > +}
> > > +
> > > +static void uio_msi_remove(struct pci_dev *pdev)
> > > +{
> > > +	struct uio_info *info = pci_get_drvdata(pdev);
> > > +	struct uio_msi_pci_dev *udev
> > > +		= container_of(info, struct uio_msi_pci_dev, info);
> > > +	int i;
> > > +
> > > +	uio_unregister_device(info);
> > > +	release_iomaps(info->mem);
> > > +
> > > +	pci_release_regions(pdev);
> > > +	for (i = 0; i < udev->max_vectors; i++)
> > > +		kfree(udev->ctx[i].name);
> > > +	kfree(udev->ctx);
> > > +	kfree(udev->msix);
> > > +	pci_disable_device(pdev);
> > > +
> > > +	pci_set_drvdata(pdev, NULL);
> > > +	kfree(udev);
> > > +}
> > > +
> > > +static struct pci_driver uio_msi_pci_driver = {
> > > +	.name = "uio_msi",
> > > +	.probe = uio_msi_probe,
> > > +	.remove = uio_msi_remove,
> > > +};
> > > +
> > > +module_pci_driver(uio_msi_pci_driver);
> > > +MODULE_VERSION(DRIVER_VERSION);
> > > +MODULE_LICENSE("GPL v2");
> > > +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
> > > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
> > > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> > > index f7b2db4..d9497691 100644
> > > --- a/include/uapi/linux/Kbuild
> > > +++ b/include/uapi/linux/Kbuild
> > > @@ -411,6 +411,7 @@ header-y += udp.h
> > >  header-y += uhid.h
> > >  header-y += uinput.h
> > >  header-y += uio.h
> > > +header-y += uio_msi.h
> > >  header-y += ultrasound.h
> > >  header-y += un.h
> > >  header-y += unistd.h
> > > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
> > > new file mode 100644
> > > index 0000000..297de00
> > > --- /dev/null
> > > +++ b/include/uapi/linux/uio_msi.h
> > > @@ -0,0 +1,22 @@
> > > +/*
> > > + * UIO_MSI API definition
> > > + *
> > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > > + * All rights reserved.
> > > + *
> > > + * This program is free software; you can redistribute it and/or modify
> > > + * it under the terms of the GNU General Public License version 2 as
> > > + * published by the Free Software Foundation.
> > > + */
> > > +#ifndef _UIO_PCI_MSI_H
> > > +#define _UIO_PCI_MSI_H
> > > +
> > > +struct uio_msi_irq_set {
> > > +	u32 vec;
> > > +	int fd;
> > > +};
> > > +
> > > +#define UIO_MSI_BASE	0x86
> > > +#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
> > > +
> > > +#endif
> > > -- 
> > > 2.1.4
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-01  8:33   ` Michael S. Tsirkin
  2015-10-01 10:37     ` Michael S. Tsirkin
  2015-10-01 14:50     ` Stephen Hemminger
@ 2015-10-01 16:31     ` Michael S. Tsirkin
  2015-10-01 17:26       ` Stephen Hemminger
  2015-10-05 21:54     ` Michael S. Tsirkin
  3 siblings, 1 reply; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 16:31 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, hjk, gregkh, linux-kernel

On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > This driver allows using PCI device with Message Signalled Interrupt
> > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > file descriptors similar to VFIO.
> >
> > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > have to work in environments where IOMMU support (real or emulated) is
> > not available.  All UIO drivers that support DMA are not secure against
> > rogue userspace applications programming DMA hardware to access
> > private memory; this driver is no less secure than existing code.
> > 
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> 
> I don't think copying the igb_uio interface is a good idea.
> What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> is abusing the sysfs BAR access to provide unlimited
> access to hardware.
> 
> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.
> 
> uio_pci_generic is already abused like that, mostly
> because when I wrote it, I didn't add enough protections
> against using it with DMA capable devices,
> and we can't go back and break working userspace.
> But at least it does not bind to VFs which all of
> them are capable of DMA.
> 
> The result of merging this driver will be userspace abusing the
> sysfs BAR access with VFs as well, and we do not want that.
> 
> 
> Just forwarding events is not enough to make a valid driver.
> What is missing is a way to access the device in a safe way.
> 
> On a more positive note:
> 
> What would be a reasonable interface? One that does the following
> in kernel:
> 
> 1. initializes device rings (can be in pinned userspace memory,
>    but can not be writeable by userspace), brings up interface link
> 2. pins userspace memory (unless using e.g. hugetlbfs)
> 3. gets request, make sure it's valid and belongs to
>    the correct task, put it in the ring
> 4. in the reverse direction, notify userspace when buffers
>    are available in the ring
> 5. notify userspace about MSI (what this driver does)
> 
> What userspace can be allowed to do:
> 
> 	format requests (e.g. transmit, receive) in userspace
> 	read ring contents
> 
> What userspace can't be allowed to do:
> 
> 	access BAR
> 	write rings
> 
> 
> This means that the driver can not be a generic one,
> and there will be a system call overhead when you
> write the ring, but that's the price you have to
> pay for ability to run on systems without an IOMMU.
> 


The device specific parts can be taken from John Fastabend's patches
BTW:

https://patchwork.ozlabs.org/patch/396713/

IIUC what was missing there was exactly the memory protection
we are looking for here.


> 
> 
> > ---
> >  drivers/uio/Kconfig          |   9 ++
> >  drivers/uio/Makefile         |   1 +
> >  drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/Kbuild    |   1 +
> >  include/uapi/linux/uio_msi.h |  22 +++
> >  5 files changed, 411 insertions(+)
> >  create mode 100644 drivers/uio/uio_msi.c
> >  create mode 100644 include/uapi/linux/uio_msi.h
> > 
> > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> > index 52c98ce..04adfa0 100644
> > --- a/drivers/uio/Kconfig
> > +++ b/drivers/uio/Kconfig
> > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
> >  	  primarily, for virtualization scenarios.
> >  	  If you compile this as a module, it will be called uio_pci_generic.
> >  
> > +config UIO_PCI_MSI
> > +       tristate "Generic driver supporting MSI-x on PCI Express cards"
> > +	depends on PCI
> > +	help
> > +	  Generic driver that provides Message Signalled IRQ events
> > +	  similar to VFIO. If IOMMMU is available please use VFIO
> > +	  instead since it provides more security.
> > +	  If you compile this as a module, it will be called uio_msi.
> > +
> >  config UIO_NETX
> >  	tristate "Hilscher NetX Card driver"
> >  	depends on PCI
> > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> > index 8560dad..62fc44b 100644
> > --- a/drivers/uio/Makefile
> > +++ b/drivers/uio/Makefile
> > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
> >  obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
> >  obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
> >  obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
> > +obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
> > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
> > new file mode 100644
> > index 0000000..802b5c4
> > --- /dev/null
> > +++ b/drivers/uio/uio_msi.c
> > @@ -0,0 +1,378 @@
> > +/*-
> > + *
> > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > + * Author: Stephen Hemminger <stephen@networkplumber.org>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 only.
> > + */
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +#include <linux/device.h>
> > +#include <linux/interrupt.h>
> > +#include <linux/eventfd.h>
> > +#include <linux/module.h>
> > +#include <linux/pci.h>
> > +#include <linux/uio_driver.h>
> > +#include <linux/msi.h>
> > +#include <linux/uio_msi.h>
> > +
> > +#define DRIVER_VERSION		"0.1.1"
> > +#define MAX_MSIX_VECTORS	64
> > +
> > +/* MSI-X vector information */
> > +struct uio_msi_pci_dev {
> > +	struct uio_info info;		/* UIO driver info */
> > +	struct pci_dev *pdev;		/* PCI device */
> > +	struct mutex	mutex;		/* open/release/ioctl mutex */
> > +	int		ref_cnt;	/* references to device */
> > +	unsigned int	max_vectors;	/* MSI-X slots available */
> > +	struct msix_entry *msix;	/* MSI-X vector table */
> > +	struct uio_msi_irq_ctx {
> > +		struct eventfd_ctx *trigger; /* vector to eventfd */
> > +		char *name;		/* name in /proc/interrupts */
> > +	} *ctx;
> > +};
> > +
> > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
> > +{
> > +	struct uio_msi_pci_dev *udev = arg;
> > +
> > +	if (pci_check_and_mask_intx(udev->pdev)) {
> > +		eventfd_signal(udev->ctx->trigger, 1);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	return IRQ_NONE;
> > +}
> > +
> > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
> > +{
> > +	struct eventfd_ctx *trigger = arg;
> > +
> > +	eventfd_signal(trigger, 1);
> > +	return IRQ_HANDLED;
> > +}
> > +
> > +/* set the mapping between vector # and existing eventfd. */
> > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
> > +{
> > +	struct eventfd_ctx *trigger;
> > +	int irq, err;
> > +
> > +	if (vec >= udev->max_vectors) {
> > +		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
> > +			   vec, udev->max_vectors);
> > +		return -ERANGE;
> > +	}
> > +
> > +	irq = udev->msix[vec].vector;
> > +	trigger = udev->ctx[vec].trigger;
> > +	if (trigger) {
> > +		/* Clearup existing irq mapping */
> > +		free_irq(irq, trigger);
> > +		eventfd_ctx_put(trigger);
> > +		udev->ctx[vec].trigger = NULL;
> > +	}
> > +
> > +	/* Passing -1 is used to disable interrupt */
> > +	if (fd < 0)
> > +		return 0;
> > +
> > +	trigger = eventfd_ctx_fdget(fd);
> > +	if (IS_ERR(trigger)) {
> > +		err = PTR_ERR(trigger);
> > +		dev_notice(&udev->pdev->dev,
> > +			   "eventfd ctx get failed: %d\n", err);
> > +		return err;
> > +	}
> > +
> > +	if (udev->msix)
> > +		err = request_irq(irq, uio_msi_irqhandler, 0,
> > +				  udev->ctx[vec].name, trigger);
> > +	else
> > +		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
> > +				  udev->ctx[vec].name, udev);
> > +
> > +	if (err) {
> > +		dev_notice(&udev->pdev->dev,
> > +			   "request irq failed: %d\n", err);
> > +		eventfd_ctx_put(trigger);
> > +		return err;
> > +	}
> > +
> > +	udev->ctx[vec].trigger = trigger;
> > +	return 0;
> > +}
> > +
> > +static int
> > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	struct uio_msi_irq_set hdr;
> > +	int err;
> > +
> > +	switch (cmd) {
> > +	case UIO_MSI_IRQ_SET:
> > +		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
> > +			return -EFAULT;
> > +
> > +		mutex_lock(&udev->mutex);
> > +		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
> > +		mutex_unlock(&udev->mutex);
> > +		break;
> > +	default:
> > +		err = -EOPNOTSUPP;
> > +	}
> > +	return err;
> > +}
> > +
> > +/* Opening the UIO device for first time enables MSI-X */
> > +static int
> > +uio_msi_open(struct uio_info *info, struct inode *inode)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int err = 0;
> > +
> > +	mutex_lock(&udev->mutex);
> > +	if (udev->ref_cnt++ == 0) {
> > +		if (udev->msix)
> > +			err = pci_enable_msix(udev->pdev, udev->msix,
> > +					      udev->max_vectors);
> > +	}
> > +	mutex_unlock(&udev->mutex);
> > +
> > +	return err;
> > +}
> > +
> > +/* Last close of the UIO device releases/disables all IRQ's */
> > +static int
> > +uio_msi_release(struct uio_info *info, struct inode *inode)
> > +{
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int i;
> > +
> > +	mutex_lock(&udev->mutex);
> > +	if (--udev->ref_cnt == 0) {
> > +		for (i = 0; i < udev->max_vectors; i++) {
> > +			int irq = udev->msix[i].vector;
> > +			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
> > +
> > +			if (!trigger)
> > +				continue;
> > +
> > +			free_irq(irq, trigger);
> > +			eventfd_ctx_put(trigger);
> > +			udev->ctx[i].trigger = NULL;
> > +		}
> > +
> > +		if (udev->msix)
> > +			pci_disable_msix(udev->pdev);
> > +	}
> > +	mutex_unlock(&udev->mutex);
> > +
> > +	return 0;
> > +}
> > +
> > +/* Unmap previously ioremap'd resources */
> > +static void
> > +release_iomaps(struct uio_mem *mem)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
> > +		if (mem->internal_addr)
> > +			iounmap(mem->internal_addr);
> > +	}
> > +}
> > +
> > +static int
> > +setup_maps(struct pci_dev *pdev, struct uio_info *info)
> > +{
> > +	int i, m = 0, p = 0, err;
> > +	static const char * const bar_names[] = {
> > +		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
> > +	};
> > +
> > +	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
> > +		unsigned long start = pci_resource_start(pdev, i);
> > +		unsigned long flags = pci_resource_flags(pdev, i);
> > +		unsigned long len = pci_resource_len(pdev, i);
> > +
> > +		if (start == 0 || len == 0)
> > +			continue;
> > +
> > +		if (flags & IORESOURCE_MEM) {
> > +			void __iomem *addr;
> > +
> > +			if (m >= MAX_UIO_MAPS)
> > +				continue;
> > +
> > +			addr = ioremap(start, len);
> > +			if (addr == NULL) {
> > +				err = -EINVAL;
> > +				goto fail;
> > +			}
> > +
> > +			info->mem[m].name = bar_names[i];
> > +			info->mem[m].addr = start;
> > +			info->mem[m].internal_addr = addr;
> > +			info->mem[m].size = len;
> > +			info->mem[m].memtype = UIO_MEM_PHYS;
> > +			++m;
> > +		} else if (flags & IORESOURCE_IO) {
> > +			if (p >= MAX_UIO_PORT_REGIONS)
> > +				continue;
> > +
> > +			info->port[p].name = bar_names[i];
> > +			info->port[p].start = start;
> > +			info->port[p].size = len;
> > +			info->port[p].porttype = UIO_PORT_X86;
> > +			++p;
> > +		}
> > +	}
> > +
> > +	return 0;
> > + fail:
> > +	for (i = 0; i < m; i++)
> > +		iounmap(info->mem[i].internal_addr);
> > +	return err;
> > +}
> > +
> > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > +{
> > +	struct uio_msi_pci_dev *udev;
> > +	int i, err, vectors;
> > +
> > +	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
> > +	if (!udev)
> > +		return -ENOMEM;
> > +
> > +	err = pci_enable_device(pdev);
> > +	if (err != 0) {
> > +		dev_err(&pdev->dev, "cannot enable PCI device\n");
> > +		goto fail_free;
> > +	}
> > +
> > +	err = pci_request_regions(pdev, "uio_msi");
> > +	if (err != 0) {
> > +		dev_err(&pdev->dev, "Cannot request regions\n");
> > +		goto fail_disable;
> > +	}
> > +
> > +	pci_set_master(pdev);
> > +
> > +	/* remap resources */
> > +	err = setup_maps(pdev, &udev->info);
> > +	if (err)
> > +		goto fail_release_iomem;
> > +
> > +	/* fill uio infos */
> > +	udev->info.name = "uio_msi";
> > +	udev->info.version = DRIVER_VERSION;
> > +	udev->info.priv = udev;
> > +	udev->pdev = pdev;
> > +	udev->info.ioctl = uio_msi_ioctl;
> > +	udev->info.open = uio_msi_open;
> > +	udev->info.release = uio_msi_release;
> > +	udev->info.irq = UIO_IRQ_CUSTOM;
> > +	mutex_init(&udev->mutex);
> > +
> > +	vectors = pci_msix_vec_count(pdev);
> > +	if (vectors > 0) {
> > +		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
> > +		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
> > +			udev->max_vectors);
> > +
> > +		err = -ENOMEM;
> > +		udev->msix = kcalloc(udev->max_vectors,
> > +				     sizeof(struct msix_entry), GFP_KERNEL);
> > +		if (!udev->msix)
> > +			goto fail_release_iomem;
> > +	} else if (!pci_intx_mask_supported(pdev)) {
> > +		dev_err(&pdev->dev,
> > +			"device does not support MSI-X or INTX\n");
> > +		err = -EINVAL;
> > +		goto fail_release_iomem;
> > +	} else {
> > +		dev_notice(&pdev->dev, "using INTX\n");
> > +		udev->info.irq_flags = IRQF_SHARED;
> > +		udev->max_vectors = 1;
> > +	}
> > +
> > +	udev->ctx = kcalloc(udev->max_vectors,
> > +			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
> > +	if (!udev->ctx)
> > +		goto fail_free_msix;
> > +
> > +	for (i = 0; i < udev->max_vectors; i++) {
> > +		udev->msix[i].entry = i;
> > +
> > +		udev->ctx[i].name = kasprintf(GFP_KERNEL,
> > +					      KBUILD_MODNAME "[%d](%s)",
> > +					      i, pci_name(pdev));
> > +		if (!udev->ctx[i].name)
> > +			goto fail_free_ctx;
> > +	}
> > +
> > +	/* register uio driver */
> > +	err = uio_register_device(&pdev->dev, &udev->info);
> > +	if (err != 0)
> > +		goto fail_free_ctx;
> > +
> > +	pci_set_drvdata(pdev, udev);
> > +	return 0;
> > +
> > +fail_free_ctx:
> > +	for (i = 0; i < udev->max_vectors; i++)
> > +		kfree(udev->ctx[i].name);
> > +	kfree(udev->ctx);
> > +fail_free_msix:
> > +	kfree(udev->msix);
> > +fail_release_iomem:
> > +	release_iomaps(udev->info.mem);
> > +	pci_release_regions(pdev);
> > +fail_disable:
> > +	pci_disable_device(pdev);
> > +fail_free:
> > +	kfree(udev);
> > +
> > +	pr_notice("%s ret %d\n", __func__, err);
> > +	return err;
> > +}
> > +
> > +static void uio_msi_remove(struct pci_dev *pdev)
> > +{
> > +	struct uio_info *info = pci_get_drvdata(pdev);
> > +	struct uio_msi_pci_dev *udev
> > +		= container_of(info, struct uio_msi_pci_dev, info);
> > +	int i;
> > +
> > +	uio_unregister_device(info);
> > +	release_iomaps(info->mem);
> > +
> > +	pci_release_regions(pdev);
> > +	for (i = 0; i < udev->max_vectors; i++)
> > +		kfree(udev->ctx[i].name);
> > +	kfree(udev->ctx);
> > +	kfree(udev->msix);
> > +	pci_disable_device(pdev);
> > +
> > +	pci_set_drvdata(pdev, NULL);
> > +	kfree(udev);
> > +}
> > +
> > +static struct pci_driver uio_msi_pci_driver = {
> > +	.name = "uio_msi",
> > +	.probe = uio_msi_probe,
> > +	.remove = uio_msi_remove,
> > +};
> > +
> > +module_pci_driver(uio_msi_pci_driver);
> > +MODULE_VERSION(DRIVER_VERSION);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
> > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
> > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> > index f7b2db4..d9497691 100644
> > --- a/include/uapi/linux/Kbuild
> > +++ b/include/uapi/linux/Kbuild
> > @@ -411,6 +411,7 @@ header-y += udp.h
> >  header-y += uhid.h
> >  header-y += uinput.h
> >  header-y += uio.h
> > +header-y += uio_msi.h
> >  header-y += ultrasound.h
> >  header-y += un.h
> >  header-y += unistd.h
> > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
> > new file mode 100644
> > index 0000000..297de00
> > --- /dev/null
> > +++ b/include/uapi/linux/uio_msi.h
> > @@ -0,0 +1,22 @@
> > +/*
> > + * UIO_MSI API definition
> > + *
> > + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> > + * All rights reserved.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +#ifndef _UIO_PCI_MSI_H
> > +#define _UIO_PCI_MSI_H
> > +
> > +struct uio_msi_irq_set {
> > +	u32 vec;
> > +	int fd;
> > +};
> > +
> > +#define UIO_MSI_BASE	0x86
> > +#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
> > +
> > +#endif
> > -- 
> > 2.1.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-01 16:31     ` Michael S. Tsirkin
@ 2015-10-01 17:26       ` Stephen Hemminger
  2015-10-01 18:25         ` Michael S. Tsirkin
  0 siblings, 1 reply; 44+ messages in thread
From: Stephen Hemminger @ 2015-10-01 17:26 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, hjk, gregkh, linux-kernel

On Thu, 1 Oct 2015 19:31:08 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > > This driver allows using PCI device with Message Signalled Interrupt
> > > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > > file descriptors similar to VFIO.
> > >
> > > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > > have to work in environments where IOMMU support (real or emulated) is
> > > not available.  All UIO drivers that support DMA are not secure against
> > > rogue userspace applications programming DMA hardware to access
> > > private memory; this driver is no less secure than existing code.
> > > 
> > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > 
> > I don't think copying the igb_uio interface is a good idea.
> > What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> > is abusing the sysfs BAR access to provide unlimited
> > access to hardware.
> > 
> > MSI messages are memory writes so any generic device capable
> > of MSI is capable of corrupting kernel memory.
> > This means that a bug in userspace will lead to kernel memory corruption
> > and crashes.  This is something distributions can't support.
> > 
> > uio_pci_generic is already abused like that, mostly
> > because when I wrote it, I didn't add enough protections
> > against using it with DMA capable devices,
> > and we can't go back and break working userspace.
> > But at least it does not bind to VFs which all of
> > them are capable of DMA.
> > 
> > The result of merging this driver will be userspace abusing the
> > sysfs BAR access with VFs as well, and we do not want that.
> > 
> > 
> > Just forwarding events is not enough to make a valid driver.
> > What is missing is a way to access the device in a safe way.
> > 
> > On a more positive note:
> > 
> > What would be a reasonable interface? One that does the following
> > in kernel:
> > 
> > 1. initializes device rings (can be in pinned userspace memory,
> >    but can not be writeable by userspace), brings up interface link
> > 2. pins userspace memory (unless using e.g. hugetlbfs)
> > 3. gets request, make sure it's valid and belongs to
> >    the correct task, put it in the ring
> > 4. in the reverse direction, notify userspace when buffers
> >    are available in the ring
> > 5. notify userspace about MSI (what this driver does)
> > 
> > What userspace can be allowed to do:
> > 
> > 	format requests (e.g. transmit, receive) in userspace
> > 	read ring contents
> > 
> > What userspace can't be allowed to do:
> > 
> > 	access BAR
> > 	write rings
> > 
> > 
> > This means that the driver can not be a generic one,
> > and there will be a system call overhead when you
> > write the ring, but that's the price you have to
> > pay for ability to run on systems without an IOMMU.
> > 
> 
> 
> The device specific parts can be taken from John Fastabend's patches
> BTW:
> 
> https://patchwork.ozlabs.org/patch/396713/
> 
> IIUC what was missing there was exactly the memory protection
> we are looking for here.

The bifuricated drivers are interesting from an architecture
point of view, but do nothing to solve the immediate use case.
The problem is not on bare metal environment, most of those already have IOMMU.
The issues are on environments like VMWare with SRIOV or vmxnet3,
neither of those are really helped by bifirucated driver or VFIO.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-01 17:26       ` Stephen Hemminger
@ 2015-10-01 18:25         ` Michael S. Tsirkin
  0 siblings, 0 replies; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-01 18:25 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, hjk, gregkh, linux-kernel

On Thu, Oct 01, 2015 at 10:26:19AM -0700, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 19:31:08 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > > > This driver allows using PCI device with Message Signalled Interrupt
> > > > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > > > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > > > file descriptors similar to VFIO.
> > > >
> > > > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > > > have to work in environments where IOMMU support (real or emulated) is
> > > > not available.  All UIO drivers that support DMA are not secure against
> > > > rogue userspace applications programming DMA hardware to access
> > > > private memory; this driver is no less secure than existing code.
> > > > 
> > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> > > 
> > > I don't think copying the igb_uio interface is a good idea.
> > > What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> > > is abusing the sysfs BAR access to provide unlimited
> > > access to hardware.
> > > 
> > > MSI messages are memory writes so any generic device capable
> > > of MSI is capable of corrupting kernel memory.
> > > This means that a bug in userspace will lead to kernel memory corruption
> > > and crashes.  This is something distributions can't support.
> > > 
> > > uio_pci_generic is already abused like that, mostly
> > > because when I wrote it, I didn't add enough protections
> > > against using it with DMA capable devices,
> > > and we can't go back and break working userspace.
> > > But at least it does not bind to VFs which all of
> > > them are capable of DMA.
> > > 
> > > The result of merging this driver will be userspace abusing the
> > > sysfs BAR access with VFs as well, and we do not want that.
> > > 
> > > 
> > > Just forwarding events is not enough to make a valid driver.
> > > What is missing is a way to access the device in a safe way.
> > > 
> > > On a more positive note:
> > > 
> > > What would be a reasonable interface? One that does the following
> > > in kernel:
> > > 
> > > 1. initializes device rings (can be in pinned userspace memory,
> > >    but can not be writeable by userspace), brings up interface link
> > > 2. pins userspace memory (unless using e.g. hugetlbfs)
> > > 3. gets request, make sure it's valid and belongs to
> > >    the correct task, put it in the ring
> > > 4. in the reverse direction, notify userspace when buffers
> > >    are available in the ring
> > > 5. notify userspace about MSI (what this driver does)
> > > 
> > > What userspace can be allowed to do:
> > > 
> > > 	format requests (e.g. transmit, receive) in userspace
> > > 	read ring contents
> > > 
> > > What userspace can't be allowed to do:
> > > 
> > > 	access BAR
> > > 	write rings
> > > 
> > > 
> > > This means that the driver can not be a generic one,
> > > and there will be a system call overhead when you
> > > write the ring, but that's the price you have to
> > > pay for ability to run on systems without an IOMMU.
> > > 
> > 
> > 
> > The device specific parts can be taken from John Fastabend's patches
> > BTW:
> > 
> > https://patchwork.ozlabs.org/patch/396713/
> > 
> > IIUC what was missing there was exactly the memory protection
> > we are looking for here.
> 
> The bifuricated drivers are interesting from an architecture
> point of view, but do nothing to solve the immediate use case.
> The problem is not on bare metal environment, most of those already have IOMMU.
> The issues are on environments like VMWare with SRIOV or vmxnet3,
> neither of those are really helped by bifirucated driver or VFIO.

Two points I tried to make (and apparently failed, so I'm trying again,
more verbosely):
- bifurcated drivers do DMA into both kernel and userspace
  memory from same pci address (bus/dev/fn). As IOMMU uses this
  source address to validated accesses, there is no way to
  have IOMMU prevent userspace from accessing kernel memory.
  If you are prepared to use dynamic mappings for kernel
  memory, it might be possible to limit the harm that userspace
  can do, but this will slow down kernel networking
  (changing IOMMU mappings is expensive) and userspace will
  likely be able to at least disrupt kernel networking.
  So what I am discussing might still have value there.

- bifurcated drivers have code to bring up link and map rings into
  userspace (they also map other rings into kernel, and tweak rx filter
  in hardware, that might not be necessary for this usecase).
  What I proposed above can use that code, with
  the twist that the RX ring is made RO for userspace, and a system call
  to safely copy from userspace ring there is supported.
  In other words, here's the device specific part in
  kernel that you wanted to use, which will only
  need some tweaks.

-- 
MST

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-10-01 14:57   ` Stephen Hemminger
@ 2015-10-01 19:48     ` Alexander Duyck
  2015-10-01 22:00       ` Stephen Hemminger
  0 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2015-10-01 19:48 UTC (permalink / raw)
  To: Stephen Hemminger, Avi Kivity; +Cc: dev, hjk, gregkh, linux-kernel

On 10/01/2015 07:57 AM, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 13:59:02 +0300
> Avi Kivity <avi@scylladb.com> wrote:
>
>> On 10/01/2015 01:28 AM, Stephen Hemminger wrote:
>>> This is a new UIO device driver to allow supporting MSI-X and MSI devices
>>> in userspace.  It has been used in environments like VMware and older versions
>>> of QEMU/KVM where no IOMMU support is available.
>> Why not add msi/msix support to uio_pci_generic?
> That is possible but that would meet ABI and other resistance from the author.
> Also, uio_pci_generic makes it harder to find resources since it doesn't fully
> utilize UIO infrastructure.

I'd say you are better off actually taking this in the other direction.  
>From what I have seen it seems like this driver is meant to deal with 
mapping VFs contained inside of guests.  If you are going to fork off 
and create a UIO driver for mapping VFs why not just make it specialize 
in that.  You could probably simplify the code by dropping support for 
legacy interrupts and IO regions since all that is already covered by 
uio_pci_generic anyway if I am not mistaken.

You could then look at naming it something like uio_vf since the uio_msi 
is a bit of a misnomer since it is MSI-X it supports, not MSI interrupts.

- Alex

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-10-01 19:48     ` Alexander Duyck
@ 2015-10-01 22:00       ` Stephen Hemminger
  2015-10-01 23:03         ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Stephen Hemminger @ 2015-10-01 22:00 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: dev, Avi Kivity, hjk, gregkh, linux-kernel

On Thu, 1 Oct 2015 12:48:36 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On 10/01/2015 07:57 AM, Stephen Hemminger wrote:
> > On Thu, 1 Oct 2015 13:59:02 +0300
> > Avi Kivity <avi@scylladb.com> wrote:
> >
> >> On 10/01/2015 01:28 AM, Stephen Hemminger wrote:
> >>> This is a new UIO device driver to allow supporting MSI-X and MSI devices
> >>> in userspace.  It has been used in environments like VMware and older versions
> >>> of QEMU/KVM where no IOMMU support is available.
> >> Why not add msi/msix support to uio_pci_generic?
> > That is possible but that would meet ABI and other resistance from the author.
> > Also, uio_pci_generic makes it harder to find resources since it doesn't fully
> > utilize UIO infrastructure.
> 
> I'd say you are better off actually taking this in the other direction.  
> From what I have seen it seems like this driver is meant to deal with 
> mapping VFs contained inside of guests.  If you are going to fork off 
> and create a UIO driver for mapping VFs why not just make it specialize 
> in that.  You could probably simplify the code by dropping support for 
> legacy interrupts and IO regions since all that is already covered by 
> uio_pci_generic anyway if I am not mistaken.
> 
> You could then look at naming it something like uio_vf since the uio_msi 
> is a bit of a misnomer since it is MSI-X it supports, not MSI interrupts.

The support needs to cover:
  - VF in guest
  - VNIC in guest (vmxnet3)
it isn't just about VF's

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-10-01 22:00       ` Stephen Hemminger
@ 2015-10-01 23:03         ` Alexander Duyck
  2015-10-01 23:39           ` Stephen Hemminger
  0 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2015-10-01 23:03 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Avi Kivity, hjk, gregkh, linux-kernel

On 10/01/2015 03:00 PM, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 12:48:36 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On 10/01/2015 07:57 AM, Stephen Hemminger wrote:
>>> On Thu, 1 Oct 2015 13:59:02 +0300
>>> Avi Kivity <avi@scylladb.com> wrote:
>>>
>>>> On 10/01/2015 01:28 AM, Stephen Hemminger wrote:
>>>>> This is a new UIO device driver to allow supporting MSI-X and MSI devices
>>>>> in userspace.  It has been used in environments like VMware and older versions
>>>>> of QEMU/KVM where no IOMMU support is available.
>>>> Why not add msi/msix support to uio_pci_generic?
>>> That is possible but that would meet ABI and other resistance from the author.
>>> Also, uio_pci_generic makes it harder to find resources since it doesn't fully
>>> utilize UIO infrastructure.
>> I'd say you are better off actually taking this in the other direction.
>>  From what I have seen it seems like this driver is meant to deal with
>> mapping VFs contained inside of guests.  If you are going to fork off
>> and create a UIO driver for mapping VFs why not just make it specialize
>> in that.  You could probably simplify the code by dropping support for
>> legacy interrupts and IO regions since all that is already covered by
>> uio_pci_generic anyway if I am not mistaken.
>>
>> You could then look at naming it something like uio_vf since the uio_msi
>> is a bit of a misnomer since it is MSI-X it supports, not MSI interrupts.
> The support needs to cover:
>    - VF in guest
>    - VNIC in guest (vmxnet3)
> it isn't just about VF's

I get that, but the driver you are talking about adding is duplicating 
much of what is already there in uio_pci_generic.  If nothing else it 
might be worth while to look at replacing the legacy interrupt with 
MSI.  Maybe look at naming it something like uio_pcie to indicate that 
we are focusing on assigning PCIe and virtual devices that support MSI 
and MSI-X and use memory BARs rather than legacy PCI devices that are 
doing things like mapping I/O BARs and using INTx signaling.

My main argument is that we should probably look at dropping support for 
anything that isn't going to be needed.  If it is really important we 
can always add it later.  I just don't see the value in having code 
around for things we aren't likely to ever use with real devices as we 
are stuck supporting it for the life of the driver. I'll go ahead and 
provide a inline review of your patch 2/2 as I think my feedback might 
make a bit more sense that way.

- Alex

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-10-01 23:03         ` Alexander Duyck
@ 2015-10-01 23:39           ` Stephen Hemminger
  2015-10-01 23:43             ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Stephen Hemminger @ 2015-10-01 23:39 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: dev, Avi Kivity, hjk, gregkh, linux-kernel

On Thu, 1 Oct 2015 16:03:06 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On 10/01/2015 03:00 PM, Stephen Hemminger wrote:
> > On Thu, 1 Oct 2015 12:48:36 -0700
> > Alexander Duyck <alexander.duyck@gmail.com> wrote:
> >
> >> On 10/01/2015 07:57 AM, Stephen Hemminger wrote:
> >>> On Thu, 1 Oct 2015 13:59:02 +0300
> >>> Avi Kivity <avi@scylladb.com> wrote:
> >>>
> >>>> On 10/01/2015 01:28 AM, Stephen Hemminger wrote:
> >>>>> This is a new UIO device driver to allow supporting MSI-X and MSI devices
> >>>>> in userspace.  It has been used in environments like VMware and older versions
> >>>>> of QEMU/KVM where no IOMMU support is available.
> >>>> Why not add msi/msix support to uio_pci_generic?
> >>> That is possible but that would meet ABI and other resistance from the author.
> >>> Also, uio_pci_generic makes it harder to find resources since it doesn't fully
> >>> utilize UIO infrastructure.
> >> I'd say you are better off actually taking this in the other direction.
> >>  From what I have seen it seems like this driver is meant to deal with
> >> mapping VFs contained inside of guests.  If you are going to fork off
> >> and create a UIO driver for mapping VFs why not just make it specialize
> >> in that.  You could probably simplify the code by dropping support for
> >> legacy interrupts and IO regions since all that is already covered by
> >> uio_pci_generic anyway if I am not mistaken.
> >>
> >> You could then look at naming it something like uio_vf since the uio_msi
> >> is a bit of a misnomer since it is MSI-X it supports, not MSI interrupts.
> > The support needs to cover:
> >    - VF in guest
> >    - VNIC in guest (vmxnet3)
> > it isn't just about VF's
> 
> I get that, but the driver you are talking about adding is duplicating 
> much of what is already there in uio_pci_generic.  If nothing else it 
> might be worth while to look at replacing the legacy interrupt with 
> MSI.  Maybe look at naming it something like uio_pcie to indicate that 
> we are focusing on assigning PCIe and virtual devices that support MSI 
> and MSI-X and use memory BARs rather than legacy PCI devices that are 
> doing things like mapping I/O BARs and using INTx signaling.
> 
> My main argument is that we should probably look at dropping support for 
> anything that isn't going to be needed.  If it is really important we 
> can always add it later.  I just don't see the value in having code 
> around for things we aren't likely to ever use with real devices as we 
> are stuck supporting it for the life of the driver. I'll go ahead and 
> provide a inline review of your patch 2/2 as I think my feedback might 
> make a bit more sense that way.

Ok, but having one driver that can deal with failures with msi-x vector
setup and fallback seemed like a better strategy.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-09-30 22:28 ` [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X Stephen Hemminger
  2015-10-01  8:33   ` Michael S. Tsirkin
@ 2015-10-01 23:40   ` Alexander Duyck
  2015-10-02  0:01     ` Stephen Hemminger
  2015-10-02  0:04     ` Stephen Hemminger
  1 sibling, 2 replies; 44+ messages in thread
From: Alexander Duyck @ 2015-10-01 23:40 UTC (permalink / raw)
  To: Stephen Hemminger, hjk, gregkh; +Cc: dev, linux-kernel

On 09/30/2015 03:28 PM, Stephen Hemminger wrote:
> This driver allows using PCI device with Message Signalled Interrupt
> from userspace. The API is similar to the igb_uio driver used by the DPDK.
> Via ioctl it provides a mechanism to map MSI-X interrupts into event
> file descriptors similar to VFIO.
>
> VFIO is a better choice if IOMMU is available, but often userspace drivers
> have to work in environments where IOMMU support (real or emulated) is
> not available.  All UIO drivers that support DMA are not secure against
> rogue userspace applications programming DMA hardware to access
> private memory; this driver is no less secure than existing code.
>
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>   drivers/uio/Kconfig          |   9 ++
>   drivers/uio/Makefile         |   1 +
>   drivers/uio/uio_msi.c        | 378 +++++++++++++++++++++++++++++++++++++++++++
>   include/uapi/linux/Kbuild    |   1 +
>   include/uapi/linux/uio_msi.h |  22 +++
>   5 files changed, 411 insertions(+)
>   create mode 100644 drivers/uio/uio_msi.c
>   create mode 100644 include/uapi/linux/uio_msi.h
>
> diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> index 52c98ce..04adfa0 100644
> --- a/drivers/uio/Kconfig
> +++ b/drivers/uio/Kconfig
> @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
>   	  primarily, for virtualization scenarios.
>   	  If you compile this as a module, it will be called uio_pci_generic.
>
> +config UIO_PCI_MSI
> +       tristate "Generic driver supporting MSI-x on PCI Express cards"
> +	depends on PCI
> +	help
> +	  Generic driver that provides Message Signalled IRQ events
> +	  similar to VFIO. If IOMMMU is available please use VFIO
> +	  instead since it provides more security.
> +	  If you compile this as a module, it will be called uio_msi.
> +
>   config UIO_NETX
>   	tristate "Hilscher NetX Card driver"
>   	depends on PCI

Should you maybe instead depend on CONFIG_PCI_MSI.  Without MSI this is 
essentially just uio_pci_generic with a bit more greedy mapping setup.

> diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> index 8560dad..62fc44b 100644
> --- a/drivers/uio/Makefile
> +++ b/drivers/uio/Makefile
> @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)	+= uio_netx.o
>   obj-$(CONFIG_UIO_PRUSS)         += uio_pruss.o
>   obj-$(CONFIG_UIO_MF624)         += uio_mf624.o
>   obj-$(CONFIG_UIO_FSL_ELBC_GPCM)	+= uio_fsl_elbc_gpcm.o
> +obj-$(CONFIG_UIO_PCI_MSI)	+= uio_msi.o
> diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
> new file mode 100644
> index 0000000..802b5c4
> --- /dev/null
> +++ b/drivers/uio/uio_msi.c
> @@ -0,0 +1,378 @@
> +/*-
> + *
> + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> + * Author: Stephen Hemminger <stephen@networkplumber.org>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 only.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/device.h>
> +#include <linux/interrupt.h>
> +#include <linux/eventfd.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/uio_driver.h>
> +#include <linux/msi.h>
> +#include <linux/uio_msi.h>
> +
> +#define DRIVER_VERSION		"0.1.1"
> +#define MAX_MSIX_VECTORS	64
> +
> +/* MSI-X vector information */
> +struct uio_msi_pci_dev {
> +	struct uio_info info;		/* UIO driver info */
> +	struct pci_dev *pdev;		/* PCI device */
> +	struct mutex	mutex;		/* open/release/ioctl mutex */
> +	int		ref_cnt;	/* references to device */
> +	unsigned int	max_vectors;	/* MSI-X slots available */
> +	struct msix_entry *msix;	/* MSI-X vector table */
> +	struct uio_msi_irq_ctx {
> +		struct eventfd_ctx *trigger; /* vector to eventfd */
> +		char *name;		/* name in /proc/interrupts */
> +	} *ctx;
> +};
> +

I would move the definition of uio_msi_irq_ctx out of uio_msi_pci_dev. 
It would help to make this a bit more readable.

> +static irqreturn_t uio_intx_irqhandler(int irq, void *arg)
> +{
> +	struct uio_msi_pci_dev *udev = arg;
> +
> +	if (pci_check_and_mask_intx(udev->pdev)) {
> +		eventfd_signal(udev->ctx->trigger, 1);
> +		return IRQ_HANDLED;
> +	}
> +
> +	return IRQ_NONE;
> +}
> +

I would really prefer to see the intx handling dropped since there are 
already 2 different UIO drivers setup for handling INTx style 
interrupts.  Lets focus on the parts from the last decade and drop 
support for INTx now in favor of MSI-X and maybe MSI.  If we _REALLY_ 
need it we can always come back later and add it.

> +static irqreturn_t uio_msi_irqhandler(int irq, void *arg)
> +{
> +	struct eventfd_ctx *trigger = arg;
> +
> +	eventfd_signal(trigger, 1);
> +	return IRQ_HANDLED;
> +}
> +
> +/* set the mapping between vector # and existing eventfd. */
> +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd)
> +{
> +	struct eventfd_ctx *trigger;
> +	int irq, err;
> +
> +	if (vec >= udev->max_vectors) {
> +		dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n",
> +			   vec, udev->max_vectors);
> +		return -ERANGE;
> +	}
> +
> +	irq = udev->msix[vec].vector;
> +	trigger = udev->ctx[vec].trigger;
> +	if (trigger) {
> +		/* Clearup existing irq mapping */

Minor spelling issue here, "Clear up" should be 2 words, "Cleanup" can 
be one.

> +		free_irq(irq, trigger);
> +		eventfd_ctx_put(trigger);
> +		udev->ctx[vec].trigger = NULL;
> +	}
> +
> +	/* Passing -1 is used to disable interrupt */
> +	if (fd < 0)
> +		return 0;
> +
> +	trigger = eventfd_ctx_fdget(fd);
> +	if (IS_ERR(trigger)) {
> +		err = PTR_ERR(trigger);
> +		dev_notice(&udev->pdev->dev,
> +			   "eventfd ctx get failed: %d\n", err);
> +		return err;
> +	}
> +
> +	if (udev->msix)
> +		err = request_irq(irq, uio_msi_irqhandler, 0,
> +				  udev->ctx[vec].name, trigger);
> +	else
> +		err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED,
> +				  udev->ctx[vec].name, udev);
> +
> +	if (err) {
> +		dev_notice(&udev->pdev->dev,
> +			   "request irq failed: %d\n", err);
> +		eventfd_ctx_put(trigger);
> +		return err;
> +	}
> +
> +	udev->ctx[vec].trigger = trigger;
> +	return 0;
> +}
> +
> +static int
> +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	struct uio_msi_irq_set hdr;
> +	int err;
> +
> +	switch (cmd) {
> +	case UIO_MSI_IRQ_SET:
> +		if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr)))
> +			return -EFAULT;
> +
> +		mutex_lock(&udev->mutex);
> +		err = set_irq_eventfd(udev, hdr.vec, hdr.fd);
> +		mutex_unlock(&udev->mutex);
> +		break;
> +	default:
> +		err = -EOPNOTSUPP;
> +	}
> +	return err;
> +}
> +
> +/* Opening the UIO device for first time enables MSI-X */
> +static int
> +uio_msi_open(struct uio_info *info, struct inode *inode)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int err = 0;
> +
> +	mutex_lock(&udev->mutex);
> +	if (udev->ref_cnt++ == 0) {
> +		if (udev->msix)
> +			err = pci_enable_msix(udev->pdev, udev->msix,
> +					      udev->max_vectors);
> +	}
> +	mutex_unlock(&udev->mutex);
> +
> +	return err;
> +}
> +

I agree with some other reviewers.  Why call pci_enable_msix in open? 
It seems like it would make much more sense to do this on probe, and 
then disable MSI-X on free.  I can only assume you are trying to do it 
to save on resources but the fact is this is a driver you have to 
explicitly force onto a device so you would probably be safe to assume 
that they plan to use it in the near future.

> +/* Last close of the UIO device releases/disables all IRQ's */
> +static int
> +uio_msi_release(struct uio_info *info, struct inode *inode)
> +{
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int i;
> +
> +	mutex_lock(&udev->mutex);
> +	if (--udev->ref_cnt == 0) {
> +		for (i = 0; i < udev->max_vectors; i++) {
> +			int irq = udev->msix[i].vector;
> +			struct eventfd_ctx *trigger = udev->ctx[i].trigger;
> +
> +			if (!trigger)
> +				continue;
> +
> +			free_irq(irq, trigger);
> +			eventfd_ctx_put(trigger);
> +			udev->ctx[i].trigger = NULL;
> +		}
> +
> +		if (udev->msix)
> +			pci_disable_msix(udev->pdev);
> +	}
> +	mutex_unlock(&udev->mutex);
> +
> +	return 0;
> +}
> +
> +/* Unmap previously ioremap'd resources */
> +static void
> +release_iomaps(struct uio_mem *mem)
> +{
> +	int i;
> +
> +	for (i = 0; i < MAX_UIO_MAPS; i++, mem++) {
> +		if (mem->internal_addr)
> +			iounmap(mem->internal_addr);
> +	}
> +}
> +
> +static int
> +setup_maps(struct pci_dev *pdev, struct uio_info *info)
> +{
> +	int i, m = 0, p = 0, err;
> +	static const char * const bar_names[] = {
> +		"BAR0",	"BAR1",	"BAR2",	"BAR3",	"BAR4",	"BAR5",
> +	};
> +
> +	for (i = 0; i < ARRAY_SIZE(bar_names); i++) {
> +		unsigned long start = pci_resource_start(pdev, i);
> +		unsigned long flags = pci_resource_flags(pdev, i);
> +		unsigned long len = pci_resource_len(pdev, i);
> +
> +		if (start == 0 || len == 0)
> +			continue;
> +
> +		if (flags & IORESOURCE_MEM) {
> +			void __iomem *addr;
> +
> +			if (m >= MAX_UIO_MAPS)
> +				continue;
> +
> +			addr = ioremap(start, len);
> +			if (addr == NULL) {
> +				err = -EINVAL;
> +				goto fail;
> +			}
> +
> +			info->mem[m].name = bar_names[i];
> +			info->mem[m].addr = start;
> +			info->mem[m].internal_addr = addr;
> +			info->mem[m].size = len;
> +			info->mem[m].memtype = UIO_MEM_PHYS;
> +			++m;
> +		} else if (flags & IORESOURCE_IO) {
> +			if (p >= MAX_UIO_PORT_REGIONS)
> +				continue;
> +
> +			info->port[p].name = bar_names[i];
> +			info->port[p].start = start;
> +			info->port[p].size = len;
> +			info->port[p].porttype = UIO_PORT_X86;
> +			++p;
> +		}
> +	}
> +
> +	return 0;
> + fail:
> +	for (i = 0; i < m; i++)
> +		iounmap(info->mem[i].internal_addr);
> +	return err;
> +}
> +

Do you really need to map IORESOURCE bars?  Most drivers I can think of 
don't use IO BARs anymore.  Maybe we could look at just dropping the 
code and adding it back later if we have a use case that absolutely 
needs it.

Also how many devices actually need resources beyond BAR 0?  I'm just 
curious as I know BAR 2 on many of the Intel devices is the register 
space related to MSI-X so now we have both the PCIe subsystem and user 
space with access to this region.

> +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> +{
> +	struct uio_msi_pci_dev *udev;
> +	int i, err, vectors;
> +
> +	udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL);
> +	if (!udev)
> +		return -ENOMEM;
> +
> +	err = pci_enable_device(pdev);
> +	if (err != 0) {
> +		dev_err(&pdev->dev, "cannot enable PCI device\n");
> +		goto fail_free;
> +	}
> +
> +	err = pci_request_regions(pdev, "uio_msi");
> +	if (err != 0) {
> +		dev_err(&pdev->dev, "Cannot request regions\n");
> +		goto fail_disable;
> +	}
> +
> +	pci_set_master(pdev);
> +
> +	/* remap resources */
> +	err = setup_maps(pdev, &udev->info);
> +	if (err)
> +		goto fail_release_iomem;
> +
> +	/* fill uio infos */
> +	udev->info.name = "uio_msi";
> +	udev->info.version = DRIVER_VERSION;
> +	udev->info.priv = udev;
> +	udev->pdev = pdev;
> +	udev->info.ioctl = uio_msi_ioctl;
> +	udev->info.open = uio_msi_open;
> +	udev->info.release = uio_msi_release;
> +	udev->info.irq = UIO_IRQ_CUSTOM;
> +	mutex_init(&udev->mutex);
> +
> +	vectors = pci_msix_vec_count(pdev);
> +	if (vectors > 0) {
> +		udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS);
> +		dev_info(&pdev->dev, "using up to %u MSI-X vectors\n",
> +			udev->max_vectors);
> +
> +		err = -ENOMEM;
> +		udev->msix = kcalloc(udev->max_vectors,
> +				     sizeof(struct msix_entry), GFP_KERNEL);
> +		if (!udev->msix)
> +			goto fail_release_iomem;
> +	} else if (!pci_intx_mask_supported(pdev)) {
> +		dev_err(&pdev->dev,
> +			"device does not support MSI-X or INTX\n");
> +		err = -EINVAL;
> +		goto fail_release_iomem;
> +	} else {
> +		dev_notice(&pdev->dev, "using INTX\n");
> +		udev->info.irq_flags = IRQF_SHARED;
> +		udev->max_vectors = 1;
> +	}
> +
> +	udev->ctx = kcalloc(udev->max_vectors,
> +			    sizeof(struct uio_msi_irq_ctx), GFP_KERNEL);
> +	if (!udev->ctx)
> +		goto fail_free_msix;
> +
> +	for (i = 0; i < udev->max_vectors; i++) {
> +		udev->msix[i].entry = i;
> +
> +		udev->ctx[i].name = kasprintf(GFP_KERNEL,
> +					      KBUILD_MODNAME "[%d](%s)",
> +					      i, pci_name(pdev));
> +		if (!udev->ctx[i].name)
> +			goto fail_free_ctx;
> +	}
> +
> +	/* register uio driver */
> +	err = uio_register_device(&pdev->dev, &udev->info);
> +	if (err != 0)
> +		goto fail_free_ctx;
> +
> +	pci_set_drvdata(pdev, udev);
> +	return 0;
> +
> +fail_free_ctx:
> +	for (i = 0; i < udev->max_vectors; i++)
> +		kfree(udev->ctx[i].name);
> +	kfree(udev->ctx);
> +fail_free_msix:
> +	kfree(udev->msix);
> +fail_release_iomem:
> +	release_iomaps(udev->info.mem);
> +	pci_release_regions(pdev);
> +fail_disable:
> +	pci_disable_device(pdev);
> +fail_free:
> +	kfree(udev);
> +
> +	pr_notice("%s ret %d\n", __func__, err);
> +	return err;
> +}
> +
> +static void uio_msi_remove(struct pci_dev *pdev)
> +{
> +	struct uio_info *info = pci_get_drvdata(pdev);
> +	struct uio_msi_pci_dev *udev
> +		= container_of(info, struct uio_msi_pci_dev, info);
> +	int i;
> +
> +	uio_unregister_device(info);
> +	release_iomaps(info->mem);
> +
> +	pci_release_regions(pdev);
> +	for (i = 0; i < udev->max_vectors; i++)
> +		kfree(udev->ctx[i].name);
> +	kfree(udev->ctx);
> +	kfree(udev->msix);
> +	pci_disable_device(pdev);
> +
> +	pci_set_drvdata(pdev, NULL);
> +	kfree(udev);
> +}
> +
> +static struct pci_driver uio_msi_pci_driver = {
> +	.name = "uio_msi",
> +	.probe = uio_msi_probe,
> +	.remove = uio_msi_remove,
> +};
> +
> +module_pci_driver(uio_msi_pci_driver);
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>");
> +MODULE_DESCRIPTION("UIO driver for MSI PCI devices");
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index f7b2db4..d9497691 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -411,6 +411,7 @@ header-y += udp.h
>   header-y += uhid.h
>   header-y += uinput.h
>   header-y += uio.h
> +header-y += uio_msi.h
>   header-y += ultrasound.h
>   header-y += un.h
>   header-y += unistd.h
> diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h
> new file mode 100644
> index 0000000..297de00
> --- /dev/null
> +++ b/include/uapi/linux/uio_msi.h
> @@ -0,0 +1,22 @@
> +/*
> + * UIO_MSI API definition
> + *
> + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +#ifndef _UIO_PCI_MSI_H
> +#define _UIO_PCI_MSI_H
> +
> +struct uio_msi_irq_set {
> +	u32 vec;
> +	int fd;
> +};
> +
> +#define UIO_MSI_BASE	0x86
> +#define UIO_MSI_IRQ_SET	_IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set)
> +
> +#endif
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-10-01 23:39           ` Stephen Hemminger
@ 2015-10-01 23:43             ` Alexander Duyck
  2015-10-02  0:04               ` Stephen Hemminger
  0 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2015-10-01 23:43 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Avi Kivity, hjk, gregkh, linux-kernel

On 10/01/2015 04:39 PM, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 16:03:06 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On 10/01/2015 03:00 PM, Stephen Hemminger wrote:
>>> On Thu, 1 Oct 2015 12:48:36 -0700
>>> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>>
>>>> On 10/01/2015 07:57 AM, Stephen Hemminger wrote:
>>>>> On Thu, 1 Oct 2015 13:59:02 +0300
>>>>> Avi Kivity <avi@scylladb.com> wrote:
>>>>>
>>>>>> On 10/01/2015 01:28 AM, Stephen Hemminger wrote:
>>>>>>> This is a new UIO device driver to allow supporting MSI-X and MSI devices
>>>>>>> in userspace.  It has been used in environments like VMware and older versions
>>>>>>> of QEMU/KVM where no IOMMU support is available.
>>>>>> Why not add msi/msix support to uio_pci_generic?
>>>>> That is possible but that would meet ABI and other resistance from the author.
>>>>> Also, uio_pci_generic makes it harder to find resources since it doesn't fully
>>>>> utilize UIO infrastructure.
>>>> I'd say you are better off actually taking this in the other direction.
>>>>   From what I have seen it seems like this driver is meant to deal with
>>>> mapping VFs contained inside of guests.  If you are going to fork off
>>>> and create a UIO driver for mapping VFs why not just make it specialize
>>>> in that.  You could probably simplify the code by dropping support for
>>>> legacy interrupts and IO regions since all that is already covered by
>>>> uio_pci_generic anyway if I am not mistaken.
>>>>
>>>> You could then look at naming it something like uio_vf since the uio_msi
>>>> is a bit of a misnomer since it is MSI-X it supports, not MSI interrupts.
>>> The support needs to cover:
>>>     - VF in guest
>>>     - VNIC in guest (vmxnet3)
>>> it isn't just about VF's
>> I get that, but the driver you are talking about adding is duplicating
>> much of what is already there in uio_pci_generic.  If nothing else it
>> might be worth while to look at replacing the legacy interrupt with
>> MSI.  Maybe look at naming it something like uio_pcie to indicate that
>> we are focusing on assigning PCIe and virtual devices that support MSI
>> and MSI-X and use memory BARs rather than legacy PCI devices that are
>> doing things like mapping I/O BARs and using INTx signaling.
>>
>> My main argument is that we should probably look at dropping support for
>> anything that isn't going to be needed.  If it is really important we
>> can always add it later.  I just don't see the value in having code
>> around for things we aren't likely to ever use with real devices as we
>> are stuck supporting it for the life of the driver. I'll go ahead and
>> provide a inline review of your patch 2/2 as I think my feedback might
>> make a bit more sense that way.
> Ok, but having one driver that can deal with failures with msi-x vector
> setup and fallback seemed like a better strategy.

Yes, but in the case of something like a VF it is going to just make a 
bigger mess of things since INTx doesn't work.  So what would you expect 
your driver to do in that case?  Also we have to keep in mind that the 
MSI-X failure case is very unlikely.

One other thing that just occurred to me is that you may want to try 
using the range allocation call instead of a hard set number of 
interrupts.  Then if you start running short on vectors you don't hard 
fail and instead just allocate what you can.

- Alex

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-01 23:40   ` Alexander Duyck
@ 2015-10-02  0:01     ` Stephen Hemminger
  2015-10-02  1:21       ` Alexander Duyck
  2015-10-02  0:04     ` Stephen Hemminger
  1 sibling, 1 reply; 44+ messages in thread
From: Stephen Hemminger @ 2015-10-02  0:01 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: dev, hjk, gregkh, linux-kernel

On Thu, 1 Oct 2015 16:40:10 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> I agree with some other reviewers.  Why call pci_enable_msix in open? 
> It seems like it would make much more sense to do this on probe, and 
> then disable MSI-X on free.  I can only assume you are trying to do it 
> to save on resources but the fact is this is a driver you have to 
> explicitly force onto a device so you would probably be safe to assume 
> that they plan to use it in the near future.

Because if interface is not up, the MSI handle doesn't have to be open.
This saves resources and avoids some races.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-10-01 23:43             ` Alexander Duyck
@ 2015-10-02  0:04               ` Stephen Hemminger
  2015-10-02  1:39                 ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Stephen Hemminger @ 2015-10-02  0:04 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: dev, Avi Kivity, hjk, gregkh, linux-kernel

On Thu, 1 Oct 2015 16:43:23 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> Yes, but in the case of something like a VF it is going to just make a 
> bigger mess of things since INTx doesn't work.  So what would you expect 
> your driver to do in that case?  Also we have to keep in mind that the 
> MSI-X failure case is very unlikely.
> 
> One other thing that just occurred to me is that you may want to try 
> using the range allocation call instead of a hard set number of 
> interrupts.  Then if you start running short on vectors you don't hard 
> fail and instead just allocate what you can.

I tried that but the bookkeeping gets messy since there is no good
way to communicate that back to userspace and have it adapt.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-01 23:40   ` Alexander Duyck
  2015-10-02  0:01     ` Stephen Hemminger
@ 2015-10-02  0:04     ` Stephen Hemminger
  2015-10-02  2:33       ` Alexander Duyck
  1 sibling, 1 reply; 44+ messages in thread
From: Stephen Hemminger @ 2015-10-02  0:04 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: dev, hjk, gregkh, linux-kernel

On Thu, 1 Oct 2015 16:40:10 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> Do you really need to map IORESOURCE bars?  Most drivers I can think of 
> don't use IO BARs anymore.  Maybe we could look at just dropping the 
> code and adding it back later if we have a use case that absolutely 
> needs it.
Mapping is not strictly necessary, but for virtio it acts a way to communicate
the regions.

> Also how many devices actually need resources beyond BAR 0?  I'm just 
> curious as I know BAR 2 on many of the Intel devices is the register 
> space related to MSI-X so now we have both the PCIe subsystem and user 
> space with access to this region.

VMXNet3 needs 2 bars. Most use only one.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-02  0:01     ` Stephen Hemminger
@ 2015-10-02  1:21       ` Alexander Duyck
  0 siblings, 0 replies; 44+ messages in thread
From: Alexander Duyck @ 2015-10-02  1:21 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, hjk, gregkh, linux-kernel

On 10/01/2015 05:01 PM, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 16:40:10 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> I agree with some other reviewers.  Why call pci_enable_msix in open?
>> It seems like it would make much more sense to do this on probe, and
>> then disable MSI-X on free.  I can only assume you are trying to do it
>> to save on resources but the fact is this is a driver you have to
>> explicitly force onto a device so you would probably be safe to assume
>> that they plan to use it in the near future.
> Because if interface is not up, the MSI handle doesn't have to be open.
> This saves resources and avoids some races.

Yes, but it makes things a bit messier for the interrupts.  Most drivers 
take care of interrupts during probe so that if there are any allocation 
problems they can take care of them then instead of leaving an interface 
out that will later fail when it is brought up.  It ends up being a way 
to deal with the whole MSI-X fall-back issue.

- Alex

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-10-02  0:04               ` Stephen Hemminger
@ 2015-10-02  1:39                 ` Alexander Duyck
  2015-10-04 16:49                   ` Vlad Zolotarov
  0 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2015-10-02  1:39 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Avi Kivity, hjk, gregkh, linux-kernel

On 10/01/2015 05:04 PM, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 16:43:23 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> Yes, but in the case of something like a VF it is going to just make a
>> bigger mess of things since INTx doesn't work.  So what would you expect
>> your driver to do in that case?  Also we have to keep in mind that the
>> MSI-X failure case is very unlikely.
>>
>> One other thing that just occurred to me is that you may want to try
>> using the range allocation call instead of a hard set number of
>> interrupts.  Then if you start running short on vectors you don't hard
>> fail and instead just allocate what you can.
> I tried that but the bookkeeping gets messy since there is no good
> way to communicate that back to userspace and have it adapt.

Actually I kind of just realized that uio_msi_open is kind of messed 
up.  So if the MSI-X allocation fails due to no resources it will return 
a positive value indicating the number of vectors that could be 
allocated, a negative value if one of the input values is invalid, or 
0.  I'm not sure if returning a positive value on failure is an issue or 
not.  I know the open call is supposed to return a negative value or the 
file descriptor if not negative.  I don't know if the return value might 
be interpreted as a file descriptor or not.

Also if MSI-X is supported by the hardware, but disabled for some reason 
by the kernel ("pci=nomsi")  then this driver is rendered inoperable 
since it will never give you anything but -EINVAL from the open call.

I really think you should probably look at taking care of enabling MSI-X 
and maybe MSI as a fall-back in probe.  At least then you can post a 
message about how many vectors are enabled and what type. Then if you 
cannot enable any interrupts due to MSI being disabled you can simply 
fail at probe time and let then load a different driver.

- Alex

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-02  0:04     ` Stephen Hemminger
@ 2015-10-02  2:33       ` Alexander Duyck
  0 siblings, 0 replies; 44+ messages in thread
From: Alexander Duyck @ 2015-10-02  2:33 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, hjk, gregkh, linux-kernel

On 10/01/2015 05:04 PM, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 16:40:10 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> Do you really need to map IORESOURCE bars?  Most drivers I can think of
>> don't use IO BARs anymore.  Maybe we could look at just dropping the
>> code and adding it back later if we have a use case that absolutely
>> needs it.
> Mapping is not strictly necessary, but for virtio it acts a way to communicate
> the regions.

I think I see what you are saying.  I was hoping we could get away from 
having to map any I/O ports but it looks like virtio is still using them 
for BAR 0, or at least that is what I am seeing on my VM with 
virtio_net.  I was really hoping we could get away from that since a 16b 
address space is far too restrictive anyway.

>> Also how many devices actually need resources beyond BAR 0?  I'm just
>> curious as I know BAR 2 on many of the Intel devices is the register
>> space related to MSI-X so now we have both the PCIe subsystem and user
>> space with access to this region.
> VMXNet3 needs 2 bars. Most use only one.

So essentially we are needing to make exceptions for the virtual interfaces.

I guess there isn't much we can do then and we probably need to map any 
and all base address registers we can find for the given device.  I was 
hoping for something a bit more surgical since we are opening a security 
hole of sorts, but I guess it can't be helped if we want to support 
multiple devices and they all have such radically different configurations.

- Alex

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-10-02  1:39                 ` Alexander Duyck
@ 2015-10-04 16:49                   ` Vlad Zolotarov
  2015-10-04 19:03                     ` Greg KH
  0 siblings, 1 reply; 44+ messages in thread
From: Vlad Zolotarov @ 2015-10-04 16:49 UTC (permalink / raw)
  To: Alexander Duyck, Stephen Hemminger
  Cc: dev, Avi Kivity, hjk, gregkh, linux-kernel

FYI: I've just posted to linux-kernel list patches that add support for 
both MSI and MSI-X interrupt modes to uio_pci_generic driver.
It addresses most (all) remarks on this thread and also fixes some 
issues this code has, e.g. not disabling msi-x in remove(), etc.

U are all welcome to comment... ;)

thanks,
vlad

On 10/02/15 04:39, Alexander Duyck wrote:
> On 10/01/2015 05:04 PM, Stephen Hemminger wrote:
>> On Thu, 1 Oct 2015 16:43:23 -0700
>> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>
>>> Yes, but in the case of something like a VF it is going to just make a
>>> bigger mess of things since INTx doesn't work.  So what would you 
>>> expect
>>> your driver to do in that case?  Also we have to keep in mind that the
>>> MSI-X failure case is very unlikely.
>>>
>>> One other thing that just occurred to me is that you may want to try
>>> using the range allocation call instead of a hard set number of
>>> interrupts.  Then if you start running short on vectors you don't hard
>>> fail and instead just allocate what you can.
>> I tried that but the bookkeeping gets messy since there is no good
>> way to communicate that back to userspace and have it adapt.
>
> Actually I kind of just realized that uio_msi_open is kind of messed 
> up.  So if the MSI-X allocation fails due to no resources it will 
> return a positive value indicating the number of vectors that could be 
> allocated, a negative value if one of the input values is invalid, or 
> 0.  I'm not sure if returning a positive value on failure is an issue 
> or not.  I know the open call is supposed to return a negative value 
> or the file descriptor if not negative.  I don't know if the return 
> value might be interpreted as a file descriptor or not.
>
> Also if MSI-X is supported by the hardware, but disabled for some 
> reason by the kernel ("pci=nomsi")  then this driver is rendered 
> inoperable since it will never give you anything but -EINVAL from the 
> open call.
>
> I really think you should probably look at taking care of enabling 
> MSI-X and maybe MSI as a fall-back in probe.  At least then you can 
> post a message about how many vectors are enabled and what type. Then 
> if you cannot enable any interrupts due to MSI being disabled you can 
> simply fail at probe time and let then load a different driver.
>
> - Alex

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-10-04 16:49                   ` Vlad Zolotarov
@ 2015-10-04 19:03                     ` Greg KH
  2015-10-04 20:49                       ` Vlad Zolotarov
  0 siblings, 1 reply; 44+ messages in thread
From: Greg KH @ 2015-10-04 19:03 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: Avi Kivity, dev, linux-kernel, hjk

On Sun, Oct 04, 2015 at 07:49:35PM +0300, Vlad Zolotarov wrote:
> FYI: I've just posted to linux-kernel list patches that add support for both
> MSI and MSI-X interrupt modes to uio_pci_generic driver.
> It addresses most (all) remarks on this thread and also fixes some issues
> this code has, e.g. not disabling msi-x in remove(), etc.
> 
> U are all welcome to comment... ;)

Not if you don't at least cc: all of the uio maintainers :(

I'm just going to ignore the things, as obviously you don't want them
merged, quite strange...

greg k-h

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 0/2] uio_msi: device driver
  2015-10-04 19:03                     ` Greg KH
@ 2015-10-04 20:49                       ` Vlad Zolotarov
  0 siblings, 0 replies; 44+ messages in thread
From: Vlad Zolotarov @ 2015-10-04 20:49 UTC (permalink / raw)
  To: Greg KH; +Cc: Avi Kivity, dev, linux-kernel, hjk



On 10/04/15 22:03, Greg KH wrote:
> On Sun, Oct 04, 2015 at 07:49:35PM +0300, Vlad Zolotarov wrote:
>> FYI: I've just posted to linux-kernel list patches that add support for both
>> MSI and MSI-X interrupt modes to uio_pci_generic driver.
>> It addresses most (all) remarks on this thread and also fixes some issues
>> this code has, e.g. not disabling msi-x in remove(), etc.
>>
>> U are all welcome to comment... ;)
> Not if you don't at least cc: all of the uio maintainers :(
>
> I'm just going to ignore the things, as obviously you don't want them
> merged, quite strange...

I actually do mean them to be merged and I do (tried to) cc all the 
maintainers. Unfortunately I missed the first letter when I copied your 
email from the get_maintainers.pl output. I resent v3 with the correct 
email of yours. Hope u don't have (too) hard feelings about the first 
iterations of the series. Pls., believe me there was nothing personal, 
just a typo... ;)

>
> greg k-h

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-01  8:33   ` Michael S. Tsirkin
                       ` (2 preceding siblings ...)
  2015-10-01 16:31     ` Michael S. Tsirkin
@ 2015-10-05 21:54     ` Michael S. Tsirkin
  2015-10-05 22:09       ` Vladislav Zolotarov
  3 siblings, 1 reply; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-05 21:54 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, hjk, gregkh, linux-kernel

On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> Just forwarding events is not enough to make a valid driver.
> What is missing is a way to access the device in a safe way.

Thinking about it some more, maybe some devices don't do DMA, and merely
signal events with MSI/MSI-X.

The fact you mention igb_uio in the cover letter seems to hint that this
isn't the case, and that the real intent is to abuse it for DMA-capable
devices, but still ...

If we assume such a simple device, we need to block userspace from
tweaking at least the MSI control and the MSI-X table.
And changing BARs might make someone else corrupt the MSI-X
table, so we need to block it from changing BARs, too.

Things like device reset will clear the table.  I guess this means we
need to track access to reset, too, make sure we restore the
table to a sane config.

PM  capability can be used to reset things tooI think. Better be
careful about that.

And a bunch of devices could be doing weird things that need
to be special-cased.

All of this is what VFIO is already dealing with.

Maybe extending VFIO for this usecase, or finding another way to share
code might be a better idea than duplicating the code within uio?

-- 
MST

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-05 21:54     ` Michael S. Tsirkin
@ 2015-10-05 22:09       ` Vladislav Zolotarov
  2015-10-05 22:49         ` Michael S. Tsirkin
  0 siblings, 1 reply; 44+ messages in thread
From: Vladislav Zolotarov @ 2015-10-05 22:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, hjk, gregkh, linux-kernel

On Oct 6, 2015 12:55 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
> On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> > Just forwarding events is not enough to make a valid driver.
> > What is missing is a way to access the device in a safe way.
>
> Thinking about it some more, maybe some devices don't do DMA, and merely
> signal events with MSI/MSI-X.
>
> The fact you mention igb_uio in the cover letter seems to hint that this
> isn't the case, and that the real intent is to abuse it for DMA-capable
> devices, but still ...
>
> If we assume such a simple device, we need to block userspace from
> tweaking at least the MSI control and the MSI-X table.
> And changing BARs might make someone else corrupt the MSI-X
> table, so we need to block it from changing BARs, too.
>
> Things like device reset will clear the table.  I guess this means we
> need to track access to reset, too, make sure we restore the
> table to a sane config.
>
> PM  capability can be used to reset things tooI think. Better be
> careful about that.
>
> And a bunch of devices could be doing weird things that need
> to be special-cased.
>
> All of this is what VFIO is already dealing with.
>
> Maybe extending VFIO for this usecase, or finding another way to share
> code might be a better idea than duplicating the code within uio?

How about instead of trying to invent the wheel just go and attack the
problem directly just like i've proposed already a few times in the last
days: instead of limiting the UIO limit the users that are allowed to use
UIO to privileged users only (e.g. root). This would solve all clearly
unresolvable issues u are raising here all together, wouldn't it?

>
> --
> MST

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-05 22:09       ` Vladislav Zolotarov
@ 2015-10-05 22:49         ` Michael S. Tsirkin
  2015-10-06  7:33           ` Stephen Hemminger
  2015-10-06  8:23           ` Vlad Zolotarov
  0 siblings, 2 replies; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-05 22:49 UTC (permalink / raw)
  To: Vladislav Zolotarov; +Cc: dev, hjk, gregkh, linux-kernel

On Tue, Oct 06, 2015 at 01:09:55AM +0300, Vladislav Zolotarov wrote:
> How about instead of trying to invent the wheel just go and attack the problem
> directly just like i've proposed already a few times in the last days: instead
> of limiting the UIO limit the users that are allowed to use UIO to privileged
> users only (e.g. root). This would solve all clearly unresolvable issues u are
> raising here all together, wouldn't it?

No - root or no root, if the user can modify the addresses in the MSI-X
table and make the chip corrupt random memory, this is IMHO a non-starter.

And tainting kernel is not a solution - your patch adds a pile of
code that either goes completely unused or taints the kernel.
Not just that - it's a dedicated userspace API that either
goes completely unused or taints the kernel.

> >
> > --
> > MST
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-05 22:49         ` Michael S. Tsirkin
@ 2015-10-06  7:33           ` Stephen Hemminger
  2015-10-06 12:15             ` Avi Kivity
  2015-10-06 13:42             ` Michael S. Tsirkin
  2015-10-06  8:23           ` Vlad Zolotarov
  1 sibling, 2 replies; 44+ messages in thread
From: Stephen Hemminger @ 2015-10-06  7:33 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, hjk, gregkh, linux-kernel

Other than implementation objections, so far the two main arguments
against this reduce to:
  1. If you allow UIO ioctl then it opens an API hook for all the crap out
     of tree UIO drivers to do what they want.
  2. If you allow UIO MSI-X then you are expanding the usage of userspace
     device access in an insecure manner.

Another alternative which I explored was making a version of VFIO that
works without IOMMU. It solves #1 but actually increases the likely negative
response to arguent #2. This would keep same API, and avoid having to
modify UIO. But we would still have the same (if not more resistance)
from IOMMU developers who believe all systems have to be secure against
root.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-05 22:49         ` Michael S. Tsirkin
  2015-10-06  7:33           ` Stephen Hemminger
@ 2015-10-06  8:23           ` Vlad Zolotarov
  2015-10-06 13:58             ` Michael S. Tsirkin
  1 sibling, 1 reply; 44+ messages in thread
From: Vlad Zolotarov @ 2015-10-06  8:23 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, hjk, gregkh, linux-kernel



On 10/06/15 01:49, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 01:09:55AM +0300, Vladislav Zolotarov wrote:
>> How about instead of trying to invent the wheel just go and attack the problem
>> directly just like i've proposed already a few times in the last days: instead
>> of limiting the UIO limit the users that are allowed to use UIO to privileged
>> users only (e.g. root). This would solve all clearly unresolvable issues u are
>> raising here all together, wouldn't it?
> No - root or no root, if the user can modify the addresses in the MSI-X
> table and make the chip corrupt random memory, this is IMHO a non-starter.

Michael, how this or any other related patch is related to the problem u 
r describing? The above ability is there for years and if memory serves 
me well it was u who wrote uio_pci_generic with this "security flaw".  ;)

This patch in general only adds the ability to receive notifications per 
MSI-X interrupt and it has nothing to do with the ability to reprogram 
the MSI-X related registers from the user space which was always there.

>
> And tainting kernel is not a solution - your patch adds a pile of
> code that either goes completely unused or taints the kernel.
> Not just that - it's a dedicated userspace API that either
> goes completely unused or taints the kernel.
>
>>> --
>>> MST

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-06  7:33           ` Stephen Hemminger
@ 2015-10-06 12:15             ` Avi Kivity
  2015-10-06 14:07               ` Michael S. Tsirkin
  2015-10-16 17:11               ` Thomas Monjalon
  2015-10-06 13:42             ` Michael S. Tsirkin
  1 sibling, 2 replies; 44+ messages in thread
From: Avi Kivity @ 2015-10-06 12:15 UTC (permalink / raw)
  To: Stephen Hemminger, Michael S. Tsirkin; +Cc: dev, hjk, gregkh, linux-kernel

On 10/06/2015 10:33 AM, Stephen Hemminger wrote:
> Other than implementation objections, so far the two main arguments
> against this reduce to:
>    1. If you allow UIO ioctl then it opens an API hook for all the crap out
>       of tree UIO drivers to do what they want.
>    2. If you allow UIO MSI-X then you are expanding the usage of userspace
>       device access in an insecure manner.
>
> Another alternative which I explored was making a version of VFIO that
> works without IOMMU. It solves #1 but actually increases the likely negative
> response to arguent #2. This would keep same API, and avoid having to
> modify UIO. But we would still have the same (if not more resistance)
> from IOMMU developers who believe all systems have to be secure against
> root.

vfio's charter was explicitly aiming for modern setups with iommus.

This could be revisited, but I agree it will have even more resistance, 
justified IMO.

btw, (2) doesn't really add any insecurity.  The user could already poke 
at the msix tables (as well as perform DMA); they just couldn't get a 
useful interrupt out of them.

Maybe a module parameter "allow_insecure_dma" can be added to 
uio_pci_generic.  Without the parameter, bus mastering and msix is 
disabled, with the parameter it is allowed.  This requires the sysadmin 
to take a positive step in order to make use of their hardware.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-06  7:33           ` Stephen Hemminger
  2015-10-06 12:15             ` Avi Kivity
@ 2015-10-06 13:42             ` Michael S. Tsirkin
  1 sibling, 0 replies; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-06 13:42 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, hjk, gregkh, linux-kernel

On Tue, Oct 06, 2015 at 08:33:56AM +0100, Stephen Hemminger wrote:
> Other than implementation objections, so far the two main arguments
> against this reduce to:
>   1. If you allow UIO ioctl then it opens an API hook for all the crap out
>      of tree UIO drivers to do what they want.
>   2. If you allow UIO MSI-X then you are expanding the usage of userspace
>      device access in an insecure manner.

That's not all. Without MSI one can detect insecure usage by detecting
userspace enabling bus mastering.  This can be detected simply using
lspci.  Or one can also imagine a configuration where this ability is
disabled, is logged, or taints kernel.  This seems like something that
might be worth having for some locked-down systems.

OTOH enabling MSI requires enabling bus mastering so suddenly we have no
idea whether device can be/is used in a safe way.

> 
> Another alternative which I explored was making a version of VFIO that
> works without IOMMU. It solves #1 but actually increases the likely negative
> response to arguent #2.

No - because VFIO has limited protection against device misuse by
userspace, by limiting access to sub-ranges of device BARs and config
space.  For a device that doesn't do DMA, that will be enough to make it
secure to use.

That's a pretty weak excuse to support userspace drivers for PCI devices
without an IOMMU, but it's the best I heard so far.

Is that worth the security trade-off? I'm still not sure.

> This would keep same API, and avoid having to
> modify UIO. But we would still have the same (if not more resistance)
> from IOMMU developers who believe all systems have to be secure against
> root.

"Secure against root" is a confusing way to put it IMHO. We are talking
about memory protection.

So that's not IOMMU developers IIUC. I believe most kernel developers will
agree it's not a good idea to let userspace corrupt kernel memory.
Otherwise, the driver can't be supported, and maintaining upstream
drivers that can't be supported serves no useful purpose.  Anyone can
load out of tree ones just as well.

VFIO already supports MSI so VFIO developers already have a lot of
experience with these issues. Getting their input would be valuable.

-- 
MST

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-06  8:23           ` Vlad Zolotarov
@ 2015-10-06 13:58             ` Michael S. Tsirkin
  2015-10-06 14:49               ` Vlad Zolotarov
  0 siblings, 1 reply; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-06 13:58 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev, hjk, gregkh, linux-kernel

On Tue, Oct 06, 2015 at 11:23:11AM +0300, Vlad Zolotarov wrote:
> Michael, how this or any other related patch is related to the problem u r
> describing?
> The above ability is there for years and if memory serves me
> well it was u who wrote uio_pci_generic with this "security flaw".  ;)

I answered all this already.

This patch enables bus mastering, enables MSI or MSI-X, and requires
userspace to map the MSI-X table and read/write the config space.
This means that a single userspace bug is enough to corrupt kernel
memory.

uio_pci_generic does not enable bus mastering or MSI, and
it might be a good idea to have uio_pci_generic block
access to MSI/MSI-X config.
-- 
MST

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-06 12:15             ` Avi Kivity
@ 2015-10-06 14:07               ` Michael S. Tsirkin
  2015-10-06 15:41                 ` Avi Kivity
  2015-10-16 17:11               ` Thomas Monjalon
  1 sibling, 1 reply; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-06 14:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: dev, hjk, gregkh, linux-kernel

On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote:
> btw, (2) doesn't really add any insecurity.  The user could already poke at
> the msix tables (as well as perform DMA); they just couldn't get a useful
> interrupt out of them.

Poking at msix tables won't cause memory corruption unless msix and bus
mastering is enabled.  It's true root can enable msix and bus mastering
through sysfs - but that's easy to block or detect. Even if you don't
buy a security story, it seems less likely to trigger as a result
of a userspace bug.

-- 
MST

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-06 13:58             ` Michael S. Tsirkin
@ 2015-10-06 14:49               ` Vlad Zolotarov
  2015-10-06 15:00                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 44+ messages in thread
From: Vlad Zolotarov @ 2015-10-06 14:49 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, hjk, gregkh, linux-kernel



On 10/06/15 16:58, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 11:23:11AM +0300, Vlad Zolotarov wrote:
>> Michael, how this or any other related patch is related to the problem u r
>> describing?
>> The above ability is there for years and if memory serves me
>> well it was u who wrote uio_pci_generic with this "security flaw".  ;)
> I answered all this already.
>
> This patch enables bus mastering, enables MSI or MSI-X

This may be done from the user space right now without this patch...

> , and requires
> userspace to map the MSI-X table

Hmmm... I must have missed this requirement. Could u, pls., clarify? 
 From what I see, MSI/MSI-X table is configured completely in the kernel 
here...

> and read/write the config space.
> This means that a single userspace bug is enough to corrupt kernel
> memory.

Could u, pls., provide and example of this simple bug? Because it's 
absolutely not obvious...

>
> uio_pci_generic does not enable bus mastering or MSI, and
> it might be a good idea to have uio_pci_generic block
> access to MSI/MSI-X config.

Since device bars may be mapped bypassing the UIO/uio_pci_generic - this 
won't solve any issue.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-06 14:49               ` Vlad Zolotarov
@ 2015-10-06 15:00                 ` Michael S. Tsirkin
  2015-10-06 16:40                   ` Vlad Zolotarov
  0 siblings, 1 reply; 44+ messages in thread
From: Michael S. Tsirkin @ 2015-10-06 15:00 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: dev, hjk, gregkh, linux-kernel

On Tue, Oct 06, 2015 at 05:49:21PM +0300, Vlad Zolotarov wrote:
> >and read/write the config space.
> >This means that a single userspace bug is enough to corrupt kernel
> >memory.
> 
> Could u, pls., provide and example of this simple bug? Because it's
> absolutely not obvious...

Stick a value that happens to match a kernel address in Msg Addr field
in an unmasked MSI-X entry.

-- 
MST

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-06 14:07               ` Michael S. Tsirkin
@ 2015-10-06 15:41                 ` Avi Kivity
  0 siblings, 0 replies; 44+ messages in thread
From: Avi Kivity @ 2015-10-06 15:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, hjk, gregkh, linux-kernel



On 10/06/2015 05:07 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote:
>> btw, (2) doesn't really add any insecurity.  The user could already poke at
>> the msix tables (as well as perform DMA); they just couldn't get a useful
>> interrupt out of them.
> Poking at msix tables won't cause memory corruption unless msix and bus
> mastering is enabled.

It's a given that bus mastering is enabled.  It's true that msix is 
unlikely to be enabled, unless msix support is added.

>    It's true root can enable msix and bus mastering
> through sysfs - but that's easy to block or detect. Even if you don't
> buy a security story, it seems less likely to trigger as a result
> of a userspace bug.

If you're doing DMA, that's the least of your worries.

Still, zero-mapping the msix space seems reasonable, and can protect 
userspace from silly stuff.  It can't be considered to have anything to 
do with security though, as long as users can simply DMA to every bit of 
RAM in the system they want to.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-06 15:00                 ` Michael S. Tsirkin
@ 2015-10-06 16:40                   ` Vlad Zolotarov
  0 siblings, 0 replies; 44+ messages in thread
From: Vlad Zolotarov @ 2015-10-06 16:40 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, hjk, gregkh, linux-kernel



On 10/06/15 18:00, Michael S. Tsirkin wrote:
> On Tue, Oct 06, 2015 at 05:49:21PM +0300, Vlad Zolotarov wrote:
>>> and read/write the config space.
>>> This means that a single userspace bug is enough to corrupt kernel
>>> memory.
>> Could u, pls., provide and example of this simple bug? Because it's
>> absolutely not obvious...
> Stick a value that happens to match a kernel address in Msg Addr field
> in an unmasked MSI-X entry.

This patch neither configures MSI-X entries in the user space nor 
provides additional means to do so therefore this "sticking" would be a 
matter of some extra code that is absolutely unrelated to this patch. 
So, this example seems absolutely irrelevant to this particular discussion.

thanks,
vlad

>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-06 12:15             ` Avi Kivity
  2015-10-06 14:07               ` Michael S. Tsirkin
@ 2015-10-16 17:11               ` Thomas Monjalon
  2015-10-16 17:20                 ` Stephen Hemminger
  1 sibling, 1 reply; 44+ messages in thread
From: Thomas Monjalon @ 2015-10-16 17:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel; +Cc: Avi Kivity, dev, hjk, gregkh

To sum it up,
We want to remove the need of the out-of-tree module igb_uio.
3 possible implementations were discussed so far:
- new UIO driver
- extend uio_pci_generic
- VFIO without IOMMU

It is preferred to avoid creating yet another module to support.
That's why the uio_pci_generic extension would be nice.
In my understanding, there are currently 2 issues with the patches
from Vlad and Stephen:
- IRQ must be mapped to a fd without using a new ioctl
- MSI-X handling in userspace breaks the memory protection

I'm confident the first issue can be fixed with something like sysfs.
About the "security" concern, mainly expressed by MST, I think the idea
of Avi (below) deserves to be discussed.

2015-10-06 15:15, Avi Kivity:
> On 10/06/2015 10:33 AM, Stephen Hemminger wrote:
> > Other than implementation objections, so far the two main arguments
> > against this reduce to:
> >    1. If you allow UIO ioctl then it opens an API hook for all the crap out
> >       of tree UIO drivers to do what they want.
> >    2. If you allow UIO MSI-X then you are expanding the usage of userspace
> >       device access in an insecure manner.
[...]
> btw, (2) doesn't really add any insecurity.  The user could already poke 
> at the msix tables (as well as perform DMA); they just couldn't get a 
> useful interrupt out of them.
> 
> Maybe a module parameter "allow_insecure_dma" can be added to 
> uio_pci_generic.  Without the parameter, bus mastering and msix is 
> disabled, with the parameter it is allowed.  This requires the sysadmin 
> to take a positive step in order to make use of their hardware.

Giving the control of the memory protection level to the distribution or
the administrator looks a good idea.
When allowing insecure DMA, a log will make clear how it is supported
-or not- by the system provider.

>From another thread:
2015-10-01 14:09, Michael S. Tsirkin:
> If Linux keeps enabling hacks, no one will bother doing the right thing.
> Upstream inclusion is the only carrot Linux has to make people do the
> right thing.

The "right thing" should be guided by the users needs at a given time.
The "carrot" for a better solution will be to have a well protected system.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X
  2015-10-16 17:11               ` Thomas Monjalon
@ 2015-10-16 17:20                 ` Stephen Hemminger
  0 siblings, 0 replies; 44+ messages in thread
From: Stephen Hemminger @ 2015-10-16 17:20 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Avi Kivity, Michael S. Tsirkin, dev, linux-kernel, hjk, gregkh

On Fri, 16 Oct 2015 19:11:35 +0200
Thomas Monjalon <thomas.monjalon@6wind.com> wrote:

> To sum it up,
> We want to remove the need of the out-of-tree module igb_uio.
> 3 possible implementations were discussed so far:
> - new UIO driver
> - extend uio_pci_generic
> - VFIO without IOMMU

There is recent progress on VFIO without IOMMU. This looks the most promising
long term solution.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2015-10-16 17:20 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-30 22:28 [dpdk-dev] [PATCH 0/2] uio_msi: device driver Stephen Hemminger
2015-09-30 22:28 ` [dpdk-dev] [PATCH 1/2] uio: add support for ioctls Stephen Hemminger
2015-09-30 22:28 ` [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X Stephen Hemminger
2015-10-01  8:33   ` Michael S. Tsirkin
2015-10-01 10:37     ` Michael S. Tsirkin
2015-10-01 16:06       ` Michael S. Tsirkin
2015-10-01 14:50     ` Stephen Hemminger
2015-10-01 15:22       ` Michael S. Tsirkin
2015-10-01 16:31     ` Michael S. Tsirkin
2015-10-01 17:26       ` Stephen Hemminger
2015-10-01 18:25         ` Michael S. Tsirkin
2015-10-05 21:54     ` Michael S. Tsirkin
2015-10-05 22:09       ` Vladislav Zolotarov
2015-10-05 22:49         ` Michael S. Tsirkin
2015-10-06  7:33           ` Stephen Hemminger
2015-10-06 12:15             ` Avi Kivity
2015-10-06 14:07               ` Michael S. Tsirkin
2015-10-06 15:41                 ` Avi Kivity
2015-10-16 17:11               ` Thomas Monjalon
2015-10-16 17:20                 ` Stephen Hemminger
2015-10-06 13:42             ` Michael S. Tsirkin
2015-10-06  8:23           ` Vlad Zolotarov
2015-10-06 13:58             ` Michael S. Tsirkin
2015-10-06 14:49               ` Vlad Zolotarov
2015-10-06 15:00                 ` Michael S. Tsirkin
2015-10-06 16:40                   ` Vlad Zolotarov
2015-10-01 23:40   ` Alexander Duyck
2015-10-02  0:01     ` Stephen Hemminger
2015-10-02  1:21       ` Alexander Duyck
2015-10-02  0:04     ` Stephen Hemminger
2015-10-02  2:33       ` Alexander Duyck
2015-10-01  8:36 ` [dpdk-dev] [PATCH 0/2] uio_msi: device driver Michael S. Tsirkin
2015-10-01 10:59 ` Avi Kivity
2015-10-01 14:57   ` Stephen Hemminger
2015-10-01 19:48     ` Alexander Duyck
2015-10-01 22:00       ` Stephen Hemminger
2015-10-01 23:03         ` Alexander Duyck
2015-10-01 23:39           ` Stephen Hemminger
2015-10-01 23:43             ` Alexander Duyck
2015-10-02  0:04               ` Stephen Hemminger
2015-10-02  1:39                 ` Alexander Duyck
2015-10-04 16:49                   ` Vlad Zolotarov
2015-10-04 19:03                     ` Greg KH
2015-10-04 20:49                       ` Vlad Zolotarov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).