From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id AFDDE8D9F for ; Thu, 1 Oct 2015 18:06:39 +0200 (CEST) Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by mx1.redhat.com (Postfix) with ESMTPS id 0EB99319B6E; Thu, 1 Oct 2015 16:06:39 +0000 (UTC) Received: from redhat.com (ovpn-116-83.ams2.redhat.com [10.36.116.83]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with SMTP id t91G6Y8E029726; Thu, 1 Oct 2015 12:06:35 -0400 Date: Thu, 1 Oct 2015 19:06:33 +0300 From: "Michael S. Tsirkin" To: Stephen Hemminger Message-ID: <20151001190458-mutt-send-email-mst@redhat.com> References: <1443652138-31782-1-git-send-email-stephen@networkplumber.org> <1443652138-31782-3-git-send-email-stephen@networkplumber.org> <20151001104505-mutt-send-email-mst@redhat.com> <20151001133114-mutt-send-email-mst@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151001133114-mutt-send-email-mst@redhat.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 Cc: dev@dpdk.org, hjk@hansjkoch.de, gregkh@linux-foundation.org, linux-kernel@vger.kernel.org Subject: Re: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Oct 2015 16:06:40 -0000 On Thu, Oct 01, 2015 at 01:37:12PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote: > > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote: > > > This driver allows using PCI device with Message Signalled Interrupt > > > from userspace. The API is similar to the igb_uio driver used by the DPDK. > > > Via ioctl it provides a mechanism to map MSI-X interrupts into event > > > file descriptors similar to VFIO. > > > > > > VFIO is a better choice if IOMMU is available, but often userspace drivers > > > have to work in environments where IOMMU support (real or emulated) is > > > not available. All UIO drivers that support DMA are not secure against > > > rogue userspace applications programming DMA hardware to access > > > private memory; this driver is no less secure than existing code. > > > > > > Signed-off-by: Stephen Hemminger > > > > I don't think copying the igb_uio interface is a good idea. > > What DPDK is doing with igb_uio (and indeed uio_pci_generic) > > is abusing the sysfs BAR access to provide unlimited > > access to hardware. > > > > MSI messages are memory writes so any generic device capable > > of MSI is capable of corrupting kernel memory. > > This means that a bug in userspace will lead to kernel memory corruption > > and crashes. This is something distributions can't support. > > > > uio_pci_generic is already abused like that, mostly > > because when I wrote it, I didn't add enough protections > > against using it with DMA capable devices, > > and we can't go back and break working userspace. > > But at least it does not bind to VFs which all of > > them are capable of DMA. > > > > The result of merging this driver will be userspace abusing the > > sysfs BAR access with VFs as well, and we do not want that. > > > > > > Just forwarding events is not enough to make a valid driver. > > What is missing is a way to access the device in a safe way. > > > > On a more positive note: > > > > What would be a reasonable interface? One that does the following > > in kernel: > > > > 1. initializes device rings (can be in pinned userspace memory, > > but can not be writeable by userspace), brings up interface link > > 2. pins userspace memory (unless using e.g. hugetlbfs) > > 3. gets request, make sure it's valid and belongs to > > the correct task, put it in the ring > > 4. in the reverse direction, notify userspace when buffers > > are available in the ring > > 5. notify userspace about MSI (what this driver does) > > > > What userspace can be allowed to do: > > > > format requests (e.g. transmit, receive) in userspace > > read ring contents > > > > What userspace can't be allowed to do: > > > > access BAR > > write rings > > Thinking about it some more, many devices > > > Have separate rings for DMA: TX (device reads memory) > and RX (device writes memory). > With such devices, a mode where userspace can write TX ring > but not RX ring might make sense. > > This will mean userspace might read kernel memory > through the device, but can not corrupt it. > > That's already a big win! > > And RX buffers do not have to be added one at a time. > If we assume 0.2usec per system call, batching some 100 buffers per > system call gives you 2 nano seconds overhead. That seems quite > reasonable. > To add to that, there's no reason to do this on the same core. Re-arming descriptors can happen in parallel with packet processing, so this overhead won't affect PPS or latency at all: only the CPU utilization. > > > > > > > This means that the driver can not be a generic one, > > and there will be a system call overhead when you > > write the ring, but that's the price you have to > > pay for ability to run on systems without an IOMMU. > > > > > > > > > > > --- > > > drivers/uio/Kconfig | 9 ++ > > > drivers/uio/Makefile | 1 + > > > drivers/uio/uio_msi.c | 378 +++++++++++++++++++++++++++++++++++++++++++ > > > include/uapi/linux/Kbuild | 1 + > > > include/uapi/linux/uio_msi.h | 22 +++ > > > 5 files changed, 411 insertions(+) > > > create mode 100644 drivers/uio/uio_msi.c > > > create mode 100644 include/uapi/linux/uio_msi.h > > > > > > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig > > > index 52c98ce..04adfa0 100644 > > > --- a/drivers/uio/Kconfig > > > +++ b/drivers/uio/Kconfig > > > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC > > > primarily, for virtualization scenarios. > > > If you compile this as a module, it will be called uio_pci_generic. > > > > > > +config UIO_PCI_MSI > > > + tristate "Generic driver supporting MSI-x on PCI Express cards" > > > + depends on PCI > > > + help > > > + Generic driver that provides Message Signalled IRQ events > > > + similar to VFIO. If IOMMMU is available please use VFIO > > > + instead since it provides more security. > > > + If you compile this as a module, it will be called uio_msi. > > > + > > > config UIO_NETX > > > tristate "Hilscher NetX Card driver" > > > depends on PCI > > > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile > > > index 8560dad..62fc44b 100644 > > > --- a/drivers/uio/Makefile > > > +++ b/drivers/uio/Makefile > > > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX) += uio_netx.o > > > obj-$(CONFIG_UIO_PRUSS) += uio_pruss.o > > > obj-$(CONFIG_UIO_MF624) += uio_mf624.o > > > obj-$(CONFIG_UIO_FSL_ELBC_GPCM) += uio_fsl_elbc_gpcm.o > > > +obj-$(CONFIG_UIO_PCI_MSI) += uio_msi.o > > > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c > > > new file mode 100644 > > > index 0000000..802b5c4 > > > --- /dev/null > > > +++ b/drivers/uio/uio_msi.c > > > @@ -0,0 +1,378 @@ > > > +/*- > > > + * > > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > > > + * Author: Stephen Hemminger > > > + * > > > + * This work is licensed under the terms of the GNU GPL, version 2 only. > > > + */ > > > + > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > > + > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > + > > > +#define DRIVER_VERSION "0.1.1" > > > +#define MAX_MSIX_VECTORS 64 > > > + > > > +/* MSI-X vector information */ > > > +struct uio_msi_pci_dev { > > > + struct uio_info info; /* UIO driver info */ > > > + struct pci_dev *pdev; /* PCI device */ > > > + struct mutex mutex; /* open/release/ioctl mutex */ > > > + int ref_cnt; /* references to device */ > > > + unsigned int max_vectors; /* MSI-X slots available */ > > > + struct msix_entry *msix; /* MSI-X vector table */ > > > + struct uio_msi_irq_ctx { > > > + struct eventfd_ctx *trigger; /* vector to eventfd */ > > > + char *name; /* name in /proc/interrupts */ > > > + } *ctx; > > > +}; > > > + > > > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg) > > > +{ > > > + struct uio_msi_pci_dev *udev = arg; > > > + > > > + if (pci_check_and_mask_intx(udev->pdev)) { > > > + eventfd_signal(udev->ctx->trigger, 1); > > > + return IRQ_HANDLED; > > > + } > > > + > > > + return IRQ_NONE; > > > +} > > > + > > > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg) > > > +{ > > > + struct eventfd_ctx *trigger = arg; > > > + > > > + eventfd_signal(trigger, 1); > > > + return IRQ_HANDLED; > > > +} > > > + > > > +/* set the mapping between vector # and existing eventfd. */ > > > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd) > > > +{ > > > + struct eventfd_ctx *trigger; > > > + int irq, err; > > > + > > > + if (vec >= udev->max_vectors) { > > > + dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n", > > > + vec, udev->max_vectors); > > > + return -ERANGE; > > > + } > > > + > > > + irq = udev->msix[vec].vector; > > > + trigger = udev->ctx[vec].trigger; > > > + if (trigger) { > > > + /* Clearup existing irq mapping */ > > > + free_irq(irq, trigger); > > > + eventfd_ctx_put(trigger); > > > + udev->ctx[vec].trigger = NULL; > > > + } > > > + > > > + /* Passing -1 is used to disable interrupt */ > > > + if (fd < 0) > > > + return 0; > > > + > > > + trigger = eventfd_ctx_fdget(fd); > > > + if (IS_ERR(trigger)) { > > > + err = PTR_ERR(trigger); > > > + dev_notice(&udev->pdev->dev, > > > + "eventfd ctx get failed: %d\n", err); > > > + return err; > > > + } > > > + > > > + if (udev->msix) > > > + err = request_irq(irq, uio_msi_irqhandler, 0, > > > + udev->ctx[vec].name, trigger); > > > + else > > > + err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED, > > > + udev->ctx[vec].name, udev); > > > + > > > + if (err) { > > > + dev_notice(&udev->pdev->dev, > > > + "request irq failed: %d\n", err); > > > + eventfd_ctx_put(trigger); > > > + return err; > > > + } > > > + > > > + udev->ctx[vec].trigger = trigger; > > > + return 0; > > > +} > > > + > > > +static int > > > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg) > > > +{ > > > + struct uio_msi_pci_dev *udev > > > + = container_of(info, struct uio_msi_pci_dev, info); > > > + struct uio_msi_irq_set hdr; > > > + int err; > > > + > > > + switch (cmd) { > > > + case UIO_MSI_IRQ_SET: > > > + if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr))) > > > + return -EFAULT; > > > + > > > + mutex_lock(&udev->mutex); > > > + err = set_irq_eventfd(udev, hdr.vec, hdr.fd); > > > + mutex_unlock(&udev->mutex); > > > + break; > > > + default: > > > + err = -EOPNOTSUPP; > > > + } > > > + return err; > > > +} > > > + > > > +/* Opening the UIO device for first time enables MSI-X */ > > > +static int > > > +uio_msi_open(struct uio_info *info, struct inode *inode) > > > +{ > > > + struct uio_msi_pci_dev *udev > > > + = container_of(info, struct uio_msi_pci_dev, info); > > > + int err = 0; > > > + > > > + mutex_lock(&udev->mutex); > > > + if (udev->ref_cnt++ == 0) { > > > + if (udev->msix) > > > + err = pci_enable_msix(udev->pdev, udev->msix, > > > + udev->max_vectors); > > > + } > > > + mutex_unlock(&udev->mutex); > > > + > > > + return err; > > > +} > > > + > > > +/* Last close of the UIO device releases/disables all IRQ's */ > > > +static int > > > +uio_msi_release(struct uio_info *info, struct inode *inode) > > > +{ > > > + struct uio_msi_pci_dev *udev > > > + = container_of(info, struct uio_msi_pci_dev, info); > > > + int i; > > > + > > > + mutex_lock(&udev->mutex); > > > + if (--udev->ref_cnt == 0) { > > > + for (i = 0; i < udev->max_vectors; i++) { > > > + int irq = udev->msix[i].vector; > > > + struct eventfd_ctx *trigger = udev->ctx[i].trigger; > > > + > > > + if (!trigger) > > > + continue; > > > + > > > + free_irq(irq, trigger); > > > + eventfd_ctx_put(trigger); > > > + udev->ctx[i].trigger = NULL; > > > + } > > > + > > > + if (udev->msix) > > > + pci_disable_msix(udev->pdev); > > > + } > > > + mutex_unlock(&udev->mutex); > > > + > > > + return 0; > > > +} > > > + > > > +/* Unmap previously ioremap'd resources */ > > > +static void > > > +release_iomaps(struct uio_mem *mem) > > > +{ > > > + int i; > > > + > > > + for (i = 0; i < MAX_UIO_MAPS; i++, mem++) { > > > + if (mem->internal_addr) > > > + iounmap(mem->internal_addr); > > > + } > > > +} > > > + > > > +static int > > > +setup_maps(struct pci_dev *pdev, struct uio_info *info) > > > +{ > > > + int i, m = 0, p = 0, err; > > > + static const char * const bar_names[] = { > > > + "BAR0", "BAR1", "BAR2", "BAR3", "BAR4", "BAR5", > > > + }; > > > + > > > + for (i = 0; i < ARRAY_SIZE(bar_names); i++) { > > > + unsigned long start = pci_resource_start(pdev, i); > > > + unsigned long flags = pci_resource_flags(pdev, i); > > > + unsigned long len = pci_resource_len(pdev, i); > > > + > > > + if (start == 0 || len == 0) > > > + continue; > > > + > > > + if (flags & IORESOURCE_MEM) { > > > + void __iomem *addr; > > > + > > > + if (m >= MAX_UIO_MAPS) > > > + continue; > > > + > > > + addr = ioremap(start, len); > > > + if (addr == NULL) { > > > + err = -EINVAL; > > > + goto fail; > > > + } > > > + > > > + info->mem[m].name = bar_names[i]; > > > + info->mem[m].addr = start; > > > + info->mem[m].internal_addr = addr; > > > + info->mem[m].size = len; > > > + info->mem[m].memtype = UIO_MEM_PHYS; > > > + ++m; > > > + } else if (flags & IORESOURCE_IO) { > > > + if (p >= MAX_UIO_PORT_REGIONS) > > > + continue; > > > + > > > + info->port[p].name = bar_names[i]; > > > + info->port[p].start = start; > > > + info->port[p].size = len; > > > + info->port[p].porttype = UIO_PORT_X86; > > > + ++p; > > > + } > > > + } > > > + > > > + return 0; > > > + fail: > > > + for (i = 0; i < m; i++) > > > + iounmap(info->mem[i].internal_addr); > > > + return err; > > > +} > > > + > > > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id) > > > +{ > > > + struct uio_msi_pci_dev *udev; > > > + int i, err, vectors; > > > + > > > + udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL); > > > + if (!udev) > > > + return -ENOMEM; > > > + > > > + err = pci_enable_device(pdev); > > > + if (err != 0) { > > > + dev_err(&pdev->dev, "cannot enable PCI device\n"); > > > + goto fail_free; > > > + } > > > + > > > + err = pci_request_regions(pdev, "uio_msi"); > > > + if (err != 0) { > > > + dev_err(&pdev->dev, "Cannot request regions\n"); > > > + goto fail_disable; > > > + } > > > + > > > + pci_set_master(pdev); > > > + > > > + /* remap resources */ > > > + err = setup_maps(pdev, &udev->info); > > > + if (err) > > > + goto fail_release_iomem; > > > + > > > + /* fill uio infos */ > > > + udev->info.name = "uio_msi"; > > > + udev->info.version = DRIVER_VERSION; > > > + udev->info.priv = udev; > > > + udev->pdev = pdev; > > > + udev->info.ioctl = uio_msi_ioctl; > > > + udev->info.open = uio_msi_open; > > > + udev->info.release = uio_msi_release; > > > + udev->info.irq = UIO_IRQ_CUSTOM; > > > + mutex_init(&udev->mutex); > > > + > > > + vectors = pci_msix_vec_count(pdev); > > > + if (vectors > 0) { > > > + udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS); > > > + dev_info(&pdev->dev, "using up to %u MSI-X vectors\n", > > > + udev->max_vectors); > > > + > > > + err = -ENOMEM; > > > + udev->msix = kcalloc(udev->max_vectors, > > > + sizeof(struct msix_entry), GFP_KERNEL); > > > + if (!udev->msix) > > > + goto fail_release_iomem; > > > + } else if (!pci_intx_mask_supported(pdev)) { > > > + dev_err(&pdev->dev, > > > + "device does not support MSI-X or INTX\n"); > > > + err = -EINVAL; > > > + goto fail_release_iomem; > > > + } else { > > > + dev_notice(&pdev->dev, "using INTX\n"); > > > + udev->info.irq_flags = IRQF_SHARED; > > > + udev->max_vectors = 1; > > > + } > > > + > > > + udev->ctx = kcalloc(udev->max_vectors, > > > + sizeof(struct uio_msi_irq_ctx), GFP_KERNEL); > > > + if (!udev->ctx) > > > + goto fail_free_msix; > > > + > > > + for (i = 0; i < udev->max_vectors; i++) { > > > + udev->msix[i].entry = i; > > > + > > > + udev->ctx[i].name = kasprintf(GFP_KERNEL, > > > + KBUILD_MODNAME "[%d](%s)", > > > + i, pci_name(pdev)); > > > + if (!udev->ctx[i].name) > > > + goto fail_free_ctx; > > > + } > > > + > > > + /* register uio driver */ > > > + err = uio_register_device(&pdev->dev, &udev->info); > > > + if (err != 0) > > > + goto fail_free_ctx; > > > + > > > + pci_set_drvdata(pdev, udev); > > > + return 0; > > > + > > > +fail_free_ctx: > > > + for (i = 0; i < udev->max_vectors; i++) > > > + kfree(udev->ctx[i].name); > > > + kfree(udev->ctx); > > > +fail_free_msix: > > > + kfree(udev->msix); > > > +fail_release_iomem: > > > + release_iomaps(udev->info.mem); > > > + pci_release_regions(pdev); > > > +fail_disable: > > > + pci_disable_device(pdev); > > > +fail_free: > > > + kfree(udev); > > > + > > > + pr_notice("%s ret %d\n", __func__, err); > > > + return err; > > > +} > > > + > > > +static void uio_msi_remove(struct pci_dev *pdev) > > > +{ > > > + struct uio_info *info = pci_get_drvdata(pdev); > > > + struct uio_msi_pci_dev *udev > > > + = container_of(info, struct uio_msi_pci_dev, info); > > > + int i; > > > + > > > + uio_unregister_device(info); > > > + release_iomaps(info->mem); > > > + > > > + pci_release_regions(pdev); > > > + for (i = 0; i < udev->max_vectors; i++) > > > + kfree(udev->ctx[i].name); > > > + kfree(udev->ctx); > > > + kfree(udev->msix); > > > + pci_disable_device(pdev); > > > + > > > + pci_set_drvdata(pdev, NULL); > > > + kfree(udev); > > > +} > > > + > > > +static struct pci_driver uio_msi_pci_driver = { > > > + .name = "uio_msi", > > > + .probe = uio_msi_probe, > > > + .remove = uio_msi_remove, > > > +}; > > > + > > > +module_pci_driver(uio_msi_pci_driver); > > > +MODULE_VERSION(DRIVER_VERSION); > > > +MODULE_LICENSE("GPL v2"); > > > +MODULE_AUTHOR("Stephen Hemminger "); > > > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices"); > > > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild > > > index f7b2db4..d9497691 100644 > > > --- a/include/uapi/linux/Kbuild > > > +++ b/include/uapi/linux/Kbuild > > > @@ -411,6 +411,7 @@ header-y += udp.h > > > header-y += uhid.h > > > header-y += uinput.h > > > header-y += uio.h > > > +header-y += uio_msi.h > > > header-y += ultrasound.h > > > header-y += un.h > > > header-y += unistd.h > > > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h > > > new file mode 100644 > > > index 0000000..297de00 > > > --- /dev/null > > > +++ b/include/uapi/linux/uio_msi.h > > > @@ -0,0 +1,22 @@ > > > +/* > > > + * UIO_MSI API definition > > > + * > > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > > > + * All rights reserved. > > > + * > > > + * This program is free software; you can redistribute it and/or modify > > > + * it under the terms of the GNU General Public License version 2 as > > > + * published by the Free Software Foundation. > > > + */ > > > +#ifndef _UIO_PCI_MSI_H > > > +#define _UIO_PCI_MSI_H > > > + > > > +struct uio_msi_irq_set { > > > + u32 vec; > > > + int fd; > > > +}; > > > + > > > +#define UIO_MSI_BASE 0x86 > > > +#define UIO_MSI_IRQ_SET _IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set) > > > + > > > +#endif > > > -- > > > 2.1.4 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Please read the FAQ at http://www.tux.org/lkml/