From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by dpdk.space (Postfix) with ESMTP id C09A8A0679 for ; Thu, 4 Apr 2019 06:19:59 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id B75E64C8E; Thu, 4 Apr 2019 06:19:57 +0200 (CEST) Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by dpdk.org (Postfix) with ESMTP id C845234F3 for ; Thu, 4 Apr 2019 06:19:54 +0200 (CEST) X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga107.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 03 Apr 2019 21:19:52 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,306,1549958400"; d="scan'208";a="137451809" Received: from dpdk-tbie.sh.intel.com ([10.67.104.173]) by fmsmga008.fm.intel.com with ESMTP; 03 Apr 2019 21:19:51 -0700 Date: Thu, 4 Apr 2019 12:19:25 +0800 From: Tiwei Bie To: "Wiles, Keith" Cc: dpdk-dev , "Liang, Cunming" , "Richardson, Bruce" , "alejandro.lucero@netronome.com" Message-ID: <20190404041925.GA26931@dpdk-tbie.sh.intel.com> References: <20190403071844.21126-1-tiwei.bie@intel.com> <20190403071844.21126-4-tiwei.bie@intel.com> <055CA897-B4BA-446F-BE9B-620DAEAB02EC@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <055CA897-B4BA-446F-BE9B-620DAEAB02EC@intel.com> User-Agent: Mutt/1.9.4 (2018-02-28) Subject: Re: [dpdk-dev] [RFC 3/3] bus/pci: add mdev support X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Message-ID: <20190404041925.x1ucNUFFu2cGqpbbJLRK9Rip0RogMjvaKnRBl0MDUwg@z> On Wed, Apr 03, 2019 at 10:13:25PM +0800, Wiles, Keith wrote: > Some minor nits. > > > On Apr 3, 2019, at 2:18 AM, Tiwei Bie wrote: > > > > This patch adds the mdev support in PCI bus driver. A mdev > > driver is introduced to probe the mdev devices whose device > > API is "vfio-pci" on the mdev bus. > > > > PS. There are some hacks in this patch for now. > > > > Signed-off-by: Cunming Liang > > Signed-off-by: Tiwei Bie > > --- > > drivers/bus/pci/Makefile | 3 + > > drivers/bus/pci/linux/Makefile | 4 + > > drivers/bus/pci/linux/pci_vfio.c | 35 ++- > > drivers/bus/pci/linux/pci_vfio_mdev.c | 305 ++++++++++++++++++++++++++ > > drivers/bus/pci/meson.build | 4 +- > > drivers/bus/pci/pci_common.c | 17 +- > > drivers/bus/pci/private.h | 9 + > > drivers/bus/pci/rte_bus_pci.h | 11 +- > > 8 files changed, 370 insertions(+), 18 deletions(-) > > create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c > > > > diff --git a/drivers/bus/pci/Makefile b/drivers/bus/pci/Makefile > > index de53ce1bf..085ec9066 100644 > > --- a/drivers/bus/pci/Makefile > > +++ b/drivers/bus/pci/Makefile > > @@ -27,6 +27,9 @@ CFLAGS += -DALLOW_EXPERIMENTAL_API > > This define is enabled in 50-70 Makefiles, we can leave this here, but we should refactor this to a common place in the future. > > > > LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring > > LDLIBS += -lrte_ethdev -lrte_pci -lrte_kvargs > > +ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y) > > +LDLIBS += -lrte_bus_mdev > > +endif > > See comment below. > > > > include $(RTE_SDK)/drivers/bus/pci/$(SYSTEM)/Makefile > > SRCS-$(CONFIG_RTE_LIBRTE_PCI_BUS) := $(addprefix $(SYSTEM)/,$(SRCS)) > > diff --git a/drivers/bus/pci/linux/Makefile b/drivers/bus/pci/linux/Makefile > > index 90404468b..88bbc2390 100644 > > --- a/drivers/bus/pci/linux/Makefile > > +++ b/drivers/bus/pci/linux/Makefile > > @@ -4,3 +4,7 @@ > > SRCS += pci.c > > SRCS += pci_uio.c > > SRCS += pci_vfio.c > > + > > +ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y) > > + SRCS += pci_vfio_mdev.c > > +endif > > Do we need a configuration option for MDEV? > Can it be enabled for all builds or reuse a current configuration if only for some OS or arch? I think it's possible. > [...] > > +static int > > +get_pci_id(const char *sysfs_base, const char *dev_addr, > > + struct rte_pci_id *pci_id) > > +{ > > + int ret = 0; > > + int iommu_group_num; > > + int vfio_group_fd; > > + int vfio_dev_fd; > > + int container; > > + int class; > > + char name[PATH_MAX]; > > + struct vfio_group_status group_status = { > > + .argsz = sizeof(group_status) }; > > + > > + container = open("/dev/vfio/vfio", O_RDWR); > > Should this one use the VFIO_CONTAINER_PATH define in rte_vfio.h? > The define is gated by VFIO_PRESENT in that header. Yeah! And all the code in this file should be gated by VFIO_PRESENT as well. > > + if (container < 0) { > > + RTE_LOG(WARNING, EAL, "Failed to open VFIO container\n"); > > + ret = -1; > > + goto out; > > + } > > + > > + if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) { > > + /* Unknown API version */ > > + RTE_LOG(WARNING, EAL, "Unknown VFIO API version\n"); > > + ret = -1; > > + goto close_container; > > + } > > + > > + if (rte_vfio_get_group_num(sysfs_base, dev_addr, > > + &iommu_group_num) <= 0) { > > + RTE_LOG(WARNING, EAL, "%s not managed by VFIO driver\n", > > + dev_addr); > > + ret = -1; > > + goto close_container; > > + } > > + > > + snprintf(name, sizeof(name), "/dev/vfio/%d", iommu_group_num); > > We should be testing the return value from snprintf, but it is not done anyplace else in the code? > We need to look at fixing this in a different patch, but not here. > > + > > + vfio_group_fd = open(name, O_RDWR); > > + if (vfio_group_fd < 0) { > > + ret = -1; > > + goto close_container; > > + } [...] > > + /* class_id */ > > + if (pread64(vfio_dev_fd, &class, sizeof(uint32_t), > > + VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) + > > + PCI_CLASS_REVISION) != sizeof(uint32_t)) { > > + RTE_LOG(ERR, EAL, "Cannot read ClassID from PCI config space\n”); > > These should possible be DEBUG messages, but ERR is ok I guess. To me filling up the log with a bunch of messages when it is also flagged and log at a higher layer to many log messages. It would require us to look and make a cleaner > > + ret = -1; > > + goto close_device; > > + } > > + pci_id->class_id = class >> 8; > > + > > +close_device: > > + if (close(vfio_dev_fd) < 0) { > > + RTE_LOG(INFO, EAL, "Error when closing VFIO device for %s\n", > > + dev_addr); > > These should be ERR, DEBUG or WARN not INFO IMO or no log message at all. You are right. I missed this one. > > + ret = -1; > > + } > > + > > +close_group: > > + if (close(vfio_group_fd) < 0) { > > + RTE_LOG(INFO, EAL, "Error when closing VFIO group for %s\n", > > + dev_addr); > > + ret = -1; > > + } > > + > > +close_container: > > + if (close(container) < 0) { > > + RTE_LOG(INFO, EAL, "Error when closing VFIO container\n"); > > + ret = -1; > > + } > > + > > +out: > > Jumping to 4 different exit points makes this function complex, would it not be better to have one error exit point and test if the fds need to be closed > e.g. > if (pread64(...)) { > RTE_LOG(ERR, EAL, “Error message”); > goto err_exit; > } > > return 0; > err_exit: > if (vfio_dev_fd && close(vfio_dev_fd) < 0) { > > } > if (…) { > } > return -1; > > This should eliminate the variable ret and reduce the lines of code. Thanks for the suggestion. I can do that. > > + return ret; > > +} > > + > > +static int vfio_pci_probe(struct rte_mdev_driver *mdev_drv __rte_unused, > > + struct rte_mdev_device *mdev_dev) > > +{ > > + char name[RTE_UUID_STRLEN]; > > + struct rte_pci_device *dev; > > + struct rte_bus *bus; > > + int ret; > > + > > + bus = rte_bus_find_by_name("pci"); > > + if (bus == NULL) { > > + RTE_LOG(ERR, EAL, "Cannot find bus pci\n"); > > + return -ENOENT; > > + } > > + > > + if (bus->plug == NULL) { > > + RTE_LOG(ERR, EAL, "Function plug not supported by bus (%s)\n", > > + bus->name); > > + return -ENOTSUP; > > + } > > + > > + dev = malloc(sizeof(*dev)); > > + if (dev == NULL) > > + return -ENOMEM; > > If going to add error logs for the above tests, why does this one not get one? > Should we just remove them and check in the calling function instead? Then convert these to DEBUG logs or remove them. Thanks for the suggestion. Will improve the logs. > > + > > + memset(dev, 0, sizeof(*dev)); > > + dev->device.bus = &rte_pci_bus.bus; > > + rte_uuid_unparse(mdev_dev->addr, name, sizeof(name)); > > + > > + if (get_pci_id(rte_mdev_get_sysfs_path(), name, &dev->id)) { > > + free(dev); > > + return -1; > > + } > > + > > + snprintf(dev->name, sizeof(dev->name), "%s", name); > > This should be strlcpy() > > + dev->device.name = dev->name; > > + dev->kdrv = RTE_KDRV_VFIO; > > + dev->use_uuid = 1; > > + rte_uuid_copy(dev->uuid, mdev_dev->addr); > > + > > + // TODO: dev->device.devargs, etc > > + > > + memset(&dev->addr, -1, sizeof(dev->addr)); // XXX: TODO > > I have seen in the past that TODO or FIXME is not something that should be in the code. The TODO items should be removed and tracked outside the code if needed to be done later. Sorry for the confusion. There are some quick hacks in this RFC (especially in this function). I highlighted them with XXX or TODO. I didn't get rid of them for now, because there are different possible ways to add the mdev support in DPDK, and this RFC is just to demonstrate one possible way that we can do and to hear people's thoughts/opinions. PS. All the hacks (including comments starting with //) in this RFC are temporary. They will be fixed or removed in the formal patch. > > + > > + /* device is valid, add to the list (sorted) */ > > + if (TAILQ_EMPTY(&rte_pci_bus.device_list)) { > > + rte_pci_add_device(dev); > > + } else { > > + struct rte_pci_device *dev2; > > + int ret; > > + > > + TAILQ_FOREACH(dev2, &rte_pci_bus.device_list, next) { > > + // XXX > > What does this comment mean? remove it or explain it. It's to indicate that there is a quick hack here. It won't exist in the formal patch. > > + ret = rte_pci_addr_cmp(&dev->addr, &dev2->addr); > > + if (ret == 0) > > + ret = strncmp(dev->name, dev2->name, > > + sizeof(dev->name)); > > + if (ret > 0) > > + continue; > > + if (ret < 0) { > > + rte_pci_insert_device(dev2, dev); > > + goto plug; > > + } > > + /* already registered */ > > + free(dev); > > + return 0; > > + } > > + > > + rte_pci_add_device(dev); > > + } > > + > > +plug: > > + ret = bus->plug(&dev->device); > > + if (ret != 0) { > > + rte_pci_remove_device(dev); > > + free(dev); > > + } else { > > + mdev_dev->private = dev; > > + } > > The coding guide states we remove {} around single line statements. > > + return ret; > > +} > > + > > +static int vfio_pci_remove(struct rte_mdev_device *mdev_dev) > > +{ > > + struct rte_pci_device *dev = mdev_dev->private; > > + struct rte_bus *bus; > > + int ret; > > + > > + if (dev == NULL) > > + return 0; > > + > > + bus = rte_bus_find_by_name("pci"); > > + if (bus == NULL) { > > + RTE_LOG(ERR, EAL, "Cannot find bus pci\n"); > > + return -ENOENT; > > + } > > + > > + if (bus->unplug == NULL) { > > + RTE_LOG(ERR, EAL, "Function unplug not supported by bus (%s)\n", > > + bus->name); > > + return -ENOTSUP; > > + } > > + > > + ret = bus->unplug(&dev->device); > > + if (ret == 0) > > + mdev_dev->private = NULL; > > + > > + return ret; > > +} > > + > > +static struct rte_mdev_driver vfio_pci_drv = { > > + .dev_api = RTE_MDEV_DEV_API_VFIO_PCI, > > + .probe = vfio_pci_probe, > > + .remove = vfio_pci_remove > > +}; > > + > > +RTE_MDEV_REGISTER_DRIVER(mdev_vfio_pci, vfio_pci_drv); > > diff --git a/drivers/bus/pci/meson.build b/drivers/bus/pci/meson.build > > index a3140ff97..c3e884657 100644 > > --- a/drivers/bus/pci/meson.build > > +++ b/drivers/bus/pci/meson.build > > @@ -11,8 +11,10 @@ sources = files('pci_common.c', > > if host_machine.system() == 'linux' > > sources += files('linux/pci.c', > > 'linux/pci_uio.c', > > - 'linux/pci_vfio.c') > > + 'linux/pci_vfio.c', > > + 'linux/pci_vfio_mdev.c’) > > If you need the RTE_LIBRTE_MDEV define then pci_vfio_mdev.c needs to be built conditionally? > > includes += include_directories('linux') > > + deps += ['bus_mdev’] > > If this was added form dev then is too should be conditional. Yeah. There should be a check of dpdk_conf.has('...') > > else > > sources += files('bsd/pci.c') > > includes += include_directories('bsd') > > diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c > > index 704b9d71a..6b47333e6 100644 > > --- a/drivers/bus/pci/pci_common.c > > +++ b/drivers/bus/pci/pci_common.c > > @@ -124,21 +124,17 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr, > > { > > int ret; > > bool already_probed; > > - struct rte_pci_addr *loc; > > > > if ((dr == NULL) || (dev == NULL)) > > return -EINVAL; > > > > - loc = &dev->addr; > > - > > /* The device is not blacklisted; Check if driver supports it */ > > if (!rte_pci_match(dr, dev)) > > /* Match of device and driver failed */ > > return 1; > > > > - RTE_LOG(INFO, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n", > > - loc->domain, loc->bus, loc->devid, loc->function, > > - dev->device.numa_node); > > + RTE_LOG(INFO, EAL, "PCI device %s on NUMA socket %i\n", > > + dev->name, dev->device.numa_node); > > > > /* no initialization when blacklisted, return without error */ > > if (dev->device.devargs != NULL && > > @@ -208,7 +204,6 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr, > > static int > > rte_pci_detach_dev(struct rte_pci_device *dev) > > { > > - struct rte_pci_addr *loc; > > struct rte_pci_driver *dr; > > int ret = 0; > > > > @@ -216,11 +211,9 @@ rte_pci_detach_dev(struct rte_pci_device *dev) > > return -EINVAL; > > > > dr = dev->driver; > > - loc = &dev->addr; > > > > - RTE_LOG(DEBUG, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n", > > - loc->domain, loc->bus, loc->devid, > > - loc->function, dev->device.numa_node); > > + RTE_LOG(DEBUG, EAL, "PCI device %s on NUMA socket %i\n", > > + dev->name, dev->device.numa_node); > > > > RTE_LOG(DEBUG, EAL, " remove driver: %x:%x %s\n", dev->id.vendor_id, > > dev->id.device_id, dr->driver.name); > > @@ -387,7 +380,7 @@ rte_pci_insert_device(struct rte_pci_device *exist_pci_dev, > > } > > > > /* Remove a device from PCI bus */ > > -static void > > +void > > rte_pci_remove_device(struct rte_pci_device *pci_dev) > > Have not looked yet, but did this function get added to the version.map file? > Does converting a function to public function require experimental tag too, maybe not? This is just to make it a global function declared in private.h, so that we can call it from other C files inside PCI bus. > > { > > TAILQ_REMOVE(&rte_pci_bus.device_list, pci_dev, next); > > diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h > > index 13c3324bb..d5815ee44 100644 > > --- a/drivers/bus/pci/private.h > > +++ b/drivers/bus/pci/private.h > > @@ -67,6 +67,15 @@ void rte_pci_add_device(struct rte_pci_device *pci_dev); > > void rte_pci_insert_device(struct rte_pci_device *exist_pci_dev, > > struct rte_pci_device *new_pci_dev); > > > > +/** > > + * Remove a PCI device from the PCI Bus. > > + * > > + * @param pci_dev > > + * PCI device to remove > > + * @return void > > + */ > > +void rte_pci_remove_device(struct rte_pci_device *pci_dev); > > + > > /** > > * Update a pci device object by asking the kernel for the latest information. > > * > > diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h > > index 06e004cd3..465a44935 100644 > > --- a/drivers/bus/pci/rte_bus_pci.h > > +++ b/drivers/bus/pci/rte_bus_pci.h > > @@ -51,6 +51,13 @@ TAILQ_HEAD(rte_pci_driver_list, rte_pci_driver); > > > > struct rte_devargs; > > > > +/* It's RTE_UUID_STRLEN, which is bigger than PCI_PRI_STR_SIZE. */ > > +#define RTE_PCI_NAME_LEN (36 + 1) > > + > > +// XXX: we can't include rte_uuid.h directly due to the conflicts > > +// introduced by stdbool.h > > +typedef unsigned char rte_uuid_t[16]; > > Does this need to have a the string ‘XXX’ in the comment? Note maybe a better word. OK. Thanks for the reviews/suggestions! Do appreciate it! Regards, Tiwei