DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK
@ 2019-04-03  7:18 Tiwei Bie
  2019-04-03  7:18 ` Tiwei Bie
                   ` (5 more replies)
  0 siblings, 6 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-04-03  7:18 UTC (permalink / raw)
  To: dev; +Cc: cunming.liang, bruce.richardson, alejandro.lucero

Hi everyone,

This is a draft implementation of the mdev (Mediated device [1])
bus support in DPDK. Mdev is a way to virtualize devices in Linux
kernel. Based on the device-api (mdev_type/device_api), there could
be different types of mdev devices (e.g. vfio-pci). In this RFC,
one mdev bus is introduced to scan the mdev devices in the system
and do the probe based on the device-api.

Take the mdev devices whose device-api is "vfio-pci" as an example,
in this RFC, these devices will be probed by a mdev driver provided
by PCI bus, which will plug them to the PCI bus. And they will be
probed with the drivers registered on the PCI bus based on VendorID/
DeviceID/... then.

                     +----------+
                     | mdev bus |
                     +----+-----+
                          |
         +----------------+----+------+------+
         |                     |      |      |
   mdev_vfio_pci               ......
(device-api: vfio-pci)

There are also other ways to add mdev device support in DPDK (e.g.
let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
appreciated!

[1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt

Thanks,
Tiwei

Tiwei Bie (3):
  eal: add a helper for reading string from sysfs
  bus/mdev: add mdev bus support
  bus/pci: add mdev support

 config/common_base                        |   5 +
 config/common_linux                       |   1 +
 drivers/bus/Makefile                      |   1 +
 drivers/bus/mdev/Makefile                 |  41 +++
 drivers/bus/mdev/linux/Makefile           |   6 +
 drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
 drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
 drivers/bus/mdev/meson.build              |  15 ++
 drivers/bus/mdev/private.h                |  90 +++++++
 drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
 drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
 drivers/bus/meson.build                   |   2 +-
 drivers/bus/pci/Makefile                  |   3 +
 drivers/bus/pci/linux/Makefile            |   4 +
 drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
 drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
 drivers/bus/pci/meson.build               |   4 +-
 drivers/bus/pci/pci_common.c              |  17 +-
 drivers/bus/pci/private.h                 |   9 +
 drivers/bus/pci/rte_bus_pci.h             |  11 +-
 lib/librte_eal/common/eal_filesystem.h    |   7 +
 lib/librte_eal/freebsd/eal/eal.c          |  22 ++
 lib/librte_eal/linux/eal/eal.c            |  22 ++
 lib/librte_eal/rte_eal_version.map        |   1 +
 mk/rte.app.mk                             |   1 +
 25 files changed, 1163 insertions(+), 19 deletions(-)
 create mode 100644 drivers/bus/mdev/Makefile
 create mode 100644 drivers/bus/mdev/linux/Makefile
 create mode 100644 drivers/bus/mdev/linux/mdev.c
 create mode 100644 drivers/bus/mdev/mdev.c
 create mode 100644 drivers/bus/mdev/meson.build
 create mode 100644 drivers/bus/mdev/private.h
 create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
 create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
 create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c

-- 
2.17.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK
  2019-04-03  7:18 [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Tiwei Bie
@ 2019-04-03  7:18 ` Tiwei Bie
  2019-04-03  7:18 ` [dpdk-dev] [RFC 1/3] eal: add a helper for reading string from sysfs Tiwei Bie
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-04-03  7:18 UTC (permalink / raw)
  To: dev; +Cc: cunming.liang, bruce.richardson, alejandro.lucero

Hi everyone,

This is a draft implementation of the mdev (Mediated device [1])
bus support in DPDK. Mdev is a way to virtualize devices in Linux
kernel. Based on the device-api (mdev_type/device_api), there could
be different types of mdev devices (e.g. vfio-pci). In this RFC,
one mdev bus is introduced to scan the mdev devices in the system
and do the probe based on the device-api.

Take the mdev devices whose device-api is "vfio-pci" as an example,
in this RFC, these devices will be probed by a mdev driver provided
by PCI bus, which will plug them to the PCI bus. And they will be
probed with the drivers registered on the PCI bus based on VendorID/
DeviceID/... then.

                     +----------+
                     | mdev bus |
                     +----+-----+
                          |
         +----------------+----+------+------+
         |                     |      |      |
   mdev_vfio_pci               ......
(device-api: vfio-pci)

There are also other ways to add mdev device support in DPDK (e.g.
let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
appreciated!

[1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt

Thanks,
Tiwei

Tiwei Bie (3):
  eal: add a helper for reading string from sysfs
  bus/mdev: add mdev bus support
  bus/pci: add mdev support

 config/common_base                        |   5 +
 config/common_linux                       |   1 +
 drivers/bus/Makefile                      |   1 +
 drivers/bus/mdev/Makefile                 |  41 +++
 drivers/bus/mdev/linux/Makefile           |   6 +
 drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
 drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
 drivers/bus/mdev/meson.build              |  15 ++
 drivers/bus/mdev/private.h                |  90 +++++++
 drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
 drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
 drivers/bus/meson.build                   |   2 +-
 drivers/bus/pci/Makefile                  |   3 +
 drivers/bus/pci/linux/Makefile            |   4 +
 drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
 drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
 drivers/bus/pci/meson.build               |   4 +-
 drivers/bus/pci/pci_common.c              |  17 +-
 drivers/bus/pci/private.h                 |   9 +
 drivers/bus/pci/rte_bus_pci.h             |  11 +-
 lib/librte_eal/common/eal_filesystem.h    |   7 +
 lib/librte_eal/freebsd/eal/eal.c          |  22 ++
 lib/librte_eal/linux/eal/eal.c            |  22 ++
 lib/librte_eal/rte_eal_version.map        |   1 +
 mk/rte.app.mk                             |   1 +
 25 files changed, 1163 insertions(+), 19 deletions(-)
 create mode 100644 drivers/bus/mdev/Makefile
 create mode 100644 drivers/bus/mdev/linux/Makefile
 create mode 100644 drivers/bus/mdev/linux/mdev.c
 create mode 100644 drivers/bus/mdev/mdev.c
 create mode 100644 drivers/bus/mdev/meson.build
 create mode 100644 drivers/bus/mdev/private.h
 create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
 create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
 create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC 1/3] eal: add a helper for reading string from sysfs
  2019-04-03  7:18 [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Tiwei Bie
  2019-04-03  7:18 ` Tiwei Bie
@ 2019-04-03  7:18 ` Tiwei Bie
  2019-04-03  7:18   ` Tiwei Bie
  2019-04-03  7:18 ` [dpdk-dev] [RFC 2/3] bus/mdev: add mdev bus support Tiwei Bie
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Tiwei Bie @ 2019-04-03  7:18 UTC (permalink / raw)
  To: dev; +Cc: cunming.liang, bruce.richardson, alejandro.lucero

This patch adds a helper for reading string from sysfs.

Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 lib/librte_eal/common/eal_filesystem.h |  7 +++++++
 lib/librte_eal/freebsd/eal/eal.c       | 22 ++++++++++++++++++++++
 lib/librte_eal/linux/eal/eal.c         | 22 ++++++++++++++++++++++
 lib/librte_eal/rte_eal_version.map     |  1 +
 4 files changed, 52 insertions(+)

diff --git a/lib/librte_eal/common/eal_filesystem.h b/lib/librte_eal/common/eal_filesystem.h
index 89a3added..2c823b27d 100644
--- a/lib/librte_eal/common/eal_filesystem.h
+++ b/lib/librte_eal/common/eal_filesystem.h
@@ -116,4 +116,11 @@ eal_get_hugefile_lock_path(char *buffer, size_t buflen, int f_id)
  * Used to read information from files on /sys */
 int eal_parse_sysfs_value(const char *filename, unsigned long *val);
 
+/**
+ * Function to read a line from a file on the filesystem.
+ * Used to read information from files on /sys
+ */
+int __rte_experimental
+rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz);
+
 #endif /* EAL_FILESYSTEM_H */
diff --git a/lib/librte_eal/freebsd/eal/eal.c b/lib/librte_eal/freebsd/eal/eal.c
index 4e86b10b1..816cb9b91 100644
--- a/lib/librte_eal/freebsd/eal/eal.c
+++ b/lib/librte_eal/freebsd/eal/eal.c
@@ -208,6 +208,28 @@ eal_parse_sysfs_value(const char *filename, unsigned long *val)
 	return 0;
 }
 
+int
+rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
+{
+	FILE *f;
+
+	f = fopen(filename, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
+			__func__, filename);
+		return -1;
+	}
+
+	if (fgets(buf, sz, f) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs file %s\n",
+			__func__, filename);
+		fclose(f);
+		return -1;
+	}
+
+	fclose(f);
+	return 0;
+}
 
 /* create memory configuration in shared/mmap memory. Take out
  * a write lock on the memsegs, so we can auto-detect primary/secondary.
diff --git a/lib/librte_eal/linux/eal/eal.c b/lib/librte_eal/linux/eal/eal.c
index 13f401684..865cb19d7 100644
--- a/lib/librte_eal/linux/eal/eal.c
+++ b/lib/librte_eal/linux/eal/eal.c
@@ -293,6 +293,28 @@ eal_parse_sysfs_value(const char *filename, unsigned long *val)
 	return 0;
 }
 
+int
+rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
+{
+	FILE *f;
+
+	f = fopen(filename, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
+			__func__, filename);
+		return -1;
+	}
+
+	if (fgets(buf, sz, f) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs file %s\n",
+			__func__, filename);
+		fclose(f);
+		return -1;
+	}
+
+	fclose(f);
+	return 0;
+}
 
 /* create memory configuration in shared/mmap memory. Take out
  * a write lock on the memsegs, so we can auto-detect primary/secondary.
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d6e375135..d16258ffc 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -298,6 +298,7 @@ EXPERIMENTAL {
 	rte_devargs_remove;
 	rte_devargs_type_count;
 	rte_eal_cleanup;
+	rte_eal_parse_sysfs_str;
 	rte_extmem_attach;
 	rte_extmem_detach;
 	rte_extmem_register;
-- 
2.17.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC 1/3] eal: add a helper for reading string from sysfs
  2019-04-03  7:18 ` [dpdk-dev] [RFC 1/3] eal: add a helper for reading string from sysfs Tiwei Bie
@ 2019-04-03  7:18   ` Tiwei Bie
  0 siblings, 0 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-04-03  7:18 UTC (permalink / raw)
  To: dev; +Cc: cunming.liang, bruce.richardson, alejandro.lucero

This patch adds a helper for reading string from sysfs.

Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 lib/librte_eal/common/eal_filesystem.h |  7 +++++++
 lib/librte_eal/freebsd/eal/eal.c       | 22 ++++++++++++++++++++++
 lib/librte_eal/linux/eal/eal.c         | 22 ++++++++++++++++++++++
 lib/librte_eal/rte_eal_version.map     |  1 +
 4 files changed, 52 insertions(+)

diff --git a/lib/librte_eal/common/eal_filesystem.h b/lib/librte_eal/common/eal_filesystem.h
index 89a3added..2c823b27d 100644
--- a/lib/librte_eal/common/eal_filesystem.h
+++ b/lib/librte_eal/common/eal_filesystem.h
@@ -116,4 +116,11 @@ eal_get_hugefile_lock_path(char *buffer, size_t buflen, int f_id)
  * Used to read information from files on /sys */
 int eal_parse_sysfs_value(const char *filename, unsigned long *val);
 
+/**
+ * Function to read a line from a file on the filesystem.
+ * Used to read information from files on /sys
+ */
+int __rte_experimental
+rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz);
+
 #endif /* EAL_FILESYSTEM_H */
diff --git a/lib/librte_eal/freebsd/eal/eal.c b/lib/librte_eal/freebsd/eal/eal.c
index 4e86b10b1..816cb9b91 100644
--- a/lib/librte_eal/freebsd/eal/eal.c
+++ b/lib/librte_eal/freebsd/eal/eal.c
@@ -208,6 +208,28 @@ eal_parse_sysfs_value(const char *filename, unsigned long *val)
 	return 0;
 }
 
+int
+rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
+{
+	FILE *f;
+
+	f = fopen(filename, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
+			__func__, filename);
+		return -1;
+	}
+
+	if (fgets(buf, sz, f) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs file %s\n",
+			__func__, filename);
+		fclose(f);
+		return -1;
+	}
+
+	fclose(f);
+	return 0;
+}
 
 /* create memory configuration in shared/mmap memory. Take out
  * a write lock on the memsegs, so we can auto-detect primary/secondary.
diff --git a/lib/librte_eal/linux/eal/eal.c b/lib/librte_eal/linux/eal/eal.c
index 13f401684..865cb19d7 100644
--- a/lib/librte_eal/linux/eal/eal.c
+++ b/lib/librte_eal/linux/eal/eal.c
@@ -293,6 +293,28 @@ eal_parse_sysfs_value(const char *filename, unsigned long *val)
 	return 0;
 }
 
+int
+rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
+{
+	FILE *f;
+
+	f = fopen(filename, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
+			__func__, filename);
+		return -1;
+	}
+
+	if (fgets(buf, sz, f) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs file %s\n",
+			__func__, filename);
+		fclose(f);
+		return -1;
+	}
+
+	fclose(f);
+	return 0;
+}
 
 /* create memory configuration in shared/mmap memory. Take out
  * a write lock on the memsegs, so we can auto-detect primary/secondary.
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d6e375135..d16258ffc 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -298,6 +298,7 @@ EXPERIMENTAL {
 	rte_devargs_remove;
 	rte_devargs_type_count;
 	rte_eal_cleanup;
+	rte_eal_parse_sysfs_str;
 	rte_extmem_attach;
 	rte_extmem_detach;
 	rte_extmem_register;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC 2/3] bus/mdev: add mdev bus support
  2019-04-03  7:18 [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Tiwei Bie
  2019-04-03  7:18 ` Tiwei Bie
  2019-04-03  7:18 ` [dpdk-dev] [RFC 1/3] eal: add a helper for reading string from sysfs Tiwei Bie
@ 2019-04-03  7:18 ` Tiwei Bie
  2019-04-03  7:18   ` Tiwei Bie
  2019-04-03  7:18 ` [dpdk-dev] [RFC 3/3] bus/pci: add mdev support Tiwei Bie
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Tiwei Bie @ 2019-04-03  7:18 UTC (permalink / raw)
  To: dev; +Cc: cunming.liang, bruce.richardson, alejandro.lucero

This patch adds the mdev (Mediated device) bus support in DPDK.
This bus driver will scan all the mdev devices in the system,
and do the probe based on device API (mdev_type/device_api).

Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 config/common_base                        |   5 +
 config/common_linux                       |   1 +
 drivers/bus/Makefile                      |   1 +
 drivers/bus/mdev/Makefile                 |  41 +++
 drivers/bus/mdev/linux/Makefile           |   6 +
 drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
 drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
 drivers/bus/mdev/meson.build              |  15 ++
 drivers/bus/mdev/private.h                |  90 +++++++
 drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
 drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
 drivers/bus/meson.build                   |   2 +-
 mk/rte.app.mk                             |   1 +
 13 files changed, 741 insertions(+), 1 deletion(-)
 create mode 100644 drivers/bus/mdev/Makefile
 create mode 100644 drivers/bus/mdev/linux/Makefile
 create mode 100644 drivers/bus/mdev/linux/mdev.c
 create mode 100644 drivers/bus/mdev/mdev.c
 create mode 100644 drivers/bus/mdev/meson.build
 create mode 100644 drivers/bus/mdev/private.h
 create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
 create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map

diff --git a/config/common_base b/config/common_base
index 6292bc4af..d29e9a089 100644
--- a/config/common_base
+++ b/config/common_base
@@ -168,6 +168,11 @@ CONFIG_RTE_LIBRTE_COMMON_DPAAX=n
 #
 CONFIG_RTE_LIBRTE_IFPGA_BUS=y
 
+#
+# Compile the mdev bus
+#
+CONFIG_RTE_LIBRTE_MDEV_BUS=n
+
 #
 # Compile PCI bus driver
 #
diff --git a/config/common_linux b/config/common_linux
index 75334273d..7de9624c0 100644
--- a/config/common_linux
+++ b/config/common_linux
@@ -25,6 +25,7 @@ CONFIG_RTE_LIBRTE_AVP_PMD=y
 CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=y
 CONFIG_RTE_LIBRTE_NFP_PMD=y
 CONFIG_RTE_LIBRTE_POWER=y
+CONFIG_RTE_LIBRTE_MDEV_BUS=y
 CONFIG_RTE_VIRTIO_USER=y
 CONFIG_RTE_PROC_INFO=y
 
diff --git a/drivers/bus/Makefile b/drivers/bus/Makefile
index cea3b55e6..b2144ee63 100644
--- a/drivers/bus/Makefile
+++ b/drivers/bus/Makefile
@@ -8,6 +8,7 @@ ifeq ($(CONFIG_RTE_EAL_VFIO),y)
 DIRS-$(CONFIG_RTE_LIBRTE_FSLMC_BUS) += fslmc
 endif
 DIRS-$(CONFIG_RTE_LIBRTE_IFPGA_BUS) += ifpga
+DIRS-$(CONFIG_RTE_LIBRTE_MDEV_BUS) += mdev
 DIRS-$(CONFIG_RTE_LIBRTE_PCI_BUS) += pci
 DIRS-$(CONFIG_RTE_LIBRTE_VDEV_BUS) += vdev
 DIRS-$(CONFIG_RTE_LIBRTE_VMBUS) += vmbus
diff --git a/drivers/bus/mdev/Makefile b/drivers/bus/mdev/Makefile
new file mode 100644
index 000000000..b2faee395
--- /dev/null
+++ b/drivers/bus/mdev/Makefile
@@ -0,0 +1,41 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_bus_mdev.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += -I$(SRCDIR)
+
+# versioning export map
+EXPORT_MAP := rte_bus_mdev_version.map
+
+# library version
+LIBABIVER := 1
+
+ifneq ($(CONFIG_RTE_EXEC_ENV_LINUX),)
+SYSTEM := linux
+endif
+ifneq ($(CONFIG_RTE_EXEC_ENV_FREEBSD),)
+$(error "Mdev bus not implemented for BSD yet")
+endif
+
+CFLAGS += -I$(RTE_SDK)/drivers/bus/mdev/$(SYSTEM)
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/$(SYSTEM)/eal
+
+LDLIBS += -lrte_eal
+
+include $(RTE_SDK)/drivers/bus/mdev/$(SYSTEM)/Makefile
+SRCS-$(CONFIG_RTE_LIBRTE_MDEV_BUS) := $(addprefix $(SYSTEM)/,$(SRCS))
+SRCS-$(CONFIG_RTE_LIBRTE_MDEV_BUS) += mdev.c
+
+SYMLINK-$(CONFIG_RTE_LIBRTE_MDEV_BUS)-include += rte_bus_mdev.h
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/bus/mdev/linux/Makefile b/drivers/bus/mdev/linux/Makefile
new file mode 100644
index 000000000..a777ad3d4
--- /dev/null
+++ b/drivers/bus/mdev/linux/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+SRCS += mdev.c
+
+CFLAGS += -D_GNU_SOURCE
diff --git a/drivers/bus/mdev/linux/mdev.c b/drivers/bus/mdev/linux/mdev.c
new file mode 100644
index 000000000..ecfe0eba6
--- /dev/null
+++ b/drivers/bus/mdev/linux/mdev.c
@@ -0,0 +1,117 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include <string.h>
+#include <dirent.h>
+
+#include <rte_log.h>
+#include <rte_bus_mdev.h>
+
+#include "eal_filesystem.h"
+
+#include "private.h"
+
+static int
+mdev_scan_one(const char *dirname, const rte_uuid_t addr)
+{
+	struct rte_mdev_device *mdev;
+	char device_api[PATH_MAX];
+	char filename[PATH_MAX];
+	char *ptr;
+
+	mdev = malloc(sizeof(*mdev));
+	if (mdev == NULL)
+		return -1;
+
+	memset(mdev, 0, sizeof(*mdev));
+	mdev->device.bus = &rte_mdev_bus.bus;
+	rte_uuid_copy(mdev->addr, addr);
+
+	/* get device_api */
+	snprintf(filename, sizeof(filename), "%s/mdev_type/device_api",
+		 dirname);
+	if (rte_eal_parse_sysfs_str(filename, device_api,
+				    sizeof(device_api)) < 0) {
+		free(mdev);
+		return -1;
+	}
+
+	ptr = strchr(device_api, '\n');
+	if (ptr != NULL)
+		*ptr = '\0';
+
+	mdev_name_set(mdev);
+
+	if (strcmp(device_api, "vfio-pci") == 0) {
+		/* device api */
+		mdev->dev_api = RTE_MDEV_DEV_API_VFIO_PCI;
+
+		if (TAILQ_EMPTY(&rte_mdev_bus.device_list))
+			rte_mdev_add_device(mdev);
+		else {
+			struct rte_mdev_device *dev;
+			int ret;
+
+			TAILQ_FOREACH(dev, &rte_mdev_bus.device_list, next) {
+				ret = rte_uuid_compare(mdev->addr, dev->addr);
+				if (ret > 0)
+					continue;
+
+				if (ret < 0)
+					rte_mdev_insert_device(dev, mdev);
+				else /* already registered */
+					free(mdev);
+
+				return 0;
+			}
+
+			rte_mdev_add_device(mdev);
+		}
+	} else {
+		RTE_LOG(DEBUG, EAL, "%s(): mdev device_api %s is not supported\n",
+			__func__, device_api);
+	}
+
+	return 0;
+}
+
+/*
+ * Scan the content of the mdev bus, and the devices in the devices
+ * list
+ */
+int
+rte_mdev_scan(void)
+{
+	struct dirent *e;
+	DIR *dir;
+	char dirname[PATH_MAX];
+	rte_uuid_t addr;
+
+	dir = opendir(rte_mdev_get_sysfs_path());
+	if (dir == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): opendir failed: %s\n",
+			__func__, strerror(errno));
+		return -1;
+	}
+
+	while ((e = readdir(dir)) != NULL) {
+		if (e->d_name[0] == '.')
+			continue;
+
+		if (rte_uuid_parse(e->d_name, addr) != 0)
+			continue;
+
+		snprintf(dirname, sizeof(dirname), "%s/%s",
+			 rte_mdev_get_sysfs_path(), e->d_name);
+
+		if (mdev_scan_one(dirname, addr) < 0)
+			goto error;
+	}
+	closedir(dir);
+	return 0;
+
+error:
+	closedir(dir);
+	return -1;
+}
diff --git a/drivers/bus/mdev/mdev.c b/drivers/bus/mdev/mdev.c
new file mode 100644
index 000000000..2f9209cca
--- /dev/null
+++ b/drivers/bus/mdev/mdev.c
@@ -0,0 +1,310 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include <string.h>
+#include <inttypes.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/queue.h>
+#include <sys/mman.h>
+
+#include <rte_errno.h>
+#include <rte_interrupts.h>
+#include <rte_log.h>
+#include <rte_per_lcore.h>
+#include <rte_memory.h>
+#include <rte_eal.h>
+#include <rte_common.h>
+#include <rte_devargs.h>
+#include <rte_uuid.h>
+#include <rte_bus_mdev.h>
+
+#include "private.h"
+
+#define SYSFS_MDEV_DEVICES "/sys/bus/mdev/devices"
+
+const char *rte_mdev_get_sysfs_path(void)
+{
+	const char *path = NULL;
+
+	path = getenv("SYSFS_MDEV_DEVICES");
+	if (path == NULL)
+		return SYSFS_MDEV_DEVICES;
+
+	return path;
+}
+
+static void
+rte_mdev_device_name(const rte_uuid_t addr, char *output, size_t size)
+{
+	RTE_VERIFY(size >= RTE_UUID_STRLEN);
+	rte_uuid_unparse(addr, output, size);
+}
+
+static struct rte_devargs *
+mdev_devargs_lookup(struct rte_mdev_device *dev)
+{
+	struct rte_devargs *devargs;
+	rte_uuid_t addr;
+
+	RTE_EAL_DEVARGS_FOREACH("mdev", devargs) {
+		devargs->bus->parse(devargs->name, addr);
+		if (!rte_uuid_compare(dev->addr, addr))
+			return devargs;
+	}
+	return NULL;
+}
+
+void
+mdev_name_set(struct rte_mdev_device *dev)
+{
+	struct rte_devargs *devargs;
+
+	/* Each device has its internal, canonical name set. */
+	rte_mdev_device_name(dev->addr, dev->name, sizeof(dev->name));
+	devargs = mdev_devargs_lookup(dev);
+	dev->device.devargs = devargs;
+	/* In blacklist mode, if the device is not blacklisted, no
+	 * rte_devargs exists for it.
+	 */
+	if (devargs != NULL)
+		/* If an rte_devargs exists, the generic rte_device uses the
+		 * given name as its name.
+		 */
+		dev->device.name = dev->device.devargs->name;
+	else
+		/* Otherwise, it uses the internal, canonical form. */
+		dev->device.name = dev->name;
+}
+
+void
+rte_mdev_register(struct rte_mdev_driver *driver)
+{
+	TAILQ_INSERT_TAIL(&rte_mdev_bus.driver_list, driver, next);
+	driver->bus = &rte_mdev_bus;
+}
+
+void
+rte_mdev_unregister(struct rte_mdev_driver *driver)
+{
+	TAILQ_REMOVE(&rte_mdev_bus.driver_list, driver, next);
+	driver->bus = NULL;
+}
+
+void
+rte_mdev_add_device(struct rte_mdev_device *mdev)
+{
+	TAILQ_INSERT_TAIL(&rte_mdev_bus.device_list, mdev, next);
+}
+
+void
+rte_mdev_insert_device(struct rte_mdev_device *exist_mdev,
+		       struct rte_mdev_device *new_mdev)
+{
+	TAILQ_INSERT_BEFORE(exist_mdev, new_mdev, next);
+}
+
+void
+rte_mdev_remove_device(struct rte_mdev_device *mdev)
+{
+	TAILQ_REMOVE(&rte_mdev_bus.device_list, mdev, next);
+}
+
+static struct rte_device *
+mdev_find_device(const struct rte_device *start, rte_dev_cmp_t cmp,
+		 const void *data)
+{
+	const struct rte_mdev_device *pstart;
+	struct rte_mdev_device *pdev;
+
+	if (start != NULL) {
+		pstart = RTE_DEV_TO_MDEV_CONST(start);
+		pdev = TAILQ_NEXT(pstart, next);
+	} else {
+		pdev = TAILQ_FIRST(&rte_mdev_bus.device_list);
+	}
+	while (pdev != NULL) {
+		if (cmp(&pdev->device, data) == 0)
+			return &pdev->device;
+		pdev = TAILQ_NEXT(pdev, next);
+	}
+	return NULL;
+}
+
+int
+rte_mdev_match(const struct rte_mdev_driver *mdev_drv,
+	       const struct rte_mdev_device *mdev_dev)
+{
+	if (mdev_drv->dev_api == mdev_dev->dev_api)
+		return 1;
+
+	return 0;
+}
+
+static int
+rte_mdev_probe_one_driver(struct rte_mdev_driver *dr,
+			  struct rte_mdev_device *dev)
+{
+	int ret;
+
+	if (dr == NULL || dev == NULL)
+		return -EINVAL;
+
+	/* no initialization when blacklisted, return without error */
+	if (dev->device.devargs != NULL &&
+	    dev->device.devargs->policy == RTE_DEV_BLACKLISTED) {
+		RTE_LOG(INFO, EAL, "Device is blacklisted, not initializing\n");
+		return 1;
+	}
+
+	/* The device is not blacklisted; Check if driver supports it */
+	if (!rte_mdev_match(dr, dev)) {
+		/* Match of device and driver failed */
+		return 1;
+	}
+
+	/* reference driver structure */
+	dev->driver = dr;
+
+	/* call the driver probe() function */
+	ret = dr->probe(dr, dev);
+	if (ret != 0)
+		dev->driver = NULL;
+
+	return ret;
+}
+
+static int
+mdev_probe_all_drivers(struct rte_mdev_device *dev)
+{
+	struct rte_mdev_driver *dr = NULL;
+	int rc = 0;
+
+	if (dev == NULL)
+		return -1;
+
+	/* Check if a driver is already loaded */
+	if (dev->driver != NULL)
+		return 0;
+
+	FOREACH_DRIVER_ON_MDEV_BUS(dr) {
+		rc = rte_mdev_probe_one_driver(dr, dev);
+		if (rc < 0)
+			/* negative value is an error */
+			return -1;
+		if (rc > 0)
+			/* positive value means driver doesn't support it */
+			continue;
+		return 0;
+	}
+	return 1;
+}
+
+int
+rte_mdev_probe(void)
+{
+	struct rte_mdev_device *mdev = NULL;
+	size_t probed = 0, failed = 0;
+	struct rte_devargs *devargs;
+	int probe_all = 0;
+	int ret = 0;
+
+	if (rte_mdev_bus.bus.conf.scan_mode != RTE_BUS_SCAN_WHITELIST)
+		probe_all = 1;
+
+	FOREACH_DEVICE_ON_MDEV_BUS(mdev) {
+		probed++;
+
+		devargs = mdev->device.devargs;
+		/* probe all or only whitelisted devices */
+		if (probe_all)
+			ret = mdev_probe_all_drivers(mdev);
+		else if (devargs != NULL &&
+			devargs->policy == RTE_DEV_WHITELISTED)
+			ret = mdev_probe_all_drivers(mdev);
+		if (ret < 0) {
+			char name[RTE_UUID_STRLEN];
+			rte_uuid_unparse(mdev->addr, name, sizeof(name));
+			RTE_LOG(ERR, EAL, "Requested device %s cannot be used\n",
+				name);
+			rte_errno = errno;
+			failed++;
+			ret = 0;
+		}
+	}
+
+	return (probed && probed == failed) ? -1 : 0;
+}
+
+static int
+mdev_plug(struct rte_device *dev)
+{
+	return mdev_probe_all_drivers(RTE_DEV_TO_MDEV(dev));
+}
+
+static int
+rte_mdev_detach_dev(struct rte_mdev_device *dev)
+{
+	struct rte_mdev_driver *dr;
+	int ret = 0;
+
+	if (dev == NULL)
+		return -EINVAL;
+
+	dr = dev->driver;
+
+	if (dr->remove) {
+		ret = dr->remove(dev);
+		if (ret != 0)
+			return ret;
+	}
+
+	/* clear driver structure */
+	dev->driver = NULL;
+
+	return 0;
+}
+
+static int
+mdev_unplug(struct rte_device *dev)
+{
+	struct rte_mdev_device *pmdev;
+	int ret;
+
+	pmdev = RTE_DEV_TO_MDEV(dev);
+	ret = rte_mdev_detach_dev(pmdev);
+	if (ret == 0) {
+		rte_mdev_remove_device(pmdev);
+		free(pmdev);
+	}
+	return ret;
+}
+
+static int
+mdev_parse(const char *name, void *addr)
+{
+	rte_uuid_t uuid;
+	int parse;
+
+	parse = (rte_uuid_parse(name, uuid) == 0);
+	if (parse && addr != NULL)
+		rte_uuid_copy(addr, uuid);
+	return parse == false;
+}
+
+struct rte_mdev_bus rte_mdev_bus = {
+	.bus = {
+		.scan = rte_mdev_scan,
+		.probe = rte_mdev_probe,
+		.find_device = mdev_find_device,
+		.plug = mdev_plug,
+		.unplug = mdev_unplug,
+		.parse = mdev_parse,
+	},
+	.device_list = TAILQ_HEAD_INITIALIZER(rte_mdev_bus.device_list),
+	.driver_list = TAILQ_HEAD_INITIALIZER(rte_mdev_bus.driver_list),
+};
+
+RTE_REGISTER_BUS(mdev, rte_mdev_bus.bus);
diff --git a/drivers/bus/mdev/meson.build b/drivers/bus/mdev/meson.build
new file mode 100644
index 000000000..33c701cb9
--- /dev/null
+++ b/drivers/bus/mdev/meson.build
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+version = 1
+allow_experimental_apis = true
+install_headers('rte_bus_mdev.h')
+sources = files('mdev.c')
+
+if host_machine.system() == 'linux'
+	sources += files('linux/mdev.c')
+	includes += include_directories('linux')
+	cflags += ['-D_GNU_SOURCE']
+else
+	build = false
+endif
diff --git a/drivers/bus/mdev/private.h b/drivers/bus/mdev/private.h
new file mode 100644
index 000000000..81cfe3045
--- /dev/null
+++ b/drivers/bus/mdev/private.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _MDEV_PRIVATE_H_
+#define _MDEV_PRIVATE_H_
+
+#include <stdbool.h>
+#include <stdio.h>
+#include <rte_bus_mdev.h>
+
+struct rte_mdev_driver;
+struct rte_mdev_device;
+
+extern struct rte_mdev_bus rte_mdev_bus;
+
+/**
+ * Probe the mdev bus.
+ *
+ * @return
+ *   - 0 on success.
+ *   - !0 on error.
+ */
+int rte_mdev_probe(void);
+
+/**
+ * Scan the content of the mdev bus, and the devices in the devices
+ * list.
+ *
+ * @return
+ *  0 on success, negative on error
+ */
+int rte_mdev_scan(void);
+
+/**
+ * Set the name of a mdev device.
+ */
+void mdev_name_set(struct rte_mdev_device *dev);
+
+/**
+ * Add a mdev device to the mdev bus (append to mdev device list). This function
+ * also updates the bus references of the mdev device (and the generic device
+ * object embedded within.
+ *
+ * @param mdev
+ *	mdev device to add
+ * @return void
+ */
+void rte_mdev_add_device(struct rte_mdev_device *mdev);
+
+/**
+ * Insert a mdev device in the mdev bus at a particular location in the device
+ * list. It also updates the mdev bus reference of the new devices to be
+ * inserted.
+ *
+ * @param exist_mdev
+ *	existing mdev device in mdev bus
+ * @param new_mdev
+ *	mdev device to be added before exist_mdev
+ * @return void
+ */
+void rte_mdev_insert_device(struct rte_mdev_device *exist_mdev,
+			    struct rte_mdev_device *new_mdev);
+
+/**
+ * Remove a mdev device from the mdev bus. This sets to NULL the bus references
+ * in the mdev device object as well as the generic device object.
+ *
+ * @param mdev_device
+ *	mdev device to be removed from mdev bus
+ * @return void
+ */
+void rte_mdev_remove_device(struct rte_mdev_device *mdev_device);
+
+/**
+ * Match the mdev driver and device using mdev device_api.
+ *
+ * @param mdev_drv
+ *      mdev driver from which device_api would be extracted
+ * @param mdev_dev
+ *      mdev device to match against the driver
+ * @return
+ *      1 for successful match
+ *      0 for unsuccessful match
+ */
+int
+rte_mdev_match(const struct rte_mdev_driver *mdev_drv,
+	       const struct rte_mdev_device *mdev_dev);
+
+#endif /* _MDEV_PRIVATE_H_ */
diff --git a/drivers/bus/mdev/rte_bus_mdev.h b/drivers/bus/mdev/rte_bus_mdev.h
new file mode 100644
index 000000000..913521ace
--- /dev/null
+++ b/drivers/bus/mdev/rte_bus_mdev.h
@@ -0,0 +1,141 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _RTE_BUS_MDEV_H_
+#define _RTE_BUS_MDEV_H_
+
+/**
+ * @file
+ *
+ * RTE Mdev Bus Interface
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <limits.h>
+#include <errno.h>
+#include <sys/queue.h>
+#include <stdint.h>
+#include <inttypes.h>
+
+#include <rte_debug.h>
+#include <rte_interrupts.h>
+#include <rte_dev.h>
+#include <rte_uuid.h>
+#include <rte_bus.h>
+
+struct rte_devargs;
+
+enum rte_mdev_device_api {
+	RTE_MDEV_DEV_API_VFIO_PCI = 0,
+	RTE_MDEV_DEV_API_MAX,
+};
+
+struct rte_mdev_bus;
+struct rte_mdev_driver;
+struct rte_mdev_device;
+
+/** Pathname of mdev devices directory. */
+const char * __rte_experimental rte_mdev_get_sysfs_path(void);
+
+/**
+ * Register a mdev driver.
+ *
+ * @param driver
+ *   A pointer to a rte_mdev_driver structure describing the driver
+ *   to be registered.
+ */
+void __rte_experimental rte_mdev_register(struct rte_mdev_driver *driver);
+
+#define RTE_MDEV_REGISTER_DRIVER(nm, mdev_drv) \
+RTE_INIT(mdevinitfn_ ##nm) \
+{ \
+	(mdev_drv).driver.name = RTE_STR(nm); \
+	rte_mdev_register(&mdev_drv); \
+} \
+RTE_PMD_EXPORT_NAME(nm, __COUNTER__)
+
+/**
+ * Unregister a mdev driver.
+ *
+ * @param driver
+ *   A pointer to a rte_mdev_driver structure describing the driver
+ *   to be unregistered.
+ */
+void __rte_experimental rte_mdev_unregister(struct rte_mdev_driver *driver);
+
+/**
+ * Initialisation function for the driver called during mdev probing.
+ */
+typedef int (mdev_probe_t)(struct rte_mdev_driver *, struct rte_mdev_device *);
+
+/**
+ * Uninitialisation function for the driver called during hotplugging.
+ */
+typedef int (mdev_remove_t)(struct rte_mdev_device *);
+
+/**
+ * A structure describing a mdev driver.
+ */
+struct rte_mdev_driver {
+	TAILQ_ENTRY(rte_mdev_driver) next; /**< Next in list. */
+	struct rte_driver driver;          /**< Inherit core driver. */
+	struct rte_mdev_bus *bus;          /**< Mdev bus reference. */
+	mdev_probe_t *probe;               /**< Device probe function. */
+	mdev_remove_t *remove;             /**< Device remove function. */
+	enum rte_mdev_device_api dev_api;  /**< Device API. */
+};
+
+/**
+ * A structure describing a mdev device.
+ */
+struct rte_mdev_device {
+	TAILQ_ENTRY(rte_mdev_device) next; /**< Next mdev device. */
+	struct rte_device device;	   /**< Inherit core device. */
+	enum rte_mdev_device_api dev_api;  /**< Device API. */
+	struct rte_mdev_driver *driver;    /**< Associated driver. */
+	rte_uuid_t addr;                   /**< Location. */
+	char name[RTE_UUID_STRLEN];        /**< Location (ASCII). */
+	void *private;                     /**< Driver-specific data. */
+};
+
+/**
+ * @internal
+ * Helper macro for drivers that need to convert to struct rte_mdev_device.
+ */
+#define RTE_DEV_TO_MDEV(ptr) container_of(ptr, struct rte_mdev_device, device)
+
+#define RTE_DEV_TO_MDEV_CONST(ptr) \
+	container_of(ptr, const struct rte_mdev_device, device)
+
+/** List of mdev devices */
+TAILQ_HEAD(rte_mdev_device_list, rte_mdev_device);
+/** List of mdev drivers */
+TAILQ_HEAD(rte_mdev_driver_list, rte_mdev_driver);
+
+/**
+ * Structure describing the mdev bus
+ */
+struct rte_mdev_bus {
+	struct rte_bus bus;                /**< Inherit the generic class */
+	struct rte_mdev_device_list device_list;  /**< List of mdev devices */
+	struct rte_mdev_driver_list driver_list;  /**< List of mdev drivers */
+};
+
+/* Mdev Bus iterators */
+#define FOREACH_DEVICE_ON_MDEV_BUS(p)	\
+		TAILQ_FOREACH(p, &(rte_mdev_bus.device_list), next)
+
+#define FOREACH_DRIVER_ON_MDEV_BUS(p)	\
+		TAILQ_FOREACH(p, &(rte_mdev_bus.driver_list), next)
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_BUS_MDEV_H_ */
diff --git a/drivers/bus/mdev/rte_bus_mdev_version.map b/drivers/bus/mdev/rte_bus_mdev_version.map
new file mode 100644
index 000000000..7f73bf96b
--- /dev/null
+++ b/drivers/bus/mdev/rte_bus_mdev_version.map
@@ -0,0 +1,12 @@
+DPDK_19.05 {
+
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	rte_mdev_get_sysfs_path;
+	rte_mdev_register;
+	rte_mdev_unregister;
+};
diff --git a/drivers/bus/meson.build b/drivers/bus/meson.build
index 80de2d91d..f0ab19a03 100644
--- a/drivers/bus/meson.build
+++ b/drivers/bus/meson.build
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2017 Intel Corporation
 
-drivers = ['dpaa', 'fslmc', 'ifpga', 'pci', 'vdev', 'vmbus']
+drivers = ['dpaa', 'fslmc', 'ifpga', 'mdev', 'pci', 'vdev', 'vmbus']
 std_deps = ['eal']
 config_flag_fmt = 'RTE_LIBRTE_@0@_BUS'
 driver_name_fmt = 'rte_bus_@0@'
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 262132fc6..f8abe8237 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -123,6 +123,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_FSLMC_BUS),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_COMMON_DPAAX)   += -lrte_common_dpaax
 endif
 
+_LDLIBS-$(CONFIG_RTE_LIBRTE_MDEV_BUS)       += -lrte_bus_mdev
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PCI_BUS)        += -lrte_bus_pci
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_BUS)       += -lrte_bus_vdev
 _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA_BUS)       += -lrte_bus_dpaa
-- 
2.17.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC 2/3] bus/mdev: add mdev bus support
  2019-04-03  7:18 ` [dpdk-dev] [RFC 2/3] bus/mdev: add mdev bus support Tiwei Bie
@ 2019-04-03  7:18   ` Tiwei Bie
  0 siblings, 0 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-04-03  7:18 UTC (permalink / raw)
  To: dev; +Cc: cunming.liang, bruce.richardson, alejandro.lucero

This patch adds the mdev (Mediated device) bus support in DPDK.
This bus driver will scan all the mdev devices in the system,
and do the probe based on device API (mdev_type/device_api).

Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 config/common_base                        |   5 +
 config/common_linux                       |   1 +
 drivers/bus/Makefile                      |   1 +
 drivers/bus/mdev/Makefile                 |  41 +++
 drivers/bus/mdev/linux/Makefile           |   6 +
 drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
 drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
 drivers/bus/mdev/meson.build              |  15 ++
 drivers/bus/mdev/private.h                |  90 +++++++
 drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
 drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
 drivers/bus/meson.build                   |   2 +-
 mk/rte.app.mk                             |   1 +
 13 files changed, 741 insertions(+), 1 deletion(-)
 create mode 100644 drivers/bus/mdev/Makefile
 create mode 100644 drivers/bus/mdev/linux/Makefile
 create mode 100644 drivers/bus/mdev/linux/mdev.c
 create mode 100644 drivers/bus/mdev/mdev.c
 create mode 100644 drivers/bus/mdev/meson.build
 create mode 100644 drivers/bus/mdev/private.h
 create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
 create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map

diff --git a/config/common_base b/config/common_base
index 6292bc4af..d29e9a089 100644
--- a/config/common_base
+++ b/config/common_base
@@ -168,6 +168,11 @@ CONFIG_RTE_LIBRTE_COMMON_DPAAX=n
 #
 CONFIG_RTE_LIBRTE_IFPGA_BUS=y
 
+#
+# Compile the mdev bus
+#
+CONFIG_RTE_LIBRTE_MDEV_BUS=n
+
 #
 # Compile PCI bus driver
 #
diff --git a/config/common_linux b/config/common_linux
index 75334273d..7de9624c0 100644
--- a/config/common_linux
+++ b/config/common_linux
@@ -25,6 +25,7 @@ CONFIG_RTE_LIBRTE_AVP_PMD=y
 CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=y
 CONFIG_RTE_LIBRTE_NFP_PMD=y
 CONFIG_RTE_LIBRTE_POWER=y
+CONFIG_RTE_LIBRTE_MDEV_BUS=y
 CONFIG_RTE_VIRTIO_USER=y
 CONFIG_RTE_PROC_INFO=y
 
diff --git a/drivers/bus/Makefile b/drivers/bus/Makefile
index cea3b55e6..b2144ee63 100644
--- a/drivers/bus/Makefile
+++ b/drivers/bus/Makefile
@@ -8,6 +8,7 @@ ifeq ($(CONFIG_RTE_EAL_VFIO),y)
 DIRS-$(CONFIG_RTE_LIBRTE_FSLMC_BUS) += fslmc
 endif
 DIRS-$(CONFIG_RTE_LIBRTE_IFPGA_BUS) += ifpga
+DIRS-$(CONFIG_RTE_LIBRTE_MDEV_BUS) += mdev
 DIRS-$(CONFIG_RTE_LIBRTE_PCI_BUS) += pci
 DIRS-$(CONFIG_RTE_LIBRTE_VDEV_BUS) += vdev
 DIRS-$(CONFIG_RTE_LIBRTE_VMBUS) += vmbus
diff --git a/drivers/bus/mdev/Makefile b/drivers/bus/mdev/Makefile
new file mode 100644
index 000000000..b2faee395
--- /dev/null
+++ b/drivers/bus/mdev/Makefile
@@ -0,0 +1,41 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_bus_mdev.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += -I$(SRCDIR)
+
+# versioning export map
+EXPORT_MAP := rte_bus_mdev_version.map
+
+# library version
+LIBABIVER := 1
+
+ifneq ($(CONFIG_RTE_EXEC_ENV_LINUX),)
+SYSTEM := linux
+endif
+ifneq ($(CONFIG_RTE_EXEC_ENV_FREEBSD),)
+$(error "Mdev bus not implemented for BSD yet")
+endif
+
+CFLAGS += -I$(RTE_SDK)/drivers/bus/mdev/$(SYSTEM)
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/$(SYSTEM)/eal
+
+LDLIBS += -lrte_eal
+
+include $(RTE_SDK)/drivers/bus/mdev/$(SYSTEM)/Makefile
+SRCS-$(CONFIG_RTE_LIBRTE_MDEV_BUS) := $(addprefix $(SYSTEM)/,$(SRCS))
+SRCS-$(CONFIG_RTE_LIBRTE_MDEV_BUS) += mdev.c
+
+SYMLINK-$(CONFIG_RTE_LIBRTE_MDEV_BUS)-include += rte_bus_mdev.h
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/bus/mdev/linux/Makefile b/drivers/bus/mdev/linux/Makefile
new file mode 100644
index 000000000..a777ad3d4
--- /dev/null
+++ b/drivers/bus/mdev/linux/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+SRCS += mdev.c
+
+CFLAGS += -D_GNU_SOURCE
diff --git a/drivers/bus/mdev/linux/mdev.c b/drivers/bus/mdev/linux/mdev.c
new file mode 100644
index 000000000..ecfe0eba6
--- /dev/null
+++ b/drivers/bus/mdev/linux/mdev.c
@@ -0,0 +1,117 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include <string.h>
+#include <dirent.h>
+
+#include <rte_log.h>
+#include <rte_bus_mdev.h>
+
+#include "eal_filesystem.h"
+
+#include "private.h"
+
+static int
+mdev_scan_one(const char *dirname, const rte_uuid_t addr)
+{
+	struct rte_mdev_device *mdev;
+	char device_api[PATH_MAX];
+	char filename[PATH_MAX];
+	char *ptr;
+
+	mdev = malloc(sizeof(*mdev));
+	if (mdev == NULL)
+		return -1;
+
+	memset(mdev, 0, sizeof(*mdev));
+	mdev->device.bus = &rte_mdev_bus.bus;
+	rte_uuid_copy(mdev->addr, addr);
+
+	/* get device_api */
+	snprintf(filename, sizeof(filename), "%s/mdev_type/device_api",
+		 dirname);
+	if (rte_eal_parse_sysfs_str(filename, device_api,
+				    sizeof(device_api)) < 0) {
+		free(mdev);
+		return -1;
+	}
+
+	ptr = strchr(device_api, '\n');
+	if (ptr != NULL)
+		*ptr = '\0';
+
+	mdev_name_set(mdev);
+
+	if (strcmp(device_api, "vfio-pci") == 0) {
+		/* device api */
+		mdev->dev_api = RTE_MDEV_DEV_API_VFIO_PCI;
+
+		if (TAILQ_EMPTY(&rte_mdev_bus.device_list))
+			rte_mdev_add_device(mdev);
+		else {
+			struct rte_mdev_device *dev;
+			int ret;
+
+			TAILQ_FOREACH(dev, &rte_mdev_bus.device_list, next) {
+				ret = rte_uuid_compare(mdev->addr, dev->addr);
+				if (ret > 0)
+					continue;
+
+				if (ret < 0)
+					rte_mdev_insert_device(dev, mdev);
+				else /* already registered */
+					free(mdev);
+
+				return 0;
+			}
+
+			rte_mdev_add_device(mdev);
+		}
+	} else {
+		RTE_LOG(DEBUG, EAL, "%s(): mdev device_api %s is not supported\n",
+			__func__, device_api);
+	}
+
+	return 0;
+}
+
+/*
+ * Scan the content of the mdev bus, and the devices in the devices
+ * list
+ */
+int
+rte_mdev_scan(void)
+{
+	struct dirent *e;
+	DIR *dir;
+	char dirname[PATH_MAX];
+	rte_uuid_t addr;
+
+	dir = opendir(rte_mdev_get_sysfs_path());
+	if (dir == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): opendir failed: %s\n",
+			__func__, strerror(errno));
+		return -1;
+	}
+
+	while ((e = readdir(dir)) != NULL) {
+		if (e->d_name[0] == '.')
+			continue;
+
+		if (rte_uuid_parse(e->d_name, addr) != 0)
+			continue;
+
+		snprintf(dirname, sizeof(dirname), "%s/%s",
+			 rte_mdev_get_sysfs_path(), e->d_name);
+
+		if (mdev_scan_one(dirname, addr) < 0)
+			goto error;
+	}
+	closedir(dir);
+	return 0;
+
+error:
+	closedir(dir);
+	return -1;
+}
diff --git a/drivers/bus/mdev/mdev.c b/drivers/bus/mdev/mdev.c
new file mode 100644
index 000000000..2f9209cca
--- /dev/null
+++ b/drivers/bus/mdev/mdev.c
@@ -0,0 +1,310 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include <string.h>
+#include <inttypes.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/queue.h>
+#include <sys/mman.h>
+
+#include <rte_errno.h>
+#include <rte_interrupts.h>
+#include <rte_log.h>
+#include <rte_per_lcore.h>
+#include <rte_memory.h>
+#include <rte_eal.h>
+#include <rte_common.h>
+#include <rte_devargs.h>
+#include <rte_uuid.h>
+#include <rte_bus_mdev.h>
+
+#include "private.h"
+
+#define SYSFS_MDEV_DEVICES "/sys/bus/mdev/devices"
+
+const char *rte_mdev_get_sysfs_path(void)
+{
+	const char *path = NULL;
+
+	path = getenv("SYSFS_MDEV_DEVICES");
+	if (path == NULL)
+		return SYSFS_MDEV_DEVICES;
+
+	return path;
+}
+
+static void
+rte_mdev_device_name(const rte_uuid_t addr, char *output, size_t size)
+{
+	RTE_VERIFY(size >= RTE_UUID_STRLEN);
+	rte_uuid_unparse(addr, output, size);
+}
+
+static struct rte_devargs *
+mdev_devargs_lookup(struct rte_mdev_device *dev)
+{
+	struct rte_devargs *devargs;
+	rte_uuid_t addr;
+
+	RTE_EAL_DEVARGS_FOREACH("mdev", devargs) {
+		devargs->bus->parse(devargs->name, addr);
+		if (!rte_uuid_compare(dev->addr, addr))
+			return devargs;
+	}
+	return NULL;
+}
+
+void
+mdev_name_set(struct rte_mdev_device *dev)
+{
+	struct rte_devargs *devargs;
+
+	/* Each device has its internal, canonical name set. */
+	rte_mdev_device_name(dev->addr, dev->name, sizeof(dev->name));
+	devargs = mdev_devargs_lookup(dev);
+	dev->device.devargs = devargs;
+	/* In blacklist mode, if the device is not blacklisted, no
+	 * rte_devargs exists for it.
+	 */
+	if (devargs != NULL)
+		/* If an rte_devargs exists, the generic rte_device uses the
+		 * given name as its name.
+		 */
+		dev->device.name = dev->device.devargs->name;
+	else
+		/* Otherwise, it uses the internal, canonical form. */
+		dev->device.name = dev->name;
+}
+
+void
+rte_mdev_register(struct rte_mdev_driver *driver)
+{
+	TAILQ_INSERT_TAIL(&rte_mdev_bus.driver_list, driver, next);
+	driver->bus = &rte_mdev_bus;
+}
+
+void
+rte_mdev_unregister(struct rte_mdev_driver *driver)
+{
+	TAILQ_REMOVE(&rte_mdev_bus.driver_list, driver, next);
+	driver->bus = NULL;
+}
+
+void
+rte_mdev_add_device(struct rte_mdev_device *mdev)
+{
+	TAILQ_INSERT_TAIL(&rte_mdev_bus.device_list, mdev, next);
+}
+
+void
+rte_mdev_insert_device(struct rte_mdev_device *exist_mdev,
+		       struct rte_mdev_device *new_mdev)
+{
+	TAILQ_INSERT_BEFORE(exist_mdev, new_mdev, next);
+}
+
+void
+rte_mdev_remove_device(struct rte_mdev_device *mdev)
+{
+	TAILQ_REMOVE(&rte_mdev_bus.device_list, mdev, next);
+}
+
+static struct rte_device *
+mdev_find_device(const struct rte_device *start, rte_dev_cmp_t cmp,
+		 const void *data)
+{
+	const struct rte_mdev_device *pstart;
+	struct rte_mdev_device *pdev;
+
+	if (start != NULL) {
+		pstart = RTE_DEV_TO_MDEV_CONST(start);
+		pdev = TAILQ_NEXT(pstart, next);
+	} else {
+		pdev = TAILQ_FIRST(&rte_mdev_bus.device_list);
+	}
+	while (pdev != NULL) {
+		if (cmp(&pdev->device, data) == 0)
+			return &pdev->device;
+		pdev = TAILQ_NEXT(pdev, next);
+	}
+	return NULL;
+}
+
+int
+rte_mdev_match(const struct rte_mdev_driver *mdev_drv,
+	       const struct rte_mdev_device *mdev_dev)
+{
+	if (mdev_drv->dev_api == mdev_dev->dev_api)
+		return 1;
+
+	return 0;
+}
+
+static int
+rte_mdev_probe_one_driver(struct rte_mdev_driver *dr,
+			  struct rte_mdev_device *dev)
+{
+	int ret;
+
+	if (dr == NULL || dev == NULL)
+		return -EINVAL;
+
+	/* no initialization when blacklisted, return without error */
+	if (dev->device.devargs != NULL &&
+	    dev->device.devargs->policy == RTE_DEV_BLACKLISTED) {
+		RTE_LOG(INFO, EAL, "Device is blacklisted, not initializing\n");
+		return 1;
+	}
+
+	/* The device is not blacklisted; Check if driver supports it */
+	if (!rte_mdev_match(dr, dev)) {
+		/* Match of device and driver failed */
+		return 1;
+	}
+
+	/* reference driver structure */
+	dev->driver = dr;
+
+	/* call the driver probe() function */
+	ret = dr->probe(dr, dev);
+	if (ret != 0)
+		dev->driver = NULL;
+
+	return ret;
+}
+
+static int
+mdev_probe_all_drivers(struct rte_mdev_device *dev)
+{
+	struct rte_mdev_driver *dr = NULL;
+	int rc = 0;
+
+	if (dev == NULL)
+		return -1;
+
+	/* Check if a driver is already loaded */
+	if (dev->driver != NULL)
+		return 0;
+
+	FOREACH_DRIVER_ON_MDEV_BUS(dr) {
+		rc = rte_mdev_probe_one_driver(dr, dev);
+		if (rc < 0)
+			/* negative value is an error */
+			return -1;
+		if (rc > 0)
+			/* positive value means driver doesn't support it */
+			continue;
+		return 0;
+	}
+	return 1;
+}
+
+int
+rte_mdev_probe(void)
+{
+	struct rte_mdev_device *mdev = NULL;
+	size_t probed = 0, failed = 0;
+	struct rte_devargs *devargs;
+	int probe_all = 0;
+	int ret = 0;
+
+	if (rte_mdev_bus.bus.conf.scan_mode != RTE_BUS_SCAN_WHITELIST)
+		probe_all = 1;
+
+	FOREACH_DEVICE_ON_MDEV_BUS(mdev) {
+		probed++;
+
+		devargs = mdev->device.devargs;
+		/* probe all or only whitelisted devices */
+		if (probe_all)
+			ret = mdev_probe_all_drivers(mdev);
+		else if (devargs != NULL &&
+			devargs->policy == RTE_DEV_WHITELISTED)
+			ret = mdev_probe_all_drivers(mdev);
+		if (ret < 0) {
+			char name[RTE_UUID_STRLEN];
+			rte_uuid_unparse(mdev->addr, name, sizeof(name));
+			RTE_LOG(ERR, EAL, "Requested device %s cannot be used\n",
+				name);
+			rte_errno = errno;
+			failed++;
+			ret = 0;
+		}
+	}
+
+	return (probed && probed == failed) ? -1 : 0;
+}
+
+static int
+mdev_plug(struct rte_device *dev)
+{
+	return mdev_probe_all_drivers(RTE_DEV_TO_MDEV(dev));
+}
+
+static int
+rte_mdev_detach_dev(struct rte_mdev_device *dev)
+{
+	struct rte_mdev_driver *dr;
+	int ret = 0;
+
+	if (dev == NULL)
+		return -EINVAL;
+
+	dr = dev->driver;
+
+	if (dr->remove) {
+		ret = dr->remove(dev);
+		if (ret != 0)
+			return ret;
+	}
+
+	/* clear driver structure */
+	dev->driver = NULL;
+
+	return 0;
+}
+
+static int
+mdev_unplug(struct rte_device *dev)
+{
+	struct rte_mdev_device *pmdev;
+	int ret;
+
+	pmdev = RTE_DEV_TO_MDEV(dev);
+	ret = rte_mdev_detach_dev(pmdev);
+	if (ret == 0) {
+		rte_mdev_remove_device(pmdev);
+		free(pmdev);
+	}
+	return ret;
+}
+
+static int
+mdev_parse(const char *name, void *addr)
+{
+	rte_uuid_t uuid;
+	int parse;
+
+	parse = (rte_uuid_parse(name, uuid) == 0);
+	if (parse && addr != NULL)
+		rte_uuid_copy(addr, uuid);
+	return parse == false;
+}
+
+struct rte_mdev_bus rte_mdev_bus = {
+	.bus = {
+		.scan = rte_mdev_scan,
+		.probe = rte_mdev_probe,
+		.find_device = mdev_find_device,
+		.plug = mdev_plug,
+		.unplug = mdev_unplug,
+		.parse = mdev_parse,
+	},
+	.device_list = TAILQ_HEAD_INITIALIZER(rte_mdev_bus.device_list),
+	.driver_list = TAILQ_HEAD_INITIALIZER(rte_mdev_bus.driver_list),
+};
+
+RTE_REGISTER_BUS(mdev, rte_mdev_bus.bus);
diff --git a/drivers/bus/mdev/meson.build b/drivers/bus/mdev/meson.build
new file mode 100644
index 000000000..33c701cb9
--- /dev/null
+++ b/drivers/bus/mdev/meson.build
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+version = 1
+allow_experimental_apis = true
+install_headers('rte_bus_mdev.h')
+sources = files('mdev.c')
+
+if host_machine.system() == 'linux'
+	sources += files('linux/mdev.c')
+	includes += include_directories('linux')
+	cflags += ['-D_GNU_SOURCE']
+else
+	build = false
+endif
diff --git a/drivers/bus/mdev/private.h b/drivers/bus/mdev/private.h
new file mode 100644
index 000000000..81cfe3045
--- /dev/null
+++ b/drivers/bus/mdev/private.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _MDEV_PRIVATE_H_
+#define _MDEV_PRIVATE_H_
+
+#include <stdbool.h>
+#include <stdio.h>
+#include <rte_bus_mdev.h>
+
+struct rte_mdev_driver;
+struct rte_mdev_device;
+
+extern struct rte_mdev_bus rte_mdev_bus;
+
+/**
+ * Probe the mdev bus.
+ *
+ * @return
+ *   - 0 on success.
+ *   - !0 on error.
+ */
+int rte_mdev_probe(void);
+
+/**
+ * Scan the content of the mdev bus, and the devices in the devices
+ * list.
+ *
+ * @return
+ *  0 on success, negative on error
+ */
+int rte_mdev_scan(void);
+
+/**
+ * Set the name of a mdev device.
+ */
+void mdev_name_set(struct rte_mdev_device *dev);
+
+/**
+ * Add a mdev device to the mdev bus (append to mdev device list). This function
+ * also updates the bus references of the mdev device (and the generic device
+ * object embedded within.
+ *
+ * @param mdev
+ *	mdev device to add
+ * @return void
+ */
+void rte_mdev_add_device(struct rte_mdev_device *mdev);
+
+/**
+ * Insert a mdev device in the mdev bus at a particular location in the device
+ * list. It also updates the mdev bus reference of the new devices to be
+ * inserted.
+ *
+ * @param exist_mdev
+ *	existing mdev device in mdev bus
+ * @param new_mdev
+ *	mdev device to be added before exist_mdev
+ * @return void
+ */
+void rte_mdev_insert_device(struct rte_mdev_device *exist_mdev,
+			    struct rte_mdev_device *new_mdev);
+
+/**
+ * Remove a mdev device from the mdev bus. This sets to NULL the bus references
+ * in the mdev device object as well as the generic device object.
+ *
+ * @param mdev_device
+ *	mdev device to be removed from mdev bus
+ * @return void
+ */
+void rte_mdev_remove_device(struct rte_mdev_device *mdev_device);
+
+/**
+ * Match the mdev driver and device using mdev device_api.
+ *
+ * @param mdev_drv
+ *      mdev driver from which device_api would be extracted
+ * @param mdev_dev
+ *      mdev device to match against the driver
+ * @return
+ *      1 for successful match
+ *      0 for unsuccessful match
+ */
+int
+rte_mdev_match(const struct rte_mdev_driver *mdev_drv,
+	       const struct rte_mdev_device *mdev_dev);
+
+#endif /* _MDEV_PRIVATE_H_ */
diff --git a/drivers/bus/mdev/rte_bus_mdev.h b/drivers/bus/mdev/rte_bus_mdev.h
new file mode 100644
index 000000000..913521ace
--- /dev/null
+++ b/drivers/bus/mdev/rte_bus_mdev.h
@@ -0,0 +1,141 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _RTE_BUS_MDEV_H_
+#define _RTE_BUS_MDEV_H_
+
+/**
+ * @file
+ *
+ * RTE Mdev Bus Interface
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <limits.h>
+#include <errno.h>
+#include <sys/queue.h>
+#include <stdint.h>
+#include <inttypes.h>
+
+#include <rte_debug.h>
+#include <rte_interrupts.h>
+#include <rte_dev.h>
+#include <rte_uuid.h>
+#include <rte_bus.h>
+
+struct rte_devargs;
+
+enum rte_mdev_device_api {
+	RTE_MDEV_DEV_API_VFIO_PCI = 0,
+	RTE_MDEV_DEV_API_MAX,
+};
+
+struct rte_mdev_bus;
+struct rte_mdev_driver;
+struct rte_mdev_device;
+
+/** Pathname of mdev devices directory. */
+const char * __rte_experimental rte_mdev_get_sysfs_path(void);
+
+/**
+ * Register a mdev driver.
+ *
+ * @param driver
+ *   A pointer to a rte_mdev_driver structure describing the driver
+ *   to be registered.
+ */
+void __rte_experimental rte_mdev_register(struct rte_mdev_driver *driver);
+
+#define RTE_MDEV_REGISTER_DRIVER(nm, mdev_drv) \
+RTE_INIT(mdevinitfn_ ##nm) \
+{ \
+	(mdev_drv).driver.name = RTE_STR(nm); \
+	rte_mdev_register(&mdev_drv); \
+} \
+RTE_PMD_EXPORT_NAME(nm, __COUNTER__)
+
+/**
+ * Unregister a mdev driver.
+ *
+ * @param driver
+ *   A pointer to a rte_mdev_driver structure describing the driver
+ *   to be unregistered.
+ */
+void __rte_experimental rte_mdev_unregister(struct rte_mdev_driver *driver);
+
+/**
+ * Initialisation function for the driver called during mdev probing.
+ */
+typedef int (mdev_probe_t)(struct rte_mdev_driver *, struct rte_mdev_device *);
+
+/**
+ * Uninitialisation function for the driver called during hotplugging.
+ */
+typedef int (mdev_remove_t)(struct rte_mdev_device *);
+
+/**
+ * A structure describing a mdev driver.
+ */
+struct rte_mdev_driver {
+	TAILQ_ENTRY(rte_mdev_driver) next; /**< Next in list. */
+	struct rte_driver driver;          /**< Inherit core driver. */
+	struct rte_mdev_bus *bus;          /**< Mdev bus reference. */
+	mdev_probe_t *probe;               /**< Device probe function. */
+	mdev_remove_t *remove;             /**< Device remove function. */
+	enum rte_mdev_device_api dev_api;  /**< Device API. */
+};
+
+/**
+ * A structure describing a mdev device.
+ */
+struct rte_mdev_device {
+	TAILQ_ENTRY(rte_mdev_device) next; /**< Next mdev device. */
+	struct rte_device device;	   /**< Inherit core device. */
+	enum rte_mdev_device_api dev_api;  /**< Device API. */
+	struct rte_mdev_driver *driver;    /**< Associated driver. */
+	rte_uuid_t addr;                   /**< Location. */
+	char name[RTE_UUID_STRLEN];        /**< Location (ASCII). */
+	void *private;                     /**< Driver-specific data. */
+};
+
+/**
+ * @internal
+ * Helper macro for drivers that need to convert to struct rte_mdev_device.
+ */
+#define RTE_DEV_TO_MDEV(ptr) container_of(ptr, struct rte_mdev_device, device)
+
+#define RTE_DEV_TO_MDEV_CONST(ptr) \
+	container_of(ptr, const struct rte_mdev_device, device)
+
+/** List of mdev devices */
+TAILQ_HEAD(rte_mdev_device_list, rte_mdev_device);
+/** List of mdev drivers */
+TAILQ_HEAD(rte_mdev_driver_list, rte_mdev_driver);
+
+/**
+ * Structure describing the mdev bus
+ */
+struct rte_mdev_bus {
+	struct rte_bus bus;                /**< Inherit the generic class */
+	struct rte_mdev_device_list device_list;  /**< List of mdev devices */
+	struct rte_mdev_driver_list driver_list;  /**< List of mdev drivers */
+};
+
+/* Mdev Bus iterators */
+#define FOREACH_DEVICE_ON_MDEV_BUS(p)	\
+		TAILQ_FOREACH(p, &(rte_mdev_bus.device_list), next)
+
+#define FOREACH_DRIVER_ON_MDEV_BUS(p)	\
+		TAILQ_FOREACH(p, &(rte_mdev_bus.driver_list), next)
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_BUS_MDEV_H_ */
diff --git a/drivers/bus/mdev/rte_bus_mdev_version.map b/drivers/bus/mdev/rte_bus_mdev_version.map
new file mode 100644
index 000000000..7f73bf96b
--- /dev/null
+++ b/drivers/bus/mdev/rte_bus_mdev_version.map
@@ -0,0 +1,12 @@
+DPDK_19.05 {
+
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	rte_mdev_get_sysfs_path;
+	rte_mdev_register;
+	rte_mdev_unregister;
+};
diff --git a/drivers/bus/meson.build b/drivers/bus/meson.build
index 80de2d91d..f0ab19a03 100644
--- a/drivers/bus/meson.build
+++ b/drivers/bus/meson.build
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2017 Intel Corporation
 
-drivers = ['dpaa', 'fslmc', 'ifpga', 'pci', 'vdev', 'vmbus']
+drivers = ['dpaa', 'fslmc', 'ifpga', 'mdev', 'pci', 'vdev', 'vmbus']
 std_deps = ['eal']
 config_flag_fmt = 'RTE_LIBRTE_@0@_BUS'
 driver_name_fmt = 'rte_bus_@0@'
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 262132fc6..f8abe8237 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -123,6 +123,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_FSLMC_BUS),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_COMMON_DPAAX)   += -lrte_common_dpaax
 endif
 
+_LDLIBS-$(CONFIG_RTE_LIBRTE_MDEV_BUS)       += -lrte_bus_mdev
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PCI_BUS)        += -lrte_bus_pci
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_BUS)       += -lrte_bus_vdev
 _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA_BUS)       += -lrte_bus_dpaa
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC 3/3] bus/pci: add mdev support
  2019-04-03  7:18 [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Tiwei Bie
                   ` (2 preceding siblings ...)
  2019-04-03  7:18 ` [dpdk-dev] [RFC 2/3] bus/mdev: add mdev bus support Tiwei Bie
@ 2019-04-03  7:18 ` Tiwei Bie
  2019-04-03  7:18   ` Tiwei Bie
  2019-04-03 14:13   ` Wiles, Keith
  2019-04-08  8:44 ` [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Alejandro Lucero
  2019-07-15  7:52 ` [dpdk-dev] [RFC v2 0/5] " Tiwei Bie
  5 siblings, 2 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-04-03  7:18 UTC (permalink / raw)
  To: dev; +Cc: cunming.liang, bruce.richardson, alejandro.lucero

This patch adds the mdev support in PCI bus driver. A mdev
driver is introduced to probe the mdev devices whose device
API is "vfio-pci" on the mdev bus.

PS. There are some hacks in this patch for now.

Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/bus/pci/Makefile              |   3 +
 drivers/bus/pci/linux/Makefile        |   4 +
 drivers/bus/pci/linux/pci_vfio.c      |  35 ++-
 drivers/bus/pci/linux/pci_vfio_mdev.c | 305 ++++++++++++++++++++++++++
 drivers/bus/pci/meson.build           |   4 +-
 drivers/bus/pci/pci_common.c          |  17 +-
 drivers/bus/pci/private.h             |   9 +
 drivers/bus/pci/rte_bus_pci.h         |  11 +-
 8 files changed, 370 insertions(+), 18 deletions(-)
 create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c

diff --git a/drivers/bus/pci/Makefile b/drivers/bus/pci/Makefile
index de53ce1bf..085ec9066 100644
--- a/drivers/bus/pci/Makefile
+++ b/drivers/bus/pci/Makefile
@@ -27,6 +27,9 @@ CFLAGS += -DALLOW_EXPERIMENTAL_API
 
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_pci -lrte_kvargs
+ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
+LDLIBS += -lrte_bus_mdev
+endif
 
 include $(RTE_SDK)/drivers/bus/pci/$(SYSTEM)/Makefile
 SRCS-$(CONFIG_RTE_LIBRTE_PCI_BUS) := $(addprefix $(SYSTEM)/,$(SRCS))
diff --git a/drivers/bus/pci/linux/Makefile b/drivers/bus/pci/linux/Makefile
index 90404468b..88bbc2390 100644
--- a/drivers/bus/pci/linux/Makefile
+++ b/drivers/bus/pci/linux/Makefile
@@ -4,3 +4,7 @@
 SRCS += pci.c
 SRCS += pci_uio.c
 SRCS += pci_vfio.c
+
+ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
+	SRCS += pci_vfio_mdev.c
+endif
diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index ebf6ccd3c..c2c4c6a50 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -13,6 +13,9 @@
 
 #include <rte_log.h>
 #include <rte_pci.h>
+#ifdef RTE_LIBRTE_MDEV_BUS
+#include <rte_bus_mdev.h>
+#endif
 #include <rte_bus_pci.h>
 #include <rte_eal_memconfig.h>
 #include <rte_malloc.h>
@@ -20,6 +23,7 @@
 #include <rte_eal.h>
 #include <rte_bus.h>
 #include <rte_spinlock.h>
+#include <rte_uuid.h>
 
 #include "eal_filesystem.h"
 
@@ -648,6 +652,7 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 {
 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
 	char pci_addr[PATH_MAX] = {0};
+	const char *sysfs_path;
 	int vfio_dev_fd;
 	struct rte_pci_addr *loc = &dev->addr;
 	int i, ret;
@@ -663,10 +668,20 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 #endif
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->use_uuid) {
+#ifdef RTE_LIBRTE_MDEV_BUS
+		sysfs_path = rte_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+#else
+		return -1;
+#endif
+	} else {
+		sysfs_path = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
 
-	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
+	ret = rte_vfio_setup_device(sysfs_path, pci_addr,
 					&vfio_dev_fd, &device_info);
 	if (ret)
 		return ret;
@@ -793,6 +808,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 {
 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
 	char pci_addr[PATH_MAX] = {0};
+	const char *sysfs_path;
 	int vfio_dev_fd;
 	struct rte_pci_addr *loc = &dev->addr;
 	int i, ret;
@@ -808,8 +824,19 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 #endif
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->use_uuid) {
+#ifdef RTE_LIBRTE_MDEV_BUS
+		sysfs_path = rte_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+#else
+		return -1;
+#endif
+	} else {
+		sysfs_path = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
+
 
 	/* if we're in a secondary process, just find our tailq entry */
 	TAILQ_FOREACH(vfio_res, vfio_res_list, next) {
@@ -825,7 +852,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 		return -1;
 	}
 
-	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
+	ret = rte_vfio_setup_device(sysfs_path, pci_addr,
 					&vfio_dev_fd, &device_info);
 	if (ret)
 		return ret;
diff --git a/drivers/bus/pci/linux/pci_vfio_mdev.c b/drivers/bus/pci/linux/pci_vfio_mdev.c
new file mode 100644
index 000000000..92498c2fe
--- /dev/null
+++ b/drivers/bus/pci/linux/pci_vfio_mdev.c
@@ -0,0 +1,305 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include <string.h>
+#include <dirent.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <linux/pci_regs.h>
+
+#include <rte_log.h>
+#include <rte_pci.h>
+#include <rte_eal_memconfig.h>
+#include <rte_malloc.h>
+#include <rte_devargs.h>
+#include <rte_memcpy.h>
+#include <rte_vfio.h>
+#include <rte_bus_mdev.h>
+
+#include "eal_private.h"
+#include "eal_filesystem.h"
+
+#include "private.h"
+
+extern struct rte_pci_bus rte_pci_bus;
+
+static int
+get_pci_id(const char *sysfs_base, const char *dev_addr,
+	   struct rte_pci_id *pci_id)
+{
+	int ret = 0;
+	int iommu_group_num;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	int container;
+	int class;
+	char name[PATH_MAX];
+	struct vfio_group_status group_status = {
+		.argsz = sizeof(group_status) };
+
+	container = open("/dev/vfio/vfio", O_RDWR);
+	if (container < 0) {
+		RTE_LOG(WARNING, EAL, "Failed to open VFIO container\n");
+		ret = -1;
+		goto out;
+	}
+
+	if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) {
+		/* Unknown API version */
+		RTE_LOG(WARNING, EAL, "Unknown VFIO API version\n");
+		ret = -1;
+		goto close_container;
+	}
+
+	if (rte_vfio_get_group_num(sysfs_base, dev_addr,
+				   &iommu_group_num) <= 0) {
+		RTE_LOG(WARNING, EAL, "%s not managed by VFIO driver\n",
+			dev_addr);
+		ret = -1;
+		goto close_container;
+	}
+
+	snprintf(name, sizeof(name), "/dev/vfio/%d", iommu_group_num);
+
+	vfio_group_fd = open(name, O_RDWR);
+	if (vfio_group_fd < 0) {
+		ret = -1;
+		goto close_container;
+	}
+
+	/* if group_fd == 0, that means the device isn't managed by VFIO */
+	if (vfio_group_fd == 0) {
+		RTE_LOG(WARNING, EAL, "%s not managed by VFIO driver\n",
+			dev_addr);
+		ret = -1;
+		goto close_group;
+	}
+
+	if (ioctl(vfio_group_fd, VFIO_GROUP_GET_STATUS, &group_status)) {
+		RTE_LOG(ERR, EAL, "%s cannot get group status, error %i (%s)\n",
+			dev_addr, errno, strerror(errno));
+		ret = -1;
+		goto close_group;
+	}
+
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+		RTE_LOG(ERR, EAL, "%s VFIO group is not viable!\n", dev_addr);
+		ret = -1;
+		goto close_group;
+	}
+
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
+		if (ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
+			    &container)) {
+			RTE_LOG(ERR, EAL, "%s cannot add VFIO group to container, error %i (%s)\n",
+				dev_addr, errno, strerror(errno));
+			ret = -1;
+			goto close_group;
+		}
+	}
+
+	if (ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)) {
+		RTE_LOG(ERR, EAL, "%s cannot set iommu, error %i (%s)\n",
+			dev_addr, errno, strerror(errno));
+		ret = -1;
+		goto close_group;
+	}
+
+	vfio_dev_fd = ioctl(vfio_group_fd, VFIO_GROUP_GET_DEVICE_FD, dev_addr);
+	if (vfio_dev_fd < 0) {
+		/* if we cannot get a device fd, this implies a problem with
+		 * the VFIO group or the container not having IOMMU configured.
+		 */
+		RTE_LOG(ERR, EAL, "Getting a vfio_dev_fd for %s failed errno %d\n",
+			dev_addr, errno);
+		ret = -1;
+		goto close_group;
+	}
+
+	/* vendor_id */
+	if (pread64(vfio_dev_fd, &pci_id->vendor_id, sizeof(uint16_t),
+		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+		      PCI_VENDOR_ID) != sizeof(uint16_t)) {
+		RTE_LOG(ERR, EAL, "Cannot read VendorID from PCI config space\n");
+		ret = -1;
+		goto close_device;
+	}
+
+	/* device_id */
+	if (pread64(vfio_dev_fd, &pci_id->device_id, sizeof(uint16_t),
+		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+		      PCI_DEVICE_ID) != sizeof(uint16_t)) {
+		RTE_LOG(ERR, EAL, "Cannot read DeviceID from PCI config space\n");
+		ret = -1;
+		goto close_device;
+	}
+
+	/* subsystem_vendor_id */
+	if (pread64(vfio_dev_fd, &pci_id->subsystem_vendor_id, sizeof(uint16_t),
+		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+		      PCI_SUBSYSTEM_VENDOR_ID) != sizeof(uint16_t)) {
+		RTE_LOG(ERR, EAL, "Cannot read SubVendorID from PCI config space\n");
+		ret = -1;
+		goto close_device;
+	}
+
+	/* subsystem_device_id */
+	if (pread64(vfio_dev_fd, &pci_id->subsystem_device_id, sizeof(uint16_t),
+		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+		      PCI_SUBSYSTEM_ID) != sizeof(uint16_t)) {
+		RTE_LOG(ERR, EAL, "Cannot read SubDeviceID from PCI config space\n");
+		ret = -1;
+		goto close_device;
+	}
+
+	/* class_id */
+	if (pread64(vfio_dev_fd, &class, sizeof(uint32_t),
+		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+		      PCI_CLASS_REVISION) != sizeof(uint32_t)) {
+		RTE_LOG(ERR, EAL, "Cannot read ClassID from PCI config space\n");
+		ret = -1;
+		goto close_device;
+	}
+	pci_id->class_id = class >> 8;
+
+close_device:
+	if (close(vfio_dev_fd) < 0) {
+		RTE_LOG(INFO, EAL, "Error when closing VFIO device for %s\n",
+			dev_addr);
+		ret = -1;
+	}
+
+close_group:
+	if (close(vfio_group_fd) < 0) {
+		RTE_LOG(INFO, EAL, "Error when closing VFIO group for %s\n",
+			dev_addr);
+		ret = -1;
+	}
+
+close_container:
+	if (close(container) < 0) {
+		RTE_LOG(INFO, EAL, "Error when closing VFIO container\n");
+		ret = -1;
+	}
+
+out:
+	return ret;
+}
+
+static int vfio_pci_probe(struct rte_mdev_driver *mdev_drv __rte_unused,
+			  struct rte_mdev_device *mdev_dev)
+{
+	char name[RTE_UUID_STRLEN];
+	struct rte_pci_device *dev;
+	struct rte_bus *bus;
+	int ret;
+
+	bus = rte_bus_find_by_name("pci");
+	if (bus == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
+		return -ENOENT;
+	}
+
+	if (bus->plug == NULL) {
+		RTE_LOG(ERR, EAL, "Function plug not supported by bus (%s)\n",
+			bus->name);
+		return -ENOTSUP;
+	}
+
+	dev = malloc(sizeof(*dev));
+	if (dev == NULL)
+		return -ENOMEM;
+
+	memset(dev, 0, sizeof(*dev));
+	dev->device.bus = &rte_pci_bus.bus;
+	rte_uuid_unparse(mdev_dev->addr, name, sizeof(name));
+
+	if (get_pci_id(rte_mdev_get_sysfs_path(), name, &dev->id)) {
+		free(dev);
+		return -1;
+	}
+
+	snprintf(dev->name, sizeof(dev->name), "%s", name);
+	dev->device.name = dev->name;
+	dev->kdrv = RTE_KDRV_VFIO;
+	dev->use_uuid = 1;
+	rte_uuid_copy(dev->uuid, mdev_dev->addr);
+
+	// TODO: dev->device.devargs, etc
+
+	memset(&dev->addr, -1, sizeof(dev->addr)); // XXX: TODO
+
+	/* device is valid, add to the list (sorted) */
+	if (TAILQ_EMPTY(&rte_pci_bus.device_list)) {
+		rte_pci_add_device(dev);
+	} else {
+		struct rte_pci_device *dev2;
+		int ret;
+
+		TAILQ_FOREACH(dev2, &rte_pci_bus.device_list, next) {
+			// XXX
+			ret = rte_pci_addr_cmp(&dev->addr, &dev2->addr);
+			if (ret == 0)
+				ret = strncmp(dev->name, dev2->name,
+					      sizeof(dev->name));
+			if (ret > 0)
+				continue;
+			if (ret < 0) {
+				rte_pci_insert_device(dev2, dev);
+				goto plug;
+			}
+			/* already registered */
+			free(dev);
+			return 0;
+		}
+
+		rte_pci_add_device(dev);
+	}
+
+plug:
+	ret = bus->plug(&dev->device);
+	if (ret != 0) {
+		rte_pci_remove_device(dev);
+		free(dev);
+	} else {
+		mdev_dev->private = dev;
+	}
+	return ret;
+}
+
+static int vfio_pci_remove(struct rte_mdev_device *mdev_dev)
+{
+	struct rte_pci_device *dev = mdev_dev->private;
+	struct rte_bus *bus;
+	int ret;
+
+	if (dev == NULL)
+		return 0;
+
+	bus = rte_bus_find_by_name("pci");
+	if (bus == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
+		return -ENOENT;
+	}
+
+	if (bus->unplug == NULL) {
+		RTE_LOG(ERR, EAL, "Function unplug not supported by bus (%s)\n",
+			bus->name);
+		return -ENOTSUP;
+	}
+
+	ret = bus->unplug(&dev->device);
+	if (ret == 0)
+		mdev_dev->private = NULL;
+
+	return ret;
+}
+
+static struct rte_mdev_driver vfio_pci_drv = {
+	.dev_api = RTE_MDEV_DEV_API_VFIO_PCI,
+	.probe = vfio_pci_probe,
+	.remove = vfio_pci_remove
+};
+
+RTE_MDEV_REGISTER_DRIVER(mdev_vfio_pci, vfio_pci_drv);
diff --git a/drivers/bus/pci/meson.build b/drivers/bus/pci/meson.build
index a3140ff97..c3e884657 100644
--- a/drivers/bus/pci/meson.build
+++ b/drivers/bus/pci/meson.build
@@ -11,8 +11,10 @@ sources = files('pci_common.c',
 if host_machine.system() == 'linux'
 	sources += files('linux/pci.c',
 			'linux/pci_uio.c',
-			'linux/pci_vfio.c')
+			'linux/pci_vfio.c',
+			'linux/pci_vfio_mdev.c')
 	includes += include_directories('linux')
+	deps += ['bus_mdev']
 else
 	sources += files('bsd/pci.c')
 	includes += include_directories('bsd')
diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
index 704b9d71a..6b47333e6 100644
--- a/drivers/bus/pci/pci_common.c
+++ b/drivers/bus/pci/pci_common.c
@@ -124,21 +124,17 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
 {
 	int ret;
 	bool already_probed;
-	struct rte_pci_addr *loc;
 
 	if ((dr == NULL) || (dev == NULL))
 		return -EINVAL;
 
-	loc = &dev->addr;
-
 	/* The device is not blacklisted; Check if driver supports it */
 	if (!rte_pci_match(dr, dev))
 		/* Match of device and driver failed */
 		return 1;
 
-	RTE_LOG(INFO, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
-			loc->domain, loc->bus, loc->devid, loc->function,
-			dev->device.numa_node);
+	RTE_LOG(INFO, EAL, "PCI device %s on NUMA socket %i\n",
+		dev->name, dev->device.numa_node);
 
 	/* no initialization when blacklisted, return without error */
 	if (dev->device.devargs != NULL &&
@@ -208,7 +204,6 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
 static int
 rte_pci_detach_dev(struct rte_pci_device *dev)
 {
-	struct rte_pci_addr *loc;
 	struct rte_pci_driver *dr;
 	int ret = 0;
 
@@ -216,11 +211,9 @@ rte_pci_detach_dev(struct rte_pci_device *dev)
 		return -EINVAL;
 
 	dr = dev->driver;
-	loc = &dev->addr;
 
-	RTE_LOG(DEBUG, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
-			loc->domain, loc->bus, loc->devid,
-			loc->function, dev->device.numa_node);
+	RTE_LOG(DEBUG, EAL, "PCI device %s on NUMA socket %i\n",
+		dev->name, dev->device.numa_node);
 
 	RTE_LOG(DEBUG, EAL, "  remove driver: %x:%x %s\n", dev->id.vendor_id,
 			dev->id.device_id, dr->driver.name);
@@ -387,7 +380,7 @@ rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
 }
 
 /* Remove a device from PCI bus */
-static void
+void
 rte_pci_remove_device(struct rte_pci_device *pci_dev)
 {
 	TAILQ_REMOVE(&rte_pci_bus.device_list, pci_dev, next);
diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
index 13c3324bb..d5815ee44 100644
--- a/drivers/bus/pci/private.h
+++ b/drivers/bus/pci/private.h
@@ -67,6 +67,15 @@ void rte_pci_add_device(struct rte_pci_device *pci_dev);
 void rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
 		struct rte_pci_device *new_pci_dev);
 
+/**
+ * Remove a PCI device from the PCI Bus.
+ *
+ * @param pci_dev
+ *	PCI device to remove
+ * @return void
+ */
+void rte_pci_remove_device(struct rte_pci_device *pci_dev);
+
 /**
  * Update a pci device object by asking the kernel for the latest information.
  *
diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
index 06e004cd3..465a44935 100644
--- a/drivers/bus/pci/rte_bus_pci.h
+++ b/drivers/bus/pci/rte_bus_pci.h
@@ -51,6 +51,13 @@ TAILQ_HEAD(rte_pci_driver_list, rte_pci_driver);
 
 struct rte_devargs;
 
+/* It's RTE_UUID_STRLEN, which is bigger than PCI_PRI_STR_SIZE. */
+#define RTE_PCI_NAME_LEN		(36 + 1)
+
+// XXX: we can't include rte_uuid.h directly due to the conflicts
+//      introduced by stdbool.h
+typedef unsigned char rte_uuid_t[16];
+
 /**
  * A structure describing a PCI device.
  */
@@ -58,6 +65,8 @@ struct rte_pci_device {
 	TAILQ_ENTRY(rte_pci_device) next;   /**< Next probed PCI device. */
 	struct rte_device device;           /**< Inherit core device */
 	struct rte_pci_addr addr;           /**< PCI location. */
+	rte_uuid_t uuid;                    /**< Mdev location. */
+	uint8_t use_uuid;                   /**< True if uuid field valid. */
 	struct rte_pci_id id;               /**< PCI ID. */
 	struct rte_mem_resource mem_resource[PCI_MAX_RESOURCE];
 					    /**< PCI Memory Resource */
@@ -65,7 +74,7 @@ struct rte_pci_device {
 	struct rte_pci_driver *driver;      /**< PCI driver used in probing */
 	uint16_t max_vfs;                   /**< sriov enable if not zero */
 	enum rte_kernel_driver kdrv;        /**< Kernel driver passthrough */
-	char name[PCI_PRI_STR_SIZE+1];      /**< PCI location (ASCII) */
+	char name[RTE_PCI_NAME_LEN];        /**< PCI/Mdev location (ASCII) */
 	struct rte_intr_handle vfio_req_intr_handle;
 				/**< Handler of VFIO request interrupt */
 };
-- 
2.17.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC 3/3] bus/pci: add mdev support
  2019-04-03  7:18 ` [dpdk-dev] [RFC 3/3] bus/pci: add mdev support Tiwei Bie
@ 2019-04-03  7:18   ` Tiwei Bie
  2019-04-03 14:13   ` Wiles, Keith
  1 sibling, 0 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-04-03  7:18 UTC (permalink / raw)
  To: dev; +Cc: cunming.liang, bruce.richardson, alejandro.lucero

This patch adds the mdev support in PCI bus driver. A mdev
driver is introduced to probe the mdev devices whose device
API is "vfio-pci" on the mdev bus.

PS. There are some hacks in this patch for now.

Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/bus/pci/Makefile              |   3 +
 drivers/bus/pci/linux/Makefile        |   4 +
 drivers/bus/pci/linux/pci_vfio.c      |  35 ++-
 drivers/bus/pci/linux/pci_vfio_mdev.c | 305 ++++++++++++++++++++++++++
 drivers/bus/pci/meson.build           |   4 +-
 drivers/bus/pci/pci_common.c          |  17 +-
 drivers/bus/pci/private.h             |   9 +
 drivers/bus/pci/rte_bus_pci.h         |  11 +-
 8 files changed, 370 insertions(+), 18 deletions(-)
 create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c

diff --git a/drivers/bus/pci/Makefile b/drivers/bus/pci/Makefile
index de53ce1bf..085ec9066 100644
--- a/drivers/bus/pci/Makefile
+++ b/drivers/bus/pci/Makefile
@@ -27,6 +27,9 @@ CFLAGS += -DALLOW_EXPERIMENTAL_API
 
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_pci -lrte_kvargs
+ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
+LDLIBS += -lrte_bus_mdev
+endif
 
 include $(RTE_SDK)/drivers/bus/pci/$(SYSTEM)/Makefile
 SRCS-$(CONFIG_RTE_LIBRTE_PCI_BUS) := $(addprefix $(SYSTEM)/,$(SRCS))
diff --git a/drivers/bus/pci/linux/Makefile b/drivers/bus/pci/linux/Makefile
index 90404468b..88bbc2390 100644
--- a/drivers/bus/pci/linux/Makefile
+++ b/drivers/bus/pci/linux/Makefile
@@ -4,3 +4,7 @@
 SRCS += pci.c
 SRCS += pci_uio.c
 SRCS += pci_vfio.c
+
+ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
+	SRCS += pci_vfio_mdev.c
+endif
diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index ebf6ccd3c..c2c4c6a50 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -13,6 +13,9 @@
 
 #include <rte_log.h>
 #include <rte_pci.h>
+#ifdef RTE_LIBRTE_MDEV_BUS
+#include <rte_bus_mdev.h>
+#endif
 #include <rte_bus_pci.h>
 #include <rte_eal_memconfig.h>
 #include <rte_malloc.h>
@@ -20,6 +23,7 @@
 #include <rte_eal.h>
 #include <rte_bus.h>
 #include <rte_spinlock.h>
+#include <rte_uuid.h>
 
 #include "eal_filesystem.h"
 
@@ -648,6 +652,7 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 {
 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
 	char pci_addr[PATH_MAX] = {0};
+	const char *sysfs_path;
 	int vfio_dev_fd;
 	struct rte_pci_addr *loc = &dev->addr;
 	int i, ret;
@@ -663,10 +668,20 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 #endif
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->use_uuid) {
+#ifdef RTE_LIBRTE_MDEV_BUS
+		sysfs_path = rte_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+#else
+		return -1;
+#endif
+	} else {
+		sysfs_path = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
 
-	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
+	ret = rte_vfio_setup_device(sysfs_path, pci_addr,
 					&vfio_dev_fd, &device_info);
 	if (ret)
 		return ret;
@@ -793,6 +808,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 {
 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
 	char pci_addr[PATH_MAX] = {0};
+	const char *sysfs_path;
 	int vfio_dev_fd;
 	struct rte_pci_addr *loc = &dev->addr;
 	int i, ret;
@@ -808,8 +824,19 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 #endif
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->use_uuid) {
+#ifdef RTE_LIBRTE_MDEV_BUS
+		sysfs_path = rte_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+#else
+		return -1;
+#endif
+	} else {
+		sysfs_path = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
+
 
 	/* if we're in a secondary process, just find our tailq entry */
 	TAILQ_FOREACH(vfio_res, vfio_res_list, next) {
@@ -825,7 +852,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 		return -1;
 	}
 
-	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
+	ret = rte_vfio_setup_device(sysfs_path, pci_addr,
 					&vfio_dev_fd, &device_info);
 	if (ret)
 		return ret;
diff --git a/drivers/bus/pci/linux/pci_vfio_mdev.c b/drivers/bus/pci/linux/pci_vfio_mdev.c
new file mode 100644
index 000000000..92498c2fe
--- /dev/null
+++ b/drivers/bus/pci/linux/pci_vfio_mdev.c
@@ -0,0 +1,305 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include <string.h>
+#include <dirent.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <linux/pci_regs.h>
+
+#include <rte_log.h>
+#include <rte_pci.h>
+#include <rte_eal_memconfig.h>
+#include <rte_malloc.h>
+#include <rte_devargs.h>
+#include <rte_memcpy.h>
+#include <rte_vfio.h>
+#include <rte_bus_mdev.h>
+
+#include "eal_private.h"
+#include "eal_filesystem.h"
+
+#include "private.h"
+
+extern struct rte_pci_bus rte_pci_bus;
+
+static int
+get_pci_id(const char *sysfs_base, const char *dev_addr,
+	   struct rte_pci_id *pci_id)
+{
+	int ret = 0;
+	int iommu_group_num;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	int container;
+	int class;
+	char name[PATH_MAX];
+	struct vfio_group_status group_status = {
+		.argsz = sizeof(group_status) };
+
+	container = open("/dev/vfio/vfio", O_RDWR);
+	if (container < 0) {
+		RTE_LOG(WARNING, EAL, "Failed to open VFIO container\n");
+		ret = -1;
+		goto out;
+	}
+
+	if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) {
+		/* Unknown API version */
+		RTE_LOG(WARNING, EAL, "Unknown VFIO API version\n");
+		ret = -1;
+		goto close_container;
+	}
+
+	if (rte_vfio_get_group_num(sysfs_base, dev_addr,
+				   &iommu_group_num) <= 0) {
+		RTE_LOG(WARNING, EAL, "%s not managed by VFIO driver\n",
+			dev_addr);
+		ret = -1;
+		goto close_container;
+	}
+
+	snprintf(name, sizeof(name), "/dev/vfio/%d", iommu_group_num);
+
+	vfio_group_fd = open(name, O_RDWR);
+	if (vfio_group_fd < 0) {
+		ret = -1;
+		goto close_container;
+	}
+
+	/* if group_fd == 0, that means the device isn't managed by VFIO */
+	if (vfio_group_fd == 0) {
+		RTE_LOG(WARNING, EAL, "%s not managed by VFIO driver\n",
+			dev_addr);
+		ret = -1;
+		goto close_group;
+	}
+
+	if (ioctl(vfio_group_fd, VFIO_GROUP_GET_STATUS, &group_status)) {
+		RTE_LOG(ERR, EAL, "%s cannot get group status, error %i (%s)\n",
+			dev_addr, errno, strerror(errno));
+		ret = -1;
+		goto close_group;
+	}
+
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+		RTE_LOG(ERR, EAL, "%s VFIO group is not viable!\n", dev_addr);
+		ret = -1;
+		goto close_group;
+	}
+
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
+		if (ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
+			    &container)) {
+			RTE_LOG(ERR, EAL, "%s cannot add VFIO group to container, error %i (%s)\n",
+				dev_addr, errno, strerror(errno));
+			ret = -1;
+			goto close_group;
+		}
+	}
+
+	if (ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)) {
+		RTE_LOG(ERR, EAL, "%s cannot set iommu, error %i (%s)\n",
+			dev_addr, errno, strerror(errno));
+		ret = -1;
+		goto close_group;
+	}
+
+	vfio_dev_fd = ioctl(vfio_group_fd, VFIO_GROUP_GET_DEVICE_FD, dev_addr);
+	if (vfio_dev_fd < 0) {
+		/* if we cannot get a device fd, this implies a problem with
+		 * the VFIO group or the container not having IOMMU configured.
+		 */
+		RTE_LOG(ERR, EAL, "Getting a vfio_dev_fd for %s failed errno %d\n",
+			dev_addr, errno);
+		ret = -1;
+		goto close_group;
+	}
+
+	/* vendor_id */
+	if (pread64(vfio_dev_fd, &pci_id->vendor_id, sizeof(uint16_t),
+		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+		      PCI_VENDOR_ID) != sizeof(uint16_t)) {
+		RTE_LOG(ERR, EAL, "Cannot read VendorID from PCI config space\n");
+		ret = -1;
+		goto close_device;
+	}
+
+	/* device_id */
+	if (pread64(vfio_dev_fd, &pci_id->device_id, sizeof(uint16_t),
+		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+		      PCI_DEVICE_ID) != sizeof(uint16_t)) {
+		RTE_LOG(ERR, EAL, "Cannot read DeviceID from PCI config space\n");
+		ret = -1;
+		goto close_device;
+	}
+
+	/* subsystem_vendor_id */
+	if (pread64(vfio_dev_fd, &pci_id->subsystem_vendor_id, sizeof(uint16_t),
+		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+		      PCI_SUBSYSTEM_VENDOR_ID) != sizeof(uint16_t)) {
+		RTE_LOG(ERR, EAL, "Cannot read SubVendorID from PCI config space\n");
+		ret = -1;
+		goto close_device;
+	}
+
+	/* subsystem_device_id */
+	if (pread64(vfio_dev_fd, &pci_id->subsystem_device_id, sizeof(uint16_t),
+		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+		      PCI_SUBSYSTEM_ID) != sizeof(uint16_t)) {
+		RTE_LOG(ERR, EAL, "Cannot read SubDeviceID from PCI config space\n");
+		ret = -1;
+		goto close_device;
+	}
+
+	/* class_id */
+	if (pread64(vfio_dev_fd, &class, sizeof(uint32_t),
+		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+		      PCI_CLASS_REVISION) != sizeof(uint32_t)) {
+		RTE_LOG(ERR, EAL, "Cannot read ClassID from PCI config space\n");
+		ret = -1;
+		goto close_device;
+	}
+	pci_id->class_id = class >> 8;
+
+close_device:
+	if (close(vfio_dev_fd) < 0) {
+		RTE_LOG(INFO, EAL, "Error when closing VFIO device for %s\n",
+			dev_addr);
+		ret = -1;
+	}
+
+close_group:
+	if (close(vfio_group_fd) < 0) {
+		RTE_LOG(INFO, EAL, "Error when closing VFIO group for %s\n",
+			dev_addr);
+		ret = -1;
+	}
+
+close_container:
+	if (close(container) < 0) {
+		RTE_LOG(INFO, EAL, "Error when closing VFIO container\n");
+		ret = -1;
+	}
+
+out:
+	return ret;
+}
+
+static int vfio_pci_probe(struct rte_mdev_driver *mdev_drv __rte_unused,
+			  struct rte_mdev_device *mdev_dev)
+{
+	char name[RTE_UUID_STRLEN];
+	struct rte_pci_device *dev;
+	struct rte_bus *bus;
+	int ret;
+
+	bus = rte_bus_find_by_name("pci");
+	if (bus == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
+		return -ENOENT;
+	}
+
+	if (bus->plug == NULL) {
+		RTE_LOG(ERR, EAL, "Function plug not supported by bus (%s)\n",
+			bus->name);
+		return -ENOTSUP;
+	}
+
+	dev = malloc(sizeof(*dev));
+	if (dev == NULL)
+		return -ENOMEM;
+
+	memset(dev, 0, sizeof(*dev));
+	dev->device.bus = &rte_pci_bus.bus;
+	rte_uuid_unparse(mdev_dev->addr, name, sizeof(name));
+
+	if (get_pci_id(rte_mdev_get_sysfs_path(), name, &dev->id)) {
+		free(dev);
+		return -1;
+	}
+
+	snprintf(dev->name, sizeof(dev->name), "%s", name);
+	dev->device.name = dev->name;
+	dev->kdrv = RTE_KDRV_VFIO;
+	dev->use_uuid = 1;
+	rte_uuid_copy(dev->uuid, mdev_dev->addr);
+
+	// TODO: dev->device.devargs, etc
+
+	memset(&dev->addr, -1, sizeof(dev->addr)); // XXX: TODO
+
+	/* device is valid, add to the list (sorted) */
+	if (TAILQ_EMPTY(&rte_pci_bus.device_list)) {
+		rte_pci_add_device(dev);
+	} else {
+		struct rte_pci_device *dev2;
+		int ret;
+
+		TAILQ_FOREACH(dev2, &rte_pci_bus.device_list, next) {
+			// XXX
+			ret = rte_pci_addr_cmp(&dev->addr, &dev2->addr);
+			if (ret == 0)
+				ret = strncmp(dev->name, dev2->name,
+					      sizeof(dev->name));
+			if (ret > 0)
+				continue;
+			if (ret < 0) {
+				rte_pci_insert_device(dev2, dev);
+				goto plug;
+			}
+			/* already registered */
+			free(dev);
+			return 0;
+		}
+
+		rte_pci_add_device(dev);
+	}
+
+plug:
+	ret = bus->plug(&dev->device);
+	if (ret != 0) {
+		rte_pci_remove_device(dev);
+		free(dev);
+	} else {
+		mdev_dev->private = dev;
+	}
+	return ret;
+}
+
+static int vfio_pci_remove(struct rte_mdev_device *mdev_dev)
+{
+	struct rte_pci_device *dev = mdev_dev->private;
+	struct rte_bus *bus;
+	int ret;
+
+	if (dev == NULL)
+		return 0;
+
+	bus = rte_bus_find_by_name("pci");
+	if (bus == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
+		return -ENOENT;
+	}
+
+	if (bus->unplug == NULL) {
+		RTE_LOG(ERR, EAL, "Function unplug not supported by bus (%s)\n",
+			bus->name);
+		return -ENOTSUP;
+	}
+
+	ret = bus->unplug(&dev->device);
+	if (ret == 0)
+		mdev_dev->private = NULL;
+
+	return ret;
+}
+
+static struct rte_mdev_driver vfio_pci_drv = {
+	.dev_api = RTE_MDEV_DEV_API_VFIO_PCI,
+	.probe = vfio_pci_probe,
+	.remove = vfio_pci_remove
+};
+
+RTE_MDEV_REGISTER_DRIVER(mdev_vfio_pci, vfio_pci_drv);
diff --git a/drivers/bus/pci/meson.build b/drivers/bus/pci/meson.build
index a3140ff97..c3e884657 100644
--- a/drivers/bus/pci/meson.build
+++ b/drivers/bus/pci/meson.build
@@ -11,8 +11,10 @@ sources = files('pci_common.c',
 if host_machine.system() == 'linux'
 	sources += files('linux/pci.c',
 			'linux/pci_uio.c',
-			'linux/pci_vfio.c')
+			'linux/pci_vfio.c',
+			'linux/pci_vfio_mdev.c')
 	includes += include_directories('linux')
+	deps += ['bus_mdev']
 else
 	sources += files('bsd/pci.c')
 	includes += include_directories('bsd')
diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
index 704b9d71a..6b47333e6 100644
--- a/drivers/bus/pci/pci_common.c
+++ b/drivers/bus/pci/pci_common.c
@@ -124,21 +124,17 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
 {
 	int ret;
 	bool already_probed;
-	struct rte_pci_addr *loc;
 
 	if ((dr == NULL) || (dev == NULL))
 		return -EINVAL;
 
-	loc = &dev->addr;
-
 	/* The device is not blacklisted; Check if driver supports it */
 	if (!rte_pci_match(dr, dev))
 		/* Match of device and driver failed */
 		return 1;
 
-	RTE_LOG(INFO, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
-			loc->domain, loc->bus, loc->devid, loc->function,
-			dev->device.numa_node);
+	RTE_LOG(INFO, EAL, "PCI device %s on NUMA socket %i\n",
+		dev->name, dev->device.numa_node);
 
 	/* no initialization when blacklisted, return without error */
 	if (dev->device.devargs != NULL &&
@@ -208,7 +204,6 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
 static int
 rte_pci_detach_dev(struct rte_pci_device *dev)
 {
-	struct rte_pci_addr *loc;
 	struct rte_pci_driver *dr;
 	int ret = 0;
 
@@ -216,11 +211,9 @@ rte_pci_detach_dev(struct rte_pci_device *dev)
 		return -EINVAL;
 
 	dr = dev->driver;
-	loc = &dev->addr;
 
-	RTE_LOG(DEBUG, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
-			loc->domain, loc->bus, loc->devid,
-			loc->function, dev->device.numa_node);
+	RTE_LOG(DEBUG, EAL, "PCI device %s on NUMA socket %i\n",
+		dev->name, dev->device.numa_node);
 
 	RTE_LOG(DEBUG, EAL, "  remove driver: %x:%x %s\n", dev->id.vendor_id,
 			dev->id.device_id, dr->driver.name);
@@ -387,7 +380,7 @@ rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
 }
 
 /* Remove a device from PCI bus */
-static void
+void
 rte_pci_remove_device(struct rte_pci_device *pci_dev)
 {
 	TAILQ_REMOVE(&rte_pci_bus.device_list, pci_dev, next);
diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
index 13c3324bb..d5815ee44 100644
--- a/drivers/bus/pci/private.h
+++ b/drivers/bus/pci/private.h
@@ -67,6 +67,15 @@ void rte_pci_add_device(struct rte_pci_device *pci_dev);
 void rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
 		struct rte_pci_device *new_pci_dev);
 
+/**
+ * Remove a PCI device from the PCI Bus.
+ *
+ * @param pci_dev
+ *	PCI device to remove
+ * @return void
+ */
+void rte_pci_remove_device(struct rte_pci_device *pci_dev);
+
 /**
  * Update a pci device object by asking the kernel for the latest information.
  *
diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
index 06e004cd3..465a44935 100644
--- a/drivers/bus/pci/rte_bus_pci.h
+++ b/drivers/bus/pci/rte_bus_pci.h
@@ -51,6 +51,13 @@ TAILQ_HEAD(rte_pci_driver_list, rte_pci_driver);
 
 struct rte_devargs;
 
+/* It's RTE_UUID_STRLEN, which is bigger than PCI_PRI_STR_SIZE. */
+#define RTE_PCI_NAME_LEN		(36 + 1)
+
+// XXX: we can't include rte_uuid.h directly due to the conflicts
+//      introduced by stdbool.h
+typedef unsigned char rte_uuid_t[16];
+
 /**
  * A structure describing a PCI device.
  */
@@ -58,6 +65,8 @@ struct rte_pci_device {
 	TAILQ_ENTRY(rte_pci_device) next;   /**< Next probed PCI device. */
 	struct rte_device device;           /**< Inherit core device */
 	struct rte_pci_addr addr;           /**< PCI location. */
+	rte_uuid_t uuid;                    /**< Mdev location. */
+	uint8_t use_uuid;                   /**< True if uuid field valid. */
 	struct rte_pci_id id;               /**< PCI ID. */
 	struct rte_mem_resource mem_resource[PCI_MAX_RESOURCE];
 					    /**< PCI Memory Resource */
@@ -65,7 +74,7 @@ struct rte_pci_device {
 	struct rte_pci_driver *driver;      /**< PCI driver used in probing */
 	uint16_t max_vfs;                   /**< sriov enable if not zero */
 	enum rte_kernel_driver kdrv;        /**< Kernel driver passthrough */
-	char name[PCI_PRI_STR_SIZE+1];      /**< PCI location (ASCII) */
+	char name[RTE_PCI_NAME_LEN];        /**< PCI/Mdev location (ASCII) */
 	struct rte_intr_handle vfio_req_intr_handle;
 				/**< Handler of VFIO request interrupt */
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC 3/3] bus/pci: add mdev support
  2019-04-03  7:18 ` [dpdk-dev] [RFC 3/3] bus/pci: add mdev support Tiwei Bie
  2019-04-03  7:18   ` Tiwei Bie
@ 2019-04-03 14:13   ` Wiles, Keith
  2019-04-03 14:13     ` Wiles, Keith
  2019-04-04  4:19     ` Tiwei Bie
  1 sibling, 2 replies; 41+ messages in thread
From: Wiles, Keith @ 2019-04-03 14:13 UTC (permalink / raw)
  To: Bie, Tiwei; +Cc: dpdk-dev, Liang, Cunming, Richardson, Bruce, alejandro.lucero

Some minor nits.

> On Apr 3, 2019, at 2:18 AM, Tiwei Bie <tiwei.bie@intel.com> wrote:
> 
> This patch adds the mdev support in PCI bus driver. A mdev
> driver is introduced to probe the mdev devices whose device
> API is "vfio-pci" on the mdev bus.
> 
> PS. There are some hacks in this patch for now.
> 
> Signed-off-by: Cunming Liang <cunming.liang@intel.com>
> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> ---
> drivers/bus/pci/Makefile              |   3 +
> drivers/bus/pci/linux/Makefile        |   4 +
> drivers/bus/pci/linux/pci_vfio.c      |  35 ++-
> drivers/bus/pci/linux/pci_vfio_mdev.c | 305 ++++++++++++++++++++++++++
> drivers/bus/pci/meson.build           |   4 +-
> drivers/bus/pci/pci_common.c          |  17 +-
> drivers/bus/pci/private.h             |   9 +
> drivers/bus/pci/rte_bus_pci.h         |  11 +-
> 8 files changed, 370 insertions(+), 18 deletions(-)
> create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
> 
> diff --git a/drivers/bus/pci/Makefile b/drivers/bus/pci/Makefile
> index de53ce1bf..085ec9066 100644
> --- a/drivers/bus/pci/Makefile
> +++ b/drivers/bus/pci/Makefile
> @@ -27,6 +27,9 @@ CFLAGS += -DALLOW_EXPERIMENTAL_API

This define is enabled in 50-70 Makefiles, we can leave this here, but we should refactor this to a common place in the future.
> 
> LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
> LDLIBS += -lrte_ethdev -lrte_pci -lrte_kvargs
> +ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
> +LDLIBS += -lrte_bus_mdev
> +endif

See comment below.
> 
> include $(RTE_SDK)/drivers/bus/pci/$(SYSTEM)/Makefile
> SRCS-$(CONFIG_RTE_LIBRTE_PCI_BUS) := $(addprefix $(SYSTEM)/,$(SRCS))
> diff --git a/drivers/bus/pci/linux/Makefile b/drivers/bus/pci/linux/Makefile
> index 90404468b..88bbc2390 100644
> --- a/drivers/bus/pci/linux/Makefile
> +++ b/drivers/bus/pci/linux/Makefile
> @@ -4,3 +4,7 @@
> SRCS += pci.c
> SRCS += pci_uio.c
> SRCS += pci_vfio.c
> +
> +ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
> +	SRCS += pci_vfio_mdev.c
> +endif

Do we need a configuration option for MDEV?
Can it be enabled for all builds or reuse a current configuration if only for some OS or arch?

> diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
> index ebf6ccd3c..c2c4c6a50 100644
> --- a/drivers/bus/pci/linux/pci_vfio.c
> +++ b/drivers/bus/pci/linux/pci_vfio.c
> @@ -13,6 +13,9 @@
> 
> #include <rte_log.h>
> #include <rte_pci.h>
> +#ifdef RTE_LIBRTE_MDEV_BUS
> +#include <rte_bus_mdev.h>
> +#endif
> #include <rte_bus_pci.h>
> #include <rte_eal_memconfig.h>
> #include <rte_malloc.h>
> @@ -20,6 +23,7 @@
> #include <rte_eal.h>
> #include <rte_bus.h>
> #include <rte_spinlock.h>
> +#include <rte_uuid.h>
> 
> #include "eal_filesystem.h"
> 
> @@ -648,6 +652,7 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
> {
> 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
> 	char pci_addr[PATH_MAX] = {0};
> +	const char *sysfs_path;
> 	int vfio_dev_fd;
> 	struct rte_pci_addr *loc = &dev->addr;
> 	int i, ret;
> @@ -663,10 +668,20 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
> #endif
> 
> 	/* store PCI address string */
> -	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
> +	if (dev->use_uuid) {
> +#ifdef RTE_LIBRTE_MDEV_BUS
> +		sysfs_path = rte_mdev_get_sysfs_path();
> +		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
> +#else
> +		return -1;
> +#endif
> +	} else {
> +		sysfs_path = rte_pci_get_sysfs_path();
> +		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
> 			loc->domain, loc->bus, loc->devid, loc->function);
> +	}
> 
> -	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
> +	ret = rte_vfio_setup_device(sysfs_path, pci_addr,
> 					&vfio_dev_fd, &device_info);
> 	if (ret)
> 		return ret;
> @@ -793,6 +808,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
> {
> 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
> 	char pci_addr[PATH_MAX] = {0};
> +	const char *sysfs_path;
> 	int vfio_dev_fd;
> 	struct rte_pci_addr *loc = &dev->addr;
> 	int i, ret;
> @@ -808,8 +824,19 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
> #endif
> 
> 	/* store PCI address string */
> -	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
> +	if (dev->use_uuid) {
> +#ifdef RTE_LIBRTE_MDEV_BUS
> +		sysfs_path = rte_mdev_get_sysfs_path();
> +		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
> +#else
> +		return -1;
> +#endif
> +	} else {
> +		sysfs_path = rte_pci_get_sysfs_path();
> +		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
> 			loc->domain, loc->bus, loc->devid, loc->function);
> +	}
> +
> 
> 	/* if we're in a secondary process, just find our tailq entry */
> 	TAILQ_FOREACH(vfio_res, vfio_res_list, next) {
> @@ -825,7 +852,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
> 		return -1;
> 	}
> 
> -	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
> +	ret = rte_vfio_setup_device(sysfs_path, pci_addr,
> 					&vfio_dev_fd, &device_info);
> 	if (ret)
> 		return ret;
> diff --git a/drivers/bus/pci/linux/pci_vfio_mdev.c b/drivers/bus/pci/linux/pci_vfio_mdev.c
> new file mode 100644
> index 000000000..92498c2fe
> --- /dev/null
> +++ b/drivers/bus/pci/linux/pci_vfio_mdev.c
> @@ -0,0 +1,305 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#include <string.h>
> +#include <dirent.h>
> +#include <fcntl.h>
> +#include <sys/ioctl.h>
> +#include <linux/pci_regs.h>
> +
> +#include <rte_log.h>
> +#include <rte_pci.h>
> +#include <rte_eal_memconfig.h>
> +#include <rte_malloc.h>
> +#include <rte_devargs.h>
> +#include <rte_memcpy.h>
> +#include <rte_vfio.h>
> +#include <rte_bus_mdev.h>
> +
> +#include "eal_private.h"
> +#include "eal_filesystem.h"
> +
> +#include "private.h"
> +
> +extern struct rte_pci_bus rte_pci_bus;
> +
> +static int
> +get_pci_id(const char *sysfs_base, const char *dev_addr,
> +	   struct rte_pci_id *pci_id)
> +{
> +	int ret = 0;
> +	int iommu_group_num;
> +	int vfio_group_fd;
> +	int vfio_dev_fd;
> +	int container;
> +	int class;
> +	char name[PATH_MAX];
> +	struct vfio_group_status group_status = {
> +		.argsz = sizeof(group_status) };
> +
> +	container = open("/dev/vfio/vfio", O_RDWR);

Should this one use the VFIO_CONTAINER_PATH define in rte_vfio.h?
The define is gated by VFIO_PRESENT in that header.
> +	if (container < 0) {
> +		RTE_LOG(WARNING, EAL, "Failed to open VFIO container\n");
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) {
> +		/* Unknown API version */
> +		RTE_LOG(WARNING, EAL, "Unknown VFIO API version\n");
> +		ret = -1;
> +		goto close_container;
> +	}
> +
> +	if (rte_vfio_get_group_num(sysfs_base, dev_addr,
> +				   &iommu_group_num) <= 0) {
> +		RTE_LOG(WARNING, EAL, "%s not managed by VFIO driver\n",
> +			dev_addr);
> +		ret = -1;
> +		goto close_container;
> +	}
> +
> +	snprintf(name, sizeof(name), "/dev/vfio/%d", iommu_group_num);

We should be testing the return value from snprintf, but it is not done anyplace else in the code?
We need to look at fixing this in a different patch, but not here.
> +
> +	vfio_group_fd = open(name, O_RDWR);
> +	if (vfio_group_fd < 0) {
> +		ret = -1;
> +		goto close_container;
> +	}
> +
> +	/* if group_fd == 0, that means the device isn't managed by VFIO */
> +	if (vfio_group_fd == 0) {
> +		RTE_LOG(WARNING, EAL, "%s not managed by VFIO driver\n",
> +			dev_addr);
> +		ret = -1;
> +		goto close_group;
> +	}
> +
> +	if (ioctl(vfio_group_fd, VFIO_GROUP_GET_STATUS, &group_status)) {
> +		RTE_LOG(ERR, EAL, "%s cannot get group status, error %i (%s)\n",
> +			dev_addr, errno, strerror(errno));
> +		ret = -1;
> +		goto close_group;
> +	}
> +
> +	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
> +		RTE_LOG(ERR, EAL, "%s VFIO group is not viable!\n", dev_addr);
> +		ret = -1;
> +		goto close_group;
> +	}
> +
> +	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
> +		if (ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
> +			    &container)) {
> +			RTE_LOG(ERR, EAL, "%s cannot add VFIO group to container, error %i (%s)\n",
> +				dev_addr, errno, strerror(errno));
> +			ret = -1;
> +			goto close_group;
> +		}
> +	}
> +
> +	if (ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)) {
> +		RTE_LOG(ERR, EAL, "%s cannot set iommu, error %i (%s)\n",
> +			dev_addr, errno, strerror(errno));
> +		ret = -1;
> +		goto close_group;
> +	}
> +
> +	vfio_dev_fd = ioctl(vfio_group_fd, VFIO_GROUP_GET_DEVICE_FD, dev_addr);
> +	if (vfio_dev_fd < 0) {
> +		/* if we cannot get a device fd, this implies a problem with
> +		 * the VFIO group or the container not having IOMMU configured.
> +		 */
> +		RTE_LOG(ERR, EAL, "Getting a vfio_dev_fd for %s failed errno %d\n",
> +			dev_addr, errno);
> +		ret = -1;
> +		goto close_group;
> +	}
> +
> +	/* vendor_id */
> +	if (pread64(vfio_dev_fd, &pci_id->vendor_id, sizeof(uint16_t),
> +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> +		      PCI_VENDOR_ID) != sizeof(uint16_t)) {
> +		RTE_LOG(ERR, EAL, "Cannot read VendorID from PCI config space\n");
> +		ret = -1;
> +		goto close_device;
> +	}
> +
> +	/* device_id */
> +	if (pread64(vfio_dev_fd, &pci_id->device_id, sizeof(uint16_t),
> +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> +		      PCI_DEVICE_ID) != sizeof(uint16_t)) {
> +		RTE_LOG(ERR, EAL, "Cannot read DeviceID from PCI config space\n");
> +		ret = -1;
> +		goto close_device;
> +	}
> +
> +	/* subsystem_vendor_id */
> +	if (pread64(vfio_dev_fd, &pci_id->subsystem_vendor_id, sizeof(uint16_t),
> +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> +		      PCI_SUBSYSTEM_VENDOR_ID) != sizeof(uint16_t)) {
> +		RTE_LOG(ERR, EAL, "Cannot read SubVendorID from PCI config space\n");
> +		ret = -1;
> +		goto close_device;
> +	}
> +
> +	/* subsystem_device_id */
> +	if (pread64(vfio_dev_fd, &pci_id->subsystem_device_id, sizeof(uint16_t),
> +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> +		      PCI_SUBSYSTEM_ID) != sizeof(uint16_t)) {
> +		RTE_LOG(ERR, EAL, "Cannot read SubDeviceID from PCI config space\n");
> +		ret = -1;
> +		goto close_device;
> +	}
> +
> +	/* class_id */
> +	if (pread64(vfio_dev_fd, &class, sizeof(uint32_t),
> +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> +		      PCI_CLASS_REVISION) != sizeof(uint32_t)) {
> +		RTE_LOG(ERR, EAL, "Cannot read ClassID from PCI config space\n”);

These should possible be DEBUG messages, but ERR is ok I guess. To me filling up the log with a bunch of messages when it is also flagged and log at a higher layer to many log messages. It would require us to look and make a cleaner
> +		ret = -1;
> +		goto close_device;
> +	}
> +	pci_id->class_id = class >> 8;
> +
> +close_device:
> +	if (close(vfio_dev_fd) < 0) {
> +		RTE_LOG(INFO, EAL, "Error when closing VFIO device for %s\n",
> +			dev_addr);

These should be ERR, DEBUG or WARN not INFO IMO or no log message at all.
> +		ret = -1;
> +	}
> +
> +close_group:
> +	if (close(vfio_group_fd) < 0) {
> +		RTE_LOG(INFO, EAL, "Error when closing VFIO group for %s\n",
> +			dev_addr);
> +		ret = -1;
> +	}
> +
> +close_container:
> +	if (close(container) < 0) {
> +		RTE_LOG(INFO, EAL, "Error when closing VFIO container\n");
> +		ret = -1;
> +	}
> +
> +out:

Jumping to 4 different exit points makes this function complex, would it not be better to have one error exit point and test if the fds need to be closed
e.g.
	if (pread64(...)) {
		RTE_LOG(ERR, EAL, “Error message”);
		goto err_exit;
	}

	return 0;
err_exit:
	if (vfio_dev_fd && close(vfio_dev_fd) < 0) {
		
	}
	if (…) {
	}
	return -1;

This should eliminate the variable ret and reduce the lines of code.
> +	return ret;
> +}
> +
> +static int vfio_pci_probe(struct rte_mdev_driver *mdev_drv __rte_unused,
> +			  struct rte_mdev_device *mdev_dev)
> +{
> +	char name[RTE_UUID_STRLEN];
> +	struct rte_pci_device *dev;
> +	struct rte_bus *bus;
> +	int ret;
> +
> +	bus = rte_bus_find_by_name("pci");
> +	if (bus == NULL) {
> +		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
> +		return -ENOENT;
> +	}
> +
> +	if (bus->plug == NULL) {
> +		RTE_LOG(ERR, EAL, "Function plug not supported by bus (%s)\n",
> +			bus->name);
> +		return -ENOTSUP;
> +	}
> +
> +	dev = malloc(sizeof(*dev));
> +	if (dev == NULL)
> +		return -ENOMEM;

If going to add error logs for the above tests, why does this one not get one?
Should we just remove them and check in the calling function instead? Then convert these to DEBUG logs or remove them.
> +
> +	memset(dev, 0, sizeof(*dev));
> +	dev->device.bus = &rte_pci_bus.bus;
> +	rte_uuid_unparse(mdev_dev->addr, name, sizeof(name));
> +
> +	if (get_pci_id(rte_mdev_get_sysfs_path(), name, &dev->id)) {
> +		free(dev);
> +		return -1;
> +	}
> +
> +	snprintf(dev->name, sizeof(dev->name), "%s", name);

This should be strlcpy()
> +	dev->device.name = dev->name;
> +	dev->kdrv = RTE_KDRV_VFIO;
> +	dev->use_uuid = 1;
> +	rte_uuid_copy(dev->uuid, mdev_dev->addr);
> +
> +	// TODO: dev->device.devargs, etc
> +
> +	memset(&dev->addr, -1, sizeof(dev->addr)); // XXX: TODO

I have seen in the past that TODO or FIXME is not something that should be in the code. The TODO items should be removed and tracked outside the code if needed to be done later.
> +
> +	/* device is valid, add to the list (sorted) */
> +	if (TAILQ_EMPTY(&rte_pci_bus.device_list)) {
> +		rte_pci_add_device(dev);
> +	} else {
> +		struct rte_pci_device *dev2;
> +		int ret;
> +
> +		TAILQ_FOREACH(dev2, &rte_pci_bus.device_list, next) {
> +			// XXX

What does this comment mean? remove it or explain it.
> +			ret = rte_pci_addr_cmp(&dev->addr, &dev2->addr);
> +			if (ret == 0)
> +				ret = strncmp(dev->name, dev2->name,
> +					      sizeof(dev->name));
> +			if (ret > 0)
> +				continue;
> +			if (ret < 0) {
> +				rte_pci_insert_device(dev2, dev);
> +				goto plug;
> +			}
> +			/* already registered */
> +			free(dev);
> +			return 0;
> +		}
> +
> +		rte_pci_add_device(dev);
> +	}
> +
> +plug:
> +	ret = bus->plug(&dev->device);
> +	if (ret != 0) {
> +		rte_pci_remove_device(dev);
> +		free(dev);
> +	} else {
> +		mdev_dev->private = dev;
> +	}

The coding guide states we remove {} around single line statements.
> +	return ret;
> +}
> +
> +static int vfio_pci_remove(struct rte_mdev_device *mdev_dev)
> +{
> +	struct rte_pci_device *dev = mdev_dev->private;
> +	struct rte_bus *bus;
> +	int ret;
> +
> +	if (dev == NULL)
> +		return 0;
> +
> +	bus = rte_bus_find_by_name("pci");
> +	if (bus == NULL) {
> +		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
> +		return -ENOENT;
> +	}
> +
> +	if (bus->unplug == NULL) {
> +		RTE_LOG(ERR, EAL, "Function unplug not supported by bus (%s)\n",
> +			bus->name);
> +		return -ENOTSUP;
> +	}
> +
> +	ret = bus->unplug(&dev->device);
> +	if (ret == 0)
> +		mdev_dev->private = NULL;
> +
> +	return ret;
> +}
> +
> +static struct rte_mdev_driver vfio_pci_drv = {
> +	.dev_api = RTE_MDEV_DEV_API_VFIO_PCI,
> +	.probe = vfio_pci_probe,
> +	.remove = vfio_pci_remove
> +};
> +
> +RTE_MDEV_REGISTER_DRIVER(mdev_vfio_pci, vfio_pci_drv);
> diff --git a/drivers/bus/pci/meson.build b/drivers/bus/pci/meson.build
> index a3140ff97..c3e884657 100644
> --- a/drivers/bus/pci/meson.build
> +++ b/drivers/bus/pci/meson.build
> @@ -11,8 +11,10 @@ sources = files('pci_common.c',
> if host_machine.system() == 'linux'
> 	sources += files('linux/pci.c',
> 			'linux/pci_uio.c',
> -			'linux/pci_vfio.c')
> +			'linux/pci_vfio.c',
> +			'linux/pci_vfio_mdev.c’)

If you need the RTE_LIBRTE_MDEV define then pci_vfio_mdev.c needs to be built conditionally?
> 	includes += include_directories('linux')
> +	deps += ['bus_mdev’]

If this was added form dev then is too should be conditional.
> else
> 	sources += files('bsd/pci.c')
> 	includes += include_directories('bsd')
> diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
> index 704b9d71a..6b47333e6 100644
> --- a/drivers/bus/pci/pci_common.c
> +++ b/drivers/bus/pci/pci_common.c
> @@ -124,21 +124,17 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
> {
> 	int ret;
> 	bool already_probed;
> -	struct rte_pci_addr *loc;
> 
> 	if ((dr == NULL) || (dev == NULL))
> 		return -EINVAL;
> 
> -	loc = &dev->addr;
> -
> 	/* The device is not blacklisted; Check if driver supports it */
> 	if (!rte_pci_match(dr, dev))
> 		/* Match of device and driver failed */
> 		return 1;
> 
> -	RTE_LOG(INFO, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
> -			loc->domain, loc->bus, loc->devid, loc->function,
> -			dev->device.numa_node);
> +	RTE_LOG(INFO, EAL, "PCI device %s on NUMA socket %i\n",
> +		dev->name, dev->device.numa_node);
> 
> 	/* no initialization when blacklisted, return without error */
> 	if (dev->device.devargs != NULL &&
> @@ -208,7 +204,6 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
> static int
> rte_pci_detach_dev(struct rte_pci_device *dev)
> {
> -	struct rte_pci_addr *loc;
> 	struct rte_pci_driver *dr;
> 	int ret = 0;
> 
> @@ -216,11 +211,9 @@ rte_pci_detach_dev(struct rte_pci_device *dev)
> 		return -EINVAL;
> 
> 	dr = dev->driver;
> -	loc = &dev->addr;
> 
> -	RTE_LOG(DEBUG, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
> -			loc->domain, loc->bus, loc->devid,
> -			loc->function, dev->device.numa_node);
> +	RTE_LOG(DEBUG, EAL, "PCI device %s on NUMA socket %i\n",
> +		dev->name, dev->device.numa_node);
> 
> 	RTE_LOG(DEBUG, EAL, "  remove driver: %x:%x %s\n", dev->id.vendor_id,
> 			dev->id.device_id, dr->driver.name);
> @@ -387,7 +380,7 @@ rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
> }
> 
> /* Remove a device from PCI bus */
> -static void
> +void
> rte_pci_remove_device(struct rte_pci_device *pci_dev)

Have not looked yet, but did this function get added to the version.map file?
Does converting a function to public function require experimental tag too, maybe not?
> {
> 	TAILQ_REMOVE(&rte_pci_bus.device_list, pci_dev, next);
> diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
> index 13c3324bb..d5815ee44 100644
> --- a/drivers/bus/pci/private.h
> +++ b/drivers/bus/pci/private.h
> @@ -67,6 +67,15 @@ void rte_pci_add_device(struct rte_pci_device *pci_dev);
> void rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
> 		struct rte_pci_device *new_pci_dev);
> 
> +/**
> + * Remove a PCI device from the PCI Bus.
> + *
> + * @param pci_dev
> + *	PCI device to remove
> + * @return void
> + */
> +void rte_pci_remove_device(struct rte_pci_device *pci_dev);
> +
> /**
>  * Update a pci device object by asking the kernel for the latest information.
>  *
> diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
> index 06e004cd3..465a44935 100644
> --- a/drivers/bus/pci/rte_bus_pci.h
> +++ b/drivers/bus/pci/rte_bus_pci.h
> @@ -51,6 +51,13 @@ TAILQ_HEAD(rte_pci_driver_list, rte_pci_driver);
> 
> struct rte_devargs;
> 
> +/* It's RTE_UUID_STRLEN, which is bigger than PCI_PRI_STR_SIZE. */
> +#define RTE_PCI_NAME_LEN		(36 + 1)
> +
> +// XXX: we can't include rte_uuid.h directly due to the conflicts
> +//      introduced by stdbool.h
> +typedef unsigned char rte_uuid_t[16];

Does this need to have a the string ‘XXX’ in the comment? Note maybe a better word.
> +
> /**
>  * A structure describing a PCI device.
>  */
> @@ -58,6 +65,8 @@ struct rte_pci_device {
> 	TAILQ_ENTRY(rte_pci_device) next;   /**< Next probed PCI device. */
> 	struct rte_device device;           /**< Inherit core device */
> 	struct rte_pci_addr addr;           /**< PCI location. */
> +	rte_uuid_t uuid;                    /**< Mdev location. */
> +	uint8_t use_uuid;                   /**< True if uuid field valid. */
> 	struct rte_pci_id id;               /**< PCI ID. */
> 	struct rte_mem_resource mem_resource[PCI_MAX_RESOURCE];
> 					    /**< PCI Memory Resource */
> @@ -65,7 +74,7 @@ struct rte_pci_device {
> 	struct rte_pci_driver *driver;      /**< PCI driver used in probing */
> 	uint16_t max_vfs;                   /**< sriov enable if not zero */
> 	enum rte_kernel_driver kdrv;        /**< Kernel driver passthrough */
> -	char name[PCI_PRI_STR_SIZE+1];      /**< PCI location (ASCII) */
> +	char name[RTE_PCI_NAME_LEN];        /**< PCI/Mdev location (ASCII) */
> 	struct rte_intr_handle vfio_req_intr_handle;
> 				/**< Handler of VFIO request interrupt */
> };
> -- 
> 2.17.1
> 

Regards,
Keith


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC 3/3] bus/pci: add mdev support
  2019-04-03 14:13   ` Wiles, Keith
@ 2019-04-03 14:13     ` Wiles, Keith
  2019-04-04  4:19     ` Tiwei Bie
  1 sibling, 0 replies; 41+ messages in thread
From: Wiles, Keith @ 2019-04-03 14:13 UTC (permalink / raw)
  To: Bie, Tiwei; +Cc: dpdk-dev, Liang, Cunming, Richardson, Bruce, alejandro.lucero

Some minor nits.

> On Apr 3, 2019, at 2:18 AM, Tiwei Bie <tiwei.bie@intel.com> wrote:
> 
> This patch adds the mdev support in PCI bus driver. A mdev
> driver is introduced to probe the mdev devices whose device
> API is "vfio-pci" on the mdev bus.
> 
> PS. There are some hacks in this patch for now.
> 
> Signed-off-by: Cunming Liang <cunming.liang@intel.com>
> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> ---
> drivers/bus/pci/Makefile              |   3 +
> drivers/bus/pci/linux/Makefile        |   4 +
> drivers/bus/pci/linux/pci_vfio.c      |  35 ++-
> drivers/bus/pci/linux/pci_vfio_mdev.c | 305 ++++++++++++++++++++++++++
> drivers/bus/pci/meson.build           |   4 +-
> drivers/bus/pci/pci_common.c          |  17 +-
> drivers/bus/pci/private.h             |   9 +
> drivers/bus/pci/rte_bus_pci.h         |  11 +-
> 8 files changed, 370 insertions(+), 18 deletions(-)
> create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
> 
> diff --git a/drivers/bus/pci/Makefile b/drivers/bus/pci/Makefile
> index de53ce1bf..085ec9066 100644
> --- a/drivers/bus/pci/Makefile
> +++ b/drivers/bus/pci/Makefile
> @@ -27,6 +27,9 @@ CFLAGS += -DALLOW_EXPERIMENTAL_API

This define is enabled in 50-70 Makefiles, we can leave this here, but we should refactor this to a common place in the future.
> 
> LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
> LDLIBS += -lrte_ethdev -lrte_pci -lrte_kvargs
> +ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
> +LDLIBS += -lrte_bus_mdev
> +endif

See comment below.
> 
> include $(RTE_SDK)/drivers/bus/pci/$(SYSTEM)/Makefile
> SRCS-$(CONFIG_RTE_LIBRTE_PCI_BUS) := $(addprefix $(SYSTEM)/,$(SRCS))
> diff --git a/drivers/bus/pci/linux/Makefile b/drivers/bus/pci/linux/Makefile
> index 90404468b..88bbc2390 100644
> --- a/drivers/bus/pci/linux/Makefile
> +++ b/drivers/bus/pci/linux/Makefile
> @@ -4,3 +4,7 @@
> SRCS += pci.c
> SRCS += pci_uio.c
> SRCS += pci_vfio.c
> +
> +ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
> +	SRCS += pci_vfio_mdev.c
> +endif

Do we need a configuration option for MDEV?
Can it be enabled for all builds or reuse a current configuration if only for some OS or arch?

> diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
> index ebf6ccd3c..c2c4c6a50 100644
> --- a/drivers/bus/pci/linux/pci_vfio.c
> +++ b/drivers/bus/pci/linux/pci_vfio.c
> @@ -13,6 +13,9 @@
> 
> #include <rte_log.h>
> #include <rte_pci.h>
> +#ifdef RTE_LIBRTE_MDEV_BUS
> +#include <rte_bus_mdev.h>
> +#endif
> #include <rte_bus_pci.h>
> #include <rte_eal_memconfig.h>
> #include <rte_malloc.h>
> @@ -20,6 +23,7 @@
> #include <rte_eal.h>
> #include <rte_bus.h>
> #include <rte_spinlock.h>
> +#include <rte_uuid.h>
> 
> #include "eal_filesystem.h"
> 
> @@ -648,6 +652,7 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
> {
> 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
> 	char pci_addr[PATH_MAX] = {0};
> +	const char *sysfs_path;
> 	int vfio_dev_fd;
> 	struct rte_pci_addr *loc = &dev->addr;
> 	int i, ret;
> @@ -663,10 +668,20 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
> #endif
> 
> 	/* store PCI address string */
> -	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
> +	if (dev->use_uuid) {
> +#ifdef RTE_LIBRTE_MDEV_BUS
> +		sysfs_path = rte_mdev_get_sysfs_path();
> +		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
> +#else
> +		return -1;
> +#endif
> +	} else {
> +		sysfs_path = rte_pci_get_sysfs_path();
> +		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
> 			loc->domain, loc->bus, loc->devid, loc->function);
> +	}
> 
> -	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
> +	ret = rte_vfio_setup_device(sysfs_path, pci_addr,
> 					&vfio_dev_fd, &device_info);
> 	if (ret)
> 		return ret;
> @@ -793,6 +808,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
> {
> 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
> 	char pci_addr[PATH_MAX] = {0};
> +	const char *sysfs_path;
> 	int vfio_dev_fd;
> 	struct rte_pci_addr *loc = &dev->addr;
> 	int i, ret;
> @@ -808,8 +824,19 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
> #endif
> 
> 	/* store PCI address string */
> -	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
> +	if (dev->use_uuid) {
> +#ifdef RTE_LIBRTE_MDEV_BUS
> +		sysfs_path = rte_mdev_get_sysfs_path();
> +		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
> +#else
> +		return -1;
> +#endif
> +	} else {
> +		sysfs_path = rte_pci_get_sysfs_path();
> +		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
> 			loc->domain, loc->bus, loc->devid, loc->function);
> +	}
> +
> 
> 	/* if we're in a secondary process, just find our tailq entry */
> 	TAILQ_FOREACH(vfio_res, vfio_res_list, next) {
> @@ -825,7 +852,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
> 		return -1;
> 	}
> 
> -	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
> +	ret = rte_vfio_setup_device(sysfs_path, pci_addr,
> 					&vfio_dev_fd, &device_info);
> 	if (ret)
> 		return ret;
> diff --git a/drivers/bus/pci/linux/pci_vfio_mdev.c b/drivers/bus/pci/linux/pci_vfio_mdev.c
> new file mode 100644
> index 000000000..92498c2fe
> --- /dev/null
> +++ b/drivers/bus/pci/linux/pci_vfio_mdev.c
> @@ -0,0 +1,305 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#include <string.h>
> +#include <dirent.h>
> +#include <fcntl.h>
> +#include <sys/ioctl.h>
> +#include <linux/pci_regs.h>
> +
> +#include <rte_log.h>
> +#include <rte_pci.h>
> +#include <rte_eal_memconfig.h>
> +#include <rte_malloc.h>
> +#include <rte_devargs.h>
> +#include <rte_memcpy.h>
> +#include <rte_vfio.h>
> +#include <rte_bus_mdev.h>
> +
> +#include "eal_private.h"
> +#include "eal_filesystem.h"
> +
> +#include "private.h"
> +
> +extern struct rte_pci_bus rte_pci_bus;
> +
> +static int
> +get_pci_id(const char *sysfs_base, const char *dev_addr,
> +	   struct rte_pci_id *pci_id)
> +{
> +	int ret = 0;
> +	int iommu_group_num;
> +	int vfio_group_fd;
> +	int vfio_dev_fd;
> +	int container;
> +	int class;
> +	char name[PATH_MAX];
> +	struct vfio_group_status group_status = {
> +		.argsz = sizeof(group_status) };
> +
> +	container = open("/dev/vfio/vfio", O_RDWR);

Should this one use the VFIO_CONTAINER_PATH define in rte_vfio.h?
The define is gated by VFIO_PRESENT in that header.
> +	if (container < 0) {
> +		RTE_LOG(WARNING, EAL, "Failed to open VFIO container\n");
> +		ret = -1;
> +		goto out;
> +	}
> +
> +	if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) {
> +		/* Unknown API version */
> +		RTE_LOG(WARNING, EAL, "Unknown VFIO API version\n");
> +		ret = -1;
> +		goto close_container;
> +	}
> +
> +	if (rte_vfio_get_group_num(sysfs_base, dev_addr,
> +				   &iommu_group_num) <= 0) {
> +		RTE_LOG(WARNING, EAL, "%s not managed by VFIO driver\n",
> +			dev_addr);
> +		ret = -1;
> +		goto close_container;
> +	}
> +
> +	snprintf(name, sizeof(name), "/dev/vfio/%d", iommu_group_num);

We should be testing the return value from snprintf, but it is not done anyplace else in the code?
We need to look at fixing this in a different patch, but not here.
> +
> +	vfio_group_fd = open(name, O_RDWR);
> +	if (vfio_group_fd < 0) {
> +		ret = -1;
> +		goto close_container;
> +	}
> +
> +	/* if group_fd == 0, that means the device isn't managed by VFIO */
> +	if (vfio_group_fd == 0) {
> +		RTE_LOG(WARNING, EAL, "%s not managed by VFIO driver\n",
> +			dev_addr);
> +		ret = -1;
> +		goto close_group;
> +	}
> +
> +	if (ioctl(vfio_group_fd, VFIO_GROUP_GET_STATUS, &group_status)) {
> +		RTE_LOG(ERR, EAL, "%s cannot get group status, error %i (%s)\n",
> +			dev_addr, errno, strerror(errno));
> +		ret = -1;
> +		goto close_group;
> +	}
> +
> +	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
> +		RTE_LOG(ERR, EAL, "%s VFIO group is not viable!\n", dev_addr);
> +		ret = -1;
> +		goto close_group;
> +	}
> +
> +	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
> +		if (ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
> +			    &container)) {
> +			RTE_LOG(ERR, EAL, "%s cannot add VFIO group to container, error %i (%s)\n",
> +				dev_addr, errno, strerror(errno));
> +			ret = -1;
> +			goto close_group;
> +		}
> +	}
> +
> +	if (ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)) {
> +		RTE_LOG(ERR, EAL, "%s cannot set iommu, error %i (%s)\n",
> +			dev_addr, errno, strerror(errno));
> +		ret = -1;
> +		goto close_group;
> +	}
> +
> +	vfio_dev_fd = ioctl(vfio_group_fd, VFIO_GROUP_GET_DEVICE_FD, dev_addr);
> +	if (vfio_dev_fd < 0) {
> +		/* if we cannot get a device fd, this implies a problem with
> +		 * the VFIO group or the container not having IOMMU configured.
> +		 */
> +		RTE_LOG(ERR, EAL, "Getting a vfio_dev_fd for %s failed errno %d\n",
> +			dev_addr, errno);
> +		ret = -1;
> +		goto close_group;
> +	}
> +
> +	/* vendor_id */
> +	if (pread64(vfio_dev_fd, &pci_id->vendor_id, sizeof(uint16_t),
> +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> +		      PCI_VENDOR_ID) != sizeof(uint16_t)) {
> +		RTE_LOG(ERR, EAL, "Cannot read VendorID from PCI config space\n");
> +		ret = -1;
> +		goto close_device;
> +	}
> +
> +	/* device_id */
> +	if (pread64(vfio_dev_fd, &pci_id->device_id, sizeof(uint16_t),
> +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> +		      PCI_DEVICE_ID) != sizeof(uint16_t)) {
> +		RTE_LOG(ERR, EAL, "Cannot read DeviceID from PCI config space\n");
> +		ret = -1;
> +		goto close_device;
> +	}
> +
> +	/* subsystem_vendor_id */
> +	if (pread64(vfio_dev_fd, &pci_id->subsystem_vendor_id, sizeof(uint16_t),
> +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> +		      PCI_SUBSYSTEM_VENDOR_ID) != sizeof(uint16_t)) {
> +		RTE_LOG(ERR, EAL, "Cannot read SubVendorID from PCI config space\n");
> +		ret = -1;
> +		goto close_device;
> +	}
> +
> +	/* subsystem_device_id */
> +	if (pread64(vfio_dev_fd, &pci_id->subsystem_device_id, sizeof(uint16_t),
> +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> +		      PCI_SUBSYSTEM_ID) != sizeof(uint16_t)) {
> +		RTE_LOG(ERR, EAL, "Cannot read SubDeviceID from PCI config space\n");
> +		ret = -1;
> +		goto close_device;
> +	}
> +
> +	/* class_id */
> +	if (pread64(vfio_dev_fd, &class, sizeof(uint32_t),
> +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> +		      PCI_CLASS_REVISION) != sizeof(uint32_t)) {
> +		RTE_LOG(ERR, EAL, "Cannot read ClassID from PCI config space\n”);

These should possible be DEBUG messages, but ERR is ok I guess. To me filling up the log with a bunch of messages when it is also flagged and log at a higher layer to many log messages. It would require us to look and make a cleaner
> +		ret = -1;
> +		goto close_device;
> +	}
> +	pci_id->class_id = class >> 8;
> +
> +close_device:
> +	if (close(vfio_dev_fd) < 0) {
> +		RTE_LOG(INFO, EAL, "Error when closing VFIO device for %s\n",
> +			dev_addr);

These should be ERR, DEBUG or WARN not INFO IMO or no log message at all.
> +		ret = -1;
> +	}
> +
> +close_group:
> +	if (close(vfio_group_fd) < 0) {
> +		RTE_LOG(INFO, EAL, "Error when closing VFIO group for %s\n",
> +			dev_addr);
> +		ret = -1;
> +	}
> +
> +close_container:
> +	if (close(container) < 0) {
> +		RTE_LOG(INFO, EAL, "Error when closing VFIO container\n");
> +		ret = -1;
> +	}
> +
> +out:

Jumping to 4 different exit points makes this function complex, would it not be better to have one error exit point and test if the fds need to be closed
e.g.
	if (pread64(...)) {
		RTE_LOG(ERR, EAL, “Error message”);
		goto err_exit;
	}

	return 0;
err_exit:
	if (vfio_dev_fd && close(vfio_dev_fd) < 0) {
		
	}
	if (…) {
	}
	return -1;

This should eliminate the variable ret and reduce the lines of code.
> +	return ret;
> +}
> +
> +static int vfio_pci_probe(struct rte_mdev_driver *mdev_drv __rte_unused,
> +			  struct rte_mdev_device *mdev_dev)
> +{
> +	char name[RTE_UUID_STRLEN];
> +	struct rte_pci_device *dev;
> +	struct rte_bus *bus;
> +	int ret;
> +
> +	bus = rte_bus_find_by_name("pci");
> +	if (bus == NULL) {
> +		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
> +		return -ENOENT;
> +	}
> +
> +	if (bus->plug == NULL) {
> +		RTE_LOG(ERR, EAL, "Function plug not supported by bus (%s)\n",
> +			bus->name);
> +		return -ENOTSUP;
> +	}
> +
> +	dev = malloc(sizeof(*dev));
> +	if (dev == NULL)
> +		return -ENOMEM;

If going to add error logs for the above tests, why does this one not get one?
Should we just remove them and check in the calling function instead? Then convert these to DEBUG logs or remove them.
> +
> +	memset(dev, 0, sizeof(*dev));
> +	dev->device.bus = &rte_pci_bus.bus;
> +	rte_uuid_unparse(mdev_dev->addr, name, sizeof(name));
> +
> +	if (get_pci_id(rte_mdev_get_sysfs_path(), name, &dev->id)) {
> +		free(dev);
> +		return -1;
> +	}
> +
> +	snprintf(dev->name, sizeof(dev->name), "%s", name);

This should be strlcpy()
> +	dev->device.name = dev->name;
> +	dev->kdrv = RTE_KDRV_VFIO;
> +	dev->use_uuid = 1;
> +	rte_uuid_copy(dev->uuid, mdev_dev->addr);
> +
> +	// TODO: dev->device.devargs, etc
> +
> +	memset(&dev->addr, -1, sizeof(dev->addr)); // XXX: TODO

I have seen in the past that TODO or FIXME is not something that should be in the code. The TODO items should be removed and tracked outside the code if needed to be done later.
> +
> +	/* device is valid, add to the list (sorted) */
> +	if (TAILQ_EMPTY(&rte_pci_bus.device_list)) {
> +		rte_pci_add_device(dev);
> +	} else {
> +		struct rte_pci_device *dev2;
> +		int ret;
> +
> +		TAILQ_FOREACH(dev2, &rte_pci_bus.device_list, next) {
> +			// XXX

What does this comment mean? remove it or explain it.
> +			ret = rte_pci_addr_cmp(&dev->addr, &dev2->addr);
> +			if (ret == 0)
> +				ret = strncmp(dev->name, dev2->name,
> +					      sizeof(dev->name));
> +			if (ret > 0)
> +				continue;
> +			if (ret < 0) {
> +				rte_pci_insert_device(dev2, dev);
> +				goto plug;
> +			}
> +			/* already registered */
> +			free(dev);
> +			return 0;
> +		}
> +
> +		rte_pci_add_device(dev);
> +	}
> +
> +plug:
> +	ret = bus->plug(&dev->device);
> +	if (ret != 0) {
> +		rte_pci_remove_device(dev);
> +		free(dev);
> +	} else {
> +		mdev_dev->private = dev;
> +	}

The coding guide states we remove {} around single line statements.
> +	return ret;
> +}
> +
> +static int vfio_pci_remove(struct rte_mdev_device *mdev_dev)
> +{
> +	struct rte_pci_device *dev = mdev_dev->private;
> +	struct rte_bus *bus;
> +	int ret;
> +
> +	if (dev == NULL)
> +		return 0;
> +
> +	bus = rte_bus_find_by_name("pci");
> +	if (bus == NULL) {
> +		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
> +		return -ENOENT;
> +	}
> +
> +	if (bus->unplug == NULL) {
> +		RTE_LOG(ERR, EAL, "Function unplug not supported by bus (%s)\n",
> +			bus->name);
> +		return -ENOTSUP;
> +	}
> +
> +	ret = bus->unplug(&dev->device);
> +	if (ret == 0)
> +		mdev_dev->private = NULL;
> +
> +	return ret;
> +}
> +
> +static struct rte_mdev_driver vfio_pci_drv = {
> +	.dev_api = RTE_MDEV_DEV_API_VFIO_PCI,
> +	.probe = vfio_pci_probe,
> +	.remove = vfio_pci_remove
> +};
> +
> +RTE_MDEV_REGISTER_DRIVER(mdev_vfio_pci, vfio_pci_drv);
> diff --git a/drivers/bus/pci/meson.build b/drivers/bus/pci/meson.build
> index a3140ff97..c3e884657 100644
> --- a/drivers/bus/pci/meson.build
> +++ b/drivers/bus/pci/meson.build
> @@ -11,8 +11,10 @@ sources = files('pci_common.c',
> if host_machine.system() == 'linux'
> 	sources += files('linux/pci.c',
> 			'linux/pci_uio.c',
> -			'linux/pci_vfio.c')
> +			'linux/pci_vfio.c',
> +			'linux/pci_vfio_mdev.c’)

If you need the RTE_LIBRTE_MDEV define then pci_vfio_mdev.c needs to be built conditionally?
> 	includes += include_directories('linux')
> +	deps += ['bus_mdev’]

If this was added form dev then is too should be conditional.
> else
> 	sources += files('bsd/pci.c')
> 	includes += include_directories('bsd')
> diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
> index 704b9d71a..6b47333e6 100644
> --- a/drivers/bus/pci/pci_common.c
> +++ b/drivers/bus/pci/pci_common.c
> @@ -124,21 +124,17 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
> {
> 	int ret;
> 	bool already_probed;
> -	struct rte_pci_addr *loc;
> 
> 	if ((dr == NULL) || (dev == NULL))
> 		return -EINVAL;
> 
> -	loc = &dev->addr;
> -
> 	/* The device is not blacklisted; Check if driver supports it */
> 	if (!rte_pci_match(dr, dev))
> 		/* Match of device and driver failed */
> 		return 1;
> 
> -	RTE_LOG(INFO, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
> -			loc->domain, loc->bus, loc->devid, loc->function,
> -			dev->device.numa_node);
> +	RTE_LOG(INFO, EAL, "PCI device %s on NUMA socket %i\n",
> +		dev->name, dev->device.numa_node);
> 
> 	/* no initialization when blacklisted, return without error */
> 	if (dev->device.devargs != NULL &&
> @@ -208,7 +204,6 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
> static int
> rte_pci_detach_dev(struct rte_pci_device *dev)
> {
> -	struct rte_pci_addr *loc;
> 	struct rte_pci_driver *dr;
> 	int ret = 0;
> 
> @@ -216,11 +211,9 @@ rte_pci_detach_dev(struct rte_pci_device *dev)
> 		return -EINVAL;
> 
> 	dr = dev->driver;
> -	loc = &dev->addr;
> 
> -	RTE_LOG(DEBUG, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
> -			loc->domain, loc->bus, loc->devid,
> -			loc->function, dev->device.numa_node);
> +	RTE_LOG(DEBUG, EAL, "PCI device %s on NUMA socket %i\n",
> +		dev->name, dev->device.numa_node);
> 
> 	RTE_LOG(DEBUG, EAL, "  remove driver: %x:%x %s\n", dev->id.vendor_id,
> 			dev->id.device_id, dr->driver.name);
> @@ -387,7 +380,7 @@ rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
> }
> 
> /* Remove a device from PCI bus */
> -static void
> +void
> rte_pci_remove_device(struct rte_pci_device *pci_dev)

Have not looked yet, but did this function get added to the version.map file?
Does converting a function to public function require experimental tag too, maybe not?
> {
> 	TAILQ_REMOVE(&rte_pci_bus.device_list, pci_dev, next);
> diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
> index 13c3324bb..d5815ee44 100644
> --- a/drivers/bus/pci/private.h
> +++ b/drivers/bus/pci/private.h
> @@ -67,6 +67,15 @@ void rte_pci_add_device(struct rte_pci_device *pci_dev);
> void rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
> 		struct rte_pci_device *new_pci_dev);
> 
> +/**
> + * Remove a PCI device from the PCI Bus.
> + *
> + * @param pci_dev
> + *	PCI device to remove
> + * @return void
> + */
> +void rte_pci_remove_device(struct rte_pci_device *pci_dev);
> +
> /**
>  * Update a pci device object by asking the kernel for the latest information.
>  *
> diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
> index 06e004cd3..465a44935 100644
> --- a/drivers/bus/pci/rte_bus_pci.h
> +++ b/drivers/bus/pci/rte_bus_pci.h
> @@ -51,6 +51,13 @@ TAILQ_HEAD(rte_pci_driver_list, rte_pci_driver);
> 
> struct rte_devargs;
> 
> +/* It's RTE_UUID_STRLEN, which is bigger than PCI_PRI_STR_SIZE. */
> +#define RTE_PCI_NAME_LEN		(36 + 1)
> +
> +// XXX: we can't include rte_uuid.h directly due to the conflicts
> +//      introduced by stdbool.h
> +typedef unsigned char rte_uuid_t[16];

Does this need to have a the string ‘XXX’ in the comment? Note maybe a better word.
> +
> /**
>  * A structure describing a PCI device.
>  */
> @@ -58,6 +65,8 @@ struct rte_pci_device {
> 	TAILQ_ENTRY(rte_pci_device) next;   /**< Next probed PCI device. */
> 	struct rte_device device;           /**< Inherit core device */
> 	struct rte_pci_addr addr;           /**< PCI location. */
> +	rte_uuid_t uuid;                    /**< Mdev location. */
> +	uint8_t use_uuid;                   /**< True if uuid field valid. */
> 	struct rte_pci_id id;               /**< PCI ID. */
> 	struct rte_mem_resource mem_resource[PCI_MAX_RESOURCE];
> 					    /**< PCI Memory Resource */
> @@ -65,7 +74,7 @@ struct rte_pci_device {
> 	struct rte_pci_driver *driver;      /**< PCI driver used in probing */
> 	uint16_t max_vfs;                   /**< sriov enable if not zero */
> 	enum rte_kernel_driver kdrv;        /**< Kernel driver passthrough */
> -	char name[PCI_PRI_STR_SIZE+1];      /**< PCI location (ASCII) */
> +	char name[RTE_PCI_NAME_LEN];        /**< PCI/Mdev location (ASCII) */
> 	struct rte_intr_handle vfio_req_intr_handle;
> 				/**< Handler of VFIO request interrupt */
> };
> -- 
> 2.17.1
> 

Regards,
Keith


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC 3/3] bus/pci: add mdev support
  2019-04-03 14:13   ` Wiles, Keith
  2019-04-03 14:13     ` Wiles, Keith
@ 2019-04-04  4:19     ` Tiwei Bie
  2019-04-04  4:19       ` Tiwei Bie
  1 sibling, 1 reply; 41+ messages in thread
From: Tiwei Bie @ 2019-04-04  4:19 UTC (permalink / raw)
  To: Wiles, Keith
  Cc: dpdk-dev, Liang, Cunming, Richardson, Bruce, alejandro.lucero

On Wed, Apr 03, 2019 at 10:13:25PM +0800, Wiles, Keith wrote:
> Some minor nits.
> 
> > On Apr 3, 2019, at 2:18 AM, Tiwei Bie <tiwei.bie@intel.com> wrote:
> > 
> > This patch adds the mdev support in PCI bus driver. A mdev
> > driver is introduced to probe the mdev devices whose device
> > API is "vfio-pci" on the mdev bus.
> > 
> > PS. There are some hacks in this patch for now.
> > 
> > Signed-off-by: Cunming Liang <cunming.liang@intel.com>
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> > drivers/bus/pci/Makefile              |   3 +
> > drivers/bus/pci/linux/Makefile        |   4 +
> > drivers/bus/pci/linux/pci_vfio.c      |  35 ++-
> > drivers/bus/pci/linux/pci_vfio_mdev.c | 305 ++++++++++++++++++++++++++
> > drivers/bus/pci/meson.build           |   4 +-
> > drivers/bus/pci/pci_common.c          |  17 +-
> > drivers/bus/pci/private.h             |   9 +
> > drivers/bus/pci/rte_bus_pci.h         |  11 +-
> > 8 files changed, 370 insertions(+), 18 deletions(-)
> > create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
> > 
> > diff --git a/drivers/bus/pci/Makefile b/drivers/bus/pci/Makefile
> > index de53ce1bf..085ec9066 100644
> > --- a/drivers/bus/pci/Makefile
> > +++ b/drivers/bus/pci/Makefile
> > @@ -27,6 +27,9 @@ CFLAGS += -DALLOW_EXPERIMENTAL_API
> 
> This define is enabled in 50-70 Makefiles, we can leave this here, but we should refactor this to a common place in the future.
> > 
> > LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
> > LDLIBS += -lrte_ethdev -lrte_pci -lrte_kvargs
> > +ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
> > +LDLIBS += -lrte_bus_mdev
> > +endif
> 
> See comment below.
> > 
> > include $(RTE_SDK)/drivers/bus/pci/$(SYSTEM)/Makefile
> > SRCS-$(CONFIG_RTE_LIBRTE_PCI_BUS) := $(addprefix $(SYSTEM)/,$(SRCS))
> > diff --git a/drivers/bus/pci/linux/Makefile b/drivers/bus/pci/linux/Makefile
> > index 90404468b..88bbc2390 100644
> > --- a/drivers/bus/pci/linux/Makefile
> > +++ b/drivers/bus/pci/linux/Makefile
> > @@ -4,3 +4,7 @@
> > SRCS += pci.c
> > SRCS += pci_uio.c
> > SRCS += pci_vfio.c
> > +
> > +ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
> > +	SRCS += pci_vfio_mdev.c
> > +endif
> 
> Do we need a configuration option for MDEV?
> Can it be enabled for all builds or reuse a current configuration if only for some OS or arch?

I think it's possible.

> 
[...]
> > +static int
> > +get_pci_id(const char *sysfs_base, const char *dev_addr,
> > +	   struct rte_pci_id *pci_id)
> > +{
> > +	int ret = 0;
> > +	int iommu_group_num;
> > +	int vfio_group_fd;
> > +	int vfio_dev_fd;
> > +	int container;
> > +	int class;
> > +	char name[PATH_MAX];
> > +	struct vfio_group_status group_status = {
> > +		.argsz = sizeof(group_status) };
> > +
> > +	container = open("/dev/vfio/vfio", O_RDWR);
> 
> Should this one use the VFIO_CONTAINER_PATH define in rte_vfio.h?
> The define is gated by VFIO_PRESENT in that header.

Yeah! And all the code in this file should be gated
by VFIO_PRESENT as well.

> > +	if (container < 0) {
> > +		RTE_LOG(WARNING, EAL, "Failed to open VFIO container\n");
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +
> > +	if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) {
> > +		/* Unknown API version */
> > +		RTE_LOG(WARNING, EAL, "Unknown VFIO API version\n");
> > +		ret = -1;
> > +		goto close_container;
> > +	}
> > +
> > +	if (rte_vfio_get_group_num(sysfs_base, dev_addr,
> > +				   &iommu_group_num) <= 0) {
> > +		RTE_LOG(WARNING, EAL, "%s not managed by VFIO driver\n",
> > +			dev_addr);
> > +		ret = -1;
> > +		goto close_container;
> > +	}
> > +
> > +	snprintf(name, sizeof(name), "/dev/vfio/%d", iommu_group_num);
> 
> We should be testing the return value from snprintf, but it is not done anyplace else in the code?
> We need to look at fixing this in a different patch, but not here.
> > +
> > +	vfio_group_fd = open(name, O_RDWR);
> > +	if (vfio_group_fd < 0) {
> > +		ret = -1;
> > +		goto close_container;
> > +	}
[...]
> > +	/* class_id */
> > +	if (pread64(vfio_dev_fd, &class, sizeof(uint32_t),
> > +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> > +		      PCI_CLASS_REVISION) != sizeof(uint32_t)) {
> > +		RTE_LOG(ERR, EAL, "Cannot read ClassID from PCI config space\n”);
> 
> These should possible be DEBUG messages, but ERR is ok I guess. To me filling up the log with a bunch of messages when it is also flagged and log at a higher layer to many log messages. It would require us to look and make a cleaner
> > +		ret = -1;
> > +		goto close_device;
> > +	}
> > +	pci_id->class_id = class >> 8;
> > +
> > +close_device:
> > +	if (close(vfio_dev_fd) < 0) {
> > +		RTE_LOG(INFO, EAL, "Error when closing VFIO device for %s\n",
> > +			dev_addr);
> 
> These should be ERR, DEBUG or WARN not INFO IMO or no log message at all.

You are right. I missed this one.

> > +		ret = -1;
> > +	}
> > +
> > +close_group:
> > +	if (close(vfio_group_fd) < 0) {
> > +		RTE_LOG(INFO, EAL, "Error when closing VFIO group for %s\n",
> > +			dev_addr);
> > +		ret = -1;
> > +	}
> > +
> > +close_container:
> > +	if (close(container) < 0) {
> > +		RTE_LOG(INFO, EAL, "Error when closing VFIO container\n");
> > +		ret = -1;
> > +	}
> > +
> > +out:
> 
> Jumping to 4 different exit points makes this function complex, would it not be better to have one error exit point and test if the fds need to be closed
> e.g.
> 	if (pread64(...)) {
> 		RTE_LOG(ERR, EAL, “Error message”);
> 		goto err_exit;
> 	}
> 
> 	return 0;
> err_exit:
> 	if (vfio_dev_fd && close(vfio_dev_fd) < 0) {
> 		
> 	}
> 	if (…) {
> 	}
> 	return -1;
> 
> This should eliminate the variable ret and reduce the lines of code.

Thanks for the suggestion. I can do that.

> > +	return ret;
> > +}
> > +
> > +static int vfio_pci_probe(struct rte_mdev_driver *mdev_drv __rte_unused,
> > +			  struct rte_mdev_device *mdev_dev)
> > +{
> > +	char name[RTE_UUID_STRLEN];
> > +	struct rte_pci_device *dev;
> > +	struct rte_bus *bus;
> > +	int ret;
> > +
> > +	bus = rte_bus_find_by_name("pci");
> > +	if (bus == NULL) {
> > +		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
> > +		return -ENOENT;
> > +	}
> > +
> > +	if (bus->plug == NULL) {
> > +		RTE_LOG(ERR, EAL, "Function plug not supported by bus (%s)\n",
> > +			bus->name);
> > +		return -ENOTSUP;
> > +	}
> > +
> > +	dev = malloc(sizeof(*dev));
> > +	if (dev == NULL)
> > +		return -ENOMEM;
> 
> If going to add error logs for the above tests, why does this one not get one?
> Should we just remove them and check in the calling function instead? Then convert these to DEBUG logs or remove them.

Thanks for the suggestion. Will improve the logs.

> > +
> > +	memset(dev, 0, sizeof(*dev));
> > +	dev->device.bus = &rte_pci_bus.bus;
> > +	rte_uuid_unparse(mdev_dev->addr, name, sizeof(name));
> > +
> > +	if (get_pci_id(rte_mdev_get_sysfs_path(), name, &dev->id)) {
> > +		free(dev);
> > +		return -1;
> > +	}
> > +
> > +	snprintf(dev->name, sizeof(dev->name), "%s", name);
> 
> This should be strlcpy()
> > +	dev->device.name = dev->name;
> > +	dev->kdrv = RTE_KDRV_VFIO;
> > +	dev->use_uuid = 1;
> > +	rte_uuid_copy(dev->uuid, mdev_dev->addr);
> > +
> > +	// TODO: dev->device.devargs, etc
> > +
> > +	memset(&dev->addr, -1, sizeof(dev->addr)); // XXX: TODO
> 
> I have seen in the past that TODO or FIXME is not something that should be in the code. The TODO items should be removed and tracked outside the code if needed to be done later.

Sorry for the confusion. There are some quick hacks in this RFC
(especially in this function). I highlighted them with XXX or TODO.
I didn't get rid of them for now, because there are different
possible ways to add the mdev support in DPDK, and this RFC is just
to demonstrate one possible way that we can do and to hear people's
thoughts/opinions.

PS. All the hacks (including comments starting with //) in this RFC
are temporary. They will be fixed or removed in the formal patch.

> > +
> > +	/* device is valid, add to the list (sorted) */
> > +	if (TAILQ_EMPTY(&rte_pci_bus.device_list)) {
> > +		rte_pci_add_device(dev);
> > +	} else {
> > +		struct rte_pci_device *dev2;
> > +		int ret;
> > +
> > +		TAILQ_FOREACH(dev2, &rte_pci_bus.device_list, next) {
> > +			// XXX
> 
> What does this comment mean? remove it or explain it.

It's to indicate that there is a quick hack here. It won't
exist in the formal patch.

> > +			ret = rte_pci_addr_cmp(&dev->addr, &dev2->addr);
> > +			if (ret == 0)
> > +				ret = strncmp(dev->name, dev2->name,
> > +					      sizeof(dev->name));
> > +			if (ret > 0)
> > +				continue;
> > +			if (ret < 0) {
> > +				rte_pci_insert_device(dev2, dev);
> > +				goto plug;
> > +			}
> > +			/* already registered */
> > +			free(dev);
> > +			return 0;
> > +		}
> > +
> > +		rte_pci_add_device(dev);
> > +	}
> > +
> > +plug:
> > +	ret = bus->plug(&dev->device);
> > +	if (ret != 0) {
> > +		rte_pci_remove_device(dev);
> > +		free(dev);
> > +	} else {
> > +		mdev_dev->private = dev;
> > +	}
> 
> The coding guide states we remove {} around single line statements.
> > +	return ret;
> > +}
> > +
> > +static int vfio_pci_remove(struct rte_mdev_device *mdev_dev)
> > +{
> > +	struct rte_pci_device *dev = mdev_dev->private;
> > +	struct rte_bus *bus;
> > +	int ret;
> > +
> > +	if (dev == NULL)
> > +		return 0;
> > +
> > +	bus = rte_bus_find_by_name("pci");
> > +	if (bus == NULL) {
> > +		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
> > +		return -ENOENT;
> > +	}
> > +
> > +	if (bus->unplug == NULL) {
> > +		RTE_LOG(ERR, EAL, "Function unplug not supported by bus (%s)\n",
> > +			bus->name);
> > +		return -ENOTSUP;
> > +	}
> > +
> > +	ret = bus->unplug(&dev->device);
> > +	if (ret == 0)
> > +		mdev_dev->private = NULL;
> > +
> > +	return ret;
> > +}
> > +
> > +static struct rte_mdev_driver vfio_pci_drv = {
> > +	.dev_api = RTE_MDEV_DEV_API_VFIO_PCI,
> > +	.probe = vfio_pci_probe,
> > +	.remove = vfio_pci_remove
> > +};
> > +
> > +RTE_MDEV_REGISTER_DRIVER(mdev_vfio_pci, vfio_pci_drv);
> > diff --git a/drivers/bus/pci/meson.build b/drivers/bus/pci/meson.build
> > index a3140ff97..c3e884657 100644
> > --- a/drivers/bus/pci/meson.build
> > +++ b/drivers/bus/pci/meson.build
> > @@ -11,8 +11,10 @@ sources = files('pci_common.c',
> > if host_machine.system() == 'linux'
> > 	sources += files('linux/pci.c',
> > 			'linux/pci_uio.c',
> > -			'linux/pci_vfio.c')
> > +			'linux/pci_vfio.c',
> > +			'linux/pci_vfio_mdev.c’)
> 
> If you need the RTE_LIBRTE_MDEV define then pci_vfio_mdev.c needs to be built conditionally?
> > 	includes += include_directories('linux')
> > +	deps += ['bus_mdev’]
> 
> If this was added form dev then is too should be conditional.

Yeah. There should be a check of dpdk_conf.has('...')

> > else
> > 	sources += files('bsd/pci.c')
> > 	includes += include_directories('bsd')
> > diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
> > index 704b9d71a..6b47333e6 100644
> > --- a/drivers/bus/pci/pci_common.c
> > +++ b/drivers/bus/pci/pci_common.c
> > @@ -124,21 +124,17 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
> > {
> > 	int ret;
> > 	bool already_probed;
> > -	struct rte_pci_addr *loc;
> > 
> > 	if ((dr == NULL) || (dev == NULL))
> > 		return -EINVAL;
> > 
> > -	loc = &dev->addr;
> > -
> > 	/* The device is not blacklisted; Check if driver supports it */
> > 	if (!rte_pci_match(dr, dev))
> > 		/* Match of device and driver failed */
> > 		return 1;
> > 
> > -	RTE_LOG(INFO, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
> > -			loc->domain, loc->bus, loc->devid, loc->function,
> > -			dev->device.numa_node);
> > +	RTE_LOG(INFO, EAL, "PCI device %s on NUMA socket %i\n",
> > +		dev->name, dev->device.numa_node);
> > 
> > 	/* no initialization when blacklisted, return without error */
> > 	if (dev->device.devargs != NULL &&
> > @@ -208,7 +204,6 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
> > static int
> > rte_pci_detach_dev(struct rte_pci_device *dev)
> > {
> > -	struct rte_pci_addr *loc;
> > 	struct rte_pci_driver *dr;
> > 	int ret = 0;
> > 
> > @@ -216,11 +211,9 @@ rte_pci_detach_dev(struct rte_pci_device *dev)
> > 		return -EINVAL;
> > 
> > 	dr = dev->driver;
> > -	loc = &dev->addr;
> > 
> > -	RTE_LOG(DEBUG, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
> > -			loc->domain, loc->bus, loc->devid,
> > -			loc->function, dev->device.numa_node);
> > +	RTE_LOG(DEBUG, EAL, "PCI device %s on NUMA socket %i\n",
> > +		dev->name, dev->device.numa_node);
> > 
> > 	RTE_LOG(DEBUG, EAL, "  remove driver: %x:%x %s\n", dev->id.vendor_id,
> > 			dev->id.device_id, dr->driver.name);
> > @@ -387,7 +380,7 @@ rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
> > }
> > 
> > /* Remove a device from PCI bus */
> > -static void
> > +void
> > rte_pci_remove_device(struct rte_pci_device *pci_dev)
> 
> Have not looked yet, but did this function get added to the version.map file?
> Does converting a function to public function require experimental tag too, maybe not?

This is just to make it a global function declared in private.h,
so that we can call it from other C files inside PCI bus.

> > {
> > 	TAILQ_REMOVE(&rte_pci_bus.device_list, pci_dev, next);
> > diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
> > index 13c3324bb..d5815ee44 100644
> > --- a/drivers/bus/pci/private.h
> > +++ b/drivers/bus/pci/private.h
> > @@ -67,6 +67,15 @@ void rte_pci_add_device(struct rte_pci_device *pci_dev);
> > void rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
> > 		struct rte_pci_device *new_pci_dev);
> > 
> > +/**
> > + * Remove a PCI device from the PCI Bus.
> > + *
> > + * @param pci_dev
> > + *	PCI device to remove
> > + * @return void
> > + */
> > +void rte_pci_remove_device(struct rte_pci_device *pci_dev);
> > +
> > /**
> >  * Update a pci device object by asking the kernel for the latest information.
> >  *
> > diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
> > index 06e004cd3..465a44935 100644
> > --- a/drivers/bus/pci/rte_bus_pci.h
> > +++ b/drivers/bus/pci/rte_bus_pci.h
> > @@ -51,6 +51,13 @@ TAILQ_HEAD(rte_pci_driver_list, rte_pci_driver);
> > 
> > struct rte_devargs;
> > 
> > +/* It's RTE_UUID_STRLEN, which is bigger than PCI_PRI_STR_SIZE. */
> > +#define RTE_PCI_NAME_LEN		(36 + 1)
> > +
> > +// XXX: we can't include rte_uuid.h directly due to the conflicts
> > +//      introduced by stdbool.h
> > +typedef unsigned char rte_uuid_t[16];
> 
> Does this need to have a the string ‘XXX’ in the comment? Note maybe a better word.

OK.

Thanks for the reviews/suggestions! Do appreciate it!

Regards,
Tiwei

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC 3/3] bus/pci: add mdev support
  2019-04-04  4:19     ` Tiwei Bie
@ 2019-04-04  4:19       ` Tiwei Bie
  0 siblings, 0 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-04-04  4:19 UTC (permalink / raw)
  To: Wiles, Keith
  Cc: dpdk-dev, Liang, Cunming, Richardson, Bruce, alejandro.lucero

On Wed, Apr 03, 2019 at 10:13:25PM +0800, Wiles, Keith wrote:
> Some minor nits.
> 
> > On Apr 3, 2019, at 2:18 AM, Tiwei Bie <tiwei.bie@intel.com> wrote:
> > 
> > This patch adds the mdev support in PCI bus driver. A mdev
> > driver is introduced to probe the mdev devices whose device
> > API is "vfio-pci" on the mdev bus.
> > 
> > PS. There are some hacks in this patch for now.
> > 
> > Signed-off-by: Cunming Liang <cunming.liang@intel.com>
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> > drivers/bus/pci/Makefile              |   3 +
> > drivers/bus/pci/linux/Makefile        |   4 +
> > drivers/bus/pci/linux/pci_vfio.c      |  35 ++-
> > drivers/bus/pci/linux/pci_vfio_mdev.c | 305 ++++++++++++++++++++++++++
> > drivers/bus/pci/meson.build           |   4 +-
> > drivers/bus/pci/pci_common.c          |  17 +-
> > drivers/bus/pci/private.h             |   9 +
> > drivers/bus/pci/rte_bus_pci.h         |  11 +-
> > 8 files changed, 370 insertions(+), 18 deletions(-)
> > create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
> > 
> > diff --git a/drivers/bus/pci/Makefile b/drivers/bus/pci/Makefile
> > index de53ce1bf..085ec9066 100644
> > --- a/drivers/bus/pci/Makefile
> > +++ b/drivers/bus/pci/Makefile
> > @@ -27,6 +27,9 @@ CFLAGS += -DALLOW_EXPERIMENTAL_API
> 
> This define is enabled in 50-70 Makefiles, we can leave this here, but we should refactor this to a common place in the future.
> > 
> > LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
> > LDLIBS += -lrte_ethdev -lrte_pci -lrte_kvargs
> > +ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
> > +LDLIBS += -lrte_bus_mdev
> > +endif
> 
> See comment below.
> > 
> > include $(RTE_SDK)/drivers/bus/pci/$(SYSTEM)/Makefile
> > SRCS-$(CONFIG_RTE_LIBRTE_PCI_BUS) := $(addprefix $(SYSTEM)/,$(SRCS))
> > diff --git a/drivers/bus/pci/linux/Makefile b/drivers/bus/pci/linux/Makefile
> > index 90404468b..88bbc2390 100644
> > --- a/drivers/bus/pci/linux/Makefile
> > +++ b/drivers/bus/pci/linux/Makefile
> > @@ -4,3 +4,7 @@
> > SRCS += pci.c
> > SRCS += pci_uio.c
> > SRCS += pci_vfio.c
> > +
> > +ifeq ($(CONFIG_RTE_LIBRTE_MDEV_BUS),y)
> > +	SRCS += pci_vfio_mdev.c
> > +endif
> 
> Do we need a configuration option for MDEV?
> Can it be enabled for all builds or reuse a current configuration if only for some OS or arch?

I think it's possible.

> 
[...]
> > +static int
> > +get_pci_id(const char *sysfs_base, const char *dev_addr,
> > +	   struct rte_pci_id *pci_id)
> > +{
> > +	int ret = 0;
> > +	int iommu_group_num;
> > +	int vfio_group_fd;
> > +	int vfio_dev_fd;
> > +	int container;
> > +	int class;
> > +	char name[PATH_MAX];
> > +	struct vfio_group_status group_status = {
> > +		.argsz = sizeof(group_status) };
> > +
> > +	container = open("/dev/vfio/vfio", O_RDWR);
> 
> Should this one use the VFIO_CONTAINER_PATH define in rte_vfio.h?
> The define is gated by VFIO_PRESENT in that header.

Yeah! And all the code in this file should be gated
by VFIO_PRESENT as well.

> > +	if (container < 0) {
> > +		RTE_LOG(WARNING, EAL, "Failed to open VFIO container\n");
> > +		ret = -1;
> > +		goto out;
> > +	}
> > +
> > +	if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) {
> > +		/* Unknown API version */
> > +		RTE_LOG(WARNING, EAL, "Unknown VFIO API version\n");
> > +		ret = -1;
> > +		goto close_container;
> > +	}
> > +
> > +	if (rte_vfio_get_group_num(sysfs_base, dev_addr,
> > +				   &iommu_group_num) <= 0) {
> > +		RTE_LOG(WARNING, EAL, "%s not managed by VFIO driver\n",
> > +			dev_addr);
> > +		ret = -1;
> > +		goto close_container;
> > +	}
> > +
> > +	snprintf(name, sizeof(name), "/dev/vfio/%d", iommu_group_num);
> 
> We should be testing the return value from snprintf, but it is not done anyplace else in the code?
> We need to look at fixing this in a different patch, but not here.
> > +
> > +	vfio_group_fd = open(name, O_RDWR);
> > +	if (vfio_group_fd < 0) {
> > +		ret = -1;
> > +		goto close_container;
> > +	}
[...]
> > +	/* class_id */
> > +	if (pread64(vfio_dev_fd, &class, sizeof(uint32_t),
> > +		      VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
> > +		      PCI_CLASS_REVISION) != sizeof(uint32_t)) {
> > +		RTE_LOG(ERR, EAL, "Cannot read ClassID from PCI config space\n”);
> 
> These should possible be DEBUG messages, but ERR is ok I guess. To me filling up the log with a bunch of messages when it is also flagged and log at a higher layer to many log messages. It would require us to look and make a cleaner
> > +		ret = -1;
> > +		goto close_device;
> > +	}
> > +	pci_id->class_id = class >> 8;
> > +
> > +close_device:
> > +	if (close(vfio_dev_fd) < 0) {
> > +		RTE_LOG(INFO, EAL, "Error when closing VFIO device for %s\n",
> > +			dev_addr);
> 
> These should be ERR, DEBUG or WARN not INFO IMO or no log message at all.

You are right. I missed this one.

> > +		ret = -1;
> > +	}
> > +
> > +close_group:
> > +	if (close(vfio_group_fd) < 0) {
> > +		RTE_LOG(INFO, EAL, "Error when closing VFIO group for %s\n",
> > +			dev_addr);
> > +		ret = -1;
> > +	}
> > +
> > +close_container:
> > +	if (close(container) < 0) {
> > +		RTE_LOG(INFO, EAL, "Error when closing VFIO container\n");
> > +		ret = -1;
> > +	}
> > +
> > +out:
> 
> Jumping to 4 different exit points makes this function complex, would it not be better to have one error exit point and test if the fds need to be closed
> e.g.
> 	if (pread64(...)) {
> 		RTE_LOG(ERR, EAL, “Error message”);
> 		goto err_exit;
> 	}
> 
> 	return 0;
> err_exit:
> 	if (vfio_dev_fd && close(vfio_dev_fd) < 0) {
> 		
> 	}
> 	if (…) {
> 	}
> 	return -1;
> 
> This should eliminate the variable ret and reduce the lines of code.

Thanks for the suggestion. I can do that.

> > +	return ret;
> > +}
> > +
> > +static int vfio_pci_probe(struct rte_mdev_driver *mdev_drv __rte_unused,
> > +			  struct rte_mdev_device *mdev_dev)
> > +{
> > +	char name[RTE_UUID_STRLEN];
> > +	struct rte_pci_device *dev;
> > +	struct rte_bus *bus;
> > +	int ret;
> > +
> > +	bus = rte_bus_find_by_name("pci");
> > +	if (bus == NULL) {
> > +		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
> > +		return -ENOENT;
> > +	}
> > +
> > +	if (bus->plug == NULL) {
> > +		RTE_LOG(ERR, EAL, "Function plug not supported by bus (%s)\n",
> > +			bus->name);
> > +		return -ENOTSUP;
> > +	}
> > +
> > +	dev = malloc(sizeof(*dev));
> > +	if (dev == NULL)
> > +		return -ENOMEM;
> 
> If going to add error logs for the above tests, why does this one not get one?
> Should we just remove them and check in the calling function instead? Then convert these to DEBUG logs or remove them.

Thanks for the suggestion. Will improve the logs.

> > +
> > +	memset(dev, 0, sizeof(*dev));
> > +	dev->device.bus = &rte_pci_bus.bus;
> > +	rte_uuid_unparse(mdev_dev->addr, name, sizeof(name));
> > +
> > +	if (get_pci_id(rte_mdev_get_sysfs_path(), name, &dev->id)) {
> > +		free(dev);
> > +		return -1;
> > +	}
> > +
> > +	snprintf(dev->name, sizeof(dev->name), "%s", name);
> 
> This should be strlcpy()
> > +	dev->device.name = dev->name;
> > +	dev->kdrv = RTE_KDRV_VFIO;
> > +	dev->use_uuid = 1;
> > +	rte_uuid_copy(dev->uuid, mdev_dev->addr);
> > +
> > +	// TODO: dev->device.devargs, etc
> > +
> > +	memset(&dev->addr, -1, sizeof(dev->addr)); // XXX: TODO
> 
> I have seen in the past that TODO or FIXME is not something that should be in the code. The TODO items should be removed and tracked outside the code if needed to be done later.

Sorry for the confusion. There are some quick hacks in this RFC
(especially in this function). I highlighted them with XXX or TODO.
I didn't get rid of them for now, because there are different
possible ways to add the mdev support in DPDK, and this RFC is just
to demonstrate one possible way that we can do and to hear people's
thoughts/opinions.

PS. All the hacks (including comments starting with //) in this RFC
are temporary. They will be fixed or removed in the formal patch.

> > +
> > +	/* device is valid, add to the list (sorted) */
> > +	if (TAILQ_EMPTY(&rte_pci_bus.device_list)) {
> > +		rte_pci_add_device(dev);
> > +	} else {
> > +		struct rte_pci_device *dev2;
> > +		int ret;
> > +
> > +		TAILQ_FOREACH(dev2, &rte_pci_bus.device_list, next) {
> > +			// XXX
> 
> What does this comment mean? remove it or explain it.

It's to indicate that there is a quick hack here. It won't
exist in the formal patch.

> > +			ret = rte_pci_addr_cmp(&dev->addr, &dev2->addr);
> > +			if (ret == 0)
> > +				ret = strncmp(dev->name, dev2->name,
> > +					      sizeof(dev->name));
> > +			if (ret > 0)
> > +				continue;
> > +			if (ret < 0) {
> > +				rte_pci_insert_device(dev2, dev);
> > +				goto plug;
> > +			}
> > +			/* already registered */
> > +			free(dev);
> > +			return 0;
> > +		}
> > +
> > +		rte_pci_add_device(dev);
> > +	}
> > +
> > +plug:
> > +	ret = bus->plug(&dev->device);
> > +	if (ret != 0) {
> > +		rte_pci_remove_device(dev);
> > +		free(dev);
> > +	} else {
> > +		mdev_dev->private = dev;
> > +	}
> 
> The coding guide states we remove {} around single line statements.
> > +	return ret;
> > +}
> > +
> > +static int vfio_pci_remove(struct rte_mdev_device *mdev_dev)
> > +{
> > +	struct rte_pci_device *dev = mdev_dev->private;
> > +	struct rte_bus *bus;
> > +	int ret;
> > +
> > +	if (dev == NULL)
> > +		return 0;
> > +
> > +	bus = rte_bus_find_by_name("pci");
> > +	if (bus == NULL) {
> > +		RTE_LOG(ERR, EAL, "Cannot find bus pci\n");
> > +		return -ENOENT;
> > +	}
> > +
> > +	if (bus->unplug == NULL) {
> > +		RTE_LOG(ERR, EAL, "Function unplug not supported by bus (%s)\n",
> > +			bus->name);
> > +		return -ENOTSUP;
> > +	}
> > +
> > +	ret = bus->unplug(&dev->device);
> > +	if (ret == 0)
> > +		mdev_dev->private = NULL;
> > +
> > +	return ret;
> > +}
> > +
> > +static struct rte_mdev_driver vfio_pci_drv = {
> > +	.dev_api = RTE_MDEV_DEV_API_VFIO_PCI,
> > +	.probe = vfio_pci_probe,
> > +	.remove = vfio_pci_remove
> > +};
> > +
> > +RTE_MDEV_REGISTER_DRIVER(mdev_vfio_pci, vfio_pci_drv);
> > diff --git a/drivers/bus/pci/meson.build b/drivers/bus/pci/meson.build
> > index a3140ff97..c3e884657 100644
> > --- a/drivers/bus/pci/meson.build
> > +++ b/drivers/bus/pci/meson.build
> > @@ -11,8 +11,10 @@ sources = files('pci_common.c',
> > if host_machine.system() == 'linux'
> > 	sources += files('linux/pci.c',
> > 			'linux/pci_uio.c',
> > -			'linux/pci_vfio.c')
> > +			'linux/pci_vfio.c',
> > +			'linux/pci_vfio_mdev.c’)
> 
> If you need the RTE_LIBRTE_MDEV define then pci_vfio_mdev.c needs to be built conditionally?
> > 	includes += include_directories('linux')
> > +	deps += ['bus_mdev’]
> 
> If this was added form dev then is too should be conditional.

Yeah. There should be a check of dpdk_conf.has('...')

> > else
> > 	sources += files('bsd/pci.c')
> > 	includes += include_directories('bsd')
> > diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
> > index 704b9d71a..6b47333e6 100644
> > --- a/drivers/bus/pci/pci_common.c
> > +++ b/drivers/bus/pci/pci_common.c
> > @@ -124,21 +124,17 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
> > {
> > 	int ret;
> > 	bool already_probed;
> > -	struct rte_pci_addr *loc;
> > 
> > 	if ((dr == NULL) || (dev == NULL))
> > 		return -EINVAL;
> > 
> > -	loc = &dev->addr;
> > -
> > 	/* The device is not blacklisted; Check if driver supports it */
> > 	if (!rte_pci_match(dr, dev))
> > 		/* Match of device and driver failed */
> > 		return 1;
> > 
> > -	RTE_LOG(INFO, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
> > -			loc->domain, loc->bus, loc->devid, loc->function,
> > -			dev->device.numa_node);
> > +	RTE_LOG(INFO, EAL, "PCI device %s on NUMA socket %i\n",
> > +		dev->name, dev->device.numa_node);
> > 
> > 	/* no initialization when blacklisted, return without error */
> > 	if (dev->device.devargs != NULL &&
> > @@ -208,7 +204,6 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
> > static int
> > rte_pci_detach_dev(struct rte_pci_device *dev)
> > {
> > -	struct rte_pci_addr *loc;
> > 	struct rte_pci_driver *dr;
> > 	int ret = 0;
> > 
> > @@ -216,11 +211,9 @@ rte_pci_detach_dev(struct rte_pci_device *dev)
> > 		return -EINVAL;
> > 
> > 	dr = dev->driver;
> > -	loc = &dev->addr;
> > 
> > -	RTE_LOG(DEBUG, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
> > -			loc->domain, loc->bus, loc->devid,
> > -			loc->function, dev->device.numa_node);
> > +	RTE_LOG(DEBUG, EAL, "PCI device %s on NUMA socket %i\n",
> > +		dev->name, dev->device.numa_node);
> > 
> > 	RTE_LOG(DEBUG, EAL, "  remove driver: %x:%x %s\n", dev->id.vendor_id,
> > 			dev->id.device_id, dr->driver.name);
> > @@ -387,7 +380,7 @@ rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
> > }
> > 
> > /* Remove a device from PCI bus */
> > -static void
> > +void
> > rte_pci_remove_device(struct rte_pci_device *pci_dev)
> 
> Have not looked yet, but did this function get added to the version.map file?
> Does converting a function to public function require experimental tag too, maybe not?

This is just to make it a global function declared in private.h,
so that we can call it from other C files inside PCI bus.

> > {
> > 	TAILQ_REMOVE(&rte_pci_bus.device_list, pci_dev, next);
> > diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
> > index 13c3324bb..d5815ee44 100644
> > --- a/drivers/bus/pci/private.h
> > +++ b/drivers/bus/pci/private.h
> > @@ -67,6 +67,15 @@ void rte_pci_add_device(struct rte_pci_device *pci_dev);
> > void rte_pci_insert_device(struct rte_pci_device *exist_pci_dev,
> > 		struct rte_pci_device *new_pci_dev);
> > 
> > +/**
> > + * Remove a PCI device from the PCI Bus.
> > + *
> > + * @param pci_dev
> > + *	PCI device to remove
> > + * @return void
> > + */
> > +void rte_pci_remove_device(struct rte_pci_device *pci_dev);
> > +
> > /**
> >  * Update a pci device object by asking the kernel for the latest information.
> >  *
> > diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
> > index 06e004cd3..465a44935 100644
> > --- a/drivers/bus/pci/rte_bus_pci.h
> > +++ b/drivers/bus/pci/rte_bus_pci.h
> > @@ -51,6 +51,13 @@ TAILQ_HEAD(rte_pci_driver_list, rte_pci_driver);
> > 
> > struct rte_devargs;
> > 
> > +/* It's RTE_UUID_STRLEN, which is bigger than PCI_PRI_STR_SIZE. */
> > +#define RTE_PCI_NAME_LEN		(36 + 1)
> > +
> > +// XXX: we can't include rte_uuid.h directly due to the conflicts
> > +//      introduced by stdbool.h
> > +typedef unsigned char rte_uuid_t[16];
> 
> Does this need to have a the string ‘XXX’ in the comment? Note maybe a better word.

OK.

Thanks for the reviews/suggestions! Do appreciate it!

Regards,
Tiwei

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK
  2019-04-03  7:18 [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Tiwei Bie
                   ` (3 preceding siblings ...)
  2019-04-03  7:18 ` [dpdk-dev] [RFC 3/3] bus/pci: add mdev support Tiwei Bie
@ 2019-04-08  8:44 ` Alejandro Lucero
  2019-04-08  8:44   ` Alejandro Lucero
  2019-04-08  9:36   ` Tiwei Bie
  2019-07-15  7:52 ` [dpdk-dev] [RFC v2 0/5] " Tiwei Bie
  5 siblings, 2 replies; 41+ messages in thread
From: Alejandro Lucero @ 2019-04-08  8:44 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: dev, Liang, Cunming, Bruce Richardson

On Wed, Apr 3, 2019 at 8:19 AM Tiwei Bie <tiwei.bie@intel.com> wrote:

> Hi everyone,
>
> This is a draft implementation of the mdev (Mediated device [1])
> bus support in DPDK. Mdev is a way to virtualize devices in Linux
> kernel. Based on the device-api (mdev_type/device_api), there could
> be different types of mdev devices (e.g. vfio-pci). In this RFC,
> one mdev bus is introduced to scan the mdev devices in the system
> and do the probe based on the device-api.
>
> Take the mdev devices whose device-api is "vfio-pci" as an example,
> in this RFC, these devices will be probed by a mdev driver provided
> by PCI bus, which will plug them to the PCI bus. And they will be
> probed with the drivers registered on the PCI bus based on VendorID/
> DeviceID/... then.
>
>                      +----------+
>                      | mdev bus |
>                      +----+-----+
>                           |
>          +----------------+----+------+------+
>          |                     |      |      |
>    mdev_vfio_pci               ......
> (device-api: vfio-pci)
>
> There are also other ways to add mdev device support in DPDK (e.g.
> let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
> appreciated!
>
>
Hi Tiwei,

Thanks for the patchset. I was close to send a patchset with the same mdev
support, but I'm glad to see your patchset first because I think it is
interesting to see another view of how to implemented this.

After going through your patch I was a bit confused about how the mdev
device to mdev driver match was done. But then I realized the approach you
are following is different to my implementation, likely due to having
different purposes. If I understand the idea behind, you want to have same
PCI PMD drivers working with devices, PCI devices, created from mediated
devices. That is the reason there is just one mdev driver, the one for
vfio-pci mediated devices type.

My approach was different and I though having specific PMD mdev support was
necessary, with the PMD requiring to register a mdev driver. I can see,
after reading your patch, it can be perfectly possible to have the same
PMDs for "pure" PCI devices and PCI devices made from mediated devices, and
if the PMD requires to do something different due to the mediated devices
intrinsics, then explicitly supporting that per PMD. I got specific ioctl
calls between the PMD and the mediating driver but this can also be done
with your approach.

I'm working on having a mediated PF, what is a different purpose than the
Intel scalable I/O idea, so I will merge this patchset with my code and see
if it works.

Thanks!


> [1]
> https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt
>
> Thanks,
> Tiwei
>
> Tiwei Bie (3):
>   eal: add a helper for reading string from sysfs
>   bus/mdev: add mdev bus support
>   bus/pci: add mdev support
>
>  config/common_base                        |   5 +
>  config/common_linux                       |   1 +
>  drivers/bus/Makefile                      |   1 +
>  drivers/bus/mdev/Makefile                 |  41 +++
>  drivers/bus/mdev/linux/Makefile           |   6 +
>  drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
>  drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
>  drivers/bus/mdev/meson.build              |  15 ++
>  drivers/bus/mdev/private.h                |  90 +++++++
>  drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
>  drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
>  drivers/bus/meson.build                   |   2 +-
>  drivers/bus/pci/Makefile                  |   3 +
>  drivers/bus/pci/linux/Makefile            |   4 +
>  drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
>  drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
>  drivers/bus/pci/meson.build               |   4 +-
>  drivers/bus/pci/pci_common.c              |  17 +-
>  drivers/bus/pci/private.h                 |   9 +
>  drivers/bus/pci/rte_bus_pci.h             |  11 +-
>  lib/librte_eal/common/eal_filesystem.h    |   7 +
>  lib/librte_eal/freebsd/eal/eal.c          |  22 ++
>  lib/librte_eal/linux/eal/eal.c            |  22 ++
>  lib/librte_eal/rte_eal_version.map        |   1 +
>  mk/rte.app.mk                             |   1 +
>  25 files changed, 1163 insertions(+), 19 deletions(-)
>  create mode 100644 drivers/bus/mdev/Makefile
>  create mode 100644 drivers/bus/mdev/linux/Makefile
>  create mode 100644 drivers/bus/mdev/linux/mdev.c
>  create mode 100644 drivers/bus/mdev/mdev.c
>  create mode 100644 drivers/bus/mdev/meson.build
>  create mode 100644 drivers/bus/mdev/private.h
>  create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
>  create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
>  create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
>
> --
> 2.17.1
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK
  2019-04-08  8:44 ` [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Alejandro Lucero
@ 2019-04-08  8:44   ` Alejandro Lucero
  2019-04-08  9:36   ` Tiwei Bie
  1 sibling, 0 replies; 41+ messages in thread
From: Alejandro Lucero @ 2019-04-08  8:44 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: dev, Liang, Cunming, Bruce Richardson

On Wed, Apr 3, 2019 at 8:19 AM Tiwei Bie <tiwei.bie@intel.com> wrote:

> Hi everyone,
>
> This is a draft implementation of the mdev (Mediated device [1])
> bus support in DPDK. Mdev is a way to virtualize devices in Linux
> kernel. Based on the device-api (mdev_type/device_api), there could
> be different types of mdev devices (e.g. vfio-pci). In this RFC,
> one mdev bus is introduced to scan the mdev devices in the system
> and do the probe based on the device-api.
>
> Take the mdev devices whose device-api is "vfio-pci" as an example,
> in this RFC, these devices will be probed by a mdev driver provided
> by PCI bus, which will plug them to the PCI bus. And they will be
> probed with the drivers registered on the PCI bus based on VendorID/
> DeviceID/... then.
>
>                      +----------+
>                      | mdev bus |
>                      +----+-----+
>                           |
>          +----------------+----+------+------+
>          |                     |      |      |
>    mdev_vfio_pci               ......
> (device-api: vfio-pci)
>
> There are also other ways to add mdev device support in DPDK (e.g.
> let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
> appreciated!
>
>
Hi Tiwei,

Thanks for the patchset. I was close to send a patchset with the same mdev
support, but I'm glad to see your patchset first because I think it is
interesting to see another view of how to implemented this.

After going through your patch I was a bit confused about how the mdev
device to mdev driver match was done. But then I realized the approach you
are following is different to my implementation, likely due to having
different purposes. If I understand the idea behind, you want to have same
PCI PMD drivers working with devices, PCI devices, created from mediated
devices. That is the reason there is just one mdev driver, the one for
vfio-pci mediated devices type.

My approach was different and I though having specific PMD mdev support was
necessary, with the PMD requiring to register a mdev driver. I can see,
after reading your patch, it can be perfectly possible to have the same
PMDs for "pure" PCI devices and PCI devices made from mediated devices, and
if the PMD requires to do something different due to the mediated devices
intrinsics, then explicitly supporting that per PMD. I got specific ioctl
calls between the PMD and the mediating driver but this can also be done
with your approach.

I'm working on having a mediated PF, what is a different purpose than the
Intel scalable I/O idea, so I will merge this patchset with my code and see
if it works.

Thanks!


> [1]
> https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt
>
> Thanks,
> Tiwei
>
> Tiwei Bie (3):
>   eal: add a helper for reading string from sysfs
>   bus/mdev: add mdev bus support
>   bus/pci: add mdev support
>
>  config/common_base                        |   5 +
>  config/common_linux                       |   1 +
>  drivers/bus/Makefile                      |   1 +
>  drivers/bus/mdev/Makefile                 |  41 +++
>  drivers/bus/mdev/linux/Makefile           |   6 +
>  drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
>  drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
>  drivers/bus/mdev/meson.build              |  15 ++
>  drivers/bus/mdev/private.h                |  90 +++++++
>  drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
>  drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
>  drivers/bus/meson.build                   |   2 +-
>  drivers/bus/pci/Makefile                  |   3 +
>  drivers/bus/pci/linux/Makefile            |   4 +
>  drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
>  drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
>  drivers/bus/pci/meson.build               |   4 +-
>  drivers/bus/pci/pci_common.c              |  17 +-
>  drivers/bus/pci/private.h                 |   9 +
>  drivers/bus/pci/rte_bus_pci.h             |  11 +-
>  lib/librte_eal/common/eal_filesystem.h    |   7 +
>  lib/librte_eal/freebsd/eal/eal.c          |  22 ++
>  lib/librte_eal/linux/eal/eal.c            |  22 ++
>  lib/librte_eal/rte_eal_version.map        |   1 +
>  mk/rte.app.mk                             |   1 +
>  25 files changed, 1163 insertions(+), 19 deletions(-)
>  create mode 100644 drivers/bus/mdev/Makefile
>  create mode 100644 drivers/bus/mdev/linux/Makefile
>  create mode 100644 drivers/bus/mdev/linux/mdev.c
>  create mode 100644 drivers/bus/mdev/mdev.c
>  create mode 100644 drivers/bus/mdev/meson.build
>  create mode 100644 drivers/bus/mdev/private.h
>  create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
>  create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
>  create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
>
> --
> 2.17.1
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK
  2019-04-08  8:44 ` [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Alejandro Lucero
  2019-04-08  8:44   ` Alejandro Lucero
@ 2019-04-08  9:36   ` Tiwei Bie
  2019-04-08  9:36     ` Tiwei Bie
  2019-04-10 10:02     ` Francois Ozog
  1 sibling, 2 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-04-08  9:36 UTC (permalink / raw)
  To: Alejandro Lucero; +Cc: dev, Liang, Cunming, Bruce Richardson

On Mon, Apr 08, 2019 at 09:44:07AM +0100, Alejandro Lucero wrote:
> On Wed, Apr 3, 2019 at 8:19 AM Tiwei Bie <tiwei.bie@intel.com> wrote:
> > Hi everyone,
> >
> > This is a draft implementation of the mdev (Mediated device [1])
> > bus support in DPDK. Mdev is a way to virtualize devices in Linux
> > kernel. Based on the device-api (mdev_type/device_api), there could
> > be different types of mdev devices (e.g. vfio-pci). In this RFC,
> > one mdev bus is introduced to scan the mdev devices in the system
> > and do the probe based on the device-api.
> >
> > Take the mdev devices whose device-api is "vfio-pci" as an example,
> > in this RFC, these devices will be probed by a mdev driver provided
> > by PCI bus, which will plug them to the PCI bus. And they will be
> > probed with the drivers registered on the PCI bus based on VendorID/
> > DeviceID/... then.
> >
> >                      +----------+
> >                      | mdev bus |
> >                      +----+-----+
> >                           |
> >          +----------------+----+------+------+
> >          |                     |      |      |
> >    mdev_vfio_pci               ......
> > (device-api: vfio-pci)
> >
> > There are also other ways to add mdev device support in DPDK (e.g.
> > let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
> > appreciated!
> 
> Hi Tiwei,
> 
> Thanks for the patchset. I was close to send a patchset with the same mdev
> support, but I'm glad to see your patchset first because I think it is
> interesting to see another view of how to implemented this.
> 
> After going through your patch I was a bit confused about how the mdev device
> to mdev driver match was done. But then I realized the approach you are
> following is different to my implementation, likely due to having different
> purposes. If I understand the idea behind, you want to have same PCI PMD
> drivers working with devices, PCI devices, created from mediated devices.

Exactly!

> That
> is the reason there is just one mdev driver, the one for vfio-pci mediated
> devices type.
> 
> My approach was different and I though having specific PMD mdev support was
> necessary, with the PMD requiring to register a mdev driver. I can see, after
> reading your patch, it can be perfectly possible to have the same PMDs for
> "pure" PCI devices and PCI devices made from mediated devices, and if the PMD
> requires to do something different due to the mediated devices intrinsics, then
> explicitly supporting that per PMD. I got specific ioctl calls between the PMD
> and the mediating driver but this can also be done with your approach.
> 
> I'm working on having a mediated PF, what is a different purpose than the Intel
> scalable I/O idea, so I will merge this patchset with my code and see if it
> works. 

Cool! Thanks!

> 
> Thanks!
>  
> 
> > [1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt
> > 
> > Thanks,
> > Tiwei
> > 
> > Tiwei Bie (3):
> >   eal: add a helper for reading string from sysfs
> >   bus/mdev: add mdev bus support
> >   bus/pci: add mdev support
> > 
> >  config/common_base                        |   5 +
> >  config/common_linux                       |   1 +
> >  drivers/bus/Makefile                      |   1 +
> >  drivers/bus/mdev/Makefile                 |  41 +++
> >  drivers/bus/mdev/linux/Makefile           |   6 +
> >  drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
> >  drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
> >  drivers/bus/mdev/meson.build              |  15 ++
> >  drivers/bus/mdev/private.h                |  90 +++++++
> >  drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
> >  drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
> >  drivers/bus/meson.build                   |   2 +-
> >  drivers/bus/pci/Makefile                  |   3 +
> >  drivers/bus/pci/linux/Makefile            |   4 +
> >  drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
> >  drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
> >  drivers/bus/pci/meson.build               |   4 +-
> >  drivers/bus/pci/pci_common.c              |  17 +-
> >  drivers/bus/pci/private.h                 |   9 +
> >  drivers/bus/pci/rte_bus_pci.h             |  11 +-
> >  lib/librte_eal/common/eal_filesystem.h    |   7 +
> >  lib/librte_eal/freebsd/eal/eal.c          |  22 ++
> >  lib/librte_eal/linux/eal/eal.c            |  22 ++
> >  lib/librte_eal/rte_eal_version.map        |   1 +
> >  mk/rte.app.mk                             |   1 +
> >  25 files changed, 1163 insertions(+), 19 deletions(-)
> >  create mode 100644 drivers/bus/mdev/Makefile
> >  create mode 100644 drivers/bus/mdev/linux/Makefile
> >  create mode 100644 drivers/bus/mdev/linux/mdev.c
> >  create mode 100644 drivers/bus/mdev/mdev.c
> >  create mode 100644 drivers/bus/mdev/meson.build
> >  create mode 100644 drivers/bus/mdev/private.h
> >  create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
> >  create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
> >  create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
> > 
> > --
> > 2.17.1
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK
  2019-04-08  9:36   ` Tiwei Bie
@ 2019-04-08  9:36     ` Tiwei Bie
  2019-04-10 10:02     ` Francois Ozog
  1 sibling, 0 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-04-08  9:36 UTC (permalink / raw)
  To: Alejandro Lucero; +Cc: dev, Liang, Cunming, Bruce Richardson

On Mon, Apr 08, 2019 at 09:44:07AM +0100, Alejandro Lucero wrote:
> On Wed, Apr 3, 2019 at 8:19 AM Tiwei Bie <tiwei.bie@intel.com> wrote:
> > Hi everyone,
> >
> > This is a draft implementation of the mdev (Mediated device [1])
> > bus support in DPDK. Mdev is a way to virtualize devices in Linux
> > kernel. Based on the device-api (mdev_type/device_api), there could
> > be different types of mdev devices (e.g. vfio-pci). In this RFC,
> > one mdev bus is introduced to scan the mdev devices in the system
> > and do the probe based on the device-api.
> >
> > Take the mdev devices whose device-api is "vfio-pci" as an example,
> > in this RFC, these devices will be probed by a mdev driver provided
> > by PCI bus, which will plug them to the PCI bus. And they will be
> > probed with the drivers registered on the PCI bus based on VendorID/
> > DeviceID/... then.
> >
> >                      +----------+
> >                      | mdev bus |
> >                      +----+-----+
> >                           |
> >          +----------------+----+------+------+
> >          |                     |      |      |
> >    mdev_vfio_pci               ......
> > (device-api: vfio-pci)
> >
> > There are also other ways to add mdev device support in DPDK (e.g.
> > let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
> > appreciated!
> 
> Hi Tiwei,
> 
> Thanks for the patchset. I was close to send a patchset with the same mdev
> support, but I'm glad to see your patchset first because I think it is
> interesting to see another view of how to implemented this.
> 
> After going through your patch I was a bit confused about how the mdev device
> to mdev driver match was done. But then I realized the approach you are
> following is different to my implementation, likely due to having different
> purposes. If I understand the idea behind, you want to have same PCI PMD
> drivers working with devices, PCI devices, created from mediated devices.

Exactly!

> That
> is the reason there is just one mdev driver, the one for vfio-pci mediated
> devices type.
> 
> My approach was different and I though having specific PMD mdev support was
> necessary, with the PMD requiring to register a mdev driver. I can see, after
> reading your patch, it can be perfectly possible to have the same PMDs for
> "pure" PCI devices and PCI devices made from mediated devices, and if the PMD
> requires to do something different due to the mediated devices intrinsics, then
> explicitly supporting that per PMD. I got specific ioctl calls between the PMD
> and the mediating driver but this can also be done with your approach.
> 
> I'm working on having a mediated PF, what is a different purpose than the Intel
> scalable I/O idea, so I will merge this patchset with my code and see if it
> works. 

Cool! Thanks!

> 
> Thanks!
>  
> 
> > [1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt
> > 
> > Thanks,
> > Tiwei
> > 
> > Tiwei Bie (3):
> >   eal: add a helper for reading string from sysfs
> >   bus/mdev: add mdev bus support
> >   bus/pci: add mdev support
> > 
> >  config/common_base                        |   5 +
> >  config/common_linux                       |   1 +
> >  drivers/bus/Makefile                      |   1 +
> >  drivers/bus/mdev/Makefile                 |  41 +++
> >  drivers/bus/mdev/linux/Makefile           |   6 +
> >  drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
> >  drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
> >  drivers/bus/mdev/meson.build              |  15 ++
> >  drivers/bus/mdev/private.h                |  90 +++++++
> >  drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
> >  drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
> >  drivers/bus/meson.build                   |   2 +-
> >  drivers/bus/pci/Makefile                  |   3 +
> >  drivers/bus/pci/linux/Makefile            |   4 +
> >  drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
> >  drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
> >  drivers/bus/pci/meson.build               |   4 +-
> >  drivers/bus/pci/pci_common.c              |  17 +-
> >  drivers/bus/pci/private.h                 |   9 +
> >  drivers/bus/pci/rte_bus_pci.h             |  11 +-
> >  lib/librte_eal/common/eal_filesystem.h    |   7 +
> >  lib/librte_eal/freebsd/eal/eal.c          |  22 ++
> >  lib/librte_eal/linux/eal/eal.c            |  22 ++
> >  lib/librte_eal/rte_eal_version.map        |   1 +
> >  mk/rte.app.mk                             |   1 +
> >  25 files changed, 1163 insertions(+), 19 deletions(-)
> >  create mode 100644 drivers/bus/mdev/Makefile
> >  create mode 100644 drivers/bus/mdev/linux/Makefile
> >  create mode 100644 drivers/bus/mdev/linux/mdev.c
> >  create mode 100644 drivers/bus/mdev/mdev.c
> >  create mode 100644 drivers/bus/mdev/meson.build
> >  create mode 100644 drivers/bus/mdev/private.h
> >  create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
> >  create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
> >  create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
> > 
> > --
> > 2.17.1
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK
  2019-04-08  9:36   ` Tiwei Bie
  2019-04-08  9:36     ` Tiwei Bie
@ 2019-04-10 10:02     ` Francois Ozog
  2019-04-10 10:02       ` Francois Ozog
  1 sibling, 1 reply; 41+ messages in thread
From: Francois Ozog @ 2019-04-10 10:02 UTC (permalink / raw)
  To: Tiwei Bie, dev
  Cc: Alejandro Lucero, Liang, Cunming, Bruce Richardson,
	Ilias Apalodimas, brouer

Hi all,

I presented an approach in Fosdem
(https://archive.fosdem.org/2018/schedule/event/netmdev/) and feel
happy someone is picking up.

If we step back a little, the mdev concept is to allow userland to be
given a direct control over the hardware data path on a device still
controlled by the kernel.
>From a code base perspective, this can shrink down PMD code size b y a
significant size: only 10% of the PMD code is actual data path, the
rest being device control!
The concept is perfect for DPDK, SPDK and many other scenarios (AI
accelerators).
Should the work be triggered by DPDK community, it should be
applicable to a broader set of communities: SPDK, VPP, ODP, AF_XDP....

We bumped into many sharing (between kernel and userland) complexities
particularly when a single PCI device controls two ports.
So let's assume we try to solve a subset of the cases: coherent IO
memory and a dedicated PCI space (by whatever mechanism) per port.

What are the "things to solve"?

1) enumeration: enumerating and "capturing" an mdev device (the patch I assume)
2) bifurcation: designating the queues to capture in userland (may be
all) with a hardware driven rule (flow director or more generic)
3) memory management: dealing with rings and buffer management on rx
and tx paths

The bifurcation can be as simple as : all queues in userland, or quite
rich: TCP port 80 goes to userland while the rest (ICMP...) go to
kernel. If the kernel gets some of the traffic there will be a routing
information sharing problem to solve. We had a few experiments here.
Conclusion is its doable but many corner cases make it a big work. And
it would be nice if the queue selection can be made very generic (and
not tied to flow director).
Let's state this is for further study for  now.

Lets focus on memory management of VFIO exposed devices.
I haven't refreshed my knowledge of the VFIO framework so you may want
to correct a few points...
First of all, DPDK is made to switch packets and particularly between ports.
With VFIO, this means all devices are in the same virtual IOVA which
is tricky to implement in the kernel.
There are a few strategies to do that all requiring significant mdev
extensions and more probably a kernel infrastructure change. The good
news is it can be made in such a way that selected drivers implement
the change, not requiring all the drivers to be touched.
Another big question is: is the kernel allocating the memory then the
userland gets a map to it, or does the userland allocates the memory
and the kernel just maintains the IOVA mapping.
I would favor kernel allocation and userland gets a map to it (in the
unified IOVA). One reason being that memory allocation strategy can be
very different from hardware to hardware:
- driver allocates packet buffers and populate a single ring of packet per queue
- driver allocates packet buffers of different sizes and populate
multiple rings per queue (for instance rings of 128, 256, 1024, 2048
byte arrays per queue)
- driver allocates an unstructured memory area (say 32MB) and give it
to hardware (no prepopulation of rings).
So the userland framework (DPDK, SPDK, ODP, VPP, AF_XDP,
proprietary...) can just query for queues and rings to the kernel
driver that knows what has to be done for the driver. The userland
framework just has to create the relevant objects (queues, rings,
packet buffers) to the provided kernel information.

Exposing VFIO devices to DPDK and other frameworks is a major topic,
and I suggest that at the same time enumeration is done, a broader
discussion on the data path itself happens.
Data path discussion is about memory management (above) and packet
descriptors. Exposing hardware dependent structures in the userland is
not the most widely accepted wisdom.
So I would rather assume hardware natively produce hardware, vendor,
OS independent descriptors. Candidates can be: DPDK mbuf, VPP vlib_buf
or virtio 1.1. I would favor a packet descriptor that supports a
combination of inline offloads (VxLAN + IPSec + TSO...) : if virtio
1.1 could be extended with some DPDK mbuf fields that would be perfect
;-) That looks science fiction but I know that some smartNICs and
other hardware, the hardware produced packet descriptor format can be
flexible....

Cheers

FF



On Mon, 8 Apr 2019 at 11:36, Tiwei Bie <tiwei.bie@intel.com> wrote:
>
> On Mon, Apr 08, 2019 at 09:44:07AM +0100, Alejandro Lucero wrote:
> > On Wed, Apr 3, 2019 at 8:19 AM Tiwei Bie <tiwei.bie@intel.com> wrote:
> > > Hi everyone,
> > >
> > > This is a draft implementation of the mdev (Mediated device [1])
> > > bus support in DPDK. Mdev is a way to virtualize devices in Linux
> > > kernel. Based on the device-api (mdev_type/device_api), there could
> > > be different types of mdev devices (e.g. vfio-pci). In this RFC,
> > > one mdev bus is introduced to scan the mdev devices in the system
> > > and do the probe based on the device-api.
> > >
> > > Take the mdev devices whose device-api is "vfio-pci" as an example,
> > > in this RFC, these devices will be probed by a mdev driver provided
> > > by PCI bus, which will plug them to the PCI bus. And they will be
> > > probed with the drivers registered on the PCI bus based on VendorID/
> > > DeviceID/... then.
> > >
> > >                      +----------+
> > >                      | mdev bus |
> > >                      +----+-----+
> > >                           |
> > >          +----------------+----+------+------+
> > >          |                     |      |      |
> > >    mdev_vfio_pci               ......
> > > (device-api: vfio-pci)
> > >
> > > There are also other ways to add mdev device support in DPDK (e.g.
> > > let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
> > > appreciated!
> >
> > Hi Tiwei,
> >
> > Thanks for the patchset. I was close to send a patchset with the same mdev
> > support, but I'm glad to see your patchset first because I think it is
> > interesting to see another view of how to implemented this.
> >
> > After going through your patch I was a bit confused about how the mdev device
> > to mdev driver match was done. But then I realized the approach you are
> > following is different to my implementation, likely due to having different
> > purposes. If I understand the idea behind, you want to have same PCI PMD
> > drivers working with devices, PCI devices, created from mediated devices.
>
> Exactly!
>
> > That
> > is the reason there is just one mdev driver, the one for vfio-pci mediated
> > devices type.
> >
> > My approach was different and I though having specific PMD mdev support was
> > necessary, with the PMD requiring to register a mdev driver. I can see, after
> > reading your patch, it can be perfectly possible to have the same PMDs for
> > "pure" PCI devices and PCI devices made from mediated devices, and if the PMD
> > requires to do something different due to the mediated devices intrinsics, then
> > explicitly supporting that per PMD. I got specific ioctl calls between the PMD
> > and the mediating driver but this can also be done with your approach.
> >
> > I'm working on having a mediated PF, what is a different purpose than the Intel
> > scalable I/O idea, so I will merge this patchset with my code and see if it
> > works.
>
> Cool! Thanks!
>
> >
> > Thanks!
> >
> >
> > > [1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt
> > >
> > > Thanks,
> > > Tiwei
> > >
> > > Tiwei Bie (3):
> > >   eal: add a helper for reading string from sysfs
> > >   bus/mdev: add mdev bus support
> > >   bus/pci: add mdev support
> > >
> > >  config/common_base                        |   5 +
> > >  config/common_linux                       |   1 +
> > >  drivers/bus/Makefile                      |   1 +
> > >  drivers/bus/mdev/Makefile                 |  41 +++
> > >  drivers/bus/mdev/linux/Makefile           |   6 +
> > >  drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
> > >  drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
> > >  drivers/bus/mdev/meson.build              |  15 ++
> > >  drivers/bus/mdev/private.h                |  90 +++++++
> > >  drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
> > >  drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
> > >  drivers/bus/meson.build                   |   2 +-
> > >  drivers/bus/pci/Makefile                  |   3 +
> > >  drivers/bus/pci/linux/Makefile            |   4 +
> > >  drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
> > >  drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
> > >  drivers/bus/pci/meson.build               |   4 +-
> > >  drivers/bus/pci/pci_common.c              |  17 +-
> > >  drivers/bus/pci/private.h                 |   9 +
> > >  drivers/bus/pci/rte_bus_pci.h             |  11 +-
> > >  lib/librte_eal/common/eal_filesystem.h    |   7 +
> > >  lib/librte_eal/freebsd/eal/eal.c          |  22 ++
> > >  lib/librte_eal/linux/eal/eal.c            |  22 ++
> > >  lib/librte_eal/rte_eal_version.map        |   1 +
> > >  mk/rte.app.mk                             |   1 +
> > >  25 files changed, 1163 insertions(+), 19 deletions(-)
> > >  create mode 100644 drivers/bus/mdev/Makefile
> > >  create mode 100644 drivers/bus/mdev/linux/Makefile
> > >  create mode 100644 drivers/bus/mdev/linux/mdev.c
> > >  create mode 100644 drivers/bus/mdev/mdev.c
> > >  create mode 100644 drivers/bus/mdev/meson.build
> > >  create mode 100644 drivers/bus/mdev/private.h
> > >  create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
> > >  create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
> > >  create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
> > >
> > > --
> > > 2.17.1
> >
> >



--
François-Frédéric Ozog | Director Linaro Edge & Fog Computing Group
T: +33.67221.6485
francois.ozog@linaro.org | Skype: ffozog

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK
  2019-04-10 10:02     ` Francois Ozog
@ 2019-04-10 10:02       ` Francois Ozog
  0 siblings, 0 replies; 41+ messages in thread
From: Francois Ozog @ 2019-04-10 10:02 UTC (permalink / raw)
  To: Tiwei Bie, dev
  Cc: Alejandro Lucero, Liang, Cunming, Bruce Richardson,
	Ilias Apalodimas, brouer

Hi all,

I presented an approach in Fosdem
(https://archive.fosdem.org/2018/schedule/event/netmdev/) and feel
happy someone is picking up.

If we step back a little, the mdev concept is to allow userland to be
given a direct control over the hardware data path on a device still
controlled by the kernel.
From a code base perspective, this can shrink down PMD code size b y a
significant size: only 10% of the PMD code is actual data path, the
rest being device control!
The concept is perfect for DPDK, SPDK and many other scenarios (AI
accelerators).
Should the work be triggered by DPDK community, it should be
applicable to a broader set of communities: SPDK, VPP, ODP, AF_XDP....

We bumped into many sharing (between kernel and userland) complexities
particularly when a single PCI device controls two ports.
So let's assume we try to solve a subset of the cases: coherent IO
memory and a dedicated PCI space (by whatever mechanism) per port.

What are the "things to solve"?

1) enumeration: enumerating and "capturing" an mdev device (the patch I assume)
2) bifurcation: designating the queues to capture in userland (may be
all) with a hardware driven rule (flow director or more generic)
3) memory management: dealing with rings and buffer management on rx
and tx paths

The bifurcation can be as simple as : all queues in userland, or quite
rich: TCP port 80 goes to userland while the rest (ICMP...) go to
kernel. If the kernel gets some of the traffic there will be a routing
information sharing problem to solve. We had a few experiments here.
Conclusion is its doable but many corner cases make it a big work. And
it would be nice if the queue selection can be made very generic (and
not tied to flow director).
Let's state this is for further study for  now.

Lets focus on memory management of VFIO exposed devices.
I haven't refreshed my knowledge of the VFIO framework so you may want
to correct a few points...
First of all, DPDK is made to switch packets and particularly between ports.
With VFIO, this means all devices are in the same virtual IOVA which
is tricky to implement in the kernel.
There are a few strategies to do that all requiring significant mdev
extensions and more probably a kernel infrastructure change. The good
news is it can be made in such a way that selected drivers implement
the change, not requiring all the drivers to be touched.
Another big question is: is the kernel allocating the memory then the
userland gets a map to it, or does the userland allocates the memory
and the kernel just maintains the IOVA mapping.
I would favor kernel allocation and userland gets a map to it (in the
unified IOVA). One reason being that memory allocation strategy can be
very different from hardware to hardware:
- driver allocates packet buffers and populate a single ring of packet per queue
- driver allocates packet buffers of different sizes and populate
multiple rings per queue (for instance rings of 128, 256, 1024, 2048
byte arrays per queue)
- driver allocates an unstructured memory area (say 32MB) and give it
to hardware (no prepopulation of rings).
So the userland framework (DPDK, SPDK, ODP, VPP, AF_XDP,
proprietary...) can just query for queues and rings to the kernel
driver that knows what has to be done for the driver. The userland
framework just has to create the relevant objects (queues, rings,
packet buffers) to the provided kernel information.

Exposing VFIO devices to DPDK and other frameworks is a major topic,
and I suggest that at the same time enumeration is done, a broader
discussion on the data path itself happens.
Data path discussion is about memory management (above) and packet
descriptors. Exposing hardware dependent structures in the userland is
not the most widely accepted wisdom.
So I would rather assume hardware natively produce hardware, vendor,
OS independent descriptors. Candidates can be: DPDK mbuf, VPP vlib_buf
or virtio 1.1. I would favor a packet descriptor that supports a
combination of inline offloads (VxLAN + IPSec + TSO...) : if virtio
1.1 could be extended with some DPDK mbuf fields that would be perfect
;-) That looks science fiction but I know that some smartNICs and
other hardware, the hardware produced packet descriptor format can be
flexible....

Cheers

FF



On Mon, 8 Apr 2019 at 11:36, Tiwei Bie <tiwei.bie@intel.com> wrote:
>
> On Mon, Apr 08, 2019 at 09:44:07AM +0100, Alejandro Lucero wrote:
> > On Wed, Apr 3, 2019 at 8:19 AM Tiwei Bie <tiwei.bie@intel.com> wrote:
> > > Hi everyone,
> > >
> > > This is a draft implementation of the mdev (Mediated device [1])
> > > bus support in DPDK. Mdev is a way to virtualize devices in Linux
> > > kernel. Based on the device-api (mdev_type/device_api), there could
> > > be different types of mdev devices (e.g. vfio-pci). In this RFC,
> > > one mdev bus is introduced to scan the mdev devices in the system
> > > and do the probe based on the device-api.
> > >
> > > Take the mdev devices whose device-api is "vfio-pci" as an example,
> > > in this RFC, these devices will be probed by a mdev driver provided
> > > by PCI bus, which will plug them to the PCI bus. And they will be
> > > probed with the drivers registered on the PCI bus based on VendorID/
> > > DeviceID/... then.
> > >
> > >                      +----------+
> > >                      | mdev bus |
> > >                      +----+-----+
> > >                           |
> > >          +----------------+----+------+------+
> > >          |                     |      |      |
> > >    mdev_vfio_pci               ......
> > > (device-api: vfio-pci)
> > >
> > > There are also other ways to add mdev device support in DPDK (e.g.
> > > let PCI bus scan /sys/bus/mdev/devices directly). Comments would be
> > > appreciated!
> >
> > Hi Tiwei,
> >
> > Thanks for the patchset. I was close to send a patchset with the same mdev
> > support, but I'm glad to see your patchset first because I think it is
> > interesting to see another view of how to implemented this.
> >
> > After going through your patch I was a bit confused about how the mdev device
> > to mdev driver match was done. But then I realized the approach you are
> > following is different to my implementation, likely due to having different
> > purposes. If I understand the idea behind, you want to have same PCI PMD
> > drivers working with devices, PCI devices, created from mediated devices.
>
> Exactly!
>
> > That
> > is the reason there is just one mdev driver, the one for vfio-pci mediated
> > devices type.
> >
> > My approach was different and I though having specific PMD mdev support was
> > necessary, with the PMD requiring to register a mdev driver. I can see, after
> > reading your patch, it can be perfectly possible to have the same PMDs for
> > "pure" PCI devices and PCI devices made from mediated devices, and if the PMD
> > requires to do something different due to the mediated devices intrinsics, then
> > explicitly supporting that per PMD. I got specific ioctl calls between the PMD
> > and the mediating driver but this can also be done with your approach.
> >
> > I'm working on having a mediated PF, what is a different purpose than the Intel
> > scalable I/O idea, so I will merge this patchset with my code and see if it
> > works.
>
> Cool! Thanks!
>
> >
> > Thanks!
> >
> >
> > > [1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt
> > >
> > > Thanks,
> > > Tiwei
> > >
> > > Tiwei Bie (3):
> > >   eal: add a helper for reading string from sysfs
> > >   bus/mdev: add mdev bus support
> > >   bus/pci: add mdev support
> > >
> > >  config/common_base                        |   5 +
> > >  config/common_linux                       |   1 +
> > >  drivers/bus/Makefile                      |   1 +
> > >  drivers/bus/mdev/Makefile                 |  41 +++
> > >  drivers/bus/mdev/linux/Makefile           |   6 +
> > >  drivers/bus/mdev/linux/mdev.c             | 117 ++++++++
> > >  drivers/bus/mdev/mdev.c                   | 310 ++++++++++++++++++++++
> > >  drivers/bus/mdev/meson.build              |  15 ++
> > >  drivers/bus/mdev/private.h                |  90 +++++++
> > >  drivers/bus/mdev/rte_bus_mdev.h           | 141 ++++++++++
> > >  drivers/bus/mdev/rte_bus_mdev_version.map |  12 +
> > >  drivers/bus/meson.build                   |   2 +-
> > >  drivers/bus/pci/Makefile                  |   3 +
> > >  drivers/bus/pci/linux/Makefile            |   4 +
> > >  drivers/bus/pci/linux/pci_vfio.c          |  35 ++-
> > >  drivers/bus/pci/linux/pci_vfio_mdev.c     | 305 +++++++++++++++++++++
> > >  drivers/bus/pci/meson.build               |   4 +-
> > >  drivers/bus/pci/pci_common.c              |  17 +-
> > >  drivers/bus/pci/private.h                 |   9 +
> > >  drivers/bus/pci/rte_bus_pci.h             |  11 +-
> > >  lib/librte_eal/common/eal_filesystem.h    |   7 +
> > >  lib/librte_eal/freebsd/eal/eal.c          |  22 ++
> > >  lib/librte_eal/linux/eal/eal.c            |  22 ++
> > >  lib/librte_eal/rte_eal_version.map        |   1 +
> > >  mk/rte.app.mk                             |   1 +
> > >  25 files changed, 1163 insertions(+), 19 deletions(-)
> > >  create mode 100644 drivers/bus/mdev/Makefile
> > >  create mode 100644 drivers/bus/mdev/linux/Makefile
> > >  create mode 100644 drivers/bus/mdev/linux/mdev.c
> > >  create mode 100644 drivers/bus/mdev/mdev.c
> > >  create mode 100644 drivers/bus/mdev/meson.build
> > >  create mode 100644 drivers/bus/mdev/private.h
> > >  create mode 100644 drivers/bus/mdev/rte_bus_mdev.h
> > >  create mode 100644 drivers/bus/mdev/rte_bus_mdev_version.map
> > >  create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c
> > >
> > > --
> > > 2.17.1
> >
> >



--
François-Frédéric Ozog | Director Linaro Edge & Fog Computing Group
T: +33.67221.6485
francois.ozog@linaro.org | Skype: ffozog

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v2 0/5] Add mdev (Mediated device) support in DPDK
  2019-04-03  7:18 [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Tiwei Bie
                   ` (4 preceding siblings ...)
  2019-04-08  8:44 ` [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Alejandro Lucero
@ 2019-07-15  7:52 ` Tiwei Bie
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 1/5] bus/pci: introduce an internal representation of PCI device Tiwei Bie
                     ` (4 more replies)
  5 siblings, 5 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-07-15  7:52 UTC (permalink / raw)
  To: dev
  Cc: ferruh.yigit, anatoly.burakov, bruce.richardson, keith.wiles,
	david.marchand, alejandro.lucero, cunming.liang

Hi everyone,

This is a draft implementation of the mdev (Mediated device [1])
support in DPDK PCI bus driver. Mdev is a way to virtualize devices
in Linux kernel. Based on the device-api (mdev_type/device_api),
there could be different types of mdev devices (e.g. vfio-pci).
In this RFCv2, the PCI bus driver is extended to support scanning
and probing the mdev devices whose device-api is "vfio-pci".

                     +---------+
                     | PCI bus |
                     +----+----+
                          |
         +--------+-------+-------+--------+
         |        |               |        |
  Physical PCI devices ...   Mediated PCI devices ...

RFCv2:
- Let PCI bus scan mediated PCI devices directly
- Address Keith's comments
- Merge below patch into this series (David)
   http://patches.dpdk.org/patch/55927/
- Add internal representation of PCI device (David)
- Minor fixes and improvements

[1] https://github.com/torvalds/linux/blob/master/Documentation/vfio-mediated-device.txt

Thanks,
Tiwei

Tiwei Bie (5):
  bus/pci: introduce an internal representation of PCI device
  bus/pci: avoid depending on private value in kernel source
  bus/pci: introduce helper for MMIO read and write
  eal: add a helper for reading string from sysfs
  bus/pci: add mdev support

 drivers/bus/pci/bsd/pci.c               |  36 ++-
 drivers/bus/pci/linux/Makefile          |   1 +
 drivers/bus/pci/linux/pci.c             | 105 ++++++--
 drivers/bus/pci/linux/pci_init.h        |  29 ++-
 drivers/bus/pci/linux/pci_uio.c         |  22 ++
 drivers/bus/pci/linux/pci_vfio.c        | 314 ++++++++++++++++++++----
 drivers/bus/pci/linux/pci_vfio_mdev.c   | 236 ++++++++++++++++++
 drivers/bus/pci/meson.build             |   3 +-
 drivers/bus/pci/pci_common.c            |  31 +--
 drivers/bus/pci/private.h               |  22 ++
 drivers/bus/pci/rte_bus_pci.h           |  65 ++++-
 drivers/bus/pci/rte_bus_pci_version.map |   7 +
 lib/librte_eal/common/eal_filesystem.h  |  10 +
 lib/librte_eal/freebsd/eal/eal.c        |  22 ++
 lib/librte_eal/linux/eal/eal.c          |  39 ++-
 lib/librte_eal/rte_eal_version.map      |   1 +
 16 files changed, 838 insertions(+), 105 deletions(-)
 create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v2 1/5] bus/pci: introduce an internal representation of PCI device
  2019-07-15  7:52 ` [dpdk-dev] [RFC v2 0/5] " Tiwei Bie
@ 2019-07-15  7:52   ` Tiwei Bie
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 2/5] bus/pci: avoid depending on private value in kernel source Tiwei Bie
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-07-15  7:52 UTC (permalink / raw)
  To: dev
  Cc: ferruh.yigit, anatoly.burakov, bruce.richardson, keith.wiles,
	david.marchand, alejandro.lucero, cunming.liang

This patch introduces an internal representation of the PCI device
which will be used to store the internal information that don't have
to be exposed, e.g. the VFIO region sizes/offsets.

In this patch, the internal structure is simply a wrapper of the
rte_pci_device structure. More fields will be added in the coming
patches.

Suggested-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/bus/pci/bsd/pci.c    | 14 ++++++++------
 drivers/bus/pci/linux/pci.c  | 25 ++++++++++++++-----------
 drivers/bus/pci/pci_common.c |  2 +-
 drivers/bus/pci/private.h    | 12 ++++++++++++
 4 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/drivers/bus/pci/bsd/pci.c b/drivers/bus/pci/bsd/pci.c
index a2de70910..636868f38 100644
--- a/drivers/bus/pci/bsd/pci.c
+++ b/drivers/bus/pci/bsd/pci.c
@@ -213,16 +213,18 @@ pci_uio_map_resource_by_index(struct rte_pci_device *dev, int res_idx,
 static int
 pci_scan_one(int dev_pci_fd, struct pci_conf *conf)
 {
+	struct rte_pci_device_internal *pdev;
 	struct rte_pci_device *dev;
 	struct pci_bar_io bar;
 	unsigned i, max;
 
-	dev = malloc(sizeof(*dev));
-	if (dev == NULL) {
+	pdev = malloc(sizeof(*pdev));
+	if (pdev == NULL)
 		return -1;
-	}
 
-	memset(dev, 0, sizeof(*dev));
+	memset(pdev, 0, sizeof(*pdev));
+
+	dev = &pdev->device;
 	dev->device.bus = &rte_pci_bus.bus;
 
 	dev->addr.domain = conf->pc_sel.pc_domain;
@@ -308,7 +310,7 @@ pci_scan_one(int dev_pci_fd, struct pci_conf *conf)
 				memmove(dev2->mem_resource,
 					dev->mem_resource,
 					sizeof(dev->mem_resource));
-				free(dev);
+				free(pdev);
 			}
 			return 0;
 		}
@@ -318,7 +320,7 @@ pci_scan_one(int dev_pci_fd, struct pci_conf *conf)
 	return 0;
 
 skipdev:
-	free(dev);
+	free(pdev);
 	return 0;
 }
 
diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index 33c8ea7e9..dfab7b81b 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -219,22 +219,25 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 {
 	char filename[PATH_MAX];
 	unsigned long tmp;
+	struct rte_pci_device_internal *pdev;
 	struct rte_pci_device *dev;
 	char driver[PATH_MAX];
 	int ret;
 
-	dev = malloc(sizeof(*dev));
-	if (dev == NULL)
+	pdev = malloc(sizeof(*pdev));
+	if (pdev == NULL)
 		return -1;
 
-	memset(dev, 0, sizeof(*dev));
+	memset(pdev, 0, sizeof(*pdev));
+
+	dev = &pdev->device;
 	dev->device.bus = &rte_pci_bus.bus;
 	dev->addr = *addr;
 
 	/* get vendor id */
 	snprintf(filename, sizeof(filename), "%s/vendor", dirname);
 	if (eal_parse_sysfs_value(filename, &tmp) < 0) {
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 	dev->id.vendor_id = (uint16_t)tmp;
@@ -242,7 +245,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	/* get device id */
 	snprintf(filename, sizeof(filename), "%s/device", dirname);
 	if (eal_parse_sysfs_value(filename, &tmp) < 0) {
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 	dev->id.device_id = (uint16_t)tmp;
@@ -251,7 +254,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	snprintf(filename, sizeof(filename), "%s/subsystem_vendor",
 		 dirname);
 	if (eal_parse_sysfs_value(filename, &tmp) < 0) {
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 	dev->id.subsystem_vendor_id = (uint16_t)tmp;
@@ -260,7 +263,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	snprintf(filename, sizeof(filename), "%s/subsystem_device",
 		 dirname);
 	if (eal_parse_sysfs_value(filename, &tmp) < 0) {
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 	dev->id.subsystem_device_id = (uint16_t)tmp;
@@ -269,7 +272,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	snprintf(filename, sizeof(filename), "%s/class",
 		 dirname);
 	if (eal_parse_sysfs_value(filename, &tmp) < 0) {
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 	/* the least 24 bits are valid: class, subclass, program interface */
@@ -309,7 +312,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	snprintf(filename, sizeof(filename), "%s/resource", dirname);
 	if (pci_parse_sysfs_resource(filename, dev) < 0) {
 		RTE_LOG(ERR, EAL, "%s(): cannot parse resource\n", __func__);
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 
@@ -318,7 +321,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	ret = pci_get_kernel_driver_by_path(filename, driver, sizeof(driver));
 	if (ret < 0) {
 		RTE_LOG(ERR, EAL, "Fail to get kernel driver\n");
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 
@@ -382,7 +385,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 						RTE_LOG(ERR, EAL, "Unexpected device scan at %s!\n",
 							filename);
 				}
-				free(dev);
+				free(pdev);
 			}
 			return 0;
 		}
diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
index d2af472ef..8b9deca8b 100644
--- a/drivers/bus/pci/pci_common.c
+++ b/drivers/bus/pci/pci_common.c
@@ -523,7 +523,7 @@ pci_unplug(struct rte_device *dev)
 	if (ret == 0) {
 		rte_pci_remove_device(pdev);
 		rte_devargs_remove(dev->devargs);
-		free(pdev);
+		free(RTE_PCI_DEVICE_INTERNAL(pdev));
 	}
 	return ret;
 }
diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
index 8a5524052..3e2abd818 100644
--- a/drivers/bus/pci/private.h
+++ b/drivers/bus/pci/private.h
@@ -10,6 +10,14 @@
 #include <rte_pci.h>
 #include <rte_bus_pci.h>
 
+/*
+ * Convert struct rte_pci_device to struct rte_pci_device_internal
+ */
+#define RTE_PCI_DEVICE_INTERNAL(ptr) \
+	container_of(ptr, struct rte_pci_device_internal, device)
+#define RTE_PCI_DEVICE_INTERNAL_CONST(ptr) \
+	container_of(ptr, const struct rte_pci_device_internal, device)
+
 extern struct rte_pci_bus rte_pci_bus;
 
 struct rte_pci_driver;
@@ -17,6 +25,10 @@ struct rte_pci_device;
 
 extern struct rte_pci_bus rte_pci_bus;
 
+struct rte_pci_device_internal {
+	struct rte_pci_device device;
+};
+
 /**
  * Probe the PCI bus
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v2 2/5] bus/pci: avoid depending on private value in kernel source
  2019-07-15  7:52 ` [dpdk-dev] [RFC v2 0/5] " Tiwei Bie
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 1/5] bus/pci: introduce an internal representation of PCI device Tiwei Bie
@ 2019-07-15  7:52   ` Tiwei Bie
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 3/5] bus/pci: introduce helper for MMIO read and write Tiwei Bie
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-07-15  7:52 UTC (permalink / raw)
  To: dev
  Cc: ferruh.yigit, anatoly.burakov, bruce.richardson, keith.wiles,
	david.marchand, alejandro.lucero, cunming.liang

The value 40 used in VFIO_GET_REGION_ADDR() is a private value
(VFIO_PCI_OFFSET_SHIFT) defined in Linux kernel source [1]. It
is not part of VFIO API, and we should not depend on it.

[1] https://github.com/torvalds/linux/blob/6fbc7275c7a9/drivers/vfio/pci/vfio_pci_private.h#L19

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/bus/pci/linux/pci.c      |   4 +-
 drivers/bus/pci/linux/pci_init.h |   4 +-
 drivers/bus/pci/linux/pci_vfio.c | 176 ++++++++++++++++++++++++-------
 drivers/bus/pci/private.h        |  10 ++
 4 files changed, 154 insertions(+), 40 deletions(-)

diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index dfab7b81b..00bfbb301 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -639,7 +639,7 @@ int rte_pci_read_config(const struct rte_pci_device *device,
 		return pci_uio_read_config(intr_handle, buf, len, offset);
 #ifdef VFIO_PRESENT
 	case RTE_KDRV_VFIO:
-		return pci_vfio_read_config(intr_handle, buf, len, offset);
+		return pci_vfio_read_config(device, buf, len, offset);
 #endif
 	default:
 		rte_pci_device_name(&device->addr, devname,
@@ -663,7 +663,7 @@ int rte_pci_write_config(const struct rte_pci_device *device,
 		return pci_uio_write_config(intr_handle, buf, len, offset);
 #ifdef VFIO_PRESENT
 	case RTE_KDRV_VFIO:
-		return pci_vfio_write_config(intr_handle, buf, len, offset);
+		return pci_vfio_write_config(device, buf, len, offset);
 #endif
 	default:
 		rte_pci_device_name(&device->addr, devname,
diff --git a/drivers/bus/pci/linux/pci_init.h b/drivers/bus/pci/linux/pci_init.h
index c2e603a37..c6542a8f9 100644
--- a/drivers/bus/pci/linux/pci_init.h
+++ b/drivers/bus/pci/linux/pci_init.h
@@ -64,9 +64,9 @@ int pci_uio_ioport_unmap(struct rte_pci_ioport *p);
 #endif
 
 /* access config space */
-int pci_vfio_read_config(const struct rte_intr_handle *intr_handle,
+int pci_vfio_read_config(const struct rte_pci_device *dev,
 			 void *buf, size_t len, off_t offs);
-int pci_vfio_write_config(const struct rte_intr_handle *intr_handle,
+int pci_vfio_write_config(const struct rte_pci_device *dev,
 			  const void *buf, size_t len, off_t offs);
 
 int pci_vfio_ioport_map(struct rte_pci_device *dev, int bar,
diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index ee3123965..2dc4a9299 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -49,35 +49,82 @@ static struct rte_tailq_elem rte_vfio_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_vfio_tailq)
 
+static int
+pci_vfio_get_region(const struct rte_pci_device *dev, int index,
+		    uint64_t *size, uint64_t *offset)
+{
+	const struct rte_pci_device_internal *pdev =
+		RTE_PCI_DEVICE_INTERNAL_CONST(dev);
+
+	if (index >= VFIO_PCI_NUM_REGIONS || index >= RTE_MAX_PCI_REGIONS)
+		return -1;
+
+	if (pdev->region[index].size == 0 && pdev->region[index].offset == 0)
+		return -1;
+
+	*size   = pdev->region[index].size;
+	*offset = pdev->region[index].offset;
+
+	return 0;
+}
+
 int
-pci_vfio_read_config(const struct rte_intr_handle *intr_handle,
+pci_vfio_read_config(const struct rte_pci_device *dev,
 		    void *buf, size_t len, off_t offs)
 {
-	return pread64(intr_handle->vfio_dev_fd, buf, len,
-	       VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) + offs);
+	uint64_t size, offset;
+	int fd;
+
+	fd = dev->intr_handle.vfio_dev_fd;
+
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+				&size, &offset) != 0)
+		return -1;
+
+	if ((uint64_t)len + offs > size)
+		return -1;
+
+	return pread64(fd, buf, len, offset + offs);
 }
 
 int
-pci_vfio_write_config(const struct rte_intr_handle *intr_handle,
+pci_vfio_write_config(const struct rte_pci_device *dev,
 		    const void *buf, size_t len, off_t offs)
 {
-	return pwrite64(intr_handle->vfio_dev_fd, buf, len,
-	       VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) + offs);
+	uint64_t size, offset;
+	int fd;
+
+	fd = dev->intr_handle.vfio_dev_fd;
+
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+				&size, &offset) != 0)
+		return -1;
+
+	if ((uint64_t)len + offs > size)
+		return -1;
+
+	return pwrite64(fd, buf, len, offset + offs);
 }
 
 /* get PCI BAR number where MSI-X interrupts are */
 static int
-pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
+pci_vfio_get_msix_bar(const struct rte_pci_device *dev, int fd,
+		      struct pci_msix_table *msix_table)
 {
 	int ret;
 	uint32_t reg;
 	uint16_t flags;
 	uint8_t cap_id, cap_offset;
+	uint64_t size, offset;
+
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+				&size, &offset) != 0) {
+		RTE_LOG(ERR, EAL, "Cannot get offset of CONFIG region.\n");
+		return -1;
+	}
 
 	/* read PCI capability pointer from config space */
-	ret = pread64(fd, &reg, sizeof(reg),
-			VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-			PCI_CAPABILITY_LIST);
+	ret = pread64(fd, &reg, sizeof(reg), offset + PCI_CAPABILITY_LIST);
 	if (ret != sizeof(reg)) {
 		RTE_LOG(ERR, EAL, "Cannot read capability pointer from PCI "
 				"config space!\n");
@@ -90,9 +137,7 @@ pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
 	while (cap_offset) {
 
 		/* read PCI capability ID */
-		ret = pread64(fd, &reg, sizeof(reg),
-				VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-				cap_offset);
+		ret = pread64(fd, &reg, sizeof(reg), offset + cap_offset);
 		if (ret != sizeof(reg)) {
 			RTE_LOG(ERR, EAL, "Cannot read capability ID from PCI "
 					"config space!\n");
@@ -105,8 +150,7 @@ pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
 		/* if we haven't reached MSI-X, check next capability */
 		if (cap_id != PCI_CAP_ID_MSIX) {
 			ret = pread64(fd, &reg, sizeof(reg),
-					VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-					cap_offset);
+					offset + cap_offset);
 			if (ret != sizeof(reg)) {
 				RTE_LOG(ERR, EAL, "Cannot read capability pointer from PCI "
 						"config space!\n");
@@ -122,8 +166,7 @@ pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
 		else {
 			/* table offset resides in the next 4 bytes */
 			ret = pread64(fd, &reg, sizeof(reg),
-					VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-					cap_offset + 4);
+					offset + cap_offset + 4);
 			if (ret != sizeof(reg)) {
 				RTE_LOG(ERR, EAL, "Cannot read table offset from PCI config "
 						"space!\n");
@@ -131,8 +174,7 @@ pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
 			}
 
 			ret = pread64(fd, &flags, sizeof(flags),
-					VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-					cap_offset + 2);
+					offset + cap_offset + 2);
 			if (ret != sizeof(flags)) {
 				RTE_LOG(ERR, EAL, "Cannot read table flags from PCI config "
 						"space!\n");
@@ -152,14 +194,19 @@ pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
 
 /* set PCI bus mastering */
 static int
-pci_vfio_set_bus_master(int dev_fd, bool op)
+pci_vfio_set_bus_master(const struct rte_pci_device *dev, int dev_fd, bool op)
 {
+	uint64_t size, offset;
 	uint16_t reg;
 	int ret;
 
-	ret = pread64(dev_fd, &reg, sizeof(reg),
-			VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-			PCI_COMMAND);
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+				&size, &offset) != 0) {
+		RTE_LOG(ERR, EAL, "Cannot get offset of CONFIG region.\n");
+		return -1;
+	}
+
+	ret = pread64(dev_fd, &reg, sizeof(reg), offset + PCI_COMMAND);
 	if (ret != sizeof(reg)) {
 		RTE_LOG(ERR, EAL, "Cannot read command from PCI config space!\n");
 		return -1;
@@ -171,10 +218,7 @@ pci_vfio_set_bus_master(int dev_fd, bool op)
 	else
 		reg &= ~(PCI_COMMAND_MASTER);
 
-	ret = pwrite64(dev_fd, &reg, sizeof(reg),
-			VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-			PCI_COMMAND);
-
+	ret = pwrite64(dev_fd, &reg, sizeof(reg), offset + PCI_COMMAND);
 	if (ret != sizeof(reg)) {
 		RTE_LOG(ERR, EAL, "Cannot write command to PCI config space!\n");
 		return -1;
@@ -405,14 +449,21 @@ pci_vfio_disable_notifier(struct rte_pci_device *dev)
 #endif
 
 static int
-pci_vfio_is_ioport_bar(int vfio_dev_fd, int bar_index)
+pci_vfio_is_ioport_bar(const struct rte_pci_device *dev,
+		       int vfio_dev_fd, int bar_index)
 {
+	uint64_t size, offset;
 	uint32_t ioport_bar;
 	int ret;
 
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+				&size, &offset) != 0) {
+		RTE_LOG(ERR, EAL, "Cannot get offset of CONFIG region.\n");
+		return -1;
+	}
+
 	ret = pread64(vfio_dev_fd, &ioport_bar, sizeof(ioport_bar),
-			  VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX)
-			  + PCI_BASE_ADDRESS_0 + bar_index*4);
+			  offset + PCI_BASE_ADDRESS_0 + bar_index*4);
 	if (ret != sizeof(ioport_bar)) {
 		RTE_LOG(ERR, EAL, "Cannot read command (%x) from config space!\n",
 			PCI_BASE_ADDRESS_0 + bar_index*4);
@@ -431,7 +482,7 @@ pci_rte_vfio_setup_device(struct rte_pci_device *dev, int vfio_dev_fd)
 	}
 
 	/* set bus mastering for the device */
-	if (pci_vfio_set_bus_master(vfio_dev_fd, true)) {
+	if (pci_vfio_set_bus_master(dev, vfio_dev_fd, true)) {
 		RTE_LOG(ERR, EAL, "Cannot set up bus mastering!\n");
 		return -1;
 	}
@@ -645,11 +696,40 @@ pci_vfio_msix_is_mappable(int vfio_dev_fd, int msix_region)
 	return ret;
 }
 
+static int
+pci_vfio_fill_regions(struct rte_pci_device *dev, int vfio_dev_fd,
+		      struct vfio_device_info *device_info)
+{
+	struct rte_pci_device_internal *pdev = RTE_PCI_DEVICE_INTERNAL(dev);
+	struct vfio_region_info *reg = NULL;
+	int nb_maps, i, ret;
+
+	nb_maps = RTE_MIN((int)device_info->num_regions,
+			VFIO_PCI_CONFIG_REGION_INDEX + 1);
+
+	for (i = 0; i < nb_maps; i++) {
+		ret = pci_vfio_get_region_info(vfio_dev_fd, &reg, i);
+		if (ret < 0) {
+			RTE_LOG(DEBUG, EAL, "%s cannot get device region info error %i (%s)\n",
+				dev->name, errno, strerror(errno));
+			return -1;
+		}
+
+		pdev->region[i].size = reg->size;
+		pdev->region[i].offset = reg->offset;
+
+		free(reg);
+	}
+
+	return 0;
+}
 
 static int
 pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 {
+	struct rte_pci_device_internal *pdev = RTE_PCI_DEVICE_INTERNAL(dev);
 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+	struct vfio_region_info *reg = NULL;
 	char pci_addr[PATH_MAX] = {0};
 	int vfio_dev_fd;
 	struct rte_pci_addr *loc = &dev->addr;
@@ -690,11 +770,22 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 	/* map BARs */
 	maps = vfio_res->maps;
 
+	ret = pci_vfio_get_region_info(vfio_dev_fd, &reg,
+				       VFIO_PCI_CONFIG_REGION_INDEX);
+	if (ret < 0) {
+		RTE_LOG(ERR, EAL, "%s cannot get device region info error %i (%s)\n",
+			dev->name, errno, strerror(errno));
+		goto err_vfio_res;
+	}
+	pdev->region[VFIO_PCI_CONFIG_REGION_INDEX].size = reg->size;
+	pdev->region[VFIO_PCI_CONFIG_REGION_INDEX].offset = reg->offset;
+	free(reg);
+
 	vfio_res->msix_table.bar_index = -1;
 	/* get MSI-X BAR, if any (we have to know where it is because we can't
 	 * easily mmap it when using VFIO)
 	 */
-	ret = pci_vfio_get_msix_bar(vfio_dev_fd, &vfio_res->msix_table);
+	ret = pci_vfio_get_msix_bar(dev, vfio_dev_fd, &vfio_res->msix_table);
 	if (ret < 0) {
 		RTE_LOG(ERR, EAL, "  %s cannot get MSI-X BAR number!\n",
 				pci_addr);
@@ -715,7 +806,6 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 	}
 
 	for (i = 0; i < (int) vfio_res->nb_maps; i++) {
-		struct vfio_region_info *reg = NULL;
 		void *bar_addr;
 
 		ret = pci_vfio_get_region_info(vfio_dev_fd, &reg, i);
@@ -726,8 +816,11 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 			goto err_vfio_res;
 		}
 
+		pdev->region[i].size = reg->size;
+		pdev->region[i].offset = reg->offset;
+
 		/* chk for io port region */
-		ret = pci_vfio_is_ioport_bar(vfio_dev_fd, i);
+		ret = pci_vfio_is_ioport_bar(dev, vfio_dev_fd, i);
 		if (ret < 0) {
 			free(reg);
 			goto err_vfio_res;
@@ -833,6 +926,10 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 	if (ret)
 		return ret;
 
+	ret = pci_vfio_fill_regions(dev, vfio_dev_fd, &device_info);
+	if (ret)
+		return ret;
+
 	/* map BARs */
 	maps = vfio_res->maps;
 
@@ -938,7 +1035,7 @@ pci_vfio_unmap_resource_primary(struct rte_pci_device *dev)
 		return -1;
 	}
 
-	if (pci_vfio_set_bus_master(dev->intr_handle.vfio_dev_fd, false)) {
+	if (pci_vfio_set_bus_master(dev, dev->intr_handle.vfio_dev_fd, false)) {
 		RTE_LOG(ERR, EAL, "  %s cannot unset bus mastering for PCI device!\n",
 				pci_addr);
 		return -1;
@@ -1016,14 +1113,21 @@ int
 pci_vfio_ioport_map(struct rte_pci_device *dev, int bar,
 		    struct rte_pci_ioport *p)
 {
+	uint64_t size, offset;
+
 	if (bar < VFIO_PCI_BAR0_REGION_INDEX ||
 	    bar > VFIO_PCI_BAR5_REGION_INDEX) {
 		RTE_LOG(ERR, EAL, "invalid bar (%d)!\n", bar);
 		return -1;
 	}
 
+	if (pci_vfio_get_region(dev, bar, &size, &offset) != 0) {
+		RTE_LOG(ERR, EAL, "Cannot get offset of region %d.\n", bar);
+		return -1;
+	}
+
 	p->dev = dev;
-	p->base = VFIO_GET_REGION_ADDR(bar);
+	p->base = offset;
 	return 0;
 }
 
diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
index 3e2abd818..c09185b86 100644
--- a/drivers/bus/pci/private.h
+++ b/drivers/bus/pci/private.h
@@ -10,6 +10,8 @@
 #include <rte_pci.h>
 #include <rte_bus_pci.h>
 
+#define RTE_MAX_PCI_REGIONS	9
+
 /*
  * Convert struct rte_pci_device to struct rte_pci_device_internal
  */
@@ -25,8 +27,16 @@ struct rte_pci_device;
 
 extern struct rte_pci_bus rte_pci_bus;
 
+struct rte_pci_region {
+	uint64_t size;
+	uint64_t offset;
+};
+
 struct rte_pci_device_internal {
 	struct rte_pci_device device;
+
+	/* PCI regions provided by e.g. VFIO. */
+	struct rte_pci_region region[RTE_MAX_PCI_REGIONS];
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v2 3/5] bus/pci: introduce helper for MMIO read and write
  2019-07-15  7:52 ` [dpdk-dev] [RFC v2 0/5] " Tiwei Bie
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 1/5] bus/pci: introduce an internal representation of PCI device Tiwei Bie
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 2/5] bus/pci: avoid depending on private value in kernel source Tiwei Bie
@ 2019-07-15  7:52   ` Tiwei Bie
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 4/5] eal: add a helper for reading string from sysfs Tiwei Bie
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 5/5] bus/pci: add mdev support Tiwei Bie
  4 siblings, 0 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-07-15  7:52 UTC (permalink / raw)
  To: dev
  Cc: ferruh.yigit, anatoly.burakov, bruce.richardson, keith.wiles,
	david.marchand, alejandro.lucero, cunming.liang

The MMIO regions may not be mmap-able for mediated PCI device.
In this case, the application should explicitly do read and write
to access these regions.

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/bus/pci/bsd/pci.c               | 22 ++++++++++++
 drivers/bus/pci/linux/pci.c             | 46 ++++++++++++++++++++++++
 drivers/bus/pci/linux/pci_init.h        | 10 ++++++
 drivers/bus/pci/linux/pci_uio.c         | 22 ++++++++++++
 drivers/bus/pci/linux/pci_vfio.c        | 36 +++++++++++++++++++
 drivers/bus/pci/rte_bus_pci.h           | 48 +++++++++++++++++++++++++
 drivers/bus/pci/rte_bus_pci_version.map |  7 ++++
 7 files changed, 191 insertions(+)

diff --git a/drivers/bus/pci/bsd/pci.c b/drivers/bus/pci/bsd/pci.c
index 636868f38..d4d2c9016 100644
--- a/drivers/bus/pci/bsd/pci.c
+++ b/drivers/bus/pci/bsd/pci.c
@@ -527,6 +527,28 @@ int rte_pci_write_config(const struct rte_pci_device *dev,
 	return -1;
 }
 
+/* Read PCI MMIO space. */
+int rte_pci_mmio_read(const struct rte_pci_device *dev, int bar,
+		      void *buf, size_t len, off_t offset)
+{
+	if (bar >= PCI_MAX_RESOURCE || dev->mem_resource[bar].addr == NULL ||
+			(uint64_t)offset + len > dev->mem_resource[bar].len)
+		return -1;
+	memcpy(buf, (uint8_t *)dev->mem_resource[bar].addr + offset, len);
+	return len;
+}
+
+/* Write PCI MMIO space. */
+int rte_pci_mmio_write(const struct rte_pci_device *dev, int bar,
+		       const void *buf, size_t len, off_t offset)
+{
+	if (bar >= PCI_MAX_RESOURCE || dev->mem_resource[bar].addr == NULL ||
+			(uint64_t)offset + len > dev->mem_resource[bar].len)
+		return -1;
+	memcpy((uint8_t *)dev->mem_resource[bar].addr + offset, buf, len);
+	return len;
+}
+
 int
 rte_pci_ioport_map(struct rte_pci_device *dev, int bar,
 		struct rte_pci_ioport *p)
diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index 00bfbb301..bdfc8c5ff 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -674,6 +674,52 @@ int rte_pci_write_config(const struct rte_pci_device *device,
 	}
 }
 
+/* Read PCI MMIO space. */
+int rte_pci_mmio_read(const struct rte_pci_device *device, int bar,
+		void *buf, size_t len, off_t offset)
+{
+	char devname[RTE_DEV_NAME_MAX_LEN] = "";
+
+	switch (device->kdrv) {
+	case RTE_KDRV_IGB_UIO:
+	case RTE_KDRV_UIO_GENERIC:
+		return pci_uio_mmio_read(device, bar, buf, len, offset);
+#ifdef VFIO_PRESENT
+	case RTE_KDRV_VFIO:
+		return pci_vfio_mmio_read(device, bar, buf, len, offset);
+#endif
+	default:
+		rte_pci_device_name(&device->addr, devname,
+				    RTE_DEV_NAME_MAX_LEN);
+		RTE_LOG(ERR, EAL,
+			"Unknown driver type for %s\n", devname);
+		return -1;
+	}
+}
+
+/* Write PCI MMIO space. */
+int rte_pci_mmio_write(const struct rte_pci_device *device, int bar,
+		const void *buf, size_t len, off_t offset)
+{
+	char devname[RTE_DEV_NAME_MAX_LEN] = "";
+
+	switch (device->kdrv) {
+	case RTE_KDRV_IGB_UIO:
+	case RTE_KDRV_UIO_GENERIC:
+		return pci_uio_mmio_write(device, bar, buf, len, offset);
+#ifdef VFIO_PRESENT
+	case RTE_KDRV_VFIO:
+		return pci_vfio_mmio_write(device, bar, buf, len, offset);
+#endif
+	default:
+		rte_pci_device_name(&device->addr, devname,
+				    RTE_DEV_NAME_MAX_LEN);
+		RTE_LOG(ERR, EAL,
+			"Unknown driver type for %s\n", devname);
+		return -1;
+	}
+}
+
 #if defined(RTE_ARCH_X86)
 static int
 pci_ioport_map(struct rte_pci_device *dev, int bar __rte_unused,
diff --git a/drivers/bus/pci/linux/pci_init.h b/drivers/bus/pci/linux/pci_init.h
index c6542a8f9..158a16977 100644
--- a/drivers/bus/pci/linux/pci_init.h
+++ b/drivers/bus/pci/linux/pci_init.h
@@ -35,6 +35,11 @@ int pci_uio_read_config(const struct rte_intr_handle *intr_handle,
 int pci_uio_write_config(const struct rte_intr_handle *intr_handle,
 			 const void *buf, size_t len, off_t offs);
 
+int pci_uio_mmio_read(const struct rte_pci_device *dev, int bar,
+		      void *buf, size_t len, off_t offset);
+int pci_uio_mmio_write(const struct rte_pci_device *dev, int bar,
+		       const void *buf, size_t len, off_t offset);
+
 int pci_uio_ioport_map(struct rte_pci_device *dev, int bar,
 		       struct rte_pci_ioport *p);
 void pci_uio_ioport_read(struct rte_pci_ioport *p,
@@ -69,6 +74,11 @@ int pci_vfio_read_config(const struct rte_pci_device *dev,
 int pci_vfio_write_config(const struct rte_pci_device *dev,
 			  const void *buf, size_t len, off_t offs);
 
+int pci_vfio_mmio_read(const struct rte_pci_device *dev, int bar,
+		       void *buf, size_t len, off_t offset);
+int pci_vfio_mmio_write(const struct rte_pci_device *dev, int bar,
+			const void *buf, size_t len, off_t offset);
+
 int pci_vfio_ioport_map(struct rte_pci_device *dev, int bar,
 		        struct rte_pci_ioport *p);
 void pci_vfio_ioport_read(struct rte_pci_ioport *p,
diff --git a/drivers/bus/pci/linux/pci_uio.c b/drivers/bus/pci/linux/pci_uio.c
index f240fe4f2..623273541 100644
--- a/drivers/bus/pci/linux/pci_uio.c
+++ b/drivers/bus/pci/linux/pci_uio.c
@@ -45,6 +45,28 @@ pci_uio_write_config(const struct rte_intr_handle *intr_handle,
 	return pwrite(intr_handle->uio_cfg_fd, buf, len, offset);
 }
 
+int
+pci_uio_mmio_read(const struct rte_pci_device *dev, int bar,
+		  void *buf, size_t len, off_t offset)
+{
+	if (bar >= PCI_MAX_RESOURCE || dev->mem_resource[bar].addr == NULL ||
+			(uint64_t)offset + len > dev->mem_resource[bar].len)
+		return -1;
+	memcpy(buf, (uint8_t *)dev->mem_resource[bar].addr + offset, len);
+	return len;
+}
+
+int
+pci_uio_mmio_write(const struct rte_pci_device *dev, int bar,
+		   const void *buf, size_t len, off_t offset)
+{
+	if (bar >= PCI_MAX_RESOURCE || dev->mem_resource[bar].addr == NULL ||
+			(uint64_t)offset + len > dev->mem_resource[bar].len)
+		return -1;
+	memcpy((uint8_t *)dev->mem_resource[bar].addr + offset, buf, len);
+	return len;
+}
+
 static int
 pci_uio_set_bus_master(int dev_fd)
 {
diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index 2dc4a9299..204698be0 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -1164,6 +1164,42 @@ pci_vfio_ioport_unmap(struct rte_pci_ioport *p)
 	return -1;
 }
 
+int
+pci_vfio_mmio_read(const struct rte_pci_device *dev, int bar,
+		   void *buf, size_t len, off_t offs)
+{
+	uint64_t size, offset;
+	int fd;
+
+	fd = dev->intr_handle.vfio_dev_fd;
+
+	if (pci_vfio_get_region(dev, bar, &size, &offset) != 0)
+		return -1;
+
+	if ((uint64_t)len + offs > size)
+		return -1;
+
+	return pread64(fd, buf, len, offset + offs);
+}
+
+int
+pci_vfio_mmio_write(const struct rte_pci_device *dev, int bar,
+		    const void *buf, size_t len, off_t offs)
+{
+	uint64_t size, offset;
+	int fd;
+
+	fd = dev->intr_handle.vfio_dev_fd;
+
+	if (pci_vfio_get_region(dev, bar, &size, &offset) != 0)
+		return -1;
+
+	if ((uint64_t)len + offs > size)
+		return -1;
+
+	return pwrite64(fd, buf, len, offset + offs);
+}
+
 int
 pci_vfio_is_enabled(void)
 {
diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
index 06e004cd3..86527b421 100644
--- a/drivers/bus/pci/rte_bus_pci.h
+++ b/drivers/bus/pci/rte_bus_pci.h
@@ -285,6 +285,54 @@ int rte_pci_read_config(const struct rte_pci_device *device,
 int rte_pci_write_config(const struct rte_pci_device *device,
 		const void *buf, size_t len, off_t offset);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read from a MMIO pci resource.
+ *
+ * @param device
+ *   A pointer to a rte_pci_device structure describing the device
+ *   to use
+ * @param bar
+ *   Index of the io pci resource we want to access.
+ * @param buf
+ *   A data buffer where the bytes should be read into
+ * @param len
+ *   The length of the data buffer.
+ * @param offset
+ *   The offset into MMIO space described by @bar
+ * @return
+ *  Number of bytes read on success, negative on error.
+ */
+__rte_experimental
+int rte_pci_mmio_read(const struct rte_pci_device *device, int bar,
+		void *buf, size_t len, off_t offset);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Write to a MMIO pci resource.
+ *
+ * @param device
+ *   A pointer to a rte_pci_device structure describing the device
+ *   to use
+ * @param bar
+ *   Index of the io pci resource we want to access.
+ * @param buf
+ *   A data buffer containing the bytes should be written
+ * @param len
+ *   The length of the data buffer.
+ * @param offset
+ *   The offset into MMIO space described by @bar
+ * @return
+ *  Number of bytes written on success, negative on error.
+ */
+__rte_experimental
+int rte_pci_mmio_write(const struct rte_pci_device *device, int bar,
+		const void *buf, size_t len, off_t offset);
+
 /**
  * A structure used to access io resources for a pci device.
  * rte_pci_ioport is arch, os, driver specific, and should not be used outside
diff --git a/drivers/bus/pci/rte_bus_pci_version.map b/drivers/bus/pci/rte_bus_pci_version.map
index 27e9c4f10..141bdf48e 100644
--- a/drivers/bus/pci/rte_bus_pci_version.map
+++ b/drivers/bus/pci/rte_bus_pci_version.map
@@ -16,3 +16,10 @@ DPDK_17.11 {
 
 	local: *;
 };
+
+EXPERIMENTAL {
+	global:
+
+	rte_pci_mmio_read;
+	rte_pci_mmio_write;
+};
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v2 4/5] eal: add a helper for reading string from sysfs
  2019-07-15  7:52 ` [dpdk-dev] [RFC v2 0/5] " Tiwei Bie
                     ` (2 preceding siblings ...)
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 3/5] bus/pci: introduce helper for MMIO read and write Tiwei Bie
@ 2019-07-15  7:52   ` Tiwei Bie
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 5/5] bus/pci: add mdev support Tiwei Bie
  4 siblings, 0 replies; 41+ messages in thread
From: Tiwei Bie @ 2019-07-15  7:52 UTC (permalink / raw)
  To: dev
  Cc: ferruh.yigit, anatoly.burakov, bruce.richardson, keith.wiles,
	david.marchand, alejandro.lucero, cunming.liang

This patch adds a helper for reading string from sysfs.

Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 lib/librte_eal/common/eal_filesystem.h | 10 ++++++++++
 lib/librte_eal/freebsd/eal/eal.c       | 22 ++++++++++++++++++++++
 lib/librte_eal/linux/eal/eal.c         | 22 ++++++++++++++++++++++
 lib/librte_eal/rte_eal_version.map     |  1 +
 4 files changed, 55 insertions(+)

diff --git a/lib/librte_eal/common/eal_filesystem.h b/lib/librte_eal/common/eal_filesystem.h
index 5d21f07c2..be4c51ebb 100644
--- a/lib/librte_eal/common/eal_filesystem.h
+++ b/lib/librte_eal/common/eal_filesystem.h
@@ -104,4 +104,14 @@ eal_get_hugefile_path(char *buffer, size_t buflen, const char *hugedir, int f_id
  * Used to read information from files on /sys */
 int eal_parse_sysfs_value(const char *filename, unsigned long *val);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Function to read a line from a file on the filesystem.
+ * Used to read information from files on /sys
+ */
+__rte_experimental
+int rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz);
+
 #endif /* EAL_FILESYSTEM_H */
diff --git a/lib/librte_eal/freebsd/eal/eal.c b/lib/librte_eal/freebsd/eal/eal.c
index d53f0fe69..78720685f 100644
--- a/lib/librte_eal/freebsd/eal/eal.c
+++ b/lib/librte_eal/freebsd/eal/eal.c
@@ -209,6 +209,28 @@ eal_parse_sysfs_value(const char *filename, unsigned long *val)
 	return 0;
 }
 
+int
+rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
+{
+	FILE *f;
+
+	f = fopen(filename, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
+			__func__, filename);
+		return -1;
+	}
+
+	if (fgets(buf, sz, f) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs file %s\n",
+			__func__, filename);
+		fclose(f);
+		return -1;
+	}
+
+	fclose(f);
+	return 0;
+}
 
 /* create memory configuration in shared/mmap memory. Take out
  * a write lock on the memsegs, so we can auto-detect primary/secondary.
diff --git a/lib/librte_eal/linux/eal/eal.c b/lib/librte_eal/linux/eal/eal.c
index 2e5499f9b..44bad45d3 100644
--- a/lib/librte_eal/linux/eal/eal.c
+++ b/lib/librte_eal/linux/eal/eal.c
@@ -295,6 +295,28 @@ eal_parse_sysfs_value(const char *filename, unsigned long *val)
 	return 0;
 }
 
+int
+rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
+{
+	FILE *f;
+
+	f = fopen(filename, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
+			__func__, filename);
+		return -1;
+	}
+
+	if (fgets(buf, sz, f) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs file %s\n",
+			__func__, filename);
+		fclose(f);
+		return -1;
+	}
+
+	fclose(f);
+	return 0;
+}
 
 /* create memory configuration in shared/mmap memory. Take out
  * a write lock on the memsegs, so we can auto-detect primary/secondary.
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index 1892d9ea9..a9559176b 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -331,6 +331,7 @@ EXPERIMENTAL {
 	rte_dev_hotplug_handle_enable;
 	rte_dev_iterator_init;
 	rte_dev_iterator_next;
+	rte_eal_parse_sysfs_str;
 	rte_extmem_attach;
 	rte_extmem_detach;
 	rte_extmem_register;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v2 5/5] bus/pci: add mdev support
  2019-07-15  7:52 ` [dpdk-dev] [RFC v2 0/5] " Tiwei Bie
                     ` (3 preceding siblings ...)
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 4/5] eal: add a helper for reading string from sysfs Tiwei Bie
@ 2019-07-15  7:52   ` Tiwei Bie
  2021-06-01  3:06     ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Chenbo Xia
  4 siblings, 1 reply; 41+ messages in thread
From: Tiwei Bie @ 2019-07-15  7:52 UTC (permalink / raw)
  To: dev
  Cc: ferruh.yigit, anatoly.burakov, bruce.richardson, keith.wiles,
	david.marchand, alejandro.lucero, cunming.liang

This patch adds the mdev (Mediated device) support in PCI bus
driver. With this patch, the PCI bus driver will be able to scan
and probe the mediated PCI devices (i.e. the Mediated devices
whose device API is "vfio-pci") in the system.

There are several things different between physical PCI devices
and mediated PCI devices:

- Mediated PCI devices have to be accessed through VFIO API;
- The regions in mediated PCI devices may not be mmap-able,
  and drivers need to call read/write function to access them
  in this case;
- Mediated PCI devices use UUID as device address;

Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/bus/pci/linux/Makefile        |   1 +
 drivers/bus/pci/linux/pci.c           |  30 +++-
 drivers/bus/pci/linux/pci_init.h      |  15 +-
 drivers/bus/pci/linux/pci_vfio.c      | 104 ++++++++++--
 drivers/bus/pci/linux/pci_vfio_mdev.c | 236 ++++++++++++++++++++++++++
 drivers/bus/pci/meson.build           |   3 +-
 drivers/bus/pci/pci_common.c          |  29 ++--
 drivers/bus/pci/rte_bus_pci.h         |  17 +-
 lib/librte_eal/linux/eal/eal.c        |  17 +-
 9 files changed, 404 insertions(+), 48 deletions(-)
 create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c

diff --git a/drivers/bus/pci/linux/Makefile b/drivers/bus/pci/linux/Makefile
index 90404468b..c17ab2484 100644
--- a/drivers/bus/pci/linux/Makefile
+++ b/drivers/bus/pci/linux/Makefile
@@ -4,3 +4,4 @@
 SRCS += pci.c
 SRCS += pci_uio.c
 SRCS += pci_vfio.c
+SRCS += pci_vfio_mdev.c
diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index bdfc8c5ff..5be898803 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -31,7 +31,7 @@
 
 extern struct rte_pci_bus rte_pci_bus;
 
-static int
+int
 pci_get_kernel_driver_by_path(const char *filename, char *dri_name,
 			      size_t len)
 {
@@ -71,7 +71,7 @@ rte_pci_map_device(struct rte_pci_device *dev)
 	switch (dev->kdrv) {
 	case RTE_KDRV_VFIO:
 #ifdef VFIO_PRESENT
-		if (pci_vfio_is_enabled())
+		if (pci_vfio_is_enabled(dev))
 			ret = pci_vfio_map_resource(dev);
 #endif
 		break;
@@ -100,7 +100,7 @@ rte_pci_unmap_device(struct rte_pci_device *dev)
 	switch (dev->kdrv) {
 	case RTE_KDRV_VFIO:
 #ifdef VFIO_PRESENT
-		if (pci_vfio_is_enabled())
+		if (pci_vfio_is_enabled(dev))
 			pci_vfio_unmap_resource(dev);
 #endif
 		break;
@@ -348,6 +348,15 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 		int ret;
 
 		TAILQ_FOREACH(dev2, &rte_pci_bus.device_list, next) {
+			/*
+			 * Insert physical PCI devices before all mediated
+			 * PCI devices.
+			 */
+			if (dev2->is_mdev) {
+				rte_pci_insert_device(dev2, dev);
+				return 0;
+			}
+
 			ret = rte_pci_addr_cmp(&dev->addr, &dev2->addr);
 			if (ret > 0)
 				continue;
@@ -471,8 +480,14 @@ rte_pci_scan(void)
 		return 0;
 
 #ifdef VFIO_PRESENT
-	if (!pci_vfio_is_enabled())
-		RTE_LOG(DEBUG, EAL, "VFIO PCI modules not loaded\n");
+	if (!rte_vfio_is_enabled("vfio_pci"))
+		RTE_LOG(DEBUG, EAL, "VFIO PCI module not loaded\n");
+
+	if (!rte_vfio_is_enabled("vfio_mdev"))
+		RTE_LOG(DEBUG, EAL, "VFIO MDEV module not loaded\n");
+
+	if (pci_scan_mdev() != 0)
+		return -1;
 #endif
 
 	dir = opendir(rte_pci_get_sysfs_path());
@@ -788,7 +803,7 @@ rte_pci_ioport_map(struct rte_pci_device *dev, int bar,
 	switch (dev->kdrv) {
 #ifdef VFIO_PRESENT
 	case RTE_KDRV_VFIO:
-		if (pci_vfio_is_enabled())
+		if (pci_vfio_is_enabled(dev))
 			ret = pci_vfio_ioport_map(dev, bar, p);
 		break;
 #endif
@@ -877,8 +892,7 @@ rte_pci_ioport_unmap(struct rte_pci_ioport *p)
 	switch (p->dev->kdrv) {
 #ifdef VFIO_PRESENT
 	case RTE_KDRV_VFIO:
-		if (pci_vfio_is_enabled())
-			ret = pci_vfio_ioport_unmap(p);
+		ret = -1;
 		break;
 #endif
 	case RTE_KDRV_IGB_UIO:
diff --git a/drivers/bus/pci/linux/pci_init.h b/drivers/bus/pci/linux/pci_init.h
index 158a16977..12739ba51 100644
--- a/drivers/bus/pci/linux/pci_init.h
+++ b/drivers/bus/pci/linux/pci_init.h
@@ -17,6 +17,9 @@
 extern void *pci_map_addr;
 void *pci_find_max_end_va(void);
 
+int pci_get_kernel_driver_by_path(const char *filename, char *dri_name,
+				  size_t len);
+
 /* parse one line of the "resource" sysfs file (note that the 'line'
  * string is modified)
  */
@@ -91,7 +94,17 @@ int pci_vfio_ioport_unmap(struct rte_pci_ioport *p);
 int pci_vfio_map_resource(struct rte_pci_device *dev);
 int pci_vfio_unmap_resource(struct rte_pci_device *dev);
 
-int pci_vfio_is_enabled(void);
+int pci_vfio_is_enabled(struct rte_pci_device *dev);
+
+int pci_vfio_fill_regions(struct rte_pci_device *dev, int vfio_dev_fd,
+			  struct vfio_device_info *device_info);
+
+int pci_vfio_get_pci_id(struct rte_pci_device *dev, int vfio_dev_fd,
+			struct rte_pci_id *pci_id);
+
+const char *pci_mdev_get_sysfs_path(void);
+
+int pci_scan_mdev(void);
 
 #endif
 
diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index 204698be0..7cea57ff9 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -21,6 +21,7 @@
 #include <rte_bus.h>
 #include <rte_spinlock.h>
 #include <rte_tailq.h>
+#include <rte_uuid.h>
 
 #include "eal_filesystem.h"
 
@@ -696,7 +697,7 @@ pci_vfio_msix_is_mappable(int vfio_dev_fd, int msix_region)
 	return ret;
 }
 
-static int
+int
 pci_vfio_fill_regions(struct rte_pci_device *dev, int vfio_dev_fd,
 		      struct vfio_device_info *device_info)
 {
@@ -731,6 +732,7 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
 	struct vfio_region_info *reg = NULL;
 	char pci_addr[PATH_MAX] = {0};
+	const char *sysfs_base;
 	int vfio_dev_fd;
 	struct rte_pci_addr *loc = &dev->addr;
 	int i, ret;
@@ -746,10 +748,16 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 #endif
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->is_mdev) {
+		sysfs_base = pci_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+	} else {
+		sysfs_base = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
 
-	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
+	ret = rte_vfio_setup_device(sysfs_base, pci_addr,
 					&vfio_dev_fd, &device_info);
 	if (ret)
 		return ret;
@@ -889,6 +897,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 {
 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
 	char pci_addr[PATH_MAX] = {0};
+	const char *sysfs_base;
 	int vfio_dev_fd;
 	struct rte_pci_addr *loc = &dev->addr;
 	int i, ret;
@@ -904,8 +913,14 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 #endif
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->is_mdev) {
+		sysfs_base = pci_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+	} else {
+		sysfs_base = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
 
 	/* if we're in a secondary process, just find our tailq entry */
 	TAILQ_FOREACH(vfio_res, vfio_res_list, next) {
@@ -921,7 +936,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 		return -1;
 	}
 
-	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
+	ret = rte_vfio_setup_device(sysfs_base, pci_addr,
 					&vfio_dev_fd, &device_info);
 	if (ret)
 		return ret;
@@ -1011,6 +1026,7 @@ find_and_unmap_vfio_resource(struct mapped_pci_res_list *vfio_res_list,
 static int
 pci_vfio_unmap_resource_primary(struct rte_pci_device *dev)
 {
+	const char *sysfs_base;
 	char pci_addr[PATH_MAX] = {0};
 	struct rte_pci_addr *loc = &dev->addr;
 	struct mapped_pci_resource *vfio_res = NULL;
@@ -1018,8 +1034,14 @@ pci_vfio_unmap_resource_primary(struct rte_pci_device *dev)
 	int ret;
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->is_mdev) {
+		sysfs_base = pci_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+	} else {
+		sysfs_base = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
 
 #ifdef HAVE_VFIO_DEV_REQ_INTERFACE
 	ret = pci_vfio_disable_notifier(dev);
@@ -1041,7 +1063,7 @@ pci_vfio_unmap_resource_primary(struct rte_pci_device *dev)
 		return -1;
 	}
 
-	ret = rte_vfio_release_device(rte_pci_get_sysfs_path(), pci_addr,
+	ret = rte_vfio_release_device(sysfs_base, pci_addr,
 				  dev->intr_handle.vfio_dev_fd);
 	if (ret < 0) {
 		RTE_LOG(ERR, EAL,
@@ -1068,6 +1090,7 @@ pci_vfio_unmap_resource_primary(struct rte_pci_device *dev)
 static int
 pci_vfio_unmap_resource_secondary(struct rte_pci_device *dev)
 {
+	const char *sysfs_base;
 	char pci_addr[PATH_MAX] = {0};
 	struct rte_pci_addr *loc = &dev->addr;
 	struct mapped_pci_resource *vfio_res = NULL;
@@ -1075,10 +1098,16 @@ pci_vfio_unmap_resource_secondary(struct rte_pci_device *dev)
 	int ret;
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->is_mdev) {
+		sysfs_base = pci_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+	} else {
+		sysfs_base = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
 
-	ret = rte_vfio_release_device(rte_pci_get_sysfs_path(), pci_addr,
+	ret = rte_vfio_release_device(sysfs_base, pci_addr,
 				  dev->intr_handle.vfio_dev_fd);
 	if (ret < 0) {
 		RTE_LOG(ERR, EAL,
@@ -1201,8 +1230,61 @@ pci_vfio_mmio_write(const struct rte_pci_device *dev, int bar,
 }
 
 int
-pci_vfio_is_enabled(void)
+pci_vfio_is_enabled(struct rte_pci_device *dev)
 {
-	return rte_vfio_is_enabled("vfio_pci");
+	return rte_vfio_is_enabled(dev->is_mdev ? "vfio_mdev" : "vfio_pci");
 }
+
+int
+pci_vfio_get_pci_id(struct rte_pci_device *dev, int vfio_dev_fd,
+		    struct rte_pci_id *pci_id)
+{
+	uint64_t size, offset;
+	int class;
+
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+				&size, &offset) != 0) {
+		RTE_LOG(DEBUG, EAL, "Cannot get offset of CONFIG region.\n");
+		return -1;
+	}
+
+	/* vendor_id */
+	if (pread64(vfio_dev_fd, &pci_id->vendor_id, sizeof(uint16_t),
+		    offset + PCI_VENDOR_ID) != sizeof(uint16_t)) {
+		RTE_LOG(DEBUG, EAL, "Cannot read VendorID from PCI config space\n");
+		return -1;
+	}
+
+	/* device_id */
+	if (pread64(vfio_dev_fd, &pci_id->device_id, sizeof(uint16_t),
+		    offset + PCI_DEVICE_ID) != sizeof(uint16_t)) {
+		RTE_LOG(DEBUG, EAL, "Cannot read DeviceID from PCI config space\n");
+		return -1;
+	}
+
+	/* subsystem_vendor_id */
+	if (pread64(vfio_dev_fd, &pci_id->subsystem_vendor_id, sizeof(uint16_t),
+		    offset + PCI_SUBSYSTEM_VENDOR_ID) != sizeof(uint16_t)) {
+		RTE_LOG(DEBUG, EAL, "Cannot read SubVendorID from PCI config space\n");
+		return -1;
+	}
+
+	/* subsystem_device_id */
+	if (pread64(vfio_dev_fd, &pci_id->subsystem_device_id, sizeof(uint16_t),
+		    offset + PCI_SUBSYSTEM_ID) != sizeof(uint16_t)) {
+		RTE_LOG(DEBUG, EAL, "Cannot read SubDeviceID from PCI config space\n");
+		return -1;
+	}
+
+	/* class_id */
+	if (pread64(vfio_dev_fd, &class, sizeof(uint32_t),
+		    offset + PCI_CLASS_REVISION) != sizeof(uint32_t)) {
+		RTE_LOG(DEBUG, EAL, "Cannot read ClassID from PCI config space\n");
+		return -1;
+	}
+	pci_id->class_id = class >> 8;
+
+	return 0;
+}
+
 #endif
diff --git a/drivers/bus/pci/linux/pci_vfio_mdev.c b/drivers/bus/pci/linux/pci_vfio_mdev.c
new file mode 100644
index 000000000..dab7e9b35
--- /dev/null
+++ b/drivers/bus/pci/linux/pci_vfio_mdev.c
@@ -0,0 +1,236 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include <string.h>
+#include <dirent.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <linux/pci_regs.h>
+
+#include <rte_log.h>
+#include <rte_pci.h>
+#include <rte_eal_memconfig.h>
+#include <rte_malloc.h>
+#include <rte_devargs.h>
+#include <rte_memcpy.h>
+#include <rte_vfio.h>
+#include <rte_uuid.h>
+
+#include "eal_private.h"
+#include "eal_filesystem.h"
+
+#include "private.h"
+#include "pci_init.h"
+
+#ifdef VFIO_PRESENT
+
+extern struct rte_pci_bus rte_pci_bus;
+
+#define SYSFS_MDEV_DEVICES "/sys/bus/mdev/devices"
+
+const char *pci_mdev_get_sysfs_path(void)
+{
+	const char *path = NULL;
+
+	path = getenv("SYSFS_MDEV_DEVICES");
+	if (path == NULL)
+		return SYSFS_MDEV_DEVICES;
+
+	return path;
+}
+
+static int
+is_pci_device(const char *dirname)
+{
+	char device_api[PATH_MAX];
+	char filename[PATH_MAX];
+	char *ptr;
+
+	/* get device_api */
+	snprintf(filename, sizeof(filename), "%s/mdev_type/device_api",
+		 dirname);
+
+	if (rte_eal_parse_sysfs_str(filename, device_api,
+				    sizeof(device_api)) < 0) {
+		return -1;
+	}
+
+	ptr = strchr(device_api, '\n');
+	if (ptr != NULL)
+		*ptr = '\0';
+
+	return strcmp(device_api, "vfio-pci") == 0;
+}
+
+static int
+pci_scan_one_mdev(const char *dirname, const rte_uuid_t addr)
+{
+	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+	char name[RTE_UUID_STRLEN];
+	char filename[PATH_MAX];
+	char path[PATH_MAX];
+	char driver[PATH_MAX];
+	char *ptr;
+	struct rte_pci_device_internal *pdev;
+	struct rte_pci_device *dev;
+	bool need_release = false;
+	const char *sysfs_base;
+	unsigned long tmp;
+	int vfio_dev_fd;
+	int ret;
+
+	sysfs_base = pci_mdev_get_sysfs_path();
+
+	pdev = malloc(sizeof(*pdev));
+	if (pdev == NULL)
+		return -1;
+
+	memset(pdev, 0, sizeof(*pdev));
+
+	dev = &pdev->device;
+	dev->device.bus = &rte_pci_bus.bus;
+	rte_uuid_unparse(addr, name, sizeof(name));
+
+	/* parse driver */
+	snprintf(filename, sizeof(filename), "%s/driver", dirname);
+	ret = pci_get_kernel_driver_by_path(filename, driver, sizeof(driver));
+	if (ret < 0) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to get kernel driver\n", name);
+		goto err;
+	}
+
+	if (ret != 0 || strcmp(driver, "vfio_mdev") != 0) {
+		RTE_LOG(DEBUG, EAL, "%s: unsupported mdev driver\n", name);
+		goto err;
+	}
+
+	dev->kdrv = RTE_KDRV_VFIO;
+
+	dev->is_mdev = 1;
+	rte_uuid_copy(dev->uuid, addr);
+
+	snprintf(filename, sizeof(filename), "%s/%s", sysfs_base, name);
+
+	/* Get the path of the parent device. */
+	if (realpath(filename, path) == NULL) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to get parent device\n", name);
+		goto err;
+	}
+
+	ptr = strrchr(path, '/');
+	if (ptr == NULL) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to parse parent device\n",
+			name);
+		goto err;
+	}
+	*ptr = '\0';
+
+	/* get numa node, default to 0 if not present */
+	snprintf(filename, sizeof(filename), "%s/numa_node", path);
+
+	if (access(filename, F_OK) != -1) {
+		if (eal_parse_sysfs_value(filename, &tmp) == 0)
+			dev->device.numa_node = tmp;
+		else
+			dev->device.numa_node = -1;
+	} else {
+		dev->device.numa_node = 0;
+	}
+
+	pci_name_set(dev);
+
+	if (rte_vfio_setup_device(sysfs_base, name, &vfio_dev_fd,
+				  &device_info) != 0) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to setup device\n", name);
+		goto err;
+	}
+
+	need_release = true;
+
+	if (pci_vfio_fill_regions(dev, vfio_dev_fd, &device_info) != 0) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to get regions\n", name);
+		goto err;
+	}
+
+	if (pci_vfio_get_pci_id(dev, vfio_dev_fd, &dev->id) != 0) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to access the device\n", name);
+		goto err;
+	}
+
+	/* device is valid, add to the list (sorted) */
+	if (TAILQ_EMPTY(&rte_pci_bus.device_list)) {
+		rte_pci_add_device(dev);
+	} else {
+		struct rte_pci_device *dev2;
+		int ret;
+
+		TAILQ_FOREACH(dev2, &rte_pci_bus.device_list, next) {
+			/*
+			 * Insert mediated PCI devices after all physical
+			 * PCI devices.
+			 */
+			if (!dev2->is_mdev)
+				continue;
+			ret = rte_uuid_compare(dev->uuid, dev2->uuid);
+			if (ret > 0)
+				continue;
+			if (ret < 0)
+				rte_pci_insert_device(dev2, dev);
+			else /* already registered */
+				free(pdev);
+			return 0;
+		}
+
+		rte_pci_add_device(dev);
+	}
+
+	return 0;
+
+err:
+	if (need_release)
+		rte_vfio_release_device(sysfs_base, name, vfio_dev_fd);
+	free(pdev);
+	return 1;
+}
+
+int
+pci_scan_mdev(void)
+{
+	struct dirent *e;
+	DIR *dir;
+	char dirname[PATH_MAX];
+	rte_uuid_t addr;
+
+	dir = opendir(pci_mdev_get_sysfs_path());
+	if (dir == NULL) {
+		RTE_LOG(DEBUG, EAL, "%s(): opendir failed: %s\n",
+			__func__, strerror(errno));
+		return 0;
+	}
+
+	while ((e = readdir(dir)) != NULL) {
+		if (e->d_name[0] == '.')
+			continue;
+
+		if (rte_uuid_parse(e->d_name, addr) != 0)
+			continue;
+
+		snprintf(dirname, sizeof(dirname), "%s/%s",
+			 pci_mdev_get_sysfs_path(), e->d_name);
+
+		if (!is_pci_device(dirname))
+			continue;
+
+		if (pci_scan_one_mdev(dirname, addr) < 0)
+			goto error;
+	}
+	closedir(dir);
+	return 0;
+
+error:
+	closedir(dir);
+	return -1;
+}
+
+#endif /* VFIO_PRESENT */
diff --git a/drivers/bus/pci/meson.build b/drivers/bus/pci/meson.build
index a312ecc03..890b7bda0 100644
--- a/drivers/bus/pci/meson.build
+++ b/drivers/bus/pci/meson.build
@@ -11,7 +11,8 @@ sources = files('pci_common.c',
 if is_linux
 	sources += files('linux/pci.c',
 			'linux/pci_uio.c',
-			'linux/pci_vfio.c')
+			'linux/pci_vfio.c',
+			'linux/pci_vfio_mdev.c')
 	includes += include_directories('linux')
 else
 	sources += files('bsd/pci.c')
diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
index 8b9deca8b..ec314cb07 100644
--- a/drivers/bus/pci/pci_common.c
+++ b/drivers/bus/pci/pci_common.c
@@ -25,6 +25,7 @@
 #include <rte_common.h>
 #include <rte_devargs.h>
 #include <rte_vfio.h>
+#include <rte_uuid.h>
 
 #include "private.h"
 
@@ -61,8 +62,10 @@ pci_name_set(struct rte_pci_device *dev)
 	struct rte_devargs *devargs;
 
 	/* Each device has its internal, canonical name set. */
-	rte_pci_device_name(&dev->addr,
-			dev->name, sizeof(dev->name));
+	if (dev->is_mdev)
+		rte_uuid_unparse(dev->uuid, dev->name, sizeof(dev->name));
+	else
+		rte_pci_device_name(&dev->addr, dev->name, sizeof(dev->name));
 	devargs = pci_devargs_lookup(dev);
 	dev->device.devargs = devargs;
 	/* In blacklist mode, if the device is not blacklisted, no
@@ -124,21 +127,17 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
 {
 	int ret;
 	bool already_probed;
-	struct rte_pci_addr *loc;
 
 	if ((dr == NULL) || (dev == NULL))
 		return -EINVAL;
 
-	loc = &dev->addr;
-
 	/* The device is not blacklisted; Check if driver supports it */
 	if (!rte_pci_match(dr, dev))
 		/* Match of device and driver failed */
 		return 1;
 
-	RTE_LOG(INFO, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
-			loc->domain, loc->bus, loc->devid, loc->function,
-			dev->device.numa_node);
+	RTE_LOG(INFO, EAL, "PCI device %s on NUMA socket %i\n",
+		dev->name, dev->device.numa_node);
 
 	/* no initialization when blacklisted, return without error */
 	if (dev->device.devargs != NULL &&
@@ -208,7 +207,6 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
 static int
 rte_pci_detach_dev(struct rte_pci_device *dev)
 {
-	struct rte_pci_addr *loc;
 	struct rte_pci_driver *dr;
 	int ret = 0;
 
@@ -216,11 +214,9 @@ rte_pci_detach_dev(struct rte_pci_device *dev)
 		return -EINVAL;
 
 	dr = dev->driver;
-	loc = &dev->addr;
 
-	RTE_LOG(DEBUG, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
-			loc->domain, loc->bus, loc->devid,
-			loc->function, dev->device.numa_node);
+	RTE_LOG(DEBUG, EAL, "PCI device %s on NUMA socket %i\n",
+		dev->name, dev->device.numa_node);
 
 	RTE_LOG(DEBUG, EAL, "  remove driver: %x:%x %s\n", dev->id.vendor_id,
 			dev->id.device_id, dr->driver.name);
@@ -297,10 +293,9 @@ rte_pci_probe(void)
 			ret = pci_probe_all_drivers(dev);
 		if (ret < 0) {
 			if (ret != -EEXIST) {
-				RTE_LOG(ERR, EAL, "Requested device "
-					PCI_PRI_FMT " cannot be used\n",
-					dev->addr.domain, dev->addr.bus,
-					dev->addr.devid, dev->addr.function);
+				RTE_LOG(ERR, EAL,
+					"Requested device %s cannot be used\n",
+					dev->name);
 				rte_errno = errno;
 				failed++;
 			}
diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
index 86527b421..47e669e9c 100644
--- a/drivers/bus/pci/rte_bus_pci.h
+++ b/drivers/bus/pci/rte_bus_pci.h
@@ -51,13 +51,26 @@ TAILQ_HEAD(rte_pci_driver_list, rte_pci_driver);
 
 struct rte_devargs;
 
+/*
+ * NOTE: we can't include rte_uuid.h directly due to the conflicts
+ *      introduced by stdbool.h
+ */
+typedef unsigned char rte_uuid_t[16];
+
+/* It's RTE_UUID_STRLEN, which is bigger than PCI_PRI_STR_SIZE. */
+#define RTE_PCI_NAME_LEN		(36 + 1)
+
 /**
  * A structure describing a PCI device.
  */
 struct rte_pci_device {
 	TAILQ_ENTRY(rte_pci_device) next;   /**< Next probed PCI device. */
 	struct rte_device device;           /**< Inherit core device */
-	struct rte_pci_addr addr;           /**< PCI location. */
+	union {
+		struct rte_pci_addr addr;   /**< PCI location. */
+		rte_uuid_t uuid;            /**< Mdev location. */
+	};
+	uint8_t is_mdev;                    /**< True for mediated PCI device */
 	struct rte_pci_id id;               /**< PCI ID. */
 	struct rte_mem_resource mem_resource[PCI_MAX_RESOURCE];
 					    /**< PCI Memory Resource */
@@ -65,7 +78,7 @@ struct rte_pci_device {
 	struct rte_pci_driver *driver;      /**< PCI driver used in probing */
 	uint16_t max_vfs;                   /**< sriov enable if not zero */
 	enum rte_kernel_driver kdrv;        /**< Kernel driver passthrough */
-	char name[PCI_PRI_STR_SIZE+1];      /**< PCI location (ASCII) */
+	char name[RTE_PCI_NAME_LEN];        /**< PCI/Mdev location (ASCII) */
 	struct rte_intr_handle vfio_req_intr_handle;
 				/**< Handler of VFIO request interrupt */
 };
diff --git a/lib/librte_eal/linux/eal/eal.c b/lib/librte_eal/linux/eal/eal.c
index 44bad45d3..942148180 100644
--- a/lib/librte_eal/linux/eal/eal.c
+++ b/lib/librte_eal/linux/eal/eal.c
@@ -1068,6 +1068,15 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+#ifdef VFIO_PRESENT
+	if (rte_eal_vfio_setup() < 0) {
+		rte_eal_init_alert("Cannot init VFIO");
+		rte_errno = EAGAIN;
+		rte_atomic32_clear(&run_once);
+		return -1;
+	}
+#endif
+
 	if (rte_bus_scan()) {
 		rte_eal_init_alert("Cannot scan the buses for devices");
 		rte_errno = ENODEV;
@@ -1151,14 +1160,6 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
-#ifdef VFIO_PRESENT
-	if (rte_eal_vfio_setup() < 0) {
-		rte_eal_init_alert("Cannot init VFIO");
-		rte_errno = EAGAIN;
-		rte_atomic32_clear(&run_once);
-		return -1;
-	}
-#endif
 	/* in secondary processes, memory init may allocate additional fbarrays
 	 * not present in primary processes, so to avoid any potential issues,
 	 * initialize memzones first.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK
  2019-07-15  7:52   ` [dpdk-dev] [RFC v2 5/5] bus/pci: add mdev support Tiwei Bie
@ 2021-06-01  3:06     ` Chenbo Xia
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 1/6] bus/pci: introduce an internal representation of PCI device Chenbo Xia
                         ` (6 more replies)
  0 siblings, 7 replies; 41+ messages in thread
From: Chenbo Xia @ 2021-06-01  3:06 UTC (permalink / raw)
  To: dev, thomas, cunming.liang, jingjing.wu
  Cc: anatoly.burakov, ferruh.yigit, mdr, nhorman, bruce.richardson,
	david.marchand, stephen, konstantin.ananyev

Hi everyone,

This is a draft implementation of the mdev (Mediated device [1])
support in DPDK PCI bus driver. Mdev is a way to virtualize devices
in Linux kernel. Based on the device-api (mdev_type/device_api),
there could be different types of mdev devices (e.g. vfio-pci).
In this patchset, the PCI bus driver is extended to support scanning
and probing the mdev devices whose device-api is "vfio-pci".

                     +---------+
                     | PCI bus |
                     +----+----+
                          |
         +--------+-------+-------+--------+
         |        |               |        |
  Physical PCI devices ...   Mediated PCI devices ...

The first four patches in this patchset are mainly preparation of mdev
bus support. The left two patches are the key implementation of mdev bus.

The implementation of mdev bus in DPDK has several options:

1: Embed mdev bus in current pci bus

   This patchset takes this option for an example. Mdev has several
   device types: pci/platform/amba/ccw/ap. DPDK currently only cares
   pci devices in all mdev device types so we could embed the mdev bus
   into current pci bus. Then pci bus with mdev support will scan/plug/
   unplug/.. not only normal pci devices but also mediated pci devices.

2: A new mdev bus that scans mediated pci devices and probes mdev driver to
   plug-in pci devices to pci bus

   If we took this option, a new mdev bus will be implemented to scan
   mediated pci devices and a new mdev driver for pci devices will be
   implemented in pci bus to plug-in mediated pci devices to pci bus.

   Our RFC v1 takes this option:
   http://patchwork.dpdk.org/project/dpdk/cover/20190403071844.21126-1-tiwei.bie@intel.com/

   Note that: for either option 1 or 2, device drivers do not know the
   implementation difference but only use structs/functions exposed by
   pci bus. Mediated pci devices are different from normal pci devices
   on: 1. Mediated pci devices use UUID as address but normal ones use BDF.
   2. Mediated pci devices may have some capabilities that normal pci
   devices do not have. For example, mediated pci devices could have
   regions that have sparse mmap capability, which allows a region to have
   multiple mmap areas. Another example is mediated pci devices may have
   regions/part of regions not mmaped but need to access them. Above
   difference will change the current ABI (i.e., struct rte_pci_device).
   Please check 5th and 6th patch for details.

3. A brand new mdev bus that does everything

   This option will implement a new and standalone mdev bus. This option
   does not need any changes in current pci bus but only needs some shared
   code (linux vfio part) in pci bus. Drivers of devices that support mdev
   will register itself as a mdev driver and do not rely on pci bus anymore.
   This option, IMHO, will make the code clean. The only potential problem
   may be code duplication, which could be solved by making code of linux
   vfio part of pci bus common and shared.

Your comments on above three options are welcomed and appreciated!

Thanks!
Chenbo

----------------------------------------------------------------------------
RFC v3:
- Add sparse mmap support
- Minor fixes and improvements

RFC v2:
- Let PCI bus scan mediated PCI devices directly
- Address Keith's comments
- Merge below patch into this series (David)
   http://patches.dpdk.org/patch/55927/
- Add internal representation of PCI device (David)
- Minor fixes and improvements

[1] https://github.com/torvalds/linux/blob/master/Documentation/driver-api/vfio-mediated-device.rst

Chenbo Xia (1):
  bus/pci: add sparse mmap support for mediated PCI devices

Tiwei Bie (5):
  bus/pci: introduce an internal representation of PCI device
  bus/pci: avoid depending on private value in kernel source
  bus/pci: introduce helper for MMIO read and write
  eal: add a helper for reading string from sysfs
  bus/pci: add mdev support

 drivers/bus/pci/bsd/pci.c             |  36 +-
 drivers/bus/pci/linux/pci.c           | 107 ++++-
 drivers/bus/pci/linux/pci_init.h      |  29 +-
 drivers/bus/pci/linux/pci_uio.c       |  22 +
 drivers/bus/pci/linux/pci_vfio.c      | 586 ++++++++++++++++++++++----
 drivers/bus/pci/linux/pci_vfio_mdev.c | 277 ++++++++++++
 drivers/bus/pci/meson.build           |   1 +
 drivers/bus/pci/pci_common.c          |  86 ++--
 drivers/bus/pci/pci_params.c          |  36 +-
 drivers/bus/pci/private.h             |  40 ++
 drivers/bus/pci/rte_bus_pci.h         |  83 +++-
 drivers/bus/pci/version.map           |   4 +
 lib/eal/common/eal_filesystem.h       |  10 +
 lib/eal/freebsd/eal.c                 |  22 +
 lib/eal/linux/eal.c                   |  39 +-
 lib/eal/version.map                   |   3 +
 16 files changed, 1224 insertions(+), 157 deletions(-)
 create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v3 1/6] bus/pci: introduce an internal representation of PCI device
  2021-06-01  3:06     ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Chenbo Xia
@ 2021-06-01  3:06       ` Chenbo Xia
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 2/6] bus/pci: avoid depending on private value in kernel source Chenbo Xia
                         ` (5 subsequent siblings)
  6 siblings, 0 replies; 41+ messages in thread
From: Chenbo Xia @ 2021-06-01  3:06 UTC (permalink / raw)
  To: dev, thomas, cunming.liang, jingjing.wu
  Cc: anatoly.burakov, ferruh.yigit, mdr, nhorman, bruce.richardson,
	david.marchand, stephen, konstantin.ananyev, Tiwei Bie

From: Tiwei Bie <tiwei.bie@intel.com>

This patch introduces an internal representation of the PCI device
which will be used to store the internal information that don't have
to be exposed, e.g. the VFIO region sizes/offsets.

In this patch, the internal structure is simply a wrapper of the
rte_pci_device structure. More fields will be added in the coming
patches.

Suggested-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: Chenbo Xia <chenbo.xia@intel.com>
---
 drivers/bus/pci/bsd/pci.c    | 14 +++++++++-----
 drivers/bus/pci/linux/pci.c  | 27 ++++++++++++++++-----------
 drivers/bus/pci/pci_common.c |  2 +-
 drivers/bus/pci/private.h    | 12 ++++++++++++
 4 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/drivers/bus/pci/bsd/pci.c b/drivers/bus/pci/bsd/pci.c
index 4b8a208781..20ce979f60 100644
--- a/drivers/bus/pci/bsd/pci.c
+++ b/drivers/bus/pci/bsd/pci.c
@@ -212,16 +212,20 @@ pci_uio_map_resource_by_index(struct rte_pci_device *dev, int res_idx,
 static int
 pci_scan_one(int dev_pci_fd, struct pci_conf *conf)
 {
+	struct rte_pci_device_internal *pdev;
 	struct rte_pci_device *dev;
 	struct pci_bar_io bar;
 	unsigned i, max;
 
-	dev = malloc(sizeof(*dev));
-	if (dev == NULL) {
+	pdev = malloc(sizeof(*pdev));
+	if (pdev == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memory for internal pci device\n");
 		return -1;
 	}
 
-	memset(dev, 0, sizeof(*dev));
+	memset(pdev, 0, sizeof(*pdev));
+
+	dev = &pdev->device;
 	dev->device.bus = &rte_pci_bus.bus;
 
 	dev->addr.domain = conf->pc_sel.pc_domain;
@@ -307,7 +311,7 @@ pci_scan_one(int dev_pci_fd, struct pci_conf *conf)
 				memmove(dev2->mem_resource,
 					dev->mem_resource,
 					sizeof(dev->mem_resource));
-				free(dev);
+				free(pdev);
 			}
 			return 0;
 		}
@@ -317,7 +321,7 @@ pci_scan_one(int dev_pci_fd, struct pci_conf *conf)
 	return 0;
 
 skipdev:
-	free(dev);
+	free(pdev);
 	return 0;
 }
 
diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index 0dc99e9cb2..6dbba10657 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -218,22 +218,27 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 {
 	char filename[PATH_MAX];
 	unsigned long tmp;
+	struct rte_pci_device_internal *pdev;
 	struct rte_pci_device *dev;
 	char driver[PATH_MAX];
 	int ret;
 
-	dev = malloc(sizeof(*dev));
-	if (dev == NULL)
+	pdev = malloc(sizeof(*pdev));
+	if (pdev == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memory for internal pci device\n");
 		return -1;
+	}
+
+	memset(pdev, 0, sizeof(*pdev));
 
-	memset(dev, 0, sizeof(*dev));
+	dev = &pdev->device;
 	dev->device.bus = &rte_pci_bus.bus;
 	dev->addr = *addr;
 
 	/* get vendor id */
 	snprintf(filename, sizeof(filename), "%s/vendor", dirname);
 	if (eal_parse_sysfs_value(filename, &tmp) < 0) {
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 	dev->id.vendor_id = (uint16_t)tmp;
@@ -241,7 +246,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	/* get device id */
 	snprintf(filename, sizeof(filename), "%s/device", dirname);
 	if (eal_parse_sysfs_value(filename, &tmp) < 0) {
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 	dev->id.device_id = (uint16_t)tmp;
@@ -250,7 +255,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	snprintf(filename, sizeof(filename), "%s/subsystem_vendor",
 		 dirname);
 	if (eal_parse_sysfs_value(filename, &tmp) < 0) {
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 	dev->id.subsystem_vendor_id = (uint16_t)tmp;
@@ -259,7 +264,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	snprintf(filename, sizeof(filename), "%s/subsystem_device",
 		 dirname);
 	if (eal_parse_sysfs_value(filename, &tmp) < 0) {
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 	dev->id.subsystem_device_id = (uint16_t)tmp;
@@ -268,7 +273,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	snprintf(filename, sizeof(filename), "%s/class",
 		 dirname);
 	if (eal_parse_sysfs_value(filename, &tmp) < 0) {
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 	/* the least 24 bits are valid: class, subclass, program interface */
@@ -308,7 +313,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	snprintf(filename, sizeof(filename), "%s/resource", dirname);
 	if (pci_parse_sysfs_resource(filename, dev) < 0) {
 		RTE_LOG(ERR, EAL, "%s(): cannot parse resource\n", __func__);
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 
@@ -317,7 +322,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 	ret = pci_get_kernel_driver_by_path(filename, driver, sizeof(driver));
 	if (ret < 0) {
 		RTE_LOG(ERR, EAL, "Fail to get kernel driver\n");
-		free(dev);
+		free(pdev);
 		return -1;
 	}
 
@@ -386,7 +391,7 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 						pci_name_set(dev2);
 					}
 				}
-				free(dev);
+				free(pdev);
 			}
 			return 0;
 		}
diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
index ee7f966358..1c368c254c 100644
--- a/drivers/bus/pci/pci_common.c
+++ b/drivers/bus/pci/pci_common.c
@@ -571,7 +571,7 @@ pci_unplug(struct rte_device *dev)
 	if (ret == 0) {
 		rte_pci_remove_device(pdev);
 		rte_devargs_remove(dev->devargs);
-		free(pdev);
+		free(RTE_PCI_DEVICE_INTERNAL(pdev));
 	}
 	return ret;
 }
diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
index 4cd9d14ec7..49a29d45cf 100644
--- a/drivers/bus/pci/private.h
+++ b/drivers/bus/pci/private.h
@@ -12,11 +12,23 @@
 #include <rte_os_shim.h>
 #include <rte_pci.h>
 
+/*
+ * Convert struct rte_pci_device to struct rte_pci_device_internal
+ */
+#define RTE_PCI_DEVICE_INTERNAL(ptr) \
+	container_of(ptr, struct rte_pci_device_internal, device)
+#define RTE_PCI_DEVICE_INTERNAL_CONST(ptr) \
+	container_of(ptr, const struct rte_pci_device_internal, device)
+
 extern struct rte_pci_bus rte_pci_bus;
 
 struct rte_pci_driver;
 struct rte_pci_device;
 
+struct rte_pci_device_internal {
+	struct rte_pci_device device;
+};
+
 /**
  * Scan the content of the PCI bus, and the devices in the devices
  * list
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v3 2/6] bus/pci: avoid depending on private value in kernel source
  2021-06-01  3:06     ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Chenbo Xia
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 1/6] bus/pci: introduce an internal representation of PCI device Chenbo Xia
@ 2021-06-01  3:06       ` Chenbo Xia
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 3/6] bus/pci: introduce helper for MMIO read and write Chenbo Xia
                         ` (4 subsequent siblings)
  6 siblings, 0 replies; 41+ messages in thread
From: Chenbo Xia @ 2021-06-01  3:06 UTC (permalink / raw)
  To: dev, thomas, cunming.liang, jingjing.wu
  Cc: anatoly.burakov, ferruh.yigit, mdr, nhorman, bruce.richardson,
	david.marchand, stephen, konstantin.ananyev, Tiwei Bie

From: Tiwei Bie <tiwei.bie@intel.com>

The value 40 used in VFIO_GET_REGION_ADDR() is a private value
(VFIO_PCI_OFFSET_SHIFT) defined in Linux kernel source [1]. It
is not part of VFIO API, and we should not depend on it.

[1] https://github.com/torvalds/linux/blob/v5.12/drivers/vfio/pci/vfio_pci_private.h

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/bus/pci/linux/pci.c      |   4 +-
 drivers/bus/pci/linux/pci_init.h |   4 +-
 drivers/bus/pci/linux/pci_vfio.c | 176 ++++++++++++++++++++++++-------
 drivers/bus/pci/private.h        |   9 ++
 4 files changed, 153 insertions(+), 40 deletions(-)

diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index 6dbba10657..8f1fddbf20 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -647,7 +647,7 @@ int rte_pci_read_config(const struct rte_pci_device *device,
 		return pci_uio_read_config(intr_handle, buf, len, offset);
 #ifdef VFIO_PRESENT
 	case RTE_PCI_KDRV_VFIO:
-		return pci_vfio_read_config(intr_handle, buf, len, offset);
+		return pci_vfio_read_config(device, buf, len, offset);
 #endif
 	default:
 		rte_pci_device_name(&device->addr, devname,
@@ -671,7 +671,7 @@ int rte_pci_write_config(const struct rte_pci_device *device,
 		return pci_uio_write_config(intr_handle, buf, len, offset);
 #ifdef VFIO_PRESENT
 	case RTE_PCI_KDRV_VFIO:
-		return pci_vfio_write_config(intr_handle, buf, len, offset);
+		return pci_vfio_write_config(device, buf, len, offset);
 #endif
 	default:
 		rte_pci_device_name(&device->addr, devname,
diff --git a/drivers/bus/pci/linux/pci_init.h b/drivers/bus/pci/linux/pci_init.h
index dcea726186..9f6659ba6e 100644
--- a/drivers/bus/pci/linux/pci_init.h
+++ b/drivers/bus/pci/linux/pci_init.h
@@ -66,9 +66,9 @@ int pci_uio_ioport_unmap(struct rte_pci_ioport *p);
 #endif
 
 /* access config space */
-int pci_vfio_read_config(const struct rte_intr_handle *intr_handle,
+int pci_vfio_read_config(const struct rte_pci_device *dev,
 			 void *buf, size_t len, off_t offs);
-int pci_vfio_write_config(const struct rte_intr_handle *intr_handle,
+int pci_vfio_write_config(const struct rte_pci_device *dev,
 			  const void *buf, size_t len, off_t offs);
 
 int pci_vfio_ioport_map(struct rte_pci_device *dev, int bar,
diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index 07706f7338..012e7f72c1 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -43,35 +43,82 @@ static struct rte_tailq_elem rte_vfio_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_vfio_tailq)
 
+static int
+pci_vfio_get_region(const struct rte_pci_device *dev, int index,
+		    uint64_t *size, uint64_t *offset)
+{
+	const struct rte_pci_device_internal *pdev =
+		RTE_PCI_DEVICE_INTERNAL_CONST(dev);
+
+	if (index >= VFIO_PCI_NUM_REGIONS || index >= RTE_MAX_PCI_REGIONS)
+		return -1;
+
+	if (pdev->region[index].size == 0 && pdev->region[index].offset == 0)
+		return -1;
+
+	*size   = pdev->region[index].size;
+	*offset = pdev->region[index].offset;
+
+	return 0;
+}
+
 int
-pci_vfio_read_config(const struct rte_intr_handle *intr_handle,
+pci_vfio_read_config(const struct rte_pci_device *dev,
 		    void *buf, size_t len, off_t offs)
 {
-	return pread64(intr_handle->vfio_dev_fd, buf, len,
-	       VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) + offs);
+	uint64_t size, offset;
+	int fd;
+
+	fd = dev->intr_handle.vfio_dev_fd;
+
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+				&size, &offset) != 0)
+		return -1;
+
+	if ((uint64_t)len + offs > size)
+		return -1;
+
+	return pread64(fd, buf, len, offset + offs);
 }
 
 int
-pci_vfio_write_config(const struct rte_intr_handle *intr_handle,
+pci_vfio_write_config(const struct rte_pci_device *dev,
 		    const void *buf, size_t len, off_t offs)
 {
-	return pwrite64(intr_handle->vfio_dev_fd, buf, len,
-	       VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) + offs);
+	uint64_t size, offset;
+	int fd;
+
+	fd = dev->intr_handle.vfio_dev_fd;
+
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+				&size, &offset) != 0)
+		return -1;
+
+	if ((uint64_t)len + offs > size)
+		return -1;
+
+	return pwrite64(fd, buf, len, offset + offs);
 }
 
 /* get PCI BAR number where MSI-X interrupts are */
 static int
-pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
+pci_vfio_get_msix_bar(const struct rte_pci_device *dev, int fd,
+		struct pci_msix_table *msix_table)
 {
 	int ret;
 	uint32_t reg;
 	uint16_t flags;
 	uint8_t cap_id, cap_offset;
+	uint64_t size, offset;
+
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+		&size, &offset) != 0) {
+		RTE_LOG(ERR, EAL, "Cannot get offset of CONFIG region.\n");
+		return -1;
+	}
 
 	/* read PCI capability pointer from config space */
-	ret = pread64(fd, &reg, sizeof(reg),
-			VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-			PCI_CAPABILITY_LIST);
+	ret = pread64(fd, &reg, sizeof(reg), offset + PCI_CAPABILITY_LIST);
 	if (ret != sizeof(reg)) {
 		RTE_LOG(ERR, EAL,
 			"Cannot read capability pointer from PCI config space!\n");
@@ -84,9 +131,7 @@ pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
 	while (cap_offset) {
 
 		/* read PCI capability ID */
-		ret = pread64(fd, &reg, sizeof(reg),
-				VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-				cap_offset);
+		ret = pread64(fd, &reg, sizeof(reg), offset + cap_offset);
 		if (ret != sizeof(reg)) {
 			RTE_LOG(ERR, EAL,
 				"Cannot read capability ID from PCI config space!\n");
@@ -99,8 +144,7 @@ pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
 		/* if we haven't reached MSI-X, check next capability */
 		if (cap_id != PCI_CAP_ID_MSIX) {
 			ret = pread64(fd, &reg, sizeof(reg),
-					VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-					cap_offset);
+				offset + cap_offset);
 			if (ret != sizeof(reg)) {
 				RTE_LOG(ERR, EAL,
 					"Cannot read capability pointer from PCI config space!\n");
@@ -116,8 +160,7 @@ pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
 		else {
 			/* table offset resides in the next 4 bytes */
 			ret = pread64(fd, &reg, sizeof(reg),
-					VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-					cap_offset + 4);
+				offset + cap_offset + 4);
 			if (ret != sizeof(reg)) {
 				RTE_LOG(ERR, EAL,
 					"Cannot read table offset from PCI config space!\n");
@@ -125,8 +168,7 @@ pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
 			}
 
 			ret = pread64(fd, &flags, sizeof(flags),
-					VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-					cap_offset + 2);
+				offset + cap_offset + 2);
 			if (ret != sizeof(flags)) {
 				RTE_LOG(ERR, EAL,
 					"Cannot read table flags from PCI config space!\n");
@@ -178,14 +220,19 @@ pci_vfio_enable_bus_memory(int dev_fd)
 
 /* set PCI bus mastering */
 static int
-pci_vfio_set_bus_master(int dev_fd, bool op)
+pci_vfio_set_bus_master(const struct rte_pci_device *dev, int dev_fd, bool op)
 {
+	uint64_t size, offset;
 	uint16_t reg;
 	int ret;
 
-	ret = pread64(dev_fd, &reg, sizeof(reg),
-			VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-			PCI_COMMAND);
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+		&size, &offset) != 0) {
+		RTE_LOG(ERR, EAL, "Cannot get offset of CONFIG region.\n");
+		return -1;
+	}
+
+	ret = pread64(dev_fd, &reg, sizeof(reg), offset + PCI_COMMAND);
 	if (ret != sizeof(reg)) {
 		RTE_LOG(ERR, EAL, "Cannot read command from PCI config space!\n");
 		return -1;
@@ -197,10 +244,7 @@ pci_vfio_set_bus_master(int dev_fd, bool op)
 	else
 		reg &= ~(PCI_COMMAND_MASTER);
 
-	ret = pwrite64(dev_fd, &reg, sizeof(reg),
-			VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
-			PCI_COMMAND);
-
+	ret = pwrite64(dev_fd, &reg, sizeof(reg), offset + PCI_COMMAND);
 	if (ret != sizeof(reg)) {
 		RTE_LOG(ERR, EAL, "Cannot write command to PCI config space!\n");
 		return -1;
@@ -429,14 +473,21 @@ pci_vfio_disable_notifier(struct rte_pci_device *dev)
 #endif
 
 static int
-pci_vfio_is_ioport_bar(int vfio_dev_fd, int bar_index)
+pci_vfio_is_ioport_bar(const struct rte_pci_device *dev, int vfio_dev_fd,
+	int bar_index)
 {
+	uint64_t size, offset;
 	uint32_t ioport_bar;
 	int ret;
 
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+		&size, &offset) != 0) {
+		RTE_LOG(ERR, EAL, "Cannot get offset of CONFIG region.\n");
+		return -1;
+	}
+
 	ret = pread64(vfio_dev_fd, &ioport_bar, sizeof(ioport_bar),
-			  VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX)
-			  + PCI_BASE_ADDRESS_0 + bar_index*4);
+			  offset + PCI_BASE_ADDRESS_0 + bar_index * 4);
 	if (ret != sizeof(ioport_bar)) {
 		RTE_LOG(ERR, EAL, "Cannot read command (%x) from config space!\n",
 			PCI_BASE_ADDRESS_0 + bar_index*4);
@@ -460,7 +511,7 @@ pci_rte_vfio_setup_device(struct rte_pci_device *dev, int vfio_dev_fd)
 	}
 
 	/* set bus mastering for the device */
-	if (pci_vfio_set_bus_master(vfio_dev_fd, true)) {
+	if (pci_vfio_set_bus_master(dev, vfio_dev_fd, true)) {
 		RTE_LOG(ERR, EAL, "Cannot set up bus mastering!\n");
 		return -1;
 	}
@@ -690,11 +741,40 @@ pci_vfio_msix_is_mappable(int vfio_dev_fd, int msix_region)
 	return ret;
 }
 
+static int
+pci_vfio_fill_regions(struct rte_pci_device *dev, int vfio_dev_fd,
+		      struct vfio_device_info *device_info)
+{
+	struct rte_pci_device_internal *pdev = RTE_PCI_DEVICE_INTERNAL(dev);
+	struct vfio_region_info *reg = NULL;
+	int nb_maps, i, ret;
+
+	nb_maps = RTE_MIN((int)device_info->num_regions,
+			VFIO_PCI_CONFIG_REGION_INDEX + 1);
+
+	for (i = 0; i < nb_maps; i++) {
+		ret = pci_vfio_get_region_info(vfio_dev_fd, &reg, i);
+		if (ret < 0) {
+			RTE_LOG(DEBUG, EAL, "%s cannot get device region info error %i (%s)\n",
+				dev->name, errno, strerror(errno));
+			return -1;
+		}
+
+		pdev->region[i].size = reg->size;
+		pdev->region[i].offset = reg->offset;
+
+		free(reg);
+	}
+
+	return 0;
+}
 
 static int
 pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 {
+	struct rte_pci_device_internal *pdev = RTE_PCI_DEVICE_INTERNAL(dev);
 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+	struct vfio_region_info *reg = NULL;
 	char pci_addr[PATH_MAX] = {0};
 	int vfio_dev_fd;
 	struct rte_pci_addr *loc = &dev->addr;
@@ -735,11 +815,22 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 	/* map BARs */
 	maps = vfio_res->maps;
 
+	ret = pci_vfio_get_region_info(vfio_dev_fd, &reg,
+		VFIO_PCI_CONFIG_REGION_INDEX);
+	if (ret < 0) {
+		RTE_LOG(ERR, EAL, "%s cannot get device region info error %i (%s)\n",
+			dev->name, errno, strerror(errno));
+		goto err_vfio_res;
+	}
+	pdev->region[VFIO_PCI_CONFIG_REGION_INDEX].size = reg->size;
+	pdev->region[VFIO_PCI_CONFIG_REGION_INDEX].offset = reg->offset;
+	free(reg);
+
 	vfio_res->msix_table.bar_index = -1;
 	/* get MSI-X BAR, if any (we have to know where it is because we can't
 	 * easily mmap it when using VFIO)
 	 */
-	ret = pci_vfio_get_msix_bar(vfio_dev_fd, &vfio_res->msix_table);
+	ret = pci_vfio_get_msix_bar(dev, vfio_dev_fd, &vfio_res->msix_table);
 	if (ret < 0) {
 		RTE_LOG(ERR, EAL, "%s cannot get MSI-X BAR number!\n",
 				pci_addr);
@@ -760,7 +851,6 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 	}
 
 	for (i = 0; i < vfio_res->nb_maps; i++) {
-		struct vfio_region_info *reg = NULL;
 		void *bar_addr;
 
 		ret = pci_vfio_get_region_info(vfio_dev_fd, &reg, i);
@@ -771,8 +861,11 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 			goto err_vfio_res;
 		}
 
+		pdev->region[i].size = reg->size;
+		pdev->region[i].offset = reg->offset;
+
 		/* chk for io port region */
-		ret = pci_vfio_is_ioport_bar(vfio_dev_fd, i);
+		ret = pci_vfio_is_ioport_bar(dev, vfio_dev_fd, i);
 		if (ret < 0) {
 			free(reg);
 			goto err_vfio_res;
@@ -882,6 +975,10 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 	if (ret)
 		return ret;
 
+	ret = pci_vfio_fill_regions(dev, vfio_dev_fd, &device_info);
+	if (ret)
+		return ret;
+
 	/* map BARs */
 	maps = vfio_res->maps;
 
@@ -988,7 +1085,7 @@ pci_vfio_unmap_resource_primary(struct rte_pci_device *dev)
 		return -1;
 	}
 
-	if (pci_vfio_set_bus_master(dev->intr_handle.vfio_dev_fd, false)) {
+	if (pci_vfio_set_bus_master(dev, dev->intr_handle.vfio_dev_fd, false)) {
 		RTE_LOG(ERR, EAL, "%s cannot unset bus mastering for PCI device!\n",
 				pci_addr);
 		return -1;
@@ -1064,14 +1161,21 @@ int
 pci_vfio_ioport_map(struct rte_pci_device *dev, int bar,
 		    struct rte_pci_ioport *p)
 {
+	uint64_t size, offset;
+
 	if (bar < VFIO_PCI_BAR0_REGION_INDEX ||
 	    bar > VFIO_PCI_BAR5_REGION_INDEX) {
 		RTE_LOG(ERR, EAL, "invalid bar (%d)!\n", bar);
 		return -1;
 	}
 
+	if (pci_vfio_get_region(dev, bar, &size, &offset) != 0) {
+		RTE_LOG(ERR, EAL, "Cannot get offset of region %d.\n", bar);
+		return -1;
+	}
+
 	p->dev = dev;
-	p->base = VFIO_GET_REGION_ADDR(bar);
+	p->base = offset;
 	return 0;
 }
 
diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
index 49a29d45cf..8b5fa70641 100644
--- a/drivers/bus/pci/private.h
+++ b/drivers/bus/pci/private.h
@@ -12,6 +12,8 @@
 #include <rte_os_shim.h>
 #include <rte_pci.h>
 
+#define RTE_MAX_PCI_REGIONS	9
+
 /*
  * Convert struct rte_pci_device to struct rte_pci_device_internal
  */
@@ -25,8 +27,15 @@ extern struct rte_pci_bus rte_pci_bus;
 struct rte_pci_driver;
 struct rte_pci_device;
 
+struct rte_pci_region {
+	uint64_t size;
+	uint64_t offset;
+};
+
 struct rte_pci_device_internal {
 	struct rte_pci_device device;
+	/* PCI regions provided by e.g. VFIO. */
+	struct rte_pci_region region[RTE_MAX_PCI_REGIONS];
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v3 3/6] bus/pci: introduce helper for MMIO read and write
  2021-06-01  3:06     ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Chenbo Xia
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 1/6] bus/pci: introduce an internal representation of PCI device Chenbo Xia
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 2/6] bus/pci: avoid depending on private value in kernel source Chenbo Xia
@ 2021-06-01  3:06       ` Chenbo Xia
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs Chenbo Xia
                         ` (3 subsequent siblings)
  6 siblings, 0 replies; 41+ messages in thread
From: Chenbo Xia @ 2021-06-01  3:06 UTC (permalink / raw)
  To: dev, thomas, cunming.liang, jingjing.wu
  Cc: anatoly.burakov, ferruh.yigit, mdr, nhorman, bruce.richardson,
	david.marchand, stephen, konstantin.ananyev, Tiwei Bie

From: Tiwei Bie <tiwei.bie@intel.com>

The MMIO regions may not be mmap-able for mediated PCI device.
In this case, the application should explicitly do read and write
to access these regions.

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/bus/pci/bsd/pci.c        | 22 +++++++++++++++
 drivers/bus/pci/linux/pci.c      | 46 ++++++++++++++++++++++++++++++
 drivers/bus/pci/linux/pci_init.h | 10 +++++++
 drivers/bus/pci/linux/pci_uio.c  | 22 +++++++++++++++
 drivers/bus/pci/linux/pci_vfio.c | 36 ++++++++++++++++++++++++
 drivers/bus/pci/rte_bus_pci.h    | 48 ++++++++++++++++++++++++++++++++
 drivers/bus/pci/version.map      |  4 +++
 7 files changed, 188 insertions(+)

diff --git a/drivers/bus/pci/bsd/pci.c b/drivers/bus/pci/bsd/pci.c
index 20ce979f60..781f65c637 100644
--- a/drivers/bus/pci/bsd/pci.c
+++ b/drivers/bus/pci/bsd/pci.c
@@ -494,6 +494,28 @@ int rte_pci_write_config(const struct rte_pci_device *dev,
 	return -1;
 }
 
+/* Read PCI MMIO space. */
+int rte_pci_mmio_read(const struct rte_pci_device *dev, int bar,
+		      void *buf, size_t len, off_t offset)
+{
+	if (bar >= PCI_MAX_RESOURCE || dev->mem_resource[bar].addr == NULL ||
+			(uint64_t)offset + len > dev->mem_resource[bar].len)
+		return -1;
+	memcpy(buf, (uint8_t *)dev->mem_resource[bar].addr + offset, len);
+	return len;
+}
+
+/* Write PCI MMIO space. */
+int rte_pci_mmio_write(const struct rte_pci_device *dev, int bar,
+		       const void *buf, size_t len, off_t offset)
+{
+	if (bar >= PCI_MAX_RESOURCE || dev->mem_resource[bar].addr == NULL ||
+			(uint64_t)offset + len > dev->mem_resource[bar].len)
+		return -1;
+	memcpy((uint8_t *)dev->mem_resource[bar].addr + offset, buf, len);
+	return len;
+}
+
 int
 rte_pci_ioport_map(struct rte_pci_device *dev, int bar,
 		struct rte_pci_ioport *p)
diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index 8f1fddbf20..4805f277c5 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -682,6 +682,52 @@ int rte_pci_write_config(const struct rte_pci_device *device,
 	}
 }
 
+/* Read PCI MMIO space. */
+int rte_pci_mmio_read(const struct rte_pci_device *device, int bar,
+		void *buf, size_t len, off_t offset)
+{
+	char devname[RTE_DEV_NAME_MAX_LEN] = "";
+
+	switch (device->kdrv) {
+	case RTE_PCI_KDRV_IGB_UIO:
+	case RTE_PCI_KDRV_UIO_GENERIC:
+		return pci_uio_mmio_read(device, bar, buf, len, offset);
+#ifdef VFIO_PRESENT
+	case RTE_PCI_KDRV_VFIO:
+		return pci_vfio_mmio_read(device, bar, buf, len, offset);
+#endif
+	default:
+		rte_pci_device_name(&device->addr, devname,
+				    RTE_DEV_NAME_MAX_LEN);
+		RTE_LOG(ERR, EAL,
+			"Unknown driver type for %s\n", devname);
+		return -1;
+	}
+}
+
+/* Write PCI MMIO space. */
+int rte_pci_mmio_write(const struct rte_pci_device *device, int bar,
+		const void *buf, size_t len, off_t offset)
+{
+	char devname[RTE_DEV_NAME_MAX_LEN] = "";
+
+	switch (device->kdrv) {
+	case RTE_PCI_KDRV_IGB_UIO:
+	case RTE_PCI_KDRV_UIO_GENERIC:
+		return pci_uio_mmio_write(device, bar, buf, len, offset);
+#ifdef VFIO_PRESENT
+	case RTE_PCI_KDRV_VFIO:
+		return pci_vfio_mmio_write(device, bar, buf, len, offset);
+#endif
+	default:
+		rte_pci_device_name(&device->addr, devname,
+				    RTE_DEV_NAME_MAX_LEN);
+		RTE_LOG(ERR, EAL,
+			"Unknown driver type for %s\n", devname);
+		return -1;
+	}
+}
+
 int
 rte_pci_ioport_map(struct rte_pci_device *dev, int bar,
 		struct rte_pci_ioport *p)
diff --git a/drivers/bus/pci/linux/pci_init.h b/drivers/bus/pci/linux/pci_init.h
index 9f6659ba6e..6853fa88a3 100644
--- a/drivers/bus/pci/linux/pci_init.h
+++ b/drivers/bus/pci/linux/pci_init.h
@@ -37,6 +37,11 @@ int pci_uio_read_config(const struct rte_intr_handle *intr_handle,
 int pci_uio_write_config(const struct rte_intr_handle *intr_handle,
 			 const void *buf, size_t len, off_t offs);
 
+int pci_uio_mmio_read(const struct rte_pci_device *dev, int bar,
+		      void *buf, size_t len, off_t offset);
+int pci_uio_mmio_write(const struct rte_pci_device *dev, int bar,
+		       const void *buf, size_t len, off_t offset);
+
 int pci_uio_ioport_map(struct rte_pci_device *dev, int bar,
 		       struct rte_pci_ioport *p);
 void pci_uio_ioport_read(struct rte_pci_ioport *p,
@@ -71,6 +76,11 @@ int pci_vfio_read_config(const struct rte_pci_device *dev,
 int pci_vfio_write_config(const struct rte_pci_device *dev,
 			  const void *buf, size_t len, off_t offs);
 
+int pci_vfio_mmio_read(const struct rte_pci_device *dev, int bar,
+		       void *buf, size_t len, off_t offset);
+int pci_vfio_mmio_write(const struct rte_pci_device *dev, int bar,
+			const void *buf, size_t len, off_t offset);
+
 int pci_vfio_ioport_map(struct rte_pci_device *dev, int bar,
 		        struct rte_pci_ioport *p);
 void pci_vfio_ioport_read(struct rte_pci_ioport *p,
diff --git a/drivers/bus/pci/linux/pci_uio.c b/drivers/bus/pci/linux/pci_uio.c
index 39ebeac2a0..2482635058 100644
--- a/drivers/bus/pci/linux/pci_uio.c
+++ b/drivers/bus/pci/linux/pci_uio.c
@@ -45,6 +45,28 @@ pci_uio_write_config(const struct rte_intr_handle *intr_handle,
 	return pwrite(intr_handle->uio_cfg_fd, buf, len, offset);
 }
 
+int
+pci_uio_mmio_read(const struct rte_pci_device *dev, int bar,
+		  void *buf, size_t len, off_t offset)
+{
+	if (bar >= PCI_MAX_RESOURCE || dev->mem_resource[bar].addr == NULL ||
+			(uint64_t)offset + len > dev->mem_resource[bar].len)
+		return -1;
+	memcpy(buf, (uint8_t *)dev->mem_resource[bar].addr + offset, len);
+	return len;
+}
+
+int
+pci_uio_mmio_write(const struct rte_pci_device *dev, int bar,
+		   const void *buf, size_t len, off_t offset)
+{
+	if (bar >= PCI_MAX_RESOURCE || dev->mem_resource[bar].addr == NULL ||
+			(uint64_t)offset + len > dev->mem_resource[bar].len)
+		return -1;
+	memcpy((uint8_t *)dev->mem_resource[bar].addr + offset, buf, len);
+	return len;
+}
+
 static int
 pci_uio_set_bus_master(int dev_fd)
 {
diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index 012e7f72c1..3ecd984215 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -1212,6 +1212,42 @@ pci_vfio_ioport_unmap(struct rte_pci_ioport *p)
 	return -1;
 }
 
+int
+pci_vfio_mmio_read(const struct rte_pci_device *dev, int bar,
+		   void *buf, size_t len, off_t offs)
+{
+	uint64_t size, offset;
+	int fd;
+
+	fd = dev->intr_handle.vfio_dev_fd;
+
+	if (pci_vfio_get_region(dev, bar, &size, &offset) != 0)
+		return -1;
+
+	if ((uint64_t)len + offs > size)
+		return -1;
+
+	return pread64(fd, buf, len, offset + offs);
+}
+
+int
+pci_vfio_mmio_write(const struct rte_pci_device *dev, int bar,
+		    const void *buf, size_t len, off_t offs)
+{
+	uint64_t size, offset;
+	int fd;
+
+	fd = dev->intr_handle.vfio_dev_fd;
+
+	if (pci_vfio_get_region(dev, bar, &size, &offset) != 0)
+		return -1;
+
+	if ((uint64_t)len + offs > size)
+		return -1;
+
+	return pwrite64(fd, buf, len, offset + offs);
+}
+
 int
 pci_vfio_is_enabled(void)
 {
diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
index 64886b4731..dc26811b0a 100644
--- a/drivers/bus/pci/rte_bus_pci.h
+++ b/drivers/bus/pci/rte_bus_pci.h
@@ -310,6 +310,54 @@ int rte_pci_read_config(const struct rte_pci_device *device,
 int rte_pci_write_config(const struct rte_pci_device *device,
 		const void *buf, size_t len, off_t offset);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read from a MMIO pci resource.
+ *
+ * @param device
+ *   A pointer to a rte_pci_device structure describing the device
+ *   to use
+ * @param bar
+ *   Index of the io pci resource we want to access.
+ * @param buf
+ *   A data buffer where the bytes should be read into
+ * @param len
+ *   The length of the data buffer.
+ * @param offset
+ *   The offset into MMIO space described by @bar
+ * @return
+ *  Number of bytes read on success, negative on error.
+ */
+__rte_experimental
+int rte_pci_mmio_read(const struct rte_pci_device *device, int bar,
+		void *buf, size_t len, off_t offset);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Write to a MMIO pci resource.
+ *
+ * @param device
+ *   A pointer to a rte_pci_device structure describing the device
+ *   to use
+ * @param bar
+ *   Index of the io pci resource we want to access.
+ * @param buf
+ *   A data buffer containing the bytes should be written
+ * @param len
+ *   The length of the data buffer.
+ * @param offset
+ *   The offset into MMIO space described by @bar
+ * @return
+ *  Number of bytes written on success, negative on error.
+ */
+__rte_experimental
+int rte_pci_mmio_write(const struct rte_pci_device *device, int bar,
+		const void *buf, size_t len, off_t offset);
+
 /**
  * A structure used to access io resources for a pci device.
  * rte_pci_ioport is arch, os, driver specific, and should not be used outside
diff --git a/drivers/bus/pci/version.map b/drivers/bus/pci/version.map
index f33ed0abd1..02e4219aab 100644
--- a/drivers/bus/pci/version.map
+++ b/drivers/bus/pci/version.map
@@ -21,4 +21,8 @@ EXPERIMENTAL {
 	global:
 
 	rte_pci_find_ext_capability;
+
+	# added in 21.08
+	rte_pci_mmio_read;
+	rte_pci_mmio_write;
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs
  2021-06-01  3:06     ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Chenbo Xia
                         ` (2 preceding siblings ...)
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 3/6] bus/pci: introduce helper for MMIO read and write Chenbo Xia
@ 2021-06-01  3:06       ` Chenbo Xia
  2021-06-01  5:37         ` Stephen Hemminger
                           ` (2 more replies)
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 5/6] bus/pci: add mdev support Chenbo Xia
                         ` (2 subsequent siblings)
  6 siblings, 3 replies; 41+ messages in thread
From: Chenbo Xia @ 2021-06-01  3:06 UTC (permalink / raw)
  To: dev, thomas, cunming.liang, jingjing.wu
  Cc: anatoly.burakov, ferruh.yigit, mdr, nhorman, bruce.richardson,
	david.marchand, stephen, konstantin.ananyev, Tiwei Bie

From: Tiwei Bie <tiwei.bie@intel.com>

This patch adds a helper for reading string from sysfs.

Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 lib/eal/common/eal_filesystem.h | 10 ++++++++++
 lib/eal/freebsd/eal.c           | 22 ++++++++++++++++++++++
 lib/eal/linux/eal.c             | 22 ++++++++++++++++++++++
 lib/eal/version.map             |  3 +++
 4 files changed, 57 insertions(+)

diff --git a/lib/eal/common/eal_filesystem.h b/lib/eal/common/eal_filesystem.h
index 5d21f07c20..be4c51ebb2 100644
--- a/lib/eal/common/eal_filesystem.h
+++ b/lib/eal/common/eal_filesystem.h
@@ -104,4 +104,14 @@ eal_get_hugefile_path(char *buffer, size_t buflen, const char *hugedir, int f_id
  * Used to read information from files on /sys */
 int eal_parse_sysfs_value(const char *filename, unsigned long *val);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Function to read a line from a file on the filesystem.
+ * Used to read information from files on /sys
+ */
+__rte_experimental
+int rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz);
+
 #endif /* EAL_FILESYSTEM_H */
diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index f4d1676754..002f07f4da 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -169,6 +169,28 @@ eal_parse_sysfs_value(const char *filename, unsigned long *val)
 	return 0;
 }
 
+int
+rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
+{
+	FILE *f;
+
+	f = fopen(filename, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
+			__func__, filename);
+		return -1;
+	}
+
+	if (fgets(buf, sz, f) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs file %s\n",
+			__func__, filename);
+		fclose(f);
+		return -1;
+	}
+
+	fclose(f);
+	return 0;
+}
 
 /* create memory configuration in shared/mmap memory. Take out
  * a write lock on the memsegs, so we can auto-detect primary/secondary.
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index ba19fc6347..d5917a48ca 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -260,6 +260,28 @@ eal_parse_sysfs_value(const char *filename, unsigned long *val)
 	return 0;
 }
 
+int
+rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
+{
+	FILE *f;
+
+	f = fopen(filename, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
+			__func__, filename);
+		return -1;
+	}
+
+	if (fgets(buf, sz, f) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs file %s\n",
+			__func__, filename);
+		fclose(f);
+		return -1;
+	}
+
+	fclose(f);
+	return 0;
+}
 
 /* create memory configuration in shared/mmap memory. Take out
  * a write lock on the memsegs, so we can auto-detect primary/secondary.
diff --git a/lib/eal/version.map b/lib/eal/version.map
index fe5c3dac98..3d7fce26a4 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -423,6 +423,9 @@ EXPERIMENTAL {
 	rte_version_release; # WINDOWS_NO_EXPORT
 	rte_version_suffix; # WINDOWS_NO_EXPORT
 	rte_version_year; # WINDOWS_NO_EXPORT
+
+	# added in 21.08
+	rte_eal_parse_sysfs_str; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v3 5/6] bus/pci: add mdev support
  2021-06-01  3:06     ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Chenbo Xia
                         ` (3 preceding siblings ...)
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs Chenbo Xia
@ 2021-06-01  3:06       ` Chenbo Xia
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 6/6] bus/pci: add sparse mmap support for mediated PCI devices Chenbo Xia
  2021-06-11  7:15       ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Thomas Monjalon
  6 siblings, 0 replies; 41+ messages in thread
From: Chenbo Xia @ 2021-06-01  3:06 UTC (permalink / raw)
  To: dev, thomas, cunming.liang, jingjing.wu
  Cc: anatoly.burakov, ferruh.yigit, mdr, nhorman, bruce.richardson,
	david.marchand, stephen, konstantin.ananyev, Tiwei Bie

From: Tiwei Bie <tiwei.bie@intel.com>

This patch adds the mdev (Mediated device) support in PCI bus
driver. With this patch, the PCI bus driver will be able to scan
and probe the mediated PCI devices (i.e. the Mediated devices
whose device API is "vfio-pci") in the system.

There are several things different between physical PCI devices
and mediated PCI devices:

- Mediated PCI devices have to be accessed through VFIO API;
- The regions in mediated PCI devices may not be mmap-able,
  and drivers need to call read/write function to access them
  in this case;
- Mediated PCI devices use UUID as device address;

Signed-off-by: Cunming Liang <cunming.liang@intel.com>
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
Signed-off-by: Chenbo Xia <chenbo.xia@intel.com>
---
 drivers/bus/pci/linux/pci.c           |  30 ++-
 drivers/bus/pci/linux/pci_init.h      |  15 +-
 drivers/bus/pci/linux/pci_vfio.c      | 147 ++++++++++++--
 drivers/bus/pci/linux/pci_vfio_mdev.c | 277 ++++++++++++++++++++++++++
 drivers/bus/pci/meson.build           |   1 +
 drivers/bus/pci/pci_common.c          |  84 +++++---
 drivers/bus/pci/pci_params.c          |  36 +++-
 drivers/bus/pci/private.h             |  17 ++
 drivers/bus/pci/rte_bus_pci.h         |  17 +-
 lib/eal/linux/eal.c                   |  17 +-
 10 files changed, 571 insertions(+), 70 deletions(-)
 create mode 100644 drivers/bus/pci/linux/pci_vfio_mdev.c

diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index 4805f277c5..29dd9ba26f 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -30,7 +30,7 @@
 
 extern struct rte_pci_bus rte_pci_bus;
 
-static int
+int
 pci_get_kernel_driver_by_path(const char *filename, char *dri_name,
 			      size_t len)
 {
@@ -70,7 +70,7 @@ rte_pci_map_device(struct rte_pci_device *dev)
 	switch (dev->kdrv) {
 	case RTE_PCI_KDRV_VFIO:
 #ifdef VFIO_PRESENT
-		if (pci_vfio_is_enabled())
+		if (pci_vfio_is_enabled(dev))
 			ret = pci_vfio_map_resource(dev);
 #endif
 		break;
@@ -99,7 +99,7 @@ rte_pci_unmap_device(struct rte_pci_device *dev)
 	switch (dev->kdrv) {
 	case RTE_PCI_KDRV_VFIO:
 #ifdef VFIO_PRESENT
-		if (pci_vfio_is_enabled())
+		if (pci_vfio_is_enabled(dev))
 			pci_vfio_unmap_resource(dev);
 #endif
 		break;
@@ -347,6 +347,15 @@ pci_scan_one(const char *dirname, const struct rte_pci_addr *addr)
 		int ret;
 
 		TAILQ_FOREACH(dev2, &rte_pci_bus.device_list, next) {
+			/*
+			 * Insert physical PCI devices before all mediated
+			 * PCI devices.
+			 */
+			if (dev2->is_mdev) {
+				rte_pci_insert_device(dev2, dev);
+				return 0;
+			}
+
 			ret = rte_pci_addr_cmp(&dev->addr, &dev2->addr);
 			if (ret > 0)
 				continue;
@@ -465,8 +474,14 @@ rte_pci_scan(void)
 		return 0;
 
 #ifdef VFIO_PRESENT
-	if (!pci_vfio_is_enabled())
-		RTE_LOG(DEBUG, EAL, "VFIO PCI modules not loaded\n");
+	if (!rte_vfio_is_enabled("vfio_pci"))
+		RTE_LOG(DEBUG, EAL, "VFIO PCI module not loaded\n");
+
+	if (!rte_vfio_is_enabled("vfio_mdev"))
+		RTE_LOG(DEBUG, EAL, "VFIO MDEV module not loaded\n");
+
+	if (pci_scan_mdev() != 0)
+		return -1;
 #endif
 
 	dir = opendir(rte_pci_get_sysfs_path());
@@ -737,7 +752,7 @@ rte_pci_ioport_map(struct rte_pci_device *dev, int bar,
 	switch (dev->kdrv) {
 #ifdef VFIO_PRESENT
 	case RTE_PCI_KDRV_VFIO:
-		if (pci_vfio_is_enabled())
+		if (pci_vfio_is_enabled(dev))
 			ret = pci_vfio_ioport_map(dev, bar, p);
 		break;
 #endif
@@ -801,8 +816,7 @@ rte_pci_ioport_unmap(struct rte_pci_ioport *p)
 	switch (p->dev->kdrv) {
 #ifdef VFIO_PRESENT
 	case RTE_PCI_KDRV_VFIO:
-		if (pci_vfio_is_enabled())
-			ret = pci_vfio_ioport_unmap(p);
+		ret = -1;
 		break;
 #endif
 	case RTE_PCI_KDRV_IGB_UIO:
diff --git a/drivers/bus/pci/linux/pci_init.h b/drivers/bus/pci/linux/pci_init.h
index 6853fa88a3..0c0191b6d5 100644
--- a/drivers/bus/pci/linux/pci_init.h
+++ b/drivers/bus/pci/linux/pci_init.h
@@ -19,6 +19,9 @@
 extern void *pci_map_addr;
 void *pci_find_max_end_va(void);
 
+int pci_get_kernel_driver_by_path(const char *filename, char *dri_name,
+				  size_t len);
+
 /* parse one line of the "resource" sysfs file (note that the 'line'
  * string is modified)
  */
@@ -93,7 +96,17 @@ int pci_vfio_ioport_unmap(struct rte_pci_ioport *p);
 int pci_vfio_map_resource(struct rte_pci_device *dev);
 int pci_vfio_unmap_resource(struct rte_pci_device *dev);
 
-int pci_vfio_is_enabled(void);
+int pci_vfio_is_enabled(struct rte_pci_device *dev);
+
+int pci_vfio_fill_regions(struct rte_pci_device *dev, int vfio_dev_fd,
+			  struct vfio_device_info *device_info);
+
+int pci_vfio_get_pci_id(struct rte_pci_device *dev, int vfio_dev_fd,
+			struct rte_pci_id *pci_id);
+
+const char *pci_mdev_get_sysfs_path(void);
+
+int pci_scan_mdev(void);
 
 #endif
 
diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index 3ecd984215..00ba5db03a 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -21,6 +21,7 @@
 #include <rte_bus.h>
 #include <rte_spinlock.h>
 #include <rte_tailq.h>
+#include <rte_uuid.h>
 
 #include "eal_filesystem.h"
 
@@ -741,7 +742,7 @@ pci_vfio_msix_is_mappable(int vfio_dev_fd, int msix_region)
 	return ret;
 }
 
-static int
+int
 pci_vfio_fill_regions(struct rte_pci_device *dev, int vfio_dev_fd,
 		      struct vfio_device_info *device_info)
 {
@@ -776,6 +777,7 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
 	struct vfio_region_info *reg = NULL;
 	char pci_addr[PATH_MAX] = {0};
+	const char *sysfs_base;
 	int vfio_dev_fd;
 	struct rte_pci_addr *loc = &dev->addr;
 	int i, ret;
@@ -791,11 +793,17 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 #endif
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->is_mdev) {
+		sysfs_base = pci_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+	} else {
+		sysfs_base = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
 
-	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
-					&vfio_dev_fd, &device_info);
+	ret = rte_vfio_setup_device(sysfs_base, pci_addr, &vfio_dev_fd,
+		&device_info);
 	if (ret)
 		return ret;
 
@@ -806,7 +814,13 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 			"Cannot store VFIO mmap details\n");
 		goto err_vfio_dev_fd;
 	}
-	memcpy(&vfio_res->pci_addr, &dev->addr, sizeof(vfio_res->pci_addr));
+
+	vfio_res->is_mdev = dev->is_mdev;
+	if (dev->is_mdev)
+		memcpy(&vfio_res->uuid, &dev->uuid, sizeof(vfio_res->uuid));
+	else
+		memcpy(&vfio_res->pci_addr, &dev->addr,
+			sizeof(vfio_res->pci_addr));
 
 	/* get number of registers (up to BAR5) */
 	vfio_res->nb_maps = RTE_MIN((int) device_info.num_regions,
@@ -938,6 +952,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 {
 	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
 	char pci_addr[PATH_MAX] = {0};
+	const char *sysfs_base;
 	int vfio_dev_fd;
 	struct rte_pci_addr *loc = &dev->addr;
 	int i, ret;
@@ -953,15 +968,29 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 #endif
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->is_mdev) {
+		sysfs_base = pci_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+	} else {
+		sysfs_base = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
 
 	/* if we're in a secondary process, just find our tailq entry */
 	TAILQ_FOREACH(vfio_res, vfio_res_list, next) {
-		if (rte_pci_addr_cmp(&vfio_res->pci_addr,
-						 &dev->addr))
+		if (dev->is_mdev != vfio_res->is_mdev)
 			continue;
-		break;
+
+		if (!dev->is_mdev && !rte_pci_addr_cmp(&vfio_res->pci_addr,
+			&dev->addr))
+			break;
+
+		if (dev->is_mdev && !rte_uuid_compare(vfio_res->uuid,
+			dev->uuid))
+			break;
+
+		continue;
 	}
 	/* if we haven't found our tailq entry, something's wrong */
 	if (vfio_res == NULL) {
@@ -970,8 +999,8 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 		return -1;
 	}
 
-	ret = rte_vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
-					&vfio_dev_fd, &device_info);
+	ret = rte_vfio_setup_device(sysfs_base, pci_addr, &vfio_dev_fd,
+		&device_info);
 	if (ret)
 		return ret;
 
@@ -1030,9 +1059,18 @@ find_and_unmap_vfio_resource(struct mapped_pci_res_list *vfio_res_list,
 
 	/* Get vfio_res */
 	TAILQ_FOREACH(vfio_res, vfio_res_list, next) {
-		if (rte_pci_addr_cmp(&vfio_res->pci_addr, &dev->addr))
+		if (dev->is_mdev != vfio_res->is_mdev)
 			continue;
-		break;
+
+		if (!dev->is_mdev && !rte_pci_addr_cmp(&vfio_res->pci_addr,
+			&dev->addr))
+			break;
+
+		if (dev->is_mdev && !rte_uuid_compare(vfio_res->uuid,
+			dev->uuid))
+			break;
+
+		continue;
 	}
 
 	if  (vfio_res == NULL)
@@ -1061,6 +1099,7 @@ find_and_unmap_vfio_resource(struct mapped_pci_res_list *vfio_res_list,
 static int
 pci_vfio_unmap_resource_primary(struct rte_pci_device *dev)
 {
+	const char *sysfs_base;
 	char pci_addr[PATH_MAX] = {0};
 	struct rte_pci_addr *loc = &dev->addr;
 	struct mapped_pci_resource *vfio_res = NULL;
@@ -1068,8 +1107,14 @@ pci_vfio_unmap_resource_primary(struct rte_pci_device *dev)
 	int ret;
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->is_mdev) {
+		sysfs_base = pci_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+	} else {
+		sysfs_base = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
 
 #ifdef HAVE_VFIO_DEV_REQ_INTERFACE
 	ret = pci_vfio_disable_notifier(dev);
@@ -1091,8 +1136,8 @@ pci_vfio_unmap_resource_primary(struct rte_pci_device *dev)
 		return -1;
 	}
 
-	ret = rte_vfio_release_device(rte_pci_get_sysfs_path(), pci_addr,
-				  dev->intr_handle.vfio_dev_fd);
+	ret = rte_vfio_release_device(sysfs_base, pci_addr,
+		dev->intr_handle.vfio_dev_fd);
 	if (ret < 0) {
 		RTE_LOG(ERR, EAL, "Cannot release VFIO device\n");
 		return ret;
@@ -1117,6 +1162,7 @@ pci_vfio_unmap_resource_primary(struct rte_pci_device *dev)
 static int
 pci_vfio_unmap_resource_secondary(struct rte_pci_device *dev)
 {
+	const char *sysfs_base;
 	char pci_addr[PATH_MAX] = {0};
 	struct rte_pci_addr *loc = &dev->addr;
 	struct mapped_pci_resource *vfio_res = NULL;
@@ -1124,11 +1170,17 @@ pci_vfio_unmap_resource_secondary(struct rte_pci_device *dev)
 	int ret;
 
 	/* store PCI address string */
-	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+	if (dev->is_mdev) {
+		sysfs_base = pci_mdev_get_sysfs_path();
+		rte_uuid_unparse(dev->uuid, pci_addr, sizeof(pci_addr));
+	} else {
+		sysfs_base = rte_pci_get_sysfs_path();
+		snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
 			loc->domain, loc->bus, loc->devid, loc->function);
+	}
 
-	ret = rte_vfio_release_device(rte_pci_get_sysfs_path(), pci_addr,
-				  dev->intr_handle.vfio_dev_fd);
+	ret = rte_vfio_release_device(sysfs_base, pci_addr,
+		dev->intr_handle.vfio_dev_fd);
 	if (ret < 0) {
 		RTE_LOG(ERR, EAL, "Cannot release VFIO device\n");
 		return ret;
@@ -1249,8 +1301,61 @@ pci_vfio_mmio_write(const struct rte_pci_device *dev, int bar,
 }
 
 int
-pci_vfio_is_enabled(void)
+pci_vfio_is_enabled(struct rte_pci_device *dev)
 {
-	return rte_vfio_is_enabled("vfio_pci");
+	return rte_vfio_is_enabled(dev->is_mdev ? "vfio_mdev" : "vfio_pci");
 }
+
+int
+pci_vfio_get_pci_id(struct rte_pci_device *dev, int vfio_dev_fd,
+		    struct rte_pci_id *pci_id)
+{
+	uint64_t size, offset;
+	int class;
+
+	if (pci_vfio_get_region(dev, VFIO_PCI_CONFIG_REGION_INDEX,
+				&size, &offset) != 0) {
+		RTE_LOG(DEBUG, EAL, "Cannot get offset of CONFIG region.\n");
+		return -1;
+	}
+
+	/* vendor_id */
+	if (pread64(vfio_dev_fd, &pci_id->vendor_id, sizeof(uint16_t),
+		    offset + PCI_VENDOR_ID) != sizeof(uint16_t)) {
+		RTE_LOG(DEBUG, EAL, "Cannot read VendorID from PCI config space\n");
+		return -1;
+	}
+
+	/* device_id */
+	if (pread64(vfio_dev_fd, &pci_id->device_id, sizeof(uint16_t),
+		    offset + PCI_DEVICE_ID) != sizeof(uint16_t)) {
+		RTE_LOG(DEBUG, EAL, "Cannot read DeviceID from PCI config space\n");
+		return -1;
+	}
+
+	/* subsystem_vendor_id */
+	if (pread64(vfio_dev_fd, &pci_id->subsystem_vendor_id, sizeof(uint16_t),
+		    offset + PCI_SUBSYSTEM_VENDOR_ID) != sizeof(uint16_t)) {
+		RTE_LOG(DEBUG, EAL, "Cannot read SubVendorID from PCI config space\n");
+		return -1;
+	}
+
+	/* subsystem_device_id */
+	if (pread64(vfio_dev_fd, &pci_id->subsystem_device_id, sizeof(uint16_t),
+		    offset + PCI_SUBSYSTEM_ID) != sizeof(uint16_t)) {
+		RTE_LOG(DEBUG, EAL, "Cannot read SubDeviceID from PCI config space\n");
+		return -1;
+	}
+
+	/* class_id */
+	if (pread64(vfio_dev_fd, &class, sizeof(uint32_t),
+		    offset + PCI_CLASS_REVISION) != sizeof(uint32_t)) {
+		RTE_LOG(DEBUG, EAL, "Cannot read ClassID from PCI config space\n");
+		return -1;
+	}
+	pci_id->class_id = class >> 8;
+
+	return 0;
+}
+
 #endif
diff --git a/drivers/bus/pci/linux/pci_vfio_mdev.c b/drivers/bus/pci/linux/pci_vfio_mdev.c
new file mode 100644
index 0000000000..ef25749a0d
--- /dev/null
+++ b/drivers/bus/pci/linux/pci_vfio_mdev.c
@@ -0,0 +1,277 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2021 Intel Corporation
+ */
+
+#include <string.h>
+#include <dirent.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <linux/pci_regs.h>
+
+#include <rte_log.h>
+#include <rte_pci.h>
+#include <rte_eal_memconfig.h>
+#include <rte_malloc.h>
+#include <rte_devargs.h>
+#include <rte_memcpy.h>
+#include <rte_vfio.h>
+#include <rte_uuid.h>
+
+#include "eal_private.h"
+#include "eal_filesystem.h"
+
+#include "private.h"
+#include "pci_init.h"
+
+#ifdef VFIO_PRESENT
+
+extern struct rte_pci_bus rte_pci_bus;
+
+#define SYSFS_MDEV_DEVICES "/sys/bus/mdev/devices"
+
+const char *pci_mdev_get_sysfs_path(void)
+{
+	const char *path = NULL;
+
+	path = getenv("SYSFS_MDEV_DEVICES");
+	if (path == NULL)
+		return SYSFS_MDEV_DEVICES;
+
+	return path;
+}
+
+static int
+is_pci_device(const char *dirname)
+{
+	char device_api[PATH_MAX];
+	char filename[PATH_MAX];
+	char *ptr;
+
+	/* get device_api */
+	snprintf(filename, sizeof(filename), "%s/mdev_type/device_api",
+		 dirname);
+
+	if (rte_eal_parse_sysfs_str(filename, device_api,
+				    sizeof(device_api)) < 0) {
+		return -1;
+	}
+
+	ptr = strchr(device_api, '\n');
+	if (ptr != NULL)
+		*ptr = '\0';
+
+	return strcmp(device_api, "vfio-pci") == 0;
+}
+
+static int
+pci_scan_one_mdev(const char *dirname, const rte_uuid_t addr)
+{
+	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+	char name[RTE_UUID_STRLEN];
+	char filename[PATH_MAX];
+	char path[PATH_MAX];
+	char driver[PATH_MAX];
+	char *ptr;
+	struct rte_pci_device_internal *pdev;
+	struct rte_pci_device *dev;
+	bool need_release = false;
+	const char *sysfs_base;
+	unsigned long tmp;
+	int vfio_dev_fd;
+	int ret;
+
+	sysfs_base = pci_mdev_get_sysfs_path();
+
+	pdev = malloc(sizeof(*pdev));
+	if (pdev == NULL)
+		return -1;
+
+	memset(pdev, 0, sizeof(*pdev));
+
+	dev = &pdev->device;
+	dev->device.bus = &rte_pci_bus.bus;
+	rte_uuid_unparse(addr, name, sizeof(name));
+
+	/* parse driver */
+	snprintf(filename, sizeof(filename), "%s/driver", dirname);
+	ret = pci_get_kernel_driver_by_path(filename, driver, sizeof(driver));
+	if (ret < 0) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to get kernel driver\n", name);
+		goto err;
+	}
+
+	if (ret != 0 || strcmp(driver, "vfio_mdev") != 0) {
+		RTE_LOG(DEBUG, EAL, "%s: unsupported mdev driver\n", name);
+		goto err;
+	}
+
+	dev->kdrv = RTE_PCI_KDRV_VFIO;
+
+	dev->is_mdev = 1;
+	rte_uuid_copy(dev->uuid, addr);
+
+	snprintf(filename, sizeof(filename), "%s/%s", sysfs_base, name);
+
+	/* Get the path of the parent device. */
+	if (realpath(filename, path) == NULL) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to get parent device\n", name);
+		goto err;
+	}
+
+	ptr = strrchr(path, '/');
+	if (ptr == NULL) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to parse parent device\n",
+			name);
+		goto err;
+	}
+	*ptr = '\0';
+
+	/* get numa node, default to 0 if not present */
+	snprintf(filename, sizeof(filename), "%s/numa_node", path);
+
+	if (access(filename, F_OK) != -1) {
+		if (eal_parse_sysfs_value(filename, &tmp) == 0)
+			dev->device.numa_node = tmp;
+		else
+			dev->device.numa_node = -1;
+	} else {
+		dev->device.numa_node = 0;
+	}
+
+	pci_name_set(dev);
+
+	if (rte_vfio_setup_device(sysfs_base, name, &vfio_dev_fd,
+				  &device_info) != 0) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to setup device\n", name);
+		goto err;
+	}
+
+	need_release = true;
+
+	if (pci_vfio_fill_regions(dev, vfio_dev_fd, &device_info) != 0) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to get regions\n", name);
+		goto err;
+	}
+
+	if (pci_vfio_get_pci_id(dev, vfio_dev_fd, &dev->id) != 0) {
+		RTE_LOG(DEBUG, EAL, "%s: failed to access the device\n", name);
+		goto err;
+	}
+
+	/* device is valid, add to the list (sorted) */
+	if (TAILQ_EMPTY(&rte_pci_bus.device_list)) {
+		rte_pci_add_device(dev);
+	} else {
+		struct rte_pci_device *dev2;
+		int ret;
+
+		TAILQ_FOREACH(dev2, &rte_pci_bus.device_list, next) {
+			/*
+			 * Insert mediated PCI devices after all physical
+			 * PCI devices.
+			 */
+			if (!dev2->is_mdev)
+				continue;
+			ret = rte_uuid_compare(dev->uuid, dev2->uuid);
+			if (ret > 0)
+				continue;
+			if (ret < 0)
+				rte_pci_insert_device(dev2, dev);
+			else {/* already registered */
+				if (!rte_dev_is_probed(&dev2->device)) {
+					dev2->kdrv = dev->kdrv;
+					dev2->max_vfs = dev->max_vfs;
+					pci_name_set(dev2);
+					memmove(dev2->mem_resource,
+						dev->mem_resource,
+						sizeof(dev->mem_resource));
+				} else {
+					/**
+					 * If device is plugged and driver is
+					 * probed already, (This happens when
+					 * we call rte_dev_probe which will
+					 * scan all device on the bus) we don't
+					 * need to do anything here unless...
+					 **/
+					if (dev2->kdrv != dev->kdrv ||
+						dev2->max_vfs != dev->max_vfs ||
+						memcmp(&dev2->id, &dev->id,
+							sizeof(dev2->id)))
+						/*
+						 * This should not happen.
+						 * But it is still possible if
+						 * we unbind a device from
+						 * vfio or uio before hotplug
+						 * remove and rebind it with
+						 * a different configure.
+						 * So we just print out the
+						 * error as an alarm.
+						 */
+						RTE_LOG(ERR, EAL, "Unexpected device scan at %s!\n",
+							filename);
+					else if (dev2->device.devargs !=
+						 dev->device.devargs) {
+						rte_devargs_remove(dev2->device.devargs);
+						pci_name_set(dev2);
+					}
+				}
+				free(pdev);
+			}
+			return 0;
+		}
+
+		rte_pci_add_device(dev);
+	}
+
+	return 0;
+
+err:
+	if (need_release)
+		rte_vfio_release_device(sysfs_base, name, vfio_dev_fd);
+	free(pdev);
+	return 1;
+}
+
+int
+pci_scan_mdev(void)
+{
+	struct dirent *e;
+	DIR *dir;
+	char dirname[PATH_MAX];
+	rte_uuid_t addr;
+
+	dir = opendir(pci_mdev_get_sysfs_path());
+	if (dir == NULL) {
+		RTE_LOG(DEBUG, EAL, "%s(): opendir failed: %s\n",
+			__func__, strerror(errno));
+		return 0;
+	}
+
+	while ((e = readdir(dir)) != NULL) {
+		if (e->d_name[0] == '.')
+			continue;
+
+		if (rte_uuid_parse(e->d_name, addr) != 0)
+			continue;
+
+		if (rte_mdev_ignore_device(addr))
+			continue;
+
+		snprintf(dirname, sizeof(dirname), "%s/%s",
+			 pci_mdev_get_sysfs_path(), e->d_name);
+
+		if (!is_pci_device(dirname))
+			continue;
+
+		if (pci_scan_one_mdev(dirname, addr) < 0)
+			goto error;
+	}
+	closedir(dir);
+	return 0;
+
+error:
+	closedir(dir);
+	return -1;
+}
+
+#endif /* VFIO_PRESENT */
diff --git a/drivers/bus/pci/meson.build b/drivers/bus/pci/meson.build
index 81c7e94c00..fb7a9a1fa8 100644
--- a/drivers/bus/pci/meson.build
+++ b/drivers/bus/pci/meson.build
@@ -11,6 +11,7 @@ if is_linux
             'linux/pci.c',
             'linux/pci_uio.c',
             'linux/pci_vfio.c',
+            'linux/pci_vfio_mdev.c',
     )
     includes += include_directories('linux')
 endif
diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
index 1c368c254c..1984dbdba0 100644
--- a/drivers/bus/pci/pci_common.c
+++ b/drivers/bus/pci/pci_common.c
@@ -24,6 +24,7 @@
 #include <rte_common.h>
 #include <rte_devargs.h>
 #include <rte_vfio.h>
+#include <rte_uuid.h>
 
 #include "private.h"
 
@@ -57,15 +58,34 @@ pci_devargs_lookup(const struct rte_pci_addr *pci_addr)
 	return NULL;
 }
 
+static struct rte_devargs *
+mdev_devargs_lookup(const rte_uuid_t mdev_addr)
+{
+	struct rte_devargs *devargs;
+	rte_uuid_t id;
+
+	RTE_EAL_DEVARGS_FOREACH("pci", devargs) {
+		devargs->bus->parse(devargs->name, &id);
+		if (!rte_uuid_compare(mdev_addr, id))
+			return devargs;
+	}
+	return NULL;
+}
+
 void
 pci_name_set(struct rte_pci_device *dev)
 {
 	struct rte_devargs *devargs;
 
 	/* Each device has its internal, canonical name set. */
-	rte_pci_device_name(&dev->addr,
-			dev->name, sizeof(dev->name));
-	devargs = pci_devargs_lookup(&dev->addr);
+	if (dev->is_mdev) {
+		rte_uuid_unparse(dev->uuid, dev->name, sizeof(dev->name));
+		devargs = mdev_devargs_lookup(dev->uuid);
+	} else {
+		rte_pci_device_name(&dev->addr, dev->name, sizeof(dev->name));
+		devargs = pci_devargs_lookup(&dev->addr);
+	}
+
 	dev->device.devargs = devargs;
 
 	/* When using a blocklist, only blocked devices will have
@@ -166,21 +186,17 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
 {
 	int ret;
 	bool already_probed;
-	struct rte_pci_addr *loc;
 
 	if ((dr == NULL) || (dev == NULL))
 		return -EINVAL;
 
-	loc = &dev->addr;
-
 	/* The device is not blocked; Check if driver supports it */
 	if (!rte_pci_match(dr, dev))
 		/* Match of device and driver failed */
 		return 1;
 
-	RTE_LOG(DEBUG, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
-			loc->domain, loc->bus, loc->devid, loc->function,
-			dev->device.numa_node);
+	RTE_LOG(DEBUG, EAL, "PCI device %s on NUMA socket %i\n",
+		dev->name, dev->device.numa_node);
 
 	/* no initialization when marked as blocked, return without error */
 	if (dev->device.devargs != NULL &&
@@ -235,10 +251,9 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
 		}
 	}
 
-	RTE_LOG(INFO, EAL, "Probe PCI driver: %s (%x:%x) device: "PCI_PRI_FMT" (socket %i)\n",
+	RTE_LOG(INFO, EAL, "Probe PCI driver: %s (%x:%x) device: %s (socket %i)\n",
 			dr->driver.name, dev->id.vendor_id, dev->id.device_id,
-			loc->domain, loc->bus, loc->devid, loc->function,
-			dev->device.numa_node);
+			dev->name, dev->device.numa_node);
 	/* call the driver probe() function */
 	ret = dr->probe(dr, dev);
 	if (already_probed)
@@ -266,7 +281,6 @@ rte_pci_probe_one_driver(struct rte_pci_driver *dr,
 static int
 rte_pci_detach_dev(struct rte_pci_device *dev)
 {
-	struct rte_pci_addr *loc;
 	struct rte_pci_driver *dr;
 	int ret = 0;
 
@@ -274,11 +288,9 @@ rte_pci_detach_dev(struct rte_pci_device *dev)
 		return -EINVAL;
 
 	dr = dev->driver;
-	loc = &dev->addr;
 
-	RTE_LOG(DEBUG, EAL, "PCI device "PCI_PRI_FMT" on NUMA socket %i\n",
-			loc->domain, loc->bus, loc->devid,
-			loc->function, dev->device.numa_node);
+	RTE_LOG(DEBUG, EAL, "PCI device %s on NUMA socket %i\n",
+		dev->name, dev->device.numa_node);
 
 	RTE_LOG(DEBUG, EAL, "  remove driver: %x:%x %s\n", dev->id.vendor_id,
 			dev->id.device_id, dr->driver.name);
@@ -345,10 +357,9 @@ pci_probe(void)
 		ret = pci_probe_all_drivers(dev);
 		if (ret < 0) {
 			if (ret != -EEXIST) {
-				RTE_LOG(ERR, EAL, "Requested device "
-					PCI_PRI_FMT " cannot be used\n",
-					dev->addr.domain, dev->addr.bus,
-					dev->addr.devid, dev->addr.function);
+				RTE_LOG(ERR, EAL,
+					"Requested device %s cannot be used\n",
+					dev->name);
 				rte_errno = errno;
 				failed++;
 			}
@@ -395,11 +406,20 @@ pci_parse(const char *name, void *addr)
 {
 	struct rte_pci_addr *out = addr;
 	struct rte_pci_addr pci_addr;
+	rte_uuid_t mdev_addr;
 	bool parse;
 
 	parse = (rte_pci_addr_parse(name, &pci_addr) == 0);
 	if (parse && addr != NULL)
 		*out = pci_addr;
+
+	if (parse)
+		return 0;
+
+	parse = (rte_uuid_parse(name, mdev_addr) == 0);
+	if (parse && addr != NULL)
+		memcpy(addr, &mdev_addr, sizeof(mdev_addr));
+
 	return parse == false;
 }
 
@@ -622,11 +642,9 @@ pci_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova, size_t len)
 	return -1;
 }
 
-bool
-rte_pci_ignore_device(const struct rte_pci_addr *pci_addr)
+static bool
+devargs_ignore_device(struct rte_devargs *devargs)
 {
-	struct rte_devargs *devargs = pci_devargs_lookup(pci_addr);
-
 	switch (rte_pci_bus.bus.conf.scan_mode) {
 	case RTE_BUS_SCAN_ALLOWLIST:
 		if (devargs && devargs->policy == RTE_DEV_ALLOWED)
@@ -641,6 +659,22 @@ rte_pci_ignore_device(const struct rte_pci_addr *pci_addr)
 	return true;
 }
 
+bool
+rte_pci_ignore_device(const struct rte_pci_addr *pci_addr)
+{
+	struct rte_devargs *devargs = pci_devargs_lookup(pci_addr);
+
+	return devargs_ignore_device(devargs);
+}
+
+bool
+rte_mdev_ignore_device(const rte_uuid_t mdev_addr)
+{
+	struct rte_devargs *devargs = mdev_devargs_lookup(mdev_addr);
+
+	return devargs_ignore_device(devargs);
+}
+
 enum rte_iova_mode
 rte_pci_get_iommu_class(void)
 {
diff --git a/drivers/bus/pci/pci_params.c b/drivers/bus/pci/pci_params.c
index 3192e9c967..231e57213e 100644
--- a/drivers/bus/pci/pci_params.c
+++ b/drivers/bus/pci/pci_params.c
@@ -2,12 +2,15 @@
  * Copyright 2018 Gaëtan Rivet
  */
 
+#include <string.h>
+
 #include <rte_bus.h>
 #include <rte_bus_pci.h>
 #include <rte_dev.h>
 #include <rte_errno.h>
 #include <rte_kvargs.h>
 #include <rte_pci.h>
+#include <rte_uuid.h>
 
 #include "private.h"
 
@@ -35,6 +38,19 @@ pci_addr_kv_cmp(const char *key __rte_unused,
 	return -abs(rte_pci_addr_cmp(addr1, addr2));
 }
 
+static int
+mdev_addr_kv_cmp(const char *key __rte_unused,
+		const char *value,
+		void *_addr2)
+{
+	rte_uuid_t addr1;
+	unsigned char *addr2 = _addr2;
+
+	if (rte_uuid_parse(value, addr1))
+		return -1;
+	return -abs(rte_uuid_compare(addr1, addr2));
+}
+
 static int
 pci_dev_match(const struct rte_device *dev,
 	      const void *_kvlist)
@@ -47,11 +63,21 @@ pci_dev_match(const struct rte_device *dev,
 		return 0;
 	pdev = RTE_DEV_TO_PCI_CONST(dev);
 	/* if any field does not match. */
-	if (rte_kvargs_process(kvlist, pci_params_keys[RTE_PCI_PARAM_ADDR],
-			       &pci_addr_kv_cmp,
-			       (void *)(intptr_t)&pdev->addr))
-		return 1;
-	return 0;
+	if (!pdev->is_mdev) {
+		if (rte_kvargs_process(kvlist,
+			pci_params_keys[RTE_PCI_PARAM_ADDR], &pci_addr_kv_cmp,
+			(void *)(intptr_t)&pdev->addr))
+			return 1;
+		else
+			return 0;
+	} else {
+		if (rte_kvargs_process(kvlist,
+			pci_params_keys[RTE_PCI_PARAM_ADDR], &mdev_addr_kv_cmp,
+			(void *)(intptr_t)&pdev->uuid))
+			return 1;
+		else
+			return 0;
+	}
 }
 
 void *
diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
index 8b5fa70641..3515c086aa 100644
--- a/drivers/bus/pci/private.h
+++ b/drivers/bus/pci/private.h
@@ -64,6 +64,18 @@ pci_name_set(struct rte_pci_device *dev);
  */
 bool rte_pci_ignore_device(const struct rte_pci_addr *pci_addr);
 
+/**
+ * Validate whether a mediated PCI device with given uuid should be
+ * ignored or not.
+ *
+ * @param mdev_addr
+ *	MDEV address of device to be validated
+ * @return
+ *	true: if device is to be ignored,
+ *	false: if device is to be scanned,
+ */
+bool rte_mdev_ignore_device(const rte_uuid_t mdev_addr);
+
 /**
  * Add a PCI device to the PCI Bus (append to PCI Device list). This function
  * also updates the bus references of the PCI Device (and the generic device
@@ -114,6 +126,11 @@ struct pci_msix_table {
 struct mapped_pci_resource {
 	TAILQ_ENTRY(mapped_pci_resource) next;
 
+	union {
+		struct rte_pci_addr addr;
+		rte_uuid_t uuid;
+	};
+	uint8_t is_mdev;
 	struct rte_pci_addr pci_addr;
 	char path[PATH_MAX];
 	int nb_maps;
diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
index dc26811b0a..fb7d934bd0 100644
--- a/drivers/bus/pci/rte_bus_pci.h
+++ b/drivers/bus/pci/rte_bus_pci.h
@@ -51,6 +51,15 @@ TAILQ_HEAD(rte_pci_driver_list, rte_pci_driver);
 
 struct rte_devargs;
 
+/*
+ * NOTE: we can't include rte_uuid.h directly due to the conflicts
+ *      introduced by stdbool.h
+ */
+typedef unsigned char rte_uuid_t[16];
+
+/* It's RTE_UUID_STRLEN, which is bigger than PCI_PRI_STR_SIZE. */
+#define RTE_PCI_NAME_LEN		(36 + 1)
+
 enum rte_pci_kernel_driver {
 	RTE_PCI_KDRV_UNKNOWN = 0,  /* may be misc UIO or bifurcated driver */
 	RTE_PCI_KDRV_IGB_UIO,      /* igb_uio for Linux */
@@ -67,7 +76,11 @@ enum rte_pci_kernel_driver {
 struct rte_pci_device {
 	TAILQ_ENTRY(rte_pci_device) next;   /**< Next probed PCI device. */
 	struct rte_device device;           /**< Inherit core device */
-	struct rte_pci_addr addr;           /**< PCI location. */
+	union {
+		struct rte_pci_addr addr;   /**< PCI location. */
+		rte_uuid_t uuid;            /**< Mdev location. */
+	};
+	uint8_t is_mdev;                    /**< True for mediated PCI device */
 	struct rte_pci_id id;               /**< PCI ID. */
 	struct rte_mem_resource mem_resource[PCI_MAX_RESOURCE];
 					    /**< PCI Memory Resource */
@@ -75,7 +88,7 @@ struct rte_pci_device {
 	struct rte_pci_driver *driver;      /**< PCI driver used in probing */
 	uint16_t max_vfs;                   /**< sriov enable if not zero */
 	enum rte_pci_kernel_driver kdrv;    /**< Kernel driver passthrough */
-	char name[PCI_PRI_STR_SIZE+1];      /**< PCI location (ASCII) */
+	char name[RTE_PCI_NAME_LEN];        /**< PCI/Mdev location (ASCII) */
 	struct rte_intr_handle vfio_req_intr_handle;
 				/**< Handler of VFIO request interrupt */
 };
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index d5917a48ca..323f13107e 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1089,6 +1089,15 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+#ifdef VFIO_PRESENT
+	if (rte_eal_vfio_setup() < 0) {
+		rte_eal_init_alert("Cannot init VFIO");
+		rte_errno = EAGAIN;
+		__atomic_store_n(&run_once, 0, __ATOMIC_RELAXED);
+		return -1;
+	}
+#endif
+
 	if (rte_bus_scan()) {
 		rte_eal_init_alert("Cannot scan the buses for devices");
 		rte_errno = ENODEV;
@@ -1194,14 +1203,6 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
-#ifdef VFIO_PRESENT
-	if (rte_eal_vfio_setup() < 0) {
-		rte_eal_init_alert("Cannot init VFIO");
-		rte_errno = EAGAIN;
-		__atomic_store_n(&run_once, 0, __ATOMIC_RELAXED);
-		return -1;
-	}
-#endif
 	/* in secondary processes, memory init may allocate additional fbarrays
 	 * not present in primary processes, so to avoid any potential issues,
 	 * initialize memzones first.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [dpdk-dev] [RFC v3 6/6] bus/pci: add sparse mmap support for mediated PCI devices
  2021-06-01  3:06     ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Chenbo Xia
                         ` (4 preceding siblings ...)
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 5/6] bus/pci: add mdev support Chenbo Xia
@ 2021-06-01  3:06       ` Chenbo Xia
  2021-06-11  7:15       ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Thomas Monjalon
  6 siblings, 0 replies; 41+ messages in thread
From: Chenbo Xia @ 2021-06-01  3:06 UTC (permalink / raw)
  To: dev, thomas, cunming.liang, jingjing.wu
  Cc: anatoly.burakov, ferruh.yigit, mdr, nhorman, bruce.richardson,
	david.marchand, stephen, konstantin.ananyev

This patch adds sparse mmap support in PCI bus. Sparse mmap is a
capability defined in VFIO which allows multiple mmap areas in one
VFIO region. Mediated pci devices could use this capability to let
mdev parent driver have control over access of non-mmapable part
of regions.

Signed-off-by: Chenbo Xia <chenbo.xia@intel.com>
---
 drivers/bus/pci/linux/pci_vfio.c | 229 +++++++++++++++++++++++++++----
 drivers/bus/pci/private.h        |   2 +
 drivers/bus/pci/rte_bus_pci.h    |  18 ++-
 3 files changed, 218 insertions(+), 31 deletions(-)

diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index 00ba5db03a..e68eccb63f 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -654,6 +654,82 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res,
 	return 0;
 }
 
+static int
+pci_vfio_sparse_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res,
+		struct vfio_region_sparse_mmap_area *vfio_areas,
+		uint32_t nr_areas, int bar_index, int additional_flags,
+		int numa_node)
+{
+	struct pci_map *map = &vfio_res->maps[bar_index];
+	struct rte_mem_map_area *area;
+	struct vfio_region_sparse_mmap_area *sparse;
+	void *bar_addr;
+	uint32_t i, j;
+
+	map->nr_areas = nr_areas;
+
+	if (map->size == 0) {
+		RTE_LOG(DEBUG, EAL, "Bar size is 0, skip BAR%d\n", bar_index);
+		return 0;
+	}
+
+	if (!map->nr_areas) {
+		RTE_LOG(DEBUG, EAL, "Skip bar %d with no sparse mmap areas\n",
+			bar_index);
+		map->areas = NULL;
+		return 0;
+	}
+
+	if (map->areas == NULL) {
+		map->areas = rte_zmalloc_socket(NULL,
+				sizeof(*map->areas) * nr_areas,
+				RTE_CACHE_LINE_SIZE, numa_node);
+		if (map->areas == NULL) {
+			RTE_LOG(ERR, EAL,
+				"Cannot alloc memory for sparse map areas\n");
+			return -1;
+		}
+	}
+
+	for (i = 0; i < map->nr_areas; i++) {
+		area = &map->areas[i];
+		sparse = &vfio_areas[i];
+
+		bar_addr = mmap(map->addr, sparse->size, 0, MAP_PRIVATE |
+				MAP_ANONYMOUS | additional_flags, -1, 0);
+		if (bar_addr != MAP_FAILED) {
+			area->addr = pci_map_resource(bar_addr, vfio_dev_fd,
+				map->offset + sparse->offset, sparse->size,
+				RTE_MAP_FORCE_ADDRESS);
+			if (area->addr == NULL) {
+				munmap(bar_addr, sparse->size);
+				RTE_LOG(ERR, EAL, "Failed to map pci BAR%d\n",
+					bar_index);
+				goto err_map;
+			}
+
+			area->offset = sparse->offset;
+			area->size = sparse->size;
+		} else {
+			RTE_LOG(ERR, EAL, "Failed to create inaccessible mapping for BAR%d\n",
+				bar_index);
+			goto err_map;
+		}
+	}
+
+	return 0;
+
+err_map:
+	for (j = 0; j < i; j++) {
+		pci_unmap_resource(map->areas[j].addr, map->areas[j].size);
+		map->areas[j].offset = 0;
+		map->areas[j].size = 0;
+	}
+	rte_free(map->areas);
+	map->nr_areas = 0;
+	return -1;
+}
+
 /*
  * region info may contain capability headers, so we need to keep reallocating
  * the memory until we match allocated memory size with argsz.
@@ -770,6 +846,31 @@ pci_vfio_fill_regions(struct rte_pci_device *dev, int vfio_dev_fd,
 	return 0;
 }
 
+static void
+clean_up_pci_resource(struct mapped_pci_resource *vfio_res)
+{
+	struct pci_map *map;
+	uint32_t i, j;
+
+	for (i = 0; i < PCI_MAX_RESOURCE; i++) {
+		map = &vfio_res->maps[i];
+		if (map->nr_areas > 1) {
+			for (j = 0; j < map->nr_areas; j++)
+				pci_unmap_resource(map->areas[j].addr,
+					map->areas[j].size);
+		} else {
+			/*
+			 * We do not need to be aware of MSI-X BAR mappings.
+			 * Using current maps array is enough.
+			 */
+			if (map->addr)
+				pci_unmap_resource(map->addr, map->size);
+		}
+	}
+
+	rte_free(map->areas);
+}
+
 static int
 pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 {
@@ -866,6 +967,8 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 
 	for (i = 0; i < vfio_res->nb_maps; i++) {
 		void *bar_addr;
+		struct vfio_info_cap_header *hdr;
+		struct vfio_region_info_cap_sparse_mmap *sparse;
 
 		ret = pci_vfio_get_region_info(vfio_dev_fd, &reg, i);
 		if (ret < 0) {
@@ -911,15 +1014,59 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 		maps[i].size = reg->size;
 		maps[i].path = NULL; /* vfio doesn't have per-resource paths */
 
-		ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res, i, 0);
-		if (ret < 0) {
-			RTE_LOG(ERR, EAL, "%s mapping BAR%i failed: %s\n",
-					pci_addr, i, strerror(errno));
-			free(reg);
-			goto err_vfio_res;
-		}
+		hdr = pci_vfio_info_cap(reg, VFIO_REGION_INFO_CAP_SPARSE_MMAP);
+
+		if (dev->is_mdev && hdr != NULL) {
+			sparse = container_of(hdr,
+				struct vfio_region_info_cap_sparse_mmap,
+				header);
+
+			ret = pci_vfio_sparse_mmap_bar(vfio_dev_fd, vfio_res,
+				sparse->areas, sparse->nr_areas, i, 0,
+				dev->device.numa_node);
+			if (ret < 0) {
+				RTE_LOG(ERR, EAL, "%s sparse mapping BAR%i failed: %s\n",
+						pci_addr, i, strerror(errno));
+				free(reg);
+				goto err_vfio_res;
+			}
 
-		dev->mem_resource[i].addr = maps[i].addr;
+			dev->sparse_mem[i].size = reg->size;
+			dev->sparse_mem[i].nr_maps = vfio_res->maps[i].nr_areas;
+			dev->sparse_mem[i].areas = vfio_res->maps[i].areas;
+		} else {
+			ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res, i, 0);
+			if (ret < 0) {
+				RTE_LOG(ERR, EAL, "%s mapping BAR%i failed: %s\n",
+						pci_addr, i, strerror(errno));
+				free(reg);
+				goto err_vfio_res;
+			}
+
+			if (dev->is_mdev) {
+				struct pci_map *mdev_map = &maps[i];
+				mdev_map->nr_areas = 1;
+				mdev_map->areas = rte_zmalloc_socket(NULL,
+					sizeof(*mdev_map->areas),
+					RTE_CACHE_LINE_SIZE,
+					dev->device.numa_node);
+				if (maps[i].areas == NULL) {
+					RTE_LOG(ERR, EAL,
+						"Cannot allocate memory for sparse map areas\n");
+					goto err_vfio_res;
+				}
+				mdev_map->areas[0].addr = maps[i].addr;
+				mdev_map->areas[0].offset = 0;
+				mdev_map->areas[0].size = reg->size;
+				dev->sparse_mem[i].size = reg->size;
+				dev->sparse_mem[i].nr_maps = 1;
+				dev->sparse_mem[i].areas = mdev_map->areas;
+			} else {
+				maps[i].nr_areas = 0;
+				maps[i].areas = NULL;
+				dev->mem_resource[i].addr = maps[i].addr;
+			}
+		}
 
 		free(reg);
 	}
@@ -940,6 +1087,7 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev)
 
 	return 0;
 err_vfio_res:
+	clean_up_pci_resource(vfio_res);
 	rte_free(vfio_res);
 err_vfio_dev_fd:
 	rte_vfio_release_device(rte_pci_get_sysfs_path(),
@@ -960,7 +1108,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 	struct mapped_pci_res_list *vfio_res_list =
 		RTE_TAILQ_CAST(rte_vfio_tailq.head, mapped_pci_res_list);
 
-	struct pci_map *maps;
+	struct pci_map *maps, *cur;
 
 	dev->intr_handle.fd = -1;
 #ifdef HAVE_VFIO_DEV_REQ_INTERFACE
@@ -1012,14 +1160,49 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev)
 	maps = vfio_res->maps;
 
 	for (i = 0; i < vfio_res->nb_maps; i++) {
-		ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res, i, MAP_FIXED);
-		if (ret < 0) {
-			RTE_LOG(ERR, EAL, "%s mapping BAR%i failed: %s\n",
-					pci_addr, i, strerror(errno));
-			goto err_vfio_dev_fd;
+		cur = &maps[i];
+		if (cur->nr_areas > 1) {
+			struct vfio_region_sparse_mmap_area *areas;
+			uint32_t i;
+
+			areas = malloc(sizeof(*areas) * cur->nr_areas);
+			if (areas == NULL) {
+				RTE_LOG(ERR, EAL, "Failed to alloc vfio areas for %s\n",
+					pci_addr);
+				goto err_vfio_dev_fd;
+			}
+
+			for (i = 0; i < cur->nr_areas; i++) {
+				areas[i].offset = cur->areas[i].offset;
+				areas[i].size = cur->areas[i].size;
+			}
+
+			ret = pci_vfio_sparse_mmap_bar(vfio_dev_fd, vfio_res,
+				areas, cur->nr_areas, i, MAP_FIXED,
+				dev->device.numa_node);
+			if (ret < 0) {
+				RTE_LOG(ERR, EAL, "%s sparse mapping BAR%i failed: %s\n",
+						pci_addr, i, strerror(errno));
+				free(areas);
+				goto err_vfio_dev_fd;
+			}
+
+			free(areas);
+		} else {
+			ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res,
+				i, MAP_FIXED);
+			if (ret < 0) {
+				RTE_LOG(ERR, EAL, "%s mapping BAR%i failed: %s\n",
+						pci_addr, i, strerror(errno));
+				goto err_vfio_dev_fd;
+			}
+
+			if (dev->is_mdev)
+				cur->areas[0].addr = cur->addr;
+			else
+				dev->mem_resource[i].addr = cur->addr;
 		}
 
-		dev->mem_resource[i].addr = maps[i].addr;
 	}
 
 	/* we need save vfio_dev_fd, so it can be used during release */
@@ -1054,8 +1237,6 @@ find_and_unmap_vfio_resource(struct mapped_pci_res_list *vfio_res_list,
 			const char *pci_addr)
 {
 	struct mapped_pci_resource *vfio_res = NULL;
-	struct pci_map *maps;
-	int i;
 
 	/* Get vfio_res */
 	TAILQ_FOREACH(vfio_res, vfio_res_list, next) {
@@ -1079,19 +1260,7 @@ find_and_unmap_vfio_resource(struct mapped_pci_res_list *vfio_res_list,
 	RTE_LOG(INFO, EAL, "Releasing PCI mapped resource for %s\n",
 		pci_addr);
 
-	maps = vfio_res->maps;
-	for (i = 0; i < vfio_res->nb_maps; i++) {
-
-		/*
-		 * We do not need to be aware of MSI-X table BAR mappings as
-		 * when mapping. Just using current maps array is enough
-		 */
-		if (maps[i].addr) {
-			RTE_LOG(INFO, EAL, "Calling pci_unmap_resource for %s at %p\n",
-				pci_addr, maps[i].addr);
-			pci_unmap_resource(maps[i].addr, maps[i].size);
-		}
-	}
+	clean_up_pci_resource(vfio_res);
 
 	return vfio_res;
 }
diff --git a/drivers/bus/pci/private.h b/drivers/bus/pci/private.h
index 3515c086aa..8d94d8acf8 100644
--- a/drivers/bus/pci/private.h
+++ b/drivers/bus/pci/private.h
@@ -110,6 +110,8 @@ struct pci_map {
 	uint64_t offset;
 	uint64_t size;
 	uint64_t phaddr;
+	uint32_t nr_areas;
+	struct rte_mem_map_area *areas;
 };
 
 struct pci_msix_table {
diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
index fb7d934bd0..ddc913f121 100644
--- a/drivers/bus/pci/rte_bus_pci.h
+++ b/drivers/bus/pci/rte_bus_pci.h
@@ -70,6 +70,18 @@ enum rte_pci_kernel_driver {
 	RTE_PCI_KDRV_NET_UIO,      /* NetUIO for Windows */
 };
 
+struct rte_mem_map_area {
+	void *addr;
+	uint64_t offset;
+	uint64_t size;
+};
+
+struct rte_sparse_mem_map {
+	uint64_t size;
+	uint32_t nr_maps;
+	struct rte_mem_map_area *areas;
+};
+
 /**
  * A structure describing a PCI device.
  */
@@ -82,8 +94,12 @@ struct rte_pci_device {
 	};
 	uint8_t is_mdev;                    /**< True for mediated PCI device */
 	struct rte_pci_id id;               /**< PCI ID. */
-	struct rte_mem_resource mem_resource[PCI_MAX_RESOURCE];
+	union {
+		struct rte_mem_resource mem_resource[PCI_MAX_RESOURCE];
 					    /**< PCI Memory Resource */
+		struct rte_sparse_mem_map sparse_mem[PCI_MAX_RESOURCE];
+					    /**< Sparse Memory Map for Mdev */
+	};
 	struct rte_intr_handle intr_handle; /**< Interrupt handle */
 	struct rte_pci_driver *driver;      /**< PCI driver used in probing */
 	uint16_t max_vfs;                   /**< sriov enable if not zero */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs Chenbo Xia
@ 2021-06-01  5:37         ` Stephen Hemminger
  2021-06-08  5:47           ` Xia, Chenbo
  2021-06-01  5:39         ` Stephen Hemminger
  2021-06-11  7:19         ` Thomas Monjalon
  2 siblings, 1 reply; 41+ messages in thread
From: Stephen Hemminger @ 2021-06-01  5:37 UTC (permalink / raw)
  To: Chenbo Xia
  Cc: dev, thomas, cunming.liang, jingjing.wu, anatoly.burakov,
	ferruh.yigit, mdr, nhorman, bruce.richardson, david.marchand,
	konstantin.ananyev, Tiwei Bie

On Tue,  1 Jun 2021 11:06:42 +0800
Chenbo Xia <chenbo.xia@intel.com> wrote:

> +int
> +rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
> +{
> +	FILE *f;
> +
> +	f = fopen(filename, "r");
> +	if (f == NULL) {
> +		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
> +			__func__, filename);
> +		return -1;
> +	}
> +
> +	if (fgets(buf, sz, f) == NULL) {
> +		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs file %s\n",
> +			__func__, filename);
> +		fclose(f);
> +		return -1;
> +	}
> +
> +	fclose(f);
> +	return 0;
> +}

It would be helpful if function removed trailing newline.
	strchrnul(buf, '\n') = '\0';

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs Chenbo Xia
  2021-06-01  5:37         ` Stephen Hemminger
@ 2021-06-01  5:39         ` Stephen Hemminger
  2021-06-08  5:48           ` Xia, Chenbo
  2021-06-11  7:19         ` Thomas Monjalon
  2 siblings, 1 reply; 41+ messages in thread
From: Stephen Hemminger @ 2021-06-01  5:39 UTC (permalink / raw)
  To: Chenbo Xia
  Cc: dev, thomas, cunming.liang, jingjing.wu, anatoly.burakov,
	ferruh.yigit, mdr, nhorman, bruce.richardson, david.marchand,
	konstantin.ananyev, Tiwei Bie

On Tue,  1 Jun 2021 11:06:42 +0800
Chenbo Xia <chenbo.xia@intel.com> wrote:

>  
> +int
> +rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
> +{
> +	FILE *f;
> +
> +	f = fopen(filename, "r");
> +	if (f == NULL) {
> +		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
> +			__func__, filename);

Helpful to decode errno.
		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s:%s\n",
			__func__, filename, strerror(errno));


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs
  2021-06-01  5:37         ` Stephen Hemminger
@ 2021-06-08  5:47           ` Xia, Chenbo
  0 siblings, 0 replies; 41+ messages in thread
From: Xia, Chenbo @ 2021-06-08  5:47 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, thomas, Liang, Cunming, Wu, Jingjing, Burakov, Anatoly,
	Yigit, Ferruh, mdr, nhorman, Richardson, Bruce, david.marchand,
	Ananyev, Konstantin, Tiwei Bie

Hi Stephen,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, June 1, 2021 1:38 PM
> To: Xia, Chenbo <chenbo.xia@intel.com>
> Cc: dev@dpdk.org; thomas@monjalon.net; Liang, Cunming
> <cunming.liang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; Burakov,
> Anatoly <anatoly.burakov@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>;
> mdr@ashroe.eu; nhorman@tuxdriver.com; Richardson, Bruce
> <bruce.richardson@intel.com>; david.marchand@redhat.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Tiwei Bie <tiwei.bie@intel.com>
> Subject: Re: [RFC v3 4/6] eal: add a helper for reading string from sysfs
> 
> On Tue,  1 Jun 2021 11:06:42 +0800
> Chenbo Xia <chenbo.xia@intel.com> wrote:
> 
> > +int
> > +rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
> > +{
> > +	FILE *f;
> > +
> > +	f = fopen(filename, "r");
> > +	if (f == NULL) {
> > +		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
> > +			__func__, filename);
> > +		return -1;
> > +	}
> > +
> > +	if (fgets(buf, sz, f) == NULL) {
> > +		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs file %s\n",
> > +			__func__, filename);
> > +		fclose(f);
> > +		return -1;
> > +	}
> > +
> > +	fclose(f);
> > +	return 0;
> > +}
> 
> It would be helpful if function removed trailing newline.
> 	strchrnul(buf, '\n') = '\0';

Make sense. Will fix it.

Thanks,
Chenbo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs
  2021-06-01  5:39         ` Stephen Hemminger
@ 2021-06-08  5:48           ` Xia, Chenbo
  0 siblings, 0 replies; 41+ messages in thread
From: Xia, Chenbo @ 2021-06-08  5:48 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, thomas, Liang, Cunming, Wu, Jingjing, Burakov, Anatoly,
	Yigit, Ferruh, mdr, nhorman, Richardson, Bruce, david.marchand,
	Ananyev, Konstantin, Tiwei Bie

Hi Stephen,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, June 1, 2021 1:39 PM
> To: Xia, Chenbo <chenbo.xia@intel.com>
> Cc: dev@dpdk.org; thomas@monjalon.net; Liang, Cunming
> <cunming.liang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; Burakov,
> Anatoly <anatoly.burakov@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>;
> mdr@ashroe.eu; nhorman@tuxdriver.com; Richardson, Bruce
> <bruce.richardson@intel.com>; david.marchand@redhat.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Tiwei Bie <tiwei.bie@intel.com>
> Subject: Re: [RFC v3 4/6] eal: add a helper for reading string from sysfs
> 
> On Tue,  1 Jun 2021 11:06:42 +0800
> Chenbo Xia <chenbo.xia@intel.com> wrote:
> 
> >
> > +int
> > +rte_eal_parse_sysfs_str(const char *filename, char *buf, unsigned long sz)
> > +{
> > +	FILE *f;
> > +
> > +	f = fopen(filename, "r");
> > +	if (f == NULL) {
> > +		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s\n",
> > +			__func__, filename);
> 
> Helpful to decode errno.
> 		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs file %s:%s\n",
> 			__func__, filename, strerror(errno));

Yes. Will fix.

Thanks,
Chenbo


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK
  2021-06-01  3:06     ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Chenbo Xia
                         ` (5 preceding siblings ...)
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 6/6] bus/pci: add sparse mmap support for mediated PCI devices Chenbo Xia
@ 2021-06-11  7:15       ` Thomas Monjalon
  2021-06-15  2:49         ` Xia, Chenbo
  6 siblings, 1 reply; 41+ messages in thread
From: Thomas Monjalon @ 2021-06-11  7:15 UTC (permalink / raw)
  To: Chenbo Xia
  Cc: dev, cunming.liang, jingjing.wu, anatoly.burakov, ferruh.yigit,
	mdr, nhorman, bruce.richardson, david.marchand, stephen,
	konstantin.ananyev

01/06/2021 05:06, Chenbo Xia:
> Hi everyone,
> 
> This is a draft implementation of the mdev (Mediated device [1])
> support in DPDK PCI bus driver. Mdev is a way to virtualize devices
> in Linux kernel. Based on the device-api (mdev_type/device_api),
> there could be different types of mdev devices (e.g. vfio-pci).

Please could you illustrate with an usage of mdev in DPDK?
What does it enable which is not possible today?

> In this patchset, the PCI bus driver is extended to support scanning
> and probing the mdev devices whose device-api is "vfio-pci".
> 
>                      +---------+
>                      | PCI bus |
>                      +----+----+
>                           |
>          +--------+-------+-------+--------+
>          |        |               |        |
>   Physical PCI devices ...   Mediated PCI devices ...
> 
> The first four patches in this patchset are mainly preparation of mdev
> bus support. The left two patches are the key implementation of mdev bus.
> 
> The implementation of mdev bus in DPDK has several options:
> 
> 1: Embed mdev bus in current pci bus
> 
>    This patchset takes this option for an example. Mdev has several
>    device types: pci/platform/amba/ccw/ap. DPDK currently only cares
>    pci devices in all mdev device types so we could embed the mdev bus
>    into current pci bus. Then pci bus with mdev support will scan/plug/
>    unplug/.. not only normal pci devices but also mediated pci devices.

I think it is a different bus.
It would be cleaner to not touch the PCI bus.
Having a separate bus will allow an easy way to identify a device
with the new generic devargs syntax, example:
	bus=mdev,uuid=XXX
or more complex:
	bus=mdev,uuid=XXX/class=crypto/driver=qat,foo=bar

> 2: A new mdev bus that scans mediated pci devices and probes mdev driver to
>    plug-in pci devices to pci bus
> 
>    If we took this option, a new mdev bus will be implemented to scan
>    mediated pci devices and a new mdev driver for pci devices will be
>    implemented in pci bus to plug-in mediated pci devices to pci bus.
> 
>    Our RFC v1 takes this option:
>    http://patchwork.dpdk.org/project/dpdk/cover/20190403071844.21126-1-tiwei.bie@intel.com/
> 
>    Note that: for either option 1 or 2, device drivers do not know the
>    implementation difference but only use structs/functions exposed by
>    pci bus. Mediated pci devices are different from normal pci devices
>    on: 1. Mediated pci devices use UUID as address but normal ones use BDF.
>    2. Mediated pci devices may have some capabilities that normal pci
>    devices do not have. For example, mediated pci devices could have
>    regions that have sparse mmap capability, which allows a region to have
>    multiple mmap areas. Another example is mediated pci devices may have
>    regions/part of regions not mmaped but need to access them. Above
>    difference will change the current ABI (i.e., struct rte_pci_device).
>    Please check 5th and 6th patch for details.
> 
> 3. A brand new mdev bus that does everything
> 
>    This option will implement a new and standalone mdev bus. This option
>    does not need any changes in current pci bus but only needs some shared
>    code (linux vfio part) in pci bus. Drivers of devices that support mdev
>    will register itself as a mdev driver and do not rely on pci bus anymore.
>    This option, IMHO, will make the code clean. The only potential problem
>    may be code duplication, which could be solved by making code of linux
>    vfio part of pci bus common and shared.

Yes I prefer this third option.
We can find an elegant way of sharing some VFIO code between buses.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs
  2021-06-01  3:06       ` [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs Chenbo Xia
  2021-06-01  5:37         ` Stephen Hemminger
  2021-06-01  5:39         ` Stephen Hemminger
@ 2021-06-11  7:19         ` Thomas Monjalon
  2 siblings, 0 replies; 41+ messages in thread
From: Thomas Monjalon @ 2021-06-11  7:19 UTC (permalink / raw)
  To: Tiwei Bie, Chenbo Xia
  Cc: dev, cunming.liang, jingjing.wu, anatoly.burakov, ferruh.yigit,
	mdr, nhorman, bruce.richardson, david.marchand, stephen,
	konstantin.ananyev

01/06/2021 05:06, Chenbo Xia:
> From: Tiwei Bie <tiwei.bie@intel.com>
> 
> This patch adds a helper for reading string from sysfs.
> 
> Signed-off-by: Cunming Liang <cunming.liang@intel.com>
> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> ---
>  lib/eal/common/eal_filesystem.h | 10 ++++++++++
>  lib/eal/freebsd/eal.c           | 22 ++++++++++++++++++++++
>  lib/eal/linux/eal.c             | 22 ++++++++++++++++++++++
>  lib/eal/version.map             |  3 +++
>  4 files changed, 57 insertions(+)

3 separate comments:

1/ How much code is portable between Linux and FreeBSD?
I guess the path will be different?

2/ Please think about Windows stub.

3/ Instead of EAL, we should start lib/sysfs/
I have other ideas of sysfs functions for PMD use.





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK
  2021-06-11  7:15       ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Thomas Monjalon
@ 2021-06-15  2:49         ` Xia, Chenbo
  2021-06-15  7:48           ` Thomas Monjalon
  0 siblings, 1 reply; 41+ messages in thread
From: Xia, Chenbo @ 2021-06-15  2:49 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Liang, Cunming, Wu, Jingjing, Burakov, Anatoly, Yigit,
	Ferruh, mdr, nhorman, Richardson, Bruce, david.marchand, stephen,
	Ananyev, Konstantin

Hi Thomas,

> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, June 11, 2021 3:16 PM
> To: Xia, Chenbo <chenbo.xia@intel.com>
> Cc: dev@dpdk.org; Liang, Cunming <cunming.liang@intel.com>; Wu, Jingjing
> <jingjing.wu@intel.com>; Burakov, Anatoly <anatoly.burakov@intel.com>; Yigit,
> Ferruh <ferruh.yigit@intel.com>; mdr@ashroe.eu; nhorman@tuxdriver.com;
> Richardson, Bruce <bruce.richardson@intel.com>; david.marchand@redhat.com;
> stephen@networkplumber.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in
> DPDK
> 
> 01/06/2021 05:06, Chenbo Xia:
> > Hi everyone,
> >
> > This is a draft implementation of the mdev (Mediated device [1])
> > support in DPDK PCI bus driver. Mdev is a way to virtualize devices
> > in Linux kernel. Based on the device-api (mdev_type/device_api),
> > there could be different types of mdev devices (e.g. vfio-pci).
> 
> Please could you illustrate with an usage of mdev in DPDK?
> What does it enable which is not possible today?

The main purpose is for DPDK to drive mdev-based devices, which is not
possible today.

I'd take PCI devices for an example. Currently DPDK can only drive devices
of physical pci bus under /sys/bus/pci and kernel exposes the pci devices
to APP in that way.

But there are PCI devices using vfio-mdev as a software framework to expose
Mdev to APP under /sys/bus/mdev. Devices could choose this way of virtualizing
itself to let multiple APPs share one physical device. For example, Intel
Scalable IOV technology is known to use vfio-mdev as SW framework for Scalable
IOV enabled devices (and Intel net/crypto/raw devices support this tech). For
those mdev-based devices, DPDK needs support on the bus layer to scan/plug/probe/..
them, which is the main effort this patchset does. There are also other devices
using the vfio-mdev framework, AFAIK, Nvidia's GPU is the first one using mdev
and Intel's GPU virtualization also uses it.

> 
> > In this patchset, the PCI bus driver is extended to support scanning
> > and probing the mdev devices whose device-api is "vfio-pci".
> >
> >                      +---------+
> >                      | PCI bus |
> >                      +----+----+
> >                           |
> >          +--------+-------+-------+--------+
> >          |        |               |        |
> >   Physical PCI devices ...   Mediated PCI devices ...
> >
> > The first four patches in this patchset are mainly preparation of mdev
> > bus support. The left two patches are the key implementation of mdev bus.
> >
> > The implementation of mdev bus in DPDK has several options:
> >
> > 1: Embed mdev bus in current pci bus
> >
> >    This patchset takes this option for an example. Mdev has several
> >    device types: pci/platform/amba/ccw/ap. DPDK currently only cares
> >    pci devices in all mdev device types so we could embed the mdev bus
> >    into current pci bus. Then pci bus with mdev support will scan/plug/
> >    unplug/.. not only normal pci devices but also mediated pci devices.
> 
> I think it is a different bus.
> It would be cleaner to not touch the PCI bus.
> Having a separate bus will allow an easy way to identify a device
> with the new generic devargs syntax, example:
> 	bus=mdev,uuid=XXX
> or more complex:
> 	bus=mdev,uuid=XXX/class=crypto/driver=qat,foo=bar

OK. Agree on cleaner to not touch PCI bus. And there may also be a 'type=pci'
as mdev has several types in its definition (pci/ap/platform/ccw/...).

> 
> > 2: A new mdev bus that scans mediated pci devices and probes mdev driver to
> >    plug-in pci devices to pci bus
> >
> >    If we took this option, a new mdev bus will be implemented to scan
> >    mediated pci devices and a new mdev driver for pci devices will be
> >    implemented in pci bus to plug-in mediated pci devices to pci bus.
> >
> >    Our RFC v1 takes this option:
> >    http://patchwork.dpdk.org/project/dpdk/cover/20190403071844.21126-1-
> tiwei.bie@intel.com/
> >
> >    Note that: for either option 1 or 2, device drivers do not know the
> >    implementation difference but only use structs/functions exposed by
> >    pci bus. Mediated pci devices are different from normal pci devices
> >    on: 1. Mediated pci devices use UUID as address but normal ones use BDF.
> >    2. Mediated pci devices may have some capabilities that normal pci
> >    devices do not have. For example, mediated pci devices could have
> >    regions that have sparse mmap capability, which allows a region to have
> >    multiple mmap areas. Another example is mediated pci devices may have
> >    regions/part of regions not mmaped but need to access them. Above
> >    difference will change the current ABI (i.e., struct rte_pci_device).
> >    Please check 5th and 6th patch for details.
> >
> > 3. A brand new mdev bus that does everything
> >
> >    This option will implement a new and standalone mdev bus. This option
> >    does not need any changes in current pci bus but only needs some shared
> >    code (linux vfio part) in pci bus. Drivers of devices that support mdev
> >    will register itself as a mdev driver and do not rely on pci bus anymore.
> >    This option, IMHO, will make the code clean. The only potential problem
> >    may be code duplication, which could be solved by making code of linux
> >    vfio part of pci bus common and shared.
> 
> Yes I prefer this third option.
> We can find an elegant way of sharing some VFIO code between buses.

Yes, I have not thought about the details of the code sharing but will try to make
it elegant.

Thanks,
Chenbo

> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK
  2021-06-15  2:49         ` Xia, Chenbo
@ 2021-06-15  7:48           ` Thomas Monjalon
  2021-06-15 10:44             ` Xia, Chenbo
  2021-06-15 11:57             ` Jason Gunthorpe
  0 siblings, 2 replies; 41+ messages in thread
From: Thomas Monjalon @ 2021-06-15  7:48 UTC (permalink / raw)
  To: Xia, Chenbo
  Cc: dev, Liang, Cunming, Wu, Jingjing, Burakov, Anatoly, Yigit,
	Ferruh, mdr, nhorman, Richardson, Bruce, david.marchand, stephen,
	Ananyev, Konstantin, jgg, parav, xuemingl

15/06/2021 04:49, Xia, Chenbo:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 01/06/2021 05:06, Chenbo Xia:
> > > Hi everyone,
> > >
> > > This is a draft implementation of the mdev (Mediated device [1])
> > > support in DPDK PCI bus driver. Mdev is a way to virtualize devices
> > > in Linux kernel. Based on the device-api (mdev_type/device_api),
> > > there could be different types of mdev devices (e.g. vfio-pci).
> > 
> > Please could you illustrate with an usage of mdev in DPDK?
> > What does it enable which is not possible today?
> 
> The main purpose is for DPDK to drive mdev-based devices, which is not
> possible today.
> 
> I'd take PCI devices for an example. Currently DPDK can only drive devices
> of physical pci bus under /sys/bus/pci and kernel exposes the pci devices
> to APP in that way.
> 
> But there are PCI devices using vfio-mdev as a software framework to expose
> Mdev to APP under /sys/bus/mdev. Devices could choose this way of virtualizing
> itself to let multiple APPs share one physical device. For example, Intel
> Scalable IOV technology is known to use vfio-mdev as SW framework for Scalable
> IOV enabled devices (and Intel net/crypto/raw devices support this tech). For
> those mdev-based devices, DPDK needs support on the bus layer to scan/plug/probe/..
> them, which is the main effort this patchset does. There are also other devices
> using the vfio-mdev framework, AFAIK, Nvidia's GPU is the first one using mdev
> and Intel's GPU virtualization also uses it.

Yes mdev was designed for virtualization I think.
The use of mdev for Scalable IOV without virtualization
may be seen as an abuse by Linux maintainers,
as they currently seem to prefer the auxiliary bus (which is a real bus).

Mellanox got a push back when trying to use mdev for the same purpose
(Scalable Function, also called Sub-Function) in the kernel.
The Linux community decided to use the auxiliary bus.

Any other feedback on the choice mdev vs aux?
Is there any kernel code supporting this mdev model for Intel devices?

> > > In this patchset, the PCI bus driver is extended to support scanning
> > > and probing the mdev devices whose device-api is "vfio-pci".
> > >
> > >                      +---------+
> > >                      | PCI bus |
> > >                      +----+----+
> > >                           |
> > >          +--------+-------+-------+--------+
> > >          |        |               |        |
> > >   Physical PCI devices ...   Mediated PCI devices ...
> > >
> > > The first four patches in this patchset are mainly preparation of mdev
> > > bus support. The left two patches are the key implementation of mdev bus.
> > >
> > > The implementation of mdev bus in DPDK has several options:
> > >
> > > 1: Embed mdev bus in current pci bus
> > >
> > >    This patchset takes this option for an example. Mdev has several
> > >    device types: pci/platform/amba/ccw/ap. DPDK currently only cares
> > >    pci devices in all mdev device types so we could embed the mdev bus
> > >    into current pci bus. Then pci bus with mdev support will scan/plug/
> > >    unplug/.. not only normal pci devices but also mediated pci devices.
> > 
> > I think it is a different bus.
> > It would be cleaner to not touch the PCI bus.
> > Having a separate bus will allow an easy way to identify a device
> > with the new generic devargs syntax, example:
> > 	bus=mdev,uuid=XXX
> > or more complex:
> > 	bus=mdev,uuid=XXX/class=crypto/driver=qat,foo=bar
> 
> OK. Agree on cleaner to not touch PCI bus. And there may also be a 'type=pci'
> as mdev has several types in its definition (pci/ap/platform/ccw/...).
> 
> > > 2: A new mdev bus that scans mediated pci devices and probes mdev driver to
> > >    plug-in pci devices to pci bus
> > >
> > >    If we took this option, a new mdev bus will be implemented to scan
> > >    mediated pci devices and a new mdev driver for pci devices will be
> > >    implemented in pci bus to plug-in mediated pci devices to pci bus.
> > >
> > >    Our RFC v1 takes this option:
> > >    http://patchwork.dpdk.org/project/dpdk/cover/20190403071844.21126-1-
> > tiwei.bie@intel.com/
> > >
> > >    Note that: for either option 1 or 2, device drivers do not know the
> > >    implementation difference but only use structs/functions exposed by
> > >    pci bus. Mediated pci devices are different from normal pci devices
> > >    on: 1. Mediated pci devices use UUID as address but normal ones use BDF.
> > >    2. Mediated pci devices may have some capabilities that normal pci
> > >    devices do not have. For example, mediated pci devices could have
> > >    regions that have sparse mmap capability, which allows a region to have
> > >    multiple mmap areas. Another example is mediated pci devices may have
> > >    regions/part of regions not mmaped but need to access them. Above
> > >    difference will change the current ABI (i.e., struct rte_pci_device).
> > >    Please check 5th and 6th patch for details.
> > >
> > > 3. A brand new mdev bus that does everything
> > >
> > >    This option will implement a new and standalone mdev bus. This option
> > >    does not need any changes in current pci bus but only needs some shared
> > >    code (linux vfio part) in pci bus. Drivers of devices that support mdev
> > >    will register itself as a mdev driver and do not rely on pci bus anymore.
> > >    This option, IMHO, will make the code clean. The only potential problem
> > >    may be code duplication, which could be solved by making code of linux
> > >    vfio part of pci bus common and shared.
> > 
> > Yes I prefer this third option.
> > We can find an elegant way of sharing some VFIO code between buses.
> 
> Yes, I have not thought about the details of the code sharing but will try to make
> it elegant.

Great, thanks.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK
  2021-06-15  7:48           ` Thomas Monjalon
@ 2021-06-15 10:44             ` Xia, Chenbo
  2021-06-15 11:57             ` Jason Gunthorpe
  1 sibling, 0 replies; 41+ messages in thread
From: Xia, Chenbo @ 2021-06-15 10:44 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Liang, Cunming, Wu, Jingjing, Burakov, Anatoly, Yigit,
	Ferruh, mdr, nhorman, Richardson, Bruce, david.marchand, stephen,
	Ananyev, Konstantin, jgg, parav, xuemingl

Hi Thomas,

> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Tuesday, June 15, 2021 3:48 PM
> To: Xia, Chenbo <chenbo.xia@intel.com>
> Cc: dev@dpdk.org; Liang, Cunming <cunming.liang@intel.com>; Wu, Jingjing
> <jingjing.wu@intel.com>; Burakov, Anatoly <anatoly.burakov@intel.com>; Yigit,
> Ferruh <ferruh.yigit@intel.com>; mdr@ashroe.eu; nhorman@tuxdriver.com;
> Richardson, Bruce <bruce.richardson@intel.com>; david.marchand@redhat.com;
> stephen@networkplumber.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> jgg@nvidia.com; parav@nvidia.com; xuemingl@nvidia.com
> Subject: Re: [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in
> DPDK
> 
> 15/06/2021 04:49, Xia, Chenbo:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > > 01/06/2021 05:06, Chenbo Xia:
> > > > Hi everyone,
> > > >
> > > > This is a draft implementation of the mdev (Mediated device [1])
> > > > support in DPDK PCI bus driver. Mdev is a way to virtualize devices
> > > > in Linux kernel. Based on the device-api (mdev_type/device_api),
> > > > there could be different types of mdev devices (e.g. vfio-pci).
> > >
> > > Please could you illustrate with an usage of mdev in DPDK?
> > > What does it enable which is not possible today?
> >
> > The main purpose is for DPDK to drive mdev-based devices, which is not
> > possible today.
> >
> > I'd take PCI devices for an example. Currently DPDK can only drive devices
> > of physical pci bus under /sys/bus/pci and kernel exposes the pci devices
> > to APP in that way.
> >
> > But there are PCI devices using vfio-mdev as a software framework to expose
> > Mdev to APP under /sys/bus/mdev. Devices could choose this way of
> virtualizing
> > itself to let multiple APPs share one physical device. For example, Intel
> > Scalable IOV technology is known to use vfio-mdev as SW framework for
> Scalable
> > IOV enabled devices (and Intel net/crypto/raw devices support this tech).
> For
> > those mdev-based devices, DPDK needs support on the bus layer to
> scan/plug/probe/..
> > them, which is the main effort this patchset does. There are also other
> devices
> > using the vfio-mdev framework, AFAIK, Nvidia's GPU is the first one using
> mdev
> > and Intel's GPU virtualization also uses it.
> 
> Yes mdev was designed for virtualization I think.
> The use of mdev for Scalable IOV without virtualization
> may be seen as an abuse by Linux maintainers,
> as they currently seem to prefer the auxiliary bus (which is a real bus).
> 
> Mellanox got a push back when trying to use mdev for the same purpose
> (Scalable Function, also called Sub-Function) in the kernel.
> The Linux community decided to use the auxiliary bus.
> 
> Any other feedback on the choice mdev vs aux?

OK. Thanks for the info. Much appreciated.

I could investigate a bit about the choice and later come back to you.

> Is there any kernel code supporting this mdev model for Intel devices?

Now there's only intel GPU. But I think you care more about devices that DPDK could
drive: a dma device (DPDK's name ioat under raw/ioat) is on its way upstreaming
(https://www.spinics.net/lists/kvm/msg244417.html)

Thanks,
Chenbo

> 
> > > > In this patchset, the PCI bus driver is extended to support scanning
> > > > and probing the mdev devices whose device-api is "vfio-pci".
> > > >
> > > >                      +---------+
> > > >                      | PCI bus |
> > > >                      +----+----+
> > > >                           |
> > > >          +--------+-------+-------+--------+
> > > >          |        |               |        |
> > > >   Physical PCI devices ...   Mediated PCI devices ...
> > > >
> > > > The first four patches in this patchset are mainly preparation of mdev
> > > > bus support. The left two patches are the key implementation of mdev bus.
> > > >
> > > > The implementation of mdev bus in DPDK has several options:
> > > >
> > > > 1: Embed mdev bus in current pci bus
> > > >
> > > >    This patchset takes this option for an example. Mdev has several
> > > >    device types: pci/platform/amba/ccw/ap. DPDK currently only cares
> > > >    pci devices in all mdev device types so we could embed the mdev bus
> > > >    into current pci bus. Then pci bus with mdev support will scan/plug/
> > > >    unplug/.. not only normal pci devices but also mediated pci devices.
> > >
> > > I think it is a different bus.
> > > It would be cleaner to not touch the PCI bus.
> > > Having a separate bus will allow an easy way to identify a device
> > > with the new generic devargs syntax, example:
> > > 	bus=mdev,uuid=XXX
> > > or more complex:
> > > 	bus=mdev,uuid=XXX/class=crypto/driver=qat,foo=bar
> >
> > OK. Agree on cleaner to not touch PCI bus. And there may also be a
> 'type=pci'
> > as mdev has several types in its definition (pci/ap/platform/ccw/...).
> >
> > > > 2: A new mdev bus that scans mediated pci devices and probes mdev driver
> to
> > > >    plug-in pci devices to pci bus
> > > >
> > > >    If we took this option, a new mdev bus will be implemented to scan
> > > >    mediated pci devices and a new mdev driver for pci devices will be
> > > >    implemented in pci bus to plug-in mediated pci devices to pci bus.
> > > >
> > > >    Our RFC v1 takes this option:
> > > >    http://patchwork.dpdk.org/project/dpdk/cover/20190403071844.21126-1-
> > > tiwei.bie@intel.com/
> > > >
> > > >    Note that: for either option 1 or 2, device drivers do not know the
> > > >    implementation difference but only use structs/functions exposed by
> > > >    pci bus. Mediated pci devices are different from normal pci devices
> > > >    on: 1. Mediated pci devices use UUID as address but normal ones use
> BDF.
> > > >    2. Mediated pci devices may have some capabilities that normal pci
> > > >    devices do not have. For example, mediated pci devices could have
> > > >    regions that have sparse mmap capability, which allows a region to
> have
> > > >    multiple mmap areas. Another example is mediated pci devices may have
> > > >    regions/part of regions not mmaped but need to access them. Above
> > > >    difference will change the current ABI (i.e., struct rte_pci_device).
> > > >    Please check 5th and 6th patch for details.
> > > >
> > > > 3. A brand new mdev bus that does everything
> > > >
> > > >    This option will implement a new and standalone mdev bus. This option
> > > >    does not need any changes in current pci bus but only needs some
> shared
> > > >    code (linux vfio part) in pci bus. Drivers of devices that support
> mdev
> > > >    will register itself as a mdev driver and do not rely on pci bus
> anymore.
> > > >    This option, IMHO, will make the code clean. The only potential
> problem
> > > >    may be code duplication, which could be solved by making code of
> linux
> > > >    vfio part of pci bus common and shared.
> > >
> > > Yes I prefer this third option.
> > > We can find an elegant way of sharing some VFIO code between buses.
> >
> > Yes, I have not thought about the details of the code sharing but will try
> to make
> > it elegant.
> 
> Great, thanks.
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK
  2021-06-15  7:48           ` Thomas Monjalon
  2021-06-15 10:44             ` Xia, Chenbo
@ 2021-06-15 11:57             ` Jason Gunthorpe
  1 sibling, 0 replies; 41+ messages in thread
From: Jason Gunthorpe @ 2021-06-15 11:57 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Xia, Chenbo, dev, Liang, Cunming, Wu, Jingjing, Burakov, Anatoly,
	Yigit, Ferruh, mdr, nhorman, Richardson, Bruce, david.marchand,
	stephen, Ananyev, Konstantin, parav, xuemingl

On Tue, Jun 15, 2021 at 09:48:24AM +0200, Thomas Monjalon wrote:
> 15/06/2021 04:49, Xia, Chenbo:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > > 01/06/2021 05:06, Chenbo Xia:
> > > > Hi everyone,
> > > >
> > > > This is a draft implementation of the mdev (Mediated device [1])
> > > > support in DPDK PCI bus driver. Mdev is a way to virtualize devices
> > > > in Linux kernel. Based on the device-api (mdev_type/device_api),
> > > > there could be different types of mdev devices (e.g. vfio-pci).
> > > 
> > > Please could you illustrate with an usage of mdev in DPDK?
> > > What does it enable which is not possible today?
> > 
> > The main purpose is for DPDK to drive mdev-based devices, which is not
> > possible today.
> > 
> > I'd take PCI devices for an example. Currently DPDK can only drive devices
> > of physical pci bus under /sys/bus/pci and kernel exposes the pci devices
> > to APP in that way.
> > 
> > But there are PCI devices using vfio-mdev as a software framework to expose
> > Mdev to APP under /sys/bus/mdev. Devices could choose this way of virtualizing
> > itself to let multiple APPs share one physical device. For example, Intel
> > Scalable IOV technology is known to use vfio-mdev as SW framework for Scalable
> > IOV enabled devices (and Intel net/crypto/raw devices support this tech). For
> > those mdev-based devices, DPDK needs support on the bus layer to scan/plug/probe/..
> > them, which is the main effort this patchset does. There are also other devices
> > using the vfio-mdev framework, AFAIK, Nvidia's GPU is the first one using mdev
> > and Intel's GPU virtualization also uses it.
> 
> Yes mdev was designed for virtualization I think.
> The use of mdev for Scalable IOV without virtualization
> may be seen as an abuse by Linux maintainers,
> as they currently seem to prefer the auxiliary bus (which is a real bus).
> 
> Mellanox got a push back when trying to use mdev for the same purpose
> (Scalable Function, also called Sub-Function) in the kernel.
> The Linux community decided to use the auxiliary bus.
> 
> Any other feedback on the choice mdev vs aux?
> Is there any kernel code supporting this mdev model for Intel devices?

IMHO until a kernel networking driver is accepted that uses mdev this
is all just dead code in dpdk and shouldn't be merged.

I think it is unlikely that future networking drivers will use mdev.

> > > > 2: A new mdev bus that scans mediated pci devices and probes mdev driver to
> > > >    plug-in pci devices to pci bus

And we are likely not doing 'mediated pci devices' at all..

Jason

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-06-16  6:08 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-03  7:18 [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Tiwei Bie
2019-04-03  7:18 ` Tiwei Bie
2019-04-03  7:18 ` [dpdk-dev] [RFC 1/3] eal: add a helper for reading string from sysfs Tiwei Bie
2019-04-03  7:18   ` Tiwei Bie
2019-04-03  7:18 ` [dpdk-dev] [RFC 2/3] bus/mdev: add mdev bus support Tiwei Bie
2019-04-03  7:18   ` Tiwei Bie
2019-04-03  7:18 ` [dpdk-dev] [RFC 3/3] bus/pci: add mdev support Tiwei Bie
2019-04-03  7:18   ` Tiwei Bie
2019-04-03 14:13   ` Wiles, Keith
2019-04-03 14:13     ` Wiles, Keith
2019-04-04  4:19     ` Tiwei Bie
2019-04-04  4:19       ` Tiwei Bie
2019-04-08  8:44 ` [dpdk-dev] [RFC 0/3] Add mdev (Mediated device) support in DPDK Alejandro Lucero
2019-04-08  8:44   ` Alejandro Lucero
2019-04-08  9:36   ` Tiwei Bie
2019-04-08  9:36     ` Tiwei Bie
2019-04-10 10:02     ` Francois Ozog
2019-04-10 10:02       ` Francois Ozog
2019-07-15  7:52 ` [dpdk-dev] [RFC v2 0/5] " Tiwei Bie
2019-07-15  7:52   ` [dpdk-dev] [RFC v2 1/5] bus/pci: introduce an internal representation of PCI device Tiwei Bie
2019-07-15  7:52   ` [dpdk-dev] [RFC v2 2/5] bus/pci: avoid depending on private value in kernel source Tiwei Bie
2019-07-15  7:52   ` [dpdk-dev] [RFC v2 3/5] bus/pci: introduce helper for MMIO read and write Tiwei Bie
2019-07-15  7:52   ` [dpdk-dev] [RFC v2 4/5] eal: add a helper for reading string from sysfs Tiwei Bie
2019-07-15  7:52   ` [dpdk-dev] [RFC v2 5/5] bus/pci: add mdev support Tiwei Bie
2021-06-01  3:06     ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Chenbo Xia
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 1/6] bus/pci: introduce an internal representation of PCI device Chenbo Xia
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 2/6] bus/pci: avoid depending on private value in kernel source Chenbo Xia
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 3/6] bus/pci: introduce helper for MMIO read and write Chenbo Xia
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 4/6] eal: add a helper for reading string from sysfs Chenbo Xia
2021-06-01  5:37         ` Stephen Hemminger
2021-06-08  5:47           ` Xia, Chenbo
2021-06-01  5:39         ` Stephen Hemminger
2021-06-08  5:48           ` Xia, Chenbo
2021-06-11  7:19         ` Thomas Monjalon
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 5/6] bus/pci: add mdev support Chenbo Xia
2021-06-01  3:06       ` [dpdk-dev] [RFC v3 6/6] bus/pci: add sparse mmap support for mediated PCI devices Chenbo Xia
2021-06-11  7:15       ` [dpdk-dev] [RFC v3 0/6] Add mdev (Mediated device) support in DPDK Thomas Monjalon
2021-06-15  2:49         ` Xia, Chenbo
2021-06-15  7:48           ` Thomas Monjalon
2021-06-15 10:44             ` Xia, Chenbo
2021-06-15 11:57             ` Jason Gunthorpe

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git