DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/3] add vDPA sample driver
@ 2018-02-04 14:55 Xiao Wang
  2018-02-04 14:55 ` [dpdk-dev] [PATCH 1/3] bus/pci: expose API for vDPA Xiao Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Xiao Wang @ 2018-02-04 14:55 UTC (permalink / raw)
  To: dev
  Cc: jianfeng.tan, tiwei.bie, maxime.coquelin, yliu, cunming.liang,
	dan.daly, zhihong.wang, Xiao Wang

This patch set has dependency on the vhost lib patch:
http://dpdk.org/dev/patchwork/patch/34872/

This patch set shows a reference sample of making vDPA device driver 
, this driver uses a QEMU-emulated virtio-net PCI device as vDPA device,
and make it serve as a backend for a virtio-net pci device in nested VM.

The key driver ops implemented are:

* vdpa_virtio_eng_init
Mapping virtio pci device with VFIO into userspace, and read device
capability and intialize internal data.

* vdpa_virtio_eng_uninit
Release the mapped device.

* vdpa_virtio_info_query
Device capability reporting, e.g. queue number, features.

* vdpa_virtio_dev_config
With the guest virtio information provideed by vhost lib, this
function configures device and IOMMU to set up vhost datapath,
which includes: Rx/Tx vring, VFIO interrupt, kick relay.

* vdpa_virtio_dev_close
Unset the stuff that are configured previously by dev_conf.

This driver requires the virtio device supports VIRTIO_F_IOMMU_PLATFORM
, because the buffer address written in desc is IOVA.

Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
only vfio-pci is supported currently.

Below are setup steps for your reference:

1. Make sure your kernnel vhost module and QEMU support vIOMMU.
   - OS: CentOS 7.4
   - QEMU: 2.10.1
   - Guest OS: CentOS 7.2
   - Nested VM OS: CentOS 7.2

2. enable VT-x feature for vCPU in VM.
   modprobe kvm_intel nested=1

3. Start a VM with a virtio-net-pci device.
   ./qemu-2.10.1/x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu host \
   <snip>
   -machine q35 \
   -device intel-iommu \
   -netdev tap,id=mytap,ifname=vdpa,vhostforce=on \
   -device virtio-net-pci,netdev=mytap,mac=00:aa:bb:cc:dd:ee,\
   disable-modern=off,disable-legacy=on,iommu_platform=on \

4. Bind VFIO-pci to virtio_net_pci device
   a) login to VM;
   b) modprobe vfio-pci
   c) rmmod vfio_iommu_type1
   d) modprobe vfio_iommu_type1 allow_unsafe_interrupts=1
   e) ./usertools/dpdk-devbind.py -b vfio-pci 00:03.0

5. Start vdpa sample
   ./examples/vdpa/build/vdpa -c 0x2 -n 4 --socket-mem 1024 --no-pci \
    --vdev "net_vdpa_virtio_pci0,bdf=0000:00:03.0" -- --bdf 0000:00:03.0 \
    --iface /tmp/vhost-user- --devcnt 1  --queue 1

6. Start nested VM
   ./qemu-2.10.1/x86_64-softmmu/qemu-system-x86_64 -cpu host -enable-kvm \
   <snip>
   -mem-prealloc \
   -chardev socket,id=char0,path=/tmp/vhost-user-0 \
   -netdev type=vhost-user,id=vdpa,chardev=char0,vhostforce \
   -device virtio-net-pci,netdev=vdpa,mac=00:aa:bb:cc:dd:ee \

7. Login the nested VM, and verify the virtio in nested VM can communicate
   with tap device on host.

Xiao Wang (3):
  bus/pci: expose API for vDPA
  net/vdpa_virtio_pci: introduce vdpa sample driver
  examples/vdpa: add a new sample for vdpa

 config/common_base                                 |    6 +
 config/common_linuxapp                             |    1 +
 drivers/bus/pci/Makefile                           |    1 +
 drivers/bus/pci/linux/pci.c                        |   10 +-
 drivers/bus/pci/linux/pci_init.h                   |   22 +-
 drivers/bus/pci/linux/pci_vfio.c                   |    5 +-
 drivers/bus/pci/rte_bus_pci_version.map            |   13 +
 drivers/net/Makefile                               |    1 +
 drivers/net/vdpa_virtio_pci/Makefile               |   31 +
 .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527 ++++++++++++++++++++
 .../rte_vdpa_virtio_pci_version.map                |    4 +
 examples/vdpa/Makefile                             |   32 +
 examples/vdpa/main.c                               |  387 +++++
 mk/rte.app.mk                                      |    1 +
 14 files changed, 2030 insertions(+), 11 deletions(-)
 create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
 create mode 100644 drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
 create mode 100644 drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
 create mode 100644 examples/vdpa/Makefile
 create mode 100644 examples/vdpa/main.c

-- 
2.15.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [dpdk-dev] [PATCH 1/3] bus/pci: expose API for vDPA
  2018-02-04 14:55 [dpdk-dev] [PATCH 0/3] add vDPA sample driver Xiao Wang
@ 2018-02-04 14:55 ` Xiao Wang
  2018-02-04 14:55 ` [dpdk-dev] [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver Xiao Wang
  2018-02-04 14:55 ` [dpdk-dev] [PATCH 3/3] examples/vdpa: add a new sample for vdpa Xiao Wang
  2 siblings, 0 replies; 8+ messages in thread
From: Xiao Wang @ 2018-02-04 14:55 UTC (permalink / raw)
  To: dev
  Cc: jianfeng.tan, tiwei.bie, maxime.coquelin, yliu, cunming.liang,
	dan.daly, zhihong.wang, Xiao Wang

Some existing PCI APIs are helpful for vDPA device setup, expose them
for the later driver patch.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 drivers/bus/pci/Makefile                |  1 +
 drivers/bus/pci/linux/pci.c             | 10 +++-------
 drivers/bus/pci/linux/pci_init.h        | 22 +++++++++++++++++++++-
 drivers/bus/pci/linux/pci_vfio.c        |  5 ++---
 drivers/bus/pci/rte_bus_pci_version.map | 13 +++++++++++++
 5 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/drivers/bus/pci/Makefile b/drivers/bus/pci/Makefile
index f3df1c4ce..e45bee024 100644
--- a/drivers/bus/pci/Makefile
+++ b/drivers/bus/pci/Makefile
@@ -45,6 +45,7 @@ ifneq ($(CONFIG_RTE_EXEC_ENV_BSDAPP),)
 SYSTEM := bsd
 endif
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += -I$(RTE_SDK)/drivers/bus/pci/$(SYSTEM)
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/$(SYSTEM)app/eal
diff --git a/drivers/bus/pci/linux/pci.c b/drivers/bus/pci/linux/pci.c
index abde64119..06b811d5e 100644
--- a/drivers/bus/pci/linux/pci.c
+++ b/drivers/bus/pci/linux/pci.c
@@ -32,7 +32,7 @@
 
 extern struct rte_pci_bus rte_pci_bus;
 
-static int
+int
 pci_get_kernel_driver_by_path(const char *filename, char *dri_name)
 {
 	int count;
@@ -168,8 +168,7 @@ pci_parse_one_sysfs_resource(char *line, size_t len, uint64_t *phys_addr,
 	return 0;
 }
 
-/* parse the "resource" sysfs file */
-static int
+int
 pci_parse_sysfs_resource(const char *filename, struct rte_pci_device *dev)
 {
 	FILE *f;
@@ -372,10 +371,7 @@ pci_update_device(const struct rte_pci_addr *addr)
 	return pci_scan_one(filename, addr);
 }
 
-/*
- * split up a pci address into its constituent parts.
- */
-static int
+int
 parse_pci_addr_format(const char *buf, int bufsize, struct rte_pci_addr *addr)
 {
 	/* first split on ':' */
diff --git a/drivers/bus/pci/linux/pci_init.h b/drivers/bus/pci/linux/pci_init.h
index c2e603a37..9e06cb57d 100644
--- a/drivers/bus/pci/linux/pci_init.h
+++ b/drivers/bus/pci/linux/pci_init.h
@@ -6,6 +6,7 @@
 #define EAL_PCI_INIT_H_
 
 #include <rte_vfio.h>
+#include <stdbool.h>
 
 /** IO resource type: */
 #define IORESOURCE_IO         0x00000100
@@ -15,7 +16,7 @@
  * Helper function to map PCI resources right after hugepages in virtual memory
  */
 extern void *pci_map_addr;
-void *pci_find_max_end_va(void);
+void *__rte_experimental pci_find_max_end_va(void);
 
 /* parse one line of the "resource" sysfs file (note that the 'line'
  * string is modified)
@@ -83,6 +84,25 @@ int pci_vfio_unmap_resource(struct rte_pci_device *dev);
 
 int pci_vfio_is_enabled(void);
 
+/* parse sysfs file path */
+int __rte_experimental
+pci_get_kernel_driver_by_path(const char *filename, char *dri_name);
+
+/* parse the "resource" sysfs file */
+int __rte_experimental
+pci_parse_sysfs_resource(const char *filename, struct rte_pci_device *dev);
+
+/* split up a pci address into its constituent parts */
+int __rte_experimental
+parse_pci_addr_format(const char *buf, int bufsize, struct rte_pci_addr *addr);
+
+/* get PCI BAR info for MSI-X interrupts */
+int __rte_experimental
+pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table);
+
+/* enable DMA and reset device */
+int __rte_experimental
+pci_rte_vfio_setup_device(struct rte_pci_device *dev, int vfio_dev_fd);
 #endif
 
 #endif /* EAL_PCI_INIT_H_ */
diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index aeeaa9ed8..6d0486a7d 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -59,8 +59,7 @@ pci_vfio_write_config(const struct rte_intr_handle *intr_handle,
 	       VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) + offs);
 }
 
-/* get PCI BAR number where MSI-X interrupts are */
-static int
+int
 pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table)
 {
 	int ret;
@@ -295,7 +294,7 @@ pci_vfio_is_ioport_bar(int vfio_dev_fd, int bar_index)
 	return (ioport_bar & PCI_BASE_ADDRESS_SPACE_IO) != 0;
 }
 
-static int
+int
 pci_rte_vfio_setup_device(struct rte_pci_device *dev, int vfio_dev_fd)
 {
 	if (pci_vfio_setup_interrupts(dev, vfio_dev_fd) != 0) {
diff --git a/drivers/bus/pci/rte_bus_pci_version.map b/drivers/bus/pci/rte_bus_pci_version.map
index 27e9c4f10..fd806ad33 100644
--- a/drivers/bus/pci/rte_bus_pci_version.map
+++ b/drivers/bus/pci/rte_bus_pci_version.map
@@ -16,3 +16,16 @@ DPDK_17.11 {
 
 	local: *;
 };
+
+EXPERIMENTAL {
+	global:
+
+	pci_map_addr;
+	pci_find_max_end_va;
+	pci_get_kernel_driver_by_path;
+	pci_parse_sysfs_resource;
+	pci_vfio_get_msix_bar;
+	pci_rte_vfio_setup_device;
+	parse_pci_addr_format;
+
+} DPDK_17.11;
-- 
2.15.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [dpdk-dev] [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
  2018-02-04 14:55 [dpdk-dev] [PATCH 0/3] add vDPA sample driver Xiao Wang
  2018-02-04 14:55 ` [dpdk-dev] [PATCH 1/3] bus/pci: expose API for vDPA Xiao Wang
@ 2018-02-04 14:55 ` Xiao Wang
  2018-02-06 14:24   ` Maxime Coquelin
  2018-02-04 14:55 ` [dpdk-dev] [PATCH 3/3] examples/vdpa: add a new sample for vdpa Xiao Wang
  2 siblings, 1 reply; 8+ messages in thread
From: Xiao Wang @ 2018-02-04 14:55 UTC (permalink / raw)
  To: dev
  Cc: jianfeng.tan, tiwei.bie, maxime.coquelin, yliu, cunming.liang,
	dan.daly, zhihong.wang, Xiao Wang

This driver is a reference sample of making vDPA device driver based
on vhost lib, this driver uses a standard virtio-net PCI device as
vDPA device, it can serve as a backend for a virtio-net pci device
in nested VM.

The key driver ops implemented are:

* vdpa_virtio_eng_init
Mapping virtio pci device with VFIO into userspace, and read device
capability and intialize internal data.

* vdpa_virtio_eng_uninit
Release the mapped device.

* vdpa_virtio_info_query
Device capability reporting, e.g. queue number, features.

* vdpa_virtio_dev_config
With the guest virtio information provideed by vhost lib, this
function configures device and IOMMU to set up vhost datapath,
which includes: Rx/Tx vring, VFIO interrupt, kick relay.

* vdpa_virtio_dev_close
Unset the stuff that are configured previously by dev_conf.

This driver requires the virtio device supports VIRTIO_F_IOMMU_PLATFORM
, because the buffer address written in desc is IOVA.

Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
only vfio-pci is supported currently.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 config/common_base                                 |    6 +
 config/common_linuxapp                             |    1 +
 drivers/net/Makefile                               |    1 +
 drivers/net/vdpa_virtio_pci/Makefile               |   31 +
 .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527 ++++++++++++++++++++
 .../rte_vdpa_virtio_pci_version.map                |    4 +
 mk/rte.app.mk                                      |    1 +
 7 files changed, 1571 insertions(+)
 create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
 create mode 100644 drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
 create mode 100644 drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map

diff --git a/config/common_base b/config/common_base
index ad03cf433..aaa775129 100644
--- a/config/common_base
+++ b/config/common_base
@@ -791,6 +791,12 @@ CONFIG_RTE_LIBRTE_VHOST_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_PMD_VHOST=n
 
+#
+# Compile VDPA VIRTIO PCI driver
+# To compile, CONFIG_RTE_LIBRTE_VHOST should be enabled.
+#
+CONFIG_RTE_LIBRTE_VDPA_VIRTIO_PCI=n
+
 #
 # Compile the test application
 #
diff --git a/config/common_linuxapp b/config/common_linuxapp
index ff98f2355..83446090c 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -15,6 +15,7 @@ CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
+CONFIG_RTE_LIBRTE_VDPA_VIRTIO_PCI=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index e1127326b..0a45ef603 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -53,6 +53,7 @@ endif # $(CONFIG_RTE_LIBRTE_SCHED)
 
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += vhost
+DIRS-$(CONFIG_RTE_LIBRTE_VDPA_VIRTIO_PCI) += vdpa_virtio_pci
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 
 ifeq ($(CONFIG_RTE_LIBRTE_MRVL_PMD),y)
diff --git a/drivers/net/vdpa_virtio_pci/Makefile b/drivers/net/vdpa_virtio_pci/Makefile
new file mode 100644
index 000000000..147d7a7a3
--- /dev/null
+++ b/drivers/net/vdpa_virtio_pci/Makefile
@@ -0,0 +1,31 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_vdpa_virtio_pci.a
+
+LDLIBS += -lpthread
+LDLIBS += -lrte_eal -lrte_mempool -lrte_pci
+LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_vhost
+LDLIBS += -lrte_bus_vdev -lrte_bus_pci
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -I$(RTE_SDK)/lib/librte_eal/linuxapp/eal
+CFLAGS += -I$(RTE_SDK)/drivers/bus/pci/linux
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
+EXPORT_MAP := rte_vdpa_virtio_pci_version.map
+
+LIBABIVER := 1
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_VHOST) += rte_eth_vdpa_virtio_pci.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c b/drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
new file mode 100644
index 000000000..5e63b15e6
--- /dev/null
+++ b/drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
@@ -0,0 +1,1527 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <fcntl.h>
+#include <linux/pci_regs.h>
+#include <linux/virtio_net.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_pci.h>
+#include <sys/ioctl.h>
+#include <sys/epoll.h>
+#include <sys/mman.h>
+
+#include <rte_vfio.h>
+#include <rte_mbuf.h>
+#include <rte_ethdev.h>
+#include <rte_ethdev_vdev.h>
+#include <rte_malloc.h>
+#include <rte_memcpy.h>
+#include <rte_bus_pci.h>
+#include <rte_bus_vdev.h>
+#include <rte_kvargs.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+#include <rte_io.h>
+#include <rte_cycles.h>
+#include <rte_spinlock.h>
+#include <eal_vfio.h>
+#include <pci_init.h>
+
+#define MAX_QUEUES		1
+#define VIRTIO_F_IOMMU_PLATFORM	33
+#define MSIX_IRQ_SET_BUF_LEN (sizeof(struct vfio_irq_set) + \
+		sizeof(int) * (MAX_QUEUES * 2 + 1))
+
+#define ETH_VDPA_VIRTIO_PCI_BDF_ARG	"bdf"
+
+static const char *const valid_arguments[] = {
+	ETH_VDPA_VIRTIO_PCI_BDF_ARG,
+	NULL
+};
+
+static struct ether_addr base_eth_addr = {
+	.addr_bytes = {
+		0x56 /* V */,
+		0x44 /* D */,
+		0x50 /* P */,
+		0x41 /* A */,
+		0x00,
+		0x00
+	}
+};
+
+struct virtio_pci_info {
+	struct rte_pci_device pdev;
+	uint64_t    req_features;
+	uint32_t    notify_off_multiplier;
+	struct virtio_pci_common_cfg *common_cfg;
+	uint8_t     *isr;
+	uint16_t    *notify_base;
+	struct virtio_net_device_config *dev_cfg;
+	uint16_t    *notify_addr[MAX_QUEUES * 2];
+	int vfio_container_fd;
+	int vfio_group_fd;
+	int vfio_dev_fd;
+	pthread_t tid;	/* thread for notify relay */
+	int epfd;
+};
+
+struct vdpa_virtio_pci_internal {
+	char *dev_name;
+	uint16_t max_queues;
+	uint16_t max_devices;
+	uint64_t features;
+	struct rte_vdpa_eng_addr eng_addr;
+	int eid;
+	int vid;
+	struct virtio_pci_info vpci;
+	rte_atomic32_t started;
+	rte_atomic32_t dev_attached;
+	rte_atomic32_t running;
+	rte_spinlock_t lock;
+};
+
+struct internal_list {
+	TAILQ_ENTRY(internal_list) next;
+	struct rte_eth_dev *eth_dev;
+};
+
+TAILQ_HEAD(internal_list_head, internal_list);
+static struct internal_list_head internal_list =
+	TAILQ_HEAD_INITIALIZER(internal_list);
+
+static pthread_mutex_t internal_list_lock = PTHREAD_MUTEX_INITIALIZER;
+
+static struct rte_eth_link vdpa_link = {
+		.link_speed = 10000,
+		.link_duplex = ETH_LINK_FULL_DUPLEX,
+		.link_status = ETH_LINK_DOWN
+};
+
+static struct internal_list *
+find_internal_resource_by_eid(int eid)
+{
+	int found = 0;
+	struct internal_list *list;
+	struct vdpa_virtio_pci_internal *internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		internal = list->eth_dev->data->dev_private;
+		if (eid == internal->eid) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static struct internal_list *
+find_internal_resource_by_eng_addr(struct rte_vdpa_eng_addr *addr)
+{
+	int found = 0;
+	struct internal_list *list;
+	struct vdpa_virtio_pci_internal *internal;
+
+	pthread_mutex_lock(&internal_list_lock);
+
+	TAILQ_FOREACH(list, &internal_list, next) {
+		internal = list->eth_dev->data->dev_private;
+		if (addr == &internal->eng_addr) {
+			found = 1;
+			break;
+		}
+	}
+
+	pthread_mutex_unlock(&internal_list_lock);
+
+	if (!found)
+		return NULL;
+
+	return list;
+}
+
+static int
+check_pci_dev(struct rte_pci_device *dev)
+{
+	char filename[PATH_MAX];
+	char dev_dir[PATH_MAX];
+	char driver[PATH_MAX];
+	int ret;
+
+	snprintf(dev_dir, sizeof(dev_dir), "%s/" PCI_PRI_FMT,
+			rte_pci_get_sysfs_path(),
+			dev->addr.domain, dev->addr.bus,
+			dev->addr.devid, dev->addr.function);
+	if (access(dev_dir, R_OK) != 0) {
+		RTE_LOG(ERR, PMD, "%s not exist\n", dev_dir);
+		return -1;
+	}
+
+	/* parse resources */
+	snprintf(filename, sizeof(filename), "%s/resource", dev_dir);
+	if (pci_parse_sysfs_resource(filename, dev) < 0) {
+		RTE_LOG(ERR, PMD, "cannot parse resource: %s\n", filename);
+		return -1;
+	}
+
+	/* parse driver */
+	snprintf(filename, sizeof(filename), "%s/driver", dev_dir);
+	ret = pci_get_kernel_driver_by_path(filename, driver);
+	if (ret != 0) {
+		RTE_LOG(ERR, PMD, "Fail to get kernel driver: %s\n", filename);
+		return -1;
+	}
+
+	if (strcmp(driver, "vfio-pci") != 0) {
+		RTE_LOG(ERR, PMD, "kernel driver %s is not vfio-pci\n", driver);
+		return -1;
+	}
+	return 0;
+}
+
+static int
+vdpa_vfio_get_group_fd(int iommu_group_no)
+{
+	char filename[PATH_MAX];
+	int vfio_group_fd;
+
+	snprintf(filename, sizeof(filename), VFIO_GROUP_FMT, iommu_group_no);
+	vfio_group_fd = open(filename, O_RDWR);
+	if (vfio_group_fd < 0) {
+		if (errno != ENOENT) {
+			RTE_LOG(ERR, PMD, "cannot open %s: %s\n", filename,
+				strerror(errno));
+			return -1;
+		}
+		return 0;
+	}
+
+	return vfio_group_fd;
+}
+
+static int
+vfio_setup_device(const char *sysfs_base, const char *dev_addr,
+		  int *vfio_dev_fd, struct vfio_device_info *device_info,
+		  struct virtio_pci_info *vpci)
+{
+	struct vfio_group_status group_status = {
+		.argsz = sizeof(group_status)
+	};
+	int vfio_container_fd = -1;
+	int vfio_group_fd = -1;
+	int iommu_group_no;
+	int ret;
+
+	vfio_container_fd = vfio_get_container_fd();
+
+	/* check if we have VFIO driver enabled */
+	if (vfio_container_fd < 0) {
+		RTE_LOG(ERR, PMD, "failed to open VFIO container\n");
+		return -1;
+	}
+
+	ret = ioctl(vfio_container_fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "VFIO_TYPE1_IOMMU not supported\n");
+		goto err;
+	}
+
+	/* get group number */
+	ret = vfio_get_group_no(sysfs_base, dev_addr, &iommu_group_no);
+	if (ret <= 0) {
+		RTE_LOG(ERR, PMD, "%s: unable to find IOMMU group\n", dev_addr);
+		goto err;
+	}
+
+	/* get the actual group fd */
+	vfio_group_fd = vdpa_vfio_get_group_fd(iommu_group_no);
+	RTE_LOG(INFO, PMD, "\n%s group no %d group fd %d\n",
+			dev_addr, iommu_group_no, vfio_group_fd);
+	if (vfio_group_fd <= 0)
+		goto err;
+
+	/* check if the group is viable */
+	ret = ioctl(vfio_group_fd, VFIO_GROUP_GET_STATUS, &group_status);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "%s cannot get group status, error: %s\n",
+				dev_addr, strerror(errno));
+		goto err;
+	} else if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+		RTE_LOG(ERR, PMD, "%s VFIO group is not viable\n", dev_addr);
+		goto err;
+	}
+
+	/* check if group does not have a container yet */
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
+		/* add group to a container */
+		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
+				&vfio_container_fd);
+		if (ret) {
+			RTE_LOG(ERR, PMD, "cannot add VFIO group to container, "
+					"error: %s\n", strerror(errno));
+			goto err;
+		}
+		RTE_LOG(INFO, PMD, "vfio_group_fd %d set container_fd %d\n",
+				vfio_group_fd, vfio_container_fd);
+	} else {
+		RTE_LOG(ERR, PMD, "%s has a container already\n", dev_addr);
+		goto err;
+	}
+
+	ret = ioctl(vfio_container_fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "%s set IOMMU type failed, error: %s\n",
+				dev_addr, strerror(errno));
+		goto err;
+	}
+
+	/* get a file descriptor for the device */
+	*vfio_dev_fd = ioctl(vfio_group_fd, VFIO_GROUP_GET_DEVICE_FD, dev_addr);
+	if (*vfio_dev_fd < 0) {
+		RTE_LOG(ERR, PMD, "%s cannot get vfio_dev_fd, error: %s\n",
+				dev_addr, strerror(errno));
+		goto err;
+	}
+
+	ret = ioctl(*vfio_dev_fd, VFIO_DEVICE_GET_INFO, device_info);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "%s cannot get device info, error: %s\n",
+				dev_addr, strerror(errno));
+		close(*vfio_dev_fd);
+		goto err;
+	}
+
+	vpci->vfio_container_fd = vfio_container_fd;
+	vpci->vfio_group_fd = vfio_group_fd;
+	return 0;
+
+err:
+	if (vfio_container_fd >= 0)
+		close(vfio_container_fd);
+	if (vfio_group_fd > 0)
+		close(vfio_group_fd);
+	return -1;
+}
+
+static int
+virtio_pci_vfio_map_resource(struct virtio_pci_info *vpci)
+{
+	struct rte_pci_device *pdev = &vpci->pdev;
+	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+	char pci_addr[PATH_MAX] = {0};
+	struct rte_pci_addr *loc = &pdev->addr;
+	int i, ret, nb_maps;
+	int vfio_dev_fd;
+	uint32_t ioport_bar;
+	struct pci_msix_table msix_table;
+
+	/* store PCI address string */
+	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
+			loc->domain, loc->bus, loc->devid, loc->function);
+
+	ret = vfio_setup_device(rte_pci_get_sysfs_path(), pci_addr,
+			&vfio_dev_fd, &device_info, vpci);
+	if (ret)
+		return ret;
+
+	ret = pci_vfio_get_msix_bar(vfio_dev_fd, &msix_table);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "%s cannot get MSI-X BAR number\n", pci_addr);
+		goto fail;
+	}
+
+	/* get number of regions (up to BAR5) */
+	nb_maps = RTE_MIN((int)device_info.num_regions,
+				VFIO_PCI_BAR5_REGION_INDEX + 1);
+
+	/* map BARs */
+	for (i = 0; i < nb_maps; i++) {
+		struct vfio_region_info reg = { .argsz = sizeof(reg) };
+		void *bar_addr;
+
+		reg.index = i;
+		ret = ioctl(vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg);
+		if (ret) {
+			RTE_LOG(ERR, PMD, "%s cannot get region info, "
+					"error: %s\n",
+					pci_addr, strerror(errno));
+			goto fail;
+		}
+
+		ret = pread(vfio_dev_fd, &ioport_bar, sizeof(ioport_bar),
+			    VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) +
+			    PCI_BASE_ADDRESS_0 + i * 4);
+		if (ret != sizeof(ioport_bar)) {
+			RTE_LOG(ERR, PMD, "cannot read command (%x) from "
+				"config space\n", PCI_BASE_ADDRESS_0 + i * 4);
+			goto fail;
+		}
+
+		/* check for io port region */
+		if (ioport_bar & PCI_BASE_ADDRESS_SPACE_IO)
+			continue;
+
+		/* skip non-mmapable BARs */
+		if ((reg.flags & VFIO_REGION_INFO_FLAG_MMAP) == 0)
+			continue;
+
+		if (i == msix_table.bar_index)
+			continue;
+
+		/* try mapping somewhere close to the end of hugepages */
+		if (pci_map_addr == NULL)
+			pci_map_addr = pci_find_max_end_va();
+
+		bar_addr = pci_map_addr;
+		pci_map_addr = RTE_PTR_ADD(bar_addr, (size_t)reg.size);
+
+		/* reserve the address using an inaccessible mapping */
+		bar_addr = mmap(bar_addr, reg.size, 0, MAP_PRIVATE |
+				MAP_ANONYMOUS, -1, 0);
+		if (bar_addr != MAP_FAILED) {
+			void *map_addr = NULL;
+			if (reg.size)
+				map_addr = pci_map_resource(bar_addr,
+						vfio_dev_fd,
+						reg.offset, reg.size,
+						MAP_FIXED);
+
+			if (map_addr == MAP_FAILED || !map_addr) {
+				munmap(bar_addr, reg.size);
+				bar_addr = MAP_FAILED;
+			}
+		}
+
+		if (bar_addr == MAP_FAILED) {
+			RTE_LOG(ERR, PMD, "%s mapping BAR%d failed: %s\n",
+					pci_addr, i, strerror(errno));
+			goto fail;
+		}
+		pdev->mem_resource[i].addr = bar_addr;
+	}
+
+	if (pci_rte_vfio_setup_device(pdev, vfio_dev_fd) < 0) {
+		RTE_LOG(ERR, PMD, "%s failed to set up device\n", pci_addr);
+		goto fail;
+	}
+
+	vpci->vfio_dev_fd = vfio_dev_fd;
+	return 0;
+
+fail:
+	close(vfio_dev_fd);
+	return -1;
+}
+
+static void *
+get_cap_addr(struct rte_pci_device *dev, struct virtio_pci_cap *cap)
+{
+	uint8_t bar = cap->bar;
+	uint32_t length = cap->length;
+	uint32_t offset = cap->offset;
+	uint8_t *base;
+
+	if (bar > 5) {
+		RTE_LOG(ERR, PMD, "invalid bar: %u\n", bar);
+		return NULL;
+	}
+
+	if (offset + length < offset) {
+		RTE_LOG(ERR, PMD, "offset(%u) + length(%u) overflows\n",
+			offset, length);
+		return NULL;
+	}
+
+	if (offset + length > dev->mem_resource[bar].len) {
+		RTE_LOG(ERR, PMD, "invalid cap: overflows bar size: %u > %lu\n",
+			offset + length, dev->mem_resource[bar].len);
+		return NULL;
+	}
+
+	base = dev->mem_resource[bar].addr;
+	if (base == NULL) {
+		RTE_LOG(ERR, PMD, "bar %u base addr is NULL", bar);
+		return NULL;
+	}
+
+	return base + offset;
+}
+
+static int
+virtio_pci_map(struct virtio_pci_info *vpci)
+{
+	uint8_t pos;
+	struct virtio_pci_cap cap;
+	struct rte_pci_device *dev = &vpci->pdev;
+	int ret;
+
+	if (virtio_pci_vfio_map_resource(vpci)) {
+		RTE_LOG(ERR, PMD, "failed to map pci device\n");
+		return -1;
+	}
+
+	ret = rte_pci_read_config(dev, &pos, sizeof(pos), PCI_CAPABILITY_LIST);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "failed to read pci capability list\n");
+		return -1;
+	}
+
+	while (pos) {
+		ret = rte_pci_read_config(dev, &cap, sizeof(cap), pos);
+		if (ret < 0) {
+			RTE_LOG(ERR, PMD, "failed to read cap at pos: %x", pos);
+			break;
+		}
+
+		if (cap.cap_vndr != PCI_CAP_ID_VNDR)
+			goto next;
+
+		RTE_LOG(INFO, PMD, "cfg type: %u, bar: %u, offset: %u, "
+				"len: %u\n", cap.cfg_type, cap.bar,
+				cap.offset, cap.length);
+
+		switch (cap.cfg_type) {
+		case VIRTIO_PCI_CAP_COMMON_CFG:
+			vpci->common_cfg = get_cap_addr(dev, &cap);
+			break;
+		case VIRTIO_PCI_CAP_NOTIFY_CFG:
+			rte_pci_read_config(dev, &vpci->notify_off_multiplier,
+						4, pos + sizeof(cap));
+			vpci->notify_base = get_cap_addr(dev, &cap);
+			break;
+		case VIRTIO_PCI_CAP_ISR_CFG:
+			vpci->isr = get_cap_addr(dev, &cap);
+			break;
+		case VIRTIO_PCI_CAP_DEVICE_CFG:
+			vpci->dev_cfg = get_cap_addr(dev, &cap);
+			break;
+		}
+next:
+		pos = cap.cap_next;
+	}
+
+	if (vpci->common_cfg == NULL || vpci->notify_base == NULL ||
+			vpci->isr == NULL || vpci->dev_cfg == NULL) {
+		RTE_LOG(ERR, PMD, "capability incomplete\n");
+		return -1;
+	}
+
+	RTE_LOG(INFO, PMD, "capability mapping:\ncommon cfg: %p\n"
+			"notify base: %p\nisr cfg: %p\ndevice cfg: %p\n"
+			"multiplier: %u\n",
+			vpci->common_cfg, vpci->dev_cfg,
+			vpci->isr, vpci->notify_base,
+			vpci->notify_off_multiplier);
+
+	return 0;
+}
+
+static int
+virtio_pci_dma_map(struct vdpa_virtio_pci_internal *internal)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "failed to get VM memory layout\n");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vpci.vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct vfio_iommu_type1_dma_map dma_map;
+		struct rte_vhost_mem_region *reg;
+		reg = &mem->regions[i];
+
+		RTE_LOG(INFO, PMD, "region %u: HVA 0x%lx, GPA 0x%lx, "
+			"size 0x%lx\n", i, reg->host_user_addr,
+			reg->guest_phys_addr, reg->size);
+
+		memset(&dma_map, 0, sizeof(dma_map));
+		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+		dma_map.vaddr = reg->host_user_addr;
+		dma_map.size = reg->size;
+		dma_map.iova = reg->guest_phys_addr;
+		dma_map.flags = VFIO_DMA_MAP_FLAG_READ |
+				VFIO_DMA_MAP_FLAG_WRITE;
+
+		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+		if (ret) {
+			RTE_LOG(ERR, PMD, "cannot set up DMA remapping, "
+				"error: %s\n", strerror(errno));
+			goto exit;
+		}
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static int
+virtio_pci_dma_unmap(struct vdpa_virtio_pci_internal *internal)
+{
+	uint32_t i;
+	int ret = 0;
+	struct rte_vhost_memory *mem = NULL;
+	int vfio_container_fd;
+
+	ret = rte_vhost_get_mem_table(internal->vid, &mem);
+	if (ret < 0) {
+		RTE_LOG(ERR, PMD, "failed to get VM memory layout\n");
+		goto exit;
+	}
+
+	vfio_container_fd = internal->vpci.vfio_container_fd;
+
+	for (i = 0; i < mem->nregions; i++) {
+		struct vfio_iommu_type1_dma_unmap dma_unmap;
+		struct rte_vhost_mem_region *reg;
+		reg = &mem->regions[i];
+
+		memset(&dma_unmap, 0, sizeof(dma_unmap));
+		dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+		dma_unmap.size = reg->size;
+		dma_unmap.iova = reg->guest_phys_addr;
+		dma_unmap.flags = 0;
+
+		ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA,
+				&dma_unmap);
+		if (ret) {
+			RTE_LOG(ERR, PMD, "cannot unset DMA remapping, "
+				"error: %s\n", strerror(errno));
+			goto exit;
+		}
+	}
+
+exit:
+	if (mem)
+		free(mem);
+	return ret;
+}
+
+static uint8_t
+virtio_get_status(struct virtio_pci_info *vpci)
+{
+	return rte_read8(&vpci->common_cfg->device_status);
+}
+
+static void
+virtio_set_status(struct virtio_pci_info *vpci, uint8_t status)
+{
+	rte_write8(status, &vpci->common_cfg->device_status);
+}
+
+static void
+vdpa_virtio_reset(struct virtio_pci_info *vpci)
+{
+	/* 0 means reset */
+	virtio_set_status(vpci, 0);
+
+	/* flush status write */
+	while (virtio_get_status(vpci))
+		rte_delay_ms(1);
+}
+
+static void
+vdpa_virtio_set_status(struct virtio_pci_info *vpci, uint8_t status)
+{
+	if (status != 0)
+		status |= virtio_get_status(vpci);
+
+	virtio_set_status(vpci, status);
+	virtio_get_status(vpci);
+}
+
+static uint64_t
+virtio_get_features(struct virtio_pci_info *vpci)
+{
+	uint32_t features_lo, features_hi;
+	struct virtio_pci_common_cfg *cfg = vpci->common_cfg;
+
+	rte_write32(0, &cfg->device_feature_select);
+	features_lo = rte_read32(&cfg->device_feature);
+
+	rte_write32(1, &cfg->device_feature_select);
+	features_hi = rte_read32(&cfg->device_feature);
+
+	return ((uint64_t)features_hi << 32) | features_lo;
+}
+
+static void
+vdpa_set_features(struct virtio_pci_info *vpci, uint64_t features)
+{
+	struct virtio_pci_common_cfg *cfg = vpci->common_cfg;
+
+	/* enable device DMA with iova */
+	features |= (1ULL << VIRTIO_F_IOMMU_PLATFORM);
+
+	rte_write32(0, &cfg->guest_feature_select);
+	rte_write32(features & ((1ULL << 32) - 1), &cfg->guest_feature);
+
+	rte_write32(1, &cfg->guest_feature_select);
+	rte_write32(features >> 32, &cfg->guest_feature);
+}
+
+static int
+vdpa_virtio_config_features(struct virtio_pci_info *vpci, uint64_t req_features)
+{
+	uint64_t host_features;
+
+	host_features = virtio_get_features(vpci);
+	vpci->req_features = req_features & host_features;
+
+	vdpa_set_features(vpci, vpci->req_features);
+	vdpa_virtio_set_status(vpci, VIRTIO_CONFIG_S_FEATURES_OK);
+
+	if (!(virtio_get_status(vpci) & VIRTIO_CONFIG_S_FEATURES_OK)) {
+		RTE_LOG(ERR, PMD, "failed to set FEATURES_OK status\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static uint64_t
+qva_to_gpa(int vid, uint64_t qva)
+{
+	struct rte_vhost_memory *mem = NULL;
+	struct rte_vhost_mem_region *reg;
+	uint32_t i;
+	uint64_t gpa = 0;
+
+	if (rte_vhost_get_mem_table(vid, &mem) < 0)
+		goto exit;
+
+	for (i = 0; i < mem->nregions; i++) {
+		reg = &mem->regions[i];
+
+		if (qva >= reg->host_user_addr &&
+				qva < reg->host_user_addr + reg->size) {
+			gpa = qva - reg->host_user_addr + reg->guest_phys_addr;
+			break;
+		}
+	}
+
+exit:
+	if (gpa == 0)
+		rte_panic("failed to get gpa\n");
+	if (mem)
+		free(mem);
+	return gpa;
+}
+
+static void
+io_write64_twopart(uint64_t val, uint32_t *lo, uint32_t *hi)
+{
+	rte_write32(val & ((1ULL << 32) - 1), lo);
+	rte_write32(val >> 32, hi);
+}
+
+static int
+vdpa_virtio_dev_enable(struct vdpa_virtio_pci_internal *internal)
+{
+	struct virtio_pci_info *vpci;
+	struct virtio_pci_common_cfg *cfg;
+	uint64_t desc_addr, avail_addr, used_addr;
+	uint32_t i, nr_vring;
+	uint16_t notify_off;
+	struct rte_vhost_vring vq;
+
+	vpci = &internal->vpci;
+	cfg = vpci->common_cfg;
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	rte_write16(0, &cfg->msix_config);
+	if (rte_read16(&cfg->msix_config) == VIRTIO_MSI_NO_VECTOR) {
+		RTE_LOG(ERR, PMD, "msix vec alloc failed for device config\n");
+		return -1;
+	}
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(internal->vid, i, &vq);
+		desc_addr = qva_to_gpa(internal->vid, (uint64_t)vq.desc);
+		avail_addr = qva_to_gpa(internal->vid, (uint64_t)vq.avail);
+		used_addr = qva_to_gpa(internal->vid, (uint64_t)vq.used);
+
+		rte_write16(i, &cfg->queue_select);
+		io_write64_twopart(desc_addr, &cfg->queue_desc_lo,
+				&cfg->queue_desc_hi);
+		io_write64_twopart(avail_addr, &cfg->queue_avail_lo,
+				&cfg->queue_avail_hi);
+		io_write64_twopart(used_addr, &cfg->queue_used_lo,
+				&cfg->queue_used_hi);
+		rte_write16(vq.size, &cfg->queue_size);
+
+		rte_write16(i + 1, &cfg->queue_msix_vector);
+		if (rte_read16(&cfg->queue_msix_vector) ==
+				VIRTIO_MSI_NO_VECTOR) {
+			RTE_LOG(ERR, PMD, "queue %u, msix vec alloc failed\n",
+					i);
+			return -1;
+		}
+
+		notify_off = rte_read16(&cfg->queue_notify_off);
+		vpci->notify_addr[i] = (void *)((uint8_t *)vpci->notify_base +
+				notify_off * vpci->notify_off_multiplier);
+		rte_write16(1, &cfg->queue_enable);
+
+		RTE_LOG(INFO, PMD, "queue %u addresses:\n"
+				"desc_addr: 0x%lx\tavail_addr: 0x%lx\tused_addr: 0x%lx\n"
+				"queue size: %u\t\tnotify addr: %p\tnotify offset: %u\n",
+				i, desc_addr, avail_addr, used_addr,
+				vq.size, vpci->notify_addr[i], notify_off);
+	}
+
+	return 0;
+}
+
+static void
+vdpa_virtio_dev_disable(struct vdpa_virtio_pci_internal *internal)
+{
+	uint32_t i, nr_vring;
+	struct virtio_pci_info *vpci;
+	struct virtio_pci_common_cfg *cfg;
+
+	vpci = &internal->vpci;
+	cfg = vpci->common_cfg;
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	rte_write16(VIRTIO_MSI_NO_VECTOR, &cfg->msix_config);
+	for (i = 0; i < nr_vring; i++) {
+		rte_write16(i, &cfg->queue_select);
+		rte_write16(0, &cfg->queue_enable);
+		rte_write16(VIRTIO_MSI_NO_VECTOR, &cfg->queue_msix_vector);
+	}
+}
+
+static int
+vdpa_virtio_pci_start(struct vdpa_virtio_pci_internal *internal)
+{
+	struct virtio_pci_info *vpci;
+	uint64_t features;
+
+	vpci = &internal->vpci;
+
+	rte_vhost_get_negotiated_features(internal->vid, &features);
+
+	/* Reset the device although not necessary at startup. */
+	vdpa_virtio_reset(vpci);
+
+	/* Tell the host we've noticed this device. */
+	vdpa_virtio_set_status(vpci, VIRTIO_CONFIG_S_ACKNOWLEDGE);
+
+	/* Tell the host we've known how to drive the device. */
+	vdpa_virtio_set_status(vpci, VIRTIO_CONFIG_S_DRIVER);
+
+	if (vdpa_virtio_config_features(vpci, features) < 0)
+		return -1;
+
+	if (vdpa_virtio_dev_enable(internal) < 0)
+		return -1;
+
+	vdpa_virtio_set_status(vpci, VIRTIO_CONFIG_S_DRIVER_OK);
+	return 0;
+}
+
+static void
+vdpa_virtio_pci_stop(struct vdpa_virtio_pci_internal *internal)
+{
+	struct virtio_pci_info *vpci;
+
+	vpci = &internal->vpci;
+	vdpa_virtio_dev_disable(internal);
+	vdpa_virtio_reset(vpci);
+}
+
+static int
+vdpa_enable_vfio_intr(struct vdpa_virtio_pci_internal *internal)
+{
+	int ret;
+	uint32_t i, nr_vring;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	int *fd_ptr;
+	struct virtio_pci_info *vpci;
+	struct rte_vhost_vring vring;
+
+	vpci = &internal->vpci;
+	nr_vring = rte_vhost_get_vring_num(internal->vid);
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = nr_vring + 1;
+	irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
+			 VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+	fd_ptr = (int *)&irq_set->data;
+	fd_ptr[RTE_INTR_VEC_ZERO_OFFSET] = vpci->pdev.intr_handle.fd;
+
+	for (i = 0; i < nr_vring; i++) {
+		rte_vhost_get_vhost_vring(internal->vid, i, &vring);
+		fd_ptr[RTE_INTR_VEC_RXTX_OFFSET + i] = vring.callfd;
+	}
+
+	ret = ioctl(vpci->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "Error enabling MSI-X interrupts: %s\n",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vdpa_disable_vfio_intr(struct vdpa_virtio_pci_internal *internal)
+{
+	int ret;
+	char irq_set_buf[MSIX_IRQ_SET_BUF_LEN];
+	struct vfio_irq_set *irq_set;
+	struct virtio_pci_info *vpci;
+
+	vpci = &internal->vpci;
+
+	irq_set = (struct vfio_irq_set *)irq_set_buf;
+	irq_set->argsz = sizeof(irq_set_buf);
+	irq_set->count = 0;
+	irq_set->flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	irq_set->index = VFIO_PCI_MSIX_IRQ_INDEX;
+	irq_set->start = 0;
+
+	ret = ioctl(vpci->vfio_dev_fd, VFIO_DEVICE_SET_IRQS, irq_set);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "Error disabling MSI-X interrupts: %s\n",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void *
+notify_relay(void *arg)
+{
+	int i, kickfd, epfd, nfds = 0;
+	struct virtio_pci_info *vpci;
+	uint32_t qid, q_num;
+	struct epoll_event events[MAX_QUEUES * 2];
+	struct epoll_event ev;
+	uint64_t buf;
+	int nbytes;
+	struct rte_vhost_vring vring;
+	struct vdpa_virtio_pci_internal *internal = arg;
+
+	vpci = &internal->vpci;
+	q_num = rte_vhost_get_vring_num(internal->vid);
+
+	epfd = epoll_create(MAX_QUEUES * 2);
+	if (epfd < 0) {
+		RTE_LOG(ERR, PMD, "failed to create epoll instance\n");
+		return NULL;
+	}
+	vpci->epfd = epfd;
+
+	for (qid = 0; qid < q_num; qid++) {
+		ev.events = EPOLLIN | EPOLLPRI;
+		rte_vhost_get_vhost_vring(internal->vid, qid, &vring);
+		ev.data.u64 = qid | (uint64_t)vring.kickfd << 32;
+		if (epoll_ctl(epfd, EPOLL_CTL_ADD, vring.kickfd, &ev) < 0) {
+			RTE_LOG(ERR, PMD, "epoll add error, %s\n",
+					strerror(errno));
+			return NULL;
+		}
+	}
+
+	for (;;) {
+		nfds = epoll_wait(epfd, events, q_num, -1);
+		if (nfds < 0) {
+			if (errno == EINTR)
+				continue;
+			RTE_LOG(ERR, PMD, "epoll_wait return fail\n");
+			return NULL;
+		}
+
+		for (i = 0; i < nfds; i++) {
+			qid = events[i].data.u32;
+			kickfd = (uint32_t)(events[i].data.u64 >> 32);
+			do {
+				nbytes = read(kickfd, &buf, 8);
+				if (nbytes < 0) {
+					if (errno == EINTR ||
+					    errno == EWOULDBLOCK ||
+					    errno == EAGAIN)
+						continue;
+					RTE_LOG(INFO, PMD, "Error reading "
+						"kickfd: %s\n",
+						strerror(errno));
+				}
+				break;
+			} while (1);
+
+			rte_write16(qid, vpci->notify_addr[qid]);
+		}
+	}
+
+	return NULL;
+}
+
+static int
+setup_notify_relay(struct vdpa_virtio_pci_internal *internal)
+{
+	int ret;
+
+	ret = pthread_create(&internal->vpci.tid, NULL, notify_relay,
+			     (void *)internal);
+	if (ret) {
+		RTE_LOG(ERR, PMD, "failed to create notify relay pthread\n");
+		return -1;
+	}
+	return 0;
+}
+
+static int
+unset_notify_relay(struct vdpa_virtio_pci_internal *internal)
+{
+	struct virtio_pci_info *vpci;
+	void *status;
+
+	vpci = &internal->vpci;
+	if (vpci->tid) {
+		pthread_cancel(vpci->tid);
+		pthread_join(vpci->tid, &status);
+	}
+	vpci->tid = 0;
+
+	if (vpci->epfd >= 0)
+		close(vpci->epfd);
+	vpci->epfd = -1;
+
+	return 0;
+}
+
+static int
+update_datapath(struct rte_eth_dev *eth_dev)
+{
+	struct vdpa_virtio_pci_internal *internal;
+	int ret;
+
+	internal = eth_dev->data->dev_private;
+	rte_spinlock_lock(&internal->lock);
+
+	if (!rte_atomic32_read(&internal->running) &&
+	    (rte_atomic32_read(&internal->started) &&
+	     rte_atomic32_read(&internal->dev_attached))) {
+		ret = virtio_pci_dma_map(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_enable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = setup_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_virtio_pci_start(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 1);
+	} else if (rte_atomic32_read(&internal->running) &&
+		   (!rte_atomic32_read(&internal->started) ||
+		    !rte_atomic32_read(&internal->dev_attached))) {
+		vdpa_virtio_pci_stop(internal);
+
+		ret = unset_notify_relay(internal);
+		if (ret)
+			goto err;
+
+		ret = vdpa_disable_vfio_intr(internal);
+		if (ret)
+			goto err;
+
+		ret = virtio_pci_dma_unmap(internal);
+		if (ret)
+			goto err;
+
+		rte_atomic32_set(&internal->running, 0);
+	}
+
+	rte_spinlock_unlock(&internal->lock);
+	return 0;
+err:
+	rte_spinlock_unlock(&internal->lock);
+	return ret;
+}
+
+static int
+vdpa_virtio_dev_config(int vid)
+{
+	int eid;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct vdpa_virtio_pci_internal *internal;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+
+	eth_dev->data->dev_link.link_status = ETH_LINK_UP;
+
+	rte_atomic32_set(&internal->dev_attached, 1);
+	update_datapath(eth_dev);
+
+	return 0;
+}
+
+static int
+vdpa_virtio_dev_close(int vid)
+{
+	int eid;
+	struct internal_list *list;
+	struct rte_eth_dev *eth_dev;
+	struct vdpa_virtio_pci_internal *internal;
+
+	eid = rte_vhost_get_vdpa_eid(vid);
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	eth_dev = list->eth_dev;
+	internal = eth_dev->data->dev_private;
+
+	eth_dev->data->dev_link.link_status = ETH_LINK_DOWN;
+
+	rte_atomic32_set(&internal->dev_attached, 0);
+	update_datapath(eth_dev);
+
+	return 0;
+}
+
+static void
+vfio_close_fds(struct virtio_pci_info *vpci)
+{
+	if (vpci->vfio_dev_fd >= 0)
+		close(vpci->vfio_dev_fd);
+	if (vpci->vfio_group_fd >= 0)
+		close(vpci->vfio_group_fd);
+	if (vpci->vfio_container_fd >= 0)
+		close(vpci->vfio_container_fd);
+
+	vpci->vfio_dev_fd = -1;
+	vpci->vfio_group_fd = -1;
+	vpci->vfio_container_fd = -1;
+}
+
+static int
+vdpa_virtio_eng_init(int eid, struct rte_vdpa_eng_addr *addr)
+{
+	struct internal_list *list;
+	struct vdpa_virtio_pci_internal *internal;
+	struct virtio_pci_info *vpci;
+	uint64_t features;
+
+	list = find_internal_resource_by_eng_addr(addr);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine addr\n");
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+	vpci = &internal->vpci;
+
+	vpci->vfio_dev_fd = -1;
+	vpci->vfio_group_fd = -1;
+	vpci->vfio_container_fd = -1;
+
+	if (check_pci_dev(&vpci->pdev) < 0)
+		return -1;
+
+	if (virtio_pci_map(vpci) < 0)
+		goto err;
+
+	internal->eid = eid;
+	internal->max_devices = 1;
+	internal->max_queues = MAX_QUEUES;
+	features = virtio_get_features(&internal->vpci);
+	if ((features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)) == 0) {
+		RTE_LOG(ERR, PMD, "VIRTIO_F_IOMMU_PLATFORM feature is required "
+				"to support DMA with IOVA");
+		goto err;
+	}
+
+	/* We need the nested VM's driver to use GPA */
+	internal->features = (features & ~(1ULL << VIRTIO_F_IOMMU_PLATFORM)) |
+			  (1ULL << RTE_VHOST_USER_F_PROTOCOL_FEATURES);
+	return 0;
+
+err:
+	vfio_close_fds(vpci);
+	return -1;
+}
+
+static int
+vdpa_virtio_eng_uninit(int eid)
+{
+	struct internal_list *list;
+	struct vdpa_virtio_pci_internal *internal;
+
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id %d\n", eid);
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+	vfio_close_fds(&internal->vpci);
+	return 0;
+}
+
+#define VDPA_SUPPORTED_PROTOCOL_FEATURES \
+		(1ULL << RTE_VHOST_USER_PROTOCOL_F_REPLY_ACK)
+static int
+vdpa_virtio_info_query(int eid, struct rte_vdpa_eng_attr *attr)
+{
+	struct internal_list *list;
+	struct vdpa_virtio_pci_internal *internal;
+
+	list = find_internal_resource_by_eid(eid);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine id: %d\n", eid);
+		return -1;
+	}
+
+	internal = list->eth_dev->data->dev_private;
+	attr->dev_num = internal->max_devices;
+	attr->queue_num = internal->max_queues;
+	attr->features = internal->features;
+	attr->protocol_features = VDPA_SUPPORTED_PROTOCOL_FEATURES;
+
+	return 0;
+}
+
+struct rte_vdpa_eng_driver vdpa_virtio_pci_driver = {
+	.name = "vdpa_virtio_pci",
+	.eng_ops = {
+		.eng_init = vdpa_virtio_eng_init,
+		.eng_uninit = vdpa_virtio_eng_uninit,
+		.info_query = vdpa_virtio_info_query,
+	},
+	.dev_ops = {
+		.dev_conf = vdpa_virtio_dev_config,
+		.dev_close = vdpa_virtio_dev_close,
+		.vring_state_set = NULL,
+		.feature_set = NULL,
+		.migration_done = NULL,
+	},
+};
+
+RTE_VDPA_REGISTER_DRIVER(vdpa_virtio_pci, vdpa_virtio_pci_driver);
+
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	struct vdpa_virtio_pci_internal *internal;
+
+	internal = dev->data->dev_private;
+	rte_atomic32_set(&internal->started, 1);
+	update_datapath(dev);
+
+	return 0;
+}
+
+static void
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	struct vdpa_virtio_pci_internal *internal;
+
+	internal = dev->data->dev_private;
+	rte_atomic32_set(&internal->started, 0);
+	update_datapath(dev);
+}
+
+static void
+eth_dev_close(struct rte_eth_dev *dev)
+{
+	struct vdpa_virtio_pci_internal *internal;
+	struct internal_list *list;
+
+	internal = dev->data->dev_private;
+	eth_dev_stop(dev);
+
+	list = find_internal_resource_by_eng_addr(&internal->eng_addr);
+	if (list == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid engine addr\n");
+		return;
+	}
+
+	rte_vdpa_unregister_engine(internal->eid);
+
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_REMOVE(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+	rte_free(list);
+
+	rte_free(dev->data->mac_addrs);
+	free(internal->dev_name);
+	rte_free(internal);
+
+	dev->data->dev_private = NULL;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct vdpa_virtio_pci_internal *internal;
+
+	internal = dev->data->dev_private;
+	if (internal == NULL) {
+		RTE_LOG(ERR, PMD, "Invalid device specified\n");
+		return;
+	}
+
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = (uint32_t)-1;
+	dev_info->max_rx_queues = internal->max_queues;
+	dev_info->max_tx_queues = internal->max_queues;
+	dev_info->min_rx_bufsize = 0;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev __rte_unused,
+		   uint16_t rx_queue_id __rte_unused,
+		   uint16_t nb_rx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   struct rte_mempool *mb_pool __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev __rte_unused,
+		   uint16_t tx_queue_id __rte_unused,
+		   uint16_t nb_tx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_txconf *tx_conf __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_queue_release(void *q __rte_unused)
+{
+}
+
+static uint16_t
+eth_vdpa_virtio_pci_rx(void *q __rte_unused,
+		       struct rte_mbuf **bufs __rte_unused,
+		       uint16_t nb_bufs __rte_unused)
+{
+	return 0;
+}
+
+static uint16_t
+eth_vdpa_virtio_pci_tx(void *q __rte_unused,
+		       struct rte_mbuf **bufs __rte_unused,
+		       uint16_t nb_bufs __rte_unused)
+{
+	return 0;
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev __rte_unused,
+		int wait_to_complete __rte_unused)
+{
+	return 0;
+}
+
+static const struct eth_dev_ops ops = {
+	.dev_start = eth_dev_start,
+	.dev_stop = eth_dev_stop,
+	.dev_close = eth_dev_close,
+	.dev_configure = eth_dev_configure,
+	.dev_infos_get = eth_dev_info,
+	.rx_queue_setup = eth_rx_queue_setup,
+	.tx_queue_setup = eth_tx_queue_setup,
+	.rx_queue_release = eth_queue_release,
+	.tx_queue_release = eth_queue_release,
+	.link_update = eth_link_update,
+};
+
+static int
+eth_dev_vdpa_virtio_pci_create(struct rte_vdev_device *dev,
+		struct rte_pci_addr *pci_addr)
+{
+	const char *name = rte_vdev_device_name(dev);
+	struct rte_eth_dev *eth_dev = NULL;
+	struct ether_addr *eth_addr = NULL;
+	struct vdpa_virtio_pci_internal *internal = NULL;
+	struct internal_list *list = NULL;
+	struct rte_eth_dev_data *data = NULL;
+
+	list = rte_zmalloc_socket(name, sizeof(*list), 0,
+			dev->device.numa_node);
+	if (list == NULL)
+		goto error;
+
+	/* reserve an ethdev entry */
+	eth_dev = rte_eth_vdev_allocate(dev, sizeof(*internal));
+	if (eth_dev == NULL)
+		goto error;
+
+	eth_addr = rte_zmalloc_socket(name, sizeof(*eth_addr), 0,
+			dev->device.numa_node);
+	if (eth_addr == NULL)
+		goto error;
+
+	*eth_addr = base_eth_addr;
+	eth_addr->addr_bytes[5] = eth_dev->data->port_id;
+
+	internal = eth_dev->data->dev_private;
+	internal->dev_name = strdup(name);
+	if (internal->dev_name == NULL)
+		goto error;
+
+	internal->eng_addr.pci_addr = *pci_addr;
+	internal->vpci.pdev.addr = *pci_addr;
+	rte_spinlock_init(&internal->lock);
+
+	list->eth_dev = eth_dev;
+	pthread_mutex_lock(&internal_list_lock);
+	TAILQ_INSERT_TAIL(&internal_list, list, next);
+	pthread_mutex_unlock(&internal_list_lock);
+
+	data = eth_dev->data;
+	data->nb_rx_queues = MAX_QUEUES;
+	data->nb_tx_queues = MAX_QUEUES;
+	data->dev_link = vdpa_link;
+	data->mac_addrs = eth_addr;
+	data->dev_flags = RTE_ETH_DEV_INTR_LSC;
+	eth_dev->dev_ops = &ops;
+
+	/* assign rx and tx ops, could be used as vDPA fallback */
+	eth_dev->rx_pkt_burst = eth_vdpa_virtio_pci_rx;
+	eth_dev->tx_pkt_burst = eth_vdpa_virtio_pci_tx;
+
+	if (rte_vdpa_register_engine(vdpa_virtio_pci_driver.name,
+				&internal->eng_addr) < 0)
+		goto error;
+
+	return 0;
+
+error:
+	rte_free(list);
+	rte_free(eth_addr);
+	if (internal && internal->dev_name)
+		free(internal->dev_name);
+	rte_free(internal);
+	if (eth_dev)
+		rte_eth_dev_release_port(eth_dev);
+
+	return -1;
+}
+
+static int
+get_pci_addr(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	if (value == NULL || extra_args == NULL)
+		return -1;
+
+	return parse_pci_addr_format(value, strlen(value), extra_args);
+}
+
+static int
+rte_vdpa_virtio_pci_probe(struct rte_vdev_device *dev)
+{
+	struct rte_kvargs *kvlist = NULL;
+	int ret = 0;
+	struct rte_pci_addr pci_addr;
+
+	RTE_LOG(INFO, PMD, "Initializing vdpa_virtio_pci for %s\n",
+		rte_vdev_device_name(dev));
+
+	kvlist = rte_kvargs_parse(rte_vdev_device_args(dev), valid_arguments);
+	if (kvlist == NULL)
+		return -1;
+
+	if (rte_kvargs_count(kvlist, ETH_VDPA_VIRTIO_PCI_BDF_ARG) == 1) {
+		ret = rte_kvargs_process(kvlist, ETH_VDPA_VIRTIO_PCI_BDF_ARG,
+				&get_pci_addr, &pci_addr);
+		if (ret < 0)
+			goto out_free;
+	} else {
+		ret = -1;
+		goto out_free;
+	}
+
+	eth_dev_vdpa_virtio_pci_create(dev, &pci_addr);
+
+out_free:
+	rte_kvargs_free(kvlist);
+	return ret;
+}
+
+static int
+rte_vdpa_virtio_pci_remove(struct rte_vdev_device *dev)
+{
+	const char *name;
+	struct rte_eth_dev *eth_dev = NULL;
+
+	name = rte_vdev_device_name(dev);
+	RTE_LOG(INFO, PMD, "Un-Initializing vdpa_virtio_pci for %s\n", name);
+
+	/* find an ethdev entry */
+	eth_dev = rte_eth_dev_allocated(name);
+	if (eth_dev == NULL)
+		return -ENODEV;
+
+	eth_dev_close(eth_dev);
+	rte_free(eth_dev->data);
+	rte_eth_dev_release_port(eth_dev);
+
+	return 0;
+}
+
+static struct rte_vdev_driver vdpa_virtio_pci_drv = {
+	.probe = rte_vdpa_virtio_pci_probe,
+	.remove = rte_vdpa_virtio_pci_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_vdpa_virtio_pci, vdpa_virtio_pci_drv);
+RTE_PMD_REGISTER_ALIAS(net_vdpa_virtio_pci, eth_vdpa_virtio_pci);
+RTE_PMD_REGISTER_PARAM_STRING(net_vdpa_virtio_pci,
+	"bdf=<bdf>");
diff --git a/drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map b/drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
new file mode 100644
index 000000000..33d237913
--- /dev/null
+++ b/drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
@@ -0,0 +1,4 @@
+EXPERIMENTAL {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 3eb41d176..44e87f4d9 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -171,6 +171,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD) += -lrte_pmd_vdev_netvsc
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD)     += -lrte_pmd_virtio
 ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
+_LDLIBS-$(CONFIG_RTE_LIBRTE_VDPA_VIRTIO_PCI)      += -lrte_vdpa_virtio_pci
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
 
-- 
2.15.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [dpdk-dev] [PATCH 3/3] examples/vdpa: add a new sample for vdpa
  2018-02-04 14:55 [dpdk-dev] [PATCH 0/3] add vDPA sample driver Xiao Wang
  2018-02-04 14:55 ` [dpdk-dev] [PATCH 1/3] bus/pci: expose API for vDPA Xiao Wang
  2018-02-04 14:55 ` [dpdk-dev] [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver Xiao Wang
@ 2018-02-04 14:55 ` Xiao Wang
  2 siblings, 0 replies; 8+ messages in thread
From: Xiao Wang @ 2018-02-04 14:55 UTC (permalink / raw)
  To: dev
  Cc: jianfeng.tan, tiwei.bie, maxime.coquelin, yliu, cunming.liang,
	dan.daly, zhihong.wang, Xiao Wang

This patch adds a sample which creates vhost-user socket based on
vdpa driver. vdpa driver can help to set up vhost datapath so this
app doesn't need to spend a dedicated worker thread on vhost
enqueue/dequeue operations.

Below are setup steps for your reference:

1. Make sure your kernnel vhost module and QEMU support vIOMMU.
   - OS: CentOS 7.4
   - QEMU: 2.10.1
   - Guest OS: CentOS 7.2
   - Nested VM OS: CentOS 7.2

2. enable VT-x feature for vCPU in VM.
   modprobe kvm_intel nested=1

3. Start a VM with a virtio-net-pci device.
   ./qemu-2.10.1/x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu host \
   <snip>
   -machine q35 \
   -device intel-iommu \
   -netdev tap,id=mytap,ifname=vdpa,vhostforce=on \
   -device virtio-net-pci,netdev=mytap,mac=00:aa:bb:cc:dd:ee,\
   disable-modern=off,disable-legacy=on,iommu_platform=on \

4. Bind VFIO-pci to virtio_net_pci device
   a) login to VM;
   b) modprobe vfio-pci
   c) rmmod vfio_iommu_type1
   d) modprobe vfio_iommu_type1 allow_unsafe_interrupts=1
   e) ./usertools/dpdk-devbind.py -b vfio-pci 00:03.0

5. Start vdpa sample
   ./examples/vdpa/build/vdpa -c 0x2 -n 4 --socket-mem 1024 --no-pci \
    --vdev "net_vdpa_virtio_pci0,bdf=0000:00:03.0" -- --bdf 0000:00:03.0 \
    --iface /tmp/vhost-user- --devcnt 1  --queue 1

6. Start nested VM
   ./qemu-2.10.1/x86_64-softmmu/qemu-system-x86_64 -cpu host -enable-kvm \
   <snip>
   -mem-prealloc \
   -chardev socket,id=char0,path=/tmp/vhost-user-0 \
   -netdev type=vhost-user,id=vdpa,chardev=char0,vhostforce \
   -device virtio-net-pci,netdev=vdpa,mac=00:aa:bb:cc:dd:ee \

7. Login the nested VM, and verify the virtio in nested VM can communicate
   with tap device on the host.

Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 examples/vdpa/Makefile |  32 ++++
 examples/vdpa/main.c   | 387 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 419 insertions(+)
 create mode 100644 examples/vdpa/Makefile
 create mode 100644 examples/vdpa/main.c

diff --git a/examples/vdpa/Makefile b/examples/vdpa/Makefile
new file mode 100644
index 000000000..42672a2bc
--- /dev/null
+++ b/examples/vdpa/Makefile
@@ -0,0 +1,32 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Intel Corporation
+
+ifeq ($(RTE_SDK),)
+$(error "Please define RTE_SDK environment variable")
+endif
+
+# Default target, can be overridden by command line or environment
+RTE_TARGET ?= x86_64-native-linuxapp-gcc
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+ifneq ($(CONFIG_RTE_EXEC_ENV),"linuxapp")
+$(info This application can only operate in a linuxapp environment, \
+please change the definition of the RTE_TARGET environment variable)
+all:
+else
+
+# binary name
+APP = vdpa
+
+# all source are stored in SRCS-y
+SRCS-y := main.c
+
+CFLAGS += -O2 -D_FILE_OFFSET_BITS=64
+CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -D_GNU_SOURCE
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
+include $(RTE_SDK)/mk/rte.extapp.mk
+
+endif
diff --git a/examples/vdpa/main.c b/examples/vdpa/main.c
new file mode 100644
index 000000000..1c9143469
--- /dev/null
+++ b/examples/vdpa/main.c
@@ -0,0 +1,387 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Intel Corporation
+ */
+
+#include <getopt.h>
+#include <signal.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <rte_ethdev.h>
+#include <rte_malloc.h>
+#include <rte_vhost.h>
+#include <rte_vdpa.h>
+
+#define NUM_MBUFS 8191
+#define MBUF_CACHE_SIZE 250
+
+#define RX_RING_SIZE 128
+#define TX_RING_SIZE 128
+
+#define MAX_PATH_LEN 128
+#define MAX_VDPA_SAMPLE_PORTS 1024
+
+struct vdpa_port {
+	char ifname[MAX_PATH_LEN];
+	int eid;
+	int did;
+	int vid;
+};
+
+struct vdpa_port vports[MAX_VDPA_SAMPLE_PORTS];
+
+struct rte_vdpa_eng_attr attr;
+struct rte_vdpa_eng_addr dev_id;
+char iface[MAX_PATH_LEN];
+int queue;
+int devcnt;
+
+/* display usage */
+static void
+vdpa_usage(const char *prgname)
+{
+	printf("%s [EAL options]"
+		" -- --bdf B:D:F --iface <path> --devcnt ND  --queue NQ\n"
+		" --bdf B:D:F, the PCI device used for vdpa\n"
+		" --iface <path>: The path of the socket file\n"
+		" --devcnt ND: number of vhost sockets to be created, default 1\n"
+		" --queue NQ: number of queue pairs to be configured, default 1\n",
+		prgname);
+}
+
+static int
+get_unsigned(const char *str, int base)
+{
+	unsigned long num;
+	char *end = NULL;
+
+	errno = 0;
+	num = strtoul(str, &end, base);
+	if (str[0] == '\0' || end == NULL || *end != '\0' || errno != 0)
+		return -1;
+
+	return num;
+}
+
+static int
+parse_args(int argc, char **argv)
+{
+	static const char *short_option = "";
+	static struct option long_option[] = {
+		{"bdf", required_argument, NULL, 0},
+		{"queue", required_argument, NULL, 0},
+		{"devcnt", required_argument, NULL, 0},
+		{"iface", required_argument, NULL, 0},
+		{NULL, 0, 0, 0},
+	};
+	char str[MAX_PATH_LEN];
+	int opt, idx;
+	int num[4] = {0};
+	int i, j;
+	char *prgname = argv[0];
+
+	while ((opt = getopt_long(argc, argv, short_option, long_option, &idx))
+			!= EOF) {
+		switch (opt) {
+		case 0:
+			if (strncmp(long_option[idx].name, "bdf",
+						MAX_PATH_LEN) == 0) {
+				strcpy(str, optarg);
+				i = strlen(str) - 1;
+				j = 3;
+				while (i > 0 && j >= 0) {
+					while ((str[i - 1] != ':'
+							&& str[i - 1] != '.')
+							&& i > 0)
+						i--;
+					num[j--] = get_unsigned(&str[i], 16);
+					i--;
+					if (i >= 0)
+						str[i] = '\0';
+				}
+				dev_id.pci_addr.domain = num[0];
+				dev_id.pci_addr.bus = num[1];
+				dev_id.pci_addr.devid = num[2];
+				dev_id.pci_addr.function = num[3];
+				printf("bdf %04x:%02x:%02x.%02x\n",
+						dev_id.pci_addr.domain,
+						dev_id.pci_addr.bus,
+						dev_id.pci_addr.devid,
+						dev_id.pci_addr.function);
+			} else if (strncmp(long_option[idx].name, "queue",
+						MAX_PATH_LEN) == 0) {
+				queue = get_unsigned(optarg, 10);
+				printf("queue %d\n", queue);
+			} else if (strncmp(long_option[idx].name, "devcnt",
+						MAX_PATH_LEN) == 0) {
+				devcnt = get_unsigned(optarg, 10);
+				printf("devcnt %d\n", devcnt);
+			} else if (strncmp(long_option[idx].name, "iface",
+						MAX_PATH_LEN) == 0) {
+				strncpy(iface, optarg, MAX_PATH_LEN);
+				printf("iface %s\n", iface);
+			}
+
+			break;
+
+		default:
+			vdpa_usage(prgname);
+			return -1;
+		}
+	}
+
+	if (queue <= 0 || devcnt <= 0 || *iface == '\0') {
+		vdpa_usage(prgname);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+data_init(void)
+{
+	devcnt = 1;
+	queue = 1;
+	memset(&dev_id, 0, sizeof(dev_id));
+	memset(iface, 0, MAX_PATH_LEN * sizeof(iface[0]));
+	memset(vports, 0, MAX_VDPA_SAMPLE_PORTS * sizeof(vports[0]));
+
+	return;
+}
+
+static void
+signal_handler(int signum)
+{
+	uint16_t portid, nb_ports;
+
+	if (signum == SIGINT || signum == SIGTERM) {
+		printf("\nSignal %d received, preparing to exit...\n",
+				signum);
+		nb_ports = rte_eth_dev_count();
+		for (portid = 0; portid < nb_ports; portid++) {
+			printf("Closing port %d...\n", portid);
+			rte_eth_dev_stop(portid);
+			rte_eth_dev_close(portid);
+		}
+		exit(0);
+	}
+}
+
+static int
+new_device(int vid)
+{
+	char ifname[MAX_PATH_LEN];
+	int i;
+
+	rte_vhost_get_ifname(vid, ifname, sizeof(ifname));
+	for (i = 0; i < MAX_VDPA_SAMPLE_PORTS; i++) {
+		if (strcmp(ifname, vports[i].ifname) == 0) {
+			printf("\nport %s connected, eid: %d, did %d\n",
+					ifname, vports[i].eid, vports[i].did);
+			vports[i].vid = vid;
+			break;
+		}
+	}
+
+	if (i >= MAX_VDPA_SAMPLE_PORTS)
+		return -1;
+
+	return 0;
+}
+
+static void
+destroy_device(int vid)
+{
+	char ifname[MAX_PATH_LEN];
+	int i;
+
+	rte_vhost_get_ifname(vid, ifname, sizeof(ifname));
+	for (i = 0; i < MAX_VDPA_SAMPLE_PORTS; i++) {
+		if (strcmp(ifname, vports[i].ifname) == 0) {
+			printf("\nport %s disconnected, eid: %d, did %d\n",
+					ifname, vports[i].eid, vports[i].did);
+			vports[i].vid = vid;
+			break;
+		}
+	}
+
+	return;
+}
+
+static const struct vhost_device_ops vdpa_sample_devops = {
+	.new_device = new_device,
+	.destroy_device = destroy_device,
+	.vring_state_changed = NULL,
+	.features_changed = NULL,
+	.new_connection = NULL,
+	.destroy_connection = NULL,
+};
+
+static const struct rte_eth_conf port_conf_default = {
+	.rxmode = {
+		.max_rx_pkt_len = ETHER_MAX_LEN,
+		.ignore_offload_bitfield = 1,
+	},
+};
+
+static inline int
+port_init(uint16_t port, struct rte_mempool *mbuf_pool)
+{
+	uint16_t rx_rings = 1, tx_rings = 1;
+	uint16_t nb_rxd = RX_RING_SIZE;
+	uint16_t nb_txd = TX_RING_SIZE;
+	int retval;
+	uint16_t q;
+	struct rte_eth_dev_info dev_info;
+	struct rte_eth_txconf txconf;
+	struct ether_addr addr;
+
+	if (port >= rte_eth_dev_count())
+		return -1;
+
+	rte_eth_dev_info_get(port, &dev_info);
+
+	/* Configure the Ethernet device. */
+	retval = rte_eth_dev_configure(port, rx_rings, tx_rings,
+			&port_conf_default);
+	if (retval < 0)
+		return retval;
+
+	/* Allocate and set up 1 Rx queue per Ethernet port. */
+	for (q = 0; q < rx_rings; q++) {
+		retval = rte_eth_rx_queue_setup(port, q, nb_rxd,
+				rte_eth_dev_socket_id(port), NULL, mbuf_pool);
+		if (retval < 0)
+			return retval;
+	}
+
+	txconf = dev_info.default_txconf;
+	/* Allocate and set up 1 Tx queue per Ethernet port. */
+	for (q = 0; q < tx_rings; q++) {
+		retval = rte_eth_tx_queue_setup(port, q, nb_txd,
+				rte_eth_dev_socket_id(port), &txconf);
+		if (retval < 0)
+			return retval;
+	}
+
+	/* Start the Ethernet port. */
+	retval = rte_eth_dev_start(port);
+	if (retval < 0)
+		return retval;
+
+	/* Display the port MAC address. */
+	rte_eth_macaddr_get(port, &addr);
+	printf("Port %u MAC: %02" PRIx8 " %02" PRIx8 " %02" PRIx8
+			   " %02" PRIx8 " %02" PRIx8 " %02" PRIx8 "\n",
+			port,
+			addr.addr_bytes[0], addr.addr_bytes[1],
+			addr.addr_bytes[2], addr.addr_bytes[3],
+			addr.addr_bytes[4], addr.addr_bytes[5]);
+
+	return 0;
+}
+
+int
+main(int argc, char *argv[])
+{
+	char ifname[MAX_PATH_LEN];
+	uint16_t nb_ports, portid;
+	struct rte_mempool *mbuf_pool;
+	char ch;
+	int i, eid, did;
+	int ret;
+	uint64_t flags = 0;
+
+	ret = rte_eal_init(argc, argv);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE, "eal init failed\n");
+	argc -= ret;
+	argv += ret;
+
+	signal(SIGINT, signal_handler);
+	signal(SIGTERM, signal_handler);
+
+	nb_ports = rte_eth_dev_count();
+	mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", NUM_MBUFS * nb_ports,
+		MBUF_CACHE_SIZE, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
+
+	/* Initialize all ports. */
+	for (portid = 0; portid < nb_ports; portid++)
+		if (port_init(portid, mbuf_pool) != 0)
+			rte_exit(EXIT_FAILURE, "Cannot init port %d\n",
+					portid);
+
+	data_init();
+
+	ret = parse_args(argc, argv);
+	if (ret < 0)
+		rte_exit(EXIT_FAILURE, "invalid argument\n");
+
+	eid = rte_vdpa_find_engine_id(&dev_id);
+	if (eid < 0)
+		rte_exit(EXIT_FAILURE, "no vDPA engine found\n");
+
+	printf("\nuse engine %d to create vhost socket\n", eid);
+	rte_vdpa_info_query(eid, &attr);
+	if (devcnt > (int)attr.dev_num)
+		rte_exit(EXIT_FAILURE, "not enough devices in engine\n");
+
+	if (queue > (int)attr.queue_num)
+		rte_exit(EXIT_FAILURE, "not enough queues in engine\n");
+
+	for (i = 0; i <  RTE_MIN(MAX_VDPA_SAMPLE_PORTS, devcnt); i++) {
+		snprintf(ifname, sizeof(ifname), "%s%d", iface, i);
+		did = i;
+		vports[i].eid = eid;
+		vports[i].did = did;
+		strcpy(vports[i].ifname, ifname);
+
+		ret = rte_vhost_driver_register(ifname, flags);
+		if (ret != 0)
+			rte_exit(EXIT_FAILURE,
+					"register driver failed: %s\n",
+					ifname);
+
+		ret = rte_vhost_driver_callback_register(ifname,
+				&vdpa_sample_devops);
+		if (ret != 0)
+			rte_exit(EXIT_FAILURE,
+					"register driver ops failed: %s\n",
+					ifname);
+
+		rte_vhost_driver_set_vdpa_eid(ifname, eid);
+		rte_vhost_driver_set_vdpa_did(ifname, did);
+		/*
+		 * Configure vhost port with vDPA device's maximum capability.
+		 * App has the flexibility to change the features, queue num.
+		 */
+		rte_vhost_driver_set_queue_num(ifname, attr.queue_num);
+		rte_vhost_driver_set_features(ifname, attr.features);
+		rte_vhost_driver_set_protocol_features(ifname,
+				attr.protocol_features);
+
+		if (rte_vhost_driver_start(ifname) < 0)
+			rte_exit(EXIT_FAILURE,
+					"start vhost driver failed: %s\n",
+					ifname);
+	}
+
+	printf("enter \'q\' to quit\n");
+	while (scanf("%c", &ch)) {
+		if (ch == 'q')
+			break;
+		while (ch != '\n')
+			scanf("%c", &ch);
+		printf("enter \'q\' to quit\n");
+	}
+
+	for (portid = 0; portid < nb_ports; portid++) {
+		printf("Closing port %d...\n", portid);
+		rte_eth_dev_stop(portid);
+		rte_eth_dev_close(portid);
+	}
+
+	return 0;
+}
-- 
2.15.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
  2018-02-04 14:55 ` [dpdk-dev] [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver Xiao Wang
@ 2018-02-06 14:24   ` Maxime Coquelin
  2018-02-08  2:23     ` Wang, Xiao W
  0 siblings, 1 reply; 8+ messages in thread
From: Maxime Coquelin @ 2018-02-06 14:24 UTC (permalink / raw)
  To: Xiao Wang, dev
  Cc: jianfeng.tan, tiwei.bie, yliu, cunming.liang, dan.daly, zhihong.wang

Hi Xiao,

On 02/04/2018 03:55 PM, Xiao Wang wrote:
> This driver is a reference sample of making vDPA device driver based
> on vhost lib, this driver uses a standard virtio-net PCI device as
> vDPA device, it can serve as a backend for a virtio-net pci device
> in nested VM.
> 
> The key driver ops implemented are:
> 
> * vdpa_virtio_eng_init
> Mapping virtio pci device with VFIO into userspace, and read device
> capability and intialize internal data.
> 
> * vdpa_virtio_eng_uninit
> Release the mapped device.
> 
> * vdpa_virtio_info_query
> Device capability reporting, e.g. queue number, features.
> 
> * vdpa_virtio_dev_config
> With the guest virtio information provideed by vhost lib, this
> function configures device and IOMMU to set up vhost datapath,
> which includes: Rx/Tx vring, VFIO interrupt, kick relay.
> 
> * vdpa_virtio_dev_close
> Unset the stuff that are configured previously by dev_conf.
> 
> This driver requires the virtio device supports VIRTIO_F_IOMMU_PLATFORM
> , because the buffer address written in desc is IOVA.
> 
> Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> only vfio-pci is supported currently.
> 
> Signed-off-by: Xiao Wang<xiao.w.wang@intel.com>
> ---
>   config/common_base                                 |    6 +
>   config/common_linuxapp                             |    1 +
>   drivers/net/Makefile                               |    1 +
>   drivers/net/vdpa_virtio_pci/Makefile               |   31 +
>   .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527 ++++++++++++++++++++
>   .../rte_vdpa_virtio_pci_version.map                |    4 +
>   mk/rte.app.mk                                      |    1 +
>   7 files changed, 1571 insertions(+)
>   create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
>   create mode 100644 drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
>   create mode 100644 drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map

Is there a specific constraint that makes you expose PCI functions and
duplicate a lot of vfio code into the driver?

Wouldn't it be better (if possible) to use RTE_PMD_REGISTER_PCI() & co. 
to benefit from all the existing infrastructure?

Maxime

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
  2018-02-06 14:24   ` Maxime Coquelin
@ 2018-02-08  2:23     ` Wang, Xiao W
  2018-02-08  9:08       ` Maxime Coquelin
  0 siblings, 1 reply; 8+ messages in thread
From: Wang, Xiao W @ 2018-02-08  2:23 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: Tan, Jianfeng, Bie, Tiwei, yliu, Liang, Cunming, Daly, Dan, Wang,
	Zhihong

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Tuesday, February 6, 2018 10:24 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> Cc: Tan, Jianfeng <jianfeng.tan@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>;
> yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Daly, Dan
> <dan.daly@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Subject: Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
> 
> Hi Xiao,
> 
> On 02/04/2018 03:55 PM, Xiao Wang wrote:
> > This driver is a reference sample of making vDPA device driver based
> > on vhost lib, this driver uses a standard virtio-net PCI device as
> > vDPA device, it can serve as a backend for a virtio-net pci device
> > in nested VM.
> >
> > The key driver ops implemented are:
> >
> > * vdpa_virtio_eng_init
> > Mapping virtio pci device with VFIO into userspace, and read device
> > capability and intialize internal data.
> >
> > * vdpa_virtio_eng_uninit
> > Release the mapped device.
> >
> > * vdpa_virtio_info_query
> > Device capability reporting, e.g. queue number, features.
> >
> > * vdpa_virtio_dev_config
> > With the guest virtio information provideed by vhost lib, this
> > function configures device and IOMMU to set up vhost datapath,
> > which includes: Rx/Tx vring, VFIO interrupt, kick relay.
> >
> > * vdpa_virtio_dev_close
> > Unset the stuff that are configured previously by dev_conf.
> >
> > This driver requires the virtio device supports VIRTIO_F_IOMMU_PLATFORM
> > , because the buffer address written in desc is IOVA.
> >
> > Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> > only vfio-pci is supported currently.
> >
> > Signed-off-by: Xiao Wang<xiao.w.wang@intel.com>
> > ---
> >   config/common_base                                 |    6 +
> >   config/common_linuxapp                             |    1 +
> >   drivers/net/Makefile                               |    1 +
> >   drivers/net/vdpa_virtio_pci/Makefile               |   31 +
> >   .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527
> ++++++++++++++++++++
> >   .../rte_vdpa_virtio_pci_version.map                |    4 +
> >   mk/rte.app.mk                                      |    1 +
> >   7 files changed, 1571 insertions(+)
> >   create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
> >   create mode 100644 drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
> >   create mode 100644
> drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
> 
> Is there a specific constraint that makes you expose PCI functions and
> duplicate a lot of vfio code into the driver?

The existing vfio code doesn't fit VDPA well, this vDPA driver needs to program IOMMU for a vDPA device with a VM's memory table.
While the eal/vfio uses a struct vfio_cfg to takes all regular devices and add them to a single vfio_container, and program IOMMU with DPDK process's memory table.

This driver doing PCI VFIO initialization itself can avoid affecting the global vfio_cfg structure.

> 
> Wouldn't it be better (if possible) to use RTE_PMD_REGISTER_PCI() & co.
> to benefit from all the existing infrastructure?

RTE_PMD_REGISTER_PCI() & co will make this driver as PCI driver (physical device), then this will conflict with the virtio_pmd.
So I make vDPA device driver as a vdev driver.

> 
> Maxime

Thanks for the comments,
Xiao

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
  2018-02-08  2:23     ` Wang, Xiao W
@ 2018-02-08  9:08       ` Maxime Coquelin
  2018-02-12 15:36         ` Wang, Xiao W
  0 siblings, 1 reply; 8+ messages in thread
From: Maxime Coquelin @ 2018-02-08  9:08 UTC (permalink / raw)
  To: Wang, Xiao W, dev
  Cc: Tan, Jianfeng, Bie, Tiwei, yliu, Liang, Cunming, Daly, Dan, Wang,
	Zhihong

Hi Xiao,

On 02/08/2018 03:23 AM, Wang, Xiao W wrote:
> Hi Maxime,
> 
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Tuesday, February 6, 2018 10:24 PM
>> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
>> Cc: Tan, Jianfeng <jianfeng.tan@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>;
>> yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Daly, Dan
>> <dan.daly@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
>> Subject: Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
>>
>> Hi Xiao,
>>
>> On 02/04/2018 03:55 PM, Xiao Wang wrote:
>>> This driver is a reference sample of making vDPA device driver based
>>> on vhost lib, this driver uses a standard virtio-net PCI device as
>>> vDPA device, it can serve as a backend for a virtio-net pci device
>>> in nested VM.
>>>
>>> The key driver ops implemented are:
>>>
>>> * vdpa_virtio_eng_init
>>> Mapping virtio pci device with VFIO into userspace, and read device
>>> capability and intialize internal data.
>>>
>>> * vdpa_virtio_eng_uninit
>>> Release the mapped device.
>>>
>>> * vdpa_virtio_info_query
>>> Device capability reporting, e.g. queue number, features.
>>>
>>> * vdpa_virtio_dev_config
>>> With the guest virtio information provideed by vhost lib, this
>>> function configures device and IOMMU to set up vhost datapath,
>>> which includes: Rx/Tx vring, VFIO interrupt, kick relay.
>>>
>>> * vdpa_virtio_dev_close
>>> Unset the stuff that are configured previously by dev_conf.
>>>
>>> This driver requires the virtio device supports VIRTIO_F_IOMMU_PLATFORM
>>> , because the buffer address written in desc is IOVA.
>>>
>>> Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
>>> only vfio-pci is supported currently.
>>>
>>> Signed-off-by: Xiao Wang<xiao.w.wang@intel.com>
>>> ---
>>>    config/common_base                                 |    6 +
>>>    config/common_linuxapp                             |    1 +
>>>    drivers/net/Makefile                               |    1 +
>>>    drivers/net/vdpa_virtio_pci/Makefile               |   31 +
>>>    .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527
>> ++++++++++++++++++++
>>>    .../rte_vdpa_virtio_pci_version.map                |    4 +
>>>    mk/rte.app.mk                                      |    1 +
>>>    7 files changed, 1571 insertions(+)
>>>    create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
>>>    create mode 100644 drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
>>>    create mode 100644
>> drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
>>
>> Is there a specific constraint that makes you expose PCI functions and
>> duplicate a lot of vfio code into the driver?
> 
> The existing vfio code doesn't fit VDPA well, this vDPA driver needs to program IOMMU for a vDPA device with a VM's memory table.
> While the eal/vfio uses a struct vfio_cfg to takes all regular devices and add them to a single vfio_container, and program IOMMU with DPDK process's memory table.
> 
> This driver doing PCI VFIO initialization itself can avoid affecting the global vfio_cfg structure.

Ok, I get it.
So I think what you have to do is to extend eal/vfio for this case.
Or at least, have a vdpa layer to perform this, else every offload
driver will have to duplicate the code.

>>
>> Wouldn't it be better (if possible) to use RTE_PMD_REGISTER_PCI() & co.
>> to benefit from all the existing infrastructure?
> 
> RTE_PMD_REGISTER_PCI() & co will make this driver as PCI driver (physical device), then this will conflict with the virtio_pmd.
> So I make vDPA device driver as a vdev driver.

Yes, but it is a PCI device, not a virtual device. You have to extend
the EAL to support this new class of devices/drivers. Think of it as in
kernel when a NIC device can be either binded to its NIC driver, VFIO or
UIO.

If I look at patch 3, you have to set --no-pci, or at least I think to
blacklist the Virtio device.

I wonder if real vDPA cards will support either vDPA mode or or behave
like a regular NIC, like the Virtio case in your example.
If this is the case, maybe the vDPA code for a NIC could be in the same
driver as the "NIC" mode.
A new struct rte_pci_driver driver flag could be introduced to specify
that the driver supports vDPA.
Then, in EAL arguments, if a vhost vdev specifies it wants Virtio device
at PCI addr 00:01:00 as offload, the PCI layer could probe this device
in "vdpa" mode.

Also, I don't know if this will be possible with real vDPA cards, but we
could have the application doing packet switching between vhost-user
vdev and the Virtio device. And at some point, at runtime, switch into
vDPA mode. This use-case would be much easier to implement if vDPA
relied on existing PCI layer.

I may be not very clear, don't hesitate to ask questions.
But generally, I think vDPA has to fit in existing DPDK architecture,
and not try to live outside of it.

Thanks,
Maxime
>>
>> Maxime
> 
> Thanks for the comments,
> Xiao
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dpdk-dev] [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
  2018-02-08  9:08       ` Maxime Coquelin
@ 2018-02-12 15:36         ` Wang, Xiao W
  0 siblings, 0 replies; 8+ messages in thread
From: Wang, Xiao W @ 2018-02-12 15:36 UTC (permalink / raw)
  To: Maxime Coquelin, dev
  Cc: Tan, Jianfeng, Bie, Tiwei, yliu, Liang, Cunming, Daly, Dan, Wang,
	Zhihong

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Thursday, February 8, 2018 5:09 PM
> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> Cc: Tan, Jianfeng <jianfeng.tan@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>;
> yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Daly, Dan
> <dan.daly@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> Subject: Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
> 
> Hi Xiao,
> 
> On 02/08/2018 03:23 AM, Wang, Xiao W wrote:
> > Hi Maxime,
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> >> Sent: Tuesday, February 6, 2018 10:24 PM
> >> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org
> >> Cc: Tan, Jianfeng <jianfeng.tan@intel.com>; Bie, Tiwei
> <tiwei.bie@intel.com>;
> >> yliu@fridaylinux.org; Liang, Cunming <cunming.liang@intel.com>; Daly, Dan
> >> <dan.daly@intel.com>; Wang, Zhihong <zhihong.wang@intel.com>
> >> Subject: Re: [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver
> >>
> >> Hi Xiao,
> >>
> >> On 02/04/2018 03:55 PM, Xiao Wang wrote:
> >>> This driver is a reference sample of making vDPA device driver based
> >>> on vhost lib, this driver uses a standard virtio-net PCI device as
> >>> vDPA device, it can serve as a backend for a virtio-net pci device
> >>> in nested VM.
> >>>
> >>> The key driver ops implemented are:
> >>>
> >>> * vdpa_virtio_eng_init
> >>> Mapping virtio pci device with VFIO into userspace, and read device
> >>> capability and intialize internal data.
> >>>
> >>> * vdpa_virtio_eng_uninit
> >>> Release the mapped device.
> >>>
> >>> * vdpa_virtio_info_query
> >>> Device capability reporting, e.g. queue number, features.
> >>>
> >>> * vdpa_virtio_dev_config
> >>> With the guest virtio information provideed by vhost lib, this
> >>> function configures device and IOMMU to set up vhost datapath,
> >>> which includes: Rx/Tx vring, VFIO interrupt, kick relay.
> >>>
> >>> * vdpa_virtio_dev_close
> >>> Unset the stuff that are configured previously by dev_conf.
> >>>
> >>> This driver requires the virtio device supports
> VIRTIO_F_IOMMU_PLATFORM
> >>> , because the buffer address written in desc is IOVA.
> >>>
> >>> Because vDPA driver needs to set up MSI-X vector to interrupt the guest,
> >>> only vfio-pci is supported currently.
> >>>
> >>> Signed-off-by: Xiao Wang<xiao.w.wang@intel.com>
> >>> ---
> >>>    config/common_base                                 |    6 +
> >>>    config/common_linuxapp                             |    1 +
> >>>    drivers/net/Makefile                               |    1 +
> >>>    drivers/net/vdpa_virtio_pci/Makefile               |   31 +
> >>>    .../net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c  | 1527
> >> ++++++++++++++++++++
> >>>    .../rte_vdpa_virtio_pci_version.map                |    4 +
> >>>    mk/rte.app.mk                                      |    1 +
> >>>    7 files changed, 1571 insertions(+)
> >>>    create mode 100644 drivers/net/vdpa_virtio_pci/Makefile
> >>>    create mode 100644
> drivers/net/vdpa_virtio_pci/rte_eth_vdpa_virtio_pci.c
> >>>    create mode 100644
> >> drivers/net/vdpa_virtio_pci/rte_vdpa_virtio_pci_version.map
> >>
> >> Is there a specific constraint that makes you expose PCI functions and
> >> duplicate a lot of vfio code into the driver?
> >
> > The existing vfio code doesn't fit VDPA well, this vDPA driver needs to
> program IOMMU for a vDPA device with a VM's memory table.
> > While the eal/vfio uses a struct vfio_cfg to takes all regular devices and add
> them to a single vfio_container, and program IOMMU with DPDK process's
> memory table.
> >
> > This driver doing PCI VFIO initialization itself can avoid affecting the global
> vfio_cfg structure.
> 
> Ok, I get it.
> So I think what you have to do is to extend eal/vfio for this case.
> Or at least, have a vdpa layer to perform this, else every offload
> driver will have to duplicate the code.

I think I need to extend eal/vfio to provide container based APIs, such as creating container,
vfio group fd binding with container, DMAR programming, etc.

> 
> >>
> >> Wouldn't it be better (if possible) to use RTE_PMD_REGISTER_PCI() & co.
> >> to benefit from all the existing infrastructure?
> >
> > RTE_PMD_REGISTER_PCI() & co will make this driver as PCI driver (physical
> device), then this will conflict with the virtio_pmd.
> > So I make vDPA device driver as a vdev driver.
> 
> Yes, but it is a PCI device, not a virtual device. You have to extend
> the EAL to support this new class of devices/drivers. Think of it as in
> kernel when a NIC device can be either binded to its NIC driver, VFIO or
> UIO.
> 
> If I look at patch 3, you have to set --no-pci, or at least I think to
> blacklist the Virtio device.
> 
> I wonder if real vDPA cards will support either vDPA mode or or behave
> like a regular NIC, like the Virtio case in your example.
> If this is the case, maybe the vDPA code for a NIC could be in the same
> driver as the "NIC" mode.
> A new struct rte_pci_driver driver flag could be introduced to specify
> that the driver supports vDPA.
> Then, in EAL arguments, if a vhost vdev specifies it wants Virtio device
> at PCI addr 00:01:00 as offload, the PCI layer could probe this device
> in "vdpa" mode.

Considering that we could have a pool of vDPA devices, we need to have a port supporting port representor,
it defines control domain to which these vDPA devices belong to, we can have a vdev port for this
purpose and this vdev helps to register vDPA ports by port-representor library (patch submitted).

                                   +------+
                                   | vdev |
+---+                              |------|
|app|--register representor port-->|broker|-->add port with vDPA device 0/1/2...
+---+                              +------+

I plan to submit vdpa driver patch for a real vDPA card, that card will have different sub device_id/vendor_id,
so we won't have conflict issue on that driver.

> 
> Also, I don't know if this will be possible with real vDPA cards, but we
> could have the application doing packet switching between vhost-user
> vdev and the Virtio device. And at some point, at runtime, switch into
> vDPA mode. This use-case would be much easier to implement if vDPA
> relied on existing PCI layer.

In vDPA mode, each vhost-user datapath is performed by a vDPA device,
If switchover to normal SW packet switching, it will be typically many vhost-user ports and one uplink port.

Thanks,
Xiao

> 
> I may be not very clear, don't hesitate to ask questions.
> But generally, I think vDPA has to fit in existing DPDK architecture,
> and not try to live outside of it.
> 
> Thanks,
> Maxime
> >>
> >> Maxime
> >
> > Thanks for the comments,
> > Xiao
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-02-12 15:36 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-04 14:55 [dpdk-dev] [PATCH 0/3] add vDPA sample driver Xiao Wang
2018-02-04 14:55 ` [dpdk-dev] [PATCH 1/3] bus/pci: expose API for vDPA Xiao Wang
2018-02-04 14:55 ` [dpdk-dev] [PATCH 2/3] net/vdpa_virtio_pci: introduce vdpa sample driver Xiao Wang
2018-02-06 14:24   ` Maxime Coquelin
2018-02-08  2:23     ` Wang, Xiao W
2018-02-08  9:08       ` Maxime Coquelin
2018-02-12 15:36         ` Wang, Xiao W
2018-02-04 14:55 ` [dpdk-dev] [PATCH 3/3] examples/vdpa: add a new sample for vdpa Xiao Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).