DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
@ 2016-08-23  8:10 Yuanhan Liu
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling Yuanhan Liu
                   ` (9 more replies)
  0 siblings, 10 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23  8:10 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

This patch set enables vhost Tx zero copy. The majority work goes to
patch 4: vhost: add Tx zero copy.

The basic idea of Tx zero copy is, instead of copying data from the
desc buf, here we let the mbuf reference the desc buf addr directly.

The major issue behind that is how and when to update the used ring.
You could check the commit log of patch 4 for more details.

Patch 5 introduces a new flag, RTE_VHOST_USER_TX_ZERO_COPY, to enable
Tx zero copy, which is disabled by default.

Few more TODOs are left, including handling a desc buf that is across
two physical pages, updating release note, etc. Those will be fixed
in later version. For now, here is a simple one that hopefully it
shows the idea clearly.

I did some quick tests, the performance gain is quite impressive.

For a simple dequeue workload (running rxonly in vhost-pmd and runnin
txonly in guest testpmd), it yields 40+% performance boost for packet
size 1400B.

For VM2VM iperf test case, it's even better: about 70% boost.

---
Yuanhan Liu (6):
  vhost: simplify memory regions handling
  vhost: get guest/host physical address mappings
  vhost: introduce last avail idx for Tx
  vhost: add Tx zero copy
  vhost: add a flag to enable Tx zero copy
  examples/vhost: add an option to enable Tx zero copy

 doc/guides/prog_guide/vhost_lib.rst |   7 +-
 examples/vhost/main.c               |  19 ++-
 lib/librte_vhost/rte_virtio_net.h   |   1 +
 lib/librte_vhost/socket.c           |   5 +
 lib/librte_vhost/vhost.c            |  12 ++
 lib/librte_vhost/vhost.h            | 103 +++++++++----
 lib/librte_vhost/vhost_user.c       | 297 +++++++++++++++++++++++-------------
 lib/librte_vhost/virtio_net.c       | 188 +++++++++++++++++++----
 8 files changed, 472 insertions(+), 160 deletions(-)

-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling
  2016-08-23  8:10 [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Yuanhan Liu
@ 2016-08-23  8:10 ` Yuanhan Liu
  2016-08-23  9:17   ` Maxime Coquelin
  2016-08-24  7:26   ` Xu, Qian Q
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 2/6] vhost: get guest/host physical address mappings Yuanhan Liu
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23  8:10 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

Due to history reason (that vhost-cuse comes before vhost-user), some
fields for maintaining the vhost-user memory mappings (such as mmapped
address and size, with those we then can unmap on destroy) are kept in
"orig_region_map" struct, a structure that is defined only in vhost-user
source file.

The right way to go is to remove the structure and move all those fields
into virtio_memory_region struct. But we simply can't do that before,
because it breaks the ABI.

Now, thanks to the ABI refactoring, it's never been a blocking issue
any more. And here it goes: this patch removes orig_region_map and
redefines virtio_memory_region, to include all necessary info.

With that, we can simplify the guest/host address convert a bit.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost.h      |  49 ++++++------
 lib/librte_vhost/vhost_user.c | 172 +++++++++++++++++-------------------------
 2 files changed, 90 insertions(+), 131 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index c2dfc3c..df2107b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -143,12 +143,14 @@ struct virtio_net {
  * Information relating to memory regions including offsets to
  * addresses in QEMUs memory file.
  */
-struct virtio_memory_regions {
-	uint64_t guest_phys_address;
-	uint64_t guest_phys_address_end;
-	uint64_t memory_size;
-	uint64_t userspace_address;
-	uint64_t address_offset;
+struct virtio_memory_region {
+	uint64_t guest_phys_addr;
+	uint64_t guest_user_addr;
+	uint64_t host_user_addr;
+	uint64_t size;
+	void	 *mmap_addr;
+	uint64_t mmap_size;
+	int fd;
 };
 
 
@@ -156,12 +158,8 @@ struct virtio_memory_regions {
  * Memory structure includes region and mapping information.
  */
 struct virtio_memory {
-	/* Base QEMU userspace address of the memory file. */
-	uint64_t base_address;
-	uint64_t mapped_address;
-	uint64_t mapped_size;
 	uint32_t nregions;
-	struct virtio_memory_regions regions[0];
+	struct virtio_memory_region regions[0];
 };
 
 
@@ -200,26 +198,23 @@ extern uint64_t VHOST_FEATURES;
 #define MAX_VHOST_DEVICE	1024
 extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
 
-/**
- * Function to convert guest physical addresses to vhost virtual addresses.
- * This is used to convert guest virtio buffer addresses.
- */
+/* Convert guest physical Address to host virtual address */
 static inline uint64_t __attribute__((always_inline))
-gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
+gpa_to_vva(struct virtio_net *dev, uint64_t gpa)
 {
-	struct virtio_memory_regions *region;
-	uint32_t regionidx;
-	uint64_t vhost_va = 0;
-
-	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
-		region = &dev->mem->regions[regionidx];
-		if ((guest_pa >= region->guest_phys_address) &&
-			(guest_pa <= region->guest_phys_address_end)) {
-			vhost_va = region->address_offset + guest_pa;
-			break;
+	struct virtio_memory_region *reg;
+	uint32_t i;
+
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+		if (gpa >= reg->guest_phys_addr &&
+		    gpa <  reg->guest_phys_addr + reg->size) {
+			return gpa - reg->guest_phys_addr +
+			       reg->host_user_addr;
 		}
 	}
-	return vhost_va;
+
+	return 0;
 }
 
 struct virtio_net_device_ops const *notify_ops;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index eee99e9..d2071fd 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -74,18 +74,6 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
 	[VHOST_USER_SEND_RARP]  = "VHOST_USER_SEND_RARP",
 };
 
-struct orig_region_map {
-	int fd;
-	uint64_t mapped_address;
-	uint64_t mapped_size;
-	uint64_t blksz;
-};
-
-#define orig_region(ptr, nregions) \
-	((struct orig_region_map *)RTE_PTR_ADD((ptr), \
-		sizeof(struct virtio_memory) + \
-		sizeof(struct virtio_memory_regions) * (nregions)))
-
 static uint64_t
 get_blk_size(int fd)
 {
@@ -99,18 +87,17 @@ get_blk_size(int fd)
 static void
 free_mem_region(struct virtio_net *dev)
 {
-	struct orig_region_map *region;
-	unsigned int idx;
+	uint32_t i;
+	struct virtio_memory_region *reg;
 
 	if (!dev || !dev->mem)
 		return;
 
-	region = orig_region(dev->mem, dev->mem->nregions);
-	for (idx = 0; idx < dev->mem->nregions; idx++) {
-		if (region[idx].mapped_address) {
-			munmap((void *)(uintptr_t)region[idx].mapped_address,
-					region[idx].mapped_size);
-			close(region[idx].fd);
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+		if (reg->host_user_addr) {
+			munmap(reg->mmap_addr, reg->mmap_size);
+			close(reg->fd);
 		}
 	}
 }
@@ -120,7 +107,7 @@ vhost_backend_cleanup(struct virtio_net *dev)
 {
 	if (dev->mem) {
 		free_mem_region(dev);
-		free(dev->mem);
+		rte_free(dev->mem);
 		dev->mem = NULL;
 	}
 	if (dev->log_addr) {
@@ -286,25 +273,23 @@ numa_realloc(struct virtio_net *dev, int index __rte_unused)
  * used to convert the ring addresses to our address space.
  */
 static uint64_t
-qva_to_vva(struct virtio_net *dev, uint64_t qemu_va)
+qva_to_vva(struct virtio_net *dev, uint64_t qva)
 {
-	struct virtio_memory_regions *region;
-	uint64_t vhost_va = 0;
-	uint32_t regionidx = 0;
+	struct virtio_memory_region *reg;
+	uint32_t i;
 
 	/* Find the region where the address lives. */
-	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
-		region = &dev->mem->regions[regionidx];
-		if ((qemu_va >= region->userspace_address) &&
-			(qemu_va <= region->userspace_address +
-			region->memory_size)) {
-			vhost_va = qemu_va + region->guest_phys_address +
-				region->address_offset -
-				region->userspace_address;
-			break;
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+
+		if (qva >= reg->guest_user_addr &&
+		    qva <  reg->guest_user_addr + reg->size) {
+			return qva - reg->guest_user_addr +
+			       reg->host_user_addr;
 		}
 	}
-	return vhost_va;
+
+	return 0;
 }
 
 /*
@@ -391,11 +376,13 @@ static int
 vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 {
 	struct VhostUserMemory memory = pmsg->payload.memory;
-	struct virtio_memory_regions *pregion;
-	uint64_t mapped_address, mapped_size;
-	unsigned int idx = 0;
-	struct orig_region_map *pregion_orig;
+	struct virtio_memory_region *reg;
+	void *mmap_addr;
+	uint64_t mmap_size;
+	uint64_t mmap_offset;
 	uint64_t alignment;
+	uint32_t i;
+	int fd;
 
 	/* Remove from the data plane. */
 	if (dev->flags & VIRTIO_DEV_RUNNING) {
@@ -405,14 +392,12 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 
 	if (dev->mem) {
 		free_mem_region(dev);
-		free(dev->mem);
+		rte_free(dev->mem);
 		dev->mem = NULL;
 	}
 
-	dev->mem = calloc(1,
-		sizeof(struct virtio_memory) +
-		sizeof(struct virtio_memory_regions) * memory.nregions +
-		sizeof(struct orig_region_map) * memory.nregions);
+	dev->mem = rte_zmalloc("vhost-mem-table", sizeof(struct virtio_memory) +
+		sizeof(struct virtio_memory_region) * memory.nregions, 0);
 	if (dev->mem == NULL) {
 		RTE_LOG(ERR, VHOST_CONFIG,
 			"(%d) failed to allocate memory for dev->mem\n",
@@ -421,22 +406,17 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 	}
 	dev->mem->nregions = memory.nregions;
 
-	pregion_orig = orig_region(dev->mem, memory.nregions);
-	for (idx = 0; idx < memory.nregions; idx++) {
-		pregion = &dev->mem->regions[idx];
-		pregion->guest_phys_address =
-			memory.regions[idx].guest_phys_addr;
-		pregion->guest_phys_address_end =
-			memory.regions[idx].guest_phys_addr +
-			memory.regions[idx].memory_size;
-		pregion->memory_size =
-			memory.regions[idx].memory_size;
-		pregion->userspace_address =
-			memory.regions[idx].userspace_addr;
-
-		/* This is ugly */
-		mapped_size = memory.regions[idx].memory_size +
-			memory.regions[idx].mmap_offset;
+	for (i = 0; i < memory.nregions; i++) {
+		fd  = pmsg->fds[i];
+		reg = &dev->mem->regions[i];
+
+		reg->guest_phys_addr = memory.regions[i].guest_phys_addr;
+		reg->guest_user_addr = memory.regions[i].userspace_addr;
+		reg->size            = memory.regions[i].memory_size;
+		reg->fd              = fd;
+
+		mmap_offset = memory.regions[i].mmap_offset;
+		mmap_size   = reg->size + mmap_offset;
 
 		/* mmap() without flag of MAP_ANONYMOUS, should be called
 		 * with length argument aligned with hugepagesz at older
@@ -446,67 +426,51 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		 * to avoid failure, make sure in caller to keep length
 		 * aligned.
 		 */
-		alignment = get_blk_size(pmsg->fds[idx]);
+		alignment = get_blk_size(fd);
 		if (alignment == (uint64_t)-1) {
 			RTE_LOG(ERR, VHOST_CONFIG,
 				"couldn't get hugepage size through fstat\n");
 			goto err_mmap;
 		}
-		mapped_size = RTE_ALIGN_CEIL(mapped_size, alignment);
+		mmap_size = RTE_ALIGN_CEIL(mmap_size, alignment);
 
-		mapped_address = (uint64_t)(uintptr_t)mmap(NULL,
-			mapped_size,
-			PROT_READ | PROT_WRITE, MAP_SHARED,
-			pmsg->fds[idx],
-			0);
+		mmap_addr = mmap(NULL, mmap_size,
+				 PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
 
-		RTE_LOG(INFO, VHOST_CONFIG,
-			"mapped region %d fd:%d to:%p sz:0x%"PRIx64" "
-			"off:0x%"PRIx64" align:0x%"PRIx64"\n",
-			idx, pmsg->fds[idx], (void *)(uintptr_t)mapped_address,
-			mapped_size, memory.regions[idx].mmap_offset,
-			alignment);
-
-		if (mapped_address == (uint64_t)(uintptr_t)MAP_FAILED) {
+		if (mmap_addr == MAP_FAILED) {
 			RTE_LOG(ERR, VHOST_CONFIG,
-				"mmap qemu guest failed.\n");
+				"mmap region %u failed.\n", i);
 			goto err_mmap;
 		}
 
-		pregion_orig[idx].mapped_address = mapped_address;
-		pregion_orig[idx].mapped_size = mapped_size;
-		pregion_orig[idx].blksz = alignment;
-		pregion_orig[idx].fd = pmsg->fds[idx];
-
-		mapped_address +=  memory.regions[idx].mmap_offset;
+		reg->mmap_addr = mmap_addr;
+		reg->mmap_size = mmap_size;
+		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr + mmap_offset;
 
-		pregion->address_offset = mapped_address -
-			pregion->guest_phys_address;
-
-		if (memory.regions[idx].guest_phys_addr == 0) {
-			dev->mem->base_address =
-				memory.regions[idx].userspace_addr;
-			dev->mem->mapped_address =
-				pregion->address_offset;
-		}
-
-		LOG_DEBUG(VHOST_CONFIG,
-			"REGION: %u GPA: %p QEMU VA: %p SIZE (%"PRIu64")\n",
-			idx,
-			(void *)(uintptr_t)pregion->guest_phys_address,
-			(void *)(uintptr_t)pregion->userspace_address,
-			 pregion->memory_size);
+		RTE_LOG(INFO, VHOST_CONFIG,
+			"guest memory region %u, size: 0x%" PRIx64 "\n"
+			"\t guest physical addr: 0x%" PRIx64 "\n"
+			"\t guest virtual  addr: 0x%" PRIx64 "\n"
+			"\t host  virtual  addr: 0x%" PRIx64 "\n"
+			"\t mmap addr : 0x%" PRIx64 "\n"
+			"\t mmap size : 0x%" PRIx64 "\n"
+			"\t mmap align: 0x%" PRIx64 "\n"
+			"\t mmap off  : 0x%" PRIx64 "\n",
+			i, reg->size,
+			reg->guest_phys_addr,
+			reg->guest_user_addr,
+			reg->host_user_addr,
+			(uint64_t)(uintptr_t)mmap_addr,
+			mmap_size,
+			alignment,
+			mmap_offset);
 	}
 
 	return 0;
 
 err_mmap:
-	while (idx--) {
-		munmap((void *)(uintptr_t)pregion_orig[idx].mapped_address,
-				pregion_orig[idx].mapped_size);
-		close(pregion_orig[idx].fd);
-	}
-	free(dev->mem);
+	free_mem_region(dev);
+	rte_free(dev->mem);
 	dev->mem = NULL;
 	return -1;
 }
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH 2/6] vhost: get guest/host physical address mappings
  2016-08-23  8:10 [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Yuanhan Liu
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling Yuanhan Liu
@ 2016-08-23  8:10 ` Yuanhan Liu
  2016-08-23  9:58   ` Maxime Coquelin
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 3/6] vhost: introduce last avail idx for Tx Yuanhan Liu
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23  8:10 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

So that we can convert a guest physical address to host physical
address, which will be used in later Tx zero copy implementation.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost.h      | 30 +++++++++++++++
 lib/librte_vhost/vhost_user.c | 86 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 116 insertions(+)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index df2107b..2d52987 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -114,6 +114,12 @@ struct vhost_virtqueue {
  #define VIRTIO_F_VERSION_1 32
 #endif
 
+struct guest_page {
+	uint64_t guest_phys_addr;
+	uint64_t host_phys_addr;
+	uint64_t size;
+};
+
 /**
  * Device structure contains all configuration information relating
  * to the device.
@@ -137,6 +143,10 @@ struct virtio_net {
 	uint64_t		log_addr;
 	struct ether_addr	mac;
 
+	uint32_t		nr_guest_pages;
+	uint32_t		max_guest_pages;
+	struct guest_page       *guest_pages;
+
 } __rte_cache_aligned;
 
 /**
@@ -217,6 +227,26 @@ gpa_to_vva(struct virtio_net *dev, uint64_t gpa)
 	return 0;
 }
 
+/* Convert guest physical address to host physical address */
+static inline phys_addr_t __attribute__((always_inline))
+gpa_to_hpa(struct virtio_net *dev, uint64_t gpa, uint64_t size)
+{
+	uint32_t i;
+	struct guest_page *page;
+
+	for (i = 0; i < dev->nr_guest_pages; i++) {
+		page = &dev->guest_pages[i];
+
+		if (gpa >= page->guest_phys_addr &&
+		    gpa + size < page->guest_phys_addr + page->size) {
+			return gpa - page->guest_phys_addr +
+			       page->host_phys_addr;
+		}
+	}
+
+	return 0;
+}
+
 struct virtio_net_device_ops const *notify_ops;
 struct virtio_net *get_device(int vid);
 
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index d2071fd..045d4f0 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -372,6 +372,81 @@ vhost_user_set_vring_base(struct virtio_net *dev,
 	return 0;
 }
 
+static void
+add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr,
+		   uint64_t host_phys_addr, uint64_t size)
+{
+	struct guest_page *page;
+
+	if (dev->nr_guest_pages == dev->max_guest_pages) {
+		dev->max_guest_pages *= 2;
+		dev->guest_pages = realloc(dev->guest_pages,
+					dev->max_guest_pages * sizeof(*page));
+	}
+
+	page = &dev->guest_pages[dev->nr_guest_pages++];
+	page->guest_phys_addr = guest_phys_addr;
+	page->host_phys_addr  = host_phys_addr;
+	page->size = size;
+}
+
+static void
+add_guest_pages(struct virtio_net *dev, struct virtio_memory_region *reg,
+		uint64_t page_size)
+{
+	uint64_t reg_size = reg->size;
+	uint64_t host_user_addr  = reg->host_user_addr;
+	uint64_t guest_phys_addr = reg->guest_phys_addr;
+	uint64_t host_phys_addr;
+	uint64_t size;
+	uint32_t pre_read;
+
+	pre_read = *((uint32_t *)(uintptr_t)host_user_addr);
+	host_phys_addr = rte_mem_virt2phy((void *)(uintptr_t)host_user_addr);
+	size = page_size - (guest_phys_addr & (page_size - 1));
+	size = RTE_MIN(size, reg_size);
+
+	add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size);
+	host_user_addr  += size;
+	guest_phys_addr += size;
+	reg_size -= size;
+
+	while (reg_size > 0) {
+		pre_read += *((uint32_t *)(uintptr_t)host_user_addr);
+		host_phys_addr = rte_mem_virt2phy((void *)(uintptr_t)host_user_addr);
+		add_one_guest_page(dev, guest_phys_addr, host_phys_addr, page_size);
+
+		host_user_addr  += page_size;
+		guest_phys_addr += page_size;
+		reg_size -= page_size;
+	}
+
+	/* FIXME */
+	RTE_LOG(INFO, VHOST_CONFIG, ":: %u ::\n", pre_read);
+}
+
+/* TODO: enable it only in debug mode? */
+static void
+dump_guest_pages(struct virtio_net *dev)
+{
+	uint32_t i;
+	struct guest_page *page;
+
+	for (i = 0; i < dev->nr_guest_pages; i++) {
+		page = &dev->guest_pages[i];
+
+		RTE_LOG(INFO, VHOST_CONFIG,
+			"guest physical page region %u\n"
+			"\t guest_phys_addr: %" PRIx64 "\n"
+			"\t host_phys_addr : %" PRIx64 "\n"
+			"\t size           : %" PRIx64 "\n",
+			i,
+			page->guest_phys_addr,
+			page->host_phys_addr,
+			page->size);
+	}
+}
+
 static int
 vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 {
@@ -396,6 +471,13 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		dev->mem = NULL;
 	}
 
+	dev->nr_guest_pages = 0;
+	if (!dev->guest_pages) {
+		dev->max_guest_pages = 8;
+		dev->guest_pages = malloc(dev->max_guest_pages *
+						sizeof(struct guest_page));
+	}
+
 	dev->mem = rte_zmalloc("vhost-mem-table", sizeof(struct virtio_memory) +
 		sizeof(struct virtio_memory_region) * memory.nregions, 0);
 	if (dev->mem == NULL) {
@@ -447,6 +529,8 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		reg->mmap_size = mmap_size;
 		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr + mmap_offset;
 
+		add_guest_pages(dev, reg, alignment);
+
 		RTE_LOG(INFO, VHOST_CONFIG,
 			"guest memory region %u, size: 0x%" PRIx64 "\n"
 			"\t guest physical addr: 0x%" PRIx64 "\n"
@@ -466,6 +550,8 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 			mmap_offset);
 	}
 
+	dump_guest_pages(dev);
+
 	return 0;
 
 err_mmap:
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH 3/6] vhost: introduce last avail idx for Tx
  2016-08-23  8:10 [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Yuanhan Liu
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling Yuanhan Liu
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 2/6] vhost: get guest/host physical address mappings Yuanhan Liu
@ 2016-08-23  8:10 ` Yuanhan Liu
  2016-08-23 12:27   ` Maxime Coquelin
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 4/6] vhost: add Tx zero copy Yuanhan Liu
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23  8:10 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

So far, we retrieve both the used ring avail ring idx by last_used_idx
var; it won't be a problem because we used ring is updated immediately
after those avail entries are consumed.

But that's not true when Tx zero copy is enabled, that used ring is updated
only when the mbuf is consumed. Thus, we need use another var to note
the last avail ring idx we have consumed.

Therefore, last_avail_idx is introduced.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost.h      |  2 +-
 lib/librte_vhost/virtio_net.c | 19 +++++++++++--------
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2d52987..8565fa1 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -70,7 +70,7 @@ struct vhost_virtqueue {
 	struct vring_used	*used;
 	uint32_t		size;
 
-	/* Last index used on the available ring */
+	uint16_t		last_avail_idx;
 	volatile uint16_t	last_used_idx;
 #define VIRTIO_INVALID_EVENTFD		(-1)
 #define VIRTIO_UNINITIALIZED_EVENTFD	(-2)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 8a151af..1c2ee47 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -846,16 +846,17 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 		}
 	}
 
-	avail_idx =  *((volatile uint16_t *)&vq->avail->idx);
-	free_entries = avail_idx - vq->last_used_idx;
+	free_entries = *((volatile uint16_t *)&vq->avail->idx) -
+			vq->last_avail_idx;
 	if (free_entries == 0)
 		goto out;
 
 	LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
 
-	/* Prefetch available ring to retrieve head indexes. */
-	used_idx = vq->last_used_idx & (vq->size - 1);
-	rte_prefetch0(&vq->avail->ring[used_idx]);
+	/* Prefetch available and used ring */
+	avail_idx = vq->last_avail_idx & (vq->size - 1);
+	used_idx  = vq->last_used_idx  & (vq->size - 1);
+	rte_prefetch0(&vq->avail->ring[avail_idx]);
 	rte_prefetch0(&vq->used->ring[used_idx]);
 
 	count = RTE_MIN(count, MAX_PKT_BURST);
@@ -865,8 +866,9 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 
 	/* Retrieve all of the head indexes first to avoid caching issues. */
 	for (i = 0; i < count; i++) {
-		used_idx = (vq->last_used_idx + i) & (vq->size - 1);
-		desc_indexes[i] = vq->avail->ring[used_idx];
+		avail_idx = (vq->last_avail_idx + i) & (vq->size - 1);
+		used_idx  = (vq->last_used_idx  + i) & (vq->size - 1);
+		desc_indexes[i] = vq->avail->ring[avail_idx];
 
 		vq->used->ring[used_idx].id  = desc_indexes[i];
 		vq->used->ring[used_idx].len = 0;
@@ -900,7 +902,8 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	rte_smp_wmb();
 	rte_smp_rmb();
 	vq->used->idx += i;
-	vq->last_used_idx += i;
+	vq->last_avail_idx += i;
+	vq->last_used_idx  += i;
 	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
 			sizeof(vq->used->idx));
 
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH 4/6] vhost: add Tx zero copy
  2016-08-23  8:10 [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Yuanhan Liu
                   ` (2 preceding siblings ...)
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 3/6] vhost: introduce last avail idx for Tx Yuanhan Liu
@ 2016-08-23  8:10 ` Yuanhan Liu
  2016-08-23 14:04   ` Maxime Coquelin
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable " Yuanhan Liu
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23  8:10 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

The basic idea of Tx zero copy is, instead of copying data from the
desc buf, here we let the mbuf reference the desc buf addr directly.

Doing so, however, has one major issue: we can't update the used ring
at the end of rte_vhost_dequeue_burst. Because we don't do the copy
here, an update of the used ring would let the driver to reclaim the
desc buf. As a result, DPDK might reference a stale memory region.

To update the used ring properly, this patch does several tricks:

- when mbuf references a desc buf, refcnt is added by 1.

  This is to pin lock the mbuf, so that a mbuf free from the DPDK
  won't actually free it, instead, refcnt is subtracted by 1.

- We chain all those mbuf together (by tailq)

  And we check it every time on the rte_vhost_dequeue_burst entrance,
  to see if the mbuf is freed (when refcnt equals to 1). If that
  happens, it means we are the last user of this mbuf and we are
  safe to update the used ring.

- "struct zcopy_mbuf" is introduced, to associate an mbuf with the
  right desc idx.

Tx zero copy is introduced for performance reason, and some rough tests
show about 40% perfomance boost for packet size 1400B. FOr small packets,
(e.g. 64B), it actually slows a bit down. That is expected because this
patch introduces some extra works, and it outweighs the benefit from
saving few bytes copy.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost.c      |   2 +
 lib/librte_vhost/vhost.h      |  21 ++++++
 lib/librte_vhost/vhost_user.c |  41 +++++++++-
 lib/librte_vhost/virtio_net.c | 169 +++++++++++++++++++++++++++++++++++++-----
 4 files changed, 214 insertions(+), 19 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 46095c3..ab25649 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -141,6 +141,8 @@ init_vring_queue(struct vhost_virtqueue *vq, int qp_idx)
 	/* always set the default vq pair to enabled */
 	if (qp_idx == 0)
 		vq->enabled = 1;
+
+	TAILQ_INIT(&vq->zmbuf_list);
 }
 
 static void
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 8565fa1..718133e 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -36,6 +36,7 @@
 #include <stdint.h>
 #include <stdio.h>
 #include <sys/types.h>
+#include <sys/queue.h>
 #include <unistd.h>
 #include <linux/vhost.h>
 
@@ -61,6 +62,19 @@ struct buf_vector {
 	uint32_t desc_idx;
 };
 
+/*
+ * A structure to hold some fields needed in zero copy code path,
+ * mainly for associating an mbuf with the right desc_idx.
+ */
+struct zcopy_mbuf {
+	struct rte_mbuf *mbuf;
+	uint32_t desc_idx;
+	uint16_t in_use;
+
+	TAILQ_ENTRY(zcopy_mbuf) next;
+};
+TAILQ_HEAD(zcopy_mbuf_list, zcopy_mbuf);
+
 /**
  * Structure contains variables relevant to RX/TX virtqueues.
  */
@@ -85,6 +99,12 @@ struct vhost_virtqueue {
 
 	/* Physical address of used ring, for logging */
 	uint64_t		log_guest_addr;
+
+	uint16_t		nr_zmbuf;
+	uint16_t		zmbuf_size;
+	uint16_t		last_zmbuf_idx;
+	struct zcopy_mbuf	*zmbufs;
+	struct zcopy_mbuf_list	zmbuf_list;
 } __rte_cache_aligned;
 
 /* Old kernels have no such macro defined */
@@ -147,6 +167,7 @@ struct virtio_net {
 	uint32_t		max_guest_pages;
 	struct guest_page       *guest_pages;
 
+	int			tx_zero_copy;
 } __rte_cache_aligned;
 
 /**
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 045d4f0..189b57b 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -180,7 +180,22 @@ static int
 vhost_user_set_vring_num(struct virtio_net *dev,
 			 struct vhost_vring_state *state)
 {
-	dev->virtqueue[state->index]->size = state->num;
+	struct vhost_virtqueue *vq = dev->virtqueue[state->index];
+
+	vq->size = state->num;
+
+	if (dev->tx_zero_copy) {
+		vq->last_zmbuf_idx = 0;
+		vq->zmbuf_size = vq->size * 2;
+		vq->zmbufs = rte_zmalloc(NULL, vq->zmbuf_size *
+					 sizeof(struct zcopy_mbuf), 0);
+		if (vq->zmbufs == NULL) {
+			RTE_LOG(WARNING, VHOST_CONFIG,
+				"failed to allocate mem for zero copy; "
+				"zero copy is force disabled\n");
+			dev->tx_zero_copy = 0;
+		}
+	}
 
 	return 0;
 }
@@ -649,11 +664,32 @@ vhost_user_set_vring_kick(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 	vq->kickfd = file.fd;
 
 	if (virtio_is_ready(dev) && !(dev->flags & VIRTIO_DEV_RUNNING)) {
+		if (dev->tx_zero_copy) {
+			RTE_LOG(INFO, VHOST_CONFIG,
+				"Tx zero copy is enabled\n");
+		}
+
 		if (notify_ops->new_device(dev->vid) == 0)
 			dev->flags |= VIRTIO_DEV_RUNNING;
 	}
 }
 
+static void
+free_zmbufs(struct vhost_virtqueue *vq)
+{
+	struct zcopy_mbuf *zmbuf, *next;
+
+	for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
+	     zmbuf != NULL; zmbuf = next) {
+		next = TAILQ_NEXT(zmbuf, next);
+
+		rte_pktmbuf_free(zmbuf->mbuf);
+		TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
+	}
+
+	rte_free(vq->zmbufs);
+}
+
 /*
  * when virtio is stopped, qemu will send us the GET_VRING_BASE message.
  */
@@ -682,6 +718,9 @@ vhost_user_get_vring_base(struct virtio_net *dev,
 
 	dev->virtqueue[state->index]->kickfd = VIRTIO_UNINITIALIZED_EVENTFD;
 
+	if (dev->tx_zero_copy)
+		free_zmbufs(dev->virtqueue[state->index]);
+
 	return 0;
 }
 
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 1c2ee47..d7e0335 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -678,6 +678,43 @@ make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *mac)
 	return 0;
 }
 
+static inline struct zcopy_mbuf * __attribute__((always_inline))
+get_zmbuf(struct vhost_virtqueue *vq)
+{
+	uint16_t i;
+	uint16_t last;
+	int tries = 0;
+
+	/* search [last_zmbuf_idx, zmbuf_size) */
+	i = vq->last_zmbuf_idx;
+	last = vq->zmbuf_size;
+
+again:
+	for (; i < last; i++) {
+		if (vq->zmbufs[i].in_use == 0) {
+			vq->last_zmbuf_idx = i + 1;
+			vq->zmbufs[i].in_use = 1;
+			return &vq->zmbufs[i];
+		}
+	}
+
+	tries++;
+	if (tries == 1) {
+		/* search [0, last_zmbuf_idx) */
+		i = 0;
+		last = vq->last_zmbuf_idx;
+		goto again;
+	}
+
+	return NULL;
+}
+
+static inline void __attribute__((always_inline))
+put_zmbuf(struct zcopy_mbuf *zmbuf)
+{
+	zmbuf->in_use = 0;
+}
+
 static inline int __attribute__((always_inline))
 copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		  struct rte_mbuf *m, uint16_t desc_idx,
@@ -701,6 +738,27 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	if (unlikely(!desc_addr))
 		return -1;
 
+	if (dev->tx_zero_copy) {
+		struct zcopy_mbuf *zmbuf;
+
+		zmbuf = get_zmbuf(vq);
+		if (!zmbuf)
+			return -1;
+		zmbuf->mbuf = m;
+		zmbuf->desc_idx = desc_idx;
+
+		/*
+		 * Pin lock the mbuf; we will check later to see whether
+		 * the mbuf is freed (when we are the last user) or not.
+		 * If that's the case, we then could update the used ring
+		 * safely.
+		 */
+		rte_mbuf_refcnt_update(m, 1);
+
+		vq->nr_zmbuf += 1;
+		TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbuf, next);
+	}
+
 	hdr = (struct virtio_net_hdr *)((uintptr_t)desc_addr);
 	rte_prefetch0(hdr);
 
@@ -733,9 +791,28 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	mbuf_avail  = m->buf_len - RTE_PKTMBUF_HEADROOM;
 	while (1) {
 		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
-		rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
-			(void *)((uintptr_t)(desc_addr + desc_offset)),
-			cpy_len);
+		if (dev->tx_zero_copy) {
+			cur->data_len = cpy_len;
+			cur->data_off = 0;
+			cur->buf_addr = (void *)(uintptr_t)desc_addr;
+			/*
+			 * TODO: we need handle the case a desc buf
+			 * acrosses two pages.
+			 */
+			cur->buf_physaddr = gpa_to_hpa(dev, desc->addr +
+						desc_offset, cpy_len);
+
+			/*
+			 * In zero copy mode, one mbuf can only reference data
+			 * for one or partial of one desc buff.
+			 */
+			mbuf_avail = cpy_len;
+		} else {
+			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *,
+							   mbuf_offset),
+				(void *)((uintptr_t)(desc_addr + desc_offset)),
+				cpy_len);
+		}
 
 		mbuf_avail  -= cpy_len;
 		mbuf_offset += cpy_len;
@@ -796,6 +873,49 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return 0;
 }
 
+static inline void __attribute__((always_inline))
+update_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		 uint32_t used_idx, uint32_t desc_idx)
+{
+	vq->used->ring[used_idx].id  = desc_idx;
+	vq->used->ring[used_idx].len = 0;
+	vhost_log_used_vring(dev, vq,
+			offsetof(struct vring_used, ring[used_idx]),
+			sizeof(vq->used->ring[used_idx]));
+}
+
+static inline void __attribute__((always_inline))
+update_used_idx(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint32_t count)
+{
+	if (count == 0)
+		return;
+
+	rte_smp_wmb();
+	rte_smp_rmb();
+
+	vq->used->idx += count;
+	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
+			sizeof(vq->used->idx));
+
+	/* Kick guest if required. */
+	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
+			&& (vq->callfd >= 0))
+		eventfd_write(vq->callfd, (eventfd_t)1);
+}
+
+static inline bool __attribute__((always_inline))
+mbuf_is_consumed(struct rte_mbuf *m)
+{
+	while (m) {
+		if (rte_mbuf_refcnt_read(m) > 1)
+			return false;
+		m = m->next;
+	}
+
+	return true;
+}
+
 uint16_t
 rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
@@ -823,6 +943,30 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	if (unlikely(vq->enabled == 0))
 		return 0;
 
+	if (dev->tx_zero_copy) {
+		struct zcopy_mbuf *zmbuf, *next;
+		int nr_updated = 0;
+
+		for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
+		     zmbuf != NULL; zmbuf = next) {
+			next = TAILQ_NEXT(zmbuf, next);
+
+			if (mbuf_is_consumed(zmbuf->mbuf)) {
+				used_idx = vq->last_used_idx++ & (vq->size - 1);
+				update_used_ring(dev, vq, used_idx,
+						 zmbuf->desc_idx);
+				nr_updated += 1;
+
+				TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
+				rte_pktmbuf_free(zmbuf->mbuf);
+				put_zmbuf(zmbuf);
+				vq->nr_zmbuf -= 1;
+			}
+		}
+
+		update_used_idx(dev, vq, nr_updated);
+	}
+
 	/*
 	 * Construct a RARP broadcast packet, and inject it to the "pkts"
 	 * array, to looks like that guest actually send such packet.
@@ -870,11 +1014,8 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 		used_idx  = (vq->last_used_idx  + i) & (vq->size - 1);
 		desc_indexes[i] = vq->avail->ring[avail_idx];
 
-		vq->used->ring[used_idx].id  = desc_indexes[i];
-		vq->used->ring[used_idx].len = 0;
-		vhost_log_used_vring(dev, vq,
-				offsetof(struct vring_used, ring[used_idx]),
-				sizeof(vq->used->ring[used_idx]));
+		if (dev->tx_zero_copy == 0)
+			update_used_ring(dev, vq, used_idx, desc_indexes[i]);
 	}
 
 	/* Prefetch descriptor index. */
@@ -898,19 +1039,11 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 			break;
 		}
 	}
-
-	rte_smp_wmb();
-	rte_smp_rmb();
-	vq->used->idx += i;
 	vq->last_avail_idx += i;
 	vq->last_used_idx  += i;
-	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
-			sizeof(vq->used->idx));
 
-	/* Kick guest if required. */
-	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-			&& (vq->callfd >= 0))
-		eventfd_write(vq->callfd, (eventfd_t)1);
+	if (dev->tx_zero_copy == 0)
+		update_used_idx(dev, vq, i);
 
 out:
 	if (unlikely(rarp_mbuf != NULL)) {
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy
  2016-08-23  8:10 [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Yuanhan Liu
                   ` (3 preceding siblings ...)
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 4/6] vhost: add Tx zero copy Yuanhan Liu
@ 2016-08-23  8:10 ` Yuanhan Liu
  2016-09-06  9:00   ` Xu, Qian Q
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 6/6] examples/vhost: add an option " Yuanhan Liu
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23  8:10 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

Add a new flag ``RTE_VHOST_USER_TX_ZERO_COPY`` to explictily enable
Tx zero copy. If not given, Tx zero copy is disabled by default.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 doc/guides/prog_guide/vhost_lib.rst |  7 ++++++-
 lib/librte_vhost/rte_virtio_net.h   |  1 +
 lib/librte_vhost/socket.c           |  5 +++++
 lib/librte_vhost/vhost.c            | 10 ++++++++++
 lib/librte_vhost/vhost.h            |  1 +
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index 6b0c6b2..15c2bf7 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -79,7 +79,7 @@ The following is an overview of the Vhost API functions:
   ``/dev/path`` character device file will be created. For vhost-user server
   mode, a Unix domain socket file ``path`` will be created.
 
-  Currently two flags are supported (these are valid for vhost-user only):
+  Currently supported flags are (these are valid for vhost-user only):
 
   - ``RTE_VHOST_USER_CLIENT``
 
@@ -97,6 +97,11 @@ The following is an overview of the Vhost API functions:
     This reconnect option is enabled by default. However, it can be turned off
     by setting this flag.
 
+  - ``RTE_VHOST_USER_TX_ZERO_COPY``
+
+    Tx zero copy will be enabled when this flag is set. It is disabled by
+    default.
+
 * ``rte_vhost_driver_session_start()``
 
   This function starts the vhost session loop to handle vhost messages. It
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 9caa622..5e437c6 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -53,6 +53,7 @@
 
 #define RTE_VHOST_USER_CLIENT		(1ULL << 0)
 #define RTE_VHOST_USER_NO_RECONNECT	(1ULL << 1)
+#define RTE_VHOST_USER_TX_ZERO_COPY	(1ULL << 2)
 
 /* Enum for virtqueue management. */
 enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index bf03f84..5c3962d 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -62,6 +62,7 @@ struct vhost_user_socket {
 	int connfd;
 	bool is_server;
 	bool reconnect;
+	bool tx_zero_copy;
 };
 
 struct vhost_user_connection {
@@ -203,6 +204,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 	size = strnlen(vsocket->path, PATH_MAX);
 	vhost_set_ifname(vid, vsocket->path, size);
 
+	if (vsocket->tx_zero_copy)
+		vhost_enable_tx_zero_copy(vid);
+
 	RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
 
 	vsocket->connfd = fd;
@@ -499,6 +503,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 	memset(vsocket, 0, sizeof(struct vhost_user_socket));
 	vsocket->path = strdup(path);
 	vsocket->connfd = -1;
+	vsocket->tx_zero_copy = flags & RTE_VHOST_USER_TX_ZERO_COPY;
 
 	if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
 		vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index ab25649..5461e5b 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -290,6 +290,16 @@ vhost_set_ifname(int vid, const char *if_name, unsigned int if_len)
 	dev->ifname[sizeof(dev->ifname) - 1] = '\0';
 }
 
+void
+vhost_enable_tx_zero_copy(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->tx_zero_copy = 1;
+}
 
 int
 rte_vhost_get_numa_node(int vid)
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 718133e..3081180 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -279,6 +279,7 @@ void vhost_destroy_device(int);
 int alloc_vring_queue_pair(struct virtio_net *dev, uint32_t qp_idx);
 
 void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
+void vhost_enable_tx_zero_copy(int vid);
 
 /*
  * Backend-specific cleanup. Defined by vhost-cuse and vhost-user.
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH 6/6] examples/vhost: add an option to enable Tx zero copy
  2016-08-23  8:10 [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Yuanhan Liu
                   ` (4 preceding siblings ...)
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable " Yuanhan Liu
@ 2016-08-23  8:10 ` Yuanhan Liu
  2016-08-23  9:31   ` Thomas Monjalon
  2016-08-23 14:14   ` Maxime Coquelin
  2016-08-23 14:18 ` [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Maxime Coquelin
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23  8:10 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

Add an option, --tx-zero-copy, to enable Tx zero copy.

One thing worth noting while using Tx zero copy is the nb_tx_desc has
to be small enough so that the eth driver will hit the mbuf free
threshold easily and thus free mbuf more frequently.

The reason behind that is, when Tx zero copy is enabled, guest Tx used
vring will be updated only when corresponding mbuf is freed. If mbuf is
not freed frequently, the guest Tx vring could be starved.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 examples/vhost/main.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 9974f0b..e3437ad 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -130,6 +130,7 @@ static uint32_t enable_tx_csum;
 static uint32_t enable_tso;
 
 static int client_mode;
+static int tx_zero_copy;
 
 /* Specify timeout (in useconds) between retries on RX. */
 static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
@@ -297,6 +298,17 @@ port_init(uint8_t port)
 
 	rx_ring_size = RTE_TEST_RX_DESC_DEFAULT;
 	tx_ring_size = RTE_TEST_TX_DESC_DEFAULT;
+
+	/*
+	 * When Tx zero copy is enabled, guest Tx used vring will be updated
+	 * only when corresponding mbuf is freed. Thus, the nb_tx_desc
+	 * (tx_ring_size here) must be small enough so that the driver will
+	 * hit the free threshold easily and free mbufs timely. Otherwise,
+	 * guest Tx vring would be starved.
+	 */
+	if (tx_zero_copy)
+		tx_ring_size = 64;
+
 	tx_rings = (uint16_t)rte_lcore_count();
 
 	retval = validate_num_devices(MAX_DEVICES);
@@ -474,7 +486,8 @@ us_vhost_usage(const char *prgname)
 	"		--socket-file: The path of the socket file.\n"
 	"		--tx-csum [0|1] disable/enable TX checksum offload.\n"
 	"		--tso [0|1] disable/enable TCP segment offload.\n"
-	"		--client register a vhost-user socket as client mode.\n",
+	"		--client register a vhost-user socket as client mode.\n"
+	"		--tx-zero-copy enables Tx zero copy\n",
 	       prgname);
 }
 
@@ -500,6 +513,7 @@ us_vhost_parse_args(int argc, char **argv)
 		{"tx-csum", required_argument, NULL, 0},
 		{"tso", required_argument, NULL, 0},
 		{"client", no_argument, &client_mode, 1},
+		{"tx-zero-copy", no_argument, &tx_zero_copy, 1},
 		{NULL, 0, 0, 0},
 	};
 
@@ -1531,6 +1545,9 @@ main(int argc, char *argv[])
 	if (client_mode)
 		flags |= RTE_VHOST_USER_CLIENT;
 
+	if (tx_zero_copy)
+		flags |= RTE_VHOST_USER_TX_ZERO_COPY;
+
 	/* Register vhost user driver to handle vhost messages. */
 	for (i = 0; i < nb_sockets; i++) {
 		ret = rte_vhost_driver_register
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling Yuanhan Liu
@ 2016-08-23  9:17   ` Maxime Coquelin
  2016-08-24  7:26   ` Xu, Qian Q
  1 sibling, 0 replies; 75+ messages in thread
From: Maxime Coquelin @ 2016-08-23  9:17 UTC (permalink / raw)
  To: Yuanhan Liu, dev



On 08/23/2016 10:10 AM, Yuanhan Liu wrote:
> Due to history reason (that vhost-cuse comes before vhost-user), some
> fields for maintaining the vhost-user memory mappings (such as mmapped
> address and size, with those we then can unmap on destroy) are kept in
> "orig_region_map" struct, a structure that is defined only in vhost-user
> source file.
>
> The right way to go is to remove the structure and move all those fields
> into virtio_memory_region struct. But we simply can't do that before,
> because it breaks the ABI.
>
> Now, thanks to the ABI refactoring, it's never been a blocking issue
> any more. And here it goes: this patch removes orig_region_map and
> redefines virtio_memory_region, to include all necessary info.
>
> With that, we can simplify the guest/host address convert a bit.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/vhost.h      |  49 ++++++------
>  lib/librte_vhost/vhost_user.c | 172 +++++++++++++++++-------------------------
>  2 files changed, 90 insertions(+), 131 deletions(-)
>

Thanks for explaining the history behind this.
FWIW, the change looks good to me:

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 6/6] examples/vhost: add an option to enable Tx zero copy
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 6/6] examples/vhost: add an option " Yuanhan Liu
@ 2016-08-23  9:31   ` Thomas Monjalon
  2016-08-23 12:33     ` Yuanhan Liu
  2016-08-23 14:14   ` Maxime Coquelin
  1 sibling, 1 reply; 75+ messages in thread
From: Thomas Monjalon @ 2016-08-23  9:31 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Maxime Coquelin

2016-08-23 16:10, Yuanhan Liu:
> One thing worth noting while using Tx zero copy is the nb_tx_desc has
> to be small enough so that the eth driver will hit the mbuf free
> threshold easily and thus free mbuf more frequently.
> 
> The reason behind that is, when Tx zero copy is enabled, guest Tx used
> vring will be updated only when corresponding mbuf is freed. If mbuf is
> not freed frequently, the guest Tx vring could be starved.

I think you should explain this behaviour in the doc of the vhost flag.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 2/6] vhost: get guest/host physical address mappings
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 2/6] vhost: get guest/host physical address mappings Yuanhan Liu
@ 2016-08-23  9:58   ` Maxime Coquelin
  2016-08-23 12:32     ` Yuanhan Liu
  0 siblings, 1 reply; 75+ messages in thread
From: Maxime Coquelin @ 2016-08-23  9:58 UTC (permalink / raw)
  To: Yuanhan Liu, dev



On 08/23/2016 10:10 AM, Yuanhan Liu wrote:
> So that we can convert a guest physical address to host physical
> address, which will be used in later Tx zero copy implementation.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/vhost.h      | 30 +++++++++++++++
>  lib/librte_vhost/vhost_user.c | 86 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 116 insertions(+)
>
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index df2107b..2d52987 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -114,6 +114,12 @@ struct vhost_virtqueue {
>   #define VIRTIO_F_VERSION_1 32
>  #endif
>
> +struct guest_page {
> +	uint64_t guest_phys_addr;
> +	uint64_t host_phys_addr;
> +	uint64_t size;
> +};
> +
>  /**
>   * Device structure contains all configuration information relating
>   * to the device.
> @@ -137,6 +143,10 @@ struct virtio_net {
>  	uint64_t		log_addr;
>  	struct ether_addr	mac;
>
> +	uint32_t		nr_guest_pages;
> +	uint32_t		max_guest_pages;
> +	struct guest_page       *guest_pages;
> +
>  } __rte_cache_aligned;
>
>  /**
> @@ -217,6 +227,26 @@ gpa_to_vva(struct virtio_net *dev, uint64_t gpa)
>  	return 0;
>  }
>
> +/* Convert guest physical address to host physical address */
> +static inline phys_addr_t __attribute__((always_inline))
> +gpa_to_hpa(struct virtio_net *dev, uint64_t gpa, uint64_t size)
> +{
> +	uint32_t i;
> +	struct guest_page *page;
> +
> +	for (i = 0; i < dev->nr_guest_pages; i++) {
> +		page = &dev->guest_pages[i];
> +
> +		if (gpa >= page->guest_phys_addr &&
> +		    gpa + size < page->guest_phys_addr + page->size) {
Shouldn't be '<=' here?

> +			return gpa - page->guest_phys_addr +
> +			       page->host_phys_addr;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  struct virtio_net_device_ops const *notify_ops;
>  struct virtio_net *get_device(int vid);
>
> diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
> index d2071fd..045d4f0 100644
> --- a/lib/librte_vhost/vhost_user.c
> +++ b/lib/librte_vhost/vhost_user.c
> @@ -372,6 +372,81 @@ vhost_user_set_vring_base(struct virtio_net *dev,
>  	return 0;
>  }
>
> +static void
> +add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr,
> +		   uint64_t host_phys_addr, uint64_t size)
> +{
> +	struct guest_page *page;
> +
> +	if (dev->nr_guest_pages == dev->max_guest_pages) {
> +		dev->max_guest_pages *= 2;
> +		dev->guest_pages = realloc(dev->guest_pages,
> +					dev->max_guest_pages * sizeof(*page));

Maybe realloc return could be checked?

> +	}
> +
> +	page = &dev->guest_pages[dev->nr_guest_pages++];
> +	page->guest_phys_addr = guest_phys_addr;
> +	page->host_phys_addr  = host_phys_addr;
> +	page->size = size;
> +}
> +
> +static void
> +add_guest_pages(struct virtio_net *dev, struct virtio_memory_region *reg,
> +		uint64_t page_size)
> +{
> +	uint64_t reg_size = reg->size;
> +	uint64_t host_user_addr  = reg->host_user_addr;
> +	uint64_t guest_phys_addr = reg->guest_phys_addr;
> +	uint64_t host_phys_addr;
> +	uint64_t size;
> +	uint32_t pre_read;
> +
> +	pre_read = *((uint32_t *)(uintptr_t)host_user_addr);
> +	host_phys_addr = rte_mem_virt2phy((void *)(uintptr_t)host_user_addr);
> +	size = page_size - (guest_phys_addr & (page_size - 1));
> +	size = RTE_MIN(size, reg_size);
> +
> +	add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size);
> +	host_user_addr  += size;
> +	guest_phys_addr += size;
> +	reg_size -= size;
> +
> +	while (reg_size > 0) {
> +		pre_read += *((uint32_t *)(uintptr_t)host_user_addr);
> +		host_phys_addr = rte_mem_virt2phy((void *)(uintptr_t)host_user_addr);
> +		add_one_guest_page(dev, guest_phys_addr, host_phys_addr, page_size);
> +
> +		host_user_addr  += page_size;
> +		guest_phys_addr += page_size;
> +		reg_size -= page_size;
> +	}
> +
> +	/* FIXME */
> +	RTE_LOG(INFO, VHOST_CONFIG, ":: %u ::\n", pre_read);
For my information, what is the purpose of pre_read?

> +}
> +
> +/* TODO: enable it only in debug mode? */
> +static void
> +dump_guest_pages(struct virtio_net *dev)
> +{
> +	uint32_t i;
> +	struct guest_page *page;
> +
> +	for (i = 0; i < dev->nr_guest_pages; i++) {
> +		page = &dev->guest_pages[i];
> +
> +		RTE_LOG(INFO, VHOST_CONFIG,
> +			"guest physical page region %u\n"
> +			"\t guest_phys_addr: %" PRIx64 "\n"
> +			"\t host_phys_addr : %" PRIx64 "\n"
> +			"\t size           : %" PRIx64 "\n",
> +			i,
> +			page->guest_phys_addr,
> +			page->host_phys_addr,
> +			page->size);
> +	}
> +}
> +
>  static int
>  vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
>  {
> @@ -396,6 +471,13 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
>  		dev->mem = NULL;
>  	}
>
> +	dev->nr_guest_pages = 0;
> +	if (!dev->guest_pages) {
> +		dev->max_guest_pages = 8;
> +		dev->guest_pages = malloc(dev->max_guest_pages *
> +						sizeof(struct guest_page));
> +	}
> +
>  	dev->mem = rte_zmalloc("vhost-mem-table", sizeof(struct virtio_memory) +
>  		sizeof(struct virtio_memory_region) * memory.nregions, 0);
>  	if (dev->mem == NULL) {
> @@ -447,6 +529,8 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
>  		reg->mmap_size = mmap_size;
>  		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr + mmap_offset;
>
> +		add_guest_pages(dev, reg, alignment);
> +
>  		RTE_LOG(INFO, VHOST_CONFIG,
>  			"guest memory region %u, size: 0x%" PRIx64 "\n"
>  			"\t guest physical addr: 0x%" PRIx64 "\n"
> @@ -466,6 +550,8 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
>  			mmap_offset);
>  	}
>
> +	dump_guest_pages(dev);
> +
>  	return 0;
>
>  err_mmap:
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 3/6] vhost: introduce last avail idx for Tx
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 3/6] vhost: introduce last avail idx for Tx Yuanhan Liu
@ 2016-08-23 12:27   ` Maxime Coquelin
  0 siblings, 0 replies; 75+ messages in thread
From: Maxime Coquelin @ 2016-08-23 12:27 UTC (permalink / raw)
  To: Yuanhan Liu, dev



On 08/23/2016 10:10 AM, Yuanhan Liu wrote:
> So far, we retrieve both the used ring avail ring idx by last_used_idx
> var; it won't be a problem because we used ring is updated immediately
> after those avail entries are consumed.
>
> But that's not true when Tx zero copy is enabled, that used ring is updated
> only when the mbuf is consumed. Thus, we need use another var to note
> the last avail ring idx we have consumed.
>
> Therefore, last_avail_idx is introduced.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/vhost.h      |  2 +-
>  lib/librte_vhost/virtio_net.c | 19 +++++++++++--------
>  2 files changed, 12 insertions(+), 9 deletions(-)

Looks good to me:
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 2/6] vhost: get guest/host physical address mappings
  2016-08-23  9:58   ` Maxime Coquelin
@ 2016-08-23 12:32     ` Yuanhan Liu
  2016-08-23 13:25       ` Maxime Coquelin
  0 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23 12:32 UTC (permalink / raw)
  To: Maxime Coquelin; +Cc: dev

On Tue, Aug 23, 2016 at 11:58:42AM +0200, Maxime Coquelin wrote:
> >
> >+/* Convert guest physical address to host physical address */
> >+static inline phys_addr_t __attribute__((always_inline))
> >+gpa_to_hpa(struct virtio_net *dev, uint64_t gpa, uint64_t size)
> >+{
> >+	uint32_t i;
> >+	struct guest_page *page;
> >+
> >+	for (i = 0; i < dev->nr_guest_pages; i++) {
> >+		page = &dev->guest_pages[i];
> >+
> >+		if (gpa >= page->guest_phys_addr &&
> >+		    gpa + size < page->guest_phys_addr + page->size) {
> Shouldn't be '<=' here?

Oops, you are right.

> >+			return gpa - page->guest_phys_addr +
> >+			       page->host_phys_addr;
> >+		}
> >+	}
> >+
> >+	return 0;
> >+}
> >+
> > struct virtio_net_device_ops const *notify_ops;
> > struct virtio_net *get_device(int vid);
> >
> >diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
> >index d2071fd..045d4f0 100644
> >--- a/lib/librte_vhost/vhost_user.c
> >+++ b/lib/librte_vhost/vhost_user.c
> >@@ -372,6 +372,81 @@ vhost_user_set_vring_base(struct virtio_net *dev,
> > 	return 0;
> > }
> >
> >+static void
> >+add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr,
> >+		   uint64_t host_phys_addr, uint64_t size)
> >+{
> >+	struct guest_page *page;
> >+
> >+	if (dev->nr_guest_pages == dev->max_guest_pages) {
> >+		dev->max_guest_pages *= 2;
> >+		dev->guest_pages = realloc(dev->guest_pages,
> >+					dev->max_guest_pages * sizeof(*page));
> 
> Maybe realloc return could be checked?

Yes, I should have done that. Besides, I also forgot to free it at
somewhere. Will fix it.

> 
> >+	}
> >+
> >+	page = &dev->guest_pages[dev->nr_guest_pages++];
> >+	page->guest_phys_addr = guest_phys_addr;
> >+	page->host_phys_addr  = host_phys_addr;
> >+	page->size = size;
> >+}
> >+
> >+static void
> >+add_guest_pages(struct virtio_net *dev, struct virtio_memory_region *reg,
> >+		uint64_t page_size)
> >+{
> >+	uint64_t reg_size = reg->size;
> >+	uint64_t host_user_addr  = reg->host_user_addr;
> >+	uint64_t guest_phys_addr = reg->guest_phys_addr;
> >+	uint64_t host_phys_addr;
> >+	uint64_t size;
> >+	uint32_t pre_read;
> >+
> >+	pre_read = *((uint32_t *)(uintptr_t)host_user_addr);
> >+	host_phys_addr = rte_mem_virt2phy((void *)(uintptr_t)host_user_addr);
> >+	size = page_size - (guest_phys_addr & (page_size - 1));
> >+	size = RTE_MIN(size, reg_size);
> >+
> >+	add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size);
> >+	host_user_addr  += size;
> >+	guest_phys_addr += size;
> >+	reg_size -= size;
> >+
> >+	while (reg_size > 0) {
> >+		pre_read += *((uint32_t *)(uintptr_t)host_user_addr);
> >+		host_phys_addr = rte_mem_virt2phy((void *)(uintptr_t)host_user_addr);
> >+		add_one_guest_page(dev, guest_phys_addr, host_phys_addr, page_size);
> >+
> >+		host_user_addr  += page_size;
> >+		guest_phys_addr += page_size;
> >+		reg_size -= page_size;
> >+	}
> >+
> >+	/* FIXME */
> >+	RTE_LOG(INFO, VHOST_CONFIG, ":: %u ::\n", pre_read);
> For my information, what is the purpose of pre_read?

Again, I put a FIXME here, but I forgot to add some explanation.

Here is the thing: the read will make sure the kernel populate the
corresponding PTE entry, so that rte_mem_virt2phy() will return proper
physical address, otherwise, invalid value is returned.

I can't simply do the read but do not actually reference/consume it.
Otherwise, the compiler will treat it as some noops and remove it.

An ugly RTE_LOG will make sure the read operation is not eliminated.
I'm seeking a more proper way to achieve that. Maybe I can add a new
field in virtio_net structure and store it there.

Or, do you have better ideas?

	--yliu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 6/6] examples/vhost: add an option to enable Tx zero copy
  2016-08-23  9:31   ` Thomas Monjalon
@ 2016-08-23 12:33     ` Yuanhan Liu
  0 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23 12:33 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Maxime Coquelin

On Tue, Aug 23, 2016 at 11:31:08AM +0200, Thomas Monjalon wrote:
> 2016-08-23 16:10, Yuanhan Liu:
> > One thing worth noting while using Tx zero copy is the nb_tx_desc has
> > to be small enough so that the eth driver will hit the mbuf free
> > threshold easily and thus free mbuf more frequently.
> > 
> > The reason behind that is, when Tx zero copy is enabled, guest Tx used
> > vring will be updated only when corresponding mbuf is freed. If mbuf is
> > not freed frequently, the guest Tx vring could be starved.
> 
> I think you should explain this behaviour in the doc of the vhost flag.

Agreed. Will do it in v2.

	--yliu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 2/6] vhost: get guest/host physical address mappings
  2016-08-23 12:32     ` Yuanhan Liu
@ 2016-08-23 13:25       ` Maxime Coquelin
  2016-08-23 13:49         ` Yuanhan Liu
  0 siblings, 1 reply; 75+ messages in thread
From: Maxime Coquelin @ 2016-08-23 13:25 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev



On 08/23/2016 02:32 PM, Yuanhan Liu wrote:
>>> +
>>> > >+	/* FIXME */
>>> > >+	RTE_LOG(INFO, VHOST_CONFIG, ":: %u ::\n", pre_read);
>> > For my information, what is the purpose of pre_read?
> Again, I put a FIXME here, but I forgot to add some explanation.
>
> Here is the thing: the read will make sure the kernel populate the
> corresponding PTE entry, so that rte_mem_virt2phy() will return proper
> physical address, otherwise, invalid value is returned.
>
> I can't simply do the read but do not actually reference/consume it.
> Otherwise, the compiler will treat it as some noops and remove it.
>
> An ugly RTE_LOG will make sure the read operation is not eliminated.
> I'm seeking a more proper way to achieve that. Maybe I can add a new
> field in virtio_net structure and store it there.
>
> Or, do you have better ideas?

This behavior is pretty twisted, no?
Shouldn't be rte_mem_virt2phy() role to ensure returning a valid value?

I have no better idea for now, but I will think about it.

Regards,
Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 2/6] vhost: get guest/host physical address mappings
  2016-08-23 13:25       ` Maxime Coquelin
@ 2016-08-23 13:49         ` Yuanhan Liu
  2016-08-23 14:05           ` Maxime Coquelin
  0 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23 13:49 UTC (permalink / raw)
  To: Maxime Coquelin; +Cc: dev

On Tue, Aug 23, 2016 at 03:25:33PM +0200, Maxime Coquelin wrote:
> 
> 
> On 08/23/2016 02:32 PM, Yuanhan Liu wrote:
> >>>+
> >>>> >+	/* FIXME */
> >>>> >+	RTE_LOG(INFO, VHOST_CONFIG, ":: %u ::\n", pre_read);
> >>> For my information, what is the purpose of pre_read?
> >Again, I put a FIXME here, but I forgot to add some explanation.
> >
> >Here is the thing: the read will make sure the kernel populate the
> >corresponding PTE entry, so that rte_mem_virt2phy() will return proper
> >physical address, otherwise, invalid value is returned.
> >
> >I can't simply do the read but do not actually reference/consume it.
> >Otherwise, the compiler will treat it as some noops and remove it.
> >
> >An ugly RTE_LOG will make sure the read operation is not eliminated.
> >I'm seeking a more proper way to achieve that. Maybe I can add a new
> >field in virtio_net structure and store it there.
> >
> >Or, do you have better ideas?
> 
> This behavior is pretty twisted, no?

I have to say, yes, kind of.

> Shouldn't be rte_mem_virt2phy() role to ensure returning a valid value?

Not exactly. I think rte_mem_virt2phy() is more likely to fetch the
physical address of huge pages. And for those huge pages, EAL makes
sure they will be populated: it used to do a zero memset before to
achieve that. Since 5ce3ace1de45 ("eal: remove unnecessary hugepage
zero-filling"), it uses MAP_POPULATE option instead.

So, thank you that you just remind me of the MAP_POPULATE option.
I just had a quick try, it worked like a charm :)

	--yliu

> I have no better idea for now, but I will think about it.
> 
> Regards,
> Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 4/6] vhost: add Tx zero copy
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 4/6] vhost: add Tx zero copy Yuanhan Liu
@ 2016-08-23 14:04   ` Maxime Coquelin
  2016-08-23 14:31     ` Yuanhan Liu
  0 siblings, 1 reply; 75+ messages in thread
From: Maxime Coquelin @ 2016-08-23 14:04 UTC (permalink / raw)
  To: Yuanhan Liu, dev



On 08/23/2016 10:10 AM, Yuanhan Liu wrote:
> The basic idea of Tx zero copy is, instead of copying data from the
> desc buf, here we let the mbuf reference the desc buf addr directly.
>
> Doing so, however, has one major issue: we can't update the used ring
> at the end of rte_vhost_dequeue_burst. Because we don't do the copy
> here, an update of the used ring would let the driver to reclaim the
> desc buf. As a result, DPDK might reference a stale memory region.
>
> To update the used ring properly, this patch does several tricks:
>
> - when mbuf references a desc buf, refcnt is added by 1.
>
>   This is to pin lock the mbuf, so that a mbuf free from the DPDK
>   won't actually free it, instead, refcnt is subtracted by 1.
>
> - We chain all those mbuf together (by tailq)
>
>   And we check it every time on the rte_vhost_dequeue_burst entrance,
>   to see if the mbuf is freed (when refcnt equals to 1). If that
>   happens, it means we are the last user of this mbuf and we are
>   safe to update the used ring.
>
> - "struct zcopy_mbuf" is introduced, to associate an mbuf with the
>   right desc idx.
>
> Tx zero copy is introduced for performance reason, and some rough tests
> show about 40% perfomance boost for packet size 1400B. FOr small packets,
> (e.g. 64B), it actually slows a bit down. That is expected because this
> patch introduces some extra works, and it outweighs the benefit from
> saving few bytes copy.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/vhost.c      |   2 +
>  lib/librte_vhost/vhost.h      |  21 ++++++
>  lib/librte_vhost/vhost_user.c |  41 +++++++++-
>  lib/librte_vhost/virtio_net.c | 169 +++++++++++++++++++++++++++++++++++++-----
>  4 files changed, 214 insertions(+), 19 deletions(-)
>
...

>  rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>  	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
> @@ -823,6 +943,30 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>  	if (unlikely(vq->enabled == 0))
>  		return 0;
>
> +	if (dev->tx_zero_copy) {
> +		struct zcopy_mbuf *zmbuf, *next;
> +		int nr_updated = 0;
> +
> +		for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
> +		     zmbuf != NULL; zmbuf = next) {
> +			next = TAILQ_NEXT(zmbuf, next);
> +
> +			if (mbuf_is_consumed(zmbuf->mbuf)) {
> +				used_idx = vq->last_used_idx++ & (vq->size - 1);
> +				update_used_ring(dev, vq, used_idx,
> +						 zmbuf->desc_idx);
> +				nr_updated += 1;
> +
> +				TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
> +				rte_pktmbuf_free(zmbuf->mbuf);
> +				put_zmbuf(zmbuf);
> +				vq->nr_zmbuf -= 1;
> +			}
Shouldn't you break the loop here as soon as a mbuf is not consumed?
Indeed, they might not be consumed sequentially, and would cause
last_used_idx to be incremented whereas it shouldn't.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 2/6] vhost: get guest/host physical address mappings
  2016-08-23 13:49         ` Yuanhan Liu
@ 2016-08-23 14:05           ` Maxime Coquelin
  0 siblings, 0 replies; 75+ messages in thread
From: Maxime Coquelin @ 2016-08-23 14:05 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev



On 08/23/2016 03:49 PM, Yuanhan Liu wrote:
> On Tue, Aug 23, 2016 at 03:25:33PM +0200, Maxime Coquelin wrote:
>>
>>
>> On 08/23/2016 02:32 PM, Yuanhan Liu wrote:
>>>>> +
>>>>>>> +	/* FIXME */
>>>>>>> +	RTE_LOG(INFO, VHOST_CONFIG, ":: %u ::\n", pre_read);
>>>>> For my information, what is the purpose of pre_read?
>>> Again, I put a FIXME here, but I forgot to add some explanation.
>>>
>>> Here is the thing: the read will make sure the kernel populate the
>>> corresponding PTE entry, so that rte_mem_virt2phy() will return proper
>>> physical address, otherwise, invalid value is returned.
>>>
>>> I can't simply do the read but do not actually reference/consume it.
>>> Otherwise, the compiler will treat it as some noops and remove it.
>>>
>>> An ugly RTE_LOG will make sure the read operation is not eliminated.
>>> I'm seeking a more proper way to achieve that. Maybe I can add a new
>>> field in virtio_net structure and store it there.
>>>
>>> Or, do you have better ideas?
>>
>> This behavior is pretty twisted, no?
>
> I have to say, yes, kind of.
>
>> Shouldn't be rte_mem_virt2phy() role to ensure returning a valid value?
>
> Not exactly. I think rte_mem_virt2phy() is more likely to fetch the
> physical address of huge pages. And for those huge pages, EAL makes
> sure they will be populated: it used to do a zero memset before to
> achieve that. Since 5ce3ace1de45 ("eal: remove unnecessary hugepage
> zero-filling"), it uses MAP_POPULATE option instead.
>
> So, thank you that you just remind me of the MAP_POPULATE option.
> I just had a quick try, it worked like a charm :)

Excellent! :)

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 6/6] examples/vhost: add an option to enable Tx zero copy
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 6/6] examples/vhost: add an option " Yuanhan Liu
  2016-08-23  9:31   ` Thomas Monjalon
@ 2016-08-23 14:14   ` Maxime Coquelin
  2016-08-23 14:45     ` Yuanhan Liu
  1 sibling, 1 reply; 75+ messages in thread
From: Maxime Coquelin @ 2016-08-23 14:14 UTC (permalink / raw)
  To: Yuanhan Liu, dev



On 08/23/2016 10:10 AM, Yuanhan Liu wrote:
> Add an option, --tx-zero-copy, to enable Tx zero copy.
>
> One thing worth noting while using Tx zero copy is the nb_tx_desc has
> to be small enough so that the eth driver will hit the mbuf free
> threshold easily and thus free mbuf more frequently.
>
> The reason behind that is, when Tx zero copy is enabled, guest Tx used
> vring will be updated only when corresponding mbuf is freed. If mbuf is
> not freed frequently, the guest Tx vring could be starved.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  examples/vhost/main.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> index 9974f0b..e3437ad 100644
> --- a/examples/vhost/main.c
> +++ b/examples/vhost/main.c
> @@ -130,6 +130,7 @@ static uint32_t enable_tx_csum;
>  static uint32_t enable_tso;
>
>  static int client_mode;
> +static int tx_zero_copy;
>
>  /* Specify timeout (in useconds) between retries on RX. */
>  static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
> @@ -297,6 +298,17 @@ port_init(uint8_t port)
>
>  	rx_ring_size = RTE_TEST_RX_DESC_DEFAULT;
>  	tx_ring_size = RTE_TEST_TX_DESC_DEFAULT;
> +
> +	/*
> +	 * When Tx zero copy is enabled, guest Tx used vring will be updated
> +	 * only when corresponding mbuf is freed. Thus, the nb_tx_desc
> +	 * (tx_ring_size here) must be small enough so that the driver will
> +	 * hit the free threshold easily and free mbufs timely. Otherwise,
> +	 * guest Tx vring would be starved.
> +	 */
> +	if (tx_zero_copy)
> +		tx_ring_size = 64;

I have a concern about more complex applications, where the mbufs might
not be consumed sequentially.
If one mbuf gets stuck for a while, whereas all others are consumed,
we would face starvation.
For example, the packet is to be routed to a VM, which is paused,
and the routing thread keeps retrying to enqueue the packet for a while.

Anyway, this feature is optional and off by default, so having the
feature applied is not a blocker.

Thanks!
Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
  2016-08-23  8:10 [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Yuanhan Liu
                   ` (5 preceding siblings ...)
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 6/6] examples/vhost: add an option " Yuanhan Liu
@ 2016-08-23 14:18 ` Maxime Coquelin
  2016-08-23 14:42   ` Yuanhan Liu
  2016-08-29  8:32 ` Xu, Qian Q
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Maxime Coquelin @ 2016-08-23 14:18 UTC (permalink / raw)
  To: Yuanhan Liu, dev



On 08/23/2016 10:10 AM, Yuanhan Liu wrote:
> This patch set enables vhost Tx zero copy. The majority work goes to
> patch 4: vhost: add Tx zero copy.
>
> The basic idea of Tx zero copy is, instead of copying data from the
> desc buf, here we let the mbuf reference the desc buf addr directly.
>
> The major issue behind that is how and when to update the used ring.
> You could check the commit log of patch 4 for more details.
>
> Patch 5 introduces a new flag, RTE_VHOST_USER_TX_ZERO_COPY, to enable
> Tx zero copy, which is disabled by default.
>
> Few more TODOs are left, including handling a desc buf that is across
> two physical pages, updating release note, etc. Those will be fixed
> in later version. For now, here is a simple one that hopefully it
> shows the idea clearly.
>
> I did some quick tests, the performance gain is quite impressive.
>
> For a simple dequeue workload (running rxonly in vhost-pmd and runnin
> txonly in guest testpmd), it yields 40+% performance boost for packet
> size 1400B.
>
> For VM2VM iperf test case, it's even better: about 70% boost.

This is indeed impressive.
Somewhere else, you mention that there is a small regression with small
packets. Do you have some figures to share?

Also, with this feature OFF, do you see some regressions for both small
and bigger packets?

Thanks,
Maxime
>
> ---
> Yuanhan Liu (6):
>   vhost: simplify memory regions handling
>   vhost: get guest/host physical address mappings
>   vhost: introduce last avail idx for Tx
>   vhost: add Tx zero copy
>   vhost: add a flag to enable Tx zero copy
>   examples/vhost: add an option to enable Tx zero copy
>
>  doc/guides/prog_guide/vhost_lib.rst |   7 +-
>  examples/vhost/main.c               |  19 ++-
>  lib/librte_vhost/rte_virtio_net.h   |   1 +
>  lib/librte_vhost/socket.c           |   5 +
>  lib/librte_vhost/vhost.c            |  12 ++
>  lib/librte_vhost/vhost.h            | 103 +++++++++----
>  lib/librte_vhost/vhost_user.c       | 297 +++++++++++++++++++++++-------------
>  lib/librte_vhost/virtio_net.c       | 188 +++++++++++++++++++----
>  8 files changed, 472 insertions(+), 160 deletions(-)
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 4/6] vhost: add Tx zero copy
  2016-08-23 14:04   ` Maxime Coquelin
@ 2016-08-23 14:31     ` Yuanhan Liu
  2016-08-23 15:40       ` Maxime Coquelin
  0 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23 14:31 UTC (permalink / raw)
  To: Maxime Coquelin; +Cc: dev

On Tue, Aug 23, 2016 at 04:04:30PM +0200, Maxime Coquelin wrote:
> 
> 
> On 08/23/2016 10:10 AM, Yuanhan Liu wrote:
> >The basic idea of Tx zero copy is, instead of copying data from the
> >desc buf, here we let the mbuf reference the desc buf addr directly.
> >
> >Doing so, however, has one major issue: we can't update the used ring
> >at the end of rte_vhost_dequeue_burst. Because we don't do the copy
> >here, an update of the used ring would let the driver to reclaim the
> >desc buf. As a result, DPDK might reference a stale memory region.
> >
> >To update the used ring properly, this patch does several tricks:
> >
> >- when mbuf references a desc buf, refcnt is added by 1.
> >
> >  This is to pin lock the mbuf, so that a mbuf free from the DPDK
> >  won't actually free it, instead, refcnt is subtracted by 1.
> >
> >- We chain all those mbuf together (by tailq)
> >
> >  And we check it every time on the rte_vhost_dequeue_burst entrance,
> >  to see if the mbuf is freed (when refcnt equals to 1). If that
> >  happens, it means we are the last user of this mbuf and we are
> >  safe to update the used ring.
> >
> >- "struct zcopy_mbuf" is introduced, to associate an mbuf with the
> >  right desc idx.
> >
> >Tx zero copy is introduced for performance reason, and some rough tests
> >show about 40% perfomance boost for packet size 1400B. FOr small packets,
> >(e.g. 64B), it actually slows a bit down. That is expected because this
> >patch introduces some extra works, and it outweighs the benefit from
> >saving few bytes copy.
> >
> >Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> >---
> > lib/librte_vhost/vhost.c      |   2 +
> > lib/librte_vhost/vhost.h      |  21 ++++++
> > lib/librte_vhost/vhost_user.c |  41 +++++++++-
> > lib/librte_vhost/virtio_net.c | 169 +++++++++++++++++++++++++++++++++++++-----
> > 4 files changed, 214 insertions(+), 19 deletions(-)
> >
> ...
> 
> > rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
> > 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
> >@@ -823,6 +943,30 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
> > 	if (unlikely(vq->enabled == 0))
> > 		return 0;
> >
> >+	if (dev->tx_zero_copy) {
> >+		struct zcopy_mbuf *zmbuf, *next;
> >+		int nr_updated = 0;
> >+
> >+		for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
> >+		     zmbuf != NULL; zmbuf = next) {
> >+			next = TAILQ_NEXT(zmbuf, next);
> >+
> >+			if (mbuf_is_consumed(zmbuf->mbuf)) {
> >+				used_idx = vq->last_used_idx++ & (vq->size - 1);
> >+				update_used_ring(dev, vq, used_idx,
> >+						 zmbuf->desc_idx);
> >+				nr_updated += 1;
> >+
> >+				TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
> >+				rte_pktmbuf_free(zmbuf->mbuf);
> >+				put_zmbuf(zmbuf);
> >+				vq->nr_zmbuf -= 1;
> >+			}
> Shouldn't you break the loop here as soon as a mbuf is not consumed?

I have thought of that as well, as a micro optimization. But I was
wondering what if a heading mbuf is pin locked by the DPDK APP? Then
the whole chain would be blocked. This should be rare, but I think
we should think of the worst case.

Besides that, the performance boost I got is quite decent, that I think
we could drop this micro optimization.

> Indeed, they might not be consumed sequentially, and would cause
> last_used_idx to be incremented whereas it shouldn't.

I think the out of order used vring update won't be an issue here.
Well, there might be some problems for reconnect. The trick the
commit 0823c1cb0a73 ("vhost: workaround stale vring base") introduced
assumes that used vring will always be updated in order.

	--yliu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
  2016-08-23 14:18 ` [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Maxime Coquelin
@ 2016-08-23 14:42   ` Yuanhan Liu
  2016-08-23 14:53     ` Yuanhan Liu
  0 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23 14:42 UTC (permalink / raw)
  To: Maxime Coquelin; +Cc: dev

On Tue, Aug 23, 2016 at 04:18:40PM +0200, Maxime Coquelin wrote:
> 
> 
> On 08/23/2016 10:10 AM, Yuanhan Liu wrote:
> >This patch set enables vhost Tx zero copy. The majority work goes to
> >patch 4: vhost: add Tx zero copy.
> >
> >The basic idea of Tx zero copy is, instead of copying data from the
> >desc buf, here we let the mbuf reference the desc buf addr directly.
> >
> >The major issue behind that is how and when to update the used ring.
> >You could check the commit log of patch 4 for more details.
> >
> >Patch 5 introduces a new flag, RTE_VHOST_USER_TX_ZERO_COPY, to enable
> >Tx zero copy, which is disabled by default.
> >
> >Few more TODOs are left, including handling a desc buf that is across
> >two physical pages, updating release note, etc. Those will be fixed
> >in later version. For now, here is a simple one that hopefully it
> >shows the idea clearly.
> >
> >I did some quick tests, the performance gain is quite impressive.
> >
> >For a simple dequeue workload (running rxonly in vhost-pmd and runnin
> >txonly in guest testpmd), it yields 40+% performance boost for packet
> >size 1400B.
> >
> >For VM2VM iperf test case, it's even better: about 70% boost.
> 
> This is indeed impressive.
> Somewhere else, you mention that there is a small regression with small
> packets. Do you have some figures to share?

It could be 15% drop for PVP case with 64B packet size. The test topo is:

	 nic 0 --> VM Rx --> VM Tx --> nic 0

Put simply, I run vhost-switch example in the host and run testpmd in
the guest.

Though the number looks big, I don't think it's an issue. First of all,
it's disabled by default. Secondly, if you want to enable it, you should
be certain that the packet size is normally big, otherwise, you should
not bother to try with zero copy.

> Also, with this feature OFF, do you see some regressions for both small
> and bigger packets?

Good question. I didn't check it on purpose, but I did try when it's
disabled, the number I got is pretty the same as the one I got without
this feature. So, I would say I don't see regressions. Anyway, I could
do more tests to make sure.
	
	--yliu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 6/6] examples/vhost: add an option to enable Tx zero copy
  2016-08-23 14:14   ` Maxime Coquelin
@ 2016-08-23 14:45     ` Yuanhan Liu
  0 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23 14:45 UTC (permalink / raw)
  To: Maxime Coquelin; +Cc: dev

On Tue, Aug 23, 2016 at 04:14:44PM +0200, Maxime Coquelin wrote:
> > /* Specify timeout (in useconds) between retries on RX. */
> > static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
> >@@ -297,6 +298,17 @@ port_init(uint8_t port)
> >
> > 	rx_ring_size = RTE_TEST_RX_DESC_DEFAULT;
> > 	tx_ring_size = RTE_TEST_TX_DESC_DEFAULT;
> >+
> >+	/*
> >+	 * When Tx zero copy is enabled, guest Tx used vring will be updated
> >+	 * only when corresponding mbuf is freed. Thus, the nb_tx_desc
> >+	 * (tx_ring_size here) must be small enough so that the driver will
> >+	 * hit the free threshold easily and free mbufs timely. Otherwise,
> >+	 * guest Tx vring would be starved.
> >+	 */
> >+	if (tx_zero_copy)
> >+		tx_ring_size = 64;
> 
> I have a concern about more complex applications, where the mbufs might
> not be consumed sequentially.
> If one mbuf gets stuck for a while, whereas all others are consumed,
> we would face starvation.

I guess that exactly belongs to the worst case I mentioned in another
email. That's why I think we should not break the loop when a head
mbuf is not consumed.

	--yliu
> For example, the packet is to be routed to a VM, which is paused,
> and the routing thread keeps retrying to enqueue the packet for a while.
> 
> Anyway, this feature is optional and off by default, so having the
> feature applied is not a blocker.
> 
> Thanks!
> Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
  2016-08-23 14:42   ` Yuanhan Liu
@ 2016-08-23 14:53     ` Yuanhan Liu
  2016-08-23 16:41       ` Maxime Coquelin
  0 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-23 14:53 UTC (permalink / raw)
  To: Maxime Coquelin; +Cc: dev

BTW, I really appreicate your efforts on reviewing this patchset.

It would be great if you could take some time to review my another
patchset :)

    [PATCH 0/7] vhost: vhost-cuse removal and code path refactoring

It touchs a large of code base, that I wish I could apply it ASAP.
So that the chance a later patch will introduce conflicts is small.

	--yliu

On Tue, Aug 23, 2016 at 10:42:11PM +0800, Yuanhan Liu wrote:
> On Tue, Aug 23, 2016 at 04:18:40PM +0200, Maxime Coquelin wrote:
> > 
> > 
> > On 08/23/2016 10:10 AM, Yuanhan Liu wrote:
> > >This patch set enables vhost Tx zero copy. The majority work goes to
> > >patch 4: vhost: add Tx zero copy.
> > >
> > >The basic idea of Tx zero copy is, instead of copying data from the
> > >desc buf, here we let the mbuf reference the desc buf addr directly.
> > >
> > >The major issue behind that is how and when to update the used ring.
> > >You could check the commit log of patch 4 for more details.
> > >
> > >Patch 5 introduces a new flag, RTE_VHOST_USER_TX_ZERO_COPY, to enable
> > >Tx zero copy, which is disabled by default.
> > >
> > >Few more TODOs are left, including handling a desc buf that is across
> > >two physical pages, updating release note, etc. Those will be fixed
> > >in later version. For now, here is a simple one that hopefully it
> > >shows the idea clearly.
> > >
> > >I did some quick tests, the performance gain is quite impressive.
> > >
> > >For a simple dequeue workload (running rxonly in vhost-pmd and runnin
> > >txonly in guest testpmd), it yields 40+% performance boost for packet
> > >size 1400B.
> > >
> > >For VM2VM iperf test case, it's even better: about 70% boost.
> > 
> > This is indeed impressive.
> > Somewhere else, you mention that there is a small regression with small
> > packets. Do you have some figures to share?
> 
> It could be 15% drop for PVP case with 64B packet size. The test topo is:
> 
> 	 nic 0 --> VM Rx --> VM Tx --> nic 0
> 
> Put simply, I run vhost-switch example in the host and run testpmd in
> the guest.
> 
> Though the number looks big, I don't think it's an issue. First of all,
> it's disabled by default. Secondly, if you want to enable it, you should
> be certain that the packet size is normally big, otherwise, you should
> not bother to try with zero copy.
> 
> > Also, with this feature OFF, do you see some regressions for both small
> > and bigger packets?
> 
> Good question. I didn't check it on purpose, but I did try when it's
> disabled, the number I got is pretty the same as the one I got without
> this feature. So, I would say I don't see regressions. Anyway, I could
> do more tests to make sure.
> 	
> 	--yliu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 4/6] vhost: add Tx zero copy
  2016-08-23 14:31     ` Yuanhan Liu
@ 2016-08-23 15:40       ` Maxime Coquelin
  0 siblings, 0 replies; 75+ messages in thread
From: Maxime Coquelin @ 2016-08-23 15:40 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev



On 08/23/2016 04:31 PM, Yuanhan Liu wrote:
> On Tue, Aug 23, 2016 at 04:04:30PM +0200, Maxime Coquelin wrote:
>>
>>
>> On 08/23/2016 10:10 AM, Yuanhan Liu wrote:
>>> The basic idea of Tx zero copy is, instead of copying data from the
>>> desc buf, here we let the mbuf reference the desc buf addr directly.
>>>
>>> Doing so, however, has one major issue: we can't update the used ring
>>> at the end of rte_vhost_dequeue_burst. Because we don't do the copy
>>> here, an update of the used ring would let the driver to reclaim the
>>> desc buf. As a result, DPDK might reference a stale memory region.
>>>
>>> To update the used ring properly, this patch does several tricks:
>>>
>>> - when mbuf references a desc buf, refcnt is added by 1.
>>>
>>>  This is to pin lock the mbuf, so that a mbuf free from the DPDK
>>>  won't actually free it, instead, refcnt is subtracted by 1.
>>>
>>> - We chain all those mbuf together (by tailq)
>>>
>>>  And we check it every time on the rte_vhost_dequeue_burst entrance,
>>>  to see if the mbuf is freed (when refcnt equals to 1). If that
>>>  happens, it means we are the last user of this mbuf and we are
>>>  safe to update the used ring.
>>>
>>> - "struct zcopy_mbuf" is introduced, to associate an mbuf with the
>>>  right desc idx.
>>>
>>> Tx zero copy is introduced for performance reason, and some rough tests
>>> show about 40% perfomance boost for packet size 1400B. FOr small packets,
>>> (e.g. 64B), it actually slows a bit down. That is expected because this
>>> patch introduces some extra works, and it outweighs the benefit from
>>> saving few bytes copy.
>>>
>>> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>>> ---
>>> lib/librte_vhost/vhost.c      |   2 +
>>> lib/librte_vhost/vhost.h      |  21 ++++++
>>> lib/librte_vhost/vhost_user.c |  41 +++++++++-
>>> lib/librte_vhost/virtio_net.c | 169 +++++++++++++++++++++++++++++++++++++-----
>>> 4 files changed, 214 insertions(+), 19 deletions(-)
>>>
>> ...
>>
>>> rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>>> 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
>>> @@ -823,6 +943,30 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
>>> 	if (unlikely(vq->enabled == 0))
>>> 		return 0;
>>>
>>> +	if (dev->tx_zero_copy) {
>>> +		struct zcopy_mbuf *zmbuf, *next;
>>> +		int nr_updated = 0;
>>> +
>>> +		for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
>>> +		     zmbuf != NULL; zmbuf = next) {
>>> +			next = TAILQ_NEXT(zmbuf, next);
>>> +
>>> +			if (mbuf_is_consumed(zmbuf->mbuf)) {
>>> +				used_idx = vq->last_used_idx++ & (vq->size - 1);
>>> +				update_used_ring(dev, vq, used_idx,
>>> +						 zmbuf->desc_idx);
>>> +				nr_updated += 1;
>>> +
>>> +				TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
>>> +				rte_pktmbuf_free(zmbuf->mbuf);
>>> +				put_zmbuf(zmbuf);
>>> +				vq->nr_zmbuf -= 1;
>>> +			}
>> Shouldn't you break the loop here as soon as a mbuf is not consumed?
>
> I have thought of that as well, as a micro optimization. But I was
> wondering what if a heading mbuf is pin locked by the DPDK APP? Then
> the whole chain would be blocked. This should be rare, but I think
> we should think of the worst case.
>
> Besides that, the performance boost I got is quite decent, that I think
> we could drop this micro optimization.

Forget my comment, this was a misunderstanding of the code on my side.

Regards,
Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
  2016-08-23 14:53     ` Yuanhan Liu
@ 2016-08-23 16:41       ` Maxime Coquelin
  0 siblings, 0 replies; 75+ messages in thread
From: Maxime Coquelin @ 2016-08-23 16:41 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev



On 08/23/2016 04:53 PM, Yuanhan Liu wrote:
> BTW, I really appreicate your efforts on reviewing this patchset.
>
> It would be great if you could take some time to review my another
> patchset :)
>
>     [PATCH 0/7] vhost: vhost-cuse removal and code path refactoring
>
> It touchs a large of code base, that I wish I could apply it ASAP.
> So that the chance a later patch will introduce conflicts is small.

Sure, I will try to review it by tomorrow morning (CET).

REgards,
Maxime

>
> 	--yliu
>
> On Tue, Aug 23, 2016 at 10:42:11PM +0800, Yuanhan Liu wrote:
>> On Tue, Aug 23, 2016 at 04:18:40PM +0200, Maxime Coquelin wrote:
>>>
>>>
>>> On 08/23/2016 10:10 AM, Yuanhan Liu wrote:
>>>> This patch set enables vhost Tx zero copy. The majority work goes to
>>>> patch 4: vhost: add Tx zero copy.
>>>>
>>>> The basic idea of Tx zero copy is, instead of copying data from the
>>>> desc buf, here we let the mbuf reference the desc buf addr directly.
>>>>
>>>> The major issue behind that is how and when to update the used ring.
>>>> You could check the commit log of patch 4 for more details.
>>>>
>>>> Patch 5 introduces a new flag, RTE_VHOST_USER_TX_ZERO_COPY, to enable
>>>> Tx zero copy, which is disabled by default.
>>>>
>>>> Few more TODOs are left, including handling a desc buf that is across
>>>> two physical pages, updating release note, etc. Those will be fixed
>>>> in later version. For now, here is a simple one that hopefully it
>>>> shows the idea clearly.
>>>>
>>>> I did some quick tests, the performance gain is quite impressive.
>>>>
>>>> For a simple dequeue workload (running rxonly in vhost-pmd and runnin
>>>> txonly in guest testpmd), it yields 40+% performance boost for packet
>>>> size 1400B.
>>>>
>>>> For VM2VM iperf test case, it's even better: about 70% boost.
>>>
>>> This is indeed impressive.
>>> Somewhere else, you mention that there is a small regression with small
>>> packets. Do you have some figures to share?
>>
>> It could be 15% drop for PVP case with 64B packet size. The test topo is:
>>
>> 	 nic 0 --> VM Rx --> VM Tx --> nic 0
>>
>> Put simply, I run vhost-switch example in the host and run testpmd in
>> the guest.
>>
>> Though the number looks big, I don't think it's an issue. First of all,
>> it's disabled by default. Secondly, if you want to enable it, you should
>> be certain that the packet size is normally big, otherwise, you should
>> not bother to try with zero copy.
>>
>>> Also, with this feature OFF, do you see some regressions for both small
>>> and bigger packets?
>>
>> Good question. I didn't check it on purpose, but I did try when it's
>> disabled, the number I got is pretty the same as the one I got without
>> this feature. So, I would say I don't see regressions. Anyway, I could
>> do more tests to make sure.
>> 	
>> 	--yliu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling Yuanhan Liu
  2016-08-23  9:17   ` Maxime Coquelin
@ 2016-08-24  7:26   ` Xu, Qian Q
  2016-08-24  7:40     ` Yuanhan Liu
  1 sibling, 1 reply; 75+ messages in thread
From: Xu, Qian Q @ 2016-08-24  7:26 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Maxime Coquelin

I want to apply the patch on the latest DPDK, see below commit ID but failed since no vhost.h and vhost-user.h files. So do you have any dependency on other patches? 

commit 28d8abaf250c3fb4dcb6416518f4c54b4ae67205
Author: Deirdre O'Connor <deirdre.o.connor@intel.com>
Date:   Mon Aug 22 17:20:08 2016 +0100

    doc: fix patchwork link

    Fixes: 58abf6e77c6b ("doc: add contributors guide")

    Reported-by: Jon Loeliger <jdl@netgate.com>
    Signed-off-by: Deirdre O'Connor <deirdre.o.connor@intel.com>
    Acked-by: John McNamara <john.mcnamara@intel.com>


-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
Sent: Tuesday, August 23, 2016 4:11 PM
To: dev@dpdk.org
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
Subject: [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling

Due to history reason (that vhost-cuse comes before vhost-user), some fields for maintaining the vhost-user memory mappings (such as mmapped address and size, with those we then can unmap on destroy) are kept in "orig_region_map" struct, a structure that is defined only in vhost-user source file.

The right way to go is to remove the structure and move all those fields into virtio_memory_region struct. But we simply can't do that before, because it breaks the ABI.

Now, thanks to the ABI refactoring, it's never been a blocking issue any more. And here it goes: this patch removes orig_region_map and redefines virtio_memory_region, to include all necessary info.

With that, we can simplify the guest/host address convert a bit.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost.h      |  49 ++++++------
 lib/librte_vhost/vhost_user.c | 172 +++++++++++++++++-------------------------
 2 files changed, 90 insertions(+), 131 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h index c2dfc3c..df2107b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -143,12 +143,14 @@ struct virtio_net {
  * Information relating to memory regions including offsets to
  * addresses in QEMUs memory file.
  */
-struct virtio_memory_regions {
-	uint64_t guest_phys_address;
-	uint64_t guest_phys_address_end;
-	uint64_t memory_size;
-	uint64_t userspace_address;
-	uint64_t address_offset;
+struct virtio_memory_region {
+	uint64_t guest_phys_addr;
+	uint64_t guest_user_addr;
+	uint64_t host_user_addr;
+	uint64_t size;
+	void	 *mmap_addr;
+	uint64_t mmap_size;
+	int fd;
 };
 
 
@@ -156,12 +158,8 @@ struct virtio_memory_regions {
  * Memory structure includes region and mapping information.
  */
 struct virtio_memory {
-	/* Base QEMU userspace address of the memory file. */
-	uint64_t base_address;
-	uint64_t mapped_address;
-	uint64_t mapped_size;
 	uint32_t nregions;
-	struct virtio_memory_regions regions[0];
+	struct virtio_memory_region regions[0];
 };
 
 
@@ -200,26 +198,23 @@ extern uint64_t VHOST_FEATURES;
 #define MAX_VHOST_DEVICE	1024
 extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
 
-/**
- * Function to convert guest physical addresses to vhost virtual addresses.
- * This is used to convert guest virtio buffer addresses.
- */
+/* Convert guest physical Address to host virtual address */
 static inline uint64_t __attribute__((always_inline)) -gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
+gpa_to_vva(struct virtio_net *dev, uint64_t gpa)
 {
-	struct virtio_memory_regions *region;
-	uint32_t regionidx;
-	uint64_t vhost_va = 0;
-
-	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
-		region = &dev->mem->regions[regionidx];
-		if ((guest_pa >= region->guest_phys_address) &&
-			(guest_pa <= region->guest_phys_address_end)) {
-			vhost_va = region->address_offset + guest_pa;
-			break;
+	struct virtio_memory_region *reg;
+	uint32_t i;
+
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+		if (gpa >= reg->guest_phys_addr &&
+		    gpa <  reg->guest_phys_addr + reg->size) {
+			return gpa - reg->guest_phys_addr +
+			       reg->host_user_addr;
 		}
 	}
-	return vhost_va;
+
+	return 0;
 }
 
 struct virtio_net_device_ops const *notify_ops; diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c index eee99e9..d2071fd 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -74,18 +74,6 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
 	[VHOST_USER_SEND_RARP]  = "VHOST_USER_SEND_RARP",  };
 
-struct orig_region_map {
-	int fd;
-	uint64_t mapped_address;
-	uint64_t mapped_size;
-	uint64_t blksz;
-};
-
-#define orig_region(ptr, nregions) \
-	((struct orig_region_map *)RTE_PTR_ADD((ptr), \
-		sizeof(struct virtio_memory) + \
-		sizeof(struct virtio_memory_regions) * (nregions)))
-
 static uint64_t
 get_blk_size(int fd)
 {
@@ -99,18 +87,17 @@ get_blk_size(int fd)  static void  free_mem_region(struct virtio_net *dev)  {
-	struct orig_region_map *region;
-	unsigned int idx;
+	uint32_t i;
+	struct virtio_memory_region *reg;
 
 	if (!dev || !dev->mem)
 		return;
 
-	region = orig_region(dev->mem, dev->mem->nregions);
-	for (idx = 0; idx < dev->mem->nregions; idx++) {
-		if (region[idx].mapped_address) {
-			munmap((void *)(uintptr_t)region[idx].mapped_address,
-					region[idx].mapped_size);
-			close(region[idx].fd);
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+		if (reg->host_user_addr) {
+			munmap(reg->mmap_addr, reg->mmap_size);
+			close(reg->fd);
 		}
 	}
 }
@@ -120,7 +107,7 @@ vhost_backend_cleanup(struct virtio_net *dev)  {
 	if (dev->mem) {
 		free_mem_region(dev);
-		free(dev->mem);
+		rte_free(dev->mem);
 		dev->mem = NULL;
 	}
 	if (dev->log_addr) {
@@ -286,25 +273,23 @@ numa_realloc(struct virtio_net *dev, int index __rte_unused)
  * used to convert the ring addresses to our address space.
  */
 static uint64_t
-qva_to_vva(struct virtio_net *dev, uint64_t qemu_va)
+qva_to_vva(struct virtio_net *dev, uint64_t qva)
 {
-	struct virtio_memory_regions *region;
-	uint64_t vhost_va = 0;
-	uint32_t regionidx = 0;
+	struct virtio_memory_region *reg;
+	uint32_t i;
 
 	/* Find the region where the address lives. */
-	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
-		region = &dev->mem->regions[regionidx];
-		if ((qemu_va >= region->userspace_address) &&
-			(qemu_va <= region->userspace_address +
-			region->memory_size)) {
-			vhost_va = qemu_va + region->guest_phys_address +
-				region->address_offset -
-				region->userspace_address;
-			break;
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+
+		if (qva >= reg->guest_user_addr &&
+		    qva <  reg->guest_user_addr + reg->size) {
+			return qva - reg->guest_user_addr +
+			       reg->host_user_addr;
 		}
 	}
-	return vhost_va;
+
+	return 0;
 }
 
 /*
@@ -391,11 +376,13 @@ static int
 vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)  {
 	struct VhostUserMemory memory = pmsg->payload.memory;
-	struct virtio_memory_regions *pregion;
-	uint64_t mapped_address, mapped_size;
-	unsigned int idx = 0;
-	struct orig_region_map *pregion_orig;
+	struct virtio_memory_region *reg;
+	void *mmap_addr;
+	uint64_t mmap_size;
+	uint64_t mmap_offset;
 	uint64_t alignment;
+	uint32_t i;
+	int fd;
 
 	/* Remove from the data plane. */
 	if (dev->flags & VIRTIO_DEV_RUNNING) { @@ -405,14 +392,12 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 
 	if (dev->mem) {
 		free_mem_region(dev);
-		free(dev->mem);
+		rte_free(dev->mem);
 		dev->mem = NULL;
 	}
 
-	dev->mem = calloc(1,
-		sizeof(struct virtio_memory) +
-		sizeof(struct virtio_memory_regions) * memory.nregions +
-		sizeof(struct orig_region_map) * memory.nregions);
+	dev->mem = rte_zmalloc("vhost-mem-table", sizeof(struct virtio_memory) +
+		sizeof(struct virtio_memory_region) * memory.nregions, 0);
 	if (dev->mem == NULL) {
 		RTE_LOG(ERR, VHOST_CONFIG,
 			"(%d) failed to allocate memory for dev->mem\n", @@ -421,22 +406,17 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 	}
 	dev->mem->nregions = memory.nregions;
 
-	pregion_orig = orig_region(dev->mem, memory.nregions);
-	for (idx = 0; idx < memory.nregions; idx++) {
-		pregion = &dev->mem->regions[idx];
-		pregion->guest_phys_address =
-			memory.regions[idx].guest_phys_addr;
-		pregion->guest_phys_address_end =
-			memory.regions[idx].guest_phys_addr +
-			memory.regions[idx].memory_size;
-		pregion->memory_size =
-			memory.regions[idx].memory_size;
-		pregion->userspace_address =
-			memory.regions[idx].userspace_addr;
-
-		/* This is ugly */
-		mapped_size = memory.regions[idx].memory_size +
-			memory.regions[idx].mmap_offset;
+	for (i = 0; i < memory.nregions; i++) {
+		fd  = pmsg->fds[i];
+		reg = &dev->mem->regions[i];
+
+		reg->guest_phys_addr = memory.regions[i].guest_phys_addr;
+		reg->guest_user_addr = memory.regions[i].userspace_addr;
+		reg->size            = memory.regions[i].memory_size;
+		reg->fd              = fd;
+
+		mmap_offset = memory.regions[i].mmap_offset;
+		mmap_size   = reg->size + mmap_offset;
 
 		/* mmap() without flag of MAP_ANONYMOUS, should be called
 		 * with length argument aligned with hugepagesz at older @@ -446,67 +426,51 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		 * to avoid failure, make sure in caller to keep length
 		 * aligned.
 		 */
-		alignment = get_blk_size(pmsg->fds[idx]);
+		alignment = get_blk_size(fd);
 		if (alignment == (uint64_t)-1) {
 			RTE_LOG(ERR, VHOST_CONFIG,
 				"couldn't get hugepage size through fstat\n");
 			goto err_mmap;
 		}
-		mapped_size = RTE_ALIGN_CEIL(mapped_size, alignment);
+		mmap_size = RTE_ALIGN_CEIL(mmap_size, alignment);
 
-		mapped_address = (uint64_t)(uintptr_t)mmap(NULL,
-			mapped_size,
-			PROT_READ | PROT_WRITE, MAP_SHARED,
-			pmsg->fds[idx],
-			0);
+		mmap_addr = mmap(NULL, mmap_size,
+				 PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
 
-		RTE_LOG(INFO, VHOST_CONFIG,
-			"mapped region %d fd:%d to:%p sz:0x%"PRIx64" "
-			"off:0x%"PRIx64" align:0x%"PRIx64"\n",
-			idx, pmsg->fds[idx], (void *)(uintptr_t)mapped_address,
-			mapped_size, memory.regions[idx].mmap_offset,
-			alignment);
-
-		if (mapped_address == (uint64_t)(uintptr_t)MAP_FAILED) {
+		if (mmap_addr == MAP_FAILED) {
 			RTE_LOG(ERR, VHOST_CONFIG,
-				"mmap qemu guest failed.\n");
+				"mmap region %u failed.\n", i);
 			goto err_mmap;
 		}
 
-		pregion_orig[idx].mapped_address = mapped_address;
-		pregion_orig[idx].mapped_size = mapped_size;
-		pregion_orig[idx].blksz = alignment;
-		pregion_orig[idx].fd = pmsg->fds[idx];
-
-		mapped_address +=  memory.regions[idx].mmap_offset;
+		reg->mmap_addr = mmap_addr;
+		reg->mmap_size = mmap_size;
+		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr + mmap_offset;
 
-		pregion->address_offset = mapped_address -
-			pregion->guest_phys_address;
-
-		if (memory.regions[idx].guest_phys_addr == 0) {
-			dev->mem->base_address =
-				memory.regions[idx].userspace_addr;
-			dev->mem->mapped_address =
-				pregion->address_offset;
-		}
-
-		LOG_DEBUG(VHOST_CONFIG,
-			"REGION: %u GPA: %p QEMU VA: %p SIZE (%"PRIu64")\n",
-			idx,
-			(void *)(uintptr_t)pregion->guest_phys_address,
-			(void *)(uintptr_t)pregion->userspace_address,
-			 pregion->memory_size);
+		RTE_LOG(INFO, VHOST_CONFIG,
+			"guest memory region %u, size: 0x%" PRIx64 "\n"
+			"\t guest physical addr: 0x%" PRIx64 "\n"
+			"\t guest virtual  addr: 0x%" PRIx64 "\n"
+			"\t host  virtual  addr: 0x%" PRIx64 "\n"
+			"\t mmap addr : 0x%" PRIx64 "\n"
+			"\t mmap size : 0x%" PRIx64 "\n"
+			"\t mmap align: 0x%" PRIx64 "\n"
+			"\t mmap off  : 0x%" PRIx64 "\n",
+			i, reg->size,
+			reg->guest_phys_addr,
+			reg->guest_user_addr,
+			reg->host_user_addr,
+			(uint64_t)(uintptr_t)mmap_addr,
+			mmap_size,
+			alignment,
+			mmap_offset);
 	}
 
 	return 0;
 
 err_mmap:
-	while (idx--) {
-		munmap((void *)(uintptr_t)pregion_orig[idx].mapped_address,
-				pregion_orig[idx].mapped_size);
-		close(pregion_orig[idx].fd);
-	}
-	free(dev->mem);
+	free_mem_region(dev);
+	rte_free(dev->mem);
 	dev->mem = NULL;
 	return -1;
 }
--
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling
  2016-08-24  7:40     ` Yuanhan Liu
@ 2016-08-24  7:36       ` Xu, Qian Q
  0 siblings, 0 replies; 75+ messages in thread
From: Xu, Qian Q @ 2016-08-24  7:36 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Maxime Coquelin

OK, it's better to claim that his patchset have the dependency on another one.   

-----Original Message-----
From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] 
Sent: Wednesday, August 24, 2016 3:40 PM
To: Xu, Qian Q <qian.q.xu@intel.com>
Cc: dev@dpdk.org; Maxime Coquelin <maxime.coquelin@redhat.com>
Subject: Re: [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling

Yes, it depends on the vhost-cuse removal patchset I sent last week.

	--yliu

On Wed, Aug 24, 2016 at 07:26:07AM +0000, Xu, Qian Q wrote:
> I want to apply the patch on the latest DPDK, see below commit ID but failed since no vhost.h and vhost-user.h files. So do you have any dependency on other patches? 
> 
> commit 28d8abaf250c3fb4dcb6416518f4c54b4ae67205
> Author: Deirdre O'Connor <deirdre.o.connor@intel.com>
> Date:   Mon Aug 22 17:20:08 2016 +0100
> 
>     doc: fix patchwork link
> 
>     Fixes: 58abf6e77c6b ("doc: add contributors guide")
> 
>     Reported-by: Jon Loeliger <jdl@netgate.com>
>     Signed-off-by: Deirdre O'Connor <deirdre.o.connor@intel.com>
>     Acked-by: John McNamara <john.mcnamara@intel.com>
> 
> 
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
> Sent: Tuesday, August 23, 2016 4:11 PM
> To: dev@dpdk.org
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Subject: [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling
> 
> Due to history reason (that vhost-cuse comes before vhost-user), some fields for maintaining the vhost-user memory mappings (such as mmapped address and size, with those we then can unmap on destroy) are kept in "orig_region_map" struct, a structure that is defined only in vhost-user source file.
> 
> The right way to go is to remove the structure and move all those fields into virtio_memory_region struct. But we simply can't do that before, because it breaks the ABI.
> 
> Now, thanks to the ABI refactoring, it's never been a blocking issue any more. And here it goes: this patch removes orig_region_map and redefines virtio_memory_region, to include all necessary info.
> 
> With that, we can simplify the guest/host address convert a bit.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/vhost.h      |  49 ++++++------
>  lib/librte_vhost/vhost_user.c | 172 +++++++++++++++++-------------------------
>  2 files changed, 90 insertions(+), 131 deletions(-)
> 
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h index c2dfc3c..df2107b 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -143,12 +143,14 @@ struct virtio_net {
>   * Information relating to memory regions including offsets to
>   * addresses in QEMUs memory file.
>   */
> -struct virtio_memory_regions {
> -	uint64_t guest_phys_address;
> -	uint64_t guest_phys_address_end;
> -	uint64_t memory_size;
> -	uint64_t userspace_address;
> -	uint64_t address_offset;
> +struct virtio_memory_region {
> +	uint64_t guest_phys_addr;
> +	uint64_t guest_user_addr;
> +	uint64_t host_user_addr;
> +	uint64_t size;
> +	void	 *mmap_addr;
> +	uint64_t mmap_size;
> +	int fd;
>  };
>  
>  
> @@ -156,12 +158,8 @@ struct virtio_memory_regions {
>   * Memory structure includes region and mapping information.
>   */
>  struct virtio_memory {
> -	/* Base QEMU userspace address of the memory file. */
> -	uint64_t base_address;
> -	uint64_t mapped_address;
> -	uint64_t mapped_size;
>  	uint32_t nregions;
> -	struct virtio_memory_regions regions[0];
> +	struct virtio_memory_region regions[0];
>  };
>  
>  
> @@ -200,26 +198,23 @@ extern uint64_t VHOST_FEATURES;
>  #define MAX_VHOST_DEVICE	1024
>  extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
>  
> -/**
> - * Function to convert guest physical addresses to vhost virtual addresses.
> - * This is used to convert guest virtio buffer addresses.
> - */
> +/* Convert guest physical Address to host virtual address */
>  static inline uint64_t __attribute__((always_inline)) -gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
> +gpa_to_vva(struct virtio_net *dev, uint64_t gpa)
>  {
> -	struct virtio_memory_regions *region;
> -	uint32_t regionidx;
> -	uint64_t vhost_va = 0;
> -
> -	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
> -		region = &dev->mem->regions[regionidx];
> -		if ((guest_pa >= region->guest_phys_address) &&
> -			(guest_pa <= region->guest_phys_address_end)) {
> -			vhost_va = region->address_offset + guest_pa;
> -			break;
> +	struct virtio_memory_region *reg;
> +	uint32_t i;
> +
> +	for (i = 0; i < dev->mem->nregions; i++) {
> +		reg = &dev->mem->regions[i];
> +		if (gpa >= reg->guest_phys_addr &&
> +		    gpa <  reg->guest_phys_addr + reg->size) {
> +			return gpa - reg->guest_phys_addr +
> +			       reg->host_user_addr;
>  		}
>  	}
> -	return vhost_va;
> +
> +	return 0;
>  }
>  
>  struct virtio_net_device_ops const *notify_ops; diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c index eee99e9..d2071fd 100644
> --- a/lib/librte_vhost/vhost_user.c
> +++ b/lib/librte_vhost/vhost_user.c
> @@ -74,18 +74,6 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
>  	[VHOST_USER_SEND_RARP]  = "VHOST_USER_SEND_RARP",  };
>  
> -struct orig_region_map {
> -	int fd;
> -	uint64_t mapped_address;
> -	uint64_t mapped_size;
> -	uint64_t blksz;
> -};
> -
> -#define orig_region(ptr, nregions) \
> -	((struct orig_region_map *)RTE_PTR_ADD((ptr), \
> -		sizeof(struct virtio_memory) + \
> -		sizeof(struct virtio_memory_regions) * (nregions)))
> -
>  static uint64_t
>  get_blk_size(int fd)
>  {
> @@ -99,18 +87,17 @@ get_blk_size(int fd)  static void  free_mem_region(struct virtio_net *dev)  {
> -	struct orig_region_map *region;
> -	unsigned int idx;
> +	uint32_t i;
> +	struct virtio_memory_region *reg;
>  
>  	if (!dev || !dev->mem)
>  		return;
>  
> -	region = orig_region(dev->mem, dev->mem->nregions);
> -	for (idx = 0; idx < dev->mem->nregions; idx++) {
> -		if (region[idx].mapped_address) {
> -			munmap((void *)(uintptr_t)region[idx].mapped_address,
> -					region[idx].mapped_size);
> -			close(region[idx].fd);
> +	for (i = 0; i < dev->mem->nregions; i++) {
> +		reg = &dev->mem->regions[i];
> +		if (reg->host_user_addr) {
> +			munmap(reg->mmap_addr, reg->mmap_size);
> +			close(reg->fd);
>  		}
>  	}
>  }
> @@ -120,7 +107,7 @@ vhost_backend_cleanup(struct virtio_net *dev)  {
>  	if (dev->mem) {
>  		free_mem_region(dev);
> -		free(dev->mem);
> +		rte_free(dev->mem);
>  		dev->mem = NULL;
>  	}
>  	if (dev->log_addr) {
> @@ -286,25 +273,23 @@ numa_realloc(struct virtio_net *dev, int index __rte_unused)
>   * used to convert the ring addresses to our address space.
>   */
>  static uint64_t
> -qva_to_vva(struct virtio_net *dev, uint64_t qemu_va)
> +qva_to_vva(struct virtio_net *dev, uint64_t qva)
>  {
> -	struct virtio_memory_regions *region;
> -	uint64_t vhost_va = 0;
> -	uint32_t regionidx = 0;
> +	struct virtio_memory_region *reg;
> +	uint32_t i;
>  
>  	/* Find the region where the address lives. */
> -	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
> -		region = &dev->mem->regions[regionidx];
> -		if ((qemu_va >= region->userspace_address) &&
> -			(qemu_va <= region->userspace_address +
> -			region->memory_size)) {
> -			vhost_va = qemu_va + region->guest_phys_address +
> -				region->address_offset -
> -				region->userspace_address;
> -			break;
> +	for (i = 0; i < dev->mem->nregions; i++) {
> +		reg = &dev->mem->regions[i];
> +
> +		if (qva >= reg->guest_user_addr &&
> +		    qva <  reg->guest_user_addr + reg->size) {
> +			return qva - reg->guest_user_addr +
> +			       reg->host_user_addr;
>  		}
>  	}
> -	return vhost_va;
> +
> +	return 0;
>  }
>  
>  /*
> @@ -391,11 +376,13 @@ static int
>  vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)  {
>  	struct VhostUserMemory memory = pmsg->payload.memory;
> -	struct virtio_memory_regions *pregion;
> -	uint64_t mapped_address, mapped_size;
> -	unsigned int idx = 0;
> -	struct orig_region_map *pregion_orig;
> +	struct virtio_memory_region *reg;
> +	void *mmap_addr;
> +	uint64_t mmap_size;
> +	uint64_t mmap_offset;
>  	uint64_t alignment;
> +	uint32_t i;
> +	int fd;
>  
>  	/* Remove from the data plane. */
>  	if (dev->flags & VIRTIO_DEV_RUNNING) { @@ -405,14 +392,12 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
>  
>  	if (dev->mem) {
>  		free_mem_region(dev);
> -		free(dev->mem);
> +		rte_free(dev->mem);
>  		dev->mem = NULL;
>  	}
>  
> -	dev->mem = calloc(1,
> -		sizeof(struct virtio_memory) +
> -		sizeof(struct virtio_memory_regions) * memory.nregions +
> -		sizeof(struct orig_region_map) * memory.nregions);
> +	dev->mem = rte_zmalloc("vhost-mem-table", sizeof(struct virtio_memory) +
> +		sizeof(struct virtio_memory_region) * memory.nregions, 0);
>  	if (dev->mem == NULL) {
>  		RTE_LOG(ERR, VHOST_CONFIG,
>  			"(%d) failed to allocate memory for dev->mem\n", @@ -421,22 +406,17 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
>  	}
>  	dev->mem->nregions = memory.nregions;
>  
> -	pregion_orig = orig_region(dev->mem, memory.nregions);
> -	for (idx = 0; idx < memory.nregions; idx++) {
> -		pregion = &dev->mem->regions[idx];
> -		pregion->guest_phys_address =
> -			memory.regions[idx].guest_phys_addr;
> -		pregion->guest_phys_address_end =
> -			memory.regions[idx].guest_phys_addr +
> -			memory.regions[idx].memory_size;
> -		pregion->memory_size =
> -			memory.regions[idx].memory_size;
> -		pregion->userspace_address =
> -			memory.regions[idx].userspace_addr;
> -
> -		/* This is ugly */
> -		mapped_size = memory.regions[idx].memory_size +
> -			memory.regions[idx].mmap_offset;
> +	for (i = 0; i < memory.nregions; i++) {
> +		fd  = pmsg->fds[i];
> +		reg = &dev->mem->regions[i];
> +
> +		reg->guest_phys_addr = memory.regions[i].guest_phys_addr;
> +		reg->guest_user_addr = memory.regions[i].userspace_addr;
> +		reg->size            = memory.regions[i].memory_size;
> +		reg->fd              = fd;
> +
> +		mmap_offset = memory.regions[i].mmap_offset;
> +		mmap_size   = reg->size + mmap_offset;
>  
>  		/* mmap() without flag of MAP_ANONYMOUS, should be called
>  		 * with length argument aligned with hugepagesz at older @@ -446,67 +426,51 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
>  		 * to avoid failure, make sure in caller to keep length
>  		 * aligned.
>  		 */
> -		alignment = get_blk_size(pmsg->fds[idx]);
> +		alignment = get_blk_size(fd);
>  		if (alignment == (uint64_t)-1) {
>  			RTE_LOG(ERR, VHOST_CONFIG,
>  				"couldn't get hugepage size through fstat\n");
>  			goto err_mmap;
>  		}
> -		mapped_size = RTE_ALIGN_CEIL(mapped_size, alignment);
> +		mmap_size = RTE_ALIGN_CEIL(mmap_size, alignment);
>  
> -		mapped_address = (uint64_t)(uintptr_t)mmap(NULL,
> -			mapped_size,
> -			PROT_READ | PROT_WRITE, MAP_SHARED,
> -			pmsg->fds[idx],
> -			0);
> +		mmap_addr = mmap(NULL, mmap_size,
> +				 PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>  
> -		RTE_LOG(INFO, VHOST_CONFIG,
> -			"mapped region %d fd:%d to:%p sz:0x%"PRIx64" "
> -			"off:0x%"PRIx64" align:0x%"PRIx64"\n",
> -			idx, pmsg->fds[idx], (void *)(uintptr_t)mapped_address,
> -			mapped_size, memory.regions[idx].mmap_offset,
> -			alignment);
> -
> -		if (mapped_address == (uint64_t)(uintptr_t)MAP_FAILED) {
> +		if (mmap_addr == MAP_FAILED) {
>  			RTE_LOG(ERR, VHOST_CONFIG,
> -				"mmap qemu guest failed.\n");
> +				"mmap region %u failed.\n", i);
>  			goto err_mmap;
>  		}
>  
> -		pregion_orig[idx].mapped_address = mapped_address;
> -		pregion_orig[idx].mapped_size = mapped_size;
> -		pregion_orig[idx].blksz = alignment;
> -		pregion_orig[idx].fd = pmsg->fds[idx];
> -
> -		mapped_address +=  memory.regions[idx].mmap_offset;
> +		reg->mmap_addr = mmap_addr;
> +		reg->mmap_size = mmap_size;
> +		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr + mmap_offset;
>  
> -		pregion->address_offset = mapped_address -
> -			pregion->guest_phys_address;
> -
> -		if (memory.regions[idx].guest_phys_addr == 0) {
> -			dev->mem->base_address =
> -				memory.regions[idx].userspace_addr;
> -			dev->mem->mapped_address =
> -				pregion->address_offset;
> -		}
> -
> -		LOG_DEBUG(VHOST_CONFIG,
> -			"REGION: %u GPA: %p QEMU VA: %p SIZE (%"PRIu64")\n",
> -			idx,
> -			(void *)(uintptr_t)pregion->guest_phys_address,
> -			(void *)(uintptr_t)pregion->userspace_address,
> -			 pregion->memory_size);
> +		RTE_LOG(INFO, VHOST_CONFIG,
> +			"guest memory region %u, size: 0x%" PRIx64 "\n"
> +			"\t guest physical addr: 0x%" PRIx64 "\n"
> +			"\t guest virtual  addr: 0x%" PRIx64 "\n"
> +			"\t host  virtual  addr: 0x%" PRIx64 "\n"
> +			"\t mmap addr : 0x%" PRIx64 "\n"
> +			"\t mmap size : 0x%" PRIx64 "\n"
> +			"\t mmap align: 0x%" PRIx64 "\n"
> +			"\t mmap off  : 0x%" PRIx64 "\n",
> +			i, reg->size,
> +			reg->guest_phys_addr,
> +			reg->guest_user_addr,
> +			reg->host_user_addr,
> +			(uint64_t)(uintptr_t)mmap_addr,
> +			mmap_size,
> +			alignment,
> +			mmap_offset);
>  	}
>  
>  	return 0;
>  
>  err_mmap:
> -	while (idx--) {
> -		munmap((void *)(uintptr_t)pregion_orig[idx].mapped_address,
> -				pregion_orig[idx].mapped_size);
> -		close(pregion_orig[idx].fd);
> -	}
> -	free(dev->mem);
> +	free_mem_region(dev);
> +	rte_free(dev->mem);
>  	dev->mem = NULL;
>  	return -1;
>  }
> --
> 1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling
  2016-08-24  7:26   ` Xu, Qian Q
@ 2016-08-24  7:40     ` Yuanhan Liu
  2016-08-24  7:36       ` Xu, Qian Q
  0 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-08-24  7:40 UTC (permalink / raw)
  To: Xu, Qian Q; +Cc: dev, Maxime Coquelin

Yes, it depends on the vhost-cuse removal patchset I sent last week.

	--yliu

On Wed, Aug 24, 2016 at 07:26:07AM +0000, Xu, Qian Q wrote:
> I want to apply the patch on the latest DPDK, see below commit ID but failed since no vhost.h and vhost-user.h files. So do you have any dependency on other patches? 
> 
> commit 28d8abaf250c3fb4dcb6416518f4c54b4ae67205
> Author: Deirdre O'Connor <deirdre.o.connor@intel.com>
> Date:   Mon Aug 22 17:20:08 2016 +0100
> 
>     doc: fix patchwork link
> 
>     Fixes: 58abf6e77c6b ("doc: add contributors guide")
> 
>     Reported-by: Jon Loeliger <jdl@netgate.com>
>     Signed-off-by: Deirdre O'Connor <deirdre.o.connor@intel.com>
>     Acked-by: John McNamara <john.mcnamara@intel.com>
> 
> 
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
> Sent: Tuesday, August 23, 2016 4:11 PM
> To: dev@dpdk.org
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Subject: [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling
> 
> Due to history reason (that vhost-cuse comes before vhost-user), some fields for maintaining the vhost-user memory mappings (such as mmapped address and size, with those we then can unmap on destroy) are kept in "orig_region_map" struct, a structure that is defined only in vhost-user source file.
> 
> The right way to go is to remove the structure and move all those fields into virtio_memory_region struct. But we simply can't do that before, because it breaks the ABI.
> 
> Now, thanks to the ABI refactoring, it's never been a blocking issue any more. And here it goes: this patch removes orig_region_map and redefines virtio_memory_region, to include all necessary info.
> 
> With that, we can simplify the guest/host address convert a bit.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/vhost.h      |  49 ++++++------
>  lib/librte_vhost/vhost_user.c | 172 +++++++++++++++++-------------------------
>  2 files changed, 90 insertions(+), 131 deletions(-)
> 
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h index c2dfc3c..df2107b 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -143,12 +143,14 @@ struct virtio_net {
>   * Information relating to memory regions including offsets to
>   * addresses in QEMUs memory file.
>   */
> -struct virtio_memory_regions {
> -	uint64_t guest_phys_address;
> -	uint64_t guest_phys_address_end;
> -	uint64_t memory_size;
> -	uint64_t userspace_address;
> -	uint64_t address_offset;
> +struct virtio_memory_region {
> +	uint64_t guest_phys_addr;
> +	uint64_t guest_user_addr;
> +	uint64_t host_user_addr;
> +	uint64_t size;
> +	void	 *mmap_addr;
> +	uint64_t mmap_size;
> +	int fd;
>  };
>  
>  
> @@ -156,12 +158,8 @@ struct virtio_memory_regions {
>   * Memory structure includes region and mapping information.
>   */
>  struct virtio_memory {
> -	/* Base QEMU userspace address of the memory file. */
> -	uint64_t base_address;
> -	uint64_t mapped_address;
> -	uint64_t mapped_size;
>  	uint32_t nregions;
> -	struct virtio_memory_regions regions[0];
> +	struct virtio_memory_region regions[0];
>  };
>  
>  
> @@ -200,26 +198,23 @@ extern uint64_t VHOST_FEATURES;
>  #define MAX_VHOST_DEVICE	1024
>  extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
>  
> -/**
> - * Function to convert guest physical addresses to vhost virtual addresses.
> - * This is used to convert guest virtio buffer addresses.
> - */
> +/* Convert guest physical Address to host virtual address */
>  static inline uint64_t __attribute__((always_inline)) -gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
> +gpa_to_vva(struct virtio_net *dev, uint64_t gpa)
>  {
> -	struct virtio_memory_regions *region;
> -	uint32_t regionidx;
> -	uint64_t vhost_va = 0;
> -
> -	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
> -		region = &dev->mem->regions[regionidx];
> -		if ((guest_pa >= region->guest_phys_address) &&
> -			(guest_pa <= region->guest_phys_address_end)) {
> -			vhost_va = region->address_offset + guest_pa;
> -			break;
> +	struct virtio_memory_region *reg;
> +	uint32_t i;
> +
> +	for (i = 0; i < dev->mem->nregions; i++) {
> +		reg = &dev->mem->regions[i];
> +		if (gpa >= reg->guest_phys_addr &&
> +		    gpa <  reg->guest_phys_addr + reg->size) {
> +			return gpa - reg->guest_phys_addr +
> +			       reg->host_user_addr;
>  		}
>  	}
> -	return vhost_va;
> +
> +	return 0;
>  }
>  
>  struct virtio_net_device_ops const *notify_ops; diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c index eee99e9..d2071fd 100644
> --- a/lib/librte_vhost/vhost_user.c
> +++ b/lib/librte_vhost/vhost_user.c
> @@ -74,18 +74,6 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
>  	[VHOST_USER_SEND_RARP]  = "VHOST_USER_SEND_RARP",  };
>  
> -struct orig_region_map {
> -	int fd;
> -	uint64_t mapped_address;
> -	uint64_t mapped_size;
> -	uint64_t blksz;
> -};
> -
> -#define orig_region(ptr, nregions) \
> -	((struct orig_region_map *)RTE_PTR_ADD((ptr), \
> -		sizeof(struct virtio_memory) + \
> -		sizeof(struct virtio_memory_regions) * (nregions)))
> -
>  static uint64_t
>  get_blk_size(int fd)
>  {
> @@ -99,18 +87,17 @@ get_blk_size(int fd)  static void  free_mem_region(struct virtio_net *dev)  {
> -	struct orig_region_map *region;
> -	unsigned int idx;
> +	uint32_t i;
> +	struct virtio_memory_region *reg;
>  
>  	if (!dev || !dev->mem)
>  		return;
>  
> -	region = orig_region(dev->mem, dev->mem->nregions);
> -	for (idx = 0; idx < dev->mem->nregions; idx++) {
> -		if (region[idx].mapped_address) {
> -			munmap((void *)(uintptr_t)region[idx].mapped_address,
> -					region[idx].mapped_size);
> -			close(region[idx].fd);
> +	for (i = 0; i < dev->mem->nregions; i++) {
> +		reg = &dev->mem->regions[i];
> +		if (reg->host_user_addr) {
> +			munmap(reg->mmap_addr, reg->mmap_size);
> +			close(reg->fd);
>  		}
>  	}
>  }
> @@ -120,7 +107,7 @@ vhost_backend_cleanup(struct virtio_net *dev)  {
>  	if (dev->mem) {
>  		free_mem_region(dev);
> -		free(dev->mem);
> +		rte_free(dev->mem);
>  		dev->mem = NULL;
>  	}
>  	if (dev->log_addr) {
> @@ -286,25 +273,23 @@ numa_realloc(struct virtio_net *dev, int index __rte_unused)
>   * used to convert the ring addresses to our address space.
>   */
>  static uint64_t
> -qva_to_vva(struct virtio_net *dev, uint64_t qemu_va)
> +qva_to_vva(struct virtio_net *dev, uint64_t qva)
>  {
> -	struct virtio_memory_regions *region;
> -	uint64_t vhost_va = 0;
> -	uint32_t regionidx = 0;
> +	struct virtio_memory_region *reg;
> +	uint32_t i;
>  
>  	/* Find the region where the address lives. */
> -	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
> -		region = &dev->mem->regions[regionidx];
> -		if ((qemu_va >= region->userspace_address) &&
> -			(qemu_va <= region->userspace_address +
> -			region->memory_size)) {
> -			vhost_va = qemu_va + region->guest_phys_address +
> -				region->address_offset -
> -				region->userspace_address;
> -			break;
> +	for (i = 0; i < dev->mem->nregions; i++) {
> +		reg = &dev->mem->regions[i];
> +
> +		if (qva >= reg->guest_user_addr &&
> +		    qva <  reg->guest_user_addr + reg->size) {
> +			return qva - reg->guest_user_addr +
> +			       reg->host_user_addr;
>  		}
>  	}
> -	return vhost_va;
> +
> +	return 0;
>  }
>  
>  /*
> @@ -391,11 +376,13 @@ static int
>  vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)  {
>  	struct VhostUserMemory memory = pmsg->payload.memory;
> -	struct virtio_memory_regions *pregion;
> -	uint64_t mapped_address, mapped_size;
> -	unsigned int idx = 0;
> -	struct orig_region_map *pregion_orig;
> +	struct virtio_memory_region *reg;
> +	void *mmap_addr;
> +	uint64_t mmap_size;
> +	uint64_t mmap_offset;
>  	uint64_t alignment;
> +	uint32_t i;
> +	int fd;
>  
>  	/* Remove from the data plane. */
>  	if (dev->flags & VIRTIO_DEV_RUNNING) { @@ -405,14 +392,12 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
>  
>  	if (dev->mem) {
>  		free_mem_region(dev);
> -		free(dev->mem);
> +		rte_free(dev->mem);
>  		dev->mem = NULL;
>  	}
>  
> -	dev->mem = calloc(1,
> -		sizeof(struct virtio_memory) +
> -		sizeof(struct virtio_memory_regions) * memory.nregions +
> -		sizeof(struct orig_region_map) * memory.nregions);
> +	dev->mem = rte_zmalloc("vhost-mem-table", sizeof(struct virtio_memory) +
> +		sizeof(struct virtio_memory_region) * memory.nregions, 0);
>  	if (dev->mem == NULL) {
>  		RTE_LOG(ERR, VHOST_CONFIG,
>  			"(%d) failed to allocate memory for dev->mem\n", @@ -421,22 +406,17 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
>  	}
>  	dev->mem->nregions = memory.nregions;
>  
> -	pregion_orig = orig_region(dev->mem, memory.nregions);
> -	for (idx = 0; idx < memory.nregions; idx++) {
> -		pregion = &dev->mem->regions[idx];
> -		pregion->guest_phys_address =
> -			memory.regions[idx].guest_phys_addr;
> -		pregion->guest_phys_address_end =
> -			memory.regions[idx].guest_phys_addr +
> -			memory.regions[idx].memory_size;
> -		pregion->memory_size =
> -			memory.regions[idx].memory_size;
> -		pregion->userspace_address =
> -			memory.regions[idx].userspace_addr;
> -
> -		/* This is ugly */
> -		mapped_size = memory.regions[idx].memory_size +
> -			memory.regions[idx].mmap_offset;
> +	for (i = 0; i < memory.nregions; i++) {
> +		fd  = pmsg->fds[i];
> +		reg = &dev->mem->regions[i];
> +
> +		reg->guest_phys_addr = memory.regions[i].guest_phys_addr;
> +		reg->guest_user_addr = memory.regions[i].userspace_addr;
> +		reg->size            = memory.regions[i].memory_size;
> +		reg->fd              = fd;
> +
> +		mmap_offset = memory.regions[i].mmap_offset;
> +		mmap_size   = reg->size + mmap_offset;
>  
>  		/* mmap() without flag of MAP_ANONYMOUS, should be called
>  		 * with length argument aligned with hugepagesz at older @@ -446,67 +426,51 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
>  		 * to avoid failure, make sure in caller to keep length
>  		 * aligned.
>  		 */
> -		alignment = get_blk_size(pmsg->fds[idx]);
> +		alignment = get_blk_size(fd);
>  		if (alignment == (uint64_t)-1) {
>  			RTE_LOG(ERR, VHOST_CONFIG,
>  				"couldn't get hugepage size through fstat\n");
>  			goto err_mmap;
>  		}
> -		mapped_size = RTE_ALIGN_CEIL(mapped_size, alignment);
> +		mmap_size = RTE_ALIGN_CEIL(mmap_size, alignment);
>  
> -		mapped_address = (uint64_t)(uintptr_t)mmap(NULL,
> -			mapped_size,
> -			PROT_READ | PROT_WRITE, MAP_SHARED,
> -			pmsg->fds[idx],
> -			0);
> +		mmap_addr = mmap(NULL, mmap_size,
> +				 PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>  
> -		RTE_LOG(INFO, VHOST_CONFIG,
> -			"mapped region %d fd:%d to:%p sz:0x%"PRIx64" "
> -			"off:0x%"PRIx64" align:0x%"PRIx64"\n",
> -			idx, pmsg->fds[idx], (void *)(uintptr_t)mapped_address,
> -			mapped_size, memory.regions[idx].mmap_offset,
> -			alignment);
> -
> -		if (mapped_address == (uint64_t)(uintptr_t)MAP_FAILED) {
> +		if (mmap_addr == MAP_FAILED) {
>  			RTE_LOG(ERR, VHOST_CONFIG,
> -				"mmap qemu guest failed.\n");
> +				"mmap region %u failed.\n", i);
>  			goto err_mmap;
>  		}
>  
> -		pregion_orig[idx].mapped_address = mapped_address;
> -		pregion_orig[idx].mapped_size = mapped_size;
> -		pregion_orig[idx].blksz = alignment;
> -		pregion_orig[idx].fd = pmsg->fds[idx];
> -
> -		mapped_address +=  memory.regions[idx].mmap_offset;
> +		reg->mmap_addr = mmap_addr;
> +		reg->mmap_size = mmap_size;
> +		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr + mmap_offset;
>  
> -		pregion->address_offset = mapped_address -
> -			pregion->guest_phys_address;
> -
> -		if (memory.regions[idx].guest_phys_addr == 0) {
> -			dev->mem->base_address =
> -				memory.regions[idx].userspace_addr;
> -			dev->mem->mapped_address =
> -				pregion->address_offset;
> -		}
> -
> -		LOG_DEBUG(VHOST_CONFIG,
> -			"REGION: %u GPA: %p QEMU VA: %p SIZE (%"PRIu64")\n",
> -			idx,
> -			(void *)(uintptr_t)pregion->guest_phys_address,
> -			(void *)(uintptr_t)pregion->userspace_address,
> -			 pregion->memory_size);
> +		RTE_LOG(INFO, VHOST_CONFIG,
> +			"guest memory region %u, size: 0x%" PRIx64 "\n"
> +			"\t guest physical addr: 0x%" PRIx64 "\n"
> +			"\t guest virtual  addr: 0x%" PRIx64 "\n"
> +			"\t host  virtual  addr: 0x%" PRIx64 "\n"
> +			"\t mmap addr : 0x%" PRIx64 "\n"
> +			"\t mmap size : 0x%" PRIx64 "\n"
> +			"\t mmap align: 0x%" PRIx64 "\n"
> +			"\t mmap off  : 0x%" PRIx64 "\n",
> +			i, reg->size,
> +			reg->guest_phys_addr,
> +			reg->guest_user_addr,
> +			reg->host_user_addr,
> +			(uint64_t)(uintptr_t)mmap_addr,
> +			mmap_size,
> +			alignment,
> +			mmap_offset);
>  	}
>  
>  	return 0;
>  
>  err_mmap:
> -	while (idx--) {
> -		munmap((void *)(uintptr_t)pregion_orig[idx].mapped_address,
> -				pregion_orig[idx].mapped_size);
> -		close(pregion_orig[idx].fd);
> -	}
> -	free(dev->mem);
> +	free_mem_region(dev);
> +	rte_free(dev->mem);
>  	dev->mem = NULL;
>  	return -1;
>  }
> --
> 1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
  2016-08-23  8:10 [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Yuanhan Liu
                   ` (6 preceding siblings ...)
  2016-08-23 14:18 ` [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Maxime Coquelin
@ 2016-08-29  8:32 ` Xu, Qian Q
  2016-08-29  8:57   ` Xu, Qian Q
  2016-10-09 15:20   ` Yuanhan Liu
  2016-09-23  4:13 ` [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue " Yuanhan Liu
  2016-10-09 10:46 ` [dpdk-dev] [PATCH 0/6] vhost: add Tx " linhaifeng
  9 siblings, 2 replies; 75+ messages in thread
From: Xu, Qian Q @ 2016-08-29  8:32 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Maxime Coquelin

I just ran a PVP test, nic receive packets then forwards to vhost PMD, and virtio user interface. I didn't see any performance gains in this scenario. All packet size from 64B to 1518B 
performance haven't got benefit from this patchset, and in fact, the performance dropped a lot before 1280B, and similar at 1518B. 
The TX/RX desc setting is " txd=64, rxd=128" for TX-zero-copy enabled case. For TX-zero-copy disabled case, I just ran default testpmd(txd=512, rxd=128) without the patch. 
Could you help check if NIC2VM case? 

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
Sent: Tuesday, August 23, 2016 4:11 PM
To: dev@dpdk.org
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
Subject: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support

This patch set enables vhost Tx zero copy. The majority work goes to patch 4: vhost: add Tx zero copy.

The basic idea of Tx zero copy is, instead of copying data from the desc buf, here we let the mbuf reference the desc buf addr directly.

The major issue behind that is how and when to update the used ring.
You could check the commit log of patch 4 for more details.

Patch 5 introduces a new flag, RTE_VHOST_USER_TX_ZERO_COPY, to enable Tx zero copy, which is disabled by default.

Few more TODOs are left, including handling a desc buf that is across two physical pages, updating release note, etc. Those will be fixed in later version. For now, here is a simple one that hopefully it shows the idea clearly.

I did some quick tests, the performance gain is quite impressive.

For a simple dequeue workload (running rxonly in vhost-pmd and runnin txonly in guest testpmd), it yields 40+% performance boost for packet size 1400B.

For VM2VM iperf test case, it's even better: about 70% boost.

---
Yuanhan Liu (6):
  vhost: simplify memory regions handling
  vhost: get guest/host physical address mappings
  vhost: introduce last avail idx for Tx
  vhost: add Tx zero copy
  vhost: add a flag to enable Tx zero copy
  examples/vhost: add an option to enable Tx zero copy

 doc/guides/prog_guide/vhost_lib.rst |   7 +-
 examples/vhost/main.c               |  19 ++-
 lib/librte_vhost/rte_virtio_net.h   |   1 +
 lib/librte_vhost/socket.c           |   5 +
 lib/librte_vhost/vhost.c            |  12 ++
 lib/librte_vhost/vhost.h            | 103 +++++++++----
 lib/librte_vhost/vhost_user.c       | 297 +++++++++++++++++++++++-------------
 lib/librte_vhost/virtio_net.c       | 188 +++++++++++++++++++----
 8 files changed, 472 insertions(+), 160 deletions(-)

--
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
  2016-08-29  8:32 ` Xu, Qian Q
@ 2016-08-29  8:57   ` Xu, Qian Q
  2016-09-23  4:11     ` Yuanhan Liu
  2016-10-09 15:20   ` Yuanhan Liu
  1 sibling, 1 reply; 75+ messages in thread
From: Xu, Qian Q @ 2016-08-29  8:57 UTC (permalink / raw)
  To: Xu, Qian Q, Yuanhan Liu, dev; +Cc: Maxime Coquelin

Btw, some good news: if I run a simple dequeue workload (running rxonly in vhost-pmd and runnin txonly in guest testpmd), it yields ~50% performance boost for packet size 1518B, but this case is without NIC. 
And similar case as vhost<-->virtio loopback, we can see ~10% performance gains at 1518B without NIC. 

Some bad news: If with the patch, I noticed a 3%-7% performance drop if zero-copy=0 compared with current DPDK(e.g: 16.07) at vhost/virtio loopback and vhost RX only + virtio TX only. Seems the patch will 
Impact the zero-copy=0 performance a little. 

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Xu, Qian Q
Sent: Monday, August 29, 2016 4:33 PM
To: Yuanhan Liu <yuanhan.liu@linux.intel.com>; dev@dpdk.org
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Subject: Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support

I just ran a PVP test, nic receive packets then forwards to vhost PMD, and virtio user interface. I didn't see any performance gains in this scenario. All packet size from 64B to 1518B performance haven't got benefit from this patchset, and in fact, the performance dropped a lot before 1280B, and similar at 1518B. 
The TX/RX desc setting is " txd=64, rxd=128" for TX-zero-copy enabled case. For TX-zero-copy disabled case, I just ran default testpmd(txd=512, rxd=128) without the patch. 
Could you help check if NIC2VM case? 

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
Sent: Tuesday, August 23, 2016 4:11 PM
To: dev@dpdk.org
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
Subject: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support

This patch set enables vhost Tx zero copy. The majority work goes to patch 4: vhost: add Tx zero copy.

The basic idea of Tx zero copy is, instead of copying data from the desc buf, here we let the mbuf reference the desc buf addr directly.

The major issue behind that is how and when to update the used ring.
You could check the commit log of patch 4 for more details.

Patch 5 introduces a new flag, RTE_VHOST_USER_TX_ZERO_COPY, to enable Tx zero copy, which is disabled by default.

Few more TODOs are left, including handling a desc buf that is across two physical pages, updating release note, etc. Those will be fixed in later version. For now, here is a simple one that hopefully it shows the idea clearly.

I did some quick tests, the performance gain is quite impressive.

For a simple dequeue workload (running rxonly in vhost-pmd and runnin txonly in guest testpmd), it yields 40+% performance boost for packet size 1400B.

For VM2VM iperf test case, it's even better: about 70% boost.

---
Yuanhan Liu (6):
  vhost: simplify memory regions handling
  vhost: get guest/host physical address mappings
  vhost: introduce last avail idx for Tx
  vhost: add Tx zero copy
  vhost: add a flag to enable Tx zero copy
  examples/vhost: add an option to enable Tx zero copy

 doc/guides/prog_guide/vhost_lib.rst |   7 +-
 examples/vhost/main.c               |  19 ++-
 lib/librte_vhost/rte_virtio_net.h   |   1 +
 lib/librte_vhost/socket.c           |   5 +
 lib/librte_vhost/vhost.c            |  12 ++
 lib/librte_vhost/vhost.h            | 103 +++++++++----
 lib/librte_vhost/vhost_user.c       | 297 +++++++++++++++++++++++-------------
 lib/librte_vhost/virtio_net.c       | 188 +++++++++++++++++++----
 8 files changed, 472 insertions(+), 160 deletions(-)

--
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy
  2016-08-23  8:10 ` [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable " Yuanhan Liu
@ 2016-09-06  9:00   ` Xu, Qian Q
  2016-09-06  9:42     ` Xu, Qian Q
  2016-09-06  9:55     ` Yuanhan Liu
  0 siblings, 2 replies; 75+ messages in thread
From: Xu, Qian Q @ 2016-09-06  9:00 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Maxime Coquelin

Just curious about the naming: vhost USER TX Zero copy. In fact, it's Vhost RX zero-copy
For virtio, it's Virtio TX zero-copy. So, I wonder why we call it as Vhost TX ZERO-COPY, 
Any comments? 

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
Sent: Tuesday, August 23, 2016 4:11 PM
To: dev@dpdk.org
Cc: Maxime Coquelin; Yuanhan Liu
Subject: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy

Add a new flag ``RTE_VHOST_USER_TX_ZERO_COPY`` to explictily enable
Tx zero copy. If not given, Tx zero copy is disabled by default.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 doc/guides/prog_guide/vhost_lib.rst |  7 ++++++-
 lib/librte_vhost/rte_virtio_net.h   |  1 +
 lib/librte_vhost/socket.c           |  5 +++++
 lib/librte_vhost/vhost.c            | 10 ++++++++++
 lib/librte_vhost/vhost.h            |  1 +
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index 6b0c6b2..15c2bf7 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -79,7 +79,7 @@ The following is an overview of the Vhost API functions:
   ``/dev/path`` character device file will be created. For vhost-user server
   mode, a Unix domain socket file ``path`` will be created.
 
-  Currently two flags are supported (these are valid for vhost-user only):
+  Currently supported flags are (these are valid for vhost-user only):
 
   - ``RTE_VHOST_USER_CLIENT``
 
@@ -97,6 +97,11 @@ The following is an overview of the Vhost API functions:
     This reconnect option is enabled by default. However, it can be turned off
     by setting this flag.
 
+  - ``RTE_VHOST_USER_TX_ZERO_COPY``
+
+    Tx zero copy will be enabled when this flag is set. It is disabled by
+    default.
+
 * ``rte_vhost_driver_session_start()``
 
   This function starts the vhost session loop to handle vhost messages. It
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 9caa622..5e437c6 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -53,6 +53,7 @@
 
 #define RTE_VHOST_USER_CLIENT		(1ULL << 0)
 #define RTE_VHOST_USER_NO_RECONNECT	(1ULL << 1)
+#define RTE_VHOST_USER_TX_ZERO_COPY	(1ULL << 2)
 
 /* Enum for virtqueue management. */
 enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index bf03f84..5c3962d 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -62,6 +62,7 @@ struct vhost_user_socket {
 	int connfd;
 	bool is_server;
 	bool reconnect;
+	bool tx_zero_copy;
 };
 
 struct vhost_user_connection {
@@ -203,6 +204,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 	size = strnlen(vsocket->path, PATH_MAX);
 	vhost_set_ifname(vid, vsocket->path, size);
 
+	if (vsocket->tx_zero_copy)
+		vhost_enable_tx_zero_copy(vid);
+
 	RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
 
 	vsocket->connfd = fd;
@@ -499,6 +503,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 	memset(vsocket, 0, sizeof(struct vhost_user_socket));
 	vsocket->path = strdup(path);
 	vsocket->connfd = -1;
+	vsocket->tx_zero_copy = flags & RTE_VHOST_USER_TX_ZERO_COPY;
 
 	if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
 		vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index ab25649..5461e5b 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -290,6 +290,16 @@ vhost_set_ifname(int vid, const char *if_name, unsigned int if_len)
 	dev->ifname[sizeof(dev->ifname) - 1] = '\0';
 }
 
+void
+vhost_enable_tx_zero_copy(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->tx_zero_copy = 1;
+}
 
 int
 rte_vhost_get_numa_node(int vid)
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 718133e..3081180 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -279,6 +279,7 @@ void vhost_destroy_device(int);
 int alloc_vring_queue_pair(struct virtio_net *dev, uint32_t qp_idx);
 
 void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
+void vhost_enable_tx_zero_copy(int vid);
 
 /*
  * Backend-specific cleanup. Defined by vhost-cuse and vhost-user.
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy
  2016-09-06  9:00   ` Xu, Qian Q
@ 2016-09-06  9:42     ` Xu, Qian Q
  2016-09-06 10:02       ` Yuanhan Liu
  2016-09-06  9:55     ` Yuanhan Liu
  1 sibling, 1 reply; 75+ messages in thread
From: Xu, Qian Q @ 2016-09-06  9:42 UTC (permalink / raw)
  To: Xu, Qian Q, Yuanhan Liu, dev; +Cc: Maxime Coquelin

Another interesting thing to me is the ZERO-COPY settings. If I have 2 vhost, and 1 is set as
Zero-copy=0, and another is set zero-copy=1, so the vhost will take it as Zero-copy
Enabled for all vhost, or for one vhost. Does the vhost allow such usage? Or we need 
Enforce all vhost zero-copy to be a same number. 

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Xu, Qian Q
Sent: Tuesday, September 06, 2016 5:00 PM
To: Yuanhan Liu; dev@dpdk.org
Cc: Maxime Coquelin
Subject: Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy

Just curious about the naming: vhost USER TX Zero copy. In fact, it's Vhost RX zero-copy
For virtio, it's Virtio TX zero-copy. So, I wonder why we call it as Vhost TX ZERO-COPY, 
Any comments? 

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
Sent: Tuesday, August 23, 2016 4:11 PM
To: dev@dpdk.org
Cc: Maxime Coquelin; Yuanhan Liu
Subject: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy

Add a new flag ``RTE_VHOST_USER_TX_ZERO_COPY`` to explictily enable
Tx zero copy. If not given, Tx zero copy is disabled by default.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 doc/guides/prog_guide/vhost_lib.rst |  7 ++++++-
 lib/librte_vhost/rte_virtio_net.h   |  1 +
 lib/librte_vhost/socket.c           |  5 +++++
 lib/librte_vhost/vhost.c            | 10 ++++++++++
 lib/librte_vhost/vhost.h            |  1 +
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index 6b0c6b2..15c2bf7 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -79,7 +79,7 @@ The following is an overview of the Vhost API functions:
   ``/dev/path`` character device file will be created. For vhost-user server
   mode, a Unix domain socket file ``path`` will be created.
 
-  Currently two flags are supported (these are valid for vhost-user only):
+  Currently supported flags are (these are valid for vhost-user only):
 
   - ``RTE_VHOST_USER_CLIENT``
 
@@ -97,6 +97,11 @@ The following is an overview of the Vhost API functions:
     This reconnect option is enabled by default. However, it can be turned off
     by setting this flag.
 
+  - ``RTE_VHOST_USER_TX_ZERO_COPY``
+
+    Tx zero copy will be enabled when this flag is set. It is disabled by
+    default.
+
 * ``rte_vhost_driver_session_start()``
 
   This function starts the vhost session loop to handle vhost messages. It
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 9caa622..5e437c6 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -53,6 +53,7 @@
 
 #define RTE_VHOST_USER_CLIENT		(1ULL << 0)
 #define RTE_VHOST_USER_NO_RECONNECT	(1ULL << 1)
+#define RTE_VHOST_USER_TX_ZERO_COPY	(1ULL << 2)
 
 /* Enum for virtqueue management. */
 enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index bf03f84..5c3962d 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -62,6 +62,7 @@ struct vhost_user_socket {
 	int connfd;
 	bool is_server;
 	bool reconnect;
+	bool tx_zero_copy;
 };
 
 struct vhost_user_connection {
@@ -203,6 +204,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 	size = strnlen(vsocket->path, PATH_MAX);
 	vhost_set_ifname(vid, vsocket->path, size);
 
+	if (vsocket->tx_zero_copy)
+		vhost_enable_tx_zero_copy(vid);
+
 	RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
 
 	vsocket->connfd = fd;
@@ -499,6 +503,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 	memset(vsocket, 0, sizeof(struct vhost_user_socket));
 	vsocket->path = strdup(path);
 	vsocket->connfd = -1;
+	vsocket->tx_zero_copy = flags & RTE_VHOST_USER_TX_ZERO_COPY;
 
 	if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
 		vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index ab25649..5461e5b 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -290,6 +290,16 @@ vhost_set_ifname(int vid, const char *if_name, unsigned int if_len)
 	dev->ifname[sizeof(dev->ifname) - 1] = '\0';
 }
 
+void
+vhost_enable_tx_zero_copy(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->tx_zero_copy = 1;
+}
 
 int
 rte_vhost_get_numa_node(int vid)
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 718133e..3081180 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -279,6 +279,7 @@ void vhost_destroy_device(int);
 int alloc_vring_queue_pair(struct virtio_net *dev, uint32_t qp_idx);
 
 void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
+void vhost_enable_tx_zero_copy(int vid);
 
 /*
  * Backend-specific cleanup. Defined by vhost-cuse and vhost-user.
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy
  2016-09-06  9:00   ` Xu, Qian Q
  2016-09-06  9:42     ` Xu, Qian Q
@ 2016-09-06  9:55     ` Yuanhan Liu
  2016-09-07 16:00       ` Thomas Monjalon
  1 sibling, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-06  9:55 UTC (permalink / raw)
  To: Xu, Qian Q; +Cc: dev, Maxime Coquelin

On Tue, Sep 06, 2016 at 09:00:14AM +0000, Xu, Qian Q wrote:
> Just curious about the naming: vhost USER TX Zero copy. In fact, it's Vhost RX zero-copy
> For virtio, it's Virtio TX zero-copy. So, I wonder why we call it as Vhost TX ZERO-COPY, 
> Any comments? 

It's just that "Tx zero copy" looks more nature to me (yes, I took the
name from the virtio point of view).

Besides that, naming it to "vhost Rx zero copy" would be a little
weird, based on we have functions like "virtio_dev_rx" in the enqueue
path while here we just touch dequeue path.

OTOH, I seldome say "vhost-user Tx zero copy"; I normally mention it
as "Tx zero copy", without mentioning "vhost-user". For the flag
RTE_VHOST_USER_TX_ZERO_COPY, all vhost-user flags start with "RTE_VHOST_USER_"
prefix.

	--yliu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy
  2016-09-06  9:42     ` Xu, Qian Q
@ 2016-09-06 10:02       ` Yuanhan Liu
  2016-09-07  2:43         ` Xu, Qian Q
  0 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-06 10:02 UTC (permalink / raw)
  To: Xu, Qian Q; +Cc: dev, Maxime Coquelin

On Tue, Sep 06, 2016 at 09:42:38AM +0000, Xu, Qian Q wrote:
> Another interesting thing to me is the ZERO-COPY settings. If I have 2 vhost, and 1 is set as
> Zero-copy=0, and another is set zero-copy=1, so the vhost will take it as Zero-copy
> Enabled for all vhost, or for one vhost.


The flag is per vhost-user socket file path.

If you have two vhost interfaces attached on the same socket files
(when it acts as the server), the two vhost interface will both
with zero copy enabled.

If the two vhost interfaces are attached on different socket file,
zero-copy will be only enabled when the option is given for
corresponding socket file.

> Does the vhost allow such usage? Or we need 

Yes, it's allowd.

> Enforce all vhost zero-copy to be a same number. 

Nope.

	--yliu

> 
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Xu, Qian Q
> Sent: Tuesday, September 06, 2016 5:00 PM
> To: Yuanhan Liu; dev@dpdk.org
> Cc: Maxime Coquelin
> Subject: Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy
> 
> Just curious about the naming: vhost USER TX Zero copy. In fact, it's Vhost RX zero-copy
> For virtio, it's Virtio TX zero-copy. So, I wonder why we call it as Vhost TX ZERO-COPY, 
> Any comments? 
> 
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
> Sent: Tuesday, August 23, 2016 4:11 PM
> To: dev@dpdk.org
> Cc: Maxime Coquelin; Yuanhan Liu
> Subject: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy
> 
> Add a new flag ``RTE_VHOST_USER_TX_ZERO_COPY`` to explictily enable
> Tx zero copy. If not given, Tx zero copy is disabled by default.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  doc/guides/prog_guide/vhost_lib.rst |  7 ++++++-
>  lib/librte_vhost/rte_virtio_net.h   |  1 +
>  lib/librte_vhost/socket.c           |  5 +++++
>  lib/librte_vhost/vhost.c            | 10 ++++++++++
>  lib/librte_vhost/vhost.h            |  1 +
>  5 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
> index 6b0c6b2..15c2bf7 100644
> --- a/doc/guides/prog_guide/vhost_lib.rst
> +++ b/doc/guides/prog_guide/vhost_lib.rst
> @@ -79,7 +79,7 @@ The following is an overview of the Vhost API functions:
>    ``/dev/path`` character device file will be created. For vhost-user server
>    mode, a Unix domain socket file ``path`` will be created.
>  
> -  Currently two flags are supported (these are valid for vhost-user only):
> +  Currently supported flags are (these are valid for vhost-user only):
>  
>    - ``RTE_VHOST_USER_CLIENT``
>  
> @@ -97,6 +97,11 @@ The following is an overview of the Vhost API functions:
>      This reconnect option is enabled by default. However, it can be turned off
>      by setting this flag.
>  
> +  - ``RTE_VHOST_USER_TX_ZERO_COPY``
> +
> +    Tx zero copy will be enabled when this flag is set. It is disabled by
> +    default.
> +
>  * ``rte_vhost_driver_session_start()``
>  
>    This function starts the vhost session loop to handle vhost messages. It
> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> index 9caa622..5e437c6 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -53,6 +53,7 @@
>  
>  #define RTE_VHOST_USER_CLIENT		(1ULL << 0)
>  #define RTE_VHOST_USER_NO_RECONNECT	(1ULL << 1)
> +#define RTE_VHOST_USER_TX_ZERO_COPY	(1ULL << 2)
>  
>  /* Enum for virtqueue management. */
>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
> index bf03f84..5c3962d 100644
> --- a/lib/librte_vhost/socket.c
> +++ b/lib/librte_vhost/socket.c
> @@ -62,6 +62,7 @@ struct vhost_user_socket {
>  	int connfd;
>  	bool is_server;
>  	bool reconnect;
> +	bool tx_zero_copy;
>  };
>  
>  struct vhost_user_connection {
> @@ -203,6 +204,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
>  	size = strnlen(vsocket->path, PATH_MAX);
>  	vhost_set_ifname(vid, vsocket->path, size);
>  
> +	if (vsocket->tx_zero_copy)
> +		vhost_enable_tx_zero_copy(vid);
> +
>  	RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
>  
>  	vsocket->connfd = fd;
> @@ -499,6 +503,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
>  	memset(vsocket, 0, sizeof(struct vhost_user_socket));
>  	vsocket->path = strdup(path);
>  	vsocket->connfd = -1;
> +	vsocket->tx_zero_copy = flags & RTE_VHOST_USER_TX_ZERO_COPY;
>  
>  	if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
>  		vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> index ab25649..5461e5b 100644
> --- a/lib/librte_vhost/vhost.c
> +++ b/lib/librte_vhost/vhost.c
> @@ -290,6 +290,16 @@ vhost_set_ifname(int vid, const char *if_name, unsigned int if_len)
>  	dev->ifname[sizeof(dev->ifname) - 1] = '\0';
>  }
>  
> +void
> +vhost_enable_tx_zero_copy(int vid)
> +{
> +	struct virtio_net *dev = get_device(vid);
> +
> +	if (dev == NULL)
> +		return;
> +
> +	dev->tx_zero_copy = 1;
> +}
>  
>  int
>  rte_vhost_get_numa_node(int vid)
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 718133e..3081180 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -279,6 +279,7 @@ void vhost_destroy_device(int);
>  int alloc_vring_queue_pair(struct virtio_net *dev, uint32_t qp_idx);
>  
>  void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
> +void vhost_enable_tx_zero_copy(int vid);
>  
>  /*
>   * Backend-specific cleanup. Defined by vhost-cuse and vhost-user.
> -- 
> 1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy
  2016-09-06 10:02       ` Yuanhan Liu
@ 2016-09-07  2:43         ` Xu, Qian Q
  0 siblings, 0 replies; 75+ messages in thread
From: Xu, Qian Q @ 2016-09-07  2:43 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Maxime Coquelin

Thx for the clarification. As to the naming, although it's a little confusing, if people are fine with it, I'm fine. 


-----Original Message-----
From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] 
Sent: Tuesday, September 6, 2016 6:02 PM
To: Xu, Qian Q <qian.q.xu@intel.com>
Cc: dev@dpdk.org; Maxime Coquelin <maxime.coquelin@redhat.com>
Subject: Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy

On Tue, Sep 06, 2016 at 09:42:38AM +0000, Xu, Qian Q wrote:
> Another interesting thing to me is the ZERO-COPY settings. If I have 2 
> vhost, and 1 is set as Zero-copy=0, and another is set zero-copy=1, so 
> the vhost will take it as Zero-copy Enabled for all vhost, or for one vhost.


The flag is per vhost-user socket file path.

If you have two vhost interfaces attached on the same socket files (when it acts as the server), the two vhost interface will both with zero copy enabled.

If the two vhost interfaces are attached on different socket file, zero-copy will be only enabled when the option is given for corresponding socket file.

> Does the vhost allow such usage? Or we need

Yes, it's allowd.

> Enforce all vhost zero-copy to be a same number. 

Nope.

	--yliu

> 
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Xu, Qian Q
> Sent: Tuesday, September 06, 2016 5:00 PM
> To: Yuanhan Liu; dev@dpdk.org
> Cc: Maxime Coquelin
> Subject: Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx 
> zero copy
> 
> Just curious about the naming: vhost USER TX Zero copy. In fact, it's 
> Vhost RX zero-copy For virtio, it's Virtio TX zero-copy. So, I wonder 
> why we call it as Vhost TX ZERO-COPY, Any comments?
> 
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
> Sent: Tuesday, August 23, 2016 4:11 PM
> To: dev@dpdk.org
> Cc: Maxime Coquelin; Yuanhan Liu
> Subject: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero 
> copy
> 
> Add a new flag ``RTE_VHOST_USER_TX_ZERO_COPY`` to explictily enable Tx 
> zero copy. If not given, Tx zero copy is disabled by default.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  doc/guides/prog_guide/vhost_lib.rst |  7 ++++++-
>  lib/librte_vhost/rte_virtio_net.h   |  1 +
>  lib/librte_vhost/socket.c           |  5 +++++
>  lib/librte_vhost/vhost.c            | 10 ++++++++++
>  lib/librte_vhost/vhost.h            |  1 +
>  5 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/guides/prog_guide/vhost_lib.rst 
> b/doc/guides/prog_guide/vhost_lib.rst
> index 6b0c6b2..15c2bf7 100644
> --- a/doc/guides/prog_guide/vhost_lib.rst
> +++ b/doc/guides/prog_guide/vhost_lib.rst
> @@ -79,7 +79,7 @@ The following is an overview of the Vhost API functions:
>    ``/dev/path`` character device file will be created. For vhost-user server
>    mode, a Unix domain socket file ``path`` will be created.
>  
> -  Currently two flags are supported (these are valid for vhost-user only):
> +  Currently supported flags are (these are valid for vhost-user only):
>  
>    - ``RTE_VHOST_USER_CLIENT``
>  
> @@ -97,6 +97,11 @@ The following is an overview of the Vhost API functions:
>      This reconnect option is enabled by default. However, it can be turned off
>      by setting this flag.
>  
> +  - ``RTE_VHOST_USER_TX_ZERO_COPY``
> +
> +    Tx zero copy will be enabled when this flag is set. It is disabled by
> +    default.
> +
>  * ``rte_vhost_driver_session_start()``
>  
>    This function starts the vhost session loop to handle vhost 
> messages. It diff --git a/lib/librte_vhost/rte_virtio_net.h 
> b/lib/librte_vhost/rte_virtio_net.h
> index 9caa622..5e437c6 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -53,6 +53,7 @@
>  
>  #define RTE_VHOST_USER_CLIENT		(1ULL << 0)
>  #define RTE_VHOST_USER_NO_RECONNECT	(1ULL << 1)
> +#define RTE_VHOST_USER_TX_ZERO_COPY	(1ULL << 2)
>  
>  /* Enum for virtqueue management. */
>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM}; diff --git 
> a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c index 
> bf03f84..5c3962d 100644
> --- a/lib/librte_vhost/socket.c
> +++ b/lib/librte_vhost/socket.c
> @@ -62,6 +62,7 @@ struct vhost_user_socket {
>  	int connfd;
>  	bool is_server;
>  	bool reconnect;
> +	bool tx_zero_copy;
>  };
>  
>  struct vhost_user_connection {
> @@ -203,6 +204,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
>  	size = strnlen(vsocket->path, PATH_MAX);
>  	vhost_set_ifname(vid, vsocket->path, size);
>  
> +	if (vsocket->tx_zero_copy)
> +		vhost_enable_tx_zero_copy(vid);
> +
>  	RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
>  
>  	vsocket->connfd = fd;
> @@ -499,6 +503,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
>  	memset(vsocket, 0, sizeof(struct vhost_user_socket));
>  	vsocket->path = strdup(path);
>  	vsocket->connfd = -1;
> +	vsocket->tx_zero_copy = flags & RTE_VHOST_USER_TX_ZERO_COPY;
>  
>  	if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
>  		vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT); diff 
> --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c index 
> ab25649..5461e5b 100644
> --- a/lib/librte_vhost/vhost.c
> +++ b/lib/librte_vhost/vhost.c
> @@ -290,6 +290,16 @@ vhost_set_ifname(int vid, const char *if_name, unsigned int if_len)
>  	dev->ifname[sizeof(dev->ifname) - 1] = '\0';  }
>  
> +void
> +vhost_enable_tx_zero_copy(int vid)
> +{
> +	struct virtio_net *dev = get_device(vid);
> +
> +	if (dev == NULL)
> +		return;
> +
> +	dev->tx_zero_copy = 1;
> +}
>  
>  int
>  rte_vhost_get_numa_node(int vid)
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h index 
> 718133e..3081180 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -279,6 +279,7 @@ void vhost_destroy_device(int);  int 
> alloc_vring_queue_pair(struct virtio_net *dev, uint32_t qp_idx);
>  
>  void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
> +void vhost_enable_tx_zero_copy(int vid);
>  
>  /*
>   * Backend-specific cleanup. Defined by vhost-cuse and vhost-user.
> --
> 1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy
  2016-09-06  9:55     ` Yuanhan Liu
@ 2016-09-07 16:00       ` Thomas Monjalon
  2016-09-08  7:21         ` Yuanhan Liu
  0 siblings, 1 reply; 75+ messages in thread
From: Thomas Monjalon @ 2016-09-07 16:00 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Xu, Qian Q, Maxime Coquelin

2016-09-06 17:55, Yuanhan Liu:
> On Tue, Sep 06, 2016 at 09:00:14AM +0000, Xu, Qian Q wrote:
> > Just curious about the naming: vhost USER TX Zero copy. In fact, it's Vhost RX zero-copy
> > For virtio, it's Virtio TX zero-copy. So, I wonder why we call it as Vhost TX ZERO-COPY, 
> > Any comments? 
> 
> It's just that "Tx zero copy" looks more nature to me (yes, I took the
> name from the virtio point of view).
> 
> Besides that, naming it to "vhost Rx zero copy" would be a little
> weird, based on we have functions like "virtio_dev_rx" in the enqueue
> path while here we just touch dequeue path.
> 
> OTOH, I seldome say "vhost-user Tx zero copy"; I normally mention it
> as "Tx zero copy", without mentioning "vhost-user". For the flag
> RTE_VHOST_USER_TX_ZERO_COPY, all vhost-user flags start with "RTE_VHOST_USER_"
> prefix.

I agree that the naming in vhost code is quite confusing.
It would be better to define a terminology and stop mixing virtio/vhost
directions as well as Rx/Tx and enqueue/dequeue.
Or at least, it should be documented.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy
  2016-09-07 16:00       ` Thomas Monjalon
@ 2016-09-08  7:21         ` Yuanhan Liu
  2016-09-08  7:57           ` Thomas Monjalon
  0 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-08  7:21 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Xu, Qian Q, Maxime Coquelin

On Wed, Sep 07, 2016 at 06:00:36PM +0200, Thomas Monjalon wrote:
> 2016-09-06 17:55, Yuanhan Liu:
> > On Tue, Sep 06, 2016 at 09:00:14AM +0000, Xu, Qian Q wrote:
> > > Just curious about the naming: vhost USER TX Zero copy. In fact, it's Vhost RX zero-copy
> > > For virtio, it's Virtio TX zero-copy. So, I wonder why we call it as Vhost TX ZERO-COPY, 
> > > Any comments? 
> > 
> > It's just that "Tx zero copy" looks more nature to me (yes, I took the
> > name from the virtio point of view).
> > 
> > Besides that, naming it to "vhost Rx zero copy" would be a little
> > weird, based on we have functions like "virtio_dev_rx" in the enqueue
> > path while here we just touch dequeue path.
> > 
> > OTOH, I seldome say "vhost-user Tx zero copy"; I normally mention it
> > as "Tx zero copy", without mentioning "vhost-user". For the flag
> > RTE_VHOST_USER_TX_ZERO_COPY, all vhost-user flags start with "RTE_VHOST_USER_"
> > prefix.
> 
> I agree that the naming in vhost code is quite confusing.
> It would be better to define a terminology and stop mixing virtio/vhost
> directions as well as Rx/Tx and enqueue/dequeue.

I think we could/should avoid using Rx/Tx in vhost, but we should keep
the enqueue/dequeue: that's how the two key vhost API named.

> Or at least, it should be documented.

Or, how about renaming it to RTE_VHOST_USER_DEQUEUE_ZERO_COPY, to align
with the function name rte_vhost_dequeue_burst?

	--yliu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable Tx zero copy
  2016-09-08  7:21         ` Yuanhan Liu
@ 2016-09-08  7:57           ` Thomas Monjalon
  0 siblings, 0 replies; 75+ messages in thread
From: Thomas Monjalon @ 2016-09-08  7:57 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Xu, Qian Q, Maxime Coquelin

2016-09-08 15:21, Yuanhan Liu:
> On Wed, Sep 07, 2016 at 06:00:36PM +0200, Thomas Monjalon wrote:
> > 2016-09-06 17:55, Yuanhan Liu:
> > > On Tue, Sep 06, 2016 at 09:00:14AM +0000, Xu, Qian Q wrote:
> > > > Just curious about the naming: vhost USER TX Zero copy. In fact, it's Vhost RX zero-copy
> > > > For virtio, it's Virtio TX zero-copy. So, I wonder why we call it as Vhost TX ZERO-COPY, 
> > > > Any comments? 
> > > 
> > > It's just that "Tx zero copy" looks more nature to me (yes, I took the
> > > name from the virtio point of view).
> > > 
> > > Besides that, naming it to "vhost Rx zero copy" would be a little
> > > weird, based on we have functions like "virtio_dev_rx" in the enqueue
> > > path while here we just touch dequeue path.
> > > 
> > > OTOH, I seldome say "vhost-user Tx zero copy"; I normally mention it
> > > as "Tx zero copy", without mentioning "vhost-user". For the flag
> > > RTE_VHOST_USER_TX_ZERO_COPY, all vhost-user flags start with "RTE_VHOST_USER_"
> > > prefix.
> > 
> > I agree that the naming in vhost code is quite confusing.
> > It would be better to define a terminology and stop mixing virtio/vhost
> > directions as well as Rx/Tx and enqueue/dequeue.
> 
> I think we could/should avoid using Rx/Tx in vhost, but we should keep
> the enqueue/dequeue: that's how the two key vhost API named.
> 
> > Or at least, it should be documented.
> 
> Or, how about renaming it to RTE_VHOST_USER_DEQUEUE_ZERO_COPY, to align
> with the function name rte_vhost_dequeue_burst?

Seems reasonable, yes.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
  2016-08-29  8:57   ` Xu, Qian Q
@ 2016-09-23  4:11     ` Yuanhan Liu
  0 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-23  4:11 UTC (permalink / raw)
  To: dev; +Cc: Xu, Qian Q, Maxime Coquelin

On Mon, Aug 29, 2016 at 08:57:52AM +0000, Xu, Qian Q wrote:
> Btw, some good news: if I run a simple dequeue workload (running rxonly in vhost-pmd and runnin txonly in guest testpmd), it yields ~50% performance boost for packet size 1518B, but this case is without NIC. 
> And similar case as vhost<-->virtio loopback, we can see ~10% performance gains at 1518B without NIC. 
> 
> Some bad news: If with the patch, I noticed a 3%-7% performance drop if zero-copy=0 compared with current DPDK(e.g: 16.07) at vhost/virtio loopback and vhost RX only + virtio TX only. Seems the patch will 
> Impact the zero-copy=0 performance a little. 

There are some follow up discussion internally, the 3%-7% drop reported
by Qian when zero-copy is not enabled is acutally due to the fluctuation.
So, a false alarm.

	--yliu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue zero copy support
  2016-08-23  8:10 [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Yuanhan Liu
                   ` (7 preceding siblings ...)
  2016-08-29  8:32 ` Xu, Qian Q
@ 2016-09-23  4:13 ` Yuanhan Liu
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 1/7] vhost: simplify memory regions handling Yuanhan Liu
                     ` (7 more replies)
  2016-10-09 10:46 ` [dpdk-dev] [PATCH 0/6] vhost: add Tx " linhaifeng
  9 siblings, 8 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-23  4:13 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

v2: - renamed "tx zero copy" to "dequeue zero copy", to reduce confusions.
    - hnadle the case that a desc buf might across 2 host phys pages
    - use MAP_POPULATE to let kernel populate the page tables
    - updated release note
    - doc-ed the limitations for the vm2nic case
    - merge 2 continuous guest phys memory region
    - and few more trivial changes, please see them in the corresponding
      patches

This patch set enables vhost dequeue zero copy. The majority work goes
to patch 4: "vhost: add dequeue zero copy".

The basic idea of dequeue zero copy is, instead of copying data from the
desc buf, here we let the mbuf reference the desc buf addr directly.

The major issue behind that is how and when to update the used ring.
You could check the commit log of patch 4 for more details.

Patch 5 introduces a new flag, RTE_VHOST_USER_DEQUEUE_ZERO_COPY, to enable
dequeue zero copy, which is disabled by default.

The performance gain is quite impressive. For a simple dequeue workload
(running rxonly in vhost-pmd and runnin txonly in guest testpmd), it yields
50+% performance boost for packet size 1500B. For VM2VM iperf test case,
it's even better: about 70% boost.

For small packets, the performance is worse (it's expected, as the extra
overhead introduced by zero copy outweighs the benefits from saving few
bytes copy).

---
Yuanhan Liu (7):
  vhost: simplify memory regions handling
  vhost: get guest/host physical address mappings
  vhost: introduce last avail idx for dequeue
  vhost: add dequeue zero copy
  vhost: add a flag to enable dequeue zero copy
  examples/vhost: add an option to enable dequeue zero copy
  net/vhost: add an option to enable dequeue zero copy

 doc/guides/prog_guide/vhost_lib.rst    |  35 +++-
 doc/guides/rel_notes/release_16_11.rst |  11 ++
 drivers/net/vhost/rte_eth_vhost.c      |  13 ++
 examples/vhost/main.c                  |  19 +-
 lib/librte_vhost/rte_virtio_net.h      |   1 +
 lib/librte_vhost/socket.c              |   5 +
 lib/librte_vhost/vhost.c               |  12 ++
 lib/librte_vhost/vhost.h               | 102 ++++++++---
 lib/librte_vhost/vhost_user.c          | 315 ++++++++++++++++++++++-----------
 lib/librte_vhost/virtio_net.c          | 192 +++++++++++++++++---
 10 files changed, 543 insertions(+), 162 deletions(-)

-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v2 1/7] vhost: simplify memory regions handling
  2016-09-23  4:13 ` [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue " Yuanhan Liu
@ 2016-09-23  4:13   ` Yuanhan Liu
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 2/7] vhost: get guest/host physical address mappings Yuanhan Liu
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-23  4:13 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

Due to history reason (that vhost-cuse comes before vhost-user), some
fields for maintaining the vhost-user memory mappings (such as mmapped
address and size, with those we then can unmap on destroy) are kept in
"orig_region_map" struct, a structure that is defined only in vhost-user
source file.

The right way to go is to remove the structure and move all those fields
into virtio_memory_region struct. But we simply can't do that before,
because it breaks the ABI.

Now, thanks to the ABI refactoring, it's never been a blocking issue
any more. And here it goes: this patch removes orig_region_map and
redefines virtio_memory_region, to include all necessary info.

With that, we can simplify the guest/host address convert a bit.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/vhost.h      |  49 ++++++------
 lib/librte_vhost/vhost_user.c | 173 +++++++++++++++++-------------------------
 2 files changed, 91 insertions(+), 131 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index c2dfc3c..df2107b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -143,12 +143,14 @@ struct virtio_net {
  * Information relating to memory regions including offsets to
  * addresses in QEMUs memory file.
  */
-struct virtio_memory_regions {
-	uint64_t guest_phys_address;
-	uint64_t guest_phys_address_end;
-	uint64_t memory_size;
-	uint64_t userspace_address;
-	uint64_t address_offset;
+struct virtio_memory_region {
+	uint64_t guest_phys_addr;
+	uint64_t guest_user_addr;
+	uint64_t host_user_addr;
+	uint64_t size;
+	void	 *mmap_addr;
+	uint64_t mmap_size;
+	int fd;
 };
 
 
@@ -156,12 +158,8 @@ struct virtio_memory_regions {
  * Memory structure includes region and mapping information.
  */
 struct virtio_memory {
-	/* Base QEMU userspace address of the memory file. */
-	uint64_t base_address;
-	uint64_t mapped_address;
-	uint64_t mapped_size;
 	uint32_t nregions;
-	struct virtio_memory_regions regions[0];
+	struct virtio_memory_region regions[0];
 };
 
 
@@ -200,26 +198,23 @@ extern uint64_t VHOST_FEATURES;
 #define MAX_VHOST_DEVICE	1024
 extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
 
-/**
- * Function to convert guest physical addresses to vhost virtual addresses.
- * This is used to convert guest virtio buffer addresses.
- */
+/* Convert guest physical Address to host virtual address */
 static inline uint64_t __attribute__((always_inline))
-gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
+gpa_to_vva(struct virtio_net *dev, uint64_t gpa)
 {
-	struct virtio_memory_regions *region;
-	uint32_t regionidx;
-	uint64_t vhost_va = 0;
-
-	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
-		region = &dev->mem->regions[regionidx];
-		if ((guest_pa >= region->guest_phys_address) &&
-			(guest_pa <= region->guest_phys_address_end)) {
-			vhost_va = region->address_offset + guest_pa;
-			break;
+	struct virtio_memory_region *reg;
+	uint32_t i;
+
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+		if (gpa >= reg->guest_phys_addr &&
+		    gpa <  reg->guest_phys_addr + reg->size) {
+			return gpa - reg->guest_phys_addr +
+			       reg->host_user_addr;
 		}
 	}
-	return vhost_va;
+
+	return 0;
 }
 
 struct virtio_net_device_ops const *notify_ops;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index eee99e9..49585b8 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -74,18 +74,6 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
 	[VHOST_USER_SEND_RARP]  = "VHOST_USER_SEND_RARP",
 };
 
-struct orig_region_map {
-	int fd;
-	uint64_t mapped_address;
-	uint64_t mapped_size;
-	uint64_t blksz;
-};
-
-#define orig_region(ptr, nregions) \
-	((struct orig_region_map *)RTE_PTR_ADD((ptr), \
-		sizeof(struct virtio_memory) + \
-		sizeof(struct virtio_memory_regions) * (nregions)))
-
 static uint64_t
 get_blk_size(int fd)
 {
@@ -99,18 +87,17 @@ get_blk_size(int fd)
 static void
 free_mem_region(struct virtio_net *dev)
 {
-	struct orig_region_map *region;
-	unsigned int idx;
+	uint32_t i;
+	struct virtio_memory_region *reg;
 
 	if (!dev || !dev->mem)
 		return;
 
-	region = orig_region(dev->mem, dev->mem->nregions);
-	for (idx = 0; idx < dev->mem->nregions; idx++) {
-		if (region[idx].mapped_address) {
-			munmap((void *)(uintptr_t)region[idx].mapped_address,
-					region[idx].mapped_size);
-			close(region[idx].fd);
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+		if (reg->host_user_addr) {
+			munmap(reg->mmap_addr, reg->mmap_size);
+			close(reg->fd);
 		}
 	}
 }
@@ -120,7 +107,7 @@ vhost_backend_cleanup(struct virtio_net *dev)
 {
 	if (dev->mem) {
 		free_mem_region(dev);
-		free(dev->mem);
+		rte_free(dev->mem);
 		dev->mem = NULL;
 	}
 	if (dev->log_addr) {
@@ -286,25 +273,23 @@ numa_realloc(struct virtio_net *dev, int index __rte_unused)
  * used to convert the ring addresses to our address space.
  */
 static uint64_t
-qva_to_vva(struct virtio_net *dev, uint64_t qemu_va)
+qva_to_vva(struct virtio_net *dev, uint64_t qva)
 {
-	struct virtio_memory_regions *region;
-	uint64_t vhost_va = 0;
-	uint32_t regionidx = 0;
+	struct virtio_memory_region *reg;
+	uint32_t i;
 
 	/* Find the region where the address lives. */
-	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
-		region = &dev->mem->regions[regionidx];
-		if ((qemu_va >= region->userspace_address) &&
-			(qemu_va <= region->userspace_address +
-			region->memory_size)) {
-			vhost_va = qemu_va + region->guest_phys_address +
-				region->address_offset -
-				region->userspace_address;
-			break;
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+
+		if (qva >= reg->guest_user_addr &&
+		    qva <  reg->guest_user_addr + reg->size) {
+			return qva - reg->guest_user_addr +
+			       reg->host_user_addr;
 		}
 	}
-	return vhost_va;
+
+	return 0;
 }
 
 /*
@@ -391,11 +376,13 @@ static int
 vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 {
 	struct VhostUserMemory memory = pmsg->payload.memory;
-	struct virtio_memory_regions *pregion;
-	uint64_t mapped_address, mapped_size;
-	unsigned int idx = 0;
-	struct orig_region_map *pregion_orig;
+	struct virtio_memory_region *reg;
+	void *mmap_addr;
+	uint64_t mmap_size;
+	uint64_t mmap_offset;
 	uint64_t alignment;
+	uint32_t i;
+	int fd;
 
 	/* Remove from the data plane. */
 	if (dev->flags & VIRTIO_DEV_RUNNING) {
@@ -405,14 +392,12 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 
 	if (dev->mem) {
 		free_mem_region(dev);
-		free(dev->mem);
+		rte_free(dev->mem);
 		dev->mem = NULL;
 	}
 
-	dev->mem = calloc(1,
-		sizeof(struct virtio_memory) +
-		sizeof(struct virtio_memory_regions) * memory.nregions +
-		sizeof(struct orig_region_map) * memory.nregions);
+	dev->mem = rte_zmalloc("vhost-mem-table", sizeof(struct virtio_memory) +
+		sizeof(struct virtio_memory_region) * memory.nregions, 0);
 	if (dev->mem == NULL) {
 		RTE_LOG(ERR, VHOST_CONFIG,
 			"(%d) failed to allocate memory for dev->mem\n",
@@ -421,22 +406,17 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 	}
 	dev->mem->nregions = memory.nregions;
 
-	pregion_orig = orig_region(dev->mem, memory.nregions);
-	for (idx = 0; idx < memory.nregions; idx++) {
-		pregion = &dev->mem->regions[idx];
-		pregion->guest_phys_address =
-			memory.regions[idx].guest_phys_addr;
-		pregion->guest_phys_address_end =
-			memory.regions[idx].guest_phys_addr +
-			memory.regions[idx].memory_size;
-		pregion->memory_size =
-			memory.regions[idx].memory_size;
-		pregion->userspace_address =
-			memory.regions[idx].userspace_addr;
-
-		/* This is ugly */
-		mapped_size = memory.regions[idx].memory_size +
-			memory.regions[idx].mmap_offset;
+	for (i = 0; i < memory.nregions; i++) {
+		fd  = pmsg->fds[i];
+		reg = &dev->mem->regions[i];
+
+		reg->guest_phys_addr = memory.regions[i].guest_phys_addr;
+		reg->guest_user_addr = memory.regions[i].userspace_addr;
+		reg->size            = memory.regions[i].memory_size;
+		reg->fd              = fd;
+
+		mmap_offset = memory.regions[i].mmap_offset;
+		mmap_size   = reg->size + mmap_offset;
 
 		/* mmap() without flag of MAP_ANONYMOUS, should be called
 		 * with length argument aligned with hugepagesz at older
@@ -446,67 +426,52 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		 * to avoid failure, make sure in caller to keep length
 		 * aligned.
 		 */
-		alignment = get_blk_size(pmsg->fds[idx]);
+		alignment = get_blk_size(fd);
 		if (alignment == (uint64_t)-1) {
 			RTE_LOG(ERR, VHOST_CONFIG,
 				"couldn't get hugepage size through fstat\n");
 			goto err_mmap;
 		}
-		mapped_size = RTE_ALIGN_CEIL(mapped_size, alignment);
+		mmap_size = RTE_ALIGN_CEIL(mmap_size, alignment);
 
-		mapped_address = (uint64_t)(uintptr_t)mmap(NULL,
-			mapped_size,
-			PROT_READ | PROT_WRITE, MAP_SHARED,
-			pmsg->fds[idx],
-			0);
+		mmap_addr = mmap(NULL, mmap_size,
+				 PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
 
-		RTE_LOG(INFO, VHOST_CONFIG,
-			"mapped region %d fd:%d to:%p sz:0x%"PRIx64" "
-			"off:0x%"PRIx64" align:0x%"PRIx64"\n",
-			idx, pmsg->fds[idx], (void *)(uintptr_t)mapped_address,
-			mapped_size, memory.regions[idx].mmap_offset,
-			alignment);
-
-		if (mapped_address == (uint64_t)(uintptr_t)MAP_FAILED) {
+		if (mmap_addr == MAP_FAILED) {
 			RTE_LOG(ERR, VHOST_CONFIG,
-				"mmap qemu guest failed.\n");
+				"mmap region %u failed.\n", i);
 			goto err_mmap;
 		}
 
-		pregion_orig[idx].mapped_address = mapped_address;
-		pregion_orig[idx].mapped_size = mapped_size;
-		pregion_orig[idx].blksz = alignment;
-		pregion_orig[idx].fd = pmsg->fds[idx];
-
-		mapped_address +=  memory.regions[idx].mmap_offset;
+		reg->mmap_addr = mmap_addr;
+		reg->mmap_size = mmap_size;
+		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
+				      mmap_offset;
 
-		pregion->address_offset = mapped_address -
-			pregion->guest_phys_address;
-
-		if (memory.regions[idx].guest_phys_addr == 0) {
-			dev->mem->base_address =
-				memory.regions[idx].userspace_addr;
-			dev->mem->mapped_address =
-				pregion->address_offset;
-		}
-
-		LOG_DEBUG(VHOST_CONFIG,
-			"REGION: %u GPA: %p QEMU VA: %p SIZE (%"PRIu64")\n",
-			idx,
-			(void *)(uintptr_t)pregion->guest_phys_address,
-			(void *)(uintptr_t)pregion->userspace_address,
-			 pregion->memory_size);
+		RTE_LOG(INFO, VHOST_CONFIG,
+			"guest memory region %u, size: 0x%" PRIx64 "\n"
+			"\t guest physical addr: 0x%" PRIx64 "\n"
+			"\t guest virtual  addr: 0x%" PRIx64 "\n"
+			"\t host  virtual  addr: 0x%" PRIx64 "\n"
+			"\t mmap addr : 0x%" PRIx64 "\n"
+			"\t mmap size : 0x%" PRIx64 "\n"
+			"\t mmap align: 0x%" PRIx64 "\n"
+			"\t mmap off  : 0x%" PRIx64 "\n",
+			i, reg->size,
+			reg->guest_phys_addr,
+			reg->guest_user_addr,
+			reg->host_user_addr,
+			(uint64_t)(uintptr_t)mmap_addr,
+			mmap_size,
+			alignment,
+			mmap_offset);
 	}
 
 	return 0;
 
 err_mmap:
-	while (idx--) {
-		munmap((void *)(uintptr_t)pregion_orig[idx].mapped_address,
-				pregion_orig[idx].mapped_size);
-		close(pregion_orig[idx].fd);
-	}
-	free(dev->mem);
+	free_mem_region(dev);
+	rte_free(dev->mem);
 	dev->mem = NULL;
 	return -1;
 }
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v2 2/7] vhost: get guest/host physical address mappings
  2016-09-23  4:13 ` [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue " Yuanhan Liu
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 1/7] vhost: simplify memory regions handling Yuanhan Liu
@ 2016-09-23  4:13   ` Yuanhan Liu
  2016-09-26 20:17     ` Maxime Coquelin
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 3/7] vhost: introduce last avail idx for dequeue Yuanhan Liu
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-23  4:13 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

So that we can convert a guest physical address to host physical
address, which will be used in later Tx zero copy implementation.

MAP_POPULATE is set while mmaping guest memory regions, to make
sure the page tables are setup and then rte_mem_virt2phy() could
yield proper physical address.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---

v2: - use MAP_POPULATE option to make sure the page table will
      be already setup while getting the phys address

    - do a simple merge if the last 2 pages are continuous

    - dump guest pages only in debug mode
---
 lib/librte_vhost/vhost.h      |  30 +++++++++++++
 lib/librte_vhost/vhost_user.c | 100 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index df2107b..2d52987 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -114,6 +114,12 @@ struct vhost_virtqueue {
  #define VIRTIO_F_VERSION_1 32
 #endif
 
+struct guest_page {
+	uint64_t guest_phys_addr;
+	uint64_t host_phys_addr;
+	uint64_t size;
+};
+
 /**
  * Device structure contains all configuration information relating
  * to the device.
@@ -137,6 +143,10 @@ struct virtio_net {
 	uint64_t		log_addr;
 	struct ether_addr	mac;
 
+	uint32_t		nr_guest_pages;
+	uint32_t		max_guest_pages;
+	struct guest_page       *guest_pages;
+
 } __rte_cache_aligned;
 
 /**
@@ -217,6 +227,26 @@ gpa_to_vva(struct virtio_net *dev, uint64_t gpa)
 	return 0;
 }
 
+/* Convert guest physical address to host physical address */
+static inline phys_addr_t __attribute__((always_inline))
+gpa_to_hpa(struct virtio_net *dev, uint64_t gpa, uint64_t size)
+{
+	uint32_t i;
+	struct guest_page *page;
+
+	for (i = 0; i < dev->nr_guest_pages; i++) {
+		page = &dev->guest_pages[i];
+
+		if (gpa >= page->guest_phys_addr &&
+		    gpa + size < page->guest_phys_addr + page->size) {
+			return gpa - page->guest_phys_addr +
+			       page->host_phys_addr;
+		}
+	}
+
+	return 0;
+}
+
 struct virtio_net_device_ops const *notify_ops;
 struct virtio_net *get_device(int vid);
 
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 49585b8..e651912 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -372,6 +372,91 @@ vhost_user_set_vring_base(struct virtio_net *dev,
 	return 0;
 }
 
+static void
+add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr,
+		   uint64_t host_phys_addr, uint64_t size)
+{
+	struct guest_page *page, *last_page;
+
+	if (dev->nr_guest_pages == dev->max_guest_pages) {
+		dev->max_guest_pages *= 2;
+		dev->guest_pages = realloc(dev->guest_pages,
+					dev->max_guest_pages * sizeof(*page));
+	}
+
+	if (dev->nr_guest_pages > 0) {
+		last_page = &dev->guest_pages[dev->nr_guest_pages - 1];
+		/* merge if the two pages are continuous */
+		if (host_phys_addr == last_page->host_phys_addr +
+				      last_page->size) {
+			last_page->size += size;
+			return;
+		}
+	}
+
+	page = &dev->guest_pages[dev->nr_guest_pages++];
+	page->guest_phys_addr = guest_phys_addr;
+	page->host_phys_addr  = host_phys_addr;
+	page->size = size;
+}
+
+static void
+add_guest_pages(struct virtio_net *dev, struct virtio_memory_region *reg,
+		uint64_t page_size)
+{
+	uint64_t reg_size = reg->size;
+	uint64_t host_user_addr  = reg->host_user_addr;
+	uint64_t guest_phys_addr = reg->guest_phys_addr;
+	uint64_t host_phys_addr;
+	uint64_t size;
+
+	host_phys_addr = rte_mem_virt2phy((void *)(uintptr_t)host_user_addr);
+	size = page_size - (guest_phys_addr & (page_size - 1));
+	size = RTE_MIN(size, reg_size);
+
+	add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size);
+	host_user_addr  += size;
+	guest_phys_addr += size;
+	reg_size -= size;
+
+	while (reg_size > 0) {
+		host_phys_addr = rte_mem_virt2phy((void *)(uintptr_t)
+						  host_user_addr);
+		add_one_guest_page(dev, guest_phys_addr, host_phys_addr,
+				   page_size);
+
+		host_user_addr  += page_size;
+		guest_phys_addr += page_size;
+		reg_size -= page_size;
+	}
+}
+
+#ifdef RTE_LIBRTE_VHOST_DEBUG
+/* TODO: enable it only in debug mode? */
+static void
+dump_guest_pages(struct virtio_net *dev)
+{
+	uint32_t i;
+	struct guest_page *page;
+
+	for (i = 0; i < dev->nr_guest_pages; i++) {
+		page = &dev->guest_pages[i];
+
+		RTE_LOG(INFO, VHOST_CONFIG,
+			"guest physical page region %u\n"
+			"\t guest_phys_addr: %" PRIx64 "\n"
+			"\t host_phys_addr : %" PRIx64 "\n"
+			"\t size           : %" PRIx64 "\n",
+			i,
+			page->guest_phys_addr,
+			page->host_phys_addr,
+			page->size);
+	}
+}
+#else
+#define dump_guest_pages(dev)
+#endif
+
 static int
 vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 {
@@ -396,6 +481,13 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		dev->mem = NULL;
 	}
 
+	dev->nr_guest_pages = 0;
+	if (!dev->guest_pages) {
+		dev->max_guest_pages = 8;
+		dev->guest_pages = malloc(dev->max_guest_pages *
+						sizeof(struct guest_page));
+	}
+
 	dev->mem = rte_zmalloc("vhost-mem-table", sizeof(struct virtio_memory) +
 		sizeof(struct virtio_memory_region) * memory.nregions, 0);
 	if (dev->mem == NULL) {
@@ -434,8 +526,8 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		}
 		mmap_size = RTE_ALIGN_CEIL(mmap_size, alignment);
 
-		mmap_addr = mmap(NULL, mmap_size,
-				 PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+		mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+				 MAP_SHARED | MAP_POPULATE, fd, 0);
 
 		if (mmap_addr == MAP_FAILED) {
 			RTE_LOG(ERR, VHOST_CONFIG,
@@ -448,6 +540,8 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
 				      mmap_offset;
 
+		add_guest_pages(dev, reg, alignment);
+
 		RTE_LOG(INFO, VHOST_CONFIG,
 			"guest memory region %u, size: 0x%" PRIx64 "\n"
 			"\t guest physical addr: 0x%" PRIx64 "\n"
@@ -467,6 +561,8 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 			mmap_offset);
 	}
 
+	dump_guest_pages(dev);
+
 	return 0;
 
 err_mmap:
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v2 3/7] vhost: introduce last avail idx for dequeue
  2016-09-23  4:13 ` [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue " Yuanhan Liu
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 1/7] vhost: simplify memory regions handling Yuanhan Liu
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 2/7] vhost: get guest/host physical address mappings Yuanhan Liu
@ 2016-09-23  4:13   ` Yuanhan Liu
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy Yuanhan Liu
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-23  4:13 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

So far, we retrieve both the used ring and avail ring idx by the var
last_used_idx; it won't be a problem because the used ring is updated
immediately after those avail entries are consumed.

But that's not true when dequeue zero copy is enabled, that used ring is
updated only when the mbuf is consumed. Thus, we need use another var to
note the last avail ring idx we have consumed.

Therefore, last_avail_idx is introduced.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/vhost.h      |  2 +-
 lib/librte_vhost/vhost_user.c |  6 ++++--
 lib/librte_vhost/virtio_net.c | 19 +++++++++++--------
 3 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2d52987..8565fa1 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -70,7 +70,7 @@ struct vhost_virtqueue {
 	struct vring_used	*used;
 	uint32_t		size;
 
-	/* Last index used on the available ring */
+	uint16_t		last_avail_idx;
 	volatile uint16_t	last_used_idx;
 #define VIRTIO_INVALID_EVENTFD		(-1)
 #define VIRTIO_UNINITIALIZED_EVENTFD	(-2)
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index e651912..a92377a 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -343,7 +343,8 @@ vhost_user_set_vring_addr(struct virtio_net *dev, struct vhost_vring_addr *addr)
 			"last_used_idx (%u) and vq->used->idx (%u) mismatches; "
 			"some packets maybe resent for Tx and dropped for Rx\n",
 			vq->last_used_idx, vq->used->idx);
-		vq->last_used_idx     = vq->used->idx;
+		vq->last_used_idx  = vq->used->idx;
+		vq->last_avail_idx = vq->used->idx;
 	}
 
 	vq->log_guest_addr = addr->log_guest_addr;
@@ -367,7 +368,8 @@ static int
 vhost_user_set_vring_base(struct virtio_net *dev,
 			  struct vhost_vring_state *state)
 {
-	dev->virtqueue[state->index]->last_used_idx = state->num;
+	dev->virtqueue[state->index]->last_used_idx  = state->num;
+	dev->virtqueue[state->index]->last_avail_idx = state->num;
 
 	return 0;
 }
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 8a151af..1c2ee47 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -846,16 +846,17 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 		}
 	}
 
-	avail_idx =  *((volatile uint16_t *)&vq->avail->idx);
-	free_entries = avail_idx - vq->last_used_idx;
+	free_entries = *((volatile uint16_t *)&vq->avail->idx) -
+			vq->last_avail_idx;
 	if (free_entries == 0)
 		goto out;
 
 	LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
 
-	/* Prefetch available ring to retrieve head indexes. */
-	used_idx = vq->last_used_idx & (vq->size - 1);
-	rte_prefetch0(&vq->avail->ring[used_idx]);
+	/* Prefetch available and used ring */
+	avail_idx = vq->last_avail_idx & (vq->size - 1);
+	used_idx  = vq->last_used_idx  & (vq->size - 1);
+	rte_prefetch0(&vq->avail->ring[avail_idx]);
 	rte_prefetch0(&vq->used->ring[used_idx]);
 
 	count = RTE_MIN(count, MAX_PKT_BURST);
@@ -865,8 +866,9 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 
 	/* Retrieve all of the head indexes first to avoid caching issues. */
 	for (i = 0; i < count; i++) {
-		used_idx = (vq->last_used_idx + i) & (vq->size - 1);
-		desc_indexes[i] = vq->avail->ring[used_idx];
+		avail_idx = (vq->last_avail_idx + i) & (vq->size - 1);
+		used_idx  = (vq->last_used_idx  + i) & (vq->size - 1);
+		desc_indexes[i] = vq->avail->ring[avail_idx];
 
 		vq->used->ring[used_idx].id  = desc_indexes[i];
 		vq->used->ring[used_idx].len = 0;
@@ -900,7 +902,8 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	rte_smp_wmb();
 	rte_smp_rmb();
 	vq->used->idx += i;
-	vq->last_used_idx += i;
+	vq->last_avail_idx += i;
+	vq->last_used_idx  += i;
 	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
 			sizeof(vq->used->idx));
 
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy
  2016-09-23  4:13 ` [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue " Yuanhan Liu
                     ` (2 preceding siblings ...)
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 3/7] vhost: introduce last avail idx for dequeue Yuanhan Liu
@ 2016-09-23  4:13   ` Yuanhan Liu
  2016-09-26 20:45     ` Maxime Coquelin
  2016-10-06 14:37     ` Xu, Qian Q
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 5/7] vhost: add a flag to enable " Yuanhan Liu
                     ` (3 subsequent siblings)
  7 siblings, 2 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-23  4:13 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

The basic idea of dequeue zero copy is, instead of copying data from
the desc buf, here we let the mbuf reference the desc buf addr directly.

Doing so, however, has one major issue: we can't update the used ring
at the end of rte_vhost_dequeue_burst. Because we don't do the copy
here, an update of the used ring would let the driver to reclaim the
desc buf. As a result, DPDK might reference a stale memory region.

To update the used ring properly, this patch does several tricks:

- when mbuf references a desc buf, refcnt is added by 1.

  This is to pin lock the mbuf, so that a mbuf free from the DPDK
  won't actually free it, instead, refcnt is subtracted by 1.

- We chain all those mbuf together (by tailq)

  And we check it every time on the rte_vhost_dequeue_burst entrance,
  to see if the mbuf is freed (when refcnt equals to 1). If that
  happens, it means we are the last user of this mbuf and we are
  safe to update the used ring.

- "struct zcopy_mbuf" is introduced, to associate an mbuf with the
  right desc idx.

Dequeue zero copy is introduced for performance reason, and some rough
tests show about 50% perfomance boost for packet size 1500B. For small
packets, (e.g. 64B), it actually slows a bit down (well, it could up to
15%). That is expected because this patch introduces some extra works,
and it outweighs the benefit from saving few bytes copy.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---

v2: - use unlikely/likely for dequeue_zero_copy check, as it's not enabled
      by default, as well as it has some limitations in vm2nic case.

    - handle the case that a desc buf might across 2 host phys pages

    - reset nr_zmbuf to 0 at set_vring_num

    - set the zmbuf_size to vq->size, but not the double of it.
---
 lib/librte_vhost/vhost.c      |   2 +
 lib/librte_vhost/vhost.h      |  22 +++++-
 lib/librte_vhost/vhost_user.c |  42 +++++++++-
 lib/librte_vhost/virtio_net.c | 173 +++++++++++++++++++++++++++++++++++++-----
 4 files changed, 219 insertions(+), 20 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 46095c3..ab25649 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -141,6 +141,8 @@ init_vring_queue(struct vhost_virtqueue *vq, int qp_idx)
 	/* always set the default vq pair to enabled */
 	if (qp_idx == 0)
 		vq->enabled = 1;
+
+	TAILQ_INIT(&vq->zmbuf_list);
 }
 
 static void
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 8565fa1..be8a398 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -36,6 +36,7 @@
 #include <stdint.h>
 #include <stdio.h>
 #include <sys/types.h>
+#include <sys/queue.h>
 #include <unistd.h>
 #include <linux/vhost.h>
 
@@ -61,6 +62,19 @@ struct buf_vector {
 	uint32_t desc_idx;
 };
 
+/*
+ * A structure to hold some fields needed in zero copy code path,
+ * mainly for associating an mbuf with the right desc_idx.
+ */
+struct zcopy_mbuf {
+	struct rte_mbuf *mbuf;
+	uint32_t desc_idx;
+	uint16_t in_use;
+
+	TAILQ_ENTRY(zcopy_mbuf) next;
+};
+TAILQ_HEAD(zcopy_mbuf_list, zcopy_mbuf);
+
 /**
  * Structure contains variables relevant to RX/TX virtqueues.
  */
@@ -85,6 +99,12 @@ struct vhost_virtqueue {
 
 	/* Physical address of used ring, for logging */
 	uint64_t		log_guest_addr;
+
+	uint16_t		nr_zmbuf;
+	uint16_t		zmbuf_size;
+	uint16_t		last_zmbuf_idx;
+	struct zcopy_mbuf	*zmbufs;
+	struct zcopy_mbuf_list	zmbuf_list;
 } __rte_cache_aligned;
 
 /* Old kernels have no such macro defined */
@@ -135,6 +155,7 @@ struct virtio_net {
 	/* to tell if we need broadcast rarp packet */
 	rte_atomic16_t		broadcast_rarp;
 	uint32_t		virt_qp_nb;
+	int			dequeue_zero_copy;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
 	char			ifname[IF_NAME_SZ];
@@ -146,7 +167,6 @@ struct virtio_net {
 	uint32_t		nr_guest_pages;
 	uint32_t		max_guest_pages;
 	struct guest_page       *guest_pages;
-
 } __rte_cache_aligned;
 
 /**
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index a92377a..ac40408 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -180,7 +180,23 @@ static int
 vhost_user_set_vring_num(struct virtio_net *dev,
 			 struct vhost_vring_state *state)
 {
-	dev->virtqueue[state->index]->size = state->num;
+	struct vhost_virtqueue *vq = dev->virtqueue[state->index];
+
+	vq->size = state->num;
+
+	if (dev->dequeue_zero_copy) {
+		vq->nr_zmbuf = 0;
+		vq->last_zmbuf_idx = 0;
+		vq->zmbuf_size = vq->size;
+		vq->zmbufs = rte_zmalloc(NULL, vq->zmbuf_size *
+					 sizeof(struct zcopy_mbuf), 0);
+		if (vq->zmbufs == NULL) {
+			RTE_LOG(WARNING, VHOST_CONFIG,
+				"failed to allocate mem for zero copy; "
+				"zero copy is force disabled\n");
+			dev->dequeue_zero_copy = 0;
+		}
+	}
 
 	return 0;
 }
@@ -662,11 +678,32 @@ vhost_user_set_vring_kick(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 	vq->kickfd = file.fd;
 
 	if (virtio_is_ready(dev) && !(dev->flags & VIRTIO_DEV_RUNNING)) {
+		if (dev->dequeue_zero_copy) {
+			RTE_LOG(INFO, VHOST_CONFIG,
+				"Tx zero copy is enabled\n");
+		}
+
 		if (notify_ops->new_device(dev->vid) == 0)
 			dev->flags |= VIRTIO_DEV_RUNNING;
 	}
 }
 
+static void
+free_zmbufs(struct vhost_virtqueue *vq)
+{
+	struct zcopy_mbuf *zmbuf, *next;
+
+	for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
+	     zmbuf != NULL; zmbuf = next) {
+		next = TAILQ_NEXT(zmbuf, next);
+
+		rte_pktmbuf_free(zmbuf->mbuf);
+		TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
+	}
+
+	rte_free(vq->zmbufs);
+}
+
 /*
  * when virtio is stopped, qemu will send us the GET_VRING_BASE message.
  */
@@ -695,6 +732,9 @@ vhost_user_get_vring_base(struct virtio_net *dev,
 
 	dev->virtqueue[state->index]->kickfd = VIRTIO_UNINITIALIZED_EVENTFD;
 
+	if (dev->dequeue_zero_copy)
+		free_zmbufs(dev->virtqueue[state->index]);
+
 	return 0;
 }
 
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 1c2ee47..215542c 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -678,6 +678,43 @@ make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *mac)
 	return 0;
 }
 
+static inline struct zcopy_mbuf *__attribute__((always_inline))
+get_zmbuf(struct vhost_virtqueue *vq)
+{
+	uint16_t i;
+	uint16_t last;
+	int tries = 0;
+
+	/* search [last_zmbuf_idx, zmbuf_size) */
+	i = vq->last_zmbuf_idx;
+	last = vq->zmbuf_size;
+
+again:
+	for (; i < last; i++) {
+		if (vq->zmbufs[i].in_use == 0) {
+			vq->last_zmbuf_idx = i + 1;
+			vq->zmbufs[i].in_use = 1;
+			return &vq->zmbufs[i];
+		}
+	}
+
+	tries++;
+	if (tries == 1) {
+		/* search [0, last_zmbuf_idx) */
+		i = 0;
+		last = vq->last_zmbuf_idx;
+		goto again;
+	}
+
+	return NULL;
+}
+
+static inline void __attribute__((always_inline))
+put_zmbuf(struct zcopy_mbuf *zmbuf)
+{
+	zmbuf->in_use = 0;
+}
+
 static inline int __attribute__((always_inline))
 copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		  struct rte_mbuf *m, uint16_t desc_idx,
@@ -701,6 +738,27 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	if (unlikely(!desc_addr))
 		return -1;
 
+	if (unlikely(dev->dequeue_zero_copy)) {
+		struct zcopy_mbuf *zmbuf;
+
+		zmbuf = get_zmbuf(vq);
+		if (!zmbuf)
+			return -1;
+		zmbuf->mbuf = m;
+		zmbuf->desc_idx = desc_idx;
+
+		/*
+		 * Pin lock the mbuf; we will check later to see whether
+		 * the mbuf is freed (when we are the last user) or not.
+		 * If that's the case, we then could update the used ring
+		 * safely.
+		 */
+		rte_mbuf_refcnt_update(m, 1);
+
+		vq->nr_zmbuf += 1;
+		TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbuf, next);
+	}
+
 	hdr = (struct virtio_net_hdr *)((uintptr_t)desc_addr);
 	rte_prefetch0(hdr);
 
@@ -732,10 +790,33 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	mbuf_offset = 0;
 	mbuf_avail  = m->buf_len - RTE_PKTMBUF_HEADROOM;
 	while (1) {
+		uint64_t hpa;
+
 		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
-		rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
-			(void *)((uintptr_t)(desc_addr + desc_offset)),
-			cpy_len);
+
+		/*
+		 * A desc buf might across two host physical pages that are
+		 * not continuous. In such case (gpa_to_hpa returns 0), data
+		 * will be copied even though zero copy is enabled.
+		 */
+		if (unlikely(dev->dequeue_zero_copy && (hpa = gpa_to_hpa(dev,
+					desc->addr + desc_offset, cpy_len)))) {
+			cur->data_len = cpy_len;
+			cur->data_off = 0;
+			cur->buf_addr = (void *)(uintptr_t)desc_addr;
+			cur->buf_physaddr = hpa;
+
+			/*
+			 * In zero copy mode, one mbuf can only reference data
+			 * for one or partial of one desc buff.
+			 */
+			mbuf_avail = cpy_len;
+		} else {
+			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *,
+							   mbuf_offset),
+				(void *)((uintptr_t)(desc_addr + desc_offset)),
+				cpy_len);
+		}
 
 		mbuf_avail  -= cpy_len;
 		mbuf_offset += cpy_len;
@@ -796,6 +877,49 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	return 0;
 }
 
+static inline void __attribute__((always_inline))
+update_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		 uint32_t used_idx, uint32_t desc_idx)
+{
+	vq->used->ring[used_idx].id  = desc_idx;
+	vq->used->ring[used_idx].len = 0;
+	vhost_log_used_vring(dev, vq,
+			offsetof(struct vring_used, ring[used_idx]),
+			sizeof(vq->used->ring[used_idx]));
+}
+
+static inline void __attribute__((always_inline))
+update_used_idx(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint32_t count)
+{
+	if (count == 0)
+		return;
+
+	rte_smp_wmb();
+	rte_smp_rmb();
+
+	vq->used->idx += count;
+	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
+			sizeof(vq->used->idx));
+
+	/* Kick guest if required. */
+	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
+			&& (vq->callfd >= 0))
+		eventfd_write(vq->callfd, (eventfd_t)1);
+}
+
+static inline bool __attribute__((always_inline))
+mbuf_is_consumed(struct rte_mbuf *m)
+{
+	while (m) {
+		if (rte_mbuf_refcnt_read(m) > 1)
+			return false;
+		m = m->next;
+	}
+
+	return true;
+}
+
 uint16_t
 rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
@@ -823,6 +947,30 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	if (unlikely(vq->enabled == 0))
 		return 0;
 
+	if (unlikely(dev->dequeue_zero_copy)) {
+		struct zcopy_mbuf *zmbuf, *next;
+		int nr_updated = 0;
+
+		for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
+		     zmbuf != NULL; zmbuf = next) {
+			next = TAILQ_NEXT(zmbuf, next);
+
+			if (mbuf_is_consumed(zmbuf->mbuf)) {
+				used_idx = vq->last_used_idx++ & (vq->size - 1);
+				update_used_ring(dev, vq, used_idx,
+						 zmbuf->desc_idx);
+				nr_updated += 1;
+
+				TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
+				rte_pktmbuf_free(zmbuf->mbuf);
+				put_zmbuf(zmbuf);
+				vq->nr_zmbuf -= 1;
+			}
+		}
+
+		update_used_idx(dev, vq, nr_updated);
+	}
+
 	/*
 	 * Construct a RARP broadcast packet, and inject it to the "pkts"
 	 * array, to looks like that guest actually send such packet.
@@ -870,11 +1018,8 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 		used_idx  = (vq->last_used_idx  + i) & (vq->size - 1);
 		desc_indexes[i] = vq->avail->ring[avail_idx];
 
-		vq->used->ring[used_idx].id  = desc_indexes[i];
-		vq->used->ring[used_idx].len = 0;
-		vhost_log_used_vring(dev, vq,
-				offsetof(struct vring_used, ring[used_idx]),
-				sizeof(vq->used->ring[used_idx]));
+		if (likely(dev->dequeue_zero_copy == 0))
+			update_used_ring(dev, vq, used_idx, desc_indexes[i]);
 	}
 
 	/* Prefetch descriptor index. */
@@ -898,19 +1043,11 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 			break;
 		}
 	}
-
-	rte_smp_wmb();
-	rte_smp_rmb();
-	vq->used->idx += i;
 	vq->last_avail_idx += i;
 	vq->last_used_idx  += i;
-	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
-			sizeof(vq->used->idx));
 
-	/* Kick guest if required. */
-	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-			&& (vq->callfd >= 0))
-		eventfd_write(vq->callfd, (eventfd_t)1);
+	if (likely(dev->dequeue_zero_copy == 0))
+		update_used_idx(dev, vq, i);
 
 out:
 	if (unlikely(rarp_mbuf != NULL)) {
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v2 5/7] vhost: add a flag to enable dequeue zero copy
  2016-09-23  4:13 ` [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue " Yuanhan Liu
                     ` (3 preceding siblings ...)
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy Yuanhan Liu
@ 2016-09-23  4:13   ` Yuanhan Liu
  2016-09-26 20:57     ` Maxime Coquelin
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 6/7] examples/vhost: add an option " Yuanhan Liu
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-23  4:13 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

Dequeue zero copy is disabled by default. Here add a new flag
``RTE_VHOST_USER_DEQUEUE_ZERO_COPY`` to explictily enable it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---

v2: - update release log
    - doc dequeue zero copy in detail
---
 doc/guides/prog_guide/vhost_lib.rst    | 35 +++++++++++++++++++++++++++++++++-
 doc/guides/rel_notes/release_16_11.rst | 11 +++++++++++
 lib/librte_vhost/rte_virtio_net.h      |  1 +
 lib/librte_vhost/socket.c              |  5 +++++
 lib/librte_vhost/vhost.c               | 10 ++++++++++
 lib/librte_vhost/vhost.h               |  1 +
 6 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index 6b0c6b2..3fa9dd7 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -79,7 +79,7 @@ The following is an overview of the Vhost API functions:
   ``/dev/path`` character device file will be created. For vhost-user server
   mode, a Unix domain socket file ``path`` will be created.
 
-  Currently two flags are supported (these are valid for vhost-user only):
+  Currently supported flags are (these are valid for vhost-user only):
 
   - ``RTE_VHOST_USER_CLIENT``
 
@@ -97,6 +97,39 @@ The following is an overview of the Vhost API functions:
     This reconnect option is enabled by default. However, it can be turned off
     by setting this flag.
 
+  - ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY``
+
+    Dequeue zero copy will be enabled when this flag is set. It is disabled by
+    default.
+
+    There are some truths (including limitations) you might want to know while
+    setting this flag:
+
+    * zero copy is not good for small packets (typically for packet size below
+      512).
+
+    * zero copy is really good for VM2VM case. For iperf between two VMs, the
+      boost could be above 70% (when TSO is enableld).
+
+    * for VM2NIC case, the ``nb_tx_desc`` has to be small enough: <= 64 if virtio
+      indirect feature is not enabled and <= 128 if it is enabled.
+
+      The is because when dequeue zero copy is enabled, guest Tx used vring will
+      be updated only when corresponding mbuf is freed. Thus, the nb_tx_desc
+      has to be small enough so that the PMD driver will run out of available
+      Tx descriptors and free mbufs timely. Otherwise, guest Tx vring would be
+      starved.
+
+    * Guest memory should be backended with huge pages to achieve better
+      performance. Using 1G page size is the best.
+
+      When dequeue zero copy is enabled, the guest phys address and host phys
+      address mapping has to be established. Using non-huge pages means far
+      more page segments. To make it simple, DPDK vhost does a linear search
+      of those segments, thus the fewer the segments, the quicker we will get
+      the mapping. NOTE: we may speed it by using radix tree searching in
+      future.
+
 * ``rte_vhost_driver_session_start()``
 
   This function starts the vhost session loop to handle vhost messages. It
diff --git a/doc/guides/rel_notes/release_16_11.rst b/doc/guides/rel_notes/release_16_11.rst
index 66916af..0c5756e 100644
--- a/doc/guides/rel_notes/release_16_11.rst
+++ b/doc/guides/rel_notes/release_16_11.rst
@@ -36,6 +36,17 @@ New Features
 
      This section is a comment. Make sure to start the actual text at the margin.
 
+  * **Added vhost-user dequeue zero copy support**
+
+    The copy in dequeue path is saved, which is meant to improve the performance.
+    In the VM2VM case, the boost is quite impressive. The bigger the packet size,
+    the bigger performance boost you may get. However, for VM2NIC case, there
+    are some limitations, yet the boost is not that impressive as VM2VM case.
+    It may even drop quite a bit for small packets.
+
+    For such reason, this feature is disabled by default. It can be enabled when
+    ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY`` flag is given. Check the vhost section
+    at programming guide for more information.
 
 Resolved Issues
 ---------------
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index a88aecd..c53ff64 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -53,6 +53,7 @@
 
 #define RTE_VHOST_USER_CLIENT		(1ULL << 0)
 #define RTE_VHOST_USER_NO_RECONNECT	(1ULL << 1)
+#define RTE_VHOST_USER_DEQUEUE_ZERO_COPY	(1ULL << 2)
 
 /* Enum for virtqueue management. */
 enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index bf03f84..967cb65 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -62,6 +62,7 @@ struct vhost_user_socket {
 	int connfd;
 	bool is_server;
 	bool reconnect;
+	bool dequeue_zero_copy;
 };
 
 struct vhost_user_connection {
@@ -203,6 +204,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 	size = strnlen(vsocket->path, PATH_MAX);
 	vhost_set_ifname(vid, vsocket->path, size);
 
+	if (vsocket->dequeue_zero_copy)
+		vhost_enable_dequeue_zero_copy(vid);
+
 	RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
 
 	vsocket->connfd = fd;
@@ -499,6 +503,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 	memset(vsocket, 0, sizeof(struct vhost_user_socket));
 	vsocket->path = strdup(path);
 	vsocket->connfd = -1;
+	vsocket->dequeue_zero_copy = flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
 
 	if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
 		vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index ab25649..f5f8f92 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -290,6 +290,16 @@ vhost_set_ifname(int vid, const char *if_name, unsigned int if_len)
 	dev->ifname[sizeof(dev->ifname) - 1] = '\0';
 }
 
+void
+vhost_enable_dequeue_zero_copy(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->dequeue_zero_copy = 1;
+}
 
 int
 rte_vhost_get_numa_node(int vid)
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index be8a398..53dbf33 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -278,6 +278,7 @@ void vhost_destroy_device(int);
 int alloc_vring_queue_pair(struct virtio_net *dev, uint32_t qp_idx);
 
 void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
+void vhost_enable_dequeue_zero_copy(int vid);
 
 /*
  * Backend-specific cleanup. Defined by vhost-cuse and vhost-user.
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v2 6/7] examples/vhost: add an option to enable dequeue zero copy
  2016-09-23  4:13 ` [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue " Yuanhan Liu
                     ` (4 preceding siblings ...)
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 5/7] vhost: add a flag to enable " Yuanhan Liu
@ 2016-09-23  4:13   ` Yuanhan Liu
  2016-09-26 21:05     ` Maxime Coquelin
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 7/7] net/vhost: " Yuanhan Liu
  2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
  7 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-23  4:13 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

Add an option, --dequeue-zero-copy, to enable dequeue zero copy.

One thing worth noting while using dequeue zero copy is the nb_tx_desc
has to be small enough so that the eth driver will hit the mbuf free
threshold easily and thus free mbuf more frequently.

The reason behind that is, when dequeue zero copy is enabled, guest Tx
used vring will be updated only when corresponding mbuf is freed. If mbuf
is not freed frequently, the guest Tx vring could be starved.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 examples/vhost/main.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 195b1db..a79ca5a 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -127,6 +127,7 @@ static uint32_t enable_tx_csum;
 static uint32_t enable_tso;
 
 static int client_mode;
+static int dequeue_zero_copy;
 
 /* Specify timeout (in useconds) between retries on RX. */
 static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
@@ -294,6 +295,17 @@ port_init(uint8_t port)
 
 	rx_ring_size = RTE_TEST_RX_DESC_DEFAULT;
 	tx_ring_size = RTE_TEST_TX_DESC_DEFAULT;
+
+	/*
+	 * When dequeue zero copy is enabled, guest Tx used vring will be
+	 * updated only when corresponding mbuf is freed. Thus, the nb_tx_desc
+	 * (tx_ring_size here) must be small enough so that the driver will
+	 * hit the free threshold easily and free mbufs timely. Otherwise,
+	 * guest Tx vring would be starved.
+	 */
+	if (dequeue_zero_copy)
+		tx_ring_size = 64;
+
 	tx_rings = (uint16_t)rte_lcore_count();
 
 	retval = validate_num_devices(MAX_DEVICES);
@@ -470,7 +482,8 @@ us_vhost_usage(const char *prgname)
 	"		--socket-file: The path of the socket file.\n"
 	"		--tx-csum [0|1] disable/enable TX checksum offload.\n"
 	"		--tso [0|1] disable/enable TCP segment offload.\n"
-	"		--client register a vhost-user socket as client mode.\n",
+	"		--client register a vhost-user socket as client mode.\n"
+	"		--dequeue-zero-copy enables Tx zero copy\n",
 	       prgname);
 }
 
@@ -495,6 +508,7 @@ us_vhost_parse_args(int argc, char **argv)
 		{"tx-csum", required_argument, NULL, 0},
 		{"tso", required_argument, NULL, 0},
 		{"client", no_argument, &client_mode, 1},
+		{"dequeue-zero-copy", no_argument, &dequeue_zero_copy, 1},
 		{NULL, 0, 0, 0},
 	};
 
@@ -1501,6 +1515,9 @@ main(int argc, char *argv[])
 	if (client_mode)
 		flags |= RTE_VHOST_USER_CLIENT;
 
+	if (dequeue_zero_copy)
+		flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
+
 	/* Register vhost user driver to handle vhost messages. */
 	for (i = 0; i < nb_sockets; i++) {
 		ret = rte_vhost_driver_register
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v2 7/7] net/vhost: add an option to enable dequeue zero copy
  2016-09-23  4:13 ` [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue " Yuanhan Liu
                     ` (5 preceding siblings ...)
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 6/7] examples/vhost: add an option " Yuanhan Liu
@ 2016-09-23  4:13   ` Yuanhan Liu
  2016-09-26 21:05     ` Maxime Coquelin
  2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
  7 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-09-23  4:13 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

Add an option, dequeue-zero-copy, to enable this feature in vhost-pmd.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 drivers/net/vhost/rte_eth_vhost.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/net/vhost/rte_eth_vhost.c b/drivers/net/vhost/rte_eth_vhost.c
index 7539cd4..61b3dfc 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -51,6 +51,7 @@
 #define ETH_VHOST_IFACE_ARG		"iface"
 #define ETH_VHOST_QUEUES_ARG		"queues"
 #define ETH_VHOST_CLIENT_ARG		"client"
+#define ETH_VHOST_DEQUEUE_ZERO_COPY	"dequeue-zero-copy"
 
 static const char *drivername = "VHOST PMD";
 
@@ -58,6 +59,7 @@ static const char *valid_arguments[] = {
 	ETH_VHOST_IFACE_ARG,
 	ETH_VHOST_QUEUES_ARG,
 	ETH_VHOST_CLIENT_ARG,
+	ETH_VHOST_DEQUEUE_ZERO_COPY,
 	NULL
 };
 
@@ -831,6 +833,7 @@ rte_pmd_vhost_devinit(const char *name, const char *params)
 	uint16_t queues;
 	uint64_t flags = 0;
 	int client_mode = 0;
+	int dequeue_zero_copy = 0;
 
 	RTE_LOG(INFO, PMD, "Initializing pmd_vhost for %s\n", name);
 
@@ -867,6 +870,16 @@ rte_pmd_vhost_devinit(const char *name, const char *params)
 			flags |= RTE_VHOST_USER_CLIENT;
 	}
 
+	if (rte_kvargs_count(kvlist, ETH_VHOST_DEQUEUE_ZERO_COPY) == 1) {
+		ret = rte_kvargs_process(kvlist, ETH_VHOST_DEQUEUE_ZERO_COPY,
+					 &open_int, &dequeue_zero_copy);
+		if (ret < 0)
+			goto out_free;
+
+		if (dequeue_zero_copy)
+			flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
+	}
+
 	eth_dev_vhost_create(name, iface_name, queues, rte_socket_id(), flags);
 
 out_free:
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/7] vhost: get guest/host physical address mappings
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 2/7] vhost: get guest/host physical address mappings Yuanhan Liu
@ 2016-09-26 20:17     ` Maxime Coquelin
  0 siblings, 0 replies; 75+ messages in thread
From: Maxime Coquelin @ 2016-09-26 20:17 UTC (permalink / raw)
  To: Yuanhan Liu, dev



On 09/23/2016 06:13 AM, Yuanhan Liu wrote:
> So that we can convert a guest physical address to host physical
> address, which will be used in later Tx zero copy implementation.
>
> MAP_POPULATE is set while mmaping guest memory regions, to make
> sure the page tables are setup and then rte_mem_virt2phy() could
> yield proper physical address.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>
> v2: - use MAP_POPULATE option to make sure the page table will
>       be already setup while getting the phys address
>
>     - do a simple merge if the last 2 pages are continuous
>
>     - dump guest pages only in debug mode
> ---
>  lib/librte_vhost/vhost.h      |  30 +++++++++++++
>  lib/librte_vhost/vhost_user.c | 100 +++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 128 insertions(+), 2 deletions(-)
Reviewed-by: Maxime Coquelin <maxime.coquelin@¶edhat.com>

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy Yuanhan Liu
@ 2016-09-26 20:45     ` Maxime Coquelin
  2016-10-06 14:37     ` Xu, Qian Q
  1 sibling, 0 replies; 75+ messages in thread
From: Maxime Coquelin @ 2016-09-26 20:45 UTC (permalink / raw)
  To: Yuanhan Liu, dev



On 09/23/2016 06:13 AM, Yuanhan Liu wrote:
> The basic idea of dequeue zero copy is, instead of copying data from
> the desc buf, here we let the mbuf reference the desc buf addr directly.
>
> Doing so, however, has one major issue: we can't update the used ring
> at the end of rte_vhost_dequeue_burst. Because we don't do the copy
> here, an update of the used ring would let the driver to reclaim the
> desc buf. As a result, DPDK might reference a stale memory region.
>
> To update the used ring properly, this patch does several tricks:
>
> - when mbuf references a desc buf, refcnt is added by 1.
>
>   This is to pin lock the mbuf, so that a mbuf free from the DPDK
>   won't actually free it, instead, refcnt is subtracted by 1.
>
> - We chain all those mbuf together (by tailq)
>
>   And we check it every time on the rte_vhost_dequeue_burst entrance,
>   to see if the mbuf is freed (when refcnt equals to 1). If that
>   happens, it means we are the last user of this mbuf and we are
>   safe to update the used ring.
>
> - "struct zcopy_mbuf" is introduced, to associate an mbuf with the
>   right desc idx.
>
> Dequeue zero copy is introduced for performance reason, and some rough
> tests show about 50% perfomance boost for packet size 1500B. For small
> packets, (e.g. 64B), it actually slows a bit down (well, it could up to
> 15%). That is expected because this patch introduces some extra works,
> and it outweighs the benefit from saving few bytes copy.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>
> v2: - use unlikely/likely for dequeue_zero_copy check, as it's not enabled
>       by default, as well as it has some limitations in vm2nic case.
>
>     - handle the case that a desc buf might across 2 host phys pages
>
>     - reset nr_zmbuf to 0 at set_vring_num
>
>     - set the zmbuf_size to vq->size, but not the double of it.
> ---
>  lib/librte_vhost/vhost.c      |   2 +
>  lib/librte_vhost/vhost.h      |  22 +++++-
>  lib/librte_vhost/vhost_user.c |  42 +++++++++-
>  lib/librte_vhost/virtio_net.c | 173 +++++++++++++++++++++++++++++++++++++-----
>  4 files changed, 219 insertions(+), 20 deletions(-)

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/7] vhost: add a flag to enable dequeue zero copy
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 5/7] vhost: add a flag to enable " Yuanhan Liu
@ 2016-09-26 20:57     ` Maxime Coquelin
  0 siblings, 0 replies; 75+ messages in thread
From: Maxime Coquelin @ 2016-09-26 20:57 UTC (permalink / raw)
  To: Yuanhan Liu, dev



On 09/23/2016 06:13 AM, Yuanhan Liu wrote:
> Dequeue zero copy is disabled by default. Here add a new flag
> ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY`` to explictily enable it.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>
> v2: - update release log
>     - doc dequeue zero copy in detail
> ---
>  doc/guides/prog_guide/vhost_lib.rst    | 35 +++++++++++++++++++++++++++++++++-
>  doc/guides/rel_notes/release_16_11.rst | 11 +++++++++++
>  lib/librte_vhost/rte_virtio_net.h      |  1 +
>  lib/librte_vhost/socket.c              |  5 +++++
>  lib/librte_vhost/vhost.c               | 10 ++++++++++
>  lib/librte_vhost/vhost.h               |  1 +
>  6 files changed, 62 insertions(+), 1 deletion(-)

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Thanks,
Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 6/7] examples/vhost: add an option to enable dequeue zero copy
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 6/7] examples/vhost: add an option " Yuanhan Liu
@ 2016-09-26 21:05     ` Maxime Coquelin
  0 siblings, 0 replies; 75+ messages in thread
From: Maxime Coquelin @ 2016-09-26 21:05 UTC (permalink / raw)
  To: Yuanhan Liu, dev



On 09/23/2016 06:13 AM, Yuanhan Liu wrote:
> Add an option, --dequeue-zero-copy, to enable dequeue zero copy.
>
> One thing worth noting while using dequeue zero copy is the nb_tx_desc
> has to be small enough so that the eth driver will hit the mbuf free
> threshold easily and thus free mbuf more frequently.
>
> The reason behind that is, when dequeue zero copy is enabled, guest Tx
> used vring will be updated only when corresponding mbuf is freed. If mbuf
> is not freed frequently, the guest Tx vring could be starved.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  examples/vhost/main.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Thanks,
Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 7/7] net/vhost: add an option to enable dequeue zero copy
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 7/7] net/vhost: " Yuanhan Liu
@ 2016-09-26 21:05     ` Maxime Coquelin
  0 siblings, 0 replies; 75+ messages in thread
From: Maxime Coquelin @ 2016-09-26 21:05 UTC (permalink / raw)
  To: Yuanhan Liu, dev



On 09/23/2016 06:13 AM, Yuanhan Liu wrote:
> Add an option, dequeue-zero-copy, to enable this feature in vhost-pmd.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  drivers/net/vhost/rte_eth_vhost.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy Yuanhan Liu
  2016-09-26 20:45     ` Maxime Coquelin
@ 2016-10-06 14:37     ` Xu, Qian Q
  2016-10-09  2:03       ` Yuanhan Liu
  1 sibling, 1 reply; 75+ messages in thread
From: Xu, Qian Q @ 2016-10-06 14:37 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Maxime Coquelin

See the bottom. 

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
Sent: Friday, September 23, 2016 5:13 AM
To: dev@dpdk.org
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
Subject: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy

The basic idea of dequeue zero copy is, instead of copying data from the desc buf, here we let the mbuf reference the desc buf addr directly.

Doing so, however, has one major issue: we can't update the used ring at the end of rte_vhost_dequeue_burst. Because we don't do the copy here, an update of the used ring would let the driver to reclaim the desc buf. As a result, DPDK might reference a stale memory region.

To update the used ring properly, this patch does several tricks:

- when mbuf references a desc buf, refcnt is added by 1.

  This is to pin lock the mbuf, so that a mbuf free from the DPDK
  won't actually free it, instead, refcnt is subtracted by 1.

- We chain all those mbuf together (by tailq)

  And we check it every time on the rte_vhost_dequeue_burst entrance,
  to see if the mbuf is freed (when refcnt equals to 1). If that
  happens, it means we are the last user of this mbuf and we are
  safe to update the used ring.

- "struct zcopy_mbuf" is introduced, to associate an mbuf with the
  right desc idx.

Dequeue zero copy is introduced for performance reason, and some rough tests show about 50% perfomance boost for packet size 1500B. For small packets, (e.g. 64B), it actually slows a bit down (well, it could up to 15%). That is expected because this patch introduces some extra works, and it outweighs the benefit from saving few bytes copy.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---

v2: - use unlikely/likely for dequeue_zero_copy check, as it's not enabled
      by default, as well as it has some limitations in vm2nic case.

    - handle the case that a desc buf might across 2 host phys pages

    - reset nr_zmbuf to 0 at set_vring_num

    - set the zmbuf_size to vq->size, but not the double of it.
---
 lib/librte_vhost/vhost.c      |   2 +
 lib/librte_vhost/vhost.h      |  22 +++++-
 lib/librte_vhost/vhost_user.c |  42 +++++++++-  lib/librte_vhost/virtio_net.c | 173 +++++++++++++++++++++++++++++++++++++-----
 4 files changed, 219 insertions(+), 20 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c index 46095c3..ab25649 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -141,6 +141,8 @@ init_vring_queue(struct vhost_virtqueue *vq, int qp_idx)
 	/* always set the default vq pair to enabled */
 	if (qp_idx == 0)
 		vq->enabled = 1;
+
+	TAILQ_INIT(&vq->zmbuf_list);
 }
 
 static void
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h index 8565fa1..be8a398 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -36,6 +36,7 @@
 #include <stdint.h>
 #include <stdio.h>
 #include <sys/types.h>
+#include <sys/queue.h>
 #include <unistd.h>
 #include <linux/vhost.h>
 
@@ -61,6 +62,19 @@ struct buf_vector {
 	uint32_t desc_idx;
 };
 
+/*
+ * A structure to hold some fields needed in zero copy code path,
+ * mainly for associating an mbuf with the right desc_idx.
+ */
+struct zcopy_mbuf {
+	struct rte_mbuf *mbuf;
+	uint32_t desc_idx;
+	uint16_t in_use;
+
+	TAILQ_ENTRY(zcopy_mbuf) next;
+};
+TAILQ_HEAD(zcopy_mbuf_list, zcopy_mbuf);
+
 /**
  * Structure contains variables relevant to RX/TX virtqueues.
  */
@@ -85,6 +99,12 @@ struct vhost_virtqueue {
 
 	/* Physical address of used ring, for logging */
 	uint64_t		log_guest_addr;
+
+	uint16_t		nr_zmbuf;
+	uint16_t		zmbuf_size;
+	uint16_t		last_zmbuf_idx;
+	struct zcopy_mbuf	*zmbufs;
+	struct zcopy_mbuf_list	zmbuf_list;
 } __rte_cache_aligned;
 
 /* Old kernels have no such macro defined */ @@ -135,6 +155,7 @@ struct virtio_net {
 	/* to tell if we need broadcast rarp packet */
 	rte_atomic16_t		broadcast_rarp;
 	uint32_t		virt_qp_nb;
+	int			dequeue_zero_copy;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
 	char			ifname[IF_NAME_SZ];
@@ -146,7 +167,6 @@ struct virtio_net {
 	uint32_t		nr_guest_pages;
 	uint32_t		max_guest_pages;
 	struct guest_page       *guest_pages;
-
 } __rte_cache_aligned;
 
 /**
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c index a92377a..ac40408 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -180,7 +180,23 @@ static int
 vhost_user_set_vring_num(struct virtio_net *dev,
 			 struct vhost_vring_state *state)
 {
-	dev->virtqueue[state->index]->size = state->num;
+	struct vhost_virtqueue *vq = dev->virtqueue[state->index];
+
+	vq->size = state->num;
+
+	if (dev->dequeue_zero_copy) {
+		vq->nr_zmbuf = 0;
+		vq->last_zmbuf_idx = 0;
+		vq->zmbuf_size = vq->size;
+		vq->zmbufs = rte_zmalloc(NULL, vq->zmbuf_size *
+					 sizeof(struct zcopy_mbuf), 0);
+		if (vq->zmbufs == NULL) {
+			RTE_LOG(WARNING, VHOST_CONFIG,
+				"failed to allocate mem for zero copy; "
+				"zero copy is force disabled\n");
+			dev->dequeue_zero_copy = 0;
+		}
+	}
 
 	return 0;
 }
@@ -662,11 +678,32 @@ vhost_user_set_vring_kick(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 	vq->kickfd = file.fd;
 
 	if (virtio_is_ready(dev) && !(dev->flags & VIRTIO_DEV_RUNNING)) {
+		if (dev->dequeue_zero_copy) {
+			RTE_LOG(INFO, VHOST_CONFIG,
+				"Tx zero copy is enabled\n");
+		}
+
 		if (notify_ops->new_device(dev->vid) == 0)
 			dev->flags |= VIRTIO_DEV_RUNNING;
 	}
 }
 
+static void
+free_zmbufs(struct vhost_virtqueue *vq) {
+	struct zcopy_mbuf *zmbuf, *next;
+
+	for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
+	     zmbuf != NULL; zmbuf = next) {
+		next = TAILQ_NEXT(zmbuf, next);
+
+		rte_pktmbuf_free(zmbuf->mbuf);
+		TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
+	}
+
+	rte_free(vq->zmbufs);
+}
+
 /*
  * when virtio is stopped, qemu will send us the GET_VRING_BASE message.
  */
@@ -695,6 +732,9 @@ vhost_user_get_vring_base(struct virtio_net *dev,
 
 	dev->virtqueue[state->index]->kickfd = VIRTIO_UNINITIALIZED_EVENTFD;
 
+	if (dev->dequeue_zero_copy)
+		free_zmbufs(dev->virtqueue[state->index]);
+
 	return 0;
 }
 
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c index 1c2ee47..215542c 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -678,6 +678,43 @@ make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *mac)
 	return 0;
 }
 
+static inline struct zcopy_mbuf *__attribute__((always_inline)) 
+get_zmbuf(struct vhost_virtqueue *vq) {
+	uint16_t i;
+	uint16_t last;
+	int tries = 0;
+
+	/* search [last_zmbuf_idx, zmbuf_size) */
+	i = vq->last_zmbuf_idx;
+	last = vq->zmbuf_size;
+
+again:
+	for (; i < last; i++) {
+		if (vq->zmbufs[i].in_use == 0) {
+			vq->last_zmbuf_idx = i + 1;
+			vq->zmbufs[i].in_use = 1;
+			return &vq->zmbufs[i];
+		}
+	}
+
+	tries++;
+	if (tries == 1) {
+		/* search [0, last_zmbuf_idx) */
+		i = 0;
+		last = vq->last_zmbuf_idx;
+		goto again;
+	}
+
+	return NULL;
+}
+
+static inline void __attribute__((always_inline)) put_zmbuf(struct 
+zcopy_mbuf *zmbuf) {
+	zmbuf->in_use = 0;
+}
+
 static inline int __attribute__((always_inline))  copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		  struct rte_mbuf *m, uint16_t desc_idx, @@ -701,6 +738,27 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	if (unlikely(!desc_addr))
 		return -1;
 
+	if (unlikely(dev->dequeue_zero_copy)) {
+		struct zcopy_mbuf *zmbuf;
+
+		zmbuf = get_zmbuf(vq);
+		if (!zmbuf)
+			return -1;
+		zmbuf->mbuf = m;
+		zmbuf->desc_idx = desc_idx;
+
+		/*
+		 * Pin lock the mbuf; we will check later to see whether
+		 * the mbuf is freed (when we are the last user) or not.
+		 * If that's the case, we then could update the used ring
+		 * safely.
+		 */
+		rte_mbuf_refcnt_update(m, 1);
+
+		vq->nr_zmbuf += 1;
+		TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbuf, next);
+	}
+
 	hdr = (struct virtio_net_hdr *)((uintptr_t)desc_addr);
 	rte_prefetch0(hdr);
 
this function copy_desc_to_mbuf has changed on the dpdk-next-virtio repo. Based on current dpdk-next-virtio repo, the commit ID is as below: 
commit b4f7b43cd9d3b6413f41221051d03a23bc5f5fbe
Author: Zhiyong Yang <zhiyong.yang@intel.com>
Date:   Thu Sep 29 20:35:49 2016 +0800

Then you will find the parameter "struct vhost_virtqueue *vq" is removed, so if apply your patch on that commit ID, the build will fail, since no vq definition but we used it in the function. 
Could you check? Thx. 

== Build lib/librte_table
/home/qxu10/dpdk-zero/lib/librte_vhost/virtio_net.c: In function 'copy_desc_to_mbuf':
/home/qxu10/dpdk-zero/lib/librte_vhost/virtio_net.c:745:21: error: 'vq' undeclared (first use in this function)
   zmbuf = get_zmbuf(vq);
                     ^
/home/qxu10/dpdk-zero/lib/librte_vhost/virtio_net.c:745:21: note: each undeclared identifier is reported only once for each function it appears in
/home/qxu10/dpdk-zero/mk/internal/rte.compile-pre.mk:138: recipe for target 'virtio_net.o' failed
make[5]: *** [virtio_net.o] Error 1
/home/qxu10/dpdk-zero/mk/rte.subdir.mk:61: recipe for target 'librte_vhost' failed
make[4]: *** [librte_vhost] Error 2
make[4]: *** Waiting for unfinished jobs....

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy
  2016-10-06 14:37     ` Xu, Qian Q
@ 2016-10-09  2:03       ` Yuanhan Liu
  2016-10-10 10:12         ` Xu, Qian Q
  0 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-09  2:03 UTC (permalink / raw)
  To: Xu, Qian Q; +Cc: dev, Maxime Coquelin

On Thu, Oct 06, 2016 at 02:37:27PM +0000, Xu, Qian Q wrote:
> this function copy_desc_to_mbuf has changed on the dpdk-next-virtio repo. Based on current dpdk-next-virtio repo, the commit ID is as below: 
> commit b4f7b43cd9d3b6413f41221051d03a23bc5f5fbe
> Author: Zhiyong Yang <zhiyong.yang@intel.com>
> Date:   Thu Sep 29 20:35:49 2016 +0800
> 
> Then you will find the parameter "struct vhost_virtqueue *vq" is removed, so if apply your patch on that commit ID, the build will fail, since no vq definition but we used it in the function. 
> Could you check? Thx. 

I knew that: a rebase is needed, and I have done the rebase (locally);
just haven't sent it out yet.

	--yliu

> 
> == Build lib/librte_table
> /home/qxu10/dpdk-zero/lib/librte_vhost/virtio_net.c: In function 'copy_desc_to_mbuf':
> /home/qxu10/dpdk-zero/lib/librte_vhost/virtio_net.c:745:21: error: 'vq' undeclared (first use in this function)
>    zmbuf = get_zmbuf(vq);
>                      ^
> /home/qxu10/dpdk-zero/lib/librte_vhost/virtio_net.c:745:21: note: each undeclared identifier is reported only once for each function it appears in
> /home/qxu10/dpdk-zero/mk/internal/rte.compile-pre.mk:138: recipe for target 'virtio_net.o' failed
> make[5]: *** [virtio_net.o] Error 1
> /home/qxu10/dpdk-zero/mk/rte.subdir.mk:61: recipe for target 'librte_vhost' failed
> make[4]: *** [librte_vhost] Error 2
> make[4]: *** Waiting for unfinished jobs....
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support
  2016-09-23  4:13 ` [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue " Yuanhan Liu
                     ` (6 preceding siblings ...)
  2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 7/7] net/vhost: " Yuanhan Liu
@ 2016-10-09  7:27   ` Yuanhan Liu
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 1/7] vhost: simplify memory regions handling Yuanhan Liu
                       ` (8 more replies)
  7 siblings, 9 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-09  7:27 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

This patch set enables vhost dequeue zero copy. The majority work goes
to patch 4: "vhost: add dequeue zero copy".

The basic idea of dequeue zero copy is, instead of copying data from the
desc buf, here we let the mbuf reference the desc buf addr directly.

The major issue behind that is how and when to update the used ring.
You could check the commit log of patch 4 for more details.

Patch 5 introduces a new flag, RTE_VHOST_USER_DEQUEUE_ZERO_COPY, to enable
dequeue zero copy, which is disabled by default.

The performance gain is quite impressive. For a simple dequeue workload
(running rxonly in vhost-pmd and runnin txonly in guest testpmd), it yields
50+% performance boost for packet size 1500B. For VM2VM iperf test case,
it's even better: about 70% boost.

For small packets, the performance is worse (it's expected, as the extra
overhead introduced by zero copy outweighs the benefits from saving few
bytes copy).

v3: - rebase: mainly for removing conflicts with the Tx indirect patch
    - don't update last_used_idx twice for zero-copy mode
    - handle two mssiing "Tx -> dequeue" renames in log and usage

v2: - renamed "tx zero copy" to "dequeue zero copy", to reduce confusions.
    - hnadle the case that a desc buf might across 2 host phys pages
    - use MAP_POPULATE to let kernel populate the page tables
    - updated release note
    - doc-ed the limitations for the vm2nic case
    - merge 2 continuous guest phys memory region
    - and few more trivial changes, please see them in the corresponding
      patches

---
Yuanhan Liu (7):
  vhost: simplify memory regions handling
  vhost: get guest/host physical address mappings
  vhost: introduce last avail idx for dequeue
  vhost: add dequeue zero copy
  vhost: add a flag to enable dequeue zero copy
  examples/vhost: add an option to enable dequeue zero copy
  net/vhost: add an option to enable dequeue zero copy

 doc/guides/prog_guide/vhost_lib.rst    |  35 +++-
 doc/guides/rel_notes/release_16_11.rst |  13 ++
 drivers/net/vhost/rte_eth_vhost.c      |  13 ++
 examples/vhost/main.c                  |  19 +-
 lib/librte_vhost/rte_virtio_net.h      |   1 +
 lib/librte_vhost/socket.c              |   5 +
 lib/librte_vhost/vhost.c               |  12 ++
 lib/librte_vhost/vhost.h               | 102 ++++++++---
 lib/librte_vhost/vhost_user.c          | 315 ++++++++++++++++++++++-----------
 lib/librte_vhost/virtio_net.c          | 196 +++++++++++++++++---
 10 files changed, 549 insertions(+), 162 deletions(-)

-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v3 1/7] vhost: simplify memory regions handling
  2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
@ 2016-10-09  7:27     ` Yuanhan Liu
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 2/7] vhost: get guest/host physical address mappings Yuanhan Liu
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-09  7:27 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

Due to history reason (that vhost-cuse comes before vhost-user), some
fields for maintaining the vhost-user memory mappings (such as mmapped
address and size, with those we then can unmap on destroy) are kept in
"orig_region_map" struct, a structure that is defined only in vhost-user
source file.

The right way to go is to remove the structure and move all those fields
into virtio_memory_region struct. But we simply can't do that before,
because it breaks the ABI.

Now, thanks to the ABI refactoring, it's never been a blocking issue
any more. And here it goes: this patch removes orig_region_map and
redefines virtio_memory_region, to include all necessary info.

With that, we can simplify the guest/host address convert a bit.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/vhost.h      |  49 ++++++------
 lib/librte_vhost/vhost_user.c | 173 +++++++++++++++++-------------------------
 2 files changed, 91 insertions(+), 131 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index c2dfc3c..df2107b 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -143,12 +143,14 @@ struct virtio_net {
  * Information relating to memory regions including offsets to
  * addresses in QEMUs memory file.
  */
-struct virtio_memory_regions {
-	uint64_t guest_phys_address;
-	uint64_t guest_phys_address_end;
-	uint64_t memory_size;
-	uint64_t userspace_address;
-	uint64_t address_offset;
+struct virtio_memory_region {
+	uint64_t guest_phys_addr;
+	uint64_t guest_user_addr;
+	uint64_t host_user_addr;
+	uint64_t size;
+	void	 *mmap_addr;
+	uint64_t mmap_size;
+	int fd;
 };
 
 
@@ -156,12 +158,8 @@ struct virtio_memory_regions {
  * Memory structure includes region and mapping information.
  */
 struct virtio_memory {
-	/* Base QEMU userspace address of the memory file. */
-	uint64_t base_address;
-	uint64_t mapped_address;
-	uint64_t mapped_size;
 	uint32_t nregions;
-	struct virtio_memory_regions regions[0];
+	struct virtio_memory_region regions[0];
 };
 
 
@@ -200,26 +198,23 @@ extern uint64_t VHOST_FEATURES;
 #define MAX_VHOST_DEVICE	1024
 extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
 
-/**
- * Function to convert guest physical addresses to vhost virtual addresses.
- * This is used to convert guest virtio buffer addresses.
- */
+/* Convert guest physical Address to host virtual address */
 static inline uint64_t __attribute__((always_inline))
-gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
+gpa_to_vva(struct virtio_net *dev, uint64_t gpa)
 {
-	struct virtio_memory_regions *region;
-	uint32_t regionidx;
-	uint64_t vhost_va = 0;
-
-	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
-		region = &dev->mem->regions[regionidx];
-		if ((guest_pa >= region->guest_phys_address) &&
-			(guest_pa <= region->guest_phys_address_end)) {
-			vhost_va = region->address_offset + guest_pa;
-			break;
+	struct virtio_memory_region *reg;
+	uint32_t i;
+
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+		if (gpa >= reg->guest_phys_addr &&
+		    gpa <  reg->guest_phys_addr + reg->size) {
+			return gpa - reg->guest_phys_addr +
+			       reg->host_user_addr;
 		}
 	}
-	return vhost_va;
+
+	return 0;
 }
 
 struct virtio_net_device_ops const *notify_ops;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index eee99e9..49585b8 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -74,18 +74,6 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
 	[VHOST_USER_SEND_RARP]  = "VHOST_USER_SEND_RARP",
 };
 
-struct orig_region_map {
-	int fd;
-	uint64_t mapped_address;
-	uint64_t mapped_size;
-	uint64_t blksz;
-};
-
-#define orig_region(ptr, nregions) \
-	((struct orig_region_map *)RTE_PTR_ADD((ptr), \
-		sizeof(struct virtio_memory) + \
-		sizeof(struct virtio_memory_regions) * (nregions)))
-
 static uint64_t
 get_blk_size(int fd)
 {
@@ -99,18 +87,17 @@ get_blk_size(int fd)
 static void
 free_mem_region(struct virtio_net *dev)
 {
-	struct orig_region_map *region;
-	unsigned int idx;
+	uint32_t i;
+	struct virtio_memory_region *reg;
 
 	if (!dev || !dev->mem)
 		return;
 
-	region = orig_region(dev->mem, dev->mem->nregions);
-	for (idx = 0; idx < dev->mem->nregions; idx++) {
-		if (region[idx].mapped_address) {
-			munmap((void *)(uintptr_t)region[idx].mapped_address,
-					region[idx].mapped_size);
-			close(region[idx].fd);
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+		if (reg->host_user_addr) {
+			munmap(reg->mmap_addr, reg->mmap_size);
+			close(reg->fd);
 		}
 	}
 }
@@ -120,7 +107,7 @@ vhost_backend_cleanup(struct virtio_net *dev)
 {
 	if (dev->mem) {
 		free_mem_region(dev);
-		free(dev->mem);
+		rte_free(dev->mem);
 		dev->mem = NULL;
 	}
 	if (dev->log_addr) {
@@ -286,25 +273,23 @@ numa_realloc(struct virtio_net *dev, int index __rte_unused)
  * used to convert the ring addresses to our address space.
  */
 static uint64_t
-qva_to_vva(struct virtio_net *dev, uint64_t qemu_va)
+qva_to_vva(struct virtio_net *dev, uint64_t qva)
 {
-	struct virtio_memory_regions *region;
-	uint64_t vhost_va = 0;
-	uint32_t regionidx = 0;
+	struct virtio_memory_region *reg;
+	uint32_t i;
 
 	/* Find the region where the address lives. */
-	for (regionidx = 0; regionidx < dev->mem->nregions; regionidx++) {
-		region = &dev->mem->regions[regionidx];
-		if ((qemu_va >= region->userspace_address) &&
-			(qemu_va <= region->userspace_address +
-			region->memory_size)) {
-			vhost_va = qemu_va + region->guest_phys_address +
-				region->address_offset -
-				region->userspace_address;
-			break;
+	for (i = 0; i < dev->mem->nregions; i++) {
+		reg = &dev->mem->regions[i];
+
+		if (qva >= reg->guest_user_addr &&
+		    qva <  reg->guest_user_addr + reg->size) {
+			return qva - reg->guest_user_addr +
+			       reg->host_user_addr;
 		}
 	}
-	return vhost_va;
+
+	return 0;
 }
 
 /*
@@ -391,11 +376,13 @@ static int
 vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 {
 	struct VhostUserMemory memory = pmsg->payload.memory;
-	struct virtio_memory_regions *pregion;
-	uint64_t mapped_address, mapped_size;
-	unsigned int idx = 0;
-	struct orig_region_map *pregion_orig;
+	struct virtio_memory_region *reg;
+	void *mmap_addr;
+	uint64_t mmap_size;
+	uint64_t mmap_offset;
 	uint64_t alignment;
+	uint32_t i;
+	int fd;
 
 	/* Remove from the data plane. */
 	if (dev->flags & VIRTIO_DEV_RUNNING) {
@@ -405,14 +392,12 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 
 	if (dev->mem) {
 		free_mem_region(dev);
-		free(dev->mem);
+		rte_free(dev->mem);
 		dev->mem = NULL;
 	}
 
-	dev->mem = calloc(1,
-		sizeof(struct virtio_memory) +
-		sizeof(struct virtio_memory_regions) * memory.nregions +
-		sizeof(struct orig_region_map) * memory.nregions);
+	dev->mem = rte_zmalloc("vhost-mem-table", sizeof(struct virtio_memory) +
+		sizeof(struct virtio_memory_region) * memory.nregions, 0);
 	if (dev->mem == NULL) {
 		RTE_LOG(ERR, VHOST_CONFIG,
 			"(%d) failed to allocate memory for dev->mem\n",
@@ -421,22 +406,17 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 	}
 	dev->mem->nregions = memory.nregions;
 
-	pregion_orig = orig_region(dev->mem, memory.nregions);
-	for (idx = 0; idx < memory.nregions; idx++) {
-		pregion = &dev->mem->regions[idx];
-		pregion->guest_phys_address =
-			memory.regions[idx].guest_phys_addr;
-		pregion->guest_phys_address_end =
-			memory.regions[idx].guest_phys_addr +
-			memory.regions[idx].memory_size;
-		pregion->memory_size =
-			memory.regions[idx].memory_size;
-		pregion->userspace_address =
-			memory.regions[idx].userspace_addr;
-
-		/* This is ugly */
-		mapped_size = memory.regions[idx].memory_size +
-			memory.regions[idx].mmap_offset;
+	for (i = 0; i < memory.nregions; i++) {
+		fd  = pmsg->fds[i];
+		reg = &dev->mem->regions[i];
+
+		reg->guest_phys_addr = memory.regions[i].guest_phys_addr;
+		reg->guest_user_addr = memory.regions[i].userspace_addr;
+		reg->size            = memory.regions[i].memory_size;
+		reg->fd              = fd;
+
+		mmap_offset = memory.regions[i].mmap_offset;
+		mmap_size   = reg->size + mmap_offset;
 
 		/* mmap() without flag of MAP_ANONYMOUS, should be called
 		 * with length argument aligned with hugepagesz at older
@@ -446,67 +426,52 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		 * to avoid failure, make sure in caller to keep length
 		 * aligned.
 		 */
-		alignment = get_blk_size(pmsg->fds[idx]);
+		alignment = get_blk_size(fd);
 		if (alignment == (uint64_t)-1) {
 			RTE_LOG(ERR, VHOST_CONFIG,
 				"couldn't get hugepage size through fstat\n");
 			goto err_mmap;
 		}
-		mapped_size = RTE_ALIGN_CEIL(mapped_size, alignment);
+		mmap_size = RTE_ALIGN_CEIL(mmap_size, alignment);
 
-		mapped_address = (uint64_t)(uintptr_t)mmap(NULL,
-			mapped_size,
-			PROT_READ | PROT_WRITE, MAP_SHARED,
-			pmsg->fds[idx],
-			0);
+		mmap_addr = mmap(NULL, mmap_size,
+				 PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
 
-		RTE_LOG(INFO, VHOST_CONFIG,
-			"mapped region %d fd:%d to:%p sz:0x%"PRIx64" "
-			"off:0x%"PRIx64" align:0x%"PRIx64"\n",
-			idx, pmsg->fds[idx], (void *)(uintptr_t)mapped_address,
-			mapped_size, memory.regions[idx].mmap_offset,
-			alignment);
-
-		if (mapped_address == (uint64_t)(uintptr_t)MAP_FAILED) {
+		if (mmap_addr == MAP_FAILED) {
 			RTE_LOG(ERR, VHOST_CONFIG,
-				"mmap qemu guest failed.\n");
+				"mmap region %u failed.\n", i);
 			goto err_mmap;
 		}
 
-		pregion_orig[idx].mapped_address = mapped_address;
-		pregion_orig[idx].mapped_size = mapped_size;
-		pregion_orig[idx].blksz = alignment;
-		pregion_orig[idx].fd = pmsg->fds[idx];
-
-		mapped_address +=  memory.regions[idx].mmap_offset;
+		reg->mmap_addr = mmap_addr;
+		reg->mmap_size = mmap_size;
+		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
+				      mmap_offset;
 
-		pregion->address_offset = mapped_address -
-			pregion->guest_phys_address;
-
-		if (memory.regions[idx].guest_phys_addr == 0) {
-			dev->mem->base_address =
-				memory.regions[idx].userspace_addr;
-			dev->mem->mapped_address =
-				pregion->address_offset;
-		}
-
-		LOG_DEBUG(VHOST_CONFIG,
-			"REGION: %u GPA: %p QEMU VA: %p SIZE (%"PRIu64")\n",
-			idx,
-			(void *)(uintptr_t)pregion->guest_phys_address,
-			(void *)(uintptr_t)pregion->userspace_address,
-			 pregion->memory_size);
+		RTE_LOG(INFO, VHOST_CONFIG,
+			"guest memory region %u, size: 0x%" PRIx64 "\n"
+			"\t guest physical addr: 0x%" PRIx64 "\n"
+			"\t guest virtual  addr: 0x%" PRIx64 "\n"
+			"\t host  virtual  addr: 0x%" PRIx64 "\n"
+			"\t mmap addr : 0x%" PRIx64 "\n"
+			"\t mmap size : 0x%" PRIx64 "\n"
+			"\t mmap align: 0x%" PRIx64 "\n"
+			"\t mmap off  : 0x%" PRIx64 "\n",
+			i, reg->size,
+			reg->guest_phys_addr,
+			reg->guest_user_addr,
+			reg->host_user_addr,
+			(uint64_t)(uintptr_t)mmap_addr,
+			mmap_size,
+			alignment,
+			mmap_offset);
 	}
 
 	return 0;
 
 err_mmap:
-	while (idx--) {
-		munmap((void *)(uintptr_t)pregion_orig[idx].mapped_address,
-				pregion_orig[idx].mapped_size);
-		close(pregion_orig[idx].fd);
-	}
-	free(dev->mem);
+	free_mem_region(dev);
+	rte_free(dev->mem);
 	dev->mem = NULL;
 	return -1;
 }
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v3 2/7] vhost: get guest/host physical address mappings
  2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 1/7] vhost: simplify memory regions handling Yuanhan Liu
@ 2016-10-09  7:27     ` Yuanhan Liu
  2016-11-29  3:10       ` linhaifeng
  2016-11-29 13:14       ` linhaifeng
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 3/7] vhost: introduce last avail idx for dequeue Yuanhan Liu
                       ` (6 subsequent siblings)
  8 siblings, 2 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-09  7:27 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

So that we can convert a guest physical address to host physical
address, which will be used in later Tx zero copy implementation.

MAP_POPULATE is set while mmaping guest memory regions, to make
sure the page tables are setup and then rte_mem_virt2phy() could
yield proper physical address.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---

v2: - use MAP_POPULATE option to make sure the page table will
      be already setup while getting the phys address

    - do a simple merge if the last 2 pages are continuous

    - dump guest pages only in debug mode
---
 lib/librte_vhost/vhost.h      |  30 +++++++++++++
 lib/librte_vhost/vhost_user.c | 100 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index df2107b..2d52987 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -114,6 +114,12 @@ struct vhost_virtqueue {
  #define VIRTIO_F_VERSION_1 32
 #endif
 
+struct guest_page {
+	uint64_t guest_phys_addr;
+	uint64_t host_phys_addr;
+	uint64_t size;
+};
+
 /**
  * Device structure contains all configuration information relating
  * to the device.
@@ -137,6 +143,10 @@ struct virtio_net {
 	uint64_t		log_addr;
 	struct ether_addr	mac;
 
+	uint32_t		nr_guest_pages;
+	uint32_t		max_guest_pages;
+	struct guest_page       *guest_pages;
+
 } __rte_cache_aligned;
 
 /**
@@ -217,6 +227,26 @@ gpa_to_vva(struct virtio_net *dev, uint64_t gpa)
 	return 0;
 }
 
+/* Convert guest physical address to host physical address */
+static inline phys_addr_t __attribute__((always_inline))
+gpa_to_hpa(struct virtio_net *dev, uint64_t gpa, uint64_t size)
+{
+	uint32_t i;
+	struct guest_page *page;
+
+	for (i = 0; i < dev->nr_guest_pages; i++) {
+		page = &dev->guest_pages[i];
+
+		if (gpa >= page->guest_phys_addr &&
+		    gpa + size < page->guest_phys_addr + page->size) {
+			return gpa - page->guest_phys_addr +
+			       page->host_phys_addr;
+		}
+	}
+
+	return 0;
+}
+
 struct virtio_net_device_ops const *notify_ops;
 struct virtio_net *get_device(int vid);
 
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index 49585b8..e651912 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -372,6 +372,91 @@ vhost_user_set_vring_base(struct virtio_net *dev,
 	return 0;
 }
 
+static void
+add_one_guest_page(struct virtio_net *dev, uint64_t guest_phys_addr,
+		   uint64_t host_phys_addr, uint64_t size)
+{
+	struct guest_page *page, *last_page;
+
+	if (dev->nr_guest_pages == dev->max_guest_pages) {
+		dev->max_guest_pages *= 2;
+		dev->guest_pages = realloc(dev->guest_pages,
+					dev->max_guest_pages * sizeof(*page));
+	}
+
+	if (dev->nr_guest_pages > 0) {
+		last_page = &dev->guest_pages[dev->nr_guest_pages - 1];
+		/* merge if the two pages are continuous */
+		if (host_phys_addr == last_page->host_phys_addr +
+				      last_page->size) {
+			last_page->size += size;
+			return;
+		}
+	}
+
+	page = &dev->guest_pages[dev->nr_guest_pages++];
+	page->guest_phys_addr = guest_phys_addr;
+	page->host_phys_addr  = host_phys_addr;
+	page->size = size;
+}
+
+static void
+add_guest_pages(struct virtio_net *dev, struct virtio_memory_region *reg,
+		uint64_t page_size)
+{
+	uint64_t reg_size = reg->size;
+	uint64_t host_user_addr  = reg->host_user_addr;
+	uint64_t guest_phys_addr = reg->guest_phys_addr;
+	uint64_t host_phys_addr;
+	uint64_t size;
+
+	host_phys_addr = rte_mem_virt2phy((void *)(uintptr_t)host_user_addr);
+	size = page_size - (guest_phys_addr & (page_size - 1));
+	size = RTE_MIN(size, reg_size);
+
+	add_one_guest_page(dev, guest_phys_addr, host_phys_addr, size);
+	host_user_addr  += size;
+	guest_phys_addr += size;
+	reg_size -= size;
+
+	while (reg_size > 0) {
+		host_phys_addr = rte_mem_virt2phy((void *)(uintptr_t)
+						  host_user_addr);
+		add_one_guest_page(dev, guest_phys_addr, host_phys_addr,
+				   page_size);
+
+		host_user_addr  += page_size;
+		guest_phys_addr += page_size;
+		reg_size -= page_size;
+	}
+}
+
+#ifdef RTE_LIBRTE_VHOST_DEBUG
+/* TODO: enable it only in debug mode? */
+static void
+dump_guest_pages(struct virtio_net *dev)
+{
+	uint32_t i;
+	struct guest_page *page;
+
+	for (i = 0; i < dev->nr_guest_pages; i++) {
+		page = &dev->guest_pages[i];
+
+		RTE_LOG(INFO, VHOST_CONFIG,
+			"guest physical page region %u\n"
+			"\t guest_phys_addr: %" PRIx64 "\n"
+			"\t host_phys_addr : %" PRIx64 "\n"
+			"\t size           : %" PRIx64 "\n",
+			i,
+			page->guest_phys_addr,
+			page->host_phys_addr,
+			page->size);
+	}
+}
+#else
+#define dump_guest_pages(dev)
+#endif
+
 static int
 vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 {
@@ -396,6 +481,13 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		dev->mem = NULL;
 	}
 
+	dev->nr_guest_pages = 0;
+	if (!dev->guest_pages) {
+		dev->max_guest_pages = 8;
+		dev->guest_pages = malloc(dev->max_guest_pages *
+						sizeof(struct guest_page));
+	}
+
 	dev->mem = rte_zmalloc("vhost-mem-table", sizeof(struct virtio_memory) +
 		sizeof(struct virtio_memory_region) * memory.nregions, 0);
 	if (dev->mem == NULL) {
@@ -434,8 +526,8 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		}
 		mmap_size = RTE_ALIGN_CEIL(mmap_size, alignment);
 
-		mmap_addr = mmap(NULL, mmap_size,
-				 PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+		mmap_addr = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+				 MAP_SHARED | MAP_POPULATE, fd, 0);
 
 		if (mmap_addr == MAP_FAILED) {
 			RTE_LOG(ERR, VHOST_CONFIG,
@@ -448,6 +540,8 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 		reg->host_user_addr = (uint64_t)(uintptr_t)mmap_addr +
 				      mmap_offset;
 
+		add_guest_pages(dev, reg, alignment);
+
 		RTE_LOG(INFO, VHOST_CONFIG,
 			"guest memory region %u, size: 0x%" PRIx64 "\n"
 			"\t guest physical addr: 0x%" PRIx64 "\n"
@@ -467,6 +561,8 @@ vhost_user_set_mem_table(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 			mmap_offset);
 	}
 
+	dump_guest_pages(dev);
+
 	return 0;
 
 err_mmap:
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v3 3/7] vhost: introduce last avail idx for dequeue
  2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 1/7] vhost: simplify memory regions handling Yuanhan Liu
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 2/7] vhost: get guest/host physical address mappings Yuanhan Liu
@ 2016-10-09  7:27     ` Yuanhan Liu
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 4/7] vhost: add dequeue zero copy Yuanhan Liu
                       ` (5 subsequent siblings)
  8 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-09  7:27 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

So far, we retrieve both the used ring and avail ring idx by the var
last_used_idx; it won't be a problem because the used ring is updated
immediately after those avail entries are consumed.

But that's not true when dequeue zero copy is enabled, that used ring is
updated only when the mbuf is consumed. Thus, we need use another var to
note the last avail ring idx we have consumed.

Therefore, last_avail_idx is introduced.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/vhost.h      |  2 +-
 lib/librte_vhost/vhost_user.c |  6 ++++--
 lib/librte_vhost/virtio_net.c | 19 +++++++++++--------
 3 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2d52987..8565fa1 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -70,7 +70,7 @@ struct vhost_virtqueue {
 	struct vring_used	*used;
 	uint32_t		size;
 
-	/* Last index used on the available ring */
+	uint16_t		last_avail_idx;
 	volatile uint16_t	last_used_idx;
 #define VIRTIO_INVALID_EVENTFD		(-1)
 #define VIRTIO_UNINITIALIZED_EVENTFD	(-2)
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index e651912..a92377a 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -343,7 +343,8 @@ vhost_user_set_vring_addr(struct virtio_net *dev, struct vhost_vring_addr *addr)
 			"last_used_idx (%u) and vq->used->idx (%u) mismatches; "
 			"some packets maybe resent for Tx and dropped for Rx\n",
 			vq->last_used_idx, vq->used->idx);
-		vq->last_used_idx     = vq->used->idx;
+		vq->last_used_idx  = vq->used->idx;
+		vq->last_avail_idx = vq->used->idx;
 	}
 
 	vq->log_guest_addr = addr->log_guest_addr;
@@ -367,7 +368,8 @@ static int
 vhost_user_set_vring_base(struct virtio_net *dev,
 			  struct vhost_vring_state *state)
 {
-	dev->virtqueue[state->index]->last_used_idx = state->num;
+	dev->virtqueue[state->index]->last_used_idx  = state->num;
+	dev->virtqueue[state->index]->last_avail_idx = state->num;
 
 	return 0;
 }
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index a59c39b..70301a5 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -851,16 +851,17 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 		}
 	}
 
-	avail_idx =  *((volatile uint16_t *)&vq->avail->idx);
-	free_entries = avail_idx - vq->last_used_idx;
+	free_entries = *((volatile uint16_t *)&vq->avail->idx) -
+			vq->last_avail_idx;
 	if (free_entries == 0)
 		goto out;
 
 	LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
 
-	/* Prefetch available ring to retrieve head indexes. */
-	used_idx = vq->last_used_idx & (vq->size - 1);
-	rte_prefetch0(&vq->avail->ring[used_idx]);
+	/* Prefetch available and used ring */
+	avail_idx = vq->last_avail_idx & (vq->size - 1);
+	used_idx  = vq->last_used_idx  & (vq->size - 1);
+	rte_prefetch0(&vq->avail->ring[avail_idx]);
 	rte_prefetch0(&vq->used->ring[used_idx]);
 
 	count = RTE_MIN(count, MAX_PKT_BURST);
@@ -870,8 +871,9 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 
 	/* Retrieve all of the head indexes first to avoid caching issues. */
 	for (i = 0; i < count; i++) {
-		used_idx = (vq->last_used_idx + i) & (vq->size - 1);
-		desc_indexes[i] = vq->avail->ring[used_idx];
+		avail_idx = (vq->last_avail_idx + i) & (vq->size - 1);
+		used_idx  = (vq->last_used_idx  + i) & (vq->size - 1);
+		desc_indexes[i] = vq->avail->ring[avail_idx];
 
 		vq->used->ring[used_idx].id  = desc_indexes[i];
 		vq->used->ring[used_idx].len = 0;
@@ -921,7 +923,8 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	rte_smp_wmb();
 	rte_smp_rmb();
 	vq->used->idx += i;
-	vq->last_used_idx += i;
+	vq->last_avail_idx += i;
+	vq->last_used_idx  += i;
 	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
 			sizeof(vq->used->idx));
 
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v3 4/7] vhost: add dequeue zero copy
  2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
                       ` (2 preceding siblings ...)
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 3/7] vhost: introduce last avail idx for dequeue Yuanhan Liu
@ 2016-10-09  7:27     ` Yuanhan Liu
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 5/7] vhost: add a flag to enable " Yuanhan Liu
                       ` (4 subsequent siblings)
  8 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-09  7:27 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

The basic idea of dequeue zero copy is, instead of copying data from
the desc buf, here we let the mbuf reference the desc buf addr directly.

Doing so, however, has one major issue: we can't update the used ring
at the end of rte_vhost_dequeue_burst. Because we don't do the copy
here, an update of the used ring would let the driver to reclaim the
desc buf. As a result, DPDK might reference a stale memory region.

To update the used ring properly, this patch does several tricks:

- when mbuf references a desc buf, refcnt is added by 1.

  This is to pin lock the mbuf, so that a mbuf free from the DPDK
  won't actually free it, instead, refcnt is subtracted by 1.

- We chain all those mbuf together (by tailq)

  And we check it every time on the rte_vhost_dequeue_burst entrance,
  to see if the mbuf is freed (when refcnt equals to 1). If that
  happens, it means we are the last user of this mbuf and we are
  safe to update the used ring.

- "struct zcopy_mbuf" is introduced, to associate an mbuf with the
  right desc idx.

Dequeue zero copy is introduced for performance reason, and some rough
tests show about 50% perfomance boost for packet size 1500B. For small
packets, (e.g. 64B), it actually slows a bit down (well, it could up to
15%). That is expected because this patch introduces some extra works,
and it outweighs the benefit from saving few bytes copy.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---

v3: - rebase on top of the indirect enabling patch

    - don't update last_used_idx twice for zero-copy

    - handle the missing "tx -> dequeue" rename in log message

v2: - rename "Tx zero copy" to "dequeue zero copy" for reducing confusions

    - use unlikely/likely for dequeue_zero_copy check, as it's not enabled
      by default, as well as it has some limitations in vm2nic case.

    - handle the case that a desc buf might across 2 host phys pages

    - reset nr_zmbuf to 0 at set_vring_num

    - set the zmbuf_size to vq->size, but not the double of it.
---
 lib/librte_vhost/vhost.c      |   2 +
 lib/librte_vhost/vhost.h      |  22 +++++-
 lib/librte_vhost/vhost_user.c |  42 +++++++++-
 lib/librte_vhost/virtio_net.c | 179 +++++++++++++++++++++++++++++++++++++-----
 4 files changed, 224 insertions(+), 21 deletions(-)

diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index 30bb0ce..dbf5d1b 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -142,6 +142,8 @@ init_vring_queue(struct vhost_virtqueue *vq, int qp_idx)
 	/* always set the default vq pair to enabled */
 	if (qp_idx == 0)
 		vq->enabled = 1;
+
+	TAILQ_INIT(&vq->zmbuf_list);
 }
 
 static void
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 8565fa1..be8a398 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -36,6 +36,7 @@
 #include <stdint.h>
 #include <stdio.h>
 #include <sys/types.h>
+#include <sys/queue.h>
 #include <unistd.h>
 #include <linux/vhost.h>
 
@@ -61,6 +62,19 @@ struct buf_vector {
 	uint32_t desc_idx;
 };
 
+/*
+ * A structure to hold some fields needed in zero copy code path,
+ * mainly for associating an mbuf with the right desc_idx.
+ */
+struct zcopy_mbuf {
+	struct rte_mbuf *mbuf;
+	uint32_t desc_idx;
+	uint16_t in_use;
+
+	TAILQ_ENTRY(zcopy_mbuf) next;
+};
+TAILQ_HEAD(zcopy_mbuf_list, zcopy_mbuf);
+
 /**
  * Structure contains variables relevant to RX/TX virtqueues.
  */
@@ -85,6 +99,12 @@ struct vhost_virtqueue {
 
 	/* Physical address of used ring, for logging */
 	uint64_t		log_guest_addr;
+
+	uint16_t		nr_zmbuf;
+	uint16_t		zmbuf_size;
+	uint16_t		last_zmbuf_idx;
+	struct zcopy_mbuf	*zmbufs;
+	struct zcopy_mbuf_list	zmbuf_list;
 } __rte_cache_aligned;
 
 /* Old kernels have no such macro defined */
@@ -135,6 +155,7 @@ struct virtio_net {
 	/* to tell if we need broadcast rarp packet */
 	rte_atomic16_t		broadcast_rarp;
 	uint32_t		virt_qp_nb;
+	int			dequeue_zero_copy;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
 	char			ifname[IF_NAME_SZ];
@@ -146,7 +167,6 @@ struct virtio_net {
 	uint32_t		nr_guest_pages;
 	uint32_t		max_guest_pages;
 	struct guest_page       *guest_pages;
-
 } __rte_cache_aligned;
 
 /**
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index a92377a..3074227 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -180,7 +180,23 @@ static int
 vhost_user_set_vring_num(struct virtio_net *dev,
 			 struct vhost_vring_state *state)
 {
-	dev->virtqueue[state->index]->size = state->num;
+	struct vhost_virtqueue *vq = dev->virtqueue[state->index];
+
+	vq->size = state->num;
+
+	if (dev->dequeue_zero_copy) {
+		vq->nr_zmbuf = 0;
+		vq->last_zmbuf_idx = 0;
+		vq->zmbuf_size = vq->size;
+		vq->zmbufs = rte_zmalloc(NULL, vq->zmbuf_size *
+					 sizeof(struct zcopy_mbuf), 0);
+		if (vq->zmbufs == NULL) {
+			RTE_LOG(WARNING, VHOST_CONFIG,
+				"failed to allocate mem for zero copy; "
+				"zero copy is force disabled\n");
+			dev->dequeue_zero_copy = 0;
+		}
+	}
 
 	return 0;
 }
@@ -662,11 +678,32 @@ vhost_user_set_vring_kick(struct virtio_net *dev, struct VhostUserMsg *pmsg)
 	vq->kickfd = file.fd;
 
 	if (virtio_is_ready(dev) && !(dev->flags & VIRTIO_DEV_RUNNING)) {
+		if (dev->dequeue_zero_copy) {
+			RTE_LOG(INFO, VHOST_CONFIG,
+				"dequeue zero copy is enabled\n");
+		}
+
 		if (notify_ops->new_device(dev->vid) == 0)
 			dev->flags |= VIRTIO_DEV_RUNNING;
 	}
 }
 
+static void
+free_zmbufs(struct vhost_virtqueue *vq)
+{
+	struct zcopy_mbuf *zmbuf, *next;
+
+	for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
+	     zmbuf != NULL; zmbuf = next) {
+		next = TAILQ_NEXT(zmbuf, next);
+
+		rte_pktmbuf_free(zmbuf->mbuf);
+		TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
+	}
+
+	rte_free(vq->zmbufs);
+}
+
 /*
  * when virtio is stopped, qemu will send us the GET_VRING_BASE message.
  */
@@ -695,6 +732,9 @@ vhost_user_get_vring_base(struct virtio_net *dev,
 
 	dev->virtqueue[state->index]->kickfd = VIRTIO_UNINITIALIZED_EVENTFD;
 
+	if (dev->dequeue_zero_copy)
+		free_zmbufs(dev->virtqueue[state->index]);
+
 	return 0;
 }
 
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 70301a5..74263a3 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -678,6 +678,12 @@ make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *mac)
 	return 0;
 }
 
+static inline void __attribute__((always_inline))
+put_zmbuf(struct zcopy_mbuf *zmbuf)
+{
+	zmbuf->in_use = 0;
+}
+
 static inline int __attribute__((always_inline))
 copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
 		  uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
@@ -735,10 +741,33 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
 	mbuf_offset = 0;
 	mbuf_avail  = m->buf_len - RTE_PKTMBUF_HEADROOM;
 	while (1) {
+		uint64_t hpa;
+
 		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
-		rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, mbuf_offset),
-			(void *)((uintptr_t)(desc_addr + desc_offset)),
-			cpy_len);
+
+		/*
+		 * A desc buf might across two host physical pages that are
+		 * not continuous. In such case (gpa_to_hpa returns 0), data
+		 * will be copied even though zero copy is enabled.
+		 */
+		if (unlikely(dev->dequeue_zero_copy && (hpa = gpa_to_hpa(dev,
+					desc->addr + desc_offset, cpy_len)))) {
+			cur->data_len = cpy_len;
+			cur->data_off = 0;
+			cur->buf_addr = (void *)(uintptr_t)desc_addr;
+			cur->buf_physaddr = hpa;
+
+			/*
+			 * In zero copy mode, one mbuf can only reference data
+			 * for one or partial of one desc buff.
+			 */
+			mbuf_avail = cpy_len;
+		} else {
+			rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *,
+							   mbuf_offset),
+				(void *)((uintptr_t)(desc_addr + desc_offset)),
+				cpy_len);
+		}
 
 		mbuf_avail  -= cpy_len;
 		mbuf_offset += cpy_len;
@@ -801,6 +830,80 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
 	return 0;
 }
 
+static inline void __attribute__((always_inline))
+update_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		 uint32_t used_idx, uint32_t desc_idx)
+{
+	vq->used->ring[used_idx].id  = desc_idx;
+	vq->used->ring[used_idx].len = 0;
+	vhost_log_used_vring(dev, vq,
+			offsetof(struct vring_used, ring[used_idx]),
+			sizeof(vq->used->ring[used_idx]));
+}
+
+static inline void __attribute__((always_inline))
+update_used_idx(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint32_t count)
+{
+	if (unlikely(count == 0))
+		return;
+
+	rte_smp_wmb();
+	rte_smp_rmb();
+
+	vq->used->idx += count;
+	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
+			sizeof(vq->used->idx));
+
+	/* Kick guest if required. */
+	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
+			&& (vq->callfd >= 0))
+		eventfd_write(vq->callfd, (eventfd_t)1);
+}
+
+static inline struct zcopy_mbuf *__attribute__((always_inline))
+get_zmbuf(struct vhost_virtqueue *vq)
+{
+	uint16_t i;
+	uint16_t last;
+	int tries = 0;
+
+	/* search [last_zmbuf_idx, zmbuf_size) */
+	i = vq->last_zmbuf_idx;
+	last = vq->zmbuf_size;
+
+again:
+	for (; i < last; i++) {
+		if (vq->zmbufs[i].in_use == 0) {
+			vq->last_zmbuf_idx = i + 1;
+			vq->zmbufs[i].in_use = 1;
+			return &vq->zmbufs[i];
+		}
+	}
+
+	tries++;
+	if (tries == 1) {
+		/* search [0, last_zmbuf_idx) */
+		i = 0;
+		last = vq->last_zmbuf_idx;
+		goto again;
+	}
+
+	return NULL;
+}
+
+static inline bool __attribute__((always_inline))
+mbuf_is_consumed(struct rte_mbuf *m)
+{
+	while (m) {
+		if (rte_mbuf_refcnt_read(m) > 1)
+			return false;
+		m = m->next;
+	}
+
+	return true;
+}
+
 uint16_t
 rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
@@ -828,6 +931,30 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	if (unlikely(vq->enabled == 0))
 		return 0;
 
+	if (unlikely(dev->dequeue_zero_copy)) {
+		struct zcopy_mbuf *zmbuf, *next;
+		int nr_updated = 0;
+
+		for (zmbuf = TAILQ_FIRST(&vq->zmbuf_list);
+		     zmbuf != NULL; zmbuf = next) {
+			next = TAILQ_NEXT(zmbuf, next);
+
+			if (mbuf_is_consumed(zmbuf->mbuf)) {
+				used_idx = vq->last_used_idx++ & (vq->size - 1);
+				update_used_ring(dev, vq, used_idx,
+						 zmbuf->desc_idx);
+				nr_updated += 1;
+
+				TAILQ_REMOVE(&vq->zmbuf_list, zmbuf, next);
+				rte_pktmbuf_free(zmbuf->mbuf);
+				put_zmbuf(zmbuf);
+				vq->nr_zmbuf -= 1;
+			}
+		}
+
+		update_used_idx(dev, vq, nr_updated);
+	}
+
 	/*
 	 * Construct a RARP broadcast packet, and inject it to the "pkts"
 	 * array, to looks like that guest actually send such packet.
@@ -875,11 +1002,8 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 		used_idx  = (vq->last_used_idx  + i) & (vq->size - 1);
 		desc_indexes[i] = vq->avail->ring[avail_idx];
 
-		vq->used->ring[used_idx].id  = desc_indexes[i];
-		vq->used->ring[used_idx].len = 0;
-		vhost_log_used_vring(dev, vq,
-				offsetof(struct vring_used, ring[used_idx]),
-				sizeof(vq->used->ring[used_idx]));
+		if (likely(dev->dequeue_zero_copy == 0))
+			update_used_ring(dev, vq, used_idx, desc_indexes[i]);
 	}
 
 	/* Prefetch descriptor index. */
@@ -913,25 +1037,42 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 				"Failed to allocate memory for mbuf.\n");
 			break;
 		}
+
 		err = copy_desc_to_mbuf(dev, desc, sz, pkts[i], idx, mbuf_pool);
 		if (unlikely(err)) {
 			rte_pktmbuf_free(pkts[i]);
 			break;
 		}
-	}
 
-	rte_smp_wmb();
-	rte_smp_rmb();
-	vq->used->idx += i;
+		if (unlikely(dev->dequeue_zero_copy)) {
+			struct zcopy_mbuf *zmbuf;
+
+			zmbuf = get_zmbuf(vq);
+			if (!zmbuf) {
+				rte_pktmbuf_free(pkts[i]);
+				break;
+			}
+			zmbuf->mbuf = pkts[i];
+			zmbuf->desc_idx = desc_indexes[i];
+
+			/*
+			 * Pin lock the mbuf; we will check later to see
+			 * whether the mbuf is freed (when we are the last
+			 * user) or not. If that's the case, we then could
+			 * update the used ring safely.
+			 */
+			rte_mbuf_refcnt_update(pkts[i], 1);
+
+			vq->nr_zmbuf += 1;
+			TAILQ_INSERT_TAIL(&vq->zmbuf_list, zmbuf, next);
+		}
+	}
 	vq->last_avail_idx += i;
-	vq->last_used_idx  += i;
-	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
-			sizeof(vq->used->idx));
 
-	/* Kick guest if required. */
-	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
-			&& (vq->callfd >= 0))
-		eventfd_write(vq->callfd, (eventfd_t)1);
+	if (likely(dev->dequeue_zero_copy == 0)) {
+		vq->last_used_idx += i;
+		update_used_idx(dev, vq, i);
+	}
 
 out:
 	if (unlikely(rarp_mbuf != NULL)) {
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v3 5/7] vhost: add a flag to enable dequeue zero copy
  2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
                       ` (3 preceding siblings ...)
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 4/7] vhost: add dequeue zero copy Yuanhan Liu
@ 2016-10-09  7:27     ` Yuanhan Liu
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 6/7] examples/vhost: add an option " Yuanhan Liu
                       ` (3 subsequent siblings)
  8 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-09  7:27 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

Dequeue zero copy is disabled by default. Here add a new flag
``RTE_VHOST_USER_DEQUEUE_ZERO_COPY`` to explictily enable it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---

v2: - update release log
    - doc dequeue zero copy in detail
---
 doc/guides/prog_guide/vhost_lib.rst    | 35 +++++++++++++++++++++++++++++++++-
 doc/guides/rel_notes/release_16_11.rst | 13 +++++++++++++
 lib/librte_vhost/rte_virtio_net.h      |  1 +
 lib/librte_vhost/socket.c              |  5 +++++
 lib/librte_vhost/vhost.c               | 10 ++++++++++
 lib/librte_vhost/vhost.h               |  1 +
 6 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index 6b0c6b2..3fa9dd7 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -79,7 +79,7 @@ The following is an overview of the Vhost API functions:
   ``/dev/path`` character device file will be created. For vhost-user server
   mode, a Unix domain socket file ``path`` will be created.
 
-  Currently two flags are supported (these are valid for vhost-user only):
+  Currently supported flags are (these are valid for vhost-user only):
 
   - ``RTE_VHOST_USER_CLIENT``
 
@@ -97,6 +97,39 @@ The following is an overview of the Vhost API functions:
     This reconnect option is enabled by default. However, it can be turned off
     by setting this flag.
 
+  - ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY``
+
+    Dequeue zero copy will be enabled when this flag is set. It is disabled by
+    default.
+
+    There are some truths (including limitations) you might want to know while
+    setting this flag:
+
+    * zero copy is not good for small packets (typically for packet size below
+      512).
+
+    * zero copy is really good for VM2VM case. For iperf between two VMs, the
+      boost could be above 70% (when TSO is enableld).
+
+    * for VM2NIC case, the ``nb_tx_desc`` has to be small enough: <= 64 if virtio
+      indirect feature is not enabled and <= 128 if it is enabled.
+
+      The is because when dequeue zero copy is enabled, guest Tx used vring will
+      be updated only when corresponding mbuf is freed. Thus, the nb_tx_desc
+      has to be small enough so that the PMD driver will run out of available
+      Tx descriptors and free mbufs timely. Otherwise, guest Tx vring would be
+      starved.
+
+    * Guest memory should be backended with huge pages to achieve better
+      performance. Using 1G page size is the best.
+
+      When dequeue zero copy is enabled, the guest phys address and host phys
+      address mapping has to be established. Using non-huge pages means far
+      more page segments. To make it simple, DPDK vhost does a linear search
+      of those segments, thus the fewer the segments, the quicker we will get
+      the mapping. NOTE: we may speed it by using radix tree searching in
+      future.
+
 * ``rte_vhost_driver_session_start()``
 
   This function starts the vhost session loop to handle vhost messages. It
diff --git a/doc/guides/rel_notes/release_16_11.rst b/doc/guides/rel_notes/release_16_11.rst
index 905186a..2180d8d 100644
--- a/doc/guides/rel_notes/release_16_11.rst
+++ b/doc/guides/rel_notes/release_16_11.rst
@@ -36,6 +36,19 @@ New Features
 
      This section is a comment. Make sure to start the actual text at the margin.
 
+* **Added vhost-user dequeue zero copy support**
+
+  The copy in dequeue path is saved, which is meant to improve the performance.
+  In the VM2VM case, the boost is quite impressive. The bigger the packet size,
+  the bigger performance boost you may get. However, for VM2NIC case, there
+  are some limitations, yet the boost is not that impressive as VM2VM case.
+  It may even drop quite a bit for small packets.
+
+  For such reason, this feature is disabled by default. It can be enabled when
+  ``RTE_VHOST_USER_DEQUEUE_ZERO_COPY`` flag is given. Check the vhost section
+  at programming guide for more information.
+
+
 * **Added vhost-user indirect descriptors support.**
 
   If indirect descriptor feature is negotiated, each packet sent by the guest
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index a88aecd..c53ff64 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -53,6 +53,7 @@
 
 #define RTE_VHOST_USER_CLIENT		(1ULL << 0)
 #define RTE_VHOST_USER_NO_RECONNECT	(1ULL << 1)
+#define RTE_VHOST_USER_DEQUEUE_ZERO_COPY	(1ULL << 2)
 
 /* Enum for virtqueue management. */
 enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index bf03f84..967cb65 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -62,6 +62,7 @@ struct vhost_user_socket {
 	int connfd;
 	bool is_server;
 	bool reconnect;
+	bool dequeue_zero_copy;
 };
 
 struct vhost_user_connection {
@@ -203,6 +204,9 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 	size = strnlen(vsocket->path, PATH_MAX);
 	vhost_set_ifname(vid, vsocket->path, size);
 
+	if (vsocket->dequeue_zero_copy)
+		vhost_enable_dequeue_zero_copy(vid);
+
 	RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
 
 	vsocket->connfd = fd;
@@ -499,6 +503,7 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
 	memset(vsocket, 0, sizeof(struct vhost_user_socket));
 	vsocket->path = strdup(path);
 	vsocket->connfd = -1;
+	vsocket->dequeue_zero_copy = flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
 
 	if ((flags & RTE_VHOST_USER_CLIENT) != 0) {
 		vsocket->reconnect = !(flags & RTE_VHOST_USER_NO_RECONNECT);
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index dbf5d1b..469117a 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -291,6 +291,16 @@ vhost_set_ifname(int vid, const char *if_name, unsigned int if_len)
 	dev->ifname[sizeof(dev->ifname) - 1] = '\0';
 }
 
+void
+vhost_enable_dequeue_zero_copy(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->dequeue_zero_copy = 1;
+}
 
 int
 rte_vhost_get_numa_node(int vid)
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index be8a398..53dbf33 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -278,6 +278,7 @@ void vhost_destroy_device(int);
 int alloc_vring_queue_pair(struct virtio_net *dev, uint32_t qp_idx);
 
 void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
+void vhost_enable_dequeue_zero_copy(int vid);
 
 /*
  * Backend-specific cleanup. Defined by vhost-cuse and vhost-user.
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v3 6/7] examples/vhost: add an option to enable dequeue zero copy
  2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
                       ` (4 preceding siblings ...)
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 5/7] vhost: add a flag to enable " Yuanhan Liu
@ 2016-10-09  7:27     ` Yuanhan Liu
  2016-10-09  7:28     ` [dpdk-dev] [PATCH v3 7/7] net/vhost: " Yuanhan Liu
                       ` (2 subsequent siblings)
  8 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-09  7:27 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

Add an option, --dequeue-zero-copy, to enable dequeue zero copy.

One thing worth noting while using dequeue zero copy is the nb_tx_desc
has to be small enough so that the eth driver will hit the mbuf free
threshold easily and thus free mbuf more frequently.

The reason behind that is, when dequeue zero copy is enabled, guest Tx
used vring will be updated only when corresponding mbuf is freed. If mbuf
is not freed frequently, the guest Tx vring could be starved.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---

v3: - handle missing "Tx --> dequeue" renaming in usage
---
 examples/vhost/main.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 195b1db..91000e8 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -127,6 +127,7 @@ static uint32_t enable_tx_csum;
 static uint32_t enable_tso;
 
 static int client_mode;
+static int dequeue_zero_copy;
 
 /* Specify timeout (in useconds) between retries on RX. */
 static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
@@ -294,6 +295,17 @@ port_init(uint8_t port)
 
 	rx_ring_size = RTE_TEST_RX_DESC_DEFAULT;
 	tx_ring_size = RTE_TEST_TX_DESC_DEFAULT;
+
+	/*
+	 * When dequeue zero copy is enabled, guest Tx used vring will be
+	 * updated only when corresponding mbuf is freed. Thus, the nb_tx_desc
+	 * (tx_ring_size here) must be small enough so that the driver will
+	 * hit the free threshold easily and free mbufs timely. Otherwise,
+	 * guest Tx vring would be starved.
+	 */
+	if (dequeue_zero_copy)
+		tx_ring_size = 64;
+
 	tx_rings = (uint16_t)rte_lcore_count();
 
 	retval = validate_num_devices(MAX_DEVICES);
@@ -470,7 +482,8 @@ us_vhost_usage(const char *prgname)
 	"		--socket-file: The path of the socket file.\n"
 	"		--tx-csum [0|1] disable/enable TX checksum offload.\n"
 	"		--tso [0|1] disable/enable TCP segment offload.\n"
-	"		--client register a vhost-user socket as client mode.\n",
+	"		--client register a vhost-user socket as client mode.\n"
+	"		--dequeue-zero-copy enables dequeue zero copy\n",
 	       prgname);
 }
 
@@ -495,6 +508,7 @@ us_vhost_parse_args(int argc, char **argv)
 		{"tx-csum", required_argument, NULL, 0},
 		{"tso", required_argument, NULL, 0},
 		{"client", no_argument, &client_mode, 1},
+		{"dequeue-zero-copy", no_argument, &dequeue_zero_copy, 1},
 		{NULL, 0, 0, 0},
 	};
 
@@ -1501,6 +1515,9 @@ main(int argc, char *argv[])
 	if (client_mode)
 		flags |= RTE_VHOST_USER_CLIENT;
 
+	if (dequeue_zero_copy)
+		flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
+
 	/* Register vhost user driver to handle vhost messages. */
 	for (i = 0; i < nb_sockets; i++) {
 		ret = rte_vhost_driver_register
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [dpdk-dev] [PATCH v3 7/7] net/vhost: add an option to enable dequeue zero copy
  2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
                       ` (5 preceding siblings ...)
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 6/7] examples/vhost: add an option " Yuanhan Liu
@ 2016-10-09  7:28     ` Yuanhan Liu
  2016-10-11 13:04     ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Xu, Qian Q
  2016-10-12  7:48     ` Yuanhan Liu
  8 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-09  7:28 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin, Yuanhan Liu

Add an option, dequeue-zero-copy, to enable this feature in vhost-pmd.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 drivers/net/vhost/rte_eth_vhost.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/net/vhost/rte_eth_vhost.c b/drivers/net/vhost/rte_eth_vhost.c
index c1d09a0..86c7a4d 100644
--- a/drivers/net/vhost/rte_eth_vhost.c
+++ b/drivers/net/vhost/rte_eth_vhost.c
@@ -51,6 +51,7 @@
 #define ETH_VHOST_IFACE_ARG		"iface"
 #define ETH_VHOST_QUEUES_ARG		"queues"
 #define ETH_VHOST_CLIENT_ARG		"client"
+#define ETH_VHOST_DEQUEUE_ZERO_COPY	"dequeue-zero-copy"
 
 static const char *drivername = "VHOST PMD";
 
@@ -58,6 +59,7 @@ static const char *valid_arguments[] = {
 	ETH_VHOST_IFACE_ARG,
 	ETH_VHOST_QUEUES_ARG,
 	ETH_VHOST_CLIENT_ARG,
+	ETH_VHOST_DEQUEUE_ZERO_COPY,
 	NULL
 };
 
@@ -1132,6 +1134,7 @@ rte_pmd_vhost_probe(const char *name, const char *params)
 	uint16_t queues;
 	uint64_t flags = 0;
 	int client_mode = 0;
+	int dequeue_zero_copy = 0;
 
 	RTE_LOG(INFO, PMD, "Initializing pmd_vhost for %s\n", name);
 
@@ -1168,6 +1171,16 @@ rte_pmd_vhost_probe(const char *name, const char *params)
 			flags |= RTE_VHOST_USER_CLIENT;
 	}
 
+	if (rte_kvargs_count(kvlist, ETH_VHOST_DEQUEUE_ZERO_COPY) == 1) {
+		ret = rte_kvargs_process(kvlist, ETH_VHOST_DEQUEUE_ZERO_COPY,
+					 &open_int, &dequeue_zero_copy);
+		if (ret < 0)
+			goto out_free;
+
+		if (dequeue_zero_copy)
+			flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
+	}
+
 	eth_dev_vhost_create(name, iface_name, queues, rte_socket_id(), flags);
 
 out_free:
-- 
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
  2016-08-23  8:10 [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Yuanhan Liu
                   ` (8 preceding siblings ...)
  2016-09-23  4:13 ` [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue " Yuanhan Liu
@ 2016-10-09 10:46 ` linhaifeng
  2016-10-10  8:03   ` Yuanhan Liu
  9 siblings, 1 reply; 75+ messages in thread
From: linhaifeng @ 2016-10-09 10:46 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Maxime Coquelin

在 2016/8/23 16:10, Yuanhan Liu 写道:
> The basic idea of Tx zero copy is, instead of copying data from the
> desc buf, here we let the mbuf reference the desc buf addr directly.

Is there problem when push vlan to the mbuf which reference the desc buf addr directly?
We know if guest use virtio_net(kernel) maybe skb has no headroom.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
  2016-08-29  8:32 ` Xu, Qian Q
  2016-08-29  8:57   ` Xu, Qian Q
@ 2016-10-09 15:20   ` Yuanhan Liu
  1 sibling, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-09 15:20 UTC (permalink / raw)
  To: Xu, Qian Q; +Cc: dev, Maxime Coquelin

On Mon, Aug 29, 2016 at 08:32:55AM +0000, Xu, Qian Q wrote:
> I just ran a PVP test, nic receive packets then forwards to vhost PMD, and virtio user interface. I didn't see any performance gains in this scenario. All packet size from 64B to 1518B 
> performance haven't got benefit from this patchset, and in fact, the performance dropped a lot before 1280B, and similar at 1518B. 

40G nic?

> The TX/RX desc setting is " txd=64, rxd=128"

Try it with "txd=128", you should be able to set that value since the
vhost Tx indirect patch is merged.

	--yliu

> for TX-zero-copy enabled case. For TX-zero-copy disabled case, I just ran default testpmd(txd=512, rxd=128) without the patch. 
> Could you help check if NIC2VM case? 
> 
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
> Sent: Tuesday, August 23, 2016 4:11 PM
> To: dev@dpdk.org
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Subject: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
> 
> This patch set enables vhost Tx zero copy. The majority work goes to patch 4: vhost: add Tx zero copy.
> 
> The basic idea of Tx zero copy is, instead of copying data from the desc buf, here we let the mbuf reference the desc buf addr directly.
> 
> The major issue behind that is how and when to update the used ring.
> You could check the commit log of patch 4 for more details.
> 
> Patch 5 introduces a new flag, RTE_VHOST_USER_TX_ZERO_COPY, to enable Tx zero copy, which is disabled by default.
> 
> Few more TODOs are left, including handling a desc buf that is across two physical pages, updating release note, etc. Those will be fixed in later version. For now, here is a simple one that hopefully it shows the idea clearly.
> 
> I did some quick tests, the performance gain is quite impressive.
> 
> For a simple dequeue workload (running rxonly in vhost-pmd and runnin txonly in guest testpmd), it yields 40+% performance boost for packet size 1400B.
> 
> For VM2VM iperf test case, it's even better: about 70% boost.
> 
> ---
> Yuanhan Liu (6):
>   vhost: simplify memory regions handling
>   vhost: get guest/host physical address mappings
>   vhost: introduce last avail idx for Tx
>   vhost: add Tx zero copy
>   vhost: add a flag to enable Tx zero copy
>   examples/vhost: add an option to enable Tx zero copy
> 
>  doc/guides/prog_guide/vhost_lib.rst |   7 +-
>  examples/vhost/main.c               |  19 ++-
>  lib/librte_vhost/rte_virtio_net.h   |   1 +
>  lib/librte_vhost/socket.c           |   5 +
>  lib/librte_vhost/vhost.c            |  12 ++
>  lib/librte_vhost/vhost.h            | 103 +++++++++----
>  lib/librte_vhost/vhost_user.c       | 297 +++++++++++++++++++++++-------------
>  lib/librte_vhost/virtio_net.c       | 188 +++++++++++++++++++----
>  8 files changed, 472 insertions(+), 160 deletions(-)
> 
> --
> 1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
  2016-10-09 10:46 ` [dpdk-dev] [PATCH 0/6] vhost: add Tx " linhaifeng
@ 2016-10-10  8:03   ` Yuanhan Liu
  2016-10-14  7:30     ` linhaifeng
  0 siblings, 1 reply; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-10  8:03 UTC (permalink / raw)
  To: linhaifeng; +Cc: dev, Maxime Coquelin

On Sun, Oct 09, 2016 at 06:46:44PM +0800, linhaifeng wrote:
> 在 2016/8/23 16:10, Yuanhan Liu 写道:
> > The basic idea of Tx zero copy is, instead of copying data from the
> > desc buf, here we let the mbuf reference the desc buf addr directly.
> 
> Is there problem when push vlan to the mbuf which reference the desc buf addr directly?

Yes, you can't do that when zero copy is enabled, due to following code
piece:

    +               if (unlikely(dev->dequeue_zero_copy && (hpa = gpa_to_hpa(dev,
    +                                       desc->addr + desc_offset, cpy_len)))) {
    +                       cur->data_len = cpy_len;
==> +                       cur->data_off = 0;
    +                       cur->buf_addr = (void *)(uintptr_t)desc_addr;
    +                       cur->buf_physaddr = hpa;

The marked line basically makes the mbuf has no headroom to use.

	--yliu

> We know if guest use virtio_net(kernel) maybe skb has no headroom.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy
  2016-10-09  2:03       ` Yuanhan Liu
@ 2016-10-10 10:12         ` Xu, Qian Q
  2016-10-10 10:14           ` Maxime Coquelin
  0 siblings, 1 reply; 75+ messages in thread
From: Xu, Qian Q @ 2016-10-10 10:12 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Maxime Coquelin

Good to know. I will try v3. BTW, on the master branch, seems vhost PMD is broken. When we run --vdev 'eth_vhost0,xxxx', then it will report the error that the driver is not supported.  V16.07 is OK, but I haven't got time to do git bisect. 

-----Original Message-----
From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com] 
Sent: Sunday, October 9, 2016 3:03 AM
To: Xu, Qian Q <qian.q.xu@intel.com>
Cc: dev@dpdk.org; Maxime Coquelin <maxime.coquelin@redhat.com>
Subject: Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy

On Thu, Oct 06, 2016 at 02:37:27PM +0000, Xu, Qian Q wrote:
> this function copy_desc_to_mbuf has changed on the dpdk-next-virtio repo. Based on current dpdk-next-virtio repo, the commit ID is as below: 
> commit b4f7b43cd9d3b6413f41221051d03a23bc5f5fbe
> Author: Zhiyong Yang <zhiyong.yang@intel.com>
> Date:   Thu Sep 29 20:35:49 2016 +0800
> 
> Then you will find the parameter "struct vhost_virtqueue *vq" is removed, so if apply your patch on that commit ID, the build will fail, since no vq definition but we used it in the function. 
> Could you check? Thx. 

I knew that: a rebase is needed, and I have done the rebase (locally); just haven't sent it out yet.

	--yliu

> 
> == Build lib/librte_table
> /home/qxu10/dpdk-zero/lib/librte_vhost/virtio_net.c: In function 'copy_desc_to_mbuf':
> /home/qxu10/dpdk-zero/lib/librte_vhost/virtio_net.c:745:21: error: 'vq' undeclared (first use in this function)
>    zmbuf = get_zmbuf(vq);
>                      ^
> /home/qxu10/dpdk-zero/lib/librte_vhost/virtio_net.c:745:21: note: each 
> undeclared identifier is reported only once for each function it 
> appears in
> /home/qxu10/dpdk-zero/mk/internal/rte.compile-pre.mk:138: recipe for 
> target 'virtio_net.o' failed
> make[5]: *** [virtio_net.o] Error 1
> /home/qxu10/dpdk-zero/mk/rte.subdir.mk:61: recipe for target 
> 'librte_vhost' failed
> make[4]: *** [librte_vhost] Error 2
> make[4]: *** Waiting for unfinished jobs....
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy
  2016-10-10 10:12         ` Xu, Qian Q
@ 2016-10-10 10:14           ` Maxime Coquelin
  2016-10-10 10:22             ` Xu, Qian Q
  0 siblings, 1 reply; 75+ messages in thread
From: Maxime Coquelin @ 2016-10-10 10:14 UTC (permalink / raw)
  To: Xu, Qian Q, Yuanhan Liu; +Cc: dev

Hi Xu,

On 10/10/2016 12:12 PM, Xu, Qian Q wrote:
> Good to know. I will try v3. BTW, on the master branch, seems vhost PMD is broken. When we run --vdev 'eth_vhost0,xxxx', then it will report the error that the driver is not supported.  V16.07 is OK, but I haven't got time to do git bisect.
Name has been chenged to net_vhost0 for consistency with other PMDs.

Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy
  2016-10-10 10:14           ` Maxime Coquelin
@ 2016-10-10 10:22             ` Xu, Qian Q
  2016-10-10 10:40               ` Xu, Qian Q
  2016-10-10 11:48               ` Maxime Coquelin
  0 siblings, 2 replies; 75+ messages in thread
From: Xu, Qian Q @ 2016-10-10 10:22 UTC (permalink / raw)
  To: Maxime Coquelin, Yuanhan Liu; +Cc: dev

Oh, thx for the info, it's better to have some documentation update in R16.11 release notes. 

-----Original Message-----
From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] 
Sent: Monday, October 10, 2016 11:14 AM
To: Xu, Qian Q <qian.q.xu@intel.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy

Hi Xu,

On 10/10/2016 12:12 PM, Xu, Qian Q wrote:
> Good to know. I will try v3. BTW, on the master branch, seems vhost PMD is broken. When we run --vdev 'eth_vhost0,xxxx', then it will report the error that the driver is not supported.  V16.07 is OK, but I haven't got time to do git bisect.
Name has been chenged to net_vhost0 for consistency with other PMDs.

Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy
  2016-10-10 10:22             ` Xu, Qian Q
@ 2016-10-10 10:40               ` Xu, Qian Q
  2016-10-10 11:48               ` Maxime Coquelin
  1 sibling, 0 replies; 75+ messages in thread
From: Xu, Qian Q @ 2016-10-10 10:40 UTC (permalink / raw)
  To: Xu, Qian Q, Maxime Coquelin, Yuanhan Liu; +Cc: dev

I'm a little concerned if it's a correct way to change the name from release to release, some users may use eth_vhost for the driver, and found it was not working. Do we need also make the consistency b/w releases?  
What's the name difference b/w eth and net? Other PMDs are not virtual devices. Virtio_user is also a virtual device, do we need to change the name,too? 


-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Xu, Qian Q
Sent: Monday, October 10, 2016 11:23 AM
To: Maxime Coquelin <maxime.coquelin@redhat.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy

Oh, thx for the info, it's better to have some documentation update in R16.11 release notes. 

-----Original Message-----
From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com] 
Sent: Monday, October 10, 2016 11:14 AM
To: Xu, Qian Q <qian.q.xu@intel.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy

Hi Xu,

On 10/10/2016 12:12 PM, Xu, Qian Q wrote:
> Good to know. I will try v3. BTW, on the master branch, seems vhost PMD is broken. When we run --vdev 'eth_vhost0,xxxx', then it will report the error that the driver is not supported.  V16.07 is OK, but I haven't got time to do git bisect.
Name has been chenged to net_vhost0 for consistency with other PMDs.

Maxime

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy
  2016-10-10 10:22             ` Xu, Qian Q
  2016-10-10 10:40               ` Xu, Qian Q
@ 2016-10-10 11:48               ` Maxime Coquelin
  1 sibling, 0 replies; 75+ messages in thread
From: Maxime Coquelin @ 2016-10-10 11:48 UTC (permalink / raw)
  To: Xu, Qian Q, Yuanhan Liu; +Cc: dev



On 10/10/2016 12:22 PM, Xu, Qian Q wrote:
> Oh, thx for the info, it's better to have some documentation update in R16.11 release notes.
I'm not the author of this change, just faced the same situation as
yours a a user.
But I agree that if documentation is not aligned, it should be updated.

Maxime

>
> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Monday, October 10, 2016 11:14 AM
> To: Xu, Qian Q <qian.q.xu@intel.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy
>
> Hi Xu,
>
> On 10/10/2016 12:12 PM, Xu, Qian Q wrote:
>> Good to know. I will try v3. BTW, on the master branch, seems vhost PMD is broken. When we run --vdev 'eth_vhost0,xxxx', then it will report the error that the driver is not supported.  V16.07 is OK, but I haven't got time to do git bisect.
> Name has been chenged to net_vhost0 for consistency with other PMDs.
>
> Maxime
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support
  2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
                       ` (6 preceding siblings ...)
  2016-10-09  7:28     ` [dpdk-dev] [PATCH v3 7/7] net/vhost: " Yuanhan Liu
@ 2016-10-11 13:04     ` Xu, Qian Q
  2016-10-12  7:48     ` Yuanhan Liu
  8 siblings, 0 replies; 75+ messages in thread
From: Xu, Qian Q @ 2016-10-11 13:04 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Maxime Coquelin

Tested-by: Qian Xu <qian.q.xu@intel.com>
- Apply patch to dpdk-next-virtio: Pass
- Compile: Pass
- OS: Ubuntu16.04 4.4.0-34-generic
- GCC: 5.4.0

Test Case - Pass, over 20% performance gain for big packet(1024B), and it's designed to improve big packet performance. 
- Test case: Without NIC, Vhost dequeuer, virtio TXonly, mergeable=on, then see ~28% performance gains for packet size 1518B; for small packet 64B, similar performance as zero-copy=0. 
- Test case: With Intel FVL 40G NIC, run PVP case, txd=128, mergeable=on, for packet size over 1K(1024B), we can see the performance benefits, for example, 1024 will get 18% performance gains; 1518B will get 26% performance gain compared with zero-copy=0, for small packet such as 64B, we will get 15% performance drop which is reasonable, and vhost zero-copy is not applicable for the small packet performance. 


-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
Sent: Sunday, October 9, 2016 8:28 AM
To: dev@dpdk.org
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>; Yuanhan Liu <yuanhan.liu@linux.intel.com>
Subject: [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support

This patch set enables vhost dequeue zero copy. The majority work goes to patch 4: "vhost: add dequeue zero copy".

The basic idea of dequeue zero copy is, instead of copying data from the desc buf, here we let the mbuf reference the desc buf addr directly.

The major issue behind that is how and when to update the used ring.
You could check the commit log of patch 4 for more details.

Patch 5 introduces a new flag, RTE_VHOST_USER_DEQUEUE_ZERO_COPY, to enable dequeue zero copy, which is disabled by default.

The performance gain is quite impressive. For a simple dequeue workload (running rxonly in vhost-pmd and runnin txonly in guest testpmd), it yields
50+% performance boost for packet size 1500B. For VM2VM iperf test case,
it's even better: about 70% boost.

For small packets, the performance is worse (it's expected, as the extra overhead introduced by zero copy outweighs the benefits from saving few bytes copy).

v3: - rebase: mainly for removing conflicts with the Tx indirect patch
    - don't update last_used_idx twice for zero-copy mode
    - handle two mssiing "Tx -> dequeue" renames in log and usage

v2: - renamed "tx zero copy" to "dequeue zero copy", to reduce confusions.
    - hnadle the case that a desc buf might across 2 host phys pages
    - use MAP_POPULATE to let kernel populate the page tables
    - updated release note
    - doc-ed the limitations for the vm2nic case
    - merge 2 continuous guest phys memory region
    - and few more trivial changes, please see them in the corresponding
      patches

---
Yuanhan Liu (7):
  vhost: simplify memory regions handling
  vhost: get guest/host physical address mappings
  vhost: introduce last avail idx for dequeue
  vhost: add dequeue zero copy
  vhost: add a flag to enable dequeue zero copy
  examples/vhost: add an option to enable dequeue zero copy
  net/vhost: add an option to enable dequeue zero copy

 doc/guides/prog_guide/vhost_lib.rst    |  35 +++-
 doc/guides/rel_notes/release_16_11.rst |  13 ++
 drivers/net/vhost/rte_eth_vhost.c      |  13 ++
 examples/vhost/main.c                  |  19 +-
 lib/librte_vhost/rte_virtio_net.h      |   1 +
 lib/librte_vhost/socket.c              |   5 +
 lib/librte_vhost/vhost.c               |  12 ++
 lib/librte_vhost/vhost.h               | 102 ++++++++---
 lib/librte_vhost/vhost_user.c          | 315 ++++++++++++++++++++++-----------
 lib/librte_vhost/virtio_net.c          | 196 +++++++++++++++++---
 10 files changed, 549 insertions(+), 162 deletions(-)

--
1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support
  2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
                       ` (7 preceding siblings ...)
  2016-10-11 13:04     ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Xu, Qian Q
@ 2016-10-12  7:48     ` Yuanhan Liu
  8 siblings, 0 replies; 75+ messages in thread
From: Yuanhan Liu @ 2016-10-12  7:48 UTC (permalink / raw)
  To: dev; +Cc: Maxime Coquelin

On Sun, Oct 09, 2016 at 03:27:53PM +0800, Yuanhan Liu wrote:
> This patch set enables vhost dequeue zero copy. The majority work goes
> to patch 4: "vhost: add dequeue zero copy".

Applied to dpdk-next-virtio.

	--yliu
> 
> The basic idea of dequeue zero copy is, instead of copying data from the
> desc buf, here we let the mbuf reference the desc buf addr directly.
> 
> The major issue behind that is how and when to update the used ring.
> You could check the commit log of patch 4 for more details.
> 
> Patch 5 introduces a new flag, RTE_VHOST_USER_DEQUEUE_ZERO_COPY, to enable
> dequeue zero copy, which is disabled by default.
> 
> The performance gain is quite impressive. For a simple dequeue workload
> (running rxonly in vhost-pmd and runnin txonly in guest testpmd), it yields
> 50+% performance boost for packet size 1500B. For VM2VM iperf test case,
> it's even better: about 70% boost.
> 
> For small packets, the performance is worse (it's expected, as the extra
> overhead introduced by zero copy outweighs the benefits from saving few
> bytes copy).
> 
> v3: - rebase: mainly for removing conflicts with the Tx indirect patch
>     - don't update last_used_idx twice for zero-copy mode
>     - handle two mssiing "Tx -> dequeue" renames in log and usage
> 
> v2: - renamed "tx zero copy" to "dequeue zero copy", to reduce confusions.
>     - hnadle the case that a desc buf might across 2 host phys pages
>     - use MAP_POPULATE to let kernel populate the page tables
>     - updated release note
>     - doc-ed the limitations for the vm2nic case
>     - merge 2 continuous guest phys memory region
>     - and few more trivial changes, please see them in the corresponding
>       patches
> 
> ---
> Yuanhan Liu (7):
>   vhost: simplify memory regions handling
>   vhost: get guest/host physical address mappings
>   vhost: introduce last avail idx for dequeue
>   vhost: add dequeue zero copy
>   vhost: add a flag to enable dequeue zero copy
>   examples/vhost: add an option to enable dequeue zero copy
>   net/vhost: add an option to enable dequeue zero copy
> 
>  doc/guides/prog_guide/vhost_lib.rst    |  35 +++-
>  doc/guides/rel_notes/release_16_11.rst |  13 ++
>  drivers/net/vhost/rte_eth_vhost.c      |  13 ++
>  examples/vhost/main.c                  |  19 +-
>  lib/librte_vhost/rte_virtio_net.h      |   1 +
>  lib/librte_vhost/socket.c              |   5 +
>  lib/librte_vhost/vhost.c               |  12 ++
>  lib/librte_vhost/vhost.h               | 102 ++++++++---
>  lib/librte_vhost/vhost_user.c          | 315 ++++++++++++++++++++++-----------
>  lib/librte_vhost/virtio_net.c          | 196 +++++++++++++++++---
>  10 files changed, 549 insertions(+), 162 deletions(-)
> 
> -- 
> 1.9.0

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support
  2016-10-10  8:03   ` Yuanhan Liu
@ 2016-10-14  7:30     ` linhaifeng
  0 siblings, 0 replies; 75+ messages in thread
From: linhaifeng @ 2016-10-14  7:30 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Maxime Coquelin

在 2016/10/10 16:03, Yuanhan Liu 写道:
> On Sun, Oct 09, 2016 at 06:46:44PM +0800, linhaifeng wrote:
>> 在 2016/8/23 16:10, Yuanhan Liu 写道:
>>> The basic idea of Tx zero copy is, instead of copying data from the
>>> desc buf, here we let the mbuf reference the desc buf addr directly.
>>
>> Is there problem when push vlan to the mbuf which reference the desc buf addr directly?
> 
> Yes, you can't do that when zero copy is enabled, due to following code
> piece:
> 
>     +               if (unlikely(dev->dequeue_zero_copy && (hpa = gpa_to_hpa(dev,
>     +                                       desc->addr + desc_offset, cpy_len)))) {
>     +                       cur->data_len = cpy_len;
> ==> +                       cur->data_off = 0;
>     +                       cur->buf_addr = (void *)(uintptr_t)desc_addr;
>     +                       cur->buf_physaddr = hpa;
> 
> The marked line basically makes the mbuf has no headroom to use.
> 
> 	--yliu
> 
>> We know if guest use virtio_net(kernel) maybe skb has no headroom.
> 
> .
>

It ok to set data_off zero.
But we also can use 128 bytes headromm when guest use virtio_net PMD but not for virtio_net kernel driver.

I think it's better to add headroom size to desc and kernel dirver support set headroom size.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/7] vhost: get guest/host physical address mappings
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 2/7] vhost: get guest/host physical address mappings Yuanhan Liu
@ 2016-11-29  3:10       ` linhaifeng
  2016-11-29 13:14       ` linhaifeng
  1 sibling, 0 replies; 75+ messages in thread
From: linhaifeng @ 2016-11-29  3:10 UTC (permalink / raw)
  To: dev

在 2016/10/9 15:27, Yuanhan Liu 写道:
> +	dev->nr_guest_pages = 0;
> +	if (!dev->guest_pages) {
> +		dev->max_guest_pages = 8;
> +		dev->guest_pages = malloc(dev->max_guest_pages *
> +						sizeof(struct guest_page));
> +	}
> +
when to free guest_pages ?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/7] vhost: get guest/host physical address mappings
  2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 2/7] vhost: get guest/host physical address mappings Yuanhan Liu
  2016-11-29  3:10       ` linhaifeng
@ 2016-11-29 13:14       ` linhaifeng
  1 sibling, 0 replies; 75+ messages in thread
From: linhaifeng @ 2016-11-29 13:14 UTC (permalink / raw)
  To: dev

在 2016/10/9 15:27, Yuanhan Liu 写道:
> +static void
> +add_guest_pages(struct virtio_net *dev, struct virtio_memory_region *reg,
> +		uint64_t page_size)
> +{
> +	uint64_t reg_size = reg->size;
> +	uint64_t host_user_addr  = reg->host_user_addr;
> +	uint64_t guest_phys_addr = reg->guest_phys_addr;
> +	uint64_t host_phys_addr;
> +	uint64_t size;
> +
> +	host_phys_addr = rte_mem_virt2phy((void *)(uintptr_t)host_user_addr);
> +	size = page_size - (guest_phys_addr & (page_size - 1));
> +	size = RTE_MIN(size, reg_size);

Have you use 1G hugepage to create a VM with 25G memory?
When I try, vhost crashed, use below code fixed:

-	size = page_size - (guest_phys_addr & (page_size - 1));
-	size = RTE_MIN(size, reg_size);
+ 	size = reg_size % page_size;

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2016-11-29 13:17 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-23  8:10 [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Yuanhan Liu
2016-08-23  8:10 ` [dpdk-dev] [PATCH 1/6] vhost: simplify memory regions handling Yuanhan Liu
2016-08-23  9:17   ` Maxime Coquelin
2016-08-24  7:26   ` Xu, Qian Q
2016-08-24  7:40     ` Yuanhan Liu
2016-08-24  7:36       ` Xu, Qian Q
2016-08-23  8:10 ` [dpdk-dev] [PATCH 2/6] vhost: get guest/host physical address mappings Yuanhan Liu
2016-08-23  9:58   ` Maxime Coquelin
2016-08-23 12:32     ` Yuanhan Liu
2016-08-23 13:25       ` Maxime Coquelin
2016-08-23 13:49         ` Yuanhan Liu
2016-08-23 14:05           ` Maxime Coquelin
2016-08-23  8:10 ` [dpdk-dev] [PATCH 3/6] vhost: introduce last avail idx for Tx Yuanhan Liu
2016-08-23 12:27   ` Maxime Coquelin
2016-08-23  8:10 ` [dpdk-dev] [PATCH 4/6] vhost: add Tx zero copy Yuanhan Liu
2016-08-23 14:04   ` Maxime Coquelin
2016-08-23 14:31     ` Yuanhan Liu
2016-08-23 15:40       ` Maxime Coquelin
2016-08-23  8:10 ` [dpdk-dev] [PATCH 5/6] vhost: add a flag to enable " Yuanhan Liu
2016-09-06  9:00   ` Xu, Qian Q
2016-09-06  9:42     ` Xu, Qian Q
2016-09-06 10:02       ` Yuanhan Liu
2016-09-07  2:43         ` Xu, Qian Q
2016-09-06  9:55     ` Yuanhan Liu
2016-09-07 16:00       ` Thomas Monjalon
2016-09-08  7:21         ` Yuanhan Liu
2016-09-08  7:57           ` Thomas Monjalon
2016-08-23  8:10 ` [dpdk-dev] [PATCH 6/6] examples/vhost: add an option " Yuanhan Liu
2016-08-23  9:31   ` Thomas Monjalon
2016-08-23 12:33     ` Yuanhan Liu
2016-08-23 14:14   ` Maxime Coquelin
2016-08-23 14:45     ` Yuanhan Liu
2016-08-23 14:18 ` [dpdk-dev] [PATCH 0/6] vhost: add Tx zero copy support Maxime Coquelin
2016-08-23 14:42   ` Yuanhan Liu
2016-08-23 14:53     ` Yuanhan Liu
2016-08-23 16:41       ` Maxime Coquelin
2016-08-29  8:32 ` Xu, Qian Q
2016-08-29  8:57   ` Xu, Qian Q
2016-09-23  4:11     ` Yuanhan Liu
2016-10-09 15:20   ` Yuanhan Liu
2016-09-23  4:13 ` [dpdk-dev] [PATCH v2 0/7] vhost: add dequeue " Yuanhan Liu
2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 1/7] vhost: simplify memory regions handling Yuanhan Liu
2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 2/7] vhost: get guest/host physical address mappings Yuanhan Liu
2016-09-26 20:17     ` Maxime Coquelin
2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 3/7] vhost: introduce last avail idx for dequeue Yuanhan Liu
2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 4/7] vhost: add dequeue zero copy Yuanhan Liu
2016-09-26 20:45     ` Maxime Coquelin
2016-10-06 14:37     ` Xu, Qian Q
2016-10-09  2:03       ` Yuanhan Liu
2016-10-10 10:12         ` Xu, Qian Q
2016-10-10 10:14           ` Maxime Coquelin
2016-10-10 10:22             ` Xu, Qian Q
2016-10-10 10:40               ` Xu, Qian Q
2016-10-10 11:48               ` Maxime Coquelin
2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 5/7] vhost: add a flag to enable " Yuanhan Liu
2016-09-26 20:57     ` Maxime Coquelin
2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 6/7] examples/vhost: add an option " Yuanhan Liu
2016-09-26 21:05     ` Maxime Coquelin
2016-09-23  4:13   ` [dpdk-dev] [PATCH v2 7/7] net/vhost: " Yuanhan Liu
2016-09-26 21:05     ` Maxime Coquelin
2016-10-09  7:27   ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Yuanhan Liu
2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 1/7] vhost: simplify memory regions handling Yuanhan Liu
2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 2/7] vhost: get guest/host physical address mappings Yuanhan Liu
2016-11-29  3:10       ` linhaifeng
2016-11-29 13:14       ` linhaifeng
2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 3/7] vhost: introduce last avail idx for dequeue Yuanhan Liu
2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 4/7] vhost: add dequeue zero copy Yuanhan Liu
2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 5/7] vhost: add a flag to enable " Yuanhan Liu
2016-10-09  7:27     ` [dpdk-dev] [PATCH v3 6/7] examples/vhost: add an option " Yuanhan Liu
2016-10-09  7:28     ` [dpdk-dev] [PATCH v3 7/7] net/vhost: " Yuanhan Liu
2016-10-11 13:04     ` [dpdk-dev] [PATCH v3 0/7] vhost: add dequeue zero copy support Xu, Qian Q
2016-10-12  7:48     ` Yuanhan Liu
2016-10-09 10:46 ` [dpdk-dev] [PATCH 0/6] vhost: add Tx " linhaifeng
2016-10-10  8:03   ` Yuanhan Liu
2016-10-14  7:30     ` linhaifeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).